JP2003208188A

JP2003208188A - Japanese text-to-speech synthesis method

Info

Publication number: JP2003208188A
Application number: JP2002006545A
Authority: JP
Inventors: Tomomoto Toda; 智基戸田; Hisashi Kawai; 恒河井; Minoru Tsuzaki; 実津崎
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2002-01-15
Filing date: 2002-01-15
Publication date: 2003-07-25

Abstract

(57)【要約】【課題】この発明は、音素単位だけでなく音素の中心
を境界とするダイフォン単位も用いた新たな日本語テキ
スト合成方法を提供することを目的とする。また、この
発明は、音素単位のみを用いる従来法と比較して、より
自然性の高い音声を合成することができ、かつコーパス
をより有効に使用することができるようになる日本語テ
キスト合成方法を提供することを目的とする。【解決手段】日本語テキスト音声合成方法において、
母音と母音との連鎖における波形接続において、それら
の境界での接続と、それらの母音中心での接続との両方
を考慮して音声合成単位の選択を行うようにした。 (57) [Summary] An object of the present invention is to provide a new Japanese text synthesis method using not only phoneme units but also diphone units having a boundary at the center of a phoneme. In addition, the present invention provides a method of synthesizing Japanese text that can synthesize a more natural sound and more effectively use a corpus than a conventional method using only phoneme units. The purpose is to provide. SOLUTION: In the Japanese text speech synthesis method,
In the waveform connection in a chain of vowels and vowels, a speech synthesis unit is selected in consideration of both the connection at the boundary and the connection at the center of the vowel.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】この発明は、日本語のテキス
ト情報を合成音声で読み上げることができるテキスト音
声合成方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a text-to-speech synthesis method capable of reading Japanese text information in synthetic speech.

【０００２】[0002]

【従来の技術】〔１〕従来の日本語テキスト音声合成装
置についての説明2. Description of the Related Art [1] Description of conventional Japanese text-to-speech synthesizer

【０００３】図１は、従来の日本語テキスト音声合成装
置の概略構成を示してる。FIG. 1 shows a schematic configuration of a conventional Japanese text-to-speech synthesizer.

【０００４】入力された日本語のテキストは、言語処理
部１で形態素解析、係り受け解析が行われ、音素記号、
アクセント記号等に変換される。The input Japanese text is subjected to morphological analysis and dependency analysis in the language processing unit 1, and phoneme symbols,
Converted to accent marks, etc.

【０００５】韻律パターン生成部２では、音素記号、ア
クセント記号列および形態素解析結果から得られる入力
テキストの品詞情報を用いて、音素持続時間（声の長
さ）、基本周波数（声の高さＦ₀）、母音中心のパワー
（声の大きさ）等の推定が行われる。The prosody pattern generation unit 2 uses the phoneme symbol, the accent symbol string, and the part-of-speech information of the input text obtained from the morpheme analysis result to determine the phoneme duration (voice length), the fundamental frequency (voice pitch F). ₀ ), the power of the vowel center (loudness of voice) and the like are estimated.

【０００６】音素単位選択部３では、推定された音素持
続時間、基本周波数、母音中心のパワー等に最も近く、
かつ波形辞書５に蓄積されている合成単位（音素片）を
接続したときの歪みが最も小さくなる合成単位の組み合
わせが動的計画法を用いて選択される。The phoneme unit selector 3 is closest to the estimated phoneme duration, fundamental frequency, vowel center power, etc.
In addition, the combination of the synthesis units that minimizes the distortion when the synthesis units (phoneme pieces) stored in the waveform dictionary 5 are connected is selected by using the dynamic programming method.

【０００７】音素波形生成部４では、選択された音素片
の組み合わせにしたがって、ピッチを変換しつつ音素片
の接続を行うことによって音声が生成される。In the phoneme waveform generator 4, a voice is generated by connecting the phonemes while converting the pitch according to the selected combination of the phonemes.

【０００８】ところで、音素単位選択部３によって行わ
れる単位選択では、知覚的特徴に一致した尺度（コス
ト）を用いることが重要である（文献１参照）。By the way, in the unit selection performed by the phoneme unit selection unit 3, it is important to use a scale (cost) that matches the perceptual characteristics (see Reference 1).

【０００９】文献１： E. Klabbers and R. Veldhuis,
"Reducing audible spectral discontinuities", IEEE
Trans. Speech and Audio Processing, vol. 9, no.
1, pp. 39-51, 2001.Reference 1: E. Klabbers and R. Veldhuis,
"Reducing audible spectral discontinuities", IEEE
Trans. Speech and Audio Processing, vol. 9, no.
1, pp. 39-51, 2001.

【００１０】そのため、観測可能な特徴量（物理量，言
語情報）から心理量へのコスト関数によるマッピングを
行う必要がある。図２にコスト関数の概念図を示す。Therefore, it is necessary to perform the mapping from the observable feature quantity (physical quantity, language information) to the psychological quantity by the cost function. FIG. 2 shows a conceptual diagram of the cost function.

【００１１】コスト関数は聴覚上の印象を反映する必要
があるため、知覚実験に基づいて推定するべきである。
基本周波数Ｆ₀，音素持続時間Durationのように説明変
数の空間が比較的単純な場合は、知覚実験により物理量
から心理量へのマッピングを推定することが可能であ
る。しかし、より高次元の特徴量を説明変数とした複雑
な空間においては、知覚実験による推定は実現が困難で
ある。また、心理量を直接表現するものとして聴覚上の
特性をよく反映した物理量に関する研究が数多くなされ
ているが、十分な精度をもつ物理量はいまだ見つけられ
ていない（文献２，３参照）。Since the cost function needs to reflect the auditory impression, it should be estimated based on perceptual experiments.
When the space of the explanatory variables is relatively simple like the fundamental frequency F ₀ and the phoneme duration Duration, it is possible to estimate the mapping from the physical quantity to the psychological quantity by the perceptual experiment. However, in a complex space where higher-dimensional feature quantities are used as explanatory variables, estimation by perceptual experiments is difficult to realize. Also, many studies have been conducted on physical quantities that directly reflect psychological quantities as direct expressions of psychological quantities, but physical quantities with sufficient accuracy have not yet been found (see References 2 and 3).

【００１２】文献２：Y. Stylianou and A.K. Syrdal,
"Perceptual and objective detection of discontinu
ities in concatenative speech synthesis", Proc. IC
ASSP,pp. 837-840, Salt Lake City, U.S.A., May. 200
1. 文献３：M. Tsuzaki, "Feature extraction by auditor
y modeling for unitselection in concatenative spee
ch synthesis", Proc. EUROSPEECH, pp. 2223-2226, Aa
lborg, Denmark, Sep. 2001.Reference 2: Y. Stylianou and AK Syrdal,
"Perceptual and objective detection of discontinu
ities in concatenative speech synthesis ", Proc. IC
ASSP, pp. 837-840, Salt Lake City, USA, May. 200
1. Reference 3: M. Tsuzaki, "Feature extraction by auditor
y modeling for unitselection in concatenative spee
ch synthesis ", Proc. EUROSPEECH, pp. 2223-2226, Aa
lborg, Denmark, Sep. 2001.

【００１３】言語情報から心理情報へのマッピングに関
しては、知覚実験による推定が比較的容易である。この
マッピングを用いることにより、物理量では十分に表現
できない知覚的特徴を捕らえることができる可能性があ
る。ただし、言語情報のみではカテゴリカルな特徴しか
表現できず、同一カテゴリ内のサンプル間の比較ができ
ないため、各サンプルの特徴を表すことのできる物理量
との併用も必要である。As for the mapping from linguistic information to psychological information, it is comparatively easy to estimate it by a perception experiment. By using this mapping, it may be possible to capture perceptual features that cannot be sufficiently expressed by physical quantities. However, since only categorical features can be expressed only by linguistic information, and samples within the same category cannot be compared, it is necessary to use them together with physical quantities that can express the features of each sample.

【００１４】この例では、物理量と知覚実験結果から直
接求められる心理量を用いたコスト関数Ｗｃを使用す
る。コスト関数Ｗｃは、表１に示す５つのサブコスト関
数から構成される。Ｃ_proおよびＣ_Foは音源情報に基づ
いたサブコスト関数であり、Ｃ _typ、Ｃ_envおよびＣ
_specは声道情報に基づいたサブコスト関数である。In this example, the physical quantity and the result of the perceptual experiment are directly used.
Use the cost function Wc using the psychological amount
It The cost function Wc has five sub-cost functions shown in Table 1.
Composed of numbers. C_proAnd C_FoIs based on sound source information
Sub-cost function _typ, C_envAnd C
_specIs a sub-cost function based on vocal tract information.

【００１５】[0015]

【表１】 [Table 1]

【００１６】つまり、図３に示すように、ｕ_i-1、
ｕ_i、ｕ_i+1を単位候補の音素片とし、ｔ_i-1、ｔ_i、
ｔ_i+1を実際に使用する環境（ターゲット）とすると、
ｕ_iに対するサブコスト関数には、Ｃ_pro（ｕ_i，
ｔ_i）、Ｃ_typ（ｕ_i，ｔ_i）、Ｃ_en _v（ｕ_i，
ｕ_i-1）、Ｃ_spec（ｕ_i，ｕ_i-1）およびＣ_Fo（ｕ_i，
ｕ_i-1）がある。That is, as shown in FIG. 3, u _i-1 ,
Let u _i and u _{i + 1 be} phonemes of unit candidates, and t _i−1 , t _i ,
Assuming that t _{i + 1} is the actual environment (target) to be used,
The sub-cost function for u _i is C _pro (u _i ,
t _i ), C _typ (u _i , t _i ), C _en _v (u _i ,
u _i-1 ), C _spec (u _i , u _i-1 ) and C _Fo (u _i ,
u _i-1 ).

【００１７】図３において、ｕ_i-1とｕ_iはコーパス中
で連続しているわけではなく、ｉ−１番目とｉ番目のタ
ーゲットｔ_i-1、ｔ_iに対する１つの単位候補を意味す
る。In FIG. 3, u _i-1 and u _i are not continuous in the corpus, and mean one unit candidate for the i-1 and i-th targets t _i-1 and t _i . .

【００１８】Ｃ_pro（ｕ_i，ｔ_i）およびＣ
_typ（ｕ_i，ｔ_i）は、ｉ番目の音素について抽出され
た単位候補（ｕ_i）と、実際に使用する環境（ターゲッ
トｔ_i）との間の歪みを表している。また、Ｃ_env（ｕ
_i，ｕ_i-1）、Ｃ_spec（ｕ_i，ｕ_i-1）およびＣ_Fo（ｕ
_i，ｕ_i-1）は、ｉ番目の単位候補（ｕ_i）と、ｉ−１
番目の単位候補（ｕ_i-1）とを接続したときに生ずる歪
みを表している。C _pro (u _i , t _i ) and C
_typ (u _i , t _i ) represents the distortion between the unit candidate (u _i ) extracted for the i-th phoneme and the environment (target t _i ) that is actually used. Also, C _env (u
_i , u _i-1 ), C _spec (u _i , u _i-1 ) and C _Fo (u
_i , u _i−1 ) is the i-th unit candidate (u _i ) and i−1
It represents the distortion that occurs when the th unit candidate (u _i-1 ) is connected.

【００１９】先行の単位候補がｕ_i-1である時の、合成
したい音素ターゲットｔ_iに対する単位候補ｕ_iのコス
ト関数ＷＣ（ｕ_i, ｔ_i）は次式（１）で表される。When the preceding unit candidate is u _i-1 , the cost function WC (u _i , t _i ) of the unit candidate u _i for the phoneme target t _i to be synthesized is expressed by the following equation (1).

【００２０】[0020]

【数１】 [Equation 1]

【００２１】ここで、Ｃ_pro，Ｃ_F0，Ｃ_env，Ｃ_spec，
Ｃ_typはサブコスト関数であり、詳細は以下で述べる。
各サブコストはほぼ等しいダイナミックレンジに正規化
されたものを用いる。Where C _pro , C _F0 , C _env , C _spec ,
C _typ is a sub-cost function, which will be described in detail below.
Each sub-cost is normalized to have a substantially equal dynamic range.

【００２２】ｗ_pro，ｗ_F0，ｗ_env，ｗ_spec，ｗ_typは
各サブコストに対する重みを示し、これら総和は、次式
（２）に示すように１となる。この例では、全てのサブ
コストに対して等しい重み（０．２）を用いている。W _pro , w _F0 , w _env , w _spec , and w _typ represent weights for the respective sub-costs, and their sum is 1 as shown in the following equation (2). In this example, equal weights (0.2) are used for all sub-costs.

【００２３】[0023]

【数２】 [Equation 2]

【００２４】音素単位選択部３は、入力されたセンテン
ス中の各音素をターゲットとして、次式（３）で示され
るトータルコストＴＣが最小となる単位候補の組合せを
動的計画法を用いて選択する。The phoneme unit selection unit 3 selects a combination of unit candidates having the minimum total cost TC expressed by the following equation (3) by using the dynamic programming method, targeting each phoneme in the input sentence. To do.

【００２５】[0025]

【数３】 [Equation 3]

【００２６】ここで、Ｎはセンテンス中のターゲット音
素数を示す。ＷＣ (ｕ_i,ｔ_i) は合成したいターゲット
ｔ_iに対する単位候補ｕ_iのコストを示す。Here, N represents the number of target phonemes in the sentence. WC (u _i, t _i ) indicates the cost of the unit candidate u _i for the target t _i to be combined.

【００２７】以下、各サブコスト関数について説明す
る。Each sub-cost function will be described below.

【００２８】（１）韻律に関するサブコスト関数：Ｃ
_pro サブコスト関数Ｃ_pro( ｕ_i,ｔ_i）は単位候補ｕ_iとタ
ーゲットｔ_iとの間における韻律（Ｆ₀軌跡, 音素持続
時間) の違いにより生じる自然性の劣化を示しており、
次式（４）で表される。(1) Subcost function for prosody: C
_{The pro} sub-cost function C _pro (u _i, t _i ) indicates deterioration of naturalness caused by a difference in prosody (F ₀ locus, phoneme duration) between the unit candidate u _i and the target t _i .
It is expressed by the following equation (4).

【００２９】[0029]

【数４】 [Equation 4]

【００３０】ここで、Ｄ_F0( ｕ_i,ｔ_i,ｍ）は、単位候補
ｕ_iとターゲットｔ_iにおける、ｍ番目の分割区間にお
けるｌｏｇＦ₀の平均値の差を示している。また、Ｍは
音素分割数を示している。また、Ｄ_d( ｕ_i,ｔ_i）は単
位候補ｕ_iとターゲットｔ_iにおける音素持続時間の差
を示している。また、Ｐは、Ｄ_F0( ｕ_i,ｔ_i,ｍ）とＤ _d
( ｕ_i,ｔ_i）とを変数とする非線形関数である。Where D_F0(u_i,t_i,m) is a unit candidate
u_iAnd target t_iIn the m-th division section in
KE logF₀Shows the difference in the average value of. Also, M is
The number of phoneme divisions is shown. Also, D_d(u_i,t_i) Is simple
Candidate u_iAnd target t_iDifference in phoneme duration in
Is shown. Also, P is D_F0(u_i,t_i,m) and D _d
(u_i,t_i) And are non-linear functions with variables.

【００３１】この例では、合成時に韻律変形を行うこと
を想定し、韻律変形により生じる自然性劣化についての
知覚実験結果から関数Ｐが決定される。韻律変形を行わ
ない際には、ターゲットと異なった韻律を用いることに
より生じる自然性劣化についての知覚実験を行い、その
結果から関数Ｐを決定する必要がある。In this example, it is assumed that the prosody transformation is performed at the time of synthesis, and the function P is determined from the result of the perceptual experiment on the deterioration of naturalness caused by the prosody transformation. When prosody transformation is not performed, it is necessary to perform a perceptual experiment on naturalness deterioration caused by using a prosody different from the target, and determine the function P from the result.

【００３２】（２）Ｆ₀不連続に関するサブコスト関
数：Ｃ_F0 サブコスト関数ＣＦ₀(ｕ_i,ｕ_i-1)は、ｕ_i-1 とｕ_iとの
接続境界におけるＦ₀の不連続性により生じる自然性の
劣化を示しており、次式（５）で表される。(2) Sub-cost function for F ₀ discontinuity: C _F0 sub-cost function CF ₀ (u _i, u _i-1 ) depends on the discontinuity of F _{0 at} the connection boundary between u _i-1 and u _i. It shows the deterioration of naturalness that occurs and is expressed by the following equation (5).

【００３３】[0033]

【数５】 [Equation 5]

【００３４】ここで、Ｄ_F0( ｕ_i,ｕ_i-1)は、ｕ_i-1とｕ
_iとの接続境界におけるＬｏｇＦ₀の差に基づいた距離
を示す。ｕ_i-1とｕ_iとがコーパス中で連続しており、
接続が生じない単位候補間においては、このサブコスト
は０となる。Here, D _F0 (u _i, u _i-1 ) is u _i-1 and u
_The distance based on the difference of LogF _{0 at} the connection boundary with _i is shown. u _i-1 and u _i are continuous in the corpus,
This sub-cost is 0 between unit candidates in which no connection occurs.

【００３５】（３）音素環境代替に関するサブコスト関
数：Ｃ_env サブコスト関数Ｃ_env( ｕ_i,ｕ_i-1)は、単位候補とター
ゲットとの間における音素環境の代替により生じる自然
性の劣化を表し、知覚実験結果から決定される（文献４
参照）。このサブコスト関数Ｃ_env( ｕ_i,ｕ_i-1)は次式
（６）で表される。(3) Sub-cost function relating to phoneme environment substitution: C _{env The} sub-cost function C _env (u _i, u _i-1 ) represents the deterioration of naturalness caused by the substitution of the phoneme environment between the unit candidate and the target. , Determined from the results of perception experiments (Reference 4
reference). This sub-cost function C _env (u _i, u _i-1 ) is expressed by the following equation (6).

【００３６】文献４：河井恒, 津崎実, 舛田剛志, 岩澤
秀紀, "波形素片接続時の音素環境代替による自然性劣
化の知覚的評価", 信学技法, SP2001-22, pp. 51-57,
May,2001.Reference 4: Tsune Kawai, Minoru Tsuzaki, Takeshi Masuda, Hidenori Iwasawa, "Perceptual Evaluation of Natural Deterioration by Substituting Phoneme Environments when Connecting Waveform Elements", IEEJ, SP2001-22, pp. 51- 57,
May, 2001.

【００３７】[0037]

【数６】 [Equation 6]

【００３８】ここで、Ｓ_s( ｕ_i,Ｅ_s( ｕ_i-1), ｔ_i)
は後続環境の代替により生じるサブコスト関数を示し、
Ｓ_p( ｕ_i,Ｅ_P( ｕ_i),ｔ_i-1)は先行環境の代替により
生じるサブコスト関数を示す。Ｅ_s( ｕ_i-1)はコーパス
中での後続の音素を示し、Ｅ _P (ｕ_i) は先行音素を示
している。Where S_s(u_i,E_s(u_i-1), t_i)
Indicates the sub-cost function generated by the substitution of the subsequent environment,
S_p(u_i,E_P(u_i), t_i-1) Is due to the substitution of the preceding environment
The resulting sub-cost function is shown. E_s(u_i-1) Is the corpus
The following phonemes in _P (u_i) Indicates the preceding phoneme
is doing.

【００３９】例えば、「あえ」の「あ」（ｕ_i-1 ＝ /ａ
/ ，Ｅ_s( ｕ_i-1)＝ /ｅ/ ）と、「んお」の「お」（ｕ
_i＝ /ｏ/ ，Ｅ_p( ｕ_i) ＝ /Ｎ/ ）を接続して、「あ
お」（ｔ_i-1 ＝ /ａ/ ，ｔ_i＝ /ｏ/ ）を合成する際に
は、サブコスト関数Ｃ_env(ｕ_i,ｕ_i-1)は、次式（７）
で表される。For example, "a" of "ae" (u _i-1 = / a
/, E _s (u _i-1 ) ＝ / e /), and “o” (u
_{When i} = / o / and E _p (u _i ) = / N /) are connected and "ao" (t _i-1 = / a /, t _i = / o /) is combined, The function C _env (u _i, u _i-1 ) is expressed by the following equation (7).
It is represented by.

【００４０】[0040]

【数７】 [Equation 7]

【００４１】環境が一致していても、Ｓ_sとＳ_pは０に
なるとは限らない。その場合コスト値は、音素間( ｕ_i,
ｕ_i-1)のラベリングの曖昧さ等に起因する接続の難しさ
を表す。なお、ｕ_i-1とｕ_iとがコーパス中で連続して
おり、接続が生じない単位候補間においては、このサブ
コストは０になる。Even if the environments match, S _s and S _p do not always become 0. In that case, the cost value is (u _i,
u _i-1 ) represents the difficulty of connection due to ambiguity in labeling, etc. Note that u _i−1 and u _i are continuous in the corpus, and the sub cost is 0 between unit candidates in which no connection occurs.

【００４２】（４）スペクトルの不連続に関するサブコ
スト関数：Ｃ_spec サブコスト関数Ｃ_spec( ｕ_i,ｕ_i-1)は、ｕ_i-1とｕ_iと
の接続境界におけるスペクトルの不連続により生じる自
然性の劣化を表しており、次式（８）で表される。(4) Sub-cost function relating to spectral discontinuity: C _{spec The} sub-cost function C _spec (u _i, u _i-1 ) is a natural result of spectral discontinuity at the connection boundary between u _i-1 and u _i. It represents the deterioration of the sex and is expressed by the following equation (8).

【００４３】[0043]

【数８】 [Equation 8]

【００４４】ここで、ｈ（ｆ）は、長さωの三角窓を示
している。ＭＣＤ（ｕ_i，ｕ_i-1，ｆ）は、単位候補ｕ
_i-1における終端からｆ番目のフレームと単位候補ｕ_i
の始端からｆ番目のフレームとの間におけるメルケプス
トラム歪みを示している。ｕ _i-1とｕ_iとがコーパス中
で連続しており、接続が生じない単位候補間において
は、このサブコスト関数は０となる。Where h (f) is a triangular window of length ω.
is doing. MCD (u_i, U_i-1, F) are unit candidates u
_i-1The f-th frame from the end and the unit candidate u_i
From the beginning of the frame to the f-th frame
Shows tram distortion. u _i-1And u_iAnd in the corpus
Between the unit candidates that are connected in
, The sub-cost function becomes 0.

【００４５】（５）音素の適合性に関するサブコスト関
数：Ｃ_typ サブコスト関数Ｃ_typ( ｕ_i,ｔ_i) は音素の適合性を表
しており、単位候補ｕ _iとターゲットｔ_iとの間におけ
る平均的なスペクトルの差により生じる自然の劣化を示
している。このサブコスト関数Ｃ_typ( ｕ_i,ｔ_i) は次
式（９）で表される。(5) Subcost function related to phoneme suitability
Number: C_typ Sub-cost function C_typ(u_i,t_i) Indicates the phoneme suitability
And unit candidate u _iAnd target t_iBetween
Shows the natural degradation caused by the average spectral difference
is doing. This sub-cost function C_typ(u_i,t_i) Is next
It is expressed by equation (9).

【００４６】[0046]

【数９】 [Equation 9]

【００４７】ここで、Ｃenは、対数スペクトルセントロ
イドを示している。また、ＭＣＤは、単位候補ｕ_iのセ
ントロイドＣｅｎ（ｕ_i) とターゲットｔ_iのセントロ
イドＣｅｎ( ｔ_i) との間のメルケプストラム歪みを示
す。ターゲットｔ_iのセントロイドＣen（ｔ_i) の計算
には、ＨＭＭによる音声合成方式（文献５参照）により
出力されるメルケプストラムが用いられる。Here, Cen represents a logarithmic spectrum centroid. Further, MCD shows mel cepstrum distortion between the centroids Cen (u _i) and the centroid Cen target t _{_i} (t _i) of the unit candidate u _i. For the calculation of the centroid Cen (t _i ) of the target t _i, the mel cepstrum output by the speech synthesis method by HMM (see Reference 5) is used.

【００４８】文献５：吉村貴克, 徳田恵一, 益子貴史,
小林隆夫, 北村正, "HMM に基づく音声合成におけるス
ペクトル・ピッチ・継続長の同時モデル化", 信学論
(D-II), vol. J83-D-II, no. 11, pp. 2099-2107, 200
0.Reference 5: Takakatsu Yoshimura, Keiichi Tokuda, Takashi Mashiko,
Takao Kobayashi, Tadashi Kitamura, "Simultaneous Modeling of Spectrum, Pitch, and Duration in HMM-based Speech Synthesis", Theory of Communication
(D-II), vol. J83-D-II, no. 11, pp. 2099-2107, 200
0.

【００４９】〔２〕従来の各種単位選択法についての説
明[2] Description of various conventional unit selection methods

【００５０】近年、テキスト音声合成（TTS:Text-to-Sp
eech) 装置においてはコーパスベースのものが主流とな
っており、音声素片波形を接続して音声を合成すること
によって、品質の高い合成音声が得られるようになって
きている。しかし、その品質は十分なものとはいいがた
く、読み上げ口調の文章の合成に限っても改善すべき点
は数多く残されている。本発明者らは読み上げ文章の合
成を対象とした、より高品質な日本語テキスト音声合成
装置の構築を目指している。In recent years, text-to-speech synthesis (TTS: Text-to-Sp
A corpus-based device is mainly used as an eech) device, and it is becoming possible to obtain high-quality synthesized speech by connecting speech unit waveforms to synthesize speech. However, its quality is not good enough, and there are many points left to be improved even if it is limited to the composition of reading-aloud sentences. The present inventors aim to construct a higher quality Japanese text-to-speech synthesizer for synthesizing read-aloud sentences.

【００５１】日本語では、母音の無声化を除くと、ＣＶ
（Ｃ：子音，Ｖ：母音）とＶから音節が構成されるた
め、日本語テキスト音声合成装置においては、合成単位
としてＣＶを考慮することにより効率的に音声コーパス
を構築することが可能である。また、ＣからＶへの遷移
は音韻性の知覚上重要であると考えられるため、日本語
テキスト音声合成装置では合成単位としてはＣＶ単位が
適している。In Japanese, if vowel devoicing is excluded, CV
Since a syllable is composed of (C: consonant, V: vowel) and V, the Japanese text-to-speech synthesizer can efficiently construct a speech corpus by considering CV as a synthesis unit. . Further, since the transition from C to V is considered to be important for the perception of phonology, the CV unit is suitable as the synthesis unit in the Japanese text-to-speech synthesizer.

【００５２】しかし、ＣＶ単位を用いて音声を合成する
と、ＶからＶへの接続によりしばしば不連続感が生じ
る。これは、ＶからＶへの遷移はフォルマントが滑らか
に遷移する区間であるため、接続によりフォルマントの
不連続が生じると自然性が大きく劣化してしまうからで
ある。そのため、Ｖ−Ｖ間での滑らかな接続の実現は重
要な課題である。However, when synthesizing speech using the CV unit, the connection from V to V often causes discontinuity. This is because the transition from V to V is a section in which the formants make a smooth transition, and if discontinuity of the formants occurs due to the connection, the naturalness is greatly deteriorated. Therefore, realization of smooth connection between V and V is an important issue.

【００５３】Ｖ−Ｖ間の合成単位の接続を避けるため
に、より長い単位が提案されている（文献６、７、８参
照）。Longer units have been proposed in order to avoid connecting V-V composite units (see documents 6, 7 and 8).

【００５４】文献６：H. Kawai, N. Higuchi, T. Shimi
zu and S. Yamamoto, "Developmentof a text-to-speec
h system for Japanese based on waveform splicing",
Proc. ICASSP, pp. 569-572, Adelaide, Australia, A
pr. 1994. 文献７：S. Takano, K. Tanaka, H. Mizuno, M. Abe an
d S. Nakajima, "A Japanese TTS system based on mul
tiform units and a speech modification algorithm w
ith harmonics reconstruction", IEEE Trans. Speech
and Audio Processing, vol. 9, no. 1, pp. 3-10, 200
1. 文献８：N. Iwahashi, N. Kaiki and Y. Sagisaka, "Sp
eech segment selection for concatenative synthesis
based on spectral distortion minimization", IEICE
Trans. Fundamentals, vol. E76-A, no. 11, pp. 1942
-1948, 1993.Reference 6: H. Kawai, N. Higuchi, T. Shimi
zu and S. Yamamoto, "Developmentof a text-to-speec
h system for Japanese based on waveform splicing ",
Proc. ICASSP, pp. 569-572, Adelaide, Australia, A
pr. 1994. Reference 7: S. Takano, K. Tanaka, H. Mizuno, M. Abe an
d S. Nakajima, "A Japanese TTS system based on mul
tiform units and a speech modification algorithm w
ith harmonics reconstruction ", IEEE Trans. Speech
and Audio Processing, vol. 9, no. 1, pp. 3-10, 200
1. Reference 8: N. Iwahashi, N. Kaiki and Y. Sagisaka, "Sp.
eech segment selection for concatenative synthesis
based on spectral distortion minimization ", IEICE
Trans. Fundamentals, vol. E76-A, no. 11, pp. 1942
-1948, 1993.

【００５５】文献６に示されるように、河井らによって
提案されたＣＶ単位は、出現頻度の高いＣＶに続く母音
連鎖を１つの単位としてみなすものであり、ＣＶ単位を
拡張したものである。As shown in Document 6, the CV unit proposed by Kawai et al. Regards a vowel chain following a CV having a high frequency of appearance as one unit, and is an extension of the CV unit.

【００５６】また、他の単位としては、文献８に示され
るように、岩橋らによって提案された可変長単位があ
る。この方法では、合成したい音素系列に対するコスト
関数を定め、その総和を最小とする基準により、音声コ
ーパスから最適な単位が選択される。音素単位に基づい
た動的計画法による探索を行うことにより、様々な長さ
の音素連鎖からなる単位を選択することができる（文献
９参照）。As another unit, there is a variable length unit proposed by Iwahashi et al. As shown in Reference 8. In this method, a cost function for a phoneme sequence to be synthesized is determined, and an optimum unit is selected from the speech corpus according to a criterion that minimizes the sum. By performing a search by dynamic programming based on phoneme units, it is possible to select units consisting of phoneme chains of various lengths (see Reference 9).

【００５７】文献９：A Black and N. Campbell, "Opti
mising selection of units from speech databese for
concatenative synthesis", Proc. EUROSPEECH, pp. 5
81-584, Madrid, Spain, Sept. 1995.Reference 9: A Black and N. Campbell, "Opti
mising selection of units from speech databese for
concatenative synthesis ", Proc. EUROSPEECH, pp. 5
81-584, Madrid, Spain, Sept. 1995.

【００５８】そのため、コーパス中に合成したい母音連
鎖が存在し、それが単位として選択されればＶ−Ｖ間で
の合成単位の接続が避けられる。しかし、日本語におい
ては母音連鎖の種類は無限に存在するため、全ての母音
連鎖を含むコーパスを構築することは不可能である。Therefore, if there is a vowel chain to be synthesized in the corpus and it is selected as a unit, connection of the synthesis unit between V and V can be avoided. However, since there are an unlimited number of vowel chain types in Japanese, it is impossible to construct a corpus containing all vowel chains.

【００５９】また、韻律のカバレッジ（文献１０参照）
まで考慮すると、更にコーパス量は膨大となる。したが
って、Ｖ−Ｖ間での合成単位の接続は避けることができ
ない問題があるといえる。Also, prosodic coverage (see reference 10)
Considering all the above, the amount of corpus becomes even larger. Therefore, it can be said that there is an unavoidable problem of connection of composition units between V and V.

【００６０】文献１０：河井恒, 樋口宣男, 山本誠一,
"基本周波数および音素接続時間を考慮した音声合成用
波形素片データセットの作成", 信学論 (D-II), vol.
J82-D-II, no. 8, pp. 1229-1238, 1999.Reference 10: Tsune Kawai, Nobuo Higuchi, Seiichi Yamamoto,
"Creation of Waveform Fragment Dataset for Speech Synthesis Considering Fundamental Frequency and Phoneme Connection Time", IEICE (D-II), vol.
J82-D-II, no. 8, pp. 1229-1238, 1999.

【００６１】母音連鎖において、母音中心付近でのフォ
ルマントの遷移は母音境界でのフォルマントの遷移と比
較するとより安定している。したがって、母音中心にお
いて接続を行うと、母音境界において接続を行った場合
に比べて、不連続感を低減できる場合が多い。このこと
は本発明者らが行った予備実験でも明らかにされてい
る。また、文献１１に示されているＶＣＶ単位はこの考
えに基づいて考案された合成単位である。In the vowel chain, the formant transition near the vowel center is more stable than the formant transition at the vowel boundary. Therefore, when the connection is made at the vowel center, the discontinuity can be reduced in many cases as compared with the case where the connection is made at the vowel boundary. This has been clarified in a preliminary experiment conducted by the present inventors. The VCV unit shown in Reference 11 is a synthetic unit devised based on this idea.

【００６２】文献１１：佐藤大和, "PAECOR-VCV 連鎖を
用いた音声合成方式",信学論 (D),vol. J61-D, no. 11,
pp. 858-865, 1978.Reference 11: Yamato Sato, "Speech synthesis method using PAECOR-VCV concatenation", IEICE (D), vol. J61-D, no. 11,
pp. 858-865, 1978.

【００６３】しかし、母音境界においても滑らかな接続
を行うことができる合成単位が存在する可能性もあるた
め、必ずしも母音中心のみで接続を行ったほうがよいと
は限らない。そのため、母音連鎖においては、母音中心
における接続を行う単位と母音境界で接続を行う単位の
両者を考慮するべきである。However, there is a possibility that there is a synthesis unit that allows smooth connection even at the vowel boundary, so it is not always better to connect only at the vowel center. Therefore, in the vowel chain, both the unit for connecting at the vowel center and the unit for connecting at the vowel boundary should be considered.

【００６４】[0064]

【発明が解決しようとする課題】この発明は、上記のよ
うな考察に基づいてなされたものであり、音素単位だけ
でなく音素の中心を境界とするダイフォン単位も用いた
新たな日本語テキスト合成方法を提供することを目的と
する。The present invention has been made based on the above consideration, and a new Japanese text synthesis using not only a phoneme unit but also a diphone unit with the center of the phoneme as a boundary. The purpose is to provide a method.

【００６５】また、この発明は、音素単位のみを用いる
従来法と比較して、より自然性の高い音声を合成するこ
とができ、かつコーパスをより有効に使用することがで
きるようになる日本語テキスト合成方法を提供すること
を目的とする。Further, according to the present invention, more natural speech can be synthesized and the corpus can be used more effectively as compared with the conventional method using only the phoneme unit. It is an object to provide a text synthesizing method.

【００６６】[0066]

【課題を解決するための手段】請求項１に記載の発明
は、日本語テキスト音声合成方法において、母音と母音
との連鎖における波形接続において、それらの境界での
接続と、それらの母音中心での接続との両方を考慮して
音声合成単位の選択を行うようにしたことを特徴とす
る。According to a first aspect of the present invention, in a Japanese text-to-speech synthesis method, in the waveform connection in the chain of vowels and vowels, the connection at their boundaries and their vowel centers are used. It is characterized in that the voice synthesis unit is selected in consideration of both the connection and the.

【００６７】請求項２に記載の発明は、請求項１に記載
の発明において、母音と半母音との連鎖における波形接
続において、それらの境界での接続と、半母音の前の母
音の中心での接続との両方を考慮して音声合成単位の選
択を行うようにしたことを特徴とする。According to the invention described in claim 2, in the invention described in claim 1, in the waveform connection in the chain of the vowel and the semivowel, the connection at the boundary between them and the connection at the center of the vowel before the semivowel. It is characterized in that the voice synthesis unit is selected in consideration of both of the above.

【００６８】請求項３に記載の発明は、請求項２に記載
の発明において、母音と鼻音との連鎖における波形接続
において、それらの境界での接続と、鼻音の前の母音の
中心での接続との両方を考慮して音声合成単位の選択を
行うようにしたことを特徴とする。According to a third aspect of the invention, in the invention according to the second aspect, in the waveform connection in the chain of vowels and nasal sounds, the connection at the boundary between them and the connection at the center of the vowel before the nasal sound. It is characterized in that the voice synthesis unit is selected in consideration of both of the above.

【００６９】[0069]

【発明の実施の形態】以下、図４〜図１０を参照して、
この発明の実施の形態について説明する。まず、母音中
心における接続について考察を行った後に、本願発明の
特徴である日本語テキスト音声合成方法における音素単
位とダイフォン単位とを用いた単位選択法について説明
する。BEST MODE FOR CARRYING OUT THE INVENTION Referring to FIGS.
An embodiment of the present invention will be described. First, after considering the connection in the vowel center, the unit selection method using the phoneme unit and the diphone unit in the Japanese text-to-speech method, which is a feature of the present invention, will be described.

【００７０】〔１〕母音中心における接続についての考
察[1] Consideration on connection at vowel center

【００７１】母音連鎖において母音境界で波形接続を行
った場合は、母音連鎖において母音中心で波形接続を行
った場合と比較して、より大きなフォルマント遷移の不
連続が現れる。これは、母音中心での波形接続において
は、母音境界での波形接続と比較して、よりスペクトル
が安定しており、統計的に静的特徴の分散が小さくかつ
変化も小さいためであると考えられる。When the waveform connection is made at the vowel boundary in the vowel chain, a larger discontinuity of formant transition appears as compared with the case where the waveform connection is made at the vowel center in the vowel chain. This is because the waveform connection at the vowel center is more stable in spectrum than the waveform connection at the vowel boundary, and statistically the static feature has a small variance and a small change. To be

【００７２】このことから、母音連鎖においては母音境
界で波形接続を行うより母音中心で波形接続を行ったほ
うが、よりよい接続を行うことができると予想される。
この予想をメルケプストラム歪みを客観的評価尺度とし
て用いて実験を行うことにより検証する。From this, it is expected that in the vowel chain, better connection can be achieved by performing the waveform connection at the vowel center than at the vowel boundary.
This conjecture is verified by conducting experiments using the mel-cepstral distortion as an objective evaluation scale.

【００７３】〔１−１〕実験方法[1-1] Experimental method

【００７４】母音境界において波形接続を行った際と母
音中心において波形接続を行った際の、接続境界周辺に
おけるメルケプストラム歪みについて検討する。The mel-cepstral distortion around the connection boundary when the waveform connection is made at the vowel boundary and when the waveform connection is made at the vowel center will be examined.

【００７５】母音連鎖における各接続方法を図４に示
す。図４（ａ）は母音境界での波形接続方法を、図４
（ｂ）は母音中心での波形接続方法をそれぞれ示してい
る。Each connection method in the vowel chain is shown in FIG. FIG. 4A shows a waveform connection method at a vowel boundary.
(B) shows the waveform connection method at the vowel center.

【００７６】図４では、母音Ｖ１と母音Ｖ２とを接続す
る場合を示している。図４においてＶ^*は、全ての母音
を示している。Ｖ１_fhは母音Ｖ１の前半部を、Ｖ１_lhは
母音のＶ１の後半部をそれぞれ示している。図４（ａ）
では、Ｕｎｉｔ１のＶ１と、Ｕｎｉｔ２のＶ２とが接続
されている。図４（ｂ）では、Ｕｎｉｔ１のＶ１_fhと、
Ｕｎｉｔ２の（Ｖ１_lh＋Ｖ２）とが接続されている。FIG. 4 shows a case where the vowel V1 and the vowel V2 are connected. In FIG. 4, V ^* indicates all vowels. V1 _fh indicates the first half of the vowel V1, and V1 _lh indicates the second half of the vowel V1. Figure 4 (a)
Then, V1 of Unit1 and V2 of Unit2 are connected. In FIG. 4B, V1 _{fh of} Unit1 and
It is connected to (V1 _lh + V2) of Unit2.

【００７７】日本人男性話者１名が発声した日本語音声
コーパスの中で、人手によってラベル付けされたＡＴＲ
音素バランス文の中の４５０文（約３０分程度）を用い
る。コーパス中に存在する全ての母音連鎖において音素
境界と音素中心における接続を行い、各場合におけるメ
ルケプストラム歪みの重み付け和（上記式（８）で表さ
れる、以下接続歪みという) の分布を求める。上記式
（８）において、メルケプストラムを計算するためのフ
レーム数ωは４であり、フレームシフトは５ｍｓであ
る。各フレームにおけるメルケプストラム歪みは次式
（１０）を用いて計算する。ATR labeled by hand in the Japanese speech corpus uttered by one Japanese male speaker
450 sentences (about 30 minutes) in the phoneme balance sentence are used. Connections are made at the phoneme boundaries and the phoneme centers in all vowel chains existing in the corpus, and the distribution of the weighted sum of mel-cepstrum distortions (represented by the above expression (8), referred to as connection distortions) in each case is obtained. In the above equation (8), the number of frames ω for calculating the mel cepstrum is 4 and the frame shift is 5 ms. The mel cepstrum distortion in each frame is calculated using the following equation (10).

【００７８】[0078]

【数１０】 [Equation 10]

【００７９】ここで、ｍｃ_i ^(p)は後続母音のあるフレ
ームにおけるＦ₀適応型平滑化スペクトル(STRAIGHT ス
ペクトル) （文献１２参照）から求めたｐ次元目のメル
ケプストラム係数を示す。同様に、ｍｃ_i-1 ^(p)は先行
母音におけるメルケプストラム係数を示す。標本化周波
数は１６ｋＨｚである。Here, mc _i ^(p) represents the p-dimensional mel-cepstral coefficient obtained from the F ₀ adaptive smoothing spectrum (STRAIGHT spectrum) (see Reference 12) in the frame with the following vowel. Similarly, mc _i-1 ^(p) represents the mel cepstrum coefficient in the preceding vowel. The sampling frequency is 16 kHz.

【００８０】文献１２： H. Kawahara, I. Masuda-Kats
use and A.de Cheveign₀, "Restructuring speech repr
esentations using a pitch-adaptive time-frequency
smoothing and an instantaneous-frequency-based F₀
extraction: possible roleof a repetitive structure
in sounds", Speech Communication, vol. 27, no.3-
4, pp. 187-207, 1999.Reference 12: H. Kawahara, I. Masuda-Kats
use and A.de Cheveign ₀ , "Restructuring speech repr
esentations using a pitch-adaptive time-frequency
smoothing and an instantaneous-frequency-based F ₀
extraction: possible roleof a repetitive structure
in sounds ", Speech Communication, vol. 27, no.3-
4, pp. 187-207, 1999.

【００８１】なお、以下の説明においては、撥音 /Ｎ/
も母音とみなすことにする。In the following description, sound repellency / N /
Will also be regarded as a vowel.

【００８２】〔１−２〕音素環境が異なる母音を含む場
合[1-2] When vowels having different phoneme environments are included

【００８３】音素環境が異なる母音を含む場合の母音連
鎖における接続について検討する。先行母音としては、
コーパス中にて後続環境が母音であるものを全て用いる
（図４においてＶ^*≠Ｖ２の場合も含む）。The connection in the vowel chain when the phoneme environments include vowels different from each other will be examined. As the preceding vowel,
All vowels in the succeeding environment are used in the corpus (including the case where V ^* ≠ V2 in FIG. 4).

【００８４】接続歪みの頻度分布を図５に示す。図５か
ら、母音中心における接続（Vowelceter;実線) は母音
境界における接続（Vowel boundary; 破線) と比較し
て、接続歪みをより小さくすることができる場合が多い
ことがわかる。なお、Ａｖ．は平均値を、Ｓｄ．は標準
偏差を示している。The frequency distribution of connection distortion is shown in FIG. It can be seen from FIG. 5 that the connection distortion at the vowel center (Vowelceter; solid line) can reduce the connection distortion in many cases compared to the connection at the vowel boundary (Vowel boundary; broken line). In addition, Av. Represents the average value and Sd. Indicates the standard deviation.

【００８５】単位選択においては、スペクトルの不連続
だけでなく全てのサブコストの和を小さくする単位を見
つけることが重要となる。そのため、接続歪みが小さい
ほうに分布が寄れば、よりコストを小さくできる可能性
のある単位候補数が増えることを意味している。このこ
とから、音素環境が異なる際には、母音中心で接続を行
うことにより、よりスペクトルの不連続が小さい接続を
行うことができ、より多くの良い単位候補が存在するよ
うになることがわかる。In selecting a unit, it is important to find a unit that reduces not only the discontinuity of the spectrum but also the sum of all sub-costs. Therefore, if the distribution is closer to the smaller connection distortion, it means that the number of unit candidates that can possibly reduce the cost is increased. From this, it can be seen that when the phoneme environment is different, the connection with the vowel center can make the connection with a smaller spectrum discontinuity, and more good unit candidates exist. .

【００８６】〔１−３〕音素環境が等しい母音を用いる
場合[1-3] When vowels having the same phoneme environment are used

【００８７】次に、音素環境が等しい母音を用いる場合
について検討する。これは、図４においてＶ^*＝Ｖ２の
場合に相当する。この場合のコーパス中における単位候
補数は、先の実験の場合（Ｖ^*≠Ｖ２の場合も含む）と
比較すると約１／３である。Next, the case where vowels having the same phoneme environment are used will be examined. This corresponds to the case of V ^* = V2 in FIG. The number of unit candidates in the corpus in this case is about 1/3 as compared with the case of the previous experiment (including the case of V ^* ≠ V2).

【００８８】音素環境が等しい母音を用いた際の接続歪
みの頻度分布を図６に示す。図６からわかるように、母
音中心における接続（Vowel ceter;実線) および母音境
界における接続（Vowel boundary; 破線) のいずれにお
いても、接続歪み頻度分布がほぼ同じであることがわか
る。FIG. 6 shows the frequency distribution of connection distortion when vowels having the same phoneme environment are used. As can be seen from FIG. 6, the connection distortion frequency distribution is almost the same for both the connection at the vowel center (Vowel ceter; solid line) and the connection at the vowel boundary (Vowel boundary; broken line).

【００８９】したがって、音素環境が等しい際には、母
音中心における接続と母音境界における接続の間には大
きな違いはない。Therefore, when the phoneme environments are the same, there is no great difference between the connection at the vowel center and the connection at the vowel boundary.

【００９０】そこで、各単位候補において母音境界にお
ける接続と母音中心における接続を比較し、より接続歪
みを小さくすることができる接続を用いることを考え
る。この際の接続歪みの頻度分布、すなわち、母音境界
における接続と母音中心における接続とのうち接続歪み
の小さい方を用いた場合の接続歪みの頻度分布を、図６
に鎖線（Vowel ceter & Vowel boundary) で示す。この
頻度分布から、この場合には、母音境界における接続ま
たは母音中心における接続のいずれか一方のみを用いた
場合に比較して、接続歪みをより小さくすることができ
ることがわかる。Therefore, in each unit candidate, the connection at the vowel boundary and the connection at the center of the vowel are compared, and it is considered to use the connection that can further reduce the connection distortion. The frequency distribution of connection distortion at this time, that is, the frequency distribution of connection distortion when one of the connection at the vowel boundary and the connection at the center of the vowel has the smaller connection distortion is shown in FIG.
Is indicated by a chain line (Vowel ceter & Vowel boundary). From this frequency distribution, it can be seen that in this case, the connection distortion can be made smaller than in the case where only one of the connection at the vowel boundary and the connection at the vowel center is used.

【００９１】これは、母音境界における接続と母音中心
における接続の両者を考慮することにより、単位候補の
数が増え、よりスペクトルの不連続性が小さい接続を行
うことができる単位候補の使用が可能となることを意味
している。This is because the number of unit candidates is increased by considering both the connection at the vowel boundary and the connection at the vowel center, and it is possible to use the unit candidate which can make the connection with smaller spectral discontinuity. It means that

【００９２】以上のことから、母音連鎖においては母音
中心における接続と母音境界における接続の両者を併用
することにより、よりよい単位選択を行うことができる
といえる。From the above, it can be said that in the vowel chain, a better unit selection can be performed by using both the connection at the vowel center and the connection at the vowel boundary.

【００９３】〔２〕この発明による日本語テキスト音声
合成方法についての説明[2] Description of Japanese text-to-speech method according to the present invention

【００９４】この発明による日本語テキスト音声合成方
法では、母音連鎖（Ｖ−Ｖ連鎖）においては、母音中心
における接続（ダイフォン単位を用いた単位選択）と、
母音境界における接続（従来の音素単位を用いた単位選
択）とを併用して、好適な単位選択を行うことを特徴と
している。なお、それ以外の連鎖においては従来の音素
単位を用いた単位選択を用いる。In the Japanese text-to-speech synthesis method according to the present invention, in the vowel chain (VV chain), connection at the vowel center (unit selection using diphone units),
It is characterized in that suitable unit selection is performed in combination with connection at the vowel boundary (conventional unit selection using phoneme units). It should be noted that in the other chains, the conventional unit selection using the phoneme unit is used.

【００９５】さらに、この実施の形態では、Ｖ−Ｖ連鎖
だけでなく、Ｖ−半母音連鎖、Ｖ−鼻音連鎖に関しても
ダイフォン単位による接続を考慮している。Ｖ−半母音
連鎖、Ｖ−鼻音連鎖に関してもダイフォン単位による接
続を考慮する際には、半母音、鼻音の前の母音の中心に
おける接続を考慮する。また、接続箇所が増加するのを
防ぐため、半母音だけからなる単位は使用していない。
ただし、無音を１つの音素とみなすことにより、半母音
の先行もしくは後続が無音の際には例外とする。Further, in the present embodiment, the connection by the diphone unit is considered not only for the V-V chain, but also for the V-semi-vowel chain and the V-nasal sound chain. Regarding the V-semi-vowel chain and the V-nasal sound chain, the connection at the center of the vowel before the half-vowel and the nasal sound is taken into consideration when the connection by the diphone unit is considered. Also, in order to prevent an increase in the number of connection points, a unit consisting of half vowels is not used.
However, by considering silence as one phoneme, it is an exception when the preceding or following half vowel is silent.

【００９６】〔２−１〕音素単位とダイフォン単位を用
いた単位選択法[2-1] Unit selection method using phoneme unit and diphone unit

【００９７】音声単位を用いた単位選択において用いら
れるサブコスト関数としては、従来技術の説明の欄にお
いて、既に説明したので、ここではその説明を省略す
る。The sub-cost function used in the unit selection using the voice unit has already been described in the section of the description of the prior art, so the description thereof will be omitted here.

【００９８】以下、ダイフォン単位を用いた単位選択に
おいて用いられるサブコスト関数について説明する。The sub-cost function used in unit selection using the diphone unit will be described below.

【００９９】母音中心における接続を考慮する母音の単
位候補に関しては、ダイフォン単位を考慮するために前
半部と後半部の半音素に分割して単位選択を行う。ここ
では、半音素はもととなる音素の半分の持続時間をもつ
ものとしている。ダイフォン単位を考慮する母音のコス
トの計算は音素単位に基づいており、前半部の半音素単
位候補と後半部の半音素単位候補を組み合わせたものを
一つの音素とみなす。With regard to vowel unit candidates for which connection in the vowel center is considered, unit selection is performed by dividing into first and second half phonemes in order to consider diphone units. Here, a semiphoneme is assumed to have half the duration of the original phoneme. The calculation of vowel cost considering the diphone unit is based on the phoneme unit, and a combination of the first half phoneme unit candidate and the second half semiphoneme unit candidate is regarded as one phoneme.

【０１００】前半部の半音素単位候補をｕ_i ^f（半音素
に分割する前をｕ_1i）、後半部の半音素単位候補をｕ_i
^l（半音素に分割する前をｕ_2i）とすると、ターゲット
ｔ_i（前半部をｔ_i ^f、後半部をｔ_i ^l）に対する各サ
ブコストは以下のように計算される。The first half-phoneme unit candidate is u _i ^f (before dividing into half-phonemes, u _1i ), and the second half-phoneme unit candidate is u _i.
Each sub-cost for a target t _i (t _i ^{f for} the first half and t _i ^l for the second half) is calculated as follows, where ^l (before splitting into _semiphonemes is u _2i ).

【０１０１】（１）韻律に関するサブコスト関数Ｃ_pro
（ｕ_i，ｔ_i）サブコスト関数Ｃ_pro（ｕ_i，ｔ_i）は次式（１１），
（１２）で示されるように、半音素ごとに計算が行わ
れ、各半音素の接続時間ｄｕｒに応じた重み付けが行わ
れることにより計算される。(1) Subcost function C _{pro for} prosody
The (u _i , t _i ) sub-cost function C _pro (u _i , t _i ) is expressed by the following equation (11),
As shown in (12), the calculation is performed for each semiphoneme, and the weighting is performed according to the connection time dur of each semiphoneme.

【０１０２】[0102]

【数１１】 [Equation 11]

【０１０３】[0103]

【数１２】 [Equation 12]

【０１０４】ここで、Ｃ_pro（ｕ_i ^f，ｔ_i ^f）、Ｃ
_pro（ｕ_i ^l，ｔ_i ^l）における音素分割数はＭ／２と
する。Here, C _pro (u _i ^f , t _i ^f ), C
_The number of phoneme divisions in _pro (u _i ^l , t _i ^l ) is M / 2.

【０１０５】（２）Ｆ₀不連続に関するサブコスト関数
Ｃ_F0（ｕ_i，ｕ_i-1）サブコスト関数Ｃ_F0（ｕ_i，ｕ_i-1）は次式（１３）で
示されるように、音素境界（ｕ_i-1とｕ_i ^f）と音素中
心（ｕ_i ^fとｕ_i ^l）におけるサブコストの和として計
算される。(2) Sub cost function C _F0 (u _i , u _i-1 ) for F ₀ discontinuity The sub cost function C _F0 (u _i , u _i-1 ) is a phoneme as shown by the following equation (13). It is calculated as the sum of the subcosts at the boundaries (u _i-1 and u _i ^f ) and the phoneme centers (u _i ^f and u _i ^l ).

【０１０６】[0106]

【数１３】 [Equation 13]

【０１０７】（３）音素環境代替に関するサブコスト関
数Ｃ_env（ｕ_i，ｕ_i-1）サブコスト関数Ｃ_env（ｕ_i，ｕ_i-1）は次式（１４）
で示される。半音素の先行音素環境および後続音素環境
は、半音素に分割する前の音素に対する音素環境と等し
いものとされる。ただし、音素中心におけるコスト関数
Ｓ_s ^dとＳ_p ^dは音素間のものとは異なっている。(3) Sub-cost function C _env (u _i , u _i-1 ) related to phoneme environment substitution The sub _- cost function C _env (u _i , u _i-1 ) is expressed by the following equation (14).
Indicated by. The preceding phoneme environment and the following phoneme environment of the semiphoneme are equal to the phoneme environment for the phoneme before being divided into the semiphonemes. However, the cost functions S _s ^d and S _p ^d in the phoneme center are different from those between phonemes.

【０１０８】[0108]

【数１４】 [Equation 14]

【０１０９】（４）スペクトルの不連続に関するサブコ
スト関数Ｃ_spec（ｕ_i，ｕ_i-1）サブコスト関数Ｃ_spec（ｕ_i，ｕ_i-1）は次式（１５）
で示されるように、音素境界（ｕ_i-1とｕ_i ^f）と音素
中心（ｕ_i ^fとｕ_i ^l）におけるサブコストの和として
計算される。(4) Sub-cost function C _spec (u _i , u _i-1 ) relating to spectral discontinuity The sub _- cost function C _spec (u _i , u _i-1 ) is expressed by the following equation (15).
It is calculated as the sum of the subcosts at the phoneme boundaries (u _i-1 and u _i ^f ) and the phoneme centers (u _i ^f and u _i ^l ), as shown in.

【０１１０】[0110]

【数１５】 [Equation 15]

【０１１１】（５）音素の適合性に関するサブコスト関
数Ｃ_typ（ｕ_i，ｔ_i）サブコスト関数Ｃ_typ（ｕ_i，ｔ_i）は次式（１６）で
示されるように、半音素ごとに計算が行われ、各半音素
の接続時間ｄｕｒに応じた重み付けが行なわれることに
より計算される。(5) Sub-cost function C _typ (u _i , t _i ) relating to phoneme suitability The sub-cost function C _typ (u _i , t _i ) is calculated for each semiphoneme as shown by the following equation (16). Is performed, and the weighting is performed according to the connection time dur of each semiphoneme, thereby calculating.

【０１１２】[0112]

【数１６】 [Equation 16]

【０１１３】ここで、ω_fとω_lは上記式（１２）と等
しい。Here, ω _f and ω _l are equal to the above equation (12).

【０１１４】ダイフォン単位を考慮するターゲットｔ_i
に対する単位候補ｕ_i ^f，ｕ_i ^lのコスト計算におい
て、各サブコストの計算に用いられるターゲットと単位
候補を図７に示す。ここで、ｕ_i-1，ｕ_i ^f，ｕ_i ^lは
コーパス中で連続しているわけではなく、ｉ−１番目、
ｉ番目前半部およびｉ番目後半部のターゲットに対する
１つの単位候補を意味する。Target t _i considering diphone units
FIG. 7 shows targets and unit candidates used for the calculation of each sub-cost in the cost calculation of the unit candidates u _i ^f and u _i ^l for. Here, u _i-1 , u _i ^f , and u _i ^l are not continuous in the corpus, and the i−1 th,
It means one unit candidate for the i-th first half target and the i-th second half target.

【０１１５】ダイフォン単位が使用される際には（ｕ
_i-1，ｕ_i ^fがコーパス中で連続している場合）、Ｃ
_env（ｕ_i ^f，ｕ_i-1），Ｃ_spec（ｕ_i ^f，ｕ_i-1），
Ｃ_F0（ｕ _i ^f，ｕ_i-1）は０となる。また、音素単位が
使用される際には（ｕ_i ^f，ｕ_i ^lがコーパス中で連続
している場合）、Ｃ_env（ｕ_i ^l，ｕ_i ^f），Ｃ
_spec（ｕ _i ^l，ｕ_i ^f），Ｃ_F0（ｕ_i ^l，ｕ_i ^f）は０
となる。When the diphone unit is used (u
_i-1, U_i ^fAre consecutive in the corpus), C
_env(U_i ^f, U_i-1), C_spec(U_i ^f, U_i-1),
C_F0(U _i ^f, U_i-1) Becomes 0. Also, the phoneme unit is
When used (u_i ^f, U_i ^lAre continuously in the corpus
C)_env(U_i ^l, U_i ^f), C
_spec(U _i ^l, U_i ^f), C_F0(U_i ^l, U_i ^f) Is 0
Becomes

【０１１６】以上のことを考慮して、上記式（３）によ
り示されるトータルコストＴＣが最小となる単位候補の
組み合わせを動的計画法を用いて選択する。ダイフォン
単位を考慮する母音においては、前半部の候補数がＦ
個、後半部の候補数がＬ個であるとすると、Ｆ×Ｌ個の
パスの計算が行われ、その結果、Ｌ個のパスが残る。In consideration of the above, the combination of unit candidates that minimizes the total cost TC expressed by the above equation (3) is selected by using the dynamic programming method. For vowels that consider diphone units, the number of candidates in the first half is F
Assuming that the number of candidates in the latter half is L, F × L paths are calculated, and as a result, L paths remain.

【０１１７】図８は、音素単位とダイフォン単位とを用
いた単位選択の例を示している。FIG. 8 shows an example of unit selection using a phoneme unit and a diphone unit.

【０１１８】この例では、入力センテンスは、ついやす
（／ｔｓ〔Ｃ〕ｕ〔Ｖ〕ｉ〔Ｖ〕ｙ〔Ｃ〕ａ
〔Ｖ〕Ｓ〔Ｃ〕／）である。なお、ｙは半母音であ
る。In this example, the input sentence is easy (/ ts [C] u [V] i [V] y [C] a
[V] S [C] /). In addition, y is a half vowel.

【０１１９】図８において、”＜Ｖ^*〕”（Ｖ^*は全て
の母音を表す）は、母音の後半部を示し、”〔Ｖ^*＞”
は、母音の前半部を示している。In FIG. 8, "<V ^* ]" (V ^* represents all vowels) indicates the latter half of the vowel, and "[V ^* >"
Indicates the first half of the vowel.

【０１２０】この例では、/ts-u/、 /u-i/および/i-y/
それぞれの連鎖において、音素単位だけでなくダイフォ
ン単位を考慮して単位選択が行われる。他の連鎖/y-a/
および/a-s/ においては、音素単位のみによる単位選択
が行われる。単位選択の結果、音素境界だけでなく母音
中心においても接続を許容された可変長単位が選択され
る。In this example, / ts-u /, / ui / and / iy /
In each chain, unit selection is performed considering not only phoneme units but also diphone units. Other chains / ya /
In / as /, unit selection is performed only by phoneme unit. As a result of unit selection, a variable length unit is selected that is allowed to be connected not only at the phoneme boundary but also at the vowel center.

【０１２１】〔３〕評価実験[3] Evaluation experiment

【０１２２】上記実施の形態による日本語テキスト音声
合成方法（以下、提案法という）の性能を評価するため
に、音素単位に基づいた従来法との比較を行う。In order to evaluate the performance of the Japanese text-to-speech synthesis method (hereinafter referred to as the proposed method) according to the above-described embodiment, a comparison with the conventional method based on the phoneme unit is performed.

【０１２３】〔３−１〕主観評価実験[3-1] Subjective evaluation experiment

【０１２４】〔３−１−１〕実験条件[3-1-1] Experimental conditions

【０１２５】上記〔１−１〕の実験方法で用いたものと
同じ音声コーパス（日本人男性話者１名が発声した日本
語音声コーパスの中で、人手によってラベル付けされた
ＡＴＲ音素バランス文の中の４５０文（約３０分程
度））を用いて実験を行った。ラベリングのみでなくＦ
₀も人手により修正されている。The same speech corpus as used in the experimental method of [1-1] above (in the Japanese speech corpus uttered by one Japanese male speaker, of the manually labeled ATR phoneme balance sentence) The experiment was performed using 450 sentences (about 30 minutes). Not only labeling but F
_{0 is} also manually corrected.

【０１２６】日本語１０文の合成音声を用いて対比較実
験を行った。これらの文は単位選択に用いるコーパスに
含まれていない。提案法と従来法を用いて単位選択を行
い、音声を合成する。単位選択法の性能のみを評価する
ために、原音声から抽出した自然な韻律情報とメルケプ
ストラム系列を用いて単位選択を行った。A pair comparison experiment was conducted using synthetic speech of 10 Japanese sentences. These sentences are not included in the corpus used for unit selection. Using the proposed method and the conventional method, we select units and synthesize speech. In order to evaluate only the performance of the unit selection method, unit selection was performed using the natural prosody information extracted from the original speech and the mel cepstrum sequence.

【０１２７】音声合成では、高品質な分析合成方式であ
る STRAIGHT （上記文献１２参照）を用いて韻律（Ｆ₀
軌跡、音素持続時間、パワー）の制御を行った。被験者
は日本人成人男女１０名である。各試行においては、提
案法による合成音声と従来法による合成音声を対として
ランダムな順序で呈示し、被験者にはより不自然に感じ
る合成音声を選択してもらった。その際、必要に応じて
一度だけでなく何度も合成音声を聴くことを許してい
る。In speech synthesis, a prosody (F ₀
The locus, phoneme duration, and power) were controlled. The subjects are 10 Japanese adult men and women. In each trial, synthetic speech by the proposed method and synthetic speech by the conventional method were presented as a pair in a random order, and the subject was asked to select a synthetic speech that felt more unnatural. At that time, it is allowed to listen to the synthesized voice not only once but also as many times as necessary.

【０１２８】〔３−１−２〕実験結果[3-1-2] Experimental Results

【０１２９】合成された１０文は３６６の音素で構成さ
れており、提案法では１４５の音素境界接続（Ｖ−Ｃ接
続：１２５、Ｖ−Ｖ接続：６、Ｖ−半母音接続：３、Ｖ
−鼻音接続：１１）と２５の母音中心接続が行われてい
る。従来法では、１６３の音素境界接続（Ｖ−Ｃ接続：
１２４、Ｖ−Ｖ接続：１６、Ｖ−半母音接続：３、Ｖ−
鼻音接続：２０）が行われている。The synthesized 10 sentences are composed of 366 phonemes, and in the proposed method, 145 phoneme boundary connections (VC connection: 125, VV connection: 6, V-semi-vowel connection: 3, V).
-Nasal connection: 11) and 25 vowel-centered connections are made. In the conventional method, 163 phoneme boundary connections (VC connection:
124, V-V connection: 16, V-semi-vowel connection: 3, V-
Nasal connection: 20).

【０１３０】対比較実験の結果を図９に示す。図９で
は、提案法による合成音声と従来法による合成音声のう
ち、自然に感じられた方の割合をPreference scoreとし
て示している。図９から提案法は従来法と比較して、よ
り自然な音声の合成が可能であることが明らかである。The results of the paired comparison experiment are shown in FIG. In FIG. 9, the ratio of the naturally-perceived one of the synthesized speech by the proposed method and the synthesized speech by the conventional method is shown as a Preference score. It is clear from FIG. 9 that the proposed method can synthesize speech more naturally than the conventional method.

【０１３１】〔３−２〕客観評価実験[3-2] Objective evaluation experiment

【０１３２】〔３−２−１〕実験条件[3-2-1] Experimental conditions

【０１３３】提案法はダイフォン単位も考慮することで
単位候補数を増やすことができるため、コーパスをより
有効に使用できると予想される。そこで、従来法に対す
る提案法のコーパス削減率を求めた。コーパス削減率
は、平均コスト値が等しい際の従来法と提案法のコーパ
スの大きさの差（コーパス削減量）を、その際の従来法
のコーパスの大きさで割った値として表される。単位選
択に用いたコーパスは、約８時間弱（約１００００文）
の音声コーパスを２^-n/2（ｎ＝０，１，２，...，９）
に縮小したものである。評価文にはこの音声コーパスに
含まれていない５３文を用いる。Since the proposed method can increase the number of unit candidates by considering the diphone unit as well, it is expected that the corpus can be used more effectively. Therefore, we calculated the corpus reduction rate of the proposed method over the conventional method. The corpus reduction rate is expressed as a value obtained by dividing the difference between the corpus sizes of the conventional method and the proposed method (corpus reduction amount) when the average cost values are equal by the corpus size of the conventional method at that time. The corpus used for unit selection is about 8 hours (about 10,000 sentences)
2 ^{-n / 2} (n = 0,1,2, ..., 9)
It has been reduced to. 53 sentences not included in this voice corpus are used as the evaluation sentences.

【０１３４】〔３−２−２〕実験結果[3-2-2] Experimental results

【０１３５】コーパスの大きさとコーパス削減率の関係
を図１０に示す。図１０から、提案法は従来法と比較し
コーパスを削減できることがわかり、より有効にコーパ
スを使用できるといえる。その結果はコーパスが大きく
なるにつれ更に大きくなることがわかる。FIG. 10 shows the relationship between the corpus size and the corpus reduction rate. It can be seen from FIG. 10 that the proposed method can reduce the corpus as compared with the conventional method, and it can be said that the corpus can be used more effectively. It can be seen that the result becomes even larger as the corpus becomes larger.

【０１３６】[0136]

【発明の効果】この発明によれば、音素単位だけでなく
音素の中心を境界とするダイフォン単位も用いた新たな
日本語テキスト合成方法が得られる。According to the present invention, it is possible to obtain a new Japanese text synthesizing method using not only a phoneme unit but also a diphone unit having the center of a phoneme as a boundary.

【０１３７】また、この発明によれば、音素単位のみを
用いる従来法と比較して、より自然性の高い音声を合成
することができ、かつコーパスをより有効に使用するこ
とができるようになる。Further, according to the present invention, compared to the conventional method using only the phoneme unit, more natural speech can be synthesized and the corpus can be used more effectively. .

[Brief description of drawings]

【図１】従来の日本語テキスト音声合成装置の概略構成
を示すブロック図である。FIG. 1 is a block diagram showing a schematic configuration of a conventional Japanese text-to-speech synthesizer.

【図２】コスト関数の概念を示す模式図である。FIG. 2 is a schematic diagram showing the concept of a cost function.

【図３】音素ターゲットｔ_iに対する単位候補ｕ_iのコ
スト計算において、各サブコストの計算に用いられるタ
ーゲットと単位候補とを示す模式図である。[3] In the cost calculation unit candidate u _i for phoneme targets t _i, is a schematic diagram showing a target and a unit candidate to be used in the calculation of each sub-costs.

【図４】母音境界での波形接続方法と母音中心での波形
接続方法とをそれぞれ示す模式図である。FIG. 4 is a schematic diagram showing a waveform connection method at a vowel boundary and a waveform connection method at a vowel center.

【図５】音素環境が異なる母音を含む場合の接続歪みの
頻度分布を示すグラフである。FIG. 5 is a graph showing a frequency distribution of connection distortion when vowels having different phoneme environments are included.

【図６】音素環境が等しい母音を用いた際の接続歪みの
頻度分布を示すグラフである。FIG. 6 is a graph showing a frequency distribution of connection distortion when vowels having the same phoneme environment are used.

【図７】ダイフォン単位を考慮するターゲットｔ_iに対
する単位候補ｕ_i ^f，ｕ_i ^lのコスト計算において、各
サブコストの計算に用いられるターゲットと単位候補と
を示す模式図である。FIG. 7 is a schematic diagram showing targets and unit candidates used for calculation of each sub-cost in cost calculation of unit candidates u _i ^f and u _i ^l for a target t _i considering a diphone unit.

【図８】音素単位とダイフォン単位とを用いた単位選択
の例を示す模式図である。FIG. 8 is a schematic diagram showing an example of unit selection using a phoneme unit and a diphone unit.

【図９】主観評価実験の結果を示すグラフである。FIG. 9 is a graph showing the results of a subjective evaluation experiment.

【図１０】客観評価実験の結果を示すグラフである。FIG. 10 is a graph showing the results of an objective evaluation experiment.

───────────────────────────────────────────────────── フロントページの続き (72)発明者津崎実京都府相楽郡精華町光台二丁目２番地２株式会社国際電気通信基礎技術研究所内Ｆターム(参考） 5D045 AB02 AB30 ─────────────────────────────────────────────────── ─── Continued front page (72) Inventor Minoru Tsuzaki 2-2 Kodai, Seika-cho, Soraku-gun, Kyoto International Telecommunications Basic Technology Research Institute Co., Ltd. F-term (reference) 5D045 AB02 AB30

Claims

[Claims]

1. A Japanese text-to-speech method,
In the waveform connection in the chain of vowels and vowels, the speech synthesis unit is selected in consideration of both the connection at the boundary and the connection at the vowel center. Text-to-speech method.

2. In a waveform connection in a chain of vowels and semivowels, a speech synthesis unit is selected in consideration of both the connection at the boundary between them and the connection at the center of the vowel before the half vowel. The Japanese text-to-speech synthesis method according to claim 1, wherein

3. In a waveform connection in a chain of vowels and nasal sounds, a voice synthesis unit is selected in consideration of both the connection at the boundary between them and the connection at the center of the vowel before the nasal sound. The Japanese text-to-speech synthesis method according to claim 2, wherein