JPH01292400A

JPH01292400A - Speech synthesis system

Info

Publication number: JPH01292400A
Application number: JP12253588A
Authority: JP
Inventors: Takahiro Suzuki; 貴博鈴木
Original assignee: Meidensha Corp; Meidensha Electric Manufacturing Co Ltd
Current assignee: Meidensha Corp; Meidensha Electric Manufacturing Co Ltd
Priority date: 1988-05-19
Filing date: 1988-05-19
Publication date: 1989-11-24

Abstract

PURPOSE:To generate a natural synthetic voice while smoothing modulation coupling more by performing arithmetic interpolation in both a time-basis direction and a voice channel direction and simulating the motion of a tongue. CONSTITUTION:A parameter interpolation processing part 5 performs interpolation not only in the time-base direction, but also in the voice-channel direction. For the simulation of the motion of the tongue, the interpolation is carried out only within the range of an acoustic tube corresponding to positions of the voice channel where the tongue can move. Namely, the voice channel positions where the tongue can move are denoted as KI-KE and then 0<=KI<=KE<=n-1 (n: number of stages of the acoustic tube) holds; and sectional product interpolation is performed when 0<=i<=KI and KE<=i<=n-1 hold for a sectional product parameter Ai at an optional time. Then an arithmetic part 6 finds the digital value of a current at the same timing with the arithmetic interpolation and this value is A/D-converted 7 to generate a voice by a voicing part 8.

Description

【発明の詳細な説明】Ａ、産業上の利用分野本発明は、音響管モデルを利用した音声合成装置に関す
る。DETAILED DESCRIPTION OF THE INVENTION A. Field of Industrial Application The present invention relates to a speech synthesis device using an acoustic tube model.

Ｂ１発明の概要本発明は人間の声道を音響管群とみなし、これをサージ
インピーダンス成分の回路要素群に対応させることによ
って、回路要素群の出力端の電流波に基づいて音声を模
擬的に作り出す方式において、補助演算を時間軸方向の
みでなく声道方向にも行って舌の動きをも模擬し、調音
結合を一層滑らかにして自然な合成音を作り出すように
したものである。B1 Summary of the Invention The present invention regards the human vocal tract as a group of acoustic tubes, and by associating this with a group of circuit elements for surge impedance components, it is possible to simulate speech based on the current wave at the output end of the group of circuit elements. In this method, auxiliary calculations are performed not only in the direction of the time axis but also in the direction of the vocal tract to simulate the movement of the tongue, further smoothing the articulatory combination and producing natural synthesized sounds.

Ｃ８従来の技術音声合成やミュージックシンセサイザー（電子楽器）等
の所謂音を人工的に合成して出力する電子装置は、最近
になってＩないし数チップの音声認識や音声合成のＬＳ
Ｉが音声情報処理と半導体の大規模集積回路技術により
低価格で実現されるようになり、その使用目的、制約条
件により種々の方式が提案されている。この音声合成に
は、人間の発生した生の音声を録音しておき、これを適
当に結合して文章に編集する録音編集方式と、人間の声
を直接的には利用せず、人間の音声のパラメータだけを
抽出し、音声合成過程で、そのパラメータを制御して人
工的に音声信号を作り出すパラメータ方式がある。C8 Conventional technology Electronic devices that artificially synthesize and output so-called sounds, such as voice synthesis and music synthesizers (electronic musical instruments), have recently been developed into LS for voice recognition and voice synthesis using one or several chips.
I can now be realized at low cost through audio information processing and semiconductor large-scale integrated circuit technology, and various systems have been proposed depending on the purpose of use and constraints. This speech synthesis method involves two methods: recording and editing raw human speech, which combines them appropriately and editing them into sentences; There is a parameter method that extracts only the parameters and controls them during the speech synthesis process to artificially create a speech signal.

パラメータ方式においては、音声波形をある周期毎にサ
ンプリングして各サンプリング点での音声信号の値をア
ナログ／ディジタル変換し、その値を０と１の符号で表
示して行われるが、アナログ信号に忠実な記録をするた
めには、ビット数を電増やす必要があり、このため大き
なメモリ容量を必要とする。In the parameter method, the audio waveform is sampled at certain intervals, the audio signal value at each sampling point is converted from analog to digital, and the values are displayed as codes of 0 and 1. In order to perform faithful recording, it is necessary to increase the number of bits, which requires a large memory capacity.

そこで、この情報量を極力少なくするために各種の高能
率な符号化法が研究開発されている。Therefore, various highly efficient encoding methods are being researched and developed in order to reduce the amount of information as much as possible.

その方法の一つとして、１つの音声信号の情報に最低限
１ビツトを対応させたデルタ変調方式がある。この方式
は、１ビツトの使い方として、次にくる音声信号値が現
在の値より高いか低いかを判定して、高ければ符号＠１
”、低ければ符号“０”を与え、音声信号の符号化を行
うもので、実際のシステム構成としては一定の振幅ステ
ップ量（デルタ）を定めておき、誤差が蓄積されないよ
うに今までの符号化によって得られる音声の値と、入力
してくる音声信号との残差信号に対して、符号化を行う
。One such method is a delta modulation method in which at least one bit corresponds to the information of one audio signal. In this method, one bit is used to determine whether the next audio signal value is higher or lower than the current value, and if it is higher, the sign @1 is used.
”, if it is low, the code “0” is given and the audio signal is encoded.In the actual system configuration, a constant amplitude step amount (delta) is determined, and the previous code is changed to prevent errors from accumulating. The residual signal between the audio value obtained by the encoding and the input audio signal is encoded.

このような構成予測コード化といわれ、線形予測法（何
個か前のサンプル値から予測する）およびパーコール方
式（線形予測法の予測係数の変わリニパーコール係数に
といわれる偏自己相関関数を用いる）がある。This type of predictive coding is called the linear prediction method (predicting from several previous sample values) and the Percoll method (using a partial autocorrelation function called the LiniPercoll coefficient, which changes the prediction coefficient of the linear prediction method) There is.

Ｄ１発明が解決しようとする問題点従来の音声合成方式のうち録音編集方式は、合成できる
語愈や文章の種類が限定されるという問題がある。D1 Problems to be Solved by the Invention Of the conventional speech synthesis methods, the recording and editing method has a problem in that the types of words and sentences that can be synthesized are limited.

また予測コード化を用いた方式では、音と音との継ぎ目
に相当する調音結合が難しくて合成単位の結合法が確立
しておらず、例えば母音から子音を経て母音に至る発声
において、母音の定常から過渡を経て子音に至りまた母
音の過渡を経て母音の定常音に至る過程で母音と母音の
継ぎ目の音が跡切れてしまう。従って音の滑らかさに欠
け、人間が聞いたときに自然な感じを与えないという問
題がある。In addition, in methods using predictive coding, it is difficult to create articulatory combinations that correspond to the joints between sounds, and a method for combining synthesis units has not been established. In the process of going from a steady state through a transition to a consonant, and then through a vowel transition to a steady vowel sound, the sound at the joint between vowels is cut off. Therefore, there is a problem that the sound lacks smoothness and does not give a natural feeling when heard by humans.

本発明の目的は、任意な給量、文章を合成することがで
き、しかも音が滑らかであって人間の実際の音声に近く
、自然な感じを聞く人に与えることのできる音声合成装
置を提供することにある。An object of the present invention is to provide a speech synthesis device that can synthesize sentences of any amount and provide a smooth sound that is close to actual human speech and can give a natural feeling to the listener. It's about doing.

Ｅ０問題点を解決するための手段及び作用音声を口から
外に放射するためには、音源が必要で、この音源は声帯
によって作り出される。−方声帯は２枚のヒダを開閉す
ることによって呼気を断続的に止める働きがあり、その
断続によってパフと呼ばれる空気流が発生し、声帯を緊
張させるとこのヒダに張力が加わりヒダの開閉の周波数
が高くなり、周波数の高いパフ音が発生する。そして呼
気流を大きくすると大きな音となる。Means and Actions for Solving the E0 Problem In order to radiate sound out of the mouth, a sound source is required, and this sound source is produced by the vocal cords. -The vocal cords have the function of intermittent exhalation by opening and closing two folds, and these intermittent intervals generate airflow called puffs, and when the vocal cords are tensed, tension is applied to these folds, causing the folds to open and close. The frequency increases, producing a high-frequency puff sound. When the exhalation flow is increased, the sound becomes louder.

この音源波が声道のような円筒状の音響管を通過すると
、開放端から音波は共振現象によりある成分が強調され
、ある成分が減弱し複雑な母音の波形が作り出される。When this sound source wave passes through a cylindrical acoustic tube like the vocal tract, a resonance phenomenon causes certain components of the sound waves from the open end to be emphasized and certain components to be attenuated, creating a complex vowel waveform.

そして口から発せられる音声は、音源波が同じ波形をも
っていても、口唇から放射されるまでに通過する声道の
形によって影響を受ける。即ち、人間の発生音は、声帯
から口唇までの声道の長さや断面積及び声帯の震わせ方
等によって決定される。Even if the sound source waves have the same waveform, the sound emitted from the mouth is affected by the shape of the vocal tract that the sound passes through before being emitted from the lips. That is, the sounds produced by humans are determined by the length and cross-sectional area of the vocal tract from the vocal cords to the lips, and the way the vocal cords vibrate.

本発明はこのようなことに着目してなされたものであり
、上記の声道を複数の可変断面積の音響管群とみなし、
更に音響等の音波の伝達を表す進行波現象をその等価回
路により実現することを出発点としている。The present invention has been made with this in mind, and considers the vocal tract as a group of acoustic tubes with variable cross-sectional areas.
Furthermore, the starting point is to realize the traveling wave phenomenon representing the transmission of sound waves such as acoustic waves using its equivalent circuit.

上述のことから、声道を断面積Ａ、〜Ａ７を持つｎ個の
一次元音響管を縦続接続した第２図（ア）の等価モデル
で表し、各音響管Ｓ１〜ＳｏをＬＣ成分のみの無損失分
布定数線路Ｔ、〜Ｔ１と見なすと同図（イ）の電気回路
に変換され、この電気回路に音源に代えたインパルス性
の電気又は電流信号を印加したときの出力波形が音声波
形に相当する応答出力と見なすことができる。そして、
声道の調音状態を変えるには各線路Ｔ、〜Ｔ１の定数り
。From the above, the vocal tract is represented by the equivalent model shown in Figure 2 (A) in which n one-dimensional acoustic tubes with cross-sectional areas A, ~A7 are connected in cascade, and each acoustic tube S1 ~ So is represented by the LC component only. When considered as a lossless distributed constant line T, ~T1, it is converted to the electric circuit shown in the same figure (a), and when an impulsive electric or current signal instead of a sound source is applied to this electric circuit, the output waveform becomes an audio waveform. It can be considered as the corresponding response output. and,
To change the articulation state of the vocal tract, the constants of each line T, ~T1 are required.

Ｃを変えることによるサージインピーダンスの変更にな
る。By changing C, the surge impedance can be changed.

同図（イ）の等価回路において、音源に代えたインパル
ス性の電流を印加したとき、各線路Ｔ。In the equivalent circuit of the same figure (a), when an impulsive current is applied in place of the sound source, each line T.

〜Ｔｎは夫々サージインピーダンス２．−２ｎを内部抵
抗とする電流源を持ち、各線路Ｔ、〜Ｔイの接続点と反
射と透過を伴う等価回路に変換される。~Tn are respectively surge impedances 2. It has a current source with an internal resistance of -2n, and is converted into an equivalent circuit with connection points of each line T, - T, and reflection and transmission.

この等価回路は同図（つ）に示すようになる。同図にお
いて、Ｅは音源に相当する電圧源電圧、Ｚｏはその出力
インピーダンス、ＺＬは口唇からの放射インピーダンス
、ｉ　０Ａ−ｉ　（ｎ−１）Ａは線路Ｔｌ〜Ｔｎの各接
続点の前進波電流ｉ。ｎ””　Ｉ　（わ−１１Ｂは各接
続点の後進波電流、Ｚ　Ｉ−Ｚ−は夫々断面積Ａ、−Ａ
、と空気密度ρと音速ＣからρＣ／Ａ、〜ρＣ／Ａ、に
なるサージインピーダンス、■い〜１１１１−１１Ａ及
び１１８〜Ｉｎｓは線路Ｔ、〜Ｔ１の接続点に現れる電
流源電流ａ。Ａ””　ａ　１１１−１１Ａ及びａｈａ〜
ａ０は電流源の分流電流である。This equivalent circuit is shown in the same figure. In the figure, E is the voltage source voltage corresponding to the sound source, Zo is its output impedance, ZL is the radiation impedance from the lips, and A is the forward wave at each connection point of the lines Tl to Tn. Current i. n"" I (W-11B is the backward wave current at each connection point, Z I-Z- is the cross-sectional area A, -A, respectively
, and air density ρ and sound velocity C, the surge impedance becomes ρC/A, ~ρC/A, ■ ~1111-11A and 118~Ins are the current source currents a appearing at the connection point of the lines T and ~T1. A"" a 111-11A and aha~
a0 is the shunt current of the current source.

これら各電流には第２図（１）に示す関係にある。同式
中、Ｓ　＋ａ＝　Ａ　Ｉ／　（Ａ　Ｉ＋　Ａ　＊）、Ｓ
　ＩＡ＝Ａ　＊／　（Ａ　ｔ＋　Ａ　ｔ）〜Ｓ　（＋１
’−，１Ｂ＝Ａ　（ｎ−Ｎ／（Ａ　ｊｎ−１１＋　Ａｎ
）、Ｓ＋ｎ−ｎａ＝Ａｎ／１ｎ−ｕ＋Ａｎを示す。また
、メモリ項目は現在値の１ステツプ前の演算結果を格納
する項目を示す。こうした関係式の演算はメモリ項目に
初期値Ｏを与え、電圧Ｅを与えたときの各メモリ項目に
ついて、線路Ｔ、−Ｔｎまで基本周期で順次演算し、最
終段の電流１ｎＢを基本周期で求め、この電流１ｎＢに
調音音声出力データを得ることができる。These currents have the relationship shown in FIG. 2 (1). In the same formula, S + a = A I/ (A I + A *), S
IA=A */ (A t+ A t) ~ S (+1
'-, 1B=A (n-N/(A jn-11+ An
), S+n-na=An/1n-u+An. Furthermore, the memory item indicates an item that stores the calculation result one step before the current value. To calculate such a relational expression, give an initial value O to the memory item, apply voltage E to each memory item, calculate it sequentially at the fundamental period up to the lines T and -Tn, and calculate the final stage current 1 nB at the basic period. , articulatory voice output data can be obtained at this current of 1 nB.

ここで、第２図（つ）の等価回路は線形回路になって重
ねの理が成立し、電圧Ｅを零として放射インピーダンス
Ｚｔ、に直列に電圧ＥＬを印加したときの各都電流を求
めること、即ち、インピーダンスＥｏの電流を求めるこ
とができる。このように、サージインピーダンスを変化
することにより声道の断面積を模擬的に変化させること
ができるので、サージインピーダンスを連続的に変化す
ることで音素間又は音節間の調音結合がスムーズに行わ
れる。Here, the equivalent circuit in Figure 2 (2) becomes a linear circuit, and the superposition principle holds true, and the current at each point when the voltage E is set to zero and the voltage EL is applied in series to the radiation impedance Zt can be found. In other words, the current of impedance Eo can be determined. In this way, by changing the surge impedance, the cross-sectional area of the vocal tract can be changed in a simulated manner, so by continuously changing the surge impedance, the articulatory coupling between phonemes or syllables is performed smoothly. .

そこで本発明は、縦列結合した音響管の断面積変化の補
間を時間軸方向のみならず声道方向にも行って口腔内の
舌の動きをも模擬することを特徴とする。Therefore, the present invention is characterized by interpolating the cross-sectional area changes of the acoustic tubes connected in series not only in the time axis direction but also in the vocal tract direction to simulate the movement of the tongue in the oral cavity.

Ｆ、実施例第１図は本発明の実施例のブロック構成を示す図である
。■は日本語処理部であり、入力された日本語文章に対
して文節の区切りゃ辞書を参照して読みがな変換等を行
う。２は文章処理部であり文章にイントネーションを付
ける処理を行う。３は音節処理部であり、文章を構成す
る音節に対して、イントネーションに応じたアクセント
を付ける。４は音素処理部、４．は音節パラメータ格納
部であり、音素処理部４は、入力された音節データに対
し、音節と母音及び子音の単位である音素との対応関係
を規定した音節パラメータ格納部４Ｉ内のデータを参照
して音素に分解する。F. Embodiment FIG. 1 is a diagram showing a block configuration of an embodiment of the present invention. 3 is a Japanese language processing unit, which converts the input Japanese text into pronunciations, etc. by referring to a dictionary for segmentation of phrases. Reference numeral 2 denotes a sentence processing section which performs processing for adding intonation to sentences. 3 is a syllable processing unit, which adds accents to the syllables that make up a sentence according to the intonation. 4 is a phoneme processing unit; 4. is a syllable parameter storage unit, and the phoneme processing unit 4 refers to the data in the syllable parameter storage unit 4I that defines the correspondence between syllables and phonemes, which are units of vowels and consonants, for input syllable data. into phonemes.

５はパラメータ補間処理部、５．は音素パラメータ格納
部、５．は音源パラメータ格納部である。5 is a parameter interpolation processing unit; 5. is a phoneme parameter storage unit; is a sound source parameter storage section.

音素パラメータ格納部５１は第３図に示すように各音素
の発声時間を複数例えば立ち上がり部ｔ！。As shown in FIG. 3, the phoneme parameter storage unit 51 stores a plurality of utterance times of each phoneme, for example, the rising part t! .

定常ｉｔｓ、立ち下がり部ｔｇの３つの時間帯に区分し
、各時間帯毎に継続時間音源波の繰り返し周波数である
ピッチ、この音源波のエネルギー及び音響管の断面積の
各初期値と当該時間帯の前記各初期値から次の時間帯の
各初期値への変化の仕方を規定した時定数と音源波パタ
ーンとを格納している。この実施例では、人間の声道（
男性の場合的１７ｃｍ）を長さ１ｃｍの音響管を１７個
連接したものでモデル化しており、このため断面積値は
１つの時間相当たり１７個（Ａｌ〜ＡＩ？）定められて
いる。前記パラメータ補則処理部５は、各時間帯（ｔ□
〜ｔｍ）におけるピッチ、エネルギー及び断面積の補間
処理を行う部分である。It is divided into three time periods: steady its, and falling part tg, and for each time period, the duration, the pitch which is the repetition frequency of the sound source wave, the initial values of the energy of this sound source wave, the cross-sectional area of the sound tube, and the relevant time are determined. It stores time constants and sound source wave patterns that define how each initial value of a band changes to each initial value of the next time period. In this example, the human vocal tract (
The model is made by connecting 17 acoustic tubes with a length of 1 cm, and therefore, the cross-sectional area value is determined to be 17 pieces (Al to AI?) per one time period. The parameter supplementary law processing unit 5 processes each time period (t□
This is a part that performs interpolation processing of pitch, energy, and cross-sectional area in ~tm).

６は演算部であり、パラメータ補間処理部５で算出した
パラメータに基づいて、前記補間演算と同じタイミング
で例えば１００μｓの時間間隔で第２図（つ）に示す電
流ｉ。のデジタル値を求める。７はデジタル／アナログ
（Ｄ／Ａ）変換器であり、演算部６で求めたデジタル値
に基づいて電流波（アナログ電流）を作り出す。８はス
ピーカー等で発声部であり、アナログ電流に基づいて音
声を発生する。Reference numeral 6 denotes a calculation unit, which calculates the current i shown in FIG. 2 at the same timing as the interpolation calculation, for example, at a time interval of 100 μs, based on the parameters calculated by the parameter interpolation processing unit 5. Find the digital value of. 7 is a digital/analog (D/A) converter, which generates a current wave (analog current) based on the digital value obtained by the calculation section 6. Reference numeral 8 denotes a voice generating section such as a speaker, which generates voice based on analog current.

次にパラメータ補間処理部５の具体的な補間演算につい
て説明する。Next, specific interpolation calculations by the parameter interpolation processing section 5 will be explained.

音声の発声は、例えば母音については第４図に示すよう
に立ち上がり部ｔｘ、定常部ｔ、４．立ち下がり部ｔｇ
があり、ある音素Ｐにおける断面積パラメータＡｔ（ｉ
は音響管の数１−　ｎ　）を作る場合は音素Ｐの１１．
１に、　　ｔｇについて第３図に示すようにデータを構
成する。他の音素Ｑについても同じようなデータを構成
する。For example, for a vowel, a rising part tx, a steady part t, 4. Falling part tg
, and the cross-sectional area parameter At(i
is the number of acoustic tubes (1-n), then 11. of the phoneme P.
1, configure the data for tg as shown in FIG. Similar data is configured for other phonemes Q as well.

これらを今、断面積をＡ１音素をＰ１立ち上がり部！、
定常部Ｍ１立ち下がり部Ｅ１音響管の順番０〜１７とし
た場合、音素ＰについてＡｐ□。、Ａｐ□、Ａｐ□・・
・ＡＰＩ＋７１　ＡＰＮ。・・・’ＡＰＭｌ？ＩＡｐｔ
ｏ＝Ａｐｇ＋を同様に音素ＱについてもＡｇｘｏ−Ａｏｔｔｔ、　　ＡＱＭＯ−ＡＱＭＩ？、　
　ＡＱＥＯ〜Ａｏｇ＋７と表される。Now, the cross-sectional area of these is the A1 phoneme and the P1 rising part! ,
When the order of the steady part M1 falling part E1 acoustic tube is 0 to 17, Ap□ for the phoneme P. , Ap□, Ap□...
・API+71 APN. ...'APMl? IApt
Similarly, for phoneme Q, o=Apg+ is Agxo-Aottt, AQMO-AQMI? ,
It is expressed as AQEO~Aog+7.

音響管Ａ０〜Ａ１７の断面積パラメータは、前段の音素
処理部から各音素Ｐ、Ｑ毎に時間パラメータＴｐ、Ｔｏ
を与えられる。そしてこの時間パラメータＴｐ、Ｔｏは
各音素毎に予め定められた比率（例えば音素Ｐの立ち上
がり部１＝１．定常部Ｍ＝３゜立ち下がり部Ｅ＝２．音
素Ｑの１＝１．Ｍ＝２゜Ｅ＝１）で時間をｔｘ、　　ｔ
Ｍ、　　ｔｇに配分する。The cross-sectional area parameters of the acoustic tubes A0 to A17 are determined by the time parameters Tp and To for each phoneme P and Q from the phoneme processing unit in the previous stage.
is given. These time parameters Tp and To are determined by predetermined ratios for each phoneme (for example, rising part 1 of phoneme P = 1. steady part M = 3°, falling part E = 2. 1 of phoneme Q = 1. M = 2゜E=1) and the time is tx, t
M, allocate to tg.

時刻０から音素Ｐ、Ｑと順に発声する場合■時刻０〜１
／６ＴＰまでは補間周期を毎にＡ２、。When the phonemes P and Q are uttered in order from time 0 ■Times 0 to 1
Up to /6TP, the interpolation period is A2 every time.

（音素Ｐの音響管ｉにおける立ち上がり即断面積）を目
標値として現在値Ａ１を変化させていく、補間式はへ五（ｔ）＝　（Ａｐｗｉ−Ａｎ（ｔ−０））■時刻ｔ
　／６　ＴＰ〜４　／　６　Ｔ　ｐまでは、補間周期を
毎にＡｐｘｉを目標値として現在値Ａ、をＡｔ（ｔ）＝
　（Ａｐｗｉ−Ａｔ（ｔ＝１／６Ｔｐ））■時刻４　／
　６　Ｔ　ｐ＝　Ｔ　ｐまでは、を毎にＡｐ旧を目標値
として現在値ＡＩをＡｔ（ｔ）＝　（ＡＰＥＩ　　Ａｔ（ｔ＝４／６Ｔｐ）
■時刻Ｔ　ｐＡ−Ｔ　ｐ　＋　１　／　４　Ｔ　ｔａま
ではｔ毎にＡＱｌ、を目標値として現在値ＡＩをＡＩ　（ｔ　）　＝　（Ａｏｙ＊−Ａｔ　（ｔ　＝’ｒ
ｐ）　）■時刻Ｔｐ＋１　／　４　Ｔｇ−Ｔｐ＋　３　
／　４　ＴｏまではＡｔ（ｔ）＝　［ＡＱＭＩ　　Ａｔ
（ｔ＝Ｔｐ＋　１／４ＴＱ））■Ｔｐ＋３／４Ｔｏ−Ｔ
ｐ＋ＴｏまではＡ１（ｔ）＝　（Ａｏ＊ｔ　　Ａｔ（ｔ
＝Ｔｐ＋３／４Ｔｑ））以上は直線補間の例であるが、
この他に指数近似補間なども同様に表すことができる。The interpolation formula that changes the current value A1 with (the immediate cross-sectional area of the phoneme P in the sound tube i) as the target value is He5(t) = (Apwi-An(t-0)) ■ Time t
/6 TP to 4/6 T p, Apxi is set as the target value and the current value A is set as At(t)=at every interpolation period.
(Apwi-At(t=1/6Tp))■Time 4/
Until 6 T p= T p, the current value AI is set to Ap old as the target value every time At(t)= (APEI At(t=4/6Tp)
■ Until time T pA - T p + 1 / 4 T ta, the current value AI is set to AQl every t as the target value and AI (t) = (Aoy*-At (t = 'r
p))■Time Tp+1/4 Tg-Tp+3
/ 4 Until To, At(t) = [AQMI At
(t=Tp+1/4TQ))■Tp+3/4To-T
A1(t)=(Ao*t At(t
=Tp+3/4Tq)) The above is an example of linear interpolation, but
In addition to this, exponential approximation interpolation and the like can be similarly expressed.

これらの断面積パラメータは、声道方向に一定間隔で分
割したｎ段（実施例では１７段）の音響管が連続したも
のとみなして、各段の音響管の断面積を各音素毎に保持
している。そして各段の断面積値を時間軸方向に滑らか
に補間して合成音を生成している。従って所謂口の形は
うまく模擬できるが、口腔内の舌の動きまではうまく模
擬できない。These cross-sectional area parameters are calculated by assuming that n stages (17 stages in the example) of sound tubes divided at regular intervals in the vocal tract direction are continuous, and the cross-sectional area of each stage of sound tube is maintained for each phoneme. are doing. Then, the cross-sectional area values of each stage are smoothly interpolated in the time axis direction to generate a synthesized sound. Therefore, although the so-called mouth shape can be simulated well, the movement of the tongue within the oral cavity cannot be simulated well.

そこで本発明は、更に時間軸方向のみでなく声道方向に
も補間を行うものである。Therefore, the present invention further performs interpolation not only in the time axis direction but also in the vocal tract direction.

舌の動きを模擬するには、舌の動き得る声道の位置に対
応したところの音響管の段数の範囲でのみ補間を行う。To simulate the movement of the tongue, interpolation is performed only within the range of the number of acoustic tube stages corresponding to the positions of the vocal tract where the tongue can move.

即ち舌の動き得る声道位置を、ＫＩ〜ＫＥとすると、こ
のＫ　Ｉ　−ＫＥの範囲内で舌の動きの補間を行う。な
お、ここでＫＩ〜ＫＥは、０≦Ｋｌ≦ＫＥ≦ｎ−１（ｎ
は音響管の段数）の関係にあり、任意時刻ｔにおけるＡ
１の値は、０≦ｊ≦ＫＩ、ＫＥ≦ｉ≦ｎ−１の時は、前
述の断面積補間を行う。That is, assuming that the vocal tract position where the tongue can move is KI to KE, the tongue movement is interpolated within the range of KI to KE. In addition, here, KI to KE is 0≦Kl≦KE≦n-1(n
is the number of acoustic tube stages), and A at any time t
For a value of 1, when 0≦j≦KI and KE≦i≦n-1, the above-mentioned cross-sectional area interpolation is performed.

次にこの具体的な補間演算について説明する。Next, this specific interpolation calculation will be explained.

ＫＩ≦ｉ≦ＫＥの範囲で次の演算を行う。The following calculation is performed within the range of KI≦i≦KE.

断面積パラメータＡ、を他の条件（音素０時間）を一定
にして、前段の断面積パラメータＡＡ−１。The cross-sectional area parameter A is set to the previous cross-sectional area parameter AA-1, with other conditions (phoneme 0 time) constant.

Ａｔ＋１と比較する。このときＡｉ＜Ａｘ　　１でかつ
Ａ　ｈ　＜　Ａ　ｈ　＋　１　、すなわちＡ、が谷とな
るｉ（第４図のＸ印）の値を求める。Compare with At+1. At this time, find the value of i (marked by X in FIG. 4) such that Ai<Ax 1 and A h <A h + 1, that is, A is the valley.

この１の値をｔ　ＰＩ　　ｔ　ＰＭ＋　　ｔ　Ｐｇ％Ｅ
　Ｑｌｌ　　Ｉ　ＱＮ＃ｔＱ１１というように音素処理
部からの音素データに対比した時間上の時系列としてと
る。The value of this 1 is t PI t PM+ t Pg%E
It is taken as a time series in comparison with the phoneme data from the phoneme processing unit, such as Qll I QN#tQ11.

このｉの時系列中で、ｉの変化Δｉの絶対値が２以上変
化したときのｉの変化前後の時間をｔα、ｔβとする。In this time series of i, the times before and after the change in i when the absolute value of the change Δi in i changes by 2 or more are defined as tα and tβ.

このｔα、ｔβの時間を１Δ１１等分する。その時刻を
ｊ＋−Ｅｔ・・・ｔ１Δｔｉ−１ｔとする。The time of tα and tβ is divided into 1Δ11 equal parts. Let the time be j+-Et...t1Δti-1t.

むαでの谷の断面積パラメータＡＩとｔβでの谷Ａ、＋
Δｔの間（点線区間）に断面積パラメータＡ＋＋ｌ、Ａ
ｔ＋２・・・Ａ１＋Δ１−１の断面積パラメータの補間
は、時間ｔαまでは前述の普通の方法と同様に補間を行
う。The cross-sectional area parameter AI of the valley at α and the valley A at tβ, +
Between Δt (dotted line section), the cross-sectional area parameters A++l, A
The cross-sectional area parameters of t+2...A1+Δ1-1 are interpolated in the same manner as in the above-mentioned normal method up to time tα.

時間ｔαかにｔβの間は、次のように補間を行う。Between time tα and tβ, interpolation is performed as follows.

■ＡＩ＋１の時ｔα〜（ｔα＋１１）までは、Ａ、＋１　（１）　　・　（Ａ、＋ｌ　　（ｔ　　Ｉ）
　　−Ａ１＋１　　（仁　α）　）１−１　　α ×　　　　　＋Ａｌ＋１（ｔ＝ｔα）１、−１α （ｔα＋１＋）〜ｔβまでは、Ａｓ＋１　（ｔ　）　□　（Ａｔ＋ｊ（ｔβ）　−Ａｔ
＋１　（ｔ　ａ））■Ａ、＋２の時ｔα〜（ｔα＋１１）までは、Ａｔ＋２（ｔ）＝（Ａｔ＋　２（ｔ　ｔ）　　Ａｔ＋　
１（ｔα））１−１　　α Ｘ　　　　　　　　　＋Ａ　Ｉ＋　２　（Ｌ　＝　ｔα
）１１　　１　　α （ｔα＋ｔ、）〜ｔβまではＡｔ＋　２（ｔ）＝（Ａｔ＋　　２（ｔ　　β）　　　
Ａｔ＋Ｉ（ｔ　α）ン以下、同様に演算する。■When AI+1, from tα to (tα+11), A, +1 (1) ・ (A, +l (t I)
-A1+1 (Jin α) )1-1 α × +Al+1 (t=tα) 1, from -1α (tα+1+) to tβ, As+1 (t) □ (At+j(tβ) -At
+1 (t a)) ■When A, +2, from tα to (tα+11), At+2(t)=(At+ 2(t t) At+
1 (tα)) 1-1 α X +A I+ 2 (L = tα
)11 1 α (tα+t,) to tβ is At+ 2(t) = (At+ 2(t β)
At+I(t α) and the following calculations are performed in the same manner.

即ち、補間演算を時間軸方向のみでなく、声道方向にも
行うものである。That is, interpolation calculations are performed not only in the time axis direction but also in the vocal tract direction.

以上の補間演算は音素パラメータを何等変更することな
〈実施できる。The above interpolation calculations can be performed without changing any phoneme parameters.

なお、２つの音ｍｐ、Ｑ間の調音時に声道位置Ｋｌ−Ｋ
Ｅ間でのそれぞれのＡｉの極小値位置が近ければ舌の動
きが小さく、反対に遠ければ動きが大きくなる。In addition, when articulating between the two sounds mp and Q, the vocal tract position Kl-K
If the minimum value positions of each Ai between E are close, the movement of the tongue will be small, and conversely, if the positions are far apart, the movement will be large.

Ｇ１発明の効果以上のように補間演算を時間軸方向と共に声道方向にも
行うことができるので、舌の動きを模擬でき、そのため
調音結合が一層滑らかとなり自然な合成音声が生成でき
る。特に母音どうしの調音の場合に著しい効果を発揮す
る。Effects of the G1 Invention As described above, since interpolation calculations can be performed not only in the time axis direction but also in the vocal tract direction, tongue movement can be simulated, and as a result, articulatory combination becomes smoother and natural synthesized speech can be generated. It is particularly effective in the articulation of vowels.

また音素パラメータを何等変更することな〈実施できる
ので音素パラメータを格納しておくメモリーの容量は従
来通り少なくてよい。Furthermore, since it can be implemented without changing the phoneme parameters in any way, the memory capacity for storing phoneme parameters can be as small as before.

また、音素間の調音時に声道位置Ｋｌ〜ＫＥ間でのそれ
ぞれのＡ、極小位置が近ければ舌の動きが少なく、遠け
れば動きが大きいのであるから、それに応じて発声時間
長を調節してより自然性を増すことができる。In addition, when articulating between phonemes, the closer the minimum position of each A between the vocal tract positions Kl to KE, the less tongue movement there will be, and the farther away the tongue movement will be, so the utterance time length should be adjusted accordingly. It can make it look more natural.

[Brief explanation of the drawing]

第１図は、本発明の実施例の等価回路図、第２図は、音
響管の等価モデルと演算態様図、第３図は、音素パラメ
ータの説明図、第４図は、本発明を説明するための口腔
内の声道断面積推移図を示す。１・・・日本語処理部、２・・・文章処理部、３・・・
音節処理部、４・・・音素処理部、５・・・パラメータ
補処理部、６・・・演算部、７・・・Ｄ／Ａ変換部、８
・・・発声部。第１図実施例の等価回路図第４図ｔα　　ｔβFig. 1 is an equivalent circuit diagram of an embodiment of the present invention, Fig. 2 is an equivalent model of an acoustic tube and a diagram of calculation modes, Fig. 3 is an explanatory diagram of phoneme parameters, and Fig. 4 is an explanation of the present invention. A diagram showing changes in the vocal tract cross-sectional area in the oral cavity is shown. 1... Japanese language processing section, 2... Text processing section, 3...
Syllable processing unit, 4... Phoneme processing unit, 5... Parameter supplementary processing unit, 6... Calculation unit, 7... D/A conversion unit, 8
...Voice part. Fig. 1 Equivalent circuit diagram of the embodiment Fig. 4 tα tβ

Claims

[Claims]

(1) By regarding the human vocal tract as a plurality of acoustic tubes connected in tandem, and by associating the acoustic tube group with the circuit element group of the surge impedance component and making the sound source correspond with the current source, the acoustic tube In a speech synthesis method that simulates speech waves emitted from the output end of a group of circuit elements based on power waves at the output end of a group of circuit elements, the utterance time of each phoneme is divided into one or more time periods for each phoneme that makes up a syllable. For each time period, the initial values of the pitch, which is the repetition frequency of the sound source wave, the energy of this sound source wave, and the cross-sectional area of the sound tube, and the initial values X_0 of the relevant time period, and the values of the next time period. a phoneme parameter storage unit that stores constants and sound source wave patterns that define how to change to each initial value X_r; and the pitch corresponding to the input phoneme data;
a parameter interpolation processing section that selects each initial value of energy and cross-sectional area from the phoneme parameter storage section and performs interpolation processing on these initial values; a calculation unit that calculates a current value output from the output end of the circuit element group based on the sound source wave pattern in the phoneme parameter storage unit;
and a vocalization section that generates a sound based on the calculation result of the calculation section, and the parameter interpolation section is configured to perform each front and rear cut-off within the range of the number of stages of the sound tube corresponding to the vocal tract position where the tongue can move. Compare the area parameters, take out the value of the cross-sectional area that is smaller than both the previous stage and the subsequent stage and form a valley as a time series of the sounds generated by consecutive phonemes, and then select an arbitrary value before and after when the absolute value of the change changes by 2 or more. A speech synthesis method characterized by dividing time equally, and interpolating the cross-sectional area by joining the value of the change for each equally divided time.