[go: up one dir, main page]

JPH01292400A - Speech synthesis system - Google Patents

Speech synthesis system

Info

Publication number
JPH01292400A
JPH01292400A JP12253588A JP12253588A JPH01292400A JP H01292400 A JPH01292400 A JP H01292400A JP 12253588 A JP12253588 A JP 12253588A JP 12253588 A JP12253588 A JP 12253588A JP H01292400 A JPH01292400 A JP H01292400A
Authority
JP
Japan
Prior art keywords
phoneme
time
interpolation
value
sectional area
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
JP12253588A
Other languages
Japanese (ja)
Inventor
Takahiro Suzuki
貴博 鈴木
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Meidensha Corp
Meidensha Electric Manufacturing Co Ltd
Original Assignee
Meidensha Corp
Meidensha Electric Manufacturing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Meidensha Corp, Meidensha Electric Manufacturing Co Ltd filed Critical Meidensha Corp
Priority to JP12253588A priority Critical patent/JPH01292400A/en
Publication of JPH01292400A publication Critical patent/JPH01292400A/en
Pending legal-status Critical Current

Links

Landscapes

  • Electrophonic Musical Instruments (AREA)

Abstract

PURPOSE:To generate a natural synthetic voice while smoothing modulation coupling more by performing arithmetic interpolation in both a time-basis direction and a voice channel direction and simulating the motion of a tongue. CONSTITUTION:A parameter interpolation processing part 5 performs interpolation not only in the time-base direction, but also in the voice-channel direction. For the simulation of the motion of the tongue, the interpolation is carried out only within the range of an acoustic tube corresponding to positions of the voice channel where the tongue can move. Namely, the voice channel positions where the tongue can move are denoted as KI-KE and then 0<=KI<=KE<=n-1 (n: number of stages of the acoustic tube) holds; and sectional product interpolation is performed when 0<=i<=KI and KE<=i<=n-1 hold for a sectional product parameter Ai at an optional time. Then an arithmetic part 6 finds the digital value of a current at the same timing with the arithmetic interpolation and this value is A/D-converted 7 to generate a voice by a voicing part 8.

Description

【発明の詳細な説明】 A、産業上の利用分野 本発明は、音響管モデルを利用した音声合成装置に関す
る。
DETAILED DESCRIPTION OF THE INVENTION A. Field of Industrial Application The present invention relates to a speech synthesis device using an acoustic tube model.

B1発明の概要 本発明は人間の声道を音響管群とみなし、これをサージ
インピーダンス成分の回路要素群に対応させることによ
って、回路要素群の出力端の電流波に基づいて音声を模
擬的に作り出す方式において、補助演算を時間軸方向の
みでなく声道方向にも行って舌の動きをも模擬し、調音
結合を一層滑らかにして自然な合成音を作り出すように
したものである。
B1 Summary of the Invention The present invention regards the human vocal tract as a group of acoustic tubes, and by associating this with a group of circuit elements for surge impedance components, it is possible to simulate speech based on the current wave at the output end of the group of circuit elements. In this method, auxiliary calculations are performed not only in the direction of the time axis but also in the direction of the vocal tract to simulate the movement of the tongue, further smoothing the articulatory combination and producing natural synthesized sounds.

C8従来の技術 音声合成やミュージックシンセサイザー(電子楽器)等
の所謂音を人工的に合成して出力する電子装置は、最近
になってIないし数チップの音声認識や音声合成のLS
Iが音声情報処理と半導体の大規模集積回路技術により
低価格で実現されるようになり、その使用目的、制約条
件により種々の方式が提案されている。この音声合成に
は、人間の発生した生の音声を録音しておき、これを適
当に結合して文章に編集する録音編集方式と、人間の声
を直接的には利用せず、人間の音声のパラメータだけを
抽出し、音声合成過程で、そのパラメータを制御して人
工的に音声信号を作り出すパラメータ方式がある。
C8 Conventional technology Electronic devices that artificially synthesize and output so-called sounds, such as voice synthesis and music synthesizers (electronic musical instruments), have recently been developed into LS for voice recognition and voice synthesis using one or several chips.
I can now be realized at low cost through audio information processing and semiconductor large-scale integrated circuit technology, and various systems have been proposed depending on the purpose of use and constraints. This speech synthesis method involves two methods: recording and editing raw human speech, which combines them appropriately and editing them into sentences; There is a parameter method that extracts only the parameters and controls them during the speech synthesis process to artificially create a speech signal.

パラメータ方式においては、音声波形をある周期毎にサ
ンプリングして各サンプリング点での音声信号の値をア
ナログ/ディジタル変換し、その値を0と1の符号で表
示して行われるが、アナログ信号に忠実な記録をするた
めには、ビット数を電増やす必要があり、このため大き
なメモリ容量を必要とする。
In the parameter method, the audio waveform is sampled at certain intervals, the audio signal value at each sampling point is converted from analog to digital, and the values are displayed as codes of 0 and 1. In order to perform faithful recording, it is necessary to increase the number of bits, which requires a large memory capacity.

そこで、この情報量を極力少なくするために各種の高能
率な符号化法が研究開発されている。
Therefore, various highly efficient encoding methods are being researched and developed in order to reduce the amount of information as much as possible.

その方法の一つとして、1つの音声信号の情報に最低限
1ビツトを対応させたデルタ変調方式がある。この方式
は、1ビツトの使い方として、次にくる音声信号値が現
在の値より高いか低いかを判定して、高ければ符号@1
”、低ければ符号“0”を与え、音声信号の符号化を行
うもので、実際のシステム構成としては一定の振幅ステ
ップ量(デルタ)を定めておき、誤差が蓄積されないよ
うに今までの符号化によって得られる音声の値と、入力
してくる音声信号との残差信号に対して、符号化を行う
One such method is a delta modulation method in which at least one bit corresponds to the information of one audio signal. In this method, one bit is used to determine whether the next audio signal value is higher or lower than the current value, and if it is higher, the sign @1 is used.
”, if it is low, the code “0” is given and the audio signal is encoded.In the actual system configuration, a constant amplitude step amount (delta) is determined, and the previous code is changed to prevent errors from accumulating. The residual signal between the audio value obtained by the encoding and the input audio signal is encoded.

このような構成予測コード化といわれ、線形予測法(何
個か前のサンプル値から予測する)およびパーコール方
式(線形予測法の予測係数の変わリニパーコール係数に
といわれる偏自己相関関数を用いる)がある。
This type of predictive coding is called the linear prediction method (predicting from several previous sample values) and the Percoll method (using a partial autocorrelation function called the LiniPercoll coefficient, which changes the prediction coefficient of the linear prediction method) There is.

D1発明が解決しようとする問題点 従来の音声合成方式のうち録音編集方式は、合成できる
語愈や文章の種類が限定されるという問題がある。
D1 Problems to be Solved by the Invention Of the conventional speech synthesis methods, the recording and editing method has a problem in that the types of words and sentences that can be synthesized are limited.

また予測コード化を用いた方式では、音と音との継ぎ目
に相当する調音結合が難しくて合成単位の結合法が確立
しておらず、例えば母音から子音を経て母音に至る発声
において、母音の定常から過渡を経て子音に至りまた母
音の過渡を経て母音の定常音に至る過程で母音と母音の
継ぎ目の音が跡切れてしまう。従って音の滑らかさに欠
け、人間が聞いたときに自然な感じを与えないという問
題がある。
In addition, in methods using predictive coding, it is difficult to create articulatory combinations that correspond to the joints between sounds, and a method for combining synthesis units has not been established. In the process of going from a steady state through a transition to a consonant, and then through a vowel transition to a steady vowel sound, the sound at the joint between vowels is cut off. Therefore, there is a problem that the sound lacks smoothness and does not give a natural feeling when heard by humans.

本発明の目的は、任意な給量、文章を合成することがで
き、しかも音が滑らかであって人間の実際の音声に近く
、自然な感じを聞く人に与えることのできる音声合成装
置を提供することにある。
An object of the present invention is to provide a speech synthesis device that can synthesize sentences of any amount and provide a smooth sound that is close to actual human speech and can give a natural feeling to the listener. It's about doing.

E0問題点を解決するための手段及び作用音声を口から
外に放射するためには、音源が必要で、この音源は声帯
によって作り出される。−方声帯は2枚のヒダを開閉す
ることによって呼気を断続的に止める働きがあり、その
断続によってパフと呼ばれる空気流が発生し、声帯を緊
張させるとこのヒダに張力が加わりヒダの開閉の周波数
が高くなり、周波数の高いパフ音が発生する。そして呼
気流を大きくすると大きな音となる。
Means and Actions for Solving the E0 Problem In order to radiate sound out of the mouth, a sound source is required, and this sound source is produced by the vocal cords. -The vocal cords have the function of intermittent exhalation by opening and closing two folds, and these intermittent intervals generate airflow called puffs, and when the vocal cords are tensed, tension is applied to these folds, causing the folds to open and close. The frequency increases, producing a high-frequency puff sound. When the exhalation flow is increased, the sound becomes louder.

この音源波が声道のような円筒状の音響管を通過すると
、開放端から音波は共振現象によりある成分が強調され
、ある成分が減弱し複雑な母音の波形が作り出される。
When this sound source wave passes through a cylindrical acoustic tube like the vocal tract, a resonance phenomenon causes certain components of the sound waves from the open end to be emphasized and certain components to be attenuated, creating a complex vowel waveform.

そして口から発せられる音声は、音源波が同じ波形をも
っていても、口唇から放射されるまでに通過する声道の
形によって影響を受ける。即ち、人間の発生音は、声帯
から口唇までの声道の長さや断面積及び声帯の震わせ方
等によって決定される。
Even if the sound source waves have the same waveform, the sound emitted from the mouth is affected by the shape of the vocal tract that the sound passes through before being emitted from the lips. That is, the sounds produced by humans are determined by the length and cross-sectional area of the vocal tract from the vocal cords to the lips, and the way the vocal cords vibrate.

本発明はこのようなことに着目してなされたものであり
、上記の声道を複数の可変断面積の音響管群とみなし、
更に音響等の音波の伝達を表す進行波現象をその等価回
路により実現することを出発点としている。
The present invention has been made with this in mind, and considers the vocal tract as a group of acoustic tubes with variable cross-sectional areas.
Furthermore, the starting point is to realize the traveling wave phenomenon representing the transmission of sound waves such as acoustic waves using its equivalent circuit.

上述のことから、声道を断面積A、〜A7を持つn個の
一次元音響管を縦続接続した第2図(ア)の等価モデル
で表し、各音響管S1〜SoをLC成分のみの無損失分
布定数線路T、〜T1と見なすと同図(イ)の電気回路
に変換され、この電気回路に音源に代えたインパルス性
の電気又は電流信号を印加したときの出力波形が音声波
形に相当する応答出力と見なすことができる。そして、
声道の調音状態を変えるには各線路T、〜T1の定数り
From the above, the vocal tract is represented by the equivalent model shown in Figure 2 (A) in which n one-dimensional acoustic tubes with cross-sectional areas A, ~A7 are connected in cascade, and each acoustic tube S1 ~ So is represented by the LC component only. When considered as a lossless distributed constant line T, ~T1, it is converted to the electric circuit shown in the same figure (a), and when an impulsive electric or current signal instead of a sound source is applied to this electric circuit, the output waveform becomes an audio waveform. It can be considered as the corresponding response output. and,
To change the articulation state of the vocal tract, the constants of each line T, ~T1 are required.

Cを変えることによるサージインピーダンスの変更にな
る。
By changing C, the surge impedance can be changed.

同図(イ)の等価回路において、音源に代えたインパル
ス性の電流を印加したとき、各線路T。
In the equivalent circuit of the same figure (a), when an impulsive current is applied in place of the sound source, each line T.

〜Tnは夫々サージインピーダンス2.−2nを内部抵
抗とする電流源を持ち、各線路T、〜Tイの接続点と反
射と透過を伴う等価回路に変換される。
~Tn are respectively surge impedances 2. It has a current source with an internal resistance of -2n, and is converted into an equivalent circuit with connection points of each line T, - T, and reflection and transmission.

この等価回路は同図(つ)に示すようになる。同図にお
いて、Eは音源に相当する電圧源電圧、Zoはその出力
インピーダンス、ZLは口唇からの放射インピーダンス
、i 0A−i (n−1)Aは線路Tl〜Tnの各接
続点の前進波電流i。n”” I (わ−11Bは各接
続点の後進波電流、Z I−Z−は夫々断面積A、−A
、と空気密度ρと音速CからρC/A、〜ρC/A、に
なるサージインピーダンス、■い〜1111−11A及
び118〜Insは線路T、〜T1の接続点に現れる電
流源電流a。A”” a 111−11A及びaha〜
a0は電流源の分流電流である。
This equivalent circuit is shown in the same figure. In the figure, E is the voltage source voltage corresponding to the sound source, Zo is its output impedance, ZL is the radiation impedance from the lips, and A is the forward wave at each connection point of the lines Tl to Tn. Current i. n"" I (W-11B is the backward wave current at each connection point, Z I-Z- is the cross-sectional area A, -A, respectively
, and air density ρ and sound velocity C, the surge impedance becomes ρC/A, ~ρC/A, ■ ~1111-11A and 118~Ins are the current source currents a appearing at the connection point of the lines T and ~T1. A"" a 111-11A and aha~
a0 is the shunt current of the current source.

これら各電流には第2図(1)に示す関係にある。同式
中、S +a= A I/ (A I+ A *)、S
 IA=A */ (A t+ A t)〜S (+1
’−,1B=A (n−N/(A jn−11+ An
)、S+n−na=An/1n−u+Anを示す。また
、メモリ項目は現在値の1ステツプ前の演算結果を格納
する項目を示す。こうした関係式の演算はメモリ項目に
初期値Oを与え、電圧Eを与えたときの各メモリ項目に
ついて、線路T、−Tnまで基本周期で順次演算し、最
終段の電流1nBを基本周期で求め、この電流1nBに
調音音声出力データを得ることができる。
These currents have the relationship shown in FIG. 2 (1). In the same formula, S + a = A I/ (A I + A *), S
IA=A */ (A t+ A t) ~ S (+1
'-, 1B=A (n-N/(A jn-11+ An
), S+n-na=An/1n-u+An. Furthermore, the memory item indicates an item that stores the calculation result one step before the current value. To calculate such a relational expression, give an initial value O to the memory item, apply voltage E to each memory item, calculate it sequentially at the fundamental period up to the lines T and -Tn, and calculate the final stage current 1 nB at the basic period. , articulatory voice output data can be obtained at this current of 1 nB.

ここで、第2図(つ)の等価回路は線形回路になって重
ねの理が成立し、電圧Eを零として放射インピーダンス
Zt、に直列に電圧ELを印加したときの各都電流を求
めること、即ち、インピーダンスEoの電流を求めるこ
とができる。このように、サージインピーダンスを変化
することにより声道の断面積を模擬的に変化させること
ができるので、サージインピーダンスを連続的に変化す
ることで音素間又は音節間の調音結合がスムーズに行わ
れる。
Here, the equivalent circuit in Figure 2 (2) becomes a linear circuit, and the superposition principle holds true, and the current at each point when the voltage E is set to zero and the voltage EL is applied in series to the radiation impedance Zt can be found. In other words, the current of impedance Eo can be determined. In this way, by changing the surge impedance, the cross-sectional area of the vocal tract can be changed in a simulated manner, so by continuously changing the surge impedance, the articulatory coupling between phonemes or syllables is performed smoothly. .

そこで本発明は、縦列結合した音響管の断面積変化の補
間を時間軸方向のみならず声道方向にも行って口腔内の
舌の動きをも模擬することを特徴とする。
Therefore, the present invention is characterized by interpolating the cross-sectional area changes of the acoustic tubes connected in series not only in the time axis direction but also in the vocal tract direction to simulate the movement of the tongue in the oral cavity.

F、実施例 第1図は本発明の実施例のブロック構成を示す図である
。■は日本語処理部であり、入力された日本語文章に対
して文節の区切りゃ辞書を参照して読みがな変換等を行
う。2は文章処理部であり文章にイントネーションを付
ける処理を行う。3は音節処理部であり、文章を構成す
る音節に対して、イントネーションに応じたアクセント
を付ける。4は音素処理部、4.は音節パラメータ格納
部であり、音素処理部4は、入力された音節データに対
し、音節と母音及び子音の単位である音素との対応関係
を規定した音節パラメータ格納部4I内のデータを参照
して音素に分解する。
F. Embodiment FIG. 1 is a diagram showing a block configuration of an embodiment of the present invention. 3 is a Japanese language processing unit, which converts the input Japanese text into pronunciations, etc. by referring to a dictionary for segmentation of phrases. Reference numeral 2 denotes a sentence processing section which performs processing for adding intonation to sentences. 3 is a syllable processing unit, which adds accents to the syllables that make up a sentence according to the intonation. 4 is a phoneme processing unit; 4. is a syllable parameter storage unit, and the phoneme processing unit 4 refers to the data in the syllable parameter storage unit 4I that defines the correspondence between syllables and phonemes, which are units of vowels and consonants, for input syllable data. into phonemes.

5はパラメータ補間処理部、5.は音素パラメータ格納
部、5.は音源パラメータ格納部である。
5 is a parameter interpolation processing unit; 5. is a phoneme parameter storage unit; is a sound source parameter storage section.

音素パラメータ格納部51は第3図に示すように各音素
の発声時間を複数例えば立ち上がり部t!。
As shown in FIG. 3, the phoneme parameter storage unit 51 stores a plurality of utterance times of each phoneme, for example, the rising part t! .

定常its、立ち下がり部tgの3つの時間帯に区分し
、各時間帯毎に継続時間音源波の繰り返し周波数である
ピッチ、この音源波のエネルギー及び音響管の断面積の
各初期値と当該時間帯の前記各初期値から次の時間帯の
各初期値への変化の仕方を規定した時定数と音源波パタ
ーンとを格納している。この実施例では、人間の声道(
男性の場合的17cm)を長さ1cmの音響管を17個
連接したものでモデル化しており、このため断面積値は
1つの時間相当たり17個(Al〜AI?)定められて
いる。前記パラメータ補則処理部5は、各時間帯(t□
〜tm)におけるピッチ、エネルギー及び断面積の補間
処理を行う部分である。
It is divided into three time periods: steady its, and falling part tg, and for each time period, the duration, the pitch which is the repetition frequency of the sound source wave, the initial values of the energy of this sound source wave, the cross-sectional area of the sound tube, and the relevant time are determined. It stores time constants and sound source wave patterns that define how each initial value of a band changes to each initial value of the next time period. In this example, the human vocal tract (
The model is made by connecting 17 acoustic tubes with a length of 1 cm, and therefore, the cross-sectional area value is determined to be 17 pieces (Al to AI?) per one time period. The parameter supplementary law processing unit 5 processes each time period (t□
This is a part that performs interpolation processing of pitch, energy, and cross-sectional area in ~tm).

6は演算部であり、パラメータ補間処理部5で算出した
パラメータに基づいて、前記補間演算と同じタイミング
で例えば100μsの時間間隔で第2図(つ)に示す電
流i。のデジタル値を求める。7はデジタル/アナログ
(D/A)変換器であり、演算部6で求めたデジタル値
に基づいて電流波(アナログ電流)を作り出す。8はス
ピーカー等で発声部であり、アナログ電流に基づいて音
声を発生する。
Reference numeral 6 denotes a calculation unit, which calculates the current i shown in FIG. 2 at the same timing as the interpolation calculation, for example, at a time interval of 100 μs, based on the parameters calculated by the parameter interpolation processing unit 5. Find the digital value of. 7 is a digital/analog (D/A) converter, which generates a current wave (analog current) based on the digital value obtained by the calculation section 6. Reference numeral 8 denotes a voice generating section such as a speaker, which generates voice based on analog current.

次にパラメータ補間処理部5の具体的な補間演算につい
て説明する。
Next, specific interpolation calculations by the parameter interpolation processing section 5 will be explained.

音声の発声は、例えば母音については第4図に示すよう
に立ち上がり部tx、定常部t、4.立ち下がり部tg
があり、ある音素Pにおける断面積パラメータAt(i
は音響管の数1− n )を作る場合は音素Pの11.
1に、  tgについて第3図に示すようにデータを構
成する。他の音素Qについても同じようなデータを構成
する。
For example, for a vowel, a rising part tx, a steady part t, 4. Falling part tg
, and the cross-sectional area parameter At(i
is the number of acoustic tubes (1-n), then 11. of the phoneme P.
1, configure the data for tg as shown in FIG. Similar data is configured for other phonemes Q as well.

これらを今、断面積をA1音素をP1立ち上がり部!、
定常部M1立ち下がり部E1音響管の順番0〜17とし
た場合、音素PについてAp□。、Ap□、Ap□・・
・API+71 APN。・・・’APMl?IApt
o=Apg+を 同様に音素Qについても Agxo−Aottt、  AQMO−AQMI?、 
 AQEO〜Aog+7と表される。
Now, the cross-sectional area of these is the A1 phoneme and the P1 rising part! ,
When the order of the steady part M1 falling part E1 acoustic tube is 0 to 17, Ap□ for the phoneme P. , Ap□, Ap□...
・API+71 APN. ...'APMl? IApt
Similarly, for phoneme Q, o=Apg+ is Agxo-Aottt, AQMO-AQMI? ,
It is expressed as AQEO~Aog+7.

音響管A0〜A17の断面積パラメータは、前段の音素
処理部から各音素P、Q毎に時間パラメータTp、To
を与えられる。そしてこの時間パラメータTp、Toは
各音素毎に予め定められた比率(例えば音素Pの立ち上
がり部1=1.定常部M=3゜立ち下がり部E=2.音
素Qの1=1.M=2゜E=1)で時間をtx、  t
M、  tgに配分する。
The cross-sectional area parameters of the acoustic tubes A0 to A17 are determined by the time parameters Tp and To for each phoneme P and Q from the phoneme processing unit in the previous stage.
is given. These time parameters Tp and To are determined by predetermined ratios for each phoneme (for example, rising part 1 of phoneme P = 1. steady part M = 3°, falling part E = 2. 1 of phoneme Q = 1. M = 2゜E=1) and the time is tx, t
M, allocate to tg.

時刻0から音素P、Qと順に発声する場合■時刻0〜1
/6TPまでは補間周期を毎にA2、。
When the phonemes P and Q are uttered in order from time 0 ■Times 0 to 1
Up to /6TP, the interpolation period is A2 every time.

(音素Pの音響管iにおける立ち上がり即断面積)を目
標値として現在値A1を変化させていく、補間式は へ五(t)= (Apwi−An(t−0))■時刻t
 /6 TP〜4 / 6 T pまでは、補間周期を
毎にApxiを目標値として現在値A、をAt(t)=
 (Apwi−At(t=1/6Tp))■時刻4 /
 6 T p= T pまでは、を毎にAp旧を目標値
として現在値AIを At(t)= (APEI  At(t=4/6Tp)
■時刻T pA−T p + 1 / 4 T taま
ではt毎にAQl、を目標値として現在値AIを AI (t ) = (Aoy*−At (t =’r
p) )■時刻Tp+1 / 4 Tg−Tp+ 3 
/ 4 ToまではAt(t)= [AQMI  At
(t=Tp+ 1/4TQ))■Tp+3/4To−T
p+ToまではA1(t)= (Ao*t  At(t
=Tp+3/4Tq))以上は直線補間の例であるが、
この他に指数近似補間なども同様に表すことができる。
The interpolation formula that changes the current value A1 with (the immediate cross-sectional area of the phoneme P in the sound tube i) as the target value is He5(t) = (Apwi-An(t-0)) ■ Time t
/6 TP to 4/6 T p, Apxi is set as the target value and the current value A is set as At(t)=at every interpolation period.
(Apwi-At(t=1/6Tp))■Time 4/
Until 6 T p= T p, the current value AI is set to Ap old as the target value every time At(t)= (APEI At(t=4/6Tp)
■ Until time T pA - T p + 1 / 4 T ta, the current value AI is set to AQl every t as the target value and AI (t) = (Aoy*-At (t = 'r
p))■Time Tp+1/4 Tg-Tp+3
/ 4 Until To, At(t) = [AQMI At
(t=Tp+1/4TQ))■Tp+3/4To-T
A1(t)=(Ao*t At(t
=Tp+3/4Tq)) The above is an example of linear interpolation, but
In addition to this, exponential approximation interpolation and the like can be similarly expressed.

これらの断面積パラメータは、声道方向に一定間隔で分
割したn段(実施例では17段)の音響管が連続したも
のとみなして、各段の音響管の断面積を各音素毎に保持
している。そして各段の断面積値を時間軸方向に滑らか
に補間して合成音を生成している。従って所謂口の形は
うまく模擬できるが、口腔内の舌の動きまではうまく模
擬できない。
These cross-sectional area parameters are calculated by assuming that n stages (17 stages in the example) of sound tubes divided at regular intervals in the vocal tract direction are continuous, and the cross-sectional area of each stage of sound tube is maintained for each phoneme. are doing. Then, the cross-sectional area values of each stage are smoothly interpolated in the time axis direction to generate a synthesized sound. Therefore, although the so-called mouth shape can be simulated well, the movement of the tongue within the oral cavity cannot be simulated well.

そこで本発明は、更に時間軸方向のみでなく声道方向に
も補間を行うものである。
Therefore, the present invention further performs interpolation not only in the time axis direction but also in the vocal tract direction.

舌の動きを模擬するには、舌の動き得る声道の位置に対
応したところの音響管の段数の範囲でのみ補間を行う。
To simulate the movement of the tongue, interpolation is performed only within the range of the number of acoustic tube stages corresponding to the positions of the vocal tract where the tongue can move.

即ち舌の動き得る声道位置を、KI〜KEとすると、こ
のK I −KEの範囲内で舌の動きの補間を行う。な
お、ここでKI〜KEは、0≦Kl≦KE≦n−1(n
は音響管の段数)の関係にあり、任意時刻tにおけるA
1の値は、0≦j≦KI、KE≦i≦n−1の時は、前
述の断面積補間を行う。
That is, assuming that the vocal tract position where the tongue can move is KI to KE, the tongue movement is interpolated within the range of KI to KE. In addition, here, KI to KE is 0≦Kl≦KE≦n-1(n
is the number of acoustic tube stages), and A at any time t
For a value of 1, when 0≦j≦KI and KE≦i≦n-1, the above-mentioned cross-sectional area interpolation is performed.

次にこの具体的な補間演算について説明する。Next, this specific interpolation calculation will be explained.

KI≦i≦KEの範囲で次の演算を行う。The following calculation is performed within the range of KI≦i≦KE.

断面積パラメータA、を他の条件(音素0時間)を一定
にして、前段の断面積パラメータAA−1。
The cross-sectional area parameter A is set to the previous cross-sectional area parameter AA-1, with other conditions (phoneme 0 time) constant.

At+1と比較する。このときAi<Ax  1でかつ
A h < A h + 1 、すなわちA、が谷とな
るi(第4図のX印)の値を求める。
Compare with At+1. At this time, find the value of i (marked by X in FIG. 4) such that Ai<Ax 1 and A h <A h + 1, that is, A is the valley.

この1の値をt PI  t PM+  t Pg%E
 Qll  I QN#tQ11というように音素処理
部からの音素データに対比した時間上の時系列としてと
る。
The value of this 1 is t PI t PM+ t Pg%E
It is taken as a time series in comparison with the phoneme data from the phoneme processing unit, such as Qll I QN#tQ11.

このiの時系列中で、iの変化Δiの絶対値が2以上変
化したときのiの変化前後の時間をtα、tβとする。
In this time series of i, the times before and after the change in i when the absolute value of the change Δi in i changes by 2 or more are defined as tα and tβ.

このtα、tβの時間を1Δ11等分する。その時刻を
j+−Et・・・t1Δti−1tとする。
The time of tα and tβ is divided into 1Δ11 equal parts. Let the time be j+-Et...t1Δti-1t.

むαでの谷の断面積パラメータAIとtβでの谷A、+
Δtの間(点線区間)に断面積パラメータA++l、A
t+2・・・A1+Δ1−1の断面積パラメータの補間
は、時間tαまでは前述の普通の方法と同様に補間を行
う。
The cross-sectional area parameter AI of the valley at α and the valley A at tβ, +
Between Δt (dotted line section), the cross-sectional area parameters A++l, A
The cross-sectional area parameters of t+2...A1+Δ1-1 are interpolated in the same manner as in the above-mentioned normal method up to time tα.

時間tαかにtβの間は、次のように補間を行う。Between time tα and tβ, interpolation is performed as follows.

■AI+1の時 tα〜(tα+11)までは、 A、+1 (1)  ・ (A、+l  (t  I)
  −A1+1  (仁 α) )1−1  α ×     +Al+1(t=tα) 1、−1α (tα+1+)〜tβまでは、 As+1 (t ) □ (At+j(tβ) −At
+1 (t a))■A、+2の時 tα〜(tα+11)までは、 At+2(t)=(At+ 2(t t)  At+ 
1(tα))1−1  α X         +A I+ 2 (L = tα
)11  1  α (tα+t、)〜tβまでは At+ 2(t)=(At+  2(t  β)   
At+I(t α)ン以下、同様に演算する。
■When AI+1, from tα to (tα+11), A, +1 (1) ・ (A, +l (t I)
-A1+1 (Jin α) )1-1 α × +Al+1 (t=tα) 1, from -1α (tα+1+) to tβ, As+1 (t) □ (At+j(tβ) -At
+1 (t a)) ■When A, +2, from tα to (tα+11), At+2(t)=(At+ 2(t t) At+
1 (tα)) 1-1 α X +A I+ 2 (L = tα
)11 1 α (tα+t,) to tβ is At+ 2(t) = (At+ 2(t β)
At+I(t α) and the following calculations are performed in the same manner.

即ち、補間演算を時間軸方向のみでなく、声道方向にも
行うものである。
That is, interpolation calculations are performed not only in the time axis direction but also in the vocal tract direction.

以上の補間演算は音素パラメータを何等変更することな
〈実施できる。
The above interpolation calculations can be performed without changing any phoneme parameters.

なお、2つの音mp、Q間の調音時に声道位置Kl−K
E間でのそれぞれのAiの極小値位置が近ければ舌の動
きが小さく、反対に遠ければ動きが大きくなる。
In addition, when articulating between the two sounds mp and Q, the vocal tract position Kl-K
If the minimum value positions of each Ai between E are close, the movement of the tongue will be small, and conversely, if the positions are far apart, the movement will be large.

G1発明の効果 以上のように補間演算を時間軸方向と共に声道方向にも
行うことができるので、舌の動きを模擬でき、そのため
調音結合が一層滑らかとなり自然な合成音声が生成でき
る。特に母音どうしの調音の場合に著しい効果を発揮す
る。
Effects of the G1 Invention As described above, since interpolation calculations can be performed not only in the time axis direction but also in the vocal tract direction, tongue movement can be simulated, and as a result, articulatory combination becomes smoother and natural synthesized speech can be generated. It is particularly effective in the articulation of vowels.

また音素パラメータを何等変更することな〈実施できる
ので音素パラメータを格納しておくメモリーの容量は従
来通り少なくてよい。
Furthermore, since it can be implemented without changing the phoneme parameters in any way, the memory capacity for storing phoneme parameters can be as small as before.

また、音素間の調音時に声道位置Kl〜KE間でのそれ
ぞれのA、極小位置が近ければ舌の動きが少なく、遠け
れば動きが大きいのであるから、それに応じて発声時間
長を調節してより自然性を増すことができる。
In addition, when articulating between phonemes, the closer the minimum position of each A between the vocal tract positions Kl to KE, the less tongue movement there will be, and the farther away the tongue movement will be, so the utterance time length should be adjusted accordingly. It can make it look more natural.

【図面の簡単な説明】[Brief explanation of the drawing]

第1図は、本発明の実施例の等価回路図、第2図は、音
響管の等価モデルと演算態様図、第3図は、音素パラメ
ータの説明図、第4図は、本発明を説明するための口腔
内の声道断面積推移図を示す。 1・・・日本語処理部、2・・・文章処理部、3・・・
音節処理部、4・・・音素処理部、5・・・パラメータ
補処理部、6・・・演算部、7・・・D/A変換部、8
・・・発声部。 第1図 実施例の等価回路図 第4図 tα  tβ
Fig. 1 is an equivalent circuit diagram of an embodiment of the present invention, Fig. 2 is an equivalent model of an acoustic tube and a diagram of calculation modes, Fig. 3 is an explanatory diagram of phoneme parameters, and Fig. 4 is an explanation of the present invention. A diagram showing changes in the vocal tract cross-sectional area in the oral cavity is shown. 1... Japanese language processing section, 2... Text processing section, 3...
Syllable processing unit, 4... Phoneme processing unit, 5... Parameter supplementary processing unit, 6... Calculation unit, 7... D/A conversion unit, 8
...Voice part. Fig. 1 Equivalent circuit diagram of the embodiment Fig. 4 tα tβ

Claims (1)

【特許請求の範囲】[Claims] (1)人間の声道を縦列に接合した複数の音響管とみな
し、これら音響管群とサージインピーダンス成分の回路
要素群とを対応させると共に音声源と電流源とを対応さ
せることによって、音響管群の出力端から発せられる音
声波を回路要素群の出力端の電源波に基づいて模擬的に
作り出す音声合成方式において、 音節を構成する各音素毎に各音素の発声時間を1以上の
時間帯に区分し、各時間帯毎に、音源波の繰り返し周波
数であるピッチ、この音源波のエネルギー及び音響管の
断面積の各初期値と当該時間帯の前記各初期値X_0か
ら次の時間帯の各初期値X_rへの変化の仕方を規定し
た定数と音源波パターンとを格納する音素パラメータ格
納部と、入力された音素データに対応する前記ピッチ、
エネルギー及び断面積の各初期値を前記音素パラメータ
格納部の中から選び、これら初期値の補間処理を行うパ
ラメータ補間処理部と、ここで補間処理されたパラメー
タと前記入力された音素データに対応する音素パラメー
タ格納部内の音源波パターンとに基づいて前記回路要素
群の出力端から出力される電流値を演算する演算部と、
この演算部の演算結果に基づいて音声を発生する発声部
とを備え、前記パラメータ補間部は、舌の動き得る声道
位置に対応する音響管の段数の範囲内で、各前、後の断
面積パラメータを比較し、前段と後段の両方より小で谷
となる断面積の値を連続する音素の発生音の時系列とし
て取り出し、その変化の絶対値が2以上変化したときの
前後を任意の時間に等分し、当該変化の値を等分に区分
された時間毎に接合して断面積の補間を行うようにした
ことを特徴とした音声合成方式。
(1) By regarding the human vocal tract as a plurality of acoustic tubes connected in tandem, and by associating the acoustic tube group with the circuit element group of the surge impedance component and making the sound source correspond with the current source, the acoustic tube In a speech synthesis method that simulates speech waves emitted from the output end of a group of circuit elements based on power waves at the output end of a group of circuit elements, the utterance time of each phoneme is divided into one or more time periods for each phoneme that makes up a syllable. For each time period, the initial values of the pitch, which is the repetition frequency of the sound source wave, the energy of this sound source wave, and the cross-sectional area of the sound tube, and the initial values X_0 of the relevant time period, and the values of the next time period. a phoneme parameter storage unit that stores constants and sound source wave patterns that define how to change to each initial value X_r; and the pitch corresponding to the input phoneme data;
a parameter interpolation processing section that selects each initial value of energy and cross-sectional area from the phoneme parameter storage section and performs interpolation processing on these initial values; a calculation unit that calculates a current value output from the output end of the circuit element group based on the sound source wave pattern in the phoneme parameter storage unit;
and a vocalization section that generates a sound based on the calculation result of the calculation section, and the parameter interpolation section is configured to perform each front and rear cut-off within the range of the number of stages of the sound tube corresponding to the vocal tract position where the tongue can move. Compare the area parameters, take out the value of the cross-sectional area that is smaller than both the previous stage and the subsequent stage and form a valley as a time series of the sounds generated by consecutive phonemes, and then select an arbitrary value before and after when the absolute value of the change changes by 2 or more. A speech synthesis method characterized by dividing time equally, and interpolating the cross-sectional area by joining the value of the change for each equally divided time.
JP12253588A 1988-05-19 1988-05-19 Speech synthesis system Pending JPH01292400A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP12253588A JPH01292400A (en) 1988-05-19 1988-05-19 Speech synthesis system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP12253588A JPH01292400A (en) 1988-05-19 1988-05-19 Speech synthesis system

Publications (1)

Publication Number Publication Date
JPH01292400A true JPH01292400A (en) 1989-11-24

Family

ID=14838263

Family Applications (1)

Application Number Title Priority Date Filing Date
JP12253588A Pending JPH01292400A (en) 1988-05-19 1988-05-19 Speech synthesis system

Country Status (1)

Country Link
JP (1) JPH01292400A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7216074B2 (en) 2001-10-04 2007-05-08 At&T Corp. System for bandwidth extension of narrow-band speech

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7216074B2 (en) 2001-10-04 2007-05-08 At&T Corp. System for bandwidth extension of narrow-band speech
US7613604B1 (en) 2001-10-04 2009-11-03 At&T Intellectual Property Ii, L.P. System for bandwidth extension of narrow-band speech
US8069038B2 (en) 2001-10-04 2011-11-29 At&T Intellectual Property Ii, L.P. System for bandwidth extension of narrow-band speech
US8595001B2 (en) 2001-10-04 2013-11-26 At&T Intellectual Property Ii, L.P. System for bandwidth extension of narrow-band speech

Similar Documents

Publication Publication Date Title
JP3563772B2 (en) Speech synthesis method and apparatus, and speech synthesis control method and apparatus
US5097511A (en) Sound synthesizing method and apparatus
JPH01292400A (en) Speech synthesis system
JP2990693B2 (en) Speech synthesizer
JP2992995B2 (en) Speech synthesizer
Sondhi Articulatory modeling: a possible role in concatenative text-to-speech synthesis
JP2990691B2 (en) Speech synthesizer
JPH01219899A (en) Speech synthesizing device
JPH0833749B2 (en) Sound synthesis method
JPS63257000A (en) Voice synthesization
JPH01177097A (en) Voice synthesizing system
JPH01219898A (en) Speech synthesizing device
JPH01219896A (en) Speech synthesis system
JPH01177098A (en) Voice synthesizing device
JPH0833751B2 (en) Speech synthesis method
JPH01185700A (en) Speech synthesizing method
JPH01167800A (en) Sound synthesizing method
JPH01177096A (en) Voice synthesizing system
JPH01171000A (en) Voice synthesis system
JPS63311399A (en) Sound synchronization
JPH01182900A (en) Voice synthesizing system
JPH01177100A (en) Voice synthesizing system
JPH01197799A (en) Articulation and sound source parameter producing method for sound synthesizer
JP2725381B2 (en) Inflection control method for speech synthesizer
JPH01219894A (en) Speech synthesis system