JP2880433B2

JP2880433B2 - Speech synthesizer

Info

Publication number: JP2880433B2
Application number: JP7241460A
Authority: JP
Inventors: 俊男平井; 芳典匂坂; 宜男樋口
Original assignee: Ei Tei Aaru Onsei Honyaku Tsushin Kenkyusho Kk
Current assignee: Ei Tei Aaru Onsei Honyaku Tsushin Kenkyusho Kk
Priority date: 1995-09-20
Filing date: 1995-09-20
Publication date: 1999-04-12
Anticipated expiration: 2015-09-20
Also published as: JPH0990970A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、入力された文字列
に基づいて音声を合成する音声合成装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech synthesizer for synthesizing speech based on an input character string.

【０００２】[0002]

【従来の技術】音声の基本周波数であるピッチ周波数
（以下、Ｆ₀周波数という。）のモデル化には、従来か
ら少ないパラメータ数で効率良くＦ₀周波数の時系列の
パターン（以下、Ｆ₀パターンという。）をパラメータ
化することが可能な重畳型モデルが用いられている（例
えば、従来文献１「藤崎ほか，“日本語単語アクセント
の基本周波数パタンとその生成機構のモデル”，日本音
響学会論文誌，Ｖｏｌ．２７．Ｎｏ．９，ｐｐ．４４５
−４５３，１９７１年９月」参照。）この重畳型モデル
では、Ｆ₀パターンを句頭から句末にかけて緩やかに下
降するフレーズ成分（話調成分とも呼ばれる。）とアク
セント句に対応するアクセント成分の和として捉える。
重畳型モデルによるＦ₀パターンのパラメータ化には、
次のような利点がある。（１）モデルで用いられる自由パラメータ数が少なく、
統計分析によるＦ₀制御の最適化が容易である。（２）Ｆ₀パターンをフレーズ成分とアクセント成分の
２つの成分に分離するので、最適化するので、最適化の
結果得られる制御規則の解釈が比較的容易である。BACKGROUND ART pitch frequency is a fundamental frequency of the speech to the modeling (hereinafter. Referred to F ₀ frequency), the pattern of the time series of efficiently F ₀ frequency with a small number of parameters conventionally (hereinafter, F ₀ pattern (For example, Fujisaki et al., “Model of Fundamental Frequency Pattern of Japanese Word Accent and Its Generation Mechanism”, Papers of the Acoustical Society of Japan. 27, No. 9, pp. 445.
-453, September 1971 ". ) This superposition model, taken as the sum of an accent component corresponding to the F ₀ pattern is also called a phrase component (talking tone component that gradually drops toward phrase end from Kuatama.) And accent phrase.
For parameterization of the F ₀ pattern by the superposition model,
There are the following advantages. (1) The number of free parameters used in the model is small,
It is easy to optimize F ₀ control by statistical analysis. (2) Since the separation of F ₀ pattern into two components phrase component and accent component, since the optimization, it is relatively easy interpretation of the control rules obtained as a result of the optimization.

【０００３】また、規則合成音声の多様化を図るため
の、普通調、コマーシャル調、朗読調の３つの発話様式
の間の変換規則（以下、従来例という。）が、例えば従
来文献２「阿部ほか，“発話様式の変化とその評価”，
日本音響学会講演論文集，３−Ｐ−１８，１９９３年１
０月」において提案されている。この従来例では、フォ
ルマント周波数と継続時間と基本周波数及びパワーのパ
ラメータを変換することにより、普通調、コマーシャル
調、朗読調の各音声と、普通調からコマーシャル調へ変
換した音声と、普通調から朗読調へ変換した音声の計５
つの発話様式を準備して、それらの類似性について評価
している。Further, a conversion rule (hereinafter, referred to as a conventional example) between three utterance styles of a normal tone, a commercial tone, and a reading tone in order to diversify the rule-synthesized speech is disclosed in, for example, a conventional document 2 “Abe”. In addition, “Changes in speech style and its evaluation”,
Proceedings of the Acoustical Society of Japan, 3-P-18, 1993 1
October ". In this conventional example, by converting the parameters of the formant frequency, the duration, the fundamental frequency, and the power, the normal tone, the commercial tone, the reading tone, the tone converted from the normal tone to the commercial tone, and the normal tone, Total of 5 voices converted to reading style
We prepare two utterance styles and evaluate their similarity.

【０００４】[0004]

【発明が解決しようとする課題】しかしながら、上述の
従来例では、発話様式の変換を対象としているが、話者
性を考慮せずに音声合成している。すなわち、アクセン
ト型が異なるとアクセントの高さが個人により異なり、
従来例では、ある指定された１人の話者の音声を合成す
ることはできない。However, in the above-mentioned conventional example, the conversion of the utterance style is targeted, but the speech is synthesized without considering the speaker characteristics. In other words, if the accent type is different, the height of the accent is different for each individual,
In the conventional example, it is impossible to synthesize the voice of one specified speaker.

【０００５】本発明の目的は以上の問題点を解決し、あ
る指定された１人の話者の音声を合成することができる
音声合成装置を提供することにある。It is an object of the present invention to solve the above problems and to provide a speech synthesizer capable of synthesizing speech of one specified speaker.

【０００６】[0006]

【課題を解決するための手段】本発明に係る音声合成装
置は、音声データ記憶手段（１１−１３）、Ｆ₀制御規
則記憶手段（３１−３３）、学習手段（２０、２１）、
パラメータ系列生成手段（１）、音声合成手段（２）か
らなる音声合成装置であって、音声データ記憶手段（１
１−１３）は、音声データを蓄積し、Ｆ₀制御規則記憶
手段（３１−３３）は、Ｆ₀制御規則を蓄積し、学習手
段（２０、２１）は、抽出手段、発生手段、規則生成手
段からなり、抽出手段は、音声データ記憶手段（１１−
１３）に蓄積された音声データからピッチパターンを抽
出し、発生手段は、抽出手段から抽出されたピッチパタ
ーンを臨界制御モデルによる分析法を用いて分析を行
い、アクセント指令、フレーズ指令、アクセント句境界
を含むモデルパラメータを発生し、規則生成手段は、抽
出手段の抽出したピッチパターンと発生手段の発生した
モデルパラメータに基づき、所定の制御要因に注目し
て、音声のピッチ周波数のパターンを制御するＦ₀制御
規則を生成し、Ｆ₀制御規則記憶手段（３１−３３）に
記憶させ、制御要因は、フレーズのモーラ数と、フレー
ズに先行する先行フレーズのモーラ数と、アクセント句
のアクセント型と、アクセント句の文章内の位置を含
み、パラメータ系列生成手段（１）は、入力される文字
列とＦ₀制御規則記憶手段（３１−３３）に記憶された
Ｆ₀制御規則に基づきピッチ周波数のパターンを生成
し、ピッチ周波数を含む音響パラメータ系列を生成し、
音声合成手段（２）は、パラメータ系列生成手段（１）
の出力する音響パラメータ系列に基づき音声を合成す
る。Speech synthesis apparatus according to the present invention SUMMARY OF THE INVENTION, the voice data memory means (11-13), F ₀ control rule storage means (31-33), learning means (20, 21),
A voice synthesizing device comprising a parameter sequence generating means (1) and a voice synthesizing means (2), wherein the voice data storage means (1
1-13) accumulates audio data, F ₀ control rule storage means (31-33) accumulates F ₀ control rule learning means (20, 21), the extraction means, generating means, rule generation Means, and the extraction means is a voice data storage means (11-
13) A pitch pattern is extracted from the voice data accumulated in 13), and the generating means analyzes the pitch pattern extracted from the extracting means using an analysis method based on a critical control model, and outputs an accent command, a phrase command, an accent phrase boundary. The rule generating means controls the pattern of the pitch frequency of the voice by paying attention to a predetermined control factor based on the pitch pattern extracted by the extracting means and the model parameter generated by the generating means. ₀ control rules are generated and stored in the F ₀ control rule storage means (31-33). The control factors include the number of mora of the phrase, the number of mora of the preceding phrase preceding the phrase, the accent type of the accent phrase, wherein a position within the sentence accent phrase, the parameter sequence generating means (1), the string and F ₀ control rule memory means input Generating a pattern of pitch frequency based on the F ₀ control rules stored in the 31-33), and generate an acoustic parameter sequence including the pitch frequency,
The voice synthesizing means (2) includes a parameter sequence generating means (1)
Synthesizes a speech based on the acoustic parameter sequence output by.

【０００７】[0007]

【０００８】[0008]

【０００９】[0009]

【発明の実施の形態】以下、図面を参照して本発明に係
る実施形態について説明する。図１は、本実施形態のＦ
₀周波数を制御するＦ₀制御規則を生成するＦ₀制御規則
学習部２０を備えた音声合成装置のブロック図である。
図１において、本実施形態の音声合成装置は、入力され
る文字列に基づいて、選択的に接続される１人の話者の
Ｆ₀制御規則（３１，３２，３３のうちの１つ）と、声
質制御規則４１と、音素継続時間長制御規則４２とを用
いて音声合成のための特徴パラメータ系列を生成するパ
ラメータ系列生成部１と、生成された特徴パラメータ系
列に基づいて音声信号を発生してスピーカ３に出力する
音声合成部２とを備える。本実施形態においては、特
に、話者毎に作成された音声のピッチ周波数を制御する
Ｆ₀制御規則３１，３２，３３を用いて入力された文字
列を予め指定された話者の音声に変換することを特徴と
し、上記Ｆ₀制御規則３１，３２，３３は、音声合成対
象の当該フレーズのモーラ数と、当該フレーズに先行す
る先行フレーズのモーラ数とに基づいて当該フレーズの
大きさを制御し、音声合成対象のアクセント句のアクセ
ント型と上記アクセント句の文章内の位置とに基づいて
アクセント句の大きさを制御することにより、音声のピ
ッチ周波数を制御する規則である。Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 illustrates the F
₀ is a block diagram of a speech synthesis apparatus having a F ₀ control rule learning unit 20 to generate a F ₀ control rules for controlling the frequency.
In FIG. 1, the speech synthesizer according to the present embodiment is configured such that the F ₀ control rule (one of 31, 32, and 33) of one speaker selectively connected based on an input character string. And a parameter sequence generator 1 for generating a feature parameter sequence for speech synthesis using a voice quality control rule 41 and a phoneme duration control rule 42, and generating a speech signal based on the generated feature parameter sequence. And a voice synthesizing unit 2 for outputting the voice to a speaker 3. In the present embodiment, in particular, a character string input using F ₀ control rules 31, 32, and 33 for controlling a pitch frequency of a speech created for each speaker is converted into a speech of a speaker specified in advance. The F ₀ control rules 31, 32, and 33 control the size of the phrase based on the number of mora of the phrase to be synthesized and the number of mora of the preceding phrase preceding the phrase. The rule is to control the pitch frequency of the voice by controlling the size of the accent phrase based on the accent type of the accent phrase to be synthesized and the position of the accent phrase in the sentence.

【００１０】Ｆ₀制御規則学習部２０には、詳細後述す
るＦ₀制御規則学習処理を実行するときのワークエリア
として用いるワーキングメモリ２１が接続される。ま
た、Ｆ₀制御規則学習部２０には、スイッチＳＷ１を介
して、話者Ａ，Ｂ，Ｃの音声データ１１，１２，１３の
うちの１つが選択的に接続される一方、スイッチＳＷ２
を介して、話者Ａ，Ｂ，ＣのＦ₀制御規則３１，３２，
３３のうちの１つが選択的に接続される。これらのスイ
ッチＳＷ１，ＳＷ２の切り換えはＦ₀制御規則学習部２
０によって、同一の話者の音声データとＦ₀制御規則が
同時に接続されるように連動して制御される。さらに、
パラメータ系列生成部１には、スイッチＳＷ３を介し
て、話者Ａ，Ｂ，ＣのＦ₀制御規則３１，３２，３３の
うちの１つが選択的に接続される。このスイッチＳＷ３
の切り換えは、操作者により音声合成した話者のＦ₀制
御規則を選択するように行われる。また、パラメータ系
列生成部１には、詳細後述する従来の声質変換制御規則
４１と従来の音素継続時間長制御規則４２とが接続され
る。The F ₀ control rule learning unit 20 is connected to a working memory 21 used as a work area when executing an F ₀ control rule learning process described in detail later. Further, one of the voice data 11, 12, and 13 of the speakers A, B, and C is selectively connected to the F ₀ control rule learning unit 20 via the switch SW1, while the switch SW2 is connected.
Via the F ₀ control rules 31, 32,
One of the 33 is selectively connected. Switching of these switches SW1 and SW2 is performed by the F ₀ control rule learning unit 2
By 0, the voice data of the same speaker and the F ₀ control rule are linked and controlled simultaneously. further,
One of the F ₀ control rules 31, 32, and 33 of speakers A, B, and C is selectively connected to the parameter sequence generation unit 1 via a switch SW3. This switch SW3
Is switched by the operator so as to select the _F0 control rule of the speaker who synthesized the voice. Further, a conventional voice quality conversion control rule 41 and a conventional phoneme duration control rule 42, which will be described in detail later, are connected to the parameter sequence generation unit 1.

【００１１】本実施形態において、音声データ１１，１
２，１３と、ワーキングメモリ２１と、Ｆ₀制御規則３
１，３２，３３と、声質制御規則４１と、音素継続時間
長制御規則４２とは、例えば、ハードディスクなどのメ
モリで構成される。また、Ｆ₀制御規則学習部２０と、
パラメータ系列生成部１とは、例えばデジタル電子計算
機で構成される。In the present embodiment, the audio data 11, 1
2, 13, working memory 21, and F ₀ control rule 3
1, 32, 33, the voice quality control rule 41, and the phoneme duration control rule 42 are configured by a memory such as a hard disk, for example. Further, the F ₀ control rule learning unit 20 includes:
The parameter sequence generation unit 1 is constituted by, for example, a digital computer.

【００１２】図２は、図１のＦ₀制御規則学習部２０に
よって実行されるＦ₀制御規則学習処理を示すフローチ
ャートである。まず、ステップＳ１では、音声データ１
１，１２，１３内の音声データに基づいてＦ₀パターン
を抽出した後、ステップＳ２において、抽出されたＦ₀
パターンに基づいて臨界制御モデルによる分析法を用い
て上記臨界制御モデルのモデルパラメータを発生する。
さらに、ステップＳ３で、抽出されたＦ₀パターンと、
臨界制御モデルのモデルパラメータとに基づいて、所定
の制御要因に注目して、音声のピッチ周波数を制御する
制御規則を生成する。ここで、制御要因とは、音声合成
対象の当該フレーズのモーラ数と、当該フレーズに先行
する先行フレーズのモーラ数と、音声合成対象のアクセ
ント句のアクセント型と、上記アクセント句の文章内の
位置であり、Ｆ₀制御規則は、音声合成対象の当該フレ
ーズのモーラ数と、当該フレーズに先行する先行フレー
ズのモーラ数とに基づいて当該フレーズの大きさを制御
し、音声合成対象のアクセント句のアクセント型と上記
アクセント句の文章内の位置とに基づいてアクセント句
の大きさを制御することにより、音声のピッチ周波数を
制御する。次いで、上記各ステップの処理の詳細につい
て説明する。FIG. 2 is a flowchart showing the F ₀ control rule learning process executed by the F ₀ control rule learning section 20 of FIG. First, in step S1, audio data 1
After extracting the F ₀ pattern based on the audio data in 1,12,13, in step S2, the extracted F ₀
The model parameters of the critical control model are generated using an analysis method based on the critical control model based on the pattern.
Further, in step S3, the extracted F ₀ pattern,
Based on the model parameters of the criticality control model, a control rule for controlling the pitch frequency of the voice is generated by focusing on a predetermined control factor. Here, the control factors are the number of mora of the phrase to be synthesized, the number of mora of the preceding phrase preceding the phrase, the accent type of the accent phrase to be synthesized, and the position of the accent phrase in the text. The F ₀ control rule controls the size of the phrase based on the number of mora of the phrase to be speech-synthesized and the number of mora of the preceding phrase preceding the phrase, and The pitch frequency of the voice is controlled by controlling the size of the accent phrase based on the accent type and the position of the accent phrase in the text. Next, details of the processing of each of the above steps will be described.

【００１３】まず、ステップＳ１の処理について述べ
る。音声データ１１，１２，１３にはそれぞれ、１人の
話者の読み上げ文（発声音声文ともいう。）の音声信号
のデータを含む。このステップＳ１では、この音声信号
のデータに対して、Ａ／Ｄ変換とＬＰＣ分析を行って特
徴パラメータデータを抽出した後、抽出した特徴パラメ
ータデータに基づいて、例えば公知の臨界制動モデルに
よる分析法（例えば、従来文献３「藤崎ほか，“Analys
is of voice fundamental frequency contours for dec
larative sentences of Japanese （日本語平叙文の基
本周波数パターンの分析）”，日本音響学会論文誌，
（Ｅ），Ｖｏｌ．５，Ｎｏ．４，ｐｐ．２３３−２４
４，１９８４年４月」参照。）により分析しかつＦ₀パ
ターンとモデルパラメータとを検出して、音素単位、ア
クセント句単位及びフレーズ単位でラベリングすること
より生成する。ここで、特徴パラメータデータは、対数
パワー、１６次ケプストラム係数、Δ対数パワー、及び
１６次Δケプストラム係数を含み、モデルパラメータと
は、アクセント指令と、フレーズ指令とを含み、この中
で、アクセント句境界の情報を含む。上記分析では、Ｆ
₀周波数の緩やかな下降成分であるフレーズ成分とＦ₀周
波数の局所的な起伏を示すアクセント成分に分解され
る。上記臨界制動モデルでは、フレーズ成分、アクセン
ト成分はそれぞれフレーズ指令、アクセント指令に対す
る臨界制動２次線形系の応答として捉える。各指令の精
密なタイミングと大きさは、音素ラベリング情報、アク
セント句情報、フレーズ境界情報から得られるフレーズ
指令、アクセント指令のおおよそのタイミングをもとに
自動的な合成による解析（Ａｎａｌｙｓｉｓ−ｂｙ−Ｓ
ｙｎｔｈｅｓｉｓ）を用いて求めることができる。First, the processing in step S1 will be described. Each of the voice data 11, 12, and 13 includes voice signal data of a sentence read by one speaker (also referred to as an uttered voice sentence). In this step S1, A / D conversion and LPC analysis are performed on the audio signal data to extract characteristic parameter data, and based on the extracted characteristic parameter data, for example, an analysis method using a known critical braking model is performed. (For example, see Reference 3 “Fujisaki et al.,“ Analys
is of voice fundamental frequency contours for dec
larative sentences of Japanese (analysis of fundamental frequency patterns of Japanese declarative sentences) ”, Transactions of the Acoustical Society of Japan,
(E), Vol. 5, No. 4, pp. 233-24
4, April 1984 ". ) And detects the F ₀ pattern and the model parameters, and labels them by phoneme unit, accent phrase unit and phrase unit to generate them. Here, the feature parameter data includes log power, 16th order cepstrum coefficient, Δlogarithmic power, and 16th order cepstrum coefficient, and the model parameters include an accent command and a phrase command. Contains boundary information. In the above analysis, F
₀ is decomposed into an accent component indicating a local relief phrases component and F ₀ frequency is gentle downward component of the frequency. In the critical braking model, the phrase component and the accent component are regarded as the response of the critical braking secondary linear system to the phrase command and the accent command, respectively. The precise timing and size of each command are analyzed by automatic synthesis based on the approximate timing of the phrase command obtained from phoneme labeling information, accent phrase information, phrase boundary information, and accent command (Analysis-by-S).
(synthesis).

【００１４】次いで、ステップＳ２の処理について述べ
る。上述の重畳型制御モデルの１つとして、藤崎により
研究提案されてきた藤崎モデルが知られている（例え
ば、従来文献１参照。）。この藤崎モデルを用いたパラ
メータ化には、従来、山登り法が用いられてきた（例え
ば、従来文献３参照。）。すなわち、藤崎モデルのすべ
ての自由パラメータを変化させ、Ｆ₀パターンの平均推
定２乗誤差を最小にするパラメータの組を、そのＦ₀パ
ターンの分析結果とするものである。これは、パラメー
タの総数を探索空間次元数とする探索問題ととらえるこ
とができる。従来は、フレーズ指令に関しては角周波
数、入力時点、及び大きさ、アクセント指令に関しては
角周波数、立ち上がり時点、立ち下がり時点、及び大き
さを自由パラメータとして取り扱っていたため，探索空
間は（３Ｉ＋４Ｊ）次元（ここで、Ｉはフレーズ指令の
数であり、Ｊはアクセント指令の数である。）であっ
た。これらのパラメータのうち、アクセント指令の大き
さは、Ｆ₀周波数の実測値と他のパラメータを与えれ
ば、最小２乗法を用いて一意に求めることが可能でき、
探索空間の次元数をＪだけ下げることにより計算時間を
短縮することができる。この方法では、各時点でのＦ₀
周波数の値の信頼性（ここでは、音声からＦ₀周波数を
計算する際に得られる自己相関関数の極大値）を各時点
でのＦ₀周波数の推定誤差評価の重み付けに用いること
ができるよう定式化している。これは、音声データから
得られるＦ₀周波数の値の信頼性が各時点で一様ではな
いことに対応するためのものである。本実施形態におい
ては、この方法を用いた山登り法によりＦ₀パターンの
パラメータ化を行った。Next, the processing in step S2 will be described. As one of the above-described superposition control models, a Fujisaki model, which has been researched and proposed by Fujisaki, is known (for example, see Conventional Document 1). Conventionally, a hill-climbing method has been used for parameterization using the Fujisaki model (for example, see Conventional Document 3). That is, by changing all the free parameters of Fujisaki model, a set of parameters that minimize the mean estimation squared error F ₀ pattern, it is an analysis result of the F ₀ pattern. This can be regarded as a search problem in which the total number of parameters is the number of search space dimensions. Conventionally, the angular frequency, the input time, and the size of the phrase command, and the angular frequency, the rising time, the falling time, and the size of the accent command are treated as free parameters. Therefore, the search space is (3I + 4J) -dimensional ( Here, I is the number of phrase commands, and J is the number of accent commands.) Of these parameters, the magnitude of the accent command can be uniquely obtained by using the least squares method if an actual measured value of the F ₀ frequency and other parameters are given.
The calculation time can be reduced by reducing the number of dimensions of the search space by J. In this method, F ₀ at each time point
Reliability of the frequency value (here, the maximum value of the autocorrelation function obtained in calculating the F ₀ frequency from the voice) formulation that may be used in the weighting of the estimated error evaluation of F ₀ frequency at each time point Is becoming This is for the reliability of the value of F ₀ frequency obtained from the voice data corresponds to it is not uniform at each time point. In the present embodiment, it was parameterization of F ₀ pattern by the hill-climbing method using this method.

【００１５】さらに、ステップＳ３におけるＦ₀制御規
則の生成について述べる。フレーズ指令、アクセント指
令に影響を与えると考えられる上記制御要因から、各指
令の属性を推定する規則を公知の空間多重分割型数量化
法（ＭｕｌｔｉｐｌｅＳｐｌｉｔＲｅｇｒｅｓｓｉ
ｏｎ（例えば、従来文献４「岩橋ほか，“空間分割型数
量化法による音声制御の統計モデリング”，日本音響学
会講演論文集，１−５−１１，ｐ．２３７−２３８，平
成４年１０月」参照。）；以下、ＭＳＲ法という。）に
より求める。ＭＳＲ法では、回帰木での分析手順と同様
に、モデル推定値と実測値との２乗誤差総和を最も小さ
くする分類方法によって二分木を成長させ、モデル生成
を行なう。また、ＭＳＲ法では、二分木のリーフノード
以外のノードでそれ以下の部分木全体にわたって分岐条
件を共有することを許しており、少ないパラメータ数で
効率良くモデリングが行なえる。ルートノードに近いノ
ードで二分木の成長に用いられた制御要因は、多くのサ
ンプルの推定値に影響を与えるので、それらはＦ₀周波
数の制御に深く関わる重要な制御要因であると判断でき
る。Further, generation of the F ₀ control rule in step S3 will be described. A rule for estimating the attribute of each command from the above-mentioned control factors which are considered to affect the phrase command and the accent command is defined by a well-known spatial multiple division quantification method (Multiple Split Regressi).
on (for example, Conventional Document 4, “Iwahashi et al.,“ Statistical Modeling of Voice Control by Space Division Type Quantification Method ”, Proceedings of the Acoustical Society of Japan, 1-5-11, pp. 237-238, October 1994) ); Hereinafter, referred to as the MSR method. ). In the MSR method, similar to the analysis procedure in the regression tree, a binary tree is grown by a classification method that minimizes the sum of square errors between the model estimation value and the actually measured value, and a model is generated. Further, the MSR method allows nodes other than the leaf node of the binary tree to share the branching condition over the entire subtree below it, and modeling can be performed efficiently with a small number of parameters. Control factor used for the growth of the binary tree at a node closer to the root node, so affects the estimate of the number of samples, it can be determined that they are important control factor deeply involved in the control of F ₀ frequency.

【００１６】ところで、指令推定モデルの推定対象に
は、指令の大きさと立ち上がり時点などのタイミング情
報があるが、タイミング情報は少数の規則により推定で
きるのに対し、指令の大きさの推定には複雑な規則を必
要とすることから、本実施形態では、各指令の大きさを
推定の対象とした。ここでは、フレーズ指令、アクセン
ト指令それぞれの指令推定モデルを合わせてＦ₀制御規
則と呼んでいる。The target of the command estimation model includes timing information such as the size of the command and the time of rising. The timing information can be estimated by a small number of rules, but the estimation of the size of the command is complicated. In the present embodiment, the size of each command is set as an estimation target because a simple rule is required. Here, the command estimation model of each of the phrase command and the accent command is collectively called an _F0 control rule.

【００１７】推定モデル生成のための統計モデリング手
法の代表的なものに数量化Ｉ類（例えば、従来文献５
「林ほか，“数量化理論とデータ処理”，朝倉書店，１
９８２年」参照。）や回帰木（例えば、従来文献６「Ｂ
ｒｉｅｍａｎｅｔａｌ．，“Ｃｌａｓｓｉｆｉｃａ
ｔｉｏｎＡｎｄＲｅｇｒｅｓｓｉｏｎＴｒｅｅ
ｓ”，ＷａｄｓｗｏｒｔｈＳｔａｔｉｓｔｉｃｓ／Ｐ
ｒｏｂａｂｉｌｉｔｙＳｅｒｉｅｓ，Ｕ．Ｓ．Ａ．，
１９８４年」参照。）などがある。ここで、数量化Ｉ類
は、制御要因を説明変数空間とした線形重回帰モデルで
あり、制御要因間の独立性が仮定されているため、要因
間の依存関係を表現できない。また、説明変数空間を逐
次分割していく回帰木では、分割後の説明変数空間の独
立性が仮定されているため、分割された空間の間の従属
関係を表現できない。これに対して、ＭＳＲ法は、回帰
木の分析過程において、複数の分割で共用されるパラメ
ータを考えることで、数量化Ｉ類、回帰木の両者の問題
点を解決している。なお、数量化Ｉ類で得られる結果
は、ルートノードでしか分割を許さないＭＳＲ法の特殊
解として、また、回帰木で得られる結果は、複数ノード
での同時分割を禁止したＭＳＲ法の特殊解としてそれぞ
れ考えることができる。A representative one of the statistical modeling methods for generating an estimation model is quantification class I (for example, see Reference 5).
"Hayashi et al.," Quantification Theory and Data Processing ", Asakura Shoten, 1
982 ". ) Or a regression tree (for example, see
rieman et al. , "Classifica
Tion And Regression Tree
s ", Wadsworth Statistics / P
roboticity Series, U.S.A. S. A. ,
1984 ". )and so on. Here, the quantification class I is a linear multiple regression model using the control factors as an explanatory variable space, and cannot assume the independence between the control factors. In addition, in a regression tree in which the explanatory variable space is sequentially divided, the independence of the divided explanatory variable space is assumed, so that the dependency between the divided spaces cannot be expressed. On the other hand, the MSR method solves the problems of both the quantification class I and the regression tree by considering parameters shared by a plurality of divisions in the process of analyzing the regression tree. Note that the result obtained by quantification class I is a special solution of the MSR method that allows division only at the root node, and the result obtained by the regression tree is a special solution of the MSR method that prohibits simultaneous division at multiple nodes. Each can be considered as a solution.

【００１８】回帰木と同様にＭＳＲ法の分析では木構造
のモデルが生成される。図３にＭＳＲ法によるモデルの
一例を示す。この例では、観測値を２種類の制御要因Ｃ
₁，Ｃ₂により推定することが可能である。観測推定値
は、制御要因をもとに一番上のルートノードから条件を
満たす木の枝を順次たどると同時にノードに書かれた数
量ａ_iを加算した時の、末端のノードでの数量の総和と
して得られる。数量ａ_ｉの値は大量データから得られる
制御要因と、観測値と正規方程式を用いて計算するが、
数量化Ｉ類や回帰木と同様に、パラメータ値を一意に求
められないため、いくつかのパラメータの値に制約を設
ける必要がある。本実施形態においては、条件に当ては
まらないノード側（例えば、条件Ｃ_１≦５でＮｏに分岐
する側のノード）の数量を０と置いて他の数量を求めて
いる。図３の例ではａ₃，ａ₇，ａ₉，ａ₁₁を０と置くこ
ととなる。この条件のもとでは、ルートノードの数量ａ
₁は、ａ₃，ａ₇，ａ₉，ａ₁₁がいずれも０であることか
ら、最下段右端のノードにたどり着くデータ群（すなわ
ち、どの分岐でも条件に当てはまらない側のノードを選
択するデータ）の観測値の平均値となる。数量ａ₁は推
定値を求める際の初期値と見なすことができる。Similar to the regression tree, a tree structure model is generated in the analysis of the MSR method. FIG. 3 shows an example of a model based on the MSR method. In this example, the observed value is represented by two types of control factors C
Can be estimated by _1, C _2. Observed estimated value is the quantity of the quantity at the terminal node when the tree branch satisfying the condition is sequentially traced from the top root node based on the control factor and the quantity a _i written in the node is added. Obtained as a sum. The value of the quantity a _i is calculated using control factors obtained from a large amount of data, observed values and normal equations,
As in the case of the quantification class I and the regression tree, parameter values cannot be uniquely obtained, so that it is necessary to restrict some parameter values. In the present embodiment, the quantity on the node side that does not satisfy the condition (for example, the node that branches to No when the condition C ₁ ≦ 5) is set to 0, and another quantity is obtained. In the example of FIG. 3, a ₃ , a ₇ , a ₉ , and a ₁₁ are set to 0. Under this condition, the quantity a of the root node
_{_{_1, a 3, a 7, a}} 9, a 11 from that both are 0, the data group to reach the bottom right edge of the node (i.e., the data for selecting the side of the node is not the case condition in any branch) Is the average of the observed values. The quantity a ₁ can be regarded as an initial value when obtaining an estimated value.

【００１９】図３中、点線で囲んだ部分の木構造はＭＳ
Ｒ法特有の分析結果の例である。この部分木は、ａ₃の
ノードでの分割がおこなわれた結果、ａ₆，ａ₇のノード
が生成され、その後再び、ａ₃ノードでの分割がおこな
われてノードａ₆，ａ₇が分割したことにより生成された
ものである。ａ₁₀，ａ₁₁は共有パラメータとしての数量
と見ることが可能である。数量化Ｉ類の場合はルートノ
ードでのみ分割が許されているため、また、回帰木の場
合は末端ノードでのみ分割が許されているため、例のよ
うな部分木での分割は表現できない。In FIG. 3, the tree structure surrounded by the dotted line is MS
It is an example of the analysis result peculiar to R method. This subtree is a result of the division of a node of a ₃ is performed, is generated node a _6, a _7, then again, a ₃ node split in the performed by node a _6, a ₇ is divided It is generated by doing this. a ₁₀ and a ₁₁ can be regarded as quantities as shared parameters. In the case of quantification class I, the division is allowed only at the root node, and in the case of the regression tree, the division is allowed only at the terminal node. .

【００２０】以上説明したように、本実施形態で用いる
ＭＳＲ法は、数量化Ｉ類と回帰木の概念を包含し、拡張
したものとなっている。さらに、共用パラメータの存在
によりモデルのパラメータ数の増加を抑えることがで
き、少ないパラメータ数で効率良くモデリングが可能と
なる。このような見地から、本実施形態では統計モデリ
ング手法としてＭＳＲ法を用いている。As described above, the MSR method used in the present embodiment is an expanded version that includes the concepts of quantification type I and the regression tree. Furthermore, the increase in the number of model parameters can be suppressed due to the presence of shared parameters, and modeling can be efficiently performed with a small number of parameters. From this point of view, the present embodiment uses the MSR method as a statistical modeling method.

【００２１】上述の処理により大量音声データから求め
られたフレーズ指令、アクセント指令と各指令に影響す
る制御要因との関係をＭＳＲ法を用いて分析すること
で、制御要因からフレーズ指令、アクセント指令を推定
するモデルが得られる。各モデルは、二分木構造とモデ
ルパラメータとで構成される。二分木は、各指令を制御
要因により分類する規則として利用される。またモデル
パラメータは、推定値の算出に用いられる。分析で得ら
れた二分木の構造を検討することにより、どのような制
御要因が各指令に影響を与えているか、などの解析が可
能となる。モデルパラメータの大きさも、そのパラメー
タがかかわる分類が各指令に大きな影響を及ぼしている
かどうかの判断基準となる。By analyzing the relationship between the phrase command and the accent command obtained from the large amount of voice data and the control factors affecting each command by the MSR method, the phrase command and the accent command are obtained from the control factors. The model to be estimated is obtained. Each model has a binary tree structure and model parameters. The binary tree is used as a rule for classifying each command according to a control factor. The model parameters are used for calculating an estimated value. By examining the structure of the binary tree obtained by the analysis, it becomes possible to analyze what control factors influence each command. The size of the model parameter is also a criterion for determining whether the classification involving the parameter has a large effect on each command.

【００２２】図４及び図５に、４人の話者Ｍ１，Ｍ２，
Ｆ１，Ｆ２（ここで、Ｍ１，Ｍ２は男性話者であり、Ｆ
１，Ｆ２は女性話者である。）の各Ｆ₀制御規則を示
す。ここで、図４は、当該フレーズのモーラ数に対する
制御量と、先行フレーズのモーラ数に対する制御量とを
示し、図５に、アクセント句のアクセント型に対する制
御量と、上記アクセント句の文章内の位置に対する制御
量とを示す。ここで、モーラとは、実質的にかな１文字
に対応する拍である。また、アクセント型とは、アクセ
ント句が１拍目にあるのを１型といい、アクセント句が
２拍目にあるのを２型といい、以下同様に定義される。
図４及び図５の話者Ｆ２の場合のＦ₀制御規則を表１に
示す。FIGS. 4 and 5 show four speakers M1, M2,
F1, F2 (where M1 and M2 are male speakers and F1
1 and F2 are female speakers. ) Shows each F ₀ control rule. Here, FIG. 4 shows the control amount for the number of mora of the phrase and the control amount for the number of mora of the preceding phrase. FIG. 5 shows the control amount for the accent type of the accent phrase and the control amount for the accent phrase in the sentence of the accent phrase. And a control amount for the position. Here, the mora is a beat substantially corresponding to one kana character. The accent type is referred to as type 1 when the accent phrase is on the first beat, and is referred to as type 2 when the accent phrase is on the second beat.
Table 1 shows the _F0 control rules for the speaker F2 in FIGS.

【００２３】[0023]

【表１】各音素列に対するＦ₀制御規則の具体例＜話者Ｆ２の場合＞（図４の（ｄ）及び図５の（ｄ）に対応する。） ─────────────────────────────────── （１）当該フレーズ、先行フレーズ及びアクセント指令の大きさをそれぞれ当該話者の所定の初期値（０．６）に初期化する。 ─────────────────────────────────── （２）当該フレーズのモーラ数に関する判断制御（２−１）もし当該フレーズの長さが１モーラ以上３モーラ以下であるとき、当該フレーズの大きさを初期値から０．１５だけ減らす。（２−２）もし当該フレーズの長さが４モーラ以上６モーラ以下であるとき、当該フレーズの大きさを初期値から０．０５だけ減らす。（２−３）もし当該フレーズの長さが７モーラ以上１２モーラ以下であるとき、当該フレーズの大きさを初期値から０．０２５だけ減らす。（２−４）もし当該フレーズの長さが１３モーラ以上であるとき、当該フレーズの大きさを初期値から０．０２５だけ減らす。 ─────────────────────────────────── （３）先行フレーズのモーラ数に関する判断制御（３−１）もし先行フレーズの長さが１モーラ以上であるとき、先行フレーズの大きさを初期値から０．０１２５だけ減らす。（３−２）もし先行フレーズが無いとき、先行フレーズの大きさを初期値から変化しない。 ─────────────────────────────────── （４）アクセント句のアクセント型に関する判断制御（４−１）もしアクセント型が１型又は２型であるとき、アクセント句の大きさを初期値から０．０５だけ増やす。（４−２）もしアクセント型が３型以上であるとき、アクセント句の大きさを初期値から変化しない。（４−３）もしアクセント句が無い場合、アクセント句の大きさを初期値から０．２だけ減らす。 ─────────────────────────────────── （５）アクセント句の文章内の位置に関する判断制御（４−１）もしアクセント句が文頭にあるとき、アクセント句の大きさを初期値から変化しない。（４−２）もしアクセント句が文中にあるとき、アクセント句の大きさを初期値から変化しない。（４−３）もしアクセント句が文末にあるとき、アクセント句の大きさを初期値から０．２５だけ減らす。 ─────────────────────────────────── （注）フレーズ指令の大きさの制御は、表１内の（２）と（３）の制御量の合算とし、アクセント句の大きさの制御は、表１内の（４）と（５）の制御量の合算とする。[Table 1] Specific example of F ₀ control rule for each phoneme string <Case of speaker F2> (corresponding to (d) in FIG. 4 and (d) in FIG. 5) ────────────────────────── (1) The size of the relevant phrase, preceding phrase, and accent command are each set to a predetermined initial value of the relevant speaker. Initialize to (0.6). ─────────────────────────────────── (2) Judgment control regarding the number of mora of the phrase (2-1) ) If the length of the phrase is not less than 1 mora and not more than 3 mora, the size of the phrase is reduced by 0.15 from the initial value. (2-2) If the length of the phrase is not less than 4 mora and not more than 6 mora, the size of the phrase is reduced by 0.05 from the initial value. (2-3) If the length of the phrase is 7 to 12 mora, the size of the phrase is reduced by 0.025 from the initial value. (2-4) If the length of the phrase is 13 mora or more, reduce the size of the phrase by 0.025 from the initial value. ─────────────────────────────────── (3) Judgment control on the number of mora in the preceding phrase (3-1) ) If the length of the preceding phrase is 1 mora or more, reduce the size of the preceding phrase by 0.0125 from the initial value. (3-2) If there is no preceding phrase, the size of the preceding phrase does not change from the initial value. ─────────────────────────────────── (4) Judgment control regarding accent type of accent phrase (4-1 ) If the accent type is type 1 or type 2, increase the size of the accent phrase by 0.05 from the initial value. (4-2) If the accent type is 3 or more, the size of the accent phrase does not change from the initial value. (4-3) If there is no accent phrase, reduce the size of the accent phrase by 0.2 from the initial value. ─────────────────────────────────── (5) Judgment control regarding the position of the accent phrase in the sentence (4) -1) If the accent phrase is at the beginning of the sentence, the size of the accent phrase does not change from the initial value. (4-2) If the accent phrase is in the sentence, the size of the accent phrase does not change from the initial value. (4-3) If the accent phrase is at the end of the sentence, reduce the size of the accent phrase by 0.25 from the initial value. ─────────────────────────────────── (Note) The size of the phrase command is controlled in Table 1. The control amount of (2) and (3) is the sum of the control amounts, and the control of the accent phrase size is the sum of the control amounts of (4) and (5) in Table 1.

【００２４】次いで、合成音声「今日は良い天気です」
を得るときに、各音素又は音素列に対して各パラメータ
を制御するために用いられるＦ₀制御規則３１，３２，
３３、声質制御規則４１及び音素継続時間長制御規則４
２の各一例をそれぞれ、表２、表３及び表４に示す。な
お、表３において、音響的特徴パラメータとは、対数パ
ワー、１６次ケプストラム係数、Δ対数パワー、及び１
６次Δケプストラム係数を含む３４次元のパラメータで
ある。Next, the synthesized speech "Today is fine weather"
, The F ₀ control rules 31, 32, 32 used to control each parameter for each phoneme or phoneme sequence
33, voice quality control rule 41 and phoneme duration control rule 4
Table 2, Table 3, and Table 4 respectively show an example of No. 2. In Table 3, the acoustic feature parameters are log power, 16th order cepstrum coefficient, Δlog power, and 1
This is a 34-dimensional parameter including a sixth-order ΔCepstrum coefficient.

【００２５】[0025]

【表２】Ｆ₀制御規則の一例 ─────────────────────────────────── 音素列Ｆ₀制御規則 ─────────────────────────────────── ｋｙｏ’ｕｗａフレーズの大きさのＦ₀制御規則アクセントの大きさのＦ₀制御規則 ─────────────────────────────────── ｙｏ’ｉｔｅ’Ｎｋｉｄｅｓｕフレーズの大きさのＦ₀制御規則アクセント１の大きさのＦ₀制御規則アクセント２の大きさのＦ₀制御規則 ───────────────────────────────────[Table 2] Example of F ₀ control rule ─────────────────────────────────── Phoneme sequence F ₀ control Rule ─────────────────────────────────── kyo'uwa F ₀ control rule of phrase size Accent F ₀ control rule of size 'yo'ite ＮNkidusu Phrase the size of the F ₀ control rules Accents size of one of the F ₀ control rules Accents 2 magnitude of F ₀ control rules ─────────────────────── ────────────

【００２６】[0026]

【表３】声質制御規則の一例 ───────────────── 音素音響的特徴パラメータ ───────────────── ｋｙ（０．０５，０．０３，…）ｏ（０．４５，０．３８，…）ｕ（０．２５，０．４２，…）ｗ（０．３２，０．３０，…）ａ（０．１２，０．４５，…） … … ─────────────────[Table 3] Example of voice quality control rule ───────────────── phoneme acoustic feature parameter ｋ ky ( ..., o (0.45, 0.38, ...) u (0.25, 0.42, ...) w (0.32, 0.30, ...) a (0. 12, 0.45,…)… ───────────────── ─────────────────

【００２７】[0027]

【表４】音素継続時間長制御規則の一例 ───────────── 音素音素継続時間長 ───────────── ｋｙ０．０５４秒ｏ０．１２０秒ｕ０．０９５秒ｗ０．０８０秒ａ０．１１０秒 … … ─────────────[Table 4] Example of phoneme duration control rule ───────────── phoneme phoneme duration ───────────── ky 0.054 seconds o 0 120 seconds u 0.095 seconds w 0.080 seconds a 0.110 seconds ─────────────

【００２８】さらに、図１に示す音声合成装置の動作に
ついて以下に説明する。図１に示すように、音声合成す
べき文字列はパラメータ系列生成部１に入力される。パ
ラメータ系列生成部１は、入力される文字列に基づい
て、Ｆ₀周波数を制御するＦ₀制御規則（３１，３２，３
３のうちの１つ）と、音響的特徴パラメータを制御する
声質制御規則４１と、音素継続時間長を制御する音素継
続時間長制御規則４２とを用いて、Ｆ₀周波数と音響的
特徴パラメータと音素継続時間長とを含む制御パラメー
タデータを選択し、選択されたパラメータデータに基づ
いて、例えばＤＴＷ法により時間整合処理及び音声スペ
クトルの内挿処理等の処理を実行して、例えば１６次の
ケプストラム係数の時系列データを生成して、音声合成
部２に出力する。音声合成部２は、パルス発生器と雑音
発生器と可変利得増幅器とフィルタを備えて構成され、
入力される時系列データに基づいて音声信号を発生して
スピーカ３に出力することにより、入力された文字列に
対応する合成音声を発生する。The operation of the speech synthesizer shown in FIG. 1 will be described below. As shown in FIG. 1, a character string to be speech-synthesized is input to a parameter sequence generation unit 1. Parameter sequence generating unit 1, based on the character string input, F ₀ control rules for controlling the F ₀ frequency (31,32,3
3 one of) the voice control rules 41 for controlling the acoustic feature parameter, by using the phoneme duration control rules 42 for controlling the phoneme duration, F ₀ frequency and the acoustic feature parameter The control parameter data including the phoneme duration is selected, and based on the selected parameter data, a process such as a time alignment process and a voice spectrum interpolation process is executed by, for example, the DTW method, and a 16th-order cepstrum is processed. The time series data of the coefficients is generated and output to the speech synthesis unit 2. The voice synthesizer 2 includes a pulse generator, a noise generator, a variable gain amplifier, and a filter.
By generating an audio signal based on the input time-series data and outputting the generated audio signal to the speaker 3, a synthetic voice corresponding to the input character string is generated.

【００２９】以上の実施形態において、少数の音声デー
タを変換目標の話者に発声させ、これに基づいて生成さ
れたＦ₀制御規則を、大量の音声データから生成された
Ｆ₀制御規則のものと入れ換えることにより、Ｆ₀制御規
則を生成してもよい。[0029] In the above embodiments, to speak the small number of the audio data to the speaker of the conversion target, the F ₀ control rules generated based on this, one of the F ₀ control rules generated from a large amount of speech data The F ₀ control rule may be generated by replacing

【００３０】[0030]

【実施例】本発明者は、図１の音声合成装置を用いて、
Ｆ₀制御規則学習処理を音声データベースに対して施
し、フレーズ指令、アクセント指令の大きさを推定する
Ｆ₀制御規則を生成し、複数の話者のＦ₀制御規則を生成
しかつ分析して、各話者間での重要な制御要因の共通性
を調べた。DESCRIPTION OF THE PREFERRED EMBODIMENTS The inventor uses the speech synthesizer shown in FIG.
Subjecting the speech database to F ₀ control rule learning processing, generating F ₀ control rules for estimating the magnitude of phrase commands and accent commands, generating and analyzing F ₀ control rules for a plurality of speakers, The commonality of important control factors among speakers was investigated.

【００３１】音声資料としては、Ｆ₀制御規則の生成に
は男女２名ずつの話者が発声した５００文章、合計２，
０００文章を用いた（例えば、従来文献７「阿部ほか，
日本音響学会講演論文集，ｐｐ．２６７−２６８，１９
８９年１０月」参照。）。発話内容は、新聞や雑誌から
選ばれた文章である。また、各音声データのフレーズ指
令、アクセント指令の数を表５に示す。上述の処理の方
法を用いて各音声データベースのＦ₀制御規則を生成
し、個々の制御規則を分析した。As the audio data, the generation of the F ₀ control rule requires 500 sentences uttered by two male and female speakers, for a total of 2,
000 sentences (for example, conventional literature 7 "Abe et al.,
Proceedings of the Acoustical Society of Japan, pp. 267-268,19
October 1989 ". ). The utterance contents are sentences selected from newspapers and magazines. Table 5 shows the number of phrase commands and accent commands for each voice data. Using the above-described processing method, F ₀ control rules for each voice database were generated, and individual control rules were analyzed.

【００３２】[0032]

【表５】各音声データベースに含まれる指令の数 ─────────────────────────────────── 話者Ｍ１Ｍ２Ｆ１Ｆ２ ─────────────────────────────────── フレーズ指令１９０３１６８４１４２５１５３２アクセント指令３２００３１７６３３０６３１１９ ───────────────────────────────────[Table 5] Number of commands included in each voice database ─────────────────────────────────── Speaker M1 M2 F1 F2 フレーズ Phrase command 1903 1684 1425 1532 Accent command 3200 3176 3306 3119 ───────────────────────────────────

【００３３】Ｆ₀制御規則の生成に用いた制御要因と制
御規則の分析について述べる。臨界制動モデルで用いら
れるパラメータには、フレーズ指令については入力時点
と大きさ、アクセント指令については立ち上がり時点、
立ち下がり時点、大きさがある。これらのうち入力時点
などの時間情報については、少数の簡単な規則により制
御可能であることが報告されている（例えば、従来文献
８「海木ほか，電子情報通信学会技術報告，ＳＰ９２−
６，１９９２年３月」参照。）。これに対して、指令の
大きさの適切な制御は合成音の自然性や了解性の向上に
重要である。従って、Ｆ₀制御規則の生成ではフレーズ
指令及びアクセント指令の大きさを推定の対象とした。The analysis of the control factors and control rules used for generating the F ₀ control rule will be described. The parameters used in the critical braking model include the input time and size for the phrase command, the rising time for the accent command,
At the time of falling, there is a size. Among them, it has been reported that the time information such as the input time can be controlled by a small number of simple rules (for example, the conventional art 8 "Miki et al., IEICE technical report, SP92-
6, March 1992 ". ). On the other hand, appropriate control of the magnitude of the command is important for improving the naturalness and intelligibility of the synthesized sound. Therefore, in generating the F ₀ control rule, the magnitudes of the phrase command and the accent command were set as targets of estimation.

【００３４】まず、フレーズ指令の大きさを推定するた
めに用いた制御要因とその影響について述べる。フレー
ズ指令の大きさを推定するためには、以下の４つの制御
要因を考慮した。（Ａ１）当該フレーズ長（具体的には、当該フレーズの
モーラ数）（５カテゴリに分割した。）（Ａ２）先行フレーズ長（具体的には、先行フレーズの
モーラ数）（６カテゴリに分割した。）（Ａ３）当該フレーズの文中での位置（文末又は非文末
の２カテゴリに分割した。）（Ａ４）当該フレーズの先頭アクセント句のアクセント
型（４カテゴリに分割した。）First, the control factors used for estimating the magnitude of the phrase command and the effects thereof will be described. In order to estimate the magnitude of the phrase command, the following four control factors were considered. (A1) The phrase length (specifically, the number of mora of the phrase) (divided into 5 categories) (A2) The preceding phrase length (specifically, the number of mora of the preceding phrase) (divided into 6 categories (A3) Position in the sentence of the phrase (divided into two categories at the end of the sentence or non-sentence.) (A4) Accent type of the first accent phrase of the phrase (divided into four categories.)

【００３５】当該フレーズが短い場合はフレーズ成分を
長い間高い値で保つ必要がないことから、フレーズが短
いほどフレーズ指令が小さくなることが考えられる。ま
た、先行フレーズが短い場合は、先行フレーズのフレー
ズ成分が十分減衰するまでに当該フレーズが始まること
となり、この場合もまたフレーズ指令が小さくなること
が予想される。これらのことから、当該フレーズ及び先
行フレーズの長さをフレーズ指令の大きさを推定する制
御要因に用いた。これに加えて、音声では文末でＦ₀周
波数が顕著に低下し、文末にあるフレーズ指令はそれ以
外に位置するものに比べて小さくなると考えられるの
で、文中でのフレーズの位置をフレーズ指令の大きさの
推定に用いた。さらに、フレーズ先頭部でＦ₀周波数の
値が大きくなり過ぎることを抑えるため、フレーズ指令
の大きさを抑制する要因としてアクセント成分の大小と
強い相関を持つ要因であるアクセント型を用いた。If the phrase is short, it is not necessary to keep the phrase component at a high value for a long time. Therefore, it is conceivable that the phrase command becomes smaller as the phrase becomes shorter. When the preceding phrase is short, the phrase starts before the phrase component of the preceding phrase is sufficiently attenuated. In this case, it is expected that the phrase command will also be reduced. For these reasons, the lengths of the phrase and the preceding phrase were used as control factors for estimating the size of the phrase command. In addition, in voice, the F ₀ frequency is significantly reduced at the end of a sentence, and the phrase command at the end of the sentence is considered to be smaller than that at other positions. Therefore, the position of the phrase in the sentence is determined by the size of the phrase command. It was used to estimate the length. Furthermore, in order to suppress that the value of F ₀ frequency phrase top portion becomes too large, with the accent type is a factor having a magnitude strong correlation accent component the size of a phrase command as a factor inhibiting.

【００３６】これらの制御要因からフレーズ指令の大き
さを推定するモデルを生成して分析したところ、当該フ
レーズ及び先行フレーズの長さがすべての音声データベ
ースで重要な制御要因であることが確認された。また、
上記要因（Ａ４）については、４話者中３話者において
アクセント核を有するアクセント句（以下、起伏型アク
セント句という。）がフレーズの先頭に存在する場合に
フレーズが小さくなることがわかった。When a model for estimating the magnitude of the phrase command from these control factors was generated and analyzed, it was confirmed that the length of the phrase and the preceding phrase were important control factors in all voice databases. . Also,
Regarding the above factor (A4), it was found that the phrase becomes smaller when an accent phrase having an accent nucleus (hereinafter, referred to as an undulating accent phrase) is present at the beginning of the phrase in three of the four speakers.

【００３７】次いで、アクセント指令の大きさを推定す
るために用いた制御要因とその影響について述べる。ア
クセント指令の大きさを推定するためには、以下の４つ
の制御要因を考慮した。（Ｂ１）当該アクセント句長（具体的には、当該アクセ
ント句のモーラ数）（４カテゴリに分割した。）（Ｂ２）当該アクセント句のアクセント型（４カテゴリ
に分割した。）（Ｂ３）先行アクセント句のアクセント型（５カテゴリ
に分割した。）（Ｂ４）当該アクセント句の文中での位置（文頭、文
中、文末の３カテゴリに分割した。）Next, the control factors used for estimating the magnitude of the accent command and the effects thereof will be described. In order to estimate the magnitude of the accent command, the following four control factors were considered. (B1) The length of the accent phrase (specifically, the number of moras of the accent phrase) (divided into four categories) (B2) Accent type of the accent phrase (divided into four categories) (B3) Preceding accent Accent type of phrase (divided into five categories) (B4) Position of the accent phrase in the sentence (divided into three categories: beginning, middle, and end of sentence)

【００３８】公知の通り、アクセント句が短い場合、ま
たアクセント型が平板型である場合にアクセント成分は
小さくなることが知られているので、これらを制御要因
として考慮した。本発明者の実験結果では、アクセント
型を示す数字が小さいほど、すなわち「高」で発音され
る拍数が少ないほど、アクセント指令が大きくなる傾向
が見られたので、起伏型アクセント句をより細かく分類
して（１型、２型、３型乃至５型、６型以上）分析を行
なった。また、先行アクセント句が起伏型の場合には、
先行アクセント句でＦ₀周波数を上昇させるためのエネ
ルギーが消費されて当該アクセント句が小さくなること
が考えられるので、先行アクセント句のアクセント型を
制御要因に加えた。さらに、上述したように、フレーズ
指令の大きさを推定する制御要因として文中での位置を
取り扱うことを述べたが、アクセント指令についても文
頭、文中、文末でその大きさが違うことが考えられるの
で、これも要因として考慮した。As is well known, it is known that when the accent phrase is short, and when the accent type is a flat type, the accent component becomes small, and these are considered as control factors. According to the experimental results of the present inventor, the smaller the number indicating the accent type, that is, the smaller the number of beats pronounced at “high”, the greater the tendency for the accent command to be. Classification (type 1, type 2, type 3 to type 5, type 6 or more) and analysis were performed. If the leading accent phrase is undulating,
Since it is conceivable that the energy for raising the _F0 frequency in the preceding accent phrase is consumed and the accent phrase becomes smaller, the accent type of the preceding accent phrase is added as a control factor. Furthermore, as described above, the position in the sentence is described as a control factor for estimating the size of the phrase command. However, the size of the accent command may be different at the beginning, in the middle, and at the end of the sentence. This was also taken into account.

【００３９】これらの制御要因とアクセント指令の大き
さの実測値を用いてアクセント指令推定モデルを生成し
てその分析を行なったところ、上記要因（Ｂ４）におい
て文末に位置するアクセント句のアクセント指令の大き
さが小さくなることが、どの話者の推定モデルにおいて
も確認された。また、より大量の音声データを扱った今
回の実験では、フレーズ指令とアクセント指令の大きさ
への影響の個人差は特に見られなかった。An accent command estimation model was generated by using these control factors and the measured value of the size of the accent command and analyzed. As a result, the accent command of the accent phrase located at the end of the sentence in the above factor (B4) was analyzed. The reduction in size was confirmed in any speaker estimation model. In this experiment using a larger amount of voice data, there was no particular difference in the effect of the phrase command and the accent command on the size.

【００４０】以上説明したように、本発明に係る本実施
形態によれば、話者毎に作成された音声のピッチ周波数
を制御するＦ₀制御規則を用いて入力された文字列を予
め指定された話者の音声に変換し、Ｆ₀制御規則は、音
声合成対象の当該フレーズのモーラ数と、当該フレーズ
に先行する先行フレーズのモーラ数とに基づいて当該フ
レーズの大きさを制御し、音声合成対象のアクセント句
のアクセント型と上記アクセント句の文章内の位置とに
基づいてアクセント句の大きさを制御することにより、
音声のピッチ周波数を制御するように構成した。従っ
て、ある指定された１人の話者の音声を合成することが
できる音声合成装置を提供することができる。また、Ｆ
₀制御規則学習部２０により、音声データに基づいて音
声のピッチ周波数のパターンを抽出し、抽出された音声
のピッチ周波数のパターンに基づいて臨界制御モデルに
よる分析法を用いて臨界制御モデルのモデルパラメータ
を発生し、音声のピッチ周波数を制御する制御規則を生
成することができる。従って、ある指定された１人の話
者の音声を合成するために最適であって忠実なＦ₀制御
規則を自動的にかつ容易に作成することができる。As described above, according to the embodiment of the present invention, the character string input using the _F0 control rule for controlling the pitch frequency of the speech created for each speaker is specified in advance. The F ₀ control rule controls the size of the phrase on the basis of the number of mora of the phrase to be synthesized and the number of mora of the preceding phrase preceding the phrase. By controlling the size of the accent phrase based on the accent type of the accent phrase to be synthesized and the position of the accent phrase in the sentence,
It is configured to control the pitch frequency of the voice. Therefore, it is possible to provide a speech synthesizer capable of synthesizing the voice of one specified speaker. Also, F
_{0 The} control rule learning unit 20 extracts the pitch frequency pattern of the voice based on the voice data, and uses the critical control model analysis method based on the extracted pitch frequency pattern of the voice to model parameters of the critical control model. And a control rule for controlling the pitch frequency of the voice can be generated. Therefore, it is possible to automatically and easily create an optimal and faithful F ₀ control rule for synthesizing the voice of one specified speaker.

【００４１】[0041]

【発明の効果】以上詳述したように本発明に係る音声合
成装置によれば、音声データ記憶手段（１１−１３）、
Ｆ₀制御規則記憶手段（３１−３３）、学習手段（２
０、２１）、パラメータ系列生成手段（１）、音声合成
手段（２）からなる音声合成装置であって、音声データ
記憶手段（１１−１３）は、音声データを蓄積し、Ｆ₀
制御規則記憶手段（３１−３３）は、Ｆ₀制御規則を蓄
積し、学習手段（２０、２１）は、抽出手段、発生手
段、規則生成手段からなり、抽出手段は、音声データ記
憶手段（１１−１３）に蓄積された音声データからピッ
チパターンを抽出し、発生手段は、抽出手段から抽出さ
れたピッチパターンを臨界制御モデルによる分析法を用
いて分析を行い、アクセント指令、フレーズ指令、アク
セント句境界を含むモデルパラメータを発生し、規則生
成手段は、抽出手段の抽出したピッチパターンと発生手
段の発生したモデルパラメータに基づき、所定の制御要
因に注目して、音声のピッチ周波数のパターンを制御す
るＦ₀制御規則を生成し、Ｆ₀制御規則記憶手段（３１−
３３）に記憶させ、制御要因は、フレーズのモーラ数
と、フレーズに先行する先行フレーズのモーラ数と、ア
クセント句のアクセント型と、アクセント句の文章内の
位置を含み、パラメータ系列生成手段（１）は、入力さ
れる文字列とＦ₀制御規則記憶手段（３１−３３）に記
憶されたＦ₀制御規則に基づきピッチ周波数のパターン
を生成し、ピッチ周波数を含む音響パラメータ系列を生
成し、音声合成手段（２）は、パラメータ系列生成手段
（１）の出力する音響パラメータ系列に基づき音声を合
成する。従って、本発明によれば、ある指定された１人
の話者の音声を合成するために最適であって忠実なＦ₀
制御規則を自動的にかつ容易に作成することができる。
また、ある指定された１人の話者の音声を容易に合成す
ることができる音声合成装置を提供することができる。As described above in detail, according to the speech synthesizing apparatus according to the present invention, the speech data storage means (11-13),
F ₀ control rule storage means (31-33), learning means (2
0, 21), a parameter sequence generating means (1), and a voice synthesizing means (2), wherein a voice data storage means (11-13) stores voice data and stores F _0.
Control rule memory means (31-33) accumulates F ₀ control rule learning means (20, 21), the extraction means, generating means consists rule generating means, extracting means, speech data storage means (11 -13) The pitch pattern is extracted from the voice data accumulated in (13), and the generation means analyzes the pitch pattern extracted from the extraction means using an analysis method based on a critical control model, and outputs an accent command, a phrase command, an accent phrase. Generating a model parameter including a boundary, the rule generating means controls the pattern of the pitch frequency of the voice by focusing on a predetermined control factor based on the pitch pattern extracted by the extracting means and the model parameter generated by the generating means. An F ₀ control rule is generated, and the F ₀ control rule storage means (31-
33), and the control factors include the number of mora of the phrase, the number of mora of the preceding phrase preceding the phrase, the accent type of the accent phrase, and the position of the accent phrase in the sentence. ) generates a pitch frequency pattern based on the stored F ₀ control rules to a string and F ₀ control rule memory means input (31-33), to generate an acoustic parameter sequence including the pitch frequency, sound The synthesizing unit (2) synthesizes speech based on the acoustic parameter sequence output from the parameter sequence generating unit (1). Therefore, according to the present invention, the optimal and faithful F ₀ for synthesizing the voice of one specified speaker is described.
Control rules can be created automatically and easily.
Further, it is possible to provide a speech synthesizer capable of easily synthesizing the voice of one specified speaker.

【００４２】[0042]

[Brief description of the drawings]

【図１】本発明に係る一実施形態である音声合成装置
のブロック図である。FIG. 1 is a block diagram of a speech synthesizer according to an embodiment of the present invention.

【図２】図１のＦ₀制御規則学習部で実行されるＦ₀制
御規則学習処理を示すフローチャートである。2 is a flowchart showing the F ₀ control rule learning process executed by the F ₀ control rule learning unit of FIG.

【図３】図１のＦ₀制御規則学習部で用いる空間多重
分割型数量化法（ＭＳＲ）によるモデリングの一例を示
す図である。FIG. 3 is a diagram showing an example of modeling by a spatial multiplexing type quantification method (MSR) used in the F ₀ control rule learning unit of FIG. 1;

【図４】図１のＦ₀制御規則学習部によって作成され
たフレーズ指令に関するＦ₀制御規則の一例を示すグラ
フである。FIG. 4 is a graph showing an example of an F ₀ control rule relating to a phrase command created by an F ₀ control rule learning unit in FIG. 1;

【図５】図１のＦ₀制御規則学習部によって作成され
たアクセント句に関するＦ₀制御規則の一例を示すグラ
フである。FIG. 5 is a graph showing an example of an F ₀ control rule for an accent phrase created by an F ₀ control rule learning unit in FIG. 1;

[Explanation of symbols]

１…パラメータ系列生成部、２…音声合成部、３…スピーカ、１１…話者Ａの音声データ、１２…話者Ｂの音声データ、１３…話者Ｃの音声データ、２０…Ｆ₀制御規則学習部、２１…ワーキングメモリ、３１…話者ＡのＦ₀制御規則、３２…話者ＢのＦ₀制御規則、３３…話者ＣのＦ₀制御規則、４１…音質制御規則、４２…音素継続時間長データ。1 ... parameter sequence generating unit, 2 ... speech synthesis unit, 3 ... speaker, 11 ... audio data of the speaker A, 12 ... speaker B in the speech data, 13 ... speaker C sound data, 20 ... F ₀ Control Rule learning unit, 21 ... working memory, 31 ... F ₀ control rules of the speaker a, 32 ... F ₀ control rules of the speaker B, 33 ... speaker C of F ₀ control rule, 41 ... sound quality control rule, 42 ... phoneme Duration data.

───────────────────────────────────────────────────── フロントページの続き (72)発明者樋口宜男京都府相楽郡精華町大字乾谷小字三平谷５番地株式会社エイ・ティ・アール音声翻訳通信研究所内 (56)参考文献特開平３−249800（ＪＰ，Ａ) 特開平６−43891（ＪＰ，Ａ) 特開昭64−61796（ＪＰ，Ａ) 特開平２−113299（ＪＰ，Ａ) 特開平６−27984（ＪＰ，Ａ) 特開昭64−28695（ＪＰ，Ａ) (58)調査した分野(Int.Cl.⁶，ＤＢ名) G10L 3/00 G10L 5/02 ────────────────────────────────────────────────── ─── Continuation of the front page (72) Inventor Norio Higuchi Kyoto, Soraku-gun, Seika-cho, 5th, Inani, 5th, Sanpira-ya ATI Ron Co., Ltd. Voice Translation and Communication Research Laboratories (56) References JP-A-3 JP-A-249800 (JP, A) JP-A-6-43891 (JP, A) JP-A-64-61796 (JP, A) JP-A-2-113299 (JP, A) JP-A-6-27984 (JP, A) JP-A-64-28695 (JP, A) (58) Fields investigated (Int. Cl. ⁶ , DB name) G10L 3/00 G10L 5/02

Claims

(57) [Claims]

An audio data storage means (11-13);
₀ control rule storage means (31-33), learning means (20,
21) A voice synthesizing device including a parameter sequence generating means (1) and a voice synthesizing means (2), wherein the voice data storage means (11-13) stores voice data, and stores the F ₀ control rule storage means ( 31-33) accumulates the F ₀ control rules, the learning means (20, 21) comprises an extraction means, a generation means, and a rule generation means, and the extraction means accumulates in the voice data storage means (11-13). A pitch pattern is extracted from the extracted voice data. The generating means analyzes the pitch pattern extracted from the extracting means using an analysis method based on a critical control model, and outputs model parameters including an accent command, a phrase command, and an accent phrase boundary. The rule generation unit generates a voice signal based on the pitch pattern extracted by the extraction unit and the model parameter generated by the generation unit, and pays attention to a predetermined control factor. Generate F ₀ control rules for controlling the pattern of the pitch frequency, F ₀ is stored in the control rule memory means (31-33), the control factor, the number of moras phrase, and number of moras preceding phrase preceding the phrase includes accent type of accent phrase, a position within the sentence accent phrase, the parameter sequence generating means (1) is, F _0, which is stored in the string and F ₀ control rule memory means input (31-33) A pattern of pitch frequency is generated based on the control rule, and an acoustic parameter sequence including the pitch frequency is generated. The speech synthesis unit (2) includes a parameter sequence generation unit (1)
A speech synthesizer that synthesizes speech based on the acoustic parameter sequence output by the device.