JP2018084604A

JP2018084604A - Cross lingual voice synthesis model learning device, cross lingual voice synthesis device, cross lingual voice synthesis model learning method, and program

Info

Publication number: JP2018084604A
Application number: JP2016225808A
Authority: JP
Inventors: 勇祐井島; Yusuke Ijima; 隆能勢; Takashi Nose
Original assignee: Tohoku University NUC; Nippon Telegraph and Telephone Corp
Current assignee: Tohoku University NUC; NTT Inc
Priority date: 2016-11-21
Filing date: 2016-11-21
Publication date: 2018-05-31
Anticipated expiration: 2036-11-21
Also published as: JP6631883B2

Abstract

PROBLEM TO BE SOLVED: To provide a cross lingual voice synthesis technology which can synthesize voice in a language to be synthesized by a target speaker even when there is only one person's voice data in the language to be synthesized that is not voice data of the target speaker.SOLUTION: There are provided: a time information adjustment part 101 which generates after time information adjustment target speaker voice dataand after time information adjustment learning target language voice datafrom target speaker voice data and learning target language voice data; a vocal quality converter learning part 103 which learns a non-specific speaker vocal quality converter from a set of the after time information adjustment target speaker voice dataand the after time information adjustment learning target language voice data; a vocal quality conversion part 111 which uses the non-specific speaker vocal quality converter to generate after vocal quality conversion language to be synthesized voice data having a target speaker's vocal quality from the language to be synthesized voice data; and a synthesis model learning part 113 which learns a cross lingual voice synthesis model from the after vocal quality conversion language to be synthesized voice data and a set of the language to be synthesized speech information.SELECTED DRAWING: Figure 2

Description

本発明は、音声合成技術に関し、特にクロスリンガル音声合成技術に関する。 The present invention relates to speech synthesis technology, and more particularly to cross-lingual speech synthesis technology.

近年、主流となっている音声合成方式として、統計的パラメトリック音声合成がある。例えば、HMM（Hidden Markov Model）音声合成（非特許文献１）、DNN（Deep Neural Networks）音声合成（非特許文献２）がある。これらの手法では、音声の特徴量であるスペクトルパラメータ（ケプストラム、メルケプストラム等）、音高パラメータ（F0）等の音声パラメータを統計的にモデル化する。これにより、比較的少量の学習用音声データから任意の話者の安定した品質の合成音声を生成することが可能となる。 In recent years, statistical parametric speech synthesis is a speech synthesis method that has become mainstream. For example, there are HMM (Hidden Markov Model) speech synthesis (Non-Patent Document 1) and DNN (Deep Neural Networks) speech synthesis (Non-Patent Document 2). These methods statistically model speech parameters such as spectral parameters (cepstrum, mel cepstrum, etc.) and pitch parameters (F0), which are feature amounts of speech. As a result, it is possible to generate a synthesized speech of a stable quality of an arbitrary speaker from a relatively small amount of learning speech data.

この統計的パラメトリック音声合成を用いて、任意の話者の声質で、日本語や英語などの任意の言語で音声合成を実現する場合、一般には当該話者が当該言語で発話した音声データが必要となる。しかし、当該話者が当該言語を発声できない等の理由でそのような音声データを必ずしも入手できるとは限らない。このような問題を解決するために、クロスリンガル音声合成手法が提案されている（非特許文献３、非特許文献４）。 When using this statistical parametric speech synthesis to achieve speech synthesis in any language, such as Japanese or English, with the voice quality of any speaker, generally the speech data spoken by the speaker in that language is required. It becomes. However, such voice data is not always available because the speaker cannot speak the language. In order to solve such a problem, a cross-lingual speech synthesis method has been proposed (Non-patent Documents 3 and 4).

このクロスリンガル音声合成手法では、合成音声の生成対象となる目標話者の合成対象言語とは異なる学習対象言語の音声データを用いることで、目標話者の声質を持つ合成対象言語での合成音声を生成することが可能となる。 In this cross-lingual speech synthesis method, synthesized speech in the target language that has the voice quality of the target speaker is obtained by using speech data of the target language that is different from the target language of the target speaker to be synthesized. Can be generated.

益子貴史，徳田恵一，小林隆夫，今井聖，“動的特徴を用いたHMMに基づく音声合成”，電子情報通信学会論文誌 D-II， vol.J79-D-II， No.12, pp.2184-2190, 1996.Masashi Takashi, Tokuda Keiichi, Kobayashi Takao, Imai Kiyoshi, "HMM-based speech synthesis using dynamic features", IEICE Transactions D-II, vol.J79-D-II, No.12, pp. 2184-2190, 1996. Heiga Zen, Andrew Senior, Mike Schuster, “Statistical parametric speech synthesis using deep neural networks”, IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2013, pp.7962-7966, 2013.Heiga Zen, Andrew Senior, Mike Schuster, “Statistical parametric speech synthesis using deep neural networks”, IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2013, pp.7962-7966, 2013. Masanobu Abe, Kiyohiro Shikano, Hisao Kuwabara, “Cross-language voice conversion”, IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP-90, pp.345-348, 1990.Masanobu Abe, Kiyohiro Shikano, Hisao Kuwabara, “Cross-language voice conversion”, IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP-90, pp.345-348, 1990. Yi-Jian Wu, Yoshihiko Nankaku, Keiichi Tokuda, “State mapping based method for cross-lingual speaker adaptation in HMM-based speech synthesis”, INTERSPEECH 2009, 10th Annual Conference of the International Speech Communication Association, pp.528-531, 2009.Yi-Jian Wu, Yoshihiko Nankaku, Keiichi Tokuda, “State mapping based method for cross-lingual speaker adaptation in HMM-based speech synthesis”, INTERSPEECH 2009, 10th Annual Conference of the International Speech Communication Association, pp.528-531, 2009 .

しかし、非特許文献３の手法では、合成対象言語と学習対象言語のバイリンガル話者が発声した両言語の音声データが必要となる。また、非特許文献４の手法では、合成対象言語と学習対象言語の各言語で多くの話者による音声データが必要となる。 However, the method of Non-Patent Document 3 requires speech data in both languages spoken by a bilingual speaker of a synthesis target language and a learning target language. In the method of Non-Patent Document 4, voice data from many speakers is required in each of the synthesis target language and the learning target language.

そのため、バイリンガル話者による音声データを集めること、各言語の多数話者による音声データを集めることが難しい場合、いずれの手法も適用することができない。特に、合成対象言語の数が多くなると、このような問題が顕著になる。 Therefore, if it is difficult to collect speech data from bilingual speakers and collect speech data from many speakers in each language, neither method can be applied. In particular, such a problem becomes more prominent when the number of synthesis target languages increases.

そこで本発明は、目標話者の音声データではない合成対象言語での音声データが１名分しかない場合であっても、その目標話者による合成対象言語での音声を合成することが可能となるクロスリンガル音声合成技術を提供することを目的とする。 Therefore, the present invention can synthesize speech in the synthesis target language by the target speaker even when there is only one speech data in the synthesis target language that is not the target speaker's speech data. An object of the present invention is to provide a cross-lingual speech synthesis technology.

本発明の一態様は、Nを1以上の整数、nを1≦n≦Nなる整数とし、目標話者による学習対象言語での音声データである目標話者音声データと学習対象言語入力話者_nによる学習対象言語での音声データである学習対象言語音声データ_n(1≦n≦N)は、同一の文章を発話した音声データであり、前記目標話者音声データと前記学習対象言語音声データ_n(1≦n≦N)から、前記目標話者による合成対象言語での音声を合成するクロスリンガル音声合成用モデルを学習するクロスリンガル音声合成用モデル学習装置であって、前記目標話者音声データと前記学習対象言語音声データ_nとの時間情報を調整し、時間情報調整後目標話者音声データ_nと時間情報調整後学習対象言語音声データ_nを生成する時間情報調整部と、前記時間情報調整後目標話者音声データ_nと前記時間情報調整後学習対象言語音声データ_nの組(1≦n≦N)から、任意の音声データを前記目標話者の声質を有する音声データに変換する不特定話者声質変換器を学習する声質変換器学習部と、前記不特定話者声質変換器を用いて、合成対象言語入力話者による合成対象言語での音声データである合成対象言語音声データから、前記目標話者の声質を有する声質変換後合成対象言語音声データを生成する声質変換部と、前記声質変換後合成対象言語音声データと前記合成対象言語音声データに含まれる発話の発話情報の集合である合成対象言語発話情報集合から、前記クロスリンガル音声合成用モデルを学習する合成用モデル学習部とを含む。 In one embodiment of the present invention, N is an integer greater than or equal to 1, n is an integer satisfying 1 ≦ n ≦ N, and target speaker voice data that is speech data in a target language by a target speaker and a target language input speaker by _n is audio data of the learning target language learning target language voice data _n (1 ≦ _n ≦ n) is a speech data uttered the same sentence, the said target speaker speech data learned language audio data A cross-lingual speech synthesis model learning device for learning a cross-lingual speech synthesis model for synthesizing speech in a synthesis target language by the target speaker from _n (1 ≦ n ≦ N), the target speaker speech adjust the time information data and the learned language audio data _n, and time information adjusted target speaker speech data _n and time information adjusted learned language time to generate audio data _n information adjustment unit, the time information After target speaker voice data _n after adjustment Learning an unspecified speaker voice quality converter that converts arbitrary voice data into voice data having the voice quality of the target speaker from a set (1 ≦ n ≦ N) of target language voice data _n after adjusting the recording time information Using the voice quality converter learning unit and the unspecified speaker voice quality converter, the voice quality of the target speaker is obtained from synthesis target language voice data which is voice data in a synthesis target language by a synthesis target language input speaker. From a voice quality conversion unit that generates post-voice quality conversion synthesis target language voice data, and a synthesis target language utterance information set that is a set of utterance information included in the synthesis target language voice data and the synthesis target language voice data after the voice quality conversion And a synthesis model learning unit for learning the cross-lingual speech synthesis model.

本発明の一態様は、Nを1以上の整数、nを1≦n≦Nなる整数とし、目標話者による学習対象言語での音声データである目標話者音声データと学習対象言語入力話者_nによる学習対象言語での音声データである学習対象言語音声データ_n(1≦n≦N) は、同一の文章を発話した音声データであり、前記目標話者音声データと前記学習対象言語音声データ_n(1≦n≦N)から、前記目標話者による合成対象言語での音声を合成するクロスリンガル音声合成用モデルを学習するクロスリンガル音声合成用モデル学習装置であって、学習対象言語の音素と合成対象言語の音素の対応関係を示す発音ベクトル作成規則を記録した記録部と、前記目標話者音声データと前記学習対象言語音声データ_nとの時間情報を調整し、時間情報調整後目標話者音声データ_nと時間情報調整後学習対象言語音声データ_nを生成する時間情報調整部と、前記発音ベクトル作成規則を用いて、前記学習対象言語音声データ_nに含まれる発話の発話情報の集合である学習対象言語発話情報集合_nから、前記発話情報から算出される発音ベクトルの集合である学習対象言語発音ベクトル集合_nを生成する第１発音ベクトル生成部と、前記時間情報調整後目標話者音声データ_nと前記時間情報調整後学習対象言語音声データ_nと前記学習対象言語発音ベクトル集合_nの組(1≦n≦N)から、任意の音声データを前記目標話者の声質を有する音声データに変換する不特定話者声質変換器を学習する声質変換器学習部と、前記発音ベクトル作成規則を用いて、合成対象言語入力話者による合成対象言語での音声データである合成対象言語音声データに含まれる発話の発話情報の集合である合成対象言語発話情報集合から、前記発話情報から算出される発音ベクトルの集合である合成対象言語発音ベクトル集合を生成する第２発音ベクトル生成部と、前記不特定話者声質変換器を用いて、前記合成対象言語音声データと前記合成対象言語発音ベクトル集合から、前記目標話者の声質を有する声質変換後合成対象言語音声データを生成する声質変換部と、前記声質変換後合成対象言語音声データと前記合成対象言語発話情報集合から、前記クロスリンガル音声合成用モデルを学習する合成用モデル学習部とを含む。 In one embodiment of the present invention, N is an integer greater than or equal to 1, n is an integer satisfying 1 ≦ n ≦ N, and target speaker voice data that is speech data in a target language by a target speaker and a target language input speaker by _n is audio data of the learning target language learning target language voice data _n (1 ≦ _n ≦ n) is a speech data uttered the same sentence, the said target speaker speech data learned language audio data a cross-lingual speech synthesis model learning device for learning a cross-lingual speech synthesis model for synthesizing speech in a synthesis target language by the target speaker from _n (1 ≦ n ≦ N), And the time information of the target speaker voice data and the learning target language voice data _n are adjusted, and the target information after adjusting the time information is adjusted. 's voice data _n and time information tone And the rear learning target language speech data _n the time information adjustment unit that generates, using the pronunciation vector construction rules, said a set of speech information speech included in the learning target language speech data _n learned the target language utterance information set _n A first pronunciation vector generation unit that generates a learning target language pronunciation vector set _n , which is a set of pronunciation vectors calculated from the utterance information, the time information adjusted target speaker voice data _n, and the time information adjusted Unspecified speaker voice quality conversion for converting arbitrary voice data into voice data having the voice quality of the target speaker from a set of learning target language voice data _n and the learning target language pronunciation vector set _n (1 ≦ n ≦ N) Included in the synthesis target language speech data, which is speech data in the synthesis target language by the synthesis target language input speaker, using the voice quality converter learning unit that learns the language and the pronunciation vector creation rules A second pronunciation vector generation unit that generates a synthesis target language pronunciation vector set that is a set of pronunciation vectors calculated from the speech information from a synthesis target language utterance information set that is a set of utterance information of the story; A voice quality conversion unit that generates post-speech synthesis target language voice data having voice quality of the target speaker from the synthesis target language voice data and the synthesis target language pronunciation vector set using a voice quality converter; and the voice quality A synthesis model learning unit that learns the cross-lingual speech synthesis model from the post-conversion synthesis target language speech data and the synthesis target language utterance information set.

本発明によれば、任意の音声データを目標話者の声質を有する音声データに変換する声質変換器を学習することにより、目標話者の音声データではない合成対象言語での音声データが１名分しかない場合であっても、その目標話者による合成対象言語での音声を合成するためのモデルを生成することが可能となる。 According to the present invention, by learning a voice quality converter that converts arbitrary voice data into voice data having the voice quality of the target speaker, one voice data in the synthesis target language that is not the voice data of the target speaker is obtained. Even if there is only the case, it is possible to generate a model for synthesizing speech in the synthesis target language by the target speaker.

音素セグメンテーション情報の一例を示す図。The figure which shows an example of phoneme segmentation information. クロスリンガル音声合成用モデル学習装置１００の構成の一例を示す図。The figure which shows an example of a structure of the model learning apparatus 100 for cross-lingual speech synthesis. クロスリンガル音声合成用モデル学習装置１００の動作の一例を示す図。The figure which shows an example of operation | movement of the model learning apparatus 100 for cross-lingual speech synthesis. 不特定話者声質変換器学習装置１１０の構成の一例を示す図。The figure which shows an example of a structure of the unspecified speaker voice quality converter learning device. 不特定話者声質変換器学習装置１１０の動作の一例を示す図。The figure which shows an example of operation | movement of the unspecified speaker voice quality converter learning device 110. 音声合成用モデル学習装置１２０の構成の一例を示す図。The figure which shows an example of a structure of the model learning apparatus 120 for speech synthesis. 音声合成用モデル学習装置１２０の動作の一例を示す図。The figure which shows an example of operation | movement of the model learning apparatus 120 for speech synthesis. クロスリンガル音声合成装置２００の構成の一例を示す図。1 is a diagram illustrating an example of a configuration of a cross-lingual speech synthesizer 200. FIG. クロスリンガル音声合成装置２００の動作の一例を示す図。The figure which shows an example of operation | movement of the cross-lingual speech synthesizer. 発音ベクトル作成規則の一例を示す図。The figure which shows an example of a pronunciation vector creation rule. クロスリンガル音声合成用モデル学習装置３００の構成の一例を示す図。The figure which shows an example of a structure of the model learning apparatus 300 for cross-lingual speech synthesis. 不特定話者声質変換器学習装置３１０の構成の一例を示す図。The figure which shows an example of a structure of the unspecified speaker voice quality converter learning device 310. 不特定話者声質変換器学習装置３１０の動作の一例を示す図。The figure which shows an example of operation | movement of the unspecified speaker voice quality converter learning device 310. 音声合成用モデル学習装置３２０の構成の一例を示す図。The figure which shows an example of a structure of the model learning apparatus 320 for speech synthesis. 音声合成用モデル学習装置３２０の動作の一例を示す図。The figure which shows an example of operation | movement of the model learning apparatus 320 for speech synthesis. 発音ベクトル作成規則生成装置４００の構成の一例を示す図。The figure which shows an example of a structure of the pronunciation vector creation rule production | generation apparatus 400. FIG. 発音ベクトル作成規則生成装置４００の動作の一例を示す図。The figure which shows an example of operation | movement of the pronunciation vector creation rule production | generation apparatus 400. FIG. 第１発音統計情報生成部４１０の動作の一例を示す図。The figure which shows an example of operation | movement of the 1st pronunciation statistics information generation part 410. FIG. 発音類似性判定部４２０の動作の一例を示す図。The figure which shows an example of operation | movement of the pronunciation similarity determination part 420. 音声合成用モデル学習装置５２０の構成の一例を示す図。The figure which shows an example of a structure of the model learning apparatus 520 for speech synthesis. 音声合成用モデル学習装置５２０の動作の一例を示す図。The figure which shows an example of operation | movement of the model learning apparatus 520 for speech synthesis. 音声合成用モデル学習装置６２０の構成の一例を示す図。The figure which shows an example of a structure of the model learning apparatus 620 for speech synthesis. 音声合成用モデル学習装置６２０の動作の一例を示す図。The figure which shows an example of operation | movement of the model learning apparatus 620 for speech synthesis. 話者類似性判定部６１２の動作の一例を示す図。The figure which shows an example of operation | movement of the speaker similarity determination part 612. 音声合成用モデル学習装置７２０の構成の一例を示す図。The figure which shows an example of a structure of the model learning apparatus 720 for speech synthesis. 音声合成用モデル学習装置７２０の動作の一例を示す図。The figure which shows an example of operation | movement of the model learning apparatus 720 for speech synthesis. 音声合成用モデル学習装置８２０の構成の一例を示す図。The figure which shows an example of a structure of the model learning apparatus 820 for speech synthesis. 音声合成用モデル学習装置８２０の動作の一例を示す図。The figure which shows an example of operation | movement of the model learning apparatus 820 for speech synthesis.

以下、本発明の実施の形態について、詳細に説明する。なお、同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. In addition, the same number is attached | subjected to the structure part which has the same function, and duplication description is abbreviate | omitted.

＜表記方法＞
_（アンダースコア）は下付き添字を表す。例えば、x^y_zはy_zがxに対する上付き添字であり、x_{y_z}はy_zがxに対する下付き添字であることを表す。 <Notation>
_ (Underscore) represents a subscript. For example, ^xy_z represents that _yz is a superscript to x, and _{xy_z} represents that _yz is a subscript to x.

＜定義＞
以下、各実施形態で用いる用語について説明する。 <Definition>
Hereinafter, terms used in each embodiment will be described.

［音声合成用モデル］
音声合成用モデルとは、音声合成に必要となる音声特徴量・音響特徴量（以下、音声パラメータという）を統計的にモデル化したものである。音声パラメータの例として、ケプストラム、メルケプストラム等のスペクトルパラメータや基本周波数（F0）等の音高パラメータがある。音声合成用モデルは、後述する音声データと発話情報を用いて学習する。音声合成用モデルの学習手法はいくつか提案されており、例えばHMM音声合成（非特許文献１）、DNN音声合成（非特許文献２）がある。 [Model for speech synthesis]
The speech synthesis model is a statistical model of speech feature quantity / acoustic feature quantity (hereinafter referred to as speech parameters) required for speech synthesis. Examples of audio parameters include spectral parameters such as cepstrum and mel cepstrum, and pitch parameters such as fundamental frequency (F0). The speech synthesis model learns using speech data and utterance information described later. Several methods for learning a model for speech synthesis have been proposed. For example, there are HMM speech synthesis (Non-Patent Document 1) and DNN speech synthesis (Non-Patent Document 2).

［音声データ］
音声データとは、各学習（具体的には、不特定話者声質変換器の学習、クロスリンガル音声合成用モデルの学習）に用いるため、あらかじめ収録しておく音声データのことである。音声データは、話者が発話した文章の音声であり、音声データに対して信号処理を行った結果、得られる音声パラメータ（スペクトルパラメータ、音高パラメータ）として記録するのでもよい。 [Audio data]
The voice data is voice data recorded in advance for use in each learning (specifically, learning of an unspecified speaker voice quality converter and learning of a model for cross-lingual voice synthesis). The voice data is a voice of a sentence uttered by a speaker, and may be recorded as a voice parameter (spectrum parameter, pitch parameter) obtained as a result of performing signal processing on the voice data.

不特定話者声質変換器の学習には、目標話者による学習対象言語での音声データ（以下、目標話者音声データという）、N名の話者による学習対象言語での音声データ（以下、学習対象言語音声データ_n(1≦n≦N)という）が必要である。ここで、目標話者とは合成対象言語での合成音声の生成の対象となる主体である。目標話者音声データと学習対象言語音声データ_n(1≦n≦N)は、同一の文章を発話した音声データである必要がある。 For non-specific speaker voice quality converter learning, voice data in the target language of the target speaker (hereinafter referred to as target speaker voice data), voice data in the target language of learning by N speakers (hereinafter referred to as the target speaker voice data) Learning language speech data _n (referred to as 1 ≦ n ≦ N) is required. Here, the target speaker is a subject that is a target for generating synthesized speech in a synthesis target language. The target speaker voice data and the learning target language voice data _n (1 ≦ n ≦ N) need to be voice data obtained by uttering the same sentence.

また、クロスリンガル音声合成用モデルの学習には、M名の話者による合成対象言語での音声データ（以下、合成対象言語音声データ_m(1≦m≦M)という）が必要である。ただし、M=1であってもよい。つまり最低1名分の合成対象言語音声データがあればよい。このとき、合成対象言語音声データ₁の代わりに単に合成対象言語音声データということにする。 In addition, learning of a model for cross-lingual speech synthesis requires speech data in a synthesis target language (hereinafter referred to as synthesis target language speech data _m (1 ≦ m ≦ M)) by M speakers. However, M = 1 may be sufficient. That is, it suffices if there is at least one person to be synthesized speech data. At this time, instead of the synthesis target language speech data ₁ , it is simply referred to as synthesis target language speech data.

なお、学習対象言語音声データ_nを発話した話者のことを学習対象言語入力話者_nという。また、合成対象言語音声データ_mを発話した話者のことを合成対象言語入力話者_mという。一般に、学習対象言語入力話者_n(1≦n≦N)、合成対象言語入力話者_m(1≦m≦M)は、目標話者と異なる。 Note that a speaker who utters the learning target language voice data _n is referred to as a learning target language input speaker _n . A speaker who utters the synthesis target language speech data _m is referred to as a synthesis target language input speaker _m . In general, the learning target language input speaker _n (1 ≦ n ≦ N) and the synthesis target language input speaker _m (1 ≦ m ≦ M) are different from the target speaker.

［発話情報］
発話情報とは、音声データ中の各発話（話者が発話した文章の音声）に対して付与される発音情報等の情報のことである。音声データ中の各発話に一つの発話情報が付与されている。 [Speech information]
The utterance information is information such as pronunciation information given to each utterance (speech of a sentence uttered by a speaker) in voice data. One utterance information is assigned to each utterance in the voice data.

発話情報には、少なくとも各発話に対応する発音情報（読み、音素）が含まれる。また、各音素の開始時間と終了時間の情報である音素セグメンテーション情報が含まれていてもよい。ここでいう開始時間・終了時間は、各発話の始点を0[秒]としたときの経過時間のことである。音素セグメンテーション情報の一例を図１に示す。 The utterance information includes at least pronunciation information (reading and phonemes) corresponding to each utterance. In addition, phoneme segmentation information that is information on the start time and end time of each phoneme may be included. The start time and end time here are the elapsed time when the start point of each utterance is set to 0 [seconds]. An example of phoneme segmentation information is shown in FIG.

また、発話情報には、音素や音素セグメンテーション情報以外の情報、例えば、アクセント情報（アクセント型、アクセント句長）、品詞情報を含んでいてもよい。 Further, the speech information may include information other than phonemes and phoneme segmentation information, for example, accent information (accent type, accent phrase length), and part of speech information.

＜第一実施形態＞
以下、図２〜図９を参照してクロスリンガル音声合成用モデル学習装置１００、クロスリンガル音声合成装置２００について説明する。 <First embodiment>
The cross-lingual speech synthesis model learning device 100 and the cross-lingual speech synthesis device 200 will be described below with reference to FIGS.

［クロスリンガル音声合成用モデル学習装置１００］
図２に示すようにクロスリンガル音声合成用モデル学習装置１００は、不特定話者声質変換器学習装置１１０、音声合成用モデル学習装置１２０、記録部１９０を含む。記録部１９０は、クロスリンガル音声合成用モデル学習装置１００の処理に必要な情報を適宜記録する構成部である。クロスリンガル音声合成用モデル学習装置１００は、目標話者音声データ、学習対象言語音声データ_n(1≦n≦N)、合成対象言語音声データとその合成対象言語音声データに含まれる発話の発話情報の集合（以下、合成対象言語発話情報集合という）を入力として、目標話者による合成対象言語での音声を合成するクロスリンガル音声合成モデルを学習し、出力する。 [Cross-Lingual Speech Synthesis Model Learning Device 100]
As shown in FIG. 2, the cross-lingual speech synthesis model learning device 100 includes an unspecified speaker voice quality converter learning device 110, a speech synthesis model learning device 120, and a recording unit 190. The recording unit 190 is a component that appropriately records information necessary for processing by the model learning device 100 for cross-lingual speech synthesis. The cross-lingual speech synthesis model learning apparatus 100 includes target speaker speech data, learning target language speech data _n (1 ≦ n ≦ N), synthesis target language speech data, and speech information included in the synthesis target language speech data. Is used as an input to learn and output a cross-lingual speech synthesis model for synthesizing speech in the synthesis target language by the target speaker.

図３に従いクロスリンガル音声合成用モデル学習装置１００の動作について説明する。不特定話者声質変換器学習装置１１０は、目標話者音声データ、学習対象言語音声データ_n(1≦n≦N)を入力として、任意の音声データを目標話者の声質を有する音声データへ変換する不特定話者声質変換器を学習し、出力する（Ｓ１１０）。音声合成用モデル学習装置１２０は、合成対象言語音声データと合成対象言語発話情報集合を入力として、クロスリンガル音声合成モデルを学習し、出力する（Ｓ１２０）。その際、不特定話者声質変換器を用いて、合成対象言語音声データを目標話者の声質を持つ音声データ（以下、声質変換後合成対象言語音声データという）へ変換する。 The operation of the cross-lingual speech synthesis model learning apparatus 100 will be described with reference to FIG. The unspecified speaker voice quality converter learning device 110 receives the target speaker voice data and the learning target language voice data _n (1 ≦ n ≦ N) as input, and converts any voice data into voice data having the target speaker voice quality. An unspecified speaker voice quality converter to be converted is learned and output (S110). The speech synthesis model learning device 120 receives the synthesis target language speech data and the synthesis target language utterance information set as input, learns and outputs a cross-lingual speech synthesis model (S120). At that time, the unspecified speaker voice quality converter is used to convert the synthesis target language voice data into voice data having the voice quality of the target speaker (hereinafter referred to as synthesized voice language data after voice quality conversion).

以下、不特定話者声質変換器学習装置１１０、音声合成用モデル学習装置１２０の構成、動作について詳細に説明していく。 The configurations and operations of the unspecified speaker voice quality converter learning device 110 and the speech synthesis model learning device 120 will be described in detail below.

まず、図４〜図５を参照して不特定話者声質変換器学習装置１１０について説明する。図４に示すように不特定話者声質変換器学習装置１１０は、時間情報調整部１０１、声質変換器学習部１０３を含む。図５に従い不特定話者声質変換器学習装置１１０の動作について説明する。 First, the unspecified speaker voice quality converter learning device 110 will be described with reference to FIGS. As shown in FIG. 4, the unspecified speaker voice quality converter learning device 110 includes a time information adjustment unit 101 and a voice quality converter learning unit 103. The operation of the unspecified speaker voice quality converter learning device 110 will be described with reference to FIG.

時間情報調整部１０１は、目標話者音声データ、学習対象言語音声データ_n(1≦n≦N)を入力として、目標話者音声データと学習対象言語音声データ_nに含まれる各文章の音声（つまり、発話）の時間情報をそろえた音声データ（以下、時間情報調整後目標話者音声データ_n、時間情報調整後学習対象言語音声データ_nという）を生成し、出力する（Ｓ１０１）。目標話者音声データについては、学習対象言語音声データ_n(1≦n≦N)それぞれと時間情報を調整するため、N名の学習対象言語入力話者それぞれに対する時間調整後の音声データである時間情報調整後目標話者音声データ_n(1≦n≦N)が生成される。時間情報の調整には、目標話者音声データと学習対象言語音声データ_nのフレーム間の対応付けを行う方法を用いればよい。例えば、多くの声質変換手法と同様、動的時間伸縮（Dynamic Time Warping, DTW）、DP（Dynamic Programming）マッチングを用いることができる。 The time information adjustment unit 101 receives the target speaker voice data and the learning target language voice data _n (1 ≦ n ≦ N) as inputs, and the voices of the sentences included in the target speaker voice data and the learning target language voice data _n ( In other words, voice data (hereinafter referred to as target speaker voice data _n after time information adjustment, and target language voice data _n after time information adjustment) is generated and output (S101). For the target speaker voice data, the time that is the voice data after time adjustment for each of the N learning target language input speakers in order to adjust the time information with each of the learning target language voice data _n (1 ≦ n ≦ N) Information-adjusted target speaker voice data _n (1 ≦ n ≦ N) is generated. For adjusting the time information, a method of associating the frames of the target speaker voice data and the learning target language voice data _n may be used. For example, as with many voice quality conversion methods, dynamic time warping (DTW) and DP (Dynamic Programming) matching can be used.

声質変換器学習部１０３は、時間情報調整後目標話者音声データ_nと時間情報調整後学習対象言語音声データ_nの組(1≦n≦N)を入力として、不特定話者声質変換器を学習し、出力する（Ｓ１０３）。例えば、時間情報調整後目標話者音声データ_n、時間情報調整後学習対象言語音声データ_nがスペクトルパラメータとして表現される場合、次のようにして不特定話者声質変換器を学習する。 The voice quality converter learning unit 103 receives a set (1 ≦ n ≦ N) of the target speaker voice data _n after time information adjustment and the target language voice data _n after time information adjustment (1 ≦ n ≦ N) as an unspecified speaker voice quality converter. Learning and outputting (S103). For example, when the target speaker voice data _n after time information adjustment and the learning target language voice data _n after time information adjustment are expressed as spectrum parameters, the unspecified speaker voice quality converter is learned as follows.

時間情報調整後目標話者音声データ_nのスペクトルパラメータをC_n,target(t)、時間情報調整後学習対象言語音声データ_nのスペクトルパラメータをC_n,train(t)とする（tはフレーム番号を表す）。時間情報調整後学習対象言語音声データ_nのスペクトルパラメータC_n,train(t)と時間情報調整後目標話者音声データ_nのスペクトルパラメータC_n,target(t)の組を学習データとして、入力話者スペクトルパラメータc_train(t)から目標話者の声質に対応する目標話者スペクトルパラメータc^_target(t)へ変換する不特定話者声質変換器f_{train→target}を学習する。 Spectral parameters of target speaker voice data _n after time information adjustment are C _{n, target} (t), and spectral parameters of target language voice data _n after time information adjustment are C _{n, train} (t) (t is the frame number) Represents). The set of spectral parameters C _{n, train} (t) of the target language speech data _n after time information adjustment and the spectral parameters C _{n, target} (t) of the target speaker speech data _n after time information adjustment is used as the learning data. An unspecified speaker voice quality converter f _{train → target} is converted from the speaker spectrum parameter c _train (t) to the target speaker spectrum parameter c ^ _target (t) corresponding to the voice quality of the target speaker.

ここで、不特定話者声質変換器f_{train→target}は、あるフレームtの任意の入力話者スペクトルパラメータから目標話者スペクトルパラメータへ変換する声質変換器となる。 Here, the unspecified speaker voice quality converter f _{train → target} is a voice quality converter that converts an arbitrary input speaker spectrum parameter of a certain frame t into a target speaker spectrum parameter.

一般的な声質変換器の学習方法では、話者Bの声質を有する音声データへ変換する声質変換器を学習するために話者Aの音声データを用いると、声質変換器の学習に使用した話者Aの音声データを入力しない限り、話者Bの声質を有する音声データへ変換することができない。そこで、ここでは、声質変換器の学習方法として、（参考非特許文献１）のニューラルネットワークに基づく方法を用いる。
（参考非特許文献１）能勢隆，篠崎隆宏，伊藤洋二郎，伊藤彰則，“ニューラルネットワークに基づくユーザ音声を必要としない多対一声質変換の検討”, 日本音響学会論文集２０１５年３月, 3-2-1, pp.271-274, 2015
（参考非特許文献１）では、声質変換器を学習する際、声質変換器の入力となる多数の学習対象言語入力話者の音声データと、声質変換器の出力となる1名の目標話者の音声データを学習データとして用いる。これにより、どのような話者の音声データを入力しても目標話者の声質を有する音声データに変換することが可能となる。 In a general voice quality converter learning method, when the voice data of speaker A is used to learn a voice quality converter that converts voice data having the voice quality of speaker B, the speech used for the voice quality converter learning is used. Unless voice data of the speaker A is input, the voice data having the voice quality of the speaker B cannot be converted. Therefore, here, a method based on the neural network of (Reference Non-Patent Document 1) is used as a learning method of the voice quality converter.
(Non-patent document 1) Takashi Nose, Takahiro Shinozaki, Yojiro Ito, Akinori Ito, “Examination of many-to-one voice quality conversion that does not require user speech based on neural network”, Transactions of the Acoustical Society of Japan, March 2015, 3 -2-1, pp.271-274, 2015
(Reference Non-Patent Document 1), when learning a voice quality converter, voice data of a large number of target language input speakers that are input to the voice quality converter and one target speaker that is the output of the voice quality converter Are used as learning data. As a result, it is possible to convert any voice data of the speaker into voice data having the voice quality of the target speaker.

次に、図６〜図７を参照して音声合成用モデル学習装置１２０について説明する。図６に示すように音声合成用モデル学習装置１２０は、声質変換部１１１、合成用モデル学習部１１３を含む。図７に従い音声合成用モデル学習装置１２０の動作について説明する。 Next, the speech synthesis model learning device 120 will be described with reference to FIGS. As shown in FIG. 6, the speech synthesis model learning device 120 includes a voice quality conversion unit 111 and a synthesis model learning unit 113. The operation of the speech synthesis model learning device 120 will be described with reference to FIG.

声質変換部１１１は、合成対象言語音声データを入力として、不特定話者声質変換器を用いて声質変換後合成対象言語音声データを生成し、出力する（Ｓ１１１）。例えば、不特定話者声質変換器が式(1)で表される場合、合成対象言語音声データのスペクトルパラメータC_synth,org(t)から目標話者の声質を有する合成対象言語の目標話者スペクトルパラメータC^_synth,target(t)へ変換する（tはフレーム番号）。 The voice quality conversion unit 111 receives the synthesis target language voice data as input and generates and outputs voice quality converted synthesis target language voice data using an unspecified speaker voice quality converter (S111). For example, when the unspecified speaker voice quality converter is expressed by Equation (1), the target speaker of the synthesis target language having the voice quality of the target speaker from the spectrum parameter C _{synth, org} (t) of the synthesis target language speech data. Convert to spectrum parameter C ^ _{synth, target} (t) (t is frame number).

声質変換器の学習に（参考非特許文献１）を用いた場合は、（参考非特許文献１）に従い声質変換を行えばよい。式(1)に従い計算された目標話者スペクトルパラメータC^_synth,target(t)が声質変換後合成対象言語音声データである。 When (Reference Non-Patent Document 1) is used for learning of the voice quality converter, the voice quality conversion may be performed according to (Reference Non-Patent Document 1). The target speaker spectrum parameter C ^ _{synth, target} (t) calculated according to the equation (1) is the speech data to be synthesized after the voice quality conversion.

合成用モデル学習部１１３は、Ｓ１１１で生成した声質変換後合成対象言語音声データと、合成対象言語発話情報集合を入力として、クロスリンガル音声合成モデルを学習し、出力する（Ｓ１１３）。クロスリンガル音声合成用モデルの学習には、（非特許文献１）、（非特許文献２）などを用いることができる。 The synthesis model learning unit 113 learns and outputs a cross-lingual speech synthesis model, using the post-voice quality conversion synthesis target language speech data generated in S111 and the synthesis target language utterance information set as inputs (S113). For learning of a model for cross-lingual speech synthesis, (Non-patent Document 1), (Non-Patent Document 2), etc. can be used.

以下、クロスリンガル音声合成装置２００について説明していくが、クロスリンガル音声合成用モデルは（非特許文献１）にあるようなHMM音声合成用モデルであるとする。 Hereinafter, the cross-lingual speech synthesis apparatus 200 will be described. It is assumed that the cross-lingual speech synthesis model is an HMM speech synthesis model as described in (Non-Patent Document 1).

［クロスリンガル音声合成装置２００］
図８に示すようにクロスリンガル音声合成装置２００は、テキスト解析部２１０、音声パラメータ生成部２２０、音声波形生成部２３０、記録部２９０を含む。記録部２９０は、クロスリンガル音声合成装置２００の処理に必要な情報を適宜記録する構成部である。クロスリンガル音声合成装置２００は、音声合成の対象となる合成対象言語でのテキストである合成テキストを入力として、クロスリンガル音声合成用モデルを用いて、合成テキストを読み上げた合成音声を生成し、出力する。 [Cross-Lingual Speech Synthesizer 200]
As shown in FIG. 8, the cross-lingual speech synthesizer 200 includes a text analysis unit 210, a speech parameter generation unit 220, a speech waveform generation unit 230, and a recording unit 290. The recording unit 290 is a component that appropriately records information necessary for processing of the cross-lingual speech synthesizer 200. The cross-lingual speech synthesizer 200 receives a synthesized text that is a text in a synthesis target language that is a target of speech synthesis, generates a synthesized speech that reads out the synthesized text using a cross-lingual speech synthesis model, and outputs the synthesized speech. To do.

図９に従いクロスリンガル音声合成装置２００の動作について説明する。テキスト解析部２１０は、合成テキストを入力として、テキスト解析により、合成テキストの読み、アクセント等のコンテキスト情報を生成し、出力する（Ｓ２１０）。音声パラメータ生成部２２０は、Ｓ２１０で生成したコンテキスト情報を入力として、クロスリンガル音声合成用モデルを用いて音声パラメータ（スペクトルパラメータや音高パラメータ）を生成し、出力する（Ｓ２２０）。音声波形生成部２３０は、Ｓ２２０で生成した音声パラメータを入力として、音声合成フィルタを用いて合成音声を生成し、出力する（Ｓ２３０）。音声合成フィルタは、例えば、（参考非特許文献２）に記載のものを用いればよい。
（参考非特許文献２）今井聖，住田一男，古市千枝子，“音声合成のためのメル対数スペクトル近似（MLSA）フィルタ”，電子情報通信学会論文誌 A, Vol.J66-A, No.2, pp.122-129, 1983. The operation of the cross-lingual speech synthesizer 200 will be described with reference to FIG. The text analysis unit 210 receives the synthesized text as input and generates and outputs context information such as reading of the synthesized text and accent by text analysis (S210). The voice parameter generation unit 220 receives the context information generated in S210, generates a voice parameter (spectrum parameter or pitch parameter) using the cross-lingual voice synthesis model, and outputs it (S220). The speech waveform generation unit 230 receives the speech parameter generated in S220, generates a synthesized speech using a speech synthesis filter, and outputs the synthesized speech (S230). As the speech synthesis filter, for example, a filter described in (Reference Non-Patent Document 2) may be used.
(Reference Non-Patent Document 2) Sei Imai, Kazuo Sumita, Chieko Furuichi, “Mel Log Spectrum Approximation (MLSA) Filter for Speech Synthesis”, IEICE Transactions A, Vol. J66-A, No. 2, pp.122-129, 1983.

なお、目標話者音声データ、学習対象言語音声データ_n(1≦n≦N)の代わりに、あらかじめ時間情報を調整した時間情報調整後目標話者音声データ_n(1≦n≦N)、時間情報調整後学習対象言語音声データ_n(1≦n≦N)をクロスリンガル音声合成用モデル学習装置に入力するように構成することもできる。 Instead of target speaker voice data and learning target language voice data _n (1 ≤ n ≤ N), time information adjusted target speaker voice data _n (1 ≤ n ≤ N), time The information-adjusted learning target language speech data _n (1 ≦ n ≦ N) may be input to the model learning device for cross-lingual speech synthesis.

本実施形態の発明によれば、あらかじめ用意した目標話者音声データとN名分の学習対象言語音声データを入力として学習した不特定話者声質変換器を用いて、目標話者とは異なる1名分の合成対象言語音声データを目標話者の声質を有する声質変換後合成対象言語音声データに変換する。次に、声質変換後合成対象言語音声データと、合成対象言語音声データの発話情報の集合から統計的パラメトリック音声合成であるクロスリンガル音声合成用モデル、つまり目標話者による合成対象言語での音声を合成するためのモデルを学習する。 According to the invention of the present embodiment, an unspecified speaker voice quality converter that is trained by using target speaker voice data prepared in advance and N target language voice data for learning as input is different from the target speaker. The synthesis target language speech data of the name is converted into the synthesis target language speech data after voice quality conversion having the voice quality of the target speaker. Next, a speech model in cross-lingual speech synthesis that is statistical parametric speech synthesis, that is, speech in the target language by the target speaker, is obtained from a set of speech data after speech quality conversion and speech information of the target language speech data. Learn a model to synthesize.

これにより、N名分の学習対象言語音声データと1名分の合成対象言語音声データからのクロスリンガル音声合成のためのモデルを生成することが可能となり、当該モデルを用いることでクロスリンガル音声合成が可能となる。このため、学習対象言語と合成対象言語のバイリンガル話者による音声データや、学習対象言語と合成対象言語のそれぞれについて多数の話者による音声データを事前に用意する必要がなくなる。したがって、従来に比して低コストでクロスリンガル音声合成を実現することができる。 This makes it possible to generate a model for cross-lingual speech synthesis from N language learning language speech data and one synthesis target language speech data. By using this model, cross-lingual speech synthesis is possible. Is possible. For this reason, it is not necessary to prepare speech data by a bilingual speaker in the learning target language and the synthesis target language, or speech data by a large number of speakers for each of the learning target language and the synthesis target language. Therefore, cross-lingual speech synthesis can be realized at a lower cost than in the past.

＜第二実施形態＞
第一実施形態では、不特定話者声質変換器の学習、不特定話者声質変換器による声質変換に際して、音声データ（スペクトルパラメータ）のみを使用している。つまり、音声データに付随する発話情報を使用せずに学習した不特定話者声質変換器を用いて声質変換をしている。このため、声質変換の精度が劣化し、最終的な合成音声の品質が劣化してしまう可能性がある。 <Second embodiment>
In the first embodiment, only voice data (spectrum parameters) is used for learning of the unspecified speaker voice quality converter and voice quality conversion by the unspecified speaker voice quality converter. That is, voice quality conversion is performed using an unspecified speaker voice quality converter that has been learned without using the utterance information accompanying the voice data. For this reason, the accuracy of voice quality conversion may deteriorate, and the quality of the final synthesized speech may deteriorate.

そこで、本実施形態では、不特定話者声質変換器の学習、不特定話者声質変換器による声質変換に際して、発話情報も使用することを考える。しかし、一般に言語が異なると、例えば、音素体系のような発音情報が異なってしまうため、このままでは不特定話者声質変換器の学習、不特定話者声質変換器による声質変換に際して、発話情報を使用することはできない。発話情報を使用することができるようするため、学習対象言語と合成対象言語における発音情報の違いを吸収する目的で、両言語の類似した発音情報（音素）の対応関係を作成する。これにより、両言語間の発音情報の違いを考慮した不特定話者声質変換器の学習、不特定話者声質変換器による声質変換を実現する。 Therefore, in the present embodiment, it is considered that the speech information is also used for the learning of the unspecified speaker voice quality converter and the voice quality conversion by the unspecified speaker voice quality converter. However, generally speaking, if the language is different, pronunciation information such as phoneme system will be different.For this reason, the utterance information will be used for learning of the unspecified speaker voice quality converter and voice quality conversion by the unspecified speaker voice quality converter. Cannot be used. In order to use the utterance information, a correspondence relationship between similar pronunciation information (phonemes) in both languages is created in order to absorb the difference in pronunciation information between the learning target language and the synthesis target language. Thereby, learning of the unspecified speaker voice quality converter considering the difference in pronunciation information between the two languages, and voice quality conversion by the unspecified speaker voice quality converter are realized.

以下、学習対象言語と合成対象言語の類似した発音情報（音素）の対応関係のことを発音ベクトル作成規則という。その一例を図１０に示す。図１０の表の各行は類似する学習対象言語の音素と合成対象言語の音素を組にしてインデックスを付したもの、行数は学習対象言語の音素の数Lとなっている。この例では、学習対象言語は日本語、合成対象言語は英語であり、例えば、３行目（インデックスが３である行）をみると、日本語の音素”s”と英語の音素”s”,”th”が類似関係にあり、対応することがわかる。 Hereinafter, the correspondence between similar pronunciation information (phonemes) between the learning target language and the synthesis target language is referred to as a pronunciation vector creation rule. An example is shown in FIG. Each row in the table of FIG. 10 is obtained by indexing a phoneme of a similar learning target language and a phoneme of a synthesis target language, and the number of rows is the number L of phonemes of the learning target language. In this example, the language to be learned is Japanese and the language to be synthesized is English. For example, looking at the third line (the line where the index is 3), the Japanese phoneme "s" and the English phoneme "s" , “Th” is in a similar relationship and can be seen to correspond.

なお、発音ベクトル作成規則の表現は、図１０のような、学習対象言語の音素の数Lを行数とする、学習対象言語の各音素に類似する合成対象言語の音素を示した表に限られるものではないが、以下の説明では発音ベクトル作成規則が図１０のような行数Lの表で表現されているものとする。 The expression of the pronunciation vector creation rule is limited to the table shown in FIG. 10 that shows the phonemes of the synthesis target language similar to the phonemes of the learning target language, where the number of phonemes L of the learning target language is the number of rows. However, in the following description, it is assumed that the pronunciation vector creation rules are represented by a table with the number of rows L as shown in FIG.

以下、図１１〜図１５を参照してクロスリンガル音声合成用モデル学習装置３００について説明する。なお、合成音声の生成にはクロスリンガル音声合成装置２００を用いることができる。 The cross-lingual speech synthesis model learning apparatus 300 will be described below with reference to FIGS. Note that the cross-lingual speech synthesizer 200 can be used to generate synthesized speech.

［クロスリンガル音声合成用モデル学習装置３００］
図１１に示すようにクロスリンガル音声合成用モデル学習装置３００は、不特定話者声質変換器学習装置３１０、音声合成用モデル学習装置３２０、記録部１９０を含む。記録部１９０には、事前に発音ベクトル作成規則が記録されているものとする。クロスリンガル音声合成用モデル学習装置３００は、目標話者音声データ、学習対象言語音声データ_n(1≦n≦N)、学習対象言語音声データ_nに含まれる発話の発話情報の集合（以下、学習対象言語発話情報集合_n(1≦n≦N)という）、合成対象言語音声データ、合成対象言語発話情報集合を入力として、クロスリンガル音声合成モデルを学習し、出力する。 [Cross-Lingual Speech Synthesis Model Learning Device 300]
As shown in FIG. 11, the cross-lingual speech synthesis model learning device 300 includes an unspecified speaker voice quality converter learning device 310, a speech synthesis model learning device 320, and a recording unit 190. It is assumed that a pronunciation vector creation rule is recorded in the recording unit 190 in advance. The model learning device 300 for cross-lingual speech synthesis includes a target speaker speech data, learning target language speech data _n (1 ≦ n ≦ N), and a set of utterance information of speech included in the learning target language speech data _n (hereinafter referred to as learning). The target language utterance information set _n (referred to as 1 ≦ n ≦ N), the synthesis target language speech data, and the synthesis target language utterance information set are input and the cross-lingual speech synthesis model is learned and output.

以下、不特定話者声質変換器学習装置３１０、音声合成用モデル学習装置３２０の構成、動作について詳細に説明していく。 Hereinafter, the configuration and operation of the unspecified speaker voice quality converter learning device 310 and the speech synthesis model learning device 320 will be described in detail.

まず、図１２〜図１３を参照して不特定話者声質変換器学習装置３１０について説明する。図１２に示すように不特定話者声質変換器学習装置３１０は、時間情報調整部１０１、第１発音ベクトル生成部３０２、声質変換器学習部３０３を含む。図１３に従い不特定話者声質変換器学習装置３１０の動作について説明する。 First, the unspecified speaker voice quality converter learning device 310 will be described with reference to FIGS. As shown in FIG. 12, the unspecified speaker voice quality converter learning device 310 includes a time information adjustment unit 101, a first pronunciation vector generation unit 302, and a voice quality converter learning unit 303. The operation of the unspecified speaker voice quality converter learning device 310 will be described with reference to FIG.

時間情報調整部１０１は、目標話者音声データ、学習対象言語音声データ_n(1≦n≦N)を入力として、目標話者音声データと学習対象言語音声データ_nに含まれる各文章の音声（発話）の時間情報をそろえた音声データである時間情報調整後目標話者音声データ_n、時間情報調整後学習対象言語音声データ_nを生成し、出力する（Ｓ１０１）。 The time information adjustment unit 101 receives the target speaker voice data and the learning target language voice data _n (1 ≦ n ≦ N) as inputs, and the voices of the sentences included in the target speaker voice data and the learning target language voice data _n ( The target speaker voice data _n after adjusting time information and the learning target language voice data _n after adjusting time information are generated and output (S101).

第１発音ベクトル生成部３０２は、学習対象言語発話情報集合_nを入力として、記録部１９０から読み込んだ発音ベクトル作成規則を用いて、学習対象言語発音ベクトル集合_nを生成し、出力する（Ｓ３０２）。ここで、発音ベクトルとは、音声データ中の発話の各フレームに対して算出されるベクトルであり、そのフレームがどのような発音情報（音素）であるかを示すベクトルである。発音ベクトル作成規則が図１０のような行数Lの表で表現されている場合、発音ベクトルの次元はLとなる。学習対象言語発話情報集合_nの発話情報から発音ベクトル（以下、学習対象言語発音ベクトルという）を算出していき、その集合を学習対象言語発音ベクトル集合_nとして生成する。 The first pronunciation vector generation unit 302 receives the learning target language utterance information set _n as an input, generates and outputs a learning target language pronunciation vector set _n using the pronunciation vector creation rules read from the recording unit 190 (S302). . Here, the pronunciation vector is a vector calculated for each frame of speech in the speech data, and is a vector indicating what kind of pronunciation information (phoneme) the frame is. When the pronunciation vector creation rule is expressed by a table with the number of rows L as shown in FIG. 10, the dimension of the pronunciation vector is L. A pronunciation vector (hereinafter referred to as a learning target language pronunciation vector) is calculated from the utterance information of the learning target language utterance information set _n , and the set is generated as a learning target language pronunciation vector set _n .

以下、具体的に、ある発話情報に含まれる発音情報（音素）から発音ベクトルを算出する方法について説明していく。まず、発音情報（音素）に対して、発音ベクトル作成規則を用いて対応するインデックスidxを求める（ただし、1≦idx≦L）。図１０の例でいえば、真ん中の列の中から学習対象言語音素（ここでは日本語の音素）がある行を見出し、インデックスidxを得る。次に、得られたインデックスidxを数値ベクトル化し、発音ベクトルを算出する。数値ベクトル化の方法として、例えば、以下のような意図情報の1-of-K表現を使用し、インデックスidxから発音ベクトルV=(v₁, v₂, …, v_L)（Lは学習対象言語の音素数）を算出する。 Hereinafter, a method for calculating a pronunciation vector from pronunciation information (phonemes) included in certain utterance information will be specifically described. First, for the pronunciation information (phoneme), a corresponding index idx is obtained using a pronunciation vector creation rule (where 1 ≦ idx ≦ L). In the example of FIG. 10, a row having a learning target language phoneme (here, Japanese phoneme) is found from the middle column, and an index idx is obtained. Next, the obtained index idx is converted into a numerical vector to calculate a pronunciation vector. As a numerical vectorization method, for example, the following 1-of-K representation of intention information is used, and the pronunciation vector V = (v ₁ , v ₂ ,…, v _L ) (L is the learning target) from the index idx Language phoneme number).

ただし、j=1,…,Lである。 However, j = 1,..., L.

発音ベクトルに1-of-K表現を使用するのは、一般に、L個の値を取る情報（ここでは、インデックスidx）をニューラルネットの入力とする場合、1次元の数値として取り扱うよりも1-of-K表現として取り扱う方がより高性能の学習結果（つまり、不特定話者声質変換器）が得られるからである。 The 1-of-K expression is used for the pronunciation vector in general when 1-dimensional numerical values are used instead of 1-dimensional numerical values when L-valued information (here, index idx) is used as the input of the neural network. This is because a higher-performance learning result (that is, an unspecified speaker voice quality converter) can be obtained by treating as an of-K expression.

声質変換器学習部３０３は、時間情報調整後目標話者音声データ_nと時間情報調整後学習対象言語音声データ_nと学習対象言語発音ベクトル集合_nの組(1≦n≦N)を入力として、不特定話者声質変換器を学習し、出力する（Ｓ３０３）。例えば、時間情報調整後目標話者音声データ_n、時間情報調整後学習対象言語音声データ_nがスペクトルパラメータとして表現される場合、次のようにして不特定話者声質変換器を学習する。 The voice quality converter learning unit 303 receives a set (1 ≦ n ≦ N) of the target speaker voice data _n after time information adjustment, the learning target language voice data _n after time information adjustment, and the learning target language pronunciation vector set _n as inputs. An unspecified speaker voice quality converter is learned and output (S303). For example, when the target speaker voice data _n after time information adjustment and the learning target language voice data _n after time information adjustment are expressed as spectrum parameters, the unspecified speaker voice quality converter is learned as follows.

時間情報調整後目標話者音声データ_nのスペクトルパラメータをC_n,target(t)、時間情報調整後学習対象言語音声データ_nのスペクトルパラメータをC_n,train(t)、学習対象言語発音ベクトル集合_nの要素である学習対象言語発音ベクトルをV_n,train(t)とする（tはフレーム番号を表す）。時間情報調整後学習対象言語音声データ_nのスペクトルパラメータC_n,train(t)と対応する学習対象言語発音ベクトルV_n,train(t)の組(C_n,train(t), V_n,train(t))と時間情報調整後目標話者音声データ_nのスペクトルパラメータC_n,target(t)の組を学習データとして、拡張入力話者スペクトルパラメータ(c_train(t), v_train(t))から目標話者の声質に対応する目標話者スペクトルパラメータc^_target(t)へ変換する不特定話者声質変換器f_{train→target}を学習する。 Spectral parameters of target speaker speech data _n after time information adjustment is C _{n, target} (t), spectral parameters of learning target language speech data _n after time information adjustment is C _{n, train} (t), and learning target language pronunciation vector set _Let V _{n, train} (t) be the learning target language pronunciation vector that is an element of _n (t represents the frame number). Spectral parameters C _{n, train} (t) of the target language speech data _n after time information adjustment and the corresponding language target pronunciation vector V _{n, train} (t) (C _{n, train} (t), V _{n, train} (t)) and spectral parameter C _{n, target} (t) of target speaker speech data _n after time information adjustment are used as training data, and extended input speaker spectral parameters (c _train (t), v _train (t) ) To an unspecified speaker voice quality converter f _{train → target} for conversion to a target speaker spectrum parameter c ^ _target (t) corresponding to the voice quality of the target speaker.

ここで、不特定話者声質変換器f_{train→target}は、あるフレームtの任意の拡張入力話者スペクトルパラメータから目標話者スペクトルパラメータへ変換する声質変換器となる。 Here, the unspecified speaker voice quality converter f _{train → target} is a voice quality converter for converting any extended input speaker spectrum parameter of a certain frame t into a target speaker spectrum parameter.

不特定話者声質変換器f_{train→target}の学習アルゴリズムは、第一実施形態と同じ（参考非特許文献１）記載の方法を用いればよい。ただし、学習データとして、時間情報調整後学習対象言語音声データ_nのスペクトルパラメータと対応する学習対象言語発音ベクトルの組と時間情報調整後目標話者音声データ_nのスペクトルパラメータの組を用いる点が異なる。 The learning algorithm of the unspecified speaker voice quality converter f _{train → target} may be the same method as described in the first embodiment (reference non-patent document 1). However, as learning data, is that it uses a set of spectral parameters of the spectral parameters and the set time information adjusted target speaker speech data _n corresponding learning target language pronunciation vector after time information adjustment learned language audio data _n different .

次に、図１４〜図１５を参照して音声合成用モデル学習装置３２０について説明する。図１４に示すように音声合成用モデル学習装置３２０は、声質変換部３１１、第２発音ベクトル生成部３１２、合成用モデル学習部３１３を含む。図１５に従い音声合成用モデル学習装置３２０の動作について説明する。 Next, the speech synthesis model learning device 320 will be described with reference to FIGS. As shown in FIG. 14, the speech synthesis model learning device 320 includes a voice quality conversion unit 311, a second pronunciation vector generation unit 312, and a synthesis model learning unit 313. The operation of the speech synthesis model learning device 320 will be described with reference to FIG.

第２発音ベクトル生成部３１２は、合成対象言語発話情報集合を入力として、記録部１９０から読み込んだ発音ベクトル作成規則を用いて、合成対象言語発音ベクトル集合を生成し、出力する（Ｓ３１２）。合成対象言語発音ベクトル集合の生成方法は、第１発音ベクトル生成部３０２における学習対象言語発音ベクトル集合_nの生成方法と同様であり、合成対象言語発話情報集合の発話情報から算出される発音ベクトル（以下、合成対象言語発音ベクトルという）の集合が合成対象言語発音ベクトル集合である。 The second pronunciation vector generation unit 312 receives the synthesis target language utterance information set as an input, and generates and outputs a synthesis target language pronunciation vector set using the pronunciation vector creation rules read from the recording unit 190 (S312). The generation method of the synthesis target language pronunciation vector set is the same as the generation method of the learning target language pronunciation vector set _n in the first pronunciation vector generation unit 302, and the pronunciation vector calculated from the utterance information of the synthesis target language utterance information set ( Hereinafter, a set of synthesis target language pronunciation vectors) is a synthesis target language pronunciation vector set.

声質変換部３１１は、合成対象言語音声データと、Ｓ３１２で生成した合成対象言語発音ベクトル集合を入力として、不特定話者声質変換器を用いて声質変換後合成対象言語音声データを生成し、出力する（Ｓ３１１）。例えば、不特定話者声質変換器が式(2)で表される場合、合成対象言語音声データのスペクトルパラメータC_synth,org(t)と対応する合成対象言語発音ベクトルV_synth,org(t)の組(C_synth,org(t), V_synth,org(t))から目標話者の声質を有する合成対象言語の目標話者スペクトルパラメータC^_synth,target(t)へ変換する。 The voice quality conversion unit 311 receives the synthesis target language voice data and the synthesis target language pronunciation vector set generated in S312 as input, generates voice quality converted synthesis target language voice data using an unspecified speaker voice quality converter, and outputs it. (S311). For example, when the unspecified speaker voice quality converter is expressed by Equation (2), the synthesis target language pronunciation vector V _{synth, org} (t) corresponding to the spectrum parameter C _{synth, org} (t) of the synthesis target language speech data. (C _{synth, org} (t), V _{synth, org} (t)) is converted into a target speaker spectrum parameter C ^ _{synth, target} (t) of the synthesis target language having the target speaker's voice quality.

声質変換器の学習に（参考非特許文献１）を用いた場合は、（参考非特許文献１）に従い声質変換を行えばよい。式(2)に従い計算された目標話者スペクトルパラメータC^_synth,target(t)が声質変換後合成対象言語音声データである。 When (Reference Non-Patent Document 1) is used for learning of the voice quality converter, the voice quality conversion may be performed according to (Reference Non-Patent Document 1). The target speaker spectrum parameter C ^ _{synth, target} (t) calculated according to equation (2) is the speech data to be synthesized after voice quality conversion.

合成用モデル学習部３１３は、Ｓ３１１で生成した声質変換後合成対象言語音声データと、合成対象言語発話情報集合を入力として、クロスリンガル音声合成モデルを学習し、出力する（Ｓ３１３）。 The synthesis model learning unit 313 learns and outputs a cross-lingual speech synthesis model using the post-voice quality conversion synthesis target language speech data generated in S311 and the synthesis target language utterance information set as inputs (S313).

以下、図１６〜図１７を参照して発音ベクトル作成規則生成装置４００について説明する。ここで、発音ベクトル作成規則が学習対象言語の音声データと合成対象言語の音声データから生成できるためには、各音声データの発話情報に音素セグメンテーション情報が含まれている必要がある。 Hereinafter, the pronunciation vector creation rule generation device 400 will be described with reference to FIGS. 16 to 17. Here, in order that the pronunciation vector creation rule can be generated from the speech data of the learning target language and the speech data of the synthesis target language, it is necessary that the speech information of each speech data includes phoneme segmentation information.

なお、国際音声記号（IPA；International Phonetic Alphabet）を用いることができる場合、発音ベクトル作成規則を簡易に生成することができる。ここで、国際音声記号とは、国際音声学会が定めた音声記号であり、あらゆる言語の発音情報（音素）を表記可能な記号である（参考ＵＲＬ）。
（参考ＵＲＬ：https://www.internationalphoneticassociation.org/content/full-ipa-chart） When an international phonetic alphabet (IPA) can be used, a pronunciation vector creation rule can be easily generated. Here, the international phonetic symbol is a phonetic symbol determined by the International Phonetic Society, and is a symbol that can express pronunciation information (phonemes) in any language (reference URL).
(Reference URL: https://www.internationalphoneticassociation.org/content/full-ipa-chart)

学習対象言語の発話情報からIPAへ変換するための規則、合成対象言語の発話情報からIPAへ変換するための規則がそれぞれ用意されている場合、各言語からIPAへの変換規則を介して対応関係、つまり発音ベクトル作成規則を作成することができる。 If there are rules for converting speech information in the target language to IPA and rules for converting speech information in the target language into IPA, correspondences are established via the conversion rules from each language to IPA. That is, a pronunciation vector creation rule can be created.

［発音ベクトル作成規則生成装置４００］
図１６に示すように発音ベクトル作成規則生成装置４００は、第１発音統計情報生成部４１０、第２発音統計情報生成部４１５、発音類似性判定部４２０、記録部４９０を含む。記録部４９０は、発音ベクトル作成規則生成装置４００の処理に必要な情報を適宜記録する構成部である。発音ベクトル作成規則生成装置４００は、目標話者音声データと目標話者音声データに含まれる発話の発話情報の集合（以下、目標話者発話情報集合という）、合成対象言語音声データと合成対象言語発話情報集合を入力として、発音ベクトル作成規則を生成し、出力する。 [Pronunciation vector creation rule generator 400]
As shown in FIG. 16, the pronunciation vector creation rule generation device 400 includes a first pronunciation statistical information generation unit 410, a second pronunciation statistical information generation unit 415, a pronunciation similarity determination unit 420, and a recording unit 490. The recording unit 490 is a component that appropriately records information necessary for the processing of the pronunciation vector creation rule generation device 400. The pronunciation vector creation rule generation device 400 includes a target speaker voice data, a set of speech information included in the target speaker voice data (hereinafter referred to as a target speaker speech information set), a synthesis target language voice data, and a synthesis target language. Using the utterance information set as an input, a pronunciation vector creation rule is generated and output.

以下、図１７に従い発音ベクトル作成規則生成装置４００の動作について説明する。 Hereinafter, the operation of the pronunciation vector creation rule generation device 400 will be described with reference to FIG.

第１発音統計情報生成部４１０は、目標話者音声データと目標話者発話情報集合を入力として、学習対象言語発音統計情報を生成し、出力する（Ｓ４１０）。第１発音統計情報生成部４１０の動作は、具体的には、以下のようになる（図１８参照）。目標話者音声データから学習対象言語音素_j（j=1,…,L、ただし、Lは学習対象言語の音素数）の音声区間を特定し、学習対象言語音素_jの音声データ（例えば、スペクトルパラメータ）を抽出する（Ｓ４１０−１）。抽出された学習対象言語音素_jの音声データを用いて、学習対象言語音素_jの発音統計情報を算出する（Ｓ４１０−２）。学習対象言語音素₁の発音統計情報、…、学習対象言語音素_Lの発音統計情報の集合として学習対象言語発音統計情報を生成する（Ｓ４１０−３）。ここで、統計情報として、学習対象言語音素_jの音声データ全体の平均値、分散、四分位数や、例えば、正規混合分布（Gaussian Mixture Model, GMM）のような統計モデルを使用することができる。なお、目標話者音声データと目標話者発話情報集合の代わりに、学習対象言語音声データ_nと学習対象言語発話情報集合_nを用いてもよい（ただし、nは1以上N以下の整数）。 The first pronunciation statistic information generation unit 410 receives the target speaker voice data and the target speaker utterance information set as input, and generates and outputs learning target language pronunciation statistic information (S410). The operation of the first pronunciation statistic information generation unit 410 is specifically as follows (see FIG. 18). The speech segment of the target language phoneme _j (j = 1, ..., L, where L is the number of phonemes in the target language) is identified from the target speaker voice data, and the target language phoneme _j speech data (eg, spectrum) Parameter) is extracted (S410-1). Using the voice data of the extracted learned target language phoneme _j, calculates pronunciation statistics learning target language phoneme _j (S410-2). Learning target language pronunciation statistical information is generated as a collection of pronunciation statistical information of the learning target language phoneme ₁ ,..., Pronunciation statistical information of the learning target language phoneme _L (S410-3). Here, as the statistical information, it is possible to use an average value, variance, quartile, or a statistical model such as a normal mixture distribution (Gaussian Mixture Model, GMM) of the entire speech data of the language phoneme _{j to be} learned. it can. Note that the learning target language speech data _n and the learning target language utterance information set _n may be used instead of the target speaker voice data and the target speaker utterance information set (where n is an integer of 1 to N).

第２発音統計情報生成部４１５は、合成対象言語音声データと合成対象言語発話情報集合を入力として、合成対象言語発音統計情報を生成し、出力する（Ｓ４１５）。第２発音情報統計取得部４１５の動作は、第１発音統計情報生成部４１０の動作と同様であり、合成対象言語音素₁の発音統計情報、…、合成対象言語音素_Kの発音統計情報の集合として合成対象言語発音統計情報を生成する（ただし、Kは合成対象言語の音素数）。 The second pronunciation statistic information generation unit 415 receives the synthesis target language speech data and the synthesis target language utterance information set as input, and generates and outputs synthesis target language pronunciation statistical information (S415). Operation of the second sound information statistics acquisition unit 415 is similar to the operation of the first sound statistical information generating unit 410, a set of sound statistics synthesized target language phoneme _1, ..., pronunciation statistics synthesized target language phoneme _K As a synthesis target language pronunciation statistical information (where K is the number of phonemes in the synthesis target language).

発音類似性判定部４２０は、Ｓ４１０で生成した学習対象言語発音統計情報と、Ｓ４１５で生成した合成対象言語発音統計情報を入力として、発音ベクトル作成規則を生成し、出力する（Ｓ４２０）。発音類似性判定部４２０の動作は、具体的には、以下のようになる（図１９参照）。まず、合成対象言語音素_iの発音統計情報（i=1,…,K）と学習対象言語音素_jの発音統計情報（j=1,…,L）の距離d_ijを算出する（Ｓ４２０−１）。発音統計情報間の距離の算出方法は、用いる発音統計情報によって異なる。例えば、発音統計情報としてGMMを用いた場合、確率分布間の差異を測る尺度であるKLダイバージェンスを用いることができる。また、平均値、分散を用いた場合は、マハラノビス距離、四分位数を用いた場合は、四分位数のユークリッド距離を用いて距離d_ijを算出すればよい。次に、合成対象言語音素_iについて、学習対象言語音素₁との距離d_i1、…、学習対象言語音素_Lとの距離d_iLの中から最小となる学習対象言語音素_{j_min}（1≦j_min≦L）を特定する（Ｓ４２０−２）。合成対象言語音素_iと学習対象言語音素_{j_min}を対応するものとして、インデックス、学習対象言語音素_{j_min}、合成対象言語音素_iの組を表に追加していき、最終的に発音ベクトル作成規則を生成する（Ｓ４２０−３）。 The pronunciation similarity determination unit 420 receives the learning target language pronunciation statistical information generated in S410 and the synthesis target language pronunciation statistical information generated in S415, and generates and outputs a pronunciation vector creation rule (S420). The operation of the pronunciation similarity determination unit 420 is specifically as follows (see FIG. 19). First, the distance d _ij between the pronunciation statistical information (i = 1,..., K) of the synthesis target language phoneme _i and the pronunciation statistical information (j = 1,..., L) of the learning target language phoneme _j is calculated (S420-1). ). The calculation method of the distance between pronunciation statistical information changes with pronunciation statistical information to be used. For example, when GMM is used as pronunciation statistical information, KL divergence, which is a measure for measuring a difference between probability distributions, can be used. Further, when the average value and the variance are used, the Mahalanobis distance and when the quartile is used, the distance _dij may be calculated using the Euclidean distance of the quartile. Next, for the synthesis target language phoneme _i , the learning target language phoneme _{j_min} (1 ≦ j _min ≦) is the smallest among the distances d _i1 to the learning target language phoneme ₁ and the distances d _{iL to} the learning target language phoneme _L. L) is specified (S420-2). _Assuming that the target language phoneme _i and the target language phoneme _{j_min} correspond to each other, the index, the target language phoneme _{j_min} , and the target language phoneme _i are added to the table, and finally the pronunciation vector creation rule is generated (S420-3).

本実施形態の発明によれば、学習対象言語の発話情報も用いて不特定話者声質変換器を生成することにより、合成音声の品質を向上させるクロスリンガル音声合成用モデルを生成することができる。また、当該クロスリンガル音声合成用モデルを用いることにより合成音声の品質を向上させることができる。 According to the invention of this embodiment, a cross-lingual speech synthesis model that improves the quality of synthesized speech can be generated by generating an unspecified speaker voice quality converter using speech information of a language to be learned. . In addition, the quality of synthesized speech can be improved by using the cross-lingual speech synthesis model.

＜第三実施形態＞
第一実施形態では、合成対象言語音声データとして1名の話者の音声データを使用している。 <Third embodiment>
In the first embodiment, voice data of one speaker is used as the synthesis target language voice data.

本実施形態では、合成対象言語音声データとして複数の話者の音声データを使用する。これにより、合成音声の品質を向上させることができる。 In the present embodiment, voice data of a plurality of speakers is used as the synthesis target language voice data. As a result, the quality of the synthesized speech can be improved.

以下、図２０〜図２１を参照して音声合成用モデル学習装置５２０について説明する。なお、不特定話者声質変換器の生成には不特定話者声質変換器学習装置１１０を用いることができる。つまり、不特定話者声質変換器学習装置１１０と、音声合成用モデル学習装置５２０と、記録部１９０を含むクロスリンガル音声合成用モデル学習装置を構成することができる。また、合成音声の生成にはクロスリンガル音声合成装置２００を用いることができる。 Hereinafter, the speech synthesis model learning device 520 will be described with reference to FIGS. The unspecified speaker voice quality converter learning device 110 can be used to generate the unspecified speaker voice quality converter. That is, a non-specific speaker voice quality converter learning device 110, a speech synthesis model learning device 520, and a cross-lingual speech synthesis model learning device including the recording unit 190 can be configured. In addition, the cross-lingual speech synthesizer 200 can be used to generate synthesized speech.

［音声合成用モデル学習装置５２０］
図２０に示すように音声合成用モデル学習装置５２０は、声質変換部５１１、合成用モデル学習部５１３を含む。図２１に従い音声合成用モデル学習装置５２０の動作について説明する。ここでは、合成対象言語音声データ_m(1≦m≦M)が必要である。 [Speech Synthesis Model Learning Device 520]
As shown in FIG. 20, the speech synthesis model learning device 520 includes a voice quality conversion unit 511 and a synthesis model learning unit 513. The operation of the speech synthesis model learning device 520 will be described with reference to FIG. Here, synthesis target language speech data _m (1 ≦ m ≦ M) is required.

声質変換部５１１は、合成対象言語音声データ_mを入力として、不特定話者声質変換器を用いて声質変換後合成対象言語音声データ_mを生成し、出力する（Ｓ５１１）。例えば、不特定話者声質変換器が式(1)で表される場合、合成対象言語音声データ_mのスペクトルパラメータC_synth,org,m(t)から目標話者の声質を有する合成対象言語の目標話者スペクトルパラメータ_mC^_{synth,target,m}(t)へ変換する。 The voice quality conversion unit 511 receives the synthesis target language voice data _m as input, generates post-voice quality conversion synthesis target language voice data _m using an unspecified speaker voice quality converter, and outputs it (S511). For example, when the unspecified speaker voice quality converter is expressed by the equation (1), the synthesis target language having the voice quality of the target speaker is obtained from the spectrum parameter C _{synth, org, m} (t) of the synthesis target language voice data _m . Convert to target speaker spectrum parameter _m C ^ _{synth, target, m} (t).

声質変換器の学習に（参考非特許文献１）を用いた場合は、（参考非特許文献１）に従い声質変換を行えばよい。式(1)に従い計算された目標話者スペクトルパラメータ_mC^_{synth,target,m}(t)が声質変換後合成対象言語音声データ_mである。 When (Reference Non-Patent Document 1) is used for learning of the voice quality converter, the voice quality conversion may be performed according to (Reference Non-Patent Document 1). The target speaker spectrum parameter _m C ^ _{synth, target, m} (t) calculated according to the equation (1) is the speech data _{m to be} synthesized after voice conversion.

合成用モデル学習部５１３は、Ｓ５１１で生成した声質変換後合成対象言語音声データ_mと、合成対象言語発話情報集合_mを入力として、クロスリンガル音声合成モデルを学習し、出力する（Ｓ５１３）。クロスリンガル音声合成用モデルの学習には、（非特許文献１）、（非特許文献２）を用いることができる。また、（参考非特許文献３）のように複数の話者の音声データを用いて学習する平均声モデルを音声合成用モデルとして用いることもできる。
（参考非特許文献３）田村正統，益子貴史，徳田恵一，小林隆夫，“HMMに基づく音声合成におけるピッチ・スペクトルの話者適応”，電子情報通信学会論文誌 D, Vol.J85-D2, No.4, pp.545-553, 2002. The synthesis model learning unit 513 learns and outputs a cross-lingual speech synthesis model using the post-voice quality conversion synthesis target language speech data _m generated in S511 and the synthesis target language utterance information set _m as inputs (S513). (Non-Patent Document 1) and (Non-Patent Document 2) can be used for learning a model for cross-lingual speech synthesis. Further, as in (Reference Non-Patent Document 3), an average voice model learned using voice data of a plurality of speakers can be used as a voice synthesis model.
(Non-patent document 3) Masanori Tamura, Takashi Masuko, Keiichi Tokuda, Takao Kobayashi, “Speaker adaptation of pitch spectrum in speech synthesis based on HMM”, IEICE Transactions D, Vol.J85-D2, No .4, pp.545-553, 2002.

一般に、音声合成用モデルを学習する際、学習に使用できる音声データ、発話情報の量が多いほど高品質な合成音声を生成することが可能である。本実施形態では、M名の合成対象言語音声データを用いている。M名の合成対象言語音声データを用いた場合、目標話者の声質を有する、M名の声質変換後合成対象言語音声データを音声合成用モデルの学習データとして使用することができる。これにより、1名分の合成対象言語音声データを使用する場合に比して、音声合成用モデル学習に使用可能な音声データの量が増加するため、合成音声の品質を向上させるクロスリンガル音声合成用モデルを生成することができる。また、当該クロスリンガル音声合成用モデルを用いることにより合成音声の品質を向上させることができる。 In general, when learning a speech synthesis model, it is possible to generate higher-quality synthesized speech as the amount of speech data and speech information that can be used for learning increases. In the present embodiment, M name synthesis target language speech data is used. In the case of using the M synthesis target language speech data, the M name speech quality-converted synthesis target language speech data having the voice quality of the target speaker can be used as learning data for the speech synthesis model. This increases the amount of speech data that can be used for speech synthesis model learning compared to the case of using one target language speech data for synthesis, so cross-lingual speech synthesis improves the quality of synthesized speech. Model can be generated. In addition, the quality of synthesized speech can be improved by using the cross-lingual speech synthesis model.

＜第四実施形態＞
第三実施形態では、M名の話者による合成対象言語音声データから生成した声質変換後合成対象言語音声データをすべて音声合成用モデル学習に使用している。 <Fourth embodiment>
In the third embodiment, all post-voice-quality synthesis target language speech data generated from synthesis target language speech data by M speakers is used for speech synthesis model learning.

しかし、合成対象言語入力話者_mと目標話者の話者性が大きく異なる場合、声質変換部５１１が生成する声質変換後合成対象言語音声データ_mの変換精度が劣化する場合がある。したがって、必ずしもM名すべての声質変換後合成用対象言語音声データを音声合成用モデル学習に使用することが適切であるとは限らない。 However, when the speech characteristics of the synthesis target language input speaker _m and the target speaker are greatly different, the conversion accuracy of the post-voice quality conversion target language speech data _m generated by the voice quality conversion unit 511 may deteriorate. Accordingly, it is not always appropriate to use the target language speech data for synthesis after voice quality conversion for all M names for speech synthesis model learning.

そこで、本実施形態では、この問題を回避するために、M名の声質変換後合成用対象言語音声データのうち、変換精度が高いもののみを音声合成用モデルの学習に使用する。
これにより、合成音声の品質を向上させることができる。 Therefore, in the present embodiment, in order to avoid this problem, only speech data with a high conversion accuracy among the M language voice synthesis target language speech data for speech synthesis conversion is used for learning the speech synthesis model.
As a result, the quality of the synthesized speech can be improved.

以下、図２２〜図２３を参照して音声合成用モデル学習装置６２０について説明する。なお、不特定話者声質変換器の生成には不特定話者声質変換器学習装置１１０を用いることができる。つまり、不特定話者声質変換器学習装置１１０と、音声合成用モデル学習装置６２０と、記録部１９０を含むクロスリンガル音声合成用モデル学習装置を構成することができる。また、合成音声の生成にはクロスリンガル音声合成装置２００を用いることができる。 Hereinafter, the speech synthesis model learning device 620 will be described with reference to FIGS. 22 to 23. The unspecified speaker voice quality converter learning device 110 can be used to generate the unspecified speaker voice quality converter. That is, a non-specific speaker voice quality converter learning device 110, a speech synthesis model learning device 620, and a cross-lingual speech synthesis model learning device including the recording unit 190 can be configured. In addition, the cross-lingual speech synthesizer 200 can be used to generate synthesized speech.

［音声合成用モデル学習装置６２０］
図２２に示すように音声合成用モデル学習装置６２０は、声質変換部５１１、話者類似性判定部６１２、合成用モデル学習部５１３を含む。図２３に従い音声合成用モデル学習装置６２０の動作について説明する。 [Speech Synthesis Model Learning Device 620]
As illustrated in FIG. 22, the speech synthesis model learning device 620 includes a voice quality conversion unit 511, a speaker similarity determination unit 612, and a synthesis model learning unit 513. The operation of the speech synthesis model learning device 620 will be described with reference to FIG.

声質変換部５１１は、合成対象言語音声データ_mを入力として、不特定話者声質変換器を用いて声質変換後合成対象言語音声データ_mを生成し、出力する（Ｓ５１１）。 The voice quality conversion unit 511 receives the synthesis target language voice data _m as input, generates post-voice quality conversion synthesis target language voice data _m using an unspecified speaker voice quality converter, and outputs it (S511).

話者類似性判定部６１２は、Ｓ５１１で生成した声質変換後合成対象言語音声データ_m(1≦m≦M)と目標話者音声データを入力として、声質変換後合成対象言語音声データ_mと目標話者音声データの類似の程度を示す類似度_mを算出し、類似度_mに基づいて声質変換後合成対象言語音声データ_mが目標話者音声データに類似していると判定される場合、声質変換後合成対象言語音声データ_mを目標話者音声データに類似している声質変換後合成対象言語音声データの集合である目標話者類似合成対象言語音声データ集合に追加していくことにより、目標話者類似合成対象言語音声データ集合を生成する（Ｓ６１２）。類似度の算出には、声質変換後合成対象言語音声データ_mと目標話者音声データとの差、例えばスペクトルパラメータの差を利用する。そのような方法として、正規混合分布（GMM）を使用する（参考非特許文献４）の方法がある。
（参考非特許文献４）D. A. Reynolds, “Speaker identification and verification using Gaussian mixture speaker models”, Speech Communication, vol.17, Issues 1-2, pp.91-108, 1995. The speaker similarity determination unit 612 receives the voice quality-converted synthesis target language voice data _m (1 ≦ m ≦ M) and the target speaker voice data generated in S511 and inputs the voice quality-converted target language voice data _m and the target voice data _m. calculating a similarity _m indicating the degree of similarity of the speaker's speech data, if after voice conversion synthesized target language voice data _m, based on the similarity _m is determined to be similar to the target-speaker speech data, voice By adding the post-conversion synthesis target language speech data _m to the target speaker-similar synthesis target language speech data set, which is a set of post-conversion speech synthesis target language speech data similar to the target speaker speech data, A speaker-like synthesis target language speech data set is generated (S612). For the calculation of the similarity, the difference between the speech quality conversion target speech data _m after the voice quality conversion and the target speaker voice data, for example, the difference in the spectral parameters is used. As such a method, there is a method of using a normal mixture distribution (GMM) (reference non-patent document 4).
(Reference Non-Patent Document 4) DA Reynolds, “Speaker identification and verification using Gaussian mixture speaker models”, Speech Communication, vol.17, Issues 1-2, pp.91-108, 1995.

話者類似性判定部６１２の動作は、具体的には、以下のようになる（図２４参照）。まず、目標話者音声データを使用して、目標話者のGMMを学習する（Ｓ６１２−１）。次に、学習した目標話者のGMMに対して、声質変換後合成対象言語音声データ_mを入力し、合成対象言語入力話者_mの尤度を類似度_mとして算出する（Ｓ６１２−２）。最後に、類似度_mが所定の閾値以上である（または、所定の閾値より大きい）場合、声質変換後合成対象言語音声データ_mと目標話者音声データは類似していると判定（つまり、合成対象言語入力話者_mを目標話者と類似性が高い話者と判定）し、声質変換後合成対象言語音声データ_mを目標話者類似合成対象言語音声データ集合の要素として決定する（Ｓ６１２−３）。 The operation of the speaker similarity determination unit 612 is specifically as follows (see FIG. 24). First, the target speaker's voice data is used to learn the target speaker's GMM (S612-1). Then, the target speaker GMM learned, enter the post-voice conversion synthesized target language voice data _m, and calculates the likelihood of the synthesis target language input speaker _m as the similarity _m (S612-2). Finally, when the similarity _m is equal to or higher than a predetermined threshold (or larger than the predetermined threshold), it is determined that the synthesis target language voice data _m after voice quality conversion and the target speaker voice data are similar (that is, synthesis) The target language input speaker _m is determined to be a speaker having high similarity to the target speaker), and the speech-to-synthesized synthesis target language speech data _m is determined as an element of the target speaker similar synthesis target language speech data set (S612-). 3).

合成用モデル学習部５１３は、Ｓ６１２で生成した目標話者類似合成対象言語音声データ集合と、合成対象言語発話情報集合_m(1≦m≦M)を入力として、クロスリンガル音声合成モデルを学習し、出力する（Ｓ５１３）。ここでは、目標話者類似合成対象言語音声データ集合の要素である声質変換後合成対象言語音声データ_mに対応する合成対象言語発話情報集合_mのみが使用されることになる。 The synthesis model learning unit 513 learns a cross-lingual speech synthesis model using the target speaker similar synthesis target language speech data set generated in S612 and the synthesis target language utterance information set _m (1 ≦ m ≦ M) as inputs. Are output (S513). Here, only the synthesis target language utterance information set _m corresponding to the synthesis target language speech data _m after voice quality conversion, which is an element of the target speaker similar synthesis target language speech data set, is used.

なお、Ｓ６１２で目標話者類似合成対象言語音声データ集合を生成する代わりに、最も類似度が大きい声質変換後合成対象言語音声データを目標話者類似合成対象言語音声データとして生成するのでもよい。この場合、合成用モデル学習部５１３は、Ｓ６１２で生成した目標話者類似合成対象言語音声データと、合成対象言語発話情報集合_m(1≦m≦M)を入力として、クロスリンガル音声合成モデルを学習し、出力する。 Note that, instead of generating the target speaker similar synthesis target language speech data set in S612, the synthesized speech language data after voice quality conversion having the highest similarity may be generated as the target speaker similar synthesis target language speech data. In this case, the synthesis model learning unit 513 receives the target speaker similar synthesis target language speech data generated in S612 and the synthesis target language utterance information set _m (1 ≦ m ≦ M) as input, and generates a cross-lingual speech synthesis model. Learn and output.

本実施形態では、類似度が大きい声質変換後合成対象言語音声データのみを音声合成用モデルの学習データとして用いている。これにより、合成音声の品質を向上させるクロスリンガル音声合成用モデルを生成することができる。また、当該クロスリンガル音声合成用モデルを用いることにより合成音声の品質を向上させることができる。 In the present embodiment, only the speech data after speech quality conversion with high similarity is used as learning data for the speech synthesis model. This makes it possible to generate a cross-lingual speech synthesis model that improves the quality of the synthesized speech. In addition, the quality of synthesized speech can be improved by using the cross-lingual speech synthesis model.

特に、最も類似度が大きい声質変換後合成対象言語音声データのみを学習データとして用いる場合は、目標話者による合成対象言語での音声データが利用できる場合に生成されるような音声合成用モデルによる合成音声の品質に近いクロスリンガル音声合成用モデルを生成することができる。また、当該クロスリンガル音声合成用モデルを用いることにより目標話者による合成対象言語での音声データが利用できる場合に生成されるような音声合成用モデルによる合成音声の品質に近い合成音声を生成することができる。 In particular, when only speech data after speech quality conversion with the highest degree of similarity is used as learning data, it is based on a speech synthesis model that is generated when speech data in the synthesis target language by the target speaker is available. A model for cross-lingual speech synthesis close to the quality of synthesized speech can be generated. In addition, by using the cross-lingual speech synthesis model, a synthesized speech that is close to the quality of the synthesized speech by the speech synthesis model that is generated when speech data in the target language by the target speaker is available is generated. be able to.

＜第五実施形態＞
第二実施形態では、合成対象言語音声データとして1名の話者の音声データを使用している。 <Fifth embodiment>
In the second embodiment, voice data of one speaker is used as the synthesis target language voice data.

以下、図２５〜図２６を参照して音声合成用モデル学習装置７２０について説明する。なお、不特定話者声質変換器の生成には不特定話者声質変換器学習装置３１０を用いることができる。つまり、不特定話者声質変換器学習装置３１０と、音声合成用モデル学習装置７２０と、記録部１９０を含むクロスリンガル音声合成用モデル学習装置を構成することができる。ここで、記録部１９０には、事前に発音ベクトル作成規則が記録されているものとする。また、合成音声の生成にはクロスリンガル音声合成装置２００を用いることができる。 Hereinafter, the speech synthesis model learning apparatus 720 will be described with reference to FIGS. Note that the unspecified speaker voice quality converter learning device 310 can be used to generate the unspecified speaker voice quality converter. That is, a cross-lingual speech synthesis model learning device including the unspecified speaker voice quality converter learning device 310, the speech synthesis model learning device 720, and the recording unit 190 can be configured. Here, it is assumed that a pronunciation vector creation rule is recorded in the recording unit 190 in advance. In addition, the cross-lingual speech synthesizer 200 can be used to generate synthesized speech.

［音声合成用モデル学習装置７２０］
図２５に示すように音声合成用モデル学習装置７２０は、声質変換部７１１、第２発音ベクトル生成部７１２、合成用モデル学習部７１３を含む。図２６に従い音声合成用モデル学習装置７２０の動作について説明する。ここでは、合成対象言語音声データ_m(1≦m≦M)が必要である。 [Speech Synthesis Model Learning Device 720]
As shown in FIG. 25, the speech synthesis model learning device 720 includes a voice quality conversion unit 711, a second pronunciation vector generation unit 712, and a synthesis model learning unit 713. The operation of the speech synthesis model learning device 720 will be described with reference to FIG. Here, synthesis target language speech data _m (1 ≦ m ≦ M) is required.

第２発音ベクトル生成部７１２は、合成対象言語発話情報集合_mを入力として、記録部１９０から読み込んだ発音ベクトル作成規則を用いて、合成対象言語発音ベクトル集合_mを生成し、出力する（Ｓ７１２）。合成対象言語発音ベクトル集合_mの生成方法は、第２発音ベクトル生成部３１２における合成対象言語発音ベクトル集合の生成方法と同様であり、合成対象言語発話情報集合_mの発話情報から算出される発音ベクトル（以下、合成対象言語発音ベクトルという）の集合が合成対象言語発音ベクトル集合_mである。 The second pronunciation vector generation unit 712 receives the synthesis target language utterance information set _m as an input, generates a synthesis target language pronunciation vector set _m using the pronunciation vector creation rules read from the recording unit 190, and outputs it (S712). . The generation method of the synthesis target language pronunciation vector set _{m is the same as} the generation method of the synthesis target language pronunciation vector set in the second pronunciation vector generation unit 312, and the pronunciation vector calculated from the utterance information of the synthesis target language utterance information set _m A set of synthesis target language pronunciation vectors (hereinafter referred to as a synthesis target language pronunciation vector) is a synthesis target language pronunciation vector set _m .

声質変換部７１１は、合成対象言語音声データ_mと、Ｓ７１２で生成した合成対象言語発音ベクトル集合_mを入力として、不特定話者声質変換器を用いて声質変換後合成対象言語音声データ_mを生成し、出力する（Ｓ７１１）。例えば、不特定話者声質変換器が式(2)で表される場合、合成対象言語音声データ_mのスペクトルパラメータC_synth,org,m(t)と対応する合成対象言語発音ベクトルV_synth,org,m(t)の組(C_synth,org,m(t), V_synth,org,m(t))から目標話者の声質を有する合成対象言語の目標話者スペクトルパラメータ_mC^_{synth,target,m}(t)へ変換する。 The voice quality conversion unit 711 receives the synthesis target language speech data _m and the synthesis target language pronunciation vector set _m generated in S712 as input, and generates the post-voice quality converted synthesis target language speech data _m using an unspecified speaker voice quality converter. And output (S711). For example, when the unspecified speaker voice quality converter is expressed by Equation (2), the synthesis target language pronunciation vector V _{synth, org} corresponding to the spectrum parameter C _{synth, org, m} (t) of the synthesis target speech data _m _{, m} (t) (C _{synth, org, m} (t), V _{synth, org, m} (t)) from the target speaker spectral parameter _m C ^ _synth, Convert to _{target, m} (t).

声質変換器の学習に（参考非特許文献１）を用いた場合は、（参考非特許文献１）に従い声質変換を行えばよい。式(2)に従い計算された目標話者スペクトルパラメータ_mC^_{synth,target,m}(t)が声質変換後合成対象言語音声データ_mである。 When (Reference Non-Patent Document 1) is used for learning of the voice quality converter, the voice quality conversion may be performed according to (Reference Non-Patent Document 1). The target speaker spectrum parameter _m C ^ _{synth, target, m} (t) calculated according to the equation (2) is the speech data _{m to be} synthesized after voice conversion.

合成用モデル学習部７１３は、Ｓ７１１で生成した声質変換後合成対象言語音声データ_mと、合成対象言語発話情報集合_mを入力として、クロスリンガル音声合成モデルを学習し、出力する（Ｓ７１３）。クロスリンガル音声合成用モデルの学習には、（非特許文献１）、（非特許文献２）、（参考非特許文献３）を用いることができる。 The synthesis model learning unit 713 learns and outputs a cross-lingual speech synthesis model using the post-voice-quality-converted synthesis target language speech data _m generated in S711 and the synthesis target language utterance information set _m as inputs (S713). (Non-patent document 1), (Non-patent document 2), and (Reference non-patent document 3) can be used for learning of a model for cross-lingual speech synthesis.

本実施形態では、M名の合成対象言語音声データを用いている。M名の合成対象言語音声データを用いた場合、目標話者の声質を有する、M名の声質変換後合成対象言語音声データを音声合成用モデルの学習データとして使用することができる。これにより、1名分の合成対象言語音声データを使用する場合に比して、音声合成用モデル学習に使用可能な音声データの量が増加するため、合成音声の品質を向上させるクロスリンガル音声合成用モデルを生成することができる。その上、第二実施形態同様、学習対象言語の発話情報も用いて学習した不特定話者声質変換器を用いているため、合成音声の品質をより向上させるクロスリンガル音声合成用モデルを生成することができる。また、当該クロスリンガル音声合成用モデルを用いることにより合成音声の品質を向上させることができる。 In the present embodiment, M name synthesis target language speech data is used. In the case of using the M synthesis target language speech data, the M name speech quality-converted synthesis target language speech data having the voice quality of the target speaker can be used as learning data for the speech synthesis model. This increases the amount of speech data that can be used for speech synthesis model learning compared to the case of using one target language speech data for synthesis, so cross-lingual speech synthesis improves the quality of synthesized speech. Model can be generated. In addition, as in the second embodiment, since the unspecified speaker voice quality converter learned using the speech information of the language to be learned is used, a cross-lingual speech synthesis model that further improves the quality of the synthesized speech is generated. be able to. In addition, the quality of synthesized speech can be improved by using the cross-lingual speech synthesis model.

＜第六実施形態＞
第五実施形態では、M名の話者による合成対象言語音声データから生成した声質変換後合成対象言語音声データをすべて音声合成用モデル学習に使用している。 <Sixth embodiment>
In the fifth embodiment, all post-voice quality conversion target language speech data generated from synthesis target language speech data by M speakers is used for speech synthesis model learning.

しかし、合成対象言語入力話者_mと目標話者の話者性が大きく異なる場合、声質変換部７１１が生成する声質変換後合成対象言語音声データ_mの変換精度が劣化する場合がある。したがって、必ずしもM名すべての声質変換後合成用対象言語音声データを音声合成用モデル学習に使用することが適切であるとは限らない。 However, if the speech characteristics of the synthesis target language input speaker _m and the target speaker are greatly different, the conversion accuracy of the post-voice quality conversion synthesis target language speech data _m generated by the voice quality conversion unit 711 may deteriorate. Accordingly, it is not always appropriate to use the target language speech data for synthesis after voice quality conversion for all M names for speech synthesis model learning.

以下、図２７〜図２８を参照して音声合成用モデル学習装置８２０について説明する。なお、不特定話者声質変換器の生成には不特定話者声質変換器学習装置３１０を用いることができる。つまり、不特定話者声質変換器学習装置３１０と、音声合成用モデル学習装置８２０と、記録部１９０を含むクロスリンガル音声合成用モデル学習装置を構成することができる。ここで、記録部１９０には、事前に発音ベクトル作成規則が記録されているものとする。また、合成音声の生成にはクロスリンガル音声合成装置２００を用いることができる。 Hereinafter, the speech synthesis model learning device 820 will be described with reference to FIGS. 27 to 28. Note that the unspecified speaker voice quality converter learning device 310 can be used to generate the unspecified speaker voice quality converter. That is, a cross-lingual speech synthesis model learning device including the unspecified speaker voice quality converter learning device 310, the speech synthesis model learning device 820, and the recording unit 190 can be configured. Here, it is assumed that a pronunciation vector creation rule is recorded in the recording unit 190 in advance. In addition, the cross-lingual speech synthesizer 200 can be used to generate synthesized speech.

［音声合成用モデル学習装置８２０］
図２７に示すように音声合成用モデル学習装置８２０は、声質変換部７１１、第２発音ベクトル生成部７１２、話者類似性判定部８１２、合成用モデル学習部７１３を含む。図２８に従い音声合成用モデル学習装置８２０の動作について説明する。 [Speech Synthesis Model Learning Device 820]
As shown in FIG. 27, the speech synthesis model learning device 820 includes a voice quality conversion unit 711, a second pronunciation vector generation unit 712, a speaker similarity determination unit 812, and a synthesis model learning unit 713. The operation of the speech synthesis model learning device 820 will be described with reference to FIG.

第２発音ベクトル生成部７１２は、合成対象言語発話情報集合_mを入力として、記録部１９０から読み込んだ発音ベクトル作成規則を用いて、合成対象言語発音ベクトル集合_mを生成し、出力する（Ｓ７１２）。 The second pronunciation vector generation unit 712 receives the synthesis target language utterance information set _m as an input, generates a synthesis target language pronunciation vector set _m using the pronunciation vector creation rules read from the recording unit 190, and outputs it (S712). .

声質変換部７１１は、合成対象言語音声データ_mと、Ｓ７１２で生成した合成対象言語発音ベクトル集合_mを入力として、不特定話者声質変換器を用いて声質変換後合成対象言語音声データ_mを生成し、出力する（Ｓ７１１）。 The voice quality conversion unit 711 receives the synthesis target language speech data _m and the synthesis target language pronunciation vector set _m generated in S712 as input, and generates the post-voice quality converted synthesis target language speech data _m using an unspecified speaker voice quality converter. And output (S711).

話者類似性判定部８１２は、Ｓ７１１で生成した声質変換後合成対象言語音声データ_m(1≦m≦M)と目標話者音声データを入力として、声質変換後合成対象言語音声データ_mと目標話者音声データの類似の程度を示す類似度_mを算出し、類似度_mに基づいて声質変換後合成対象言語音声データ_mが目標話者音声データに類似していると判定される場合、声質変換後合成対象言語音声データ_mを目標話者音声データに類似している声質変換後合成対象言語音声データの集合である目標話者類似合成対象言語音声データ集合に追加していくことにより、目標話者類似合成対象言語音声データ集合を生成する（Ｓ８１２）。話者類似性判定部８１２の動作は、話者類似性判定部６１２の動作と同様でよい。 The speaker similarity determination unit 812 receives the post-voice quality conversion synthesis target language voice data _m (1 ≦ m ≦ M) and the target speaker voice data generated in S711, and inputs the post-voice quality conversion target language voice data _m and the target. calculating a similarity _m indicating the degree of similarity of the speaker's speech data, if after voice conversion synthesized target language voice data _m, based on the similarity _m is determined to be similar to the target-speaker speech data, voice By adding the post-conversion synthesis target language speech data _m to the target speaker-similar synthesis target language speech data set, which is a set of post-conversion speech synthesis target language speech data similar to the target speaker speech data, A speaker-like synthesis target language speech data set is generated (S812). The operation of the speaker similarity determination unit 812 may be the same as the operation of the speaker similarity determination unit 612.

合成用モデル学習部７１３は、Ｓ８１２で生成した目標話者類似合成対象言語音声データ集合と、合成対象言語発話情報集合_m(1≦m≦M)を入力として、クロスリンガル音声合成モデルを学習し、出力する（Ｓ７１３）。ここでは、目標話者類似合成対象言語音声データ集合の要素である声質変換後合成対象言語音声データ_mに対応する合成対象言語発話情報集合_mのみが使用されることになる。 The synthesis model learning unit 713 learns a cross-lingual speech synthesis model using the target speaker similar synthesis target language speech data set generated in S812 and the synthesis target language utterance information set _m (1 ≦ m ≦ M) as inputs. Are output (S713). Here, only the synthesis target language utterance information set _m corresponding to the synthesis target language speech data _m after voice quality conversion, which is an element of the target speaker similar synthesis target language speech data set, is used.

なお、Ｓ８１２で目標話者類似合成対象言語音声データ集合を生成する代わりに、最も類似度が大きい声質変換後合成対象言語音声データを目標話者類似合成対象言語音声データとして生成するのでもよい。この場合、合成用モデル学習部７１３は、Ｓ８１２で生成した目標話者類似合成対象言語音声データと、合成対象言語発話情報集合_m(1≦m≦M)を入力として、クロスリンガル音声合成モデルを学習し、出力する。 Note that, instead of generating the target speaker similar synthesis target language speech data set in S812, the speech quality converted synthesis target language speech data having the highest degree of similarity may be generated as the target speaker similar synthesis target language speech data. In this case, the synthesis model learning unit 713 receives the target speaker similar synthesis target language speech data generated in S812 and the synthesis target language utterance information set _m (1 ≦ m ≦ M) as input, and generates a cross-lingual speech synthesis model. Learn and output.

本実施形態では、類似度が大きい声質変換後合成対象言語音声データのみを音声合成用モデルの学習データとして用いている。これにより、合成音声の品質を向上させるクロスリンガル音声合成用モデルを生成することができる。その上、第二実施形態同様、学習対象言語の発話情報も用いて学習した不特定話者声質変換器を用いているため、合成音声の品質をより向上させるクロスリンガル音声合成用モデルを生成することができる。また、当該クロスリンガル音声合成用モデルを用いることにより合成音声の品質を向上させることができる。 In the present embodiment, only the speech data after speech quality conversion with high similarity is used as learning data for the speech synthesis model. This makes it possible to generate a cross-lingual speech synthesis model that improves the quality of the synthesized speech. In addition, as in the second embodiment, since the unspecified speaker voice quality converter learned using the speech information of the language to be learned is used, a cross-lingual speech synthesis model that further improves the quality of the synthesized speech is generated. be able to. In addition, the quality of synthesized speech can be improved by using the cross-lingual speech synthesis model.

＜変形例＞
この発明は上述の実施形態に限定されるものではなく、この発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。上記実施形態において説明した各種の処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。 <Modification>
The present invention is not limited to the above-described embodiment, and it goes without saying that modifications can be made as appropriate without departing from the spirit of the present invention. The various processes described in the above embodiment may be executed not only in time series according to the order of description, but also in parallel or individually as required by the processing capability of the apparatus that executes the processes or as necessary.

＜補記＞
本発明の装置は、例えば単一のハードウェアエンティティとして、キーボードなどが接続可能な入力部、液晶ディスプレイなどが接続可能な出力部、ハードウェアエンティティの外部に通信可能な通信装置（例えば通信ケーブル）が接続可能な通信部、ＣＰＵ（Central Processing Unit、キャッシュメモリやレジスタなどを備えていてもよい）、メモリであるＲＡＭやＲＯＭ、ハードディスクである外部記憶装置並びにこれらの入力部、出力部、通信部、ＣＰＵ、ＲＡＭ、ＲＯＭ、外部記憶装置の間のデータのやり取りが可能なように接続するバスを有している。また必要に応じて、ハードウェアエンティティに、ＣＤ−ＲＯＭなどの記録媒体を読み書きできる装置（ドライブ）などを設けることとしてもよい。このようなハードウェア資源を備えた物理的実体としては、汎用コンピュータなどがある。 <Supplementary note>
The apparatus of the present invention includes, for example, a single hardware entity as an input unit to which a keyboard or the like can be connected, an output unit to which a liquid crystal display or the like can be connected, and a communication device (for example, a communication cable) capable of communicating outside the hardware entity. Can be connected to a communication unit, a CPU (Central Processing Unit, may include a cache memory or a register), a RAM or ROM that is a memory, an external storage device that is a hard disk, and an input unit, an output unit, or a communication unit thereof , A CPU, a RAM, a ROM, and a bus connected so that data can be exchanged between the external storage devices. If necessary, the hardware entity may be provided with a device (drive) that can read and write a recording medium such as a CD-ROM. A physical entity having such hardware resources includes a general-purpose computer.

ハードウェアエンティティの外部記憶装置には、上述の機能を実現するために必要となるプログラムおよびこのプログラムの処理において必要となるデータなどが記憶されている（外部記憶装置に限らず、例えばプログラムを読み出し専用記憶装置であるＲＯＭに記憶させておくこととしてもよい）。また、これらのプログラムの処理によって得られるデータなどは、ＲＡＭや外部記憶装置などに適宜に記憶される。 The external storage device of the hardware entity stores a program necessary for realizing the above functions and data necessary for processing the program (not limited to the external storage device, for example, reading a program) It may be stored in a ROM that is a dedicated storage device). Data obtained by the processing of these programs is appropriately stored in a RAM or an external storage device.

ハードウェアエンティティでは、外部記憶装置（あるいはＲＯＭなど）に記憶された各プログラムとこの各プログラムの処理に必要なデータが必要に応じてメモリに読み込まれて、適宜にＣＰＵで解釈実行・処理される。その結果、ＣＰＵが所定の機能（上記、…部、…手段などと表した各構成要件）を実現する。 In the hardware entity, each program stored in an external storage device (or ROM or the like) and data necessary for processing each program are read into a memory as necessary, and are interpreted and executed by a CPU as appropriate. . As a result, the CPU realizes a predetermined function (respective component requirements expressed as the above-described unit, unit, etc.).

本発明は上述の実施形態に限定されるものではなく、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。また、上記実施形態において説明した処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。 The present invention is not limited to the above-described embodiment, and can be appropriately changed without departing from the spirit of the present invention. In addition, the processing described in the above embodiment may be executed not only in time series according to the order of description but also in parallel or individually as required by the processing capability of the apparatus that executes the processing. .

既述のように、上記実施形態において説明したハードウェアエンティティ（本発明の装置）における処理機能をコンピュータによって実現する場合、ハードウェアエンティティが有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記ハードウェアエンティティにおける処理機能がコンピュータ上で実現される。 As described above, when the processing functions in the hardware entity (the apparatus of the present invention) described in the above embodiments are realized by a computer, the processing contents of the functions that the hardware entity should have are described by a program. Then, by executing this program on a computer, the processing functions in the hardware entity are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）／ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto-Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only). Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording medium, MO (Magneto-Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、ハードウェアエンティティを構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, a hardware entity is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

Claims

N is an integer of 1 or more, n is an integer 1 ≦ n ≦ N,
Target language speech data _n (1 ≦ n ≦ N), which is the target speaker speech data that is speech data in the target language by the target speaker, and speech data in the target language by the target language input speaker _n , Voice data uttering the same sentence,
Cross-lingual speech synthesis for learning a cross-lingual speech synthesis model for synthesizing speech in the synthesis target language by the target speaker from the target speaker speech data and the learning target language speech data _n (1 ≦ n ≦ N) Model learning device for
The target speaker speech data and adjusts the time information of the learned language audio data _n, the time information adjusted target speaker speech data _n and time information adjusted learned language time to generate audio data _n information adjustment unit When,
From the set (1 ≦ n ≦ N) of the target speaker voice data _n after the time information adjustment and the learning target language voice data _n after the time information adjustment, arbitrary voice data is converted into voice data having the voice quality of the target speaker. A voice quality converter learning unit for learning an unspecified speaker voice quality converter to be converted;
Using the unspecified speaker voice quality converter, the voice to be synthesized after synthesis of speech having the voice quality of the target speaker from the synthesis target language voice data which is voice data in the synthesis target language by the synthesis target language input speaker A voice quality conversion unit for generating data;
A synthesis model learning unit that learns the cross-lingual speech synthesis model from the synthesis target language speech information set that is a set of speech information included in the synthesis target language speech data after the voice quality conversion and the synthesis target language speech data A model learning device for cross-lingual speech synthesis including and.

N is an integer of 1 or more, n is an integer 1 ≦ n ≦ N,
Target language speech data _n (1 ≦ n ≦ N), which is the target speaker speech data that is speech data in the target language by the target speaker, and speech data in the target language by the target language input speaker _n , Voice data uttering the same sentence,
Cross-lingual speech synthesis for learning a cross-lingual speech synthesis model for synthesizing speech in the synthesis target language by the target speaker from the target speaker speech data and the learning target language speech data _n (1 ≦ n ≦ N) Model learning device for
A recording unit that records a pronunciation vector creation rule indicating a correspondence relationship between a phoneme of a language to be learned and a phoneme of a language to be synthesized;
The target speaker speech data and adjusts the time information of the learned language audio data _n, the time information adjusted target speaker speech data _n and time information adjusted learned language time to generate audio data _n information adjustment unit When,
Learning, which is a set of pronunciation vectors calculated from the utterance information, from the learning target language utterance information set _n , which is a set of utterance information of utterances included in the learning target language speech data _n , using the pronunciation vector creation rule A first pronunciation vector generation unit for generating a target language pronunciation vector set _n ;
From the set (1 ≦ n ≦ N) of the target speaker speech data _n after the time information adjustment, the learning target language speech data _n after the time information adjustment and the learning target language pronunciation vector set _n (1 ≦ n ≦ N), the target speech data A voice quality converter learning unit for learning an unspecified speaker voice quality converter for converting into voice data having the voice quality of the speaker;
Using the pronunciation vector creation rule, from the synthesis target language utterance information set, which is a set of utterance information of the utterance included in the synthesis target language speech data that is speech data in the synthesis target language by the synthesis target language input speaker, A second pronunciation vector generation unit that generates a synthesis target language pronunciation vector set that is a set of pronunciation vectors calculated from the utterance information;
Using the unspecified speaker voice quality converter, a voice quality conversion unit that generates voice-converted synthesis target language voice data having the voice quality of the target speaker from the synthesis target language voice data and the synthesis target language pronunciation vector set When,
A cross-lingual speech synthesis model learning device comprising: a synthesis model learning unit that learns the cross-lingual speech synthesis model from the synthesis target language speech data after voice quality conversion and the synthesis target language utterance information set.

A model learning device for cross-lingual speech synthesis according to claim 1,
M is an integer of 1 or more, m is an integer satisfying 1 ≦ m ≦ M,
The voice quality conversion unit generates post-voice quality synthesis target language voice data _m having the voice quality of the target speaker from synthesis target language voice data _m which is voice data in a synthesis target language by a synthesis target language input speaker _m. And
The synthesis model learning unit includes a synthesis target language utterance information set _m (1 ≦ m ≦ M), which is a set of utterance information included in the synthesis target language speech data _m and the synthesis target language speech data _m. ) To learn the cross-lingual speech synthesis model. A cross-lingual speech synthesis model learning device.

A model learning device for cross-lingual speech synthesis according to claim 3,
further,
A similarity _m indicating a degree of similarity between the voice quality converted synthesis target language voice data _m and the target speaker voice data is calculated, and the voice quality converted synthesis target language voice data _m is calculated based on the similarity _m. When it is determined that the speech data is similar to the speaker voice data, the target speech data _m is a set of post-voice quality synthesis target language speech data similar to the target speaker voice data. A speaker similarity determination unit that generates the target speaker similar synthesis target language speech data set by adding to the speaker similar synthesis target language speech data set;
The synthesis model learning unit learns the cross-lingual speech synthesis model from the target speaker similar synthesis target language speech data set and the synthesis target language utterance information _m (1 ≦ m ≦ M). Model learning device for cross-lingual speech synthesis.

Synthetic text that is text in a synthesis target language that is a target of speech synthesis using the cross-lingual speech synthesis model learned by the cross-lingual speech synthesis model learning device according to any one of claims 1 to 4. A cross-lingual speech synthesizer for generating synthesized speech of the synthesized text from
A text analysis unit that generates context information of the synthesized text from the synthesized text;
Using the cross-lingual speech synthesis model, a speech parameter generation unit that generates speech parameters from the context information;
A cross-lingual speech synthesizer comprising: a speech waveform generator that generates the synthesized speech from the speech parameters.

N is an integer of 1 or more, n is an integer 1 ≦ n ≦ N,
Target language speech data _n (1 ≦ n ≦ N), which is the target speaker speech data that is speech data in the target language by the target speaker, and speech data in the target language by the target language input speaker _n , Voice data uttering the same sentence,
The cross-lingual speech synthesis model learning device synthesizes speech in the synthesis target language by the target speaker from the target speaker speech data and the learning target language speech data _n (1 ≦ n ≦ N). A cross-lingual speech synthesis model learning method for learning a synthesis model,
The cross-lingual speech synthesis model learning device adjusts time information between the target speaker speech data and the learning target language speech data _n, and adjusts the time information adjusted target speaker speech data _n and the time information adjusted learning target. A time information adjustment step for generating language audio data _n ;
The cross-lingual speech synthesis model learning device is configured to convert arbitrary speech data from the set (1 ≦ n ≦ N) of the target speaker speech data _n after the time information adjustment and the target language speech data _n after the time information adjustment. A voice quality converter learning step for learning an unspecified speaker voice quality converter for converting into voice data having the voice quality of the target speaker;
The cross-lingual speech synthesis model learning device uses the unspecified speaker voice quality converter from the synthesis target language speech data which is speech data in a synthesis target language by a synthesis target language input speaker, and the target speaker A voice quality conversion step of generating speech data to be synthesized after voice quality conversion having the voice quality of:
The cross-lingual speech synthesis model learning device uses the cross-lingual speech from a synthesis target language utterance information set, which is a set of utterance information included in the synthesis target language speech data and the synthesis target language speech data after the voice quality conversion. A model learning method for cross-lingual speech synthesis, comprising: a model learning step for synthesis for learning a model for synthesis.

N is an integer of 1 or more, n is an integer 1 ≦ n ≦ N,
Target language speech data _n (1 ≦ n ≦ N), which is the target speaker speech data that is speech data in the target language by the target speaker, and speech data in the target language by the target language input speaker _n , Voice data uttering the same sentence,
A model learning device for cross-lingual speech synthesis having a recording unit that records a pronunciation vector creation rule indicating a correspondence relationship between a phoneme of a language to be learned and a phoneme of a language to be synthesized, the target speaker speech data and the learning target language speech data A cross-lingual speech synthesis model learning method for learning a cross-lingual speech synthesis model for synthesizing speech in a synthesis target language by the target speaker from _n (1 ≦ n ≦ N),
The cross-lingual speech synthesis model learning device adjusts time information between the target speaker speech data and the learning target language speech data _n, and adjusts the time information adjusted target speaker speech data _n and the time information adjusted learning target. A time information adjustment step for generating language audio data _n ;
The cross-lingual speech synthesis model learning device uses the pronunciation vector creation rule to generate the utterance information from the learning target language utterance information set _n , which is a set of utterance information of utterances included in the learning target language speech data _n. A first pronunciation vector generation step of generating a learning target language pronunciation vector set _n , which is a set of pronunciation vectors calculated from
The model learning device for cross-lingual speech synthesis includes a set of the target speaker speech data _n after time information adjustment, the learning target language speech data _n after time information adjustment and the learning target language pronunciation vector set _n (1 ≦ n ≦ N), a voice quality converter learning step for learning an unspecified speaker voice quality converter for converting arbitrary voice data into voice data having the voice quality of the target speaker;
The cross-lingual speech synthesis model learning device uses the pronunciation vector creation rules to collect speech utterance information included in synthesis target language speech data that is speech data in a synthesis target language by a synthesis target language input speaker. A second pronunciation vector generation step of generating a synthesis target language pronunciation vector set, which is a set of pronunciation vectors calculated from the utterance information, from the synthesis target language utterance information set of
The model learning device for cross-lingual speech synthesis uses the unspecified speaker voice quality converter, and after voice quality conversion having the voice quality of the target speaker from the synthesis target language voice data and the synthesis target language pronunciation vector set A voice conversion step for generating speech data to be synthesized;
The cross-lingual speech synthesis model learning device includes a synthesis model learning step of learning the cross-lingual speech synthesis model from the synthesis target language speech data after the voice quality conversion and the synthesis target language utterance information set. Model learning method for speech synthesis.

A program for causing a computer to function as the cross-lingual speech synthesis model learning device according to any one of claims 1 to 4 or the cross-lingual speech synthesis device according to claim 5.