WO2012115212A1

WO2012115212A1 - Speech-synthesis system, speech-synthesis method, and speech-synthesis program

Info

Publication number: WO2012115212A1
Application number: PCT/JP2012/054482
Authority: WO
Inventors: 康行三井; 玲史近藤; 正徳加藤
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2011-02-22
Filing date: 2012-02-17
Publication date: 2012-08-30
Anticipated expiration: 2013-08-22
Also published as: JP6036681B2; JPWO2012115212A1

Abstract

Provided is a technology for generating rules that allow highly-natural speech synthesis without unnecessarily collecting large amount of learning data. This speech-synthesis system includes: a learning database that stores learning data, said learning data being a set of feature quantities extracted from speech waveform data; a feature-quantity-space partitioning means that partitions a feature-quantity space, which is a space related to the learning data, into subspaces; a density-detection means that detects the density of each of the subspaces into which the feature-quantity space has been partitioned and generates and outputs density information indicating said densities; and a rule-generation means that, on the basis of the outputted density information, generates speech-synthesis rules for generating pronunciation information used in speech synthesis.

Description

Speech synthesis system, speech synthesis method, and speech synthesis program

　本発明は、音声合成システム、音声合成方法、および音声合成プログラムに関し、特に、自然性の高い音声合成を実現する技術に関する。 The present invention relates to a speech synthesis system, a speech synthesis method, and a speech synthesis program, and more particularly, to a technique for realizing speech synthesis with high naturalness.

　近年、テキスト音声合成技術（Ｔｅｘｔ−ｔｏ−Ｓｐｅｅｃｈ：ＴＴＳ）の進歩により、人間らしさを備えた合成音声を用いたサービスや製品が数多くみられるようになってきた。一般的に、ＴＴＳは、まず形態素解析等により入力されたテキストの言語構造等を解析し（言語解析処理）、その結果を元にアクセント等が付与された音韻情報を生成する。さらに、ＴＴＳは、発音情報に基づいて基本周波数（Ｆ０）パタンや音素継続時間長を推定し、韻律情報を生成する（韻律生成処理）。最終的に、ＴＴＳは生成した韻律情報と音韻情報に基づいて波形を生成する（波形生成処理）。
　前述の韻律生成処理の方法として、非特許文献１に示されているように、Ｆ０パタンを単純なルールで表現できるようにモデル化して、そのルールを用いて韻律を生成する方法が知られている。このようにルールを用いた方法は、単純なモデルでＦ０パタンが生成できるため広く使われているが、韻律が不自然で合成音声が機械的になってしまうという問題があった。
　これに対し、近年では統計的手法を用いた音声合成方式が注目されている。その代表的な手法が、非特許文献２に記されている。非特許文献２は、統計的手法として隠れマルコフモデル（ｈｉｄｄｅｎ　Ｍａｒｋｏｖ　ｍｏｄｅｌ：ＨＭＭ）を用いたＨＭＭ音声合成を開示する。ＨＭＭ音声合成の技術は、大量の学習データを用いてモデル化した韻律モデルおよび音声合成単位（パラメータ）モデルを使って音声を生成する。ＨＭＭ音声合成の技術は、実際の人間が発声した音声を学習データとしているため、前述のＦ０生成モデルに比べて、より人間らしい韻律が生成できる。 In recent years, with the advance of text-to-speech (TTS), many services and products using human-synthesized synthesized speech have been seen. In general, the TTS first analyzes the language structure or the like of text input by morphological analysis or the like (language analysis processing), and generates phoneme information to which accents or the like are given based on the result. Further, the TTS estimates the fundamental frequency (F0) pattern and phoneme duration based on the pronunciation information, and generates prosodic information (prosodic generation processing). Finally, the TTS generates a waveform based on the generated prosodic information and phonological information (waveform generation process).
As a method of prosody generation processing described above, as shown in Non-Patent Document 1, a method of modeling F0 patterns so that they can be expressed by simple rules and generating prosody using the rules is known. Yes. The method using rules is widely used because the F0 pattern can be generated with a simple model. However, there is a problem that the synthesized speech becomes mechanical because the prosody is unnatural.
On the other hand, in recent years, a speech synthesis method using a statistical method has attracted attention. A typical technique is described in Non-Patent Document 2. Non-Patent Document 2 discloses HMM speech synthesis using a hidden Markov model (HMM) as a statistical method. The technology of HMM speech synthesis generates speech using a prosodic model and a speech synthesis unit (parameter) model modeled using a large amount of learning data. Since the technology of HMM speech synthesis uses speech uttered by an actual human as learning data, a prosody that is more human can be generated compared to the F0 generation model described above.

藤崎博也，須藤寛，「日本語単語アクセントの基本周波数パタンとその生成機構のモデル」，日本音響学会誌，２７巻，９号，ｐｐ．４４５−４５３，１９７１．Hiroya Fujisaki and Hiroshi Sudo, “Basic frequency pattern of Japanese word accent and model of its generation mechanism”, Journal of Acoustical Society of Japan, Vol.27, No.9, pp. 445-453, 1971. 徳田恵一，「隠れマルコフモデルの音声合成への応用」，電気通信学会技術研究報告，ＳＰ９９−６１，ｐｐ．４７−５４，１９９９．Keiichi Tokuda, “Application of Hidden Markov Model to Speech Synthesis”, Technical Report of IEICE, SP99-61, pp. 47-54, 1999.

　しかし、上記非特許文献に記載されるような統計的手法を用いた音声合成方式では、正しいＦ０パタンが生成されず不自然な音声になる場合がある。その理由は、統計的手法を用いた音声合成方式では、主に学習データの情報量を基準として学習データ空間を部分空間に分割（クラスタリング）するため、空間内に情報量の粗密状態が発生し、学習データが少ない疎な空間が存在するからである。
　この問題を解決する方法の１つとして、さらに大量のデータでモデル学習するという方法が考えられる。しかし、大量の学習データを収集するのは困難であり、また、どのくらいのデータ量を収集すれば十分であるかが不明であるため、現実的ではない。
　以上より、本発明の目的は、不要に大量な学習データを収集することなく、自然性の高い音声合成を可能にする規則を生成する技術を提供することである。 However, in a speech synthesis method using a statistical method as described in the above-mentioned non-patent document, a correct F0 pattern may not be generated, resulting in an unnatural speech. The reason for this is that in the speech synthesis method using the statistical method, the learning data space is divided into subspaces (clustering) mainly based on the information amount of the learning data, resulting in a dense and dense state of information amount in the space. This is because there is a sparse space with little learning data.
As one method for solving this problem, a model learning with a larger amount of data is conceivable. However, it is difficult to collect a large amount of learning data, and it is unrealistic because it is unclear how much data amount should be collected.
As described above, an object of the present invention is to provide a technique for generating a rule that enables speech synthesis with high naturalness without collecting an unnecessarily large amount of learning data.

　上記目的を達成するため、本発明の音声合成システムは、音声波形データから抽出された特徴量の集合である学習データを格納する学習用データベースと、前記学習用データベースが格納する学習データに関する空間である特徴量空間を、部分空間に分割する特徴量空間分割手段と、前記特徴量空間分割手段で分割された特徴量空間である各部分空間に対する疎密状態を検出し、該疎密状態を示す情報である疎密情報を発生して出力する疎密状態検出手段と、前記疎密状態検出手段から出力された疎密情報に基づいて、音声合成に用いる発音情報を生成するための規則である音声合成用規則を生成する規則生成手段と、を含む。
　上記目的を達成するため、本発明の音声合成方法は、音声波形データから抽出された特徴量の集合である学習データを格納し、前記格納する学習データに関する空間である特徴量空間を、部分空間に分割し、前記分割された特徴量空間である各部分空間に対する疎密状態を検出し、該疎密状態を示す情報である疎密情報を発生して出力し、前記出力された疎密情報に基づいて、音声合成に用いる発音情報を生成するための規則である音声合成用規則を生成する。
　上記目的を達成するため、本発明の記録媒体が格納するプログラムは、音声波形データから抽出された特徴量の集合である学習データを格納し、前記格納する学習データに関する空間である特徴量空間を、部分空間に分割し、前記分割された特徴量空間である各部分空間に対する疎密状態を検出し、該疎密状態を示す情報である疎密情報を発生して出力し、前記出力された疎密情報に基づいて、音声合成に用いる発音情報を生成するための規則である音声合成用規則を生成する、処理をコンピュータに実行させる。 In order to achieve the above object, a speech synthesis system according to the present invention includes a learning database that stores learning data that is a set of feature amounts extracted from speech waveform data, and a space related to the learning data that is stored in the learning database. A feature amount space dividing unit that divides a certain feature amount space into partial spaces, and a sparse state with respect to each partial space that is a feature amount space divided by the feature amount space dividing unit, and information indicating the sparse state A sparse / dense state detection unit that generates and outputs certain sparse / dense information, and generates a speech synthesis rule that is a rule for generating pronunciation information used for speech synthesis based on the sparse / dense information output from the sparse / dense state detection unit And rule generation means.
To achieve the above object, the speech synthesis method of the present invention stores learning data, which is a set of feature amounts extracted from speech waveform data, and sets a feature amount space, which is a space related to the stored learning data, as a partial space. , And detects the sparse / dense state for each partial space that is the divided feature amount space, generates and outputs the sparse / dense information that is information indicating the sparse / dense state, and based on the output sparse / dense information, A speech synthesis rule, which is a rule for generating pronunciation information used for speech synthesis, is generated.
In order to achieve the above object, a program stored in the recording medium of the present invention stores learning data that is a set of feature values extracted from speech waveform data, and a feature amount space that is a space related to the stored learning data. Divide into subspaces, detect a sparse / dense state for each subspace that is the divided feature amount space, generate and output sparse / dense information that is information indicating the sparse / dense state, and output the sparse / dense information Based on this, the computer is caused to execute processing for generating a speech synthesis rule that is a rule for generating pronunciation information used for speech synthesis.

　本発明の音声合成システム、音声合成方法、および音声合成プログラムによれば、不要に大量な学習データを収集することなく、自然性の高い音声合成が実現可能な規則を生成することができる。 According to the speech synthesis system, speech synthesis method, and speech synthesis program of the present invention, it is possible to generate a rule capable of realizing highly natural speech synthesis without collecting an unnecessarily large amount of learning data.

本発明の第１実施形態に係る音声合成システム１０００の構成例を示すブロック図である。It is a block diagram which shows the structural example of the speech synthesis system 1000 which concerns on 1st Embodiment of this invention. 本発明の第１実施形態に係る音声合成システム１０００の動作の一例を示すフローチャートである。It is a flowchart which shows an example of operation | movement of the speech synthesis system 1000 which concerns on 1st Embodiment of this invention. 本発明の第２実施形態に係る音声合成システム２０００の構成例を示すブロック図である。It is a block diagram which shows the structural example of the speech synthesis system 2000 which concerns on 2nd Embodiment of this invention. 特徴量空間分割部１において学習された結果として、二分木構造クラスタリングで作成された決定木構造の模式図である。FIG. 4 is a schematic diagram of a decision tree structure created by binary tree structure clustering as a result of learning in the feature amount space dividing unit 1; 特徴量空間分割部１による学習データのクラスタリング結果を表す、特徴量空間の概念的な模式図である。FIG. 3 is a conceptual schematic diagram of a feature amount space that represents a clustering result of learning data by a feature amount space dividing unit 1. 音声合成システム２０００における、準備段階のうち音声合成用規則を生成する動作の一例を示すフローチャートである。12 is a flowchart illustrating an example of an operation of generating a speech synthesis rule in a preparation stage in the speech synthesis system 2000. 音声合成システム２０００における、準備段階のうち韻律生成モデルを作成する動作の一例を示すフローチャートである。10 is a flowchart illustrating an example of an operation of creating a prosody generation model in the preparation stage in the speech synthesis system 2000. 音声合成システム２０００における、実際に音声合成処理を行う音声合成段階の動作の一例を示すフローチャートである。6 is a flowchart illustrating an example of an operation in a speech synthesis stage in which speech synthesis processing is actually performed in the speech synthesis system 2000. 本発明の第３実施形態に係る音声合成システム３０００の構成例を示すブロック図である。It is a block diagram which shows the structural example of the speech synthesis system 3000 which concerns on 3rd Embodiment of this invention. 音声合成システム３０００における、準備段階のうち波形生成モデルを作成する動作の一例を示すフローチャートである。10 is a flowchart illustrating an example of an operation of creating a waveform generation model in the preparation stage in the speech synthesis system 3000. 音声合成システム３０００における、実際に音声合成処理を行う音声合成段階の動作の一例を示すフローチャートである。10 is a flowchart illustrating an example of an operation in a speech synthesis stage in which speech synthesis processing is actually performed in the speech synthesis system 3000. 第２実施形態に係る音声合成システム２０００を実現するハードウェア構成の一例を示すブロック図である。It is a block diagram which shows an example of the hardware constitutions which implement | achieve the speech synthesis system 2000 which concerns on 2nd Embodiment.

　まず、本発明の実施形態の理解を容易にするために、本発明の背景を説明する。
　非特許文献２に記載されるような統計的手法を用いた技術では、正しいＦ０パタンが生成されず不自然な音声になる場合がある。
　具体的に説明すると、例えば、「人」（２モーラ）、「単語」（３モーラ）、「音声」（４モーラ）といった数モーラ程度の学習データは十分な数が存在する。ここで、モーラとは、一定の時間的長さをもった音の文節単位であり、日本語では一般に拍とも呼ばれる。そのため、統計的手法を用いた技術は、数モーラ程度の音については正しいＦ０パタンを生成することができる。しかし、例えば「アルバートアインシュタイン医科大学」（１８モーラ）のような学習データは極端に数が少ない、あるいは存在しない恐れがある。そのため、このような単語を含むテキストが入力された場合、Ｆ０パタンが乱れてしまい、アクセント位置がずれる等の問題が発生する。
　以下に説明される本発明の実施形態によれば、学習データが少ない部分空間に属する言語解析結果は生成されない、あるいは生成されにくくなる。そのため、本発明の実施形態によれば、学習データ不足を要因とした音声合成の不安定性を回避することができ、自然性の高い合成音声を生成することが可能となる。
　以下、本発明の実施形態について図面を参照して説明する。なお、各実施形態について、同様な構成要素には同じ符号を付し、適宜説明を省略する。また、以下の各実施形態では日本語の場合を例に説明するが、本願発明の適用は日本語の場合に限定されない。
　＜第１実施形態＞
　図１は、本発明の第１実施形態に係る音声合成システム１０００の構成例を示すブロック図である。図１を参照すると、本実施形態に係る音声合成システム１０００は、特徴量空間分割部１と、疎密状態検出部２と、規則生成部３と、学習用データベース４とを含む。
　学習用データベース４は、音声波形データから抽出された特徴量の集合を学習データとして格納する。学習用データベース４は、音声波形データに対応した文字列である発音情報を格納する。学習用データベース４は、時間長情報やピッチ情報等を格納していても良い。
　ここで、学習データである特徴量は、少なくとも音声波形におけるＦ０の時間変化情報であるＦ０パタンを含む。さらに、学習データである特徴量は、音声波形を高速フーリエ変換（ＦＦＴ）して求められるスペクトル情報や各音素の時間長情報であるセグメンテーション情報等を含んでも良い。
　特徴量空間分割部１は、学習用データベース４が格納する学習データに関する空間（以下、「特徴量空間」と呼ぶ。）を、部分空間に分割する。ここで特徴量空間とは、Ｎ個の所定の特徴量を軸とするＮ次元の空間である。次元の数Ｎは任意であり、例えば、スペクトル情報及びセグメンテーション情報の２つの特徴量を軸とした場合、特徴量空間は２次元の空間である。
　特徴量空間分割部１は、情報量を基準とした二分木構造クラスタリング等によって特徴量空間を部分空間に分割しても良い。特徴量空間分割部１は、部分空間に分割された学習データを疎密状態検出部２に出力する。
　疎密状態検出部２は、特徴量空間分割部１で生成された各部分空間に対する疎密状態を検出し、該疎密状態を示す情報である疎密情報を発生する。疎密状態検出部２は発生した疎密情報を規則生成部３に出力する。
　ここで疎密情報とは、学習データの情報量の粗密状態を示す情報である。疎密情報は、部分空間に属する学習データ群の特徴量ベクトルの平均値と分散値でも良い。
　規則生成部３は、疎密状態検出部２から出力された疎密情報に基づいて、音声合成用規則を生成する。
　ここで、音声合成用規則とは、音声を合成するために必要な情報である発音情報を生成するための規則である。音声合成用規則は、少なくとも言語解析情報を含む。ここで言語解析情報とは、テキストの言語解析処理に必要なデータや規則に関する情報である。言語解析情報は、例えば形態素解析のためのデータや規則に関する情報である。
　音声合成用規則は、言語解析情報の他に、アクセント位置やアクセント句境界位置などの情報である、音声合成のための付加的情報の付加の方法を示す情報を含む。
　音声合成用規則は、学習データが少ない（疎な）部分空間に属するＦ０パタンで表現されるような言語について、言語解析結果として出力されないように、辞書内のスコアを極端に低くする、又は０とするような規則でも良い。
　なお、発音情報とは、音声を合成するために必要な情報であり、発声内容を表現する音素、音節列、アクセント位置等の情報を含んでも良い。具体的には、発音情報は、テキストに対し形態素解析といった言語解析処理を行い、該言語解析処理の結果にアクセント位置やアクセント句境界位置といった音声合成のための付加的情報を付与したり、変更を加えたりする処理を行うことで生成される。
　例えば、「アルバートアインシュタイン医科大学」という単語が含まれるテキストが入力された場合を考える。この場合、上記単語に関する発音情報は、例えば日本語読みで「ａ　ｒｕ　ｂａ−　ｔｏ　ａ　ｉ　Ｎ　ｓｙｕ　ｔａ　ｉ　Ｎ　ｉ　ｋａ　ｄａ　＠　ｉ　ｇａ　ｋｕ」という文字列等である。“＠”は、アクセント位置を示している。発音情報をどのようにして生成するかを定めた規則が、上述の音声合成用規則である。
　図２は、本発明の第１実施形態に係る音声合成システム１０００の動作の一例を示すフローチャートである。
　図２に示すように、まず、特徴量空間分割部１は、学習用データベース４が格納する学習データに関する空間である特徴量空間を分割する（ステップＳ１）。
　次に、疎密状態検出部２は、特徴量空間分割部１で分割された特徴量空間の一部である各部分空間における学習データの情報量の疎密状態を検出し、該疎密状態を示す情報である疎密情報を発生する（ステップＳ２）。疎密状態検出部２は、発生した疎密情報を規則生成部３に出力する。
　次に、規則生成部３は、疎密状態検出部２から出力された疎密情報に基づいて、音声合成用規則を生成する（ステップＳ３）。
　以上のように、本実施形態に係る音声合成システム１０００によれば、学習データ不足を要因とした音声合成の不安定性を回避することができ、自然性の高い合成音声を生成することが可能となる。その理由は、音声合成システム１０００は、学習データが少ない部分空間に属する発音情報は生成されない、あるいは生成されにくくなる規則を生成するためである。
　＜第２実施形態＞
　続いて、本発明の第２実施形態について説明する。
　図３は、本発明の第２実施形態に係る音声合成システム２０００の構成例を示すブロック図である。図３を参照すると、本実施形態に係る音声合成システム２０００は、学習用データベース４と、音声合成学習装置２０と、韻律生成モデル格納部６と、言語解析用辞書７と、修正言語解析用辞書８と、音声合成装置４０とを含む。
　音声合成学習装置２０は、特徴量空間分割部１と、疎密状態検出部２と、規則生成部３と、韻律学習部５とを含む。特徴量空間分割部１及び疎密状態検出部２は、第１実施形態と同様の構成である。
　なお、本実施形態では、統計的手法としてＨＭＭを、特徴量空間の分割方法として二分木構造クラスタリングを用いるものとする。統計的手法としてＨＭＭを用いる場合は、クラスタリングと学習を交互に行う場合が一般的である。そのため、本実施形態では特徴量空間分割部１と韻律学習部５を併せてＨＭＭ学習部３０とし、明示的に分割された構成を取らないものとする。しかしながら本実施形態はあくまで発明の実施態様の一例であり、ＨＭＭ以外の統計的手法を用いる場合等の発明の構成は、この限りではない。
　図３を参照すると、音声合成装置４０は、言語解析部９と、韻律生成部１０と、波形生成部１１とを含む。
　本実施形態において、学習用データベース４には予め十分な学習データが格納されているものとする。すなわち、学習用データベース４は多量の音声波形データから抽出した特徴量を格納している。学習用データベース４は、Ｆ０パタン、セグメンテーション情報及びスペクトル情報を音声波形データの特徴量として格納しているものとする。そしてこれらの特徴量の集合が学習データとして用いられる。また、学習データは１人の話者の音声を収集したものとする。
　まず、ＨＭＭ学習部４１（特徴量空間分割部１及び韻律学習部５）において、学習用データベース４を用いた統計的手法による学習が行われる。
　ＨＭＭ学習部３０において特徴量空間分割部１は、第１実施形態と同様に学習用データベース４が格納する特徴量空間を、部分空間に分割する。具体的には、特徴量空間分割部１は、学習用データベース４が格納する特徴量空間を、二分木構造クラスタリングにより部分空間に分割する。以下では、特徴量空間分割部１によって生成された部分空間のことをクラスタとも呼ぶ。
　図４は、特徴量空間分割部１において学習された結果として、二分木構造クラスタリングで作成された決定木構造の模式図である。図４に示すように、二分木構造クラスタリングとは、学習データを、各ノードＰ１~Ｐ６に配置された質問により２つのノードに分割する処理を繰り返し、最終的に分割された各クラスタの情報量が均等になるようにクラスタリングする手法である。
　例えば図４では、特徴量空間分割部１は、現在のノードに配置された質問に基づいて「ＹＥＳ」と「ＮＯ」のいずれに該当するかを判断して、学習データを分割する。図４の例では、特徴量空間分割部１は、最初にノードＰ１に配置された質問である「当該音素が有声音」か否かに基づいて学習データを分割する。次に、例えば「ＹＥＳ」と判断されて分割された学習データを、特徴量空間分割部１は、ノードＰ２に配置された質問である「先行音素が無声音」か否かに基づいて分割する。特徴量空間分割部１は、このような分割を繰り返して所定の学習データ数に分割された段階で、その分割された学習データを一つのクラスタとする。
　図５は、特徴量空間分割部１による学習データのクラスタリング結果を表す、特徴量空間の概念的な模式図である。図５における縦軸及び横軸は所定の特徴量を示す。
　図５では、各クラスタに属する学習データ数が４つであるような場合を示している。図５には、特徴量空間分割部１によって学習データ数が４つになるまで分割された結果、各クラスタに該当する学習データのモーラ数とアクセント核の型が、どのようになっているかが示されている。ここで、アクセント核の型とは、一つのアクセント句の中で音程が大きく下がる直前の位置を示す類型である。
　なお、図５はあくまで概念を示した模式図であり、軸は２つに限定されない。特徴量空間は、例えば１０個の特徴量を軸とした１０次元の空間でも良い。
　図５に示すように、特徴量空間分割部１は、１０モーラ以上８型以上クラスタのような学習データ数が疎である空間に、大きなクラスタを生成する。このようなクラスタは非常に学習データ数が少ない疎なクラスタとなる。
　特徴量空間分割部１は、部分空間に分割した学習データを、疎密状態検出部２及び韻律学習部５に出力する。
　ＨＭＭ学習部３０は、特徴量空間の分割とともに韻律生成モデルを作成する。
　ＨＭＭ学習部３０において韻律学習部５は、特徴量空間分割部１で分割された特徴量の空間内で、韻律モデルの学習を行い、韻律生成モデルを作成する。すなわち、韻律学習部５は、特徴量空間分割部１における学習データのクラスタリング結果（例えば図４に示す二分木構造クラスタリングの結果）を用いて韻律生成モデルを作成する。
　韻律生成モデル格納部６は、韻律学習部５によって作成された韻律生成モデルを格納する。
　具体的には韻律学習部５は、クラスタ毎に学習用データベース４が格納している音声波形データに対応する発音情報に対し、どのような韻律を生成すれば良いかを統計的に学習する。韻律学習部５は、その学習の結果をモデル（韻律生成モデル）にし、各クラスタに対応させて韻律生成モデル格納部６に格納する。
　なお、学習用データベース４は時間長情報及びピッチ情報を格納しない構成とし、韻律学習部５が、入力された音声波形データから発音情報に対応する時間長情報やピッチ情報を学習する構成としても良い。
　次に、疎密状態検出部２は、特徴量空間分割部１から入力された学習データにおける各クラスタの疎密情報を抽出する。疎密情報は、例えば、アクセント句のモーラ数とアクセント核の相対位置に関する分散値でも良い。このとき、例えば図５に示す３モーラ１型クラスタにおいては、全てのデータが３モーラ１型である。そのため、分散値は０となる。
　疎密状態検出部２は、抽出した各クラスタの疎密情報を、規則生成部３に出力する。
　次に、規則生成部３は、各クラスタの疎密情報に基づいて音声合成用規則を生成する。ここでは、規則生成部３は、既存の言語解析用辞書７を修正することで音声合成用規則を生成することとする。ここで言語解析用辞書７とは、テキストの言語解析処理に必要なデータや規則である上述の言語解析情報を格納する辞書である。
　本実施形態において規則生成部３は、言語解析用辞書７を「言語解析結果として、疎なクラスタに属するアクセント句の発音情報が生成されないようにする」という方針で修正する。
　具体的には、疎密情報に対応する分散値の閾値が設定され、分散値が閾値以上であるようなクラスタに属するアクセント句の発音情報が生成されないように、規則生成部３は辞書内の該当するデータを削除する。例えば、６~８モーラ３型クラスタの分散値をσＡ、１０モーラ以上８型以上クラスタの分散値をσＢと仮定した場合、規則生成部３は、σＡ＜σＴ＜σＢを満たす分散値の閾値σＴを設定する。
　この場合、３モーラ１型クラスタは分散値が０なので、規則生成部３は、「僕は」「枕」といったような３モーラ１型のアクセント句については、辞書の修正を行わない。同様に、「核開発（６モーラ）」といったような６~８モーラ３型クラスタに属するアクセント句についても、規則生成部３は辞書の修正を行わない。
　一方、「アルバートアインシュタイン医科大学（１８モーラ１５型）」といったような１０モーラ以上８型以上クラスタに属するアクセント句については、規則生成部３は辞書内から該当のデータを削除し、言語解析結果として出力されないようにする。
　または、言語解析用辞書７が言語解析用のスコアを格納しており言語解析にスコア計算が用いられる場合、規則生成部３は、該当するデータが選択されないよう該当するデータのスコアを極端に低い値に置換することで、言語解析用辞書７を修正しても良い。また、規則生成部３は、言語解析用辞書７を修正するのではなく、音声合成エンジン内の言語解析部９やその周辺におけるアルゴリズムに変更を加えることで、音声合成用規則を生成しても良い。
　規則生成部３は、修正した言語解析用辞書７の内容である音声合成用規則を修正言語解析用辞書８に出力する。
　修正言語解析用辞書８は、規則生成部３が上記の規則に基づいて修正した言語解析用辞書７の内容である音声合成用規則を格納する。
　次に、テキストを入力して行う音声合成の動作について説明する。
　言語解析部９は音声合成の対象となるテキストが入力されると、該入力されたテキストに対し修正言語解析辞書８を用いて、形態素解析等により言語解析処理を行う。言語解析部９は、言語解析処理の結果から発音情報を生成し、該発音情報を韻律生成部１０に出力する。
　次に、韻律生成部１０は、言語解析部９から入力された発音情報に対して、韻律生成モデル格納部６が格納する韻律生成モデルを用いて韻律情報を生成する。韻律生成部１０は、発音情報と、生成した韻律情報を波形生成部１１に出力する。
　波形生成部１１は、発音情報と、韻律生成部１０が生成した韻律情報とを元に、音声波形を生成する。波形生成部１１は、生成した音声波形を合成音声として出力する。波形の生成は関連する技術に基づいて行えば良く、波形はいかなる方法で生成されても良い。波形生成部１１は、生成した音声波形を合成音声として出力する。
　次に図６及び図７を参照して、音声合成システム２０００の動作の流れを、音声合成用規則及び韻律生成モデルを生成する準備段階と、実際に音声合成処理を行う音声合成段階の２段階に分けて順に説明する。
　図６は、音声合成システム２０００における、準備段階のうち音声合成用規則を生成する動作の一例を示すフローチャートである。
　図６に示すように、ステップＳ１~Ｓ３の処理は、図２における処理と同様である。
　Ｓ３の処理の後、規則生成部３は、修正した言語解析用辞書７の内容である音声合成用規則を修正言語解析用辞書８に格納する（ステップＳ４）。
　図７は、音声合成システム２０００における、準備段階のうち韻律生成モデルを作成する動作の一例を示すフローチャートである。
　ステップＳ１の処理は、図２及び図６における処理と同様である。
　ステップＳ１の後、韻律学習部５は、特徴量空間分割部１で分割された特徴量空間内で、韻律モデルの学習を行い、韻律生成モデルを作成する（ステップＳ２Ａ）。
　次に、韻律生成モデル格納部６は、韻律学習部５によって作成された韻律生成モデルを格納する（ステップＳ３Ａ）。
　なお、上記図６及び図７で説明した準備段階の処理は、逆の順序で行われても良いし、並行して行われても良い。
　図８は、音声合成システム２０００における、実際に音声合成処理を行う音声合成段階の動作の一例を示すフローチャートである。
　図８に示すように、まず、言語解析部９は、音声合成の対象となるテキストが入力される（ステップＳ１Ｂ）。
　次に、言語解析部９は、入力されたテキストに対し、修正言語解析用辞書８が格納する音声合成用規則に従って言語解析処理を行い、発音情報を生成する（ステップＳ２Ｂ）。言語解析部９は、生成した発音情報を韻律生成部１０に出力する。
　次に、韻律生成部１０は、言語解析部９から入力された発音情報に対して、韻律生成モデル格納部６が格納する韻律生成モデルを用いて韻律情報を生成する（ステップＳ３Ｂ）。韻律生成部１０は、発音情報と韻律情報を波形生成部１１に出力する。
　次に、波形生成部１１は、韻律生成部１０から入力された発音情報と韻律情報とに基づいて、音声波形を生成し（ステップＳ４Ｂ）、該音声波形を合成音声として出力する。
　以上のように、本実施形態に係る音声合成システム２０００によれば、学習データ不足を要因としたＦ０パタンの乱れを回避することができ、自然性の高い音声合成をすることが可能となる。その理由は、同一のクラスタリング結果に基づいて韻律学習と疎密情報の抽出が行われ、規則生成部３が該疎密情報に基づいて音声合成用規則を生成することで、学習データが十分な発音情報が生成されるからである。
　また、本実施形態では、学習用データベースとして、１人の話者の音声を収集したものを想定したが、複数の話者の音声を収集したものを学習用データベースとしても良い。単独話者の学習用データベースの場合は、話者の癖といった話者性を再現できる音声合成用規則が作成できるという効果がある。複数話者の学習用データベースの場合は、汎用的な音声合成用規則が作成できるという効果がある。
　＜第３実施形態＞
　続いて、本発明の第３実施形態について説明する。
　図９は、本発明の第３実施形態に係る音声合成システム３０００の構成例を示すブロック図である。
　図９を参照すると、第３実施形態に係る音声合成システム３０００は、第２実施形態に係る音声合成学習装置２０及び音声合成装置４０に代わって、音声合成学習装置２１及び音声合成装置４１を含み、さらに波形生成モデル格納部１２を含む。また、音声合成システム３０００は、言語解析用辞書７及び修正言語解析用辞書８に代わって、音声合成用辞書１４及び修正音声合成用辞書１５を含む。
　音声合成学習装置２１は、ＨＭＭ学習部３０に代わって、学習用データベース４を用いて韻律生成モデルと波形生成モデルを生成するＨＭＭ学習部３１を含む。ＨＭＭ学習部３１は、ＨＭＭ学習部３０と同様の構成に加えて、波形学習部１２をさらに含む。
　音声合成装置４１は、波形生成部１０に代わって、波形生成モデル格納部１６を用いて波形を生成する波形生成部１７を含む。
　波形学習部１２は、特徴量空間分割部１で分割された特徴量の空間内で、波形モデルの学習を行い、波形生成モデルを作成する。
　波形生成モデルとは、学習用データベース内の波形のスペクトル特徴量をモデル化したものである。具体的には、特徴量はケプストラム等でも良い。なお、本実施形態においては波形生成のためのデータとして、ＨＭＭにより生成したモデルを用いる。しかし、本発明に適用する音声合成方式はこれに限定されず、別の音声合成方式、例えば波形接続方式を用いても構わない。なお、その場合ＨＭＭ学習部３１で学習されるのは韻律生成モデルのみである。
　波形生成モデル格納部１６は、波形学習部１２によって作成された波形生成モデルを格納する。
　規則生成部１３は、各クラスタの疎密情報に基づいて音声合成用規則を生成する。ここでは、規則生成部３は、既存の音声合成用辞書１４を修正することで音声合成用規則を生成することとする。ここで音声合成用辞書１４とは、テキストの言語解析処理に必要なデータや規則の他に、言語解析処理の結果に音声合成のための付加的情報を付与したり、変更を加えたりするための規則を格納する辞書である。
　規則生成部１３は、アクセント位置やアクセント句境界に関する規則以外についての規則を修正する。以下では具体例として、規則生成部１３が「ポーズの挿入／削除」、および「言い回しの変更」に関する規則を修正する動作を説明する。
　「ポーズの挿入／削除」に関する規則とは、音声が人間らしいものになるように、「自然な位置でポーズを挿入する」、「不自然な位置のポーズを削除する」といった規則である。具体的な規則としては、「１つの呼気段落がＮモーラ以下」、「接続詞の後はポーズを入れる」等の規則である。
　また、「言い回しの変更」に関する規則とは、言語として標準的なテキストから生成された言語解析結果を話者特有の言い回しに変更する規則である。例えば「放送」という単語は通常「ほーそー」という読みが付けられる。しかし話者によってはこれを「ほうそう」とはっきり読む場合がある。これを表す規則は、「長音を母音として読む」という規則になる。
　音声合成用辞書１４の修正は、第２実施形における言語解析用辞書７の修正と同様の方針で行われる。具体的には、分散値の閾値が設定される。そして分散値が閾値以上であるようなクラスタに属する表現が生成されないように、規則生成部１３が音声合成用辞書１４の内容について該当する規則を削除し、又は追加する。
　具体例として、「そして、放送が開始された」というテキストが入力された場合について説明する。
　学習用データベース４には、「途中でポーズを入れずに話す」、「『放送』という単語を『ほーそー』ではなく『ほうそう』と発音する」という特徴を持った話者の音声波形データが格納されているとする。この場合、学習データである特徴量空間を分割すると、「『そして』の後のポーズ」というクラスタ、及び「長音化した母音の連続」というクラスタが非常に疎か、又はクラスタとして存在しないことが想定される。
　この場合、例えば、「ポーズの挿入／削除」に関する規則の修正として、規則生成部１３は、音声合成用辞書１４が格納する規則のうち「接続詞の後はポーズを入れる」という規則を削除する。あるいは規則生成部１３は、音声合成用辞書１４が格納する規則に「『そして』の後はポーズを入れない」という規則を追加する。
　また、「言い回しの変更」に関する規則の修正として、規則生成部１３は、通常「ほーそー」と発音される「放送」というテキストについて、「ほうそう」という発音がされるように、規則生成部１３は、「長音を母音に変更する」という規則を追加する。
　修正音声合成用辞書１５は、規則生成部１３が生成した音声合成用規則を格納する。ここで、規則生成部１３が生成した音声合成用規則とは、既存の音声合成用辞書１４が格納する規則について、上記のようにして規則生成部１３が修正した後の規則のことである。
　次に図を参照して、音声合成システム３０００の動作の流れを、音声合成用規則、韻律生成モデル及び波形生成モデルを作成する準備段階と、実際に音声合成処理を行う音声合成段階の２段階に分けて順に説明する。
　まず、準備段階のうち、音声合成用規則及び韻律生成モデルを作成する動作については、生成する音声合成用規則が異なる点を除いて、第２実施形態における図６及び図７に示した動作と同様である。
　図１０は、音声合成システム３０００における、準備段階のうち波形生成モデルを作成する動作の一例を示すフローチャートである。
　ステップＳ１の処理は、図２、図６及び図７における処理と同様である。
　ステップＳ１の後、波形学習部１２は、特徴量空間分割部１で分割された特徴量空間内で、波形モデルの学習を行い、波形生成モデルを作成する（ステップＳ２Ｃ）。
　次に、波形生成モデル格納部１６は、波形学習部１２によって作成された波形生成モデルを格納する（ステップＳ３Ｃ）。
　なお、準備段階における音声合成用規則、韻律生成モデル及び波形生成モデルを作成する処理は、いかなる順序で行われても良いし、並行して行われても良い。
　図１１は、音声合成システム３０００における、実際に音声合成処理を行う音声合成段階の動作の一例を示すフローチャートである。
　図１１に示すように、ステップＳ１Ｂの処理は、図８における処理と同様である。
　ステップＳ１Ｂの後、言語解析部９は、入力されたテキストに対し、修正音声合成用辞書１５が格納する音声合成用規則に従って言語解析処理を行い、発音情報を生成する。言語解析部９は、発音情報を生成する際、該発音情報に修正音声合成用辞書１５が格納する規則に従って、例えば「長音を母音に変更する」といった付加的情報を付与する（ステップＳ２Ｄ）。言語解析部９は、付加的情報を付与された発音情報を韻律生成部１０に出力する。
　ステップＳ３Ｂの処理は、図８における処理と同様である。
　次に、波形生成部１７は、韻律生成部１０から入力された発音情報と韻律情報とに基づいて、波形生成モデル格納部１６が格納する波形生成モデルを用いて音声波形を生成する（ステップＳ４Ｄ）。波形生成部１７は、該音声波形を合成音声として出力する。
　以上のように、本実施形態に係る音声合成システム３０００によれば、修正音声合成用辞書１５が修正された付加的情報を発音情報に付与するため、話者ごとの癖といった特徴を忠実に再現できる。また、本実施形態によれば、波形学習と、発音情報の修正に用いる疎密情報の抽出に、同一のクラスタリング結果を用いることにより、疎であるクラスタに属する波形生成モデルで波形を生成した場合、その部分だけ音質が劣化すると言った問題が回避できる。
　なお、波形生成にＨＭＭを用いない波形接続方式等においても、学習データが疎であるクラスタに属するデータは、対応する単位波形のデータ量も不足している。そのため、本実施形態によれば、波形接続方式等を用いた場合も、疎なクラスタに属するデータを使用しないという点で音質劣化を回避することができるという効果が得られる。
　以上、各実施形態を参照して本発明を説明したが、本発明は以上の実施形態に限定されるものではない。本発明の構成や詳細には、本発明のスコープ内で同業者が理解し得る様々な変更をすることができる。例えば、各実施形態に係る音声合成システムは、抽出した疎密情報を図示しないデータベースに格納しておき、対応表等を参照した適宜利用するようにしても良い。
　図１２は、第２実施形態に係る音声合成システム２０００を実現するハードウェア構成の一例を示すブロック図である。なお、ここでは第２実施形態を例にとって説明するが、他の実施形態に係る音声合成システムも同様のハードウェア構成によって実現されても良い。
　図１２に示すように、音声合成システム２０００を構成する各部は、ＣＰＵ（Ｃｅｎｔｒａｌ　Ｐｒｏｃｅｓｓｉｎｇ　Ｕｎｉｔ）１００と、ネットワーク接続用の通信ＩＦ（インターフェース）２００と、メモリ３００と、プログラムを格納するハードディスク等の記憶装置４００と、入力装置５００と、出力装置６００とを含む、コンピュータ装置によって実現される。ただし、音声合成システム２０００の構成は、図１２に示すコンピュータ装置に限定されない。
　ＣＰＵ１００は、オペレーティングシステムを動作させて音声合成システム２０００の全体を制御する。また、ＣＰＵ１００は、例えばドライブ装置などに装着された記録媒体からメモリ３００にプログラムやデータを読み出し、これにしたがって各種の処理を実行する。
　記録装置４００は、例えば光ディスク、フレキシブルディスク、磁気光ディスク、外付けハードディスク、半導体メモリ等であって、コンピュータプログラムをコンピュータ読み取り可能に記録する。記憶装置４００は、例えば、学習用データベース４や韻律生成モデル格納部６等でも良い。また、コンピュータプログラムは、通信網に接続されている図示しない外部コンピュータからダウンロードされても良い。
　入力装置５００は、例えば音声学習装置４０において、ユーザから入力テキストを受け付ける。出力装置６００は、最終的に生成した合成音声を出力する。
　なお、これまでに説明した各実施形態において利用するブロック図は、ハードウェア単位の構成ではなく、機能単位のブロックを示している。また、音声合成システム２０００の構成部の実現手段は特に限定されない。すなわち、音声合成システム２０００は、物理的に結合した一つの装置により実現されても良いし、物理的に分離した二つ以上の装置を有線又は無線で接続し、これら複数の装置により実現されても良い。その場合物理的に分離した二つの装置をそれぞれ音声合成学習装置２０及び音声合成装置４０としても良い。
　本発明のプログラムは、上記の各実施形態で説明した各動作を、コンピュータに実行させるプログラムであれば良い。
　上記の各実施の形態においては、以下に示すような音声合成システム、音声合成方法、および音声合成プログラムの特徴的構成が示されている。
（付記１）
　音声波形データから抽出された特徴量の集合である学習データを格納する学習用データベースと、
　前記学習用データベースが格納する学習データに関する空間である特徴量空間を、部分空間に分割する特徴量空間分割手段と、
　前記特徴量空間分割手段で分割された特徴量空間である各部分空間に対する疎密状態を検出し、該疎密状態を示す情報である疎密情報を発生して出力する疎密状態検出手段と、
　前記疎密状態検出手段から出力された疎密情報に基づいて、音声合成に用いる発音情報を生成するための規則である音声合成用規則を生成する規則生成手段と、
　を含む音声合成システム。
（付記２）
　前記特徴量空間分割手段で分割された特徴量空間である部分空間内で、韻律モデルの学習を行い、韻律生成モデルを作成する韻律学習手段と、
　前記韻律学習手段によって作成された韻律生成モデルを格納する韻律生成モデル格納手段と、
　前記規則生成手段が生成した音声合成用規則に従って生成された発音情報に対して、前記韻律生成モデル格納手段が格納する韻律生成モデルを用いて韻律情報を生成する韻律生成手段と、
　をさらに含む付記１に記載の音声合成システム。
（付記３）
　テキストの言語解析処理に必要な規則を格納する辞書をさらに含み、
　前記規則生成手段は、前記辞書が格納する規則の修正を行うことで音声合成用規則を生成する、
　付記１又は２に記載の音声合成システム。
（付記４）
　前記規則生成手段が生成した修正後の規則を音声合成用規則として格納する修正辞書と、
　テキストの入力を受けて、該テキストから前記修正辞書が格納する音声合成用規則に基づいて発音情報を生成し、該発音情報を前記韻律生成手段に出力する言語解析手段と、
　をさらに含む付記３に記載の音声合成システム。
（付記５）
　前記規則生成手段は、前記疎密情報に基づいて疎な部分空間に属すると判断されたアクセント句のデータを削除することで音声合成用規則を修正する、
　付記４に記載の音声合成システム。
（付記６）
　前記規則生成手段は、ポーズ挿入位置又は入力テキストの言い回し等に関する音声合成用規則を修正する、
　付記３~５のいずれかに記載の音声合成システム。
（付記７）
　前記特徴量空間分割手段は、情報量を基準とした二分木構造クラスタリングによって特徴量空間を部分空間に分割する、
　付記１~６のいずれかに記載の音声合成システム。
（付記８）
　前記韻律学習手段は、前記韻律モデルの学習をＨＭＭ学習により行う、
　付記２~７のいずれかに記載の音声合成システム。
（付記９）
　前記特徴量空間分割手段で分割された特徴量空間である部分空間内で、波形モデルの学習を行い、波形生成モデルを作成する波形学習手段と、
　前記波形学習手段によって作成された波形生成モデルを格納する波形生成モデル格納手段と、
　前記韻律生成手段が生成した韻律情報から、前記波形生成モデル格納手段が格納する波形生成モデルを用いて音声波形を生成し、生成した音声波形を合成音声として出力する波形生成手段と、
　をさらに含む付記１~８のいずれかに記載の音声合成システム。
（付記１０）
　音声波形データから抽出された特徴量の集合である学習データを格納し、
　前記格納する学習データに関する空間である特徴量空間を、部分空間に分割し、
　前記分割された特徴量空間である各部分空間に対する疎密状態を検出し、該疎密状態を示す情報である疎密情報を発生して出力し、
　前記出力された疎密情報に基づいて、音声合成に用いる発音情報を生成するための規則である音声合成用規則を生成する、
　音声合成方法。
（付記１１）
　音声波形データから抽出された特徴量の集合である学習データを格納し、
　前記格納する学習データに関する空間である特徴量空間を、部分空間に分割し、
　前記分割された特徴量空間である各部分空間に対する疎密状態を検出し、該疎密状態を示す情報である疎密情報を発生して出力し、
　前記出力された疎密情報に基づいて、音声合成に用いる発音情報を生成するための規則である音声合成用規則を生成する、
　処理をコンピュータに実行させるプログラムを格納する記録媒体。
　以上、実施形態を参照して本願発明を説明したが、本願発明は以上の実施形態に限定されるものではない。本願発明の構成や詳細には、本願発明のスコープ内で同業者が理解し得る様々な変更をすることができる。
　この出願は、２０１１年２月２２日に出願された日本出願特願２０１１−０３５５４３を基礎とする優先権を主張し、その開示の全てをここに取り込む。 First, in order to facilitate understanding of the embodiments of the present invention, the background of the present invention will be described.
In a technique using a statistical method as described in Non-Patent Document 2, a correct F0 pattern may not be generated, resulting in an unnatural voice.
More specifically, for example, there is a sufficient number of learning data of about several mora such as “person” (2 mora), “word” (3 mora), and “voice” (4 mora). Here, mora is a phrase unit of sound having a certain length of time, and is generally called a beat in Japanese. Therefore, a technique using a statistical method can generate a correct F0 pattern for a sound of several mora. However, there is a possibility that learning data such as “Albert Einstein Medical University” (18 mora) is extremely small or does not exist. For this reason, when a text including such a word is input, the F0 pattern is disturbed, causing a problem such as a shift of the accent position.
According to the embodiment of the present invention described below, a language analysis result belonging to a partial space with a small amount of learning data is not generated or is hardly generated. Therefore, according to the embodiment of the present invention, it is possible to avoid instability of speech synthesis due to lack of learning data, and it is possible to generate synthesized speech with high naturalness.
Embodiments of the present invention will be described below with reference to the drawings. In addition, about each embodiment, the same code | symbol is attached | subjected to the same component and description is abbreviate | omitted suitably. In the following embodiments, the case of Japanese is described as an example, but the application of the present invention is not limited to the case of Japanese.
<First Embodiment>
FIG. 1 is a block diagram showing a configuration example of a speech synthesis system 1000 according to the first embodiment of the present invention. Referring to FIG. 1, a speech synthesis system 1000 according to the present embodiment includes a feature amount space division unit 1, a sparse / dense state detection unit 2, a rule generation unit 3, and a learning database 4.
The learning database 4 stores a set of feature amounts extracted from speech waveform data as learning data. The learning database 4 stores pronunciation information that is a character string corresponding to the speech waveform data. The learning database 4 may store time length information, pitch information, and the like.
Here, the feature amount that is the learning data includes at least an F0 pattern that is time change information of F0 in the speech waveform. Furthermore, the feature amount that is learning data may include spectrum information obtained by fast Fourier transform (FFT) of a speech waveform, segmentation information that is time length information of each phoneme, and the like.
The feature amount space dividing unit 1 divides a space related to learning data stored in the learning database 4 (hereinafter referred to as “feature amount space”) into partial spaces. Here, the feature amount space is an N-dimensional space with N predetermined feature amounts as axes. The number N of dimensions is arbitrary. For example, when two feature amounts of spectrum information and segmentation information are used as axes, the feature amount space is a two-dimensional space.
The feature amount space dividing unit 1 may divide the feature amount space into partial spaces by binary tree structure clustering based on the information amount. The feature amount space dividing unit 1 outputs the learning data divided into partial spaces to the sparse / dense state detecting unit 2.
The sparse / dense state detecting unit 2 detects a sparse / dense state for each partial space generated by the feature amount space dividing unit 1 and generates sparse / dense information which is information indicating the sparse / dense state. The sparse / dense state detection unit 2 outputs the generated sparse / dense information to the rule generation unit 3.
Here, the density information is information indicating a density state of the amount of information of learning data. The density information may be an average value and a variance value of feature quantity vectors of learning data groups belonging to the partial space.
The rule generation unit 3 generates a speech synthesis rule based on the density information output from the density state detection unit 2.
Here, the speech synthesis rules are rules for generating pronunciation information, which is information necessary for synthesizing speech. The speech synthesis rules include at least language analysis information. Here, the language analysis information is information related to data and rules necessary for text language analysis processing. The language analysis information is information on data and rules for morphological analysis, for example.
The speech synthesis rules include information indicating a method for adding additional information for speech synthesis, which is information such as accent positions and accent phrase boundary positions, in addition to language analysis information.
The speech synthesis rule makes the score in the dictionary extremely low so that it is not output as a language analysis result for a language that is expressed by the F0 pattern belonging to a (sparse) subspace with little learning data, or 0 A rule such as
Note that the pronunciation information is information necessary for synthesizing speech, and may include information such as phonemes, syllable strings, and accent positions that express the utterance content. Specifically, the pronunciation information is subjected to language analysis processing such as morphological analysis on the text, and additional information for speech synthesis such as accent position and accent phrase boundary position is added to the result of the language analysis processing or changed. It is generated by performing processing such as adding.
For example, consider a case where a text including the word “Albert Einstein Medical University” is input. In this case, the pronunciation information related to the word is, for example, a character string “a ru ba-to ai N syu ta i N i kada @ i gaku” in Japanese. “@” Indicates an accent position. The rule that defines how the pronunciation information is generated is the above-described speech synthesis rule.
FIG. 2 is a flowchart showing an example of the operation of the speech synthesis system 1000 according to the first embodiment of the present invention.
As shown in FIG. 2, the feature amount space dividing unit 1 first divides a feature amount space that is a space related to learning data stored in the learning database 4 (step S1).
Next, the sparse / dense state detecting unit 2 detects the sparse / dense state of the information amount of the learning data in each partial space that is a part of the feature amount space divided by the feature amount space dividing unit 1, and indicates the sparse / dense state. The density information is generated (step S2). The sparse / dense state detection unit 2 outputs the generated sparse / dense information to the rule generation unit 3.
Next, the rule generation unit 3 generates a speech synthesis rule based on the density information output from the density state detection unit 2 (step S3).
As described above, according to the speech synthesis system 1000 according to the present embodiment, it is possible to avoid the instability of speech synthesis due to lack of learning data and to generate highly natural synthesized speech. Become. The reason is that the speech synthesis system 1000 generates a rule that does not generate or is difficult to generate pronunciation information belonging to a partial space with less learning data.
Second Embodiment
Subsequently, a second embodiment of the present invention will be described.
FIG. 3 is a block diagram showing a configuration example of the speech synthesis system 2000 according to the second embodiment of the present invention. Referring to FIG. 3, the speech synthesis system 2000 according to the present embodiment includes a learning database 4, a speech synthesis learning device 20, a prosody generation model storage unit 6, a language analysis dictionary 7, and a modified language analysis dictionary. 8 and the speech synthesizer 40.
The speech synthesis learning device 20 includes a feature amount space dividing unit 1, a sparse / dense state detecting unit 2, a rule generating unit 3, and a prosody learning unit 5. The feature amount space dividing unit 1 and the sparse / dense state detecting unit 2 have the same configuration as in the first embodiment.
In the present embodiment, HMM is used as a statistical method, and binary tree clustering is used as a feature space dividing method. When an HMM is used as a statistical method, clustering and learning are generally performed alternately. For this reason, in the present embodiment, the feature space division unit 1 and the prosody learning unit 5 are combined into the HMM learning unit 30 and do not take an explicitly divided configuration. However, this embodiment is merely an example of an embodiment of the invention, and the configuration of the invention is not limited to this when a statistical method other than the HMM is used.
Referring to FIG. 3, the speech synthesizer 40 includes a language analysis unit 9, a prosody generation unit 10, and a waveform generation unit 11.
In the present embodiment, it is assumed that sufficient learning data is stored in the learning database 4 in advance. That is, the learning database 4 stores feature amounts extracted from a large amount of speech waveform data. It is assumed that the learning database 4 stores F0 patterns, segmentation information, and spectrum information as feature values of speech waveform data. A set of these feature amounts is used as learning data. Further, it is assumed that the learning data is collected from the voice of one speaker.
First, learning by a statistical method using the learning database 4 is performed in the HMM learning unit 41 (the feature amount space dividing unit 1 and the prosody learning unit 5).
In the HMM learning unit 30, the feature amount space dividing unit 1 divides the feature amount space stored in the learning database 4 into partial spaces, as in the first embodiment. Specifically, the feature amount space dividing unit 1 divides the feature amount space stored in the learning database 4 into partial spaces by binary tree structure clustering. Hereinafter, the partial space generated by the feature amount space dividing unit 1 is also referred to as a cluster.
FIG. 4 is a schematic diagram of a decision tree structure created by binary tree structure clustering as a result of learning in the feature amount space dividing unit 1. As shown in FIG. 4, the binary tree clustering is a process of dividing the learning data into two nodes according to the questions arranged in the nodes P1 to P6, and finally the information amount of each divided cluster This is a method of clustering so that is uniform.
For example, in FIG. 4, the feature amount space division unit 1 determines whether “YES” or “NO” corresponds to the question arranged at the current node, and divides the learning data. In the example of FIG. 4, the feature amount space dividing unit 1 divides the learning data based on whether or not “the phoneme is a voiced sound”, which is a question initially placed at the node P1. Next, for example, the feature amount space dividing unit 1 divides the learning data divided by being determined as “YES” based on whether or not “the preceding phoneme is an unvoiced sound” which is a question arranged in the node P2. The feature amount space dividing unit 1 repeats such division and divides the learning data into one cluster at a stage where the division is made into a predetermined number of learning data.
FIG. 5 is a conceptual schematic diagram of the feature amount space showing the clustering result of the learning data by the feature amount space dividing unit 1. The vertical and horizontal axes in FIG. 5 indicate predetermined feature amounts.
FIG. 5 shows a case where the number of learning data belonging to each cluster is four. FIG. 5 shows how the number of mora of the learning data corresponding to each cluster and the type of the accent kernel are as a result of the division by the feature amount space dividing unit 1 until the number of learning data becomes four. It is shown. Here, the type of the accent kernel is a type indicating the position immediately before the pitch is greatly lowered in one accent phrase.
Note that FIG. 5 is a schematic diagram illustrating the concept to the last, and the number of axes is not limited to two. The feature amount space may be a 10-dimensional space with ten feature amounts as axes, for example.
As shown in FIG. 5, the feature amount space dividing unit 1 generates a large cluster in a space where the number of learning data is sparse, such as a cluster of 10 mora or more and an 8-type or more cluster. Such a cluster is a sparse cluster with a very small number of learning data.
The feature amount space dividing unit 1 outputs the learning data divided into the partial spaces to the sparse / dense state detecting unit 2 and the prosody learning unit 5.
The HMM learning unit 30 creates a prosody generation model together with the division of the feature amount space.
In the HMM learning unit 30, the prosody learning unit 5 learns the prosody model in the feature amount space divided by the feature amount space dividing unit 1, and creates a prosody generation model. That is, the prosody learning unit 5 creates a prosody generation model using the clustering result of the learning data (for example, the result of the binary tree clustering shown in FIG. 4) in the feature amount space dividing unit 1.
The prosody generation model storage unit 6 stores the prosody generation model created by the prosody learning unit 5.
Specifically, the prosody learning unit 5 statistically learns what prosody should be generated for the pronunciation information corresponding to the speech waveform data stored in the learning database 4 for each cluster. The prosody learning unit 5 uses the learning result as a model (prosody generation model) and stores it in the prosody generation model storage unit 6 in association with each cluster.
The learning database 4 may be configured not to store time length information and pitch information, and the prosody learning unit 5 may be configured to learn time length information and pitch information corresponding to pronunciation information from input speech waveform data. .
Next, the sparse / dense state detecting unit 2 extracts the sparse / dense information of each cluster in the learning data input from the feature amount space dividing unit 1. The density information may be, for example, a variance value regarding the number of mora of the accent phrase and the relative position of the accent kernel. At this time, for example, in the 3 mora 1 type cluster shown in FIG. 5, all data is 3 mora 1 type. Therefore, the variance value is 0.
The sparse / dense state detection unit 2 outputs the extracted sparse / dense information of each cluster to the rule generation unit 3.
Next, the rule generation unit 3 generates a speech synthesis rule based on the density information of each cluster. Here, it is assumed that the rule generation unit 3 generates a speech synthesis rule by correcting the existing language analysis dictionary 7. Here, the language analysis dictionary 7 is a dictionary that stores the above-described language analysis information that is data and rules necessary for text language analysis processing.
In the present embodiment, the rule generation unit 3 corrects the language analysis dictionary 7 with a policy of “no generation of accent phrase pronunciation information belonging to a sparse cluster as a language analysis result”.
Specifically, the rule generation unit 3 sets the threshold value of the variance value corresponding to the density information, and the rule generation unit 3 does not generate the pronunciation information of the accent phrase belonging to the cluster whose variance value is equal to or greater than the threshold value. Delete data. For example, assuming that the variance value of a 6-8 mora type 3 cluster is σA, and the variance value of a 10 mora or more type 8 cluster is σB, the rule generation unit 3 sets the threshold value σT of the variance value satisfying σA <σT <σB Set.
In this case, since the variance value of the 3 mora type 1 cluster is 0, the rule generation unit 3 does not correct the dictionary for the 3 mora type 1 accent phrase such as “I am” or “pillow”. Similarly, the rule generation unit 3 does not modify the dictionary for accent phrases belonging to 6-8 mora type 3 clusters such as “nuclear development (6 mora)”.
On the other hand, for an accent phrase belonging to a cluster of 10 mora or more and 8 types or more, such as “Albert Einstein Medical University (18 mora type 15)”, the rule generation unit 3 deletes the corresponding data from the dictionary and obtains the language analysis result Prevent output.
Or, when the language analysis dictionary 7 stores a score for language analysis, and the score calculation is used for language analysis, the rule generation unit 3 extremely reduces the score of the corresponding data so that the corresponding data is not selected. The language analysis dictionary 7 may be modified by replacing it with a value. Further, the rule generation unit 3 does not modify the language analysis dictionary 7 but can generate a speech synthesis rule by changing the language analysis unit 9 in the speech synthesis engine or an algorithm in the vicinity thereof. good.
The rule generation unit 3 outputs the speech synthesis rules that are the contents of the corrected language analysis dictionary 7 to the corrected language analysis dictionary 8.
The corrected language analysis dictionary 8 stores a speech synthesis rule that is the content of the language analysis dictionary 7 corrected by the rule generation unit 3 based on the above rules.
Next, a speech synthesis operation performed by inputting text will be described.
When a text to be subjected to speech synthesis is input, the language analysis unit 9 performs a language analysis process by morphological analysis or the like using the corrected language analysis dictionary 8 for the input text. The language analysis unit 9 generates pronunciation information from the result of the language analysis process, and outputs the pronunciation information to the prosody generation unit 10.
Next, the prosody generation unit 10 generates prosody information for the pronunciation information input from the language analysis unit 9 using the prosody generation model stored in the prosody generation model storage unit 6. The prosody generation unit 10 outputs the pronunciation information and the generated prosody information to the waveform generation unit 11.
The waveform generation unit 11 generates a speech waveform based on the pronunciation information and the prosody information generated by the prosody generation unit 10. The waveform generation unit 11 outputs the generated speech waveform as synthesized speech. The waveform may be generated based on a related technique, and the waveform may be generated by any method. The waveform generation unit 11 outputs the generated speech waveform as synthesized speech.
Next, referring to FIG. 6 and FIG. 7, the operation flow of the speech synthesis system 2000 is divided into two stages: a preparation stage for generating speech synthesis rules and prosody generation models, and a speech synthesis stage for actually performing speech synthesis processing. These will be described in order.
FIG. 6 is a flowchart illustrating an example of an operation of generating a speech synthesis rule in the preparation stage in the speech synthesis system 2000.
As shown in FIG. 6, the processing in steps S1 to S3 is the same as the processing in FIG.
After the process of S3, the rule generation unit 3 stores the speech synthesis rules, which are the contents of the corrected language analysis dictionary 7, in the corrected language analysis dictionary 8 (step S4).
FIG. 7 is a flowchart illustrating an example of an operation for creating a prosody generation model in the preparation stage in the speech synthesis system 2000.
The processing in step S1 is the same as the processing in FIGS.
After step S1, the prosody learning unit 5 learns the prosody model in the feature amount space divided by the feature amount space dividing unit 1, and creates a prosody generation model (step S2A).
Next, the prosody generation model storage unit 6 stores the prosody generation model created by the prosody learning unit 5 (step S3A).
Note that the processing in the preparation stage described in FIGS. 6 and 7 may be performed in the reverse order or in parallel.
FIG. 8 is a flowchart illustrating an example of the operation of the speech synthesis stage in which speech synthesis processing is actually performed in the speech synthesis system 2000.
As shown in FIG. 8, first, the language analysis unit 9 receives text to be speech synthesized (step S1B).
Next, the language analysis unit 9 performs language analysis processing on the input text according to the speech synthesis rules stored in the corrected language analysis dictionary 8 to generate pronunciation information (step S2B). The language analysis unit 9 outputs the generated pronunciation information to the prosody generation unit 10.
Next, the prosody generation unit 10 generates prosody information for the pronunciation information input from the language analysis unit 9 using the prosody generation model stored in the prosody generation model storage unit 6 (step S3B). The prosody generation unit 10 outputs the pronunciation information and the prosody information to the waveform generation unit 11.
Next, the waveform generation unit 11 generates a speech waveform based on the pronunciation information and the prosody information input from the prosody generation unit 10 (step S4B), and outputs the speech waveform as synthesized speech.
As described above, according to the speech synthesis system 2000 according to the present embodiment, disturbance of the F0 pattern due to insufficient learning data can be avoided, and speech synthesis with high naturalness can be performed. The reason is that prosody learning and extraction of sparse / dense information are performed based on the same clustering result, and the rule generation unit 3 generates a speech synthesis rule based on the sparse / dense information, so that the pronunciation data has sufficient pronunciation information. Is generated.
In this embodiment, the learning database is assumed to have collected the voice of one speaker. However, the learning database may be a collection of voices of a plurality of speakers. In the case of a database for learning a single speaker, there is an effect that it is possible to create a speech synthesis rule that can reproduce speaker characteristics such as a speaker's habit. In the case of a multi-speaker learning database, a general-purpose speech synthesis rule can be created.
<Third Embodiment>
Subsequently, a third embodiment of the present invention will be described.
FIG. 9 is a block diagram illustrating a configuration example of a speech synthesis system 3000 according to the third embodiment of the present invention.
Referring to FIG. 9, a speech synthesis system 3000 according to the third embodiment includes a speech synthesis learning device 21 and a speech synthesis device 41 in place of the speech synthesis learning device 20 and the speech synthesis device 40 according to the second embodiment. Further, a waveform generation model storage unit 12 is included. The speech synthesis system 3000 includes a speech synthesis dictionary 14 and a modified speech synthesis dictionary 15 in place of the language analysis dictionary 7 and the modified language analysis dictionary 8.
The speech synthesis learning device 21 includes an HMM learning unit 31 that generates a prosody generation model and a waveform generation model using the learning database 4 instead of the HMM learning unit 30. The HMM learning unit 31 further includes a waveform learning unit 12 in addition to the same configuration as the HMM learning unit 30.
The speech synthesizer 41 includes a waveform generation unit 17 that generates a waveform using the waveform generation model storage unit 16 instead of the waveform generation unit 10.
The waveform learning unit 12 learns the waveform model in the feature amount space divided by the feature amount space dividing unit 1 and creates a waveform generation model.
The waveform generation model is obtained by modeling the spectral feature amount of the waveform in the learning database. Specifically, the feature amount may be a cepstrum or the like. In the present embodiment, a model generated by the HMM is used as data for waveform generation. However, the speech synthesis method applied to the present invention is not limited to this, and another speech synthesis method, for example, a waveform connection method may be used. In this case, only the prosody generation model is learned by the HMM learning unit 31.
The waveform generation model storage unit 16 stores the waveform generation model created by the waveform learning unit 12.
The rule generation unit 13 generates a speech synthesis rule based on the density information of each cluster. Here, it is assumed that the rule generation unit 3 generates a speech synthesis rule by modifying the existing speech synthesis dictionary 14. Here, the speech synthesis dictionary 14 is used to add or change additional information for speech synthesis to the result of language analysis processing in addition to data and rules necessary for text language analysis processing. It is a dictionary that stores the rules.
The rule generation unit 13 corrects rules other than those relating to accent positions and accent phrase boundaries. Hereinafter, as a specific example, an operation in which the rule generation unit 13 corrects rules relating to “pause insertion / deletion” and “phrase change” will be described.
The rules relating to “insertion / deletion of poses” are rules such as “insert a pose at a natural position” and “delete a pose at an unnatural position” so that the sound becomes human-like. Specific rules include “one expiratory paragraph is N mora or less”, “pause after conjunction”, and the like.
Further, the rule relating to “change of wording” is a rule for changing a language analysis result generated from a standard text as a language to a wording unique to a speaker. For example, the word “broadcast” is usually read “hosoo”. However, some speakers may clearly read this as “Housou”. The rule representing this is a rule of “reading a long sound as a vowel”.
The correction of the speech synthesis dictionary 14 is performed according to the same policy as the correction of the language analysis dictionary 7 in the second embodiment. Specifically, a threshold value of the dispersion value is set. Then, the rule generation unit 13 deletes or adds a rule corresponding to the contents of the speech synthesis dictionary 14 so that an expression belonging to a cluster whose variance value is equal to or greater than a threshold value is not generated.
As a specific example, a case where a text “and broadcasting has started” is input will be described.
In the database 4 for learning, the voice of a speaker having the characteristics of “speak without pause” or “pronounce the word“ broadcast ”as“ hoso ”instead of“ hoso ”” ” Assume that waveform data is stored. In this case, when the feature space that is the learning data is divided, the cluster “pause after“ and ”and the cluster“ continuous vowels ”are very sparse or do not exist as clusters. Is done.
In this case, for example, as a modification of the rule relating to “pause insertion / deletion”, the rule generation unit 13 deletes the rule “insert a pose after a conjunction” from the rules stored in the speech synthesis dictionary 14. Alternatively, the rule generation unit 13 adds a rule “no pause after“ and ”” to the rule stored in the speech synthesis dictionary 14.
Further, as a modification of the rule regarding “change of wording”, the rule generation unit 13 is configured so that the text “broadcast” normally pronounced “hoso” is pronounced “hoso”. The generation unit 13 adds a rule “change a long sound to a vowel”.
The modified speech synthesis dictionary 15 stores the speech synthesis rules generated by the rule generation unit 13. Here, the speech synthesis rule generated by the rule generation unit 13 is a rule after the rule generation unit 13 corrects the rule stored in the existing speech synthesis dictionary 14 as described above.
Next, referring to the figure, the operation flow of the speech synthesis system 3000 is divided into two stages: a preparation stage for creating a speech synthesis rule, a prosody generation model and a waveform generation model, and a speech synthesis stage for actually performing speech synthesis processing These will be described in order.
First, in the preparation stage, the operation for creating the speech synthesis rules and the prosody generation model is the same as the operations shown in FIGS. 6 and 7 in the second embodiment, except that the speech synthesis rules to be generated are different. It is the same.
FIG. 10 is a flowchart illustrating an example of an operation of creating a waveform generation model in the preparation stage in the speech synthesis system 3000.
The processing in step S1 is the same as the processing in FIGS.
After step S1, the waveform learning unit 12 learns the waveform model in the feature amount space divided by the feature amount space dividing unit 1, and creates a waveform generation model (step S2C).
Next, the waveform generation model storage unit 16 stores the waveform generation model created by the waveform learning unit 12 (step S3C).
Note that the processing for creating the speech synthesis rules, the prosody generation model, and the waveform generation model in the preparation stage may be performed in any order or may be performed in parallel.
FIG. 11 is a flowchart illustrating an example of an operation in a speech synthesis stage in which speech synthesis processing is actually performed in the speech synthesis system 3000.
As shown in FIG. 11, the process in step S1B is the same as the process in FIG.
After step S1B, the language analysis unit 9 performs language analysis processing on the input text in accordance with the speech synthesis rules stored in the modified speech synthesis dictionary 15 to generate pronunciation information. When generating the pronunciation information, the language analysis unit 9 gives additional information such as “changes the long sound to a vowel” according to the rules stored in the corrected speech synthesis dictionary 15 in the pronunciation information (step S2D). The language analysis unit 9 outputs the pronunciation information to which the additional information is given to the prosody generation unit 10.
The process of step S3B is the same as the process in FIG.
Next, the waveform generation unit 17 generates a speech waveform using the waveform generation model stored in the waveform generation model storage unit 16 based on the pronunciation information and prosodic information input from the prosody generation unit 10 (step S4D). ). The waveform generation unit 17 outputs the speech waveform as synthesized speech.
As described above, according to the speech synthesis system 3000 according to the present embodiment, the modified speech synthesis dictionary 15 adds the corrected additional information to the pronunciation information, and thus faithfully reproduces a feature such as a song for each speaker. it can. Further, according to the present embodiment, when a waveform is generated with a waveform generation model belonging to a sparse cluster by using the same clustering result for waveform learning and extraction of sparse and dense information used for correcting pronunciation information, The problem that the sound quality deteriorates only in that part can be avoided.
Note that, even in a waveform connection method that does not use an HMM for waveform generation, data belonging to a cluster in which learning data is sparse lacks the data amount of the corresponding unit waveform. Therefore, according to this embodiment, even when the waveform connection method or the like is used, it is possible to avoid deterioration in sound quality in that data belonging to a sparse cluster is not used.
As mentioned above, although this invention was demonstrated with reference to each embodiment, this invention is not limited to the above embodiment. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention. For example, the speech synthesis system according to each embodiment may store the extracted density information in a database (not shown) and use it appropriately with reference to a correspondence table or the like.
FIG. 12 is a block diagram illustrating an example of a hardware configuration that implements the speech synthesis system 2000 according to the second embodiment. Although the second embodiment will be described as an example here, a speech synthesis system according to another embodiment may be realized by a similar hardware configuration.
As shown in FIG. 12, each unit constituting the speech synthesis system 2000 includes a CPU (Central Processing Unit) 100, a communication IF (interface) 200 for network connection, a memory 300, and a storage such as a hard disk for storing a program. It is realized by a computer device including a device 400, an input device 500, and an output device 600. However, the configuration of the speech synthesis system 2000 is not limited to the computer apparatus shown in FIG.
The CPU 100 controls the entire speech synthesis system 2000 by operating an operating system. Further, the CPU 100 reads programs and data from a recording medium mounted on, for example, a drive device to the memory 300, and executes various processes according to the programs and data.
The recording device 400 is, for example, an optical disk, a flexible disk, a magnetic optical disk, an external hard disk, a semiconductor memory, etc., and records a computer program so that it can be read by a computer. The storage device 400 may be, for example, the learning database 4 or the prosody generation model storage unit 6. The computer program may be downloaded from an external computer (not shown) connected to the communication network.
The input device 500 receives input text from the user in the speech learning device 40, for example. The output device 600 outputs the synthesized speech that is finally generated.
In addition, the block diagram utilized in each embodiment described so far has shown the block of a functional unit instead of the structure of a hardware unit. Further, the means for realizing the constituent parts of the speech synthesis system 2000 is not particularly limited. That is, the speech synthesis system 2000 may be realized by one physically coupled device, or may be realized by connecting two or more physically separated devices in a wired or wireless manner and by a plurality of these devices. Also good. In that case, the two physically separated devices may be used as the speech synthesis learning device 20 and the speech synthesis device 40, respectively.
The program of the present invention may be a program that causes a computer to execute the operations described in the above embodiments.
In each of the above embodiments, characteristic configurations of a speech synthesis system, a speech synthesis method, and a speech synthesis program as described below are shown.
(Appendix 1)
A learning database that stores learning data that is a set of feature values extracted from speech waveform data;
A feature amount space dividing means for dividing a feature amount space, which is a space related to learning data stored in the learning database, into subspaces;
A sparse / dense state detecting unit that detects a sparse / dense state for each partial space that is a feature amount space divided by the feature amount space dividing unit, and generates and outputs sparse / dense information that is information indicating the sparse / dense state;
Rule generating means for generating a speech synthesis rule, which is a rule for generating pronunciation information used for speech synthesis, based on the density information output from the density state detecting means;
Speech synthesis system including
(Appendix 2)
Prosody learning means for learning a prosodic model and creating a prosody generation model in a subspace which is a feature amount space divided by the feature amount space dividing means;
Prosody generation model storage means for storing the prosody generation model created by the prosody learning means;
Prosody generation means for generating prosody information using the prosody generation model stored in the prosody generation model storage means for the pronunciation information generated according to the speech synthesis rules generated by the rule generation means;
The speech synthesis system according to claim 1, further comprising:
(Appendix 3)
It further includes a dictionary for storing rules necessary for text language analysis processing,
The rule generation means generates a rule for speech synthesis by correcting a rule stored in the dictionary.
The speech synthesis system according to appendix 1 or 2.
(Appendix 4)
A modified dictionary that stores the modified rules generated by the rule generating means as rules for speech synthesis;
Language analysis means for receiving input of text, generating pronunciation information based on the speech synthesis rules stored in the correction dictionary from the text, and outputting the pronunciation information to the prosody generation means;
The speech synthesis system according to supplementary note 3, further comprising:
(Appendix 5)
The rule generating means corrects the speech synthesis rule by deleting the data of the accent phrase determined to belong to a sparse partial space based on the sparse / dense information;
The speech synthesis system according to attachment 4.
(Appendix 6)
The rule generation means corrects a speech synthesis rule related to a pause insertion position or an input text phrase,
The speech synthesis system according to any one of appendices 3 to 5.
(Appendix 7)
The feature amount space dividing means divides the feature amount space into partial spaces by binary tree structure clustering based on the information amount.
The speech synthesis system according to any one of appendices 1 to 6.
(Appendix 8)
The prosody learning means performs learning of the prosody model by HMM learning.
The speech synthesis system according to any one of appendices 2 to 7.
(Appendix 9)
Waveform learning means for learning a waveform model and creating a waveform generation model in a partial space that is a feature amount space divided by the feature amount space dividing means;
Waveform generation model storage means for storing the waveform generation model created by the waveform learning means;
From the prosody information generated by the prosody generation unit, a waveform generation unit that generates a speech waveform using a waveform generation model stored in the waveform generation model storage unit and outputs the generated speech waveform as synthesized speech;
The speech synthesis system according to any one of appendices 1 to 8, further including:
(Appendix 10)
Stores learning data that is a set of features extracted from speech waveform data,
Dividing the feature amount space, which is a space related to the learning data to be stored, into subspaces;
Detects a sparse / dense state for each partial space that is the divided feature amount space, generates and outputs sparse / dense information that is information indicating the sparse / dense state,
Generate a speech synthesis rule that is a rule for generating pronunciation information used for speech synthesis based on the output density information.
Speech synthesis method.
(Appendix 11)
Stores learning data that is a set of features extracted from speech waveform data,
Dividing the feature amount space, which is a space related to the learning data to be stored, into subspaces;
Detects a sparse / dense state for each partial space that is the divided feature amount space, generates and outputs sparse / dense information that is information indicating the sparse / dense state,
Generate a speech synthesis rule that is a rule for generating pronunciation information used for speech synthesis based on the output density information.
A recording medium for storing a program that causes a computer to execute processing.
Although the present invention has been described with reference to the embodiments, the present invention is not limited to the above embodiments. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.
This application claims the priority on the basis of Japanese application Japanese Patent Application No. 2011-035543 for which it applied on February 22, 2011, and takes in those the indications of all here.

　以上説明したように、本発明は、情報量が限定された学習データを用いた音声合成システムを構築する際に好適に適用可能である。例えば、ニュース記事や自動応答文等といったテキスト全般の読み上げシステムに好適に適用される。 As described above, the present invention can be suitably applied when constructing a speech synthesis system using learning data with a limited amount of information. For example, the present invention is suitably applied to a reading system for general text such as news articles and automatic response sentences.

　１　　特徴量空間分割部
　２　　疎密情報抽出部
　３、１３　規則生成部
　４　　学習用データベース
　５　　韻律学習部
　６　　韻律生成モデル格納部
　７　　言語解析用辞書
　８　　修正言語解析用辞書
　９　　言語解析部
　１０　　韻律生成部
　１１、１７　波形生成部
　１２　　波形学習部
　１４　　音声合成用辞書
　１５　　修正音声合成用辞書
　１６　　波形生成モデル格納部
　１７　　波形生成部
　２０、２１　音声合成学習装置
　３０、３１　ＨＭＭ学習部
　４０、４１　音声合成装置
　１００　　ＣＰＵ
　２００　　通信ＩＦ
　３００　　メモリ
　４００　　記憶装置
　５００　　入力装置
　６００　　出力装置
　１０００、２０００、３０００　音声合成システム DESCRIPTION OF SYMBOLS 1 Feature-value space division part 2 Density information extraction part 3, 13 Rule generation part 4 Learning database 5 Prosody learning part 6 Prosody generation model storage part 7 Language analysis dictionary 8 Modified language analysis dictionary 9 Language analysis part 10 Prosody generation part DESCRIPTION OF SYMBOLS 11, 17 Waveform production | generation part 12 Waveform learning part 14 Dictionary for speech synthesis 15 Dictionary for correction speech synthesis 16 Waveform generation model storage part 17 Waveform generation part 20, 21 Speech synthesis learning apparatus 30, 31 HMM learning part 40, 41 Speech synthesis apparatus 100 CPU
200 Communication IF
300 memory 400 storage device 500 input device 600 output device 1000, 2000, 3000 speech synthesis system

Claims

A learning database that stores learning data that is a set of feature values extracted from speech waveform data;
A feature amount space dividing means for dividing a feature amount space, which is a space related to learning data stored in the learning database, into subspaces;
A sparse / dense state detecting unit that detects a sparse / dense state for each partial space that is a feature amount space divided by the feature amount space dividing unit, and generates and outputs sparse / dense information that is information indicating the sparse / dense state;
Rule generating means for generating a speech synthesis rule, which is a rule for generating pronunciation information used for speech synthesis, based on the density information output from the density state detecting means;
Speech synthesis system including

Prosody learning means for learning a prosodic model and creating a prosody generation model in a subspace which is a feature amount space divided by the feature amount space dividing means;
Prosody generation model storage means for storing the prosody generation model created by the prosody learning means;
Prosody generation means for generating prosody information using the prosody generation model stored in the prosody generation model storage means for the pronunciation information generated according to the speech synthesis rules generated by the rule generation means;
The speech synthesis system according to claim 1, further comprising:

It further includes a dictionary for storing rules necessary for text language analysis processing,
The rule generation means generates a rule for speech synthesis by correcting a rule stored in the dictionary.
The speech synthesis system according to claim 1 or 2.

A modified dictionary that stores the modified rules generated by the rule generating means as rules for speech synthesis;
Language analysis means for receiving input of text, generating pronunciation information based on the speech synthesis rules stored in the correction dictionary from the text, and outputting the pronunciation information to the prosody generation means;
The speech synthesis system according to claim 3, further comprising:

The rule generating means corrects the speech synthesis rule by deleting the data of the accent phrase determined to belong to a sparse partial space based on the sparse / dense information;
The speech synthesis system according to claim 4.

The rule generation means corrects a speech synthesis rule related to a pause insertion position or an input text phrase,
The speech synthesis system according to any one of claims 3 to 5.

The feature amount space dividing means divides the feature amount space into partial spaces by binary tree structure clustering based on the information amount.
The speech synthesis system according to any one of claims 1 to 6.

The prosody learning means performs learning of the prosody model by HMM learning.
The speech synthesis system according to any one of claims 2 to 7.

Stores learning data that is a set of features extracted from speech waveform data,
Dividing the feature amount space, which is a space related to the learning data to be stored, into subspaces;
Detects a sparse / dense state for each partial space that is the divided feature amount space, generates and outputs sparse / dense information that is information indicating the sparse / dense state,
Generate a speech synthesis rule that is a rule for generating pronunciation information used for speech synthesis based on the output density information.
Speech synthesis method.

Stores learning data that is a set of features extracted from speech waveform data,
Dividing the feature amount space, which is a space related to the learning data to be stored, into subspaces;
Detects a sparse / dense state for each partial space that is the divided feature amount space, generates and outputs sparse / dense information that is information indicating the sparse / dense state,
Generate a speech synthesis rule that is a rule for generating pronunciation information used for speech synthesis based on the output density information.
A recording medium for storing a program that causes a computer to execute processing.