JP2011141470A

JP2011141470A - Phoneme information-creating device, voice synthesis system, voice synthesis method and program

Info

Publication number: JP2011141470A
Application number: JP2010002697A
Authority: JP
Inventors: Masanori Kato; 正徳加藤; Reiji Kondo; 玲史近藤
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2010-01-08
Filing date: 2010-01-08
Publication date: 2011-07-21

Abstract

<P>PROBLEM TO BE SOLVED: To provide a phoneme information-creating device capable of reducing a data amount of phoneme information which is a base of voice to be synthesized, while preventing degradation of synthesized voice quality. <P>SOLUTION: A phoneme information-creating device 300 includes: a frame period determination section 301 for determining a frame period for expressing a time interval, based on rhythm relevant information which is information regarding a rhythm of the voice; and a phoneme information creation section 302 which extracts a feature parameter for expressing a feature of a part in a time frame in a phoneme that is a part of a basic voice, and that is a base of voice synthesis processing for synthesizing the voice, on each of a plurality of time frames positioned so that starting positions of contiguous two time frames may be separated by the determined frame period, and which creates phoneme information including a time sequence data of the extracted feature parameter. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、音声合成技術に関し、特に、音声を合成する際に用いられる素片情報を生成する素片情報生成装置に関する。 The present invention relates to a speech synthesis technique, and more particularly, to a segment information generation apparatus that generates segment information used when speech is synthesized.

文字列を表す文字列情報を解析し、その文字列が表す音声を規則合成方式に従って合成する音声合成処理を行う音声合成装置が知られている。この種の音声合成装置の構成及び動作の詳細は、例えば、非特許文献1乃至非特許文献3と、特許文献1及び特許文献2と、に記載されている。 2. Description of the Related Art A speech synthesizer that performs speech synthesis processing that analyzes character string information representing a character string and synthesizes speech represented by the character string according to a rule synthesis method is known. Details of the configuration and operation of this type of speech synthesizer are described in, for example, Non-Patent Document 1 to Non-Patent Document 3, and Patent Document 1 and Patent Document 2.

図1に示したように、この種の音声合成装置は、言語処理部1と、韻律生成部2と、素片情報記憶部12と、素片選択部3と、波形生成部4と、を備える。素片情報記憶部12は、音声素片を表す音声素片情報と、各音声素片の属性を表す属性情報と、を含む素片情報を記憶している。ここで、音声素片は、音声合成処理の基となる基礎音声(人間が発した音声(自然音声))の一部であり、基礎音声を音声合成単位毎に分割することにより生成される。 As shown in FIG. 1, this type of speech synthesizer includes a language processing unit 1, a prosody generation unit 2, a unit information storage unit 12, a unit selection unit 3, and a waveform generation unit 4. Prepare. The unit information storage unit 12 stores unit information including speech unit information representing speech units and attribute information representing attributes of each speech unit. Here, the speech segment is a part of basic speech (speech generated by humans (natural speech)) that is a basis of speech synthesis processing, and is generated by dividing the basic speech into speech synthesis units.

音声素片情報は、音声素片の音声波形を表す情報、又は、当該音声素片から抽出され且つ当該音声素片の特徴を表す特徴パラメータ(例えば、線形予測法に従って抽出されたパラメータ(線形予測分析パラメータ)、又は、ケプストラム係数等)を含む。また、音声合成単位は、音素、音節、CV(Vは母音、Cは子音)等の半音節、CVC、又は、VCV等であることが多い。属性情報は、各音声素片の基礎音声における環境(音素環境)、及び、韻律情報(基本周波数(ピッチ周波数)、振幅、及び、継続時間長等)を含む。 The speech unit information is information representing the speech waveform of the speech unit or a feature parameter extracted from the speech unit and representing the feature of the speech unit (e.g., a parameter extracted according to a linear prediction method (linear prediction Analysis parameters) or cepstrum coefficients). In many cases, the speech synthesis unit is a phoneme, a syllable, a semi-syllable such as CV (V is a vowel, and C is a consonant), CVC, or VCV. The attribute information includes the environment (phoneme environment) in the basic speech of each speech unit, and prosodic information (basic frequency (pitch frequency), amplitude, duration length, etc.).

音声素片情報及び属性情報は、例えば、アナウンサー又は声優が発した(発声した)音声に基づいて生成される。音声素片情報の基となった音声を発した人間(話者)は、元発話者とも呼ばれる。なお、音声素片の長さ、及び、音声合成単位の詳細は、非特許文献1及び非特許文献3に記載されている。 The speech segment information and the attribute information are generated based on, for example, a speech uttered (spoken) by an announcer or a voice actor. The person (speaker) who uttered the voice that is the basis of the speech segment information is also called the original speaker. Note that details of the lengths of speech segments and speech synthesis units are described in Non-Patent Document 1 and Non-Patent Document 3.

言語処理部1は、入力された文字列情報に対して、形態素解析、構文解析、及び、読み付け等の分析を行う。そして、言語処理部1は、音素記号等の「読み」を表す記号列を表す情報と、形態素の品詞、活用、及び、アクセント型等を表す情報と、を言語解析処理結果として、韻律生成部2及び素片選択部3へ出力する。 The language processing unit 1 performs analysis such as morphological analysis, syntax analysis, and reading on the input character string information. Then, the language processing unit 1 uses, as a language analysis processing result, information representing a symbol string representing “reading” such as a phoneme symbol and information representing a morpheme part of speech, utilization, accent type, etc. 2 and the unit selection unit 3.

韻律生成部2は、言語処理部1から出力された言語解析処理結果に基づいて、合成される音声(合成音声)の韻律(音の高さ(ピッチ周波数)、音の長さ(継続時間長)、及び、音の大きさ(パワー)等に関する情報)を生成し、生成した韻律を表す韻律情報を目標韻律情報として素片選択部3及び波形生成部4へ出力する。 The prosody generation unit 2 is based on the language analysis processing result output from the language processing unit 1, and the prosody (sound pitch (pitch frequency), sound length (duration length) of the synthesized speech) ) And information related to sound volume (power) and the like, and the prosody information representing the generated prosody is output to the segment selection unit 3 and the waveform generation unit 4 as target prosody information.

素片選択部3は、言語解析処理結果と目標韻律情報とに基づいて、素片情報記憶部12に記憶されている素片情報の中から、下記のように素片情報を選択し、選択した素片情報を波形生成部4へ出力する。 The segment selection unit 3 selects the segment information from the segment information stored in the segment information storage unit 12 based on the language analysis processing result and the target prosodic information as follows, and selects The segment information is output to the waveform generator 4.

具体的には、素片選択部3は、入力された言語解析処理結果と目標韻律情報とに基づいて、合成音声の特徴を表す情報(以下、これを「目標素片環境」と呼ぶ。)を音声合成単位毎に求める。目標素片環境は、該当・先行・後続の各音素、ストレスの有無、アクセント核からの距離、音声合成単位毎のピッチ周波数、パワー、継続時間長、ケプストラム、MFCC(Mel Frequency Cepstral Coefficients)、及び、これらのΔ量(単位時間あたりの変化量)等である。 Specifically, the segment selection unit 3 is information representing the characteristics of the synthesized speech based on the input language analysis processing result and the target prosodic information (hereinafter referred to as “target segment environment”). For each speech synthesis unit. The target segment environment includes the corresponding / preceding / following phonemes, the presence or absence of stress, the distance from the accent core, the pitch frequency, power, duration, cepstrum, MFCC (Mel Frequency Cepstral Coefficients), and MFCC. These Δ amounts (change amounts per unit time) and the like.

次に、素片選択部3は、求めた目標素片環境に含まれる特定の情報(主に該当音素)に対応(例えば、一致)する音素を有する音声素片を表す素片情報を素片情報記憶部12から複数取得する。取得された素片情報は、音声を合成するために用いられる素片情報の候補である。 Next, the segment selection unit 3 generates segment information representing a speech unit having a phoneme corresponding to (for example, matching) specific information (mainly corresponding phoneme) included in the obtained target segment environment. A plurality of information is acquired from the information storage unit 12. The acquired segment information is a candidate for segment information used for synthesizing speech.

そして、素片選択部3は、取得された素片情報に対して、音声を合成するために用いる素片情報としての適切度を示す指標であるコストを算出する。コストは、目標素片環境と候補となる素片情報の属性情報との差を表す値であり、差が小さくなるほど(即ち、両者が類似している程度が高いほど)、小さくなる値である。コストが小さい素片情報を用いるほど、合成された音声は、人間が発した音声と類似している程度を表す自然度が高い音声となる。従って、素片選択部3は、算出されたコストが最も小さい素片情報を選択し、選択された素片情報を波形生成部4へ出力する。 Then, the segment selection unit 3 calculates a cost that is an index indicating the appropriateness as the segment information used for synthesizing the speech with respect to the acquired segment information. The cost is a value that represents the difference between the target element environment and the attribute information of the candidate element information, and the value becomes smaller as the difference becomes smaller (that is, the higher the degree of similarity between the two). . As the unit information with a lower cost is used, the synthesized speech becomes a speech with a higher degree of naturalness representing the degree of similarity to the speech uttered by a human. Therefore, the segment selection unit 3 selects the segment information with the lowest calculated cost, and outputs the selected segment information to the waveform generation unit 4.

波形生成部4は、韻律生成部2から供給された目標韻律情報と、素片選択部3から供給された素片情報と、に基づいて、目標韻律に一致若しくは類似する韻律を有する音声波形を生成し、生成した音声波形を接続して合成音声を生成する。 Based on the target prosody information supplied from the prosody generation unit 2 and the segment information supplied from the segment selection unit 3, the waveform generation unit 4 generates a speech waveform having a prosody that matches or is similar to the target prosody. Generate synthesized speech by connecting the generated speech waveforms.

なお、本明細書においては、素片情報に基づいて目標韻律に一致若しくは類似する韻律を有する音声波形を生成することを、音声素片の韻律を制御するという意味合いから、韻律制御と呼ぶ。また、素片情報から生成された音声波形を、他の音声波形と区別する目的で素片波形とも呼ぶ。 In this specification, generating a speech waveform having a prosody that matches or resembles the target prosody based on the segment information is called prosody control in the sense of controlling the prosody of the speech unit. The speech waveform generated from the segment information is also called a segment waveform for the purpose of distinguishing it from other speech waveforms.

ところで、素片選択部3から供給される素片情報が表す音声素片は、有声音からなる音声素片と、無声音からなる音声素片と、に分類される。有声音に対する韻律制御を行うために用いられる方法と、無声音に対する韻律制御を行うために用いられる方法と、は互いに異なる。無声音に対する韻律制御を行うために用いられる方法は、例えば、非特許文献6又は非特許文献7に記載された方法である。 Incidentally, speech units represented by the segment information supplied from the segment selection unit 3 are classified into speech units composed of voiced sounds and speech segments composed of unvoiced sounds. The method used for performing prosody control for voiced sound and the method used for performing prosody control for unvoiced sound are different from each other. The method used to perform prosodic control on unvoiced sound is, for example, the method described in Non-Patent Document 6 or Non-Patent Document 7.

一方、有声音に対する韻律制御を行うために用いられる方法は、例えば、非特許文献4に記載されているPSOLA(Pitch Synchronous OverLap Add)方式である。また、有声音に対する韻律制御を行うために用いられる方法は、特許文献3に記載された方法でもよい。この方法によれば、元音声波形のスペクトル包絡を表す複素ケプストラムを求め、複素ケプストラムを表現するフィルタを所望のピッチ周波数に相当する時間間隔で駆動することによって、所望のピッチ周波数を有する素片波形を生成することができる。 On the other hand, a method used to perform prosodic control on voiced sound is, for example, a PSOLA (Pitch Synchronous OverLap Add) method described in Non-Patent Document 4. Further, the method used for performing prosodic control on voiced sound may be the method described in Patent Document 3. According to this method, a complex cepstrum representing the spectral envelope of the original speech waveform is obtained, and a segment waveform having a desired pitch frequency is driven by driving a filter representing the complex cepstrum at a time interval corresponding to the desired pitch frequency. Can be generated.

次に、図2を参照しながら、素片情報記憶部12に記憶されている素片情報を生成する素片情報生成装置について説明する。図2に示したように、素片情報生成装置は、収録文記憶部21と、自然音声記憶部22と、属性情報抽出部7と、特徴パラメータ抽出部6と、素片情報記憶部12と、を備える。 Next, with reference to FIG. 2, a segment information generating apparatus that generates segment information stored in the segment information storage unit 12 will be described. As shown in FIG. 2, the segment information generation device includes a recorded sentence storage unit 21, a natural speech storage unit 22, an attribute information extraction unit 7, a feature parameter extraction unit 6, a segment information storage unit 12, .

自然音声記憶部22は、素片情報を生成する基となる基礎音声(自然音声波形)を表す情報を記憶している。収録文記憶部21は、自然音声記憶部22に記憶されている基礎音声に対応する文字列を表す情報を含む言語情報を記憶している。言語情報は、例えば、漢字かな混じり文を表す情報である。更に、言語情報は、読み・アクセント位置・形態素の品詞等の情報(例えば、言語処理部1により出力される情報と同じ形式を有する情報)を含んでいてもよい。 The natural speech storage unit 22 stores information representing basic speech (natural speech waveform) that is a basis for generating segment information. The recorded sentence storage unit 21 stores language information including information representing a character string corresponding to the basic speech stored in the natural speech storage unit 22. The language information is, for example, information representing a kanji-kana mixed sentence. Furthermore, the language information may include information such as readings, accent positions, morpheme parts of speech, etc. (for example, information having the same format as the information output by the language processing unit 1).

特徴パラメータ抽出部6は、自然音声記憶部22から供給された自然音声波形から特徴パラメータの時系列データを抽出し、抽出した時系列データを素片情報記憶部12へ出力する。抽出方法は、図1に示した音声合成装置の波形生成部4が音声波形を生成するために用いる方式(波形生成方式)に依存する。例えば、波形生成部4が波形生成方式として非特許文献4に記載されているPSOLA方式を用いる場合、特徴パラメータ抽出部6は、特徴パラメータの抽出方法として、非特許文献4に記載された方法を用いることが好適である。 The feature parameter extraction unit 6 extracts time series data of feature parameters from the natural speech waveform supplied from the natural speech storage unit 22, and outputs the extracted time series data to the segment information storage unit 12. The extraction method depends on a method (waveform generation method) used by the waveform generation unit 4 of the speech synthesizer shown in FIG. 1 to generate a speech waveform. For example, when the waveform generation unit 4 uses the PSOLA method described in Non-Patent Document 4 as the waveform generation method, the feature parameter extraction unit 6 uses the method described in Non-Patent Document 4 as the feature parameter extraction method. It is preferable to use it.

また、特徴パラメータに線形予測分析パラメータ、又は、ケプストラム係数を用いる場合には、波形生成部4は、非特許文献5に記載の方法を用いる。属性情報抽出部7は、自然音声記憶部22から供給される自然音声波形と、収録文記憶部21から供給される言語情報と、に基づいて、属性情報を取得し、取得した属性情報を素片情報記憶部12へ出力する。 In addition, when a linear prediction analysis parameter or a cepstrum coefficient is used as the feature parameter, the waveform generation unit 4 uses the method described in Non-Patent Document 5. The attribute information extraction unit 7 acquires attribute information based on the natural speech waveform supplied from the natural speech storage unit 22 and the linguistic information supplied from the recorded sentence storage unit 21, and acquires the acquired attribute information. The information is output to the single information storage unit 12.

素片情報生成装置が生成する音声素片情報が、音声波形を表す情報そのものではなく、特徴パラメータの時系列データである場合、各特徴パラメータは、非特許文献5に記載されるように、窓関数を自然音声波形に適用することにより生成される。 When the speech unit information generated by the unit information generating device is not time-series data of feature parameters but information representing a speech waveform, each feature parameter is a window as described in Non-Patent Document 5. Generated by applying a function to a natural speech waveform.

具体的には、特徴パラメータ抽出部6は、自然音声波形のうちの、所定の時間長(フレーム長)を有する時間フレーム内の部分の特徴を表す特徴パラメータを抽出する。特徴パラメータ抽出部6は、複数の時間フレームのそれぞれに対して抽出された特徴パラメータを、特徴パラメータの時系列データとして取得する。複数の時間フレームは、連続する2つの時間フレームの開始位置が所定のフレーム周期だけ離れるように配置される。 Specifically, the feature parameter extraction unit 6 extracts feature parameters representing features of a portion in a time frame having a predetermined time length (frame length) in the natural speech waveform. The feature parameter extraction unit 6 acquires feature parameters extracted for each of a plurality of time frames as time series data of feature parameters. The plurality of time frames are arranged such that the start positions of two consecutive time frames are separated by a predetermined frame period.

ここで、フレーム長、及び、フレーム周期の値は、特徴パラメータを抽出するために用いられる方法によって異なる。フレーム長、及び、フレーム周期の値として、特徴パラメータを抽出する対象となる自然音声波形のピッチ周波数に同期した値が利用される場合(ピッチ周波数同期方式)と、ピッチ周波数に同期することなく常に一定の値が利用される場合(ピッチ周波数非同期方式)と、がある。ピッチ周波数同期方式の一例は、PSOLA方式である。 Here, the frame length and the frame period value differ depending on the method used to extract the feature parameter. When the value synchronized with the pitch frequency of the natural speech waveform from which the feature parameter is extracted is used as the value of the frame length and frame period (pitch frequency synchronization method), it is always without synchronizing with the pitch frequency. There are cases where a fixed value is used (pitch frequency asynchronous method). An example of the pitch frequency synchronization method is the PSOLA method.

一方、線形予測分析パラメータ又はケプストラム係数を抽出する場合、ピッチ周波数非同期方式が利用される。例えば、特許文献4に記載された素片情報生成装置は、ピッチ周波数非同期方式を用いて特徴パラメータを抽出する。この素片情報生成装置は、フレーム周期、及び、フレーム長として、経験的に定められた値を用いる。 On the other hand, when extracting a linear prediction analysis parameter or a cepstrum coefficient, a pitch frequency asynchronous method is used. For example, the segment information generating apparatus described in Patent Document 4 extracts feature parameters using a pitch frequency asynchronous method. This segment information generation device uses empirically determined values for the frame period and the frame length.

特開2005-91551号公報JP 2005-91551 A 特開2006-84854号公報JP 2006-84854 A 特許第2812184号公報Japanese Patent No. 2812184 特開2003-66982号公報JP 2003-66982 A

Huang、Acero、Hon、“Spoken Language Processing”、Prentice Hall、2001年、p.689-836Huang, Acero, Hon, “Spoken Language Processing”, Prentice Hall, 2001, p.689-836 石川、“音声合成のための韻律制御の基礎”、電子情報通信学会技術研究報告、電子情報通信学会、2000年、第100巻、第392号、p.27-34Ishikawa, “Basics of Prosodic Control for Speech Synthesis”, IEICE Technical Report, IEICE, 2000, 100, 392, p.27-34 阿部、“音声合成のための合成単位の基礎”、電子情報通信学会技術研究報告、電子情報通信学会、2000年、第100巻、第392号、p.35-42Abe, “Basics of Synthesis Units for Speech Synthesis”, IEICE Technical Report, IEICE, 2000, 100, 392, p.35-42 Moulines、Charapentier、“Pitch-Synchronous Waveform Processing Techniques For Text-To-Speech Synthesis Using Diphones”、Speech Communication、1990年、第9巻、p.435-467Moulines, Charapentier, “Pitch-Synchronous Waveform Processing Techniques For Text-To-Speech Synthesis Using Diphones”, Speech Communication, 1990, Vol. 9, p.435-467 古井、“音声情報処理”、森北出版、1998年Furui, “Speech Information Processing”, Morikita Publishing, 1998 R.Suzuki、M.Misaki、“Time-scale modification of speech signals using cross-correlation functions”、IEEE Trans. Consum. Electron.、IEEE、1992年、第38巻、p.357-363R. Suzuki, M. Misaki, “Time-scale modification of speech signals using cross-correlation functions”, IEEE Trans. Consum. Electron., IEEE, 1992, Vol. 38, p.357-363 清山ほか、“高品質リアルタイム話速変換システムの開発”、電子情報通信学会論文誌、電子情報通信学会、2001年、第J84-D-II巻、第6号、p.918-926Kiyoyama et al., “Development of a high-quality real-time speech rate conversion system”, IEICE Transactions, 2001, J84-D-II, No. 6, p.918-926

しかしながら、経験的に定められたフレーム周期を用いて音声素片情報(ここでは、特徴パラメータの時系列データ)を生成すると、その音声素片情報を用いて音声を合成した場合に、合成された音声の品質(音質)が低くなることがある。例えば、フレーム周期が過大な値に設定されていると、特徴パラメータが抽出される頻度が低下する。その結果、音声合成処理を行う際に用いられる特徴パラメータの時間方向における連続性が不十分となる(即ち、特徴パラメータが時間に対して十分に滑らかに変化しなくなる)ので、音質が低下する。この現象について、図3を参照しながら説明する。 However, when speech unit information (here, feature parameter time-series data) is generated using an empirically determined frame period, it is synthesized when speech is synthesized using the speech unit information. Audio quality (sound quality) may be low. For example, if the frame period is set to an excessive value, the frequency with which feature parameters are extracted decreases. As a result, the continuity in the time direction of the characteristic parameters used when performing the speech synthesis process becomes insufficient (that is, the characteristic parameters do not change sufficiently smoothly with respect to time), so that the sound quality is deteriorated. This phenomenon will be described with reference to FIG.

図3は、自然音声から特徴パラメータの時系列データを抽出し、抽出された時系列データを所望のピッチ周期にて再構成することにより音声を合成する処理を概念的に示した説明図である。図3の左側の列は、フレーム周期が相対的に短い場合を表し、右側の列は、フレーム周期が相対的に長い場合を表している。 FIG. 3 is an explanatory diagram conceptually showing a process of synthesizing speech by extracting time series data of feature parameters from natural speech and reconstructing the extracted time series data at a desired pitch period. . The left column in FIG. 3 represents the case where the frame period is relatively short, and the right column represents the case where the frame period is relatively long.

図3に示したように、フレーム周期が短くなるほど、ある時間長を有する自然音声波形に対して抽出される特徴パラメータの数(時系列データの量)が多くなる。この例では、フレーム周期が短い場合は、7つの時間フレームのそれぞれに対して特徴パラメータが抽出され、フレーム周期が長い場合は、3つの時間フレームのそれぞれに対して特徴パラメータが抽出される。 As shown in FIG. 3, as the frame period becomes shorter, the number of feature parameters (amount of time-series data) extracted from a natural speech waveform having a certain length of time increases. In this example, feature parameters are extracted for each of the seven time frames when the frame period is short, and feature parameters are extracted for each of the three time frames when the frame period is long.

音声合成装置は、特徴パラメータの時系列データを所望のピッチ周期にて再構成することにより音声を合成する場合、ピッチ周期毎に特徴パラメータを選択し、選択した特徴パラメータから音声波形を生成する。図3に示したピッチ同期位置は、隣接する他のピッチ同期位置との間の間隔がピッチ周期と一致するように配置された位置(時刻)である。 When synthesizing speech by reconstructing time-series data of feature parameters at a desired pitch period, the speech synthesizer selects a feature parameter for each pitch period and generates a speech waveform from the selected feature parameter. The pitch synchronization position shown in FIG. 3 is a position (time) arranged such that the interval between other adjacent pitch synchronization positions coincides with the pitch period.

この例では、音声合成装置は、フレーム周期が短い場合(図3の左側の列)、フレーム番号が2,3,…,7の時間フレームのそれぞれを1つずつ選択する。一方、フレーム周期が長い場合(図3の右側の列)、音声合成装置は、フレーム番号が1の時間フレームを2つ選択し、フレーム番号が2の時間フレームを3つ選択し、フレーム番号が3の時間フレームを1つ選択する。音声合成装置は、選択した時間フレームに対して抽出された音声素片情報に基づいて音声を合成する。 In this example, when the frame period is short (the left column in FIG. 3), the speech synthesizer selects each of the time frames having frame numbers 2, 3,. On the other hand, if the frame period is long (the right column in FIG. 3), the speech synthesizer selects two time frames with a frame number of 1, selects three time frames with a frame number of 2, and the frame number is Select one of the three time frames. The speech synthesizer synthesizes speech based on speech unit information extracted for the selected time frame.

この例では、音声合成装置は、ピッチ周期毎に特徴パラメータを選択する際に、ピッチ同期位置に最も近い特徴パラメータを選択する。即ち、音声合成装置は、フレーム周期が短い場合(図3の左側の列)、同一の時間フレーム(即ち、特徴パラメータ)が繰り返し選択されることなく、特徴パラメータの時間方向における連続性が確保されている(即ち、特徴パラメータが時間に対して十分に滑らかに変化する)。一方、フレーム周期が長い場合(図3の右側の列)、音声合成装置は、同一の時間フレームを繰り返し選択する。その結果、特徴パラメータの時間方向における連続性が損なわれている(即ち、特徴パラメータが時間に対して十分に滑らかに変化しない)。 In this example, the speech synthesizer selects the feature parameter closest to the pitch synchronization position when selecting the feature parameter for each pitch period. That is, when the frame period is short (the left column in FIG. 3), the speech synthesizer ensures the continuity of the feature parameters in the time direction without repeatedly selecting the same time frame (that is, the feature parameter). (Ie, the feature parameters change smoothly enough over time). On the other hand, when the frame period is long (the right column in FIG. 3), the speech synthesizer repeatedly selects the same time frame. As a result, the continuity of the feature parameter in the time direction is impaired (that is, the feature parameter does not change sufficiently smoothly with respect to time).

即ち、フレーム周期が過度に長い場合、音声を合成する際に特徴パラメータの数が不足する。換言すると、この場合、音声合成装置は、過度に少ない情報量に基づいて音声を合成する。従って、十分な音質を確保するためには、フレーム周期を十分に短くする必要がある。しかしながら、フレーム周期を過度に短くすると、素片情報のデータ量が過度に多くなる。その結果、素片情報を記憶する記憶装置に要求される記憶容量が過度に大きくなってしまう。 That is, when the frame period is excessively long, the number of feature parameters is insufficient when synthesizing speech. In other words, in this case, the speech synthesizer synthesizes speech based on an excessively small amount of information. Therefore, in order to ensure sufficient sound quality, it is necessary to shorten the frame period sufficiently. However, when the frame period is excessively shortened, the data amount of the piece information is excessively increased. As a result, the storage capacity required for the storage device that stores the piece information becomes excessively large.

このように、上述した文献に記載された素片情報生成装置によれば、合成される音声の品質が低下するという問題、又は、素片情報のデータ量が過度に多くなるという問題が生じていた。 As described above, according to the segment information generation device described in the above-described document, there is a problem that the quality of synthesized speech is deteriorated or a data amount of segment information is excessively increased. It was.

このため、本発明の目的は、上述した課題である「合成される音声の品質が低下する場合、又は、素片情報のデータ量が過度に多くなる場合が生じること」を解決することが可能な素片情報生成装置を提供することにある。 For this reason, the object of the present invention is to solve the above-mentioned problem “when the quality of the synthesized speech is reduced or the data amount of the piece information is excessively increased”. An object of the present invention is to provide a segment information generation apparatus.

かかる目的を達成するため本発明の一形態である素片情報生成装置は、
時間間隔を表すフレーム周期を、音声の韻律に関する情報である韻律関連情報に基づいて決定するフレーム周期決定手段と、
連続する2つの時間フレームの開始位置が上記決定されたフレーム周期だけ離れるように配置された複数の時間フレームのそれぞれに対して、音声を合成する音声合成処理の基となる基礎音声の一部である音声素片のうちの当該時間フレーム内の部分の特徴を表す特徴パラメータを抽出し、当該抽出された特徴パラメータの時系列データを含む素片情報を生成する素片情報生成手段と、
を備える。 In order to achieve such an object, an element information generating apparatus according to one aspect of the present invention is
Frame period determining means for determining a frame period representing a time interval based on prosody related information which is information relating to the prosody of speech;
A part of the basic speech that is the basis of the speech synthesis process that synthesizes speech for each of a plurality of time frames arranged so that the start positions of two consecutive time frames are separated by the determined frame period. Unit information generating means for extracting a feature parameter representing a feature of a portion of a certain speech unit in the time frame and generating unit information including time-series data of the extracted feature parameter;
Is provided.

本発明は、以上のように構成されることにより、合成される音声の品質が低下することを防止しながら、当該合成される音声の基となる素片情報のデータ量を低減することができる。 With the configuration as described above, the present invention can reduce the data amount of the segment information that is the basis of the synthesized voice while preventing the quality of the synthesized voice from being degraded. .

背景技術に係る音声合成装置の概略構成を表す図である。It is a figure showing the schematic structure of the speech synthesizer which concerns on background art. 背景技術に係る素片情報生成装置の概略構成を表す図である。It is a figure showing the schematic structure of the segment information generation apparatus which concerns on background art. 自然音声から特徴パラメータの時系列データを抽出し、抽出された時系列データを所望のピッチ周期にて再構成することにより音声を合成する処理を概念的に示した説明図である。It is explanatory drawing which showed notionally the process which synthesize | combines audio | voice by extracting the time series data of the characteristic parameter from natural audio | voice, and reconstructing the extracted time series data with a desired pitch period. 本発明の第1実施形態に係る素片情報生成装置の概略構成を表す図である。1 is a diagram illustrating a schematic configuration of a segment information generation device according to a first embodiment of the present invention. 本発明の第2実施形態に係る素片情報生成装置の概略構成を表す図である。FIG. 5 is a diagram illustrating a schematic configuration of a segment information generation device according to a second embodiment of the present invention. 本発明の第3実施形態に係る素片情報生成装置の概略構成を表す図である。FIG. 10 is a diagram illustrating a schematic configuration of a segment information generating apparatus according to a third embodiment of the present invention. 本発明の第4実施形態に係る素片情報生成装置の概略構成を表す図である。FIG. 10 is a diagram illustrating a schematic configuration of a segment information generation device according to a fourth embodiment of the present invention. 本発明の第5実施形態に係る音声合成装置の概略構成を表す図である。FIG. 10 is a diagram illustrating a schematic configuration of a speech synthesizer according to a fifth embodiment of the present invention. 本発明の第6実施形態に係る素片情報生成装置の概略構成を表す図である。It is a figure showing the schematic structure of the segment information generation apparatus which concerns on 6th Embodiment of this invention.

以下、本発明に係る、素片情報生成装置、音声合成システム、音声合成方法、及び、プログラム、の各実施形態について図1〜図9を参照しながら説明する。 Hereinafter, each embodiment of a segment information generation device, a speech synthesis system, a speech synthesis method, and a program according to the present invention will be described with reference to FIGS.

<第1実施形態>
(構成)
図4に示したように、本発明の第1実施形態に係る素片情報生成装置100は、収録文記憶部21と、自然音声記憶部22と、属性情報抽出部7と、特徴パラメータ抽出部6と、素片情報記憶部12と、フレーム周期記憶部13と、フレーム周期決定部(フレーム周期決定手段)31と、合成音声韻律情報推定部(韻律情報推定手段)33と、合成音声韻律情報記憶部34と、を備える。特徴パラメータ抽出部6及び属性情報抽出部7は、素片情報生成手段を構成している。 <First embodiment>
(Constitution)
As shown in FIG. 4, the segment information generating apparatus 100 according to the first embodiment of the present invention includes a recorded sentence storage unit 21, a natural speech storage unit 22, an attribute information extraction unit 7, and a feature parameter extraction unit. 6, segment information storage unit 12, frame cycle storage unit 13, frame cycle determination unit (frame cycle determination unit) 31, synthesized speech prosody information estimation unit (prosody information estimation unit) 33, synthesized speech prosody information And a storage unit 34. The feature parameter extraction unit 6 and the attribute information extraction unit 7 constitute unit information generation means.

自然音声記憶部22は、素片情報を生成する基となる基礎音声(自然音声波形)を表す情報を記憶している。素片情報は、音声素片を表す音声素片情報と、各音声素片の属性を表す属性情報と、を含む。ここで、音声素片は、音声を合成する音声合成処理の基となる基礎音声(人間が発した音声(自然音声))の一部であり、基礎音声を音声合成単位毎に分割することにより生成される。 The natural speech storage unit 22 stores information representing basic speech (natural speech waveform) that is a basis for generating segment information. The unit information includes speech unit information representing a speech unit and attribute information representing an attribute of each speech unit. Here, a speech segment is a part of basic speech (speech generated by humans (natural speech)) that is the basis of speech synthesis processing that synthesizes speech, and is divided by speech synthesis units. Generated.

本例では、音声素片情報は、音声素片から抽出され且つ当該音声素片の特徴を表す特徴パラメータの時系列データを含む。また、音声合成単位は、音節である。なお、音声合成単位は、音素、CV(Vは母音、Cは子音)等の半音節、CVC、又は、VCV等であってもよい。また、属性情報は、各音声素片の基礎音声における環境(音素環境)、及び、韻律情報(基本周波数(ピッチ周波数)、振幅、及び、継続時間長等)を含む。 In this example, the speech unit information includes time-series data of feature parameters extracted from the speech unit and representing the features of the speech unit. The speech synthesis unit is a syllable. Note that the speech synthesis unit may be a phoneme, a semi-syllable such as CV (V is a vowel, C is a consonant), CVC, or VCV. The attribute information includes the environment (phoneme environment) in the basic speech of each speech unit, and prosodic information (basic frequency (pitch frequency), amplitude, duration length, etc.).

収録文記憶部21は、自然音声記憶部22に記憶されている基礎音声に対応する文字列(収録文)を表す情報を含む言語情報を記憶している。言語情報は、例えば、漢字かな混じり文を表す情報である。更に、言語情報は、読み・アクセント位置・形態素の品詞等の情報を含んでいてもよい。 The recorded sentence storage unit 21 stores language information including information representing a character string (recorded sentence) corresponding to the basic speech stored in the natural speech storage unit 22. The language information is, for example, information representing a kanji-kana mixed sentence. Furthermore, the language information may include information such as readings, accent positions, morpheme parts of speech.

特徴パラメータ抽出部6は、自然音声記憶部22から供給された自然音声波形から特徴パラメータの時系列データを抽出する。具体的には、特徴パラメータ抽出部6は、自然音声波形のうちの、予め設定された時間長(フレーム長)を有する時間フレーム内の部分の特徴を表す特徴パラメータを抽出する。 The feature parameter extraction unit 6 extracts time series data of feature parameters from the natural speech waveform supplied from the natural speech storage unit 22. Specifically, the feature parameter extraction unit 6 extracts feature parameters representing features of a portion in a time frame having a preset time length (frame length) in the natural speech waveform.

特徴パラメータ抽出部6は、複数の時間フレームのそれぞれに対して抽出された特徴パラメータを、特徴パラメータの時系列データとして取得する。複数の時間フレームは、連続する2つの時間フレームの開始位置が、フレーム周期決定部31により決定されるフレーム周期(後述するように、フレーム周期記憶部13に記憶されているフレーム周期)だけ離れるように配置される。ここで、フレーム周期は、時間間隔を表す。特徴パラメータ抽出部6は、取得した時系列データを音声素片情報として素片情報記憶部12へ出力する。 The feature parameter extraction unit 6 acquires feature parameters extracted for each of a plurality of time frames as time series data of feature parameters. In the plurality of time frames, the start positions of two consecutive time frames are separated by a frame period determined by the frame period determination unit 31 (a frame period stored in the frame period storage unit 13 as described later). Placed in. Here, the frame period represents a time interval. The feature parameter extraction unit 6 outputs the acquired time series data to the unit information storage unit 12 as speech unit information.

属性情報抽出部7は、自然音声記憶部22から供給される自然音声波形と、収録文記憶部21から供給される言語情報と、に基づいて、属性情報を取得し、取得した属性情報を素片情報記憶部12へ出力する。
即ち、特徴パラメータ抽出部6及び属性情報抽出部7は、特徴パラメータの時系列データを含む素片情報を生成している、と言うことができる。 The attribute information extraction unit 7 acquires attribute information based on the natural speech waveform supplied from the natural speech storage unit 22 and the linguistic information supplied from the recorded sentence storage unit 21, and acquires the acquired attribute information. The information is output to the single information storage unit 12.
That is, it can be said that the feature parameter extraction unit 6 and the attribute information extraction unit 7 generate segment information including time series data of feature parameters.

素片情報記憶部12は、特徴パラメータ抽出部6から供給された音声素片情報と、属性情報抽出部7から供給された属性情報と、を含む素片情報を記憶する。 The segment information storage unit 12 stores segment information including the speech segment information supplied from the feature parameter extraction unit 6 and the attribute information supplied from the attribute information extraction unit 7.

合成音声韻律情報記憶部34は、素片情報記憶部12に記憶されている(即ち、素片情報生成装置100が生成した)素片情報に基づいて音声合成処理を行う図示しない音声合成装置(音声合成処理手段)から出力された韻律情報を予め記憶している。即ち、合成音声韻律情報記憶部34は、上記音声合成装置が過去の時点にて合成した音声である過去合成音声の韻律を表す韻律情報を予め記憶している。 The synthesized speech prosody information storage unit 34 is a speech synthesizer (not shown) that performs speech synthesis processing based on the unit information stored in the unit information storage unit 12 (that is, generated by the unit information generation device 100). The prosodic information output from the speech synthesis processing means) is stored in advance. That is, the synthesized speech prosody information storage unit 34 stores in advance prosody information representing the prosody of past synthesized speech that is speech synthesized by the speech synthesizer at a past time.

なお、合成音声韻律情報記憶部34が記憶する韻律情報の統計量に基づいて、後述する合成音声韻律情報推定部33が、上記音声合成装置が将来の時点にて合成する音声である将来合成音声の韻律を表す韻律情報を推定する。従って、合成音声韻律情報記憶部34が記憶する韻律情報の情報量は、十分に多い量であることが好適である。 Based on the statistics of the prosodic information stored in the synthesized speech prosody information storage unit 34, the synthesized speech prosody information estimation unit 33 (to be described later) is a speech that is synthesized by the speech synthesizer at a future time. Prosody information representing the prosody of is estimated. Accordingly, it is preferable that the information amount of the prosodic information stored in the synthesized speech prosodic information storage unit 34 is a sufficiently large amount.

合成音声韻律情報推定部33は、合成音声韻律情報記憶部34に記憶されている、過去合成音声の韻律を表す韻律情報に基づいて、将来合成音声の韻律を表す韻律情報を推定する。合成音声韻律情報推定部33は、推定された韻律情報を韻律関連情報としてフレーム周期決定部31へ出力する。韻律関連情報は、音声の韻律に関する情報である。 The synthesized speech prosody information estimation unit 33 estimates the prosody information representing the prosody of the future synthesized speech based on the prosody information representing the prosody of the past synthesized speech stored in the synthesized speech prosody information storage unit. The synthesized speech prosody information estimation unit 33 outputs the estimated prosody information to the frame period determination unit 31 as prosody related information. The prosody related information is information related to the prosody of the speech.

ところで、将来の時点にて行われる音声合成処理の対象となるテキスト(文字列)を特定することはできない。従って、本例では、合成音声韻律情報推定部33は、将来の時点にて、任意のテキストが入力された場合に音声合成装置が合成する音声の韻律を表す韻律情報を推定する。 By the way, it is not possible to specify text (character string) to be subjected to speech synthesis processing performed at a future time. Therefore, in this example, the synthesized speech prosody information estimation unit 33 estimates prosody information representing the speech prosody synthesized by the speech synthesizer when an arbitrary text is input at a future time point.

なお、任意のテキストが入力された場合に音声合成装置が合成する音声の韻律は、ある一定の範囲内の値を有する。従って、過去合成音声の韻律を表す韻律情報の統計量に基づいて将来合成音声の韻律を表す韻律情報を推定することが好適であると考えられる。 Note that the prosody of the speech synthesized by the speech synthesizer when any text is input has a value within a certain range. Therefore, it is considered preferable to estimate prosodic information representing the prosody of the future synthesized speech based on the statistics of the prosodic information representing the prosody of the past synthesized speech.

そこで、合成音声韻律情報推定部33は、過去合成音声の韻律を表す韻律情報の統計量を算出する。合成音声韻律情報推定部33は、算出した過去合成音声の韻律を表す韻律情報の統計量を、将来合成音声の韻律を表す韻律情報として推定する。 Therefore, the synthesized speech prosody information estimation unit 33 calculates a statistic of prosodic information representing the prosody of the past synthesized speech. The synthesized speech prosody information estimation unit 33 estimates the calculated statistics of the prosody information representing the prosody of the past synthesized speech as the prosodic information representing the prosody of the future synthesized speech.

ここで、韻律情報の統計量は、ピッチ周波数、又は、継続時間長の、平均値、分散、頻度(ヒストグラム)、最大値、最小値、又は、最頻値等である。これらの統計量は、合成音声韻律情報記憶部34に記憶されている韻律情報の全体から一意に算出されてもよく、また、音素、音節、半音節、又は、素片等の単位毎に算出されてもよい。 Here, the statistic of the prosodic information is an average value, variance, frequency (histogram), maximum value, minimum value, mode value, or the like of the pitch frequency or the duration length. These statistics may be calculated uniquely from the entire prosodic information stored in the synthesized speech prosodic information storage unit 34, or calculated for each unit such as phonemes, syllables, semi-syllables, or segments. May be.

本例では、合成音声韻律情報推定部33は、韻律情報の統計量として、音声合成単位毎に、ピッチ周波数の平均値及び分散を算出する。 In this example, the synthesized speech prosody information estimation unit 33 calculates an average value and variance of pitch frequencies for each speech synthesis unit as a statistic of prosodic information.

フレーム周期決定部31は、合成音声韻律情報推定部33から供給された韻律関連情報に基づいて、音声合成単位(即ち、予め設定された処理単位)毎にフレーム周期を決定する。フレーム周期決定部31は、決定したフレーム周期をフレーム周期記憶部13へ出力する。 The frame period determining unit 31 determines a frame period for each speech synthesis unit (that is, a preset processing unit) based on the prosody related information supplied from the synthesized speech prosody information estimation unit 33. The frame cycle determination unit 31 outputs the determined frame cycle to the frame cycle storage unit 13.

ところで、音声合成処理によって合成される音声(合成音声)のピッチ周波数が高くなるほど、合成音声の品質(音質)を十分に高くするために必要とされる時間フレームの数(フレーム数)は多くなる。即ち、ピッチ周波数の平均値が大きくなるほど、フレーム周期を小さくすることが好適である。 By the way, the higher the pitch frequency of the speech synthesized by speech synthesis processing (synthesized speech), the greater the number of time frames (number of frames) required to sufficiently increase the quality (sound quality) of the synthesized speech. . That is, it is preferable to reduce the frame period as the average value of the pitch frequency increases.

また、ある音声合成単位に対するピッチ周波数の分散が大きくなるほど、フレーム数が不十分であることにより、音質が劣化する可能性が高くなる。
そこで、本例では、フレーム周期決定部31は、ピッチ周波数の平均値及び分散に基づいてフレーム周期を決定する。 In addition, the greater the variance of the pitch frequency for a certain speech synthesis unit, the higher the possibility that the sound quality will deteriorate due to the insufficient number of frames.
Therefore, in this example, the frame period determination unit 31 determines the frame period based on the average value and variance of the pitch frequencies.

具体的には、フレーム周期決定部31は、数式1に従って、音声合成単位毎にフレーム周期を決定する。ここで、F0_meanは、韻律関連情報が表すピッチ周波数の平均値であり、F0_varは、韻律関連情報が表すピッチ周波数の分散である。また、T_pは、フレーム周期であり、β₁は、予め設定された閾値である。α₁は、予め設定された正の値を有する(即ち、0よりも大きい)定数である。また、T_p1は、予め設定された正の値を有する第1のフレーム周期であり、T_p2は、予め設定された正の値を有する第2のフレーム周期である。更に、第1のフレーム周期T_p1は、第2のフレーム周期T_p2よりも小さい。即ち、第1のフレーム周期T_p1及び第2のフレーム周期T_p2は、0<T_p1<T_p2を満足する。

Specifically, the frame period determination unit 31 determines the frame period for each speech synthesis unit according to Equation 1. Here, F0 _mean is an average value of pitch frequencies represented by prosody related information, and F0 _var is a variance of pitch frequencies represented by prosody related information. T _p is a frame period, and β ₁ is a preset threshold value. α ₁ is a constant having a preset positive value (ie, greater than 0). T _p1 is a first frame period having a preset positive value, and T _p2 is a second frame period having a preset positive value. Furthermore, the first frame period T _p1 is smaller than the second frame period T _p2 . That is, the first frame period T _p1 and the second frame period T _p2 satisfy 0 <T _p1 <T _p2 .

このように、フレーム周期決定部31は、韻律関連情報が表すピッチ周波数が大きくなるほど小さくなる値を、フレーム周期として決定する。具体的には、フレーム周期決定部31は、韻律関連情報が表すピッチ周波数の平均値が大きくなるほど小さくなるとともに、韻律関連情報が表すピッチ周波数の分散が大きくなるほど小さくなる値を、フレーム周期として決定する。 Thus, the frame period determination unit 31 determines a value that decreases as the pitch frequency represented by the prosodic information increases as the frame period. Specifically, the frame period determining unit 31 determines, as the frame period, a value that decreases as the average value of the pitch frequency represented by the prosodic related information increases and decreases as the pitch frequency dispersion represented by the prosodic related information increases. To do.

なお、本例では、フレーム周期決定部31は、2つのフレーム周期から1つのフレーム周期を選択するように構成されていたが、3つ以上のフレーム周期から1つのフレーム周期を選択するように構成されていてもよい。この場合、フレーム周期をより高い精度にて制御することができる。この結果、合成音声の品質が低下することを確実に防止しながら、合成音声の基となる素片情報のデータ量をより一層低減することができる。 In this example, the frame period determination unit 31 is configured to select one frame period from two frame periods, but is configured to select one frame period from three or more frame periods. May be. In this case, the frame period can be controlled with higher accuracy. As a result, it is possible to further reduce the data amount of the segment information that is the basis of the synthesized speech while reliably preventing the quality of the synthesized speech from being lowered.

また、本例では、フレーム周期決定部31は、ピッチ周波数の平均値及び分散に基づいてフレーム周期を決定するように構成されていたが、ピッチ周波数の平均値のみに基づいてフレーム周期を決定するように構成されていてもよい。 In this example, the frame period determination unit 31 is configured to determine the frame period based on the average value and variance of the pitch frequency, but determines the frame period based only on the average value of the pitch frequency. It may be configured as follows.

フレーム周期記憶部13は、フレーム周期決定部31から出力されたフレーム周期(の値)を素片情報と対応付けて記憶する。上述したように、フレーム周期記憶部13が記憶しているフレーム周期は、特徴パラメータ抽出部6が特徴パラメータを抽出する際に用いられる値であるとともに、上記音声合成装置が音声合成処理を実行する際に用いられる値である。 The frame cycle storage unit 13 stores the frame cycle (value) output from the frame cycle determination unit 31 in association with the segment information. As described above, the frame period stored in the frame period storage unit 13 is a value used when the feature parameter extraction unit 6 extracts feature parameters, and the speech synthesizer executes speech synthesis processing. It is a value used in the case.

(作動)
次に、上記のように構成された素片情報生成装置100の作動について説明する。
先ず、音声合成装置は、音声合成処理を行うことによって韻律情報を生成し、生成した韻律情報を素片情報生成装置100へ出力する。合成音声韻律情報記憶部34は、音声合成装置から出力された韻律情報を記憶する。このような処理が繰り返されることにより、合成音声韻律情報記憶部34は、十分に多い量の韻律情報を蓄積する。 (Operation)
Next, the operation of the segment information generating apparatus 100 configured as described above will be described.
First, the speech synthesizer generates prosodic information by performing speech synthesis processing, and outputs the generated prosodic information to the segment information generating device 100. The synthesized speech prosody information storage unit 34 stores the prosody information output from the speech synthesizer. By repeating such processing, the synthesized speech prosody information storage unit 34 accumulates a sufficiently large amount of prosodic information.

その後、素片情報生成装置100のユーザが素片情報を生成する旨を指示する情報を入力する。これにより、素片情報生成装置100は、素片情報を生成する処理を開始する。 Thereafter, the user of the segment information generating apparatus 100 inputs information instructing to generate segment information. Thereby, the segment information generating apparatus 100 starts a process of generating segment information.

先ず、合成音声韻律情報推定部33は、合成音声韻律情報記憶部34に記憶されている韻律情報に基づいて、将来合成音声の韻律を表す韻律情報として、音声合成単位毎に、ピッチ周波数の平均値及び分散を算出(推定)する。 First, the synthesized speech prosody information estimation unit 33 calculates the average pitch frequency for each speech synthesis unit as prosody information representing the prosody of the future synthesized speech based on the prosody information stored in the synthesized speech prosody information storage unit 34. Calculate (estimate) values and variances.

次いで、合成音声韻律情報推定部33は、推定された韻律情報を韻律関連情報としてフレーム周期決定部31へ出力する。
そして、フレーム周期決定部31は、合成音声韻律情報推定部33から出力された韻律関連情報に基づいて、音声合成単位毎にフレーム周期を決定する。次いで、フレーム周期記憶部13は、フレーム周期決定部31により決定されたフレーム周期を記憶する。 Next, the synthesized speech prosody information estimation unit 33 outputs the estimated prosody information to the frame period determination unit 31 as prosody related information.
Then, the frame period determination unit 31 determines a frame period for each speech synthesis unit based on the prosody related information output from the synthesized speech prosody information estimation unit 33. Next, the frame period storage unit 13 stores the frame period determined by the frame period determination unit 31.

そして、特徴パラメータ抽出部6は、自然音声記憶部22に記憶されている自然音声波形に基づいて、連続する2つの時間フレームの開始位置が、フレーム周期記憶部13に記憶されているフレーム周期だけ離れるように配置された複数の時間フレームのそれぞれに対する特徴パラメータを抽出する。また、属性情報抽出部7は、属性情報を取得する。 Then, the feature parameter extraction unit 6 is based on the natural speech waveform stored in the natural speech storage unit 22, and the start positions of two consecutive time frames are only the frame period stored in the frame cycle storage unit 13. A feature parameter for each of a plurality of time frames arranged to be separated from each other is extracted. Further, the attribute information extraction unit 7 acquires attribute information.

このようにして、特徴パラメータ抽出部6及び属性情報抽出部7は、特徴パラメータの時系列データを含む音声素片情報と、属性情報と、からなる素片情報を生成する。そして、素片情報記憶部12は、特徴パラメータ抽出部6及び属性情報抽出部7により生成された素片情報を記憶する。 In this way, the feature parameter extraction unit 6 and the attribute information extraction unit 7 generate unit information composed of speech unit information including time series data of feature parameters and attribute information. Then, the segment information storage unit 12 stores the segment information generated by the feature parameter extraction unit 6 and the attribute information extraction unit 7.

その後、音声合成装置は、音声合成処理の対象となるテキストを受け付けると、素片情報記憶部12に記憶されている素片情報に基づいて、受け付けたテキストを表す音声を合成するための音声合成処理を行う。 After that, when receiving the text to be subjected to the speech synthesis process, the speech synthesizer synthesizes speech for synthesizing the speech representing the received text based on the segment information stored in the segment information storage unit 12. Process.

以上、説明したように、第1実施形態に係る素片情報生成装置100によれば、素片情報を生成する際に用いられるフレーム周期を、音声の韻律に基づいて適切に設定することができる。この結果、合成される音声の品質が低下することを防止しながら、当該合成される音声の基となる素片情報のデータ量を低減することができる。 As described above, according to the segment information generating apparatus 100 according to the first embodiment, the frame period used when generating segment information can be appropriately set based on the prosody of speech. . As a result, it is possible to reduce the data amount of the segment information that is the basis of the synthesized voice while preventing the quality of the synthesized voice from being lowered.

更に、第1実施形態において、合成音声韻律情報推定部33は、音声合成処理手段によって過去の時点にて合成された音声である過去合成音声の韻律を表す韻律情報に基づいて将来合成音声の韻律を表す韻律情報を推定する。
これによれば、将来の時点にて音声合成処理において用いられる可能性が比較的高い韻律に基づいて、フレーム周期を適切に決定することができる。 Further, in the first embodiment, the synthesized speech prosody information estimation unit 33 performs the prosody of the future synthesized speech based on the prosody information representing the prosody of the past synthesized speech that is the speech synthesized at the past time point by the speech synthesis processing means. The prosodic information representing is estimated.
According to this, it is possible to appropriately determine the frame period based on a prosody that is relatively likely to be used in speech synthesis processing at a future time.

加えて、第1実施形態において、フレーム周期決定部31は、ピッチ周波数(基本周波数)が大きくなるほど小さくなる値を、フレーム周期として決定する。
これにより、合成される音声の品質が低下することを確実に防止することができる。 In addition, in the first embodiment, the frame cycle determination unit 31 determines a value that decreases as the pitch frequency (fundamental frequency) increases as the frame cycle.
Thereby, it is possible to reliably prevent the quality of the synthesized voice from being deteriorated.

更に、第1実施形態において、フレーム周期決定部31は、予め設定された処理単位毎に、韻律関連情報を取得するとともに当該取得した韻律関連情報に基づいてフレーム周期を決定する。
これによれば、例えば、音節毎に、又は、半音節毎に、フレーム周期を設定することができる。この結果、合成される音声の品質が低下することを防止しながら、当該合成される音声の基となる素片情報のデータ量を低減することができる。 Furthermore, in the first embodiment, the frame period determination unit 31 acquires prosodic related information for each processing unit set in advance and determines a frame period based on the acquired prosodic related information.
According to this, for example, the frame period can be set for each syllable or for each semi-syllable. As a result, it is possible to reduce the data amount of the segment information that is the basis of the synthesized voice while preventing the quality of the synthesized voice from being lowered.

<第2実施形態>
次に、本発明の第2実施形態に係る素片情報生成装置について説明する。第2実施形態に係る素片情報生成装置は、上記第1実施形態に係る素片情報生成装置に対して、基礎音声の韻律を表す韻律情報に基づいて将来合成音声の韻律を表す韻律情報を推定する点において相違している。従って、以下、かかる相違点を中心として説明する。 <Second Embodiment>
Next, the segment information generating apparatus according to the second embodiment of the present invention will be described. The segment information generation device according to the second embodiment is configured to provide prosody information representing the prosody of the future synthesized speech based on the prosody information representing the prosody of the basic speech, with respect to the segment information generation device according to the first embodiment. There is a difference in estimation. Accordingly, the following description will focus on such differences.

図5に示したように、第2実施形態に係る素片情報生成装置100は、第1実施形態に係る素片情報生成装置100が有する構成に加えて、出現数算出部(出現数取得手段)46を備える。また、第2実施形態に係る素片情報生成装置100は、フレーム周期決定部31に代わるフレーム周期決定部41と、合成音声韻律情報推定部33に代わる合成音声韻律情報推定部43と、合成音声韻律情報記憶部34に代わる素片韻律情報記憶部44と、を備える。 As shown in FIG. 5, the segment information generation device 100 according to the second embodiment includes an appearance number calculation unit (appearance number acquisition unit) in addition to the configuration of the segment information generation device 100 according to the first embodiment. ) 46. Also, the segment information generating apparatus 100 according to the second embodiment includes a frame period determining unit 41 that replaces the frame period determining unit 31, a synthetic speech prosody information estimating unit 43 that replaces the synthesized speech prosody information estimating unit 33, and a synthesized speech A segment prosody information storage unit 44 instead of the prosody information storage unit 34;

属性情報抽出部7は、音声素片毎に、属性情報を取得し、当該取得した属性情報に含まれる韻律情報(素片韻律情報)を素片韻律情報記憶部44へ出力する。
素片韻律情報記憶部44は、属性情報抽出部7から供給された素片韻律情報を音声素片毎に記憶する。 The attribute information extraction unit 7 acquires attribute information for each speech unit, and outputs prosodic information (segment prosody information) included in the acquired attribute information to the segment prosody information storage unit 44.
The segment prosody information storage unit 44 stores the segment prosody information supplied from the attribute information extraction unit 7 for each speech unit.

出現数算出部46は、収録文記憶部21に記憶されている言語情報に基づいて、音声合成単位(処理単位)毎に、当該音声合成単位に対応する音声が基礎音声において出現する回数である出現数を算出(取得)する。出現数算出部46は、出現数を算出する対象となる音声合成単位に対応する文字列が収録文において出現する回数を計数することにより、音声合成単位毎の出現数を算出する。 The number-of-appearance calculation unit 46 is the number of times the voice corresponding to the voice synthesis unit appears in the basic voice for each voice synthesis unit (processing unit) based on the language information stored in the recorded sentence storage unit 21. Calculate (acquire) the number of occurrences. The appearance number calculation unit 46 calculates the number of appearances for each speech synthesis unit by counting the number of times a character string corresponding to the speech synthesis unit for which the appearance number is calculated appears in the recorded sentence.

例えば、音声合成単位を音節とする場合において「わ」の出現数を算出するとき、出現数算出部46は、収録文記憶部21に記憶されている言語情報が表す収録文において「わ」が出現する回数を計数する。
出現数算出部46は、算出された出現数を合成音声韻律情報推定部43へ出力する。 For example, when the number of occurrences of “wa” is calculated in the case where the speech synthesis unit is a syllable, the appearance number calculation unit 46 determines that “wa” is included in the recorded sentence represented by the language information stored in the recorded sentence storage unit 21. Count the number of occurrences.
The appearance number calculation unit 46 outputs the calculated number of appearances to the synthesized speech prosody information estimation unit 43.

合成音声韻律情報推定部43は、素片韻律情報記憶部44に記憶されている音声素片毎の韻律情報(即ち、基礎音声の韻律を表す韻律情報)と、出現数算出部46から供給された音声合成単位毎の出現数と、に基づいて、音声素片毎の将来合成音声の韻律を表す韻律情報を推定する。合成音声韻律情報推定部43は、推定された韻律情報を韻律関連情報としてフレーム周期決定部41へ出力する。 The synthesized speech prosody information estimation unit 43 is supplied from the prosody information for each speech unit stored in the segment prosody information storage unit 44 (that is, the prosody information representing the prosody of the basic speech) and the appearance number calculation unit 46. The prosody information representing the prosody of the future synthesized speech for each speech unit is estimated based on the number of appearances for each speech synthesis unit. The synthesized speech prosody information estimation unit 43 outputs the estimated prosody information to the frame period determination unit 41 as prosody related information.

ところで、上記音声合成装置は、合成音声の韻律の目標値である目標韻律に最も近い韻律を有する素片情報を選択するように構成される。従って、将来合成音声中の、ある音声素片の韻律と、当該音声素片の基となった基礎音声中の音声素片の韻律と、が著しく異なることは少ない。 By the way, the speech synthesizer is configured to select segment information having a prosody closest to the target prosody that is the target value of the synthesized speech. Therefore, there is little difference between the prosody of a certain speech unit in the synthesized speech and the prosody of the speech unit in the basic speech that is the basis of the speech unit in the future.

但し、出現数が比較的少ない音声合成単位に関しては、音声合成装置が目標韻律に十分に近い韻律を有する素片情報を選択することができない。従って、将来合成音声中の、ある音声素片の韻律と、当該音声素片の基となった基礎音声中の音声素片の韻律と、が比較的大きく異なる場合がある。 However, for speech synthesis units with a relatively small number of appearances, the speech synthesizer cannot select segment information having a prosody sufficiently close to the target prosody. Therefore, the prosody of a certain speech unit in the synthesized speech in the future may be relatively different from the prosody of the speech unit in the basic speech that is the basis of the speech unit.

即ち、将来合成音声中の、ある音声素片の韻律と、当該音声素片の基となった基礎音声中の音声素片の韻律と、が異なる程度は、出現数に応じて変化する、と言うことができる。そこで、合成音声韻律情報推定部43は、このような性質を利用して、音声素片毎の将来合成音声の韻律を表す韻律情報を推定する。 In other words, the degree to which the prosody of a certain speech unit in the synthesized speech and the prosody of the speech unit in the basic speech on which the speech unit is based varies depending on the number of appearances. I can say that. Therefore, the synthesized speech prosody information estimation unit 43 estimates prosody information representing the prosody of the future synthesized speech for each speech unit using such properties.

本例では、合成音声韻律情報推定部43は、数式2に従って、音声素片毎の将来合成音声の韻律を表す韻律情報を推定する。なお、本例では、推定する対象となる韻律がピッチ周波数である場合について説明する。

In this example, the synthesized speech prosody information estimation unit 43 estimates prosodic information representing the prosody of the future synthesized speech for each speech unit according to Equation 2. In this example, a case where the prosody to be estimated is a pitch frequency will be described.

ここで、N(u)は、音声合成単位uに対応する音声素片sが基礎音声において出現する回数である出現数である。また、F0_est(s)は、音声素片sに対する将来合成音声の韻律(ピッチ周波数)の推定値であり、F0_org(s)は、音声素片sに対する基礎音声の韻律である。また、β₂は、予め設定された閾値である。また、F0_mod1は、予め設定された正の値を有する第1の係数であり、F0_mod2は、予め設定された正の値を有する第2の係数である。更に、第1の係数F0_mod1は、第2の係数F0_mod2よりも小さい。即ち、第1の係数F0_mod1及び第2の係数F0_mod2は、0<F0_mod1<F0_mod2を満足する。 Here, N (u) is the number of appearances, which is the number of times the speech unit s corresponding to the speech synthesis unit u appears in the basic speech. F0 _est (s) is an estimated value of the prosody (pitch frequency) of the future synthesized speech for the speech unit s, and F0 _org (s) is the prosody of the basic speech for the speech unit s. Β ₂ is a preset threshold value. F0 _mod1 is a first coefficient having a preset positive value, and F0 _mod2 is a second coefficient having a preset positive value. Furthermore, the first coefficient F0 _mod1 is smaller than the second coefficient F0 _mod2 . That is, the first coefficient F0 _mod1 and the second coefficient F0 _mod2 satisfy 0 <F0 _mod1 <F0 _mod2 .

このように、合成音声韻律情報推定部43は、基礎音声の韻律を表す韻律情報に、第1の係数F0_mod1又は第2の係数F0_mod2を乗じた値を、将来合成音声の韻律を表す韻律情報として推定する。また、合成音声韻律情報推定部43は、出現数が大きくなるほど小さくなる値を、将来合成音声の韻律を表す韻律情報として推定する。 In this way, the synthesized speech prosody information estimation unit 43 multiplies the prosody information representing the basic speech prosody by the first coefficient F0 _mod1 or the second coefficient F0 _mod2, and the prosody representing the future synthesized speech prosody. Estimate as information. In addition, the synthesized speech prosody information estimation unit 43 estimates a value that decreases as the number of appearances increases as prosody information representing the prosody of the synthesized speech in the future.

フレーム周期決定部41は、第1実施形態に係るフレーム周期決定部31と同様に、合成音声韻律情報推定部43により推定された韻律情報に基づいてフレーム周期を決定する。 The frame cycle determination unit 41 determines the frame cycle based on the prosodic information estimated by the synthesized speech prosody information estimation unit 43, similarly to the frame cycle determination unit 31 according to the first embodiment.

以上、説明したように、第2実施形態に係る素片情報生成装置100において、合成音声韻律情報推定部43は、基礎音声の韻律を表す韻律情報に基づいて将来合成音声の韻律を表す韻律情報を推定する。 As described above, in the segment information generating apparatus 100 according to the second embodiment, the synthesized speech prosody information estimation unit 43 uses the prosodic information representing the prosody of the future synthesized speech based on the prosodic information representing the prosody of the basic speech. Is estimated.

ところで、音声合成処理において用いられる素片情報が、合成される音声の韻律と類似する(可能な限り近い)韻律を有する情報である可能性は比較的高い。そこで、上記のように素片情報生成装置100を構成することによって、将来合成音声の韻律を表す韻律情報を高い精度にて推定することができる。この結果、将来の時点にて音声合成処理において用いられる可能性が比較的高い韻律に基づいて、フレーム周期を適切に決定することができる。 By the way, it is relatively likely that the segment information used in the speech synthesis process is information having a prosody similar (as close as possible) to the prosody of the synthesized speech. Therefore, by configuring the segment information generating apparatus 100 as described above, prosody information representing the prosody of a synthesized speech in the future can be estimated with high accuracy. As a result, the frame period can be appropriately determined based on the prosody that is relatively likely to be used in speech synthesis processing at a future time.

更に、第2実施形態において、合成音声韻律情報推定部43は、出現数算出部46により取得された出現数と、基礎音声の韻律を表す韻律情報と、に基づいて将来合成音声の韻律を表す韻律情報を推定する。 Furthermore, in the second embodiment, the synthesized speech prosody information estimation unit 43 represents the prosody of the future synthesized speech based on the appearance number acquired by the appearance number calculation unit 46 and the prosody information representing the prosody of the basic speech. Estimate prosodic information.

ところで、ある音声合成単位に対して、出現数が多くなるほど、合成される音声の韻律と一致している程度がより高い韻律を有する音声素片を表す素片情報が、音声合成処理において用いられる。従って、上記のように素片情報生成装置100を構成することによって、将来合成音声の韻律を表す韻律情報をより一層高い精度にて推定することができる。この結果、将来の時点にて音声合成処理において用いられる可能性がより一層高い韻律に基づいて、フレーム周期を適切に決定することができる。 By the way, as the number of appearances increases for a certain speech synthesis unit, segment information representing a speech unit having a higher prosody that matches the prosody of the synthesized speech is used in the speech synthesis process. . Therefore, by configuring the segment information generating apparatus 100 as described above, prosodic information representing the prosody of a synthesized speech in the future can be estimated with higher accuracy. As a result, the frame period can be appropriately determined based on the prosody that is more likely to be used in the speech synthesis process at a future time.

なお、第2実施形態において、素片情報生成装置100は、音声合成単位毎に算出した出現数に基づいて将来合成音声の韻律を表す韻律情報を推定するように構成されていたが、音声合成単位毎に算出した出現数の統計量(平均値、及び/又は、分散等)に基づいて将来合成音声の韻律を表す韻律情報を推定するように構成されていてもよい。この場合、例えば、合成音声韻律情報推定部43は、出現数に代えて、出現数の統計量に基づく値(例えば、出現数の分散と予め設定された係数との積を、出現数の平均値に加えた値)を用いるように構成されることが好適である。 In the second embodiment, the segment information generation device 100 is configured to estimate prosody information representing the prosody of a future synthesized speech based on the number of appearances calculated for each speech synthesis unit. The prosody information representing the prosody of the synthesized speech in the future may be estimated based on the statistics (average value and / or variance) of the number of appearances calculated for each unit. In this case, for example, the synthesized speech prosody information estimation unit 43 replaces the number of appearances with a value based on the statistics of the number of appearances (for example, the product of the variance of the number of appearances and a preset coefficient) It is preferable to use a value added to the value).

また、第2実施形態において、素片情報生成装置100は、基礎音声の韻律を表す韻律情報に基づいて、将来合成音声の韻律を表す韻律情報を推定するように構成されていたが、基礎音声の韻律を表す韻律情報の統計量に基づいて、将来合成音声の韻律を表す韻律情報を推定するように構成されていてもよい。 In the second embodiment, the unit information generation apparatus 100 is configured to estimate prosody information representing the prosody of the synthesized speech based on prosody information representing the prosody of the basic speech. The prosody information representing the prosody of the synthesized speech in the future may be estimated based on the statistics of the prosody information representing the prosody.

例えば、韻律情報の統計量は、ピッチ周波数、又は、継続時間長の、平均値、分散、頻度(ヒストグラム)、最大値、最小値、又は、最頻値等である。
ここでは、基礎音声の韻律を表す韻律情報の平均値及び分散に基づいて、将来合成音声の韻律を表す韻律情報を推定するように構成された変形例に係る素片情報生成装置100について説明する。 For example, the statistic of prosodic information is an average value, variance, frequency (histogram), maximum value, minimum value, mode value, or the like of pitch frequency or duration time.
Here, a description will be given of the segment information generating apparatus 100 according to the modified example configured to estimate the prosodic information representing the prosody of the synthesized speech based on the average value and the variance of the prosodic information representing the prosody of the basic speech. .

この変形例に係る合成音声韻律情報推定部43は、数式3に従って、将来合成音声の韻律を表す韻律情報を推定する。ここで、F0_est(u)は、音声合成単位uに対する将来合成音声の韻律(本例では、ピッチ周波数)の推定値である。また、F0_m(u)は、基礎音声のうちの、音声合成単位uに対応する音声素片の韻律の平均値であり、F0_v(u)は、基礎音声のうちの、音声合成単位uに対応する音声素片の韻律の分散である。また、α₂は、予め設定された正の値を有する定数である。

The synthesized speech prosody information estimation unit 43 according to this modification estimates prosody information representing the future synthesized speech prosody according to Equation 3. Here, F0 _est (u) is an estimated value of the prosody (pitch frequency in this example) of the future synthesized speech for the speech synthesis unit u. F0 _m (u) is the average value of the prosody of the speech unit corresponding to the speech synthesis unit u in the basic speech, and F0 _v (u) is the speech synthesis unit u in the basic speech. Is the variance of the prosody of the speech segment corresponding to. Α ₂ is a constant having a preset positive value.

このように構成された素片情報生成装置100であっても、第2実施形態に係る素片情報生成装置100と同様の作用及び効果を奏することができる。 Even the segment information generating apparatus 100 configured as described above can achieve the same operations and effects as those of the segment information generating apparatus 100 according to the second embodiment.

<第3実施形態>
次に、本発明の第3実施形態に係る素片情報生成装置について説明する。第2実施形態に係る素片情報生成装置は、上記第2実施形態に係る素片情報生成装置に対して、将来合成音声の韻律を推定することなくフレーム周期を決定する点において相違している。従って、以下、かかる相違点を中心として説明する。 <Third embodiment>
Next, the segment information generating apparatus according to the third embodiment of the present invention will be described. The segment information generation device according to the second embodiment is different from the segment information generation device according to the second embodiment in that the frame period is determined without estimating the prosody of the synthesized speech in the future. . Accordingly, the following description will focus on such differences.

図6に示したように、第3実施形態に係る素片情報生成装置100は、第2実施形態に係る素片情報生成装置100が備える合成音声韻律情報推定部43を備えない。更に、第3実施形態に係る素片情報生成装置100は、フレーム周期決定部41に代えて、フレーム周期決定部51を備える。 As shown in FIG. 6, the segment information generation apparatus 100 according to the third embodiment does not include the synthesized speech prosody information estimation unit 43 included in the segment information generation apparatus 100 according to the second embodiment. Furthermore, the segment information generating apparatus 100 according to the third embodiment includes a frame period determining unit 51 instead of the frame period determining unit 41.

フレーム周期決定部51は、素片韻律情報記憶部44に記憶されている音声素片毎の韻律情報(即ち、基礎音声の韻律を表す韻律情報)を韻律関連情報として取得する。更に、フレーム周期決定部51は、取得した韻律関連情報と、出現数算出部46から供給された音声合成単位毎の出現数と、に基づいて、フレーム周期を決定する。フレーム周期決定部51は、決定したフレーム周期をフレーム周期記憶部13へ出力する。 The frame period determining unit 51 acquires prosodic information for each speech unit (that is, prosodic information representing the prosody of the basic speech) stored in the segment prosodic information storage unit 44 as prosodic related information. Further, the frame period determination unit 51 determines the frame period based on the acquired prosodic information and the number of appearances for each speech synthesis unit supplied from the appearance number calculation unit 46. The frame cycle determination unit 51 outputs the determined frame cycle to the frame cycle storage unit 13.

本例においても、韻律情報としてピッチ周波数を用いる場合について説明する。
本例では、フレーム周期決定部51は、数式4に従って、音声素片毎にフレーム周期を決定する。ここで、F0_org(s)は、音声合成単位uに対応する音声素片sに対する基礎音声の韻律である。また、β₃(s)は、音声素片sに対する閾値である。また、T_p3は、予め設定された正の値を有する第3のフレーム周期であり、T_p4は、予め設定された正の値を有する第4のフレーム周期である。更に、第3のフレーム周期T_p3は、第4のフレーム周期T_p4よりも小さい。即ち、第3のフレーム周期T_p3及び第4のフレーム周期T_p4は、0<T_p3<T_p4を満足する。

Also in this example, a case where a pitch frequency is used as prosodic information will be described.
In this example, the frame period determination unit 51 determines the frame period for each speech unit according to Equation 4. Here, F0 _org (s) is the basic speech prosody for the speech unit s corresponding to the speech synthesis unit u. Β ₃ (s) is a threshold value for the speech segment s. T _p3 is a third frame period having a preset positive value, and T _p4 is a fourth frame period having a preset positive value. Furthermore, the third frame period T _p3 is smaller than the fourth frame period T _p4 . That is, the third frame period T _p3 and the fourth frame period T _p4 satisfy 0 <T _p3 <T _p4 .

このように、フレーム周期決定部51は、韻律関連情報が表すピッチ周波数が大きくなるほど小さくなる値を、フレーム周期として決定する。なお、本例では、フレーム周期決定部51は、2つのフレーム周期から1つのフレーム周期を選択するように構成されていたが、3つ以上のフレーム周期から1つのフレーム周期を選択するように構成されていてもよい。この場合、フレーム周期をより高い精度にて制御することができる。この結果、合成音声の品質が低下することを確実に防止しながら、合成音声の基となる素片情報のデータ量をより一層低減することができる。 As described above, the frame period determination unit 51 determines a value that decreases as the pitch frequency represented by the prosodic information increases as the frame period. In this example, the frame period determining unit 51 is configured to select one frame period from two frame periods, but is configured to select one frame period from three or more frame periods. May be. In this case, the frame period can be controlled with higher accuracy. As a result, it is possible to further reduce the data amount of the segment information that is the basis of the synthesized speech while reliably preventing the quality of the synthesized speech from being lowered.

更に、フレーム周期決定部51は、音声合成単位uに対応する音声素片sが基礎音声において出現する回数である出現数N(u)に基づいて、音声素片sに対する閾値β₃(s)を決定する。本例では、フレーム周期決定部51は、数式5に従って、音声素片sに対する閾値β₃(s)を決定する。

Further, the frame period determination unit 51 determines the threshold β ₃ (s) for the speech unit s based on the appearance number N (u) that is the number of times the speech unit s corresponding to the speech synthesis unit u appears in the basic speech. To decide. In this example, the frame period determination unit 51 determines the threshold value β ₃ (s) for the speech element s according to Equation 5.

ここで、β₀は、予め設定された正の値を有する定数であり、β₄は、予め設定された閾値である。また、β_mod1は、予め設定された正の値を有する第1の係数であり、β_mod2は、予め設定された正の値を有する第2の係数である。更に、第1の係数β_mod1は、第2の係数β_mod2よりも小さい。即ち、第1の係数β_mod1及び第2の係数β_mod2は、0<β_mod1<β_mod2を満足する。 Here, β ₀ is a constant having a preset positive value, and β ₄ is a preset threshold value. Β _mod1 is a first coefficient having a preset positive value, and β _mod2 is a second coefficient having a preset positive value. Furthermore, the first coefficient β _mod1 is smaller than the second coefficient β _mod2 . That is, the first coefficient β _mod1 and the second coefficient β _mod2 satisfy 0 <β _mod1 <β _mod2 .

このように、フレーム周期決定部51は、出現数N(u)が大きくなるほど小さくなる値を、閾値β₃(s)として決定する。従って、フレーム周期決定部51は、韻律関連情報が表すピッチ周波数が大きくなるほど小さくなるとともに、出現数が小さくなるほど小さくなる値をフレーム周期として決定する。 As described above, the frame period determination unit 51 determines a value that decreases as the appearance number N (u) increases as the threshold value β ₃ (s). Therefore, the frame period determining unit 51 determines a value that decreases as the pitch frequency represented by the prosodic information increases and decreases as the appearance number decreases as the frame period.

以上、説明したように、第3実施形態に係る素片情報生成装置100において、フレーム周期決定部51は、基礎音声の韻律を表す韻律情報を韻律関連情報として取得し、当該取得した韻律関連情報に基づいてフレーム周期を決定する。 As described above, in the segment information generation device 100 according to the third embodiment, the frame period determination unit 51 acquires prosody information representing the prosody of the basic speech as prosody related information, and the acquired prosody related information. To determine the frame period.

ところで、音声合成処理において用いられる素片情報が、合成される音声の韻律と類似する(可能な限り近い)韻律を有する情報である可能性は比較的高い。そこで、上記のように素片情報生成装置100を構成することによって、音声合成処理において用いられる可能性が比較的高い韻律に基づいて、フレーム周期を適切に決定することができる。 By the way, it is relatively likely that the segment information used in the speech synthesis process is information having a prosody similar (as close as possible) to the prosody of the synthesized speech. Therefore, by configuring the segment information generating apparatus 100 as described above, the frame period can be appropriately determined based on the prosody that is relatively likely to be used in the speech synthesis process.

更に、第3実施形態において、フレーム周期決定部51は、取得された出現数と、韻律関連情報と、に基づいてフレーム周期を決定する。 Furthermore, in the third embodiment, the frame cycle determination unit 51 determines the frame cycle based on the acquired number of appearances and prosodic information.

ところで、ある処理単位(音声合成単位)に対して、出現数が多くなるほど、合成される音声の韻律と一致している程度がより高い韻律を有する音声素片を表す素片情報が、音声合成処理において用いられる。従って、上記のように素片情報生成装置100を構成することによって、音声合成処理において用いられる可能性がより一層高い韻律に基づいて、フレーム周期を適切に決定することができる。 By the way, with respect to a certain processing unit (speech synthesis unit), as the number of appearances increases, the unit information representing a speech unit having a higher prosody that matches the prosody of the synthesized speech becomes speech synthesis. Used in processing. Therefore, by configuring the segment information generating apparatus 100 as described above, the frame period can be appropriately determined based on the prosody that is more likely to be used in the speech synthesis process.

なお、第3実施形態において、フレーム周期決定部51は、基礎音声の韻律を表す韻律情報を韻律関連情報として取得するように構成されていたが、過去合成音声の韻律を表す韻律情報を韻律関連情報として取得するように構成されていてもよい。 In the third embodiment, the frame period determination unit 51 is configured to acquire prosody information representing the prosody of the basic speech as prosody related information, but the prosody information representing the prosody of the past synthesized speech is You may be comprised so that it may acquire as information.

ところで、将来の時点にて合成される音声の韻律が、過去の時点にて合成された音声の韻律と類似する可能性は比較的高い。そこで、上記のように素片情報生成装置100を構成することによって、将来の時点にて音声合成処理において用いられる可能性が比較的高い韻律に基づいて、フレーム周期を適切に決定することができる。 By the way, the prosody of a voice synthesized at a future time is relatively likely to be similar to the prosody of a voice synthesized at a past time. Therefore, by configuring the segment information generating apparatus 100 as described above, it is possible to appropriately determine the frame period based on a prosody that is relatively likely to be used in speech synthesis processing at a future time. .

<第4実施形態>
次に、本発明の第4実施形態に係る素片情報生成装置について説明する。第4実施形態に係る素片情報生成装置は、上記第1実施形態に係る素片情報生成装置に対して、過去合成音声の韻律である合成音声韻律と、基礎音声の韻律である基礎音声韻律と、の差を表す韻律制御量に基づいてフレーム周期を決定する点において相違している。従って、以下、かかる相違点を中心として説明する。 <Fourth embodiment>
Next, an element information generating apparatus according to the fourth embodiment of the present invention will be described. The segment information generating apparatus according to the fourth embodiment is different from the segment information generating apparatus according to the first embodiment in that the synthesized speech prosody that is the prosody of the past synthesized speech and the basic speech prosody that is the prosody of the basic speech. Is different in that the frame period is determined based on the prosodic control amount representing the difference. Accordingly, the following description will focus on such differences.

図7に示したように、第4実施形態に係る素片情報生成装置100は、第1実施形態に係る素片情報生成装置100が備える合成音声韻律情報推定部33を備えない。更に、第4実施形態に係る素片情報生成装置100は、フレーム周期決定部31に代わるフレーム周期決定部61と、合成音声韻律情報記憶部34に代わる韻律制御量情報記憶部64と、を備える。 As shown in FIG. 7, the segment information generation apparatus 100 according to the fourth embodiment does not include the synthesized speech prosody information estimation unit 33 included in the segment information generation apparatus 100 according to the first embodiment. Furthermore, the segment information generation device 100 according to the fourth embodiment includes a frame period determination unit 61 that replaces the frame period determination unit 31, and a prosody control amount information storage unit 64 that replaces the synthesized speech prosody information storage unit 34. .

韻律制御量情報記憶部64は、素片情報記憶部12に記憶されている(即ち、素片情報生成装置100が生成した)素片情報に基づいて音声合成処理を行う図示しない音声合成装置(音声合成処理手段)から出力された韻律制御量を予め記憶している。 The prosody control amount information storage unit 64 is a speech synthesizer (not shown) that performs speech synthesis processing based on the unit information stored in the unit information storage unit 12 (that is, generated by the unit information generation device 100). The prosody control amount output from the speech synthesis processing means) is stored in advance.

韻律制御量は、上記音声合成装置によって過去の時点にて合成された音声(過去合成音声)の韻律である合成音声韻律と、基礎音声の韻律である基礎音声韻律と、の差を表す情報である。本例では、韻律制御量情報記憶部64は、基礎音声韻律と合成音声韻律との差の大きさを韻律制御量として記憶する。 The prosody control amount is information indicating a difference between a synthesized speech prosody that is a prosody of speech synthesized in the past by the speech synthesizer (past synthesized speech) and a basic speech prosody that is a prosody of the basic speech. is there. In this example, the prosody control amount information storage unit 64 stores the magnitude of the difference between the basic speech prosody and the synthesized speech prosody as a prosody control amount.

具体的には、韻律としてピッチ周波数を用いた場合において、ある音声素片に対して、基礎音声のピッチ周波数が250Hzであり、過去合成音声のピッチ周波数が300Hzであるとき、韻律制御量情報記憶部64は、韻律制御量として、50(=|300-250|)を記憶する。 Specifically, when a pitch frequency is used as a prosody, when a basic speech pitch frequency is 250 Hz and a past synthesized speech pitch frequency is 300 Hz, a prosodic control amount information storage is performed. The unit 64 stores 50 (= | 300−250 |) as the prosody control amount.

また、韻律として継続時間長を用いた場合において、ある音声素片に対して、基礎音声の継続時間長が250msであり、過去合成音声の継続時間長が300msであるとき、韻律制御量情報記憶部64は、韻律制御量として、50(=|300-250|)を記憶する。なお、韻律制御量情報記憶部64は、基礎音声韻律と合成音声韻律との比に基づく値(基礎音声韻律と合成音声韻律との差の大きさが大きくなるほど大きくなる値)を韻律制御量として記憶するように構成されていてもよい。 When the duration is used as a prosody, when a duration of the basic speech is 250 ms and a duration of the past synthesized speech is 300 ms for a certain speech unit, the prosodic control amount information storage is performed. The unit 64 stores 50 (= | 300−250 |) as the prosody control amount. The prosody control amount information storage unit 64 uses a value based on the ratio between the basic speech prosody and the synthesized speech prosody (a value that increases as the difference between the basic speech prosody and the synthesized speech prosody increases) as the prosody control amount. It may be configured to store.

フレーム周期決定部61は、韻律制御量情報記憶部64に記憶されている韻律制御量に基づいて、韻律制御量の統計量を算出(取得)する。ここで、韻律制御量の統計量は、韻律制御量の、平均値、分散、頻度(ヒストグラム)、最大値、最小値、又は、最頻値等である。これらの統計量は、韻律制御量情報記憶部64に記憶されている韻律制御量の全体から一意に算出されてもよく、また、音素、音節、半音節、又は、素片等の単位毎に算出されてもよい。 The frame period determination unit 61 calculates (acquires) a statistic of the prosodic control amount based on the prosodic control amount stored in the prosody control amount information storage unit 64. Here, the statistic of the prosodic control amount is an average value, variance, frequency (histogram), maximum value, minimum value, or mode value of the prosodic control amount. These statistics may be calculated uniquely from the entire prosody control amount stored in the prosody control amount information storage unit 64, and for each unit such as phoneme, syllable, semi-syllable, or segment. It may be calculated.

本例では、フレーム周期決定部61は、韻律制御量の統計量として、音声合成単位毎に、韻律制御量の平均値及び分散を算出する。また、本例では、フレーム周期決定部61は、韻律制御量として、過去合成音声の継続時間長と、基礎音声の継続時間長と、の差の大きさを用いる。 In this example, the frame period determination unit 61 calculates the average value and variance of the prosodic control amount for each speech synthesis unit as the prosodic control amount statistics. In this example, the frame period determination unit 61 uses the magnitude of the difference between the duration of the past synthesized speech and the duration of the basic speech as the prosody control amount.

フレーム周期決定部61は、算出した韻律制御量の平均値及び分散を韻律関連情報として取得する。フレーム周期決定部61は、取得した韻律関連情報に基づいてフレーム周期を決定し、決定したフレーム周期をフレーム周期記憶部13へ出力する。 The frame period determining unit 61 acquires the calculated average value and variance of the prosodic control amount as prosodic related information. The frame period determination unit 61 determines a frame period based on the acquired prosodic information, and outputs the determined frame period to the frame period storage unit 13.

ところで、韻律制御量が大きくなるほど、合成される音声の品質は低下しやすい。そこで、フレーム周期決定部61は、韻律制御量が大きくなるほど小さくなる値をフレーム周期として決定する。この結果、合成される音声の品質が低下することを確実に防止することができる。 By the way, the higher the prosody control amount, the lower the quality of synthesized speech. Therefore, the frame period determining unit 61 determines a value that decreases as the prosodic control amount increases as the frame period. As a result, it is possible to reliably prevent the quality of the synthesized voice from deteriorating.

本例では、フレーム周期決定部61は、韻律制御量の分散に予め設定された係数を乗じた値を、韻律制御量の平均値に加えた値が大きくなるほど小さくなる値をフレーム周期として決定する。なお、フレーム周期決定部61は、韻律制御量の平均値のみに基づいてフレーム周期を決定するように構成されていてもよい。 In this example, the frame cycle determination unit 61 determines a value that becomes smaller as the value obtained by multiplying the variance of the prosodic control amount by a preset coefficient to the average value of the prosodic control amount increases as the frame cycle. . The frame period determining unit 61 may be configured to determine the frame period based only on the average value of the prosodic control amount.

以上、説明したように、第4実施形態に係る素片情報生成装置100において、フレーム周期決定部61は、合成音声韻律と基礎音声韻律との差を表す韻律制御量を韻律関連情報として取得し、当該取得した韻律関連情報に基づいてフレーム周期を決定する。これにより、韻律制御量が大きくなるほど、フレーム周期を小さくすることができる。この結果、合成される音声の品質が低下することを確実に防止することができる。 As described above, in the segment information generating apparatus 100 according to the fourth embodiment, the frame period determining unit 61 acquires the prosody control amount representing the difference between the synthesized speech prosody and the basic speech prosody as the prosody related information. The frame period is determined based on the acquired prosodic information. As a result, the frame period can be reduced as the prosodic control amount increases. As a result, it is possible to reliably prevent the quality of the synthesized voice from deteriorating.

<第5実施形態>
次に、本発明の第5実施形態に係る音声合成システムについて説明する。第5実施形態に係る音声合成システムは、上記第1実施形態に係る素片情報生成装置により生成された素片情報に基づいて音声合成処理を行うシステムである。以下、素片情報生成装置以外の構成を中心として説明する。 <Fifth embodiment>
Next, a speech synthesis system according to a fifth embodiment of the present invention will be described. The speech synthesis system according to the fifth embodiment is a system that performs speech synthesis processing based on the segment information generated by the segment information generation device according to the first embodiment. Hereinafter, the configuration other than the unit information generation apparatus will be mainly described.

この音声合成システムは、第1実施形態に係る素片情報生成装置100と、図8に示した音声合成装置200と、を含む。音声合成装置200は、言語処理部1と、韻律生成部2と、素片選択部3と、波形生成部8と、を備える。 This speech synthesis system includes the segment information generation device 100 according to the first embodiment and the speech synthesis device 200 shown in FIG. The speech synthesizer 200 includes a language processing unit 1, a prosody generation unit 2, a segment selection unit 3, and a waveform generation unit 8.

韻律生成部2は、言語処理部1から出力された言語解析処理結果に基づいて、合成される音声(合成音声)の韻律(音の高さ(ピッチ周波数)、音の長さ(継続時間長)、及び、音の大きさ(パワー)等に関する情報)を生成し、生成した韻律を表す韻律情報を目標韻律情報として素片選択部3及び波形生成部8へ出力する。 The prosody generation unit 2 is based on the language analysis processing result output from the language processing unit 1, and the prosody (sound pitch (pitch frequency), sound length (duration length) of the synthesized speech) ) And information related to sound volume (power) and the like, and the prosody information representing the generated prosody is output to the segment selection unit 3 and the waveform generation unit 8 as target prosody information.

素片選択部3は、言語解析処理結果と目標韻律情報とに基づいて、素片情報記憶部12に記憶されている素片情報の中から、下記のように素片情報を選択し、選択した素片情報を波形生成部8へ出力する。 The segment selection unit 3 selects the segment information from the segment information stored in the segment information storage unit 12 based on the language analysis processing result and the target prosodic information as follows, and selects The segment information is output to the waveform generator 8.

具体的には、素片選択部3は、入力された言語解析処理結果と目標韻律情報とに基づいて、合成音声の特徴を表す情報(以下、これを「目標素片環境」と呼ぶ。)を音声合成単位毎に求める。目標素片環境は、該当・先行・後続の各音素、ストレスの有無、アクセント核からの距離、音声合成単位毎のピッチ周波数、パワー、継続時間長、ケプストラム、MFCC(Mel Frequency Cepstral Coefficients)、及び、これらのΔ量(単位時間あたりの変化量)等である。 Specifically, the segment selection unit 3 is information representing the characteristics of the synthesized speech based on the input language analysis processing result and the target prosodic information (hereinafter referred to as “target segment environment”). For each speech synthesis unit. The target segment environment includes the corresponding / preceding / following phonemes, the presence / absence of stress, the distance from the accent core, the pitch frequency for each speech synthesis unit, power, duration, cepstrum, MFCC (Mel Frequency Cepstral Coefficients), and These Δ amounts (change amounts per unit time) and the like.

そして、素片選択部3は、取得された素片情報に対して、音声を合成するために用いる素片情報としての適切度を示す指標であるコストを算出する。コストは、目標素片環境と候補となる素片情報の属性情報との差を表す値であり、差が小さくなるほど(即ち、両者が類似している程度が高いほど)、小さくなる値である。コストが小さい素片情報を用いるほど、合成された音声は、人間が発した音声と類似している程度を表す自然度が高い音声となる。従って、素片選択部3は、算出されたコストが最も小さい素片情報を選択し、選択された素片情報を波形生成部8へ出力する。 Then, the segment selection unit 3 calculates a cost that is an index indicating the appropriateness as the segment information used for synthesizing the speech with respect to the acquired segment information. The cost is a value that represents the difference between the target element environment and the attribute information of the candidate element information, and the value becomes smaller as the difference becomes smaller (that is, the higher the degree of similarity between the two). . As the unit information with a lower cost is used, the synthesized speech becomes a speech with a higher degree of naturalness representing the degree of similarity to the speech uttered by a human. Therefore, the segment selection unit 3 selects the segment information with the lowest calculated cost, and outputs the selected segment information to the waveform generation unit 8.

波形生成部8は、韻律生成部2から供給された目標韻律情報と、素片選択部3から供給された素片情報と、フレーム周期記憶部13に記憶されているフレーム周期と、に基づいて、目標韻律に一致若しくは類似する韻律を有する音声波形を生成し、生成した音声波形を接続して合成音声を生成する。 The waveform generation unit 8 is based on the target prosody information supplied from the prosody generation unit 2, the segment information supplied from the unit selection unit 3, and the frame period stored in the frame period storage unit 13. Then, a speech waveform having a prosody that matches or is similar to the target prosody is generated, and the generated speech waveform is connected to generate a synthesized speech.

このようにして、音声合成装置200は、素片情報生成装置100により決定されたフレーム周期と、素片情報生成装置100により生成された素片情報と、に基づいて音声を合成する音声合成処理を行う音声合成処理手段を構成している。 In this way, the speech synthesizer 200 synthesizes speech based on the frame period determined by the segment information generator 100 and the segment information generated by the segment information generator 100. The speech synthesis processing means for performing is configured.

以上、説明したように、第5実施形態に係る音声合成システムによれば、素片情報を生成する際に用いられるフレーム周期を、音声の韻律に基づいて適切に設定することができる。この結果、合成される音声の品質が低下することを防止しながら、当該合成される音声の基となる素片情報のデータ量を低減することができる。 As described above, according to the speech synthesis system according to the fifth embodiment, the frame period used when generating the segment information can be appropriately set based on the prosody of the speech. As a result, it is possible to reduce the data amount of the segment information that is the basis of the synthesized voice while preventing the quality of the synthesized voice from being lowered.

<第6実施形態>
次に、本発明の第6実施形態に係る素片情報生成装置について図9を参照しながら説明する。
第6実施形態に係る素片情報生成装置300は、
時間間隔を表すフレーム周期を、音声の韻律に関する情報である韻律関連情報に基づいて決定するフレーム周期決定部(フレーム周期決定手段)301と、
連続する2つの時間フレームの開始位置が上記決定されたフレーム周期だけ離れるように配置された複数の時間フレームのそれぞれに対して、音声を合成する音声合成処理の基となる基礎音声の一部である音声素片のうちの当該時間フレーム内の部分の特徴を表す特徴パラメータを抽出し、当該抽出された特徴パラメータの時系列データを含む素片情報を生成する素片情報生成部(素片情報生成手段)302と、
を備える。 <Sixth embodiment>
Next, an element information generating apparatus according to a sixth embodiment of the present invention will be described with reference to FIG.
The segment information generating apparatus 300 according to the sixth embodiment is
A frame period determining unit (frame period determining means) 301 for determining a frame period representing a time interval based on prosody related information which is information related to the prosody of a speech;
A part of the basic speech that is the basis of the speech synthesis process that synthesizes speech for each of a plurality of time frames arranged so that the start positions of two consecutive time frames are separated by the determined frame period. A segment information generation unit (segment information) that extracts a feature parameter representing a feature of a portion of a certain speech unit within the time frame and generates unit information including time-series data of the extracted feature parameter. Generation means) 302,
Is provided.

これによれば、素片情報を生成する際に用いられるフレーム周期を、音声の韻律に基づいて適切に設定することができる。この結果、合成される音声の品質が低下することを防止しながら、当該合成される音声の基となる素片情報のデータ量を低減することができる。 According to this, the frame period used when generating the segment information can be appropriately set based on the prosody of the speech. As a result, it is possible to reduce the data amount of the segment information that is the basis of the synthesized voice while preventing the quality of the synthesized voice from being lowered.

以上、上記実施形態を参照して本願発明を説明したが、本願発明は、上述した実施形態に限定されるものではない。本願発明の構成及び詳細に、本願発明の範囲内において当業者が理解し得る様々な変更をすることができる。 Although the present invention has been described with reference to the above embodiment, the present invention is not limited to the above-described embodiment. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.

なお、上記各実施形態において素片情報生成装置の各機能は、回路等のハードウェアにより実現されていた。ところで、素片情報生成装置は、処理装置と、プログラム(ソフトウェア)を記憶する記憶装置と、を備えるとともに、処理装置がそのプログラムを実行することにより、各機能を実現するように構成されていてもよい。この場合、プログラムは、コンピュータが読み取り可能な記録媒体に記憶されていてもよい。例えば、記録媒体は、フレキシブルディスク、光ディスク、光磁気ディスク、及び、半導体メモリ等の可搬性を有する媒体である。 In each of the above embodiments, each function of the segment information generation apparatus is realized by hardware such as a circuit. By the way, the segment information generation device includes a processing device and a storage device that stores a program (software), and the processing device is configured to implement each function by executing the program. Also good. In this case, the program may be stored in a computer-readable recording medium. For example, the recording medium is a portable medium such as a flexible disk, an optical disk, a magneto-optical disk, and a semiconductor memory.

また、上記実施形態の他の変形例として、上述した実施形態及び変形例の任意の組み合わせが採用されてもよい。 In addition, as another modified example of the above-described embodiment, any combination of the above-described embodiments and modified examples may be employed.

本発明は、音声を合成する音声合成処理を行う音声合成システム等に適用可能である。 The present invention is applicable to a speech synthesis system that performs speech synthesis processing for synthesizing speech.

<付記>
(付記1)
時間間隔を表すフレーム周期を、音声の韻律に関する情報である韻律関連情報に基づいて決定するフレーム周期決定手段と、
連続する2つの時間フレームの開始位置が前記決定されたフレーム周期だけ離れるように配置された複数の時間フレームのそれぞれに対して、音声を合成する音声合成処理の基となる基礎音声の一部である音声素片のうちの当該時間フレーム内の部分の特徴を表す特徴パラメータを抽出し、当該抽出された特徴パラメータの時系列データを含む素片情報を生成する素片情報生成手段と、
を備える素片情報生成装置。 <Appendix>
(Appendix 1)
Frame period determining means for determining a frame period representing a time interval based on prosody related information which is information relating to the prosody of speech;
A part of the basic speech that is the basis of the speech synthesis process for synthesizing speech for each of a plurality of time frames arranged so that the start positions of two consecutive time frames are separated by the determined frame period. Unit information generating means for extracting a feature parameter representing a feature of a portion of a certain speech unit in the time frame and generating unit information including time-series data of the extracted feature parameter;
A unit information generating apparatus comprising:

(付記2)
付記1に記載の素片情報生成装置であって、
前記生成される素片情報に基づいて前記音声合成処理を行う音声合成処理手段によって将来の時点にて合成される音声である将来合成音声の韻律を表す韻律情報を推定する韻律情報推定手段を備え、
前記フレーム周期決定手段は、前記推定された韻律情報を前記韻律関連情報として取得し、当該取得した韻律関連情報に基づいて前記フレーム周期を決定するように構成された素片情報生成装置。 (Appendix 2)
An element information generation device according to appendix 1,
Prosody information estimating means for estimating prosody information representing the prosody of a future synthesized speech that is a speech synthesized at a future time by speech synthesis processing means for performing the speech synthesis processing based on the generated segment information. ,
The segment information generation device configured to acquire the estimated prosodic information as the prosodic information and determine the frame period based on the acquired prosodic related information.

(付記3)
付記2に記載の素片情報生成装置であって、
前記韻律情報推定手段は、前記音声合成処理手段によって過去の時点にて合成された音声である過去合成音声の韻律を表す韻律情報に基づいて前記将来合成音声の韻律を表す韻律情報を推定するように構成された素片情報生成装置。 (Appendix 3)
An element information generation device according to appendix 2,
The prosodic information estimation means estimates prosodic information representing the prosody of the future synthesized speech based on the prosodic information representing the prosody of the past synthesized speech that is the speech synthesized at the past time by the speech synthesis processing means. The unit information generation apparatus comprised in this.

ところで、将来の時点にて合成される音声の韻律が、過去の時点にて合成された音声の韻律と類似する可能性は比較的高い。そこで、上記のように素片情報生成装置を構成することによって、将来の時点にて音声合成処理において用いられる可能性が比較的高い韻律に基づいて、フレーム周期を適切に決定することができる。 By the way, the prosody of a voice synthesized at a future time is relatively likely to be similar to the prosody of a voice synthesized at a past time. Therefore, by configuring the segment information generation apparatus as described above, the frame period can be appropriately determined based on a prosody that is relatively likely to be used in speech synthesis processing at a future time.

(付記4)
付記2に記載の素片情報生成装置であって、
前記韻律情報推定手段は、前記基礎音声の韻律を表す韻律情報に基づいて前記将来合成音声の韻律を表す韻律情報を推定するように構成された素片情報生成装置。 (Appendix 4)
An element information generation device according to appendix 2,
The segment information generating device configured to estimate the prosodic information representing the prosody of the future synthesized speech based on the prosodic information representing the prosody of the basic speech.

ところで、音声合成処理において用いられる素片情報が、合成される音声の韻律と類似する(可能な限り近い)韻律を有する情報である可能性は比較的高い。そこで、上記のように素片情報生成装置を構成することによって、将来合成音声の韻律を表す韻律情報を高い精度にて推定することができる。この結果、将来の時点にて音声合成処理において用いられる可能性が比較的高い韻律に基づいて、フレーム周期を適切に決定することができる。 By the way, it is relatively likely that the segment information used in the speech synthesis process is information having a prosody similar (as close as possible) to the prosody of the synthesized speech. Therefore, by configuring the segment information generation apparatus as described above, prosody information representing the prosody of a synthesized speech in the future can be estimated with high accuracy. As a result, the frame period can be appropriately determined based on the prosody that is relatively likely to be used in speech synthesis processing at a future time.

(付記5)
付記4に記載の素片情報生成装置であって、
予め設定された処理単位毎に、当該処理単位に対応する音声が前記基礎音声において出現する回数である出現数を取得する出現数取得手段を備え、
前記韻律情報推定手段は、前記取得された出現数と、前記基礎音声の韻律を表す韻律情報と、に基づいて前記将来合成音声の韻律を表す韻律情報を推定するように構成された素片情報生成装置。 (Appendix 5)
An element information generation device according to appendix 4,
For each processing unit set in advance, an appearance number acquisition means for acquiring an appearance number that is the number of times the sound corresponding to the processing unit appears in the basic sound,
The prosody information estimating means is configured to estimate prosody information representing the prosody of the future synthesized speech based on the acquired number of appearances and the prosody information representing the prosody of the basic speech. Generator.

ところで、ある処理単位(例えば、音節、音素、CV(Vは母音、Cは子音)等の半音節、CVC、又は、VCV等)に対して、出現数が多くなるほど、合成される音声の韻律と一致している程度がより高い韻律を有する音声素片を表す素片情報が、音声合成処理において用いられる。 By the way, for a certain processing unit (for example, syllables, phonemes, semi-syllables such as CV (V is a vowel, C is a consonant), CVC, or VCV, etc.) The unit information representing the speech unit having a higher prosody with the same degree is used in the speech synthesis process.

従って、上記のように素片情報生成装置を構成することによって、将来合成音声の韻律を表す韻律情報をより一層高い精度にて推定することができる。この結果、将来の時点にて音声合成処理において用いられる可能性がより一層高い韻律に基づいて、フレーム周期を適切に決定することができる。 Therefore, by configuring the segment information generation apparatus as described above, prosody information representing the prosody of a synthesized speech in the future can be estimated with higher accuracy. As a result, the frame period can be appropriately determined based on the prosody that is more likely to be used in the speech synthesis process at a future time.

(付記6)
付記1に記載の素片情報生成装置であって、
前記フレーム周期決定手段は、前記素片情報に基づいて前記音声合成処理を行う音声合成処理手段によって過去の時点にて合成された音声である過去合成音声の韻律を表す韻律情報を前記韻律関連情報として取得し、当該取得した韻律関連情報に基づいて前記フレーム周期を決定するように構成された素片情報生成装置。 (Appendix 6)
An element information generation device according to appendix 1,
The frame period determining means converts prosodic information representing the prosody of past synthesized speech that is speech synthesized at a past time point by speech synthesis processing means for performing the speech synthesis processing based on the segment information. A segment information generation device configured to determine the frame period based on the acquired prosodic information.

(付記7)
付記1に記載の素片情報生成装置であって、
前記フレーム周期決定手段は、前記基礎音声の韻律を表す韻律情報を前記韻律関連情報として取得し、当該取得した韻律関連情報に基づいて前記フレーム周期を決定するように構成された素片情報生成装置。 (Appendix 7)
An element information generation device according to appendix 1,
The frame period determining unit is configured to acquire prosody information representing the prosody of the basic speech as the prosody related information and determine the frame period based on the acquired prosody related information. .

ところで、音声合成処理において用いられる素片情報が、合成される音声の韻律と類似する(可能な限り近い)韻律を有する情報である可能性は比較的高い。そこで、上記のように素片情報生成装置を構成することによって、音声合成処理において用いられる可能性が比較的高い韻律に基づいて、フレーム周期を適切に決定することができる。 By the way, it is relatively likely that the segment information used in the speech synthesis process is information having a prosody similar (as close as possible) to the prosody of the synthesized speech. Therefore, by configuring the segment information generation apparatus as described above, the frame period can be appropriately determined based on a prosody that is relatively likely to be used in speech synthesis processing.

(付記8)
付記7に記載の素片情報生成装置であって、
予め設定された処理単位毎に、当該処理単位に対応する音声が前記基礎音声において出現する回数である出現数を取得する出現数取得手段を備え、
前記フレーム周期決定手段は、前記取得された出現数と、前記韻律関連情報と、に基づいて前記フレーム周期を決定するように構成された素片情報生成装置。 (Appendix 8)
An element information generation device according to appendix 7,
For each processing unit set in advance, an appearance number acquisition means for acquiring an appearance number that is the number of times the sound corresponding to the processing unit appears in the basic sound,
The segment information generation device configured to determine the frame period based on the acquired number of appearances and the prosodic information, the frame period determining unit.

ところで、ある処理単位(例えば、音節、音素、CV(Vは母音、Cは子音)等の半音節、CVC、又は、VCV等)に対して、出現数が多くなるほど、合成される音声の韻律と一致している程度がより高い韻律を有する音声素片を表す素片情報が、音声合成処理において用いられる。従って、上記のように素片情報生成装置を構成することによって、音声合成処理において用いられる可能性がより一層高い韻律に基づいて、フレーム周期を適切に決定することができる。 By the way, for a certain processing unit (for example, syllables, phonemes, semi-syllables such as CV (V is a vowel, C is a consonant), CVC, or VCV, etc.) The unit information representing the speech unit having a higher prosody with the same degree is used in the speech synthesis process. Therefore, by configuring the segment information generating apparatus as described above, the frame period can be appropriately determined based on the prosody that is more likely to be used in the speech synthesis process.

(付記9)
付記2乃至付記8のいずれか一項に記載の素片情報生成装置であって、
前記フレーム周期決定手段は、前記韻律情報の統計量を前記韻律関連情報として取得するように構成された素片情報生成装置。 (Appendix 9)
The element information generation device according to any one of appendix 2 to appendix 8,
The segment information generating device configured to acquire the statistic of the prosodic information as the prosodic related information.

(付記10)
付記2乃至付記9のいずれか一項に記載の素片情報生成装置であって、
前記韻律情報は、基本周波数を表す情報を含み、
前記フレーム周期決定手段は、前記基本周波数が大きくなるほど小さくなる値を、前記フレーム周期として決定するように構成された素片情報生成装置。 (Appendix 10)
The element information generation device according to any one of appendix 2 to appendix 9,
The prosody information includes information representing a fundamental frequency,
The unit information generating apparatus configured to determine, as the frame period, a value that decreases as the fundamental frequency increases.

ところで、合成される音声の品質を十分に高くするためには、基本周波数(ピッチ周波数)が大きくなるほど、フレーム周期を短くする必要がある。従って、上記のように素片情報生成装置を構成することにより、合成される音声の品質が低下することを確実に防止することができる。 By the way, in order to sufficiently improve the quality of synthesized speech, it is necessary to shorten the frame period as the fundamental frequency (pitch frequency) increases. Therefore, by configuring the segment information generation apparatus as described above, it is possible to reliably prevent the quality of synthesized speech from being deteriorated.

(付記11)
付記1に記載の素片情報生成装置であって、
前記フレーム周期決定手段は、前記素片情報に基づいて前記音声合成処理を行う音声合成処理手段によって過去の時点にて合成された音声の韻律である合成音声韻律と、前記基礎音声の韻律である基礎音声韻律と、の差を表す韻律制御量を前記韻律関連情報として取得し、当該取得した韻律関連情報に基づいて前記フレーム周期を決定するように構成された素片情報生成装置。 (Appendix 11)
An element information generation device according to appendix 1,
The frame period determining means is a synthesized speech prosody that is a speech prosody synthesized at a past time by a speech synthesis processing means that performs the speech synthesis processing based on the segment information, and a prosody of the basic speech A segment information generation device configured to acquire a prosody control amount representing a difference from a basic speech prosody as the prosody related information, and to determine the frame period based on the acquired prosodic related information.

ところで、韻律制御量が大きくなるほど、合成される音声の品質は低下しやすい。そこで、上記のように素片情報生成装置を構成することにより、韻律制御量が大きくなるほど、フレーム周期を小さくすることができる。この結果、合成される音声の品質が低下することを確実に防止することができる。 By the way, the higher the prosody control amount, the lower the quality of synthesized speech. Therefore, by configuring the segment information generation apparatus as described above, the frame period can be reduced as the prosodic control amount increases. As a result, it is possible to reliably prevent the quality of the synthesized voice from deteriorating.

(付記12)
付記11に記載の素片情報生成装置であって、
前記韻律制御量は、前記合成音声韻律と前記基礎音声韻律との差の大きさが大きくなるほど大きくなる量であり、
前記フレーム周期決定手段は、前記韻律制御量が大きくなるほど小さくなる値を、前記フレーム周期として決定するように構成された素片情報生成装置。 (Appendix 12)
An element information generation device according to appendix 11,
The prosody control amount is an amount that increases as a difference between the synthesized speech prosody and the basic speech prosody increases.
The unit information generating apparatus configured to determine, as the frame period, a value that decreases as the prosodic control amount increases.

(付記13)
付記11又は付記12に記載の素片情報生成装置であって、
前記フレーム周期決定手段は、前記韻律制御量の統計量を前記韻律関連情報として取得するように構成された素片情報生成装置。 (Appendix 13)
The element information generation device according to appendix 11 or appendix 12,
The segment information generation device configured to acquire the statistic of the prosodic control amount as the prosodic related information.

(付記14)
付記1乃至付記13のいずれか一項に記載の素片情報生成装置であって、
前記フレーム周期決定手段は、予め設定された処理単位毎に、前記韻律関連情報を取得するとともに当該取得した韻律関連情報に基づいて前記フレーム周期を決定するように構成された素片情報生成装置。 (Appendix 14)
The element information generation device according to any one of appendix 1 to appendix 13,
The unit information generating apparatus configured to acquire the prosodic information for each preset processing unit and to determine the frame period based on the acquired prosodic related information.

これによれば、例えば、音節毎に、又は、半音節毎に、フレーム周期を設定することができる。この結果、合成される音声の品質が低下することを防止しながら、当該合成される音声の基となる素片情報のデータ量を低減することができる。 According to this, for example, the frame period can be set for each syllable or for each semi-syllable. As a result, it is possible to reduce the data amount of the segment information that is the basis of the synthesized voice while preventing the quality of the synthesized voice from being lowered.

(付記15)
時間間隔を表すフレーム周期を、音声の韻律に関する情報である韻律関連情報に基づいて決定するフレーム周期決定手段と、
連続する2つの時間フレームの開始位置が前記決定されたフレーム周期だけ離れるように配置された複数の時間フレームのそれぞれに対して、音声を合成する音声合成処理の基となる基礎音声の一部である音声素片のうちの当該時間フレーム内の部分の特徴を表す特徴パラメータを抽出し、当該抽出された特徴パラメータの時系列データを含む素片情報を生成する素片情報生成手段と、
前記決定されたフレーム周期と、前記生成された素片情報と、に基づいて音声を合成する音声合成処理を行う音声合成処理手段と、
を備える音声合成システム。 (Appendix 15)
Frame period determining means for determining a frame period representing a time interval based on prosody related information which is information relating to the prosody of speech;
A part of the basic speech that is the basis of the speech synthesis process for synthesizing speech for each of a plurality of time frames arranged so that the start positions of two consecutive time frames are separated by the determined frame period. Unit information generating means for extracting a feature parameter representing a feature of a portion of a certain speech unit in the time frame and generating unit information including time-series data of the extracted feature parameter;
Speech synthesis processing means for performing speech synthesis processing for synthesizing speech based on the determined frame period and the generated segment information;
A speech synthesis system comprising:

(付記16)
付記15に記載の音声合成システムであって、
前記音声合成処理手段によって将来の時点にて合成される音声である将来合成音声の韻律を表す韻律情報を推定する韻律情報推定手段を備え、
前記フレーム周期決定手段は、前記推定された韻律情報を前記韻律関連情報として取得し、当該取得した韻律関連情報に基づいて前記フレーム周期を決定するように構成された音声合成システム。 (Appendix 16)
The speech synthesis system according to appendix 15,
Prosody information estimation means for estimating prosody information representing the prosody of a future synthesized voice that is a voice synthesized at a future time by the voice synthesis processing means,
The speech synthesis system configured to acquire the estimated prosodic information as the prosodic information and determine the frame period based on the acquired prosodic related information.

(付記17)
付記15に記載の音声合成システムであって、
前記フレーム周期決定手段は、前記音声合成処理手段によって過去の時点にて合成された音声である過去合成音声の韻律を表す韻律情報を前記韻律関連情報として取得し、当該取得した韻律関連情報に基づいて前記フレーム周期を決定するように構成された音声合成システム。 (Appendix 17)
The speech synthesis system according to appendix 15,
The frame period determining means acquires prosodic information representing the prosody of past synthesized speech that is speech synthesized at a past time by the speech synthesis processing means as the prosodic related information, and based on the acquired prosodic related information A speech synthesis system configured to determine the frame period.

(付記18)
付記15に記載の音声合成システムであって、
前記フレーム周期決定手段は、前記基礎音声の韻律を表す韻律情報を前記韻律関連情報として取得し、当該取得した韻律関連情報に基づいて前記フレーム周期を決定するように構成された音声合成システム。 (Appendix 18)
The speech synthesis system according to appendix 15,
The speech synthesis system configured to acquire the prosody information representing the prosody of the basic speech as the prosody related information, and to determine the frame cycle based on the acquired prosody related information.

(付記19)
付記15に記載の音声合成システムであって、
前記フレーム周期決定手段は、前記音声合成処理手段によって過去の時点にて合成された音声の韻律である合成音声韻律と、前記基礎音声の韻律である基礎音声韻律と、の差を表す韻律制御量を前記韻律関連情報として取得し、当該取得した韻律関連情報に基づいて前記フレーム周期を決定するように構成された音声合成システム。 (Appendix 19)
The speech synthesis system according to appendix 15,
The frame period determining means is a prosodic control amount representing a difference between a synthesized speech prosody that is a speech prosody synthesized at a past time by the speech synthesis processing means and a basic speech prosody that is a prosody of the basic speech. Is obtained as the prosodic related information, and the frame period is determined based on the acquired prosodic related information.

(付記20)
時間間隔を表すフレーム周期を、音声の韻律に関する情報である韻律関連情報に基づいて決定し、
連続する2つの時間フレームの開始位置が前記決定されたフレーム周期だけ離れるように配置された複数の時間フレームのそれぞれに対して、音声を合成する音声合成処理の基となる基礎音声の一部である音声素片のうちの当該時間フレーム内の部分の特徴を表す特徴パラメータを抽出し、当該抽出された特徴パラメータの時系列データを含む素片情報を生成する、音声合成方法。 (Appendix 20)
A frame period representing a time interval is determined based on prosodic information, which is information related to speech prosody,
A part of the basic speech that is the basis of the speech synthesis process for synthesizing speech for each of a plurality of time frames arranged so that the start positions of two consecutive time frames are separated by the determined frame period. A speech synthesis method for extracting feature parameters representing features of a portion of a certain speech unit within the time frame and generating unit information including time-series data of the extracted feature parameters.

(付記21)
付記20に記載の音声合成方法であって、
前記生成される素片情報に基づいて前記音声合成処理を行う音声合成処理手段によって将来の時点にて合成される音声である将来合成音声の韻律を表す韻律情報を推定し、
前記推定された韻律情報を前記韻律関連情報として取得し、当該取得した韻律関連情報に基づいて前記フレーム周期を決定するように構成された音声合成方法。 (Appendix 21)
The speech synthesis method according to appendix 20, wherein
Estimating prosodic information representing the prosody of a future synthesized speech that is speech synthesized at a future time by speech synthesis processing means for performing the speech synthesis processing based on the generated segment information;
A speech synthesis method configured to acquire the estimated prosodic information as the prosodic related information and determine the frame period based on the acquired prosodic related information.

(付記22)
付記20に記載の音声合成方法であって、
前記素片情報に基づいて前記音声合成処理を行う音声合成処理手段によって過去の時点にて合成された音声である過去合成音声の韻律を表す韻律情報を前記韻律関連情報として取得し、当該取得した韻律関連情報に基づいて前記フレーム周期を決定するように構成された音声合成方法。 (Appendix 22)
The speech synthesis method according to appendix 20, wherein
Prosody information representing the prosody of past synthesized speech that is speech synthesized at a past time point by speech synthesis processing means that performs the speech synthesis processing based on the unit information is acquired as the prosody related information, and the acquired A speech synthesis method configured to determine the frame period based on prosodic information.

(付記23)
付記20に記載の音声合成方法であって、
前記基礎音声の韻律を表す韻律情報を前記韻律関連情報として取得し、当該取得した韻律関連情報に基づいて前記フレーム周期を決定するように構成された音声合成方法。 (Appendix 23)
The speech synthesis method according to appendix 20, wherein
A speech synthesis method configured to acquire prosody information representing the prosody of the basic speech as the prosody related information and to determine the frame period based on the acquired prosody related information.

(付記24)
付記20に記載の音声合成方法であって、
前記素片情報に基づいて前記音声合成処理を行う音声合成処理手段によって過去の時点にて合成された音声の韻律である合成音声韻律と、前記基礎音声の韻律である基礎音声韻律と、の差を表す韻律制御量を前記韻律関連情報として取得し、当該取得した韻律関連情報に基づいて前記フレーム周期を決定するように構成された音声合成方法。 (Appendix 24)
The speech synthesis method according to appendix 20, wherein
A difference between a synthesized speech prosody that is a speech prosody synthesized at a past time by speech synthesis processing means that performs the speech synthesis processing based on the unit information, and a basic speech prosody that is a prosody of the basic speech A speech synthesis method configured to acquire a prosodic control amount representing the prosodic information as the prosodic information and determine the frame period based on the acquired prosodic information.

(付記25)
情報処理装置に、
時間間隔を表すフレーム周期を、音声の韻律に関する情報である韻律関連情報に基づいて決定するフレーム周期決定手段と、
連続する2つの時間フレームの開始位置が前記決定されたフレーム周期だけ離れるように配置された複数の時間フレームのそれぞれに対して、音声を合成する音声合成処理の基となる基礎音声の一部である音声素片のうちの当該時間フレーム内の部分の特徴を表す特徴パラメータを抽出し、当該抽出された特徴パラメータの時系列データを含む素片情報を生成する素片情報生成手段と、
を実現させるためのプログラム。 (Appendix 25)
In the information processing device,
Frame period determining means for determining a frame period representing a time interval based on prosody related information which is information relating to the prosody of speech;
A part of the basic speech that is the basis of the speech synthesis process for synthesizing speech for each of a plurality of time frames arranged so that the start positions of two consecutive time frames are separated by the determined frame period. Unit information generating means for extracting a feature parameter representing a feature of a portion of a certain speech unit in the time frame and generating unit information including time-series data of the extracted feature parameter;
A program to realize

(付記26)
付記25に記載のプログラムであって、
前記情報処理装置に、更に、
前記生成される素片情報に基づいて前記音声合成処理を行う音声合成処理手段によって将来の時点にて合成される音声である将来合成音声の韻律を表す韻律情報を推定する韻律情報推定手段を実現させるとともに、
前記フレーム周期決定手段は、前記推定された韻律情報を前記韻律関連情報として取得し、当該取得した韻律関連情報に基づいて前記フレーム周期を決定するように構成されたプログラム。 (Appendix 26)
The program according to attachment 25, wherein
In addition to the information processing apparatus,
Realization of prosodic information estimation means for estimating prosodic information representing the prosody of a future synthesized speech that is a speech synthesized at a future time by speech synthesis processing means for performing the speech synthesis processing based on the generated segment information As well as
The program configured to acquire the estimated prosodic information as the prosodic information and determine the frame period based on the acquired prosodic related information.

(付記27)
付記25に記載のプログラムであって、
前記フレーム周期決定手段は、前記素片情報に基づいて前記音声合成処理を行う音声合成処理手段によって過去の時点にて合成された音声である過去合成音声の韻律を表す韻律情報を前記韻律関連情報として取得し、当該取得した韻律関連情報に基づいて前記フレーム周期を決定するように構成されたプログラム。 (Appendix 27)
The program according to attachment 25, wherein
The frame period determining means converts prosodic information representing the prosody of past synthesized speech that is speech synthesized at a past time point by speech synthesis processing means for performing the speech synthesis processing based on the segment information. And a program configured to determine the frame period based on the acquired prosodic information.

(付記28)
付記25に記載のプログラムであって、
前記フレーム周期決定手段は、前記基礎音声の韻律を表す韻律情報を前記韻律関連情報として取得し、当該取得した韻律関連情報に基づいて前記フレーム周期を決定するように構成されたプログラム。 (Appendix 28)
The program according to attachment 25, wherein
The frame period determining means is a program configured to acquire prosodic information representing the prosody of the basic speech as the prosodic related information and determine the frame period based on the acquired prosodic related information.

(付記29)
付記25に記載のプログラムであって、
前記フレーム周期決定手段は、前記素片情報に基づいて前記音声合成処理を行う音声合成処理手段によって過去の時点にて合成された音声の韻律である合成音声韻律と、前記基礎音声の韻律である基礎音声韻律と、の差を表す韻律制御量を前記韻律関連情報として取得し、当該取得した韻律関連情報に基づいて前記フレーム周期を決定するように構成されたプログラム。 (Appendix 29)
The program according to attachment 25, wherein
The frame period determining means is a synthesized speech prosody that is a speech prosody synthesized at a past time by a speech synthesis processing means that performs the speech synthesis processing based on the segment information, and a prosody of the basic speech A program configured to acquire a prosody control amount representing a difference from a basic speech prosody as the prosody related information and determine the frame period based on the acquired prosodic related information.

1 言語処理部
2 韻律生成部
3 素片選択部
4 波形生成部
6 特徴パラメータ抽出部
7 属性情報抽出部
8 波形生成部
12 素片情報記憶部
13 フレーム周期記憶部
21 収録文記憶部
22 自然音声記憶部
31 フレーム周期決定部
33 合成音声韻律情報推定部
34 合成音声韻律情報記憶部
41 フレーム周期決定部
43 合成音声韻律情報推定部
44 素片韻律情報記憶部
46 出現数算出部
51 フレーム周期決定部
61 フレーム周期決定部
64 韻律制御量情報記憶部
100 素片情報生成装置
200 音声合成装置
300 素片情報生成装置
301 フレーム周期決定部
302 素片情報生成部
1 Language processor
2 Prosody generator
3 Element selector
4 Waveform generator
6 Feature parameter extraction unit
7 Attribute information extractor
8 Waveform generator
12 Segment information storage
13 Frame cycle memory
21 Recorded sentence storage
22 Natural voice memory
31 Frame period determination unit
33 Synthetic speech prosody information estimation part
34 Synthetic speech prosody information storage
41 Frame period determination unit
43 Synthetic speech prosody information estimation part
44 fragment prosody information storage
46 Appearance count calculator
51 Frame period determination unit
61 Frame period determination unit
64 Prosodic control amount information storage
100 fragment information generator
200 speech synthesizer
300 Segment information generator
301 Frame period determination unit
302 Segment information generator

Claims

Frame period determining means for determining a frame period representing a time interval based on prosody related information which is information relating to the prosody of speech;
A part of the basic speech that is the basis of the speech synthesis process for synthesizing speech for each of a plurality of time frames arranged so that the start positions of two consecutive time frames are separated by the determined frame period. Unit information generating means for extracting a feature parameter representing a feature of a portion of a certain speech unit in the time frame and generating unit information including time-series data of the extracted feature parameter;
A unit information generating apparatus comprising:

The element information generation device according to claim 1,
Prosody information estimating means for estimating prosody information representing the prosody of a future synthesized speech that is a speech synthesized at a future time by speech synthesis processing means for performing the speech synthesis processing based on the generated segment information. ,
The segment information generation device configured to acquire the estimated prosodic information as the prosodic information and determine the frame period based on the acquired prosodic related information.

The element information generation device according to claim 2,
The prosodic information estimation means estimates prosodic information representing the prosody of the future synthesized speech based on the prosodic information representing the prosody of the past synthesized speech that is the speech synthesized at the past time by the speech synthesis processing means. The unit information generation apparatus comprised in this.

The element information generation device according to claim 2,
The segment information generating device configured to estimate the prosodic information representing the prosody of the future synthesized speech based on the prosodic information representing the prosody of the basic speech.

The element information generation device according to claim 1,
The frame period determining means converts prosodic information representing the prosody of past synthesized speech that is speech synthesized at a past time point by speech synthesis processing means for performing the speech synthesis processing based on the segment information. A segment information generation device configured to determine the frame period based on the acquired prosodic information.

The element information generation device according to claim 1,
The frame period determining unit is configured to acquire prosody information representing the prosody of the basic speech as the prosody related information and determine the frame period based on the acquired prosody related information. .

The element information generation device according to claim 1,
The frame period determining means is a synthesized speech prosody that is a speech prosody synthesized at a past time by a speech synthesis processing means that performs the speech synthesis processing based on the segment information, and a prosody of the basic speech A segment information generation device configured to acquire a prosody control amount representing a difference from a basic speech prosody as the prosody related information, and to determine the frame period based on the acquired prosodic related information.

Frame period determining means for determining a frame period representing a time interval based on prosody related information which is information relating to the prosody of speech;
A part of the basic speech that is the basis of the speech synthesis process for synthesizing speech for each of a plurality of time frames arranged so that the start positions of two consecutive time frames are separated by the determined frame period. Unit information generating means for extracting a feature parameter representing a feature of a portion of a certain speech unit in the time frame and generating unit information including time-series data of the extracted feature parameter;
Speech synthesis processing means for performing speech synthesis processing for synthesizing speech based on the determined frame period and the generated segment information;
A speech synthesis system comprising:

A frame period representing a time interval is determined based on prosodic information, which is information related to speech prosody,
A part of the basic speech that is the basis of the speech synthesis process for synthesizing speech for each of a plurality of time frames arranged so that the start positions of two consecutive time frames are separated by the determined frame period. A speech synthesis method for extracting feature parameters representing features of a portion of a certain speech unit within the time frame and generating unit information including time-series data of the extracted feature parameters.

In the information processing device,
Frame period determining means for determining a frame period representing a time interval based on prosody related information which is information relating to the prosody of speech;
A part of the basic speech that is the basis of the speech synthesis process for synthesizing speech for each of a plurality of time frames arranged so that the start positions of two consecutive time frames are separated by the determined frame period. Unit information generating means for extracting a feature parameter representing a feature of a portion of a certain speech unit in the time frame and generating unit information including time-series data of the extracted feature parameter;
A program to realize