JP2007041012A

JP2007041012A - Voice quality conversion device and speech synthesis device

Info

Publication number: JP2007041012A
Application number: JP2003392672A
Authority: JP
Inventors: Natsuki Saito; 夏樹齋藤; Takahiro Kamai; 孝浩釜井
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 2003-11-21
Filing date: 2003-11-21
Publication date: 2007-02-15
Also published as: WO2005050624A1

Abstract

<P>PROBLEM TO BE SOLVED: To perform flexible voice quality conversion in good sound quality and to intuitively adjust voice quality conversion. <P>SOLUTION: In a voice quality converter, deformation is added in accordance with voice quality conversion control part 103 to a sound feature parameter string 100 as a result of analyzing voice based on a voice conversion parameter data base 105, and when at least one of feature variables by which the sound parameter string 100 or feature variables calculated from the sound feature parameter exceeds a predetermined threshold value 106, it is judged that voice quality is deteriorated as a result of voice quality conversion, and a content of voice quality conversion processing is adjusted so that quality deterioration may not arise. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は音声の声質を変換する声質変換装置、及び声質可変な音声合成装置に関する。 The present invention relates to a voice quality conversion device that converts voice quality and a voice synthesis device that can change voice quality.

音声を人工的に生成する音声合成装置において合成音声の声質を変換する方法として特許文献１（特開平０７−３１９４９５号公報），特許文献２（特開平０８−１５２９００号公報）に記載の技術が存在する。 As a method for converting the voice quality of synthesized speech in a speech synthesizer that artificially generates speech, there are techniques described in Patent Document 1 (Japanese Patent Laid-Open No. 07-319495) and Patent Document 2 (Japanese Patent Laid-Open No. 08-152900). Exists.

特許文献１に記載の技術では、複数話者の音声から得た合成単位データがデータベースにあらかじめ蓄積されている。音声合成に用いたい合成単位と声質とが指定されると、まず、指定された合成単位に最も近い合成単位データがデータベースから選択される。次に、その選択した合成単位データの話者の声質が指定された声質とどの程度異なるかチェックされ、所定の程度以上異なる場合には、指定声質に近くなるように合成単位データに対し声質変換が行われる。ここでは、選択した合成単位データの話者のコードブック（声質の特徴を表した情報）から、指定の声質にマッチする声質を持った話者のコードブックへコードブックマッピングを行うことにより、選択された合成単位データの声質を指定の声質に変換している。 In the technique described in Patent Document 1, synthesis unit data obtained from voices of a plurality of speakers is stored in advance in a database. When a synthesis unit and voice quality to be used for speech synthesis are designated, first, synthesis unit data closest to the designated synthesis unit is selected from the database. Next, it is checked how much the voice quality of the selected synthesis unit data is different from the specified voice quality. If it is more than a predetermined level, the voice quality conversion is performed on the synthesis unit data so that it is close to the specified voice quality. Is done. Here, selection is performed by performing codebook mapping from the speaker codebook of the selected synthesis unit data (information expressing voice quality characteristics) to the codebook of the speaker with voice quality that matches the specified voice quality. The voice quality of the synthesized unit data is converted to the specified voice quality.

また、特許文献２に記載の技術では、デジタル音声データをアナログ音声信号に変換する際のサンプリング周波数を変換することにより合成音声の声質を変換している。さらに、出力される音声が適切になるように基本周波数や音韻継続時間などのいわゆる韻律情報（スペクトルパラメタ）をサンプリング周波数の変更に応じて適切に設定している。
特開平０７−３１９４９５号公報特開平０８−１５２９００号公報 In the technique described in Patent Document 2, the voice quality of synthesized speech is converted by converting the sampling frequency when digital audio data is converted into an analog audio signal. Furthermore, so-called prosodic information (spectrum parameters) such as the fundamental frequency and phoneme duration is appropriately set according to the change of the sampling frequency so that the output sound is appropriate.
JP 07-319495 A Japanese Patent Laid-Open No. 08-152900

しかしながら、前記従来の方法を用いた場合、声質変換の自由度が狭かったり、声質の調整が非常に困難であったりする。 However, when the conventional method is used, the degree of freedom of voice quality conversion is narrow or the voice quality is very difficult to adjust.

例えば特許文献１のように音声素片データベースを切り替える方法では、１種類の声質に対して１つの音声素片データベースが必要となり、さらに声質の連続的な変化を表現することができない。 For example, in the method of switching speech unit databases as in Patent Document 1, one speech unit database is required for one type of voice quality, and further, continuous change in voice quality cannot be expressed.

また、例えば特許文献２のようにスペクトルパラメタを変換する方法では、「明るくかつ怒った声にする」などのように、複数の観点で声質変換を同時に行う場合、声質に破綻が起こらないように音声の品質を維持するのが困難である。 Further, for example, in the method of converting the spectral parameters as in Patent Document 2, when voice quality conversion is performed simultaneously from a plurality of viewpoints such as “make a bright and angry voice”, the voice quality is not broken. It is difficult to maintain voice quality.

本発明による声質変換装置は、特徴パラメタ変換部と、声質変換制御部と、声質パラメタ調整部とを備える。特徴パラメタ変換部は、音声の音響的特徴を表す複数のパラメタ（音響特徴パラメタ列）に対して所定の変換処理を行う。声質変換制御部は、所望する声質に応じた変換指示を特徴パラメタ変換部に与える。特徴パラメタ変換部は、声質変換制御部からの変換指示に従って上記音響特徴パラメタ列に対する変換処理を行う。声質パラメタ調整部は、声質変換制御部からの変換指示に従った変換処理が上記音響特徴パラメタ列に対して行われたと仮定した場合における当該変換処理後の音響特徴パラメタ列の値が所定の範囲内に収まっているか否かを判断し、当該変換処理後の音響特徴パラメタ列のうち少なくとも１つのパラメタが上記所定の範囲を超えるときには、そのパラメタが上記所定の範囲内に収まるように声質変換制御部から特徴パラメタ変換部への変換指示を変更する。 The voice quality conversion apparatus according to the present invention includes a feature parameter conversion unit, a voice quality conversion control unit, and a voice quality parameter adjustment unit. The feature parameter conversion unit performs a predetermined conversion process on a plurality of parameters (acoustic feature parameter string) representing the acoustic features of the speech. The voice quality conversion control unit gives a conversion instruction according to the desired voice quality to the feature parameter conversion unit. The feature parameter conversion unit performs a conversion process on the acoustic feature parameter string in accordance with a conversion instruction from the voice quality conversion control unit. When the voice quality parameter adjustment unit assumes that the conversion process according to the conversion instruction from the voice quality conversion control unit is performed on the acoustic feature parameter string, the value of the acoustic feature parameter string after the conversion process is within a predetermined range. Voice quality conversion control so that the parameter falls within the predetermined range when at least one parameter of the converted acoustic feature parameter string exceeds the predetermined range. Change the conversion instruction from the section to the feature parameter conversion section.

本発明によれば、音質が良く自由度の高い声質変換が可能になり、かつ声質変換の調整を直感的に行うことができるようになる。 According to the present invention, it is possible to perform voice quality conversion with good sound quality and high flexibility, and it is possible to intuitively adjust voice quality conversion.

以下、本発明の実施の形態について、図を参照しながら説明する。なお、図面において同一または相当部分には同一の参照符号を付してその説明は繰り返さない。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. In the drawings, the same or corresponding parts are denoted by the same reference numerals, and the description thereof will not be repeated.

（実施の形態１）
図１は、本発明の実施の形態１による声質変換装置の構成であり、音声を分析した結果得られ、これをもとに再合成を行うことで元の音声波形が得られる音響特徴パラメタ列１００を特徴パラメタ変換部１０１で変換し、変換済み音響特徴パラメタ列１０２を得ることができる。特徴パラメタ変換部１０１で行う変換処理の内容は声質変換制御部１０３によって指示され、声質変換パラメタ調整部１０４は、声質変換制御部１０３の指示する変換処理の内容が変形済み音響特徴パラメタ１０２に音質劣化を起こす場合を検知し、声質変換制御部１０３に働きかけて、音質劣化を起こさないように変換処理の内容を調整することができる。声質変換パラメタデータベース１０５は、特徴パラメタ変換部１０１が声質変換処理を行う際のテンプレートとなるデータを保持する。また、音声合成パラメタ限界値データベース１０６には、音響特徴パラメタ列について自然性を保った合成音声が得られるための限界条件（各音響特徴パラメタの限界値など）が格納されており、声質変換パラメタ調整部１０４は、音声合成パラメタ限界値データベース１０６に収められたデータを元に声質変換処理の内容の調整を行うこともできる。 (Embodiment 1)
FIG. 1 shows a configuration of a voice quality conversion apparatus according to Embodiment 1 of the present invention, which is obtained as a result of analyzing speech, and an acoustic feature parameter sequence from which an original speech waveform can be obtained by performing resynthesis based on this 100 can be converted by the feature parameter conversion unit 101 to obtain a converted acoustic feature parameter sequence 102. The content of the conversion process performed by the feature parameter conversion unit 101 is instructed by the voice quality conversion control unit 103, and the voice quality conversion parameter adjustment unit 104 converts the content of the conversion process instructed by the voice quality conversion control unit 103 into the modified acoustic feature parameter 102. It is possible to detect the case where the deterioration occurs and work on the voice quality conversion control unit 103 to adjust the content of the conversion process so as not to cause the sound quality deterioration. The voice quality conversion parameter database 105 holds data serving as a template when the feature parameter conversion unit 101 performs voice quality conversion processing. The voice synthesis parameter limit value database 106 stores limit conditions (such as limit values of each acoustic feature parameter) for obtaining a synthesized speech maintaining naturalness with respect to the acoustic feature parameter sequence, and the voice quality conversion parameter The adjustment unit 104 can also adjust the content of the voice quality conversion process based on the data stored in the speech synthesis parameter limit value database 106.

以下、本実施の形態において声質変換処理を行う際の処理の流れを具体的に示す。 Hereinafter, the flow of processing when performing voice quality conversion processing in the present embodiment will be specifically shown.

図２は、本実施の形態における声質変換装置の動作初期状態である。 FIG. 2 shows an initial operation state of the voice quality conversion apparatus according to the present embodiment.

声質変換制御部１０３は、１または複数の声質の各々に対応する声質変換パラメタ調整コントロールをユーザに提示する。声質変換パラメタ調整コントロールは、対応する声質の変換可能量に対応した移動可能範囲と、当該移動可能範囲内の任意の位置を指示可能なポインタとを有している。声質変換制御部１０３は、声質変換パラメタ調整コントロールのポインタをユーザに操作させることにより声質および変換量を指定させる。たとえば図２に示すように、声質変換制御部１０３は、変換可能な声質（ここでは、明るさ・暗さ・男性らしさ・早口の４つ）に対応する声質変換パラメタをそれぞれ０から１０の範囲内で連続的に調整可能なスライダ（もしくはボリュームスイッチ等でもよい）をユーザに提示する。ユーザは各声質変換パラメタに対応するスライダを０から１０の範囲内で動かすことにより変換したい声質および変換量を指定する。ここでは調整を加えない場合は０、調整の度合いが大きくなるにつれ１０に近付くものとする。なお、図２では、明るさを「明」、暗さを「暗」、男性らしさを「男」、早口を「早」で示している。音響特徴パラメタ列１００は、調整可能な音響特徴パラメタとして、各分析フレームごとに基本周波数（Ｆ０）、第一フォルマント周波数（Ｆ１）、第二フォルマント周波数（Ｆ２）、フレーム継続長（ＦＲ）、音源パワー（ＰＷ）の５つのパラメタを持つ。声質変換パラメタデータベース１０５は、声質変換制御部１０３の持つ声質変換パラメタの値を調整したとき（１増やしたとき）、それぞれの音響特徴パラメタ値に加えるべき値を保持している。図２では、声質変換制御部１０３上で全ての声質変換パラメタの設定値が０となっているため、出力の音響特徴パラメタ列１０２の内容は入力の音響特徴パラメタ列１００と同一である。 The voice quality conversion control unit 103 presents the user with voice quality conversion parameter adjustment controls corresponding to each of one or more voice qualities. The voice quality conversion parameter adjustment control includes a movable range corresponding to the convertible amount of the corresponding voice quality, and a pointer that can indicate an arbitrary position within the movable range. The voice quality conversion control unit 103 causes the user to operate the pointer of the voice quality conversion parameter adjustment control to designate the voice quality and the conversion amount. For example, as shown in FIG. 2, the voice quality conversion control unit 103 sets the voice quality conversion parameters corresponding to convertible voice qualities (in this case, brightness, darkness, masculinity, and fast speech) in the range of 0 to 10, respectively. A slider (or a volume switch or the like) that can be continuously adjusted is presented to the user. The user designates a voice quality to be converted and a conversion amount by moving a slider corresponding to each voice quality conversion parameter within a range of 0 to 10. Here, it is assumed that 0 is not applied and 0 is approached as the degree of adjustment increases. In FIG. 2, the brightness is shown as “bright”, the darkness as “dark”, the masculinity as “male”, and the quick mouth as “early”. The acoustic feature parameter column 100 includes, as adjustable acoustic feature parameters, a fundamental frequency (F0), a first formant frequency (F1), a second formant frequency (F2), a frame duration (FR), and a sound source for each analysis frame. It has 5 parameters of power (PW). The voice quality conversion parameter database 105 holds a value to be added to each acoustic feature parameter value when the value of the voice quality conversion parameter of the voice quality conversion control unit 103 is adjusted (when it is increased by 1). In FIG. 2, since the set values of all voice quality conversion parameters are 0 on the voice quality conversion control unit 103, the contents of the output acoustic feature parameter sequence 102 are the same as the input acoustic feature parameter sequence 100.

図３は、声質変換制御部１０３で明るさの声質変換パラメタを５に、早口の声質変換パラメタを３に設定した状態での動作例を表す。入力の音響特徴パラメタ列１００のある分析フレームに着目したとき、例えばこの分析フレームにおける音響特徴パラメタＦ０の値が３００であったとすると、出力の音響特徴パラメタ列１０２における、対応する分析フレームでのＦ０の値は、明るさの声質変換パラメタを１増やすごとに＋５となり、早口の声質変換パラメタを１増やすごとに＋１となることが声質変換パラメタデータベース１０５から分かるので、３００＋５×（＋５）＋３×（＋１）＝３２８として求められる。他の音響特徴パラメタ値についても同様に計算を行えばよい。 FIG. 3 shows an operation example in a state where the voice quality conversion control unit 103 sets the brightness voice quality conversion parameter to 5 and the fast voice quality conversion parameter to 3. When focusing on an analysis frame in the input acoustic feature parameter sequence 100, for example, if the value of the acoustic feature parameter F0 in the analysis frame is 300, F0 in the corresponding analysis frame in the output acoustic feature parameter sequence 102 is assumed. Since the voice quality conversion parameter database 105 shows that the value of +1 becomes +5 every time the voice quality conversion parameter of brightness is increased by 1 and becomes +1 every time the voice quality conversion parameter of the quick mouth is increased by 1, 300 + 5 × (+5) + 3 × ( +1) = 328. Similar calculations may be performed for other acoustic feature parameter values.

図４は、図３に示される状態の後、声質変換制御部１０３で暗さの声質変換パラメタを７に設定した結果を示す。声質変換パラメタデータベース１０５によれば、暗さの声質変換パラメタは明るさの声質変換パラメタと逆の内容になっているので、暗さの声質変換パラメタを増やそうとしたとき、明るさの声質変換パラメタが０より大きければ、暗さの声質変換パラメタを増やす代わりに明るさの声質変換パラメタをまず減らしてやることで同じ効果が得られる。図４の場合、図３で明るさの声質変換パラメタが５であった状態から暗さの声質変換パラメタが７に変更されたが、明るさの声質変換パラメタを０にすることで暗さの声質変換パラメタを５増やすのと同じ効果が得られるので、暗さの声質変換パラメタを７にする代わりに、明るさの声質変換パラメタを０に、暗さの声質変換パラメタを２にすればよい。このように、声質変換制御部１０３で指定された声質変換処理を、各声質変換パラメタの値が最小になるような組み合わせに声質変換パラメタ調整部１０４で調整することにより、ユーザにとって操作しやすい声質変換制御インタフェイスが実現される。 FIG. 4 shows the result of setting the darkness voice quality conversion parameter to 7 by the voice quality conversion control unit 103 after the state shown in FIG. According to the voice quality conversion parameter database 105, the darkness voice quality conversion parameter has the opposite contents to the brightness voice quality conversion parameter. Therefore, when the darkness voice quality conversion parameter is increased, the voice quality conversion parameter of brightness is increased. If is greater than 0, the same effect can be obtained by first reducing the voice conversion parameter for brightness instead of increasing the voice conversion parameter for darkness. In the case of FIG. 4, the voice conversion parameter for darkness is changed to 7 from the state where the voice conversion parameter for brightness is 5 in FIG. 3. Since the same effect as increasing the voice quality conversion parameter by 5 is obtained, instead of setting the dark voice quality conversion parameter to 7, the brightness voice quality conversion parameter should be set to 0 and the dark voice quality conversion parameter should be set to 2. . As described above, the voice quality conversion processing specified by the voice quality conversion control unit 103 is adjusted by the voice quality conversion parameter adjustment unit 104 to a combination that minimizes the value of each voice quality conversion parameter. A conversion control interface is realized.

さらに、音響特徴パラメタＦ０の値が３５０を超えると、音響特徴パラメタ列の再合成によって出力される合成音の音質が著しく劣化することが分かっているとする。図５は、明るさの声質変換パラメタが１０に設定されており、その結果出力される音響特徴パラメタ列のある分析フレームでＦ０の値が３５０になっている状態を表す。 Furthermore, it is assumed that when the value of the acoustic feature parameter F0 exceeds 350, the sound quality of the synthesized sound output by re-synthesis of the acoustic feature parameter sequence is significantly deteriorated. FIG. 5 shows a state in which the voice quality conversion parameter for brightness is set to 10 and the value of F0 is 350 in an analysis frame having an acoustic feature parameter sequence output as a result.

図６は、図５の状態からさらに早口の声質変換パラメタ値を５に増やした状態を表す。早口の声質変換パラメタ値を５増やすと音響特徴パラメタＦ０の値が５増えるが、図５において音響特徴パラメタＦ０の値はある分析フレームで既に３５０となっているため、明るさの声質変換パラメタを１減らすことで音響特徴パラメタＦ０の値を３５０に収める。結果として、明るさの声質変換パラメタ値は９に、早口の声質変換パラメタ値は５になる。このように、声質変換パラメタ調整部１０４によって声質変換パラメタの値を自動的に調整することで、ユーザは各音響特徴パラメタの限界値を意識することなく、声質変換の操作を行うことができる。 FIG. 6 shows a state where the voice quality conversion parameter value of the quick mouth is further increased to 5 from the state of FIG. When the voice quality conversion parameter value of the fast mouth is increased by 5, the value of the acoustic feature parameter F0 increases by 5. However, since the value of the acoustic feature parameter F0 is already 350 in a certain analysis frame in FIG. By reducing the value by 1, the value of the acoustic feature parameter F0 is set to 350. As a result, the voice quality conversion parameter value for brightness is 9, and the voice quality conversion parameter value for early speech is 5. Thus, by automatically adjusting the value of the voice quality conversion parameter by the voice quality conversion parameter adjustment unit 104, the user can perform the voice quality conversion operation without being aware of the limit value of each acoustic feature parameter.

声質パラメタ調整部１０４は、声質変換制御部１０３からの変換指示に従った変換処理が音響特徴パラメタ列１００に対して行われたと仮定した場合における当該変換処理後の音響特徴パラメタ列の値が所定の範囲内に収まっているか否かを判断し、当該変換処理後の音響特徴パラメタ列のうち少なくとも１つのパラメタが上記所定の範囲を超えるときには、そのパラメタが上記所定の範囲内に収まるように声質変換制御部１０３から特徴パラメタ変換部１０１への変換指示を変更する。 The voice quality parameter adjustment unit 104 determines that the value of the acoustic feature parameter string after the conversion process is predetermined when the conversion process according to the conversion instruction from the voice quality conversion control unit 103 is performed on the acoustic feature parameter string 100. If at least one parameter of the converted acoustic feature parameter string exceeds the predetermined range, the voice quality is set so that the parameter falls within the predetermined range. A conversion instruction from the conversion control unit 103 to the feature parameter conversion unit 101 is changed.

なお、音声合成パラメタ限界値データベース１０６に格納されている限界条件（たとえば、音響特徴パラメタＦ０の値が３５０を超えてはならないというような各音響特徴パラメタの限界値や、音響特徴パラメタＦ０と音響特徴パラメタＦ２の値を加算した結果は２０００を越えてはならないというようなデータ）を声質変換パラメタ調整部１０４が必要に応じて参照するという構成を取っても良い。 Note that the limit condition stored in the speech synthesis parameter limit value database 106 (for example, the limit value of each acoustic feature parameter such that the value of the acoustic feature parameter F0 should not exceed 350, the acoustic feature parameter F0 and the acoustic The voice quality conversion parameter adjustment unit 104 may refer to the data (the data that the result of adding the value of the feature parameter F2 should not exceed 2000) as necessary.

なお、特徴パラメタ変換部１０１によって音響特徴パラメタ列１００に与えられる変化は全分析フレームに対して一律でなくてもよく、分析フレームごとに声質変換パラメタデータベース１０５の内容が変化するようになっていてもよい。 The change given to the acoustic feature parameter sequence 100 by the feature parameter conversion unit 101 may not be uniform for all analysis frames, and the contents of the voice quality conversion parameter database 105 change for each analysis frame. Also good.

なお、声質変換パラメタ調整部１０４による声質変換パラメタ値の調整は、制約充足アルゴリズムを用いて自動的に行っても良い。制約充足アルゴリズムには、例えば参考文献１のＩｎｄｉｇｏアルゴリズムがある。図５と図６に示される声質変換パラメタの調整を行う制約条件の音響特徴パラメタＦ０に関する部分を、Ｉｎｄｉｇｏアルゴリズムの制約階層で記述すると以下のようになる。 Note that the voice quality conversion parameter value adjustment by the voice quality conversion parameter adjustment unit 104 may be automatically performed using a constraint satisfaction algorithm. An example of the constraint satisfaction algorithm is the Indigo algorithm of Reference 1. A portion relating to the acoustic feature parameter F0 of the constraint condition for adjusting the voice quality conversion parameter shown in FIGS. 5 and 6 is described in the constraint hierarchy of the Indigo algorithm as follows.

ＲＥＱＵＩＲＥＤ制約Ｃ１：出力Ｆ０≦３５０
ＲＥＱＵＩＲＥＤ制約Ｃ２：入力Ｆ０＝３００
ＲＥＱＵＩＲＥＤ制約Ｃ３：明るさ×５＝ｔ１
ＲＥＱＵＩＲＥＤ制約Ｃ４：暗さ×−５＝ｔ２
ＲＥＱＵＩＲＥＤ制約Ｃ５：男性らしさ×−３＝ｔ３
ＲＥＱＵＩＲＥＤ制約Ｃ６：早口×１＝ｔ４
ＲＥＱＵＩＲＥＤ制約Ｃ７：ｔ１＋ｔ２＝ｔ５
ＲＥＱＵＩＲＥＤ制約Ｃ８：ｔ３＋ｔ４＝ｔ６
ＲＥＱＵＩＲＥＤ制約Ｃ９：ｔ５＋ｔ６＝ｔ７
ＲＥＱＵＩＲＥＤ制約Ｃ１０：入力Ｆ０＋ｔ７＝ｔ８
ＲＥＱＵＩＲＥＤ制約Ｃ１１：ｔ８＝出力Ｆ０
ＳＴＲＯＮＧ制約Ｃ１２：早口＝５
ＷＥＡＫ制約Ｃ１３：男性らしさ＝０
ＷＥＡＫ制約Ｃ１４：暗さ＝０
ＷＥＡＫ制約Ｃ１５：明るさ＝１０
以上の制約条件を、制約グラフの形で表すと図７のようになる。なお、変数ｔ１からｔ８は、制約充足の計算の途中結果を保持するための変数である。また、簡単のため省略したが、より望ましい結果を出すためには、各声質変換パラメタの値を０以上１０以下の値に束縛するＲＥＱＵＩＲＥＤ制約を設けてあることが望ましい。 REQUIRED constraint C1: Output F0 ≦ 350
REQUIRED constraint C2: input F0 = 300
REQUIRED constraint C3: Brightness × 5 = t1
REQUIRED constraint C4: darkness × −5 = t2
REQUIRED constraint C5: masculinity x-3 = t3
REQUIRED constraint C6: Hayaku × 1 = t4
REQUIRED constraint C7: t1 + t2 = t5
REQUIRED constraint C8: t3 + t4 = t6
REQUIRED constraint C9: t5 + t6 = t7
REQUIRED constraint C10: input F0 + t7 = t8
REQUIRED constraint C11: t8 = output F0
STRONG restriction C12: Hayato = 5
WEAK restriction C13: masculinity = 0
WEAK constraint C14: Darkness = 0
WEAK restriction C15: Brightness = 10
The above constraint conditions are represented in the form of a constraint graph as shown in FIG. Variables t1 to t8 are variables for holding intermediate results of constraint satisfaction calculation. Although omitted for simplicity, in order to obtain a more desirable result, it is desirable to provide a REQUIRED constraint that constrains the value of each voice quality conversion parameter to a value between 0 and 10.

以上の制約条件をＩｎｄｉｇｏアルゴリズムで解く場合の処理の概略を以下に示す。 An outline of processing when the above constraint conditions are solved by the Indigo algorithm is shown below.

初期状態：全ての変数の値域が[−∞，＋∞]
Ｃ１追加：出力Ｆ０の値域が[−∞，３５０]になる
Ｃ２追加：入力Ｆ０の値域が[３００，３００]になる
Ｃ３〜Ｃ１０追加：各変数の値域に変化無し
Ｃ１１追加：ｔ８の値域が[−∞，３５０]になる
Ｃ１０を伝播してｔ７の値域が[−∞，５０]になる
Ｃ１２追加：早口の値域が[５，５]になる
Ｃ６を伝播してｔ４の値域が[５，５]になる
Ｃ１３追加：男性らしさの値域が[０，０]になる
Ｃ５を伝播してｔ３の値域が[０，０]になる
Ｃ８を伝播してｔ６の値域が[５，５]になる
Ｃ９を伝播してｔ５の値域が[−∞，４５]になる
Ｃ１４追加：暗さの値域が[０，０]になる
Ｃ４を伝播してｔ２の値域が[０，０]になる
Ｃ７を伝播してｔ１の値域が[−∞，４５]になる
Ｃ３を伝播して明るさの値域が[−∞，９]になる
Ｃ１５追加：明るさの値域が[９，９]になる
＜参考文献１＞
A. Borning, R. Anderson, B. Freeman-Benson: The Indigo Algorithm, TR 96-05-01, Department of Computer Science and Engineering, University of Washington, July 1996
（実施の形態２）
本発明の実施の形態２による声質変換パラメタ調整コントロールの配置決定方法の一例を、図８を用いて説明する。 Initial state: The range of all variables is [−∞, + ∞]
C1 addition: The range of the output F0 becomes [−∞, 350] C2 addition: The range of the input F0 becomes [300, 300] C3 to C10 addition: No change in the range of each variable C11 addition: The range of t8 is [−∞, 350]
Propagation of C10 and the value range of t7 becomes [−∞, 50].
Propagation of C6 and the value range of t4 becomes [5,5] C13 addition: The value range of masculinity becomes [0,0]
Propagates C5 and the range of t3 becomes [0,0]
Propagates C8 and the range of t6 becomes [5,5]
Propagation of C9 and the value range of t5 becomes [−∞, 45] C14 addition: Darkness value range becomes [0, 0]
C4 is propagated and the range of t2 becomes [0,0]
Propagates C7 and the value range of t1 becomes [−∞, 45]
Propagation of C3 and brightness value range becomes [−∞, 9] C15 addition: Brightness value range becomes [9, 9] <Reference 1>
A. Borning, R. Anderson, B. Freeman-Benson: The Indigo Algorithm, TR 96-05-01, Department of Computer Science and Engineering, University of Washington, July 1996
(Embodiment 2)
An example of an arrangement determination method for voice quality conversion parameter adjustment control according to Embodiment 2 of the present invention will be described with reference to FIG.

声質変換パラメタデータベース１０５に収められる、各音響特徴パラメタの差分値のデータが図８のようになっているものとする。各声質変換パラメタが音響特徴パラメタに及ぼす変化の内容の近さを、声質変換パラメタデータベース１０５に収められる各音響特徴パラメタの差分値同士のユークリッド距離であると定義すると、それぞれの距離は表８００のように計算できる。あとは、距離の近いものほど近くに配置されるように声質変換パラメタ調整コントロールの配置を行えばよい。表８００では男性らしさと暗さの声質変換パラメタが最も近く、次に明るさと早口、その次に早口と暗さという結果になっているので、８０１に示されるような配置を行うことができる。 Assume that the difference value data of each acoustic feature parameter stored in the voice quality conversion parameter database 105 is as shown in FIG. If the closeness of the content of the change that each voice quality conversion parameter has on the acoustic feature parameter is defined as the Euclidean distance between the difference values of each acoustic feature parameter stored in the voice quality conversion parameter database 105, each distance is shown in Table 800. Can be calculated as follows. After that, the voice quality conversion parameter adjustment control may be arranged so that the closer the distance is, the closer it is arranged. In Table 800, the voice quality conversion parameters of masculinity and darkness are the closest, followed by brightness and fast mouth, and then fast mouth and darkness. Therefore, the arrangement shown in 801 can be performed.

このような配置を行い、さらに以下に示すような方法で声質変換パラメタの調整を行うことによって、ユーザは直感的に分かり易い方法で声質変換の操作を行うことができる。 By performing such an arrangement and further adjusting the voice quality conversion parameters by the method as described below, the user can perform the voice quality conversion operation in an intuitive and easily understandable manner.

ＳＴＥＰ０：ある１つの声質変換パラメタの値が、ユーザにより変更される。たとえば図９の８０２ａ，８０２ｂに示すようにすべての声質パラメタが０の状態から明るさ（明）の声質変換パラメタが１０に変更されたものとする。なお、ここでは入力された音響特徴パラメタの値や音響特徴パラメタの限界値等の条件は図５に示したのと同様であるものとする。 STEP 0: The value of a certain voice quality conversion parameter is changed by the user. For example, as shown in 802a and 802b of FIG. 9, it is assumed that the voice quality conversion parameter of brightness (bright) is changed to 10 from the state where all voice quality parameters are 0. Here, it is assumed that conditions such as the input acoustic feature parameter value and the limit value of the acoustic feature parameter are the same as those shown in FIG.

ＳＴＥＰ１：声質変換パラメタｉをあとＸｉ増やすと、音響特徴パラメタもしくは音響特徴パラメタの値を元に計算できる特徴変数のうちいずれか１つがあらかじめ設定された値域を外れてしまうような数Ｘｉを求める。これを全ての声質変換パラメタｉについて計算する。たとえば、図９の８０２ｂのように明るさの声質変換パラメタが１０に設定された場合には、早口（早）の声質変換パラメタを０より大きくすると音響特徴パラメタＦ０の値が限界値３５０を超えてしまい合成音の音質が劣化してしまう。したがって早口の声質変換パラメタについてのＸｉは０となる。 STEP 1: When the voice quality conversion parameter i is further increased by Xi, a number Xi is calculated such that any one of the acoustic feature parameters or feature variables that can be calculated based on the value of the acoustic feature parameter falls outside the preset value range. This is calculated for all voice quality conversion parameters i. For example, when the voice quality conversion parameter for brightness is set to 10 as in 802b of FIG. 9, the value of the acoustic feature parameter F0 exceeds the limit value 350 when the voice quality conversion parameter for early mouth (early) is set to greater than 0. As a result, the sound quality of the synthesized sound deteriorates. Therefore, Xi for the fast voice conversion parameter is 0.

ＳＴＥＰ２：各声質変換パラメタｉの現在の値をＹｉとして、Ｘｉ＋Ｙｉを声質変換パラメタｉの声質変換パラメタ調整コントロールの最大値、０が最小値、Ｙｉが現在値となるように声質変換パラメタ調整コントロールを再描画する。これを全ての声質変換パラメタｉについて行う。上述の例の場合には図９の８０２ｂに示すように、早口の最大値＝０、最小値＝０、現在値＝０となるように再描画する。なお、この状態（８０２ｂ）で明るさの声質変換パラメタが１０から９に下げられた場合には、上述のＳＴＥＰ０〜２により、図９の８０２ｃに示すように早口の声質変換パラメタの最大値は０から５に上昇する。 STEP 2: The current value of each voice quality conversion parameter i is Yi, Xi + Yi is the maximum value of the voice quality conversion parameter adjustment control of the voice quality conversion parameter i, 0 is the minimum value, and voice quality conversion parameter adjustment control is performed so that Yi is the current value. Redraw. This is performed for all voice quality conversion parameters i. In the case of the above example, redrawing is performed so that the maximum value = 0, the minimum value = 0, and the current value = 0 as shown in 802b of FIG. If the brightness voice quality conversion parameter is lowered from 10 to 9 in this state (802b), the maximum value of the early voice quality conversion parameter is as shown by 802c in FIG. Increase from 0 to 5.

以上の方法によれば、ある声質変換パラメタの値が増やされたとき、音響特徴パラメタに加える変化の内容がそれと似通っている声質変換パラメタほど、どの音響特徴パラメタや音響特徴パラメタを元に計算される特徴変数の値もあらかじめ定められた値域を外れさせずに増加させられる余裕はなくなるので、ＳＴＥＰ１におけるＸｉが小さくなり、結果ＳＴＥＰ２の再描画で現在の設定値Ｙｉが、設定可能な最大値に近付いているように描画できる。本実施例では、各音響特徴パラメタに与える変化の内容が近い声質変換パラメタ調整コントロールほど近くに配置されるようになっているので、ユーザには、ある声質変換パラメタ調整コントロールを操作すると、近くに配置されている声質変換パラメタ調整コントロールほど同じ方向につられて動き、遠くに配置されている声質変換パラメタ調整コントロールほど逆の方向につられて動くように提示される。したがって、ユーザはスライダ（声質変換パラメタコントロール）の操作によって声質がどのように変換されるのかを直感的に分かりやすくなる。また、声質変換パラメタの設定に連動して各声質変換パラメタの設定可能な最大値が自動的に変更されるため、ユーザは音声の品質を破綻させずに声質を変換するための各音響特徴パラメタの限界値を意識することなく、声質変換の操作を行うことができる。 According to the above method, when the value of a certain voice quality conversion parameter is increased, a voice quality conversion parameter whose content of change applied to the acoustic feature parameter is similar to that is calculated based on which acoustic feature parameter or acoustic feature parameter. Since there is no room for the value of the characteristic variable to be increased without deviating from the predetermined range, Xi in STEP1 is reduced, and the current setting value Yi is set to the maximum settable value by redrawing in STEP2. You can draw as you approach. In the present embodiment, since the content of the change given to each acoustic feature parameter is arranged closer to the voice quality conversion parameter adjustment control, when the user operates a certain voice quality conversion parameter adjustment control, the user is placed closer. The voice quality conversion parameter adjustment control that is displayed is moved in the same direction, and the voice quality conversion parameter adjustment control that is arranged farther is moved in the opposite direction. Therefore, the user can intuitively understand how the voice quality is converted by operating the slider (voice quality conversion parameter control). In addition, since the maximum value that can be set for each voice quality conversion parameter is automatically changed in conjunction with the voice quality conversion parameter setting, each acoustic feature parameter that allows the user to convert voice quality without breaking the voice quality Voice quality conversion operation can be performed without being aware of the limit value.

なお、ここではユーザによる声質変換パラメタの設定に連動して関連する声質変換パラメタの最大値を変化させる例を示したが、図９の８０２ｄ〜８０２ｆに示すように声質変換パラメタの設定に連動して関連する声質変換パラメタを変化させるようにすることも可能である。たとえば図８０２ｅに示すように、明るさの声質変換パラメタを１０に設定するとこれに連動して早口の声質変換パラメタも１０に設定される。ただし、ここでは早口の声質変換パラメタのスライダを１０の位置に移動させただけであり、早さの声質変換パラメタが１０に設定されることに応じた音響特徴パラメタの変換は行われない。上述のとおり、明るさの声質パラメタを１０に設定した場合において早口の声質パラメタを１以上に設定すると音響特徴パラメタＦ０の値が限界値３５０を超えてしまい合成音の音質が劣化してしまう。しかし図８０２ｅに示すように、明るさの声質変換パラメタが１０に設定されるのに連動して早口の声質変換パラメタも１０（最大値）に設定されるため、早さの声質変換パラメタの値をこれ以上増やすことはできない。この状態（８０２ｅ）で明るさの声質変換パラメタが１０から９に下げられた場合には、図９の８０２ｆに示すように早口の声質変換パラメタは１０から５に下げられる。ただし、この場合も早口の声質変換パラメタのスライダを５の位置に移動させただけであり、早さの声質変換パラメタが５に設定されることに応じた音響特徴パラメタの変換は行われない。このように、声質変換パラメタの設定に連動して各声質変換パラメタの値が自動的に変更されるため、ユーザは音声の品質を破綻させずに声質を変換するための各音響特徴パラメタの限界値を意識することなく、声質変換の操作を行うことができる。 In this example, the maximum value of the related voice quality conversion parameter is changed in conjunction with the voice quality conversion parameter setting by the user. However, as shown in 802d to 802f in FIG. It is also possible to change related voice quality conversion parameters. For example, as shown in FIG. 802e, when the voice quality conversion parameter of brightness is set to 10, the quick voice quality conversion parameter is also set to 10 in conjunction with this. However, here, the quick voice quality conversion parameter slider is simply moved to the position of 10, and the conversion of the acoustic feature parameter is not performed in response to the quick voice quality conversion parameter being set to 10. As described above, when the voice quality parameter of brightness is set to 10, if the voice quality parameter of the fast mouth is set to 1 or more, the value of the acoustic feature parameter F0 exceeds the limit value 350 and the sound quality of the synthesized sound is deteriorated. However, as shown in FIG. 802e, since the voice quality conversion parameter for brightness is set to 10 in conjunction with the voice quality conversion parameter for brightness, the value of the voice quality conversion parameter for early speed is also set to 10 (maximum value). Can not be increased any more. When the voice quality conversion parameter for brightness is lowered from 10 to 9 in this state (802e), the early voice quality conversion parameter is lowered from 10 to 5 as shown at 802f in FIG. However, in this case as well, the quick voice quality conversion parameter slider is simply moved to the position 5, and the acoustic feature parameter conversion is not performed in response to the quick voice quality conversion parameter being set to 5. In this way, since the value of each voice quality conversion parameter is automatically changed in conjunction with the setting of the voice quality conversion parameter, the user can limit the acoustic feature parameters for converting the voice quality without breaking the voice quality. Voice quality conversion can be performed without being aware of the value.

（実施の形態３）
本発明の実施の形態３による声質変換パラメタ調整コントロールの配置決定方法の一例を、図１０を用いて説明する。 (Embodiment 3)
An example of an arrangement determination method for voice quality conversion parameter adjustment control according to Embodiment 3 of the present invention will be described with reference to FIG.

上記実施の形態２に記載の方法と同様の方法で、声質変換パラメタデータベース１０５から表８００が作成できる。表８００を元に、各声質変換パラメタ調整コントロールの下限を一点にまとめ、さらに表８００で互いの距離が近い声質変換パラメタほど声質変換パラメタ調整コントロール間の角度が小さくなるように配置したものを９００に示す。 The table 800 can be created from the voice quality conversion parameter database 105 by a method similar to the method described in the second embodiment. Based on the table 800, the lower limit of each voice quality conversion parameter adjustment control is summarized as a single point, and in the table 800, the voice quality conversion parameters closer to each other are arranged so that the angle between the voice quality conversion parameter adjustment controls becomes smaller. Shown in

このような配置を行い、さらに上記実施の形態２に記載の方法と同様の方法で声質変換パラメタの調整を行うと、９０１に示されるように、ある声質変換パラメタ調整コントロールの操作によって、他の声質変換パラメタ調整コントロールも同じ方向につられて動くように見える。これにより、ユーザにとって直感的に分かり易い声質変換インタフェイスを提供することが可能になる。 When such an arrangement is performed and the voice quality conversion parameter is adjusted by the same method as described in the second embodiment, another voice quality conversion parameter adjustment control is operated as shown in 901 by another operation. The voice quality conversion parameter adjustment control also appears to move in the same direction. This makes it possible to provide a voice quality conversion interface that is intuitively easy for the user to understand.

（実施の形態４）
本発明の実施の形態４による音声合成装置の構成を図１１に示す。この装置は、テキスト情報を入力として様々な声質で音声合成を行うことが可能な声質可変音声合成装置である。 (Embodiment 4)
The configuration of a speech synthesizer according to Embodiment 4 of the present invention is shown in FIG. This device is a variable voice quality speech synthesizer capable of performing speech synthesis with various voice qualities by using text information as an input.

この構成では、音声合成データベース１０００に収められた音声素片から、入力のテキスト情報１００１に合致するデータを音声合成部１００２が選び出し、それを元に音響特徴パラメタ列１００を生成する。音響特徴パラメタ列１００は上記実施の形態１、２、３に示される方法で変形され、変形済み音響特徴パラメタ列１０２が出力される。この変形済み音響特徴パラメタ列１０２を用いて波形生成部１００３が合成音波形１００４を生成し、スピーカ装置１００５によって再生することで、ユーザが入力した任意のテキスト情報１００１を、声質変換制御部１０３で指定された声質の合成音で再生することができる。 In this configuration, the speech synthesizer 1002 selects data matching the input text information 1001 from the speech units stored in the speech synthesis database 1000, and generates the acoustic feature parameter sequence 100 based on the selected data. The acoustic feature parameter sequence 100 is deformed by the method shown in the first, second, and third embodiments, and the transformed acoustic feature parameter sequence 102 is output. The waveform generation unit 1003 generates a synthesized sound waveform 1004 using the deformed acoustic feature parameter sequence 102 and reproduces it by the speaker device 1005, whereby the voice quality conversion control unit 103 converts arbitrary text information 1001 input by the user. It can be played back with the synthesized voice of the specified voice quality.

（実施の形態５）
本発明の実施の形態５による音声合成装置の構成を図１２に示す。この装置は、様々な声質で音声合成を行うことが可能な声質可変音声合成装置である。この構成は、上記実施の形態４と同様の効果を別の構成で可能にする。この構成と上記実施の形態４の違いは、上記実施の形態４では声質変換パラメタ調整部１０４が音声合成部１００２の出力した音響特徴パラメタ列を用いて出力音声の品質劣化を検出しているが、本実施例では音声合成データベース１０００に収められたデータの一部または全部を用いて、この音声合成データベース１０００を元に音声合成部１００２が出力する音響特徴パラメタ列１００に現れる音響特徴パラメタ値の傾向を予測し、声質変換パラメタの調整処理を行う点である。声質変換パラメタ調整部１０４は、例えば音声合成データベース１０００から全ての音素の素片データを１つずつ選び出し、それらの素片データを声質変換パラメタ調整部１０４に従って変換した場合に出力音声の品質が劣化するかどうかを基準に、声質変換パラメタの調整処理を行えばよい。 (Embodiment 5)
FIG. 12 shows the configuration of a speech synthesizer according to Embodiment 5 of the present invention. This device is a variable voice quality speech synthesizer capable of performing speech synthesis with various voice qualities. This configuration enables the same effect as that of the above-described fourth embodiment with another configuration. The difference between this configuration and the fourth embodiment is that, in the fourth embodiment, the voice quality conversion parameter adjustment unit 104 uses the acoustic feature parameter sequence output from the voice synthesis unit 1002 to detect quality degradation of the output speech. In this embodiment, some or all of the data stored in the speech synthesis database 1000 is used, and acoustic feature parameter values appearing in the acoustic feature parameter sequence 100 output from the speech synthesis unit 1002 based on the speech synthesis database 1000 are used. This is the point of predicting the trend and adjusting the voice quality conversion parameter. For example, the voice quality conversion parameter adjustment unit 104 selects all the phoneme segment data one by one from the speech synthesis database 1000, and the quality of the output speech deteriorates when the segment data is converted according to the voice quality conversion parameter adjustment unit 104. The voice quality conversion parameter adjustment process may be performed based on whether or not to do so.

本実施の形態の上記実施の形態４に対する優位点は、本実施の形態では音声合成データベース１０００を入れ替えない限りどのようなテキスト情報１００１が入力されても、声質変換パラメタの調整処理の内容が同一で済むことである。ただし、テキスト情報１００１の内容によって音響特徴パラメタ列１００の内容が大きく異なる場合は、テキスト情報１００１の内容によって出力音声の品質が劣化する場合が有り得る。 The advantage of the present embodiment over the fourth embodiment is that the content of the adjustment processing of the voice quality conversion parameter is the same regardless of what text information 1001 is input unless the speech synthesis database 1000 is replaced in the present embodiment. Is all you need to do. However, if the contents of the acoustic feature parameter sequence 100 differ greatly depending on the contents of the text information 1001, the quality of the output voice may be deteriorated depending on the contents of the text information 1001.

なお、本実施の形態における音響特徴パラメタ列１００は直接音声合成データベース１０００から音声合成処理によって生成されるものでなくても良く、何らかの方法で生成された音響特徴パラメタ列１００の声質が、音声合成データベース１０００の内容と十分似ていれば同等の効果が得られる。 Note that the acoustic feature parameter sequence 100 in this embodiment does not have to be generated directly from the speech synthesis database 1000 by speech synthesis processing, and the voice quality of the acoustic feature parameter sequence 100 generated by some method is speech synthesis. If the content of the database 1000 is sufficiently similar, the same effect can be obtained.

（実施の形態６）
本発明の実施の形態６による音声合成装置の構成を図１３に示す。この装置は、様々な声質で音声合成を行うことが可能な声質可変音声合成装置である。この構成は、上記実施の形態５と同様の効果が期待できる別の構成である。この構成と上記実施の形態５の違いは、上記実施の形態５では音声合成データベース１０００の内容を声質変換パラメタ調整部１０４が直接参照していたのに対し、本実施の形態では音声合成データベース１０００から、声質変換処理による品質劣化を検出するのに必要十分なデータのみをあらかじめ抽出した音声合成データベース特徴テーブル１２００を持ち、これを元に声質変換パラメタの調整処理を行うことである。音声合成データベース特徴テーブル１２００は、例えば音声合成データベース１０００に収められた全データの中から、各音響特徴パラメタの上限値・下限値・平均値のみを抜き出して持っておけば、声質変換パラメタ調整部１０４は出力音声の品質劣化を検出するために必要十分なデータのみを高速に参照して声質変換パラメタの調整処理を行うことができる。 (Embodiment 6)
FIG. 13 shows the configuration of a speech synthesizer according to Embodiment 6 of the present invention. This device is a variable voice quality speech synthesizer capable of performing speech synthesis with various voice qualities. This configuration is another configuration in which the same effect as that of the fifth embodiment can be expected. The difference between this configuration and the fifth embodiment is that the voice quality conversion parameter adjustment unit 104 directly refers to the contents of the voice synthesis database 1000 in the fifth embodiment, whereas the voice synthesis database 1000 in the present embodiment. From the above, the voice synthesis database feature table 1200 in which only necessary and sufficient data for detecting quality degradation due to voice quality conversion processing is extracted in advance, and voice quality conversion parameter adjustment processing is performed based on this. For example, the voice synthesis database feature table 1200 can be obtained by extracting only the upper limit value, lower limit value, and average value of each acoustic feature parameter from all the data stored in the voice synthesis database 1000. Reference numeral 104 can perform voice quality conversion parameter adjustment processing by referring to only data necessary and sufficient for detecting quality degradation of output speech at high speed.

なお、上記実施の形態５と同様に、本実施の形態においても音響特徴パラメタ列１００は直接音声合成データベース１０００から音声合成処理によって生成されるものでなくて良く、何らかの方法で生成された音響特徴パラメタ列１００の声質が、音声合成データベース特徴テーブル１２００の内容と十分合致していれば同等の効果が得られる。 As in the fifth embodiment, in this embodiment, the acoustic feature parameter sequence 100 does not have to be generated directly from the speech synthesis database 1000 by the speech synthesis process, but is generated by some method. If the voice quality of the parameter string 100 sufficiently matches the content of the speech synthesis database feature table 1200, the same effect can be obtained.

（実施の形態７）
本発明の実施の形態７による音声合成装置の構成を図１４に示す。この装置は、テキスト情報を入力として様々な声質で音声合成を行うことが可能な声質可変音声合成装置である。この構成では、音声合成部１００２は、音声合成データベース１０００に収められたデータを特徴パラメタ変換部１０１によって声質変換処理を行った後音声合成に使用することで、声質可変音声合成を実現する。声質変換パラメタ調整部１０４による声質変換パラメタの調整は、上記実施の形態５のように音声合成データベース１０００を参照して行う。 (Embodiment 7)
FIG. 14 shows the configuration of a speech synthesizer according to Embodiment 7 of the present invention. This device is a variable voice quality speech synthesizer capable of performing speech synthesis with various voice qualities by using text information as an input. In this configuration, the voice synthesis unit 1002 implements voice quality variable voice synthesis by using the data stored in the voice synthesis database 1000 for voice synthesis after performing the voice quality conversion processing by the feature parameter conversion unit 101. The voice quality conversion parameter adjustment by the voice quality conversion parameter adjustment unit 104 is performed with reference to the speech synthesis database 1000 as in the fifth embodiment.

本実施の形態と、上記実施の形態５のどちらの構成を取るかは、実装の容易さを鑑みて選択すればよい。 Which configuration of the present embodiment or the fifth embodiment is taken may be selected in view of ease of mounting.

なお、声質変換パラメタ調整部１０４は、上記実施の形態６のように、出力される音声の品質劣化を検出するのに必要十分なデータを収めた音声合成データベース特徴テーブルを利用して声質変換パラメタの調整処理を行ってもよい。 Note that the voice quality conversion parameter adjustment unit 104 uses the voice synthesis database feature table containing data necessary and sufficient to detect the quality deterioration of the output voice as in the sixth embodiment, and uses the voice quality conversion parameter feature table. The adjustment process may be performed.

（実施の形態８）
本発明の実施の形態８による声質変換装置の構成を図１５に示す。この装置は、音声波形データを入力として、変換された音声波形データを出力する声質変換装置である。この構成では、入力された音声波形１４００は音声分析部１４０１で分析され、分析結果の音響特徴パラメタ列１００が声質変換パラメタ調整部１０４及び特徴パラメタ変換部１０１の入力となる。 (Embodiment 8)
FIG. 15 shows the configuration of a voice quality conversion apparatus according to the eighth embodiment of the present invention. This device is a voice quality conversion device that receives voice waveform data and outputs the converted voice waveform data. In this configuration, the input speech waveform 1400 is analyzed by the speech analysis unit 1401, and the acoustic feature parameter string 100 as an analysis result is input to the voice quality conversion parameter adjustment unit 104 and the feature parameter conversion unit 101.

この構成によって、テキスト情報から生成する合成音のみならず、肉声の音声波形データにも、直感的に操作しやすいインタフェイスを通じて、品質劣化を防止しつつ声質変換処理を加えることが可能となる。 With this configuration, voice quality conversion processing can be applied to not only synthesized speech generated from text information but also to real voice waveform data through an interface that is intuitively easy to operate while preventing quality degradation.

（実施の形態９）
本発明の実施の形態９による声質変換装置の構成を図１６に示す。この装置は、音声波形データを入力として、変換された音声波形データを出力する声質変換装置である。本実施の形態における構成と、上記実施の形態８における構成の違いは、声質変換パラメタ調整部が音響特徴パラメタ列１００ではなく、波形特徴テーブル１５００を参照して声質変換パラメタの調整処理を行う点である。波形特徴テーブル１５００は、例えば音声波形１４００を発声したのと同じ話者が事前に発声したサンプル音声を分析した結果から、声質変換処理による品質劣化を検出するのに必要十分なデータのみをあらかじめ抽出したものであり、例えばサンプル音声の分析結果の各音響特徴パラメタから、上限値・下限値・平均値のみを抜き出したものである。 (Embodiment 9)
FIG. 16 shows the configuration of a voice quality conversion apparatus according to Embodiment 9 of the present invention. This device is a voice quality conversion device that receives voice waveform data and outputs the converted voice waveform data. The difference between the configuration in the present embodiment and the configuration in the eighth embodiment is that the voice quality conversion parameter adjustment unit performs the voice quality conversion parameter adjustment processing with reference to the waveform feature table 1500 instead of the acoustic feature parameter sequence 100. It is. For example, the waveform feature table 1500 extracts in advance only data necessary and sufficient to detect quality degradation due to voice quality conversion processing from the result of analyzing sample speech previously uttered by the same speaker who uttered the speech waveform 1400 For example, only the upper limit value, lower limit value, and average value are extracted from each acoustic feature parameter of the analysis result of the sample voice.

なお、波形特徴テーブル１５００は複数あってもよく、例えば話者の年齢・性別などの属性によって、音声波形１４００及び音響特徴パラメタ列１００の特徴を最もよく表す波形特徴テーブル１５００を選択して使用することもできる。 There may be a plurality of waveform feature tables 1500. For example, the waveform feature table 1500 that best represents the features of the speech waveform 1400 and the acoustic feature parameter string 100 is selected and used according to attributes such as the speaker's age and sex. You can also.

この構成によって、声質変換パラメタ調整部１０４は音響特徴パラメタ列１００の入力に先行して声質変換パラメタの調整処理を行うことができるようになり、さらに音響特徴パラメタ列１００の特徴をコンパクトに表す波形特徴テーブル１５００の利用によって、声質変換パラメタの調整処理を高速に行うことが可能になる。 With this configuration, the voice quality conversion parameter adjustment unit 104 can perform the voice quality conversion parameter adjustment processing prior to the input of the acoustic feature parameter sequence 100, and further, a waveform that represents the features of the acoustic feature parameter sequence 100 in a compact manner. Use of the feature table 1500 makes it possible to perform voice quality conversion parameter adjustment processing at high speed.

本発明の声質変換装置は、合成音を利用するエージェントアプリケーションやテキスト読み上げアプリケーション、声質変換機能を利用する通信装置、音声の声質エディタ装置等として有用である。 The voice quality conversion apparatus of the present invention is useful as an agent application or a text-to-speech application that uses synthesized speech, a communication apparatus that uses a voice quality conversion function, a voice quality editor apparatus, or the like.

実施の形態１による声質変換装置の構成を示すブロック図である。1 is a block diagram showing a configuration of a voice quality conversion device according to Embodiment 1. FIG. 図１に示した声質変換装置の動作初期状態を表す概略図である。It is the schematic showing the operation | movement initial state of the voice quality conversion apparatus shown in FIG. 図２に示した状態の後、２つの声質変換パラメタ値を操作した状態を表す概略図である。It is the schematic showing the state which operated two voice quality conversion parameter values after the state shown in FIG. 図３に示した状態の後、さらに１つの声質変換パラメタ値を操作した際の、声質変換パラメタ調整処理を表す概略図である。It is the schematic showing the voice quality conversion parameter adjustment process at the time of operating one voice quality conversion parameter value after the state shown in FIG. 図１に示した声質変換装置で、初期状態から１つの声質変換パラメタ値を操作した状態を表す概略図である。It is the schematic showing the state which operated one voice quality conversion parameter value from the initial state with the voice quality conversion apparatus shown in FIG. 図５に示した状態の後、さらに１つの声質変換パラメタ値を操作した際の、声質変換パラメタ調整処理を表す概略図である。It is the schematic showing the voice quality conversion parameter adjustment process at the time of operating one voice quality conversion parameter value after the state shown in FIG. 声質変換パラメタ調整処理をＩｎｄｉｇｏアルゴリズムによって行う場合の制約グラフである。It is a restriction | limiting graph in the case of performing a voice quality conversion parameter adjustment process with an Indigo algorithm. 実施の形態２による声質変換パラメタ調整コントロールの配置方法を表す概略図である。FIG. 10 is a schematic diagram illustrating a method for arranging voice quality conversion parameter adjustment controls according to a second embodiment. 実施の形態２による声質変換パラメタ調整コントロールの配置方法を表す概略図である。FIG. 10 is a schematic diagram illustrating a method for arranging voice quality conversion parameter adjustment controls according to a second embodiment. 実施の形態３による声質変換パラメタ調整コントロールの配置方法と動作例を表す概略図である。10 is a schematic diagram illustrating an arrangement method and an operation example of voice quality conversion parameter adjustment control according to Embodiment 3. FIG. 実施の形態４による声質可変音声合成装置の構成を示すブロック図である。FIG. 10 is a block diagram illustrating a configuration of a voice quality variable speech synthesizer according to a fourth embodiment. 実施の形態５による声質可変音声合成装置の構成を示すブロック図である。FIG. 10 is a block diagram illustrating a configuration of a voice quality variable speech synthesizer according to a fifth embodiment. 実施の形態６による声質可変音声合成装置の構成を示すブロック図である。It is a block diagram which shows the structure of the voice quality variable speech synthesizer by Embodiment 6. FIG. 実施の形態７による声質可変音声合成装置の構成を示すブロック図である。FIG. 10 is a block diagram illustrating a configuration of a voice quality variable speech synthesizer according to a seventh embodiment. 実施の形態８による声質変換装置の構成を示すブロック図である。It is a block diagram which shows the structure of the voice quality conversion apparatus by Embodiment 8. FIG. 実施の形態９による声質変換装置の構成を示すブロック図である。It is a block diagram which shows the structure of the voice quality conversion apparatus by Embodiment 9. FIG.

Explanation of symbols

１００音響特徴パラメタ列
１０１特徴パラメタ変換部
１０２変換済み音響特徴パラメタ列
１０３声質変換制御部
１０４声質変換パラメタ調整部
１０５声質変換パラメタデータベース
１０６音声合成パラメタ限界値データベース
８００声質変換パラメタ間の相互距離表
８０１声質変換制御部への、声質変換パラメタ調整コントロールの配置例
９００声質変換制御部への、声質変換パラメタ調整コントロールの配置例
９０１声質変換制御部へ配置された声質変換パラメタ調整コントロールの動作例
１０００音声合成データベース
１００１テキスト情報
１００２音声合成部
１００３波形生成部
１００４合成音波形
１００５スピーカ装置
１２００音声合成データベース特徴テーブル
１４００音声波形
１４０１音声分析部
ｔ１〜ｔ８図７の制約グラフにおける、制約充足計算の途中結果を保持する変数
Ｃ１〜Ｃ１５図７の制約グラフにおける各制約条件
100 acoustic feature parameter sequence 101 feature parameter conversion unit 102 converted acoustic feature parameter sequence 103 voice quality conversion control unit 104 voice quality conversion parameter adjustment unit 105 voice quality conversion parameter database 106 speech synthesis parameter limit value database 800 mutual distance table 801 between voice quality conversion parameters Example of Arrangement of Voice Quality Conversion Parameter Adjustment Control in Voice Quality Conversion Control Unit 900 Example of Arrangement of Voice Quality Conversion Parameter Adjustment Control in Voice Quality Conversion Control Unit 901 Example of Operation of Voice Quality Conversion Parameter Adjustment Control Arranged in Voice Quality Conversion Control Unit 1000 Synthetic database 1001 Text information 1002 Speech synthesis unit 1003 Waveform generation unit 1004 Synthetic sound waveform 1005 Speaker device 1200 Speech synthesis database feature table 1400 Speech waveform 1401 Speech analysis unit t1 to t8 In constraint graph, each constraint in the constraint graph variables C1~C15 7 for holding the intermediate result of the constraint satisfaction calculation conditions

Claims

A feature parameter conversion unit that performs a predetermined conversion process on a plurality of parameters (acoustic feature parameter string) representing acoustic features of speech;
A voice quality conversion control unit that gives a conversion instruction according to a desired voice quality to the feature parameter conversion unit;
A voice quality parameter adjustment unit,
The feature parameter converter is
According to the conversion instruction from the voice quality conversion control unit, performs a conversion process on the acoustic feature parameter sequence,
The voice quality parameter adjustment unit
If the conversion process according to the conversion instruction from the voice quality conversion control unit is assumed to have been performed on the acoustic feature parameter string, whether the value of the acoustic feature parameter string after the conversion process is within a predetermined range When at least one parameter in the acoustic feature parameter string after the conversion process exceeds the predetermined range, the voice quality conversion control unit makes the characteristic so that the parameter falls within the predetermined range. Change the conversion instructions to the parameter converter,
A voice quality conversion device characterized by that.

In claim 1,
A speech synthesis parameter limit value database that holds limit conditions for obtaining a synthesized speech that maintains naturalness for the acoustic feature parameter sequence;
The voice quality parameter adjustment unit
Perform the determination process and the conversion instruction change process with reference to the speech synthesis parameter limit value database,
A voice quality conversion device characterized by that.

In claim 1,
The voice quality parameter adjustment unit
Change the conversion instruction using the constraint satisfaction algorithm.
A voice quality conversion device characterized by that.

In claim 1,
For each of one or more convertible voice qualities, further comprising a voice quality conversion parameter database that associates the conversion amount of the voice quality with the conversion amount of the acoustic feature parameter sequence necessary for performing the voice conversion of the conversion amount,
The voice quality conversion control unit
Presenting one or more convertible voice qualities to the user, allowing the user to specify a voice quality and a conversion amount of the voice quality, and converting the feature parameter based on the voice quality and conversion amount specified by the user and the voice quality conversion parameter database Give conversion instructions to
A voice quality conversion device characterized by that.

In claim 4,
The voice quality conversion control unit
Presenting to the user voice quality conversion parameter adjustment controls corresponding to each of the one or more voice qualities;
The voice quality conversion parameter adjustment control is
A movable range corresponding to the convertible amount of the corresponding voice quality, and a pointer capable of indicating an arbitrary position within the movable range,
The voice quality conversion control unit
Allowing the user to specify the voice quality and the conversion amount by operating the pointer of the voice quality conversion parameter adjustment control;
A voice quality conversion device characterized by that.

In claim 5,
The voice quality conversion parameter adjustment control has the shape of a slider or a volume switch.
A voice quality conversion device characterized by that.

In claim 5,
The voice quality conversion control unit
When the pointer of the voice quality conversion parameter adjustment control corresponding to a certain voice quality is operated by the user, the conversion amount of the acoustic feature parameter string corresponding to the conversion volume of the voice quality corresponding to the operation and the acoustic feature parameter string corresponding to the other voice quality Change the movable range of the voice quality conversion parameter adjustment control corresponding to the other voice quality based on the conversion amount of
A voice quality conversion device characterized by that.

In claim 5,
The voice quality conversion control unit
When the pointer of the voice quality conversion parameter adjustment control corresponding to a certain voice quality is operated by the user, the acoustic feature parameter string corresponding to the voice quality conversion amount corresponding to the operation and the acoustic feature parameter string corresponding to the other voice quality Change the pointer position of the voice quality conversion parameter adjustment control corresponding to the other voice quality based on the conversion amount of
A voice quality conversion device characterized by that.

In any one of claims 5 to 8,
The voice quality conversion parameter adjustment control is
In accordance with the operation of each voice quality conversion parameter adjustment control, it is arranged so that the content of the change added to the acoustic feature parameter sequence is closer to the similar one,
A voice quality conversion device characterized by that.

In any one of claims 5 to 8,
The voice quality conversion parameter adjustment control is
It is arranged so that the angle between the controls becomes smaller as the content of the change added to the acoustic feature parameter sequence is similar according to the operation of each voice quality conversion parameter adjustment control,
A voice quality conversion device characterized by that.

A voice quality conversion device according to claim 1;
A speech synthesizer that outputs text characteristic parameter strings as input text information,
The feature parameter conversion unit of the voice quality conversion device performs the predetermined conversion process on the acoustic feature parameter from the speech synthesis unit.
A speech synthesizer characterized by the above.

A voice quality conversion device according to claim 1;
A speech synthesis database used for speech synthesis processing,
The voice quality conversion parameter adjustment unit of the voice quality conversion device detects that the quality degradation of the output voice occurs when performing the conversion process as instructed by the voice quality conversion control unit with reference to the voice synthesis database,
A speech synthesizer characterized by the above.

A voice quality conversion device according to claim 1;
A speech synthesis database feature table containing characteristic information for detecting a conversion process that causes quality degradation in the output speech related to the speech synthesis database used in the speech synthesis process;
The voice quality conversion parameter adjustment unit of the voice quality conversion device detects that the quality degradation of the output voice occurs when performing the conversion process as instructed by the voice quality conversion control unit with reference to the voice synthesis database feature table,
A speech synthesizer characterized by the above.

A voice quality conversion device according to claim 1;
A speech synthesizer that outputs an acoustic feature parameter sequence as input of text information;
A speech synthesis database used for speech synthesis processing,
The data stored in the speech synthesis database is converted by the feature parameter conversion unit, and speech synthesis is performed using the converted data.
A speech synthesizer characterized by the above.

In claim 1,
A voice analysis unit for converting the voice waveform data into an acoustic feature parameter sequence;
The input voice waveform data is converted into an acoustic feature parameter string by the voice analysis unit, and converted by the feature parameter conversion unit and output, thereby performing voice quality conversion processing of the voice waveform data.
A voice quality conversion device characterized by that.

In claim 15,
One or more waveform feature tables containing characteristic information for detecting a conversion process that causes degradation of quality in output speech when speech waveform data having a specific voice quality is converted,
Detecting quality degradation of the output voice using information corresponding to the input voice waveform data in the waveform feature table;
A voice quality conversion device characterized by that.