JP2013033103A

JP2013033103A - Voice quality conversion device and voice quality conversion method

Info

Publication number: JP2013033103A
Application number: JP2011168623A
Authority: JP
Inventors: Yoshifumi Hirose; 良文廣瀬; Takahiro Kamai; 孝浩釜井
Original assignee: Panasonic Corp
Current assignee: Panasonic Corp
Priority date: 2011-08-01
Filing date: 2011-08-01
Publication date: 2013-02-14

Abstract

【課題】自然性を保持したまま入力音声の発声様態を変換する。
【解決手段】入力音声を声道情報と音源情報とに分離する声道音源分離部１０１と、声道音源分離部１０１により分離された声道情報から、入力音声の発声時の口腔内の容積に対応する開口度を、入力音声の所定再生時刻毎に算出する開口度算出部１０２と、開口度の変換度合いを示す発声様態変換度に基づいて、所定再生時刻毎の開口度の変換率を決定する開口度変換率決定部１０５と、開口度変換率決定部１０５により決定された変換率に基づいて、声道音源分離部１０１により分離された声道情報を、開口度算出部１０２が算出した開口度が当該変換率で変換された後の開口度に対応する声道情報に変換する声道情報変換部１０３と、声道情報変換部１０３による変換後の声道情報と、声道音源分離部１０１により分離された音源情報とを用いて、合成音を生成する合成部１０４とを備える。
【選択図】図５[PROBLEMS] To convert an utterance state of input speech while maintaining naturalness.
A vocal tract sound source separation unit that separates input speech into vocal tract information and sound source information, and a volume in the oral cavity when the input speech is uttered from the vocal tract information separated by the vocal tract sound source separation unit. Based on the opening degree calculation unit 102 that calculates the opening degree corresponding to the input sound at every predetermined reproduction time and the utterance state conversion degree that indicates the degree of conversion of the opening degree, the conversion degree of the opening degree at each predetermined reproduction time is calculated. Based on the conversion rate determined by the opening degree conversion rate determination unit 105 and the opening degree conversion rate determination unit 105 to be determined, the opening degree calculation unit 102 calculates the vocal tract information separated by the vocal tract sound source separation unit 101. The vocal tract information conversion unit 103 converts the vocal tract information into the vocal tract information corresponding to the open degree after the open degree is converted at the conversion rate, the vocal tract information after the conversion by the vocal tract information conversion unit 103, and the vocal tract sound source Sound source information separated by the separation unit 101 Using, and a synthesizing unit 104 for generating a synthesized sound.
[Selection] Figure 5

Description

本発明は、音声合成技術に基づき音声の声質を変換する声質変換装置およびその方法に関する。 The present invention relates to a voice quality conversion apparatus and method for converting voice quality of voice based on a voice synthesis technique.

近年、音声合成技術の発達により、さまざまな分野で加工された音声が用いられるようになってきた。 In recent years, with the development of speech synthesis technology, speech processed in various fields has come to be used.

例えば、音楽収録においてボーカル音を微修正するなど方法で加工された音声の利用も増えてきている。 For example, the use of voice processed by a method such as fine correction of vocal sounds in music recording is increasing.

音声収録には、専用のレコーディングスタジオなどの環境が必要である。このため、いったん収録が完了した音声に対して修正が必要になった場合には、音声の再収録が必要となり、コストがかかるという問題がある。 Sound recording requires a dedicated recording studio environment. For this reason, when it is necessary to correct the sound once recorded, it is necessary to re-record the sound, which is expensive.

このような再収録が必要な場合として、「言い間違い」や、「アクセントの間違い」などがあるが、これらの原因のように収録時に気づき易いもの以外にも、収録後の編集作業において変更が必要になる場合も考えられる。例えば、収録した音声を少し元気な発声に変更したい場合や、逆に少し弱弱しい発声に変更したい場合など、発声の仕方（後述の発声様態）の変更に関する要望がある。 When such re-recording is necessary, there are "wrong words" and "accent mistakes". In addition to things that are easy to notice during recording, such as these causes, there are changes in editing work after recording. It may be necessary. For example, there is a demand for changing the way of utterance (the utterance mode described later), for example, when it is desired to change the recorded voice to a slightly more active utterance, or to change to a slightly weaker utterance.

例えば、ナレーションの音声を収録した後、一部の音声に対して「もう少し元気な感じにしたい」などの要求が制作者からあった場合、ナレータに再度スタジオに来てもらって収録をやり直す必要がある。しかし、実際に再収録を行うには多大なコストが必要である。 For example, after recording the voice of a narration, if the producer requests that some of the audio be “like a little more cheerful”, the narrator needs to come back to the studio and start over. . However, enormous costs are required to actually re-record.

そのような場合の対策として、音声を強調する技術がある。従来、音声の明瞭度を向上させる技術として、特許文献１に示すような音声強調装置がある。図１５は、特許文献１の音声強調装置の構成図である。 As a countermeasure for such a case, there is a technique for enhancing speech. Conventionally, as a technique for improving speech intelligibility, there is a speech enhancement apparatus as shown in Patent Document 1. FIG. 15 is a configuration diagram of the speech enhancement device of Patent Document 1.

この音声強調装置は、音声復号器４からの強調前音声が入力される音声強調部１、マイク９からのアナログ信号Ｓ２から得られたディジタル信号が入力される雑音推定部３、音声復号器４および雑音推定部３からの信号に基づいて強調特性を決定する強調特性決定部２を備える。 The speech enhancement apparatus includes a speech enhancement unit 1 to which speech before enhancement from the speech decoder 4 is input, a noise estimation unit 3 to which a digital signal obtained from the analog signal S2 from the microphone 9 is input, and a speech decoder 4. And an emphasis characteristic determination unit 2 that determines an emphasis characteristic based on a signal from the noise estimation unit 3.

符号化データＣ１は、音声復号器４に入力され、強調前音声として音声強調部１および強調特性決定部２に入力される。 The encoded data C1 is input to the speech decoder 4 and input to the speech enhancement unit 1 and the enhancement characteristic determination unit 2 as pre-enhancement speech.

一方、マイク９に入力されたアナログ信号Ｓ２（周囲雑音）は、Ａ／Ｄ変換器８によりディジタル信号に変換され、送信信号として、音声符号器７および雑音推定部３に入力される。 On the other hand, the analog signal S2 (ambient noise) input to the microphone 9 is converted into a digital signal by the A / D converter 8 and input to the speech encoder 7 and the noise estimation unit 3 as a transmission signal.

雑音推定部３は、送信信号が背景雑音区間であるか音声区間であるかを判定し、雑音区間の場合には、その雑音特性を推定し、該雑音特性を強調特性決定部２に与える。 The noise estimation unit 3 determines whether the transmission signal is a background noise interval or a speech interval. If the transmission signal is a noise interval, the noise estimation unit 3 estimates the noise characteristic and gives the noise characteristic to the enhancement characteristic determination unit 2.

強調特性決定部２は、強調前音声の特性および雑音特性の双方に基づいて強調特性を決定する。この強調特性は音声強調部１に入力され、音声強調部１は、強調特性に基づいて復号音声を強調処理し、強調した強調後音声をＤ／Ａ変換器５に与える。強調後音声は、Ｄ／Ａ変換器５によりアナログ信号に変換され、スピーカ６からアナログ信号Ｓ１として出力される。 The emphasis characteristic determination unit 2 determines the emphasis characteristic based on both the characteristic of the speech before enhancement and the noise characteristic. This enhancement characteristic is input to the speech enhancement unit 1, and the speech enhancement unit 1 enhances the decoded speech based on the enhancement characteristic and provides the enhanced speech after enhancement to the D / A converter 5. The emphasized sound is converted into an analog signal by the D / A converter 5 and output from the speaker 6 as an analog signal S1.

このように、収音された周囲雑音の性質が推定され、該周囲雑音の性質および強調前音声の性質の双方に応じて適応的に音声が強調されることにより、明瞭度の高い音声を提供する。 In this way, the nature of the ambient noise collected is estimated, and the speech is adaptively enhanced according to both the nature of the ambient noise and the nature of the unenhanced speech, thereby providing highly clear speech. To do.

特開２００４−２８９６１４号公報JP 2004-289614 A

特許文献１の音声強調装置では、背景雑音や入力音声の特性（フォルマント中心周波数）に応じてフォルマントの強度を変化させることにより、音声を強調している。 In the speech enhancement device of Patent Document 1, speech is enhanced by changing the formant intensity according to the background noise and the characteristics of the input speech (formant center frequency).

自然な発声が持つ声質は、音声の発話速度や発話内の位置、またはアクセント句内における位置など、さまざまな要因の影響を受ける。例えば、自然な発話においては、文頭ははっきりと明瞭度高く発声するが、文末では発音の怠けが発生し、明瞭度が低下する傾向がある。また、発声時にある単語が強調されている場合は、その単語の声質は、強調されない場合と比較すると明瞭度が高くなる傾向がある。 The voice quality of a natural utterance is affected by various factors such as the utterance speed of the voice, the position within the utterance, or the position within the accent phrase. For example, in a natural utterance, the beginning of a sentence is clearly uttered with high clarity, but at the end of the sentence, there is a tendency for pronunciation to be negligible and the intelligibility tends to decrease. Further, when a word is emphasized at the time of utterance, the voice quality of the word tends to be higher in clarity than when not emphasized.

図１に、人間の声帯及びおよび声道を示す。以下、人間の音声の生成原理を説明する。人間の音声は、図１に示すように、声帯１６０１の振動により生成される音源波形が、声門１６０２から口唇１６０３までにより構成される声道１６０４を通過する際に、舌などの調音器官による狭めなどの影響を受けて生成される。分析合成型音声合成法は、このような音声の生成原理に基づいて、人間の音声を分析する。具体的には、音声を声道情報と音源情報に分離し、分離されたパラメータを変形することにより、合成音の声質を変換することが可能である。例えば、音声の分析方法として、音源声道モデルと呼ばれるモデルが使用される。音源声道モデルによる分析では、音声はその生成過程に基づいて音源情報と声道情報に分離される。分離された音源情報と声道情報をそれぞれ変形することにより声質を変えることができる。 FIG. 1 shows a human vocal cord and vocal tract. Hereinafter, the principle of human voice generation will be described. As shown in FIG. 1, human speech is narrowed by an articulator such as a tongue when a sound source waveform generated by vibration of the vocal cord 1601 passes through a vocal tract 1604 composed of glottis 1602 to lips 1603. It is generated under the influence of. The analysis and synthesis type speech synthesis method analyzes human speech based on such a speech generation principle. Specifically, it is possible to convert the voice quality of the synthesized sound by separating the voice into vocal tract information and sound source information and modifying the separated parameters. For example, a model called a sound source vocal tract model is used as a speech analysis method. In the analysis by the sound source vocal tract model, the sound is separated into sound source information and vocal tract information based on the generation process. The voice quality can be changed by transforming the separated sound source information and vocal tract information.

図２は、同一話者による先行音素が同一である同じ母音の声道伝達特性を示す。図２において、横軸は周波数を表し、縦軸はスペクトル強度を表す。 FIG. 2 shows the vocal tract transfer characteristics of the same vowel with the same preceding phoneme by the same speaker. In FIG. 2, the horizontal axis represents frequency, and the vertical axis represents spectral intensity.

図２に示す曲線２１１は「めまいがします（／ｍｅｍａｉｇａｓｈｉｍａｓｘｕ／）」と発声した際の「めまい」の／ｍａ／の母音部／ａ／が持つ声道伝達特性を示す。曲線２１２は、「お湯が出ません（／ｏｙｕｇａｄｅｍａｓｅＮ／）」と発声した際の／ｍａ／の母音部／ａ／が持つ声道伝達特性を示す。同じ先行音素を持つ母音同士を比較しても、共振周波数を示すフォルマント（上方向のピーク）の位置および強さが大きく異なることがわかる。 A curve 211 shown in FIG. 2 indicates a vocal tract transmission characteristic of the vowel part / a / of “mathematic” / ma / when uttering “/ memaigashishimasku /”. A curve 212 indicates a vocal tract transmission characteristic of a vowel part / a / of / ma / when uttered “no hot water (/ oyugadaseN /)”. Comparing vowels having the same preceding phoneme, it can be seen that the position and strength of the formant (upward peak) indicating the resonance frequency are greatly different.

この要因としては、曲線２１１に対応する母音部／ａ／は文頭に近く、かつ内容語（ｃｏｎｔｅｎｔｗｏｒｄ）であるのに対して、曲線２１２に対応する母音部／ａ／は、文末に近く、かつ機能語（ｆｕｎｃｔｉｏｎｗｏｒｄ）であると言うことが挙げられる。また、聴感上においても、曲線２１１に対応する母音部／ａ／の方がより明瞭に聞こえる。 This is because the vowel part / a / corresponding to the curve 211 is close to the beginning of the sentence and is a content word, whereas the vowel part / a / corresponding to the curve 212 is close to the end of the sentence, In addition, it is a function word. In terms of audibility, the vowel part / a / corresponding to the curve 211 can be heard more clearly.

このように、自然な発話においては、文章内において発声の方法が異なる。つまり、「ハキハキと発声し、明瞭な音声」や「怠けて発声し、不明瞭な音声」といった意識的あるいは無意識の発声の仕方の違いがある。このような発声方法の違いを以後「発声様態」と呼ぶ。 Thus, in a natural utterance, the utterance method is different in the sentence. In other words, there is a difference in the way of conscious or unconscious utterance, such as “speech and clear speech” and “speaking and unclear speech”. Such a difference in utterance method is hereinafter referred to as “speech mode”.

このような発声様態は音韻環境だけではなく、その他の様々な言語的および生理的な影響を受けて変動している。 Such utterances vary not only in the phonological environment but also in various other linguistic and physiological influences.

特許文献１の音声強調装置は、そのような発声様態の変動を考慮することなく、背景雑音と入力音声のフォルマント中心周波数に基づいてフォルマントの強度の変換を行っているため、強調された音声の発声様態の時間的変化パターンは、入力音声とは異なったものとなる。 Since the speech enhancement device of Patent Document 1 performs formant intensity conversion based on background noise and the formant center frequency of the input speech without taking into account such a variation in utterance mode, The temporal change pattern of the utterance mode is different from that of the input voice.

この発声様態の時間的変化について、図３の概念図を用いて説明する。図３（ａ）は、入力音声として発声された音声「めまいがします（／ｍｅｍａｉｇａｓｈｉｍａｓｘｕ／）」に対して、音声に含まれる各母音の発声様態（明瞭度）の変化を示している。Ｘの領域は、はっきりした発声であり、明瞭度が高い音韻を示す。Ｙの領域は、怠けた発声であり明瞭度が低い音韻を示している。例えばこのように前半は、明瞭度が高い発声様態であり、後半は明瞭度が低い発声様態を示す。 The temporal change of the utterance mode will be described with reference to the conceptual diagram of FIG. FIG. 3A shows a change in the utterance state (intelligibility) of each vowel included in the voice with respect to the voice “/ meigashimashixu /” uttered as the input voice. The region of X is a clear utterance and shows a phoneme with high intelligibility. The area Y indicates a phonation that is lazy and has a low intelligibility. For example, in this way, the first half is an utterance mode with high clarity, and the second half shows an utterance mode with low clarity.

一方、図３（ｂ）は、フォルマントの強調または弱めにより音声強調を行った場合の強調音声が持つ発声様態の時間的変化を示す。背景雑音と入力音声のフォルマント中心周波数のみによりフォルマント強度を変換しているため、発声様態の時間的な変動パターンを保持することができない。例えば、図３（ｂ）のように発声様態が変動した場合、発声様態は、はっきりと明瞭度を高く発声する区間Ｘと、怠けて明瞭度を低く発声する区間Ｙが交互に繰り返すような強調音声が得られることになる。 On the other hand, FIG. 3B shows a temporal change in the utterance mode of the emphasized speech when speech enhancement is performed by emphasizing or weakening formants. Since the formant intensity is converted only by the background noise and the formant center frequency of the input speech, the temporal variation pattern of the utterance mode cannot be maintained. For example, when the utterance mode fluctuates as shown in FIG. 3B, the utterance mode is emphasized such that the section X that clearly utters with high clarity and the section Y that utters with low clarity and utterance alternates alternately. Voice will be obtained.

このように発声様態の時間的変化パターンが入力音声の時間的変化パターンと異なると、声質変換後の音声における発声様態の変化の自然性を保持することができなくなり、結果として声質変換後の音声の自然性が大きく劣化するという課題を有していた。 Thus, if the temporal change pattern of the utterance mode is different from the temporal change pattern of the input speech, the naturalness of the change of the utterance mode in the voice after voice quality conversion cannot be maintained, and as a result, the voice after voice quality conversion is lost. The problem was that the naturalness of the material deteriorated greatly.

本発明は、前記従来の課題を解決するもので、入力音声が保有する発声様態の時間パターンを保持しながら声質を変換することで、声質変換時の自然性、言い換えれば流暢さを低下させずに発声様態を変換することができる声質変換装置を提供することを目的とする。 The present invention solves the above-described conventional problem, and by converting the voice quality while maintaining the time pattern of the utterance mode possessed by the input voice, naturalness at the time of voice quality conversion, in other words, does not reduce fluency. An object of the present invention is to provide a voice quality conversion device capable of converting the utterance mode.

本発明のある局面に係る声質変換装置は、入力音声を声道情報と音源情報とに分離する声道音源分離部と、前記声道音源分離部により分離された前記声道情報から、前記入力音声の発声時の口腔内の容積に対応する開口度を、前記入力音声の所定再生時刻毎に算出する開口度算出部と、開口度の変換度合いを示す発声様態変換度に基づいて、前記所定再生時刻毎の前記開口度の変換率を決定する開口度変換率決定部と、前記開口度変換率決定部により決定された前記変換率に基づいて、前記声道音源分離部により分離された前記声道情報を、前記開口度算出部により算出された前記開口度が当該変換率で変換された後の開口度に対応する声道情報に変換する声道情報変換部と、前記声道情報変換部による変換後の声道情報と、前記声道音源分離部により分離された前記音源情報とを用いて、合成音を生成する合成部とを備える。 A voice quality conversion device according to an aspect of the present invention includes a vocal tract sound source separation unit that separates input speech into vocal tract information and sound source information, and the input from the vocal tract information separated by the vocal tract sound source separation unit. Based on the opening degree calculation unit that calculates the opening degree corresponding to the volume in the oral cavity at the time of sound production for each predetermined reproduction time of the input sound, and the utterance mode conversion degree indicating the degree of conversion of the opening degree The opening degree conversion rate determining unit that determines the conversion rate of the opening degree for each reproduction time and the vocal tract sound source separating unit separated based on the conversion rate determined by the opening degree conversion rate determining unit A vocal tract information conversion unit that converts vocal tract information into vocal tract information corresponding to an opening degree after the opening degree calculated by the opening degree calculation unit is converted at the conversion rate; and the vocal tract information conversion Vocal tract information after conversion by the part, and the vocal tract source Using said sound source information separated by the section, and a synthesizing unit for generating synthetic speech.

この構成によると、入力音声の発声様態を変換する際に、入力音声における発声様態の時間パターンを保存することが可能となる。つまり、発声様態変換比率を大きくすると、開口度が相対的に大きくなり、はっきりと発声した音声となり、元気に発声されたように聞こえる。一方、発声様態変換比率を小さくすると、開口度が相対的に小さくなり、怠けて発声した音声となり、力なく発声されたように聞こえる。これにより、声質変換された音声には、発声様態の変化の時間パターンが保存されているため、自然性（流暢さ）を劣化させない声質変換が可能となる。 According to this configuration, it is possible to save the time pattern of the utterance mode in the input voice when converting the utterance mode of the input voice. That is, when the utterance mode conversion ratio is increased, the opening degree is relatively increased, and the voice is clearly uttered and sounds as if uttered energetically. On the other hand, when the utterance mode conversion ratio is reduced, the opening degree becomes relatively small, and the voice is uttered lazyly and sounds as if uttered without power. As a result, the voice quality-converted voice stores the time pattern of the change in utterance mode, so that voice quality conversion without deteriorating naturalness (fluency) is possible.

なお、本発明は、このような特徴的な処理部を備える声質変換装置として実現することができるだけでなく、声質変換装置に含まれる特徴的な処理部が実行する処理をステップとする声質変換方法として実現することができる。また、声質変換装置に含まれる特徴的な処理部としてコンピュータを機能させるためのプログラムまたは声質変換方法に含まれる特徴的なステップをコンピュータに実行させるプログラムとして実現することもできる。そして、そのようなプログラムを、ＣＤ−ＲＯＭ（ＣｏｍｐａｃｔＤｉｓｃ−ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）等のコンピュータ読取可能な非一時的な記録媒体やインターネット等の通信ネットワークを介して流通させることができるのは、言うまでもない。 Note that the present invention can be realized not only as a voice quality conversion device including such a characteristic processing unit, but also as a voice quality conversion method including steps executed by the characteristic processing unit included in the voice quality conversion device. Can be realized. Also, it can be realized as a program for causing a computer to function as a characteristic processing unit included in the voice quality conversion device or a program for causing a computer to execute characteristic steps included in the voice quality conversion method. Such a program can be distributed via a computer-readable non-transitory recording medium such as a CD-ROM (Compact Disc-Read Only Memory) or a communication network such as the Internet. .

本発明の声質変換装置によれば、入力音声の発声様態を変換する際に、入力音声における発声様態の時間的変化パターンを維持することが可能となる。つまり、声質変換された音声には、発声様態の変化の時間パターンが保存されているため、自然性（流暢さ）を劣化させない声質変換が可能となる。 According to the voice quality conversion device of the present invention, it is possible to maintain the temporal change pattern of the utterance mode in the input voice when converting the utterance mode of the input voice. In other words, the voice quality-converted voice stores the time pattern of the change in utterance mode, so that voice quality conversion that does not deteriorate naturalness (fluency) is possible.

人間の音声の発声機構を示す図Diagram showing the speech mechanism of human speech 発声様態の違いによる声道伝達特性の違いを示す図Diagram showing differences in vocal tract transmission characteristics due to different utterance modes 発声様態の時間的変動を示す概念図Conceptual diagram showing temporal variation of vocalization 発声様態の違いによる声道断面積関数の違いを示す図Diagram showing differences in vocal tract cross-sectional area function due to differences in vocalization modes 本発明の実施の形態における声質変換装置の機能的な構成を示すブロック図The block diagram which shows the functional structure of the voice quality conversion apparatus in embodiment of this invention 声道断面積関数の例を示す図Diagram showing examples of vocal tract cross-sectional area function 発声内における開口度の時間パターンを示す図A diagram showing the time pattern of the degree of opening in utterance 開口度の時間パターンの変換例を示す図A diagram showing an example of conversion of time pattern of aperture 声道断面積関数の変換係数の例を示す図Diagram showing examples of conversion coefficients for vocal tract cross-sectional area function 声道断面積関数の変換例を示す図Diagram showing conversion example of vocal tract cross section function 開口度の変換前後における声道伝達特性を示す図Diagram showing vocal tract transmission characteristics before and after opening degree conversion 本発明の実施の形態における声質変換装置の動作を示すフローチャートThe flowchart which shows operation | movement of the voice quality conversion apparatus in embodiment of this invention. 本発明の実施の形態の変形例における声質変換装置の機能的な構成を示すブロック図The block diagram which shows the functional structure of the voice quality conversion apparatus in the modification of embodiment of this invention. 本発明の実施の形態の変形例における声質変換装置の動作を示すフローチャートThe flowchart which shows operation | movement of the voice quality conversion apparatus in the modification of embodiment of this invention. 従来の声質変換装置の構成図Configuration of conventional voice quality conversion device

以下、本発明の実施の形態について、図面を参照しながら説明する。なお、以下で説明する実施の形態は、いずれも本発明の好ましい一具体例を示すものである。以下の実施の形態で示される数値、構成要素、構成要素の配置位置および接続形態、ステップ、ステップの順序などは、一例であり、本発明を限定する主旨ではない。本発明は、特許請求の範囲だけによって限定される。よって、以下の実施の形態における構成要素のうち、本発明の最上位概念を示す独立請求項に記載されていない構成要素については、本発明の課題を達成するのに必ずしも必要ではないが、より好ましい形態を構成するものとして説明される。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. Each of the embodiments described below shows a preferred specific example of the present invention. The numerical values, the constituent elements, the arrangement positions and connection forms of the constituent elements, the steps, the order of the steps, and the like shown in the following embodiments are examples, and are not intended to limit the present invention. The invention is limited only by the claims. Therefore, among the constituent elements in the following embodiments, constituent elements that are not described in the independent claims indicating the highest concept of the present invention are not necessarily required to achieve the object of the present invention. It will be described as constituting a preferred form.

本発明の一実施態様に係る声質変換装置は、入力音声を声道情報と音源情報とに分離する声道音源分離部と、前記声道音源分離部により分離された前記声道情報から、前記入力音声の発声時の口腔内の容積に対応する開口度を、前記入力音声の所定再生時刻毎に算出する開口度算出部と、開口度の変換度合いを示す発声様態変換度に基づいて、前記所定再生時刻毎の前記開口度の変換率を決定する開口度変換率決定部と、前記開口度変換率決定部により決定された前記変換率に基づいて、前記声道音源分離部により分離された前記声道情報を、前記開口度算出部により算出された前記開口度が当該変換率で変換された後の開口度に対応する声道情報に変換する声道情報変換部と、前記声道情報変換部による変換後の声道情報と、前記声道音源分離部により分離された前記音源情報とを用いて、合成音を生成する合成部とを備える。 A voice quality conversion device according to an embodiment of the present invention includes: a vocal tract sound source separation unit that separates input speech into vocal tract information and sound source information; and the vocal tract information separated by the vocal tract sound source separation unit, Based on the opening degree calculation unit for calculating the opening degree corresponding to the volume in the oral cavity at the time of utterance of the input sound for each predetermined reproduction time of the input sound, and the utterance mode conversion degree indicating the degree of conversion of the opening degree, Based on the conversion rate determined by the openness conversion rate determination unit and the openness conversion rate determination unit that determines the conversion rate of the openness for each predetermined reproduction time, the vocal tract sound source separation unit A vocal tract information conversion unit that converts the vocal tract information into vocal tract information corresponding to an opening degree after the opening degree calculated by the opening degree calculation unit is converted at the conversion rate; and the vocal tract information The vocal tract information after conversion by the conversion unit and the vocal tract sound source Using said sound source information separated by the releasing unit, and a synthesizing unit for generating synthetic speech.

この構成によると、入力音声の発声様態を変換する際に、入力音声における発声様態の時間パターン（時間的な変化系列）を保存することが可能となる。つまり、発声様態変換比率を大きくすると、開口度が相対的に大きくなり、はっきりと発声した音声となり、元気に発声されたように聞こえる。一方、発声様態変換比率を小さくすると、開口度が相対的に小さくなり、怠けて発声した音声となり、力なく発声されたように聞こえる。これにより、声質変換された音声には、発声様態の変化の時間パターンが保存されているため、自然性（流暢さ）を劣化させない声質変換が可能となる。 According to this configuration, when converting the utterance mode of the input voice, it is possible to save the time pattern (temporal change sequence) of the utterance mode in the input voice. That is, when the utterance mode conversion ratio is increased, the opening degree is relatively increased, and the voice is clearly uttered and sounds as if uttered energetically. On the other hand, when the utterance mode conversion ratio is reduced, the opening degree becomes relatively small, and the voice is uttered lazyly and sounds as if uttered without power. As a result, the voice quality-converted voice stores the time pattern of the change in utterance mode, so that voice quality conversion without deteriorating naturalness (fluency) is possible.

好ましくは、前記開口度算出部は、前記所定再生時刻毎に前記声道音源分離部により分離された前記声道情報から声道断面積関数を算出し、前記口腔内を複数の区間に分割した場合に、算出した前記声道断面積関数で示される各区間の声道断面積の和として、前記開口度を算出する。 Preferably, the opening degree calculation unit calculates a vocal tract cross-sectional area function from the vocal tract information separated by the vocal tract sound source separation unit at each predetermined reproduction time, and divides the oral cavity into a plurality of sections In this case, the opening degree is calculated as the sum of the vocal tract cross-sectional areas of the sections indicated by the calculated vocal tract cross-sectional area function.

さらに好ましくは、上述の声質変換装置は、さらに、前記入力音声に含まれる母音の時間区間である母音区間を抽出する母音区間抽出部を備え、前記声道情報変換部は、前記声道音源分離部により分離された前記声道情報のうち、前記母音区間抽出部により抽出された前記母音区間における声道情報のみを変換する。 More preferably, the above-described voice quality conversion device further includes a vowel section extraction unit that extracts a vowel section that is a time section of a vowel included in the input speech, and the vocal tract information conversion unit includes the vocal tract sound source separation. Only the vocal tract information in the vowel segment extracted by the vowel segment extraction unit is converted from the vocal tract information separated by the unit.

この構成によると、声道情報が時間的に早く変化する子音区間に対して開口度を変換しないことで、過度な声道形状の変換を防止することができるため、声質変換時の音質の劣化を抑圧することが可能となる。 According to this configuration, it is possible to prevent excessive conversion of the vocal tract shape by not converting the opening degree to the consonant section in which the vocal tract information changes quickly in time, so that the sound quality is deteriorated during the conversion of the voice quality. Can be suppressed.

また、前記開口度変換率決定部は、前記発声様態変換度を、前記所定再生時刻毎の前記変換率として決定しても良い。 The opening degree conversion rate determination unit may determine the utterance state conversion degree as the conversion rate for each predetermined reproduction time.

この構成によると、外部から入力された発声様態変換度を直接的に用いて、声道情報を変換することが可能となる。 According to this configuration, vocal tract information can be converted by directly using the utterance mode conversion degree inputted from the outside.

好ましくは、前記開口度変換率決定部は、前記所定再生時刻毎に、前記開口度算出部により算出された前記開口度の時間的な変化系列におけるダイナミックレンジと前記発声様態変換度との積が、変換後の開口度の時間的な変化系列におけるダイナミックレンジと一致するように、前記変換後の開口度を算出するために前記開口度算出部により算出された前記開口度に乗じる前記変換率を決定する。 Preferably, the opening degree conversion rate determining unit calculates a product of a dynamic range in the temporal change series of the opening degree calculated by the opening degree calculating unit and the utterance state conversion degree at each predetermined reproduction time. The conversion rate by which the opening degree calculated by the opening degree calculation unit is multiplied to calculate the opening degree after the conversion so as to coincide with the dynamic range in the temporal change sequence of the opening degree after conversion. decide.

この構成によると、外部から入力される発声様態変換度に基づいて、入力音声の開口度の時間パターン（時間的な変化系列）のダイナミックレンジを変換することができる。このため、よりメリハリの付いた音声に声質変換することが可能となる。 According to this configuration, it is possible to convert the dynamic range of the temporal pattern (temporal change series) of the opening degree of the input voice based on the utterance mode conversion degree input from the outside. For this reason, it is possible to convert the voice quality to a more crisp voice.

また、前記開口度変換率決定部は、前記所定再生時刻毎に、前記開口度算出部により算出された前記開口度の時間的な変化系列における開口度の平均値と前記発声様態変換度との積が、変換後の開口度の時間的な変化系列における開口度の平均値と一致するように、前記変換後の開口度を算出するために前記開口度算出部により算出された前記開口度に乗じる前記変換率を決定しても良い。 In addition, the opening degree conversion rate determination unit may calculate an average value of the opening degree in the temporal change series of the opening degree calculated by the opening degree calculation unit and the utterance state conversion degree at each predetermined reproduction time. The opening degree calculated by the opening degree calculating unit to calculate the opening degree after the conversion so that the product coincides with the average value of the opening degree in the temporal change series of the opening degree after the conversion. The conversion rate to be multiplied may be determined.

この構成によると、外部から入力される発声様態変換度に基づいて、入力音声の開口度の時間パターン（時間的な変化系列）の平均値を変換することができる。このため、平均的に発声様態が変換された声質変換することが可能となる。 According to this configuration, it is possible to convert the average value of the time pattern (temporal change series) of the opening degree of the input speech based on the utterance mode conversion degree input from the outside. For this reason, it is possible to perform voice quality conversion in which the utterance mode is converted on average.

また、前記声道情報変換部は、前記開口度変換率決定部により決定された前記変換率に基づいて、前記声道音源分離部により分離された前記声道情報を示す声道断面積を変換することにより、前記声道音源分離部により分離された前記声道情報を変換しても良い。 The vocal tract information conversion unit converts a vocal tract cross-sectional area indicating the vocal tract information separated by the vocal tract sound source separation unit based on the conversion rate determined by the opening degree conversion rate determination unit. By doing so, the vocal tract information separated by the vocal tract sound source separation unit may be converted.

好ましくは、前記声道情報変換部は、前記口腔内を複数の区間に分割した場合に、前記変換率に対応し、かつ区間ごとに定められる変換係数を、区間ごとの前記声道断面積に乗算することにより、前記声道音源分離部により分離された前記声道情報を変換する。 Preferably, the vocal tract information conversion unit, when dividing the oral cavity into a plurality of sections, converts the conversion coefficient corresponding to the conversion rate and determined for each section to the vocal tract cross-sectional area for each section. By multiplying, the vocal tract information separated by the vocal tract sound source separation unit is converted.

この構成によると、声道断面積関数を変形する際に、より自由度が高く変換することができるため、声質変換された音声の自然性の劣化を防止することが可能となる。 According to this configuration, when the vocal tract cross-sectional area function is deformed, conversion can be performed with a higher degree of freedom, and therefore it is possible to prevent deterioration of the naturalness of the voice whose voice quality has been converted.

さらに好ましくは、前記変換係数は、口唇に近い区間の前記変換係数ほど、隣接する区間の前記変換係数との差分の絶対値が大きい。 More preferably, the absolute value of the difference between the conversion coefficient in the section closer to the lip and the conversion coefficient in the adjacent section is larger as the conversion coefficient in the section closer to the lips.

この構成によると、声道断面積関数を変換する際に、より発声機構に合致した変形が可能となり、声質変換された音声の自然性の劣化を防止することが可能となる。 According to this configuration, when the vocal tract cross-sectional area function is converted, it becomes possible to perform a deformation that more closely matches the utterance mechanism, and it is possible to prevent deterioration of the naturalness of the voice whose voice quality has been converted.

（実施の形態）
本発明の実施の形態では、元となる音声（入力音声）に対して、後述する発声の様態を表す開口度を変換率に基づいて変換することにより声質変換を行う方法について説明する。 (Embodiment)
In the embodiment of the present invention, a method is described in which voice quality conversion is performed on an original voice (input voice) by converting a degree of opening representing an utterance mode described later based on a conversion rate.

既に述べたように、声質変換を行う際には、入力音声における発声様態の時間的な変動を保持することが重要である。発声様態とは、例えば、ハッキリと明瞭度の高い発声、または、怠けた明瞭度の低い発声である。 As already described, when performing voice quality conversion, it is important to maintain temporal variation of the utterance mode in the input voice. The utterance mode is, for example, utterance with clear and high intelligibility, or utterance with low intelligibility.

発声様態は、例えば音声の発話速度、発話内の位置、または、アクセント句内における位置に影響を受ける。例えば、自然な発話においては、文頭ははっきりと明瞭に発声するが、文末では怠けが発生し、明瞭度が低下する傾向がある。また、入力音声において、ある単語が強調されている場合の発声様態は、強調されない場合と異なる。 The utterance mode is affected by, for example, the speech rate, the position in the utterance, or the position in the accent phrase. For example, in a natural utterance, the beginning of a sentence is clearly and clearly uttered, but at the end of the sentence, lazyness occurs and the intelligibility tends to decrease. In addition, in the input speech, the utterance mode when a certain word is emphasized is different from the case where the word is not emphasized.

図４（ａ）は、前述の「めまいがします（／ｍｅｍａｉｇａｓｈｉｍａｓｘｕ／）」と発声した際の「めまい」の／ｍａ／の母音部／ａ／の対数声道断面積関数を示し、図４（ｂ）は、「お湯が出ません（／ｏｙｕｇａｄｅｍａｓｅＮ／）」と発声した際の／ｍａ／の母音部／ａ／の対数声道断面積関数を示す。 FIG. 4 (a) shows the logarithmic vocal tract cross-sectional area function of / ma / vowel part / a / of “vertigo” when the above-mentioned “/ memaigasimasxu /” is spoken. (B) shows the logarithmic vocal tract cross-sectional area function of the vowel part / a / of / ma / when uttered “no hot water (/ oyugadaseN /)”.

図４（ａ）の母音部／ａ／は、文頭に近く、また内容語（自立語）に含まれる音であることから、発声様態としてはハッキリと明瞭に発話されている。一方、図４（ｂ）の母音部／ａ／は、文末に近いことから、発声様態としては怠けが発生し、明瞭さが低い。 Since the vowel part / a / in FIG. 4A is a sound that is close to the beginning of a sentence and included in a content word (independent word), the utterance is clearly and clearly spoken. On the other hand, since the vowel part / a / in FIG. 4B is close to the end of the sentence, the utterance mode is lazy, and the clarity is low.

本願発明者らは、このような発声様態の違いと対数声道断面積関数の関係を注意深く観察することにより、発声様態が口腔内の容積と関連があるという知見を見出した。 The inventors of the present application have found that the voicing mode is related to the volume in the oral cavity by carefully observing the relationship between the voicing mode and the logarithmic vocal tract cross-sectional area function.

つまり、口腔内の容積が大きいほど、発声様態はハッキリと明瞭である傾向があり、逆に口腔内の容積が小さいほど、発声様態は怠けを伴い、明瞭度が低い傾向がある。 That is, as the volume in the oral cavity is larger, the utterance state tends to be clearer, and conversely, as the volume in the oral cavity is smaller, the utterance state tends to be lazy and have a lower clarity.

音声から算出可能な口腔内容積を開口度の指標とすることによって、その音声が持つ発声様態を制御することが可能となる。 By using the intraoral volume that can be calculated from the voice as an index of the opening degree, it is possible to control the utterance mode of the voice.

本実施の形態では、口腔内容積を示す開口度を用いることにより発声様態の時間的な変動を保存して声質変換を行うことで、自然性の劣化が少ない合成音を生成することができる。 In the present embodiment, a synthesized sound with little deterioration in naturalness can be generated by using the degree of opening indicating the intraoral volume to preserve the temporal variation of the vocalization mode and perform voice quality conversion.

図５は、実施の形態における声質変換装置の機能的な構成を示すブロック図である。声質変換装置は、声道音源分離部１０１と、開口度算出部１０２と、開口度変換率決定部１０５と、声道情報変換部１０３と、合成部１０４とを備える。 FIG. 5 is a block diagram illustrating a functional configuration of the voice quality conversion apparatus according to the embodiment. The voice quality conversion apparatus includes a vocal tract sound source separation unit 101, an opening degree calculation unit 102, an opening degree conversion rate determination unit 105, a vocal tract information conversion unit 103, and a synthesis unit 104.

声道音源分離部１０１は、入力音声を受け付ける。声道音源分離部１０１は、受け付けた入力音声を声道情報と音源情報に分離する。 The vocal tract sound source separation unit 101 receives input speech. The vocal tract sound source separation unit 101 separates the received input speech into vocal tract information and sound source information.

開口度算出部１０２は、声道音源分離部１０１により分離された声道情報を用いて、入力音声の各フレームにおける開口度を算出する。つまり、開口度算出部１０２は、声道音源分離部１０１により分離された声道情報から、入力音声の発声時の口腔内の容積に対応する開口度を、入力音声の所定再生時刻（フレーム）毎に算出する。 The opening degree calculation unit 102 uses the vocal tract information separated by the vocal tract sound source separation unit 101 to calculate the opening degree of each frame of the input speech. That is, the opening degree calculation unit 102 uses the vocal tract information separated by the vocal tract sound source separation unit 101 to set the opening degree corresponding to the volume in the oral cavity when the input voice is uttered, to a predetermined reproduction time (frame) of the input voice. Calculate every time.

開口度変換率決定部１０５は、外部から入力される開口度の変換度合いを示す発声様態変換度に基づいて、開口度算出部１０２により算出された各フレームにおける開口度の変換率である率を算出する。 The opening degree conversion rate determination unit 105 calculates a rate that is the conversion rate of the opening degree in each frame calculated by the opening degree calculation unit 102 based on the utterance mode conversion degree indicating the conversion degree of the opening degree input from the outside. calculate.

声道情報変換部１０３は、開口度変換率決定部１０５により決定された開口度変換率に基づいて、声道音源分離部１０１により分離された声道情報を、開口度算出部１０２が算出した開口度が当該開口度変換率で変換された後の開口度に対応する声道情報に変換する。 The vocal tract information conversion unit 103 calculates the vocal tract information separated by the vocal tract sound source separation unit 101 based on the open degree conversion rate determined by the open degree conversion rate determination unit 105 by the open degree calculation unit 102. The opening degree is converted into vocal tract information corresponding to the opening degree after being converted at the opening degree conversion rate.

合成部１０４は、声道情報変換部１０３により変換された声道情報と、声道音源分離部１０１により分離された音源情報を用いて合成音を生成する。 The synthesis unit 104 generates a synthesized sound using the vocal tract information converted by the vocal tract information conversion unit 103 and the sound source information separated by the vocal tract sound source separation unit 101.

以上のように構成された声質変換装置により、入力音声の発声様態の時間的な変動を保持しながら声質変換が可能となる。 With the voice quality conversion apparatus configured as described above, voice quality conversion can be performed while maintaining temporal variation of the utterance mode of the input voice.

以下、それぞれの構成要素について説明する。 Hereinafter, each component will be described.

＜声道音源分離部１０１＞
声道音源分離部１０１は、入力音声に対して、声道音源モデル（音声の発声機構をモデル化した音声生成モデル）を適用することにより、声道情報と音源情報の分離を行う。使用する声道音源モデルに制限はなく、どのようなモデルであっても良い。 <Vocal tract sound source separation unit 101>
The vocal tract sound source separation unit 101 separates vocal tract information and sound source information by applying a vocal tract sound source model (speech generation model that models a speech utterance mechanism) to input speech. There is no limitation on the vocal tract sound source model to be used, and any model may be used.

例えば、声道音源モデルとして線形予測モデル（ＬＰＣモデル）がある。線形予測モデルは、音声波形のある標本値ｓ（ｎ）をそれより前のｐ個の標本値から予測するものであり、標本値ｓ（ｎ）は、式１のように表せる。 For example, there is a linear prediction model (LPC model) as a vocal tract sound source model. The linear prediction model predicts a certain sample value s (n) of a speech waveform from p sample values before the sample value, and the sample value s (n) can be expressed as Equation 1.

ｐ個の標本値に対する係数α_i（ｉ＝１〜ｐ）は、相関法や共分散法などを用いることにより算出できる。算出した係数α_iを用いると入力された音声信号は、式２により生成することができる。 The coefficient α _i (i = 1 to p) for p sample values can be calculated by using a correlation method, a covariance method, or the like. When the calculated coefficient α _i is used, the input audio signal can be generated by Equation 2.

ここで、Ｓ（ｚ）は音声信号ｓ（ｎ）のｚ変換後の値であり、Ｕ（ｚ）は、は有声音源信号ｕ（ｎ）のｚ変換後の値であり、入力音声Ｓ（ｚ）を声道情報１／Ａ（ｚ）で逆フィルタリングした信号を表す。 Here, S (z) is a value after the z conversion of the voice signal s (n), U (z) is a value after the z conversion of the voiced sound source signal u (n), and the input voice S ( z) represents a signal obtained by inverse filtering the vocal tract information 1 / A (z).

さらに、ＬＰＣ分析により分析された線形予測係数α_iを用いて、ＰＡＲＣＯＲ係数（偏自己相関係数）を算出するようにしてもよい。ＰＡＲＣＯＲ係数は、線形予測係数と比較して、補間特性が良いことが知られている。ＰＡＲＣＯＲ係数は、Ｌｅｖｉｎｓｏｎ−Ｄｕｒｂｉｎ−Ｉｔａｋｕｒａアルゴリズムを用いることにより算出することが可能である。なお、ＰＡＲＣＯＲ係数は、次の特徴を保有する。 Further, the PARCOR coefficient (partial autocorrelation coefficient) may be calculated using the linear prediction coefficient α _i analyzed by the LPC analysis. It is known that the PARCOR coefficient has better interpolation characteristics than the linear prediction coefficient. The PARCOR coefficient can be calculated by using the Levinson-Durbin-Itakura algorithm. The PARCOR coefficient has the following characteristics.

（特徴１）低次の係数ほどその変動によるスペクトルへの影響が大きく、高次になるにつれて変動の影響が小さくなる。 (Characteristic 1) The lower-order coefficient has a greater influence on the spectrum due to the fluctuation, and the higher the order, the smaller the influence of the fluctuation.

（特徴２）高次の係数の変動の影響は、平坦に全域にわたる。 (Characteristic 2) The influence of high-order coefficient fluctuations covers the entire area flatly.

以下の説明では、声道情報として、ＰＡＲＣＯＲ係数を用いて説明する。なお、用いる声道情報はＰＡＲＣＯＲ係数に限らず、線形予測係数を用いても良い。さらには線スペクトル対（ＬＳＰ）を用いてもよい。 In the following description, the PARCOR coefficient is used as the vocal tract information. Note that the vocal tract information to be used is not limited to the PARCOR coefficient, and a linear prediction coefficient may be used. Further, a line spectrum pair (LSP) may be used.

また、声道音源モデルとしてＡＲＸモデル（Ａｕｔｏｒｅｇｒｅｓｓｉｖｅｗｉｔｈｅｘｏｇｅｎｏｕｓｉｎｐｕｔ）を用いてもよい。ＡＲＸ分析は、音源として数式音源モデルを用いる点がＬＰＣ分析と大きく異なる。また、ＡＲＸ分析では、ＬＰＣ分析と異なり、分析区間内に複数の基本周期を含んだ場合においても、より正確に声道と音源の情報を分離できる（非特許文献１：「音源パルス列を考慮した頑健なＡＲＸ音声分析法」、日本音響学会誌５８巻７号、２００２年、ｐｐ．３８６−３９７）。 Further, an ARX model (Autogressive with exogenous input) may be used as the vocal tract sound source model. ARX analysis is significantly different from LPC analysis in that a mathematical sound source model is used as a sound source. Also, in the ARX analysis, unlike the LPC analysis, the vocal tract and sound source information can be more accurately separated even when a plurality of basic periods are included in the analysis interval (Non-patent Document 1: “Sound source pulse train is considered. Robust ARX speech analysis method, ”Acoustical Society of Japan, Vol. 58, No. 7, 2002, pp. 386-397).

ＡＲＸ分析では、音声は式３に示す生成過程により生成される。式３において、Ｓ（ｚ）は音声信号ｓ（ｎ）のｚ変換後の値を表す。Ｕ（ｚ）は有声音源信号ｕ（ｎ）のｚ変換後の値を表す。Ｅ（ｚ）は無声雑音音源ｅ（ｎ）のｚ変換後の値を表す。つまり、ＡＲＸ分析では、有声音は式３の右辺第１項により音声が生成され、無声音は右辺第２項により生成される。 In ARX analysis, speech is generated by the generation process shown in Equation 3. In Equation 3, S (z) represents a value after the z conversion of the audio signal s (n). U (z) represents a value after z conversion of the voiced sound source signal u (n). E (z) represents a value after the z conversion of the silent noise source e (n). That is, in ARX analysis, voiced sound is generated by the first term on the right side of Equation 3, and unvoiced sound is generated by the second term on the right side.

このとき、有声音源信号ｕ（ｔ）＝ｕ（ｎＴｓ）のモデルとして、式４に示す音モデルを用いる（Ｔｓはサンプリング周期）。 At this time, the sound model shown in Formula 4 is used as a model of the voiced sound source signal u (t) = u (nTs) (Ts is a sampling period).

ただし、ＡＶは有声音源振幅、Ｔ０はピッチ周期、ＯＱは声門開放率を表わす。有声音の場合は式４の右辺上段第１項が使用され、無声音の場合は式４の右辺上段第２項が使用される。声門開放率ＯＱとは、１ピッチ周期における声門が開放されている割合を示す。声門開放率ＯＱが大きいほどやわらかい音声となる傾向があることが知られている。 Where AV is the voiced sound source amplitude, T0 is the pitch period, and OQ is the glottal opening rate. In the case of voiced sound, the first term on the upper right side of Equation 4 is used, and in the case of unvoiced sound, the second term on the upper right side of Equation 4 is used. The glottal opening rate OQ indicates the rate at which the glottal is opened in one pitch period. It is known that the greater the glottal opening rate OQ, the softer the voice.

ＡＲＸ分析は、ＬＰＣ分析と比較して以下の利点がある。 The ARX analysis has the following advantages compared to the LPC analysis.

（利点１）分析窓内に複数のピッチ周期に対応した音源パルス列を配して分析を行っているため、女性または子供などの高ピッチ音声でも安定に声道情報を抽出できる。 (Advantage 1) Since the analysis is performed by arranging sound source pulse trains corresponding to a plurality of pitch periods in the analysis window, vocal tract information can be stably extracted even in high pitch sounds such as women or children.

（利点２）特に、ピッチ周波数Ｆ０と第１フォルマント周波数Ｆ１が接近している／ｉ／、／ｕ／などの狭母音の声道音源分離性能が高い。 (Advantage 2) Particularly, the vocal tract sound source separation performance of narrow vowels such as / i / and / u /, where the pitch frequency F0 and the first formant frequency F1 are close to each other, is high.

有声音区間では、ＬＰＣ分析の場合と同様に、Ｕ（ｚ）は、入力音声Ｓ（ｚ）を声道情報１／Ａ（ｚ）で逆フィルタリングすることにより得ることができる。 In the voiced sound section, U (z) can be obtained by inverse filtering the input speech S (z) with the vocal tract information 1 / A (z), as in the case of LPC analysis.

ＡＲＸ分析においても、声道情報１／Ａ（ｚ）は、ＬＰＣ分析におけるシステム関数と同じ形式である。このことから、ＬＰＣ分析と同様の方法により、ＰＡＲＣＯＲ係数を求めても良い。 Also in the ARX analysis, the vocal tract information 1 / A (z) has the same format as the system function in the LPC analysis. Therefore, the PARCOR coefficient may be obtained by the same method as the LPC analysis.

＜開口度算出部１０２＞
開口度算出部１０２は、声道音源分離部１０１により分離された声道情報を用いて、入力音声の各音声フレームに対して口腔内の容積に対応する開口度を算出する。入力音声の各音声フレームには、入力音声が発声された時刻の情報が含まれる。 <Openness Calculation Unit 102>
The opening degree calculation unit 102 uses the vocal tract information separated by the vocal tract sound source separation unit 101 to calculate an opening degree corresponding to the volume in the oral cavity for each voice frame of the input voice. Each voice frame of the input voice includes information on the time when the input voice was uttered.

開口度算出部１０２は、所定再生時刻毎に声道音源分離部１０１により分離された声道情報から声道断面積関数を算出し、口腔内を複数の区間に分割した場合に、算出した声道断面積関数で示される各区間の声道断面積の和として、開口度を算出する。 The opening degree calculation unit 102 calculates a vocal tract cross-sectional area function from the vocal tract information separated by the vocal tract sound source separation unit 101 at each predetermined reproduction time, and calculates the calculated voice when the oral cavity is divided into a plurality of sections. The opening degree is calculated as the sum of the vocal tract cross-sectional areas of each section indicated by the road cross-sectional area function.

具体的には、開口度算出部１０２は、声道情報として抽出されたＰＡＲＣＯＲ係数から式５を用いて、声道断面積関数を算出する。 Specifically, the opening degree calculation unit 102 calculates the vocal tract cross-sectional area function using Equation 5 from the PARCOR coefficient extracted as the vocal tract information.

ここで、ｋ_ｉは、ｉ次のＰＡＲＣＯＲ係数、Ａ_ｉは、ｉ番目の声道断面積を表し、Ａ_Ｎ＋１＝１とする。 Here, k _i represents the i-th order PARCOR coefficient, A _i represents the i-th vocal tract cross-sectional area, and A _{N + 1} = 1.

図６は、ある発声の母音／ａ／の対数声道断面積関数を示す図である。声門から口唇までの声道を１１個の区間（セクション）に分割しており、セクション１１が声門を表し、セクション１が口唇を表す。 FIG. 6 is a diagram illustrating a logarithmic vocal tract cross-sectional area function of a vowel / a / of a certain utterance. The vocal tract from the glottis to the lips is divided into 11 sections, where section 11 represents the glottis and section 1 represents the lips.

図６において、網掛け領域は概ね口腔内と考えることができる。そこで、セクション１からセクションＴまでを口腔内と考えると（図６ではＴ＝５）、開口度Ｖは式６により定義することができる。Ｔは、ＬＰＣ分析あるいはＡＲＸ分析の次数に応じて変更することが望ましい。例えば１０次のＬＰＣ分析の場合、３乃至５程度が望ましい。ただし、具体的な次数については限定するものではない。 In FIG. 6, the shaded area can be considered to be generally in the oral cavity. Therefore, when section 1 to section T are considered to be in the oral cavity (T = 5 in FIG. 6), the opening degree V can be defined by Equation 6. It is desirable to change T according to the order of LPC analysis or ARX analysis. For example, in the case of 10th order LPC analysis, about 3 to 5 is desirable. However, the specific order is not limited.

開口度算出部１０２は、以上のように定義される開口度Ｖを入力音声の各フレームに対して算出する。 The opening degree calculation unit 102 calculates the opening degree V defined as described above for each frame of the input sound.

図７に、「めまいがします（／ｍｅｍａｉｇａｓｈｉｍａｓｘｕ／）」という発声において、式６により算出された開口度の時間的な変化を示す。 FIG. 7 shows a temporal change in the degree of opening calculated by Expression 6 in the utterance “Dizzy (/ memaigasimasxu /)”.

このように開口度は、時間的に変動しており、この時間変動パターンを崩すと、自然性が劣化する。 As described above, the opening degree fluctuates with time. If this time fluctuation pattern is broken, the naturalness deteriorates.

このように声道断面積関数を用いて算出される開口度（口腔内の容積）を用いることにより、単に口唇の開き具合だけではなく、外界から直接観測できない、口腔内の形状（例えば舌の位置）も考慮することができる。 By using the opening degree (volume in the oral cavity) calculated using the vocal tract cross-sectional area function in this way, not only the opening of the lips but also the shape of the oral cavity that cannot be observed directly from the outside (for example, the tongue) Position) can also be considered.

なお、上記の説明では、開口度の算出の単位はフレームとしたが、フレームに限定されるものではない。例えば、音素を単位として開口度を算出するようにしても良い。この場合、音素中心のフレームにおける開口度を、その音素の開口度としても良い。また、音素内の開口度の平均値を、その音素の開口度としても良い。音素を単位とした場合、各フレームにおける開口度は、当該フレームの前後の音素の開口度から内挿補間することにより算出することができる。開口度を算出する単位としては、このほかにモーラまたは音節などを用いるようにしても良い。 In the above description, the unit of calculation of the opening degree is the frame, but is not limited to the frame. For example, the aperture may be calculated in units of phonemes. In this case, the opening degree of the phoneme center frame may be the opening degree of the phoneme. Further, an average value of the opening degree in the phoneme may be used as the opening degree of the phoneme. When a phoneme is used as a unit, the opening degree in each frame can be calculated by interpolating from the opening degree of phonemes before and after the frame. In addition to this, a mora or a syllable may be used as a unit for calculating the opening degree.

＜開口度変換率決定部１０５＞
開口度変換率決定部１０５は、開口度算出部１０２により算出された各フレームｉにおける開口度変換率ｒ_ｉを算出する。これにより入力音声の発声様態を変化させることにより、入力音声とは異なる発声様態の音声を得ることができる。具体的には、よりハッキリと元気な発声様態への変換や、怠けた元気のない発声様態の変換などが可能となる。なお、開口度変換率ｒ_ｉとは、開口度に乗算する係数のことであり、開口度に開口度変換率ｒ_ｉを乗算することにより、開口度を変換することができる。 <Openness Conversion Rate Determination Unit 105>
The opening degree conversion rate determining unit 105 calculates the opening degree conversion rate r _i in each frame i calculated by the opening degree calculating unit 102. Thus, by changing the utterance mode of the input voice, it is possible to obtain a voice having a utterance mode different from the input voice. Specifically, it is possible to convert to a clearer and more active utterance mode, or to convert a uttered and lesser utterance mode. The opening degree conversion rate r _i is a coefficient by which the opening degree is multiplied, and the opening degree can be converted by multiplying the opening degree by the opening degree conversion rate r _i .

開口度変換率ｒ_ｉの算出の方法としては、例えば発声様態変換度ｑを直接、フレーム毎の開口度変換率ｒ_ｉとしてもよい。具体的には、開口度変換率決定部１０５は、各フレームの開口度変換率ｒ_ｉを式７に従って、算出すればよい。 As a method of calculating the opening degree conversion rate r _i , for example, the utterance mode conversion degree q may be directly used as the opening degree conversion rate r _i for each frame. Specifically, the opening degree conversion rate determination unit 105 may calculate the opening degree conversion rate r _i of each frame according to Equation 7.

ここで、ｉは、フレーム番号を示す。 Here, i indicates a frame number.

このように算出された開口度変換率ｒ_ｉを用いると、開口度算出部１０２が算出した開口度（図８（ａ）の破線）に対して、変換後の開口度Ｖ’_ｉ（図８（ａ）の実線）は式８により得ることができる。 When the aperture degree conversion rate r _i calculated in this way is used, the aperture degree V ′ _i after conversion (FIG. 8) with respect to the opening degree calculated by the opening degree calculating unit 102 (broken line in FIG. 8A). The solid line (a) can be obtained by Equation 8.

このように開口度を変換することにより、入力音声の開口度を発声様態変換度に応じて変換することになる。その結果、発声様態の異なる発声をした場合に相当する開口度パターンを生成することができる。 By converting the opening degree in this way, the opening degree of the input voice is converted according to the utterance mode conversion degree. As a result, it is possible to generate an opening degree pattern corresponding to a case where utterances having different utterance modes are made.

また、開口度の変化の時間パターンを保持したまま、開口度の変化系列におけるダイナミックレンジを調整するようにしてもよい。つまり、開口度変換率決定部１０５は、フレーム毎に、開口度算出部１０２が算出した開口度Ｖ_ｉの時間的な変化系列におけるダイナミックレンジと発声様態変換度との積が、変換後の開口度Ｖ’_ｉの時間的な変化系列におけるダイナミックレンジと一致するように、開口度変換率ｒ_ｉを決定する。 Further, the dynamic range in the change sequence of the aperture may be adjusted while maintaining the time pattern of the change in the aperture. That is, the opening degree conversion rate determination unit 105, for each frame, the product of the dynamic range and the utterance aspect conversion degree of temporal change sequence of the opening degree V _i where the opening degree calculation unit 102 calculates the opening after conversion The opening degree conversion rate r _i is determined so as to coincide with the dynamic range in the temporal change series of the degree V ′ _i .

具体的には、図８（ｂ）の破線に示すように開口度算出部１０２が算出した開口度の時間パターンをＶ_ｉとすると、開口度変換率決定部１０５は、式９により、入力音声の開口度のダイナミックレンジを変換する開口度変換率ｒ_ｉを算出する。 Specifically, as shown by the broken line in FIG. 8B, when the aperture time pattern calculated by the aperture calculation unit 102 is V _i , the aperture conversion rate determination unit 105 calculates the input voice according to Equation 9 as follows. An opening degree conversion rate r _i for converting the dynamic range of the opening degree is calculated.

ここで、Ｅ（ｘ）は、ｘの平均値を算出する関数を表す。開口度Ｖ_ｉを、式９で算出されたｒ_ｉを用いて変換した後の開口度Ｖ’_ｉは、ダイナミックレンジを式１０で定義する。変換前のダイナミックレンジＤＲと変換後のダイナミックレンジＤＲ’の関係は発声様態変換度ｑを用いると式１１のように表される。このようにして算出された開口度変換率ｒ_ｉを用いることにより、開口度の時間パターンの動きを保持しながらそのダイナミックレンジを変換することが可能となる。 Here, E (x) represents a function for calculating an average value of x. The opening degree V ′ _i obtained by converting the opening degree V _i using r _i calculated by Expression 9 defines the dynamic range by Expression 10. The relationship between the dynamic range DR before the conversion and the dynamic range DR ′ after the conversion is expressed as in Expression 11 using the utterance mode conversion degree q. By using the opening degree conversion rate r _i calculated in this way, it is possible to convert the dynamic range while maintaining the movement of the time pattern of the opening degree.

このようにして算出された開口度変換率ｒ_ｉを用いると、変換後の開口度Ｖ’_ｉは式８により表現できる。 When the aperture degree conversion rate r _i calculated in this way is used, the aperture degree V ′ _i after conversion can be expressed by Expression 8.

以上のように変換された開口度の時間パターンは、図８（ｂ）の実線のようになり、開口度の時間パターンのダイナミックレンジを変換することができる。 The aperture time pattern converted as described above is as shown by the solid line in FIG. 8B, and the dynamic range of the aperture time pattern can be converted.

例えば、発声様態変換度ｑが１よりも大きい場合は、開口度のパターンのダイナミックレンジが拡大される。よって、入力音声に含まれる開口度が大きい区間の音声は、より開口度が大きくなるように変換される。開口度が小さい区間の音声は、より開口度が小さくなるように変換される。その結果、入力音声に対して、メリハリがついた音声に変換することが可能となる。具体的には、開口度が大きくなると、はっきりと発声した音声となり、元気に発声されたように聞こえる。 For example, when the utterance state conversion degree q is larger than 1, the dynamic range of the pattern of the opening degree is expanded. Therefore, the voice of the section with a large opening degree included in the input voice is converted so that the opening degree becomes larger. Audio in a section with a small opening degree is converted so that the opening degree becomes smaller. As a result, it is possible to convert the input voice to a sharp voice. Specifically, when the opening degree increases, the voice is clearly uttered and sounds as if it was uttered energetically.

一方、発声様態変換度ｑが１よりも小さい場合は、ダイナミックレンジが縮小される。よって、入力音声に含まれる開口度が大きい区間の音声は、より開口度が小さくなるように変換される。このため、発声様態の変動が少なくなり、全体的にメリハリが少なく、落ち着いた音声に変換することが可能となる。具体的には、開口度が小さくなると、怠けて発声した音声となり、力なく発声されたように聞こえる。 On the other hand, when the utterance state conversion degree q is smaller than 1, the dynamic range is reduced. Therefore, the voice of the section with a large opening degree included in the input voice is converted so that the opening degree becomes smaller. For this reason, the variation of the utterance mode is reduced, and the whole voice is less crisp and can be converted into a calm voice. Specifically, when the opening degree is small, the voice is uttered lazyly and sounds as if it was uttered without power.

また、開口度変換率決定部１０５は、開口度の変化の時間パターンを保持したまま、開口度の変化系列における平均値をシフトするように開口度を調整するようにしてもよい。つまり、開口度変換率決定部１０５は、フレーム毎に、開口度算出部１０２が算出した開口度の時間的な変化系列における開口度Ｖ_ｉの平均値と発声様態変換度との積が、変換後の開口度Ｖ’_ｉの時間的な変化系列における開口度の平均値と一致するように、開口度変換率ｒ_ｉを決定する。 Further, the opening degree conversion rate determining unit 105 may adjust the opening degree so as to shift the average value in the changing series of the opening degree while holding the time pattern of the opening degree change. That is, the opening degree conversion rate determination unit 105, for each frame, the product of the average value and the utterance aspect conversion degree of the opening degree V _i in the time variation sequence of aperture of the opening degree calculation unit 102 is calculated, converted The opening degree conversion rate r _i is determined so as to coincide with the average value of the opening degree in the temporal change series of the later opening degree V ′ _i .

具体的には、図８（ｃ）の破線に示すように開口度算出部１０２が算出した開口度の時間パターンをＶ_ｉとすると、開口度変換率決定部１０５は、式１２により、入力音声の開口度の平均値をシフトする（変換する）開口度変換率ｒ_ｉを算出する。 Specifically, as shown by the broken line in FIG. 8C, when the aperture time pattern calculated by the aperture calculation unit 102 is V _i , the aperture conversion rate determination unit 105 uses the expression 12 to calculate the input voice. An opening degree conversion rate r _i that shifts (converts) the average value of the opening degree is calculated.

以上のように変換された開口度の時間パターンは、図８（ｃ）の実線のようになり、開口度の時間パターンの平均値をシフトするように変換することができる。 The aperture time pattern converted as described above is as shown by the solid line in FIG. 8C, and can be converted so as to shift the average value of the aperture time pattern.

発声様態変換度ｑが１よりも大きい場合には、一様にはっきりと発声し、元気な感じの音声に変換することができる。発声様態変換度ｑが１よりも小さい場合には、一様に怠けて発声することにより、元気なく発声した音声に変換することができる。 When the utterance state conversion degree q is larger than 1, it can be uttered uniformly and clearly and converted into a sound with a healthy feeling. When the utterance mode conversion degree q is smaller than 1, it is possible to convert the voice into a voice uttered without any difficulty by uttering evenly lazy.

なお、開口度変換率ｒ_ｉの算出方法は上述の方法に限るものではなく、何らかの方法で開口度の時間パターンを保持したままその大きさを変えるものであればよい。 Note that the calculation method of the aperture conversion rate r _i is not limited to the above-described method, and any method may be used as long as the size of the aperture conversion is maintained while maintaining the time pattern of the aperture.

＜声道情報変換部１０３＞
声道情報変換部１０３は、以上のようにして算出したフレームｉの開口度変換率ｒ_ｉを用いて声道音源分離部１０１により算出された声道情報を変換する。つまり、声道情報変換部１０３は、開口度変換率決定部１０５により決定された開口度変換率に基づいて、声道音源分離部１０１により分離された声道情報を示す声道断面積を変換することにより、声道音源分離部１０１により分離された声道情報を変換する。 <Vocal tract information conversion unit 103>
The vocal tract information conversion unit 103 converts the vocal tract information calculated by the vocal tract sound source separation unit 101 using the opening degree conversion rate r _i of the frame i calculated as described above. That is, the vocal tract information conversion unit 103 converts the vocal tract cross-sectional area indicating the vocal tract information separated by the vocal tract sound source separation unit 101 based on the open degree conversion rate determined by the open degree conversion rate determination unit 105. As a result, the vocal tract information separated by the vocal tract sound source separation unit 101 is converted.

具体的には、声道情報変換部１０３は、各フレームにおいて、声道断面積関数が示す声道断面積に対する変換係数を示す変換関数Ｃ^ｋ _ｉを式１３により算出する。 Specifically, the vocal tract information conversion unit 103 calculates, in each frame, a conversion function C ^k _i indicating a conversion coefficient for the vocal tract cross-sectional area indicated by the vocal tract cross-sectional area function using Expression 13.

ここで、ｋは、声道断面積関数におけるセクション番号であり、ｒ_ｉは、口唇位置における開口度変換率を表す。また、βは、各セクションにおける開口度の変化の度合いの強さを表す予め定められた定数である。βが大きいほど、口唇側のセクションにおいてより変換関数Ｃ^ｋ _ｉが急峻な変化を示す。 Here, k is a section number in the vocal tract cross-sectional area function, and r _i represents the opening degree conversion rate at the lip position. Further, β is a predetermined constant representing the strength of the degree of change in the opening degree in each section. The larger β is, the steeper change in the conversion function C ^k _i is in the lip side section.

図９は、β＝０．７とし、ｒ_ｉを−０．４、−０．２，０，０．２，０．４、と変化させたときの変換関数Ｃ^ｋ _ｉが示す変換係数Ｃｉを示すグラフである。 9, beta = 0.7 and the _{r i} -0.4, -0.2,0,0.2,0.4, the conversion in the case of changing the function ^C _{k i} is the transform coefficients shown Ci It is a graph which shows.

図からわかるように開口度変換率ｒ_ｉが負の場合には、口唇に向けて変換係数Ｃｉの値が小さくなることより徐々に声道断面積が狭まるように働き、開口度変換率ｒ_ｉ正の場合には、変換係数Ｃｉの値が大きくなることより逆に声道断面積が広がるように働く。このような変換係数Ｃｉを式１４に示すように、入力音声の声道断面積関数Ａ_ｉに乗算することにより開口度を調整することが可能である。なお、式１４では、変換関数Ｃ^ｋ _ｉに声道断面積関数Ａ_ｉ ^ｋを乗算した式を示している。 As can be seen from the figure, when the opening degree conversion rate r _i is negative, the vocal tract cross-sectional area gradually decreases as the conversion coefficient Ci decreases toward the lips, and the opening degree conversion rate r _i In the positive case, the vocal tract cross-sectional area increases as the conversion coefficient Ci increases. As shown in Expression 14, such a conversion coefficient Ci can be adjusted by multiplying the vocal tract cross-sectional area function A _i of the input speech by opening. Note that Expression 14 shows an expression ^obtained by multiplying the conversion function C ^k _i by the vocal tract cross-sectional area function A _i ^k .

つまり、声道情報変換部１０３は、口腔内を複数の区間に分割した場合に、開口度変換率に対応し、かつ区間ごとに定められる変換係数を、区間ごとの声道断面積に乗算することにより、声道音源分離部１０１により分離された声道情報を変換する。また、変換関数は図９に示すような形状をしている。このため、変換係数は、口唇に近い区間の変換係数ほど、隣接する区間の変換係数との差分の絶対値が大きい。 In other words, when the oral cavity is divided into a plurality of sections, the vocal tract information conversion unit 103 multiplies the vocal tract cross-sectional area for each section by a conversion coefficient corresponding to the opening degree conversion rate and determined for each section. Thus, the vocal tract information separated by the vocal tract sound source separation unit 101 is converted. The conversion function has a shape as shown in FIG. For this reason, as for the conversion coefficient, the absolute value of the difference with the conversion coefficient of an adjacent area is so large that the conversion coefficient of the area close | similar to a lip is.

なお、声道断面積関数Ａ_ｉ ^ｋに対する変換関数Ｃ^ｋ _ｉは式１３に限定するものではなく、声道断面積関数Ａ_ｉ ^ｋに対して、口唇に近づくほど変換係数Ｃｉの値が大きく変化するような関数であればよい。あるいは、口腔内に対応するセクションの断面積を一律に変換するように変換関数Ｃ^ｋ _ｉを設計してもよい。 Note that the conversion function C ^k _i for the vocal tract cross-sectional area function A _i ^k is not limited to Equation 13, and the value of the conversion coefficient C _i greatly changes toward the lip for the vocal tract cross-sectional area function A _i ^k . Any function can be used. Alternatively, the conversion function C ^k _i may be designed so as to uniformly convert the cross-sectional area of the section corresponding to the oral cavity.

図１０に、母音／ａ／に関して式１４を用いて開口度を変換した際の声道断面積の変化の様子を示す。図１０（ａ）は、ある発声された母音／ａ／の対数声道断面積関数である。横軸はセクション番号を示し、縦軸は声道断面積の値を示す。一方、図１０（ｂ）は、図１０（ａ）の声道断面積関数に対して、式１４により開口度を変換した場合の声道断面積関数である。横軸および縦軸は、図１０（ａ）と同じである。図１０からわかるように、変換前と変換後の声道断面積を比較した場合に、セクション５付近から口唇（セクション１）にかけて、声道断面積が徐々に狭まっている。このように声道断面積を変換することで、実際の発声時に伴う口腔内の動きを模擬する。つまり、口唇に近いほど声道断面積が大きく変化するという口腔内の動きを模擬している。 FIG. 10 shows how the vocal tract cross-sectional area changes when the aperture is converted using Equation 14 for the vowel / a /. FIG. 10A is a logarithmic vocal tract cross-sectional area function of a uttered vowel / a /. The horizontal axis indicates the section number, and the vertical axis indicates the value of the vocal tract cross-sectional area. On the other hand, FIG. 10B is a vocal tract cross-sectional area function when the opening degree is converted by Expression 14 with respect to the vocal tract cross-sectional area function of FIG. The horizontal axis and the vertical axis are the same as in FIG. As can be seen from FIG. 10, when comparing the vocal tract cross-sectional area before and after conversion, the vocal tract cross-sectional area gradually decreases from the vicinity of section 5 to the lips (section 1). In this way, by converting the vocal tract cross-sectional area, the movement in the oral cavity accompanying actual speech is simulated. That is, it simulates the movement in the oral cavity in which the vocal tract cross-sectional area changes more as it is closer to the lips.

図１１は、以上のようにして得られた声道情報の声道伝達特性を示した図である。同図において横軸は周波数を示し、縦軸はスペクトル強度を示す。 FIG. 11 is a diagram showing the vocal tract transmission characteristics of the vocal tract information obtained as described above. In the figure, the horizontal axis indicates the frequency, and the vertical axis indicates the spectrum intensity.

図１１において実線１１０１は、図１０（ａ）の声道断面積関数に対応する声道伝達特性を示す。破線１１０２は、図１０（ｂ）の声道断面積関数に対応する声道伝達特性を示す。つまり、破線１１０２は、声道情報変換部１０３により変換された後の声道情報の声道伝達特性を示す。声道情報変換部１０３により開口度が小さくなるように変換したことにより、フォルマントの強度は弱まる傾向にある。一方、開口度の変換に伴い、破線１１０２で示される声道伝達特性のフォルマント中心周波数も変換前の声道伝達特性（実線１１０１）と比較すると移動している。このように従来の音声強調装置のようにフォルマント強度だけではなく、フォルマント中心周波数に対しても開口度を変換することにより変形することが可能となる。また、開口度の時間的な変動パターンは入力音声の変動パターンと同様の時間的変動を保持しているため、音声としての自然性も保持することができる。 In FIG. 11, a solid line 1101 indicates a vocal tract transfer characteristic corresponding to the vocal tract cross-sectional area function of FIG. A broken line 1102 indicates a vocal tract transfer characteristic corresponding to the vocal tract cross-sectional area function of FIG. That is, the broken line 1102 indicates the vocal tract transmission characteristic of the vocal tract information after being converted by the vocal tract information conversion unit 103. Since the vocal tract information conversion unit 103 converts the opening degree to be small, the formant strength tends to be weakened. On the other hand, the formant center frequency of the vocal tract transfer characteristic indicated by the broken line 1102 is also moved in comparison with the vocal tract transfer characteristic before conversion (solid line 1101) as the opening degree is converted. As described above, it is possible to change not only the formant intensity but also the formant center frequency by converting the opening degree as in the conventional speech enhancement apparatus. In addition, since the temporal variation pattern of the opening degree retains the same temporal variation as the variation pattern of the input sound, it is possible to retain naturalness as a sound.

＜合成部１０４＞
合成部１０４は、声道情報変換部１０３により変換された声道情報（声道断面積関数）Ａ’_ｉ ^ｋと、声道音源分離部１０１により分離された音源情報を用いて、音声を合成する。合成の方法は特に限定するものではないが、声道情報としてＰＡＲＣＯＲ係数を用いている場合は、声道断面積関数Ａ’_ｉ ^ｋを式５を用いてＰＡＲＣＯＲ係数に変換し、ＰＡＲＣＯＲ合成を用いればよい。あるいは、ＰＡＲＣＯＲ係数からＬＰＣ係数に変換した後に合成してもよいし、フォルマントを抽出し、フォルマント合成により合成してもよい。さらにはＰＡＲＣＯＲ係数からＬＳＰ係数を算出しＬＳＰ合成により合成するようにしてもよい。 <Combining unit 104>
The synthesizing unit 104 synthesizes speech using the vocal tract information (the vocal tract cross-sectional area function) A ′ _i ^k converted by the vocal tract information conversion unit 103 and the sound source information separated by the vocal tract sound source separation unit 101. To do. The synthesis method is not particularly limited. However, when the PARCOR coefficient is used as the vocal tract information, the vocal tract cross-sectional area function A ′ _i ^k is converted into the PARCOR coefficient using Equation 5, and the PARCOR synthesis is used. That's fine. Alternatively, synthesis may be performed after conversion from PARCOR coefficients to LPC coefficients, or formants may be extracted and synthesized by formant synthesis. Furthermore, the LSP coefficient may be calculated from the PARCOR coefficient and synthesized by LSP synthesis.

＜フローチャート＞
本実施の形態に係る声質変換装置の具体的な動作に関して図１２に示すフローチャートを用いて説明する。 <Flowchart>
The specific operation of the voice quality conversion apparatus according to the present embodiment will be described with reference to the flowchart shown in FIG.

ステップＳ１０１において、声道音源分離部１０１は、入力音声を声道情報と音源情報とに分離する。 In step S101, the vocal tract sound source separation unit 101 separates the input sound into vocal tract information and sound source information.

ステップＳ１０２において、開口度算出部１０２は、ステップＳ１０１で分離された声道情報を用いて、入力音声に含まれる各フレームにおける開口度を算出する。 In step S102, the opening degree calculation unit 102 calculates the opening degree in each frame included in the input speech using the vocal tract information separated in step S101.

ステップＳ１０３において、開口度変換率決定部１０５は、ステップＳ１０２で算出された入力音声の各フレームの開口度と、別途入力される発声様態変換度に基づいて、開口度変換率を決定する。 In step S103, the opening degree conversion rate determination unit 105 determines the opening degree conversion rate based on the opening degree of each frame of the input speech calculated in step S102 and the utterance mode conversion degree that is input separately.

ステップＳ１０４において、声道情報変換部１０３は、ステップＳ１０３で決定された開口度変換率に基づいて、ステップＳ１０１において算出された声道情報を変換する。 In step S104, the vocal tract information conversion unit 103 converts the vocal tract information calculated in step S101 based on the opening degree conversion rate determined in step S103.

ステップＳ１０５において、合成部１０４は、ステップＳ１０４において変換された声道情報と、ステップＳ１０１において分離された音源情報を用いて合成音を生成する。 In step S105, the synthesis unit 104 generates a synthesized sound using the vocal tract information converted in step S104 and the sound source information separated in step S101.

＜効果＞
かかる構成によれば、入力音声の発声様態を変換する際に、入力音声における発声様態の時間的変化パターンを保存しながら、発声様態変換度に基づいて発声様態を変換することが可能となる。つまり、発声様態変換比率を大きくすると、開口度が相対的に大きくなり、はっきりと発声した音声となり、元気に発声されたように聞こえるという効果を有する。一方、発声様態変換比率を小さくすると、開口度が相対的に小さくなり、怠けて発声した音声となり、力なく発声されたように聞こえるという効果がある。また、実施の形態の声質変換装置により声質変換された音声は、発声様態の変化の時間パターンが保存されているため、声質変換時の自然性（流暢さ）を劣化させない声質変換が可能となる。 <Effect>
According to this configuration, when converting the utterance mode of the input voice, it is possible to convert the utterance mode based on the utterance mode conversion degree while preserving the temporal change pattern of the utterance mode in the input voice. That is, when the utterance state conversion ratio is increased, the opening degree is relatively increased, and the voice is clearly uttered, and the sound can be heard as if uttered energetically. On the other hand, when the utterance mode conversion ratio is made small, the opening degree becomes relatively small, and there is an effect that the voice is uttered lazyly and sounds as if uttered without power. In addition, the voice quality-converted voice by the voice quality conversion apparatus according to the embodiment stores the time pattern of the change of the utterance mode, so that voice quality conversion without deteriorating the naturalness (fluency) at the time of voice quality conversion becomes possible. .

具体的には、図８（ａ）〜図８（ｃ）の破線に示すように、入力音声の発声様態（明瞭度）の変化と、実線に示すように、声質変換後の音声の発声様態の変化パターン（はっきりや怠けの時間パターン）とが同じになるため、音声の発声様態の不自然さに起因する音質の劣化が生じない。 Specifically, as shown by the broken lines in FIGS. 8A to 8C, the change in the utterance mode (intelligibility) of the input speech and the utterance mode of the voice after voice quality conversion as shown by the solid line Therefore, the sound quality is not deteriorated due to the unnaturalness of the voice utterance.

なお、本実施の形態では、日本語の音声により説明を行ったが、日本語に限るものではなく、英語をはじめ他の言語においても同様に声質変換を行うことができる。 In the present embodiment, the description is made using Japanese speech, but the present invention is not limited to Japanese, and voice quality conversion can be performed in other languages such as English as well.

例えば、通常発声した場合“ＣａｎＩｍａｋｅａｐｈｏｎｅｃａｌｌｆｒｏｍｔｈｉｓｐｌａｉｎ？”と発声した場合の文末のｐｌａｉｎの［ｅｉ］と、“ＭａｙＩｈａｖｅａｔｈｅｒｍｏｍｅｔｅｒ？”と発声した場合の文頭のＭａｙの［ｅｉ］との発声様態は異なる（［］内は国際音声記号（ＩＰＡＩｎｔｅｒｎａｔｉｏｎａｌＰｈｏｎｅｔｉｃＡｌｐｈａｂｅｔ））。また、日本語と同様に文内位置や、内容語もしくは機能語、または強調の有無などによりその発声様態は変化することから、フォルマント強度のみを変換すると、日本語と同様に発声様態の時間的変化パターンが崩れる。よって、声質変換音声の自然性は劣化する。したがって、英語においても開口度を基準に声質を変換することにより、入力音声における発声様態の時間的変化パターンを保存しながら、目標の声質に変換することが可能となる。結果として、声質変換された音声は、発声様態の変化の時間パターンが保存されているため、声質変換時の自然性（流暢さ）を劣化させない声質変換ができる。 For example, when normal utterance is “Can I make a phone call this plain?”, The plain [ei] at the end of the sentence when uttered and “May I have a thermometer?” At the beginning of the sentence when saying “May I have a thermometer?” The utterance mode differs from ei] (inside [], an international phonetic symbol (IPA International Phonetic Alphabet)). In addition, as in Japanese, the utterance mode changes depending on the position in the sentence, the content word or function word, or the presence or absence of emphasis, so if only the formant intensity is converted, the utterance mode will change over time as in Japanese. The change pattern collapses. Therefore, the naturalness of the voice quality converted speech deteriorates. Therefore, even in English, by converting the voice quality based on the opening degree, it is possible to convert to the target voice quality while preserving the temporal change pattern of the utterance mode in the input voice. As a result, the voice quality-converted voice stores the time pattern of the change of the utterance mode, so that voice quality conversion without deteriorating the naturalness (fluency) at the time of voice quality conversion can be performed.

（変形例）
次に、本実施の形態の変形例について説明する。本変形例では、入力音声の母音区間についての声質を変換する。 (Modification)
Next, a modification of the present embodiment will be described. In this modification, the voice quality of the vowel section of the input speech is converted.

図１３は、本発明の実施の形態の声質変換装置の変形例を示す構成図である。図１３において、図５と同じ構成要素については同じ符号を用い、説明を省略する。 FIG. 13: is a block diagram which shows the modification of the voice quality conversion apparatus of embodiment of this invention. In FIG. 13, the same components as those in FIG.

本変形例に係る声質変換装置は、図５に示した声質変換装置の構成において、母音区間抽出部２０１を新たに設け、声道情報変換部１０３の代わりに母音声道情報変換部２０２を用いた構成を有する。 The voice quality conversion device according to the present modification is provided with a new vowel segment extraction unit 201 in the configuration of the voice quality conversion device shown in FIG. 5, and uses a vowel vocal tract information conversion unit 202 instead of the vocal tract information conversion unit 103. It has the composition which was.

＜母音区間抽出部２０１＞
母音区間抽出部２０１は、入力された音声に含まれる母音の時間区間である母音区間を抽出する。母音区間の抽出方法は特に限定するものではない。例えば、母音区間抽出部２０１は、音声認識技術を用いて入力音声の音素系列を認識し、その認識結果から母音区間を抽出するようにすればよい。あるいは、母音区間抽出部２０１は、予め保持している母音の音声波形のテンプレートと、入力音声の音声波形の類似度を算出し、算出した類似度が予め設定した閾値よりも大きい場合に母音区間として抽出するようにしてもよい。 <Vowel section extraction unit 201>
The vowel section extraction unit 201 extracts a vowel section that is a time section of a vowel included in the input speech. The extraction method of the vowel section is not particularly limited. For example, the vowel segment extraction unit 201 may recognize a phoneme sequence of the input speech using a speech recognition technique and extract a vowel segment from the recognition result. Alternatively, the vowel section extraction unit 201 calculates the similarity between the speech waveform template of the vowel held in advance and the speech waveform of the input speech, and when the calculated similarity is greater than a preset threshold, May be extracted as

＜母音声道情報変換部２０２＞
母音声道情報変換部２０２は、声道音源分離部１０１により分離された声道情報のうち、母音区間抽出部２０１が抽出した母音区間における声道情報のみを変換する。
具体的な声道情報の変換方法は、声道情報変換部１０３による声道情報の変換方法と同じである。このため、その詳細な説明は繰り返さない。 <Vowel vocal tract information conversion unit 202>
The vowel vocal tract information conversion unit 202 converts only the vocal tract information in the vowel segment extracted by the vowel segment extraction unit 201 among the vocal tract information separated by the vocal tract sound source separation unit 101.
A specific vocal tract information conversion method is the same as the vocal tract information conversion method performed by the vocal tract information conversion unit 103. Therefore, detailed description thereof will not be repeated.

＜フローチャート＞
本変形例に係る声質変換装置の具体的な動作に関して図１４に示すフローチャートを用いて説明する。 <Flowchart>
The specific operation of the voice quality conversion apparatus according to this modification will be described with reference to the flowchart shown in FIG.

ステップＳ２０１において、母音区間抽出部２０１は、入力音声から母音区間を抽出する。 In step S201, the vowel section extraction unit 201 extracts a vowel section from the input speech.

ステップＳ２０２において、母音声道情報変換部２０２は、ステップＳ１０３で決定された開口度変換率に基づいて、ステップＳ１０１により算出された声道情報のうち、ステップＳ２０１において抽出された母音区間における声道情報を変換する。 In step S202, the vowel vocal tract information conversion unit 202 extracts the vocal tract in the vowel section extracted in step S201 from the vocal tract information calculated in step S101 based on the opening degree conversion rate determined in step S103. Transform information.

ステップＳ１０５において、合成部１０４は、ステップＳ２０２において変換された声道情報と、ステップＳ１０１において分離された音源情報を用いて合成音を生成する。子音区間については声道情報の変換が行われない。このため、合成部１０４は、子音区間については、声道音源分離部１０１により分離された声道情報と音源情報とを用いて合成音を生成する。 In step S105, the synthesis unit 104 generates a synthesized sound using the vocal tract information converted in step S202 and the sound source information separated in step S101. The vocal tract information is not converted for the consonant section. For this reason, the synthesis unit 104 generates a synthesized sound for the consonant section using the vocal tract information and the sound source information separated by the vocal tract sound source separation unit 101.

＜効果＞
かかる構成によれば、入力音声の発声様態を変換する際に、入力音声における発声様態の時間的変化パターンを保存しながら、発声様態変換比率に基づいて発声様態を変換することが可能となる。結果として、声質変換された音声は、発声様態の変化の時間パターンが保存されているため、声質変換時の自然性（流暢さ）を劣化させない声質変換が可能となる。加えて、子音区間は声道情報の変化が時間的に速く、また大きいため、比較的継続時間が長く、声道情報の動きが緩やかな母音区間と比べて声道情報の変形に伴う音質劣化が大きい。このことから、母音区間のみを変換することにより声質変換時の音質劣化を抑圧することができる。また、母音区間は子音区間よりも継続時間長が長いことから声質変換の効果を維持することが可能である。 <Effect>
According to this configuration, when converting the utterance mode of the input voice, it is possible to convert the utterance mode based on the utterance mode conversion ratio while preserving the temporal change pattern of the utterance mode in the input voice. As a result, the voice quality-converted speech stores the time pattern of the change in the utterance mode, so that it is possible to perform voice quality conversion without deteriorating the naturalness (fluency) at the time of voice quality conversion. In addition, since the vocal tract information changes quickly in time and is large in the consonant section, the sound quality deteriorates due to the deformation of the vocal tract information compared to the vowel section where the duration is relatively long and the movement of the vocal tract information is slow. Is big. From this, it is possible to suppress deterioration in sound quality during voice quality conversion by converting only the vowel section. Further, since the vowel section has a longer duration than the consonant section, it is possible to maintain the effect of voice quality conversion.

なお、図示しない入力音声取得部が入力音声を取得しても良い。入力音声取得部は、入力音声を取得した時刻と入力音声と対応付けて取得する。この時刻は、所定再生時刻に対応する。入力音声取得部は、声道音源分離部１０１に、取得した入力音声を送信する。入力音声取得部は、マイクロホン等で構成しても良い。または、入力音声取得部に含まれる記憶部に、入力音声を記憶しても良い。 An input voice acquisition unit (not shown) may acquire the input voice. The input sound acquisition unit acquires the time when the input sound is acquired and the input sound in association with each other. This time corresponds to a predetermined reproduction time. The input voice acquisition unit transmits the acquired input voice to the vocal tract sound source separation unit 101. The input voice acquisition unit may be configured with a microphone or the like. Alternatively, the input voice may be stored in a storage unit included in the input voice acquisition unit.

なお、上記の各装置は、具体的には、マイクロプロセッサ、ＲＯＭ、ＲＡＭ、ハードディスクドライブ、ディスプレイユニット、キーボード、マウスなどから構成されるコンピュータシステムとして構成されても良い。ＲＡＭまたはハードディスクドライブには、コンピュータプログラムが記憶されている。マイクロプロセッサが、コンピュータプログラムに従って動作することにより、各装置は、その機能を達成する。ここでコンピュータプログラムは、所定の機能を達成するために、コンピュータに対する指令を示す命令コードが複数個組み合わされて構成されたものである。 Note that each of the above devices may be specifically configured as a computer system including a microprocessor, a ROM, a RAM, a hard disk drive, a display unit, a keyboard, a mouse, and the like. A computer program is stored in the RAM or hard disk drive. Each device achieves its functions by the microprocessor operating according to the computer program. Here, the computer program is configured by combining a plurality of instruction codes indicating instructions for the computer in order to achieve a predetermined function.

さらに、上記の各装置を構成する構成要素の一部または全部は、１個のシステムＬＳＩ（ＬａｒｇｅＳｃａｌｅＩｎｔｅｇｒａｔｉｏｎ：大規模集積回路）から構成されているとしても良い。システムＬＳＩは、複数の構成部を１個のチップ上に集積して製造された超多機能ＬＳＩであり、具体的には、マイクロプロセッサ、ＲＯＭ、ＲＡＭなどを含んで構成されるコンピュータシステムである。ＲＡＭには、コンピュータプログラムが記憶されている。マイクロプロセッサが、コンピュータプログラムに従って動作することにより、システムＬＳＩは、その機能を達成する。 Furthermore, some or all of the constituent elements constituting each of the above-described devices may be configured by one system LSI (Large Scale Integration). The system LSI is an ultra-multifunctional LSI manufactured by integrating a plurality of components on a single chip, and specifically, a computer system including a microprocessor, ROM, RAM, and the like. . A computer program is stored in the RAM. The system LSI achieves its functions by the microprocessor operating according to the computer program.

さらにまた、上記の各装置を構成する構成要素の一部または全部は、各装置に脱着可能なＩＣカードまたは単体のモジュールから構成されているとしても良い。ＩＣカードまたはモジュールは、マイクロプロセッサ、ＲＯＭ、ＲＡＭなどから構成されるコンピュータシステムである。ＩＣカードまたはモジュールは、上記の超多機能ＬＳＩを含むとしても良い。マイクロプロセッサが、コンピュータプログラムに従って動作することにより、ＩＣカードまたはモジュールは、その機能を達成する。このＩＣカードまたはこのモジュールは、耐タンパ性を有するとしても良い。 Furthermore, some or all of the constituent elements constituting each of the above-described devices may be configured from an IC card that can be attached to and detached from each device or a single module. The IC card or module is a computer system that includes a microprocessor, ROM, RAM, and the like. The IC card or the module may include the super multifunctional LSI described above. The IC card or the module achieves its function by the microprocessor operating according to the computer program. This IC card or this module may have tamper resistance.

また、本発明は、上記に示す方法であるとしても良い。また、これらの方法をコンピュータにより実現するコンピュータプログラムであるとしても良いし、前記コンピュータプログラムからなるディジタル信号であるとしても良い。 Further, the present invention may be the method described above. Further, the present invention may be a computer program that realizes these methods by a computer, or may be a digital signal that is formed by the computer program.

さらに、本発明は、上記コンピュータプログラムまたは上記ディジタル信号をコンピュータ読取可能な非一時的な記録媒体、例えば、フレキシブルディスク、ハードディスク、ＣＤ−ＲＯＭ、ＭＯ、ＤＶＤ、ＤＶＤ−ＲＯＭ、ＤＶＤ−ＲＡＭ、ＢＤ（Ｂｌｕ−ｒａｙＤｉｓｃ（登録商標））、半導体メモリなどに記録したものとしても良い。また、これらの非一時的な記録媒体に記録されている上記ディジタル信号であるとしても良い。 Furthermore, the present invention relates to a non-transitory recording medium that can read the computer program or the digital signal, such as a flexible disk, a hard disk, a CD-ROM, an MO, a DVD, a DVD-ROM, a DVD-RAM, a BD ( It may be recorded on a Blu-ray Disc (registered trademark)), a semiconductor memory, or the like. The digital signal may be recorded on these non-temporary recording media.

また、本発明は、上記コンピュータプログラムまたは上記ディジタル信号を、電気通信回線、無線または有線通信回線、インターネットを代表とするネットワーク、データ放送等を経由して伝送するものとしても良い。 In the present invention, the computer program or the digital signal may be transmitted via an electric communication line, a wireless or wired communication line, a network represented by the Internet, data broadcasting, or the like.

また、本発明は、マイクロプロセッサとメモリを備えたコンピュータシステムであって、上記メモリは、上記コンピュータプログラムを記憶しており、上記マイクロプロセッサは、上記コンピュータプログラムに従って動作するとしても良い。 The present invention may be a computer system including a microprocessor and a memory, wherein the memory stores the computer program, and the microprocessor operates according to the computer program.

また、上記プログラムまたは上記ディジタル信号を上記非一時的な記録媒体に記録して移送することにより、または上記プログラムまたは上記ディジタル信号を上記ネットワーク等を経由して移送することにより、独立した他のコンピュータシステムにより実施するとしても良い。 Further, by recording and transferring the program or the digital signal on the non-temporary recording medium, or transferring the program or the digital signal via the network or the like, another independent computer It may be implemented by the system.

さらに、上記実施の形態および上記変形例をそれぞれ組み合わせるとしても良い。 Furthermore, the above embodiment and the above modification examples may be combined.

今回開示された実施の形態はすべての点で例示であって制限的なものではないと考えられるべきである。本発明の範囲は上記した説明ではなくて特許請求の範囲によって示され、特許請求の範囲と均等の意味および範囲内でのすべての変更が含まれることが意図される。 The embodiment disclosed this time should be considered as illustrative in all points and not restrictive. The scope of the present invention is defined by the terms of the claims, rather than the description above, and is intended to include any modifications within the scope and meaning equivalent to the terms of the claims.

本発明に係る声質変換装置は、入力音声における発声様態の時間的変化パターンを保存しながら、目標の声質に変換する機能を有し、多様な声質を必要とする情報機器や家電機器のユーザインタフェースや、自分用の声質に変換した着信音などのエンターテイメント等の用途において有用である。また、携帯電話などによる音声通信におけるボイスチェンジャー等の用途にも応用できる。 The voice quality conversion apparatus according to the present invention has a function of converting to a target voice quality while preserving the temporal change pattern of the utterance mode in the input voice, and is a user interface of information equipment and home appliances that require various voice qualities. In addition, it is useful in applications such as entertainment such as ringtones converted into personal voice quality. It can also be applied to voice changers in voice communications using mobile phones.

１０１声道音源分離部
１０２開口度算出部
１０３声道情報変換部
１０４合成部
１０５開口度変換率決定部
２０１母音区間抽出部
２０２母音声道情報変換部 DESCRIPTION OF SYMBOLS 101 Vocal tract sound source separation part 102 Opening degree calculation part 103 Vocal tract information conversion part 104 Composition part 105 Opening degree conversion rate determination part 201 Vowel section extraction part 202 Vowel vocal tract information conversion part

Claims

A vocal tract sound source separation unit that separates input speech into vocal tract information and sound source information;
An opening degree calculation unit that calculates an opening degree corresponding to a volume in the oral cavity at the time of utterance of the input sound from the vocal tract information separated by the vocal tract sound source separation part for each predetermined reproduction time of the input sound; ,
An opening degree conversion rate determining unit that determines a conversion rate of the opening degree for each predetermined reproduction time based on a utterance state conversion degree indicating a degree of opening degree conversion;
Based on the conversion rate determined by the opening degree conversion rate determination unit, the opening degree calculated by the opening degree calculation unit is used as the conversion rate of the vocal tract information separated by the vocal tract sound source separation unit. A vocal tract information conversion unit for converting into vocal tract information corresponding to the opening degree after being converted by
A voice quality conversion device comprising: a synthesis unit that generates a synthesized sound using the vocal tract information after conversion by the vocal tract information conversion unit and the sound source information separated by the vocal tract sound source separation unit.

The opening degree calculation unit calculates a vocal tract cross-sectional area function from the vocal tract information separated by the vocal tract sound source separation unit at each predetermined reproduction time, and when the oral cavity is divided into a plurality of sections, The voice quality conversion apparatus according to claim 1, wherein the opening degree is calculated as a sum of vocal tract cross-sectional areas of each section indicated by the calculated vocal tract cross-sectional area function.

Furthermore, a vowel section extracting unit that extracts a vowel section that is a time section of a vowel included in the input speech is provided,
The vocal tract information conversion unit converts only the vocal tract information in the vowel segment extracted by the vowel segment extraction unit from the vocal tract information separated by the vocal tract sound source separation unit. The voice quality conversion device described in 1.

The voice quality conversion device according to any one of claims 1 to 3, wherein the opening degree conversion rate determination unit determines the utterance state conversion degree as the conversion rate for each predetermined reproduction time.

The opening degree conversion rate determination unit calculates a product of a dynamic range in the temporal change sequence of the opening degree calculated by the opening degree calculation unit and the utterance state conversion degree at each predetermined reproduction time after conversion. The conversion rate to be multiplied by the opening degree calculated by the opening degree calculating unit is determined to calculate the opening degree after the conversion so as to coincide with the dynamic range in the temporal change sequence of the opening degree. Item 4. The voice quality conversion device according to any one of Items 1 to 3.

The opening degree conversion rate determining unit calculates a product of an average value of the opening degree in the temporal change series of the opening degree calculated by the opening degree calculating unit and the utterance state conversion degree at each predetermined reproduction time. The opening degree calculated by the opening degree calculation unit to calculate the opening degree after the conversion so as to coincide with the average value of the opening degree in the temporal change series of the opening degree after the conversion. The voice quality conversion device according to claim 1, wherein a conversion rate is determined.

The vocal tract information conversion unit converts a vocal tract cross-sectional area indicating the vocal tract information separated by the vocal tract sound source separation unit based on the conversion rate determined by the opening degree conversion rate determination unit. The voice quality conversion apparatus according to claim 1, wherein the vocal tract information separated by the vocal tract sound source separation unit is converted.

The vocal tract information conversion unit, when dividing the oral cavity into a plurality of sections, multiplies the vocal tract cross-sectional area for each section by a conversion coefficient corresponding to the conversion rate and determined for each section. The voice quality conversion apparatus according to claim 7, wherein the vocal tract information separated by the vocal tract sound source separation unit is converted.

The voice quality conversion apparatus according to claim 8, wherein the conversion coefficient has a larger absolute value of a difference from the conversion coefficient in an adjacent section as the conversion coefficient in a section closer to the lips.

Separating input speech into vocal tract information and sound source information;
From the separated vocal tract information, calculating an opening degree corresponding to the volume in the oral cavity at the time of uttering the input sound for each predetermined reproduction time of the input sound;
Determining a conversion rate of the opening degree at each predetermined reproduction time based on a utterance state conversion degree indicating a conversion degree of the opening degree;
Converting the separated vocal tract information based on the determined conversion rate into vocal tract information corresponding to the open degree after the calculated open degree is converted at the conversion rate; and
A voice quality conversion method comprising: generating a synthesized sound using the converted vocal tract information and the separated sound source information.

A program for causing a computer to execute the voice quality conversion method according to claim 10.