CN102005205B - Emotional speech synthesizing method and device - Google Patents
Emotional speech synthesizing method and device Download PDFInfo
- Publication number
- CN102005205B CN102005205B CN200910170713A CN200910170713A CN102005205B CN 102005205 B CN102005205 B CN 102005205B CN 200910170713 A CN200910170713 A CN 200910170713A CN 200910170713 A CN200910170713 A CN 200910170713A CN 102005205 B CN102005205 B CN 102005205B
- Authority
- CN
- China
- Prior art keywords
- mentioned
- neutral
- speaker
- regular
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
本发明提供了情感语音合成方法和装置。根据本发明的一个方面,提供了一种情感语音合成方法,包括以下步骤:输入文本句;利用由第一说话人的中立语音库训练获得的中立特征模型,预测上述文本句在上述第一说话人的第一特征空间中的中立特征向量;利用由上述中立语音库和第二说话人的平行语音库训练获得的说话人规整模型,将上述中立特征向量变换为上述第二说话人的第二特征空间中的规整中立特征向量;利用由上述平行语音库训练获得的情感转换模型,将上述规整中立特征向量转换为上述第二特征空间中的规整情感特征向量;利用上述说话人规整模型,将上述规整情感特征向量逆变换为上述第一特征空间中的情感特征向量;以及利用上述第一特征空间中的情感特征向量合成出第一说话人的情感语音。
The invention provides an emotional speech synthesis method and device. According to one aspect of the present invention, a method for emotional speech synthesis is provided, comprising the steps of: inputting a text sentence; utilizing the neutral feature model obtained by training the neutral speech bank of the first speaker, predicting that the above text sentence will be used in the above-mentioned first speech The neutral feature vector in the first feature space of the person; using the speaker regularization model obtained from the training of the above-mentioned neutral speech database and the parallel speech database of the second speaker, the above-mentioned neutral feature vector is transformed into the above-mentioned second speaker’s second Regular neutral feature vectors in the feature space; using the emotional conversion model obtained from the above-mentioned parallel speech library training, the above-mentioned regular neutral feature vectors are converted into regular emotional feature vectors in the second feature space; using the above-mentioned speaker regularization model, the Inversely transforming the regularized emotional feature vectors into emotional feature vectors in the first feature space; and synthesizing the emotional voice of the first speaker by using the emotional feature vectors in the first feature space.
Description
技术领域 technical field
本发明涉及信息处理技术,具体地涉及语音合成技术,更具体地涉及不依赖于说话人的情感语音合成技术。The present invention relates to information processing technology, in particular to speech synthesis technology, more specifically to emotion speech synthesis technology independent of speaker.
背景技术 Background technique
目前,绝大多数基于大型语音库的语音合成系统都是建立在中立朗读方式的语音之上。对于情感语音的合成,通用的方法是将中立语音转换成目标情感语音的韵律和频谱转换方法,例如在非专利文献1和2中记载的基于GMM(Gaussian mixture model,高斯混合模型)的方法和在非专利文献2中记载的基于CART(Classification And Regression Tree,分类与回归树)的方法。这些韵律和频谱转换方法仅需要建立一个附加的小型平行语音库,这与重新录制一个目标情感语音的大型语音库相比节省了大量开发时间和费用。同时,这些韵律和频谱转换方法可以建立中立语音特征和目标情感语音特征之间的联系,如基于GMM的方法。可选地,也可以建立语言学信息和中立语音特征与目标情感语音特征差异之间的联系,如基于CART的方法。基于GMM的方法相对于基于CART的方法有更好的性能。此外,如在非专利文献2中所记载,也可以将CART方法和GMM方法结合起来,即,先使用CART方法根据语言学信息进行一个初步的分类,然后再对每一类使用GMM方法建立韵律和频谱转换模型。At present, most speech synthesis systems based on large speech databases are based on speech in a neutral reading mode. For the synthesis of emotional speech, the common method is to convert the neutral speech into the prosody and spectrum conversion method of the target emotional speech, such as the method based on GMM (Gaussian mixture model, Gaussian mixture model) recorded in non-patent literature 1 and 2 and The method based on CART (Classification And Regression Tree, classification and regression tree) recorded in non-patent literature 2. These prosodic and spectral transformation methods only require building an additional small parallel corpus, which saves a lot of development time and expense compared to re-recording a large corpus of the target emotional speech. At the same time, these prosodic and spectral transformation methods can establish the connection between neutral speech features and target emotional speech features, such as GMM-based methods. Optionally, it is also possible to establish the link between linguistic information and the difference between neutral speech features and target emotional speech features, such as CART-based methods. The GMM-based method has better performance than the CART-based method. In addition, as described in Non-Patent Document 2, the CART method and the GMM method can also be combined, that is, first use the CART method to perform a preliminary classification based on linguistic information, and then use the GMM method to establish prosody for each category and spectral transformation models.
然而,上述基于GMM的韵律和频谱转换模型严重依赖于说话人。也就是说,如果上述大型中立语音库和上述小型平行语音库不是来自相同的说话人,则转换的性能将会严重降低。因此,在上述韵律和频谱转换方法中,为了得到高质量的转换效果,希望上述大型中立语音库和上述小型平行语音库是来自同一说话人。然而,在实际的产品支持中这是很难实现的,因为客户的需求可能在任何时候出现,例如,在录制中立语音库的好几年以后,即使还能找到当年的说话人,他/她的声音也可能随着时间发生了相当的变化。However, the above GMM-based prosodic and spectral transformation models are heavily speaker-dependent. That is, if the aforementioned large neutral corpus and the aforementioned small parallel corpus were not from the same speakers, the performance of the conversion would be severely degraded. Therefore, in the above-mentioned prosody and spectrum conversion method, in order to obtain a high-quality conversion effect, it is hoped that the above-mentioned large-scale neutral speech library and the above-mentioned small parallel speech library are from the same speaker. However, this is difficult to achieve in actual product support, because a customer's needs may arise at any time, for example, after several years of recording a neutral voice library, even if the speakers of the year can still be found, his/her The sound may also have changed considerably over time.
非专利文献1:L.Mesbahi,V.Barreaud and O.Boeffard,“Comparing GMM-based speech transformation systems”,Proc.ofINTERSPEECH 2007,Antwerp,Belgium,Aug.27-31,2007,pp.1989-1992,在此通过参考引入其整个内容。Non-Patent Document 1: L. Mesbahi, V.Barreaud and O.Boeffard, "Comparing GMM-based speech transformation systems", Proc.of INTERSPEECH 2007, Antwerp, Belgium, Aug.27-31, 2007, pp.1989-1992, The entire contents thereof are hereby incorporated by reference.
非专利文献2:J.Tao,Y.Kang and A.Li,“Prosodyconversion from neutral speech to emotional speech”,IEEE Trans.OnAudio,Speech and Language Processing,Vol.14,No.4,2006,pp.1145-1154,在此通过参考引入其整个内容。Non-Patent Document 2: J.Tao, Y.Kang and A.Li, "Prosodyconversion from neutral speech to emotional speech", IEEE Trans.OnAudio, Speech and Language Processing, Vol.14, No.4, 2006, pp.1145 -1154, the entire contents of which are hereby incorporated by reference.
发明内容 Contents of the invention
本发明正是鉴于上述现有技术中的问题而提出的,其目的在于提供不依赖于说话人的情感语音合成方法和装置,以便能够有效改善韵律和频谱转换的性能。The present invention is proposed in view of the above-mentioned problems in the prior art, and its purpose is to provide a speaker-independent emotional speech synthesis method and device, so as to effectively improve the performance of prosody and spectrum conversion.
根据本发明的一个方面,提供了一种情感语音合成方法,包括以下步骤:输入文本句;利用由第一说话人的中立语音库训练获得的中立特征模型,预测上述文本句在上述第一说话人的第一特征空间中的中立特征向量;利用由上述中立语音库和第二说话人的平行语音库训练获得的说话人规整模型,将上述中立特征向量变换为上述第二说话人的第二特征空间中的规整中立特征向量;利用由上述平行语音库训练获得的情感转换模型,将上述规整中立特征向量转换为上述第二特征空间中的规整情感特征向量;利用上述说话人规整模型,将上述规整情感特征向量逆变换为上述第一特征空间中的情感特征向量;以及利用上述第一特征空间中的情感特征向量合成出第一说话人的情感语音。According to one aspect of the present invention, a method for emotional speech synthesis is provided, comprising the steps of: inputting a text sentence; utilizing the neutral feature model obtained by training the neutral speech bank of the first speaker, predicting that the above text sentence will be used in the above-mentioned first speech The neutral feature vector in the first feature space of the person; using the speaker regularization model obtained from the training of the above-mentioned neutral speech database and the parallel speech database of the second speaker, the above-mentioned neutral feature vector is transformed into the above-mentioned second speaker’s second Regular neutral feature vectors in the feature space; using the emotional conversion model obtained from the above-mentioned parallel speech library training, the above-mentioned regular neutral feature vectors are converted into regular emotional feature vectors in the second feature space; using the above-mentioned speaker regularization model, the Inversely transforming the regularized emotional feature vectors into emotional feature vectors in the first feature space; and synthesizing the emotional voice of the first speaker by using the emotional feature vectors in the first feature space.
根据本发明的另一个方面,提供了一种情感语音合成装置,包括:输入单元,其输入文本句;预测单元,其利用由第一说话人的中立语音库训练获得的中立特征模型,预测上述文本句在上述第一说话人的第一特征空间中的中立特征向量;变换单元,其利用由上述中立语音库和第二说话人的平行语音库训练获得的说话人规整模型,将上述中立特征向量变换为上述第二说话人的第二特征空间中的规整中立特征向量;转换单元,其利用由上述平行语音库训练获得的情感转换模型,将上述规整中立特征向量转换为上述第二特征空间中的规整情感特征向量;逆变换单元,其利用上述说话人规整模型,将上述规整情感特征向量逆变换为上述第一特征空间中的情感特征向量;以及合成单元,其利用上述第一特征空间中的情感特征向量合成出第一说话人的情感语音。According to another aspect of the present invention, an emotional speech synthesis device is provided, including: an input unit, which inputs a text sentence; a prediction unit, which utilizes a neutral feature model obtained by training the neutral speech database of the first speaker to predict the above-mentioned The neutral feature vector of the text sentence in the first feature space of the above-mentioned first speaker; the transformation unit, which utilizes the speaker regularization model obtained by training the above-mentioned neutral speech database and the parallel speech database of the second speaker, converts the above-mentioned neutral features The vector is transformed into a regular neutral feature vector in the second feature space of the above-mentioned second speaker; a conversion unit, which uses the emotion conversion model obtained from the above-mentioned parallel speech database training, converts the above-mentioned regular neutral feature vector into the above-mentioned second feature space The regular emotional feature vector in the above; the inverse transformation unit, which uses the above-mentioned speaker regularization model, inversely transforms the above-mentioned regular emotional feature vector into the emotional feature vector in the above-mentioned first feature space; and a synthesis unit, which uses the above-mentioned first feature space The emotional feature vector in synthesizes the emotional speech of the first speaker.
附图说明 Description of drawings
相信通过以下结合附图对本发明具体实施方式的说明,能够使人们更好地了解本发明上述的特点、优点和目的。It is believed that people can better understand the above-mentioned characteristics, advantages and objectives of the present invention through the following description of specific embodiments of the present invention in conjunction with the accompanying drawings.
图1是根据本发明的一个实施例的情感语音合成方法的流程图。Fig. 1 is a flowchart of an emotional speech synthesis method according to an embodiment of the present invention.
图2是根据本发明的一个实施例的说话人规整模型的一个实例。Fig. 2 is an example of a speaker regularization model according to an embodiment of the present invention.
图3是根据本发明的一个实施例的说话人规整模型的另一个实例。Fig. 3 is another example of a speaker normalization model according to an embodiment of the present invention.
图4是根据本发明的另一个实施例的情感语音合成装置的框图。Fig. 4 is a block diagram of an emotional speech synthesis device according to another embodiment of the present invention.
具体实施方式 Detailed ways
下面就结合附图对本发明的各个优选实施例进行详细的说明。Various preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
情感语音合成方法Emotional Speech Synthesis Method
图1是根据本发明的一个实施例的情感语音合成方法的流程图。下面就结合该图,对本实施例进行描述。Fig. 1 is a flowchart of an emotional speech synthesis method according to an embodiment of the present invention. The present embodiment will be described below with reference to this figure.
如图1所示,首先,在步骤101,输入文本句。在本实施例中,输入的文本句可以是本领域的技术人员公知的任何文本的句子,也可以是各种语言的文本句,例如汉语、英语、日语等,本发明对此没有任何限制。As shown in FIG. 1 , first, at
接着,在步骤105,利用文本分析从输入的文本句中提取语言学信息60。在本实施例中,语言学信息60包括上述文本句的句长,句中各字(词)的字形、拼音、音素类型、声调、词性、句中位置、与前后字(词)之间的边界类型以及与前后停顿之间的距离等等。此外,在本实施例中,用于从输入的文本句中提取语言学信息60的文本分析方法可以是本领域的技术人员公知的任何方法,本发明对此没有任何限制。Next, at
应该注意,这里的步骤105只是一个可选的步骤,也可以在步骤101输入文本句之后直接进行到步骤110。It should be noted that
在步骤110,利用由第一说话人的中立语音库10训练获得的中立特征模型30,预测在步骤101输入的文本句在上述第一说话人的第一特征空间中的中立特征向量。In
在本实施例中,中立语音库10包括第一说话人的中立语音,即中立朗读的语音。中立语音库10可以是本领域的技术人员公知的任何语音库,例如上述非专利文献1和2中记载的中立语音库。此外,由中立语音库10训练中立特征模型30的方法也可以是本领域的技术人员公知的任何方法,例如上述非专利文献1和2中记载的训练方法。此外,训练得到的中立特征模型30也可以是本领域的技术人员公知的任何模型,例如上述非专利文献1和2中记载的中立特征模型。中立特征模型30中的特征向量可以包含韵律特征(例如时长、基频轨迹、停顿、能量等)和频谱特征中的一种或多种。本发明只是在步骤110中利用了中立特征模型30,而对于中立语音库10、中立特征模型30的训练方法以及中立特征模型30没有任何限制。In this embodiment, the
在步骤110,如果没有在步骤105中提取语言学信息60,则利用中立特征模型30,预测在步骤101输入的文本句在第一说话人的第一特征空间中的中立特征向量。如果在步骤105中提取出语言学信息60,则根据提取出的语言学信息60,利用中立特征模型30,预测上述中立特征向量。在本实施例中,预测上述中立特征向量的方法可以是本领域的技术人员公知的任何方法,例如在上述非专利文献1和2中记载的预测方法。此外,预测出的中立特征向量可以包含韵律特征(例如时长、基频轨迹、停顿、能量等)和频谱特征中的一种或多种。In
接着,在步骤115,利用由中立语音库10和第二说话人的平行语音库20训练获得的说话人规整模型50,将在步骤110预测得到的中立特征向量变换为第二说话人的第二特征空间中的规整中立特征向量。在此,变换后的规整中立特征向量也可以包含韵律特征(例如时长、基频轨迹、停顿、能量等)和频谱特征中的一种或多种。Next, in
在本实施例中,第二说话人的平行语音库20包含第二说话人的中立语音和目标情感语音,它们是成对的,也就是说,同一文本句用中立和目标情感两种方式朗读。In this embodiment, the second speaker's
下面结合图2详细描述说话人规整模型50和在步骤115中进行的变换的一个实例。An example of
图2是根据本发明的一个实施例的说话人规整模型50的一个实例。如图2所示,在训练说话人规整模型50的过程中,首先根据分类规则70将中立语音库10切分成m个类1-1,1-2,...,1-m。分类规则70可以根据经验随特征的不同而不同,例如,针对时长和频谱特征根据音素类型分类,针对基频轨迹根据声调类型分类,针对能量根据句中位置分类等中的一种或多种。接着,根据同样的分类规则70将平行语音库20也切分成对应的m个类2-1,2-2,...,2-m。接着,对于每个类1-i和2-i,计算出统计量71-i和72-i,其中上述统计量可以是从每个类1-i和2-i中提取的特征向量的均值μ和协方差矩阵∑等。在这种情况下,说话人规整模型50包括分类规则70和统计量71-i和72-i。FIG. 2 is an example of a
返回图1,在说话人规整模型50包括分类规则70和统计量71-i和72-i的情况下,在步骤115,首先利用在步骤105提取出的语言学信息60,查找在步骤110预测得到的中立特征向量对应的类x,然后根据如下公式(1)将该中立特征向量变换为第二说话人的第二特征空间中的规整中立特征向量,
下面结合图3详细描述说话人规整模型50和在步骤115中进行的变换的另一个实例。Another example of the
图3是根据本发明的一个实施例的说话人规整模型的另一个实例。如图3所示,在训练说话人规整模型50的过程中,首先利用中立语音库10训练出基于GMM的第一说话人的第一特征空间模型(λ1,μ1,∑1)80。在第一特征空间模型80中,λ1代表第一特征空间模型80中的各组元所占权重的集合,μ1代表各组元的均值的集合,∑1代表各组元的协方差矩阵的集合。然后,利用平行语音库20将第一说话人的第一特征空间模型80自适应为第二说话人的第二特征空间模型(λ2,μ2,∑2)90。在第二特征空间模型90中,λ2代表第二特征空间模型90中的各组元所占权重的集合,μ2代表各组元的均值的集合,∑2代表各组元的协方差矩阵的集合。上述特征空间模型中的组元个数应该足够大以使模型能够准确地描述上述特征空间。此外,可以认为自适应前后的两个特征空间模型80和90所对应的组元是耦合的。在本实施例中,特征空间模型的自适应方法可以是MAP(Maximum a Posteriori,最大后验概率)、MCE(Minimum ClassificationError,最小分类错误)、MMI(Maximum Mutual Information,最大互相关信息)或其它可用的算法,本发明对此没有任何限制。值得注意的是,因为第二说话人的平行语音库20的数据有限,通常仅对模型的均值μ做自适应,那么假定λ1=λ2=λ,∑1=∑2=∑。在本实施例中,训练基于GMM的特征空间模型的方法可以是本领域的技术人员公知的任何方法,例如上述非专利文献1和2中记载的训练方法。在这种情况下,说话人规整模型50包括第一特征空间模型80和第二特征空间模型90。Fig. 3 is another example of a speaker normalization model according to an embodiment of the present invention. As shown in FIG. 3 , in the process of training the
返回图1,在说话人规整模型50包括第一特征空间模型80和第二特征空间模型90的情况下,在步骤115,首先计算在步骤110预测得到的中立特征向量针对第一特征空间模型(λ1,μ1,∑1)80的各组元i的概率Pi。可选地,可以根据如下公式(2)计算上述概率Pi,
接着,计算上述中立特征向量的针对第一特征空间模型(λ1,μ1,∑1)80的各组元i的上述概率Pi所占的权重wi。可选地,可以根据如下公式(3)计算上述权重wi,
接着,根据上述权重wi和第二特征空间模型(λ2,μ2,∑2)90,计算在步骤115变换后的规整中立特征向量v′n。可选地,可以根据如下公式(4)计算上述规整中立特征向量v′n,
在步骤115利用说话人规整模型50,将在步骤110预测得到的中立特征向量vn变换为第二说话人的第二特征空间中的规整中立特征向量v′n之后,在步骤120,利用由平行语音库20训练获得的情感转换模型40,将规整中立特征向量v′n转换为上述第二特征空间中的规整情感特征向量v′e。在此,转换后的规整情感特征向量v′e可以包含韵律特征(例如时长、基频轨迹、停顿、能量等)和频谱特征中的一种或多种。In
在本实施例中,可以利用基于GMM的方法、基于CART的方法或其他方法以及他们各种可能的组合训练情感转换模型40。与上述特征向量包含的特征相类似,情感转换模型40也可以包括时长转换模型、基频轨迹转换模型、停顿转换模型、能量转换模型、频谱转换模型等中的一种或多种。In this embodiment, the
此外,在本实施例中,如果情感转换模型40的训练利用了基于CART的方法,则需要在上述步骤105中从输入的文本句中提取出语言学信息60,以便在步骤120,根据语言学信息60,利用情感转换模型40,将规整中立特征向量v′n转换为规整情感特征向量v′e。In addition, in this embodiment, if the
接着,在步骤125,利用说话人规整模型50,将规整情感特征向量v′e逆变换为上述第一特征空间中的情感特征向量ve。在此,逆变换后的情感特征向量ve可以包含韵律特征(例如时长、基频轨迹、停顿、能量等)和频谱特征中的一种或多种。Next, in
可选地,在说话人规整模型50包括如图2所示的分类规则70和统计量71-i和72-i的情况下,根据如下公式(5)将规整情感特征向量v′e逆变换为情感特征向量ve,
此外,可选地,在说话人规整模型50包括如图3所示的第一特征空间模型80和第二特征空间模型90的情况下,首先计算规整情感特征向量v′e针对第二特征空间模型(λ2,μ2,∑2)90的各组元的概率P′i。可选地,可以根据如下公式(6)计算上述概率P′i,
接着,计算规整情感特征向量v′e的针对第二特征空间模型90的各组元i的上述概率P′i所占的权重w′i。可选地,可以根据如下公式(7)计算上述权重w′i,
接着,根据上述权重w′i和第一特征空间模型(λ1,μ1,∑1)80,计算在步骤125逆变换后的情感特征向量ve。可选地,可以根据如下公式(8)计算上述情感特征向量ve,
最后,在步骤130,利用上述第一特征空间中的情感特征向量ve合成出第一说话人的情感语音。Finally, in
在本实施例中,将情感特征向量合成为目标情感语音的方法可以是本领域的技术人员公知的任何方法,例如上述非专利文献1和2中记载的合成方法,本发明对此没有任何限制。In this embodiment, the method of synthesizing the emotional feature vector into target emotional speech can be any method known to those skilled in the art, such as the synthesis method described in the above-mentioned non-patent literature 1 and 2, and the present invention has no limitation to this .
通过本实施例的情感语音合成方法,可以利用与第一说话人不同的第二说话人的平行语音库合成出第一说话人的情感语音,从而在大型中立语音库和小型平行语音库不是来自相同的说话人情况下,能够有效改善韵律和频谱转换的性能。Through the emotional speech synthesis method of this embodiment, the emotional speech of the first speaker can be synthesized by using the parallel speech library of the second speaker different from the first speaker, so that the large-scale neutral speech library and the small parallel speech library are not from In the case of the same speaker, it can effectively improve the performance of prosody and spectral conversion.
情感语音合成装置Emotional Speech Synthesis Device
在同一发明构思下,图4是根据本发明的另一个实施例的情感语音合成装置的框图。下面就结合该图,对本实施例进行描述。对于那些与前面实施例相同的部分,适当省略其说明。Under the same inventive conception, FIG. 4 is a block diagram of an emotional speech synthesis device according to another embodiment of the present invention. The present embodiment will be described below with reference to this figure. For those parts that are the same as those in the previous embodiments, descriptions thereof are appropriately omitted.
如图4所示,本实施例的情感语音合成装置400包括:输入单元401,其输入文本句;预测单元410,其利用由第一说话人的中立语音库10训练获得的中立特征模型30,预测上述文本句在上述第一说话人的第一特征空间中的中立特征向量;变换单元415,其利用由上述中立语音库10和第二说话人的平行语音库20训练获得的说话人规整模型50,将上述中立特征向量变换为上述第二说话人的第二特征空间中的规整中立特征向量;转换单元420,其利用由上述平行语音库20训练获得的情感转换模型40,将上述规整中立特征向量转换为上述第二特征空间中的规整情感特征向量;逆变换单元425,其利用上述说话人规整模型50,将上述规整情感特征向量逆变换为上述第一特征空间中的情感特征向量;以及合成单元430,其利用上述第一特征空间中的情感特征向量合成出第一说话人的情感语音As shown in Figure 4, the emotional
在本实施例中,输入单元401输入的文本句可以是本领域的技术人员公知的任何文本的句子,也可以是各种语言的文本句,例如汉语、英语、日语等,本发明对此没有任何限制。In this embodiment, the text sentences input by the
本实施例的情感语音合成装置400可选地具有提取单元405,其利用文本分析从输入单元401输入的文本句中提取语言学信息60。在本实施例中,语言学信息60包括上述文本句的句长,句中各字(词)的字形、拼音、音素类型、声调、词性、句中位置、与前后字(词)之间的边界类型以及与前后停顿之间的距离等等。此外,在本实施例中,提取单元405用于从输入的文本句中提取语言学信息60的文本分析方法可以是本领域的技术人员公知的任何方法,本发明对此没有任何限制。The emotional
在本实施例中,中立语音库10包括第一说话人的中立语音,即中立朗读的语音。中立语音库10可以是本领域的技术人员公知的任何语音库,例如上述非专利文献1和2中记载的中立语音库。此外,由中立语音库10训练中立特征模型30的方法也可以是本领域的技术人员公知的任何方法,例如上述非专利文献1和2中记载的训练方法。此外,训练得到的中立特征模型30也可以是本领域的技术人员公知的任何模型,例如上述非专利文献1和2中记载的中立特征模型。中立特征模型30中的特征向量可以包含韵律特征(例如时长、基频轨迹、停顿、能量等)和频谱特征中的一种或多种。本发明只是在预测单元410中利用了中立特征模型30,而对于中立语音库10、中立特征模型30的训练方法以及中立特征模型30没有任何限制。In this embodiment, the
预测单元410,在没有利用提取单元405提取语言学信息60的情况下,利用中立特征模型30,预测由输入单元401输入的文本句在第一说话人的第一特征空间中的中立特征向量。在利用提取单元405提取出语言学信息60的情况下,则根据提取出的语言学信息60,利用中立特征模型30,预测上述中立特征向量。在本实施例中,预测单元410预测上述中立特征向量的方法可以是本领域的技术人员公知的任何方法,例如在上述非专利文献1和2中记载的预测方法。此外,预测出的中立特征向量可以包含韵律特征(例如时长、基频轨迹、停顿、能量等)和频谱特征中的一种或多种。The
变换单元415,利用由中立语音库10和第二说话人的平行语音库20训练获得的说话人规整模型50,将由预测单元410预测得到的中立特征向量变换为第二说话人的第二特征空间中的规整中立特征向量。在此,变换后的规整中立特征向量也可以包含韵律特征(例如时长、基频轨迹、停顿、能量等)和频谱特征中的一种或多种。The
在本实施例中,第二说话人的平行语音库20包含第二说话人的中立语音和目标情感语音,它们是成对的,也就是说,同一文本句用中立和目标情感两种方式朗读。In this embodiment, the second speaker's
在本实施例中,说话人规整模型50可以是如图2所示的包括分类规则70和统计量71-i和72-i的上述说话人规整模型50,也可以是如图3所示的包括第一特征空间模型80和第二特征空间模型90的上述说话人规整模型50。In this embodiment, the
在说话人规整模型50包括分类规则70和统计量71-i和72-i的情况下,变换单元415包括查找单元和计算单元。查找单元利用由提取单元405提取出的语言学信息60,查找由预测单元410预测得到的中立特征向量对应的类x。计算单元根据上述公式(1)计算规整中立特征向量。In the case that the
在说话人规整模型50包括第一特征空间模型80和第二特征空间模型90的情况下,变换单元415包括概率计算单元、权重计算单元和特征向量计算单元。概率计算单元计算由预测单元410预测得到的中立特征向量针对第一特征空间模型(λ1,μ1,∑1)80的各组元i的概率Pi。可选地,可以根据上述公式(2)计算上述概率Pi。In the case that the
权重计算单元计算上述中立特征向量的针对第一特征空间模型(λ1,μ1,∑1)80的各组元i的上述概率Pi所占的权重wi。可选地,可以根据上述公式(3)计算上述权重wi。The weight calculation unit calculates the weight w i of the probability P i of each component i of the first feature space model (λ 1 , μ 1 , Σ 1 ) 80 of the neutral feature vector. Optionally, the above weight w i may be calculated according to the above formula (3).
特征向量计算单元根据上述权重计算单元计算出的权重wi和第二特征空间模型(λ2,μ2,∑2)90,计算变换后的规整中立特征向量v′n。可选地,可以根据上述公式(4)计算上述规整中立特征向量v′n。The feature vector calculation unit calculates the transformed regular neutral feature vector v′ n according to the weight w i calculated by the above weight calculation unit and the second feature space model (λ 2 , μ 2 , Σ 2 ) 90 . Optionally, the regularized neutral eigenvector v' n may be calculated according to the above formula (4).
在变换单元415利用说话人规整模型50,将由预测单元410预测得到的中立特征向量vn变换为第二说话人的第二特征空间中的规整中立特征向量v′n之后,转换单元420利用由平行语音库20训练获得的情感转换模型40,将规整中立特征向量v′n转换为上述第二特征空间中的规整情感特征向量v′e。在此,转换后的规整情感特征向量v′e可以包含韵律特征(例如时长、基频轨迹、停顿、能量等)和频谱特征中的一种或多种。After the
在本实施例中,可以利用基于GMM的方法、基于CART的方法或其他方法以及他们各种可能的组合训练情感转换模型40。与上述特征向量包含的特征相类似,情感转换模型40也可以包括时长转换模型、基频轨迹转换模型、停顿转换模型、能量转换模型、频谱转换模型等中的一种或多种。In this embodiment, the
此外,在本实施例中,如果情感转换模型40的训练利用了基于CART的方法,则需要利用提取单元405从输入的文本句中提取出语言学信息60,以便转换单元420根据语言学信息60,利用情感转换模型40,将规整中立特征向量v′n转换为规整情感特征向量v′e。In addition, in this embodiment, if the
逆变换单元425利用说话人规整模型50,将规整情感特征向量v′e逆变换为上述第一特征空间中的情感特征向量ve。在此,逆变换后的情感特征向量ve可以包含韵律特征(例如时长、基频轨迹、停顿、能量等)和频谱特征中的一种或多种。The
可选地,在说话人规整模型50包括如图2所示的分类规则70和统计量71-i和72-i的情况下,根据上述公式(5)将规整情感特征向量v′e逆变换为情感特征向量ve。Optionally, in the case that the
此外,可选地,在说话人规整模型50包括如图3所示的第一特征空间模型80和第二特征空间模型90的情况下,逆变换单元425包括概率计算单元、权重计算单元和特征向量计算单元。概率计算单元计算规整情感特征向量v′e针对第二特征空间模型(λ2,μ2,∑2)90的各组元的概率P′i。可选地,概率计算单元可以根据上述公式(6)计算上述概率P′i。In addition, optionally, when the
权重计算单元计算规整情感特征向量v′e的针对第二特征空间模型90的各组元i的上述概率P′i所占的权重w′i。可选地,权重计算单元可以根据上述公式(7)计算上述权重w′i。The weight calculation unit calculates the weight w' i of the above-mentioned probability P' i of each component i of the second
特征向量计算单元根据上述权重计算单元计算出的权重w′i和第一特征空间模型(λ1,μ1,∑1)80,计算逆变换后的情感特征向量ve。可选地,特征向量计算单元可以根据上述公式(8)计算上述情感特征向量ve。The feature vector calculation unit calculates the emotional feature vector v e after inverse transformation according to the weight w′ i calculated by the above weight calculation unit and the first feature space model (λ 1 , μ 1 , Σ 1 ) 80 . Optionally, the feature vector calculation unit may calculate the aforementioned emotional feature vector ve according to the aforementioned formula (8).
最后,合成单元430,利用上述第一特征空间中的情感特征向量ve合成出第一说话人的情感语音。Finally, the synthesizing
在本实施例中,将情感特征向量合成为目标情感语音的方法可以是本领域的技术人员公知的任何方法,例如上述非专利文献1和2中记载的合成方法,本发明对此没有任何限制。In this embodiment, the method of synthesizing the emotional feature vector into target emotional speech can be any method known to those skilled in the art, such as the synthesis method described in the above-mentioned non-patent literature 1 and 2, and the present invention has no limitation to this .
通过本实施例的情感语音合成装置400,可以利用与第一说话人不同的第二说话人的平行语音库合成出第一说话人的情感语音,从而在大型中立语音库和小型平行语音库不是来自相同的说话人情况下,能够有效改善韵律和频谱转换的性能。Through the emotional
以上虽然通过一些示例性的实施例对本发明的情感语音合成方法和情感语音合成装置进行了详细的描述,但是以上这些实施例并不是穷举的,本领域技术人员可以在本发明的精神和范围内实现各种变化和修改。因此,本发明并不限于这些实施例,本发明的范围仅由所附权利要求为准。Although the emotional speech synthesis method and the emotional speech synthesis device of the present invention have been described in detail through some exemplary embodiments above, these above embodiments are not exhaustive, and those skilled in the art can understand the spirit and scope of the present invention Variations and modifications are implemented within. Therefore, the present invention is not limited to these embodiments, and the scope of the present invention is determined only by the appended claims.
也就是说,本发明的思想在于利用说话人规整模型,在只有另一个说话人的附加平行语音库的情况下,可以不依赖于说话人地进行韵律和频谱转换,从而有效地改善转换的性能。上述说话人规整模型易于与各种现有的韵律和频谱转换方法结合而不限于在上述实施例中描述的方法。本发明的应用目的也可以不限于情感表达,而是可以更广泛地用于丰富TTS(Text-to-Speech,文本语音转换,或称为语音合成)中的多种表达类型,例如友好说话方式、对话中的语义焦点等。本发明既适用于单元拼接的TTS系统也适用于参数合成的TTS系统。That is to say, the idea of the present invention is to use the speaker regularization model to perform prosodic and spectral conversion independently of the speaker in the case of only an additional parallel speech library of another speaker, thereby effectively improving the performance of the conversion . The speaker regularization model described above is easy to combine with various existing prosody and spectrum conversion methods and is not limited to the methods described in the above embodiments. The application purpose of the present invention can also be not limited to emotional expression, but can be used for enriching the various expression types in TTS (Text-to-Speech, text-to-speech conversion, or be called speech synthesis) more widely, such as friendly speaking manner , semantic focus in dialogue, etc. The present invention is applicable to both the TTS system of unit splicing and the TTS system of parameter synthesis.
Claims (9)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN200910170713A CN102005205B (en) | 2009-09-03 | 2009-09-03 | Emotional speech synthesizing method and device |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN200910170713A CN102005205B (en) | 2009-09-03 | 2009-09-03 | Emotional speech synthesizing method and device |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN102005205A CN102005205A (en) | 2011-04-06 |
| CN102005205B true CN102005205B (en) | 2012-10-03 |
Family
ID=43812513
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN200910170713A Expired - Fee Related CN102005205B (en) | 2009-09-03 | 2009-09-03 | Emotional speech synthesizing method and device |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN102005205B (en) |
Families Citing this family (10)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2015184615A1 (en) * | 2014-06-05 | 2015-12-10 | Nuance Software Technology (Beijing) Co., Ltd. | Systems and methods for generating speech of multiple styles from text |
| US9824681B2 (en) * | 2014-09-11 | 2017-11-21 | Microsoft Technology Licensing, Llc | Text-to-speech with emotional content |
| CN106531150B (en) * | 2016-12-23 | 2020-02-07 | 云知声(上海)智能科技有限公司 | Emotion synthesis method based on deep neural network model |
| CN107103900B (en) * | 2017-06-06 | 2020-03-31 | 西北师范大学 | Cross-language emotion voice synthesis method and system |
| CN108766413B (en) * | 2018-05-25 | 2020-09-25 | 北京云知声信息技术有限公司 | Speech synthesis method and system |
| CN111192568B (en) * | 2018-11-15 | 2022-12-13 | 华为技术有限公司 | Speech synthesis method and speech synthesis device |
| CN110083702B (en) * | 2019-04-15 | 2021-04-09 | 中国科学院深圳先进技术研究院 | Aspect level text emotion conversion method based on multi-task learning |
| CN112765971B (en) * | 2019-11-05 | 2023-11-17 | 北京火山引擎科技有限公司 | Text-to-speech conversion method and device, electronic equipment and storage medium |
| CN110930977B (en) * | 2019-11-12 | 2022-07-08 | 北京搜狗科技发展有限公司 | Data processing method and device and electronic equipment |
| CN114724540B (en) * | 2020-12-21 | 2025-09-05 | 阿里巴巴集团控股有限公司 | Model processing method and device, emotional speech synthesis method and device |
Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101064104A (en) * | 2006-04-24 | 2007-10-31 | 中国科学院自动化研究所 | Emotion voice creating method based on voice conversion |
-
2009
- 2009-09-03 CN CN200910170713A patent/CN102005205B/en not_active Expired - Fee Related
Patent Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101064104A (en) * | 2006-04-24 | 2007-10-31 | 中国科学院自动化研究所 | Emotion voice creating method based on voice conversion |
Also Published As
| Publication number | Publication date |
|---|---|
| CN102005205A (en) | 2011-04-06 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN102005205B (en) | Emotional speech synthesizing method and device | |
| US11450332B2 (en) | Audio conversion learning device, audio conversion device, method, and program | |
| Palo et al. | Wavelet based feature combination for recognition of emotions | |
| Ma et al. | Short utterance based speech language identification in intelligent vehicles with time-scale modifications and deep bottleneck features | |
| Hu et al. | GMM supervector based SVM with spectral features for speech emotion recognition | |
| US8935167B2 (en) | Exemplar-based latent perceptual modeling for automatic speech recognition | |
| CN101136199B (en) | Voice data processing method and equipment | |
| Mannepalli et al. | MFCC-GMM based accent recognition system for Telugu speech signals | |
| Wu et al. | Audio classification using attention-augmented convolutional neural network | |
| CN107301859B (en) | Speech conversion method under non-parallel text condition based on adaptive Gaussian clustering | |
| EP2888669B1 (en) | Method and system for selectively biased linear discriminant analysis in automatic speech recognition systems | |
| CN104200804A (en) | Various-information coupling emotion recognition method for human-computer interaction | |
| Gawali et al. | Marathi isolated word recognition system using MFCC and DTW features | |
| Latif et al. | Generative emotional AI for speech emotion recognition: The case for synthetic emotional speech augmentation | |
| Kadyan et al. | A comparative study of deep neural network based Punjabi-ASR system | |
| Tsai et al. | Discriminative training of Gaussian mixture bigram models with application to Chinese dialect identification | |
| Daouad et al. | An automatic speech recognition system for isolated Amazigh word using 1D & 2D CNN-LSTM architecture | |
| CN106297769A (en) | A kind of distinctive feature extracting method being applied to languages identification | |
| Jiang et al. | Task-aware deep bottleneck features for spoken language identification. | |
| Ahmed et al. | Efficient feature extraction and classification for the development of Pashto speech recognition system | |
| Tailor et al. | Deep learning approach for spoken digit recognition in Gujarati language | |
| Tobing et al. | Cyclic Spectral Modeling for Unsupervised Unit Discovery into Voice Conversion with Excitation and Waveform Modeling. | |
| Mahum et al. | Text to speech synthesis using deep learning | |
| Boulal et al. | Exploring data augmentation for Amazigh speech recognition with convolutional neural networks | |
| Li | An improved machine learning algorithm for text-voice conversion of English letters into phonemes |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| C06 | Publication | ||
| PB01 | Publication | ||
| C10 | Entry into substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| C14 | Grant of patent or utility model | ||
| GR01 | Patent grant | ||
| CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20121003 Termination date: 20160903 |
|
| CF01 | Termination of patent right due to non-payment of annual fee |