CN102005205B

CN102005205B - Emotional speech synthesizing method and device

Info

Publication number: CN102005205B
Application number: CN200910170713A
Authority: CN
Inventors: 栾剑; 李健
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2009-09-03
Filing date: 2009-09-03
Publication date: 2012-10-03
Anticipated expiration: 2029-09-03
Also published as: CN102005205A

Abstract

The invention provides an emotional speech synthesis method and device. According to one aspect of the present invention, a method for emotional speech synthesis is provided, comprising the steps of: inputting a text sentence; utilizing the neutral feature model obtained by training the neutral speech bank of the first speaker, predicting that the above text sentence will be used in the above-mentioned first speech The neutral feature vector in the first feature space of the person; using the speaker regularization model obtained from the training of the above-mentioned neutral speech database and the parallel speech database of the second speaker, the above-mentioned neutral feature vector is transformed into the above-mentioned second speaker’s second Regular neutral feature vectors in the feature space; using the emotional conversion model obtained from the above-mentioned parallel speech library training, the above-mentioned regular neutral feature vectors are converted into regular emotional feature vectors in the second feature space; using the above-mentioned speaker regularization model, the Inversely transforming the regularized emotional feature vectors into emotional feature vectors in the first feature space; and synthesizing the emotional voice of the first speaker by using the emotional feature vectors in the first feature space.

Description

Emotional speech synthesis method and device

技术领域 technical field

本发明涉及信息处理技术，具体地涉及语音合成技术，更具体地涉及不依赖于说话人的情感语音合成技术。The present invention relates to information processing technology, in particular to speech synthesis technology, more specifically to emotion speech synthesis technology independent of speaker.

背景技术 Background technique

目前，绝大多数基于大型语音库的语音合成系统都是建立在中立朗读方式的语音之上。对于情感语音的合成，通用的方法是将中立语音转换成目标情感语音的韵律和频谱转换方法，例如在非专利文献1和2中记载的基于GMM(Gaussian mixture model，高斯混合模型)的方法和在非专利文献2中记载的基于CART(Classification And Regression Tree，分类与回归树)的方法。这些韵律和频谱转换方法仅需要建立一个附加的小型平行语音库，这与重新录制一个目标情感语音的大型语音库相比节省了大量开发时间和费用。同时，这些韵律和频谱转换方法可以建立中立语音特征和目标情感语音特征之间的联系，如基于GMM的方法。可选地，也可以建立语言学信息和中立语音特征与目标情感语音特征差异之间的联系，如基于CART的方法。基于GMM的方法相对于基于CART的方法有更好的性能。此外，如在非专利文献2中所记载，也可以将CART方法和GMM方法结合起来，即，先使用CART方法根据语言学信息进行一个初步的分类，然后再对每一类使用GMM方法建立韵律和频谱转换模型。At present, most speech synthesis systems based on large speech databases are based on speech in a neutral reading mode. For the synthesis of emotional speech, the common method is to convert the neutral speech into the prosody and spectrum conversion method of the target emotional speech, such as the method based on GMM (Gaussian mixture model, Gaussian mixture model) recorded in non-patent literature 1 and 2 and The method based on CART (Classification And Regression Tree, classification and regression tree) recorded in non-patent literature 2. These prosodic and spectral transformation methods only require building an additional small parallel corpus, which saves a lot of development time and expense compared to re-recording a large corpus of the target emotional speech. At the same time, these prosodic and spectral transformation methods can establish the connection between neutral speech features and target emotional speech features, such as GMM-based methods. Optionally, it is also possible to establish the link between linguistic information and the difference between neutral speech features and target emotional speech features, such as CART-based methods. The GMM-based method has better performance than the CART-based method. In addition, as described in Non-Patent Document 2, the CART method and the GMM method can also be combined, that is, first use the CART method to perform a preliminary classification based on linguistic information, and then use the GMM method to establish prosody for each category and spectral transformation models.

然而，上述基于GMM的韵律和频谱转换模型严重依赖于说话人。也就是说，如果上述大型中立语音库和上述小型平行语音库不是来自相同的说话人，则转换的性能将会严重降低。因此，在上述韵律和频谱转换方法中，为了得到高质量的转换效果，希望上述大型中立语音库和上述小型平行语音库是来自同一说话人。然而，在实际的产品支持中这是很难实现的，因为客户的需求可能在任何时候出现，例如，在录制中立语音库的好几年以后，即使还能找到当年的说话人，他/她的声音也可能随着时间发生了相当的变化。However, the above GMM-based prosodic and spectral transformation models are heavily speaker-dependent. That is, if the aforementioned large neutral corpus and the aforementioned small parallel corpus were not from the same speakers, the performance of the conversion would be severely degraded. Therefore, in the above-mentioned prosody and spectrum conversion method, in order to obtain a high-quality conversion effect, it is hoped that the above-mentioned large-scale neutral speech library and the above-mentioned small parallel speech library are from the same speaker. However, this is difficult to achieve in actual product support, because a customer's needs may arise at any time, for example, after several years of recording a neutral voice library, even if the speakers of the year can still be found, his/her The sound may also have changed considerably over time.

非专利文献1：L.Mesbahi，V.Barreaud and O.Boeffard，“Comparing GMM-based speech transformation systems”，Proc.ofINTERSPEECH 2007，Antwerp，Belgium，Aug.27-31，2007，pp.1989-1992，在此通过参考引入其整个内容。Non-Patent Document 1: L. Mesbahi, V.Barreaud and O.Boeffard, "Comparing GMM-based speech transformation systems", Proc.of INTERSPEECH 2007, Antwerp, Belgium, Aug.27-31, 2007, pp.1989-1992, The entire contents thereof are hereby incorporated by reference.

非专利文献2：J.Tao，Y.Kang and A.Li，“Prosodyconversion from neutral speech to emotional speech”，IEEE Trans.OnAudio，Speech and Language Processing，Vol.14，No.4，2006，pp.1145-1154，在此通过参考引入其整个内容。Non-Patent Document 2: J.Tao, Y.Kang and A.Li, "Prosodyconversion from neutral speech to emotional speech", IEEE Trans.OnAudio, Speech and Language Processing, Vol.14, No.4, 2006, pp.1145 -1154, the entire contents of which are hereby incorporated by reference.

发明内容 Contents of the invention

本发明正是鉴于上述现有技术中的问题而提出的，其目的在于提供不依赖于说话人的情感语音合成方法和装置，以便能够有效改善韵律和频谱转换的性能。The present invention is proposed in view of the above-mentioned problems in the prior art, and its purpose is to provide a speaker-independent emotional speech synthesis method and device, so as to effectively improve the performance of prosody and spectrum conversion.

根据本发明的一个方面，提供了一种情感语音合成方法，包括以下步骤：输入文本句；利用由第一说话人的中立语音库训练获得的中立特征模型，预测上述文本句在上述第一说话人的第一特征空间中的中立特征向量；利用由上述中立语音库和第二说话人的平行语音库训练获得的说话人规整模型，将上述中立特征向量变换为上述第二说话人的第二特征空间中的规整中立特征向量；利用由上述平行语音库训练获得的情感转换模型，将上述规整中立特征向量转换为上述第二特征空间中的规整情感特征向量；利用上述说话人规整模型，将上述规整情感特征向量逆变换为上述第一特征空间中的情感特征向量；以及利用上述第一特征空间中的情感特征向量合成出第一说话人的情感语音。According to one aspect of the present invention, a method for emotional speech synthesis is provided, comprising the steps of: inputting a text sentence; utilizing the neutral feature model obtained by training the neutral speech bank of the first speaker, predicting that the above text sentence will be used in the above-mentioned first speech The neutral feature vector in the first feature space of the person; using the speaker regularization model obtained from the training of the above-mentioned neutral speech database and the parallel speech database of the second speaker, the above-mentioned neutral feature vector is transformed into the above-mentioned second speaker’s second Regular neutral feature vectors in the feature space; using the emotional conversion model obtained from the above-mentioned parallel speech library training, the above-mentioned regular neutral feature vectors are converted into regular emotional feature vectors in the second feature space; using the above-mentioned speaker regularization model, the Inversely transforming the regularized emotional feature vectors into emotional feature vectors in the first feature space; and synthesizing the emotional voice of the first speaker by using the emotional feature vectors in the first feature space.

根据本发明的另一个方面，提供了一种情感语音合成装置，包括：输入单元，其输入文本句；预测单元，其利用由第一说话人的中立语音库训练获得的中立特征模型，预测上述文本句在上述第一说话人的第一特征空间中的中立特征向量；变换单元，其利用由上述中立语音库和第二说话人的平行语音库训练获得的说话人规整模型，将上述中立特征向量变换为上述第二说话人的第二特征空间中的规整中立特征向量；转换单元，其利用由上述平行语音库训练获得的情感转换模型，将上述规整中立特征向量转换为上述第二特征空间中的规整情感特征向量；逆变换单元，其利用上述说话人规整模型，将上述规整情感特征向量逆变换为上述第一特征空间中的情感特征向量；以及合成单元，其利用上述第一特征空间中的情感特征向量合成出第一说话人的情感语音。According to another aspect of the present invention, an emotional speech synthesis device is provided, including: an input unit, which inputs a text sentence; a prediction unit, which utilizes a neutral feature model obtained by training the neutral speech database of the first speaker to predict the above-mentioned The neutral feature vector of the text sentence in the first feature space of the above-mentioned first speaker; the transformation unit, which utilizes the speaker regularization model obtained by training the above-mentioned neutral speech database and the parallel speech database of the second speaker, converts the above-mentioned neutral features The vector is transformed into a regular neutral feature vector in the second feature space of the above-mentioned second speaker; a conversion unit, which uses the emotion conversion model obtained from the above-mentioned parallel speech database training, converts the above-mentioned regular neutral feature vector into the above-mentioned second feature space The regular emotional feature vector in the above; the inverse transformation unit, which uses the above-mentioned speaker regularization model, inversely transforms the above-mentioned regular emotional feature vector into the emotional feature vector in the above-mentioned first feature space; and a synthesis unit, which uses the above-mentioned first feature space The emotional feature vector in synthesizes the emotional speech of the first speaker.

附图说明 Description of drawings

相信通过以下结合附图对本发明具体实施方式的说明，能够使人们更好地了解本发明上述的特点、优点和目的。It is believed that people can better understand the above-mentioned characteristics, advantages and objectives of the present invention through the following description of specific embodiments of the present invention in conjunction with the accompanying drawings.

图1是根据本发明的一个实施例的情感语音合成方法的流程图。Fig. 1 is a flowchart of an emotional speech synthesis method according to an embodiment of the present invention.

图2是根据本发明的一个实施例的说话人规整模型的一个实例。Fig. 2 is an example of a speaker regularization model according to an embodiment of the present invention.

图3是根据本发明的一个实施例的说话人规整模型的另一个实例。Fig. 3 is another example of a speaker normalization model according to an embodiment of the present invention.

图4是根据本发明的另一个实施例的情感语音合成装置的框图。Fig. 4 is a block diagram of an emotional speech synthesis device according to another embodiment of the present invention.

具体实施方式 Detailed ways

下面就结合附图对本发明的各个优选实施例进行详细的说明。Various preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

情感语音合成方法Emotional Speech Synthesis Method

图1是根据本发明的一个实施例的情感语音合成方法的流程图。下面就结合该图，对本实施例进行描述。Fig. 1 is a flowchart of an emotional speech synthesis method according to an embodiment of the present invention. The present embodiment will be described below with reference to this figure.

如图1所示，首先，在步骤101，输入文本句。在本实施例中，输入的文本句可以是本领域的技术人员公知的任何文本的句子，也可以是各种语言的文本句，例如汉语、英语、日语等，本发明对此没有任何限制。As shown in FIG. 1 , first, at step 101, a text sentence is input. In this embodiment, the input text sentence may be any text sentence known to those skilled in the art, and may also be a text sentence in various languages, such as Chinese, English, Japanese, etc., and the present invention has no limitation on this.

接着，在步骤105，利用文本分析从输入的文本句中提取语言学信息60。在本实施例中，语言学信息60包括上述文本句的句长，句中各字(词)的字形、拼音、音素类型、声调、词性、句中位置、与前后字(词)之间的边界类型以及与前后停顿之间的距离等等。此外，在本实施例中，用于从输入的文本句中提取语言学信息60的文本分析方法可以是本领域的技术人员公知的任何方法，本发明对此没有任何限制。Next, at step 105, linguistic information 60 is extracted from the input text sentence using text analysis. In the present embodiment, the linguistic information 60 includes the sentence length of the above-mentioned text sentence, the shape, pinyin, phoneme type, tone, part of speech, position in the sentence, and the distance between the characters (words) in the sentence. Boundary types and distances from front and back pauses, etc. In addition, in this embodiment, the text analysis method used to extract the linguistic information 60 from the input text sentence may be any method known to those skilled in the art, and the present invention has no limitation on this.

应该注意，这里的步骤105只是一个可选的步骤，也可以在步骤101输入文本句之后直接进行到步骤110。It should be noted that step 105 here is just an optional step, and it is also possible to directly proceed to step 110 after the text sentence is input in step 101 .

在步骤110，利用由第一说话人的中立语音库10训练获得的中立特征模型30，预测在步骤101输入的文本句在上述第一说话人的第一特征空间中的中立特征向量。In step 110, use the neutral feature model 30 obtained from the training of the neutral speech library 10 of the first speaker to predict the neutral feature vector of the text sentence input in step 101 in the first feature space of the first speaker.

在本实施例中，中立语音库10包括第一说话人的中立语音，即中立朗读的语音。中立语音库10可以是本领域的技术人员公知的任何语音库，例如上述非专利文献1和2中记载的中立语音库。此外，由中立语音库10训练中立特征模型30的方法也可以是本领域的技术人员公知的任何方法，例如上述非专利文献1和2中记载的训练方法。此外，训练得到的中立特征模型30也可以是本领域的技术人员公知的任何模型，例如上述非专利文献1和2中记载的中立特征模型。中立特征模型30中的特征向量可以包含韵律特征(例如时长、基频轨迹、停顿、能量等)和频谱特征中的一种或多种。本发明只是在步骤110中利用了中立特征模型30，而对于中立语音库10、中立特征模型30的训练方法以及中立特征模型30没有任何限制。In this embodiment, the neutral speech database 10 includes the neutral speech of the first speaker, that is, the speech of neutral reading aloud. The neutral speech database 10 may be any speech database known to those skilled in the art, such as the neutral speech databases described in the above-mentioned non-patent documents 1 and 2. In addition, the method of training the neutral feature model 30 from the neutral speech library 10 may also be any method known to those skilled in the art, such as the training methods described in the above-mentioned non-patent documents 1 and 2. In addition, the neutral feature model 30 obtained through training may also be any model known to those skilled in the art, such as the neutral feature models described in the above-mentioned Non-Patent Documents 1 and 2. The feature vector in the neutral feature model 30 may include one or more of prosodic features (such as duration, fundamental frequency trajectory, pause, energy, etc.) and spectral features. The present invention only utilizes the neutral feature model 30 in step 110 , and has no limitations on the neutral speech library 10 , the training method of the neutral feature model 30 and the neutral feature model 30 .

在步骤110，如果没有在步骤105中提取语言学信息60，则利用中立特征模型30，预测在步骤101输入的文本句在第一说话人的第一特征空间中的中立特征向量。如果在步骤105中提取出语言学信息60，则根据提取出的语言学信息60，利用中立特征模型30，预测上述中立特征向量。在本实施例中，预测上述中立特征向量的方法可以是本领域的技术人员公知的任何方法，例如在上述非专利文献1和2中记载的预测方法。此外，预测出的中立特征向量可以包含韵律特征(例如时长、基频轨迹、停顿、能量等)和频谱特征中的一种或多种。In step 110, if the linguistic information 60 is not extracted in step 105, then use the neutral feature model 30 to predict the neutral feature vector of the text sentence input in step 101 in the first feature space of the first speaker. If the linguistic information 60 is extracted in step 105 , then the above-mentioned neutral feature vector is predicted by using the neutral feature model 30 according to the extracted linguistic information 60 . In this embodiment, the method for predicting the neutral eigenvector may be any method known to those skilled in the art, such as the prediction methods described in the above-mentioned Non-Patent Documents 1 and 2. In addition, the predicted neutral feature vector may contain one or more of prosodic features (such as duration, fundamental frequency trajectory, pause, energy, etc.) and spectral features.

接着，在步骤115，利用由中立语音库10和第二说话人的平行语音库20训练获得的说话人规整模型50，将在步骤110预测得到的中立特征向量变换为第二说话人的第二特征空间中的规整中立特征向量。在此，变换后的规整中立特征向量也可以包含韵律特征(例如时长、基频轨迹、停顿、能量等)和频谱特征中的一种或多种。Next, in step 115, the neutral feature vector predicted in step 110 is transformed into the second speaker's second Regular neutral eigenvectors in feature space. Here, the transformed regular neutral feature vector may also include one or more of prosodic features (such as duration, fundamental frequency trajectory, pause, energy, etc.) and spectral features.

在本实施例中，第二说话人的平行语音库20包含第二说话人的中立语音和目标情感语音，它们是成对的，也就是说，同一文本句用中立和目标情感两种方式朗读。In this embodiment, the second speaker's parallel speech bank 20 contains the second speaker's neutral speech and target emotional speech, which are paired, that is, the same text sentence is read aloud in both neutral and target emotional ways .

下面结合图2详细描述说话人规整模型50和在步骤115中进行的变换的一个实例。An example of speaker normalization model 50 and the transformation performed in step 115 is described in detail below with reference to FIG. 2 .

图2是根据本发明的一个实施例的说话人规整模型50的一个实例。如图2所示，在训练说话人规整模型50的过程中，首先根据分类规则70将中立语音库10切分成m个类1-1，1-2，...，1-m。分类规则70可以根据经验随特征的不同而不同，例如，针对时长和频谱特征根据音素类型分类，针对基频轨迹根据声调类型分类，针对能量根据句中位置分类等中的一种或多种。接着，根据同样的分类规则70将平行语音库20也切分成对应的m个类2-1，2-2，...，2-m。接着，对于每个类1-i和2-i，计算出统计量71-i和72-i，其中上述统计量可以是从每个类1-i和2-i中提取的特征向量的均值μ和协方差矩阵∑等。在这种情况下，说话人规整模型50包括分类规则70和统计量71-i和72-i。FIG. 2 is an example of a speaker normalization model 50 according to an embodiment of the present invention. As shown in FIG. 2 , in the process of training the speaker regularization model 50 , the neutral speech library 10 is first divided into m classes 1-1, 1-2, . . . , 1-m according to the classification rule 70 . The classification rules 70 can be different from feature to feature based on experience, for example, one or more of phoneme types for duration and spectrum features, tone types for fundamental frequency traces, and position in a sentence for energy. Next, according to the same classification rule 70, the parallel speech library 20 is also divided into corresponding m classes 2-1, 2-2, . . . , 2-m. Next, for each class 1-i and 2-i, calculate statistics 71-i and 72-i, wherein the above statistics can be the mean value of the feature vectors extracted from each class 1-i and 2-i μ and the covariance matrix Σ, etc. In this case, speaker normalization model 50 includes classification rules 70 and statistics 71-i and 72-i.

返回图1，在说话人规整模型50包括分类规则70和统计量71-i和72-i的情况下，在步骤115，首先利用在步骤105提取出的语言学信息60，查找在步骤110预测得到的中立特征向量对应的类x，然后根据如下公式(1)将该中立特征向量变换为第二说话人的第二特征空间中的规整中立特征向量， ${v_{n}}^{'} = (v_{n} - μ_{1 x}) Σ_{1 x}^{- 1 / 2} Σ_{2 x}^{1 / 2} + μ_{2 x} - - - (1)$ 其中，v′_n代表上述规整中立特征向量，v_n代表上述中立特征向量，μ_1x代表从上述中立语音库10的与上述中立特征向量对应的第x类中提取的均值，∑_1x代表从上述中立语音库10的与上述中立特征向量对应的第x类中提取的协方差矩阵，μ_2x代表从上述平行语音库20的与上述中立特征向量对应的第x类中提取的均值，以及∑_2x代表从上述平行语音库20的与上述中立特征向量对应的第x类中提取的协方差矩阵。Returning to Fig. 1, in the case that the speaker regularization model 50 includes classification rules 70 and statistics 71-i and 72-i, in step 115, at first, the linguistic information 60 extracted in step 105 is used to find The class x corresponding to the obtained neutral feature vector, and then transform the neutral feature vector into a regular neutral feature vector in the second feature space of the second speaker according to the following formula (1), ${v_{no}}^{'} = (v_{no} - μ_{1 x}) Σ_{1 x}^{- 1 / 2} Σ_{2 x}^{1 / 2} + μ_{2 x} - - - (1)$ Wherein, v' _n represents the above-mentioned regular neutral feature vector, v _n represents the above-mentioned neutral feature vector, μ _1x represents the mean value extracted from the xth class corresponding to the above-mentioned neutral feature vector of the above-mentioned neutral voice bank 10, and ∑ _1x represents the mean value extracted from the above-mentioned neutral feature vector The covariance matrix extracted in the xth class corresponding to the above-mentioned neutral eigenvector of the neutral speech bank 10, μ _2x represents the mean value extracted from the xth class corresponding to the above-mentioned neutral eigenvector of the above-mentioned parallel speech bank 20, and ∑ _2x represents the covariance matrix extracted from the xth class corresponding to the above-mentioned neutral feature vector of the above-mentioned parallel speech library 20 .

下面结合图3详细描述说话人规整模型50和在步骤115中进行的变换的另一个实例。Another example of the speaker normalization model 50 and the transformation performed in step 115 is described in detail below with reference to FIG. 3 .

图3是根据本发明的一个实施例的说话人规整模型的另一个实例。如图3所示，在训练说话人规整模型50的过程中，首先利用中立语音库10训练出基于GMM的第一说话人的第一特征空间模型(λ₁，μ₁，∑₁)80。在第一特征空间模型80中，λ₁代表第一特征空间模型80中的各组元所占权重的集合，μ₁代表各组元的均值的集合，∑₁代表各组元的协方差矩阵的集合。然后，利用平行语音库20将第一说话人的第一特征空间模型80自适应为第二说话人的第二特征空间模型(λ₂，μ₂，∑₂)90。在第二特征空间模型90中，λ₂代表第二特征空间模型90中的各组元所占权重的集合，μ₂代表各组元的均值的集合，∑₂代表各组元的协方差矩阵的集合。上述特征空间模型中的组元个数应该足够大以使模型能够准确地描述上述特征空间。此外，可以认为自适应前后的两个特征空间模型80和90所对应的组元是耦合的。在本实施例中，特征空间模型的自适应方法可以是MAP(Maximum a Posteriori，最大后验概率)、MCE(Minimum ClassificationError，最小分类错误)、MMI(Maximum Mutual Information，最大互相关信息)或其它可用的算法，本发明对此没有任何限制。值得注意的是，因为第二说话人的平行语音库20的数据有限，通常仅对模型的均值μ做自适应，那么假定λ₁＝λ₂＝λ，∑₁＝∑₂＝∑。在本实施例中，训练基于GMM的特征空间模型的方法可以是本领域的技术人员公知的任何方法，例如上述非专利文献1和2中记载的训练方法。在这种情况下，说话人规整模型50包括第一特征空间模型80和第二特征空间模型90。Fig. 3 is another example of a speaker normalization model according to an embodiment of the present invention. As shown in FIG. 3 , in the process of training the speaker regularization model 50 , the first feature space model (λ ₁ , μ ₁ , Σ ₁ ) 80 of the first speaker based on GMM is trained using the neutral speech library 10 . In the first feature space model 80, _λ1 represents the set of weights of each component in the first feature space model 80, _μ1 represents the set of the mean value of each component, and _Σ1 represents the covariance matrix of each component collection. Then, the parallel speech library 20 is used to adapt the first feature space model 80 of the first speaker to the second feature space model (λ ₂ , μ ₂ , Σ ₂ ) 90 of the second speaker. In the second feature space model 90, λ ₂ represents the set of weights occupied by each component in the second feature space model 90, μ ₂ represents the set of the mean value of each component, and ∑ ₂ represents the covariance matrix of each component collection. The number of components in the above feature space model should be large enough to enable the model to accurately describe the above feature space. In addition, it can be considered that the components corresponding to the two feature space models 80 and 90 before and after adaptation are coupled. In this embodiment, the adaptive method of the feature space model can be MAP (Maximum a Posteriori, maximum posterior probability), MCE (Minimum Classification Error, minimum classification error), MMI (Maximum Mutual Information, maximum mutual correlation information) or other The available algorithms are not limited by the present invention. It should be noted that because the second speaker's parallel speech library 20 has limited data, usually only the mean value μ of the model is adapted, so it is assumed that λ ₁ =λ ₂ =λ, ∑ ₁ =∑ ₂ =∑. In this embodiment, the method for training the GMM-based feature space model may be any method known to those skilled in the art, such as the training methods described in the above-mentioned non-patent documents 1 and 2. In this case, the speaker normalization model 50 includes a first feature space model 80 and a second feature space model 90 .

返回图1，在说话人规整模型50包括第一特征空间模型80和第二特征空间模型90的情况下，在步骤115，首先计算在步骤110预测得到的中立特征向量针对第一特征空间模型(λ₁，μ₁，∑₁)80的各组元i的概率P_i。可选地，可以根据如下公式(2)计算上述概率P_i， $p_{i} = λ_{i} \cdot \frac{1}{{(2 π)}^{n / 2} {| Σ_{i} |}^{1 / 2}} \exp (- \frac{1}{2} {(v_{n} - μ_{1 i})}^{T} Σ_{i}^{- 1} (v_{n} - μ_{1 i})) - - - (2)$ 其中，λ_i代表第一特征空间模型80中的各组元i所占的权重，μ_1i代表第一特征空间模型80中的各组元i的均值，∑_i代表第一特征空间模型80中的各组元i的协方差矩阵，v_n代表中立特征向量。Returning to Fig. 1, in the case that the speaker regularization model 50 includes the first feature space model 80 and the second feature space model 90, in step 115, firstly, the neutral feature vector predicted in step 110 is calculated for the first feature space model ( λ ₁ , μ ₁ , Σ ₁ ) the probability P _i of each component i of 80. Optionally, the above probability P _i can be calculated according to the following formula (2): $p_{i} = λ_{i} &Center Dot; \frac{1}{{(2 π)}^{no / 2} {| Σ_{i} |}^{1 / 2}} \exp (- \frac{1}{2} {(v_{no} - μ_{1 i})}^{T} Σ_{i}^{- 1} (v_{no} - μ_{1 i})) - - - (2)$ Wherein, _λi represents the weight of each component i in the first feature space model 80, μ _1i represents the mean value of each component i in the first feature space model 80, and _Σi represents the weight of each component i in the first feature space model 80. The covariance matrix of each component i of , v _n represents the neutral eigenvector.

接着，计算上述中立特征向量的针对第一特征空间模型(λ₁，μ₁，∑₁)80的各组元i的上述概率P_i所占的权重w_i。可选地，可以根据如下公式(3)计算上述权重w_i， $w_{i} = \frac{p_{i}}{Σ_{i = 1}^{n} p_{i}} - - - (3)$ Next, the weight w _i of the above-mentioned probability P _i for each component i of the first feature space model (λ ₁ , μ ₁ , Σ ₁ ) 80 of the above-mentioned neutral feature vector is calculated. Optionally, the above weight w _i can be calculated according to the following formula (3): $w_{i} = \frac{p_{i}}{Σ_{i = 1}^{no} p_{i}} - - - (3)$

接着，根据上述权重w_i和第二特征空间模型(λ₂，μ₂，∑₂)90，计算在步骤115变换后的规整中立特征向量v′_n。可选地，可以根据如下公式(4)计算上述规整中立特征向量v′_n， ${v_{n}}^{'} = Σ_{i = 1}^{n} (w_{i} \cdot μ_{2 i}) - - - (4)$ 其中，μ_2i代表第二特征空间模型90中的各组元i的均值。Next, according to the above weight w _i and the second feature space model (λ ₂ , μ ₂ , Σ ₂ ) 90 , calculate the regularized neutral feature vector v′ _n transformed in step 115 . Optionally, the above regular neutral eigenvector v' _n can be calculated according to the following formula (4), ${v_{no}}^{'} = Σ_{i = 1}^{no} (w_{i} &Center Dot; μ_{2 i}) - - - (4)$ Wherein, μ _2i represents the average value of each component i in the second feature space model 90.

在步骤115利用说话人规整模型50，将在步骤110预测得到的中立特征向量v_n变换为第二说话人的第二特征空间中的规整中立特征向量v′_n之后，在步骤120，利用由平行语音库20训练获得的情感转换模型40，将规整中立特征向量v′_n转换为上述第二特征空间中的规整情感特征向量v′_e。在此，转换后的规整情感特征向量v′_e可以包含韵律特征(例如时长、基频轨迹、停顿、能量等)和频谱特征中的一种或多种。In step 115, the speaker normalization model 50 is used to transform the neutral feature vector v _n predicted in step 110 into a regular neutral feature vector v′ _n in the second feature space of the second speaker. The emotion conversion model 40 obtained through the training of the parallel speech library 20 converts the regularized neutral feature vector v' _n into the regularized emotional feature vector v' _e in the above-mentioned second feature space. Here, the converted normalized emotional feature vector v' _e may contain one or more of prosodic features (such as duration, fundamental frequency trajectory, pause, energy, etc.) and spectral features.

在本实施例中，可以利用基于GMM的方法、基于CART的方法或其他方法以及他们各种可能的组合训练情感转换模型40。与上述特征向量包含的特征相类似，情感转换模型40也可以包括时长转换模型、基频轨迹转换模型、停顿转换模型、能量转换模型、频谱转换模型等中的一种或多种。In this embodiment, the emotion conversion model 40 can be trained by using a GMM-based method, a CART-based method or other methods and various possible combinations thereof. Similar to the features contained in the above feature vectors, the emotion conversion model 40 may also include one or more of a duration conversion model, a fundamental frequency trajectory conversion model, a pause conversion model, an energy conversion model, and a spectrum conversion model.

此外，在本实施例中，如果情感转换模型40的训练利用了基于CART的方法，则需要在上述步骤105中从输入的文本句中提取出语言学信息60，以便在步骤120，根据语言学信息60，利用情感转换模型40，将规整中立特征向量v′_n转换为规整情感特征向量v′_e。In addition, in this embodiment, if the emotion conversion model 40 is trained using a CART-based method, it is necessary to extract the linguistic information 60 from the input text sentence in the above step 105, so that in step 120, according to the linguistic Information 60, using the emotion conversion model 40 to transform the regularized neutral feature vector v′ _n into a regularized emotional feature vector v′ _e .

接着，在步骤125，利用说话人规整模型50，将规整情感特征向量v′_e逆变换为上述第一特征空间中的情感特征向量v_e。在此，逆变换后的情感特征向量v_e可以包含韵律特征(例如时长、基频轨迹、停顿、能量等)和频谱特征中的一种或多种。Next, in step 125, the regularized emotional feature vector v' _e is inversely transformed into the emotional feature vector v _e in the above-mentioned first feature space by using the speaker normalized model 50 . Here, the inversely transformed emotional feature vector _ve may contain one or more of prosodic features (such as duration, pitch track, pause, energy, etc.) and spectral features.

可选地，在说话人规整模型50包括如图2所示的分类规则70和统计量71-i和72-i的情况下，根据如下公式(5)将规整情感特征向量v′_e逆变换为情感特征向量v_e， $v_{e} = ({v_{e}}^{'} - μ_{2 x}) Σ_{2 x}^{- 1 / 2} Σ_{1 x}^{1 / 2} + μ_{1 x} - - - (5)$ 其中，μ_2x代表从平行语音库20的与中立特征向量v_n对应的第x类中提取的均值，以及∑_2x代表从平行语音库20的与中立特征向量v_n对应的第x类中提取的协方差矩阵。Optionally, in the case that the speaker regularization model 50 includes classification rules 70 and statistics 71-i and ₇₂ -i as shown in FIG. is the emotional feature vector v _e , $v_{e} = ({v_{e}}^{'} - μ_{2 x}) Σ_{2 x}^{- 1 / 2} Σ_{1 x}^{1 / 2} + μ_{1 x} - - - (5)$ Among them, μ _2x represents the mean value extracted from the xth class corresponding to the neutral feature vector v _n of the parallel speech library 20, and ∑ _2x represents the extraction from the xth class corresponding to the neutral feature vector v _n of the parallel speech library 20 The covariance matrix of .

此外，可选地，在说话人规整模型50包括如图3所示的第一特征空间模型80和第二特征空间模型90的情况下，首先计算规整情感特征向量v′_e针对第二特征空间模型(λ₂，μ₂，∑₂)90的各组元的概率P′_i。可选地，可以根据如下公式(6)计算上述概率P′_i， ${p_{i}}^{'} = λ_{i} \cdot \frac{1}{{(2 π)}^{n / 2} {| Σ_{i} |}^{1 / 2}} \exp (- \frac{1}{2} {({v_{e}}^{'} - μ_{2 i})}^{T} Σ_{i}^{- 1} ({v_{e}}^{'} - μ_{2 i})) - - - (6)$ 其中，λ_i代表第二特征空间模型90中的各组元i所占的权重，μ_2i代表第二特征空间模型90中的各组元i的均值，∑_i代表第二特征空间模型90中的各组元i的协方差矩阵，v′_e代表规整情感特征向量。In addition, optionally, in the case that the speaker regularization model 50 includes the first feature space model 80 and the second feature space model ₉₀ as shown in FIG. The probability P′ _i of each component of the model (λ ₂ , μ ₂ , Σ ₂ ) 90 . Optionally, the above probability P′ _i can be calculated according to the following formula (6): ${p_{i}}^{'} = λ_{i} &Center Dot; \frac{1}{{(2 π)}^{no / 2} {| Σ_{i} |}^{1 / 2}} \exp (- \frac{1}{2} {({v_{e}}^{'} - μ_{2 i})}^{T} Σ_{i}^{- 1} ({v_{e}}^{'} - μ_{2 i})) - - - (6)$ Wherein, _λi represents the weight of each component i in the second feature space model 90, μ _2i represents the mean value of each component i in the second feature space model 90, and _Σi represents the weight of each component i in the second feature space model 90. The covariance matrix of each component i of , v′ _e represents the regular emotional feature vector.

接着，计算规整情感特征向量v′_e的针对第二特征空间模型90的各组元i的上述概率P′_i所占的权重w′_i。可选地，可以根据如下公式(7)计算上述权重w′_i， ${w_{i}}^{'} = \frac{{p_{i}}^{'}}{Σ_{i = 1}^{n} {p_{i}}^{'}} - - - (7)$ Next, the _weight w′ i of the probability P′ _i of the regularized emotional feature vector v′ _e for each component i of the second feature space model 90 is calculated. Optionally, the above weight w′ _i can be calculated according to the following formula (7): ${w_{i}}^{'} = \frac{{p_{i}}^{'}}{Σ_{i = 1}^{no} {p_{i}}^{'}} - - - (7)$

接着，根据上述权重w′_i和第一特征空间模型(λ₁，μ₁，∑₁)80，计算在步骤125逆变换后的情感特征向量v_e。可选地，可以根据如下公式(8)计算上述情感特征向量v_e， $v_{e} = Σ_{i = 1}^{n} ({w_{i}}^{'} \cdot μ_{1 i}) - - - (8)$ 其中，μ_1i代表第一特征空间模型80中的各组元i的均值。Next, according to the weight w′ _i and the first feature space model (λ ₁ , μ ₁ , Σ ₁ ) 80 , calculate the emotion feature vector _ve after inverse transformation in step 125 . Optionally, the above emotional feature vector _ve can be calculated according to the following formula (8), $v_{e} = Σ_{i = 1}^{no} ({w_{i}}^{'} &Center Dot; μ_{1 i}) - - - (8)$ Wherein, μ _1i represents the mean value of each component i in the first feature space model 80.

最后，在步骤130，利用上述第一特征空间中的情感特征向量v_e合成出第一说话人的情感语音。Finally, in step 130, the emotional speech of the first speaker is synthesized by using the emotional feature vector _ve in the first feature space.

在本实施例中，将情感特征向量合成为目标情感语音的方法可以是本领域的技术人员公知的任何方法，例如上述非专利文献1和2中记载的合成方法，本发明对此没有任何限制。In this embodiment, the method of synthesizing the emotional feature vector into target emotional speech can be any method known to those skilled in the art, such as the synthesis method described in the above-mentioned non-patent literature 1 and 2, and the present invention has no limitation to this .

通过本实施例的情感语音合成方法，可以利用与第一说话人不同的第二说话人的平行语音库合成出第一说话人的情感语音，从而在大型中立语音库和小型平行语音库不是来自相同的说话人情况下，能够有效改善韵律和频谱转换的性能。Through the emotional speech synthesis method of this embodiment, the emotional speech of the first speaker can be synthesized by using the parallel speech library of the second speaker different from the first speaker, so that the large-scale neutral speech library and the small parallel speech library are not from In the case of the same speaker, it can effectively improve the performance of prosody and spectral conversion.

情感语音合成装置Emotional Speech Synthesis Device

在同一发明构思下，图4是根据本发明的另一个实施例的情感语音合成装置的框图。下面就结合该图，对本实施例进行描述。对于那些与前面实施例相同的部分，适当省略其说明。Under the same inventive conception, FIG. 4 is a block diagram of an emotional speech synthesis device according to another embodiment of the present invention. The present embodiment will be described below with reference to this figure. For those parts that are the same as those in the previous embodiments, descriptions thereof are appropriately omitted.

如图4所示，本实施例的情感语音合成装置400包括：输入单元401，其输入文本句；预测单元410，其利用由第一说话人的中立语音库10训练获得的中立特征模型30，预测上述文本句在上述第一说话人的第一特征空间中的中立特征向量；变换单元415，其利用由上述中立语音库10和第二说话人的平行语音库20训练获得的说话人规整模型50，将上述中立特征向量变换为上述第二说话人的第二特征空间中的规整中立特征向量；转换单元420，其利用由上述平行语音库20训练获得的情感转换模型40，将上述规整中立特征向量转换为上述第二特征空间中的规整情感特征向量；逆变换单元425，其利用上述说话人规整模型50，将上述规整情感特征向量逆变换为上述第一特征空间中的情感特征向量；以及合成单元430，其利用上述第一特征空间中的情感特征向量合成出第一说话人的情感语音As shown in Figure 4, the emotional speech synthesis device 400 of the present embodiment includes: an input unit 401, which inputs a text sentence; a prediction unit 410, which utilizes the neutral feature model 30 obtained by training the neutral voice library 10 of the first speaker, Predict the neutral feature vector of the above-mentioned text sentence in the first feature space of the above-mentioned first speaker; the transformation unit 415, which utilizes the speaker regularization model obtained by training the above-mentioned neutral speech library 10 and the second speaker's parallel speech library 20 50. Transform the above-mentioned neutral feature vector into a regular neutral feature vector in the second feature space of the second speaker; a conversion unit 420, which utilizes the emotion conversion model 40 obtained by training the above-mentioned parallel speech library 20 to convert the above-mentioned regular neutral feature vector to The eigenvectors are converted into regular emotional feature vectors in the above-mentioned second feature space; the inverse transformation unit 425 uses the above-mentioned speaker regularization model 50 to inverse transform the above-mentioned regular emotional feature vectors into emotional feature vectors in the above-mentioned first feature space; And a synthesis unit 430, which uses the emotional feature vector in the first feature space to synthesize the emotional voice of the first speaker

在本实施例中，输入单元401输入的文本句可以是本领域的技术人员公知的任何文本的句子，也可以是各种语言的文本句，例如汉语、英语、日语等，本发明对此没有任何限制。In this embodiment, the text sentences input by the input unit 401 can be any text sentences known to those skilled in the art, and can also be text sentences in various languages, such as Chinese, English, Japanese, etc., and the present invention does not have this any restrictions.

本实施例的情感语音合成装置400可选地具有提取单元405，其利用文本分析从输入单元401输入的文本句中提取语言学信息60。在本实施例中，语言学信息60包括上述文本句的句长，句中各字(词)的字形、拼音、音素类型、声调、词性、句中位置、与前后字(词)之间的边界类型以及与前后停顿之间的距离等等。此外，在本实施例中，提取单元405用于从输入的文本句中提取语言学信息60的文本分析方法可以是本领域的技术人员公知的任何方法，本发明对此没有任何限制。The emotional speech synthesis device 400 of this embodiment optionally has an extraction unit 405 that uses text analysis to extract linguistic information 60 from the text sentence input by the input unit 401 . In the present embodiment, the linguistic information 60 includes the sentence length of the above-mentioned text sentence, the shape, pinyin, phoneme type, tone, part of speech, position in the sentence, and the distance between the characters (words) in the sentence. Boundary types and distances from front and back pauses, etc. In addition, in this embodiment, the text analysis method used by the extraction unit 405 to extract the linguistic information 60 from the input text sentence may be any method known to those skilled in the art, and the present invention has no limitation on this.

在本实施例中，中立语音库10包括第一说话人的中立语音，即中立朗读的语音。中立语音库10可以是本领域的技术人员公知的任何语音库，例如上述非专利文献1和2中记载的中立语音库。此外，由中立语音库10训练中立特征模型30的方法也可以是本领域的技术人员公知的任何方法，例如上述非专利文献1和2中记载的训练方法。此外，训练得到的中立特征模型30也可以是本领域的技术人员公知的任何模型，例如上述非专利文献1和2中记载的中立特征模型。中立特征模型30中的特征向量可以包含韵律特征(例如时长、基频轨迹、停顿、能量等)和频谱特征中的一种或多种。本发明只是在预测单元410中利用了中立特征模型30，而对于中立语音库10、中立特征模型30的训练方法以及中立特征模型30没有任何限制。In this embodiment, the neutral speech database 10 includes the neutral speech of the first speaker, that is, the speech of neutral reading aloud. The neutral speech database 10 may be any speech database known to those skilled in the art, such as the neutral speech databases described in the above-mentioned non-patent documents 1 and 2. In addition, the method of training the neutral feature model 30 from the neutral speech library 10 may also be any method known to those skilled in the art, such as the training methods described in the above-mentioned non-patent documents 1 and 2. In addition, the neutral feature model 30 obtained through training may also be any model known to those skilled in the art, such as the neutral feature models described in the above-mentioned Non-Patent Documents 1 and 2. The feature vector in the neutral feature model 30 may include one or more of prosodic features (such as duration, fundamental frequency trajectory, pause, energy, etc.) and spectral features. The present invention only utilizes the neutral feature model 30 in the prediction unit 410 , and has no limitation on the neutral speech library 10 , the training method of the neutral feature model 30 and the neutral feature model 30 .

预测单元410，在没有利用提取单元405提取语言学信息60的情况下，利用中立特征模型30，预测由输入单元401输入的文本句在第一说话人的第一特征空间中的中立特征向量。在利用提取单元405提取出语言学信息60的情况下，则根据提取出的语言学信息60，利用中立特征模型30，预测上述中立特征向量。在本实施例中，预测单元410预测上述中立特征向量的方法可以是本领域的技术人员公知的任何方法，例如在上述非专利文献1和2中记载的预测方法。此外，预测出的中立特征向量可以包含韵律特征(例如时长、基频轨迹、停顿、能量等)和频谱特征中的一种或多种。The prediction unit 410 uses the neutral feature model 30 to predict the neutral feature vector of the text sentence input by the input unit 401 in the first feature space of the first speaker without using the extraction unit 405 to extract the linguistic information 60 . When the linguistic information 60 is extracted by the extraction unit 405 , the above-mentioned neutral feature vector is predicted by using the neutral feature model 30 according to the extracted linguistic information 60 . In this embodiment, the method for predicting the neutral feature vector by the prediction unit 410 may be any method known to those skilled in the art, such as the prediction methods described in the above-mentioned Non-Patent Documents 1 and 2. In addition, the predicted neutral feature vector may contain one or more of prosodic features (such as duration, fundamental frequency trajectory, pause, energy, etc.) and spectral features.

变换单元415，利用由中立语音库10和第二说话人的平行语音库20训练获得的说话人规整模型50，将由预测单元410预测得到的中立特征向量变换为第二说话人的第二特征空间中的规整中立特征向量。在此，变换后的规整中立特征向量也可以包含韵律特征(例如时长、基频轨迹、停顿、能量等)和频谱特征中的一种或多种。The transformation unit 415 uses the speaker regularization model 50 trained by the neutral speech library 10 and the parallel speech library 20 of the second speaker to transform the neutral feature vector predicted by the prediction unit 410 into the second feature space of the second speaker Regular neutral eigenvectors in . Here, the transformed regular neutral feature vector may also include one or more of prosodic features (such as duration, fundamental frequency trajectory, pause, energy, etc.) and spectral features.

在本实施例中，说话人规整模型50可以是如图2所示的包括分类规则70和统计量71-i和72-i的上述说话人规整模型50，也可以是如图3所示的包括第一特征空间模型80和第二特征空间模型90的上述说话人规整模型50。In this embodiment, the speaker regularization model 50 may be the above-mentioned speaker regularization model 50 including classification rules 70 and statistics 71-i and 72-i as shown in FIG. The above-mentioned speaker normalization model 50 including the first feature space model 80 and the second feature space model 90 .

在说话人规整模型50包括分类规则70和统计量71-i和72-i的情况下，变换单元415包括查找单元和计算单元。查找单元利用由提取单元405提取出的语言学信息60，查找由预测单元410预测得到的中立特征向量对应的类x。计算单元根据上述公式(1)计算规整中立特征向量。In the case that the speaker normalization model 50 includes classification rules 70 and statistics 71-i and 72-i, the transformation unit 415 includes a search unit and a calculation unit. The search unit uses the linguistic information 60 extracted by the extraction unit 405 to search for the class x corresponding to the neutral feature vector predicted by the prediction unit 410 . The calculation unit calculates the regular neutral eigenvector according to the above formula (1).

在说话人规整模型50包括第一特征空间模型80和第二特征空间模型90的情况下，变换单元415包括概率计算单元、权重计算单元和特征向量计算单元。概率计算单元计算由预测单元410预测得到的中立特征向量针对第一特征空间模型(λ₁，μ₁，∑₁)80的各组元i的概率P_i。可选地，可以根据上述公式(2)计算上述概率P_i。In the case that the speaker normalization model 50 includes the first feature space model 80 and the second feature space model 90 , the transformation unit 415 includes a probability calculation unit, a weight calculation unit, and a feature vector calculation unit. The probability calculation unit calculates the probability P _i of the neutral feature vector predicted by the prediction unit 410 for each component i of the first feature space model (λ ₁ , μ ₁ , Σ ₁ ) 80 . Optionally, the above probability P _i may be calculated according to the above formula (2).

权重计算单元计算上述中立特征向量的针对第一特征空间模型(λ₁，μ₁，∑₁)80的各组元i的上述概率P_i所占的权重w_i。可选地，可以根据上述公式(3)计算上述权重w_i。The weight calculation unit calculates the weight w _i of the probability P _i of each component i of the first feature space model (λ ₁ , μ ₁ , Σ ₁ ) 80 of the neutral feature vector. Optionally, the above weight w _i may be calculated according to the above formula (3).

特征向量计算单元根据上述权重计算单元计算出的权重w_i和第二特征空间模型(λ₂，μ₂，∑₂)90，计算变换后的规整中立特征向量v′_n。可选地，可以根据上述公式(4)计算上述规整中立特征向量v′_n。The feature vector calculation unit calculates the transformed regular neutral feature vector v′ _n according to the weight w _i calculated by the above weight calculation unit and the second feature space model (λ ₂ , μ ₂ , Σ ₂ ) 90 . Optionally, the regularized neutral eigenvector v' _n may be calculated according to the above formula (4).

在变换单元415利用说话人规整模型50，将由预测单元410预测得到的中立特征向量v_n变换为第二说话人的第二特征空间中的规整中立特征向量v′_n之后，转换单元420利用由平行语音库20训练获得的情感转换模型40，将规整中立特征向量v′_n转换为上述第二特征空间中的规整情感特征向量v′_e。在此，转换后的规整情感特征向量v′_e可以包含韵律特征(例如时长、基频轨迹、停顿、能量等)和频谱特征中的一种或多种。After the transformation unit 415 uses the speaker normalization model 50 to transform the neutral feature vector _vn predicted by the prediction unit 410 into a regularized neutral feature vector v′ _n in the second feature space of the second speaker, the transformation unit 420 uses the The emotion conversion model 40 obtained through the training of the parallel speech library 20 converts the regularized neutral feature vector v' _n into the regularized emotional feature vector v' _e in the above-mentioned second feature space. Here, the converted normalized emotional feature vector v' _e may contain one or more of prosodic features (such as duration, fundamental frequency trajectory, pause, energy, etc.) and spectral features.

此外，在本实施例中，如果情感转换模型40的训练利用了基于CART的方法，则需要利用提取单元405从输入的文本句中提取出语言学信息60，以便转换单元420根据语言学信息60，利用情感转换模型40，将规整中立特征向量v′_n转换为规整情感特征向量v′_e。In addition, in this embodiment, if the emotion conversion model 40 is trained using a CART-based method, the extraction unit 405 needs to be used to extract the linguistic information 60 from the input text sentence, so that the conversion unit 420 can use the linguistic information 60 , using the emotion conversion model 40 to transform the regularized neutral feature vector v′ _n into the regularized emotional feature vector v′ _e .

逆变换单元425利用说话人规整模型50，将规整情感特征向量v′_e逆变换为上述第一特征空间中的情感特征向量v_e。在此，逆变换后的情感特征向量v_e可以包含韵律特征(例如时长、基频轨迹、停顿、能量等)和频谱特征中的一种或多种。The inverse transformation unit 425 utilizes the speaker regularization model 50 to inverse transform the regularized emotional feature vector v' _e into the emotional feature vector v _e in the above-mentioned first feature space. Here, the inversely transformed emotional feature vector _ve may contain one or more of prosodic features (such as duration, pitch track, pause, energy, etc.) and spectral features.

可选地，在说话人规整模型50包括如图2所示的分类规则70和统计量71-i和72-i的情况下，根据上述公式(5)将规整情感特征向量v′_e逆变换为情感特征向量v_e。Optionally, in the case that the speaker regularization model 50 includes classification rules 70 and statistics 71-i and ₇₂ -i as shown in FIG. is the emotional feature vector v _e .

此外，可选地，在说话人规整模型50包括如图3所示的第一特征空间模型80和第二特征空间模型90的情况下，逆变换单元425包括概率计算单元、权重计算单元和特征向量计算单元。概率计算单元计算规整情感特征向量v′_e针对第二特征空间模型(λ₂，μ₂，∑₂)90的各组元的概率P′_i。可选地，概率计算单元可以根据上述公式(6)计算上述概率P′_i。In addition, optionally, when the speaker regularization model 50 includes the first feature space model 80 and the second feature space model 90 as shown in FIG. 3 , the inverse transformation unit 425 includes a probability calculation unit, a weight calculation unit and a feature Vector computing unit. The probability calculation unit calculates the probability P' _i of the regularized emotional feature vector v' _e for each component of the second feature space model (λ ₂ , μ ₂ , Σ ₂ ) 90 . Optionally, the probability calculation unit may calculate the above probability P' _i according to the above formula (6).

权重计算单元计算规整情感特征向量v′_e的针对第二特征空间模型90的各组元i的上述概率P′_i所占的权重w′_i。可选地，权重计算单元可以根据上述公式(7)计算上述权重w′_i。The weight calculation unit calculates the weight w' _i of the above-mentioned probability P' _i of each component i of the second feature space model 90 of the regularized emotional feature vector v' _e . Optionally, the weight calculation unit may calculate the above weight w' _i according to the above formula (7).

特征向量计算单元根据上述权重计算单元计算出的权重w′_i和第一特征空间模型(λ₁，μ₁，∑₁)80，计算逆变换后的情感特征向量v_e。可选地，特征向量计算单元可以根据上述公式(8)计算上述情感特征向量v_e。The feature vector calculation unit calculates the emotional feature vector v _e after inverse transformation according to the weight w′ _i calculated by the above weight calculation unit and the first feature space model (λ ₁ , μ ₁ , Σ ₁ ) 80 . Optionally, the feature vector calculation unit may calculate the aforementioned emotional feature vector _ve according to the aforementioned formula (8).

最后，合成单元430，利用上述第一特征空间中的情感特征向量v_e合成出第一说话人的情感语音。Finally, the synthesizing unit 430 synthesizes the emotional voice of the first speaker by using the emotional feature vector _ve in the first feature space.

通过本实施例的情感语音合成装置400，可以利用与第一说话人不同的第二说话人的平行语音库合成出第一说话人的情感语音，从而在大型中立语音库和小型平行语音库不是来自相同的说话人情况下，能够有效改善韵律和频谱转换的性能。Through the emotional speech synthesis device 400 of this embodiment, the emotional speech of the first speaker can be synthesized by using the parallel speech library of the second speaker different from the first speaker, so that the large-scale neutral speech library and the small parallel speech library are not From the same speaker case, it can effectively improve the performance of prosody and spectral conversion.

以上虽然通过一些示例性的实施例对本发明的情感语音合成方法和情感语音合成装置进行了详细的描述，但是以上这些实施例并不是穷举的，本领域技术人员可以在本发明的精神和范围内实现各种变化和修改。因此，本发明并不限于这些实施例，本发明的范围仅由所附权利要求为准。Although the emotional speech synthesis method and the emotional speech synthesis device of the present invention have been described in detail through some exemplary embodiments above, these above embodiments are not exhaustive, and those skilled in the art can understand the spirit and scope of the present invention Variations and modifications are implemented within. Therefore, the present invention is not limited to these embodiments, and the scope of the present invention is determined only by the appended claims.

也就是说，本发明的思想在于利用说话人规整模型，在只有另一个说话人的附加平行语音库的情况下，可以不依赖于说话人地进行韵律和频谱转换，从而有效地改善转换的性能。上述说话人规整模型易于与各种现有的韵律和频谱转换方法结合而不限于在上述实施例中描述的方法。本发明的应用目的也可以不限于情感表达，而是可以更广泛地用于丰富TTS(Text-to-Speech，文本语音转换，或称为语音合成)中的多种表达类型，例如友好说话方式、对话中的语义焦点等。本发明既适用于单元拼接的TTS系统也适用于参数合成的TTS系统。That is to say, the idea of the present invention is to use the speaker regularization model to perform prosodic and spectral conversion independently of the speaker in the case of only an additional parallel speech library of another speaker, thereby effectively improving the performance of the conversion . The speaker regularization model described above is easy to combine with various existing prosody and spectrum conversion methods and is not limited to the methods described in the above embodiments. The application purpose of the present invention can also be not limited to emotional expression, but can be used for enriching the various expression types in TTS (Text-to-Speech, text-to-speech conversion, or be called speech synthesis) more widely, such as friendly speaking manner , semantic focus in dialogue, etc. The present invention is applicable to both the TTS system of unit splicing and the TTS system of parameter synthesis.

Claims

1. emotional speech synthetic method may further comprise the steps:

The input text sentence;

Utilization is predicted the neutral proper vector of above-mentioned text sentence in above-mentioned first speaker's first feature space by the neutral characteristic model of first speaker's neutral sound bank training acquisition;

Utilization is by the regular model of speaker that the above-mentioned neutral sound bank and second speaker's parallel sound bank training obtains, and above-mentioned neutral proper vector is transformed to the regular neutral proper vector in above-mentioned second speaker's second feature space;

The emotion transformation model that utilization is obtained by above-mentioned parallel sound bank training converts above-mentioned regular neutral proper vector in above-mentioned second feature space regular affective characteristics vector;

Utilize the regular model of above-mentioned speaker, above-mentioned regular affective characteristics vector is inversely transformed into the affective characteristics vector in above-mentioned first feature space; And

Utilize affective characteristics vector in above-mentioned first feature space to synthesize first speaker's emotional speech.

2. emotional speech synthetic method according to claim 1, further comprising the steps of:

At the neutral characteristic model of above-mentioned utilization, predict and from above-mentioned text sentence, extract linguistic information before the step of the neutral proper vector of above-mentioned text sentence in above-mentioned first speaker's first feature space by first speaker's neutral sound bank training acquisition.

3. emotional speech synthetic method according to claim 2; Wherein, Above-mentioned utilization predicts that by the neutral characteristic model of first speaker's neutral sound bank training acquisition the step of the neutral proper vector of above-mentioned text sentence in above-mentioned first speaker's first feature space may further comprise the steps:

Based on above-mentioned linguistic information, utilize above-mentioned neutral characteristic model, predict above-mentioned neutral characteristic vector.

4. emotional speech synthetic method according to claim 2; Wherein, The emotion transformation model that above-mentioned utilization is obtained by above-mentioned parallel sound bank training, the step that above-mentioned regular neutral proper vector is converted into the regular affective characteristics vector in above-mentioned second feature space may further comprise the steps:

According to above-mentioned linguistic information, utilize above-mentioned emotion transformation model, convert above-mentioned regular neutral proper vector into above-mentioned regular affective characteristics vector.

5. emotional speech synthetic method according to claim 1; Wherein, the regular model of above-mentioned speaker comprise classifying rules, the average and the covariance matrix of the average of the proper vector from each type of dividing according to above-mentioned classifying rules of above-mentioned neutral sound bank, extracted and covariance matrix and the proper vector from each type of dividing according to above-mentioned classifying rules of above-mentioned parallel sound bank, extracted.

6. emotional speech synthetic method according to claim 5; Wherein, above-mentioned classifying rules comprise phoneme classification of type rule to duration and spectrum signature, to the tone classification of type rule of fundamental frequency track and to energy at least a in the classifying rules of position.

7. emotional speech synthetic method according to claim 5; Wherein, Above-mentioned utilization is by the regular model of speaker that the above-mentioned neutral sound bank and second speaker's parallel sound bank training obtains, and the step that above-mentioned neutral proper vector is transformed to the regular neutral proper vector in above-mentioned second speaker's second feature space may further comprise the steps:

According to following formula above-mentioned neutral proper vector is transformed to above-mentioned regular neutral proper vector,

Wherein, V _n ^/Represent above-mentioned regular neutral proper vector, v _nRepresent above-mentioned neutral proper vector, μ _1xThe average that representative is extracted from the x class corresponding with above-mentioned neutral proper vector of above-mentioned neutral sound bank, ∑ _1xThe covariance matrix that representative is extracted from the x class corresponding with above-mentioned neutral proper vector of above-mentioned neutral sound bank, μ _2xThe average that representative is extracted from the x class corresponding with above-mentioned neutral proper vector of above-mentioned parallel sound bank, and ∑ _2xThe covariance matrix that representative is extracted from the x class corresponding with above-mentioned neutral proper vector of above-mentioned parallel sound bank.

8. emotional speech synthetic method according to claim 5, wherein, the above-mentioned regular model of above-mentioned speaker that utilizes, the step that above-mentioned regular affective characteristics vector is inversely transformed into the affective characteristics vector in above-mentioned first feature space may further comprise the steps:

According to following formula above-mentioned regular affective characteristics vector is inversely transformed into above-mentioned affective characteristics vector,

Wherein, v _eRepresent above-mentioned affective characteristics vector, V _e ^/Represent above-mentioned regular affective characteristics vector, μ _1xThe average that representative is extracted from the x class corresponding with above-mentioned neutral proper vector of above-mentioned neutral sound bank, ∑ _1xThe covariance matrix that representative is extracted from the x class corresponding with above-mentioned neutral proper vector of above-mentioned neutral sound bank, μ _2xThe average that representative is extracted from the x class corresponding with above-mentioned neutral proper vector of above-mentioned parallel sound bank, and ∑ _2xThe covariance matrix that representative is extracted from the x class corresponding with above-mentioned neutral proper vector of above-mentioned parallel sound bank.

9. emotional speech synthesizer comprises:

Input block, its input text sentence;

Predicting unit, it utilizes the neutral characteristic model by first speaker's neutral sound bank training acquisition, predicts the neutral proper vector of above-mentioned text sentence in above-mentioned first speaker's first feature space;

Converter unit, the regular model of speaker that it utilizes the parallel sound bank training by the above-mentioned neutral sound bank and second speaker to obtain is transformed to above-mentioned neutral proper vector the regular neutral proper vector in above-mentioned second speaker's second feature space;

Converting unit, it utilizes the emotion transformation model that is obtained by above-mentioned parallel sound bank training, converts above-mentioned regular neutral proper vector in above-mentioned second feature space regular affective characteristics vector;

Inverse transformation block, it utilizes the regular model of above-mentioned speaker, and above-mentioned regular affective characteristics vector is inversely transformed into the affective characteristics vector in above-mentioned first feature space; And

Synthesis unit, it utilizes affective characteristics vector in above-mentioned first feature space to synthesize first speaker's emotional speech.