[go: up one dir, main page]

CN113257225B - A kind of emotional speech synthesis method and system integrating vocabulary and phoneme pronunciation features - Google Patents

A kind of emotional speech synthesis method and system integrating vocabulary and phoneme pronunciation features Download PDF

Info

Publication number
CN113257225B
CN113257225B CN202110600732.4A CN202110600732A CN113257225B CN 113257225 B CN113257225 B CN 113257225B CN 202110600732 A CN202110600732 A CN 202110600732A CN 113257225 B CN113257225 B CN 113257225B
Authority
CN
China
Prior art keywords
phoneme
word segmentation
text
network
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn - After Issue
Application number
CN202110600732.4A
Other languages
Chinese (zh)
Other versions
CN113257225A (en
Inventor
郑书凯
李太豪
裴冠雄
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lab
Original Assignee
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lab filed Critical Zhejiang Lab
Priority to CN202110600732.4A priority Critical patent/CN113257225B/en
Publication of CN113257225A publication Critical patent/CN113257225A/en
Application granted granted Critical
Publication of CN113257225B publication Critical patent/CN113257225B/en
Withdrawn - After Issue legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Machine Translation (AREA)

Abstract

本发明属于人工智能领域,具体涉及一种融合词汇及音素发音特征的情感语音合成方法及系统,该方法为:通过录音采集设备,采集文本及情感标签,对所述文本进行预处理,获取音素及音素对齐信息,生成分词及分词语义信息,分别计算并得到分词发音时长信息、分词发音语速信息、分词发音能量信息、音素基频信息,分别训练分词语速预测网络、分词能量预测网络、音素基频预测网络,获取并拼接音素隐含信息、分词语速隐含信息、分词能量隐含信息、音素基频隐含信息,合成情感语音。本发明通过将与情感发音有关的词汇及音素发音特征融合到端到端语音合成模型中去,能够使得合成的情感语音更加自然。

Figure 202110600732

The invention belongs to the field of artificial intelligence, and in particular relates to an emotional speech synthesis method and system that integrates vocabulary and phoneme pronunciation features. and phoneme alignment information, generate word segmentation and word segmentation semantic information, calculate and obtain word segmentation pronunciation duration information, word segmentation pronunciation speed information, word segmentation pronunciation energy information, phoneme fundamental frequency information, respectively train segmentation word speed prediction network, word segmentation energy prediction network, The phoneme fundamental frequency prediction network obtains and concatenates phoneme implicit information, word segmentation speed implicit information, word segmentation energy implicit information, phoneme fundamental frequency implicit information, and synthesizes emotional speech. The present invention can make the synthesized emotional speech more natural by integrating the vocabulary and phoneme pronunciation features related to emotional pronunciation into the end-to-end speech synthesis model.

Figure 202110600732

Description

一种融合词汇及音素发音特征的情感语音合成方法及系统A method and system for emotional speech synthesis integrating vocabulary and phoneme pronunciation features

技术领域technical field

本发明属于人工智能领域,具体涉及一种融合词汇及音素发音特征的情感语音合成方法及系统。The invention belongs to the field of artificial intelligence, and in particular relates to an emotional speech synthesis method and system integrating vocabulary and phoneme pronunciation features.

背景技术Background technique

语言交互是人类最早的交流方式之一,因此语音成为了人类表达情感的主要方式。随着人机交互的兴起,让对话机器人拥有类似人的情感,说话听起来像真人,成为一种迫切的需求。目前情感主要的分类方式是上世纪Ekman提出的7种情感,分别为:中性、开心、悲伤、生气、害怕、厌恶、惊讶。Language interaction is one of the earliest forms of human communication, so speech has become the main way for humans to express emotions. With the rise of human-computer interaction, it has become an urgent need to make conversational robots have human-like emotions and speak like real people. At present, the main classification methods of emotions are the seven emotions proposed by Ekman in the last century, which are: neutral, happy, sad, angry, fearful, disgusted, and surprised.

语音合成技术随着近几年深度学习的兴起,越发成熟起来,让机器的发音像一个播音员一样已经可以实现。但是,让机器像人一样,发出带情感的语音,仍然是一个非常困难的问题,当前主流的情感语音合成可以分为两种方法。一种是基于隐马尔可夫模型这种传统机器学习的分段方法;另外一种是基于深度学习的端到端方法。基于隐马尔可夫方法合成的语音机械感很强,听起来不自然,目前已经非常少用到。基于深度学习的方法合成的语音相对比较自然。但是目前基于深度学习合成的情感语音,也只是简单的将情感标签融入文本特征中,合成的情感语音质量并不能得到有效保证。With the rise of deep learning in recent years, speech synthesis technology has become more and more mature, and it has become possible to make the pronunciation of a machine like an announcer. However, it is still a very difficult problem to make a machine emit emotional speech like a human. The current mainstream emotional speech synthesis can be divided into two methods. One is a segmentation method based on traditional machine learning such as hidden Markov models; the other is an end-to-end method based on deep learning. The speech synthesized based on the hidden Markov method has a strong mechanical sense and sounds unnatural, and it is very rarely used at present. The speech synthesized by the deep learning-based method is relatively natural. However, at present, the emotional speech synthesized based on deep learning simply integrates emotional tags into text features, and the quality of the synthesized emotional speech cannot be effectively guaranteed.

在目前的技术中,由于融入情感信息方式比较简单,通常只是将情感的标签简单的融合到文本特征中,而没有考虑人在情感语音发音中的特点,导致模型不能很好的学习到情感信息,所以合成的情感语音比较生硬,不自然。In the current technology, because the way of integrating emotional information is relatively simple, usually only the emotional tags are simply integrated into the text features, without considering the characteristics of people in emotional speech pronunciation, resulting in the model not being able to learn emotional information well , so the synthesized emotional speech is blunt and unnatural.

发明内容SUMMARY OF THE INVENTION

为了解决现有技术中存在的上述技术问题,本发明提出了一种融合词汇及音素发音特征的情感语音合成方法及系统,其具体技术方案如下:In order to solve the above-mentioned technical problems existing in the prior art, the present invention proposes an emotional speech synthesis method and system integrating vocabulary and phoneme pronunciation features, and its specific technical scheme is as follows:

一种融合词汇及音素发音特征的情感语音合成方法,包括如下步骤:An emotional speech synthesis method integrating vocabulary and phoneme pronunciation features, comprising the following steps:

步骤一,通过录音采集设备,采集文本及情感标签;Step 1: Collect text and emotional tags through recording equipment;

步骤二,对所述文本进行预处理,获取音素及音素对齐信息,生成分词及分词语义信息;Step 2, preprocessing the text, obtaining phoneme and phoneme alignment information, and generating word segmentation and word segmentation semantic information;

步骤三,分别计算并得到分词发音时长信息、分词发音语速信息、分词发音能量信息、音素基频信息;Step 3, respectively calculate and obtain word segmentation pronunciation duration information, word segmentation pronunciation speech rate information, word segmentation pronunciation energy information, phoneme fundamental frequency information;

步骤四,分别训练分词语速预测网络Net_WordSpeed、分词能量预测网络Net_WordEnergy、音素基频预测网络Net_PhonemeF0;Step 4, respectively train the word segmentation speed prediction network Net_WordSpeed, the word segmentation energy prediction network Net_WordEnergy, and the phoneme fundamental frequency prediction network Net_PhonemeF0;

步骤五,通过Tacotron2的Encoder,获取音素隐含信息,通过Net_WordSpeed,获取分词语速隐含信息,通过Net_WordEnergy,获取分词能量隐含信息,通过Net_PhonemeF0,获取音素基频隐含信息;Step 5: Obtain the implicit information of phoneme through the Encoder of Tacotron2, obtain the implicit information of word segmentation speed through Net_WordSpeed, obtain implicit information of word segmentation energy through Net_WordEnergy, and obtain the implicit information of fundamental frequency of phoneme through Net_PhonemeF0;

步骤六,拼接所述音素隐含信息、分词语速隐含信息、分词能量隐含信息、音素基频隐含信息,合成情感语音。Step 6, splicing the phoneme implicit information, word segmentation speed implicit information, word segmentation energy implicit information, and phoneme fundamental frequency implicit information to synthesize emotional speech.

进一步的,所述步骤一具体包括步骤S1:通过录音采集设备,采集涵盖中性、开心、 悲伤、生气、害怕、厌恶、惊讶的7种情感类型的语音音频,表示为

Figure 895901DEST_PATH_IMAGE001
,语音对应的文本,表 示为
Figure 485145DEST_PATH_IMAGE002
,语音对应的情感类型,表示为
Figure 186254DEST_PATH_IMAGE003
。 Further, the step 1 specifically includes step S1: collecting voice audios covering 7 emotional types of neutrality, happiness, sadness, anger, fear, disgust, and surprise by recording and collecting equipment, expressed as:
Figure 895901DEST_PATH_IMAGE001
, the text corresponding to the voice, expressed as
Figure 485145DEST_PATH_IMAGE002
, the emotion type corresponding to the speech, expressed as
Figure 186254DEST_PATH_IMAGE003
.

进一步的,所述步骤二具体包括如下步骤:Further, the step 2 specifically includes the following steps:

步骤S2,对采集的文本

Figure 112622DEST_PATH_IMAGE004
,通过pypinyin工具包转换为对应的音素文本,表示为
Figure 333518DEST_PATH_IMAGE005
,然后将音素文本
Figure 687139DEST_PATH_IMAGE006
和得到的
Figure 609965DEST_PATH_IMAGE001
通过语音处理工具软件HTK,获取文本的时 间对齐信息,生成包含每个音素发音时长的音素-时长文本,表示为
Figure 340024DEST_PATH_IMAGE007
; Step S2, for the collected text
Figure 112622DEST_PATH_IMAGE004
, converted to the corresponding phoneme text by the pypinyin toolkit, expressed as
Figure 333518DEST_PATH_IMAGE005
, then the phoneme text
Figure 687139DEST_PATH_IMAGE006
and obtained
Figure 609965DEST_PATH_IMAGE001
Through the speech processing tool software HTK, the time alignment information of the text is obtained, and the phoneme-duration text containing the pronunciation duration of each phoneme is generated, which is expressed as
Figure 340024DEST_PATH_IMAGE007
;

步骤S3,对文本

Figure 415427DEST_PATH_IMAGE004
,通过结巴分词工具进行分词,即在原始文本中插入分词边界 标识符,生成分词文本
Figure 939949DEST_PATH_IMAGE008
,将分词文本
Figure 350071DEST_PATH_IMAGE008
输入到输出宽度为D中文预训练Bert网络,得 到维度为N×D的分词特征
Figure 618241DEST_PATH_IMAGE009
,具体的, Step S3, to the text
Figure 415427DEST_PATH_IMAGE004
, perform word segmentation through the stuttering word segmentation tool, that is, insert the word segmentation boundary identifier into the original text to generate the word segmentation text
Figure 939949DEST_PATH_IMAGE008
, which will segment the text
Figure 350071DEST_PATH_IMAGE008
The input to output width is D Chinese pre-trained Bert network, and the word segmentation feature of dimension N×D is obtained
Figure 618241DEST_PATH_IMAGE009
,specific,

Figure 548151DEST_PATH_IMAGE010
Figure 548151DEST_PATH_IMAGE010

其中,

Figure 509154DEST_PATH_IMAGE011
是一个维度为D的向量。 in,
Figure 509154DEST_PATH_IMAGE011
is a vector of dimension D.

进一步的,所述步骤三具体包括如下步骤:Further, the step 3 specifically includes the following steps:

步骤S4,利用生成的

Figure 406571DEST_PATH_IMAGE012
和生成的分词文本
Figure 947274DEST_PATH_IMAGE013
计算每个分词的发音时 长,得到分词-时长文本
Figure 997270DEST_PATH_IMAGE014
; Step S4, using the generated
Figure 406571DEST_PATH_IMAGE012
and the resulting segmented text
Figure 947274DEST_PATH_IMAGE013
Calculate the pronunciation duration of each participle to get the participle-duration text
Figure 997270DEST_PATH_IMAGE014
;

步骤S5,通过得到的分词-时长文本

Figure 863595DEST_PATH_IMAGE015
计算分词的语速信息,并将语速归 为5类,分别为:慢、较慢、一般、较快、快,从而得到分词文本对应的语速类别标签
Figure 248308DEST_PATH_IMAGE016
; Step S5, through the obtained word segmentation-duration text
Figure 863595DEST_PATH_IMAGE015
Calculate the speech speed information of the word segmentation, and classify the speech speed into 5 categories, namely: slow, slow, normal, faster, and fast, so as to obtain the speech rate category label corresponding to the word segmentation text
Figure 248308DEST_PATH_IMAGE016
;

步骤S6,对所述音频

Figure 858281DEST_PATH_IMAGE001
和分词-时长文本
Figure 762783DEST_PATH_IMAGE015
,通过分词持续时间内音频 幅值的平方和计算分词的发音能量信息,并将能量信息归为五类,分别为:低、较低、中、较 高、高,从而得到分词文本对应的能量标签
Figure 65589DEST_PATH_IMAGE017
; Step S6, for the audio
Figure 858281DEST_PATH_IMAGE001
and participle-duration text
Figure 762783DEST_PATH_IMAGE015
, calculate the pronunciation energy information of the word segmentation by the sum of the squares of the audio amplitudes during the word segmentation duration, and classify the energy information into five categories: low, low, medium, high, and high, so as to obtain the energy corresponding to the word segmentation text Label
Figure 65589DEST_PATH_IMAGE017
;

步骤S7,对所述音频

Figure 666160DEST_PATH_IMAGE001
和音素-时长文本
Figure 79824DEST_PATH_IMAGE018
,通过Librosa工具包计算 音素发音的基频信息,并将基频信息根据基频高低归为五类,分别为:低、较低、中、较高、 高,从而得到音素文本对应的基频标签
Figure 104412DEST_PATH_IMAGE019
。 Step S7, for the audio
Figure 666160DEST_PATH_IMAGE001
and phoneme-duration text
Figure 79824DEST_PATH_IMAGE018
, calculate the fundamental frequency information of phoneme pronunciation through the Librosa toolkit, and classify the fundamental frequency information into five categories according to the fundamental frequency, namely: low, low, medium, high, high, so as to obtain the fundamental frequency corresponding to the phoneme text Label
Figure 104412DEST_PATH_IMAGE019
.

进一步的,所述步骤四具体包括如下步骤:Further, the step 4 specifically includes the following steps:

步骤S8,训练分词语速预测网络Net_WordSpeed:将情感类型

Figure 968331DEST_PATH_IMAGE020
和分词特征
Figure 937424DEST_PATH_IMAGE021
作为网络输入,语速类别标签
Figure 30145DEST_PATH_IMAGE016
作为网络目标,输入到深度学习序列预测网络 BiLSTM-CRF,然后通过深度学习的网络训练得到分词语速预测网络Net_WordSpeed; Step S8, train the word speed prediction network Net_WordSpeed:
Figure 968331DEST_PATH_IMAGE020
and participle features
Figure 937424DEST_PATH_IMAGE021
As network input, speech rate category labels
Figure 30145DEST_PATH_IMAGE016
As a network target, input it to the deep learning sequence prediction network BiLSTM-CRF, and then obtain the word speed prediction network Net_WordSpeed through deep learning network training;

步骤S9,训练分词能量预测网络Net_WordEnergy:将情感类型

Figure 424087DEST_PATH_IMAGE022
和分词特征
Figure 68694DEST_PATH_IMAGE021
作为网络输入,能量标签
Figure 400450DEST_PATH_IMAGE023
作为网络目标,输入到深度学习序列预测网络BLSTM- CRF,通过与步骤S8同样的处理方法,得到分词能量预测网络Net_WordEnergy; Step S9, train the word segmentation energy prediction network Net_WordEnergy:
Figure 424087DEST_PATH_IMAGE022
and participle features
Figure 68694DEST_PATH_IMAGE021
As network input, energy labels
Figure 400450DEST_PATH_IMAGE023
As the network target, input it to the deep learning sequence prediction network BLSTM-CRF, and obtain the word segmentation energy prediction network Net_WordEnergy through the same processing method as step S8;

步骤S10,训练音素基频预测网络Net_PhonemeF0:将情感类型

Figure 155916DEST_PATH_IMAGE022
和音素文本
Figure 873205DEST_PATH_IMAGE005
都通过One-Hot转换技术,转换为向量形式后,作为网络输入,基频标签
Figure 688715DEST_PATH_IMAGE019
通 过One-Hot转换技术,转换为向量形式后,作为网络目标,输入到序列预测深度学习序列预 测网络BLS TM-CRF,通过与步骤S8一样的训练方法,得到音素基频预测网络Net_ PhonemeF0。Step S10, train the phoneme fundamental frequency prediction network Net_PhonemeF0:
Figure 155916DEST_PATH_IMAGE022
and phonemic text
Figure 873205DEST_PATH_IMAGE005
All through One-Hot conversion technology, after conversion to vector form, as network input, fundamental frequency label
Figure 688715DEST_PATH_IMAGE019
After conversion into vector form through One-Hot conversion technology, as a network target, it is input to the sequence prediction deep learning sequence prediction network BLS TM-CRF, and the phoneme fundamental frequency prediction network Net_PhonemeF0 is obtained through the same training method as step S8.

进一步的,所述步骤S8具体包括如下步骤:Further, the step S8 specifically includes the following steps:

步骤A:情感类型

Figure 507766DEST_PATH_IMAGE024
,通过One-Hot向量转换技术,转换为宽度为7的One-Hot向 量,然后通过宽度为D的单层全连接网络,转换为维度为D的标签输入隐含特征
Figure 66923DEST_PATH_IMAGE025
; Step A: Emotion Types
Figure 507766DEST_PATH_IMAGE024
, converted to a One-Hot vector with a width of 7 through the One-Hot vector conversion technology, and then converted to a label input latent feature of dimension D through a single-layer fully connected network with a width of D
Figure 66923DEST_PATH_IMAGE025
;

步骤B:将得到的

Figure 169877DEST_PATH_IMAGE021
Figure 890709DEST_PATH_IMAGE025
在第一个维度进行拼接,得到网络输入
Figure 462636DEST_PATH_IMAGE026
,具体的, Step B: will get
Figure 169877DEST_PATH_IMAGE021
and
Figure 890709DEST_PATH_IMAGE025
Splicing in the first dimension to get the network input
Figure 462636DEST_PATH_IMAGE026
,specific,

Figure 684538DEST_PATH_IMAGE027
Figure 684538DEST_PATH_IMAGE027

步骤C:将分词长度N为的标签,通过One-Hot向量转换技术,转换为宽度为5的One- Hot向量,最终得到维度为N×5的网络标签矩阵

Figure 517365DEST_PATH_IMAGE028
,具体的, Step C: Convert the label with a word segmentation length of N into a One-Hot vector with a width of 5 through the One-Hot vector conversion technology, and finally obtain a network label matrix with a dimension of N×5
Figure 517365DEST_PATH_IMAGE028
,specific,

Figure 284464DEST_PATH_IMAGE029
Figure 284464DEST_PATH_IMAGE029

其中,

Figure 937162DEST_PATH_IMAGE030
是一个维度为5的向量; in,
Figure 937162DEST_PATH_IMAGE030
is a vector of dimension 5;

步骤D:将网络输入

Figure 838122DEST_PATH_IMAGE031
和网络标签矩阵
Figure 384510DEST_PATH_IMAGE028
,输入到BLSTM-CRF网络中进行训练,通 过网络的自动学习,得到可以预测文本语速的语速预测网络Net_WordSpeed。 Step D: Enter the network
Figure 838122DEST_PATH_IMAGE031
and network label matrix
Figure 384510DEST_PATH_IMAGE028
, input into the BLSTM-CRF network for training, and through the automatic learning of the network, the speech rate prediction network Net_WordSpeed that can predict the speech rate of the text is obtained.

进一步的,所述步骤五具体包括如下步骤:Further, the step 5 specifically includes the following steps:

步骤S11,通过Tacotron2的Encoder,获取音素隐含信息:将对应的音素文本

Figure 712723DEST_PATH_IMAGE032
输入到Tacotron2网络的Encoder网络,得到Encoder网络的输出特征
Figure 728084DEST_PATH_IMAGE033
; Step S11, through the Encoder of Tacotron2, obtain the phoneme implicit information: the corresponding phoneme text
Figure 712723DEST_PATH_IMAGE032
Input to the Encoder network of the Tacotron2 network to get the output features of the Encoder network
Figure 728084DEST_PATH_IMAGE033
;

步骤S12,通过Net_WordSpeed,获取分词语速隐含信息:将分词特征

Figure 432734DEST_PATH_IMAGE034
输入到分词 语速预测网络Net_WordSpeed,得到BiLSTM输出的语速隐含特征
Figure 99208DEST_PATH_IMAGE035
,根据每个分词包 含的音素个数,对语速隐含特征通过复制的在时间维度进行长度补齐,得到长度为音素个 数的语速隐含特征
Figure 598323DEST_PATH_IMAGE036
; Step S12, through Net_WordSpeed, obtain the implicit information of word segmentation speed: the word segmentation feature
Figure 432734DEST_PATH_IMAGE034
Input to the word speed prediction network Net_WordSpeed, and get the hidden features of the speech speed output by BiLSTM
Figure 99208DEST_PATH_IMAGE035
, according to the number of phonemes contained in each participle, the implicit feature of speech rate is complemented by duplicating the length in the time dimension to obtain the implicit feature of speech rate whose length is the number of phonemes
Figure 598323DEST_PATH_IMAGE036
;

步骤S13,通过Net_WordEnergy,获取分词能量隐含信息:将分词特征

Figure 100979DEST_PATH_IMAGE034
输入到分 词能量预测网络Net_WordEnergy,得到BiLSTM输出的能量隐含特征
Figure 343742DEST_PATH_IMAGE037
,根据每个分词 包含的音素个数,对能量隐含特征通过复制的在时间维度进行长度补齐,得到长度为音素 个数的能量隐含特征
Figure 870581DEST_PATH_IMAGE038
; Step S13, obtain the implicit information of word segmentation energy through Net_WordEnergy: the word segmentation feature
Figure 100979DEST_PATH_IMAGE034
Input to the word segmentation energy prediction network Net_WordEnergy, and get the energy implicit features output by BiLSTM
Figure 343742DEST_PATH_IMAGE037
, according to the number of phonemes contained in each participle, the length of the energy implicit feature is complemented in the time dimension by copying, and the energy implicit feature whose length is the number of phonemes is obtained
Figure 870581DEST_PATH_IMAGE038
;

步骤S14,通过Net_PhonemeF0,获取音素基频隐含信息:将音素文本

Figure 681542DEST_PATH_IMAGE039
输入到 音素基频预测网络Net_PhonemeF0,得到BiLSTM输出的音素基频隐含特征
Figure 796129DEST_PATH_IMAGE040
。 Step S14, through Net_PhonemeF0, obtain the implicit information of the phoneme fundamental frequency: the phoneme text
Figure 681542DEST_PATH_IMAGE039
Input to the phoneme fundamental frequency prediction network Net_PhonemeF0, and get the phoneme fundamental frequency implicit feature output by BiLSTM
Figure 796129DEST_PATH_IMAGE040
.

进一步的,所述步骤六具体包括如下步骤:Further, the step 6 specifically includes the following steps:

步骤S15将

Figure 701637DEST_PATH_IMAGE041
Figure 218069DEST_PATH_IMAGE042
Figure 934352DEST_PATH_IMAGE043
Figure 536235DEST_PATH_IMAGE044
,进行拼接,得到最终Tacotron2 的Decoder解码器网络的输入
Figure 245434DEST_PATH_IMAGE045
,具体的, Step S15 will
Figure 701637DEST_PATH_IMAGE041
,
Figure 218069DEST_PATH_IMAGE042
,
Figure 934352DEST_PATH_IMAGE043
,
Figure 536235DEST_PATH_IMAGE044
, splicing to get the input of the final Tacotron2 Decoder decoder network
Figure 245434DEST_PATH_IMAGE045
,specific,

Figure 616372DEST_PATH_IMAGE046
Figure 616372DEST_PATH_IMAGE046

步骤S16,将所述

Figure 503557DEST_PATH_IMAGE045
,输入到Tacotron2的Decoder解码器网络中,然后通过 Tacotrn2网络的后续结构,解码并合成得到最终的情感语音。 In step S16, the
Figure 503557DEST_PATH_IMAGE045
, input into the Decoder decoder network of Tacotron2, and then decode and synthesize the final emotional speech through the subsequent structure of the Tacotron2 network.

一种融合词汇及音素发音特征的情感语音合成系统,包括:An emotional speech synthesis system integrating vocabulary and phoneme pronunciation features, including:

文本采集模块,用于采用http传输,采集需要合成的文本内容及情感标签;The text collection module is used to use http transmission to collect the text content and emotional tags that need to be synthesized;

文本预处理模块,用于将采集到的文本进行预处理,对所述文本进行分词、音素转换操作,包括:对文本依次进行文本符号统一转换为英文符号、数字格式统一转换为中文文本、对中文本分词、将分词文本通过预训练Bert转换成语义向量表示形式、文本通过pypinyin工具包转换得到音素文本,所述情感标签通过One-Hot转换得到情感标签的向量表示,生成可用于神经网络处理的数据;The text preprocessing module is used for preprocessing the collected text, and performing word segmentation and phoneme conversion operations on the text, including: uniformly converting text symbols into English symbols in turn, uniformly converting digital formats into Chinese texts, and Chinese text word segmentation, the word segmentation text is converted into a semantic vector representation through pre-training Bert, the text is converted into phoneme text through the pypinyin toolkit, and the emotional label is converted by One-Hot to obtain the vector representation of the emotional label, which can be used for neural network processing. The data;

情感语音合成模块,用于通过设计的网络模型处理文本及情感信息,合成情感语音;The emotional speech synthesis module is used to process text and emotional information through the designed network model, and synthesize emotional speech;

数据存储模块,用于利用MySQL数据库,存储已经合成的情感语音;The data storage module is used to store the synthesized emotional speech by using the MySQL database;

合成语音调度模块,用于决策,采用模型合成语音,还是从数据库调用已合成语音,作为输出,并开放http端口用于输出合成好的情感语音。The synthetic voice scheduling module is used for decision-making, whether to use the model to synthesize voice, or to call the synthesized voice from the database as output, and open the http port for outputting the synthesized emotional voice.

进一步的,所述输出优先采用已合成情感语音,其次采用模型合成以提升系统响应速度。Further, the output preferentially adopts the synthesized emotional speech, and secondly adopts the model synthesis to improve the response speed of the system.

本发明的优点:Advantages of the present invention:

1. 本发明的情感语音合成方法,通过控制词语的发音控制间接控制合成语音的情感,由于词语是发音韵律的基本单位,人通过控制不同词语发音的音量、语速、基频来信息表达不同的情感,所以通过模拟人类发音表达情感的方式进行情感语音合成,能够更好的合成蕴含语音的情感,使得合成的语音更加自然;1. The emotion speech synthesis method of the present invention controls the emotion of the synthesized speech indirectly by controlling the pronunciation of words, because words are the basic units of pronunciation rhythm, people express different information by controlling the volume, speech rate, fundamental frequency of different word pronunciations. Therefore, by simulating human pronunciation to express emotions, emotional speech synthesis can better synthesize emotions containing speech, making the synthesized speech more natural;

2. 本发明的情感语音合成方法,利用独立的语速预测网络、能量预测网络和基频预测网络预测关于情感发音的三个要素,所以可以通过对独立网络的输出通过一个简单的系数进行相乘调整的方式,便捷的控制最终输出语音的效果;2. The emotional speech synthesis method of the present invention utilizes an independent speech rate prediction network, an energy prediction network and a fundamental frequency prediction network to predict three elements about emotional pronunciation, so the output of the independent network can be correlated by a simple coefficient. Multiplying the adjustment method, it is convenient to control the effect of the final output voice;

3. 本发明的情感语音合成方法,利用Tacotron2作为骨架网络,能够有效提升最终语音合成的质量;3. The emotional speech synthesis method of the present invention uses Tacotron2 as a skeleton network, which can effectively improve the quality of final speech synthesis;

4、本发明的情感语音合成系统配置有情感语音调用接口,能够通过简单的http调用就能合成带情感的高质量情感语音,对于需要进行人机语音交互的场景,能够极大提高用户使用体验感。例如用于智能电话客服对话场景,地图智能导航对话场景,儿童教育中的对话机器人交互场景,银行、机场等人形机器人对话交互场景等。4. The emotional speech synthesis system of the present invention is equipped with an emotional voice calling interface, which can synthesize high-quality emotional voice with emotion through a simple http call, and can greatly improve the user experience for scenarios that require human-computer voice interaction. sense. For example, it is used in smart phone customer service dialogue scenarios, map intelligent navigation dialogue scenarios, dialogue robot interaction scenarios in children's education, and humanoid robot dialogue interaction scenarios such as banks and airports.

附图说明Description of drawings

图1为本发明的情感语音合成系统的结构示意图;Fig. 1 is the structural representation of the emotional speech synthesis system of the present invention;

图2为本发明的情感语音合成方法的流程示意图;Fig. 2 is the schematic flow chart of the emotion speech synthesis method of the present invention;

图3为本发明的情感语音合成方法的网络结构示意图;Fig. 3 is the network structure schematic diagram of the emotion speech synthesis method of the present invention;

图4 语音合成系统Tacotron2网络结构示意图。Figure 4 is a schematic diagram of the network structure of the speech synthesis system Tacotron2.

具体实施方式Detailed ways

为了使本发明的目的、技术方案和技术效果更加清楚明白,以下结合说明书附图,对本发明作进一步详细说明。In order to make the objectives, technical solutions and technical effects of the present invention clearer, the present invention will be described in further detail below with reference to the accompanying drawings.

如图1所示,一种融合词汇及音素发音特征的情感语音合成系统,包括:As shown in Figure 1, an emotional speech synthesis system integrating vocabulary and phoneme pronunciation features includes:

文本采集模块,用于采用http传输,采集需要合成的文本内容及情感标签;The text collection module is used to use http transmission to collect the text content and emotional tags that need to be synthesized;

文本预处理模块,用于将采集到的文本进行预处理,对所述文本进行分词、音素转换操作,包括:对文本依次进行文本符号统一转换为英文符号、数字格式统一转换为中文文本、对中文本分词、将分词文本通过预训练Bert转换成语义向量表示形式、文本通过pypinyin工具包转换得到音素文本,所述情感标签通过One-Hot转换得到情感标签的向量表示,生成可用于神经网络处理的数据;The text preprocessing module is used for preprocessing the collected text, and performing word segmentation and phoneme conversion operations on the text, including: uniformly converting text symbols into English symbols in turn, uniformly converting digital formats into Chinese texts, and Chinese text word segmentation, the word segmentation text is converted into a semantic vector representation through pre-training Bert, the text is converted into phoneme text through the pypinyin toolkit, and the emotional label is converted by One-Hot to obtain the vector representation of the emotional label, which can be used for neural network processing. The data;

情感语音合成模块,用于通过设计的网络模型处理文本及情感信息,合成情感语音;The emotional speech synthesis module is used to process text and emotional information through the designed network model, and synthesize emotional speech;

数据存储模块,用于利用MySQL数据库,存储已经合成的情感语音;The data storage module is used to store the synthesized emotional speech by using the MySQL database;

合成语音调度模块,用于决策,采用模型合成语音,还是从数据库调用已合成语音,作为输出,并开放http端口用于输出合成好的情感语音,其中输出的情感语音优先采用已合成情感语音,其次采用模型合成以提升系统响应速度。The synthetic voice scheduling module is used for decision-making, whether to use the model to synthesize the voice, or to call the synthesized voice from the database as the output, and open the http port for outputting the synthesized emotional voice. The output emotional voice is given priority to the synthesized emotional voice. Secondly, model synthesis is used to improve the system response speed.

如图2-图4所示,一种融合词汇及音素发音特征的情感语音合成方法,包括如下步骤:As shown in Figure 2-Figure 4, an emotional speech synthesis method integrating vocabulary and phoneme pronunciation features includes the following steps:

步骤S1,采集文本及情感标签:通过录音采集设备,采集涵盖中性、开心、悲伤、生 气、害怕、厌恶、惊讶的7种情感类型的语音音频,表示为

Figure 327156DEST_PATH_IMAGE001
,语音对应的文本,表示为
Figure 449833DEST_PATH_IMAGE002
,语音对应的情感类型,表示为
Figure 799912DEST_PATH_IMAGE003
; Step S1, collect text and emotion tags: through the recording collection device, collect voice audios covering 7 emotion types of neutral, happy, sad, angry, fearful, disgusted, and surprised, expressed as:
Figure 327156DEST_PATH_IMAGE001
, the text corresponding to the voice, expressed as
Figure 449833DEST_PATH_IMAGE002
, the emotion type corresponding to the speech, expressed as
Figure 799912DEST_PATH_IMAGE003
;

步骤S2,预处理文本,获取音素及音素对齐信息:对步骤S1采集的文本

Figure 982631DEST_PATH_IMAGE002
,通过 pypinyin工具包转换为对应的音素文本,表示为
Figure 434472DEST_PATH_IMAGE047
,然后将音素文本
Figure 360840DEST_PATH_IMAGE048
和步骤 S1得到的
Figure 565425DEST_PATH_IMAGE001
通过语音处理工具软件HTK,获取文本的时间对齐信息,生成包含每个音素发 音时长的音素-时长文本,表示为
Figure 919046DEST_PATH_IMAGE049
; Step S2, preprocess the text to obtain phoneme and phoneme alignment information: for the text collected in step S1
Figure 982631DEST_PATH_IMAGE002
, converted to the corresponding phoneme text by the pypinyin toolkit, expressed as
Figure 434472DEST_PATH_IMAGE047
, then the phoneme text
Figure 360840DEST_PATH_IMAGE048
and obtained in step S1
Figure 565425DEST_PATH_IMAGE001
Through the speech processing tool software HTK, the time alignment information of the text is obtained, and the phoneme-duration text containing the pronunciation duration of each phoneme is generated, which is expressed as
Figure 919046DEST_PATH_IMAGE049
;

步骤S3,预处理文本,生成分词及分词语义信息:对文本

Figure 592604DEST_PATH_IMAGE002
,通过结巴分词工具 进行分词,即在原始文本中插入分词边界标识符,生成分词文本
Figure 322663DEST_PATH_IMAGE050
,将分词文本
Figure 381755DEST_PATH_IMAGE051
输 入到输出宽度为D中文预训练Bert网络,得到维度为N×D的分词特征
Figure 171856DEST_PATH_IMAGE052
,具体的, Step S3, preprocessing the text to generate word segmentation and word segmentation semantic information: for the text
Figure 592604DEST_PATH_IMAGE002
, perform word segmentation through the stuttering word segmentation tool, that is, insert the word segmentation boundary identifier into the original text to generate the word segmentation text
Figure 322663DEST_PATH_IMAGE050
, which will segment the text
Figure 381755DEST_PATH_IMAGE051
The input to output width is D Chinese pre-trained Bert network, and the word segmentation feature of dimension N×D is obtained
Figure 171856DEST_PATH_IMAGE052
,specific,

Figure 598289DEST_PATH_IMAGE053
Figure 598289DEST_PATH_IMAGE053

其中,

Figure 866460DEST_PATH_IMAGE054
是一个维度为D的向量; in,
Figure 866460DEST_PATH_IMAGE054
is a vector of dimension D;

步骤S4,计算分词发音时长信息:利用步骤S2生成的

Figure 45637DEST_PATH_IMAGE055
和步骤S3生成的分 词文本
Figure 6640DEST_PATH_IMAGE056
计算每个分词的发音时长,得到分词-时长文本
Figure 389211DEST_PATH_IMAGE057
; Step S4, calculate word segmentation pronunciation duration information: utilize step S2 to generate
Figure 45637DEST_PATH_IMAGE055
and the word segmentation text generated in step S3
Figure 6640DEST_PATH_IMAGE056
Calculate the pronunciation duration of each participle to get the participle-duration text
Figure 389211DEST_PATH_IMAGE057
;

步骤S5,计算分词发音语速信息:通过步骤S4得到的分词-时长文本

Figure 461072DEST_PATH_IMAGE057
计算 分词的语速信息,并将语速归为5类,分别为:慢、较慢、一般、较快、快,从而得到分词文本对 应的语速类别标签
Figure 760335DEST_PATH_IMAGE058
; Step S5, calculate word segmentation pronunciation speech rate information: word segmentation-duration text obtained by step S4
Figure 461072DEST_PATH_IMAGE057
Calculate the speech speed information of the word segmentation, and classify the speech speed into 5 categories, namely: slow, slow, normal, faster, and fast, so as to obtain the speech rate category label corresponding to the word segmentation text
Figure 760335DEST_PATH_IMAGE058
;

步骤S6,计算分词发音能量信息:对步骤S1得到的音频和步骤S4得到的分词-时长 文本

Figure 626660DEST_PATH_IMAGE059
,通过分词持续时间内音频幅值的平方和计算分词的发音能量信息,并将能 量信息归为五类,分别为:低、较低、中、较高、高,从而得到分词文本对应的能量标签
Figure 762106DEST_PATH_IMAGE060
; Step S6, calculate the word segmentation pronunciation energy information: to the audio obtained in step S1 and the word segmentation-duration text obtained in step S4
Figure 626660DEST_PATH_IMAGE059
, calculate the pronunciation energy information of the word segmentation by the sum of the squares of the audio amplitudes during the word segmentation duration, and classify the energy information into five categories: low, low, medium, high, and high, so as to obtain the energy corresponding to the word segmentation text Label
Figure 762106DEST_PATH_IMAGE060
;

步骤S7,计算音素基频信息:对步骤S1得到的音频

Figure 372079DEST_PATH_IMAGE061
和步骤S2得到的音素-时 长文本
Figure 543427DEST_PATH_IMAGE062
,通过Librosa工具包计算音素发音的基频信息,并将基频信息根据基频 高低归为五类,分别为:低、较低、中、较高、高,从而得到音素文本对应的基频标签
Figure 580653DEST_PATH_IMAGE063
; Step S7, calculate the phoneme fundamental frequency information: for the audio frequency obtained in step S1
Figure 372079DEST_PATH_IMAGE061
and the phoneme-duration text obtained in step S2
Figure 543427DEST_PATH_IMAGE062
, calculate the fundamental frequency information of phoneme pronunciation through the Librosa toolkit, and classify the fundamental frequency information into five categories according to the fundamental frequency, namely: low, low, medium, high, high, so as to obtain the fundamental frequency corresponding to the phoneme text Label
Figure 580653DEST_PATH_IMAGE063
;

步骤S8,训练分词语速预测网络Net_WordSpeed:将步骤S1得到的情感类型

Figure 468975DEST_PATH_IMAGE064
和步骤S3得到的分词特征
Figure 7272DEST_PATH_IMAGE065
作为网络输入,步骤S5得到的语速类别标签
Figure 890915DEST_PATH_IMAGE066
作为网 络目标,输入到深度学习序列预测网络BiLSTM-CRF,采用BiLSTM双向长短期记忆网络的原 因是因为BiLSTM特别适合处理序列类任务,例如语音信号处理,文本信号处理等,然后通过 深度学习的网络训练得到分词语速预测网络Net_WordSpeed。具体的,包括如下步骤: Step S8, train the word speed prediction network Net_WordSpeed: the emotion type obtained in step S1
Figure 468975DEST_PATH_IMAGE064
and the word segmentation feature obtained in step S3
Figure 7272DEST_PATH_IMAGE065
As the network input, the speech rate category label obtained in step S5
Figure 890915DEST_PATH_IMAGE066
As the network target, input to the deep learning sequence prediction network BiLSTM-CRF, the reason for using BiLSTM bidirectional long short-term memory network is because BiLSTM is particularly suitable for processing sequence tasks, such as speech signal processing, text signal processing, etc., and then through the deep learning network Train to get the segmentation word speed prediction network Net_WordSpeed. Specifically, it includes the following steps:

步骤A:情感类型

Figure 239987DEST_PATH_IMAGE067
,通过One-Hot向量转换技术,转换为宽度为7的One-Hot向 量,然后通过宽度为D的单层全连接网络,转换为维度为D的标签输入隐含特征
Figure 209081DEST_PATH_IMAGE068
; Step A: Emotion Types
Figure 239987DEST_PATH_IMAGE067
, converted to a One-Hot vector with a width of 7 through the One-Hot vector conversion technology, and then converted to a label input latent feature of dimension D through a single-layer fully connected network with a width of D
Figure 209081DEST_PATH_IMAGE068
;

步骤B:将步骤S3和步骤A得到的

Figure 285490DEST_PATH_IMAGE068
Figure 289218DEST_PATH_IMAGE069
在第一个维度进行拼接,得到网络输入
Figure 543613DEST_PATH_IMAGE070
,具体的, Step B: Combine the results obtained in Step S3 and Step A
Figure 285490DEST_PATH_IMAGE068
and
Figure 289218DEST_PATH_IMAGE069
Splicing in the first dimension to get the network input
Figure 543613DEST_PATH_IMAGE070
,specific,

Figure 2DEST_PATH_IMAGE071
Figure 2DEST_PATH_IMAGE071

步骤C:将分词长度N为的标签

Figure 880102DEST_PATH_IMAGE072
,通过One-Hot向量转换技术,转换为宽度 为5的One-Hot向量,最终得到维度为N×5的网络标签矩阵
Figure 738337DEST_PATH_IMAGE073
,具体的, Step C: Label the word segmentation length N
Figure 880102DEST_PATH_IMAGE072
, which is converted into a One-Hot vector with a width of 5 through the One-Hot vector conversion technology, and finally a network label matrix with a dimension of N×5 is obtained.
Figure 738337DEST_PATH_IMAGE073
,specific,

Figure 429212DEST_PATH_IMAGE074
Figure 429212DEST_PATH_IMAGE074

其中,

Figure 107318DEST_PATH_IMAGE075
是一个维度为5的向量; in,
Figure 107318DEST_PATH_IMAGE075
is a vector of dimension 5;

步骤D:将步骤B得到的网络输入

Figure 791109DEST_PATH_IMAGE070
和步骤C得到的网络标签矩阵
Figure 503850DEST_PATH_IMAGE073
,输入到 BLSTM-CRF网络中进行训练,通过网络的自动学习,得到可以预测文本语速的语速预测网络 Net_WordSpeed。 Step D: Input the network obtained in Step B
Figure 791109DEST_PATH_IMAGE070
and the network label matrix obtained in step C
Figure 503850DEST_PATH_IMAGE073
, input into the BLSTM-CRF network for training, and through the automatic learning of the network, the speech rate prediction network Net_WordSpeed that can predict the speech rate of the text is obtained.

步骤S9,训练分词能量预测网络Net_WordEnergy:将步骤S1得到的情感类型

Figure 365627DEST_PATH_IMAGE076
和步骤S3得到的分词特征
Figure 531029DEST_PATH_IMAGE069
作为网络输入,步骤S6得到的能量标签
Figure 18511DEST_PATH_IMAGE077
作为网 络目标,输入到深度学习序列预测网络BLSTM-CRF,通过与步骤S8同样的处理方法,得到分 词能量预测网络Net_WordEnergy。 Step S9, train the word segmentation energy prediction network Net_WordEnergy: the emotion type obtained in step S1
Figure 365627DEST_PATH_IMAGE076
and the word segmentation feature obtained in step S3
Figure 531029DEST_PATH_IMAGE069
As the network input, the energy label obtained in step S6
Figure 18511DEST_PATH_IMAGE077
As the network target, it is input to the deep learning sequence prediction network BLSTM-CRF, and the word segmentation energy prediction network Net_WordEnergy is obtained by the same processing method as step S8.

步骤S10,训练音素基频预测网络Net_PhonemeF0:将步骤S1得到的情感类型

Figure 320180DEST_PATH_IMAGE076
和步骤S2得到的音素文本
Figure 618437DEST_PATH_IMAGE078
都通过One-Hot转换技术,转换为向量形式后,作为 网络输入,步骤S7得到的基频标签
Figure 536714DEST_PATH_IMAGE079
通过One-Hot转换技术,转换为向量形式后,作为 网络目标,输入到深度学习序列预测网络BLS TM-CRF,通过与步骤S8一样的训练方法,得到 音素基频预测网络Net_PhonemeF0。 Step S10, train the phoneme fundamental frequency prediction network Net_PhonemeF0: use the emotion type obtained in step S1
Figure 320180DEST_PATH_IMAGE076
and the phoneme text obtained in step S2
Figure 618437DEST_PATH_IMAGE078
After converting to vector form through One-Hot conversion technology, as network input, the fundamental frequency label obtained in step S7
Figure 536714DEST_PATH_IMAGE079
After conversion into vector form through One-Hot conversion technology, as a network target, it is input to the deep learning sequence prediction network BLS TM-CRF, and the phoneme fundamental frequency prediction network Net_PhonemeF0 is obtained through the same training method as step S8.

步骤S11,通过Tacotron2的Encoder,获取音素隐含信息:将步骤S2得到的

Figure 296729DEST_PATH_IMAGE080
输入到Tacotron2网络的Encoder网络,得到Encoder网络的输出特征
Figure 984062DEST_PATH_IMAGE081
; Step S11, obtain phoneme implicit information through the Encoder of Tacotron2:
Figure 296729DEST_PATH_IMAGE080
Input to the Encoder network of the Tacotron2 network to get the output features of the Encoder network
Figure 984062DEST_PATH_IMAGE081
;

步骤S12,通过Net_WordSpeed,获取分词语速隐含信息:将步骤S3得到的分词特征

Figure 187642DEST_PATH_IMAGE082
输入到步骤S8得到的分词语速预测网络Net_WordSpeed,得到BiLSTM输出的语速隐含特 征
Figure 327636DEST_PATH_IMAGE083
,根据每个分词包含的音素个数,对语速隐含特征通过复制的在时间维度进行长 度补齐,得到长度为音素个数的语速隐含特征
Figure 156920DEST_PATH_IMAGE084
; Step S12, through Net_WordSpeed, obtain the implicit information of word segmentation speed: the word segmentation feature obtained in step S3
Figure 187642DEST_PATH_IMAGE082
Input to the segmentation word speed prediction network Net_WordSpeed obtained in step S8, and obtain the implicit feature of the speech speed output by BiLSTM
Figure 327636DEST_PATH_IMAGE083
, according to the number of phonemes contained in each participle, the implicit feature of speech rate is complemented by duplicating the length in the time dimension to obtain the implicit feature of speech rate whose length is the number of phonemes
Figure 156920DEST_PATH_IMAGE084
;

步骤S13,通过Net_WordEnergy,获取分词能量隐含信息:将步骤S3得到的分词特 征

Figure 698760DEST_PATH_IMAGE082
输入到步骤S9得到的分词能量预测网络Net_WordEnergy,得到BiLSTM输出的能量隐 含特征
Figure 73241DEST_PATH_IMAGE085
,根据每个分词包含的音素个数,对能量隐含特征通过复制的在时间维度 进行长度补齐,得到长度为音素个数的能量隐含特征
Figure 700531DEST_PATH_IMAGE086
; Step S13, through Net_WordEnergy, obtain the word segmentation energy implicit information: the word segmentation feature obtained in step S3
Figure 698760DEST_PATH_IMAGE082
Input to the word segmentation energy prediction network Net_WordEnergy obtained in step S9, and obtain the energy implicit feature output by BiLSTM
Figure 73241DEST_PATH_IMAGE085
, according to the number of phonemes contained in each participle, the length of the energy implicit feature is complemented in the time dimension by copying, and the energy implicit feature whose length is the number of phonemes is obtained
Figure 700531DEST_PATH_IMAGE086
;

步骤S14,通过Net_PhonemeF0,获取音素基频隐含信息:将步骤S2得到的音素文本

Figure 808208DEST_PATH_IMAGE087
输入到步骤S10得到的音素基频预测网络Net_PhonemeF0,得到BiLSTM输出的音素 基频隐含特征
Figure 470133DEST_PATH_IMAGE088
; Step S14, through Net_PhonemeF0, obtain the phoneme fundamental frequency implicit information: the phoneme text obtained in step S2
Figure 808208DEST_PATH_IMAGE087
Input to the phoneme fundamental frequency prediction network Net_PhonemeF0 obtained in step S10, and obtain the phoneme fundamental frequency implicit feature output by BiLSTM
Figure 470133DEST_PATH_IMAGE088
;

步骤S15,拼接音素隐含信息、分词语速隐含信息、分词能量隐含信息、音素基频隐 含信息:将步骤S11得到

Figure 15515DEST_PATH_IMAGE081
,步骤S12得到
Figure 395681DEST_PATH_IMAGE089
,步骤S13得到
Figure 301189DEST_PATH_IMAGE038
,步骤S14得 到
Figure 552042DEST_PATH_IMAGE088
,进行拼接,得到最终Tacotron2的Decoder解码器网络的输入
Figure 517593DEST_PATH_IMAGE090
,具体的, Step S15, concatenate phoneme implicit information, word segmentation speed implicit information, word segmentation energy implicit information, and phoneme fundamental frequency implicit information: obtain step S11.
Figure 15515DEST_PATH_IMAGE081
, step S12 gets
Figure 395681DEST_PATH_IMAGE089
, step S13 gets
Figure 301189DEST_PATH_IMAGE038
, step S14 gets
Figure 552042DEST_PATH_IMAGE088
, splicing to get the input of the final Tacotron2 Decoder decoder network
Figure 517593DEST_PATH_IMAGE090
,specific,

Figure 775267DEST_PATH_IMAGE091
Figure 775267DEST_PATH_IMAGE091

步骤S16,合成情感语音:将步骤S15得到的,输入到Tacotron2的Decoder解码器网络中,然后通过Tacotrn2网络的后续结构,解码并合成得到最终的情感语音。Step S16, synthesizing the emotional speech: input the data obtained in step S15 into the Decoder decoder network of Tacotron2, and then decode and synthesize the final emotional speech through the subsequent structure of the Tacotron2 network.

综上所述,本实施提供的方法,通过控制文本词汇的发音,提高了情感语音特征生成的合理性,能够提高最终合成的情感语音的质量。To sum up, the method provided by this implementation improves the rationality of emotional speech feature generation by controlling the pronunciation of text words, and can improve the quality of the final synthesized emotional speech.

以上所述,仅为本发明的优选实施案例,并非对本发明做任何形式上的限制。虽然前文对本发明的实施过程进行了详细说明,对于熟悉本领域的人员来说,其依然可以对前述各实例记载的技术方案进行修改,或者对其中部分技术特征进行同等替换。凡在本发明精神和原则之内所做修改、同等替换等,均应包含在本发明的保护范围之内。The above descriptions are only preferred implementation examples of the present invention, and do not limit the present invention in any form. Although the implementation process of the present invention has been described in detail above, those skilled in the art can still modify the technical solutions described in the foregoing examples, or perform equivalent replacements for some of the technical features. All modifications, equivalent replacements, etc. made within the spirit and principle of the present invention shall be included within the protection scope of the present invention.

Claims (10)

1.一种融合词汇及音素发音特征的情感语音合成方法,其特征在于,包括如下步骤:1. a kind of emotional speech synthesis method of fusion vocabulary and phoneme pronunciation feature, is characterized in that, comprises the steps: 步骤一,通过录音采集设备,采集文本及情感标签;Step 1: Collect text and emotional tags through recording equipment; 步骤二,对所述文本进行预处理,获取音素及音素对齐信息,生成分词及分词语义信息;Step 2, preprocessing the text, obtaining phoneme and phoneme alignment information, and generating word segmentation and word segmentation semantic information; 步骤三,分别计算并得到分词发音时长信息、分词发音语速信息、分词发音能量信息、音素基频信息;Step 3, respectively calculate and obtain word segmentation pronunciation duration information, word segmentation pronunciation speech rate information, word segmentation pronunciation energy information, phoneme fundamental frequency information; 步骤四,分别训练分词语速预测网络Net_WordSpeed、分词能量预测网络Net_WordEnergy、音素基频预测网络Net_PhonemeF0;Step 4, respectively train the word segmentation speed prediction network Net_WordSpeed, the word segmentation energy prediction network Net_WordEnergy, and the phoneme fundamental frequency prediction network Net_PhonemeF0; 步骤五,通过Tacotron2的Encoder,获取音素隐含信息,通过Net_WordSpeed,获取分词语速隐含信息,通过Net_WordEnergy,获取分词能量隐含信息,通过Net_PhonemeF0,获取音素基频隐含信息;Step 5: Obtain the implicit information of phoneme through the Encoder of Tacotron2, obtain the implicit information of word segmentation speed through Net_WordSpeed, obtain implicit information of word segmentation energy through Net_WordEnergy, and obtain the implicit information of fundamental frequency of phoneme through Net_PhonemeF0; 步骤六,拼接所述音素隐含信息、分词语速隐含信息、分词能量隐含信息、音素基频隐含信息,合成情感语音。Step 6, splicing the phoneme implicit information, word segmentation speed implicit information, word segmentation energy implicit information, and phoneme fundamental frequency implicit information to synthesize emotional speech. 2.如权利要求1所述的一种融合词汇及音素发音特征的情感语音合成方法,其特征在于,所述步骤一具体包括步骤S1:通过录音采集设备,采集涵盖Ekman提出的7种情感类型的语音音频,表示为
Figure DEST_PATH_IMAGE001
,语音对应的文本,表示为
Figure DEST_PATH_IMAGE002
,语音对应的情感类型,表示为
Figure DEST_PATH_IMAGE003
2. the emotion speech synthesis method of a kind of fusion vocabulary and phoneme pronunciation feature as claimed in claim 1, is characterized in that, described step one specifically comprises step S1: by recording collection equipment, collection covers 7 kinds of emotion types that Ekman proposes speech audio, expressed as
Figure DEST_PATH_IMAGE001
, the text corresponding to the voice, expressed as
Figure DEST_PATH_IMAGE002
, the emotion type corresponding to the speech, expressed as
Figure DEST_PATH_IMAGE003
.
3.如权利要求2所述的一种融合词汇及音素发音特征的情感语音合成方法,其特征在于,所述步骤二具体包括如下步骤:3. the emotional speech synthesis method of a kind of fusion vocabulary and phoneme pronunciation feature as claimed in claim 2, is characterized in that, described step 2 specifically comprises the steps: 步骤S2,对采集的文本
Figure 330242DEST_PATH_IMAGE002
,通过pypinyin工具包转换为对应的音素文本,表示为
Figure DEST_PATH_IMAGE004
,然后将音素文本
Figure DEST_PATH_IMAGE005
和得到的
Figure 388328DEST_PATH_IMAGE001
通过语音处理工具软件HTK,获取文本的时间对齐信息,生成包含每个音素发音时长的音素-时长文本,表示为
Figure DEST_PATH_IMAGE006
Step S2, for the collected text
Figure 330242DEST_PATH_IMAGE002
, converted to the corresponding phoneme text by the pypinyin toolkit, expressed as
Figure DEST_PATH_IMAGE004
, then the phoneme text
Figure DEST_PATH_IMAGE005
and obtained
Figure 388328DEST_PATH_IMAGE001
Through the speech processing tool software HTK, the time alignment information of the text is obtained, and the phoneme-duration text containing the pronunciation duration of each phoneme is generated, which is expressed as
Figure DEST_PATH_IMAGE006
;
步骤S3,对文本
Figure 105748DEST_PATH_IMAGE002
,通过结巴分词工具进行分词,即在原始文本中插入分词边界标识符,生成分词文本
Figure DEST_PATH_IMAGE007
,将分词文本
Figure 202755DEST_PATH_IMAGE007
输入到输出宽度为D中文预训练Bert网络,得到维度为N×D的分词特征
Figure DEST_PATH_IMAGE008
,具体的,
Step S3, to the text
Figure 105748DEST_PATH_IMAGE002
, perform word segmentation through the stuttering word segmentation tool, that is, insert the word segmentation boundary identifier into the original text to generate the word segmentation text
Figure DEST_PATH_IMAGE007
, which will segment the text
Figure 202755DEST_PATH_IMAGE007
The input to output width is D Chinese pre-trained Bert network, and the word segmentation feature of dimension N×D is obtained
Figure DEST_PATH_IMAGE008
,specific,
Figure DEST_PATH_IMAGE009
Figure DEST_PATH_IMAGE009
其中,
Figure DEST_PATH_IMAGE010
是一个维度为D的向量。
in,
Figure DEST_PATH_IMAGE010
is a vector of dimension D.
4.如权利要求3所述的一种融合词汇及音素发音特征的情感语音合成方法,其特征在于,所述步骤三具体包括如下步骤:4. the emotion speech synthesis method of a kind of fusion vocabulary and phoneme pronunciation feature as claimed in claim 3, is characterized in that, described step 3 specifically comprises the steps: 步骤S4,利用所述生成的
Figure 892494DEST_PATH_IMAGE006
和生成的分词文本
Figure DEST_PATH_IMAGE011
计算每个分词的发音时长,得到分词-时长文本
Figure DEST_PATH_IMAGE012
Step S4, using the generated
Figure 892494DEST_PATH_IMAGE006
and the resulting segmented text
Figure DEST_PATH_IMAGE011
Calculate the pronunciation duration of each participle to get the participle-duration text
Figure DEST_PATH_IMAGE012
;
步骤S5,通过得到的分词-时长文本
Figure DEST_PATH_IMAGE013
计算分词的语速信息,并将语速归为5类,分别为:慢、较慢、一般、较快、快,从而得到分词文本对应的语速类别标签
Figure DEST_PATH_IMAGE014
Step S5, through the obtained word segmentation-duration text
Figure DEST_PATH_IMAGE013
Calculate the speech speed information of the word segmentation, and classify the speech speed into 5 categories, namely: slow, slow, normal, faster, and fast, so as to obtain the speech rate category label corresponding to the word segmentation text
Figure DEST_PATH_IMAGE014
;
步骤S6,对所述音频
Figure 557699DEST_PATH_IMAGE001
和分词-时长文本
Figure 293574DEST_PATH_IMAGE013
,通过分词持续时间内音频幅值的平方和计算分词的发音能量信息,并将能量信息归为五类,分别为:低、较低、中、较高、高,从而得到分词文本对应的能量标签
Figure DEST_PATH_IMAGE015
Step S6, for the audio
Figure 557699DEST_PATH_IMAGE001
and participle-duration text
Figure 293574DEST_PATH_IMAGE013
, calculate the pronunciation energy information of the word segmentation by the sum of the squares of the audio amplitudes during the word segmentation duration, and classify the energy information into five categories: low, low, medium, high, and high, so as to obtain the energy corresponding to the word segmentation text Label
Figure DEST_PATH_IMAGE015
;
步骤S7,对所述音频
Figure 430157DEST_PATH_IMAGE001
和音素-时长文本
Figure DEST_PATH_IMAGE016
,通过Librosa工具包计算音素发音的基频信息,并将基频信息根据基频高低归为五类,分别为:低、较低、中、较高、高,从而得到音素文本对应的基频标签
Figure DEST_PATH_IMAGE017
Step S7, for the audio
Figure 430157DEST_PATH_IMAGE001
and phoneme-duration text
Figure DEST_PATH_IMAGE016
, calculate the fundamental frequency information of phoneme pronunciation through the Librosa toolkit, and classify the fundamental frequency information into five categories according to the fundamental frequency, namely: low, low, medium, high, high, so as to obtain the fundamental frequency corresponding to the phoneme text Label
Figure DEST_PATH_IMAGE017
.
5.如权利要求4所述的一种融合词汇及音素发音特征的情感语音合成方法,其特征在于,所述步骤四具体包括如下步骤:5. the emotion speech synthesis method of a kind of fusion vocabulary and phoneme pronunciation feature as claimed in claim 4, is characterized in that, described step 4 specifically comprises the steps: 步骤S8,训练分词语速预测网络Net_WordSpeed:将情感标签
Figure DEST_PATH_IMAGE018
和分词语义特征
Figure DEST_PATH_IMAGE019
作为网络输入,语速分类标签
Figure DEST_PATH_IMAGE020
作为网络目标,输入到序列预测深度学习序列预测网络BiLSTM-CRF,然后通过深度学习的网络训练得到分词语速预测网络Net_WordSpeed;
Step S8, train the word speed prediction network Net_WordSpeed:
Figure DEST_PATH_IMAGE018
and participle semantic features
Figure DEST_PATH_IMAGE019
As network input, speech rate classification labels
Figure DEST_PATH_IMAGE020
As a network target, input it to the sequence prediction deep learning sequence prediction network BiLSTM-CRF, and then obtain the word speed prediction network Net_WordSpeed through deep learning network training;
步骤S9,训练分词能量预测网络Net_WordEnergy:将情感标签
Figure DEST_PATH_IMAGE021
和分词语义特征
Figure DEST_PATH_IMAGE022
作为网络输入,语速分类标签
Figure DEST_PATH_IMAGE023
作为网络目标,输入到序列预测深度学习序列预测网络BLSTM-CRF,通过与步骤S4_1同样的处理方法,得到分词能量预测网络Net_WordEnergy;
Step S9, train the word segmentation energy prediction network Net_WordEnergy:
Figure DEST_PATH_IMAGE021
and participle semantic features
Figure DEST_PATH_IMAGE022
As network input, speech rate classification labels
Figure DEST_PATH_IMAGE023
As the network target, input it to the sequence prediction deep learning sequence prediction network BLSTM-CRF, and obtain the word segmentation energy prediction network Net_WordEnergy through the same processing method as step S4_1;
步骤S10,训练音素基频预测网络Net_PhonemeF0:将情感标签
Figure 88584DEST_PATH_IMAGE021
和音素文本
Figure DEST_PATH_IMAGE024
都通过One-Hot转换技术,转换为向量形式后,作为网络输入,基频标签
Figure 285210DEST_PATH_IMAGE017
通过One-Hot转换技术,转换为向量形式后,作为网络目标,输入到序列预测深度学习序列预测网络BLS TM-CRF,通过与步骤S4_1一样的训练方法,得到音素基频预测网络Net_PhonemeF0。
Step S10, train the phoneme fundamental frequency prediction network Net_PhonemeF0:
Figure 88584DEST_PATH_IMAGE021
and phonemic text
Figure DEST_PATH_IMAGE024
All through One-Hot conversion technology, after conversion to vector form, as network input, fundamental frequency label
Figure 285210DEST_PATH_IMAGE017
After converting to vector form through One-Hot conversion technology, as a network target, it is input to the sequence prediction deep learning sequence prediction network BLS TM-CRF, and the phoneme fundamental frequency prediction network Net_PhonemeF0 is obtained through the same training method as step S4_1.
6.如权利要求5所述的一种融合词汇及音素发音特征的情感语音合成方法,其特征在于,所述步骤S8具体包括如下步骤:6. the emotional speech synthesis method of a kind of fusion vocabulary and phoneme pronunciation feature as claimed in claim 5, is characterized in that, described step S8 specifically comprises the steps: 步骤A:情感标签
Figure DEST_PATH_IMAGE025
,通过One-Hot向量转换技术,转换为宽度为7的One-Hot向量,然后通过宽度为D的单层全连接网络,转换为维度为D的标签输入隐含特征
Figure DEST_PATH_IMAGE026
Step A: Sentiment Labeling
Figure DEST_PATH_IMAGE025
, converted to a One-Hot vector with a width of 7 through the One-Hot vector conversion technology, and then converted to a label input latent feature of dimension D through a single-layer fully connected network with a width of D
Figure DEST_PATH_IMAGE026
;
步骤B:将得到的
Figure 944599DEST_PATH_IMAGE008
Figure 150452DEST_PATH_IMAGE026
在第一个维度进行拼接,得到网络输入
Figure DEST_PATH_IMAGE027
,具体的,
Step B: will get
Figure 944599DEST_PATH_IMAGE008
and
Figure 150452DEST_PATH_IMAGE026
Splicing in the first dimension to get the network input
Figure DEST_PATH_IMAGE027
,specific,
Figure DEST_PATH_IMAGE028
Figure DEST_PATH_IMAGE028
步骤C:将分词长度N为的标签,通过One-Hot向量转换技术,转换为宽度为5的One-Hot向量,最终得到维度为N×5的网络标签矩阵
Figure DEST_PATH_IMAGE029
,具体的,
Step C: Convert the label with a word segmentation length of N into a One-Hot vector with a width of 5 through the One-Hot vector conversion technology, and finally obtain a network label matrix with a dimension of N×5
Figure DEST_PATH_IMAGE029
,specific,
Figure DEST_PATH_IMAGE030
Figure DEST_PATH_IMAGE030
其中,
Figure DEST_PATH_IMAGE031
是一个维度为5的向量;
in,
Figure DEST_PATH_IMAGE031
is a vector of dimension 5;
步骤D:将网络输入
Figure DEST_PATH_IMAGE032
和网络标签矩阵
Figure 719843DEST_PATH_IMAGE029
,输入到BLSTM-CRF网络中进行训练,通过网络的自动学习,得到可以预测文本语速的语速预测网络Net_WordSpeed。
Step D: Enter the network
Figure DEST_PATH_IMAGE032
and network label matrix
Figure 719843DEST_PATH_IMAGE029
, input into the BLSTM-CRF network for training, and through the automatic learning of the network, the speech rate prediction network Net_WordSpeed that can predict the speech rate of the text is obtained.
7.如权利要求5所述的一种融合词汇及音素发音特征的情感语音合成方法,其特征在于,所述步骤五具体包括如下步骤:7. the emotion speech synthesis method of a kind of fusion vocabulary and phoneme pronunciation feature as claimed in claim 5, is characterized in that, described step 5 specifically comprises the steps: 步骤S11,通过Tacotron2的Encoder,获取音素隐含信息:将对应的音素文本
Figure DEST_PATH_IMAGE033
输入到Tacotron2网络的Encoder网络,得到Encoder网络的输出特征
Figure DEST_PATH_IMAGE034
Step S11, through the Encoder of Tacotron2, obtain the phoneme implicit information: the corresponding phoneme text
Figure DEST_PATH_IMAGE033
Input to the Encoder network of the Tacotron2 network to get the output features of the Encoder network
Figure DEST_PATH_IMAGE034
;
步骤S12,通过Net_WordSpeed,获取分词语速隐含信息:将分词语义特征
Figure DEST_PATH_IMAGE035
输入到分词语速预测网络Net_WordSpeed,得到BiLSTM输出的隐含特征
Figure DEST_PATH_IMAGE036
,根据每个分词包含的音素个数,对隐含特征通过复制的在时间维度进行长度补齐,得到长度为音素个数的隐含特征
Figure DEST_PATH_IMAGE037
In step S12, through Net_WordSpeed, the implicit information of word segmentation speed is obtained: the semantic features of word segmentation are obtained.
Figure DEST_PATH_IMAGE035
Input to the segmentation word speed prediction network Net_WordSpeed to get the hidden features of BiLSTM output
Figure DEST_PATH_IMAGE036
, according to the number of phonemes contained in each participle, the length of the hidden features is complemented in the time dimension by copying, and the hidden features whose length is the number of phonemes are obtained
Figure DEST_PATH_IMAGE037
;
步骤S13,通过Net_WordEnergy,获取分词能量隐含信息:将分词语义特征
Figure 228316DEST_PATH_IMAGE035
输入到分词能量预测网络Net_WordEnergy,得到BiLSTM输出的隐含特征
Figure DEST_PATH_IMAGE039
,根据每个分词包含的音素个数,对隐含特征通过复制的在时间维度进行长度补齐,得到长度为音素个数的隐含特征
Figure DEST_PATH_IMAGE041
Step S13, through Net_WordEnergy, obtain the implicit information of word segmentation energy: the word segmentation semantic features
Figure 228316DEST_PATH_IMAGE035
Input to the word segmentation energy prediction network Net_WordEnergy to obtain the hidden features of BiLSTM output
Figure DEST_PATH_IMAGE039
, according to the number of phonemes contained in each participle, the length of the hidden features is complemented in the time dimension by copying, and the hidden features whose length is the number of phonemes are obtained
Figure DEST_PATH_IMAGE041
;
步骤S14,通过Net_PhonemeF0,获取音素基频隐含信息:将分词文本
Figure DEST_PATH_IMAGE042
输入到音素基频预测网络Net_PhonemeF0,得到BiLSTM输出的隐含特征
Figure DEST_PATH_IMAGE043
Step S14, through Net_PhonemeF0, obtain the implicit information of the phoneme fundamental frequency: the word segmentation text
Figure DEST_PATH_IMAGE042
Input to the phoneme fundamental frequency prediction network Net_PhonemeF0 to get the hidden features of BiLSTM output
Figure DEST_PATH_IMAGE043
.
8.如权利要求7所述的一种融合词汇及音素发音特征的情感语音合成方法,其特征在于,所述步骤六具体包括如下步骤:8. the emotional speech synthesis method of a kind of fusion vocabulary and phoneme pronunciation feature as claimed in claim 7, is characterized in that, described step 6 specifically comprises the steps: 步骤S15将
Figure DEST_PATH_IMAGE044
Figure DEST_PATH_IMAGE045
Figure DEST_PATH_IMAGE046
Figure DEST_PATH_IMAGE047
,进行拼接,得到最终Tacotron2的Decoder解码器网络的输入
Figure DEST_PATH_IMAGE048
,具体的,
Step S15 will
Figure DEST_PATH_IMAGE044
,
Figure DEST_PATH_IMAGE045
,
Figure DEST_PATH_IMAGE046
,
Figure DEST_PATH_IMAGE047
, splicing to get the input of the final Tacotron2 Decoder decoder network
Figure DEST_PATH_IMAGE048
,specific,
Figure DEST_PATH_IMAGE050
Figure DEST_PATH_IMAGE050
步骤S16,将所述
Figure 883988DEST_PATH_IMAGE048
,输入到Tacotron2的Decoder解码器网络中,然后通过Tacotrn2网络的后续结构,解码并合成得到最终的情感语音。
In step S16, the
Figure 883988DEST_PATH_IMAGE048
, input into the Decoder decoder network of Tacotron2, and then decode and synthesize the final emotional speech through the subsequent structure of the Tacotron2 network.
9.一种融合词汇及音素发音特征的情感语音合成系统,其特征在于,包括:9. an emotional speech synthesis system of fusion vocabulary and phoneme pronunciation features, is characterized in that, comprises: 文本采集模块,用于采用http传输,采集需要合成的文本内容及情感标签;The text collection module is used to use http transmission to collect the text content and emotional tags that need to be synthesized; 文本预处理模块,用于将采集到的文本进行预处理,对所述文本进行分词、音素转换操作,包括:对文本依次进行文本符号统一转换为英文符号、数字格式统一转换为中文文本、对中文本分词、将分词文本通过预训练Bert转换成语义向量表示形式、文本通过pypinyin工具包转换得到音素文本,所述情感标签通过One-Hot转换得到情感标签的向量表示,生成可用于神经网络处理的数据;The text preprocessing module is used for preprocessing the collected text, and performing word segmentation and phoneme conversion operations on the text, including: uniformly converting text symbols into English symbols in turn, uniformly converting digital formats into Chinese texts, and Chinese text word segmentation, the word segmentation text is converted into a semantic vector representation through pre-training Bert, the text is converted into phoneme text through the pypinyin toolkit, and the emotional label is converted by One-Hot to obtain the vector representation of the emotional label, which can be used for neural network processing. The data; 情感语音合成模块,用于通过设计的Tacotron2网络模型处理文本及情感信息,合成情感语音;Emotional speech synthesis module, used to process text and emotional information through the designed Tacotron2 network model, and synthesize emotional speech; 数据存储模块,用于利用MySQL数据库,存储已经合成的情感语音;The data storage module is used to store the synthesized emotional speech by using the MySQL database; 合成语音调度模块,用于决策用模型合成语音还是从数据库调用已合成语音作为输出,并开放http端口用于输出合成好的情感语音。The synthetic voice scheduling module is used to decide whether to use the model to synthesize the voice or to call the synthesized voice from the database as the output, and open the http port for outputting the synthesized emotional voice. 10.如权利要求9所述的融合词汇及音素发音特征的情感语音合成系统,其特征在于,所述输出的情感语音优先采用已合成情感语音,其次采用模型合成以提升系统响应速度。10. The emotion speech synthesis system of fusion vocabulary and phoneme pronunciation features as claimed in claim 9, it is characterized in that, the emotion speech of described output adopts synthesized emotion speech preferentially, adopts model synthesis secondly to improve system response speed.
CN202110600732.4A 2021-05-31 2021-05-31 A kind of emotional speech synthesis method and system integrating vocabulary and phoneme pronunciation features Withdrawn - After Issue CN113257225B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110600732.4A CN113257225B (en) 2021-05-31 2021-05-31 A kind of emotional speech synthesis method and system integrating vocabulary and phoneme pronunciation features

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110600732.4A CN113257225B (en) 2021-05-31 2021-05-31 A kind of emotional speech synthesis method and system integrating vocabulary and phoneme pronunciation features

Publications (2)

Publication Number Publication Date
CN113257225A CN113257225A (en) 2021-08-13
CN113257225B true CN113257225B (en) 2021-11-02

Family

ID=77185459

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110600732.4A Withdrawn - After Issue CN113257225B (en) 2021-05-31 2021-05-31 A kind of emotional speech synthesis method and system integrating vocabulary and phoneme pronunciation features

Country Status (1)

Country Link
CN (1) CN113257225B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114974183B (en) * 2022-05-16 2024-12-20 广州虎牙科技有限公司 Singing voice synthesis method, system and computer equipment
CN116469368B (en) * 2023-04-11 2025-12-12 广东九四智能科技有限公司 A speech synthesis method and system that integrates semantic information
CN116564274A (en) * 2023-06-16 2023-08-08 平安科技(深圳)有限公司 Speech synthesis method, speech synthesis device, electronic device and storage medium
CN117711413A (en) * 2023-11-02 2024-03-15 广东广信通信服务有限公司 Voice recognition data processing method, system, device and storage medium
CN119207366B (en) * 2024-11-21 2025-02-11 国科大杭州高等研究院 Fine granularity emphasized controllable emotion voice synthesis method
CN119479702B (en) * 2025-01-08 2025-04-25 成都佳发安泰教育科技股份有限公司 Pronunciation scoring method, pronunciation scoring device, electronic equipment and storage medium
CN119854414B (en) * 2025-03-19 2025-11-18 山东致群信息技术股份有限公司 AI-based telephone answering system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6665644B1 (en) * 1999-08-10 2003-12-16 International Business Machines Corporation Conversational data mining
CN108364632A (en) * 2017-12-22 2018-08-03 东南大学 A kind of Chinese text voice synthetic method having emotion
CN111627420A (en) * 2020-04-21 2020-09-04 升智信息科技(南京)有限公司 Specific-speaker emotion voice synthesis method and device under extremely low resources
CN111696579A (en) * 2020-06-17 2020-09-22 厦门快商通科技股份有限公司 Speech emotion recognition method, device, equipment and computer storage medium
CN112786004A (en) * 2020-12-30 2021-05-11 科大讯飞股份有限公司 Speech synthesis method, electronic device, and storage device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7752043B2 (en) * 2006-09-29 2010-07-06 Verint Americas Inc. Multi-pass speech analytics

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6665644B1 (en) * 1999-08-10 2003-12-16 International Business Machines Corporation Conversational data mining
CN108364632A (en) * 2017-12-22 2018-08-03 东南大学 A kind of Chinese text voice synthetic method having emotion
CN111627420A (en) * 2020-04-21 2020-09-04 升智信息科技(南京)有限公司 Specific-speaker emotion voice synthesis method and device under extremely low resources
CN111696579A (en) * 2020-06-17 2020-09-22 厦门快商通科技股份有限公司 Speech emotion recognition method, device, equipment and computer storage medium
CN112786004A (en) * 2020-12-30 2021-05-11 科大讯飞股份有限公司 Speech synthesis method, electronic device, and storage device

Also Published As

Publication number Publication date
CN113257225A (en) 2021-08-13

Similar Documents

Publication Publication Date Title
CN113257225B (en) A kind of emotional speech synthesis method and system integrating vocabulary and phoneme pronunciation features
CN116364055B (en) Speech generation method, device, device and medium based on pre-trained language model
CN110211563B (en) Chinese speech synthesis method, device and storage medium for scenes and emotion
CN112489629B (en) Voice transcription model, method, medium and electronic equipment
CN111667812A (en) Voice synthesis method, device, equipment and storage medium
CN113628609A (en) Automatic audio content generation
CN110390928B (en) Method and system for training speech synthesis model of automatic expansion corpus
CN110297909B (en) Method and device for classifying unlabeled corpora
CN109036371A (en) Audio data generation method and system for speech synthesis
CN114005430B (en) Training method, device, electronic device and storage medium for speech synthesis model
CN117789771A (en) Cross-language end-to-end emotion voice synthesis method and system
CN113539268A (en) End-to-end voice-to-text rare word optimization method
CN111009235A (en) Voice recognition method based on CLDNN + CTC acoustic model
CN107221344A (en) A kind of speech emotional moving method
CN118942443A (en) Speech generation method, virtual human speech generation method and speech generation system
CN111009236A (en) Voice recognition method based on DBLSTM + CTC acoustic model
CN113299272B (en) Speech synthesis model training and speech synthesis method, equipment and storage medium
CN117219046A (en) Interactive voice emotion control method and system
CN117316141A (en) Training methods, audio generation methods, devices and equipment for prosody annotation models
CN117079637A (en) Mongolian emotion voice synthesis method based on condition generation countermeasure network
CN114333900B (en) Method for extracting BNF (BNF) characteristics end to end, network model, training method and training system
CN119763546B (en) Speech synthesis method, system, electronic device and storage medium
CN116129868A (en) Method and system for generating structured photo
CN115130457A (en) Prosody modeling method and modeling system fused with Amdo Tibetan phoneme vectors
CN120599998A (en) A voice cloning method, device and related medium based on emotion enhancement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
AV01 Patent right actively abandoned

Granted publication date: 20211102

Effective date of abandoning: 20251012

AV01 Patent right actively abandoned

Granted publication date: 20211102

Effective date of abandoning: 20251012

AV01 Patent right actively abandoned
AV01 Patent right actively abandoned