CN113257225B - A kind of emotional speech synthesis method and system integrating vocabulary and phoneme pronunciation features - Google Patents
A kind of emotional speech synthesis method and system integrating vocabulary and phoneme pronunciation features Download PDFInfo
- Publication number
- CN113257225B CN113257225B CN202110600732.4A CN202110600732A CN113257225B CN 113257225 B CN113257225 B CN 113257225B CN 202110600732 A CN202110600732 A CN 202110600732A CN 113257225 B CN113257225 B CN 113257225B
- Authority
- CN
- China
- Prior art keywords
- phoneme
- word segmentation
- text
- network
- information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn - After Issue
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
- G10L13/10—Prosody rules derived from text; Stress or intonation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Machine Translation (AREA)
Abstract
本发明属于人工智能领域,具体涉及一种融合词汇及音素发音特征的情感语音合成方法及系统,该方法为:通过录音采集设备,采集文本及情感标签,对所述文本进行预处理,获取音素及音素对齐信息,生成分词及分词语义信息,分别计算并得到分词发音时长信息、分词发音语速信息、分词发音能量信息、音素基频信息,分别训练分词语速预测网络、分词能量预测网络、音素基频预测网络,获取并拼接音素隐含信息、分词语速隐含信息、分词能量隐含信息、音素基频隐含信息,合成情感语音。本发明通过将与情感发音有关的词汇及音素发音特征融合到端到端语音合成模型中去,能够使得合成的情感语音更加自然。
The invention belongs to the field of artificial intelligence, and in particular relates to an emotional speech synthesis method and system that integrates vocabulary and phoneme pronunciation features. and phoneme alignment information, generate word segmentation and word segmentation semantic information, calculate and obtain word segmentation pronunciation duration information, word segmentation pronunciation speed information, word segmentation pronunciation energy information, phoneme fundamental frequency information, respectively train segmentation word speed prediction network, word segmentation energy prediction network, The phoneme fundamental frequency prediction network obtains and concatenates phoneme implicit information, word segmentation speed implicit information, word segmentation energy implicit information, phoneme fundamental frequency implicit information, and synthesizes emotional speech. The present invention can make the synthesized emotional speech more natural by integrating the vocabulary and phoneme pronunciation features related to emotional pronunciation into the end-to-end speech synthesis model.
Description
技术领域technical field
本发明属于人工智能领域,具体涉及一种融合词汇及音素发音特征的情感语音合成方法及系统。The invention belongs to the field of artificial intelligence, and in particular relates to an emotional speech synthesis method and system integrating vocabulary and phoneme pronunciation features.
背景技术Background technique
语言交互是人类最早的交流方式之一,因此语音成为了人类表达情感的主要方式。随着人机交互的兴起,让对话机器人拥有类似人的情感,说话听起来像真人,成为一种迫切的需求。目前情感主要的分类方式是上世纪Ekman提出的7种情感,分别为:中性、开心、悲伤、生气、害怕、厌恶、惊讶。Language interaction is one of the earliest forms of human communication, so speech has become the main way for humans to express emotions. With the rise of human-computer interaction, it has become an urgent need to make conversational robots have human-like emotions and speak like real people. At present, the main classification methods of emotions are the seven emotions proposed by Ekman in the last century, which are: neutral, happy, sad, angry, fearful, disgusted, and surprised.
语音合成技术随着近几年深度学习的兴起,越发成熟起来,让机器的发音像一个播音员一样已经可以实现。但是,让机器像人一样,发出带情感的语音,仍然是一个非常困难的问题,当前主流的情感语音合成可以分为两种方法。一种是基于隐马尔可夫模型这种传统机器学习的分段方法;另外一种是基于深度学习的端到端方法。基于隐马尔可夫方法合成的语音机械感很强,听起来不自然,目前已经非常少用到。基于深度学习的方法合成的语音相对比较自然。但是目前基于深度学习合成的情感语音,也只是简单的将情感标签融入文本特征中,合成的情感语音质量并不能得到有效保证。With the rise of deep learning in recent years, speech synthesis technology has become more and more mature, and it has become possible to make the pronunciation of a machine like an announcer. However, it is still a very difficult problem to make a machine emit emotional speech like a human. The current mainstream emotional speech synthesis can be divided into two methods. One is a segmentation method based on traditional machine learning such as hidden Markov models; the other is an end-to-end method based on deep learning. The speech synthesized based on the hidden Markov method has a strong mechanical sense and sounds unnatural, and it is very rarely used at present. The speech synthesized by the deep learning-based method is relatively natural. However, at present, the emotional speech synthesized based on deep learning simply integrates emotional tags into text features, and the quality of the synthesized emotional speech cannot be effectively guaranteed.
在目前的技术中,由于融入情感信息方式比较简单,通常只是将情感的标签简单的融合到文本特征中,而没有考虑人在情感语音发音中的特点,导致模型不能很好的学习到情感信息,所以合成的情感语音比较生硬,不自然。In the current technology, because the way of integrating emotional information is relatively simple, usually only the emotional tags are simply integrated into the text features, without considering the characteristics of people in emotional speech pronunciation, resulting in the model not being able to learn emotional information well , so the synthesized emotional speech is blunt and unnatural.
发明内容SUMMARY OF THE INVENTION
为了解决现有技术中存在的上述技术问题,本发明提出了一种融合词汇及音素发音特征的情感语音合成方法及系统,其具体技术方案如下:In order to solve the above-mentioned technical problems existing in the prior art, the present invention proposes an emotional speech synthesis method and system integrating vocabulary and phoneme pronunciation features, and its specific technical scheme is as follows:
一种融合词汇及音素发音特征的情感语音合成方法,包括如下步骤:An emotional speech synthesis method integrating vocabulary and phoneme pronunciation features, comprising the following steps:
步骤一,通过录音采集设备,采集文本及情感标签;Step 1: Collect text and emotional tags through recording equipment;
步骤二,对所述文本进行预处理,获取音素及音素对齐信息,生成分词及分词语义信息;Step 2, preprocessing the text, obtaining phoneme and phoneme alignment information, and generating word segmentation and word segmentation semantic information;
步骤三,分别计算并得到分词发音时长信息、分词发音语速信息、分词发音能量信息、音素基频信息;Step 3, respectively calculate and obtain word segmentation pronunciation duration information, word segmentation pronunciation speech rate information, word segmentation pronunciation energy information, phoneme fundamental frequency information;
步骤四,分别训练分词语速预测网络Net_WordSpeed、分词能量预测网络Net_WordEnergy、音素基频预测网络Net_PhonemeF0;Step 4, respectively train the word segmentation speed prediction network Net_WordSpeed, the word segmentation energy prediction network Net_WordEnergy, and the phoneme fundamental frequency prediction network Net_PhonemeF0;
步骤五,通过Tacotron2的Encoder,获取音素隐含信息,通过Net_WordSpeed,获取分词语速隐含信息,通过Net_WordEnergy,获取分词能量隐含信息,通过Net_PhonemeF0,获取音素基频隐含信息;Step 5: Obtain the implicit information of phoneme through the Encoder of Tacotron2, obtain the implicit information of word segmentation speed through Net_WordSpeed, obtain implicit information of word segmentation energy through Net_WordEnergy, and obtain the implicit information of fundamental frequency of phoneme through Net_PhonemeF0;
步骤六,拼接所述音素隐含信息、分词语速隐含信息、分词能量隐含信息、音素基频隐含信息,合成情感语音。Step 6, splicing the phoneme implicit information, word segmentation speed implicit information, word segmentation energy implicit information, and phoneme fundamental frequency implicit information to synthesize emotional speech.
进一步的,所述步骤一具体包括步骤S1:通过录音采集设备,采集涵盖中性、开心、 悲伤、生气、害怕、厌恶、惊讶的7种情感类型的语音音频,表示为,语音对应的文本,表 示为,语音对应的情感类型,表示为。 Further, the step 1 specifically includes step S1: collecting voice audios covering 7 emotional types of neutrality, happiness, sadness, anger, fear, disgust, and surprise by recording and collecting equipment, expressed as: , the text corresponding to the voice, expressed as , the emotion type corresponding to the speech, expressed as .
进一步的,所述步骤二具体包括如下步骤:Further, the step 2 specifically includes the following steps:
步骤S2,对采集的文本,通过pypinyin工具包转换为对应的音素文本,表示为,然后将音素文本和得到的通过语音处理工具软件HTK,获取文本的时 间对齐信息,生成包含每个音素发音时长的音素-时长文本,表示为; Step S2, for the collected text , converted to the corresponding phoneme text by the pypinyin toolkit, expressed as , then the phoneme text and obtained Through the speech processing tool software HTK, the time alignment information of the text is obtained, and the phoneme-duration text containing the pronunciation duration of each phoneme is generated, which is expressed as ;
步骤S3,对文本,通过结巴分词工具进行分词,即在原始文本中插入分词边界 标识符,生成分词文本,将分词文本输入到输出宽度为D中文预训练Bert网络,得 到维度为N×D的分词特征,具体的, Step S3, to the text , perform word segmentation through the stuttering word segmentation tool, that is, insert the word segmentation boundary identifier into the original text to generate the word segmentation text , which will segment the text The input to output width is D Chinese pre-trained Bert network, and the word segmentation feature of dimension N×D is obtained ,specific,
其中,是一个维度为D的向量。 in, is a vector of dimension D.
进一步的,所述步骤三具体包括如下步骤:Further, the step 3 specifically includes the following steps:
步骤S4,利用生成的和生成的分词文本计算每个分词的发音时 长,得到分词-时长文本; Step S4, using the generated and the resulting segmented text Calculate the pronunciation duration of each participle to get the participle-duration text ;
步骤S5,通过得到的分词-时长文本计算分词的语速信息,并将语速归 为5类,分别为:慢、较慢、一般、较快、快,从而得到分词文本对应的语速类别标签; Step S5, through the obtained word segmentation-duration text Calculate the speech speed information of the word segmentation, and classify the speech speed into 5 categories, namely: slow, slow, normal, faster, and fast, so as to obtain the speech rate category label corresponding to the word segmentation text ;
步骤S6,对所述音频和分词-时长文本,通过分词持续时间内音频 幅值的平方和计算分词的发音能量信息,并将能量信息归为五类,分别为:低、较低、中、较 高、高,从而得到分词文本对应的能量标签; Step S6, for the audio and participle-duration text , calculate the pronunciation energy information of the word segmentation by the sum of the squares of the audio amplitudes during the word segmentation duration, and classify the energy information into five categories: low, low, medium, high, and high, so as to obtain the energy corresponding to the word segmentation text Label ;
步骤S7,对所述音频和音素-时长文本,通过Librosa工具包计算 音素发音的基频信息,并将基频信息根据基频高低归为五类,分别为:低、较低、中、较高、 高,从而得到音素文本对应的基频标签。 Step S7, for the audio and phoneme-duration text , calculate the fundamental frequency information of phoneme pronunciation through the Librosa toolkit, and classify the fundamental frequency information into five categories according to the fundamental frequency, namely: low, low, medium, high, high, so as to obtain the fundamental frequency corresponding to the phoneme text Label .
进一步的,所述步骤四具体包括如下步骤:Further, the step 4 specifically includes the following steps:
步骤S8,训练分词语速预测网络Net_WordSpeed:将情感类型和分词特征作为网络输入,语速类别标签作为网络目标,输入到深度学习序列预测网络 BiLSTM-CRF,然后通过深度学习的网络训练得到分词语速预测网络Net_WordSpeed; Step S8, train the word speed prediction network Net_WordSpeed: and participle features As network input, speech rate category labels As a network target, input it to the deep learning sequence prediction network BiLSTM-CRF, and then obtain the word speed prediction network Net_WordSpeed through deep learning network training;
步骤S9,训练分词能量预测网络Net_WordEnergy:将情感类型和分词特征作为网络输入,能量标签作为网络目标,输入到深度学习序列预测网络BLSTM- CRF,通过与步骤S8同样的处理方法,得到分词能量预测网络Net_WordEnergy; Step S9, train the word segmentation energy prediction network Net_WordEnergy: and participle features As network input, energy labels As the network target, input it to the deep learning sequence prediction network BLSTM-CRF, and obtain the word segmentation energy prediction network Net_WordEnergy through the same processing method as step S8;
步骤S10,训练音素基频预测网络Net_PhonemeF0:将情感类型和音素文本都通过One-Hot转换技术,转换为向量形式后,作为网络输入,基频标签通 过One-Hot转换技术,转换为向量形式后,作为网络目标,输入到序列预测深度学习序列预 测网络BLS TM-CRF,通过与步骤S8一样的训练方法,得到音素基频预测网络Net_ PhonemeF0。Step S10, train the phoneme fundamental frequency prediction network Net_PhonemeF0: and phonemic text All through One-Hot conversion technology, after conversion to vector form, as network input, fundamental frequency label After conversion into vector form through One-Hot conversion technology, as a network target, it is input to the sequence prediction deep learning sequence prediction network BLS TM-CRF, and the phoneme fundamental frequency prediction network Net_PhonemeF0 is obtained through the same training method as step S8.
进一步的,所述步骤S8具体包括如下步骤:Further, the step S8 specifically includes the following steps:
步骤A:情感类型,通过One-Hot向量转换技术,转换为宽度为7的One-Hot向 量,然后通过宽度为D的单层全连接网络,转换为维度为D的标签输入隐含特征; Step A: Emotion Types , converted to a One-Hot vector with a width of 7 through the One-Hot vector conversion technology, and then converted to a label input latent feature of dimension D through a single-layer fully connected network with a width of D ;
步骤B:将得到的和在第一个维度进行拼接,得到网络输入,具体的, Step B: will get and Splicing in the first dimension to get the network input ,specific,
步骤C:将分词长度N为的标签,通过One-Hot向量转换技术,转换为宽度为5的One- Hot向量,最终得到维度为N×5的网络标签矩阵,具体的, Step C: Convert the label with a word segmentation length of N into a One-Hot vector with a width of 5 through the One-Hot vector conversion technology, and finally obtain a network label matrix with a dimension of N×5 ,specific,
其中,是一个维度为5的向量; in, is a vector of dimension 5;
步骤D:将网络输入和网络标签矩阵,输入到BLSTM-CRF网络中进行训练,通 过网络的自动学习,得到可以预测文本语速的语速预测网络Net_WordSpeed。 Step D: Enter the network and network label matrix , input into the BLSTM-CRF network for training, and through the automatic learning of the network, the speech rate prediction network Net_WordSpeed that can predict the speech rate of the text is obtained.
进一步的,所述步骤五具体包括如下步骤:Further, the step 5 specifically includes the following steps:
步骤S11,通过Tacotron2的Encoder,获取音素隐含信息:将对应的音素文本 输入到Tacotron2网络的Encoder网络,得到Encoder网络的输出特征; Step S11, through the Encoder of Tacotron2, obtain the phoneme implicit information: the corresponding phoneme text Input to the Encoder network of the Tacotron2 network to get the output features of the Encoder network ;
步骤S12,通过Net_WordSpeed,获取分词语速隐含信息:将分词特征输入到分词 语速预测网络Net_WordSpeed,得到BiLSTM输出的语速隐含特征,根据每个分词包 含的音素个数,对语速隐含特征通过复制的在时间维度进行长度补齐,得到长度为音素个 数的语速隐含特征; Step S12, through Net_WordSpeed, obtain the implicit information of word segmentation speed: the word segmentation feature Input to the word speed prediction network Net_WordSpeed, and get the hidden features of the speech speed output by BiLSTM , according to the number of phonemes contained in each participle, the implicit feature of speech rate is complemented by duplicating the length in the time dimension to obtain the implicit feature of speech rate whose length is the number of phonemes ;
步骤S13,通过Net_WordEnergy,获取分词能量隐含信息:将分词特征输入到分 词能量预测网络Net_WordEnergy,得到BiLSTM输出的能量隐含特征,根据每个分词 包含的音素个数,对能量隐含特征通过复制的在时间维度进行长度补齐,得到长度为音素 个数的能量隐含特征; Step S13, obtain the implicit information of word segmentation energy through Net_WordEnergy: the word segmentation feature Input to the word segmentation energy prediction network Net_WordEnergy, and get the energy implicit features output by BiLSTM , according to the number of phonemes contained in each participle, the length of the energy implicit feature is complemented in the time dimension by copying, and the energy implicit feature whose length is the number of phonemes is obtained ;
步骤S14,通过Net_PhonemeF0,获取音素基频隐含信息:将音素文本输入到 音素基频预测网络Net_PhonemeF0,得到BiLSTM输出的音素基频隐含特征。 Step S14, through Net_PhonemeF0, obtain the implicit information of the phoneme fundamental frequency: the phoneme text Input to the phoneme fundamental frequency prediction network Net_PhonemeF0, and get the phoneme fundamental frequency implicit feature output by BiLSTM .
进一步的,所述步骤六具体包括如下步骤:Further, the step 6 specifically includes the following steps:
步骤S15将、 、、,进行拼接,得到最终Tacotron2 的Decoder解码器网络的输入,具体的, Step S15 will , , , , splicing to get the input of the final Tacotron2 Decoder decoder network ,specific,
步骤S16,将所述,输入到Tacotron2的Decoder解码器网络中,然后通过 Tacotrn2网络的后续结构,解码并合成得到最终的情感语音。 In step S16, the , input into the Decoder decoder network of Tacotron2, and then decode and synthesize the final emotional speech through the subsequent structure of the Tacotron2 network.
一种融合词汇及音素发音特征的情感语音合成系统,包括:An emotional speech synthesis system integrating vocabulary and phoneme pronunciation features, including:
文本采集模块,用于采用http传输,采集需要合成的文本内容及情感标签;The text collection module is used to use http transmission to collect the text content and emotional tags that need to be synthesized;
文本预处理模块,用于将采集到的文本进行预处理,对所述文本进行分词、音素转换操作,包括:对文本依次进行文本符号统一转换为英文符号、数字格式统一转换为中文文本、对中文本分词、将分词文本通过预训练Bert转换成语义向量表示形式、文本通过pypinyin工具包转换得到音素文本,所述情感标签通过One-Hot转换得到情感标签的向量表示,生成可用于神经网络处理的数据;The text preprocessing module is used for preprocessing the collected text, and performing word segmentation and phoneme conversion operations on the text, including: uniformly converting text symbols into English symbols in turn, uniformly converting digital formats into Chinese texts, and Chinese text word segmentation, the word segmentation text is converted into a semantic vector representation through pre-training Bert, the text is converted into phoneme text through the pypinyin toolkit, and the emotional label is converted by One-Hot to obtain the vector representation of the emotional label, which can be used for neural network processing. The data;
情感语音合成模块,用于通过设计的网络模型处理文本及情感信息,合成情感语音;The emotional speech synthesis module is used to process text and emotional information through the designed network model, and synthesize emotional speech;
数据存储模块,用于利用MySQL数据库,存储已经合成的情感语音;The data storage module is used to store the synthesized emotional speech by using the MySQL database;
合成语音调度模块,用于决策,采用模型合成语音,还是从数据库调用已合成语音,作为输出,并开放http端口用于输出合成好的情感语音。The synthetic voice scheduling module is used for decision-making, whether to use the model to synthesize voice, or to call the synthesized voice from the database as output, and open the http port for outputting the synthesized emotional voice.
进一步的,所述输出优先采用已合成情感语音,其次采用模型合成以提升系统响应速度。Further, the output preferentially adopts the synthesized emotional speech, and secondly adopts the model synthesis to improve the response speed of the system.
本发明的优点:Advantages of the present invention:
1. 本发明的情感语音合成方法,通过控制词语的发音控制间接控制合成语音的情感,由于词语是发音韵律的基本单位,人通过控制不同词语发音的音量、语速、基频来信息表达不同的情感,所以通过模拟人类发音表达情感的方式进行情感语音合成,能够更好的合成蕴含语音的情感,使得合成的语音更加自然;1. The emotion speech synthesis method of the present invention controls the emotion of the synthesized speech indirectly by controlling the pronunciation of words, because words are the basic units of pronunciation rhythm, people express different information by controlling the volume, speech rate, fundamental frequency of different word pronunciations. Therefore, by simulating human pronunciation to express emotions, emotional speech synthesis can better synthesize emotions containing speech, making the synthesized speech more natural;
2. 本发明的情感语音合成方法,利用独立的语速预测网络、能量预测网络和基频预测网络预测关于情感发音的三个要素,所以可以通过对独立网络的输出通过一个简单的系数进行相乘调整的方式,便捷的控制最终输出语音的效果;2. The emotional speech synthesis method of the present invention utilizes an independent speech rate prediction network, an energy prediction network and a fundamental frequency prediction network to predict three elements about emotional pronunciation, so the output of the independent network can be correlated by a simple coefficient. Multiplying the adjustment method, it is convenient to control the effect of the final output voice;
3. 本发明的情感语音合成方法,利用Tacotron2作为骨架网络,能够有效提升最终语音合成的质量;3. The emotional speech synthesis method of the present invention uses Tacotron2 as a skeleton network, which can effectively improve the quality of final speech synthesis;
4、本发明的情感语音合成系统配置有情感语音调用接口,能够通过简单的http调用就能合成带情感的高质量情感语音,对于需要进行人机语音交互的场景,能够极大提高用户使用体验感。例如用于智能电话客服对话场景,地图智能导航对话场景,儿童教育中的对话机器人交互场景,银行、机场等人形机器人对话交互场景等。4. The emotional speech synthesis system of the present invention is equipped with an emotional voice calling interface, which can synthesize high-quality emotional voice with emotion through a simple http call, and can greatly improve the user experience for scenarios that require human-computer voice interaction. sense. For example, it is used in smart phone customer service dialogue scenarios, map intelligent navigation dialogue scenarios, dialogue robot interaction scenarios in children's education, and humanoid robot dialogue interaction scenarios such as banks and airports.
附图说明Description of drawings
图1为本发明的情感语音合成系统的结构示意图;Fig. 1 is the structural representation of the emotional speech synthesis system of the present invention;
图2为本发明的情感语音合成方法的流程示意图;Fig. 2 is the schematic flow chart of the emotion speech synthesis method of the present invention;
图3为本发明的情感语音合成方法的网络结构示意图;Fig. 3 is the network structure schematic diagram of the emotion speech synthesis method of the present invention;
图4 语音合成系统Tacotron2网络结构示意图。Figure 4 is a schematic diagram of the network structure of the speech synthesis system Tacotron2.
具体实施方式Detailed ways
为了使本发明的目的、技术方案和技术效果更加清楚明白,以下结合说明书附图,对本发明作进一步详细说明。In order to make the objectives, technical solutions and technical effects of the present invention clearer, the present invention will be described in further detail below with reference to the accompanying drawings.
如图1所示,一种融合词汇及音素发音特征的情感语音合成系统,包括:As shown in Figure 1, an emotional speech synthesis system integrating vocabulary and phoneme pronunciation features includes:
文本采集模块,用于采用http传输,采集需要合成的文本内容及情感标签;The text collection module is used to use http transmission to collect the text content and emotional tags that need to be synthesized;
文本预处理模块,用于将采集到的文本进行预处理,对所述文本进行分词、音素转换操作,包括:对文本依次进行文本符号统一转换为英文符号、数字格式统一转换为中文文本、对中文本分词、将分词文本通过预训练Bert转换成语义向量表示形式、文本通过pypinyin工具包转换得到音素文本,所述情感标签通过One-Hot转换得到情感标签的向量表示,生成可用于神经网络处理的数据;The text preprocessing module is used for preprocessing the collected text, and performing word segmentation and phoneme conversion operations on the text, including: uniformly converting text symbols into English symbols in turn, uniformly converting digital formats into Chinese texts, and Chinese text word segmentation, the word segmentation text is converted into a semantic vector representation through pre-training Bert, the text is converted into phoneme text through the pypinyin toolkit, and the emotional label is converted by One-Hot to obtain the vector representation of the emotional label, which can be used for neural network processing. The data;
情感语音合成模块,用于通过设计的网络模型处理文本及情感信息,合成情感语音;The emotional speech synthesis module is used to process text and emotional information through the designed network model, and synthesize emotional speech;
数据存储模块,用于利用MySQL数据库,存储已经合成的情感语音;The data storage module is used to store the synthesized emotional speech by using the MySQL database;
合成语音调度模块,用于决策,采用模型合成语音,还是从数据库调用已合成语音,作为输出,并开放http端口用于输出合成好的情感语音,其中输出的情感语音优先采用已合成情感语音,其次采用模型合成以提升系统响应速度。The synthetic voice scheduling module is used for decision-making, whether to use the model to synthesize the voice, or to call the synthesized voice from the database as the output, and open the http port for outputting the synthesized emotional voice. The output emotional voice is given priority to the synthesized emotional voice. Secondly, model synthesis is used to improve the system response speed.
如图2-图4所示,一种融合词汇及音素发音特征的情感语音合成方法,包括如下步骤:As shown in Figure 2-Figure 4, an emotional speech synthesis method integrating vocabulary and phoneme pronunciation features includes the following steps:
步骤S1,采集文本及情感标签:通过录音采集设备,采集涵盖中性、开心、悲伤、生 气、害怕、厌恶、惊讶的7种情感类型的语音音频,表示为 ,语音对应的文本,表示为,语音对应的情感类型,表示为; Step S1, collect text and emotion tags: through the recording collection device, collect voice audios covering 7 emotion types of neutral, happy, sad, angry, fearful, disgusted, and surprised, expressed as: , the text corresponding to the voice, expressed as , the emotion type corresponding to the speech, expressed as ;
步骤S2,预处理文本,获取音素及音素对齐信息:对步骤S1采集的文本,通过 pypinyin工具包转换为对应的音素文本,表示为,然后将音素文本和步骤 S1得到的通过语音处理工具软件HTK,获取文本的时间对齐信息,生成包含每个音素发 音时长的音素-时长文本,表示为; Step S2, preprocess the text to obtain phoneme and phoneme alignment information: for the text collected in step S1 , converted to the corresponding phoneme text by the pypinyin toolkit, expressed as , then the phoneme text and obtained in step S1 Through the speech processing tool software HTK, the time alignment information of the text is obtained, and the phoneme-duration text containing the pronunciation duration of each phoneme is generated, which is expressed as ;
步骤S3,预处理文本,生成分词及分词语义信息:对文本,通过结巴分词工具 进行分词,即在原始文本中插入分词边界标识符,生成分词文本,将分词文本输 入到输出宽度为D中文预训练Bert网络,得到维度为N×D的分词特征,具体的, Step S3, preprocessing the text to generate word segmentation and word segmentation semantic information: for the text , perform word segmentation through the stuttering word segmentation tool, that is, insert the word segmentation boundary identifier into the original text to generate the word segmentation text , which will segment the text The input to output width is D Chinese pre-trained Bert network, and the word segmentation feature of dimension N×D is obtained ,specific,
其中,是一个维度为D的向量; in, is a vector of dimension D;
步骤S4,计算分词发音时长信息:利用步骤S2生成的和步骤S3生成的分 词文本计算每个分词的发音时长,得到分词-时长文本; Step S4, calculate word segmentation pronunciation duration information: utilize step S2 to generate and the word segmentation text generated in step S3 Calculate the pronunciation duration of each participle to get the participle-duration text ;
步骤S5,计算分词发音语速信息:通过步骤S4得到的分词-时长文本计算 分词的语速信息,并将语速归为5类,分别为:慢、较慢、一般、较快、快,从而得到分词文本对 应的语速类别标签; Step S5, calculate word segmentation pronunciation speech rate information: word segmentation-duration text obtained by step S4 Calculate the speech speed information of the word segmentation, and classify the speech speed into 5 categories, namely: slow, slow, normal, faster, and fast, so as to obtain the speech rate category label corresponding to the word segmentation text ;
步骤S6,计算分词发音能量信息:对步骤S1得到的音频和步骤S4得到的分词-时长 文本,通过分词持续时间内音频幅值的平方和计算分词的发音能量信息,并将能 量信息归为五类,分别为:低、较低、中、较高、高,从而得到分词文本对应的能量标签; Step S6, calculate the word segmentation pronunciation energy information: to the audio obtained in step S1 and the word segmentation-duration text obtained in step S4 , calculate the pronunciation energy information of the word segmentation by the sum of the squares of the audio amplitudes during the word segmentation duration, and classify the energy information into five categories: low, low, medium, high, and high, so as to obtain the energy corresponding to the word segmentation text Label ;
步骤S7,计算音素基频信息:对步骤S1得到的音频和步骤S2得到的音素-时 长文本,通过Librosa工具包计算音素发音的基频信息,并将基频信息根据基频 高低归为五类,分别为:低、较低、中、较高、高,从而得到音素文本对应的基频标签; Step S7, calculate the phoneme fundamental frequency information: for the audio frequency obtained in step S1 and the phoneme-duration text obtained in step S2 , calculate the fundamental frequency information of phoneme pronunciation through the Librosa toolkit, and classify the fundamental frequency information into five categories according to the fundamental frequency, namely: low, low, medium, high, high, so as to obtain the fundamental frequency corresponding to the phoneme text Label ;
步骤S8,训练分词语速预测网络Net_WordSpeed:将步骤S1得到的情感类型 和步骤S3得到的分词特征作为网络输入,步骤S5得到的语速类别标签作为网 络目标,输入到深度学习序列预测网络BiLSTM-CRF,采用BiLSTM双向长短期记忆网络的原 因是因为BiLSTM特别适合处理序列类任务,例如语音信号处理,文本信号处理等,然后通过 深度学习的网络训练得到分词语速预测网络Net_WordSpeed。具体的,包括如下步骤: Step S8, train the word speed prediction network Net_WordSpeed: the emotion type obtained in step S1 and the word segmentation feature obtained in step S3 As the network input, the speech rate category label obtained in step S5 As the network target, input to the deep learning sequence prediction network BiLSTM-CRF, the reason for using BiLSTM bidirectional long short-term memory network is because BiLSTM is particularly suitable for processing sequence tasks, such as speech signal processing, text signal processing, etc., and then through the deep learning network Train to get the segmentation word speed prediction network Net_WordSpeed. Specifically, it includes the following steps:
步骤A:情感类型,通过One-Hot向量转换技术,转换为宽度为7的One-Hot向 量,然后通过宽度为D的单层全连接网络,转换为维度为D的标签输入隐含特征; Step A: Emotion Types , converted to a One-Hot vector with a width of 7 through the One-Hot vector conversion technology, and then converted to a label input latent feature of dimension D through a single-layer fully connected network with a width of D ;
步骤B:将步骤S3和步骤A得到的和在第一个维度进行拼接,得到网络输入,具体的, Step B: Combine the results obtained in Step S3 and Step A and Splicing in the first dimension to get the network input ,specific,
步骤C:将分词长度N为的标签,通过One-Hot向量转换技术,转换为宽度 为5的One-Hot向量,最终得到维度为N×5的网络标签矩阵,具体的, Step C: Label the word segmentation length N , which is converted into a One-Hot vector with a width of 5 through the One-Hot vector conversion technology, and finally a network label matrix with a dimension of N×5 is obtained. ,specific,
其中,是一个维度为5的向量; in, is a vector of dimension 5;
步骤D:将步骤B得到的网络输入和步骤C得到的网络标签矩阵,输入到 BLSTM-CRF网络中进行训练,通过网络的自动学习,得到可以预测文本语速的语速预测网络 Net_WordSpeed。 Step D: Input the network obtained in Step B and the network label matrix obtained in step C , input into the BLSTM-CRF network for training, and through the automatic learning of the network, the speech rate prediction network Net_WordSpeed that can predict the speech rate of the text is obtained.
步骤S9,训练分词能量预测网络Net_WordEnergy:将步骤S1得到的情感类型和步骤S3得到的分词特征作为网络输入,步骤S6得到的能量标签作为网 络目标,输入到深度学习序列预测网络BLSTM-CRF,通过与步骤S8同样的处理方法,得到分 词能量预测网络Net_WordEnergy。 Step S9, train the word segmentation energy prediction network Net_WordEnergy: the emotion type obtained in step S1 and the word segmentation feature obtained in step S3 As the network input, the energy label obtained in step S6 As the network target, it is input to the deep learning sequence prediction network BLSTM-CRF, and the word segmentation energy prediction network Net_WordEnergy is obtained by the same processing method as step S8.
步骤S10,训练音素基频预测网络Net_PhonemeF0:将步骤S1得到的情感类型和步骤S2得到的音素文本都通过One-Hot转换技术,转换为向量形式后,作为 网络输入,步骤S7得到的基频标签通过One-Hot转换技术,转换为向量形式后,作为 网络目标,输入到深度学习序列预测网络BLS TM-CRF,通过与步骤S8一样的训练方法,得到 音素基频预测网络Net_PhonemeF0。 Step S10, train the phoneme fundamental frequency prediction network Net_PhonemeF0: use the emotion type obtained in step S1 and the phoneme text obtained in step S2 After converting to vector form through One-Hot conversion technology, as network input, the fundamental frequency label obtained in step S7 After conversion into vector form through One-Hot conversion technology, as a network target, it is input to the deep learning sequence prediction network BLS TM-CRF, and the phoneme fundamental frequency prediction network Net_PhonemeF0 is obtained through the same training method as step S8.
步骤S11,通过Tacotron2的Encoder,获取音素隐含信息:将步骤S2得到的 输入到Tacotron2网络的Encoder网络,得到Encoder网络的输出特征; Step S11, obtain phoneme implicit information through the Encoder of Tacotron2: Input to the Encoder network of the Tacotron2 network to get the output features of the Encoder network ;
步骤S12,通过Net_WordSpeed,获取分词语速隐含信息:将步骤S3得到的分词特征输入到步骤S8得到的分词语速预测网络Net_WordSpeed,得到BiLSTM输出的语速隐含特 征,根据每个分词包含的音素个数,对语速隐含特征通过复制的在时间维度进行长 度补齐,得到长度为音素个数的语速隐含特征; Step S12, through Net_WordSpeed, obtain the implicit information of word segmentation speed: the word segmentation feature obtained in step S3 Input to the segmentation word speed prediction network Net_WordSpeed obtained in step S8, and obtain the implicit feature of the speech speed output by BiLSTM , according to the number of phonemes contained in each participle, the implicit feature of speech rate is complemented by duplicating the length in the time dimension to obtain the implicit feature of speech rate whose length is the number of phonemes ;
步骤S13,通过Net_WordEnergy,获取分词能量隐含信息:将步骤S3得到的分词特 征输入到步骤S9得到的分词能量预测网络Net_WordEnergy,得到BiLSTM输出的能量隐 含特征,根据每个分词包含的音素个数,对能量隐含特征通过复制的在时间维度 进行长度补齐,得到长度为音素个数的能量隐含特征; Step S13, through Net_WordEnergy, obtain the word segmentation energy implicit information: the word segmentation feature obtained in step S3 Input to the word segmentation energy prediction network Net_WordEnergy obtained in step S9, and obtain the energy implicit feature output by BiLSTM , according to the number of phonemes contained in each participle, the length of the energy implicit feature is complemented in the time dimension by copying, and the energy implicit feature whose length is the number of phonemes is obtained ;
步骤S14,通过Net_PhonemeF0,获取音素基频隐含信息:将步骤S2得到的音素文本输入到步骤S10得到的音素基频预测网络Net_PhonemeF0,得到BiLSTM输出的音素 基频隐含特征; Step S14, through Net_PhonemeF0, obtain the phoneme fundamental frequency implicit information: the phoneme text obtained in step S2 Input to the phoneme fundamental frequency prediction network Net_PhonemeF0 obtained in step S10, and obtain the phoneme fundamental frequency implicit feature output by BiLSTM ;
步骤S15,拼接音素隐含信息、分词语速隐含信息、分词能量隐含信息、音素基频隐 含信息:将步骤S11得到,步骤S12得到,步骤S13得到,步骤S14得 到,进行拼接,得到最终Tacotron2的Decoder解码器网络的输入,具体的, Step S15, concatenate phoneme implicit information, word segmentation speed implicit information, word segmentation energy implicit information, and phoneme fundamental frequency implicit information: obtain step S11. , step S12 gets , step S13 gets , step S14 gets , splicing to get the input of the final Tacotron2 Decoder decoder network ,specific,
步骤S16,合成情感语音:将步骤S15得到的,输入到Tacotron2的Decoder解码器网络中,然后通过Tacotrn2网络的后续结构,解码并合成得到最终的情感语音。Step S16, synthesizing the emotional speech: input the data obtained in step S15 into the Decoder decoder network of Tacotron2, and then decode and synthesize the final emotional speech through the subsequent structure of the Tacotron2 network.
综上所述,本实施提供的方法,通过控制文本词汇的发音,提高了情感语音特征生成的合理性,能够提高最终合成的情感语音的质量。To sum up, the method provided by this implementation improves the rationality of emotional speech feature generation by controlling the pronunciation of text words, and can improve the quality of the final synthesized emotional speech.
以上所述,仅为本发明的优选实施案例,并非对本发明做任何形式上的限制。虽然前文对本发明的实施过程进行了详细说明,对于熟悉本领域的人员来说,其依然可以对前述各实例记载的技术方案进行修改,或者对其中部分技术特征进行同等替换。凡在本发明精神和原则之内所做修改、同等替换等,均应包含在本发明的保护范围之内。The above descriptions are only preferred implementation examples of the present invention, and do not limit the present invention in any form. Although the implementation process of the present invention has been described in detail above, those skilled in the art can still modify the technical solutions described in the foregoing examples, or perform equivalent replacements for some of the technical features. All modifications, equivalent replacements, etc. made within the spirit and principle of the present invention shall be included within the protection scope of the present invention.
Claims (10)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202110600732.4A CN113257225B (en) | 2021-05-31 | 2021-05-31 | A kind of emotional speech synthesis method and system integrating vocabulary and phoneme pronunciation features |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202110600732.4A CN113257225B (en) | 2021-05-31 | 2021-05-31 | A kind of emotional speech synthesis method and system integrating vocabulary and phoneme pronunciation features |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN113257225A CN113257225A (en) | 2021-08-13 |
| CN113257225B true CN113257225B (en) | 2021-11-02 |
Family
ID=77185459
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202110600732.4A Withdrawn - After Issue CN113257225B (en) | 2021-05-31 | 2021-05-31 | A kind of emotional speech synthesis method and system integrating vocabulary and phoneme pronunciation features |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN113257225B (en) |
Families Citing this family (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN114974183B (en) * | 2022-05-16 | 2024-12-20 | 广州虎牙科技有限公司 | Singing voice synthesis method, system and computer equipment |
| CN116469368B (en) * | 2023-04-11 | 2025-12-12 | 广东九四智能科技有限公司 | A speech synthesis method and system that integrates semantic information |
| CN116564274A (en) * | 2023-06-16 | 2023-08-08 | 平安科技(深圳)有限公司 | Speech synthesis method, speech synthesis device, electronic device and storage medium |
| CN117711413A (en) * | 2023-11-02 | 2024-03-15 | 广东广信通信服务有限公司 | Voice recognition data processing method, system, device and storage medium |
| CN119207366B (en) * | 2024-11-21 | 2025-02-11 | 国科大杭州高等研究院 | Fine granularity emphasized controllable emotion voice synthesis method |
| CN119479702B (en) * | 2025-01-08 | 2025-04-25 | 成都佳发安泰教育科技股份有限公司 | Pronunciation scoring method, pronunciation scoring device, electronic equipment and storage medium |
| CN119854414B (en) * | 2025-03-19 | 2025-11-18 | 山东致群信息技术股份有限公司 | AI-based telephone answering system |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6665644B1 (en) * | 1999-08-10 | 2003-12-16 | International Business Machines Corporation | Conversational data mining |
| CN108364632A (en) * | 2017-12-22 | 2018-08-03 | 东南大学 | A kind of Chinese text voice synthetic method having emotion |
| CN111627420A (en) * | 2020-04-21 | 2020-09-04 | 升智信息科技(南京)有限公司 | Specific-speaker emotion voice synthesis method and device under extremely low resources |
| CN111696579A (en) * | 2020-06-17 | 2020-09-22 | 厦门快商通科技股份有限公司 | Speech emotion recognition method, device, equipment and computer storage medium |
| CN112786004A (en) * | 2020-12-30 | 2021-05-11 | 科大讯飞股份有限公司 | Speech synthesis method, electronic device, and storage device |
Family Cites Families (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US7752043B2 (en) * | 2006-09-29 | 2010-07-06 | Verint Americas Inc. | Multi-pass speech analytics |
-
2021
- 2021-05-31 CN CN202110600732.4A patent/CN113257225B/en not_active Withdrawn - After Issue
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6665644B1 (en) * | 1999-08-10 | 2003-12-16 | International Business Machines Corporation | Conversational data mining |
| CN108364632A (en) * | 2017-12-22 | 2018-08-03 | 东南大学 | A kind of Chinese text voice synthetic method having emotion |
| CN111627420A (en) * | 2020-04-21 | 2020-09-04 | 升智信息科技(南京)有限公司 | Specific-speaker emotion voice synthesis method and device under extremely low resources |
| CN111696579A (en) * | 2020-06-17 | 2020-09-22 | 厦门快商通科技股份有限公司 | Speech emotion recognition method, device, equipment and computer storage medium |
| CN112786004A (en) * | 2020-12-30 | 2021-05-11 | 科大讯飞股份有限公司 | Speech synthesis method, electronic device, and storage device |
Also Published As
| Publication number | Publication date |
|---|---|
| CN113257225A (en) | 2021-08-13 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN113257225B (en) | A kind of emotional speech synthesis method and system integrating vocabulary and phoneme pronunciation features | |
| CN116364055B (en) | Speech generation method, device, device and medium based on pre-trained language model | |
| CN110211563B (en) | Chinese speech synthesis method, device and storage medium for scenes and emotion | |
| CN112489629B (en) | Voice transcription model, method, medium and electronic equipment | |
| CN111667812A (en) | Voice synthesis method, device, equipment and storage medium | |
| CN113628609A (en) | Automatic audio content generation | |
| CN110390928B (en) | Method and system for training speech synthesis model of automatic expansion corpus | |
| CN110297909B (en) | Method and device for classifying unlabeled corpora | |
| CN109036371A (en) | Audio data generation method and system for speech synthesis | |
| CN114005430B (en) | Training method, device, electronic device and storage medium for speech synthesis model | |
| CN117789771A (en) | Cross-language end-to-end emotion voice synthesis method and system | |
| CN113539268A (en) | End-to-end voice-to-text rare word optimization method | |
| CN111009235A (en) | Voice recognition method based on CLDNN + CTC acoustic model | |
| CN107221344A (en) | A kind of speech emotional moving method | |
| CN118942443A (en) | Speech generation method, virtual human speech generation method and speech generation system | |
| CN111009236A (en) | Voice recognition method based on DBLSTM + CTC acoustic model | |
| CN113299272B (en) | Speech synthesis model training and speech synthesis method, equipment and storage medium | |
| CN117219046A (en) | Interactive voice emotion control method and system | |
| CN117316141A (en) | Training methods, audio generation methods, devices and equipment for prosody annotation models | |
| CN117079637A (en) | Mongolian emotion voice synthesis method based on condition generation countermeasure network | |
| CN114333900B (en) | Method for extracting BNF (BNF) characteristics end to end, network model, training method and training system | |
| CN119763546B (en) | Speech synthesis method, system, electronic device and storage medium | |
| CN116129868A (en) | Method and system for generating structured photo | |
| CN115130457A (en) | Prosody modeling method and modeling system fused with Amdo Tibetan phoneme vectors | |
| CN120599998A (en) | A voice cloning method, device and related medium based on emotion enhancement |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant | ||
| AV01 | Patent right actively abandoned |
Granted publication date: 20211102 Effective date of abandoning: 20251012 |
|
| AV01 | Patent right actively abandoned |
Granted publication date: 20211102 Effective date of abandoning: 20251012 |
|
| AV01 | Patent right actively abandoned | ||
| AV01 | Patent right actively abandoned |