TWI840949B

TWI840949B - Multi-speaker and multi-emotion speech synthesis system, method and computer readable medium

Info

Publication number: TWI840949B
Application number: TW111134964A
Authority: TW
Inventors: 王文俊; 潘振銘; 廖元甫
Original assignee: 中華電信股份有限公司
Priority date: 2022-09-15
Filing date: 2022-09-15
Publication date: 2024-05-01
Also published as: TW202414382A

Abstract

The invention discloses a multi-speaker and multi-emotion speech synthesis system, method and computer readable medium. A speaker coding module generates an embedded speaker vector according to a multi-speaker speech reference signal, an emotion coding module generates an embedded emotion vector according to a multi-emotion speech reference signal, and a coding unit of a speech synthesis module generates an embedded linguistic vector according to a grapheme or phoneme sequence. Next, an integration unit of the speech synthesis module integrates the embedded speaker vector, the embedded emotion vector and the embedded linguistic vector into an embedded control vector, and then the integration unit controls a decoding unit of the speech synthesis module to synthesize speech spectrum by using the embedded control vector that integrates the embedded speaker vector, the embedded emotion vector and the embedded linguistic vector.

Description

Multilingual and multi-emotion speech synthesis system, method and computer-readable medium

本發明係關於一種語音合成技術，特別是指一種多語者與多情緒語音合成系統、方法及電腦可讀媒介。 The present invention relates to a speech synthesis technology, and more particularly to a multilingual and multi-emotion speech synthesis system, method and computer-readable medium.

語音合成技術已經普遍運用於各種電子裝置與人機介面中，例如蘋果(Apple)公司之iPhone智慧手機與語音助理Siri，亞馬遜(Amazon)公司之Echo智慧音箱與語音助理Alexa，谷歌(Google)公司之智慧音箱Google Home與語音助理Google Assistant等。 Speech synthesis technology has been widely used in various electronic devices and human-computer interfaces, such as Apple's iPhone smartphone and voice assistant Siri, Amazon's Echo smart speaker and voice assistant Alexa, Google's smart speaker Google Home and voice assistant Google Assistant, etc.

雖然語音合成技術可以利用人機介面回應或提示相關資訊予電子裝置之使用者，但一般之語音合成技術僅能合成中性(neutral)之語音，卻鮮少考慮合成具情緒之語音，導致無法以具情緒之語音來與使用者進行高效率之互動對話。 Although speech synthesis technology can use human-computer interfaces to respond to or prompt relevant information to users of electronic devices, general speech synthesis technology can only synthesize neutral voices, but rarely considers synthesizing emotional voices, resulting in the inability to use emotional voices to conduct efficient interactive dialogues with users.

再者，現有之語音合成技術並無法分別依據多語者之語音參考訊號、多情緒之語音參考訊號、與文字或音標序列產生嵌入式語者向量、嵌入式情緒向量、與嵌入式文脈向量，亦無法針對個人化需求產生個人化與情緒化之語音頻譜及/或合成語音，也無法依據轉換函數(如向量串接轉換函數或可學習式非線性類神經網路之轉換函數)將嵌入式語者向量、嵌入式情緒向量、與嵌入式文脈向量整合成嵌入式控制向量以控制語音合成模組合成出個人化與情緒化之語音頻譜，還無法利用類神經網路提升合成語音之品質，更無法由聲碼模組利用語音合成模組所合成之語音頻譜(如梅爾頻譜或對數梅爾頻譜)產生個人化與情緒化之合成語音。 Furthermore, existing speech synthesis technology cannot generate embedded speaker vectors, embedded emotion vectors, and embedded context vectors based on speech reference signals of multiple speakers, speech reference signals of multiple emotions, and text or phonetic symbol sequences, nor can it generate personalized and emotional speech spectra and/or synthesized speech based on personalized needs, nor can it generate personalized and emotional speech spectra and/or synthesized speech based on transformation functions (such as vector concatenation transformation functions or learnable nonlinear neural networks). The embedded speaker vector, embedded emotion vector, and embedded context vector are integrated into an embedded control vector to control the speech synthesis module to synthesize a personalized and emotional speech spectrum. It is not possible to use a neural network to improve the quality of synthesized speech, and it is not possible for the vocoding module to use the speech spectrum (such as the Mel spectrum or the logarithmic Mel spectrum) synthesized by the speech synthesis module to generate personalized and emotional synthesized speech.

因此，如何提供一種創新之語音合成技術，以解決上述之任一問題或提供相關之功能/服務，已成為本領域技術人員之一大研究課題。 Therefore, how to provide an innovative speech synthesis technology to solve any of the above problems or provide related functions/services has become a major research topic for technical personnel in this field.

本發明提供一種創新之多語者與多情緒語音合成系統、方法及電腦可讀媒介，係能分別依據多語者之語音參考訊號、多情緒之語音參考訊號、與文字或音標序列產生嵌入式語者向量、嵌入式情緒向量、與嵌入式文脈向量，亦能針對個人化需求產生個人化與情緒化之語音頻譜及/或合成語音，或者依據轉換函數(如向量串接轉換函數或可學習式非線性類神經網路之轉換函數)將嵌入式語者向量、嵌入式情緒向量、與嵌入式文脈向量整合成嵌入式控制向量以控制語音合成模組合成出個人化與情緒化之語音頻譜，抑或者利用類神經網路提升合成語音之品質，又或者由聲碼模組利用語音合成模組所合成之語音頻譜(如梅爾頻譜或對數梅爾頻譜)產生個人化與情緒化之合成語音。 The present invention provides an innovative multilingual and multi-emotion speech synthesis system, method and computer-readable medium, which can generate an embedded speaker vector, an embedded emotion vector and an embedded context vector according to a multilingual speech reference signal, a multi-emotion speech reference signal and a text or phonetic symbol sequence, and can also generate a personalized and emotional speech spectrum and/or synthesized speech according to personalized needs, or according to a transformation function (such as a vector concatenated transformation function or a learnable The embedded speaker vector, the embedded emotion vector, and the embedded context vector are integrated into an embedded control vector to control the speech synthesis module to synthesize a personalized and emotional speech spectrum, or the quality of the synthesized speech is improved by using the neural network, or the speech spectrum (such as the Mel spectrum or the logarithmic Mel spectrum) synthesized by the speech synthesis module is used by the vocoding module to generate personalized and emotional synthesized speech.

本發明之多語者與多情緒語音合成系統包括：一語者編碼模組，係依據多語者之語音參考訊號之至少一者產生對應之嵌入式語者向量；一情緒編碼模組，係依據多情緒之語音參考訊號之至少一者產生對應之嵌入式情緒向量；以及一語音合成模組，係具有一編碼單元、一整合單元與一解碼單元，其中，語音合成模組之編碼單元依據文字或音標序列產生對應之嵌入式文脈向量，而語音合成模組之整合單元將語者編碼模組依據多語者之語音參考訊號所產生之嵌入式語者向量、情緒編碼模組依據多情緒之語音參考訊號所產生之嵌入式情緒向量、與編碼單元依據文字或音標序列所產生之嵌入式文脈向量整合成嵌入式控制向量，再由語音合成模組之整合單元利用嵌入式語者向量、嵌入式情緒向量、與嵌入式文脈向量整合而成之嵌入式控制向量控制語音合成模組之解碼單元以合成出語音頻譜。 The multilingual and multi-emotion speech synthesis system of the present invention comprises: a speaker encoding module, which generates a corresponding embedded speaker vector based on at least one of the speech reference signals of the multilingual speakers; an emotion encoding module, which generates a corresponding embedded emotion vector based on at least one of the multi-emotion speech reference signals; and a speech synthesis module, which has a coding unit, an integration unit and a decoding unit, wherein the coding unit of the speech synthesis module generates a corresponding embedded context vector based on a text or phonetic symbol sequence, and the speech synthesis module generates a corresponding embedded context vector based on a text or phonetic symbol sequence. The integration unit integrates the embedded speaker vector generated by the speaker encoding module based on the speech reference signals of multiple speakers, the embedded emotion vector generated by the emotion encoding module based on the speech reference signals of multiple emotions, and the embedded context vector generated by the encoding unit based on the text or phonetic symbol sequence into an embedded control vector. The integration unit of the speech synthesis module then uses the embedded speaker vector, the embedded emotion vector, and the embedded context vector to integrate the embedded control vector to control the decoding unit of the speech synthesis module to synthesize the speech spectrum.

本發明之多語者與多情緒語音合成方法包括：由語者編碼模組依據多語者之語音參考訊號之至少一者產生對應之嵌入式語者向量，並由情緒編碼模組依據多情緒之語音參考訊號之至少一者產生對應之嵌入式情緒向量，且由語音合成模組之編碼單元依據文字或音標序列產生對應之嵌入式文脈向量；以及由語音合成模組之整合單元將語者編碼模組依據多語者之語音參考訊號所產生之嵌入式語者向量、情緒編碼模組依據多情緒之語音參考訊號所產生之嵌入式情緒向量、與編碼單元依據文字或音標序列所產生之嵌入式文脈向量整合成嵌入式控制向量，再由語音合成模組之整合單元利用嵌入式語者向量、嵌入式情緒向量、與嵌入式文脈向量整合而成之嵌入式控制向量控制語音合成模組之解碼單元以合成出語音頻譜。 The multilingual and multi-emotion speech synthesis method of the present invention comprises: a speaker encoding module generates a corresponding embedded speaker vector according to at least one of the multilingual speech reference signals, and an emotion encoding module generates a corresponding embedded emotion vector according to at least one of the multi-emotion speech reference signals, and a coding unit of the speech synthesis module generates a corresponding embedded context vector according to a text or phonetic symbol sequence; and an integration unit of the speech synthesis module integrates the speaker encoding module according to the multilingual speech reference signals. The embedded speaker vector generated by the speech reference signal, the embedded emotion vector generated by the emotion coding module based on the multi-emotion speech reference signal, and the embedded context vector generated by the coding unit based on the text or phonetic symbol sequence are integrated into an embedded control vector. The integration unit of the speech synthesis module then uses the embedded speaker vector, the embedded emotion vector, and the embedded context vector to integrate the embedded control vector to control the decoding unit of the speech synthesis module to synthesize the speech spectrum.

本發明之電腦可讀媒介應用於計算裝置或電腦中，係儲存有指令，以執行上述之多語者與多情緒語音合成方法。 The computer-readable medium of the present invention is applied to a computing device or a computer, and stores instructions for executing the above-mentioned multilingual and multi-emotion speech synthesis method.

為使本發明之上述特徵與優點能更明顯易懂，下文特舉實施例，並配合所附圖式作詳細說明。在以下描述內容中將部分闡述本發明之額外特徵及優點，且此等特徵及優點將部分自所述描述內容可得而知，或可藉由對本發明之實踐習得。應理解，前文一般描述與以下詳細描述二者均為例示性及解釋性的，且不欲約束本發明所欲主張之範圍。 In order to make the above features and advantages of the present invention more clearly understandable, the following examples are given and detailed descriptions are provided in conjunction with the attached drawings. The following description will partially explain the additional features and advantages of the present invention, and these features and advantages will be partially known from the description or can be learned through the practice of the present invention. It should be understood that both the general description above and the detailed description below are exemplary and explanatory, and are not intended to limit the scope of the present invention.

1:多語者與多情緒語音合成系統 1: Multilingual and multi-emotion speech synthesis system

10:語者編碼模組 10: Speaker encoding module

11:嵌入式語者向量萃取之類神經網路 11: Neural network based on embedded speaker vector extraction

12:音框層級 12: Sound frame level

13:輸入特徵 13: Input features

14:第一統計池化層 14: First statistical pooling layer

15:語句層級 15: Sentence level

16:語者標記 16: Speaker's Mark

17:softmax分類器 17:softmax classifier

18:損失函數 18: Loss function

19:第一資料庫 19: First database

20:情緒編碼模組 20: Emotional coding module

21:嵌入式情緒向量萃取之類神經網路 21: Neural networks such as embedded emotion vector extraction

22:音框層級 22: Sound frame level

23:輸入特徵 23: Input features

24:第二統計池化層 24: Second statistical pooling layer

25:語句層級 25: Sentence level

26:情緒標記 26: Emotional markers

27:softmax分類器 27:softmax classifier

28:損失函數 28: Loss function

29:第二資料庫 29: Second database

30:語音合成模組 30: Speech synthesis module

31:編碼單元 31: Coding unit

32:整合單元 32: Integration unit

33:注意力單元 33: Attention unit

34:解碼單元 34: Decoding unit

40:聲碼模組 40: Voice coding module

A1:多語者之語音參考訊號 A1: Voice reference signal of multilingual speakers

A2:嵌入式語者向量 A2: Embedded speaker vector

B1:多情緒之語音參考訊號 B1: Multi-emotional voice reference signals

B2:嵌入式情緒向量 B2: Embedded emotion vector

C1:文字或音標序列 C1: text or phonetic symbol sequence

C2:嵌入式文脈向量 C2: Embedded context vector

C3:嵌入式控制向量 C3: Embedded Control Vector

C4:語音頻譜 C4: Voice spectrum

C5:合成語音 C5: Synthetic Speech

S1至S2:步驟 S1 to S2: Steps

圖1為本發明之多語者與多情緒語音合成系統之架構示意圖。 Figure 1 is a schematic diagram of the architecture of the multilingual and multi-emotion speech synthesis system of the present invention.

圖2為本發明之多語者與多情緒語音合成方法之流程示意圖。 Figure 2 is a schematic diagram of the process of the multilingual and multi-emotion speech synthesis method of the present invention.

圖3為本發明之多語者與多情緒語音合成系統及其方法中，有關嵌入式語者向量分析方法之實施例示意圖。 FIG3 is a schematic diagram of an implementation example of the embedded speaker vector analysis method in the multilingual and multi-emotion speech synthesis system and method of the present invention.

圖4為本發明之多語者與多情緒語音合成系統及其方法中，有關嵌入式情緒向量分析方法之實施例示意圖。 FIG4 is a schematic diagram of an implementation example of the embedded emotion vector analysis method in the multilingual and multi-emotion speech synthesis system and method of the present invention.

以下藉由特定的具體實施形態說明本發明之實施方式，熟悉此技術之人士可由本說明書所揭示之內容了解本發明之其他優點與功效，亦可因而藉由其他不同具體等同實施形態加以施行或運用。 The following describes the implementation of the present invention through a specific concrete implementation form. People familiar with this technology can understand other advantages and effects of the present invention from the content disclosed in this manual, and can also implement or use it through other different specific equivalent implementation forms.

圖1為本發明之多語者與多情緒語音合成系統1之架構示意圖，圖2為本發明之多語者與多情緒語音合成方法之流程示意圖，圖3為本發明之多語者與多情緒語音合成系統1及其方法中有關嵌入式語者向量分析方法之實施例示意圖，圖4為本發明之多語者與多情緒語音合成系統1及其方法中有關嵌入式情緒向量分析方法之實施例示意圖。 FIG1 is a schematic diagram of the structure of the multilingual and multi-emotion speech synthesis system 1 of the present invention, FIG2 is a schematic diagram of the process of the multilingual and multi-emotion speech synthesis method of the present invention, FIG3 is a schematic diagram of an embodiment of the multilingual and multi-emotion speech synthesis system 1 of the present invention and the method thereof, and FIG4 is a schematic diagram of an embodiment of the multilingual and multi-emotion speech synthesis system 1 of the present invention and the method thereof.

如圖1所示，多語者與多情緒語音合成系統1可包括互相連結或通訊之至少一語者編碼模組10、至少一情緒編碼模組20、至少一語音合成模組30與至少一聲碼模組40，在一實施例中，語音合成模組30可分別連結或通訊語者編碼模組10、情緒編碼模組20與聲碼模組40。語音合成模組30可具有至少一編碼單元31、至少一整合單元32、至少一注意力(attention)單元33與至少一解碼單元34，在一實施例中，編碼單元31可依序連結或通訊整合單元32、注意力單元33與解碼單元34。 As shown in FIG1 , a multilingual and multi-emotion speech synthesis system 1 may include at least one speaker encoding module 10, at least one emotion encoding module 20, at least one speech synthesis module 30 and at least one voice encoding module 40 that are interconnected or communicated. In one embodiment, the speech synthesis module 30 may be respectively connected or communicated with the speaker encoding module 10, the emotion encoding module 20 and the voice encoding module 40. The speech synthesis module 30 may have at least one encoding unit 31, at least one integration unit 32, at least one attention unit 33 and at least one decoding unit 34. In one embodiment, the encoding unit 31 may be sequentially connected or communicated with the integration unit 32, the attention unit 33 and the decoding unit 34.

在一實施例中，語者編碼模組10可為語者編碼器(speaker encoder)、語者編碼晶片、語者編碼電路、語者編碼軟體、語者編碼程式等，情緒編碼模組20可為情緒編碼器(emotion encoder)、情緒編碼晶片、情緒編碼電路、情緒編碼軟體、情緒編碼程式等，語音合成模組30可為語音合成器(speech synthesizer)、語音合成晶片、語音合成電路、語音合成軟體、語音合成程式等，聲碼模組40可為聲碼器(vocoder)、聲碼晶片、聲碼電路、聲碼軟體、聲碼程式等。編碼單元31可為編碼器、編碼晶片、編碼電路、編碼軟體、編碼程式等，整合單元32可為整合軟體、整合程式等，注意力單元33可為注意力軟體、注意力程式等，解碼單元34可為解碼器、解碼晶片、解碼電路、解碼軟體、解碼程式等。 In one embodiment, the speaker encoding module 10 may be a speaker encoder, a speaker encoding chip, a speaker encoding circuit, a speaker encoding software, a speaker encoding program, etc., the emotion encoding module 20 may be an emotion encoder, an emotion encoding chip, an emotion encoding circuit, an emotion encoding software, an emotion encoding program, etc., the speech synthesis module 30 may be a speech synthesizer, a speech synthesis chip, a speech synthesis circuit, a speech synthesis software, a speech synthesis program, etc., and the vocoder module 40 may be a vocoder, a vocoder chip, a vocoder circuit, a vocoder software, a vocoder program, etc. The encoding unit 31 may be an encoder, an encoding chip, an encoding circuit, an encoding software, an encoding program, etc., the integration unit 32 may be an integration software, an integration program, etc., the attention unit 33 may be an attention software, an attention program, etc., and the decoding unit 34 may be a decoder, a decoding chip, a decoding circuit, a decoding software, a decoding program, etc.

在一實施例中，本發明所述「連結或通訊」可代表以有線方式(如有線網路)或無線方式(如無線網路)互相連結或通訊，「至少一」代表一個以上(如一、二或三個以上)，「複數」代表二個以上(如二、三、四、五或十個以上)。但是，本發明並不以上述實施例所提及者為限。 In one embodiment, the "connection or communication" mentioned in the present invention may represent interconnection or communication in a wired manner (such as a wired network) or a wireless manner (such as a wireless network), "at least one" represents more than one (such as one, two or more than three), and "plurality" represents more than two (such as two, three, four, five or more than ten). However, the present invention is not limited to those mentioned in the above embodiments.

多語者與多情緒語音合成系統1及其方法依據使用者輸入之文字以及所指定之語者標記16(見圖3)與情緒標記26(見圖4)產生個人化與情緒化之語音頻譜C4及/或合成語音C5(見圖1)，例如：

[張三]{生氣}我不這麼認為！

[李四]{厭惡}這個東西能用嗎？進一步地，多語者與多情緒語音合成系統1及其方法亦可在同一句話中變化情緒，例如：

[王五]{中性}只看外表，{傷心}會準嗎？

[趙六]{中性}每年到了這個時候^.....{快樂}聖誕樹就出來了。前述[張三]、[李四]、[王五]、[趙六]等為語者標記16，{生氣}、{厭惡}、{中性}、{傷心}、{快樂}等為情緒標記26。 The multilingual and multi-emotion speech synthesis system 1 and the method thereof generate personalized and emotional speech spectrum C4 and/or synthesized speech C5 (see FIG. 1 ) according to the text input by the user and the specified speaker tag 16 (see FIG. 3 ) and emotion tag 26 (see FIG. 4 ), for example:

[Zhang San]{Angry}I don’t think so!

[Li Si] Can this thing {disgust} be used? Furthermore, the multilingual person and multi-emotion speech synthesis system 1 and the method thereof can also change emotions in the same sentence, for example:

[Wang Wu] {Neutral} Only looking at the appearance, {Sad} will it be accurate?

[Zhao Liu] {neutral} Every year at this time ^... {happy} the Christmas tree comes out. The aforementioned [Zhang San], [Li Si], [Wang Wu], [Zhao Liu], etc. are speaker markers 16, and {angry}, {disgusted}, {neutral}, {sad}, {happy}, etc. are emotion markers 26.

多語者與多情緒語音合成系統1及其方法之特色為建置語者編碼模組10以產生/求取複數(不同)嵌入式語者向量A2以表示複數(不同)語者之聲音特色，並建置情緒編碼模組20以產生/求取複數(不同)嵌入式情緒向量B2以表示複數(不同)情緒種類。 The multilingual and multi-emotion speech synthesis system 1 and its method are characterized by constructing a speaker encoding module 10 to generate/obtain multiple (different) embedded speaker vectors A2 to represent the voice characteristics of multiple (different) speakers, and constructing an emotion encoding module 20 to generate/obtain multiple (different) embedded emotion vectors B2 to represent multiple (different) emotion types.

例如，本發明之技術特點可包括由語者編碼模組10建置嵌入式語者向量A2之分佈模型以表示語者之聲音特色，並由情緒編碼模組20建置嵌入式情緒向量B2之分佈模型以表示複數(不同)情緒種類，俾建置嵌入式語者向量A2與嵌入式情緒向量B2兩者所導引之多語者與多情緒語音合成系統1。亦即，多語者與多情緒語音合成系統1能建置可產生/求取複數(不同)嵌入式語者向量A2之語者編碼模組10，並建置可產生/求取複數(不同)嵌入式情緒向量B2之情緒編碼模組20。所以，本發明所建置之多語者與多情緒語音合成系統1能依據使用者輸入之文字，並指定語者與情緒之相關條件後，完成語音合成處理。 For example, the technical features of the present invention may include constructing a distribution model of an embedded speaker vector A2 by a speaker coding module 10 to represent the voice characteristics of the speaker, and constructing a distribution model of an embedded emotion vector B2 by an emotion coding module 20 to represent multiple (different) emotion types, so as to construct a multilingual and multi-emotion speech synthesis system 1 guided by both the embedded speaker vector A2 and the embedded emotion vector B2. That is, the multilingual and multi-emotion speech synthesis system 1 can construct a speaker coding module 10 that can generate/obtain multiple (different) embedded speaker vectors A2, and construct an emotion coding module 20 that can generate/obtain multiple (different) embedded emotion vectors B2. Therefore, the multilingual and multi-emotion speech synthesis system 1 constructed by the present invention can complete the speech synthesis processing according to the text input by the user and after specifying the relevant conditions of the speaker and emotion.

如圖1所示，語音合成模組30可使用端到端語音合成(如Tacotron2)技術、編碼-解碼架構(如編碼單元31結合解碼單元34之架構)及/或注意力單元33之注意力(attention)機制，以利語音合成模組30之解碼單元自動合成出個人化與情緒化之語音頻譜C4。 As shown in FIG1 , the speech synthesis module 30 may use end-to-end speech synthesis (such as Tacotron2) technology, a coding-decoding architecture (such as a coding unit 31 combined with a decoding unit 34) and/or an attention mechanism of an attention unit 33, so that the decoding unit of the speech synthesis module 30 can automatically synthesize a personalized and emotional speech spectrum C4.

語音合成模組30之運作原理可簡述如下：先由語音合成模組30之編碼單元31將輸入之文字或音標序列(grapheme or phoneme sequence)C1進行文本分析(text analysis)，以自動產生文字或音標序列C1所對應之嵌入式文脈向量(embedding linguistic vector)C2，再由語音合成模組30之解碼單元34依據嵌入式文脈向量C2(如文脈特徵參數)並透過注意力單元33之權重機制自動對齊所輸入之文字或音標序列C1與解碼單元34所輸出之語音頻譜C4，且語音頻譜C4可例如為梅爾頻譜(Mel spectrogram)或對數梅爾頻譜(Log-Mel Spectrogram)。 The operation principle of the speech synthesis module 30 can be briefly described as follows: the encoding unit 31 of the speech synthesis module 30 first performs text analysis on the input grapheme or phoneme sequence C1 to automatically generate an embedded linguistic vector C2 corresponding to the grapheme or phoneme sequence C1, and then the decoding unit 34 of the speech synthesis module 30 automatically aligns the input grapheme or phoneme sequence C1 with the speech spectrum C4 output by the decoding unit 34 according to the embedded linguistic vector C2 (such as the context feature parameter) and through the weight mechanism of the attention unit 33, and the speech spectrum C4 can be, for example, a Mel spectrogram or a log-Mel spectrogram.

如圖1與圖2所示，多語者與多情緒語音合成方法主要包括： As shown in Figures 1 and 2, the multilingual and multi-emotion speech synthesis methods mainly include:

在步驟S1中，由語者編碼模組10依據多語者之語音參考訊號A1之至少一者產生對應之嵌入式語者向量A2，並由情緒編碼模組20依據多情緒之語音參考訊號B1之至少一者產生對應之嵌入式情緒向量B2，且由語音合成模組30之編碼單元31依據文字或音標序列C1產生對應之嵌入式文脈向量C2。 In step S1, the speaker encoding module 10 generates a corresponding embedded speaker vector A2 according to at least one of the multi-speaker speech reference signals A1, and the emotion encoding module 20 generates a corresponding embedded emotion vector B2 according to at least one of the multi-emotion speech reference signals B1, and the encoding unit 31 of the speech synthesis module 30 generates a corresponding embedded context vector C2 according to the text or phonetic symbol sequence C1.

然後，在步驟S2中，由語音合成模組30之整合單元32將語者編碼模組10依據多語者之語音參考訊號A1所產生之嵌入式語者向量A2、情緒編碼模組20依據多情緒之語音參考訊號B1所產生之嵌入式情緒向量B2、與編碼單元31依據文字或音標序列C1所產生之嵌入式文脈向量 C2整合成嵌入式控制向量C3，再由語音合成模組30之整合單元32利用嵌入式語者向量A2、嵌入式情緒向量B2、與嵌入式文脈向量C3整合而成之嵌入式控制向量C3控制語音合成模組30之解碼單元34以合成出語音頻譜C4。 Then, in step S2, the integration unit 32 of the speech synthesis module 30 integrates the embedded speaker vector A2 generated by the speaker encoding module 10 according to the multi-speaker speech reference signal A1, the embedded emotion vector B2 generated by the emotion encoding module 20 according to the multi-emotion speech reference signal B1, and the embedded context vector C2 generated by the encoding unit 31 according to the text or phonetic symbol sequence C1 into an embedded control vector C3, and then the integration unit 32 of the speech synthesis module 30 uses the embedded control vector C3 integrated with the embedded speaker vector A2, the embedded emotion vector B2, and the embedded context vector C3 to control the decoding unit 34 of the speech synthesis module 30 to synthesize the speech spectrum C4.

多語者與多情緒語音合成系統1及其方法可利用類神經網路以提升合成語音C5之品質，亦能持續發展、追求或展現個人化與情緒化(多樣化)之語音合成功能。為達成個人化與情緒化之語音合成功能，多語者與多情緒語音合成系統1及其方法可利用多語者與多情緒語音合成訓練語料分別建置語者編碼模組10與情緒編碼模組20，以利後續之語音合成模組30與聲碼模組40分別自動合成出個人化與情緒化之語音頻譜C4與合成語音C5。 The multilingual and multi-emotion speech synthesis system 1 and the method thereof can utilize a neural network to improve the quality of synthesized speech C5, and can also continuously develop, pursue or demonstrate personalized and emotional (diversified) speech synthesis functions. To achieve personalized and emotional speech synthesis functions, the multilingual and multi-emotion speech synthesis system 1 and the method thereof can utilize multilingual and multi-emotion speech synthesis training corpus to respectively construct a speaker encoding module 10 and an emotion encoding module 20, so that the subsequent speech synthesis module 30 and the voice coding module 40 can respectively automatically synthesize personalized and emotional speech spectrum C4 and synthesized speech C5.

詳言之，多語者與多情緒語音合成系統1及其方法之技術內容可分成下列三個部份進行說明，[1]第一個部份可包括如圖1所示以嵌入式語者向量A2與嵌入式情緒向量B2控制多語者與多情緒語音合成系統1之處理程序；[2]第二個部份可包括：[2-1]如圖3所示之嵌入式語者向量分析方法，以及[2-2]如圖4所示之嵌入式情緒向量分析方法；[3]第三個部份可包括如圖1所示之語音合成模組30之整合單元32對嵌入式語者向量A2、嵌入式情緒向量B2、與嵌入式文脈向量C2之整合方法，以及相關類神經網路(模型)之訓練方法。 In detail, the technical contents of the multilingual and multi-emotion speech synthesis system 1 and its method can be divided into the following three parts for explanation. [1] The first part may include a processing procedure for controlling the multilingual and multi-emotion speech synthesis system 1 with an embedded speaker vector A2 and an embedded emotion vector B2 as shown in FIG1; [2] The second part may include: [2-1] an embedded speaker vector analysis method as shown in FIG3, and [2-2] an embedded emotion vector analysis method as shown in FIG4; [3] The third part may include an integration method of the speech synthesis module 30 32 as shown in FIG1 for integrating the embedded speaker vector A2, the embedded emotion vector B2, and the embedded context vector C2, and a training method of a related neural network (model).

[1]以嵌入式語者向量A2與嵌入式情緒向量B2控制多語者與多情緒語音合成系統1之處理程序(處理機制)。如圖1所示，語者編碼模組10、情緒編碼模組20、語音合成模組30與聲碼模組40等，必須利用具文字、語者標記16與情緒標記26之多語者與多情緒語音合成訓練語料進行訓練之建置，且以嵌入式語者向量A2與嵌入式情緒向量B2控制多語者與多情緒語音合成系統1之處理程序可包括下列程序P11至程序P16所述。 [1] The processing procedure (processing mechanism) of the multilingual and multi-emotion speech synthesis system 1 is controlled by the embedded speaker vector A2 and the embedded emotion vector B2. As shown in FIG1 , the speaker encoding module 10, the emotion encoding module 20, the speech synthesis module 30 and the voice encoding module 40, etc., must be constructed by using the multilingual and multi-emotion speech synthesis training corpus with text, speaker tags 16 and emotion tags 26 for training, and the processing procedure of the multilingual and multi-emotion speech synthesis system 1 is controlled by the embedded speaker vector A2 and the embedded emotion vector B2, which may include the following procedures P11 to P16.

程序P11：由語者編碼模組10依據輸入之多語者之語音參考訊號(multi-speaker speech reference signal)A1(如多語者之語音參考波形)之至少一者產生/求取對應之嵌入式語者向量A2。 Procedure P11: The speaker encoding module 10 generates/obtains the corresponding embedded speaker vector A2 based on at least one of the input multi-speaker speech reference signals A1 (such as multi-speaker speech reference waveforms).

程序P12：由情緒編碼模組20依據輸入之多情緒之語音參考訊號(multi-emotion speech reference signal)B1(如多情緒之語音參考波形)之至少一者產生/求取對應之嵌入式情緒向量B2。 Procedure P12: The emotion coding module 20 generates/obtains the corresponding embedded emotion vector B2 based on at least one of the input multi-emotion speech reference signals B1 (such as multi-emotion speech reference waveforms).

程序P13：由語音合成模組30之編碼單元31依據輸入之文字或音標序列C1產生/求取對應之嵌入式文脈向量C2。 Procedure P13: The encoding unit 31 of the speech synthesis module 30 generates/obtains the corresponding embedded context vector C2 according to the input text or phonetic symbol sequence C1.

程序P14：由語音合成模組30之整合單元32整合語者編碼模組10所產生之嵌入式語者向量A2、情緒編碼模組20所產生之嵌入式情緒向量B2、與語音合成模組30所產生之嵌入式文脈向量C2三者以得到嵌入式控制向量C3。 Procedure P14: The integration unit 32 of the speech synthesis module 30 integrates the embedded speaker vector A2 generated by the speaker coding module 10, the embedded emotion vector B2 generated by the emotion coding module 20, and the embedded context vector C2 generated by the speech synthesis module 30 to obtain the embedded control vector C3.

程序P15：由語音合成模組30之整合單元32利用嵌入式語者向量A2、嵌入式情緒向量B2、與嵌入式文脈向量C2三者整合而成之嵌入式控制向量C3，以自動控制語音合成模組30之注意力單元33與解碼單元34合成出個人化與情緒化之語音頻譜C4(如梅爾頻譜或對數梅爾頻譜)。 Program P15: The integration unit 32 of the speech synthesis module 30 uses the embedded control vector C3 formed by integrating the embedded speaker vector A2, the embedded emotion vector B2, and the embedded context vector C2 to automatically control the attention unit 33 and the decoding unit 34 of the speech synthesis module 30 to synthesize a personalized and emotional speech spectrum C4 (such as a Mel spectrum or a logarithmic Mel spectrum).

程序P16：由聲碼模組40利用語音合成模組30之注意力單元33所提供之注意力機制與解碼單元34所合成之語音頻譜C4(如梅爾頻譜或對數梅爾頻譜)以自動產生個人化與情緒化之合成語音C5(如合成語音訊號)。 Procedure P16: The voice coding module 40 uses the attention mechanism provided by the attention unit 33 of the speech synthesis module 30 and the speech spectrum C4 (such as the Mel spectrum or the logarithmic Mel spectrum) synthesized by the decoding unit 34 to automatically generate personalized and emotional synthesized speech C5 (such as the synthesized speech signal).

[2-1]嵌入式語者向量分析方法(處理程序)：如圖1與圖3所示，嵌入式語者向量分析方法之相關處理程序可包括下列程序P21至程序P24所述。 [2-1] Embedded speaker vector analysis method (processing procedure): As shown in Figures 1 and 3, the relevant processing procedures of the embedded speaker vector analysis method may include the following procedures P21 to P24.

程序P21：由語者編碼模組10利用具有語者標記16之訓練語料訓練出具有第一統計池化層(statistics pooling layer)14之語者辨認類神經網路，且語者辨認類神經網路可分為音框層級(frame level)12與語句層級(utterance level)15共兩個部份，以由語者辨認類神經網路之第一統計池化層14將語者辨認類神經網路之音框層級12之資訊轉換成語句層級15之資訊。語者編碼模組10可選擇使用梅爾頻率倒頻譜係數(Mel-Frequency Cepstral Coefficients；MFCC)作為語者辨認類神經網路之輸入特徵(feature)13，且語者編碼模組10亦能以softmax分類器17為基礎定義語者辨認類神經網路之損失函數(loss function)18。 Procedure P21: The speaker encoding module 10 uses the training corpus with speaker labels 16 to train a speaker identification neural network with a first statistics pooling layer 14, and the speaker identification neural network can be divided into two parts, namely, a frame level 12 and an utterance level 15. The first statistics pooling layer 14 of the speaker identification neural network converts the information of the frame level 12 of the speaker identification neural network into the information of the utterance level 15. The speaker encoding module 10 can choose to use Mel-Frequency Cepstral Coefficients (MFCC) as the input feature 13 of the speaker identification neural network, and the speaker encoding module 10 can also define the loss function 18 of the speaker identification neural network based on the softmax classifier 17.

程序P22：在訓練完成之語者辨認類神經網路之輸出層被去除後，語者編碼模組10可將語者辨認類神經網路之第一統計池化層14之輸出作為每一輸入語句之嵌入式語者向量A2，以使訓練完成之語者辨認類神經網路成為嵌入式語者向量萃取之類神經網路11。 Procedure P22: After the output layer of the trained speaker identification neural network is removed, the speaker encoding module 10 can use the output of the first statistical pooling layer 14 of the speaker identification neural network as the embedded speaker vector A2 of each input sentence, so that the trained speaker identification neural network becomes the embedded speaker vector extraction neural network 11.

程序P23：由語者編碼模組10利用嵌入式語者向量萃取之類神經網路11萃取每一輸入語句之嵌入式語者向量A2。 Procedure P23: The speaker encoding module 10 uses the embedded speaker vector extraction neural network 11 to extract the embedded speaker vector A2 of each input sentence.

程序P24：由語者編碼模組10依據每一輸入語句之語者標記16與嵌入式語者向量萃取之類神經網路11所提供之對應之嵌入式語者向量A2建置第一資料庫19。例如，第一資料庫19可儲存有複數(不同)語者標記 16與對應之複數(不同)嵌入式語者向量A2，語者標記16可包括第一語者標記、第二語者標記至第M語者標記等，嵌入式語者向量A2可包括第一嵌入式語者向量、第二嵌入式語者向量至第M嵌入式語者向量等，且M代表大於2之正整數(如3、4、5或以上)。 Procedure P24: The speaker encoding module 10 constructs a first database 19 according to the speaker tag 16 of each input sentence and the corresponding embedded speaker vector A2 provided by the embedded speaker vector extraction neural network 11. For example, the first database 19 may store a plurality of (different) speaker tags 16 and corresponding plurality of (different) embedded speaker vectors A2, the speaker tag 16 may include the first speaker tag, the second speaker tag to the Mth speaker tag, etc., the embedded speaker vector A2 may include the first embedded speaker vector, the second embedded speaker vector to the Mth embedded speaker vector, etc., and M represents a positive integer greater than 2 (such as 3, 4, 5 or more).

[2-2]嵌入式情緒向量分析方法(處理程序)：如圖1與圖4所示，嵌入式情緒向量分析方法之相關處理程序可包括下列程序P31至程序P34所述。 [2-2] Embedded emotion vector analysis method (processing procedure): As shown in Figures 1 and 4, the relevant processing procedures of the embedded emotion vector analysis method may include the following procedures P31 to P34.

程序P31：由情緒編碼模組20利用具有情緒標記26之訓練語料訓練出具有第二統計池化層24之語句情緒辨認類神經網路，且語句情緒辨認類神經網路可分為音框層級22與語句層級25共兩個部份，以由語句情緒辨認類神經網路之第二統計池化層24將語句情緒辨認類神經網路之音框層級之資訊轉換成語句層級之資訊。情緒編碼模組20可選擇使用梅爾頻率倒頻譜係數(MFCC)以及能反映情緒變化之音高(pitch)與音量(energy)等之特徵作為語句情緒辨認類神經網路之輸入特徵23，且情緒編碼模組20亦能以softmax分類器27為基礎定義語句情緒辨認類神經網路之損失函數28。 Procedure P31: The emotion encoding module 20 uses the training corpus with emotion labels 26 to train a sentence emotion recognition neural network with a second statistical pooling layer 24, and the sentence emotion recognition neural network can be divided into two parts, namely, a sound frame level 22 and a sentence level 25. The second statistical pooling layer 24 of the sentence emotion recognition neural network converts the sound frame level information of the sentence emotion recognition neural network into sentence level information. The emotion coding module 20 can choose to use the Mel frequency cepstral coefficient (MFCC) and features such as pitch and energy that can reflect emotional changes as input features 23 of the sentence emotion recognition neural network, and the emotion coding module 20 can also define the loss function 28 of the sentence emotion recognition neural network based on the softmax classifier 27.

程序P32：在訓練完成之語句情緒辨認類神經網路之輸出層被去除後，情緒編碼模組20可將語句情緒辨認類神經網路之第二統計池化層24之輸出作為每一輸入語句之嵌入式情緒向量B2，以使訓練完成之語句情緒辨認類神經網路成為嵌入式情緒向量萃取之類神經網路21。 Procedure P32: After the output layer of the trained sentence emotion recognition neural network is removed, the emotion encoding module 20 can use the output of the second statistical pooling layer 24 of the sentence emotion recognition neural network as the embedded emotion vector B2 of each input sentence, so that the trained sentence emotion recognition neural network becomes the embedded emotion vector extraction neural network 21.

程序P33：由情緒編碼模組20利用嵌入式情緒向量萃取之類神經網路21萃取每一輸入語句之嵌入式情緒向量B2。 Procedure P33: The emotion encoding module 20 uses the embedded emotion vector extraction neural network 21 to extract the embedded emotion vector B2 of each input sentence.

程序P34：由情緒編碼模組20依據每一輸入語句之情緒標記26與嵌入式情緒向量萃取之類神經網路21所提供之對應之嵌入式情緒向量B2建置第二資料庫29。例如，第二資料庫29可儲存有複數(不同)情緒標記26與對應之複數(不同)嵌入式情緒向量B2，情緒標記26可包括第一情緒標記、第二情緒標記至第N情緒標記等，嵌入式情緒向量B2可包括第一嵌入式情緒向量、第二嵌入式情緒向量至第N嵌入式情緒向量等，且N代表大於2之正整數(如3、4、5或以上)。 Procedure P34: The emotion coding module 20 builds a second database 29 according to the emotion label 26 of each input sentence and the corresponding embedded emotion vector B2 provided by the embedded emotion vector extraction neural network 21. For example, the second database 29 may store a plurality of (different) emotion labels 26 and corresponding plurality of (different) embedded emotion vectors B2, the emotion label 26 may include the first emotion label, the second emotion label to the Nth emotion label, etc., the embedded emotion vector B2 may include the first embedded emotion vector, the second embedded emotion vector to the Nth embedded emotion vector, etc., and N represents a positive integer greater than 2 (such as 3, 4, 5 or more).

[3]語音合成模組30之整合單元32對嵌入式語者向量A2、嵌入式情緒向量B2、與嵌入式文脈向量C2之整合方法。 [3] The integration unit 32 of the speech synthesis module 30 integrates the embedded speaker vector A2, the embedded emotion vector B2, and the embedded context vector C2.

如圖1與下列公式(1)至公式(2)所示，以F(．)代表語音合成模組30之整合單元32，且整合單元32(如F(．))可依據不同選擇之方法定義對應之轉換函數(如向量串接轉換函數或可學習式非線性類神經網路之轉換函數)，以由整合單元32(如F(．))依據所選擇之轉換函數(如向量串接轉換函數或可學習式非線性類神經網路之轉換函數)將嵌入式語者向量A2、嵌入式情緒向量B2、與嵌入式文脈向量C2三者自動整合成一嵌入式控制向量C3(如一個固定長度之嵌入式控制向量C3)。亦即，整合單元32(如F(．))之轉換函數可選擇如公式(1)所示之向量串接轉換函數，或者選擇如公式(2)所示之可學習式非線性類神經網路之轉換函數。 As shown in Figure 1 and the following formulas (1) to (2), F(.) represents the integration unit 32 of the speech synthesis module 30, and the integration unit 32 (such as F(.)) can define the corresponding transformation function (such as a vector concatenation transformation function or a learnable nonlinear neural network-like transformation function) according to different selection methods, so that the integration unit 32 (such as F(.)) automatically integrates the embedded speaker vector A2, the embedded emotion vector B2, and the embedded context vector C2 into an embedded control vector C3 (such as a fixed-length embedded control vector C3) according to the selected transformation function (such as a vector concatenation transformation function or a learnable nonlinear neural network-like transformation function). That is, the conversion function of the integration unit 32 (such as F(.)) can be selected as a vector concatenation conversion function as shown in formula (1), or a learnable nonlinear neural network conversion function as shown in formula (2).

上列公式(1)至公式(2)中，F(．)代表整合單元32，

、

、

分別代表嵌入式語者向量A2、嵌入式情緒向量B2、與嵌入式文脈向量C2，而ω_s、ω_e、ω_l分別代表嵌入式語者向量A2之可調整加權值、嵌入式情緒向量B2之可調整加權值、與嵌入式文脈向量C2之可調整加權值。 In the above formulas (1) to (2), F(.) represents the integrated unit 32,

,

They represent the embedded speaker vector A2, the embedded emotion vector B2, and the embedded context vector C2 respectively, and ω _s , ω _e , ω _l represent the adjustable weighted value of the embedded speaker vector A2, the adjustable weighted value of the embedded emotion vector B2, and the adjustable weighted value of the embedded context vector C2 respectively.

當整合單元32選擇可學習式非線性類神經網路時，可學習式非線性類神經網路能進一步整合語音合成模組30之注意力單元33與解碼單元34(如解碼網路)以得到整合後之類神經網路，且此整合後之類神經網路可利用嵌入式語者向量A2、嵌入式情緒向量B2、與嵌入式文脈向量C2三者整合而成之嵌入式控制向量C3完成訓練之建置。 When the integration unit 32 selects a learnable nonlinear neural network, the learnable nonlinear neural network can further integrate the attention unit 33 and the decoding unit 34 (such as a decoding network) of the speech synthesis module 30 to obtain an integrated neural network, and the integrated neural network can use the embedded control vector C3 formed by integrating the embedded speaker vector A2, the embedded emotion vector B2, and the embedded context vector C2 to complete the training construction.

然後，語音合成模組30之解碼單元34可將所產生之語音頻譜C4(如梅爾頻譜或對數梅爾頻譜)輸入至聲碼模組40，以由聲碼模組40利用語音頻譜C4合成出個人化(如語者特性)與情緒化(如語者情緒)之合成語音C5。亦即，聲碼模組40可依據合成語音C5之音質可接受度及/或系統執行速度等之不同考量，以選擇WaveNet、WaveGlow、WaveGAN與HiFi-GAN等複數種類神經網路之一者將語音頻譜C4合成個人化(如語者特性)與情緒化(如語者情緒)之合成語音C5。需說明者，前述WaveNet、WaveGlow、WaveGAN與HiFi-GAN通常以英文表示，較少以中文表示。 Then, the decoding unit 34 of the speech synthesis module 30 may input the generated speech spectrum C4 (such as a Mel spectrum or a logarithmic Mel spectrum) to the vocoding module 40, so that the vocoding module 40 may synthesize the personalized (such as speaker characteristics) and emotional (such as speaker emotions) synthesized speech C5 using the speech spectrum C4. That is, the vocoding module 40 may select one of a plurality of neural networks such as WaveNet, WaveGlow, WaveGAN and HiFi-GAN to synthesize the speech spectrum C4 into the personalized (such as speaker characteristics) and emotional (such as speaker emotions) synthesized speech C5 according to different considerations such as the sound quality acceptability of the synthesized speech C5 and/or the system execution speed. It should be noted that the aforementioned WaveNet, WaveGlow, WaveGAN and HiFi-GAN are usually expressed in English and rarely in Chinese.

完成建置之多語者與多情緒語音合成系統1及其方法可依據使用者輸入之文字及所指定之語者標記16與情緒標記26產生個人化與情緒化之合成語音C5，亦可從圖3所示之第一資料庫19中找出語者標記16所對應之嵌入式語者向量A2，並從圖4所示之第二資料庫29中找出情緒標記26所對應之嵌入式情緒向量B2，再將嵌入式語者向量A2與嵌入式情緒向量B2分別輸入至語音合成模組30以進行語音合成處理。 The established multilingual and multi-emotion speech synthesis system 1 and method thereof can generate personalized and emotional synthesized speech C5 according to the text input by the user and the specified speaker tag 16 and emotion tag 26, and can also find the embedded speaker vector A2 corresponding to the speaker tag 16 from the first database 19 shown in FIG. 3, and find the embedded emotion vector B2 corresponding to the emotion tag 26 from the second database 29 shown in FIG. 4, and then input the embedded speaker vector A2 and the embedded emotion vector B2 to the speech synthesis module 30 respectively for speech synthesis processing.

舉例而言，多語者與多情緒語音合成系統1及其方法之相關處理程序可包括如下列所述之內容。 For example, the related processing procedures of the multilingual and multi-emotion speech synthesis system 1 and its method may include the contents described below.

[1]使用者透過多語者與多情緒語音合成系統1先收集具文字、語者標記16與情緒標記26之多語者與多情緒語音合成訓練語料，作法包括(1)自行招募語者並設計錄音文本與錄製方式，或者(2)使用各種公開之語料。以(1)自行招募語者並設計錄音文本與錄製方式而言，錄音語者可依據不同需求以複數(不同)情緒進行語料錄製。又，以(2)使用各種公開之語料而言，則可選擇使用例如電視影集或Podcast(播客)之語料。 [1] The user first collects multilingual and multi-emotion speech synthesis training corpus with text, speaker tags 16 and emotion tags 26 through the multilingual and multi-emotion speech synthesis system 1, and the method includes (1) recruiting speakers and designing recording texts and recording methods, or (2) using various public corpus. In terms of (1) recruiting speakers and designing recording texts and recording methods, the recording speakers can record corpus with multiple (different) emotions according to different needs. In terms of (2) using various public corpus, you can choose to use corpus such as TV series or podcasts.

以英文為例，Multimodal Emotion Lines Dataset(簡稱MELD；網址為https：//affective-meld.github.io/)或Multimodal Signal Processing-Podcast corpus(簡稱MSP-Podcast；網址為https：//ecs.utdallas.edu/research/researchlabs/msp-lab/MSP-Podcast.html)，均為已完成標註之多語者與多情緒語音合成訓練語料。 Taking English as an example, Multimodal Emotion Lines Dataset (MELD; URL: https://affective-meld.github.io/) or Multimodal Signal Processing-Podcast corpus (MSP-Podcast; URL: https://ecs.utdallas.edu/research/researchlabs/msp-lab/MSP-Podcast.html) are both annotated multilingual and multi-emotion speech synthesis training corpora.

[2]利用上述多語者與多情緒語音合成訓練語料中之語者標記16訓練如圖1所示之語者編碼模組10或如圖3所示之嵌入式語者向量萃取之類神經網路，以由語者編碼模組10利用嵌入式語者向量萃取之類神經網路11萃取每一輸入語句之嵌入式語者向量A2，並使用x向量(x-vector)之訓練方式訓練嵌入式語者向量萃取之類神經網路11，且語者編碼模組10可選擇使用梅爾頻率倒頻譜係數(MFCC)作為嵌入式語者向量萃取之類神經網路11之輸入特徵13。 [2] The speaker labels 16 in the multilingual and multi-emotion speech synthesis training corpus are used to train the speaker encoding module 10 shown in FIG. 1 or the embedded speaker vector extraction neural network shown in FIG. 3, so that the speaker encoding module 10 uses the embedded speaker vector extraction neural network 11 to extract the embedded speaker vector A2 of each input sentence, and the embedded speaker vector extraction neural network 11 is trained using the x-vector training method, and the speaker encoding module 10 can choose to use the Mel frequency cepstral coefficient (MFCC) as the input feature 13 of the embedded speaker vector extraction neural network 11.

[3]利用上述多語者與多情緒語音合成訓練語料中之情緒標記26訓練如圖1所示之情緒編碼模組20或如圖4所示之嵌入式情緒向量萃取之類神經網路21，以由情緒編碼模組20利用嵌入式情緒向量萃取之類神經網路21萃取每一輸入語句之嵌入式情緒向量B2，並使用x向量(x-vector)之訓練方式訓練嵌入式情緒向量萃取之類神經網路21，且情緒編碼模組20可選擇使用梅爾頻率倒頻譜係數(MFCC)以及能反映情緒變化之音高(pitch)與音量(energy)等特徵作為嵌入式情緒向量萃取之類神經網路21之輸入特徵23。 [3] The emotion markers 26 in the multilingual and multi-emotion speech synthesis training corpus are used to train the emotion coding module 20 shown in FIG. 1 or the embedded emotion vector extraction neural network 21 shown in FIG. 4, so that the emotion coding module 20 uses the embedded emotion vector extraction neural network 21 to extract the embedded emotion vector B2 of each input sentence, and the embedded emotion vector extraction neural network 21 is trained using an x-vector training method, and the emotion coding module 20 can choose to use Mel frequency inverse spectrum coefficients (MFCC) and features such as pitch and energy that can reflect emotional changes as input features 23 of the embedded emotion vector extraction neural network 21.

[4]利用上述多語者與多情緒語音合成訓練語料之所有文字、語者標記16與情緒標記26、以及已訓練完成之語者編碼模組10(如嵌入式語者向量萃取之類神經網路11)與情緒編碼模組20(如嵌入式情緒向量萃取之類神經網路21)進行語音合成模組30之訓練，包括訓練語音合成模組30之前級之編碼單元31與整合單元32(如F(．))、中級之注意力單元33、及後級之解碼單元34，以使語音合成模組30產出對應之語音頻譜C4(如梅爾頻譜或對數梅爾頻譜)之資訊。 [4] Using all the text, speaker tags 16 and emotion tags 26 of the above-mentioned multilingual and multi-emotion speech synthesis training corpus, as well as the trained speaker encoding module 10 (such as the embedded speaker vector extraction neural network 11) and the emotion encoding module 20 (such as the embedded emotion vector extraction neural network 21) to train the speech synthesis module 30, including training the front-end encoding unit 31 and integration unit 32 (such as F(.)), the middle-end attention unit 33, and the back-end decoding unit 34 of the speech synthesis module 30, so that the speech synthesis module 30 generates the corresponding speech spectrum C4 (such as Mel spectrum or logarithmic Mel spectrum) information.

語音合成模組30之編碼單元31、注意力單元33與解碼單元34之架構可利用端到端語音合成(如Tacotron2)技術，且整合單元32(如F(．))可依據不同選擇之方法定義對應之轉換函數(如向量串接轉換函數或可學習式非線性類神經網路之轉換函數)，以由整合單元32(如F(．))依據所選擇之轉換函數(如向量串接轉換函數或可學習式非線性類神經網路之轉換函數)將嵌入式語者向量A2、嵌入式情緒向量B2、與嵌入式文脈向量C2三者自動整合成一嵌入式控制向量C3(如一個固定長度之嵌入式控制向量C3)。整合單元32(如F(．))之轉換函數可選擇如上方公式(1)所示之向量串接轉換函數，或者選擇如上方公式(2)所示之可學習式非線性類神經網路之轉換函數。 The architecture of the encoding unit 31, the attention unit 33 and the decoding unit 34 of the speech synthesis module 30 can utilize end-to-end speech synthesis (such as Tacotron2) technology, and the integration unit 32 (such as F(.)) can define the corresponding transformation function (such as a vector concatenation transformation function or a learnable nonlinear neural network-like transformation function) according to different selected methods, so that the integration unit 32 (such as F(.)) can automatically integrate the embedded speaker vector A2, the embedded emotion vector B2, and the embedded context vector C2 into an embedded control vector C3 (such as a fixed-length embedded control vector C3) according to the selected transformation function (such as a vector concatenation transformation function or a learnable nonlinear neural network-like transformation function). The conversion function of the integration unit 32 (such as F(.)) can be selected as a vector concatenation conversion function as shown in the above formula (1), or a learnable nonlinear neural network conversion function as shown in the above formula (2).

當整合單元32選擇可學習式非線性類神經網路時，可學習式非線性類神經網路能進一步整合語音合成模組30之注意力單元33與解碼單元34(如解碼網路)以得到整合後之類神經網路，且此整合後之類神經網路可利用嵌入式語者向量A2、嵌入式情緒向量B2、與嵌入式文脈向量C2三者整合而成之嵌入式控制向量C3完成訓練之建置。 When the integration unit 32 selects a learnable nonlinear neural network, the learnable nonlinear neural network can further integrate the attention unit 33 and the decoding unit 34 (such as a decoding network) of the speech synthesis module 30 to obtain an integrated neural network, and the integrated neural network can complete the training construction using the embedded control vector C3 formed by integrating the embedded speaker vector A2, the embedded emotion vector B2, and the embedded context vector C2.

然後，語音合成模組30之解碼單元34可將所產生之語音頻譜C4(如梅爾頻譜或對數梅爾頻譜)輸入至聲碼模組40，以由聲碼模組40利用語音頻譜C4合成出個人化(如語者特性)與情緒化(如語者情緒)之合成語音C5。亦即，聲碼模組40可依據合成語音C5之音質可接受度及/或系統執行速度等之不同考量，以選擇WaveNet、WaveGlow、WaveGAN與HiFi-GAN等複數種類神經網路之一者將語音頻譜C4合成個人化(如語者特性)與情緒化(如語者情緒)之合成語音C5。 Then, the decoding unit 34 of the speech synthesis module 30 can input the generated speech spectrum C4 (such as the Mel spectrum or the logarithmic Mel spectrum) to the vocoding module 40, so that the vocoding module 40 can use the speech spectrum C4 to synthesize the personalized (such as the speaker's characteristics) and emotional (such as the speaker's emotions) synthesized speech C5. That is, the vocoding module 40 can select one of the multiple types of neural networks such as WaveNet, WaveGlow, WaveGAN and HiFi-GAN to synthesize the speech spectrum C4 into the personalized (such as the speaker's characteristics) and emotional (such as the speaker's emotions) synthesized speech C5 according to different considerations such as the sound quality acceptability of the synthesized speech C5 and/or the system execution speed.

[5]完成建置之多語者與多情緒語音合成系統1及其方法可依據使用者輸入之文字及所指定之語者標記16與情緒標記26產生個人化與情緒化之合成語音C5，亦可從圖3所示之第一資料庫19中找出語者標記16所對應之嵌入式語者向量A2，並從圖4所示之第二資料庫29中找出情緒標記26所對應之嵌入式情緒向量B2，再將嵌入式語者向量A2與嵌入式情緒向量B2分別輸入至語音合成模組30以進行語音合成處理。 [5] The established multilingual and multi-emotion speech synthesis system 1 and method thereof can generate personalized and emotional synthesized speech C5 according to the text input by the user and the specified speaker tag 16 and emotion tag 26, and can also find the embedded speaker vector A2 corresponding to the speaker tag 16 from the first database 19 shown in FIG. 3, and find the embedded emotion vector B2 corresponding to the emotion tag 26 from the second database 29 shown in FIG. 4, and then input the embedded speaker vector A2 and the embedded emotion vector B2 into the speech synthesis module 30 respectively for speech synthesis processing.

另外，本發明還提供一種針對多語者與多情緒語音合成方法之電腦可讀媒介，係應用於具有處理器及/或記憶體之計算裝置或電腦中，且電腦可讀媒介儲存有指令，並可利用計算裝置或電腦透過處理器及/或記憶體執行電腦可讀媒介，以於執行電腦可讀媒介時執行上述內容。例如，處理器可為微處理器、中央處理器(CPU)、圖形處理器(GPU)等，記憶體可為隨機存取記憶體(RAM)、記憶卡、硬碟(如雲端/網路硬碟)、資料庫等，但不以此為限。 In addition, the present invention also provides a computer-readable medium for multilinguals and multi-emotion speech synthesis methods, which is applied to a computing device or computer having a processor and/or a memory, and the computer-readable medium stores instructions, and the computing device or computer can execute the computer-readable medium through the processor and/or the memory to execute the above content when executing the computer-readable medium. For example, the processor can be a microprocessor, a central processing unit (CPU), a graphics processing unit (GPU), etc., and the memory can be a random access memory (RAM), a memory card, a hard disk (such as a cloud/network hard disk), a database, etc., but is not limited to this.

綜上，本發明之多語者與多情緒語音合成系統、方法及電腦可讀媒介至少具有下列特色、優點或技術功效。 In summary, the multilingual and multi-emotion speech synthesis system, method and computer-readable medium of the present invention have at least the following features, advantages or technical effects.

一、本發明之語者編碼模組能依據多語者之語音參考訊號自動產生/求取嵌入式語者向量以表示語者之聲音特色，而情緒編碼模組能依據多情緒之語音參考訊號自動產生/求取嵌入式情緒向量以表示情緒種類，且語音合成模組之編碼單元能依據文字或音標序列自動產生/求取嵌入式文脈向量。 1. The speaker encoding module of the present invention can automatically generate/obtain an embedded speaker vector based on the speech reference signals of multiple speakers to represent the voice characteristics of the speaker, and the emotion encoding module can automatically generate/obtain an embedded emotion vector based on the speech reference signals of multiple emotions to represent the type of emotion, and the encoding unit of the speech synthesis module can automatically generate/obtain an embedded context vector based on a text or phonetic symbol sequence.

二、本發明之語音合成模組或整合單元能將嵌入式語者向量、嵌入式情緒向量、與嵌入式文脈向量三者整合成嵌入式控制向量，以利用嵌入式控制向量自動控制語音合成模組(如注意力單元與解碼單元)合成出個人化與情緒化之語音頻譜。 2. The speech synthesis module or integration unit of the present invention can integrate the embedded speaker vector, the embedded emotion vector, and the embedded context vector into an embedded control vector, so as to use the embedded control vector to automatically control the speech synthesis module (such as the attention unit and the decoding unit) to synthesize a personalized and emotional speech spectrum.

三、本發明能針對個人化需求產生個人化與情緒化之語音頻譜及/或合成語音，亦能使合成語音依照不同需求展現出複數(不同)語者之聲音特色。 3. The present invention can generate personalized and emotional speech spectrum and/or synthesized speech according to personalized needs, and can also make the synthesized speech show the voice characteristics of multiple (different) speakers according to different needs.

四、本發明能建置語者編碼模組、情緒編碼模組、語音合成模組與聲碼模組，亦能利用類神經網路以提升合成語音之品質，也能持續發展、追求或展現個人化與情緒化(多樣化)之語音合成功能。 4. The present invention can build a speaker coding module, an emotion coding module, a speech synthesis module and a voice coding module, and can also use a neural network to improve the quality of synthesized speech, and can also continue to develop, pursue or demonstrate personalized and emotional (diversified) speech synthesis functions.

五、本發明能利用多語者與多情緒語音合成訓練語料分別建置語者編碼模組與情緒編碼模組，以利後續之語音合成模組與聲碼模組分別自動合成出個人化與情緒化之語音頻譜及/或合成語音。 5. The present invention can use multilingual and multi-emotion speech synthesis training corpus to respectively construct a speaker encoding module and an emotion encoding module, so that the subsequent speech synthesis module and voice coding module can automatically synthesize personalized and emotional speech spectra and/or synthesized speech.

六、本發明之語音合成模組能使用端到端語音合成(如Tacotron2)技術、編碼-解碼架構(如編碼單元結合解碼單元之架構)及/或注意力單元之注意力機制，以利語音合成模組自動合成出個人化與情緒化之語音頻譜。 6. The speech synthesis module of the present invention can use end-to-end speech synthesis (such as Tacotron2) technology, encoding-decoding architecture (such as the architecture of encoding unit combined with decoding unit) and/or attention mechanism of attention unit, so that the speech synthesis module can automatically synthesize personalized and emotional speech spectrum.

七、本發明之語音合成模組能將文字或音標序列進行文本分析以利自動產生/求取嵌入式文脈向量，再利用嵌入式文脈向量與注意力單元之權重機制自動對齊所輸入之文字或音標序列與所輸出之語音頻譜。 7. The speech synthesis module of the present invention can perform text analysis on text or phonetic symbol sequences to automatically generate/obtain embedded context vectors, and then use the embedded context vectors and the weight mechanism of the attention unit to automatically align the input text or phonetic symbol sequence with the output speech spectrum.

八、本發明之語音合成模組或整合單元能依據不同選擇之方法定義對應之轉換函數(如向量串接轉換函數或可學習式非線性類神經網路之轉換函數)，以利依據所選擇之轉換函數(如向量串接轉換函數或可學習式非線性類神經網路之轉換函數)將嵌入式語者向量、嵌入式情緒向量、與嵌入式文脈向量三者自動整合成一嵌入式控制向量。 8. The speech synthesis module or integration unit of the present invention can define corresponding transformation functions (such as vector concatenation transformation functions or learnable nonlinear neural network transformation functions) according to different selected methods, so as to automatically integrate the embedded speaker vector, embedded emotion vector, and embedded context vector into an embedded control vector according to the selected transformation function (such as vector concatenation transformation function or learnable nonlinear neural network transformation function).

九、本發明之語者編碼模組能利用具有語者標記之訓練語料訓練出具有統計池化層之語者辨認類神經網路(嵌入式語者向量萃取之類神經網路)，以利用統計池化層將語者辨認類神經網路之音框層級之資訊自動轉換成語句層級之資訊。 9. The speaker encoding module of the present invention can use speaker-labeled training corpus to train a speaker identification neural network (embedded speaker vector extraction neural network) with a statistical pooling layer, so as to automatically convert the frame-level information of the speaker identification neural network into sentence-level information using the statistical pooling layer.

十、本發明之情緒編碼模組能利用具有情緒標記之訓練語料訓練出具有統計池化層之語句情緒辨認類神經網路(嵌入式情緒向量萃取之類神經網路)，以利用統計池化層將語句情緒辨認類神經網路之音框層級之資訊自動轉換成語句層級之資訊。 10. The emotion encoding module of the present invention can use the training corpus with emotion labels to train a sentence emotion recognition neural network (embedded emotion vector extraction neural network) with a statistical pooling layer, so as to automatically convert the frame-level information of the sentence emotion recognition neural network into sentence-level information using the statistical pooling layer.

十一、本發明之聲碼模組能利用語音合成模組之注意力單元與解碼單元所合成之語音頻譜(如梅爾頻譜或對數梅爾頻譜)，以利自動產生個人化與情緒化之合成語音。 11. The voice coding module of the present invention can utilize the speech spectrum (such as Mel spectrum or logarithmic Mel spectrum) synthesized by the attention unit and decoding unit of the speech synthesis module to automatically generate personalized and emotional synthesized speech.

上述實施形態僅例示性說明本發明之原理、特點及其功效，並非用以限制本發明之可實施範疇，任何熟習此項技藝之人士均能在不違背本發明之精神及範疇下，對上述實施形態進行修飾與改變。任何使用本發明所揭示內容而完成之等效改變及修飾，均仍應為申請專利範圍所涵蓋。因此，本發明之權利保護範圍應如申請專利範圍所列。 The above implementation forms are only illustrative of the principles, features and effects of the present invention, and are not intended to limit the scope of implementation of the present invention. Anyone familiar with this technology can modify and change the above implementation forms without violating the spirit and scope of the present invention. Any equivalent changes and modifications completed using the content disclosed by the present invention should still be covered by the scope of the patent application. Therefore, the scope of protection of the present invention should be as listed in the scope of the patent application.

10:語者編碼模組 10: Speaker encoding module

20:情緒編碼模組 20: Emotional coding module

30:語音合成模組 30: Speech synthesis module

31:編碼單元 31: Coding unit

32:整合單元 32: Integration unit

33:注意力單元 33: Attention unit

34:解碼單元 34: Decoding unit

40:聲碼模組 40: Voice coding module

A2:嵌入式語者向量 A2: Embedded speaker vector

B1:多情緒之語音參考訊號 B1: Multi-emotional voice reference signals

B2:嵌入式情緒向量 B2: Embedded emotion vector

C1:文字或音標序列 C1: text or phonetic symbol sequence

C2:嵌入式文脈向量 C2: Embedded context vector

C3:嵌入式控制向量 C3: Embedded Control Vector

C4:語音頻譜 C4: Voice spectrum

C5:合成語音 C5: Synthetic Speech

Claims

A multilingual and multi-emotion speech synthesis system includes: a speaker encoding module, which uses the Mel frequency inverse spectrum coefficient as the input feature of the neural network of embedded speaker vector extraction, so that the speaker encoding module generates a corresponding embedded speaker vector according to at least one of the input speech reference signals of the multilingual speakers; an emotion encoding module, which uses the Mel frequency inverse spectrum coefficient and can reflect the emotion The characteristics of the pitch and volume of the voice are used as input characteristics of a neural network for extracting an embedded emotion vector, so that the emotion coding module generates a corresponding embedded emotion vector according to at least one of the input multi-emotional speech reference signals; and a speech synthesis module having a coding unit, an integration unit and a decoding unit, wherein the coding unit of the speech synthesis module converts the input text or phonetic symbol into a corresponding embedded emotion vector. The text analysis is performed on the sequence to generate an embedded context vector corresponding to the text or phonetic symbol sequence, and the integration unit of the speech synthesis module integrates the embedded speaker vector generated by the speaker encoding module according to the speech reference signal of the multi-speaker, the embedded emotion vector generated by the emotion encoding module according to the multi-emotion speech reference signal, and the embedded context vector generated by the encoding unit according to the text or phonetic symbol sequence into an embedded control vector through a conversion function, and then the integration unit of the speech synthesis module uses the embedded speaker vector, the embedded emotion vector, and the embedded control vector integrated with the embedded context vector to control the decoding unit of the speech synthesis module to synthesize the speech spectrum using end-to-end speech synthesis technology, encoding-decoding architecture or attention mechanism.

The speech synthesis system as described in claim 1, wherein the speaker encoding module further uses speaker-labeled training corpus to train a speaker identification neural network with a statistical pooling layer, and the speaker identification neural network is divided into a sound frame level and a sentence level, so that the statistical pooling layer of the speaker identification neural network converts the information of the sound frame level of the speaker identification neural network into the information of the sentence level.

The speech synthesis system as described in claim 1, wherein the speaker encoding module further uses the Mel frequency cepstrum coefficient as the input feature of the speaker identification neural network, and defines the loss function of the speaker identification neural network based on a classifier, so that after the output layer of the speaker identification neural network is removed, the speaker encoding module uses the output of the statistical pooling layer of the speaker identification neural network as the embedded speaker vector of each input sentence, so that the speaker identification neural network becomes a neural network extracted by the embedded speaker vector.

The speech synthesis system as described in claim 1, wherein the speaker encoding module further utilizes the embedded speaker vector extraction neural network to extract the embedded speaker vector of each input sentence, so that the speaker encoding module builds a database based on the speaker tag of each input sentence and the corresponding embedded speaker vector provided by the embedded speaker vector extraction neural network.

The speech synthesis system as described in claim 1, wherein the emotion encoding module further uses the training corpus with emotion labels to train a sentence emotion recognition neural network with a statistical pooling layer, and the sentence emotion recognition neural network is divided into a sound frame level and a sentence level, so that the statistical pooling layer of the sentence emotion recognition neural network converts the information of the sound frame level of the sentence emotion recognition neural network into the information of the sentence level.

The speech synthesis system as described in claim 1, wherein the emotion coding module further uses the Mel frequency cepstrum coefficient and the characteristics of pitch and volume that can reflect emotion changes as input features of the sentence emotion recognition neural network, and uses a classifier as the basis to define the loss function of the sentence emotion recognition neural network, so that after the output layer of the sentence emotion recognition neural network is removed, the emotion coding module uses the output of the statistical pooling layer of the sentence emotion recognition neural network as the embedded emotion vector of each input sentence, so that the sentence emotion recognition neural network becomes the embedded emotion vector extraction neural network.

The speech synthesis system as described in claim 1, wherein the emotion coding module further utilizes the embedded emotion vector extraction neural network to extract the embedded emotion vector of each input sentence, so that the emotion coding module builds a database based on the emotion tag of each input sentence and the corresponding embedded emotion vector provided by the embedded emotion vector extraction neural network.

The speech synthesis system as described in claim 1, wherein the speech synthesis module further has an attention unit, and the decoding unit of the speech synthesis module aligns the input text or phonetic symbol sequence with the speech spectrum output by the decoding unit according to the embedded context vector and the weight mechanism of the attention unit.

The speech synthesis system as described in claim 1 further includes a vocoding module, and the speech synthesis module further includes an attention unit, wherein the vocoding module uses the attention mechanism provided by the attention unit of the speech synthesis module and the speech spectrum synthesized by the decoding unit to generate personalized and emotional synthesized speech, wherein the speech spectrum synthesized by the decoding unit is a Mel spectrum or a logarithmic Mel spectrum, and the vocoding module selects a neural network such as WaveNet, WaveGlow, WaveGAN or HiFi-GAN to synthesize the speech spectrum into the personalized and emotional synthesized speech according to different considerations of the sound quality acceptability of the synthesized speech or the system execution speed.

A method for synthesizing multilingual and multi-emotion speech, comprising: a speaker coding module uses a Mel frequency inverse spectrum coefficient as an input feature of a neural network for extracting an embedded speaker vector, so that the speaker coding module generates a corresponding embedded speaker vector according to at least one of the input multilingual speech reference signals; an emotion coding module uses the Mel frequency inverse spectrum coefficient and the pitch and volume features that can reflect the change of emotions as input features of a neural network for extracting an embedded emotion vector, so that the emotion coding module generates a corresponding embedded emotion vector according to at least one of the input multi-emotion speech reference signals; a coding unit of the speech synthesis module performs text analysis on the input text or phonetic symbol sequence to generate the text or phonetic symbol sequence The corresponding embedded context vector; and the integration unit of the speech synthesis module integrates the embedded speaker vector generated by the speaker encoding module according to the speech reference signal of the multi-language speaker, the embedded emotion vector generated by the emotion encoding module according to the multi-emotion speech reference signal, and the embedded context vector generated by the encoding unit according to the text or phonetic symbol sequence into an embedded control vector through a conversion function, and then the integration unit of the speech synthesis module uses the embedded speaker vector, the embedded emotion vector, and the embedded control vector integrated with the embedded context vector to control the decoding unit of the speech synthesis module to synthesize the speech spectrum using end-to-end speech synthesis technology, encoding-decoding architecture or attention mechanism.

The speech synthesis method as described in claim 10 further includes the speaker encoding module using speaker-labeled training corpus to train a speaker identification neural network with a statistical pooling layer, and the speaker identification neural network is divided into a sound frame level and a sentence level, so that the statistical pooling layer of the speaker identification neural network converts the information of the sound frame level of the speaker identification neural network into the information of the sentence level.

The speech synthesis method as described in claim 10 further includes the speaker encoding module using the Mel frequency cepstrum coefficient as an input feature of the speaker identification neural network, and using a classifier as a basis to define the loss function of the speaker identification neural network, so that after the output layer of the speaker identification neural network is removed, the speaker encoding module uses the output of the statistical pooling layer of the speaker identification neural network as the embedded speaker vector of each input sentence, so that the speaker identification neural network becomes a neural network extracted by the embedded speaker vector.

The speech synthesis method as described in claim 10 further includes the speaker encoding module extracting the embedded speaker vector of each input sentence using the embedded speaker vector extraction neural network, so that the speaker encoding module builds a database based on the speaker tag of each input sentence and the corresponding embedded speaker vector provided by the embedded speaker vector extraction neural network.

The speech synthesis method as described in claim 10 further includes the emotion coding module using the training corpus with emotion labels to train a sentence emotion recognition neural network with a statistical pooling layer, and the sentence emotion recognition neural network is divided into a sound frame level and a sentence level, so that the statistical pooling layer of the sentence emotion recognition neural network converts the information of the sound frame level of the sentence emotion recognition neural network into the information of the sentence level.

The speech synthesis method as described in claim 10 further includes the emotion coding module using the Mel frequency cepstrum coefficient and the characteristics of pitch and volume that can reflect emotion changes as input features of the sentence emotion recognition neural network, and using a classifier as the basis to define the loss function of the sentence emotion recognition neural network, so that after the output layer of the sentence emotion recognition neural network is removed, the emotion coding module uses the output of the statistical pooling layer of the sentence emotion recognition neural network as the embedded emotion vector of each input sentence, so that the sentence emotion recognition neural network becomes the embedded emotion vector extraction neural network.

The speech synthesis method as described in claim 10 further includes the emotion coding module extracting the embedded emotion vector of each input sentence using the embedded emotion vector extraction neural network, so that the emotion coding module builds a database based on the emotion label of each input sentence and the corresponding embedded emotion vector provided by the embedded emotion vector extraction neural network.

The speech synthesis method as described in claim 10 further includes the decoding unit of the speech synthesis module aligning the input text or phonetic symbol sequence with the speech spectrum output by the decoding unit according to the embedded context vector and the weight mechanism of the attention unit of the speech synthesis module.

The speech synthesis method as described in claim 10 further includes the vocoding module using the attention mechanism provided by the attention unit of the speech synthesis module and the speech spectrum synthesized by the decoding unit to generate personalized and emotional synthesized speech, wherein the speech spectrum synthesized by the decoding unit is a Mel spectrum or a logarithmic Mel spectrum, and the vocoding module selects a neural network such as WaveNet, WaveGlow, WaveGAN or HiFi-GAN to synthesize the speech spectrum into the personalized and emotional synthesized speech based on different considerations of the sound quality acceptability of the synthesized speech or the system execution speed.

A computer-readable medium, used in a computing device or a computer, stores instructions for executing a multilingual and multi-emotion speech synthesis method as described in any one of claims 10 to 18.