TWI840949B - Multi-speaker and multi-emotion speech synthesis system, method and computer readable medium - Google Patents
Multi-speaker and multi-emotion speech synthesis system, method and computer readable medium Download PDFInfo
- Publication number
- TWI840949B TWI840949B TW111134964A TW111134964A TWI840949B TW I840949 B TWI840949 B TW I840949B TW 111134964 A TW111134964 A TW 111134964A TW 111134964 A TW111134964 A TW 111134964A TW I840949 B TWI840949 B TW I840949B
- Authority
- TW
- Taiwan
- Prior art keywords
- emotion
- embedded
- speaker
- vector
- neural network
- Prior art date
Links
- 230000015572 biosynthetic process Effects 0.000 title claims abstract description 147
- 238000003786 synthesis reaction Methods 0.000 title claims abstract description 145
- 238000000034 method Methods 0.000 title claims abstract description 54
- 239000013598 vector Substances 0.000 claims abstract description 257
- 230000008451 emotion Effects 0.000 claims abstract description 179
- 238000001228 spectrum Methods 0.000 claims abstract description 68
- 230000010354 integration Effects 0.000 claims abstract description 37
- 238000013528 artificial neural network Methods 0.000 claims description 116
- 230000002996 emotional effect Effects 0.000 claims description 37
- 238000000605 extraction Methods 0.000 claims description 31
- 230000008909 emotion recognition Effects 0.000 claims description 29
- 238000011176 pooling Methods 0.000 claims description 24
- 238000012549 training Methods 0.000 claims description 23
- 238000001308 synthesis method Methods 0.000 claims description 15
- 238000005516 engineering process Methods 0.000 claims description 13
- 238000004458 analytical method Methods 0.000 claims description 12
- 230000007246 mechanism Effects 0.000 claims description 12
- 238000006243 chemical reaction Methods 0.000 claims description 8
- 230000008859 change Effects 0.000 claims description 3
- 230000002194 synthesizing effect Effects 0.000 claims description 2
- 230000006870 function Effects 0.000 description 36
- 230000009466 transformation Effects 0.000 description 22
- 238000012545 processing Methods 0.000 description 14
- 238000010586 diagram Methods 0.000 description 8
- 101001121408 Homo sapiens L-amino-acid oxidase Proteins 0.000 description 6
- 102100026388 L-amino-acid oxidase Human genes 0.000 description 6
- 230000001537 neural effect Effects 0.000 description 4
- 230000007935 neutral effect Effects 0.000 description 4
- 101000827703 Homo sapiens Polyphosphoinositide phosphatase Proteins 0.000 description 3
- 102100023591 Polyphosphoinositide phosphatase Human genes 0.000 description 3
- 101100233916 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) KAR5 gene Proteins 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000004891 communication Methods 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 241000191291 Abies alba Species 0.000 description 1
- 241000238558 Eucarida Species 0.000 description 1
- 101100012902 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) FIG2 gene Proteins 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010187 selection method Methods 0.000 description 1
Images
Landscapes
- Machine Translation (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
Description
本發明係關於一種語音合成技術,特別是指一種多語者與多情緒語音合成系統、方法及電腦可讀媒介。 The present invention relates to a speech synthesis technology, and more particularly to a multilingual and multi-emotion speech synthesis system, method and computer-readable medium.
語音合成技術已經普遍運用於各種電子裝置與人機介面中,例如蘋果(Apple)公司之iPhone智慧手機與語音助理Siri,亞馬遜(Amazon)公司之Echo智慧音箱與語音助理Alexa,谷歌(Google)公司之智慧音箱Google Home與語音助理Google Assistant等。 Speech synthesis technology has been widely used in various electronic devices and human-computer interfaces, such as Apple's iPhone smartphone and voice assistant Siri, Amazon's Echo smart speaker and voice assistant Alexa, Google's smart speaker Google Home and voice assistant Google Assistant, etc.
雖然語音合成技術可以利用人機介面回應或提示相關資訊予電子裝置之使用者,但一般之語音合成技術僅能合成中性(neutral)之語音,卻鮮少考慮合成具情緒之語音,導致無法以具情緒之語音來與使用者進行高效率之互動對話。 Although speech synthesis technology can use human-computer interfaces to respond to or prompt relevant information to users of electronic devices, general speech synthesis technology can only synthesize neutral voices, but rarely considers synthesizing emotional voices, resulting in the inability to use emotional voices to conduct efficient interactive dialogues with users.
再者,現有之語音合成技術並無法分別依據多語者之語音參考訊號、多情緒之語音參考訊號、與文字或音標序列產生嵌入式語者向量、嵌入式情緒向量、與嵌入式文脈向量,亦無法針對個人化需求產生個人化與 情緒化之語音頻譜及/或合成語音,也無法依據轉換函數(如向量串接轉換函數或可學習式非線性類神經網路之轉換函數)將嵌入式語者向量、嵌入式情緒向量、與嵌入式文脈向量整合成嵌入式控制向量以控制語音合成模組合成出個人化與情緒化之語音頻譜,還無法利用類神經網路提升合成語音之品質,更無法由聲碼模組利用語音合成模組所合成之語音頻譜(如梅爾頻譜或對數梅爾頻譜)產生個人化與情緒化之合成語音。 Furthermore, existing speech synthesis technology cannot generate embedded speaker vectors, embedded emotion vectors, and embedded context vectors based on speech reference signals of multiple speakers, speech reference signals of multiple emotions, and text or phonetic symbol sequences, nor can it generate personalized and emotional speech spectra and/or synthesized speech based on personalized needs, nor can it generate personalized and emotional speech spectra and/or synthesized speech based on transformation functions (such as vector concatenation transformation functions or learnable nonlinear neural networks). The embedded speaker vector, embedded emotion vector, and embedded context vector are integrated into an embedded control vector to control the speech synthesis module to synthesize a personalized and emotional speech spectrum. It is not possible to use a neural network to improve the quality of synthesized speech, and it is not possible for the vocoding module to use the speech spectrum (such as the Mel spectrum or the logarithmic Mel spectrum) synthesized by the speech synthesis module to generate personalized and emotional synthesized speech.
因此,如何提供一種創新之語音合成技術,以解決上述之任一問題或提供相關之功能/服務,已成為本領域技術人員之一大研究課題。 Therefore, how to provide an innovative speech synthesis technology to solve any of the above problems or provide related functions/services has become a major research topic for technical personnel in this field.
本發明提供一種創新之多語者與多情緒語音合成系統、方法及電腦可讀媒介,係能分別依據多語者之語音參考訊號、多情緒之語音參考訊號、與文字或音標序列產生嵌入式語者向量、嵌入式情緒向量、與嵌入式文脈向量,亦能針對個人化需求產生個人化與情緒化之語音頻譜及/或合成語音,或者依據轉換函數(如向量串接轉換函數或可學習式非線性類神經網路之轉換函數)將嵌入式語者向量、嵌入式情緒向量、與嵌入式文脈向量整合成嵌入式控制向量以控制語音合成模組合成出個人化與情緒化之語音頻譜,抑或者利用類神經網路提升合成語音之品質,又或者由聲碼模組利用語音合成模組所合成之語音頻譜(如梅爾頻譜或對數梅爾頻譜)產生個人化與情緒化之合成語音。 The present invention provides an innovative multilingual and multi-emotion speech synthesis system, method and computer-readable medium, which can generate an embedded speaker vector, an embedded emotion vector and an embedded context vector according to a multilingual speech reference signal, a multi-emotion speech reference signal and a text or phonetic symbol sequence, and can also generate a personalized and emotional speech spectrum and/or synthesized speech according to personalized needs, or according to a transformation function (such as a vector concatenated transformation function or a learnable The embedded speaker vector, the embedded emotion vector, and the embedded context vector are integrated into an embedded control vector to control the speech synthesis module to synthesize a personalized and emotional speech spectrum, or the quality of the synthesized speech is improved by using the neural network, or the speech spectrum (such as the Mel spectrum or the logarithmic Mel spectrum) synthesized by the speech synthesis module is used by the vocoding module to generate personalized and emotional synthesized speech.
本發明之多語者與多情緒語音合成系統包括:一語者編碼模組,係依據多語者之語音參考訊號之至少一者產生對應之嵌入式語者向量; 一情緒編碼模組,係依據多情緒之語音參考訊號之至少一者產生對應之嵌入式情緒向量;以及一語音合成模組,係具有一編碼單元、一整合單元與一解碼單元,其中,語音合成模組之編碼單元依據文字或音標序列產生對應之嵌入式文脈向量,而語音合成模組之整合單元將語者編碼模組依據多語者之語音參考訊號所產生之嵌入式語者向量、情緒編碼模組依據多情緒之語音參考訊號所產生之嵌入式情緒向量、與編碼單元依據文字或音標序列所產生之嵌入式文脈向量整合成嵌入式控制向量,再由語音合成模組之整合單元利用嵌入式語者向量、嵌入式情緒向量、與嵌入式文脈向量整合而成之嵌入式控制向量控制語音合成模組之解碼單元以合成出語音頻譜。 The multilingual and multi-emotion speech synthesis system of the present invention comprises: a speaker encoding module, which generates a corresponding embedded speaker vector based on at least one of the speech reference signals of the multilingual speakers; an emotion encoding module, which generates a corresponding embedded emotion vector based on at least one of the multi-emotion speech reference signals; and a speech synthesis module, which has a coding unit, an integration unit and a decoding unit, wherein the coding unit of the speech synthesis module generates a corresponding embedded context vector based on a text or phonetic symbol sequence, and the speech synthesis module generates a corresponding embedded context vector based on a text or phonetic symbol sequence. The integration unit integrates the embedded speaker vector generated by the speaker encoding module based on the speech reference signals of multiple speakers, the embedded emotion vector generated by the emotion encoding module based on the speech reference signals of multiple emotions, and the embedded context vector generated by the encoding unit based on the text or phonetic symbol sequence into an embedded control vector. The integration unit of the speech synthesis module then uses the embedded speaker vector, the embedded emotion vector, and the embedded context vector to integrate the embedded control vector to control the decoding unit of the speech synthesis module to synthesize the speech spectrum.
本發明之多語者與多情緒語音合成方法包括:由語者編碼模組依據多語者之語音參考訊號之至少一者產生對應之嵌入式語者向量,並由情緒編碼模組依據多情緒之語音參考訊號之至少一者產生對應之嵌入式情緒向量,且由語音合成模組之編碼單元依據文字或音標序列產生對應之嵌入式文脈向量;以及由語音合成模組之整合單元將語者編碼模組依據多語者之語音參考訊號所產生之嵌入式語者向量、情緒編碼模組依據多情緒之語音參考訊號所產生之嵌入式情緒向量、與編碼單元依據文字或音標序列所產生之嵌入式文脈向量整合成嵌入式控制向量,再由語音合成模組之整合單元利用嵌入式語者向量、嵌入式情緒向量、與嵌入式文脈向量整合而成之嵌入式控制向量控制語音合成模組之解碼單元以合成出語音頻譜。 The multilingual and multi-emotion speech synthesis method of the present invention comprises: a speaker encoding module generates a corresponding embedded speaker vector according to at least one of the multilingual speech reference signals, and an emotion encoding module generates a corresponding embedded emotion vector according to at least one of the multi-emotion speech reference signals, and a coding unit of the speech synthesis module generates a corresponding embedded context vector according to a text or phonetic symbol sequence; and an integration unit of the speech synthesis module integrates the speaker encoding module according to the multilingual speech reference signals. The embedded speaker vector generated by the speech reference signal, the embedded emotion vector generated by the emotion coding module based on the multi-emotion speech reference signal, and the embedded context vector generated by the coding unit based on the text or phonetic symbol sequence are integrated into an embedded control vector. The integration unit of the speech synthesis module then uses the embedded speaker vector, the embedded emotion vector, and the embedded context vector to integrate the embedded control vector to control the decoding unit of the speech synthesis module to synthesize the speech spectrum.
本發明之電腦可讀媒介應用於計算裝置或電腦中,係儲存有指令,以執行上述之多語者與多情緒語音合成方法。 The computer-readable medium of the present invention is applied to a computing device or a computer, and stores instructions for executing the above-mentioned multilingual and multi-emotion speech synthesis method.
為使本發明之上述特徵與優點能更明顯易懂,下文特舉實施 例,並配合所附圖式作詳細說明。在以下描述內容中將部分闡述本發明之額外特徵及優點,且此等特徵及優點將部分自所述描述內容可得而知,或可藉由對本發明之實踐習得。應理解,前文一般描述與以下詳細描述二者均為例示性及解釋性的,且不欲約束本發明所欲主張之範圍。 In order to make the above features and advantages of the present invention more clearly understandable, the following examples are given and detailed descriptions are provided in conjunction with the attached drawings. The following description will partially explain the additional features and advantages of the present invention, and these features and advantages will be partially known from the description or can be learned through the practice of the present invention. It should be understood that both the general description above and the detailed description below are exemplary and explanatory, and are not intended to limit the scope of the present invention.
1:多語者與多情緒語音合成系統 1: Multilingual and multi-emotion speech synthesis system
10:語者編碼模組 10: Speaker encoding module
11:嵌入式語者向量萃取之類神經網路 11: Neural network based on embedded speaker vector extraction
12:音框層級 12: Sound frame level
13:輸入特徵 13: Input features
14:第一統計池化層 14: First statistical pooling layer
15:語句層級 15: Sentence level
16:語者標記 16: Speaker's Mark
17:softmax分類器 17:softmax classifier
18:損失函數 18: Loss function
19:第一資料庫 19: First database
20:情緒編碼模組 20: Emotional coding module
21:嵌入式情緒向量萃取之類神經網路 21: Neural networks such as embedded emotion vector extraction
22:音框層級 22: Sound frame level
23:輸入特徵 23: Input features
24:第二統計池化層 24: Second statistical pooling layer
25:語句層級 25: Sentence level
26:情緒標記 26: Emotional markers
27:softmax分類器 27:softmax classifier
28:損失函數 28: Loss function
29:第二資料庫 29: Second database
30:語音合成模組 30: Speech synthesis module
31:編碼單元 31: Coding unit
32:整合單元 32: Integration unit
33:注意力單元 33: Attention unit
34:解碼單元 34: Decoding unit
40:聲碼模組 40: Voice coding module
A1:多語者之語音參考訊號 A1: Voice reference signal of multilingual speakers
A2:嵌入式語者向量 A2: Embedded speaker vector
B1:多情緒之語音參考訊號 B1: Multi-emotional voice reference signals
B2:嵌入式情緒向量 B2: Embedded emotion vector
C1:文字或音標序列 C1: text or phonetic symbol sequence
C2:嵌入式文脈向量 C2: Embedded context vector
C3:嵌入式控制向量 C3: Embedded Control Vector
C4:語音頻譜 C4: Voice spectrum
C5:合成語音 C5: Synthetic Speech
S1至S2:步驟 S1 to S2: Steps
圖1為本發明之多語者與多情緒語音合成系統之架構示意圖。 Figure 1 is a schematic diagram of the architecture of the multilingual and multi-emotion speech synthesis system of the present invention.
圖2為本發明之多語者與多情緒語音合成方法之流程示意圖。 Figure 2 is a schematic diagram of the process of the multilingual and multi-emotion speech synthesis method of the present invention.
圖3為本發明之多語者與多情緒語音合成系統及其方法中,有關嵌入式語者向量分析方法之實施例示意圖。 FIG3 is a schematic diagram of an implementation example of the embedded speaker vector analysis method in the multilingual and multi-emotion speech synthesis system and method of the present invention.
圖4為本發明之多語者與多情緒語音合成系統及其方法中,有關嵌入式情緒向量分析方法之實施例示意圖。 FIG4 is a schematic diagram of an implementation example of the embedded emotion vector analysis method in the multilingual and multi-emotion speech synthesis system and method of the present invention.
以下藉由特定的具體實施形態說明本發明之實施方式,熟悉此技術之人士可由本說明書所揭示之內容了解本發明之其他優點與功效,亦可因而藉由其他不同具體等同實施形態加以施行或運用。 The following describes the implementation of the present invention through a specific concrete implementation form. People familiar with this technology can understand other advantages and effects of the present invention from the content disclosed in this manual, and can also implement or use it through other different specific equivalent implementation forms.
圖1為本發明之多語者與多情緒語音合成系統1之架構示意圖,圖2為本發明之多語者與多情緒語音合成方法之流程示意圖,圖3為本發明之多語者與多情緒語音合成系統1及其方法中有關嵌入式語者向量分
析方法之實施例示意圖,圖4為本發明之多語者與多情緒語音合成系統1及其方法中有關嵌入式情緒向量分析方法之實施例示意圖。
FIG1 is a schematic diagram of the structure of the multilingual and multi-emotion
如圖1所示,多語者與多情緒語音合成系統1可包括互相連結或通訊之至少一語者編碼模組10、至少一情緒編碼模組20、至少一語音合成模組30與至少一聲碼模組40,在一實施例中,語音合成模組30可分別連結或通訊語者編碼模組10、情緒編碼模組20與聲碼模組40。語音合成模組30可具有至少一編碼單元31、至少一整合單元32、至少一注意力(attention)單元33與至少一解碼單元34,在一實施例中,編碼單元31可依序連結或通訊整合單元32、注意力單元33與解碼單元34。
As shown in FIG1 , a multilingual and multi-emotion
在一實施例中,語者編碼模組10可為語者編碼器(speaker encoder)、語者編碼晶片、語者編碼電路、語者編碼軟體、語者編碼程式等,情緒編碼模組20可為情緒編碼器(emotion encoder)、情緒編碼晶片、情緒編碼電路、情緒編碼軟體、情緒編碼程式等,語音合成模組30可為語音合成器(speech synthesizer)、語音合成晶片、語音合成電路、語音合成軟體、語音合成程式等,聲碼模組40可為聲碼器(vocoder)、聲碼晶片、聲碼電路、聲碼軟體、聲碼程式等。編碼單元31可為編碼器、編碼晶片、編碼電路、編碼軟體、編碼程式等,整合單元32可為整合軟體、整合程式等,注意力單元33可為注意力軟體、注意力程式等,解碼單元34可為解碼器、解碼晶片、解碼電路、解碼軟體、解碼程式等。
In one embodiment, the
在一實施例中,本發明所述「連結或通訊」可代表以有線方式(如有線網路)或無線方式(如無線網路)互相連結或通訊,「至少一」代表一個以上(如一、二或三個以上),「複數」代表二個以上(如二、三、四、五 或十個以上)。但是,本發明並不以上述實施例所提及者為限。 In one embodiment, the "connection or communication" mentioned in the present invention may represent interconnection or communication in a wired manner (such as a wired network) or a wireless manner (such as a wireless network), "at least one" represents more than one (such as one, two or more than three), and "plurality" represents more than two (such as two, three, four, five or more than ten). However, the present invention is not limited to those mentioned in the above embodiments.
多語者與多情緒語音合成系統1及其方法依據使用者輸入之文字以及所指定之語者標記16(見圖3)與情緒標記26(見圖4)產生個人化與情緒化之語音頻譜C4及/或合成語音C5(見圖1),例如:[張三]{生氣}我不這麼認為![李四]{厭惡}這個東西能用嗎?進一步地,多語者與多情緒語音合成系統1及其方法亦可在同一句話中變化情緒,例如:[王五]{中性}只看外表,{傷心}會準嗎?[趙六]{中性}每年到了這個時候.....{快樂}聖誕樹就出來了。前述[張三]、[李四]、[王五]、[趙六]等為語者標記16,{生氣}、{厭惡}、{中性}、{傷心}、{快樂}等為情緒標記26。
The multilingual and multi-emotion
多語者與多情緒語音合成系統1及其方法之特色為建置語者編碼模組10以產生/求取複數(不同)嵌入式語者向量A2以表示複數(不同)語者之聲音特色,並建置情緒編碼模組20以產生/求取複數(不同)嵌入式情緒向量B2以表示複數(不同)情緒種類。
The multilingual and multi-emotion
例如,本發明之技術特點可包括由語者編碼模組10建置嵌入式語者向量A2之分佈模型以表示語者之聲音特色,並由情緒編碼模組20建置嵌入式情緒向量B2之分佈模型以表示複數(不同)情緒種類,俾建置嵌入式語者向量A2與嵌入式情緒向量B2兩者所導引之多語者與多情緒語音合成系統1。亦即,多語者與多情緒語音合成系統1能建置可產生/求取複數(不同)嵌入式語者向量A2之語者編碼模組10,並建置可產生/求取複數(不同)嵌入式情緒向量B2之情緒編碼模組20。所以,本發明所建置之多語者與多情緒語音合成系統1能依據使用者輸入之文字,並指定語者與情緒之相關條件後,完成語音合成處理。
For example, the technical features of the present invention may include constructing a distribution model of an embedded speaker vector A2 by a
如圖1所示,語音合成模組30可使用端到端語音合成(如Tacotron2)技術、編碼-解碼架構(如編碼單元31結合解碼單元34之架構)及/或注意力單元33之注意力(attention)機制,以利語音合成模組30之解碼單元自動合成出個人化與情緒化之語音頻譜C4。
As shown in FIG1 , the
語音合成模組30之運作原理可簡述如下:先由語音合成模組30之編碼單元31將輸入之文字或音標序列(grapheme or phoneme sequence)C1進行文本分析(text analysis),以自動產生文字或音標序列C1所對應之嵌入式文脈向量(embedding linguistic vector)C2,再由語音合成模組30之解碼單元34依據嵌入式文脈向量C2(如文脈特徵參數)並透過注意力單元33之權重機制自動對齊所輸入之文字或音標序列C1與解碼單元34所輸出之語音頻譜C4,且語音頻譜C4可例如為梅爾頻譜(Mel spectrogram)或對數梅爾頻譜(Log-Mel Spectrogram)。
The operation principle of the
如圖1與圖2所示,多語者與多情緒語音合成方法主要包括: As shown in Figures 1 and 2, the multilingual and multi-emotion speech synthesis methods mainly include:
在步驟S1中,由語者編碼模組10依據多語者之語音參考訊號A1之至少一者產生對應之嵌入式語者向量A2,並由情緒編碼模組20依據多情緒之語音參考訊號B1之至少一者產生對應之嵌入式情緒向量B2,且由語音合成模組30之編碼單元31依據文字或音標序列C1產生對應之嵌入式文脈向量C2。
In step S1, the
然後,在步驟S2中,由語音合成模組30之整合單元32將語者編碼模組10依據多語者之語音參考訊號A1所產生之嵌入式語者向量A2、情緒編碼模組20依據多情緒之語音參考訊號B1所產生之嵌入式情緒向量B2、與編碼單元31依據文字或音標序列C1所產生之嵌入式文脈向量
C2整合成嵌入式控制向量C3,再由語音合成模組30之整合單元32利用嵌入式語者向量A2、嵌入式情緒向量B2、與嵌入式文脈向量C3整合而成之嵌入式控制向量C3控制語音合成模組30之解碼單元34以合成出語音頻譜C4。
Then, in step S2, the
多語者與多情緒語音合成系統1及其方法可利用類神經網路以提升合成語音C5之品質,亦能持續發展、追求或展現個人化與情緒化(多樣化)之語音合成功能。為達成個人化與情緒化之語音合成功能,多語者與多情緒語音合成系統1及其方法可利用多語者與多情緒語音合成訓練語料分別建置語者編碼模組10與情緒編碼模組20,以利後續之語音合成模組30與聲碼模組40分別自動合成出個人化與情緒化之語音頻譜C4與合成語音C5。
The multilingual and multi-emotion
詳言之,多語者與多情緒語音合成系統1及其方法之技術內容可分成下列三個部份進行說明,[1]第一個部份可包括如圖1所示以嵌入式語者向量A2與嵌入式情緒向量B2控制多語者與多情緒語音合成系統1之處理程序;[2]第二個部份可包括:[2-1]如圖3所示之嵌入式語者向量分析方法,以及[2-2]如圖4所示之嵌入式情緒向量分析方法;[3]第三個部份可包括如圖1所示之語音合成模組30之整合單元32對嵌入式語者向量A2、嵌入式情緒向量B2、與嵌入式文脈向量C2之整合方法,以及相關類神經網路(模型)之訓練方法。
In detail, the technical contents of the multilingual and multi-emotion
[1]以嵌入式語者向量A2與嵌入式情緒向量B2控制多語者與多情緒語音合成系統1之處理程序(處理機制)。如圖1所示,語者編碼模組10、情緒編碼模組20、語音合成模組30與聲碼模組40等,必須利用具
文字、語者標記16與情緒標記26之多語者與多情緒語音合成訓練語料進行訓練之建置,且以嵌入式語者向量A2與嵌入式情緒向量B2控制多語者與多情緒語音合成系統1之處理程序可包括下列程序P11至程序P16所述。
[1] The processing procedure (processing mechanism) of the multilingual and multi-emotion
程序P11:由語者編碼模組10依據輸入之多語者之語音參考訊號(multi-speaker speech reference signal)A1(如多語者之語音參考波形)之至少一者產生/求取對應之嵌入式語者向量A2。
Procedure P11: The
程序P12:由情緒編碼模組20依據輸入之多情緒之語音參考訊號(multi-emotion speech reference signal)B1(如多情緒之語音參考波形)之至少一者產生/求取對應之嵌入式情緒向量B2。
Procedure P12: The
程序P13:由語音合成模組30之編碼單元31依據輸入之文字或音標序列C1產生/求取對應之嵌入式文脈向量C2。
Procedure P13: The encoding
程序P14:由語音合成模組30之整合單元32整合語者編碼模組10所產生之嵌入式語者向量A2、情緒編碼模組20所產生之嵌入式情緒向量B2、與語音合成模組30所產生之嵌入式文脈向量C2三者以得到嵌入式控制向量C3。
Procedure P14: The
程序P15:由語音合成模組30之整合單元32利用嵌入式語者向量A2、嵌入式情緒向量B2、與嵌入式文脈向量C2三者整合而成之嵌入式控制向量C3,以自動控制語音合成模組30之注意力單元33與解碼單元34合成出個人化與情緒化之語音頻譜C4(如梅爾頻譜或對數梅爾頻譜)。
Program P15: The
程序P16:由聲碼模組40利用語音合成模組30之注意力單元33所提供之注意力機制與解碼單元34所合成之語音頻譜C4(如梅爾頻譜或對數梅爾頻譜)以自動產生個人化與情緒化之合成語音C5(如合成語音
訊號)。
Procedure P16: The
[2-1]嵌入式語者向量分析方法(處理程序):如圖1與圖3所示,嵌入式語者向量分析方法之相關處理程序可包括下列程序P21至程序P24所述。 [2-1] Embedded speaker vector analysis method (processing procedure): As shown in Figures 1 and 3, the relevant processing procedures of the embedded speaker vector analysis method may include the following procedures P21 to P24.
程序P21:由語者編碼模組10利用具有語者標記16之訓練語料訓練出具有第一統計池化層(statistics pooling layer)14之語者辨認類神經網路,且語者辨認類神經網路可分為音框層級(frame level)12與語句層級(utterance level)15共兩個部份,以由語者辨認類神經網路之第一統計池化層14將語者辨認類神經網路之音框層級12之資訊轉換成語句層級15之資訊。語者編碼模組10可選擇使用梅爾頻率倒頻譜係數(Mel-Frequency Cepstral Coefficients;MFCC)作為語者辨認類神經網路之輸入特徵(feature)13,且語者編碼模組10亦能以softmax分類器17為基礎定義語者辨認類神經網路之損失函數(loss function)18。
Procedure P21: The
程序P22:在訓練完成之語者辨認類神經網路之輸出層被去除後,語者編碼模組10可將語者辨認類神經網路之第一統計池化層14之輸出作為每一輸入語句之嵌入式語者向量A2,以使訓練完成之語者辨認類神經網路成為嵌入式語者向量萃取之類神經網路11。
Procedure P22: After the output layer of the trained speaker identification neural network is removed, the
程序P23:由語者編碼模組10利用嵌入式語者向量萃取之類神經網路11萃取每一輸入語句之嵌入式語者向量A2。
Procedure P23: The
程序P24:由語者編碼模組10依據每一輸入語句之語者標記16與嵌入式語者向量萃取之類神經網路11所提供之對應之嵌入式語者向量A2建置第一資料庫19。例如,第一資料庫19可儲存有複數(不同)語者標記
16與對應之複數(不同)嵌入式語者向量A2,語者標記16可包括第一語者標記、第二語者標記至第M語者標記等,嵌入式語者向量A2可包括第一嵌入式語者向量、第二嵌入式語者向量至第M嵌入式語者向量等,且M代表大於2之正整數(如3、4、5或以上)。
Procedure P24: The
[2-2]嵌入式情緒向量分析方法(處理程序):如圖1與圖4所示,嵌入式情緒向量分析方法之相關處理程序可包括下列程序P31至程序P34所述。 [2-2] Embedded emotion vector analysis method (processing procedure): As shown in Figures 1 and 4, the relevant processing procedures of the embedded emotion vector analysis method may include the following procedures P31 to P34.
程序P31:由情緒編碼模組20利用具有情緒標記26之訓練語料訓練出具有第二統計池化層24之語句情緒辨認類神經網路,且語句情緒辨認類神經網路可分為音框層級22與語句層級25共兩個部份,以由語句情緒辨認類神經網路之第二統計池化層24將語句情緒辨認類神經網路之音框層級之資訊轉換成語句層級之資訊。情緒編碼模組20可選擇使用梅爾頻率倒頻譜係數(MFCC)以及能反映情緒變化之音高(pitch)與音量(energy)等之特徵作為語句情緒辨認類神經網路之輸入特徵23,且情緒編碼模組20亦能以softmax分類器27為基礎定義語句情緒辨認類神經網路之損失函數28。
Procedure P31: The
程序P32:在訓練完成之語句情緒辨認類神經網路之輸出層被去除後,情緒編碼模組20可將語句情緒辨認類神經網路之第二統計池化層24之輸出作為每一輸入語句之嵌入式情緒向量B2,以使訓練完成之語句情緒辨認類神經網路成為嵌入式情緒向量萃取之類神經網路21。
Procedure P32: After the output layer of the trained sentence emotion recognition neural network is removed, the
程序P33:由情緒編碼模組20利用嵌入式情緒向量萃取之類神經網路21萃取每一輸入語句之嵌入式情緒向量B2。
Procedure P33: The
程序P34:由情緒編碼模組20依據每一輸入語句之情緒標記26與嵌入式情緒向量萃取之類神經網路21所提供之對應之嵌入式情緒向量B2建置第二資料庫29。例如,第二資料庫29可儲存有複數(不同)情緒標記26與對應之複數(不同)嵌入式情緒向量B2,情緒標記26可包括第一情緒標記、第二情緒標記至第N情緒標記等,嵌入式情緒向量B2可包括第一嵌入式情緒向量、第二嵌入式情緒向量至第N嵌入式情緒向量等,且N代表大於2之正整數(如3、4、5或以上)。
Procedure P34: The
[3]語音合成模組30之整合單元32對嵌入式語者向量A2、嵌入式情緒向量B2、與嵌入式文脈向量C2之整合方法。
[3] The
如圖1與下列公式(1)至公式(2)所示,以F(.)代表語音合成模組30之整合單元32,且整合單元32(如F(.))可依據不同選擇之方法定義對應之轉換函數(如向量串接轉換函數或可學習式非線性類神經網路之轉換函數),以由整合單元32(如F(.))依據所選擇之轉換函數(如向量串接轉換函數或可學習式非線性類神經網路之轉換函數)將嵌入式語者向量A2、嵌入式情緒向量B2、與嵌入式文脈向量C2三者自動整合成一嵌入式控制向量C3(如一個固定長度之嵌入式控制向量C3)。亦即,整合單元32(如F(.))之轉換函數可選擇如公式(1)所示之向量串接轉換函數,或者選擇如公式(2)所示之可學習式非線性類神經網路之轉換函數。
As shown in Figure 1 and the following formulas (1) to (2), F(.) represents the
上列公式(1)至公式(2)中,F(.)代表整合單元32,、、分別代表嵌入式語者向量A2、嵌入式情緒向量B2、與嵌入式文脈向量C2,
而ωs、ωe、ωl分別代表嵌入式語者向量A2之可調整加權值、嵌入式情緒向量B2之可調整加權值、與嵌入式文脈向量C2之可調整加權值。
In the above formulas (1) to (2), F(.) represents the
當整合單元32選擇可學習式非線性類神經網路時,可學習式非線性類神經網路能進一步整合語音合成模組30之注意力單元33與解碼單元34(如解碼網路)以得到整合後之類神經網路,且此整合後之類神經網路可利用嵌入式語者向量A2、嵌入式情緒向量B2、與嵌入式文脈向量C2三者整合而成之嵌入式控制向量C3完成訓練之建置。
When the
然後,語音合成模組30之解碼單元34可將所產生之語音頻譜C4(如梅爾頻譜或對數梅爾頻譜)輸入至聲碼模組40,以由聲碼模組40利用語音頻譜C4合成出個人化(如語者特性)與情緒化(如語者情緒)之合成語音C5。亦即,聲碼模組40可依據合成語音C5之音質可接受度及/或系統執行速度等之不同考量,以選擇WaveNet、WaveGlow、WaveGAN與HiFi-GAN等複數種類神經網路之一者將語音頻譜C4合成個人化(如語者特性)與情緒化(如語者情緒)之合成語音C5。需說明者,前述WaveNet、WaveGlow、WaveGAN與HiFi-GAN通常以英文表示,較少以中文表示。
Then, the
完成建置之多語者與多情緒語音合成系統1及其方法可依據使用者輸入之文字及所指定之語者標記16與情緒標記26產生個人化與情緒化之合成語音C5,亦可從圖3所示之第一資料庫19中找出語者標記16所對應之嵌入式語者向量A2,並從圖4所示之第二資料庫29中找出情緒標記26所對應之嵌入式情緒向量B2,再將嵌入式語者向量A2與嵌入式情緒向量B2分別輸入至語音合成模組30以進行語音合成處理。
The established multilingual and multi-emotion
舉例而言,多語者與多情緒語音合成系統1及其方法之相關
處理程序可包括如下列所述之內容。
For example, the related processing procedures of the multilingual and multi-emotion
[1]使用者透過多語者與多情緒語音合成系統1先收集具文字、語者標記16與情緒標記26之多語者與多情緒語音合成訓練語料,作法包括(1)自行招募語者並設計錄音文本與錄製方式,或者(2)使用各種公開之語料。以(1)自行招募語者並設計錄音文本與錄製方式而言,錄音語者可依據不同需求以複數(不同)情緒進行語料錄製。又,以(2)使用各種公開之語料而言,則可選擇使用例如電視影集或Podcast(播客)之語料。
[1] The user first collects multilingual and multi-emotion speech synthesis training corpus with text, speaker tags 16 and emotion tags 26 through the multilingual and multi-emotion
以英文為例,Multimodal Emotion Lines Dataset(簡稱MELD;網址為https://affective-meld.github.io/)或Multimodal Signal Processing-Podcast corpus(簡稱MSP-Podcast;網址為https://ecs.utdallas.edu/research/researchlabs/msp-lab/MSP-Podcast.html),均為已完成標註之多語者與多情緒語音合成訓練語料。 Taking English as an example, Multimodal Emotion Lines Dataset (MELD; URL: https://affective-meld.github.io/) or Multimodal Signal Processing-Podcast corpus (MSP-Podcast; URL: https://ecs.utdallas.edu/research/researchlabs/msp-lab/MSP-Podcast.html) are both annotated multilingual and multi-emotion speech synthesis training corpora.
[2]利用上述多語者與多情緒語音合成訓練語料中之語者標記16訓練如圖1所示之語者編碼模組10或如圖3所示之嵌入式語者向量萃取之類神經網路,以由語者編碼模組10利用嵌入式語者向量萃取之類神經網路11萃取每一輸入語句之嵌入式語者向量A2,並使用x向量(x-vector)之訓練方式訓練嵌入式語者向量萃取之類神經網路11,且語者編碼模組10可選擇使用梅爾頻率倒頻譜係數(MFCC)作為嵌入式語者向量萃取之類神經網路11之輸入特徵13。
[2] The speaker labels 16 in the multilingual and multi-emotion speech synthesis training corpus are used to train the
[3]利用上述多語者與多情緒語音合成訓練語料中之情緒標記26訓練如圖1所示之情緒編碼模組20或如圖4所示之嵌入式情緒向量萃取之類神經網路21,以由情緒編碼模組20利用嵌入式情緒向量萃取之類
神經網路21萃取每一輸入語句之嵌入式情緒向量B2,並使用x向量(x-vector)之訓練方式訓練嵌入式情緒向量萃取之類神經網路21,且情緒編碼模組20可選擇使用梅爾頻率倒頻譜係數(MFCC)以及能反映情緒變化之音高(pitch)與音量(energy)等特徵作為嵌入式情緒向量萃取之類神經網路21之輸入特徵23。
[3] The emotion markers 26 in the multilingual and multi-emotion speech synthesis training corpus are used to train the
[4]利用上述多語者與多情緒語音合成訓練語料之所有文字、語者標記16與情緒標記26、以及已訓練完成之語者編碼模組10(如嵌入式語者向量萃取之類神經網路11)與情緒編碼模組20(如嵌入式情緒向量萃取之類神經網路21)進行語音合成模組30之訓練,包括訓練語音合成模組30之前級之編碼單元31與整合單元32(如F(.))、中級之注意力單元33、及後級之解碼單元34,以使語音合成模組30產出對應之語音頻譜C4(如梅爾頻譜或對數梅爾頻譜)之資訊。
[4] Using all the text, speaker tags 16 and emotion tags 26 of the above-mentioned multilingual and multi-emotion speech synthesis training corpus, as well as the trained speaker encoding module 10 (such as the embedded speaker vector extraction neural network 11) and the emotion encoding module 20 (such as the embedded emotion vector extraction neural network 21) to train the
語音合成模組30之編碼單元31、注意力單元33與解碼單元34之架構可利用端到端語音合成(如Tacotron2)技術,且整合單元32(如F(.))可依據不同選擇之方法定義對應之轉換函數(如向量串接轉換函數或可學習式非線性類神經網路之轉換函數),以由整合單元32(如F(.))依據所選擇之轉換函數(如向量串接轉換函數或可學習式非線性類神經網路之轉換函數)將嵌入式語者向量A2、嵌入式情緒向量B2、與嵌入式文脈向量C2三者自動整合成一嵌入式控制向量C3(如一個固定長度之嵌入式控制向量C3)。整合單元32(如F(.))之轉換函數可選擇如上方公式(1)所示之向量串接轉換函數,或者選擇如上方公式(2)所示之可學習式非線性類神經網路之轉換函數。
The architecture of the
當整合單元32選擇可學習式非線性類神經網路時,可學習式
非線性類神經網路能進一步整合語音合成模組30之注意力單元33與解碼單元34(如解碼網路)以得到整合後之類神經網路,且此整合後之類神經網路可利用嵌入式語者向量A2、嵌入式情緒向量B2、與嵌入式文脈向量C2三者整合而成之嵌入式控制向量C3完成訓練之建置。
When the
然後,語音合成模組30之解碼單元34可將所產生之語音頻譜C4(如梅爾頻譜或對數梅爾頻譜)輸入至聲碼模組40,以由聲碼模組40利用語音頻譜C4合成出個人化(如語者特性)與情緒化(如語者情緒)之合成語音C5。亦即,聲碼模組40可依據合成語音C5之音質可接受度及/或系統執行速度等之不同考量,以選擇WaveNet、WaveGlow、WaveGAN與HiFi-GAN等複數種類神經網路之一者將語音頻譜C4合成個人化(如語者特性)與情緒化(如語者情緒)之合成語音C5。
Then, the
[5]完成建置之多語者與多情緒語音合成系統1及其方法可依據使用者輸入之文字及所指定之語者標記16與情緒標記26產生個人化與情緒化之合成語音C5,亦可從圖3所示之第一資料庫19中找出語者標記16所對應之嵌入式語者向量A2,並從圖4所示之第二資料庫29中找出情緒標記26所對應之嵌入式情緒向量B2,再將嵌入式語者向量A2與嵌入式情緒向量B2分別輸入至語音合成模組30以進行語音合成處理。
[5] The established multilingual and multi-emotion
另外,本發明還提供一種針對多語者與多情緒語音合成方法之電腦可讀媒介,係應用於具有處理器及/或記憶體之計算裝置或電腦中,且電腦可讀媒介儲存有指令,並可利用計算裝置或電腦透過處理器及/或記憶體執行電腦可讀媒介,以於執行電腦可讀媒介時執行上述內容。例如,處理器可為微處理器、中央處理器(CPU)、圖形處理器(GPU)等,記憶體可為 隨機存取記憶體(RAM)、記憶卡、硬碟(如雲端/網路硬碟)、資料庫等,但不以此為限。 In addition, the present invention also provides a computer-readable medium for multilinguals and multi-emotion speech synthesis methods, which is applied to a computing device or computer having a processor and/or a memory, and the computer-readable medium stores instructions, and the computing device or computer can execute the computer-readable medium through the processor and/or the memory to execute the above content when executing the computer-readable medium. For example, the processor can be a microprocessor, a central processing unit (CPU), a graphics processing unit (GPU), etc., and the memory can be a random access memory (RAM), a memory card, a hard disk (such as a cloud/network hard disk), a database, etc., but is not limited to this.
綜上,本發明之多語者與多情緒語音合成系統、方法及電腦可讀媒介至少具有下列特色、優點或技術功效。 In summary, the multilingual and multi-emotion speech synthesis system, method and computer-readable medium of the present invention have at least the following features, advantages or technical effects.
一、本發明之語者編碼模組能依據多語者之語音參考訊號自動產生/求取嵌入式語者向量以表示語者之聲音特色,而情緒編碼模組能依據多情緒之語音參考訊號自動產生/求取嵌入式情緒向量以表示情緒種類,且語音合成模組之編碼單元能依據文字或音標序列自動產生/求取嵌入式文脈向量。 1. The speaker encoding module of the present invention can automatically generate/obtain an embedded speaker vector based on the speech reference signals of multiple speakers to represent the voice characteristics of the speaker, and the emotion encoding module can automatically generate/obtain an embedded emotion vector based on the speech reference signals of multiple emotions to represent the type of emotion, and the encoding unit of the speech synthesis module can automatically generate/obtain an embedded context vector based on a text or phonetic symbol sequence.
二、本發明之語音合成模組或整合單元能將嵌入式語者向量、嵌入式情緒向量、與嵌入式文脈向量三者整合成嵌入式控制向量,以利用嵌入式控制向量自動控制語音合成模組(如注意力單元與解碼單元)合成出個人化與情緒化之語音頻譜。 2. The speech synthesis module or integration unit of the present invention can integrate the embedded speaker vector, the embedded emotion vector, and the embedded context vector into an embedded control vector, so as to use the embedded control vector to automatically control the speech synthesis module (such as the attention unit and the decoding unit) to synthesize a personalized and emotional speech spectrum.
三、本發明能針對個人化需求產生個人化與情緒化之語音頻譜及/或合成語音,亦能使合成語音依照不同需求展現出複數(不同)語者之聲音特色。 3. The present invention can generate personalized and emotional speech spectrum and/or synthesized speech according to personalized needs, and can also make the synthesized speech show the voice characteristics of multiple (different) speakers according to different needs.
四、本發明能建置語者編碼模組、情緒編碼模組、語音合成模組與聲碼模組,亦能利用類神經網路以提升合成語音之品質,也能持續發展、追求或展現個人化與情緒化(多樣化)之語音合成功能。 4. The present invention can build a speaker coding module, an emotion coding module, a speech synthesis module and a voice coding module, and can also use a neural network to improve the quality of synthesized speech, and can also continue to develop, pursue or demonstrate personalized and emotional (diversified) speech synthesis functions.
五、本發明能利用多語者與多情緒語音合成訓練語料分別建置語者編碼模組與情緒編碼模組,以利後續之語音合成模組與聲碼模組分別自動合成出個人化與情緒化之語音頻譜及/或合成語音。 5. The present invention can use multilingual and multi-emotion speech synthesis training corpus to respectively construct a speaker encoding module and an emotion encoding module, so that the subsequent speech synthesis module and voice coding module can automatically synthesize personalized and emotional speech spectra and/or synthesized speech.
六、本發明之語音合成模組能使用端到端語音合成(如Tacotron2)技術、編碼-解碼架構(如編碼單元結合解碼單元之架構)及/或注意力單元之注意力機制,以利語音合成模組自動合成出個人化與情緒化之語音頻譜。 6. The speech synthesis module of the present invention can use end-to-end speech synthesis (such as Tacotron2) technology, encoding-decoding architecture (such as the architecture of encoding unit combined with decoding unit) and/or attention mechanism of attention unit, so that the speech synthesis module can automatically synthesize personalized and emotional speech spectrum.
七、本發明之語音合成模組能將文字或音標序列進行文本分析以利自動產生/求取嵌入式文脈向量,再利用嵌入式文脈向量與注意力單元之權重機制自動對齊所輸入之文字或音標序列與所輸出之語音頻譜。 7. The speech synthesis module of the present invention can perform text analysis on text or phonetic symbol sequences to automatically generate/obtain embedded context vectors, and then use the embedded context vectors and the weight mechanism of the attention unit to automatically align the input text or phonetic symbol sequence with the output speech spectrum.
八、本發明之語音合成模組或整合單元能依據不同選擇之方法定義對應之轉換函數(如向量串接轉換函數或可學習式非線性類神經網路之轉換函數),以利依據所選擇之轉換函數(如向量串接轉換函數或可學習式非線性類神經網路之轉換函數)將嵌入式語者向量、嵌入式情緒向量、與嵌入式文脈向量三者自動整合成一嵌入式控制向量。 8. The speech synthesis module or integration unit of the present invention can define corresponding transformation functions (such as vector concatenation transformation functions or learnable nonlinear neural network transformation functions) according to different selected methods, so as to automatically integrate the embedded speaker vector, embedded emotion vector, and embedded context vector into an embedded control vector according to the selected transformation function (such as vector concatenation transformation function or learnable nonlinear neural network transformation function).
九、本發明之語者編碼模組能利用具有語者標記之訓練語料訓練出具有統計池化層之語者辨認類神經網路(嵌入式語者向量萃取之類神經網路),以利用統計池化層將語者辨認類神經網路之音框層級之資訊自動轉換成語句層級之資訊。 9. The speaker encoding module of the present invention can use speaker-labeled training corpus to train a speaker identification neural network (embedded speaker vector extraction neural network) with a statistical pooling layer, so as to automatically convert the frame-level information of the speaker identification neural network into sentence-level information using the statistical pooling layer.
十、本發明之情緒編碼模組能利用具有情緒標記之訓練語料訓練出具有統計池化層之語句情緒辨認類神經網路(嵌入式情緒向量萃取之類神經網路),以利用統計池化層將語句情緒辨認類神經網路之音框層級之資訊自動轉換成語句層級之資訊。 10. The emotion encoding module of the present invention can use the training corpus with emotion labels to train a sentence emotion recognition neural network (embedded emotion vector extraction neural network) with a statistical pooling layer, so as to automatically convert the frame-level information of the sentence emotion recognition neural network into sentence-level information using the statistical pooling layer.
十一、本發明之聲碼模組能利用語音合成模組之注意力單元與解碼單元所合成之語音頻譜(如梅爾頻譜或對數梅爾頻譜),以利自動產生 個人化與情緒化之合成語音。 11. The voice coding module of the present invention can utilize the speech spectrum (such as Mel spectrum or logarithmic Mel spectrum) synthesized by the attention unit and decoding unit of the speech synthesis module to automatically generate personalized and emotional synthesized speech.
上述實施形態僅例示性說明本發明之原理、特點及其功效,並非用以限制本發明之可實施範疇,任何熟習此項技藝之人士均能在不違背本發明之精神及範疇下,對上述實施形態進行修飾與改變。任何使用本發明所揭示內容而完成之等效改變及修飾,均仍應為申請專利範圍所涵蓋。因此,本發明之權利保護範圍應如申請專利範圍所列。 The above implementation forms are only illustrative of the principles, features and effects of the present invention, and are not intended to limit the scope of implementation of the present invention. Anyone familiar with this technology can modify and change the above implementation forms without violating the spirit and scope of the present invention. Any equivalent changes and modifications completed using the content disclosed by the present invention should still be covered by the scope of the patent application. Therefore, the scope of protection of the present invention should be as listed in the scope of the patent application.
1:多語者與多情緒語音合成系統 1: Multilingual and multi-emotion speech synthesis system
10:語者編碼模組 10: Speaker encoding module
20:情緒編碼模組 20: Emotional coding module
30:語音合成模組 30: Speech synthesis module
31:編碼單元 31: Coding unit
32:整合單元 32: Integration unit
33:注意力單元 33: Attention unit
34:解碼單元 34: Decoding unit
40:聲碼模組 40: Voice coding module
A1:多語者之語音參考訊號 A1: Voice reference signal of multilingual speakers
A2:嵌入式語者向量 A2: Embedded speaker vector
B1:多情緒之語音參考訊號 B1: Multi-emotional voice reference signals
B2:嵌入式情緒向量 B2: Embedded emotion vector
C1:文字或音標序列 C1: text or phonetic symbol sequence
C2:嵌入式文脈向量 C2: Embedded context vector
C3:嵌入式控制向量 C3: Embedded Control Vector
C4:語音頻譜 C4: Voice spectrum
C5:合成語音 C5: Synthetic Speech
Claims (19)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| TW111134964A TWI840949B (en) | 2022-09-15 | 2022-09-15 | Multi-speaker and multi-emotion speech synthesis system, method and computer readable medium |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| TW111134964A TWI840949B (en) | 2022-09-15 | 2022-09-15 | Multi-speaker and multi-emotion speech synthesis system, method and computer readable medium |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| TW202414382A TW202414382A (en) | 2024-04-01 |
| TWI840949B true TWI840949B (en) | 2024-05-01 |
Family
ID=91622359
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| TW111134964A TWI840949B (en) | 2022-09-15 | 2022-09-15 | Multi-speaker and multi-emotion speech synthesis system, method and computer readable medium |
Country Status (1)
| Country | Link |
|---|---|
| TW (1) | TWI840949B (en) |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| TW324097B (en) * | 1994-04-11 | 1998-01-01 | Hal Trust L L C | Phonology-based automatic speech recognition computer system in which a spoken word is recognized by finding the best match in lexicon to the symbolic representation of the speech signal |
| US20070112585A1 (en) * | 2003-08-01 | 2007-05-17 | Breiter Hans C | Cognition analysis |
| CN107210034A (en) * | 2015-02-03 | 2017-09-26 | 杜比实验室特许公司 | Optional Meeting Summary |
| CN110379409A (en) * | 2019-06-14 | 2019-10-25 | 平安科技(深圳)有限公司 | Phoneme synthesizing method, system, terminal device and readable storage medium storing program for executing |
-
2022
- 2022-09-15 TW TW111134964A patent/TWI840949B/en active
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| TW324097B (en) * | 1994-04-11 | 1998-01-01 | Hal Trust L L C | Phonology-based automatic speech recognition computer system in which a spoken word is recognized by finding the best match in lexicon to the symbolic representation of the speech signal |
| US20070112585A1 (en) * | 2003-08-01 | 2007-05-17 | Breiter Hans C | Cognition analysis |
| CN107210034A (en) * | 2015-02-03 | 2017-09-26 | 杜比实验室特许公司 | Optional Meeting Summary |
| CN110379409A (en) * | 2019-06-14 | 2019-10-25 | 平安科技(深圳)有限公司 | Phoneme synthesizing method, system, terminal device and readable storage medium storing program for executing |
Also Published As
| Publication number | Publication date |
|---|---|
| TW202414382A (en) | 2024-04-01 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN107195296B (en) | Voice recognition method, device, terminal and system | |
| CN110246488B (en) | Speech conversion method and device for semi-optimized CycleGAN model | |
| CN102779508B (en) | Sound bank generates Apparatus for () and method therefor, speech synthesis system and method thereof | |
| CN111433847B (en) | Voice conversion method and training method, intelligent device and storage medium | |
| Wu et al. | Audio classification using attention-augmented convolutional neural network | |
| CN115101046B (en) | A method and device for synthesizing speech of a specific speaker | |
| CN110534089A (en) | A Chinese Speech Synthesis Method Based on Phoneme and Prosodic Structure | |
| CN112927674A (en) | Voice style migration method and device, readable medium and electronic equipment | |
| CN108615525B (en) | Voice recognition method and device | |
| CN110473523A (en) | A kind of audio recognition method, device, storage medium and terminal | |
| CN112786004A (en) | Speech synthesis method, electronic device, and storage device | |
| CN110827801A (en) | A kind of automatic speech recognition method and system based on artificial intelligence | |
| CN111276120A (en) | Speech synthesis method, apparatus and computer-readable storage medium | |
| CN109065032A (en) | A kind of external corpus audio recognition method based on depth convolutional neural networks | |
| CN101281518A (en) | Speech translation device and method | |
| CN108231062A (en) | A kind of voice translation method and device | |
| CN113539232B (en) | Voice synthesis method based on lesson-admiring voice data set | |
| CN114283786A (en) | Speech recognition method, device and computer readable storage medium | |
| CN114512121B (en) | Speech synthesis methods, model training methods and devices | |
| Liu et al. | Feature fusion of speech emotion recognition based on deep learning | |
| CN112802446A (en) | Audio synthesis method and device, electronic equipment and computer-readable storage medium | |
| CN107221344A (en) | A kind of speech emotional moving method | |
| CN119360818A (en) | Speech generation method, device, computer equipment and medium based on artificial intelligence | |
| CN112489634A (en) | Language acoustic model training method and device, electronic equipment and computer medium | |
| CN112735404A (en) | Ironic detection method, system, terminal device and storage medium |