TWI867419B - Method and system of generating audio - Google Patents
Method and system of generating audio Download PDFInfo
- Publication number
- TWI867419B TWI867419B TW112102632A TW112102632A TWI867419B TW I867419 B TWI867419 B TW I867419B TW 112102632 A TW112102632 A TW 112102632A TW 112102632 A TW112102632 A TW 112102632A TW I867419 B TWI867419 B TW I867419B
- Authority
- TW
- Taiwan
- Prior art keywords
- parameter
- sound
- parameters
- character information
- preset
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 21
- 238000006243 chemical reaction Methods 0.000 claims description 14
- 239000000284 extract Substances 0.000 claims description 5
- 239000008186 active pharmaceutical agent Substances 0.000 description 9
- 239000013598 vector Substances 0.000 description 9
- 238000000605 extraction Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 4
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Telephonic Communication Services (AREA)
- Stereophonic System (AREA)
Abstract
Description
本揭示內容是有關於一種產生語音的技術,特別是關於一種產生語音的方法及產生語音的系統。The present disclosure relates to a technology for generating speech, and more particularly to a method for generating speech and a system for generating speech.
製作有聲書時,需要以合適的聲紋對有聲書中的角色進行配音,以提升有聲書品質。然而,預設資料庫中的聲紋可能不足以對應有聲書中的所有角色。因此,要如何設計以解決上述問題為本領域重要之課題。When making an audiobook, it is necessary to dub the characters in the audiobook with appropriate voiceprints to improve the quality of the audiobook. However, the voiceprints in the default database may not be sufficient to correspond to all the characters in the audiobook. Therefore, how to design to solve the above problem is an important topic in this field.
本發明實施例包含一種產生語音的方法。方法包含:提取角色資訊;配對角色資訊與多個預設聲音參數;當角色資訊不符合預設聲音參數時,從預設聲音參數中選擇相應角色資訊的至少一原聲音參數,並依據上述至少一原聲音參數產生至少一新增聲音參數,上述至少一新增聲音參數對應於角色資訊;以及依據上述至少一新增聲音參數產生對應的語音。The embodiment of the present invention includes a method for generating speech. The method includes: extracting character information; matching the character information with a plurality of preset sound parameters; when the character information does not match the preset sound parameters, selecting at least one original sound parameter corresponding to the character information from the preset sound parameters, and generating at least one additional sound parameter based on the at least one original sound parameter, wherein the at least one additional sound parameter corresponds to the character information; and generating corresponding speech based on the at least one additional sound parameter.
本發明實施例包含一種產生語音的系統。系統包含儲存單元及處理器。儲存單元儲存多個預設聲音參數。處理器耦接儲存單元,經由儲存單元存取預設聲音參數,並於角色資訊不符合預設聲音參數時,依據預設聲音參數中至少一原聲音參數產生對應角色資訊的至少一新增聲音參數,且依據至少一新增聲音參數產生語音。The embodiment of the present invention includes a system for generating speech. The system includes a storage unit and a processor. The storage unit stores a plurality of preset sound parameters. The processor is coupled to the storage unit, accesses the preset sound parameters through the storage unit, and generates at least one newly added sound parameter corresponding to the character information according to at least one original sound parameter in the preset sound parameters when the character information does not conform to the preset sound parameters, and generates speech according to the at least one newly added sound parameter.
於本文中,當一元件被稱為「連接」或「耦接」時,可指「電性連接」或「電性耦接」。「連接」或「耦接」亦可用以表示二或多個元件間相互搭配操作或互動。此外,雖然本文中使用「第一」、「第二」、…等用語描述不同元件,該用語僅是用以區別以相同技術用語描述的元件或操作。除非上下文清楚指明,否則該用語並非特別指稱或暗示次序或順位,亦非用以限定本案。In this article, when an element is referred to as "connected" or "coupled", it may refer to "electrically connected" or "electrically coupled". "Connected" or "coupled" may also be used to indicate that two or more elements cooperate with each other or interact with each other. In addition, although the terms "first", "second", etc. are used in this article to describe different elements, the terms are only used to distinguish between elements or operations described with the same technical terms. Unless the context clearly indicates otherwise, the terms do not specifically refer to or imply an order or sequence, nor are they used to limit the present case.
除非另有定義,本文使用的所有術語(包括技術和科學術語)具有與本案所屬領域的普通技術人員通常理解的相同的含義。將進一步理解的是,諸如在通常使用的字典中定義的那些術語應當被解釋為具有與它們在相關技術和本案的上下文中的含義一致的含義,並且將不被解釋為理想化的或過度正式的意義,除非本文中明確地這樣定義。Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by ordinary technicians in the field to which this case belongs. It will be further understood that those terms as defined in commonly used dictionaries should be interpreted as having a meaning consistent with their meaning in the context of the relevant technology and this case, and will not be interpreted as an idealized or overly formal meaning unless expressly defined as such in this document.
第1圖為根據本案之一些實施例之系統100的示意圖。在一些實施例中,系統100是依據儲存在儲存元件內的預設聲音參數產生語音。如第1圖所示,系統100包括儲存單元110、處理器120、輸入介面130及喇叭150。處理器120耦接儲存單元110、輸入介面130及喇叭150。FIG. 1 is a schematic diagram of a
操作上,處理器120接收預設於系統100中的文本140、或是自外部接收文本140或是透過輸入介面130輸入的角色資訊CR1,並且自儲存單元110接收預設聲音參數DS1~DS4。在角色資訊CR1不匹配於預設聲音參數DS1~DS4任一者時,處理器120藉由預設聲音參數DS1~DS4產生符合角色資訊CR1的新增聲音參數(例如下述新增聲音參數VD1),並依據新增聲音參數產生語音AB2,並由喇叭150播放語音AB2。In operation, the processor 120 receives the
在一些實施例中,文本140可以是代表一則故事、劇本的資訊,本發明並非為限制,於一些實施例中,文本140也可以為僅一句或幾句詞句的資訊。在各種實施例中,文本140可以被預儲存在各種儲存媒體中,例如文本140可以被預儲存在儲存單元110中,例如隨身碟、硬碟、隨身硬碟、雲端或其他任意儲存媒體,本案並非為限制。In some embodiments, the
在一些作法中,在對文本進行配音時,需要雇用與文本中的角色數量相同並且具有對應特徵的配音員,使得成本較高。In some practices, when dubbing a text, it is necessary to hire voice actors who have the same number of characters as the characters in the text and have corresponding characteristics, which makes the cost high.
相較於上述作法,在本發明實施例中,處理器120可以依據角色資訊CR1,基於預設聲音參數DS1~DS4產生符合角色資訊CR1的新增聲音參數。如此一來,藉由預設聲音參數DS1~DS4就可以對文本140進行配音,使得成本較低。Compared with the above method, in the embodiment of the present invention, the processor 120 can generate new sound parameters that match the character information CR1 based on the preset sound parameters DS1-DS4 according to the character information CR1. In this way, the
如第1圖所示,在一些實施例中,儲存單元110具有資料庫111。資料庫111中儲存了預設聲音參數DS1~DS4。在各種實施例中,資料庫111可以包含各種數量的預設聲音參數,第1圖僅為例示而已。As shown in FIG. 1 , in some embodiments, the storage unit 110 has a database 111. The database 111 stores preset sound parameters DS1-DS4. In various embodiments, the database 111 may include various numbers of preset sound parameters, and FIG. 1 is only an example.
如第1圖所示,在一些實施例中,處理器120包含文字提取模組121、聲音參數擴增模組122及文字語音轉換模組123。在各種實施例中,處理器120可以包含不同於文字提取模組121、聲音參數擴增模組122及文字語音轉換模組123的其他模組,並藉由其他模組執行下述文字提取模組121、聲音參數擴增模組122及文字語音轉換模組123之操作,故第1圖所示實施例僅為例示而已,並非用以限定本發明。上述模組亦得以其他系統、硬體、軟體或其組合實現。As shown in FIG. 1 , in some embodiments, the processor 120 includes a
在一些實施例中,文字提取模組121從文本140中提取角色資訊CR1及文本內容TX1,並將角色資訊CR1及文本內容TX1分別提供至聲音參數擴增模組122及文字語音轉換模組123。In some embodiments, the
在其他實施例中,輸入介面130接收輸入資料FT1,並從輸入資料FT1中提取角色資訊CR1,且將角色資訊CR1提供至聲音參數擴增模組122。在一些實施例中,輸入資料FT1可以包含使用者自定義的特徵參數,例如對應各種年齡特徵及/或性別特徵的特徵參數。In other embodiments, the
在一些實施例中,聲音參數擴增模組122從儲存單元110存取預設聲音參數DS1~DS4,並配對角色資訊CR1與預設聲音參數DS1~DS4。舉例來說,聲音參數擴增模組122比較角色資訊CR1與預設聲音參數DS1~DS4的年齡特徵及/或性別特徵,並依據比較結果將預設聲音參數DS1~DS4中符合角色資訊CR1的一者分配給角色資訊CR1。在一些實施例中,聲音參數擴增模組122藉由語者嵌入(Speaker Embedding)的技術對角色資訊CR1進行配對。In some embodiments, the voice
當角色資訊CR1符合預設聲音參數DS1~DS4的一者時,聲音參數擴增模組122選擇對應的預設聲音參數作為原聲音參數DS,並將原聲音參數DS傳送至文字語音轉換模組123,使得文字語音轉換模組123依據原聲音參數DS及文本內容TX1產生對應的語音AB1。When the character information CR1 matches one of the preset sound parameters DS1-DS4, the sound
當角色資訊CR1不符合預設聲音參數DS1~DS4的任一者時,聲音參數擴增模組122從預設聲音參數DS1~DS4中選擇對應角色資訊CR1的聲音參數作為原聲音參數DX,並依據原聲音參數DX產生對應於角色資訊CR1的新增聲音參數VD1。接著,文字語音轉換模組123依據新增聲音參數VD1及文本內容TX1產生對應的新增語音AB2。在一些實施例中,語音AB1及/或AB2對應文本140所代表的有聲書資訊,且喇叭150播放語音AB1及/或AB2。When the character information CR1 does not match any of the preset sound parameters DS1-DS4, the sound
在一些實施例中,角色資訊CR1可以包含各種特徵參數,特徵參數可代表例如年齡特徵及/或性別特徵。當角色資訊CR1的特徵參數不符合預設聲音參數DS1~DS4的任一者時,聲音參數擴增模組122依據原聲音參數DX以及角色資訊CR1的特徵參數,產生對應的新增聲音參數VD1,使得對應的新增語音AB2可進一步據以產生,且新增語音AB2能表現出原本不符預設情況的角色資訊CR1,具體如下所述。In some embodiments, the character information CR1 may include various characteristic parameters, which may represent, for example, age characteristics and/or gender characteristics. When the characteristic parameters of the character information CR1 do not conform to any of the preset voice parameters DS1-DS4, the voice
第2圖為根據本案一些實施例中預設聲音參數DS1~DS4於向量空間200的示意圖。在一些實施例中,向量空間200中的不同位置對應不同的聲音參數,例如不同的聲音頻率。如第2圖所示,預設聲音參數DS1~DS4及新增聲音參數NS1、NS2、NZ1、NN1、NN2被表示在向量空間200中。以下將說明具體的實施例。FIG. 2 is a schematic diagram of the default sound parameters DS1-DS4 in the
在一些實施例中,當角色資訊CR1對應的角色數量大於原聲音參數DX的參數數量時,聲音參數擴增模組122產生新增聲音參數,使得新增聲音參數的數量及原聲音參數DX的參數數量的總和相同於角色數量。In some embodiments, when the number of characters corresponding to the character information CR1 is greater than the number of parameters of the original sound parameters DX, the sound
舉例來說,預設聲音參數DS1~DS4分別具有二十歲、四十歲、六十歲、八十歲的年齡特徵,且角色資訊CR1對應三個二十歲的角色。此時,聲音參數擴增模組122依據角色資訊CR1選擇符合二十歲之預設聲音參數DS1作為原聲音參數DX,並且依據預設聲音參數DS1產生兩個對應二十歲的新增聲音參數NS1及NS2,其中新增聲音參數NS1及NS2與預設聲音參數DS1彼此均不相同。接著,聲音參數擴增模組122將新增聲音參數NS1、NS2及預設聲音參數DS1分別分配給上述三個二十歲的角色,使得文字語音轉換模組123依據新增聲音參數NS1、NS2、預設聲音參數DS1及文本內容TX1產生語音AB2,藉此能讓語音AB2能發出有三個二十歲角色的聲音。For example, the default voice parameters DS1-DS4 respectively have the age characteristics of 20, 40, 60, and 80 years old, and the character information CR1 corresponds to three 20-year-old characters. At this time, the voice
在一些實施例中,聲音參數擴增模組122調整預設聲音參數DS1以產生新增聲音參數NS1及NS2。在向量空間200中,相較於新增聲音參數NS1及NS2的每一者和其他預設聲音參數(例如預設聲音參數DS2~DS4)之間的距離,新增聲音參數NS1及NS2的每一者和預設聲音參數DS1之間的距離較近。因此,新增聲音參數NS1及NS2的每一者的特徵與預設聲音參數DS1的特徵較為接近。舉例來說,預設聲音參數DS1、新增聲音參數NS1及NS2的每一者具有相同的年齡特徵。在一些實施例中,前述之距離可以為新增聲音參數NS1、NS2與預設聲音參數DS1~DS4於特徵空間上任意兩個特徵點(聲音參數)之彼此之間的距離,亦即任意兩聲音參數之特徵向量之距離。In some embodiments, the sound
在一些實施例中,當角色資訊CR1的第一年齡特徵不符合預設聲音參數DS1~DS4的每一者時,聲音參數擴增模組122依據第一年齡特徵從預設聲音參數DS1~DS4中選擇第一原聲音參數及第二原聲音參數作為原聲音參數DX,並依據第一原聲音參數及第二原聲音參數產生符合第一年齡特徵的新增聲音參數。第一原聲音參數具有第二年齡特徵,第二原聲音參數具有第三年齡特徵,第一年齡特徵介於第二年齡特徵及第三年齡特徵之間。In some embodiments, when the first age characteristic of the character information CR1 does not match any of the preset voice parameters DS1-DS4, the voice
舉例來說,預設聲音參數DS1~DS4分別具有二十歲、四十歲、六十歲、八十歲的年齡特徵,且角色資訊CR1對應一個三十歲的角色。相較於六十歲及八十歲,二十歲及四十歲更接近三十歲。對應地,聲音參數擴增模組122選擇預設聲音參數DS1及DS2作為第一原聲音參數及第二原聲音參數,並依據預設聲音參數DS1及DS2產生新增聲音參數NZ1。舉例來說,新增聲音參數NZ1具有三十歲的年齡特徵。For example, the default sound parameters DS1-DS4 have the age characteristics of 20, 40, 60, and 80 years old, respectively, and the character information CR1 corresponds to a 30-year-old character. Compared with 60 and 80 years old, 20 and 40 years old are closer to 30 years old. Correspondingly, the sound
如第2圖所示,在向量空間200中,新增聲音參數NZ1位於預設聲音參數DS1及DS2之間。在一些實施例中,聲音參數擴增模組122藉由內插法對預設聲音參數DS1及DS2進行運算以產生新增聲音參數NZ1。As shown in FIG. 2 , the newly added sound parameter NZ1 is located between the default sound parameters DS1 and DS2 in the
在一些實施例中,當角色資訊CR1對應多個具有第一年齡特徵的角色時,聲音參數擴增模組122更依據新增聲音參數產生更多個新增聲音參數,使得新增聲音參數的參數數量與角色資訊CR1對應的角色數量相同。In some embodiments, when the character information CR1 corresponds to multiple characters with the first age characteristic, the sound
舉例來說,預設聲音參數DS1~DS4分別具有二十歲、四十歲、六十歲、八十歲的年齡特徵,且角色資訊CR1對應三個三十歲的角色。聲音參數擴增模組122在依據預設聲音參數DS1及DS2產生新增聲音參數NZ1之後,更依據新增聲音參數NZ1產生新增聲音參數NN1及NN2,其中新增聲音參數NZ1、NN1及NN2彼此不同,且新增聲音參數NN1及NN2的每一者具有三十歲的年齡特徵。接著,聲音參數擴增模組122將新增聲音參數NZ1、NN1及NN2分別分配給上述三個三十歲的角色。For example, the default voice parameters DS1-DS4 have age characteristics of 20, 40, 60, and 80 years old, respectively, and the character information CR1 corresponds to three 30-year-old characters. After the voice
在一些實施例中,聲音參數擴增模組122調整新增聲音參數NZ1以產生新增聲音參數NN1及NN2。在向量空間200中,相較於新增聲音參數NN1及NN2的每一者和預設聲音參數(例如預設聲音參數DS1~DS4)之間的距離,新增聲音參數NN1及NN2的每一者和新增聲音參數NZ1之間的距離較近。因此,新增聲音參數NN1及NN2的每一者的特徵與新增聲音參數NZ1的特徵較為接近。舉例來說,新增聲音參數NZ1、NN1及NN2的每一者具有相同的年齡特徵。In some embodiments, the sound
在一些實施例中,處理器120將新增聲音參數NZ1、NN1及NN2儲存在儲存單元110中的資料庫111。In some embodiments, the processor 120 stores the newly added sound parameters NZ1, NN1, and NN2 in the database 111 in the storage unit 110.
在一些實施例中,上述依據預設聲音資料DS1產生新增聲音參數NS1及NS2的演算法,以及依據預設聲音資料NZ1產生新增聲音參數NN1及NN2的演算法,均可以藉由不同的過採樣技術來實施,例如:少數類過採樣技術(synthetic minority oversampling technique,SMOTE)、邊界少數類過採樣技術(borderline synthetic minority oversampling technique,borderline-SMOTE)及/或基於對抗式生成網路(Generative Adversarial Network,GAN)的過採樣技術。In some embodiments, the above-mentioned algorithm for generating new sound parameters NS1 and NS2 based on the default sound data DS1, and the algorithm for generating new sound parameters NN1 and NN2 based on the default sound data NZ1, can be implemented by different oversampling techniques, for example: synthetic minority oversampling technique (SMOTE), borderline synthetic minority oversampling technique (borderline-SMOTE) and/or oversampling technology based on Generative Adversarial Network (GAN).
第3圖為根據本案之一些實施例之系統100的方法300的流程圖。以下以系統100的元件說明方法300之操作,但本發明實施例不以此為限。在各種實施例中,方法300可以藉由不同於系統100的其他系統執行。如第3圖所示,方法300包含操作S31~S36,其中本發明的方法300中的各操作S31~S36之順序並非為限制。FIG. 3 is a flow chart of a
在操作S31,處理器120從文本140中提取角色資訊CR1。舉例而言,在第1圖的實施例中,處理器120中的文字提取模組121從文本140中提取角色資訊CR1,以供後續產生對應此角色資訊CR1的聲音參數。在不同的範例中,處理器120也可以透過輸入介面130接收角色資訊CR1。In operation S31, the processor 120 extracts the character information CR1 from the
在操作S32,處理器120取得預設聲音參數DS1~DS4。舉例而言,在第1圖的實施例中,處理器120從儲存單元110中讀取預設聲音參數DS1~DS4於暫存記憶體中,以供後續操作使用。In operation S32, the processor 120 obtains the default sound parameters DS1-DS4. For example, in the embodiment of FIG. 1, the processor 120 reads the default sound parameters DS1-DS4 from the storage unit 110 in the temporary memory for subsequent operation.
在操作S33,處理器120判斷角色資訊CR1是否符合預設聲音參數DS1~DS4。舉例而言,在第1圖的實施例中,處理器120中的聲音參數擴增模組122判斷角色資訊CR1是否符合於操作S32取得的預設聲音參數DS1~DS4。In operation S33, the processor 120 determines whether the character information CR1 matches the preset sound parameters DS1-DS4. For example, in the embodiment of FIG. 1, the sound
在一些實施例中,當角色資訊CR1符合預設聲音參數DS1~DS4時,處理器120在操作S33之後進行操作S34。當角色資訊CR1不符合預設聲音參數DS1~DS4時,處理器120在操作S33之後進行操作S35。In some embodiments, when the character information CR1 meets the preset sound parameters DS1-DS4, the processor 120 performs operation S34 after operation S33. When the character information CR1 does not meet the preset sound parameters DS1-DS4, the processor 120 performs operation S35 after operation S33.
在操作S34,處理器120將操作S33中符合角色資訊CR1的預設聲音參數DS1~DS4之對應者設定為原聲音參數DS,進而依據原聲音參數DS產生對應的語音AB1。舉例而言,請參閱第1圖,聲音參數擴增模組122將符合角色資訊CR1的預設聲音參數DS1~DS4其中一者作為原聲音參數DS,再將原聲音參數DS傳送至文字語音轉換模組123,以供產生對應的語音AB1。In operation S34, the processor 120 sets the corresponding one of the preset sound parameters DS1-DS4 matching the character information CR1 in operation S33 as the original sound parameter DS, and then generates the corresponding voice AB1 according to the original sound parameter DS. For example, referring to FIG. 1, the sound
在操作S35,處理器120從預設聲音參數DS1~DS4中選擇相應角色資訊CR1的原聲音參數DX,以產生新增聲音參數VD1。舉例而言,請參閱第1圖,聲音參數擴增模組122依據角色資訊CR1及原聲音參數DX產生新增聲音參數VD1,使得新增聲音參數VD1的參數數量及特徵符合角色資訊CR1,並且進一步將新增聲音參數VD1傳送至文字語音轉換模組123。In operation S35, the processor 120 selects the original sound parameter DX corresponding to the character information CR1 from the preset sound parameters DS1-DS4 to generate the newly added sound parameter VD1. For example, referring to FIG. 1, the sound
在操作S35,處理器120依據新增聲音參數VD1產生對應的語音AB2。舉例而言,在第1圖的實施例中,處理器120中的文字語音轉換模組123依據新增聲音參數VD1及文本內容TX1產生對應的語音AB2,進一步供喇叭150播放。In operation S35, the processor 120 generates the corresponding voice AB2 according to the newly added voice parameter VD1. For example, in the embodiment of FIG. 1, the text-to-
藉此能於角色資訊無法自儲存單元中取得對應的預設聲音參數時,得以產生相符於各角色資訊的聲音參數,而能即使各角色資訊相同的情形下而發出不相同的語音,讓聽者有更佳的感受。In this way, when the character information cannot obtain the corresponding default sound parameters from the storage unit, the sound parameters that match the character information can be generated, and different voices can be emitted even when the character information is the same, giving the listener a better experience.
雖然本揭示內容已以實施例揭露如上,然其並非用以限定本揭示內容,任何所屬技術領域中具有通常知識者,在不脫離本揭示內容的精神和範圍內,當可作些許的更動與潤飾,故本揭示內容的保護範圍當視後附的申請專利範圍所界定者為準。Although the contents of this disclosure have been disclosed as above by way of embodiments, they are not intended to limit the contents of this disclosure. Any person with ordinary knowledge in the relevant technical field may make some changes and modifications without departing from the spirit and scope of the contents of this disclosure. Therefore, the protection scope of the contents of this disclosure shall be subject to the scope defined by the attached patent application.
100:系統 110:儲存單元 120:處理器 130:輸入介面 140:文本 150:喇叭 111:資料庫 DS1~DS4:預設聲音參數 DS、DX:原聲音參數 121:文字提取模組 122:聲音參數擴增模組 123:文字語音轉換模組 CR1:角色資訊 TX1:文本內容 FT1:輸入資料 AB1、AB2:語音 200:向量空間 VD1、NS1、NS2、NZ1、NN1、NN2:新增聲音參數 300:方法 S31~S36:操作 100: System 110: Storage unit 120: Processor 130: Input interface 140: Text 150: Speaker 111: Database DS1~DS4: Default sound parameters DS, DX: Original sound parameters 121: Text extraction module 122: Sound parameter expansion module 123: Text-to-speech conversion module CR1: Character information TX1: Text content FT1: Input data AB1, AB2: Voice 200: Vector space VD1, NS1, NS2, NZ1, NN1, NN2: Add new sound parameters 300: Method S31~S36: Operation
第1圖為根據本案之一些實施例所繪示之系統的示意圖。 第2圖為根據本案之一些實施例所繪示之對應第1圖所示之預設聲音參數的向量空間的示意圖。 第3圖為根據本案之一些實施例所繪示之對應第1圖所示之系統的方法的流程圖。 FIG. 1 is a schematic diagram of a system according to some embodiments of the present invention. FIG. 2 is a schematic diagram of a vector space corresponding to the preset sound parameters shown in FIG. 1 according to some embodiments of the present invention. FIG. 3 is a flow chart of a method corresponding to the system shown in FIG. 1 according to some embodiments of the present invention.
國內寄存資訊(請依寄存機構、日期、號碼順序註記) 無 國外寄存資訊(請依寄存國家、機構、日期、號碼順序註記) 無 Domestic storage information (please note in the order of storage institution, date, and number) None Foreign storage information (please note in the order of storage country, institution, date, and number) None
100:系統 100: System
110:儲存單元 110: Storage unit
120:處理器 120: Processor
130:輸入介面 130: Input interface
140:文本 140: Text
150:喇叭 150: Speaker
111:資料庫 111: Database
DS1~DS4:預設聲音參數 DS1~DS4: Default sound parameters
DS、DX:原聲音參數 DS, DX: original sound parameters
121:文字提取模組 121: Text extraction module
122:聲音參數擴增模組 122: Sound parameter expansion module
123:文字語音轉換模組 123: Text-to-speech module
CR1:角色資訊 CR1: Character information
TX1:文本內容 TX1: text content
FT1:輸入資料 FT1: Input data
AB1、AB2:語音 AB1, AB2: Voice
VD1:新增聲音參數 VD1: Added new sound parameters
Claims (10)
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| TW112102632A TWI867419B (en) | 2023-01-19 | 2023-01-19 | Method and system of generating audio |
| CN202311545863.2A CN118366474A (en) | 2023-01-19 | 2023-11-20 | Method and system for generating speech |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| TW112102632A TWI867419B (en) | 2023-01-19 | 2023-01-19 | Method and system of generating audio |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| TW202431247A TW202431247A (en) | 2024-08-01 |
| TWI867419B true TWI867419B (en) | 2024-12-21 |
Family
ID=91882222
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| TW112102632A TWI867419B (en) | 2023-01-19 | 2023-01-19 | Method and system of generating audio |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN118366474A (en) |
| TW (1) | TWI867419B (en) |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101359473A (en) * | 2007-07-30 | 2009-02-04 | 国际商业机器公司 | Auto speech conversion method and apparatus |
| TW201830354A (en) * | 2017-02-14 | 2018-08-16 | 香港商富成人工智能有限公司 | Interactive and adaptive training and learning management system using face tracking and emotion detection with associated methods |
| US20180240500A1 (en) * | 2006-07-06 | 2018-08-23 | Sundaysky Ltd. | Automatic generation of video from structured content |
| CN113129895A (en) * | 2021-04-20 | 2021-07-16 | 上海仙剑文化传媒股份有限公司 | Voice detection processing system |
| CN113628609A (en) * | 2020-05-09 | 2021-11-09 | 微软技术许可有限责任公司 | Automatic audio content generation |
-
2023
- 2023-01-19 TW TW112102632A patent/TWI867419B/en active
- 2023-11-20 CN CN202311545863.2A patent/CN118366474A/en active Pending
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20180240500A1 (en) * | 2006-07-06 | 2018-08-23 | Sundaysky Ltd. | Automatic generation of video from structured content |
| CN101359473A (en) * | 2007-07-30 | 2009-02-04 | 国际商业机器公司 | Auto speech conversion method and apparatus |
| TW201830354A (en) * | 2017-02-14 | 2018-08-16 | 香港商富成人工智能有限公司 | Interactive and adaptive training and learning management system using face tracking and emotion detection with associated methods |
| CN113628609A (en) * | 2020-05-09 | 2021-11-09 | 微软技术许可有限责任公司 | Automatic audio content generation |
| CN113129895A (en) * | 2021-04-20 | 2021-07-16 | 上海仙剑文化传媒股份有限公司 | Voice detection processing system |
Also Published As
| Publication number | Publication date |
|---|---|
| CN118366474A (en) | 2024-07-19 |
| TW202431247A (en) | 2024-08-01 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US20160140952A1 (en) | Method For Adding Realism To Synthetic Speech | |
| US10224061B2 (en) | Voice signal component forecasting | |
| US20180315438A1 (en) | Voice data compensation with machine learning | |
| CN111465982B (en) | Signal processing device and method, training device and method, and program | |
| WO2015098306A1 (en) | Response control device and control program | |
| CN110728976A (en) | Method, device and system for voice recognition | |
| CN113724686A (en) | Method and device for editing audio, electronic equipment and storage medium | |
| CN112164407B (en) | Tone color conversion method and device | |
| CN111968678B (en) | Audio data processing method, device, equipment and readable storage medium | |
| WO2025200819A1 (en) | Speech signal processing method and related device | |
| CN113256262A (en) | Automatic generation method and system of conference summary, storage medium and electronic equipment | |
| CN115393484A (en) | Method and device for generating virtual image animation, electronic equipment and storage medium | |
| CN115831088A (en) | Voice clone model generation method and device and electronic equipment | |
| CN113886640A (en) | Digital human generation method, device, device and medium | |
| EP3113175A1 (en) | Method for converting text to individual speech, and apparatus for converting text to individual speech | |
| US20080316888A1 (en) | Device Method and System for Communication Session Storage | |
| TWI867419B (en) | Method and system of generating audio | |
| CN118841008A (en) | Audio generation method, device, computer equipment and storage medium | |
| KR20230103242A (en) | Pitch and voice conversion system using end-to-end speech synthesis model | |
| CN117316185A (en) | An audio and video generation method, device, equipment and storage medium | |
| US8781835B2 (en) | Methods and apparatuses for facilitating speech synthesis | |
| CN107888963A (en) | Method and system based on rhythm synthetic video | |
| CN111916080A (en) | Speech recognition resource selection method, device, computer equipment and storage medium | |
| CN113051902B (en) | Voice data desensitization method, electronic device and computer-readable storage medium | |
| US20060224385A1 (en) | Text-to-speech conversion in electronic device field |