[go: up one dir, main page]

TWI867419B - Method and system of generating audio - Google Patents

Method and system of generating audio Download PDF

Info

Publication number
TWI867419B
TWI867419B TW112102632A TW112102632A TWI867419B TW I867419 B TWI867419 B TW I867419B TW 112102632 A TW112102632 A TW 112102632A TW 112102632 A TW112102632 A TW 112102632A TW I867419 B TWI867419 B TW I867419B
Authority
TW
Taiwan
Prior art keywords
parameter
sound
parameters
character information
preset
Prior art date
Application number
TW112102632A
Other languages
Chinese (zh)
Other versions
TW202431247A (en
Inventor
張偉毅
張立人
王儷螢
Original Assignee
宏正自動科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 宏正自動科技股份有限公司 filed Critical 宏正自動科技股份有限公司
Priority to TW112102632A priority Critical patent/TWI867419B/en
Priority to CN202311545863.2A priority patent/CN118366474A/en
Publication of TW202431247A publication Critical patent/TW202431247A/en
Application granted granted Critical
Publication of TWI867419B publication Critical patent/TWI867419B/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)
  • Stereophonic System (AREA)

Abstract

A method includes: extracting character information; matching the character information and preset voice parameters; when the character information does not match the preset voice parameters, selecting at least one original voice parameter corresponding to the character information from the preset voice parameters, and generating at least one new voice parameter according to the at least one original voice parameter which corresponds to the character information; and generating corresponding audio according to the at least one new voice parameter.

Description

產生語音的方法及系統Method and system for generating speech

本揭示內容是有關於一種產生語音的技術,特別是關於一種產生語音的方法及產生語音的系統。The present disclosure relates to a technology for generating speech, and more particularly to a method for generating speech and a system for generating speech.

製作有聲書時,需要以合適的聲紋對有聲書中的角色進行配音,以提升有聲書品質。然而,預設資料庫中的聲紋可能不足以對應有聲書中的所有角色。因此,要如何設計以解決上述問題為本領域重要之課題。When making an audiobook, it is necessary to dub the characters in the audiobook with appropriate voiceprints to improve the quality of the audiobook. However, the voiceprints in the default database may not be sufficient to correspond to all the characters in the audiobook. Therefore, how to design to solve the above problem is an important topic in this field.

本發明實施例包含一種產生語音的方法。方法包含:提取角色資訊;配對角色資訊與多個預設聲音參數;當角色資訊不符合預設聲音參數時,從預設聲音參數中選擇相應角色資訊的至少一原聲音參數,並依據上述至少一原聲音參數產生至少一新增聲音參數,上述至少一新增聲音參數對應於角色資訊;以及依據上述至少一新增聲音參數產生對應的語音。The embodiment of the present invention includes a method for generating speech. The method includes: extracting character information; matching the character information with a plurality of preset sound parameters; when the character information does not match the preset sound parameters, selecting at least one original sound parameter corresponding to the character information from the preset sound parameters, and generating at least one additional sound parameter based on the at least one original sound parameter, wherein the at least one additional sound parameter corresponds to the character information; and generating corresponding speech based on the at least one additional sound parameter.

本發明實施例包含一種產生語音的系統。系統包含儲存單元及處理器。儲存單元儲存多個預設聲音參數。處理器耦接儲存單元,經由儲存單元存取預設聲音參數,並於角色資訊不符合預設聲音參數時,依據預設聲音參數中至少一原聲音參數產生對應角色資訊的至少一新增聲音參數,且依據至少一新增聲音參數產生語音。The embodiment of the present invention includes a system for generating speech. The system includes a storage unit and a processor. The storage unit stores a plurality of preset sound parameters. The processor is coupled to the storage unit, accesses the preset sound parameters through the storage unit, and generates at least one newly added sound parameter corresponding to the character information according to at least one original sound parameter in the preset sound parameters when the character information does not conform to the preset sound parameters, and generates speech according to the at least one newly added sound parameter.

於本文中,當一元件被稱為「連接」或「耦接」時,可指「電性連接」或「電性耦接」。「連接」或「耦接」亦可用以表示二或多個元件間相互搭配操作或互動。此外,雖然本文中使用「第一」、「第二」、…等用語描述不同元件,該用語僅是用以區別以相同技術用語描述的元件或操作。除非上下文清楚指明,否則該用語並非特別指稱或暗示次序或順位,亦非用以限定本案。In this article, when an element is referred to as "connected" or "coupled", it may refer to "electrically connected" or "electrically coupled". "Connected" or "coupled" may also be used to indicate that two or more elements cooperate with each other or interact with each other. In addition, although the terms "first", "second", etc. are used in this article to describe different elements, the terms are only used to distinguish between elements or operations described with the same technical terms. Unless the context clearly indicates otherwise, the terms do not specifically refer to or imply an order or sequence, nor are they used to limit the present case.

除非另有定義,本文使用的所有術語(包括技術和科學術語)具有與本案所屬領域的普通技術人員通常理解的相同的含義。將進一步理解的是,諸如在通常使用的字典中定義的那些術語應當被解釋為具有與它們在相關技術和本案的上下文中的含義一致的含義,並且將不被解釋為理想化的或過度正式的意義,除非本文中明確地這樣定義。Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by ordinary technicians in the field to which this case belongs. It will be further understood that those terms as defined in commonly used dictionaries should be interpreted as having a meaning consistent with their meaning in the context of the relevant technology and this case, and will not be interpreted as an idealized or overly formal meaning unless expressly defined as such in this document.

第1圖為根據本案之一些實施例之系統100的示意圖。在一些實施例中,系統100是依據儲存在儲存元件內的預設聲音參數產生語音。如第1圖所示,系統100包括儲存單元110、處理器120、輸入介面130及喇叭150。處理器120耦接儲存單元110、輸入介面130及喇叭150。FIG. 1 is a schematic diagram of a system 100 according to some embodiments of the present invention. In some embodiments, the system 100 generates speech according to preset sound parameters stored in a storage element. As shown in FIG. 1 , the system 100 includes a storage unit 110, a processor 120, an input interface 130, and a speaker 150. The processor 120 is coupled to the storage unit 110, the input interface 130, and the speaker 150.

操作上,處理器120接收預設於系統100中的文本140、或是自外部接收文本140或是透過輸入介面130輸入的角色資訊CR1,並且自儲存單元110接收預設聲音參數DS1~DS4。在角色資訊CR1不匹配於預設聲音參數DS1~DS4任一者時,處理器120藉由預設聲音參數DS1~DS4產生符合角色資訊CR1的新增聲音參數(例如下述新增聲音參數VD1),並依據新增聲音參數產生語音AB2,並由喇叭150播放語音AB2。In operation, the processor 120 receives the text 140 preset in the system 100, or receives the text 140 from the outside or the character information CR1 input through the input interface 130, and receives the default sound parameters DS1-DS4 from the storage unit 110. When the character information CR1 does not match any of the default sound parameters DS1-DS4, the processor 120 generates a new sound parameter (such as the new sound parameter VD1 described below) that matches the character information CR1 through the default sound parameters DS1-DS4, and generates voice AB2 according to the new sound parameter, and the speaker 150 plays the voice AB2.

在一些實施例中,文本140可以是代表一則故事、劇本的資訊,本發明並非為限制,於一些實施例中,文本140也可以為僅一句或幾句詞句的資訊。在各種實施例中,文本140可以被預儲存在各種儲存媒體中,例如文本140可以被預儲存在儲存單元110中,例如隨身碟、硬碟、隨身硬碟、雲端或其他任意儲存媒體,本案並非為限制。In some embodiments, the text 140 may be information representing a story or a script, and the present invention is not intended to be limiting. In some embodiments, the text 140 may also be information consisting of only one or a few words. In various embodiments, the text 140 may be pre-stored in various storage media, such as the text 140 may be pre-stored in the storage unit 110, such as a flash drive, a hard drive, a flash drive, a cloud, or any other storage media, and the present invention is not intended to be limiting.

在一些作法中,在對文本進行配音時,需要雇用與文本中的角色數量相同並且具有對應特徵的配音員,使得成本較高。In some practices, when dubbing a text, it is necessary to hire voice actors who have the same number of characters as the characters in the text and have corresponding characteristics, which makes the cost high.

相較於上述作法,在本發明實施例中,處理器120可以依據角色資訊CR1,基於預設聲音參數DS1~DS4產生符合角色資訊CR1的新增聲音參數。如此一來,藉由預設聲音參數DS1~DS4就可以對文本140進行配音,使得成本較低。Compared with the above method, in the embodiment of the present invention, the processor 120 can generate new sound parameters that match the character information CR1 based on the preset sound parameters DS1-DS4 according to the character information CR1. In this way, the text 140 can be dubbed by the preset sound parameters DS1-DS4, which makes the cost lower.

如第1圖所示,在一些實施例中,儲存單元110具有資料庫111。資料庫111中儲存了預設聲音參數DS1~DS4。在各種實施例中,資料庫111可以包含各種數量的預設聲音參數,第1圖僅為例示而已。As shown in FIG. 1 , in some embodiments, the storage unit 110 has a database 111. The database 111 stores preset sound parameters DS1-DS4. In various embodiments, the database 111 may include various numbers of preset sound parameters, and FIG. 1 is only an example.

如第1圖所示,在一些實施例中,處理器120包含文字提取模組121、聲音參數擴增模組122及文字語音轉換模組123。在各種實施例中,處理器120可以包含不同於文字提取模組121、聲音參數擴增模組122及文字語音轉換模組123的其他模組,並藉由其他模組執行下述文字提取模組121、聲音參數擴增模組122及文字語音轉換模組123之操作,故第1圖所示實施例僅為例示而已,並非用以限定本發明。上述模組亦得以其他系統、硬體、軟體或其組合實現。As shown in FIG. 1 , in some embodiments, the processor 120 includes a text extraction module 121, a sound parameter expansion module 122, and a text-to-speech conversion module 123. In various embodiments, the processor 120 may include other modules different from the text extraction module 121, the sound parameter expansion module 122, and the text-to-speech conversion module 123, and the operations of the text extraction module 121, the sound parameter expansion module 122, and the text-to-speech conversion module 123 described below are performed by other modules, so the embodiment shown in FIG. 1 is only an example and is not intended to limit the present invention. The above modules may also be implemented by other systems, hardware, software, or a combination thereof.

在一些實施例中,文字提取模組121從文本140中提取角色資訊CR1及文本內容TX1,並將角色資訊CR1及文本內容TX1分別提供至聲音參數擴增模組122及文字語音轉換模組123。In some embodiments, the text extraction module 121 extracts the character information CR1 and the text content TX1 from the text 140, and provides the character information CR1 and the text content TX1 to the sound parameter expansion module 122 and the text-to-speech conversion module 123, respectively.

在其他實施例中,輸入介面130接收輸入資料FT1,並從輸入資料FT1中提取角色資訊CR1,且將角色資訊CR1提供至聲音參數擴增模組122。在一些實施例中,輸入資料FT1可以包含使用者自定義的特徵參數,例如對應各種年齡特徵及/或性別特徵的特徵參數。In other embodiments, the input interface 130 receives the input data FT1, extracts the character information CR1 from the input data FT1, and provides the character information CR1 to the sound parameter expansion module 122. In some embodiments, the input data FT1 may include user-defined feature parameters, such as feature parameters corresponding to various age features and/or gender features.

在一些實施例中,聲音參數擴增模組122從儲存單元110存取預設聲音參數DS1~DS4,並配對角色資訊CR1與預設聲音參數DS1~DS4。舉例來說,聲音參數擴增模組122比較角色資訊CR1與預設聲音參數DS1~DS4的年齡特徵及/或性別特徵,並依據比較結果將預設聲音參數DS1~DS4中符合角色資訊CR1的一者分配給角色資訊CR1。在一些實施例中,聲音參數擴增模組122藉由語者嵌入(Speaker Embedding)的技術對角色資訊CR1進行配對。In some embodiments, the voice parameter expansion module 122 accesses the preset voice parameters DS1-DS4 from the storage unit 110, and matches the character information CR1 with the preset voice parameters DS1-DS4. For example, the voice parameter expansion module 122 compares the age characteristics and/or gender characteristics of the character information CR1 and the preset voice parameters DS1-DS4, and assigns one of the preset voice parameters DS1-DS4 that matches the character information CR1 to the character information CR1 according to the comparison result. In some embodiments, the voice parameter expansion module 122 matches the character information CR1 by using the speaker embedding technology.

當角色資訊CR1符合預設聲音參數DS1~DS4的一者時,聲音參數擴增模組122選擇對應的預設聲音參數作為原聲音參數DS,並將原聲音參數DS傳送至文字語音轉換模組123,使得文字語音轉換模組123依據原聲音參數DS及文本內容TX1產生對應的語音AB1。When the character information CR1 matches one of the preset sound parameters DS1-DS4, the sound parameter expansion module 122 selects the corresponding preset sound parameter as the original sound parameter DS and transmits the original sound parameter DS to the text-to-speech conversion module 123, so that the text-to-speech conversion module 123 generates the corresponding voice AB1 according to the original sound parameter DS and the text content TX1.

當角色資訊CR1不符合預設聲音參數DS1~DS4的任一者時,聲音參數擴增模組122從預設聲音參數DS1~DS4中選擇對應角色資訊CR1的聲音參數作為原聲音參數DX,並依據原聲音參數DX產生對應於角色資訊CR1的新增聲音參數VD1。接著,文字語音轉換模組123依據新增聲音參數VD1及文本內容TX1產生對應的新增語音AB2。在一些實施例中,語音AB1及/或AB2對應文本140所代表的有聲書資訊,且喇叭150播放語音AB1及/或AB2。When the character information CR1 does not match any of the preset sound parameters DS1-DS4, the sound parameter expansion module 122 selects the sound parameter corresponding to the character information CR1 from the preset sound parameters DS1-DS4 as the original sound parameter DX, and generates the newly added sound parameter VD1 corresponding to the character information CR1 according to the original sound parameter DX. Then, the text-to-speech conversion module 123 generates the corresponding newly added voice AB2 according to the newly added sound parameter VD1 and the text content TX1. In some embodiments, the voice AB1 and/or AB2 correspond to the audio book information represented by the text 140, and the speaker 150 plays the voice AB1 and/or AB2.

在一些實施例中,角色資訊CR1可以包含各種特徵參數,特徵參數可代表例如年齡特徵及/或性別特徵。當角色資訊CR1的特徵參數不符合預設聲音參數DS1~DS4的任一者時,聲音參數擴增模組122依據原聲音參數DX以及角色資訊CR1的特徵參數,產生對應的新增聲音參數VD1,使得對應的新增語音AB2可進一步據以產生,且新增語音AB2能表現出原本不符預設情況的角色資訊CR1,具體如下所述。In some embodiments, the character information CR1 may include various characteristic parameters, which may represent, for example, age characteristics and/or gender characteristics. When the characteristic parameters of the character information CR1 do not conform to any of the preset voice parameters DS1-DS4, the voice parameter expansion module 122 generates corresponding newly added voice parameters VD1 according to the original voice parameters DX and the characteristic parameters of the character information CR1, so that the corresponding newly added voice AB2 can be further generated based on it, and the newly added voice AB2 can express the character information CR1 that does not conform to the preset situation, as described below.

第2圖為根據本案一些實施例中預設聲音參數DS1~DS4於向量空間200的示意圖。在一些實施例中,向量空間200中的不同位置對應不同的聲音參數,例如不同的聲音頻率。如第2圖所示,預設聲音參數DS1~DS4及新增聲音參數NS1、NS2、NZ1、NN1、NN2被表示在向量空間200中。以下將說明具體的實施例。FIG. 2 is a schematic diagram of the default sound parameters DS1-DS4 in the vector space 200 according to some embodiments of the present invention. In some embodiments, different positions in the vector space 200 correspond to different sound parameters, such as different sound frequencies. As shown in FIG. 2, the default sound parameters DS1-DS4 and the newly added sound parameters NS1, NS2, NZ1, NN1, NN2 are represented in the vector space 200. Specific embodiments will be described below.

在一些實施例中,當角色資訊CR1對應的角色數量大於原聲音參數DX的參數數量時,聲音參數擴增模組122產生新增聲音參數,使得新增聲音參數的數量及原聲音參數DX的參數數量的總和相同於角色數量。In some embodiments, when the number of characters corresponding to the character information CR1 is greater than the number of parameters of the original sound parameters DX, the sound parameter expansion module 122 generates additional sound parameters such that the sum of the number of additional sound parameters and the number of parameters of the original sound parameters DX is the same as the number of characters.

舉例來說,預設聲音參數DS1~DS4分別具有二十歲、四十歲、六十歲、八十歲的年齡特徵,且角色資訊CR1對應三個二十歲的角色。此時,聲音參數擴增模組122依據角色資訊CR1選擇符合二十歲之預設聲音參數DS1作為原聲音參數DX,並且依據預設聲音參數DS1產生兩個對應二十歲的新增聲音參數NS1及NS2,其中新增聲音參數NS1及NS2與預設聲音參數DS1彼此均不相同。接著,聲音參數擴增模組122將新增聲音參數NS1、NS2及預設聲音參數DS1分別分配給上述三個二十歲的角色,使得文字語音轉換模組123依據新增聲音參數NS1、NS2、預設聲音參數DS1及文本內容TX1產生語音AB2,藉此能讓語音AB2能發出有三個二十歲角色的聲音。For example, the default voice parameters DS1-DS4 respectively have the age characteristics of 20, 40, 60, and 80 years old, and the character information CR1 corresponds to three 20-year-old characters. At this time, the voice parameter expansion module 122 selects the default voice parameter DS1 corresponding to 20 years old as the original voice parameter DX according to the character information CR1, and generates two newly added voice parameters NS1 and NS2 corresponding to 20 years old according to the default voice parameter DS1, wherein the newly added voice parameters NS1 and NS2 are different from the default voice parameter DS1. Then, the voice parameter expansion module 122 assigns the newly added voice parameters NS1, NS2 and the default voice parameter DS1 to the three twenty-year-old characters respectively, so that the text-to-speech conversion module 123 generates voice AB2 according to the newly added voice parameters NS1, NS2, the default voice parameter DS1 and the text content TX1, thereby enabling the voice AB2 to emit the voices of the three twenty-year-old characters.

在一些實施例中,聲音參數擴增模組122調整預設聲音參數DS1以產生新增聲音參數NS1及NS2。在向量空間200中,相較於新增聲音參數NS1及NS2的每一者和其他預設聲音參數(例如預設聲音參數DS2~DS4)之間的距離,新增聲音參數NS1及NS2的每一者和預設聲音參數DS1之間的距離較近。因此,新增聲音參數NS1及NS2的每一者的特徵與預設聲音參數DS1的特徵較為接近。舉例來說,預設聲音參數DS1、新增聲音參數NS1及NS2的每一者具有相同的年齡特徵。在一些實施例中,前述之距離可以為新增聲音參數NS1、NS2與預設聲音參數DS1~DS4於特徵空間上任意兩個特徵點(聲音參數)之彼此之間的距離,亦即任意兩聲音參數之特徵向量之距離。In some embodiments, the sound parameter expansion module 122 adjusts the default sound parameter DS1 to generate the newly added sound parameters NS1 and NS2. In the vector space 200, the distance between each of the newly added sound parameters NS1 and NS2 and the default sound parameter DS1 is closer than the distance between each of the newly added sound parameters NS1 and NS2 and other default sound parameters (e.g., default sound parameters DS2-DS4). Therefore, the characteristics of each of the newly added sound parameters NS1 and NS2 are closer to the characteristics of the default sound parameter DS1. For example, each of the default sound parameter DS1, the newly added sound parameters NS1, and NS2 has the same age characteristics. In some embodiments, the aforementioned distance may be the distance between any two feature points (sound parameters) of the newly added sound parameters NS1, NS2 and the default sound parameters DS1-DS4 in the feature space, that is, the distance between the feature vectors of any two sound parameters.

在一些實施例中,當角色資訊CR1的第一年齡特徵不符合預設聲音參數DS1~DS4的每一者時,聲音參數擴增模組122依據第一年齡特徵從預設聲音參數DS1~DS4中選擇第一原聲音參數及第二原聲音參數作為原聲音參數DX,並依據第一原聲音參數及第二原聲音參數產生符合第一年齡特徵的新增聲音參數。第一原聲音參數具有第二年齡特徵,第二原聲音參數具有第三年齡特徵,第一年齡特徵介於第二年齡特徵及第三年齡特徵之間。In some embodiments, when the first age characteristic of the character information CR1 does not match any of the preset voice parameters DS1-DS4, the voice parameter expansion module 122 selects the first original voice parameter and the second original voice parameter from the preset voice parameters DS1-DS4 as the original voice parameter DX according to the first age characteristic, and generates a newly added voice parameter that matches the first age characteristic according to the first original voice parameter and the second original voice parameter. The first original voice parameter has the second age characteristic, the second original voice parameter has the third age characteristic, and the first age characteristic is between the second age characteristic and the third age characteristic.

舉例來說,預設聲音參數DS1~DS4分別具有二十歲、四十歲、六十歲、八十歲的年齡特徵,且角色資訊CR1對應一個三十歲的角色。相較於六十歲及八十歲,二十歲及四十歲更接近三十歲。對應地,聲音參數擴增模組122選擇預設聲音參數DS1及DS2作為第一原聲音參數及第二原聲音參數,並依據預設聲音參數DS1及DS2產生新增聲音參數NZ1。舉例來說,新增聲音參數NZ1具有三十歲的年齡特徵。For example, the default sound parameters DS1-DS4 have the age characteristics of 20, 40, 60, and 80 years old, respectively, and the character information CR1 corresponds to a 30-year-old character. Compared with 60 and 80 years old, 20 and 40 years old are closer to 30 years old. Correspondingly, the sound parameter expansion module 122 selects the default sound parameters DS1 and DS2 as the first original sound parameter and the second original sound parameter, and generates the newly added sound parameter NZ1 according to the default sound parameters DS1 and DS2. For example, the newly added sound parameter NZ1 has the age characteristic of 30 years old.

如第2圖所示,在向量空間200中,新增聲音參數NZ1位於預設聲音參數DS1及DS2之間。在一些實施例中,聲音參數擴增模組122藉由內插法對預設聲音參數DS1及DS2進行運算以產生新增聲音參數NZ1。As shown in FIG. 2 , the newly added sound parameter NZ1 is located between the default sound parameters DS1 and DS2 in the vector space 200. In some embodiments, the sound parameter expansion module 122 generates the newly added sound parameter NZ1 by performing an interpolation operation on the default sound parameters DS1 and DS2.

在一些實施例中,當角色資訊CR1對應多個具有第一年齡特徵的角色時,聲音參數擴增模組122更依據新增聲音參數產生更多個新增聲音參數,使得新增聲音參數的參數數量與角色資訊CR1對應的角色數量相同。In some embodiments, when the character information CR1 corresponds to multiple characters with the first age characteristic, the sound parameter expansion module 122 generates more new sound parameters based on the new sound parameters, so that the number of parameters of the new sound parameters is the same as the number of characters corresponding to the character information CR1.

舉例來說,預設聲音參數DS1~DS4分別具有二十歲、四十歲、六十歲、八十歲的年齡特徵,且角色資訊CR1對應三個三十歲的角色。聲音參數擴增模組122在依據預設聲音參數DS1及DS2產生新增聲音參數NZ1之後,更依據新增聲音參數NZ1產生新增聲音參數NN1及NN2,其中新增聲音參數NZ1、NN1及NN2彼此不同,且新增聲音參數NN1及NN2的每一者具有三十歲的年齡特徵。接著,聲音參數擴增模組122將新增聲音參數NZ1、NN1及NN2分別分配給上述三個三十歲的角色。For example, the default voice parameters DS1-DS4 have age characteristics of 20, 40, 60, and 80 years old, respectively, and the character information CR1 corresponds to three 30-year-old characters. After the voice parameter expansion module 122 generates the newly added voice parameter NZ1 according to the default voice parameters DS1 and DS2, it further generates the newly added voice parameters NN1 and NN2 according to the newly added voice parameter NZ1, wherein the newly added voice parameters NZ1, NN1, and NN2 are different from each other, and each of the newly added voice parameters NN1 and NN2 has an age characteristic of 30 years old. Then, the voice parameter expansion module 122 assigns the newly added voice parameters NZ1, NN1, and NN2 to the three 30-year-old characters, respectively.

在一些實施例中,聲音參數擴增模組122調整新增聲音參數NZ1以產生新增聲音參數NN1及NN2。在向量空間200中,相較於新增聲音參數NN1及NN2的每一者和預設聲音參數(例如預設聲音參數DS1~DS4)之間的距離,新增聲音參數NN1及NN2的每一者和新增聲音參數NZ1之間的距離較近。因此,新增聲音參數NN1及NN2的每一者的特徵與新增聲音參數NZ1的特徵較為接近。舉例來說,新增聲音參數NZ1、NN1及NN2的每一者具有相同的年齡特徵。In some embodiments, the sound parameter expansion module 122 adjusts the newly added sound parameter NZ1 to generate the newly added sound parameters NN1 and NN2. In the vector space 200, the distance between each of the newly added sound parameters NN1 and NN2 and the newly added sound parameter NZ1 is closer than the distance between each of the newly added sound parameters NN1 and NN2 and the default sound parameters (e.g., the default sound parameters DS1-DS4). Therefore, the characteristics of each of the newly added sound parameters NN1 and NN2 are closer to the characteristics of the newly added sound parameter NZ1. For example, each of the newly added sound parameters NZ1, NN1, and NN2 has the same age characteristics.

在一些實施例中,處理器120將新增聲音參數NZ1、NN1及NN2儲存在儲存單元110中的資料庫111。In some embodiments, the processor 120 stores the newly added sound parameters NZ1, NN1, and NN2 in the database 111 in the storage unit 110.

在一些實施例中,上述依據預設聲音資料DS1產生新增聲音參數NS1及NS2的演算法,以及依據預設聲音資料NZ1產生新增聲音參數NN1及NN2的演算法,均可以藉由不同的過採樣技術來實施,例如:少數類過採樣技術(synthetic minority oversampling technique,SMOTE)、邊界少數類過採樣技術(borderline synthetic minority oversampling technique,borderline-SMOTE)及/或基於對抗式生成網路(Generative Adversarial Network,GAN)的過採樣技術。In some embodiments, the above-mentioned algorithm for generating new sound parameters NS1 and NS2 based on the default sound data DS1, and the algorithm for generating new sound parameters NN1 and NN2 based on the default sound data NZ1, can be implemented by different oversampling techniques, for example: synthetic minority oversampling technique (SMOTE), borderline synthetic minority oversampling technique (borderline-SMOTE) and/or oversampling technology based on Generative Adversarial Network (GAN).

第3圖為根據本案之一些實施例之系統100的方法300的流程圖。以下以系統100的元件說明方法300之操作,但本發明實施例不以此為限。在各種實施例中,方法300可以藉由不同於系統100的其他系統執行。如第3圖所示,方法300包含操作S31~S36,其中本發明的方法300中的各操作S31~S36之順序並非為限制。FIG. 3 is a flow chart of a method 300 of the system 100 according to some embodiments of the present invention. The operation of the method 300 is described below with reference to the components of the system 100, but the present invention is not limited thereto. In various embodiments, the method 300 may be performed by other systems different from the system 100. As shown in FIG. 3, the method 300 includes operations S31-S36, wherein the order of the operations S31-S36 in the method 300 of the present invention is not limited thereto.

在操作S31,處理器120從文本140中提取角色資訊CR1。舉例而言,在第1圖的實施例中,處理器120中的文字提取模組121從文本140中提取角色資訊CR1,以供後續產生對應此角色資訊CR1的聲音參數。在不同的範例中,處理器120也可以透過輸入介面130接收角色資訊CR1。In operation S31, the processor 120 extracts the character information CR1 from the text 140. For example, in the embodiment of FIG. 1, the text extraction module 121 in the processor 120 extracts the character information CR1 from the text 140 for subsequent generation of the sound parameters corresponding to the character information CR1. In a different example, the processor 120 may also receive the character information CR1 through the input interface 130.

在操作S32,處理器120取得預設聲音參數DS1~DS4。舉例而言,在第1圖的實施例中,處理器120從儲存單元110中讀取預設聲音參數DS1~DS4於暫存記憶體中,以供後續操作使用。In operation S32, the processor 120 obtains the default sound parameters DS1-DS4. For example, in the embodiment of FIG. 1, the processor 120 reads the default sound parameters DS1-DS4 from the storage unit 110 in the temporary memory for subsequent operation.

在操作S33,處理器120判斷角色資訊CR1是否符合預設聲音參數DS1~DS4。舉例而言,在第1圖的實施例中,處理器120中的聲音參數擴增模組122判斷角色資訊CR1是否符合於操作S32取得的預設聲音參數DS1~DS4。In operation S33, the processor 120 determines whether the character information CR1 matches the preset sound parameters DS1-DS4. For example, in the embodiment of FIG. 1, the sound parameter expansion module 122 in the processor 120 determines whether the character information CR1 matches the preset sound parameters DS1-DS4 obtained in operation S32.

在一些實施例中,當角色資訊CR1符合預設聲音參數DS1~DS4時,處理器120在操作S33之後進行操作S34。當角色資訊CR1不符合預設聲音參數DS1~DS4時,處理器120在操作S33之後進行操作S35。In some embodiments, when the character information CR1 meets the preset sound parameters DS1-DS4, the processor 120 performs operation S34 after operation S33. When the character information CR1 does not meet the preset sound parameters DS1-DS4, the processor 120 performs operation S35 after operation S33.

在操作S34,處理器120將操作S33中符合角色資訊CR1的預設聲音參數DS1~DS4之對應者設定為原聲音參數DS,進而依據原聲音參數DS產生對應的語音AB1。舉例而言,請參閱第1圖,聲音參數擴增模組122將符合角色資訊CR1的預設聲音參數DS1~DS4其中一者作為原聲音參數DS,再將原聲音參數DS傳送至文字語音轉換模組123,以供產生對應的語音AB1。In operation S34, the processor 120 sets the corresponding one of the preset sound parameters DS1-DS4 matching the character information CR1 in operation S33 as the original sound parameter DS, and then generates the corresponding voice AB1 according to the original sound parameter DS. For example, referring to FIG. 1, the sound parameter expansion module 122 sets one of the preset sound parameters DS1-DS4 matching the character information CR1 as the original sound parameter DS, and then transmits the original sound parameter DS to the text-to-speech conversion module 123 to generate the corresponding voice AB1.

在操作S35,處理器120從預設聲音參數DS1~DS4中選擇相應角色資訊CR1的原聲音參數DX,以產生新增聲音參數VD1。舉例而言,請參閱第1圖,聲音參數擴增模組122依據角色資訊CR1及原聲音參數DX產生新增聲音參數VD1,使得新增聲音參數VD1的參數數量及特徵符合角色資訊CR1,並且進一步將新增聲音參數VD1傳送至文字語音轉換模組123。In operation S35, the processor 120 selects the original sound parameter DX corresponding to the character information CR1 from the preset sound parameters DS1-DS4 to generate the newly added sound parameter VD1. For example, referring to FIG. 1, the sound parameter expansion module 122 generates the newly added sound parameter VD1 according to the character information CR1 and the original sound parameter DX, so that the parameter quantity and characteristics of the newly added sound parameter VD1 conform to the character information CR1, and further transmits the newly added sound parameter VD1 to the text-to-speech conversion module 123.

在操作S35,處理器120依據新增聲音參數VD1產生對應的語音AB2。舉例而言,在第1圖的實施例中,處理器120中的文字語音轉換模組123依據新增聲音參數VD1及文本內容TX1產生對應的語音AB2,進一步供喇叭150播放。In operation S35, the processor 120 generates the corresponding voice AB2 according to the newly added voice parameter VD1. For example, in the embodiment of FIG. 1, the text-to-speech conversion module 123 in the processor 120 generates the corresponding voice AB2 according to the newly added voice parameter VD1 and the text content TX1, and the speaker 150 plays it.

藉此能於角色資訊無法自儲存單元中取得對應的預設聲音參數時,得以產生相符於各角色資訊的聲音參數,而能即使各角色資訊相同的情形下而發出不相同的語音,讓聽者有更佳的感受。In this way, when the character information cannot obtain the corresponding default sound parameters from the storage unit, the sound parameters that match the character information can be generated, and different voices can be emitted even when the character information is the same, giving the listener a better experience.

雖然本揭示內容已以實施例揭露如上,然其並非用以限定本揭示內容,任何所屬技術領域中具有通常知識者,在不脫離本揭示內容的精神和範圍內,當可作些許的更動與潤飾,故本揭示內容的保護範圍當視後附的申請專利範圍所界定者為準。Although the contents of this disclosure have been disclosed as above by way of embodiments, they are not intended to limit the contents of this disclosure. Any person with ordinary knowledge in the relevant technical field may make some changes and modifications without departing from the spirit and scope of the contents of this disclosure. Therefore, the protection scope of the contents of this disclosure shall be subject to the scope defined by the attached patent application.

100:系統 110:儲存單元 120:處理器 130:輸入介面 140:文本 150:喇叭 111:資料庫 DS1~DS4:預設聲音參數 DS、DX:原聲音參數 121:文字提取模組 122:聲音參數擴增模組 123:文字語音轉換模組 CR1:角色資訊 TX1:文本內容 FT1:輸入資料 AB1、AB2:語音 200:向量空間 VD1、NS1、NS2、NZ1、NN1、NN2:新增聲音參數 300:方法 S31~S36:操作 100: System 110: Storage unit 120: Processor 130: Input interface 140: Text 150: Speaker 111: Database DS1~DS4: Default sound parameters DS, DX: Original sound parameters 121: Text extraction module 122: Sound parameter expansion module 123: Text-to-speech conversion module CR1: Character information TX1: Text content FT1: Input data AB1, AB2: Voice 200: Vector space VD1, NS1, NS2, NZ1, NN1, NN2: Add new sound parameters 300: Method S31~S36: Operation

第1圖為根據本案之一些實施例所繪示之系統的示意圖。 第2圖為根據本案之一些實施例所繪示之對應第1圖所示之預設聲音參數的向量空間的示意圖。 第3圖為根據本案之一些實施例所繪示之對應第1圖所示之系統的方法的流程圖。 FIG. 1 is a schematic diagram of a system according to some embodiments of the present invention. FIG. 2 is a schematic diagram of a vector space corresponding to the preset sound parameters shown in FIG. 1 according to some embodiments of the present invention. FIG. 3 is a flow chart of a method corresponding to the system shown in FIG. 1 according to some embodiments of the present invention.

國內寄存資訊(請依寄存機構、日期、號碼順序註記) 無 國外寄存資訊(請依寄存國家、機構、日期、號碼順序註記) 無 Domestic storage information (please note in the order of storage institution, date, and number) None Foreign storage information (please note in the order of storage country, institution, date, and number) None

100:系統 100: System

110:儲存單元 110: Storage unit

120:處理器 120: Processor

130:輸入介面 130: Input interface

140:文本 140: Text

150:喇叭 150: Speaker

111:資料庫 111: Database

DS1~DS4:預設聲音參數 DS1~DS4: Default sound parameters

DS、DX:原聲音參數 DS, DX: original sound parameters

121:文字提取模組 121: Text extraction module

122:聲音參數擴增模組 122: Sound parameter expansion module

123:文字語音轉換模組 123: Text-to-speech module

CR1:角色資訊 CR1: Character information

TX1:文本內容 TX1: text content

FT1:輸入資料 FT1: Input data

AB1、AB2:語音 AB1, AB2: Voice

VD1:新增聲音參數 VD1: Added new sound parameters

Claims (10)

一種產生語音的方法,包含:從一文本提取一角色資訊;比較該角色資訊與複數個預設聲音參數,以配對該角色資訊與該些預設聲音參數;依據比較角色資訊與複數個預設聲音參數的比較結果判斷該角色資訊符合或不符合該些預設聲音參數;當該角色資訊不符合該些預設聲音參數時,從該些預設聲音參數中選擇相應該角色資訊的至少一原聲音參數,並藉由演算法對該至少一原聲音參數進行運算以產生至少一新增聲音參數,該至少一新增聲音參數對應於該角色資訊;以及將該至少一新增聲音參數傳送至一文字語音轉換模組,並藉由該文字語音轉換模組依據該至少一新增聲音參數及該文本的一文本內容產生對應的一語音。 A method for generating speech, comprising: extracting character information from a text; comparing the character information with a plurality of preset sound parameters to match the character information with the preset sound parameters; judging whether the character information matches or does not match the preset sound parameters based on the comparison result of the character information with the plurality of preset sound parameters; when the character information does not match the preset sound parameters, selecting a corresponding character information from the preset sound parameters; At least one original sound parameter of the character information is calculated by an algorithm to generate at least one newly added sound parameter, and the at least one newly added sound parameter corresponds to the character information; and the at least one newly added sound parameter is transmitted to a text-to-speech conversion module, and the text-to-speech conversion module generates a corresponding speech according to the at least one newly added sound parameter and a text content of the text. 如請求項1所述之方法,更包含:當該角色資訊的至少一特徵參數不符合該些預設聲音參數的任一者時,依據該至少一原聲音參數產生符合該至少一特徵參數之該至少一新增聲音參數。 The method as described in claim 1 further comprises: when at least one characteristic parameter of the character information does not match any of the preset sound parameters, generating at least one additional sound parameter that matches the at least one characteristic parameter based on the at least one original sound parameter. 如請求項1所述之方法,更包含:當該角色資訊對應的一角色數量大於該至少一原聲音參數的一參數數量時,產生該至少一新增聲音參數,該至少 一新增聲音參數的一數量及該參數數量的總和相同於該角色數量。 The method as described in claim 1 further includes: when the number of characters corresponding to the character information is greater than the number of parameters of the at least one original sound parameter, the at least one newly added sound parameter is generated, and the sum of the number of the at least one newly added sound parameter and the number of the parameter is the same as the number of characters. 如請求項1所述之方法,其中選擇該至少一原聲音參數包含:當該角色資訊的一第一年齡特徵不符合該些預設聲音參數的每一者時,依據該第一年齡特徵從該些預設聲音參數中選擇一第一原聲音參數及一第二原聲音參數,其中該第一原聲音參數具有一第二年齡特徵,該第二原聲音參數具有一第三年齡特徵,其中該第一年齡特徵介於該第二年齡特徵及該第三年齡特徵之間。 The method as described in claim 1, wherein selecting the at least one original voice parameter comprises: when a first age characteristic of the character information does not match each of the preset voice parameters, selecting a first original voice parameter and a second original voice parameter from the preset voice parameters according to the first age characteristic, wherein the first original voice parameter has a second age characteristic, and the second original voice parameter has a third age characteristic, wherein the first age characteristic is between the second age characteristic and the third age characteristic. 如請求項1所述之方法,其中提取該角色資訊包含:從預儲存的該文本中提取該角色資訊;或自一輸入介面所接收之資料中提取該角色資訊。 The method as described in claim 1, wherein extracting the role information comprises: extracting the role information from the pre-stored text; or extracting the role information from data received from an input interface. 一種產生語音的系統,包含:一儲存單元,儲存複數個預設聲音參數;以及一處理器,耦接該儲存單元,經由該儲存單元存取該些預設聲音參數,依據比較一角色資訊與該些預設聲音參數的比較結果判斷該角色資訊符合或不符合該些預設聲音參數,並於該角色資訊不符合該些預設聲音參數時,藉由演 算法對該些預設聲音參數中至少一原聲音參數進行運算以產生對應該角色資訊的至少一新增聲音參數,且依據該至少一新增聲音參數產生一語音。 A system for generating speech includes: a storage unit storing a plurality of preset sound parameters; and a processor coupled to the storage unit, accessing the preset sound parameters through the storage unit, judging whether the character information meets or does not meet the preset sound parameters according to the comparison result between the character information and the preset sound parameters, and when the character information does not meet the preset sound parameters, calculating at least one original sound parameter among the preset sound parameters by an algorithm to generate at least one newly added sound parameter corresponding to the character information, and generating a speech according to the at least one newly added sound parameter. 如請求項6所述之系統,其中當該角色資訊不符合該些預設聲音參數時,該至少一原聲音參數至少包含彼此不同的一第一原聲音參數及一第二原聲音參數,且該處理器依據該第一原聲音參數及該第二原聲音參數產生該至少一新增聲音參數。 A system as described in claim 6, wherein when the character information does not conform to the preset sound parameters, the at least one original sound parameter includes at least a first original sound parameter and a second original sound parameter that are different from each other, and the processor generates the at least one newly added sound parameter based on the first original sound parameter and the second original sound parameter. 如請求項7所述之系統,其中該處理器產生該至少一新增聲音參數的一參數數量與該角色資訊對應的一角色數量相同。 A system as described in claim 7, wherein the processor generates a parameter quantity of the at least one newly added sound parameter that is the same as a character quantity corresponding to the character information. 如請求項6所述之系統,其中該處理器比較該角色資訊之至少一特徵參數與該些預設聲音參數,以及當該至少一特徵參數不符合該些預設聲音參數的任一者時,該處理器依據該些預設聲音參數產生符合該至少一特徵參數之該至少一新增聲音參數。 The system as described in claim 6, wherein the processor compares at least one characteristic parameter of the character information with the preset sound parameters, and when the at least one characteristic parameter does not match any of the preset sound parameters, the processor generates the at least one new sound parameter that matches the at least one characteristic parameter based on the preset sound parameters. 如請求項6所述之系統,其中該處理器透過一輸入介面接收該角色資訊,或是從儲存於該儲存單元的一文本中提取該角色資訊。 A system as described in claim 6, wherein the processor receives the role information through an input interface, or extracts the role information from a text stored in the storage unit.
TW112102632A 2023-01-19 2023-01-19 Method and system of generating audio TWI867419B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
TW112102632A TWI867419B (en) 2023-01-19 2023-01-19 Method and system of generating audio
CN202311545863.2A CN118366474A (en) 2023-01-19 2023-11-20 Method and system for generating speech

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW112102632A TWI867419B (en) 2023-01-19 2023-01-19 Method and system of generating audio

Publications (2)

Publication Number Publication Date
TW202431247A TW202431247A (en) 2024-08-01
TWI867419B true TWI867419B (en) 2024-12-21

Family

ID=91882222

Family Applications (1)

Application Number Title Priority Date Filing Date
TW112102632A TWI867419B (en) 2023-01-19 2023-01-19 Method and system of generating audio

Country Status (2)

Country Link
CN (1) CN118366474A (en)
TW (1) TWI867419B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101359473A (en) * 2007-07-30 2009-02-04 国际商业机器公司 Auto speech conversion method and apparatus
TW201830354A (en) * 2017-02-14 2018-08-16 香港商富成人工智能有限公司 Interactive and adaptive training and learning management system using face tracking and emotion detection with associated methods
US20180240500A1 (en) * 2006-07-06 2018-08-23 Sundaysky Ltd. Automatic generation of video from structured content
CN113129895A (en) * 2021-04-20 2021-07-16 上海仙剑文化传媒股份有限公司 Voice detection processing system
CN113628609A (en) * 2020-05-09 2021-11-09 微软技术许可有限责任公司 Automatic audio content generation

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180240500A1 (en) * 2006-07-06 2018-08-23 Sundaysky Ltd. Automatic generation of video from structured content
CN101359473A (en) * 2007-07-30 2009-02-04 国际商业机器公司 Auto speech conversion method and apparatus
TW201830354A (en) * 2017-02-14 2018-08-16 香港商富成人工智能有限公司 Interactive and adaptive training and learning management system using face tracking and emotion detection with associated methods
CN113628609A (en) * 2020-05-09 2021-11-09 微软技术许可有限责任公司 Automatic audio content generation
CN113129895A (en) * 2021-04-20 2021-07-16 上海仙剑文化传媒股份有限公司 Voice detection processing system

Also Published As

Publication number Publication date
CN118366474A (en) 2024-07-19
TW202431247A (en) 2024-08-01

Similar Documents

Publication Publication Date Title
US20160140952A1 (en) Method For Adding Realism To Synthetic Speech
US10224061B2 (en) Voice signal component forecasting
US20180315438A1 (en) Voice data compensation with machine learning
CN111465982B (en) Signal processing device and method, training device and method, and program
WO2015098306A1 (en) Response control device and control program
CN110728976A (en) Method, device and system for voice recognition
CN113724686A (en) Method and device for editing audio, electronic equipment and storage medium
CN112164407B (en) Tone color conversion method and device
CN111968678B (en) Audio data processing method, device, equipment and readable storage medium
WO2025200819A1 (en) Speech signal processing method and related device
CN113256262A (en) Automatic generation method and system of conference summary, storage medium and electronic equipment
CN115393484A (en) Method and device for generating virtual image animation, electronic equipment and storage medium
CN115831088A (en) Voice clone model generation method and device and electronic equipment
CN113886640A (en) Digital human generation method, device, device and medium
EP3113175A1 (en) Method for converting text to individual speech, and apparatus for converting text to individual speech
US20080316888A1 (en) Device Method and System for Communication Session Storage
TWI867419B (en) Method and system of generating audio
CN118841008A (en) Audio generation method, device, computer equipment and storage medium
KR20230103242A (en) Pitch and voice conversion system using end-to-end speech synthesis model
CN117316185A (en) An audio and video generation method, device, equipment and storage medium
US8781835B2 (en) Methods and apparatuses for facilitating speech synthesis
CN107888963A (en) Method and system based on rhythm synthetic video
CN111916080A (en) Speech recognition resource selection method, device, computer equipment and storage medium
CN113051902B (en) Voice data desensitization method, electronic device and computer-readable storage medium
US20060224385A1 (en) Text-to-speech conversion in electronic device field