TWI858505B

TWI858505B - Voice data generation method and system and computer program product

Info

Publication number: TWI858505B
Application number: TW112101307A
Authority: TW
Inventors: 梁家瑞
Original assignee: 聯經數位股份有限公司
Priority date: 2023-01-12
Filing date: 2023-01-12
Publication date: 2024-10-11
Also published as: TW202429441A

Abstract

一種由語音資料產生系統實施的語音資料產生方法，包含：(A)根據一文字資料所包含的一語句文字部分計算一對應於該語句文字部分的字元數量；(B)至少根據該字元數量決定一對應於該語句文字部分的句末停頓時長；(C)至少根據該語句文字部分及該句末停頓時長產生一對應於該文字資料且用於以聲音形式被輸出的語音資料，其中，該語音資料包含一用於以語音指示出該語句文字部分的語句語音部分，以及一接續在該語句語音部分之後、不指示出任何語句文字部分且持續時間長度與該句末停頓時長相符的句末停頓部分。A method for generating voice data implemented by a voice data generating system comprises: (A) calculating the number of characters corresponding to a sentence text portion according to a sentence text portion contained in a text data; (B) determining a sentence end pause duration corresponding to the sentence text portion according to at least the number of characters; and (C) generating voice data corresponding to the text data and used for being output in the form of sound according to at least the sentence text portion and the sentence end pause duration, wherein the voice data comprises a sentence voice portion for indicating the sentence text portion by voice, and a sentence end pause portion following the sentence voice portion, not indicating any sentence text portion and lasting for a time length that matches the sentence end pause duration.

Description

Voice data generation method and system and computer program product

本發明是有關於一種資料產生方法，特別是指一種適合被應用在電腦語音輸出的語音資料產生方法。本發明還有關於適合被應用在電腦語音輸出的一種語音資料產生系統，以及一種電腦程式產品。The present invention relates to a data generation method, in particular to a voice data generation method suitable for being applied to computer voice output. The present invention also relates to a voice data generation system suitable for being applied to computer voice output, and a computer program product.

將文字轉換成電腦語音已經是現有技術中的常見功能，但若要讓電腦語音聽起來更為自然，則始終是一個極具挑戰性的目標。所以，如何使電腦語音更加接近真人的說話方式，便成為本案所欲探討的議題。Converting text into computer speech is a common function in existing technologies, but making computer speech sound more natural is always a very challenging goal. Therefore, how to make computer speech closer to the way real people speak has become the topic to be explored in this case.

為了使電腦語音更加接近真人的說話方式，本發明的其中一目的，便在於提供一種語音資料產生方法。In order to make computer speech closer to the way real people speak, one of the purposes of the present invention is to provide a method for generating speech data.

本發明語音資料產生方法由一語音資料產生系統對一文字資料實施；該語音資料產生方法包含：(A)根據該文字資料所包含的一語句文字部分計算一對應於該語句文字部分的字元數量；(B)至少根據該字元數量決定一對應於該語句文字部分的句末停頓時長；(C)至少根據該語句文字部分及該句末停頓時長產生一對應於該文字資料且用於以聲音形式被輸出的語音資料，其中，該語音資料包含一用於以語音指示出該語句文字部分的語句語音部分，以及一接續在該語句語音部分之後、不指示出任何語句文字部分且持續時間長度與該句末停頓時長相符的句末停頓部分。The voice data generation method of the present invention is implemented by a voice data generation system on a text data; the voice data generation method comprises: (A) calculating a number of characters corresponding to a sentence text part according to a sentence text part contained in the text data; (B) determining a sentence end pause duration corresponding to the sentence text part at least according to the number of characters; (C) generating a voice data corresponding to the text data and used for being output in the form of sound according to at least the sentence text part and the sentence end pause duration, wherein the voice data comprises a sentence voice part used for indicating the sentence text part by voice, and a sentence end pause part which follows the sentence voice part, does not indicate any sentence text part and lasts for a time length that matches the sentence end pause duration.

在本發明語音資料產生方法的一些實施態樣中，該語音資料產生系統儲存有一語速參數，且該語速參數相關於該語音資料以聲音形式被輸出時的語音速度，在步驟(B)中，該語音資料產生系統是根據該字元數量及該語速參數計算一對應於該語句文字部分的預計耗費時長，再至少根據該預計耗費時長決定該句末停頓時長。In some embodiments of the speech data generation method of the present invention, the speech data generation system stores a speech speed parameter, and the speech speed parameter is related to the speech speed when the speech data is output in the form of sound. In step (B), the speech data generation system calculates an expected time corresponding to the text part of the sentence based on the number of characters and the speech speed parameter, and then determines the pause length at the end of the sentence based on at least the expected time.

在本發明語音資料產生方法的一些實施態樣中，該語音資料產生系統還儲存有一預設可用時長參數。在步驟(B)中，在該語句文字部分為該文字資料之一段落中的第一個語句文字部分的情況下，該語音資料產生系統決定該句末停頓時長的方式，是根據一剩餘時間值來決定該句末停頓時長，其中，該剩餘時間值是該預設可用時長參數與該預計耗費時長之間的差。In some embodiments of the voice data generation method of the present invention, the voice data generation system further stores a preset available time parameter. In step (B), when the sentence text portion is the first sentence text portion in a paragraph of the text data, the voice data generation system determines the pause duration at the end of the sentence based on a remaining time value, wherein the remaining time value is the difference between the preset available time parameter and the expected consuming time.

在本發明語音資料產生方法的一些實施態樣中，該語音資料產生系統還儲存有一第一停頓時長參數，以及一大於該第一停頓時長參數的第二停頓時長參數。在步驟(B)中，在該語句文字部分為該文字資料之該段落中的第一個語句文字部分的情況下，該語音資料產生系統決定該句末停頓時長的方式包含：若該剩餘時間值大於等於一門檻值，將該句末停頓時長設定為該第一停頓時長參數的數值；若該剩餘時間值小於該門檻值，將該句末停頓時長設定為該第二停頓時長參數的數值。In some embodiments of the voice data generation method of the present invention, the voice data generation system further stores a first pause duration parameter and a second pause duration parameter greater than the first pause duration parameter. In step (B), when the sentence text portion is the first sentence text portion in the paragraph of the text data, the voice data generation system determines the pause duration at the end of the sentence in the following manner: if the remaining time value is greater than or equal to a threshold value, the pause duration at the end of the sentence is set to the value of the first pause duration parameter; if the remaining time value is less than the threshold value, the pause duration at the end of the sentence is set to the value of the second pause duration parameter.

在本發明語音資料產生方法的一些實施態樣中，該語音資料產生系統還儲存有一預設可用時長參數。在步驟(B)中，在該語句文字部分為該文字資料之一段落中的第N個語句文字部分的情況下(N為大於1的整數)，該語音資料產生系統決定該句末停頓時長的方式，是根據一剩餘時間值來決定該句末停頓時長，其中，該剩餘時間值相關於該預設可用時長參數，且還相關於該語句文字部分之該段落的第(N-1)個語句文字部分所對應的另一剩餘時間值及另一句末停頓時長。In some embodiments of the voice data generation method of the present invention, the voice data generation system further stores a preset available time parameter. In step (B), when the sentence text portion is the Nth sentence text portion in a paragraph of the text data (N is an integer greater than 1), the voice data generation system determines the end-of-sentence pause duration by determining the end-of-sentence pause duration based on a remaining time value, wherein the remaining time value is related to the preset available time parameter and is also related to another remaining time value and another end-of-sentence pause duration corresponding to the (N-1)th sentence text portion of the paragraph of the sentence text portion.

在本發明語音資料產生方法的一些實施態樣中，該語音資料產生系統還儲存有一最短停頓時長參數、一大於該最短停頓時長參數的第一停頓時長參數，以及一大於該第一停頓時長參數的第二停頓時長參數。在步驟(B)中，在該語句文字部分為該文字資料之該段落中的第N個語句文字部分的情況下，該語音資料產生系統決定該句末停頓時長的方式包含：若該剩餘時間值大於等於一為正數的第一門檻值，將該句末停頓時長設定為該最短停頓時長參數的數值；若該剩餘時間值小於該第一門檻值且大於等於一為負數的第二門檻值，將該句末停頓時長設定為該第一停頓時長參數的數值；若該剩餘時間值小於該第二門檻值，將該句末停頓時長設定為該第二停頓時長參數的數值。In some implementations of the voice data generating method of the present invention, the voice data generating system further stores a minimum pause duration parameter, a first pause duration parameter greater than the shortest pause duration parameter, and a second pause duration parameter greater than the first pause duration parameter. In step (B), when the sentence text portion is the Nth sentence text portion in the paragraph of the text data, the voice data generation system determines the pause duration at the end of the sentence in the following manner: if the remaining time value is greater than or equal to a first threshold value which is a positive number, the pause duration at the end of the sentence is set to the value of the shortest pause duration parameter; if the remaining time value is less than the first threshold value and greater than or equal to a second threshold value which is a negative number, the pause duration at the end of the sentence is set to the value of the first pause duration parameter; if the remaining time value is less than the second threshold value, the pause duration at the end of the sentence is set to the value of the second pause duration parameter.

在本發明語音資料產生方法的一些實施態樣中，在步驟(B)中，在該剩餘時間值小於一為負數的門檻值的情況下，該語音資料產生系統還將該語句文字部分拆分成多個具有順序性的語句片段，並將該句末停頓時長設定為一停頓時長參數的數值，以及決定一句中停頓時長。在步驟(C)中，在該剩餘時間值小於該門檻值的情況下，該語句語音部分包括多個具有順序性且分別用於以語音指示出該等語句片段的語音片段，以及M個不指示出任何語句片段且持續時間長度與該句中停頓時長相符的句中停頓部分，其中，M為大於等於1的整數，且該M個句中停頓部分是分別接續在該等語音片段中的前M個語音片段之後。In some implementations of the voice data generation method of the present invention, in step (B), when the remaining time value is less than a negative threshold value, the voice data generation system also splits the text portion of the sentence into multiple sequential sentence segments, sets the pause duration at the end of the sentence to the value of a pause duration parameter, and determines the pause duration in a sentence. In step (C), when the remaining time value is less than the threshold value, the speech voice portion of the sentence includes a plurality of speech segments that are sequential and are respectively used to indicate the sentence segments by voice, and M mid-sentence pause portions that do not indicate any sentence segment and whose duration matches the duration of the mid-sentence pause, wherein M is an integer greater than or equal to 1, and the M mid-sentence pause portions are respectively subsequent to the first M speech segments in the speech segments.

在本發明語音資料產生方法的一些實施態樣中，該語音資料產生方法還包含一位於步驟(A)之前的：(D)根據該文字資料所包含的一或多個特定符號從該文字資料中定義出該語句文字部分。In some embodiments of the speech data generation method of the present invention, the speech data generation method further includes a step before step (A): (D) defining the sentence text portion from the text data based on one or more specific symbols contained in the text data.

本發明的另一目的，在於提供一種有助於使電腦語音更加接近真人說話方式的語音資料產生系統。Another object of the present invention is to provide a voice data generation system that helps make computer voice closer to the way real people speak.

本發明語音資料產生系統包含一儲存單元及一與該儲存單元電連接的處理單元，且該處理單元用於：根據一文字資料所包含的一語句文字部分計算一對應於該語句文字部分的字元數量；至少根據該字元數量決定一對應於該語句文字部分的句末停頓時長；至少根據該語句文字部分及該句末停頓時長產生一對應於該文字資料且用於以聲音形式被輸出的語音資料，其中，該語音資料包含一用於以語音指示出該語句文字部分的語句語音部分，以及一接續在該語句語音部分之後、不指示出任何語句文字部分且持續時間長度與該句末停頓時長相符的句末停頓部分。The speech data generation system of the present invention comprises a storage unit and a processing unit electrically connected to the storage unit, and the processing unit is used to: calculate the number of characters corresponding to a sentence text part according to a sentence text part contained in a text data; determine a sentence end pause length corresponding to the sentence text part at least according to the number of characters; generate a speech data corresponding to the text data and used to be output in the form of sound according to at least the sentence text part and the sentence end pause length, wherein the speech data comprises a sentence speech part used to indicate the sentence text part by voice, and a sentence end pause part that follows the sentence speech part, does not indicate any sentence text part and lasts for a time length that matches the sentence end pause length.

在本發明語音資料產生系統的一些實施態樣中，該儲存單元儲存有一語速參數，且該語速參數相關於該語音資料以聲音形式被輸出時的語音速度。該處理單元是用於根據該字元數量及該語速參數計算一對應於該語句文字部分的預計耗費時長，再至少根據該預計耗費時長決定該句末停頓時長。In some embodiments of the speech data generation system of the present invention, the storage unit stores a speech speed parameter, and the speech speed parameter is related to the speech speed when the speech data is output in the form of sound. The processing unit is used to calculate an expected time corresponding to the text part of the sentence according to the number of characters and the speech speed parameter, and then determine the pause length at the end of the sentence at least according to the expected time.

在本發明語音資料產生系統的一些實施態樣中，該儲存單元還儲存有一預設可用時長參數。在該語句文字部分為該文字資料之一段落中的第一個語句文字部分的情況下，該處理單元決定該句末停頓時長的方式，是根據一剩餘時間值來決定該句末停頓時長，其中，該剩餘時間值是該預設可用時長參數與該預計耗費時長之間的差。In some embodiments of the speech data generation system of the present invention, the storage unit further stores a preset available time parameter. When the sentence text portion is the first sentence text portion in a paragraph of the text data, the processing unit determines the pause duration at the end of the sentence by a remaining time value, wherein the remaining time value is the difference between the preset available time parameter and the estimated consuming time.

在本發明語音資料產生系統的一些實施態樣中，該儲存單元還儲存有一第一停頓時長參數，以及一大於該第一停頓時長參數的第二停頓時長參數。在該語句文字部分為該文字資料之該段落中的第一個語句文字部分的情況下，該處理單元決定該句末停頓時長的方式包含：若該剩餘時間值大於等於一門檻值，將該句末停頓時長設定為該第一停頓時長參數的數值；若該剩餘時間值小於該門檻值，將該句末停頓時長設定為該第二停頓時長參數的數值。In some embodiments of the speech data generation system of the present invention, the storage unit further stores a first pause duration parameter and a second pause duration parameter greater than the first pause duration parameter. When the sentence text portion is the first sentence text portion in the paragraph of the text data, the processing unit determines the pause duration at the end of the sentence in a manner including: if the remaining time value is greater than or equal to a threshold value, setting the pause duration at the end of the sentence to the value of the first pause duration parameter; if the remaining time value is less than the threshold value, setting the pause duration at the end of the sentence to the value of the second pause duration parameter.

在本發明語音資料產生系統的一些實施態樣中，該儲存單元還儲存有一預設可用時長參數。在該語句文字部分為該文字資料之一段落中的第N個語句文字部分的情況下(N為大於1的整數)，該處理單元決定該句末停頓時長的方式，是根據一剩餘時間值來決定該句末停頓時長，其中，該剩餘時間值相關於該預設可用時長參數，且還相關於該語句文字部分之該段落的第(N-1)個語句文字部分所對應的另一剩餘時間值及另一句末停頓時長。In some implementations of the speech data generation system of the present invention, the storage unit further stores a preset available time parameter. When the sentence text portion is the Nth sentence text portion in a paragraph of the text data (N is an integer greater than 1), the processing unit determines the end-of-sentence pause duration by determining the end-of-sentence pause duration based on a remaining time value, wherein the remaining time value is related to the preset available time parameter and is also related to another remaining time value and another end-of-sentence pause duration corresponding to the (N-1)th sentence text portion of the paragraph of the sentence text portion.

在本發明語音資料產生系統的一些實施態樣中，該儲存單元還儲存有一最短停頓時長參數、一大於該最短停頓時長參數的第一停頓時長參數，以及一大於該第一停頓時長參數的第二停頓時長參數。在該語句文字部分為該文字資料之該段落中的第N個語句文字部分的情況下(N為大於1的整數)，該處理單元決定該句末停頓時長的方式包含：若該剩餘時間值大於等於一為正數的第一門檻值，將該句末停頓時長設定為該最短停頓時長參數的數值；若該剩餘時間值小於該第一門檻值且大於等於一為負數的第二門檻值，將該句末停頓時長設定為該第一停頓時長參數的數值；若該剩餘時間值小於該第二門檻值，將該句末停頓時長設定為該第二停頓時長參數的數值。In some implementations of the voice data generating system of the present invention, the storage unit further stores a shortest pause duration parameter, a first pause duration parameter greater than the shortest pause duration parameter, and a second pause duration parameter greater than the first pause duration parameter. In the case where the sentence text portion is the Nth sentence text portion in the paragraph of the text data (N is an integer greater than 1), the processing unit determines the pause duration at the end of the sentence in the following manner: if the remaining time value is greater than or equal to a first threshold value which is a positive number, the pause duration at the end of the sentence is set to the value of the shortest pause duration parameter; if the remaining time value is less than the first threshold value and greater than or equal to a second threshold value which is a negative number, the pause duration at the end of the sentence is set to the value of the first pause duration parameter; if the remaining time value is less than the second threshold value, the pause duration at the end of the sentence is set to the value of the second pause duration parameter.

在本發明語音資料產生系統的一些實施態樣中，該處理單元還用於在該剩餘時間值小於一為負數的門檻值的情況下，將該語句文字部分拆分成多個具有順序性的語句片段，並將該句末停頓時長設定為一停頓時長參數的數值，以及決定一句中停頓時長。在該剩餘時間值小於該門檻值的情況下，該語句語音部分包括多個具有順序性且分別用於以語音指示出該等語句片段的語音片段，以及M個不指示出任何語句片段且持續時間長度與該句中停頓時長相符的句中停頓部分，其中，M為大於等於1的整數，且該M個句中停頓部分是分別接續在該等語音片段中的前M個語音片段之後。In some implementations of the speech data generation system of the present invention, the processing unit is also used to split the text portion of the sentence into multiple sequential sentence segments when the remaining time value is less than a negative threshold value, set the pause duration at the end of the sentence to the value of a pause duration parameter, and determine the pause duration in a sentence. When the remaining time value is less than the threshold value, the speech voice portion of the sentence includes a plurality of speech segments that are sequential and are respectively used to indicate the sentence segments by voice, and M mid-sentence pause portions that do not indicate any sentence segment and whose duration matches the duration of the mid-sentence pause, wherein M is an integer greater than or equal to 1, and the M mid-sentence pause portions are respectively subsequent to the first M speech segments in the speech segments.

在本發明語音資料產生系統的一些實施態樣中，該處理單元還用於在計算該字元數量之前根據該文字資料所包含的一或多個特定符號從該文字資料中定義出該語句文字部分。In some implementations of the speech data generation system of the present invention, the processing unit is further used to define the sentence text part from the text data according to one or more specific symbols contained in the text data before calculating the number of characters.

本發明的再一目的，在於提供一種有助於使電腦語音更加接近真人說話方式的電腦程式產品。Another object of the present invention is to provide a computer program product that helps make computer voice closer to the way real people speak.

本發明電腦程式產品包含一應用程式，其中，該應用程式適用於被一電子裝置載入並運行，而使該電子裝置對一文字資料實施如前述任一實施態樣中所述的語音資料產生方法。The computer program product of the present invention includes an application program, wherein the application program is suitable for being loaded and run by an electronic device, so that the electronic device implements the voice data generation method described in any of the above-mentioned implementation forms for a text data.

本發明之功效在於：該語音資料產生系統能至少根據該文字資料之該語句文字部分的字元數量來決定對應於該語句文字部分的句末停頓時長，從而設定該語音資料中接續在該語句語音部分之後的句末停頓部分的持續時間長度，藉此，該語音資料產生系統所產生的該語音資料能以語句的長短為依據，從而模擬真人說話時因換氣所導致之不同長度的停頓，所以，該語音資料產生系統有助於使電腦語音更加接近真人的說話方式。The effect of the present invention is that the voice data generation system can determine the length of the sentence-end pause corresponding to the text part of the sentence at least according to the number of characters in the text part of the sentence of the text data, thereby setting the duration of the sentence-end pause part following the voice part of the sentence in the voice data. In this way, the voice data generated by the voice data generation system can be based on the length of the sentence, thereby simulating the pauses of different lengths caused by breathing when a real person speaks. Therefore, the voice data generation system helps to make computer voice closer to the way a real person speaks.

在本發明被詳細描述之前應當注意：若未特別定義，則本專利說明書中所述的「電連接」泛指多個電子設備/裝置/元件之間透過導電材料彼此相連而實現的「有線電連接」，以及透過無線通訊技術進行單/雙向無線信號傳輸的「無線電連接」。並且，本專利說明書中所述的「電連接」也泛指多個電子設備/裝置/元件之間彼此直接相連而形成的「直接電連接」，以及多個電子設備/裝置/元件之間還透過其他電子設備/裝置/元件彼此間接相連而形成的「間接電連接」。Before the invention is described in detail, it should be noted that, if not specifically defined, the "electrical connection" described in this patent specification generally refers to "wired electrical connection" achieved by connecting multiple electronic devices/devices/components to each other through conductive materials, and "radio connection" for unidirectional/bidirectional wireless signal transmission through wireless communication technology. In addition, the "electrical connection" described in this patent specification also generally refers to "direct electrical connection" formed by direct connection between multiple electronic devices/devices/components, and "indirect electrical connection" formed by indirect connection between multiple electronic devices/devices/components through other electronic devices/devices/components.

參閱圖1，本發明語音資料產生系統1的一實施例適用於與多個使用端5（圖1僅示出其中一者）透過網路電連接。其中，每一使用端5是一台能供使用者操作的手機、平板電腦、筆記型電腦或者桌上型電腦，並且，為了便於描述，以下僅利用圖1所示出的該使用端5對本實施例的運作方式進行示例說明。Referring to FIG. 1 , an embodiment of the voice data generating system 1 of the present invention is suitable for being electrically connected to a plurality of user terminals 5 (only one of which is shown in FIG. 1 ) via a network. Each user terminal 5 is a mobile phone, tablet computer, laptop computer or desktop computer that can be operated by a user, and for ease of description, the operation of the present embodiment is described below using only the user terminal 5 shown in FIG. 1 as an example.

在本實施例中，該語音資料產生系統1被實施為一台伺服設備，且該語音資料產生系統1包含一處理單元11，以及一電連接該處理單元11的儲存單元12。其中，該處理單元11為一具有資料運算及處理功能的中央處理器，且該處理單元11適用於透過網路與該使用端5電連接以進行通訊。該儲存單元12是一用於儲存數位資料的資料儲存裝置（例如硬碟）。In this embodiment, the voice data generation system 1 is implemented as a server device, and the voice data generation system 1 includes a processing unit 11 and a storage unit 12 electrically connected to the processing unit 11. The processing unit 11 is a central processor with data calculation and processing functions, and the processing unit 11 is suitable for being electrically connected to the user end 5 through a network for communication. The storage unit 12 is a data storage device (such as a hard disk) for storing digital data.

然而，在另一種實施例中，該語音資料產生系統1亦可被實施為多台彼此電連接的伺服設備，該處理單元11是被實施為該等伺服設備所分別具有之多個中央處理器的組合，而該儲存單元12則是被實施為該等伺服設備所分別具有之多個儲存裝置的組合。此外，在又一種實施例中，該語音資料產生系統1是一台能供使用者操作的電子裝置，且可被實施為手機、平板電腦、筆記型電腦或者桌上型電腦。所以，應當理解的是，該語音資料產生系統1在硬體方面的實際實施態樣並不以本實施例為限。However, in another embodiment, the voice data generation system 1 can also be implemented as a plurality of servo devices electrically connected to each other, the processing unit 11 is implemented as a combination of a plurality of central processors respectively possessed by the servo devices, and the storage unit 12 is implemented as a combination of a plurality of storage devices respectively possessed by the servo devices. In addition, in another embodiment, the voice data generation system 1 is an electronic device that can be operated by a user, and can be implemented as a mobile phone, a tablet computer, a laptop computer, or a desktop computer. Therefore, it should be understood that the actual implementation of the voice data generation system 1 in terms of hardware is not limited to this embodiment.

在本實施例中，該儲存單元12儲存有一語速參數P1、一預設可用時長參數P2，以及多個停頓時長參數P3，其中，本專利說明書所述的「時長」是代表「時間長度」。該語速參數P1、該預設可用時長參數P2以及該等停頓時長參數P3是用於供該處理單元11據以產生能以電腦語音被播放出（例如被該使用端5所播放出）的語音資料，所述語音資料的相關細節將於後續說明。In this embodiment, the storage unit 12 stores a speech rate parameter P1, a preset available time parameter P2, and a plurality of pause time parameters P3, wherein the "time" mentioned in this patent specification represents "time length". The speech rate parameter P1, the preset available time parameter P2, and the pause time parameters P3 are used by the processing unit 11 to generate voice data that can be played out as computer voice (for example, played out by the user terminal 5), and the relevant details of the voice data will be described later.

該語速參數P1是用來設定該處理單元11所產生之語音資料本身的文字播放速度。在本實施例中，該語速參數P1代表的單位是「秒／字元」，換言之，該語速參數P1是表示單一個字元被播放所需耗費的時間，但並不以此為限。舉例來說，若該語速參數P1的數值是「0.2」，代表所述的語音資料被以電腦語音播放時，其中的每一個字元平均需耗時0.2秒來被播放，所以，假設所述語音資料中的一個示例的語句文字部分是由十個字元所構成（例如「今天的天氣是晴時多雲」），則此語句文字部分便需耗時2秒才能被以電腦語音播放完畢。The speech rate parameter P1 is used to set the text playback speed of the voice data itself generated by the processing unit 11. In the present embodiment, the unit represented by the speech rate parameter P1 is "seconds/character". In other words, the speech rate parameter P1 represents the time required for a single character to be played, but is not limited to this. For example, if the value of the speech rate parameter P1 is "0.2", it means that when the voice data is played as computer voice, each character in it takes an average of 0.2 seconds to be played. Therefore, assuming that an example of a sentence text part in the voice data is composed of ten characters (for example, "Today's weather is sunny and cloudy"), this sentence text part will take 2 seconds to be played as computer voice.

該預設可用時長參數P2是用來表示所述語音資料在被播放的過程中，多個字元被連續播放的理想時間長度上限。更具體地說，該預設可用時長參數P2可以被理解成一個用來模擬真人肺活量程度的參數。舉例來說，若該預設可用時長參數P2被設定成3.5秒，則依據該預設可用時長參數P2被產生之語音資料所呈現的說話方式，便是在模擬一個能夠輕鬆地一口氣說話3.5秒、但超過3.5秒後便會漸感吃力的真人。The default available time parameter P2 is used to indicate the upper limit of the ideal time length for which multiple characters are played continuously during the playback of the voice data. More specifically, the default available time parameter P2 can be understood as a parameter used to simulate the lung capacity of a real person. For example, if the default available time parameter P2 is set to 3.5 seconds, the speaking style presented by the voice data generated according to the default available time parameter P2 is to simulate a real person who can speak easily for 3.5 seconds in one breath, but will gradually feel tired after more than 3.5 seconds.

在本實施例中，該等停頓時長參數P3的數量為三個，且該三個停頓時長參數P3分別為一最短停頓時長參數P30、一大於該最短停頓時長參數P30的第一停頓時長參數P31，以及一大於該第一停頓時長參數P31的第二停頓時長參數P32。具體來說，該最短停頓時長參數P30是用來模擬真人在說話過程中在語句之間稍微停頓的時間長度，該第一停頓時長參數P31是用來模擬真人在說話過程中因小口換氣而停頓的時間長度，而該第二停頓時長參數P32則是用來模擬真人在說話過程中因大口換氣而停頓的時間長度。示例性地，該最短停頓時長參數P30例如為0.2秒，該第一停頓時長參數P31例如為0.4秒，該第二停頓時長參數P32則例如為0.8秒，但並不以此為限。In this embodiment, the number of the pause duration parameters P3 is three, and the three pause duration parameters P3 are respectively a shortest pause duration parameter P30, a first pause duration parameter P31 greater than the shortest pause duration parameter P30, and a second pause duration parameter P32 greater than the first pause duration parameter P31. Specifically, the shortest pause time parameter P30 is used to simulate the length of time a real person pauses slightly between sentences during speaking, the first pause time parameter P31 is used to simulate the length of time a real person pauses due to taking a small breath during speaking, and the second pause time parameter P32 is used to simulate the length of time a real person pauses due to taking a large breath during speaking. Exemplarily, the shortest pause time parameter P30 is, for example, 0.2 seconds, the first pause time parameter P31 is, for example, 0.4 seconds, and the second pause time parameter P32 is, for example, 0.8 seconds, but it is not limited thereto.

補充說明的是，該語速參數P1、該預設可用時長參數P2及該等停頓時長參數P3的數值可依據欲透過語音資料所呈現之說話方式的不同而被自由調整。舉例來說，若要模擬成年男性的說話方式，該預設可用時長參數P2可被設定成3.5秒，而若要模擬幼童的說話方式，該預設可用時長參數P2則可被設定成1.9秒，但不以此為限。根據所欲呈現之說話方式的不同，該語速參數P1、該預設可用時長參數P2及該等停頓時長參數P3皆可被自由設定成大於0的各種數值，所以，該語速參數P1、該預設可用時長參數P2及該等停頓時長參數P3的實際數值並不以前述所舉之例為限。It is to be noted that the values of the speech rate parameter P1, the default available time parameter P2 and the pause time parameters P3 can be freely adjusted according to the different speaking styles to be presented through the voice data. For example, if the speaking style of an adult male is to be simulated, the default available time parameter P2 can be set to 3.5 seconds, and if the speaking style of a child is to be simulated, the default available time parameter P2 can be set to 1.9 seconds, but the present invention is not limited thereto. Depending on the desired speaking style, the speech rate parameter P1, the default available time parameter P2 and the pause time parameters P3 can all be freely set to various values greater than 0. Therefore, the actual values of the speech rate parameter P1, the default available time parameter P2 and the pause time parameters P3 are not limited to the above examples.

同時參閱圖1及圖2（由圖2A及2B組成），以下示例性地說明本實施例的該語音資料產生系統1如何對一文字資料實施一語音資料產生方法。其中，該文字資料可以是由該處理單元11從該使用端5所接收（亦即由該使用端5傳送至該處理單元11），或者，該文字資料也可以是被預先儲存於該儲存單元12，而被該處理單元11所讀取。由於該處理單元11獲得該文字資料的方式並不影響該語音資料產生方法的實施，故本實施例不限定該文字資料的來源。Referring to FIG. 1 and FIG. 2 (composed of FIG. 2A and FIG. 2B ), the following is an exemplary description of how the voice data generation system 1 of the present embodiment implements a voice data generation method for a text data. The text data may be received by the processing unit 11 from the user end 5 (i.e., transmitted by the user end 5 to the processing unit 11), or the text data may be pre-stored in the storage unit 12 and read by the processing unit 11. Since the way in which the processing unit 11 obtains the text data does not affect the implementation of the voice data generation method, the present embodiment does not limit the source of the text data.

為了便於描述與理解，在此先假設該文字資料的內容為：「學生時期老師們都會為我們精心策辦活動，萬聖節是一年之中最期待的節日之一，爸爸媽媽也都絞盡腦汁配合學校活動，為自己小孩打扮，有的時候是自己提出想要扮什麼，但更多的時候是不情願，迎合父母口味打扮成他們想要的樣子，即便如此，到了學校與同學一起慶祝，一起上街要糖，也是非常快樂的！」。補充說明的是，上述所示的該文字資料僅有單一個段落，但該語音資料產生方法也能被應用在具有多個段落的文字資料上。For the sake of ease of description and understanding, let's assume that the content of the text data is: "When we were students, our teachers would carefully plan activities for us. Halloween is one of the most anticipated holidays of the year. Parents would rack their brains to coordinate with school activities and dress up their children. Sometimes they would ask for what they want to dress up as, but more often they would be reluctant and dress up to cater to their parents' tastes. Even so, it is very happy to celebrate with classmates at school and go to the streets to ask for candy together!" It should be noted that the text data shown above has only a single paragraph, but the voice data generation method can also be applied to text data with multiple paragraphs.

首先，在步驟S1中，該處理單元11根據該預設可用時長參數P2的數值（例如「3.5」）決定出多個門檻值，以及多個補充時間值。First, in step S1, the processing unit 11 determines a plurality of threshold values and a plurality of supplementary time values according to the value of the preset available time parameter P2 (eg, "3.5").

在本實施例中，該等門檻值例如包含一為正數的第一門檻值、一為負數的第二門檻值，以及一小於該第二門檻值而亦為負數的第三門檻值。另一方面，該等補充時間值則例如包含一對應於該第一停頓時長參數P31的第一補充時間值，以及一對應於該第二停頓時長參數P32的第二補充時間值。In this embodiment, the threshold values include, for example, a first threshold value that is a positive number, a second threshold value that is a negative number, and a third threshold value that is less than the second threshold value and is also a negative number. On the other hand, the supplementary time values include, for example, a first supplementary time value corresponding to the first pause duration parameter P31, and a second supplementary time value corresponding to the second pause duration parameter P32.

具體舉例來說，在本實施例中，該第一門檻值是該預設可用時長參數P2之數值（例如「3.5」）的0.5倍（例如「1.75」），該第二門檻值是該預設可用時長參數P2之數值的-0.5倍（例如「-1.75」），而該第三門檻值則是該預設可用時長參數P2之數值的-1倍（例如「-3.5」）。另一方面，在本實施例中，該第一補充時間值是該預設可用時長參數P2之數值的0.5倍（例如「1.75」），而該第二補充時間值則與該預設可用時長參數P2的數值相符（例如「3.5」）。補充說明的是，該等門檻值及該等補充時間值皆是由該處理單元11根據該預設可用時長參數P2的數值以預設好的倍率所計算出，但是，其每一者與該預設可用時長參數P2之間的倍率關係可被自由設定與調整，所以，該等門檻值及該等補充時間值各自的實際態樣當然不以前述所舉之例為限。Specifically, in the present embodiment, the first threshold value is 0.5 times (e.g., "1.75") the value of the default available time parameter P2 (e.g., "3.5"), the second threshold value is -0.5 times (e.g., "-1.75") the value of the default available time parameter P2, and the third threshold value is -1 times (e.g., "-3.5") the value of the default available time parameter P2. On the other hand, in the present embodiment, the first supplementary time value is 0.5 times (e.g., "1.75") the value of the default available time parameter P2, and the second supplementary time value matches the value of the default available time parameter P2 (e.g., "3.5"). It should be noted that the threshold values and the supplementary time values are calculated by the processing unit 11 according to the value of the preset available time parameter P2 with a preset multiplier. However, the multiplier relationship between each of them and the preset available time parameter P2 can be freely set and adjusted. Therefore, the actual states of the threshold values and the supplementary time values are certainly not limited to the examples mentioned above.

在該處理單元11決定出該等門檻值及該等補充時間值後，流程進行至步驟S2。After the processing unit 11 determines the threshold values and the replenishment time values, the process proceeds to step S2.

在步驟S2中，該處理單元11根據該文字資料所包含的一或多個特定符號，從該文字資料中定義出多個具有順序性的語句文字部分。更詳細地說，每一語句文字部分在本實施例中是由多個能被電腦語音播放出的字元所構成，而且，所述的特定符號是被預先定義好的多種特定標點符號，包含但不限於逗號、句號、頓號、空格符號、分號、問號、驚嘆號、引號及冒號等用來表示停頓語氣的標點符號。In step S2, the processing unit 11 defines a plurality of sequential sentence text parts from the text data according to one or more specific symbols contained in the text data. More specifically, each sentence text part is composed of a plurality of characters that can be played by computer voice in this embodiment, and the specific symbols are a plurality of predefined specific punctuation marks, including but not limited to commas, periods, hyphens, space marks, semicolons, question marks, exclamation marks, quotation marks, colons, etc., which are used to indicate pauses.

以前述所示的該文字資料舉例來說，該處理單元11所定義出的第一個語句文字部分會是「學生時期老師們都會為我們精心策辦活動」，第二個語句文字部分會是「萬聖節是一年之中最期待的節日之一」，第三個語句文字部分則會是「爸爸媽媽也都絞盡腦汁配合學校活動」，餘下以此類推。Taking the text data shown above as an example, the first sentence text part defined by the processing unit 11 would be "When we were students, our teachers would carefully plan activities for us", the second sentence text part would be "Halloween is one of the most anticipated holidays of the year", the third sentence text part would be "Mom and Dad also racked their brains to cooperate with school activities", and so on.

在該處理單元11定義出該文字資料所包含的該等語句文字部分之後，流程進行至步驟S3。After the processing unit 11 defines the sentence text parts contained in the text data, the process proceeds to step S3.

在步驟S3中，對於所述的第一個語句文字部分（後稱「第一語句文字部分」），該處理單元11根據該第一語句文字部分中的所有字元，計算一對應於該第一語句文字部分的字元數量。並且，為了便於描述，在此將該第一語句文字部分所對應的該字元數量作為一第一字元數量。以前述所示的該文字資料為例，該第一語句文字部分是「學生時期老師們都會為我們精心策辦活動」，因此，該第一字元數量會是「18」，亦即該第一語句文字部分總共由18個字元所組成。In step S3, for the first sentence text part (hereinafter referred to as "the first sentence text part"), the processing unit 11 calculates the number of characters corresponding to the first sentence text part according to all the characters in the first sentence text part. And, for the convenience of description, the number of characters corresponding to the first sentence text part is taken as a first character number. Taking the text data shown above as an example, the first sentence text part is "When we were students, our teachers would carefully plan activities for us", so the first character number would be "18", that is, the first sentence text part is composed of 18 characters in total.

在該處理單元11計算出該第一字元數量之後，流程進行至步驟S4。After the processing unit 11 calculates the first character quantity, the process proceeds to step S4.

在步驟S4中，對於該第一語句文字部分，該處理單元11根據該第一字元數量及該語速參數P1，計算一對應於該第一語句文字部分的預計耗費時長。並且，為了便於描述，在此將該第一語句文字部分所對應的該預計耗費時長作為一第一預計耗費時長。更具體地說，該第一預計耗費時長是代表該第一語句文字部分被以電腦語音播放所需耗費的時間長度。而且，在本實施例中，該處理單元11是將該第一字元數量（例如為「18」）與該語速參數P1（例如為「0.2」）相乘以計算出該第一預計耗費時長（例如「3.6」），但並不以此為限。In step S4, for the first sentence text portion, the processing unit 11 calculates an expected time corresponding to the first sentence text portion according to the first character number and the speech rate parameter P1. And, for the convenience of description, the expected time corresponding to the first sentence text portion is referred to as a first expected time. More specifically, the first expected time represents the length of time required for the first sentence text portion to be played as computer voice. Moreover, in the present embodiment, the processing unit 11 multiplies the first character number (for example, "18") and the speech rate parameter P1 (for example, "0.2") to calculate the first expected time (for example, "3.6"), but is not limited thereto.

在該處理單元11計算出該第一預計耗費時長之後，流程進行至步驟S5。After the processing unit 11 calculates the first estimated time duration, the process proceeds to step S5.

在步驟S5中，對於該第一語句文字部分，該處理單元11根據該第一預計耗費時長及該預設可用時長參數P2，計算一對應於該第一語句文字部分的剩餘時間值。並且，為了便於描述，在此將該第一語句文字部分所對應的該剩餘時間值作為一第一剩餘時間值。更具體地說，在本實施例中，該處理單元11是將該預設可用時長參數P2的數值（例如「3.5」）減去該第一預計耗費時長（例如步驟S4中舉例的「3.6」）以計算出該第一剩餘時間值（例如「-0.1」），換言之，該第一剩餘時間值是該預設可用時長參數P2與該第一預計耗費時長之間的差。In step S5, for the first sentence text portion, the processing unit 11 calculates a remaining time value corresponding to the first sentence text portion according to the first expected consuming time and the preset available time parameter P2. And, for the convenience of description, the remaining time value corresponding to the first sentence text portion is referred to as a first remaining time value. More specifically, in this embodiment, the processing unit 11 calculates the first remaining time value (e.g., "-0.1") by subtracting the first expected consuming time (e.g., "3.6" cited in step S4) from the value of the preset available time parameter P2 (e.g., "3.5"); in other words, the first remaining time value is the difference between the preset available time parameter P2 and the first expected consuming time.

在本實施例中，該第一剩餘時間值可以被理解成在模擬一個真人以該語速參數P1所表示的語速一口氣說完該第一語句文字部分之後的肺活量情形，但並不以此為限。以本實施例而言，若該第一剩餘時間值大於等於0，可以理解成有如一個真人能夠輕鬆地一口氣說完該第一語句文字部分，反之，若該第一剩餘時間值小於0（即為負數），則可以理解成有如一個真人在一口氣說完該第一語句文字部分後會因肺活量不足而感到吃力，而且，在該剩餘時間值小於0的情況下，其絕對值的大小可以被理解成吃力的程度，亦即絕對值愈大，吃力程度愈高。In this embodiment, the first remaining time value can be understood as the lung capacity of a real person after speaking the first sentence text part in one breath at the speech speed represented by the speech speed parameter P1, but it is not limited to this. In this embodiment, if the first remaining time value is greater than or equal to 0, it can be understood as a real person can easily speak the first sentence text part in one breath. On the contrary, if the first remaining time value is less than 0 (i.e., a negative number), it can be understood as a real person will feel tired due to insufficient lung capacity after speaking the first sentence text part in one breath. Moreover, when the remaining time value is less than 0, the size of its absolute value can be understood as the degree of effort, that is, the larger the absolute value, the higher the degree of effort.

在該處理單元11計算出該第一剩餘時間值之後，流程進行至步驟S6。After the processing unit 11 calculates the first remaining time value, the process proceeds to step S6.

在步驟S6中，該處理單元11根據該第一剩餘時間值、該第二門檻值以及該第三門檻值來決定出一對應於該第一語句文字部分的句末停頓時長。並且，為了便於描述，在此將該第一語句文字部分所對應的該句末停頓時長作為一第一句末停頓時長。In step S6, the processing unit 11 determines a sentence end pause duration corresponding to the first sentence text portion according to the first remaining time value, the second threshold value, and the third threshold value. And, for ease of description, the sentence end pause duration corresponding to the first sentence text portion is referred to as a first sentence end pause duration.

該第一句末停頓時長是表示電腦語音在將該第一語句文字部分播放完畢後，在繼續播放下一個語句文字部分前所要停頓的時間長度。而且，在本實施例中，該處理單元11決定該第一句末停頓時長的方式，是將該第一剩餘時間值與該第二門檻值（例如步驟S1中舉例的「-1.75」）以及該第三門檻值（例如步驟S1中舉例的「-3.5」）進行比對，並根據比對結果決定該第一句末停頓時長，但並不以此為限。The first sentence end pause duration indicates the length of time that the computer voice pauses after playing the first sentence text portion and before continuing to play the next sentence text portion. Moreover, in this embodiment, the processing unit 11 determines the first sentence end pause duration by comparing the first remaining time value with the second threshold value (e.g., "-1.75" as exemplified in step S1) and the third threshold value (e.g., "-3.5" as exemplified in step S1), and determining the first sentence end pause duration based on the comparison result, but the present invention is not limited thereto.

具體舉例來說，若該處理單元11判斷出該第一剩餘時間值大於等於該第二門檻值，則該處理單元11將該第一句末停頓時長設定為該第一停頓時長參數P31的數值（例如「0.4」）。另一方面，若該處理單元11判斷出該第一剩餘時間值小於該第二門檻值且大於等於該第三門檻值，則該處理單元11將該第一句末停頓時長設定為該第二停頓時長參數P32的數值（例如「0.8」）。再一方面，若該處理單元11判斷出該第一剩餘時間值小於該第三門檻值，則該處理單元11除了將該第一句末停頓時長設定為該第二停頓時長參數P32的數值之外，還將該第一語句文字部分拆分成多個具有順序性且能共同構成該第一語句文字部分的語句片段，以及進一步決定出一對應於該第一語句文字部分的句中停頓時長。更明確地說，對於該第一語句文字部分，該處理單元11只有在該第一剩餘時間值小於該第三門檻值的情況下，才會將該第一語句文字部分拆分成該等語句片段並決定出該句中停頓時長，而且，該處理單元11在本實施例中例如是將該句中停頓時長設定為該第一停頓時長參數P31的數值（例如「0.4」），但並不以此為限。Specifically, if the processing unit 11 determines that the first remaining time value is greater than or equal to the second threshold value, the processing unit 11 sets the first end-of-sentence pause duration to the value of the first pause duration parameter P31 (e.g., "0.4"). On the other hand, if the processing unit 11 determines that the first remaining time value is less than the second threshold value and greater than or equal to the third threshold value, the processing unit 11 sets the first end-of-sentence pause duration to the value of the second pause duration parameter P32 (e.g., "0.8"). On the other hand, if the processing unit 11 determines that the first remaining time value is less than the third threshold value, the processing unit 11 not only sets the pause duration at the end of the first sentence to the value of the second pause duration parameter P32, but also splits the first sentence text portion into a plurality of sentence segments that are sequential and can together constitute the first sentence text portion, and further determines a mid-sentence pause duration corresponding to the first sentence text portion. More specifically, for the first sentence text portion, the processing unit 11 will split the first sentence text portion into the sentence segments and determine the pause duration in the sentence only when the first remaining time value is less than the third threshold value. Moreover, in this embodiment, the processing unit 11 sets the pause duration in the sentence to the value of the first pause duration parameter P31 (e.g., "0.4"), but is not limited to this.

在該處理單元11決定出該第一句末停頓時長之後，流程進行至步驟S7。After the processing unit 11 determines the pause duration at the end of the first sentence, the process proceeds to step S7.

在步驟S7中，對於所述的第二個語句文字部分（後稱「第二語句文字部分」），該處理單元11根據該第二語句文字部分中的所有字元，計算一對應於該第二語句文字部分的字元數量。並且，為了便於描述，在此將該第二語句文字部分所對應的該字元數量作為一第二字元數量。以前述所示的該文字資料為例，該第二語句文字部分是「萬聖節是一年之中最期待的節日之一」，因此，該第二字元數量會是「16」，亦即該第二語句文字部分總共由16個字元所組成。In step S7, for the second sentence text part (hereinafter referred to as "the second sentence text part"), the processing unit 11 calculates the number of characters corresponding to the second sentence text part according to all the characters in the second sentence text part. And, for the convenience of description, the number of characters corresponding to the second sentence text part is taken as a second character number. Taking the text data shown above as an example, the second sentence text part is "Halloween is one of the most anticipated festivals in a year", therefore, the second character number will be "16", that is, the second sentence text part is composed of 16 characters in total.

在該處理單元11計算出該第二字元數量之後，流程進行至步驟S8。After the processing unit 11 calculates the second character quantity, the process proceeds to step S8.

在步驟S8中，對於該第二語句文字部分，該處理單元11根據該第二字元數量及該語速參數P1，計算一對應於該第二語句文字部分的預計耗費時長。並且，為了便於描述，在此將該第二語句文字部分所對應的該預計耗費時長作為一第二預計耗費時長。與該第一預計耗費時長類似，該第二預計耗費時長是代表該第二語句文字部分被以電腦語音播放所需耗費的時間長度。而且，在本實施例中，該處理單元11是將該第二字元數量（例如為「16」）與該語速參數P1（例如為「0.2」）相乘以計算出該第二預計耗費時長（例如「3.2」），但並不以此為限。In step S8, for the second sentence text portion, the processing unit 11 calculates an expected time corresponding to the second sentence text portion according to the second character number and the speech rate parameter P1. And, for the convenience of description, the expected time corresponding to the second sentence text portion is referred to as a second expected time. Similar to the first expected time, the second expected time represents the length of time required for the second sentence text portion to be played as computer voice. Moreover, in the present embodiment, the processing unit 11 multiplies the second character number (for example, "16") and the speech rate parameter P1 (for example, "0.2") to calculate the second expected time (for example, "3.2"), but is not limited thereto.

在該處理單元11計算出該第二預計耗費時長之後，流程進行至步驟S9。After the processing unit 11 calculates the second estimated time duration, the process proceeds to step S9.

在步驟S9中，對於該第二語句文字部分，該處理單元11根據該第二預計耗費時長、該第二語句文字部分的上一個語句文字部分（亦即該第一語句文字部分）所對應的該剩餘時間值（亦即該第一剩餘時間值），以及該等補充時間值中與上一個語句文字部分所對應之句末停頓時長（亦即該第一句末停頓時長）相關的其中一個補充時間值，計算一對應於該第二語句文字部分的剩餘時間值。並且，為了便於描述，在此將該第二語句文字部分所對應的該剩餘時間值作為一第二剩餘時間值。類似於該第一剩餘時間值，該第二剩餘時間值可以被理解成在模擬一個真人以該語速參數P1所表示的語速一口氣說完該第二語句文字部分之後的肺活量情形。In step S9, for the second sentence text part, the processing unit 11 calculates a remaining time value corresponding to the second sentence text part according to the second estimated consuming time, the remaining time value corresponding to the previous sentence text part (i.e., the first sentence text part) of the second sentence text part (i.e., the first remaining time value), and one of the supplementary time values related to the end-of-sentence pause duration corresponding to the previous sentence text part (i.e., the first end-of-sentence pause duration). And, for the convenience of description, the remaining time value corresponding to the second sentence text part is taken as a second remaining time value. Similar to the first remaining time value, the second remaining time value can be understood as the lung capacity of a real person after simulating the speaking speed of the second sentence in one breath at the speaking speed parameter P1.

在本實施例中，該處理單元11計算該第二剩餘時間值的方式，是將該第一剩餘時間值（例如步驟S5中舉例的「-0.1」）累加與該第一句末停頓時長相關的補充時間值（例如該第一補充時間值，其數值例如為步驟S1中舉例的「1.75」），再減去該第二預計耗費時長（例如前述舉例的「3.2」），從而計算出該第二剩餘時間值（例如「-1.55」）。並且，若上一個語句文字部分所對應的該句末停頓時長是被設定成該第一停頓時長參數P31的數值（例如步驟S6中舉例的「0.4」），則該處理單元11計算該第二剩餘時間值時，便是將上一個語句文字部分所對應的該剩餘時間值累加與該第一停頓時長參數P31對應的該第一補充時間值（例如步驟S1中舉例的「1.75」），再減去該第二預計耗費時長。而若上一個語句文字部分所對應的該句末停頓時長是被設定成該第二停頓時長參數P32的數值（例如步驟S6中舉例的「0.8」），則該處理單元11計算該第二剩餘時間值時，便是將上一個語句文字部分所對應的該剩餘時間值累加與該第二停頓時長參數P32對應的該第二補充時間值（例如步驟S1中舉例的「3.5」），再減去該第二預計耗費時長。In this embodiment, the processing unit 11 calculates the second remaining time value by adding the first remaining time value (e.g., "-0.1" as exemplified in step S5) to the supplementary time value related to the pause duration at the end of the first sentence (e.g., the first supplementary time value, whose value is, for example, "1.75" as exemplified in step S1), and then subtracting the second expected duration (e.g., "3.2" as exemplified above), thereby calculating the second remaining time value (e.g., "-1.55"). Furthermore, if the pause duration at the end of the sentence corresponding to the previous text portion is set to the value of the first pause duration parameter P31 (for example, "0.4" as cited in step S6), then when the processing unit 11 calculates the second remaining time value, it is the remaining time value corresponding to the previous text portion plus the first supplementary time value corresponding to the first pause duration parameter P31 (for example, "1.75" as cited in step S1), and then subtracts the second expected time. If the pause duration at the end of the sentence corresponding to the previous text portion is set to the value of the second pause duration parameter P32 (for example, "0.8" as cited in step S6), then when the processing unit 11 calculates the second remaining time value, it adds the remaining time value corresponding to the previous text portion to the second supplementary time value corresponding to the second pause duration parameter P32 (for example, "3.5" as cited in step S1), and then subtracts the second expected time.

更明確地說，該處理單元11要利用哪一個補充時間值來計算該第二剩餘時間值，取決於上一個句末停頓時長是被設定成該第一停頓時長參數P31的數值，還是被設定成該第二停頓時長參數P32的數值。其中，每一補充時間值可被理解成是模擬真人在說完一句話後藉由換氣而恢復的肺活量。所以，若上一個句末停頓時長是被設定成相對較小的第一停頓時長參數P31的數值，可理解成是在模擬真人僅用相對較短的時間進行換氣，在此情況下，該處理單元11便會利用相對較小的第一補充時間值來計算該第二剩餘時間值，藉此模擬真人在說出下一句話之前，只能恢復相對較少的肺活量。反之，若上一個句末停頓時長是被設定成相對較大的第二停頓時長參數P32的數值，可理解成是在模擬真人用相對較長的時間進行換氣，在此情況下，該處理單元11便會利用相對較大的第二補充時間值來計算該第二剩餘時間值，藉此模擬真人在說出下一句話之前，能夠恢復相對較多的肺活量。More specifically, which supplementary time value the processing unit 11 uses to calculate the second remaining time value depends on whether the previous end-of-sentence pause duration is set to the value of the first pause duration parameter P31 or the value of the second pause duration parameter P32. Each supplementary time value can be understood as simulating the lung capacity restored by a real person through breathing after speaking a sentence. Therefore, if the pause duration at the end of the previous sentence is set to a relatively small value of the first pause duration parameter P31, it can be understood as simulating a real person using only a relatively short time to breathe. In this case, the processing unit 11 will use the relatively small first supplement time value to calculate the second remaining time value, thereby simulating that a real person can only recover a relatively small amount of lung capacity before speaking the next sentence. On the contrary, if the pause duration at the end of the previous sentence is set to a relatively large value of the second pause duration parameter P32, it can be understood as simulating a real person taking a relatively long time to breathe. In this case, the processing unit 11 will use the relatively large second supplement time value to calculate the second remaining time value, thereby simulating a real person to recover relatively more lung capacity before speaking the next sentence.

在該處理單元11計算出該第二剩餘時間值之後，流程進行至步驟S10。After the processing unit 11 calculates the second remaining time value, the process proceeds to step S10.

在步驟S10中，該處理單元11根據該第二剩餘時間值、該第一門檻值、該第二門檻值以及該第三門檻值來決定出一對應於該第二語句文字部分的句末停頓時長。並且，為了便於描述，在此將該第二語句文字部分所對應的該句末停頓時長作為一第二句末停頓時長。In step S10, the processing unit 11 determines a sentence end pause duration corresponding to the second sentence text portion according to the second remaining time value, the first threshold value, the second threshold value, and the third threshold value. And, for ease of description, the sentence end pause duration corresponding to the second sentence text portion is referred to as a second sentence end pause duration.

該第二句末停頓時長是表示電腦語音在將該第二語句文字部分播放完畢後，在繼續播放下一個語句文字部分前所要停頓的時間長度。而且，在本實施例中，該處理單元11決定該第二句末停頓時長的方式，是將該第二剩餘時間值與該第一門檻值（例如步驟S1中舉例的「1.75」）、該第二門檻值（例如步驟S1中舉例的「-1.75」）以及該第三門檻值（例如步驟S1中舉例的「-3.5」）進行比對，並根據比對結果決定該第二句末停頓時長，但並不以此為限。The second end-of-sentence pause duration indicates the length of time that the computer voice pauses after playing the second sentence text portion and before continuing to play the next sentence text portion. Moreover, in this embodiment, the processing unit 11 determines the second end-of-sentence pause duration by comparing the second remaining time value with the first threshold value (e.g., "1.75" as exemplified in step S1), the second threshold value (e.g., "-1.75" as exemplified in step S1), and the third threshold value (e.g., "-3.5" as exemplified in step S1), and determining the second end-of-sentence pause duration based on the comparison result, but the present invention is not limited thereto.

更具體地說，在一方面，若該處理單元11判斷出該第二剩餘時間值大於等於該第一門檻值（例如大於等於1.75），則該處理單元11將該第二句末停頓時長設定為該最短停頓時長參數P30的數值（例如「0.2」）。More specifically, on the one hand, if the processing unit 11 determines that the second remaining time value is greater than or equal to the first threshold value (for example, greater than or equal to 1.75), the processing unit 11 sets the second end-of-sentence pause duration to the value of the shortest pause duration parameter P30 (for example, "0.2").

另一方面，若該處理單元11判斷出該第二剩餘時間值小於該第一門檻值且大於等於該第二門檻值（例如小於1.75且大於等於-1.75），則該處理單元11將該第二句末停頓時長設定為該第一停頓時長參數P31的數值（例如「0.4」）。On the other hand, if the processing unit 11 determines that the second remaining time value is less than the first threshold value and greater than or equal to the second threshold value (for example, less than 1.75 and greater than or equal to -1.75), the processing unit 11 sets the second end-of-sentence pause duration to the value of the first pause duration parameter P31 (for example, "0.4").

再一方面，若該處理單元11判斷出該第二剩餘時間值小於該第二門檻值且大於等於該第三門檻值（例如小於-1.75且大於等於-3.5），則該處理單元11將該第二句末停頓時長設定為該第二停頓時長參數P32的數值（例如「0.8」）。On the other hand, if the processing unit 11 determines that the second remaining time value is less than the second threshold value and greater than or equal to the third threshold value (for example, less than -1.75 and greater than or equal to -3.5), the processing unit 11 sets the second end-of-sentence pause duration to the value of the second pause duration parameter P32 (for example, "0.8").

又一方面，若該處理單元11判斷出該第二剩餘時間值小於該第三門檻值（例如小於-3.5），則該處理單元11除了將該第二句末停頓時長設定為該第二停頓時長參數P32的數值之外，還將該第二語句文字部分拆分成多個具有順序性且能共同構成該第二語句文字部分的語句片段，以及進一步決定出另一對應於該第二語句文字部分的句中停頓時長。而且，對於該第二語句文字部分，該處理單元11只有在該第二剩餘時間值小於該第三門檻值的情況下，才會將該第二語句文字部分拆分成該等語句片段並決定出該句中停頓時長，而且，該處理單元11在本實施例中例如是將該句中停頓時長設定為該第一停頓時長參數P31的數值（例如「0.4」），但並不以此為限。On the other hand, if the processing unit 11 determines that the second remaining time value is less than the third threshold value (for example, less than -3.5), the processing unit 11 not only sets the pause duration at the end of the second sentence to the value of the second pause duration parameter P32, but also splits the second sentence text portion into a plurality of sentence segments that are sequential and can together constitute the second sentence text portion, and further determines another mid-sentence pause duration corresponding to the second sentence text portion. Moreover, for the second sentence text portion, the processing unit 11 will split the second sentence text portion into the sentence segments and determine the pause duration in the sentence only when the second remaining time value is less than the third threshold value. Moreover, in this embodiment, the processing unit 11 sets the pause duration in the sentence to the value of the first pause duration parameter P31 (for example, "0.4"), but is not limited to this.

在該處理單元11決定出該第二句末停頓時長之後，流程進行至步驟S11。After the processing unit 11 determines the pause duration at the end of the second sentence, the process proceeds to step S11.

在步驟S11中，對於該文字資料的第三個語句文字部分至第倒數第二個語句文字部分的每一者，該處理單元11計算出其每一語句文字部分所對應的字元數量、預計耗費時長、剩餘時間值以及句末停頓時長。該處理單元11在本步驟中對於每一字元數量、每一預計耗費時長、每一剩餘時間值及每一句末停頓時長的計算方式分別與步驟S7、步驟S8、步驟S9及步驟S10所述的計算方式大致相同，故在此不逐一重述。In step S11, for each of the third sentence text part to the second to last sentence text part of the text data, the processing unit 11 calculates the number of characters, the estimated time consumption, the remaining time value, and the pause time at the end of the sentence corresponding to each sentence text part. The calculation method of the processing unit 11 for each number of characters, each estimated time consumption, each remaining time value, and each pause time at the end of the sentence in this step is substantially the same as the calculation method described in step S7, step S8, step S9, and step S10, and therefore will not be repeated here one by one.

在該處理單元11計算出第三個語句文字部分至第倒數第二個語句文字部分之每一者所對應的句末停頓時長之後，流程進行至步驟S12。After the processing unit 11 calculates the sentence end pause duration corresponding to each of the third sentence text portion to the second to last sentence text portion, the process proceeds to step S12.

在步驟S12中，對於該文字資料的最後一個語句文字部分（在本實施例中是被作為一第十一語句文字部分，亦即前述所示之該文字資料中的「也是非常快樂的」），該處理單元11計算出一對應於該第十一語句文字部分的剩餘時間值（在本實施例中是被作為一第十一剩餘時間值）。其中，該處理單元11對於該第十一剩餘時間值的計算方式概與步驟S7至步驟S9中計算該第二剩餘時間值的方式相同，故在此不再重述。In step S12, for the last sentence text portion of the text data (in this embodiment, it is regarded as an eleventh sentence text portion, that is, "also very happy" in the text data shown above), the processing unit 11 calculates a remaining time value corresponding to the eleventh sentence text portion (in this embodiment, it is regarded as an eleventh remaining time value). The processing unit 11 calculates the eleventh remaining time value in the same manner as the second remaining time value calculated in steps S7 to S9, so it will not be repeated here.

在該處理單元11計算出該第十一語句文字部分所對應的該第十一剩餘時間值之後，流程進行至步驟S13。After the processing unit 11 calculates the eleventh remaining time value corresponding to the eleventh sentence text portion, the process proceeds to step S13.

在步驟S13中，該處理單元11決定一對應於該第十一語句文字部分的句末停頓時長（在本實施例是被作為一第十一句末停頓時長）。並且，該處理單元11還將該第十一剩餘時間值與該第三門檻值（例如步驟S1中舉例的「-3.5」）進行比對，並根據比對結果決定是否要將該第十一語句文字部分進行拆分。In step S13, the processing unit 11 determines a sentence-end pause duration corresponding to the eleventh sentence text portion (in this embodiment, it is taken as an eleventh sentence-end pause duration). In addition, the processing unit 11 compares the eleventh remaining time value with the third threshold value (e.g., "-3.5" as exemplified in step S1), and determines whether to split the eleventh sentence text portion according to the comparison result.

在本實施例中，該處理單元11是將最後一個語句文字部分所對應的該句末停頓時長直接設定為該第二停頓時長參數P32的數值（例如「0.8」），換言之，在本實施例中，該第十一句末停頓時長的數值並不涉及該第十一語句文字部分的字元數量與該第十一剩餘時間值。然而，在其他實施例中，該處理單元11對於最後一個語句文字部分所對應之句末停頓時長的決定方式，也可以是與決定該第二句末停頓時長的方式大致相同，而並不以本實施例為限。In this embodiment, the processing unit 11 directly sets the sentence-end pause duration corresponding to the last sentence text portion to the value of the second pause duration parameter P32 (e.g., "0.8"). In other words, in this embodiment, the value of the eleventh sentence-end pause duration does not involve the number of characters in the eleventh sentence text portion and the eleventh remaining time value. However, in other embodiments, the processing unit 11 may determine the sentence-end pause duration corresponding to the last sentence text portion in a manner that is substantially the same as that for determining the second sentence-end pause duration, and is not limited to this embodiment.

此外，若該處理單元11判斷出該第十一剩餘時間值小於該第三門檻值（例如小於-3.5），該處理單元11將該第十一語句文字部分拆分成多個具有順序性且能共同構成該第十一語句文字部分的語句片段，以及決定出一對應於該第十一語句文字部分的句中停頓時長，而且，在本實施例中，該處理單元11例如是直接將該第一停頓時長參數P31的數值（例如「0.4」）作為該第十一語句文字部分所對應的句中停頓時長，但並不以此為限。反之，若該處理單元11判斷出該第十一剩餘時間值並未小於該第三門檻值，則該處理單元11不會將該第十一語句文字部分拆分，且也不會對該第十一語句文字部分設定句中停頓時長。In addition, if the processing unit 11 determines that the eleventh remaining time value is less than the third threshold value (for example, less than -3.5), the processing unit 11 splits the eleventh sentence text part into a plurality of sentence segments that are sequential and can together constitute the eleventh sentence text part, and determines a sentence pause length corresponding to the eleventh sentence text part. Moreover, in the present embodiment, the processing unit 11, for example, directly uses the value of the first pause length parameter P31 (for example, "0.4") as the sentence pause length corresponding to the eleventh sentence text part, but is not limited to this. On the contrary, if the processing unit 11 determines that the eleventh remaining time value is not less than the third threshold value, the processing unit 11 will not split the eleventh sentence text portion, and will not set the mid-sentence pause length for the eleventh sentence text portion.

在該處理單元11設定該第十一語句文字部分所對應的該句末停頓時長之後，流程進行至步驟S14。After the processing unit 11 sets the pause time at the end of the sentence corresponding to the text portion of the eleventh sentence, the process proceeds to step S14.

在步驟S14中，該處理單元11根據所有該等語句文字部分，以及該等語句文字部分所分別對應於的該等句末停頓時長，產生一對應於該文字資料且用於以電腦語音形式被播放出的語音資料。其中，該語音資料包含多個分別對應於該等語句文字部分的語句語音部分，以及多個分別對應於該等語句語音部分的句末停頓部分，而且，該等句末停頓部分除了分別對應於該等語句語音部分之外，還分別對應於該文字資料的該等語句文字部分。In step S14, the processing unit 11 generates a voice data corresponding to the text data and used to be played in the form of computer voice according to all the sentence text parts and the sentence end pause durations corresponding to the sentence text parts. The voice data includes a plurality of sentence voice parts corresponding to the sentence text parts, and a plurality of sentence end pause parts corresponding to the sentence voice parts. In addition, the sentence end pause parts correspond to the sentence text parts of the text data in addition to the sentence voice parts.

更詳細地說，每一語句語音部分是用於在該語音資料被播放的過程中，以電腦語音呈現出該語句語音部分本身所對應的該語句文字部分。另一方面，對於每一句末停頓部分，該句末停頓部分是接續在其本身所對應的該語句語音部分之後，並且是該語音資料中的一段靜音或者帶有換氣效果音的部分，換句話說，該句末停頓部分不會以電腦語音指示出該等語句文字部分的任何一者。進一步地，該句末停頓部分的持續時間長度，是相符於其所對應之該語句文字部分所對應的該句末停頓時長，舉例來說，若該第一語句文字部分所對應的句末停頓時長（即該第一句末停頓時長）為0.4秒，則該語音資料中對應於該第一語句文字部分的該句末停頓部分的持續時間長度便會為0.4秒。In more detail, each sentence voice part is used to present the sentence text part corresponding to the sentence voice part itself in the process of playing the voice data. On the other hand, for each sentence end pause part, the sentence end pause part is continuous with the sentence voice part corresponding to itself, and is a section of silence or a part with breathing effect in the voice data. In other words, the sentence end pause part will not indicate any of the sentence text parts in computer voice. Furthermore, the duration of the pause at the end of the sentence is consistent with the duration of the pause at the end of the sentence corresponding to the text portion of the sentence to which it corresponds. For example, if the duration of the pause at the end of the sentence corresponding to the text portion of the first sentence (i.e., the duration of the pause at the end of the first sentence) is 0.4 seconds, then the duration of the pause at the end of the sentence corresponding to the text portion of the first sentence in the voice data will be 0.4 seconds.

進一步地，假設該處理單元11有將其中一或多個語句文字部分拆分成多個語句片段，並對該（等）語句文字部分設定對應的句中停頓時長，則對於有被拆分成多個語句片段的每一語句文字部分，該語音資料中與該語句文字部分對應的該語句語音部分便會包括多個具有順序性的語音片段，以及M個句中停頓部分，且M為大於等於1的整數。更具體地說，該等語音片段是分別對應於該語句文字部分的該等語句片段，而用於以電腦語音分別指示出該等語句片段的字元內容。另一方面，該M個句中停頓部分的數量是該等語音片段的數量減一，而且，該M個句中停頓部分是分別接續在該等語音片段中的前M個語音片段之後。進一步地，對於每一句中停頓部分，該句中停頓部分是該語句語音部分中的一段靜音或者帶有換氣效果音的部分，換句話說，該句中停頓部分不會以電腦語音指示出該語句語音部分的任何一個語句片段。而且，該句中停頓部分的持續時間長度與該語句文字部分所對應的該句中停頓時長相符。具體舉一例來說，假設該處理單元11將該文字資料中的第一語句文字部分拆分成兩個語句片段，並將該第一語句文字部分所對應的該句中停頓時長設定為0.4秒，則在該語音資料中，對應於該第一語句文字部分的該語句語音部分便會包括兩個對應的語音片段，以及單一個介於該兩語音片段之間且持續時間長度為0.4秒的句中停頓部分。補充說明的是，將單一個語句文字部分拆分成多個語句片段可利用現有技術達成，故在此不詳述其細節。Furthermore, assuming that the processing unit 11 splits one or more of the sentence text parts into multiple sentence segments, and sets corresponding mid-sentence pause durations for the sentence text parts, then for each sentence text part that is split into multiple sentence segments, the sentence voice part corresponding to the sentence text part in the voice data will include multiple sequential voice segments and M mid-sentence pause parts, and M is an integer greater than or equal to 1. More specifically, the voice segments are the sentence segments that respectively correspond to the sentence text part, and are used to indicate the character contents of the sentence segments respectively by computer voice. On the other hand, the number of the M mid-sentence pauses is the number of the speech segments minus one, and the M mid-sentence pauses are respectively connected to the first M speech segments in the speech segments. Further, for each mid-sentence pause, the mid-sentence pause is a section of silence or a section with a breathing effect in the speech portion of the sentence. In other words, the mid-sentence pause will not indicate any sentence segment of the speech portion of the sentence in computer voice. Moreover, the duration of the mid-sentence pause is consistent with the duration of the mid-sentence pause corresponding to the text portion of the sentence. For example, assuming that the processing unit 11 splits the first sentence text portion in the text data into two sentence segments, and sets the pause duration corresponding to the first sentence text portion to 0.4 seconds, then in the voice data, the sentence voice portion corresponding to the first sentence text portion will include two corresponding voice segments, and a single pause portion between the two voice segments with a duration of 0.4 seconds. It should be noted that splitting a single sentence text portion into multiple sentence segments can be achieved using existing technologies, so the details are not described in detail here.

延續上述所示的該文字資料，以下示出對應於該文字資料之該語音資料的兩種示例性態樣。其中，該語音資料的每一語句語音部分是將其所對應的語句文字部分加上大括號「{}」來表示，每一句末停頓部分則是以半型的底線符號「_」來表示。此外，對於包括有語音片段及句中停頓部分的語句語音部分，每一語音片段是將其所對應的多個字元加上中括號「[]」來表示，而每一句中停頓部分則是以半形的井字號「#」來表示。Continuing with the text data shown above, two exemplary forms of the voice data corresponding to the text data are shown below. Among them, each sentence voice part of the voice data is represented by adding curly brackets "{}" to the corresponding sentence text part, and each end-of-sentence pause part is represented by a half-width underscore symbol "_". In addition, for the sentence voice part including a voice segment and a pause part in the sentence, each voice segment is represented by adding a plurality of characters corresponding to it to a middle bracket "[]", and each pause part in the sentence is represented by a half-width pound sign "#".

首先，假設該處理單元11並未將該文字資料中的任何一個語句文字部分拆分成多個語音片段。在此情況下，該語音資料的第一種示例性態樣為：「{學生時期老師們都會為我們精心策辦活動}_{萬聖節是一年之中最期待的節日之一}_{爸爸媽媽也都絞盡腦汁配合學校活動}_{為自己小孩打扮}_{有的時候是自己提出想要扮什麼}_{但更多的時候是不情願}_{迎合父母口味打扮成他們想要的樣子}_{即便如此}_{到了學校與同學一起慶祝}_{一起上街要糖}_{也是非常快樂的}_」。First, assume that the processing unit 11 does not split any of the text parts of the text data into multiple voice segments. In this case, the first exemplary state of the voice data is: "{When we were students, teachers would carefully plan activities for us}_{Halloween is one of the most anticipated holidays of the year}_{Moms and dads also racked their brains to cooperate with school activities}_{dress up their children}_{sometimes they would suggest what they want to dress up as}_{but more often they would be reluctant}_{dress up to their parents to suit their tastes}_{even so}_{celebrate with classmates at school}_{go to the streets to ask for candy together}_{and are very happy}_".

接著，假設該處理單元11有對於該文字資料中的第一、二、三、五、六、七、九個語句文字部分進行拆分並設定句中停頓時長。在此情況下，該語音資料的第二種示例性態樣為：「{[學生時期老師們都會]#[為我們精心策辦活動]}_{[萬聖節是一年之中]#[最期待的節日之一]}_{[爸爸媽媽]#[也都絞盡腦汁配合學校活動]}_{為自己小孩打扮}_{[有的時候]#[是自己提出想要扮什麼]}_{[但更多的時候]#[是不情願]}_{[迎合父母口味]#[打扮成他們想要的樣子]}_{即便如此}_{[到了學校]#[與同學一起慶祝]}_{一起上街要糖}_{也是非常快樂的}_」。Next, it is assumed that the processing unit 11 splits the first, second, third, fifth, sixth, seventh, and ninth sentence text parts in the text data and sets the pause duration in the sentence. In this case, the second exemplary state of the voice data is: "{[When we were students, teachers would]#[carefully plan activities for us]}_{[Halloween is]#[one of the most anticipated holidays of the year]}_{[Moms and dads]#[also rack their brains to coordinate school activities]}_{dress up their children}_{[Sometimes]#[they come up with what they want to dress up as]}_{[but more often]#[they are reluctant]}_{[cater to their parents' tastes]#[dress up what they want]}_{even so}_{[when we go to school]#[celebrate with classmates]}_{go to the streets to ask for candy together}_{it's still very happy}_".

在該處理單元11產生該對應於該文字資料的該語音資料之後，流程進行至步驟S15。After the processing unit 11 generates the voice data corresponding to the text data, the process proceeds to step S15.

在步驟S15中，該處理單元11將該語音資料傳送至該使用端5，以供該使用端5將該語音資料以電腦語音的形式播放，以供使用者收聽。補充說明的是，在該語音資料產生系統1被實施為電子裝置的實施例中，該語音資料也可以是由該語音資料產生系統1本身進行播放。In step S15, the processing unit 11 transmits the voice data to the user terminal 5, so that the user terminal 5 plays the voice data in the form of computer voice for the user to listen to. It should be noted that in the embodiment where the voice data generating system 1 is implemented as an electronic device, the voice data can also be played by the voice data generating system 1 itself.

延續前述之該語音資料的第一種示例性態樣，藉由該語音資料所包含的該等語句語音部分及該等句末停頓部分，假設該第一句末停頓時長被設定為0.4秒，該第二句末停頓時長被設定為0.8秒，則在該語音資料被該使用端5播放的過程中，該使用端5在播放完「學生時期老師們都會為我們精心策辦活動」的第一個語句語音部分後，會先停頓0.4秒（亦即播放第一個句末停頓部分），才接著播放「萬聖節是一年之中最期待的節日之一」的第二個語句語音部分，並且，該使用端5在播放完第二個語句語音部分後，會先停頓0.8秒（亦即播放第二個句末停頓部分），才接著播放下一個語句語音部分，餘下以此類推。藉此，該語音資料能夠模擬真人說話時，根據說出的句子長短而產生不同長度之換氣停頓的情形。Continuing the first exemplary aspect of the voice data described above, based on the voice parts of the sentences and the pause parts at the end of the sentences contained in the voice data, assuming that the pause duration at the end of the first sentence is set to 0.4 seconds and the pause duration at the end of the second sentence is set to 0.8 seconds, when the voice data is played by the user terminal 5, the user terminal 5 plays the video "Teachers will carefully plan activities for us during our student years". After the first speech part, there will be a pause of 0.4 seconds (i.e., the pause at the end of the first sentence is played), and then the second speech part of "Halloween is one of the most anticipated festivals of the year" will be played. Moreover, after the user terminal 5 finishes playing the second speech part, there will be a pause of 0.8 seconds (i.e., the pause at the end of the second sentence is played), and then the next speech part will be played, and so on. In this way, the voice data can simulate the situation of breathing pauses of different lengths when a real person speaks according to the length of the sentence spoken.

以上即為本實施例之語音資料產生系統1如何實施該語音資料產生方法的示例說明。The above is an example of how the voice data generating system 1 of this embodiment implements the voice data generating method.

補充說明的是，在其他的實施例中，對於每一語句文字部分，該處理單元11也可以是在計算出該語句文字部分所對應的該字元數量之後，直接將該第一字元數量乘以一預設權重值來決定該語句文字部分所對應的該句末停頓時長。或者，該處理單元11也可以是在計算出該語句文字部分所對應的該預計耗費時長之後，直接將該預計耗費時長乘以一預定權重值來決定該語句文字部分所對應的該句末停頓時長。所以，該處理單元11對於每一語句文字部分決定其對應之句末停頓時長的方式並不以本實施例為限。It should be noted that in other embodiments, for each sentence text part, the processing unit 11 may also directly multiply the first character number by a preset weight value after calculating the character number corresponding to the sentence text part to determine the sentence end pause length corresponding to the sentence text part. Alternatively, the processing unit 11 may also directly multiply the expected time consumption corresponding to the sentence text part by a preset weight value after calculating the expected time consumption corresponding to the sentence text part. Therefore, the way in which the processing unit 11 determines the sentence end pause length corresponding to each sentence text part is not limited to this embodiment.

應當理解的是，本實施例的步驟S1至步驟S15及圖2的流程圖僅是用於示例說明本發明語音資料產生方法的其中一種可實施方式。應當理解的是，即便將步驟S1至步驟S15進行合併、拆分或順序調整，若合併、拆分或順序調整之後的流程與本實施例相比係以實質相同的方式達成實質相同的功效，便仍屬於本發明語音資料產生方法的可實施態樣，因此，本實施例的步驟S1至步驟S15及圖2的流程圖並非用於限制本發明的可實施範圍。It should be understood that steps S1 to S15 of the present embodiment and the flowchart of FIG2 are only used to illustrate one of the practicable ways of the voice data generating method of the present invention. It should be understood that even if steps S1 to S15 are merged, split or adjusted in sequence, if the process after merging, splitting or adjusting in sequence achieves substantially the same effect as the present embodiment in substantially the same way, it still belongs to the practicable aspects of the voice data generating method of the present invention. Therefore, steps S1 to S15 of the present embodiment and the flowchart of FIG2 are not used to limit the practicable scope of the present invention.

本發明還提供了一種電腦程式產品的一實施例。該電腦程式產品能被儲存於電腦可讀取紀錄媒體（例如硬碟、隨身碟及記憶卡等），並且包含一應用程式。該應用程式包括圖1所示的該語速參數P1、該預設可用時長參數P2及該等停頓時長參數P3，且能被一電子裝置（例如手機、平板電腦、筆記型電腦及桌上型電腦等）載入並運行。並且，當該應用程式被該電子裝置載入並運行時，該應用程式能使該電子裝置被作為本發明所提供的語音資料產生系統（例如圖1所示的該語音資料產生系統1），而對一文字資料實施本發明所提供的語音資料產生方法。The present invention also provides an embodiment of a computer program product. The computer program product can be stored in a computer-readable recording medium (such as a hard disk, a flash drive, and a memory card, etc.), and includes an application program. The application program includes the speech rate parameter P1, the preset available time parameter P2, and the pause time parameters P3 shown in FIG1, and can be loaded and run by an electronic device (such as a mobile phone, a tablet computer, a laptop computer, and a desktop computer, etc.). Moreover, when the application program is loaded and run by the electronic device, the application program can enable the electronic device to be used as a voice data generation system provided by the present invention (such as the voice data generation system 1 shown in FIG1), and implement the voice data generation method provided by the present invention on a text data.

綜上所述，藉由對該文字資料實施該語音資料產生方法，該語音資料產生系統1能至少根據該文字資料之每一語句文字部分的字元數量來決定對應於該語句文字部分的句末停頓時長，從而設定該語音資料中接續在每一語句語音部分之後的句末停頓部分的持續時間長度，藉此，該語音資料產生系統1所產生的該語音資料能以語句的長短為依據，從而模擬真人說話時因換氣所導致之不同長度的停頓，所以，該語音資料產生系統1有助於使電腦語音更加接近真人的說話方式，而確實能達成本發明之目的。In summary, by implementing the voice data generation method on the text data, the voice data generation system 1 can determine the length of the sentence-end pause corresponding to the text portion of the sentence at least according to the number of characters in each sentence text portion of the text data, thereby setting the duration of the sentence-end pause portion following the voice portion of each sentence in the voice data. In this way, the voice data generated by the voice data generation system 1 can be based on the length of the sentence, thereby simulating pauses of different lengths caused by breathing when a real person speaks. Therefore, the voice data generation system 1 helps to make computer voice closer to the way a real person speaks, and can indeed achieve the purpose of the present invention.

惟以上所述者，僅為本發明之實施例而已，當不能以此限定本發明實施之範圍，凡是依本發明申請專利範圍及專利說明書內容所作之簡單的等效變化與修飾，皆仍屬本發明專利涵蓋之範圍內。However, the above is only an example of the implementation of the present invention, and it should not be used to limit the scope of the implementation of the present invention. All simple equivalent changes and modifications made according to the scope of the patent application of the present invention and the content of the patent specification are still within the scope of the patent of the present invention.

1··········· 語音資料產生系統 11········· 處理單元 12········· 儲存單元 P1········· 語速參數 P2········· 預設可用時長參數 P3········· 停頓時長參數 P30········ 最短停頓時長參數 P31········ 第一停頓時長參數 P32········ 第二停頓時長參數 5··········· 使用端 S1~S15·········· 步驟 1··········· Voice data generation system 11········· Processing unit 12········· Storage unit P1········· Speech rate parameter P2········· Default available time parameter P3········· Pause time parameter P30········ Shortest pause time parameter P31········ First pause time parameter P32········ Second pause time parameter 5············ User end S1~S15·········· Steps

本發明之其他的特徵及功效，將於參照圖式的實施方式中清楚地呈現，其中：圖1是一方塊示意圖，示例性地表示本發明語音資料產生系統的一實施例，以及一適用於與該實施例配合的使用端；及圖2（由圖2A及2B組成）是一流程圖，用於示例性地說明該實施例如何對一文字資料實施一語音資料產生方法。 Other features and functions of the present invention will be clearly presented in the implementation method with reference to the drawings, wherein: FIG. 1 is a block diagram, exemplarily showing an implementation of the voice data generation system of the present invention, and a user terminal suitable for use with the implementation; and FIG. 2 (composed of FIG. 2A and FIG. 2B) is a flow chart, which is used to exemplarily illustrate how the implementation implements a voice data generation method for a text data.

S1~S15·········· 步驟S1~S15·········· Steps

Claims

A method for generating speech data is implemented by a speech data generating system on a text data, wherein the speech data generating system stores a speech speed parameter and a preset available time parameter, and the speech speed parameter is related to the speech speed when the data is output in the form of sound; the method for generating speech data comprises: (A) calculating the number of characters corresponding to a sentence text portion according to a sentence text portion contained in the text data; (B) calculating an expected time corresponding to the sentence text portion according to the number of characters and the speech speed parameter, and then determining a sentence end pause length corresponding to the sentence text portion at least according to the expected time, wherein the sentence text portion is a paragraph of the text data. In the case of the first sentence text part in the text, the voice data generation system determines the sentence end pause duration in a manner that the sentence end pause duration is determined according to a remaining time value, and the remaining time value is the difference between the preset available time parameter and the expected time duration; and (C) generating a voice data corresponding to the text data and used to be output in the form of sound based on at least the sentence text part and the sentence end pause duration, wherein the voice data includes a sentence voice part for indicating the sentence text part by voice, and a sentence end pause part that follows the sentence voice part, does not indicate any sentence text part, and lasts for a time length that matches the sentence end pause duration.

The voice data generation method as described in claim 1, wherein: the voice data generation system also stores a first pause duration parameter and a second pause duration parameter greater than the first pause duration parameter; In step (B), when the sentence text portion is the first sentence text portion in the paragraph of the text data, the voice data generation system determines the pause duration at the end of the sentence in the following manner: if the remaining time value is greater than or equal to a threshold value, the pause duration at the end of the sentence is set to the value of the first pause duration parameter; and if the remaining time value is less than the threshold value, the pause duration at the end of the sentence is set to the value of the second pause duration parameter.

A method for generating speech data is implemented by a speech data generating system on a text data. The speech data generating system stores a speech speed parameter and a preset available time parameter, and the speech speed parameter is related to the speech speed when the data is output in the form of sound. The method for generating speech data comprises: (A) calculating the number of characters corresponding to a sentence text part according to a sentence text part included in the text data; (B) calculating an expected time corresponding to the sentence text part according to the number of characters and the speech speed parameter, and determining a sentence end pause time corresponding to the sentence text part at least according to the expected time, wherein, when the sentence text part is the Nth sentence text part in a paragraph of the text data (N is an integer greater than 1), The voice data generation system determines the end-of-sentence pause duration in a manner that determines the end-of-sentence pause duration based on a remaining time value, and the remaining time value is related to the preset available time parameter, and is also related to another remaining time value and another end-of-sentence pause duration corresponding to the (N-1)th sentence text portion of the paragraph of the sentence text portion; and (C) generating a voice data corresponding to the text data and used to be output in the form of sound based on at least the sentence text portion and the end-of-sentence pause duration, wherein the voice data includes a sentence voice portion for indicating the sentence text portion by voice, and a sentence end-of-sentence pause portion that follows the sentence voice portion, does not indicate any sentence text portion, and lasts for a time length that matches the end-of-sentence pause duration.

The method for generating speech data as described in claim 3, wherein: the speech data generating system further stores a minimum pause duration parameter, a first pause duration parameter greater than the minimum pause duration parameter, and a second pause duration parameter greater than the first pause duration parameter; in step (B), when the sentence text portion is the Nth sentence text portion in the paragraph of the text data, the speech data generating system determines the sentence end pause duration The method includes: if the remaining time value is greater than or equal to a first threshold value which is a positive number, the pause duration at the end of the sentence is set to the value of the shortest pause duration parameter; if the remaining time value is less than the first threshold value and greater than or equal to a second threshold value which is a negative number, the pause duration at the end of the sentence is set to the value of the first pause duration parameter; and if the remaining time value is less than the second threshold value, the pause duration at the end of the sentence is set to the value of the second pause duration parameter.

The method for generating speech data as claimed in any one of claims 1 and 3, wherein: in step (B), when the remaining time value is less than a negative threshold value, the speech data generating system further splits the text portion of the sentence into a plurality of sequential sentence segments, sets the pause duration at the end of the sentence to the value of a pause duration parameter, and determines the pause duration in the sentence; and in step (C), when the remaining time value is less than a negative threshold value, the speech data generating system further splits the text portion of the sentence into a plurality of sequential sentence segments, sets the pause duration at the end of the sentence to the value of a pause duration parameter, and determines the pause duration in the sentence; When the time value is less than the threshold value, the speech part of the sentence includes a plurality of speech segments that are sequential and are used to indicate the sentence segments by voice, and M mid-sentence pause portions that do not indicate any sentence segment and whose duration is consistent with the duration of the mid-sentence pause, where M is an integer greater than or equal to 1, and the M mid-sentence pause portions are respectively connected to the first M speech segments in the speech segments.

The speech data generation method as described in any one of claim items 1 and 3 further includes a step before step (A): (D) defining the text portion of the sentence from the text data according to one or more specific symbols contained in the text data.

A speech data generation system comprises: a storage unit storing a speech speed parameter and a preset available time parameter, wherein the speech speed parameter is related to the speech speed when the data is output in the form of sound; and a processing unit electrically connected to the storage unit and configured to implement the speech data generation method as described in any one of claims 1 to 6 for a text data.

A computer program product includes an application program, wherein the application program includes a speech rate parameter and a preset available time parameter, and the speech rate parameter is related to the speech speed when outputting data in the form of sound, and the application program is suitable for being loaded and run by an electronic device, so that the electronic device implements the speech data generation method described in any one of claim items 1 to 6 for a text data.