[go: up one dir, main page]

TWI299855B - Detection method for voice activity endpoint - Google Patents

Detection method for voice activity endpoint Download PDF

Info

Publication number
TWI299855B
TWI299855B TW95131216A TW95131216A TWI299855B TW I299855 B TWI299855 B TW I299855B TW 95131216 A TW95131216 A TW 95131216A TW 95131216 A TW95131216 A TW 95131216A TW I299855 B TWI299855 B TW I299855B
Authority
TW
Taiwan
Prior art keywords
zero
voice
energy
threshold
speech
Prior art date
Application number
TW95131216A
Other languages
Chinese (zh)
Other versions
TW200811833A (en
Inventor
Chung Po Liao
Original Assignee
Inventec Besta Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inventec Besta Co Ltd filed Critical Inventec Besta Co Ltd
Priority to TW95131216A priority Critical patent/TWI299855B/en
Publication of TW200811833A publication Critical patent/TW200811833A/en
Application granted granted Critical
Publication of TWI299855B publication Critical patent/TWI299855B/en

Links

Landscapes

  • Electrically Operated Instructional Devices (AREA)
  • Telephonic Communication Services (AREA)

Description

1299855 input parameters of the linear regression method; obtained at least a voice activity starting point and at least a voice activity endpoint from the active voices and the inactive voices based on the energy threshold and the zero crossing rate threshold. •七、指定代表圖: (一)、本案指定代表圖為:第(3)圖 (、二〉二、案代表圖之元件符號簡單說明 步驟(a)〜步驟(e) 八、本案若有化學式時, 請揭示最能顯示發明特徵的化學式: 九、發明說明: 【發明所屬之技術領域】 本發明是有關於一 一種語音辨識 是有關在一種用於提高辨識活奢 確率之活動語音端點偵剛方法。 辨識偵測方法,且特別 動語音(active voice)正 1299855 【先前技術】 原始語音類比訊號經過數位化後,雖可直接作為 辨識之用,但由於資料量過大,處理時間過長,且效 率不好,不可能將原始語音全部儲存下來當做標準語 音參考樣本,因此必須針對數位化語音訊號的特性, 進行特徵擷取,以求取適當的特徵參數來做比對辨 認。而且對語音訊號取得代表之特徵參數,可減少資 料量,增加效率。一般現有的非特定語者的中文語音 辨識之流程如第一圖所示,包含下列步驟: 步驟(1):語音訊號輸入處理,在語音訊號輸入 後,將各個需作分析的語音訊號,以數位訊號處理技 術將語音段的訊號切割出來,形成多個音框,便於進 行下一步驟。 步驟(2):語音訊號的前置處理,該前置處理之主 要功能為端點偵測,用以判斷一段語音訊號的起迄 點。 步驟(3):進行特徵參數擷取,通常採用梅爾(mel 為音調頻率的度量單位,mel的定義:a mel is a unit of measure of perceived pitch or frequency of the tone)倒頻譜參數,將時域訊號轉換成頻譜後,就利 用濾波器等工具將梅爾刻度(mel scale)的特定頻譜 值濾出來,取其對數後即為所求。 1299855 步驟(4):利用隱藏式馬可夫模型(moo方法做語 . 音辨識。所輸入之語音訊號經端點偵測、取音框後, . 再取其聲音檔之特徵向量,利用此些特徵向量與經過 訓練的隱藏式馬可夫模型做比對,計算它是由某一串 隱藏式馬可夫模型所產生的機率有多大,來完成語音 辨識。 • 以目前使用的判斷方式對目前的輸入訊號音框 是否為活動語音(active voice,意指交談中對話的 • 聲音)段落或非活動語音(inactive voice,意指交談 - 中停頓的靜音或背景雜訊)在判斷上仍有誤判的情況 發生。若是發生誤判,則在進行特徵參數之擷取時, 因為目標語音包含活動語音及非活動語音,將導致語 音辨識之正確率降低。因此,如何準確切割出活動語 音的範圍為語音辨識技術中一重要關鍵。 【發明内容】 因此本發明的目的就是在提供一種適用於語音 辨識時的活動語音端點偵測方法,根據所输入語音之 音框之能量與越零率來更新能量門限值及越零率門 限值,再加上使用多重線性回歸(Multiple 1 inear regression)演繹法及其他評斷流程,以提高活動語 音起點及活動語音終點之判斷準確率。 1299855 、根據本發明之上述目的,此活動語音端點偵測 方去包含·(a)接收至少一連續語音,並自此連續語音 擷取複數段音框;(b)計算此些音框之能量,並根據 此些能量取得一能量門限值;(c)分別計算此些音框 越零率,並根據此些越零率取得一越零率門限值; ^)使用線性迴歸演繹法,並以此些能量及此些越 $率=為線性迴歸演繹法之輸入參數,用以判斷每一 疋否為一活動語音或一非活動語音;以及(㊀)根 據能量門限值及越零率門限值,自此些活音 些非活動語音中取得至少—活動語音起點及至少一 活動語音終點。 【實施方式】 以下詳細地討論目前較佳的實施例。然而應被理 解的是’本發明提供許多可適㈣發⑽念,而這些 f念能被體現於很寬廣多樣的特定具體背景中。所討 特疋具體的實施例僅是說明使用本發明的特定 方式,而且不會限制本發明的範圍。 十語音活動仙是用來判定是否有真人語音,近年 2廣泛用於通訊上達到節省能量耗損的目的。若用 乂識方面疋屬於語音辨識的前處理,對辨識結 、办s很大,精確的語音活動偵測可降低噪音影響 1299855 並提高辨識率。傳則語音活動偵大 立 量或越零率等資訊來判 明係針以、::: 活動横測法則特増添-多重線性:歸=:;ΐ 數及其他評斷流程,對需要辨埤往立 子,八繹《 以順利完成語音辨識之 而導Ϊ:二習知技術因擷取語音的參數不足, 致:識”上正確率降低,本發明係 音:_的活動語音端點偵測方法,以例 之各實驗數值及本實_之絲圖式說明。 此活動語音端點偵測方法包含: 二驟(a) ·接收至少一連續語音,並自該連續語音 擷取複數段音框;語音是個時變(Time-varying)的訊 號’但在觀祭貫際語音訊號時可發現,語音訊號在短 時間内的變化是彳艮緩慢的。因此,在語音信號處理上 我們通系採用短時間穩定time stationary) 的假設,以固定的取樣點數(Samples)為一個音框 (Frame),將語音訊號切割成複數段音框,觀察並利 用每個音框的特徵。 步驟(b):計算此些音框之能量,並根據此些能量 取得一能量門限值。 如上所述,首先,先行計算該音框能量,如第2 圖所示’該圖繪示一語音分割及端點偵測示意圖。由 語音缓衝區(亦指該連續語音的複數段音框)的開始 1299855 處取一小視窗,然後計算此視窗於時距中所累積能 量,其中,所謂時距係指擷取的一音框至相臨另一音 框所相隔時間。計算所有音框之能量後,從所有能量 中取得一相對能量門限值,並將相對能量門限值與與 一預估能量最小值相比較,以兩者之中較大值作為一 能量門限值。 其中,前述預估能量最小值係為於一安靜無聲下 測得一段靜音,以做使用預估的最小值。而相對能量 門限值係為所有音框能量之最大能量之1/32。 因此,執行完步驟(b)後,執行步驟(c):分別計 算此些音框之越零率,並根據此些越零率取得一越零 率門限值。 在本實施例中,取得越零率門限值方法係為比 對一預設值與對應此些越零率之相對越零率,以兩者 之中較小值作為該越零率門限值。其中,此預設值係 依照文獻[Shanughnessy’ 87, ρ· 125]而設定一門限 值,有聲無聲的邊界越零率值為3000 cross/s。前述 相對越零率係為能量低於前述能量門限值之音框之 越零率之平均值。 步驟(d):使用一線性迴歸演繹法,並以前述複 數個音框之能量及越零率作為該線性迴歸演繹法之 輸入參數,用以判斷每一音框是否為一活動語音或一 非活動語音。 1299855 在本實施例中,此線性迴歸演繹法亦是一種多 重線性迴歸(Multipie-regressive)的應用,其係由 迴歸分析的應用領域所衍生出的,該迴歸分析^用來 找出兩個或兩個以上變數間的關係,進而從一群變數 中預測資料的趨勢,於本實施例中,該些能量及該些 越零率係作為談線性迴歸演繹法所輸入兩個變數。 ^ 步驟(e):根據前述之能量門限值及前述之越零 率門限值自該些活動語音及該些非活動語音中取得 至>、活動語音起點及至少一活動語音終點。 、同時使用能量門限值及越零率門限值來判斷,是 因為語音:的鼻音、氣音的能量都較小,容易被誤判 為非活動語音(inactive v〇ice)而被刪除,這樣對於 語音辨識在做判斷將會導致錯誤。加上越零率門限值 的f斷可以分辨出子音與非活動語音的不同。在非活 動語音時,只有背景雜訊,此時靜音的越零率較低, 而子音信號的越零率有一定的數值,當有一預定之 限值便能辨別出非活動語音與子音。 推至步驟(e)時,當一所選音框為一活動語音, 且該所選音框及後-段音框之能量皆大於該能量門 限值,再判斷該所選音框之前兩段音框之越零率是否 若有大於該越零率門限值則活 動”。曰起點由該所選音框往前移動—或二音框 大於該越零率門限值則該所選音框係為一活動語音、 1299855 起點。 當所選音框為一非活動語音,且已取得該活動 語音起點,且該所選音框及後五段音框之能量皆小於 該能量門限值,再判斷該所選音框之後兩段音框之越 零率是否大於該越零率門限值,若有大於該越零率門 限值則活動語音終點由該所選音框往後移動一或二 音框,若無大於該越零率門限值則該所選音框係為一 活動語音終點。 而選取連繪五個音框其原因為,某些時候該些音 框之能量會低於門限值係為:人在連續發音中因短暫 休息所以擷取的連續音框會斷開,而非真正的靜音, 所以設定當該些音框之能量由門限值以上變化到門 限值以下,必須經過連續五個音框才能真正視為活動 語音結束。 以下為針對上述偵測流程於一活動語音端點之 偵測後的實驗數值。 實驗語料是取自2003年二月份底與三月份的 『大家說英語』教材,總共有25個語音檔,每個檔 案的格式都是8 kHz取樣頻率,每個取樣點以16位 元量化,單聲道,平均長度約為1分半左右,每一個 音框長度為22.5 ms。此語料大多是人與人之間的對 話,所以很適合作為語音活動檢測的資料庫,其中前 20個檔案作為訓練用,總長度約為28分半,後5個 11 Ί299855 檔案作為測試用,總長度約為7分半。 實驗會根據輸入參數所求得的語音活動狀態,與 正確的語音活動狀態作分析,總共會計算三種錯誤 率,分別是總錯誤率、非活動判斷為活動之錯誤率以 及活動判斷為非活動之錯誤率η,並且與G. 729的VAD 做比較,如表一。 VAD型式 E total En 一 a Ea_n 多重線性迴歸(訓 練) 11.54 6.6563 4. 8837 G· 729(訓練) 22.243 21.619 0. 62432 多重線性迴歸(訓 練) 16.808 13.903 2. 9049 G· 729(測試) 27.945 25.052 2.8938 表一 由表一可以看到在整體的錯誤率及非活動語音 判斷為活動語音的情況,多重線性迴歸不論是訓練語 料或是測試語料都優於G. 729,但是在活動語音判斷 為非活動語音時,多重線性迴歸在訓練語料所表現的 結果是較差的,而這部份的錯誤對於辨識時也有較大 的影響,因為將活動語音判斷為非活動語音會常常使 得某些子音被忽略,導致辨識錯誤,因此,希望在總 錯誤率增加不多的情形下降低。 12 1299855 及在調整多重線性迴歸輸入變數的權重下,即可 影響錯誤率的表現。若將能量的權重向下修正,可以 降低活動音框判斷為非活動音框的錯誤率,同時也會 使更多的非活動音框判斷為活動音框,將越零率向上 修正也會有類似的效果,這裡選擇改變能量之權重, 越零率維持不變,而訓練語料中的刻意選擇接近1%, 最後所ti丨練出的迴歸係數為 2.3089, 047486 ,如=0·50885 。 VAD型式 Etotal En_a Ea 一 η 多重線性迴歸(訓 練) 12.826 11.835 0.99187 G· 729(訓練) 22.243 21.619 0.62432 多重線性迴歸(訓 練) 20.011 19.511 0. 4999 G· 729(測試) 27.945 25. 052 2.8938 表二 重新測試實驗結果如下表二所示,在各種情形 下,多重線性迴歸(Weighted)方法皆優於G. 729之 VAD,並且在活動語音判斷為非活動語音的錯誤也能 保持一定的水準(1%)。 13 1299855 雖然本發明已以較佳實施例揭露如上,然其並非 用以限定本發明,任何熟習此技藝者,在不脫離本發 明之精神和範圍内,當可作各種之更動與潤飾,因此 本發明之㈣範圍當視錢之申請專观圍所界定 者為準。 【圖式簡單說明】1299855 input parameters of the linear regression method; obtained at least a voice activity starting point and at least a voice activity endpoint from the active voices and the inactive voices based on the energy threshold and the zero crossing rate threshold. : (1) The representative representative figure of this case is: (3) Figure (2), the symbol of the representative figure of the case is simple. Step (a) to step (e) 8. If there is a chemical formula in this case, please reveal the most A chemical formula capable of displaying the characteristics of the invention: IX. Description of the invention: [Technical field to which the invention pertains] The present invention relates to a voice recognition method for an active speech endpoint detection method for improving the recognition of a lively luxury rate. Detection method, and the active voice is 1129855 [Prior Art] After the original voice analog signal is digitized, although it can be directly used for identification, but the data volume is too large, the processing time is too long, and the efficiency is not good. It is impossible to store all the original speech as a standard voice reference sample. Therefore, it is necessary to perform feature extraction for the characteristics of the digitized voice signal, so as to obtain appropriate feature parameters for comparison and recognition, and to obtain representative parameters of the voice signal, which can reduce the amount of data and increase the efficiency. The process of Chinese speech recognition of a specific speaker is as shown in the first figure, and includes the following steps: Step (1): voice signal input processing, after the voice signal is input, each voice signal to be analyzed is digital signal processing technology. The signal of the voice segment is cut out to form a plurality of sound frames, which is convenient for the next step. Step (2): pre-processing of the voice signal, the main function of the pre-processing is endpoint detection, for judging a voice The starting and ending point of the signal. Step (3): Perform feature parameter extraction, usually using Meyer (mel is the unit of measure of tone frequency, mel is a unit of measure of perceived pitch or frequency of the tone) The cepstrum parameter, after converting the time domain signal into a spectrum, filters a specific spectral value of the mel scale using a tool such as a filter. 1299855 Step (4): Use the hidden Markov model (moo method to speak. Tone recognition. The input voice signal is detected by the endpoint, after the sound box is taken, and then the sound file is taken. The eigenvectors are compared with the trained hidden Markov model by using these eigenvectors to calculate the probability that it is generated by a certain series of hidden Markov models to complete the speech recognition. • In the current judgment mode, whether the current input signal box is active voice (meaning voice in conversation) or inactive voice (inactive voice, meaning conversation - pause in silence or background Miscellaneous) There is still a misjudgment in judgment. If a misjudgment occurs, when the feature parameters are captured, the target speech contains active speech and inactive speech, which will result in a lower accuracy of speech recognition. Therefore, how to accurately cut out the range of active speech is an important key in speech recognition technology. SUMMARY OF THE INVENTION Therefore, an object of the present invention is to provide an active speech endpoint detection method suitable for speech recognition, which updates an energy threshold and a zero-zero threshold according to the energy and zero-crossing rate of the input speech box. Values, plus the use of multiple linear regression (Multiple 1 inear regression) deduction and other judgment processes to improve the accuracy of the speech start point and the end of the active speech. 1299855 According to the above object of the present invention, the active speech endpoint detecting party includes (a) receiving at least one continuous speech, and extracting a plurality of audio frames from the continuous speech; (b) calculating the audio frames. Energy, and according to the energy to obtain an energy threshold; (c) separately calculate the zero rate of the sound box, and obtain a zero-zero rate threshold according to the zero-zero rate; ^) using linear regression deduction method, and The energy and the more the rate = the input parameter of the linear regression deduction method to determine whether each is an active speech or an inactive speech; and (1) according to the energy threshold and the zero-zero threshold At least, the active speech starting point and the at least one active speech ending point are obtained from the inactive voices. [Embodiment] The presently preferred embodiment will be discussed in detail below. It should be understood, however, that the present invention provides a number of suitable (four) hairs, which can be embodied in a wide variety of specific contexts. The specific embodiments described are merely illustrative of specific ways of using the invention and are not intended to limit the scope of the invention. Ten voice activity is used to determine whether there is a real voice. In recent years, 2 is widely used in communication to save energy and loss. If the sensation aspect is a pre-processing of speech recognition, the identification of the knot is very large, and accurate speech activity detection can reduce the noise impact 1299855 and improve the recognition rate. The voice activity detects the amount of the volume or the zero rate and other information to determine the needle, :::: The activity cross-test rule is special - multi-linear: return =:; number and other judgment processes, need to identify Lizi, gossip "In order to successfully complete the speech recognition: the second conventional technology is insufficient for the parameters of the speech, and the accuracy is reduced." The active speech endpoint detection method of the present invention: For example, the experimental speech value detection method includes: (2) receiving at least one continuous speech, and capturing a plurality of sound frames from the continuous speech. Voice is a time-varying signal, but it can be found when the voice signal is observed. The change of the voice signal in a short time is slow. Therefore, we use the voice signal processing. The assumption of a stable time stationary is to use a fixed sampling point (Samples) as a frame to cut the voice signal into a plurality of segments, and observe and utilize the characteristics of each frame. Step (b) : Calculate the energy of these frames, According to the energy, an energy threshold is obtained. As described above, first, the sound of the sound box is calculated first, as shown in FIG. 2, which shows a schematic diagram of voice segmentation and endpoint detection. It also refers to the beginning of 1298855 of the continuous speech of the continuous speech. Take a small window and calculate the accumulated energy of the window in the time interval. The so-called time interval refers to the captured one frame to the other. The time interval between the frames. After calculating the energy of all the frames, a relative energy threshold is obtained from all the energy, and the relative energy threshold is compared with an estimated energy minimum, which is the larger of the two. As an energy threshold, the minimum estimated energy is a silent measurement measured under a quiet silence to use the estimated minimum value, and the relative energy threshold is the maximum energy of all the sounds of the sound box. 1/32. Therefore, after performing step (b), step (c) is performed: respectively calculating the zero-crossing rate of the sound boxes, and obtaining a zero-zero rate threshold according to the zero-zero ratios. In, get the zero The threshold value method is to compare a preset value with a relative zero rate corresponding to the zero-crossing ratios, and the smaller of the two is used as the zero-crossing threshold. The preset value is according to the literature [ Shanughnessy' 87, ρ· 125] and set a threshold value, the audible and silent boundary zero-crossing rate value is 3000 cross/s. The relative zero-crossing rate is the zero rate of the sound box whose energy is lower than the aforementioned energy threshold. Step (d): using a linear regression deduction method, and using the energy and zero-crossing rate of the plurality of sound boxes as input parameters of the linear regression deduction method to determine whether each sound box is an active speech Or an inactive speech. 1299855 In this embodiment, the linear regression deduction method is also a multi-regressive (Multipie-regressive) application, which is derived from the application field of regression analysis, and the regression analysis is used. To find the relationship between two or more variables, and then predict the trend of data from a group of variables. In this embodiment, the energy and the zero-crossing rate are entered as two linear regression methods. change . ^ Step (e): according to the foregoing energy threshold and the aforementioned zero-crossing threshold value, from the active voices and the inactive voices to > the active voice start point and the at least one active voice end point. At the same time, the energy threshold and the zero-crossing threshold are used to judge because the voice: the nasal and air energy are small, and it is easy to be mistakenly recognized as inactive v (ice) and deleted. Identifying the judgment will result in an error. Adding the f-break of the zero-zero rate threshold can distinguish the difference between the sub-tone and the inactive speech. In the case of non-active speech, there is only background noise, and the lower the zero rate of the mute, the lower the zero rate of the consonant signal has a certain value, and the inactive speech and the consonant can be discerned when there is a predetermined limit. When step (e) is pushed, when a selected frame is an active voice, and the energy of the selected frame and the posterior segment is greater than the energy threshold, the two segments before the selected frame are determined. Whether the zero-crossing rate of the sound box is greater than the zero-crossing rate threshold is active. 曰 The starting point is moved forward by the selected sound box—or the second sound box is greater than the zero-crossing rate threshold, then the selected sound frame is It is an active voice, starting point of 1299855. When the selected sound box is an inactive voice, and the active voice start point has been obtained, and the energy of the selected sound box and the last five sound box are less than the energy threshold, then judge Whether the zero-crossing rate of the two-sound frame after the selected frame is greater than the zero-crossing threshold, and if there is greater than the zero-crossing threshold, the active speech end moves one or two frames backward from the selected frame. If there is no greater than the zero-crossing threshold, the selected frame is an active speech end point. The reason for selecting five frames is that the energy of the frames will be lower than the threshold. For: the continuous sound frame of the person is disconnected due to a short break in the continuous pronunciation, and Non-real mute, so when the energy of the frames changes from the threshold to below the threshold, it must pass through five consecutive frames to truly be regarded as the end of the active speech. The following is an active speech for the above detection process. The experimental values after the detection of the endpoints. The experimental corpus is taken from the "English for everyone" textbooks at the end of February and March of 2003. There are a total of 25 voice files, each of which is in the format of 8 kHz sampling frequency. Each sample point is quantized by 16 bits, mono, with an average length of about 1 minute and a half, and each frame length is 22.5 ms. This corpus is mostly a dialogue between people, so it is suitable as a The database of voice activity detection, the first 20 files for training, the total length is about 28 minutes and a half, the last 5 11 Ί 299855 files for testing, the total length is about 7 and a half. The experiment will be based on the input parameters. The voice activity status is analyzed with the correct voice activity status. A total of three error rates are calculated, which are the total error rate, the inactivity rate as the activity error rate, and the activity judgment. The error rate η is compared with the VAD of G. 729, as shown in Table 1. VAD type E total En a a Ea_n Multiple linear regression (training) 11.54 6.6563 4. 8837 G· 729 (training) 22.243 21.619 0. 62432 Multiple linear regression (training) 16.808 13.903 2. 9049 G· 729 (test) 27.945 25.052 2.8938 Table 1 can be seen in Table 1 where the overall error rate and inactive speech are judged as active speech, multiple linear regression, whether it is training The corpus or test corpus is better than G. 729, but when the active speech is judged to be inactive, the multiple linear regression results in the training corpus is poor, and this part of the error is also recognized for identification. A larger impact, because judging active speech as inactive speech often causes certain consonants to be ignored, resulting in identification errors, so it is desirable to reduce the total error rate by a small increase. 12 1299855 and under the weight of adjusting multiple linear regression input variables, can affect the performance of the error rate. If the weight of the energy is corrected downward, the error rate of the active sound frame determined as the inactive sound box can be reduced, and more inactive sound frames can be judged as the active sound box, and the zero-zero rate can be corrected upwards. A similar effect, here chooses to change the weight of the energy, the zero rate remains unchanged, and the deliberate choice in the training corpus is close to 1%, and finally the regression coefficient of the ti丨 is 2.3089, 047486, such as =0.50885. VAD type Etotal En_a Ea-η multiple linear regression (training) 12.826 11.835 0.99187 G· 729 (training) 22.243 21.619 0.62432 multiple linear regression (training) 20.011 19.511 0. 4999 G· 729 (test) 27.945 25. 052 2.8938 The test results are shown in Table 2 below. In each case, the multiple linear regression method is superior to the V. 729 VAD, and the error in the active speech judged as inactive speech can also maintain a certain level (1%). ). 13 1299855 Although the present invention has been disclosed in the above preferred embodiments, it is not intended to limit the invention, and various modifications and refinements may be made without departing from the spirit and scope of the invention. The scope of (4) of the present invention shall be determined by the definition of the application of the money. [Simple description of the map]

為讓本發明之上述和其他目的、特徵、優點與實 施例此更明顯易懂’所附圖式之詳細說明如下·、 第1圖緣示非特定語者的中文語音辨識之流程 點偵測 第2圖繪示語音分割及端點偵測示意圖。 第3圖繪示用於語音辨識時的活動語音端 方法之流程圖。 【主要元件符號說明】 步驟(1)〜步驟(4) 步驟(a)〜步驟(e) 14The above and other objects, features, advantages and embodiments of the present invention are more obvious and understood. The detailed description of the drawings is as follows. The first figure shows the process point detection of Chinese speech recognition of non-specific speakers. Figure 2 shows a schematic diagram of speech segmentation and endpoint detection. Figure 3 is a flow chart showing the method of active speech end for speech recognition. [Explanation of main component symbols] Step (1) to Step (4) Step (a) to Step (e) 14

Claims (1)

1299855 十、申請專利範園: 1 ·種活動語音端點之销測方法,包含下列步驟: (a) 接收至少一連續語音,並自該連續語音擷取複 數段音框; (b) 计异該些音框之能量,並根據該些能量取得一 能量門限值;1299855 X. Applying for a patent garden: 1 · A method for measuring the voice endpoint of an activity, comprising the following steps: (a) receiving at least one continuous voice, and extracting a plurality of voice frames from the continuous voice; (b) The energy of the sound boxes, and an energy threshold is obtained according to the energy; …(c)分別計算該些音框之越零率,並根據該些越零 率取得一越零率門限值; 及 (d)使用一線性迴歸演繹法,並以該些能量及該些 ,零率作為該線性迴歸演繹法之輸人參數,用以判& 母-該些音框是否為-活動語音或—非活動語音;以 、(〇根據該能量門限值及該越零率門限值,自該些(c) separately calculating the zero-crossing rate of the sound boxes, and obtaining a zero-zero rate threshold based on the zero-crossing rates; and (d) using a linear regression deductive method, and using the energies and the The zero rate is used as the input parameter of the linear regression deduction method to determine whether the parent box is the active voice or the inactive voice; and (based on the energy threshold and the zero crossing rate threshold) Value, from these ,語音及該些非活動語音中取得至少—活動語; 起點及至少一活動語音終點。 2. Λ申Γ專利範圍第1項之活動語音端點偵測方 更包核對—難能錄小值及一 對應該些月b 1之相對能量門限值,以兩者之中 值作為該能量門限值。 …中該預估能量最小值係為-於安靜無聲之環 15 1299855 、 9.如申請專利範圍第1項之活動語音端點偵測方 • 法,其中該步驟(e)更包含當所選音框為一非活動語 音,且已取得該活動語音起點,且該所選音框及後 複數段音框之能量皆小於該能量門限值,並判斷該 所選音框及後複數段音框之越零率是否大於該越零 率門限值,若有大於則活動語音終點由該所選音框 往後移動複數段音框,若無大於則該所選音框係為 # 一活動語音終點。 17, at least one of the speech and the inactive speech, the activity term; the starting point and the at least one active speech end point. 2. 活动 Λ Γ Γ Γ Γ 语音 语音 语音 语音 语音 语音 语音 语音 语音 语音 语音 语音 语音 语音 语音 语音 语音 语音 语音 语音 语音 语音 语音 语音 语音 语音 语音 语音 语音 语音 语音 语音 语音 语音 语音 语音 语音 语音 语音Threshold. The minimum estimated energy in the system is - in the quiet silent ring 15 1299855, 9. The active speech endpoint detection method in the first application of the patent scope, wherein the step (e) further includes The sound box is an inactive voice, and the active voice start point has been obtained, and the energy of the selected sound box and the subsequent plurality of sound boxes are less than the energy threshold, and the selected sound box and the subsequent plurality of sound boxes are determined. Whether the zero-crossing rate is greater than the zero-crossing rate threshold, if there is more than, the active speech end point moves the multi-segment sound box backward from the selected sound box, and if not, the selected sound frame is ##active speech end point . 17
TW95131216A 2006-08-24 2006-08-24 Detection method for voice activity endpoint TWI299855B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW95131216A TWI299855B (en) 2006-08-24 2006-08-24 Detection method for voice activity endpoint

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW95131216A TWI299855B (en) 2006-08-24 2006-08-24 Detection method for voice activity endpoint

Publications (2)

Publication Number Publication Date
TW200811833A TW200811833A (en) 2008-03-01
TWI299855B true TWI299855B (en) 2008-08-11

Family

ID=44767866

Family Applications (1)

Application Number Title Priority Date Filing Date
TW95131216A TWI299855B (en) 2006-08-24 2006-08-24 Detection method for voice activity endpoint

Country Status (1)

Country Link
TW (1) TWI299855B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8655655B2 (en) 2010-12-03 2014-02-18 Industrial Technology Research Institute Sound event detecting module for a sound event recognition system and method thereof
US9330676B2 (en) 2012-11-15 2016-05-03 Wistron Corporation Determining whether speech interference occurs based on time interval between speech instructions and status of the speech instructions
TWI659409B (en) * 2017-02-13 2019-05-11 大陸商芋頭科技(杭州)有限公司 Speech endpoint detection method and speech recognition method

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106847270B (en) * 2016-12-09 2020-08-18 华南理工大学 A dual-threshold place name voice endpoint detection method
US10460749B1 (en) * 2018-06-28 2019-10-29 Nuvoton Technology Corporation Voice activity detection using vocal tract area information

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8655655B2 (en) 2010-12-03 2014-02-18 Industrial Technology Research Institute Sound event detecting module for a sound event recognition system and method thereof
US9330676B2 (en) 2012-11-15 2016-05-03 Wistron Corporation Determining whether speech interference occurs based on time interval between speech instructions and status of the speech instructions
TWI659409B (en) * 2017-02-13 2019-05-11 大陸商芋頭科技(杭州)有限公司 Speech endpoint detection method and speech recognition method

Also Published As

Publication number Publication date
TW200811833A (en) 2008-03-01

Similar Documents

Publication Publication Date Title
CN104835498B (en) Method for recognizing sound-groove based on polymorphic type assemblage characteristic parameter
Deshmukh et al. Use of temporal information: Detection of periodicity, aperiodicity, and pitch in speech
JP6303971B2 (en) Speaker change detection device, speaker change detection method, and computer program for speaker change detection
CN112133277B (en) Sample generation method and device
CN108922541B (en) Multi-dimensional feature parameter voiceprint recognition method based on DTW and GMM models
CN106023986B (en) A Speech Recognition Method Based on Sound Effect Pattern Detection
CN108682432B (en) Voice emotion recognition device
JP5050698B2 (en) Voice processing apparatus and program
KR20100036893A (en) Speaker cognition device using voice signal analysis and method thereof
CN102222498A (en) Voice judging system, voice judging method and program for voice judgment
CN108091340B (en) Voiceprint recognition method, voiceprint recognition system, and computer-readable storage medium
CN105895079B (en) Voice data processing method and device
TWI299855B (en) Detection method for voice activity endpoint
Khan et al. Hindi speaking person identification using zero crossing rate
CN108986844B (en) A Speech Endpoint Detection Method Based on Speaker's Speech Features
Badenhorst et al. Quality measurements for mobile data collection in the developing world.
Nandwana et al. A new front-end for classification of non-speech sounds: a study on human whistle
US7650281B1 (en) Method of comparing voice signals that reduces false alarms
Abushariah et al. Voice based automatic person identification system using vector quantization
JP2002189487A (en) Voice recognition device and voice recognition method
CN112786071A (en) Data annotation method for voice segments of voice interaction scene
Speights et al. Computer-assisted syllable analysis of continuous speech as a measure of child speech disorder
JP2006154212A (en) Voice evaluation method and evaluation apparatus
JP2004317822A (en) Feeling analysis/display device
CN107039046B (en) Voice sound effect mode detection method based on feature fusion

Legal Events

Date Code Title Description
MM4A Annulment or lapse of patent due to non-payment of fees