[go: up one dir, main page]

TWI520131B - Speech Recognition System Based on Joint Time - Frequency Domain and Its Method - Google Patents

Speech Recognition System Based on Joint Time - Frequency Domain and Its Method Download PDF

Info

Publication number
TWI520131B
TWI520131B TW102136684A TW102136684A TWI520131B TW I520131 B TWI520131 B TW I520131B TW 102136684 A TW102136684 A TW 102136684A TW 102136684 A TW102136684 A TW 102136684A TW I520131 B TWI520131 B TW I520131B
Authority
TW
Taiwan
Prior art keywords
speech
time
sound
frequency domain
characteristic
Prior art date
Application number
TW102136684A
Other languages
Chinese (zh)
Other versions
TW201514977A (en
Inventor
Tai Shih Chi
Chung Chien Hsu
Tse En Lin
Jian Hueng Chen
Yi Cheng Chen
Original Assignee
Chunghwa Telecom Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chunghwa Telecom Co Ltd filed Critical Chunghwa Telecom Co Ltd
Priority to TW102136684A priority Critical patent/TWI520131B/en
Publication of TW201514977A publication Critical patent/TW201514977A/en
Application granted granted Critical
Publication of TWI520131B publication Critical patent/TWI520131B/en

Links

Landscapes

  • Telephone Function (AREA)

Description

基於聯合時頻域特徵之語音識別系統及其方法 Speech recognition system based on joint time-frequency domain feature and method thereof

本發明係一種語音及非語音辨識系統及其方法,尤指一種以聯合時域頻域調變能量為基礎之語音及非語音辨別系統及其方法。 The present invention relates to a speech and non-speech recognition system and method thereof, and more particularly to a speech and non-speech discrimination system based on combined time domain frequency domain modulation energy and a method thereof.

伴隨行動通訊裝置的長足發展,為能提供更便捷的操作介面,以使智慧型電子產品能夠提供更卓越的服務功能,語音輸入模式便成為熱門的解案方案之一。 With the rapid development of mobile communication devices, in order to provide a more convenient operation interface, so that smart electronic products can provide better service functions, voice input mode has become one of the popular solution solutions.

當行動裝置在戶外或在較為吵雜環境使用時,環境之背景噪音容易造成語音辨識裝置之誤判,進而影響其服務品質。因此,為解決此技術問題,目前消除背景噪音之習知技術計有(1). 利用信號之時域特徵(如訊號之音框能量及過零率)所設計之濾波器來消除噪音、(2). 利用信號之頻域特徵(針對訊號之特定頻帶)所設計之濾波器來消除噪音、以及(3). 利用前述其中之一或及其組合所設計之濾波器以消除噪音,並透過機器學習以優化其抗雜訊號力。 When the mobile device is used outdoors or in a noisy environment, the background noise of the environment is likely to cause misjudgment of the speech recognition device, thereby affecting the service quality. Therefore, in order to solve this technical problem, the conventional techniques for eliminating background noise are (1). The filter designed by the time domain characteristics of the signal (such as the sound box energy of the signal and the zero-crossing rate) is used to eliminate noise, 2). Use the filter designed by the frequency domain characteristics of the signal (for the specific frequency band of the signal) to eliminate noise, and (3) use one of the above or a combination of filters designed to eliminate noise and pass through Machine learning to optimize its anti-noise power.

然而,環境雜訊具有極大的不確定性,而使得僅透過時、頻率之部分特徵所設計之濾波器在較為複雜或低訊號雜訊比的環境下無法準確辨識出語音及噪音,因而影響語音服務之品質;而採用機器學習進行優化之濾波器則需耗費冗長的訓練過程以及大量的資源,使得此方案亦無法在實際上獲得廣泛的運用。 However, the environmental noise has great uncertainty, so that the filter designed by only part of the time and frequency can not accurately recognize the voice and noise in the environment of complex or low signal noise ratio, thus affecting the voice. The quality of the service; the filter optimized by machine learning requires a lengthy training process and a large amount of resources, making this solution not widely available in practice.

為解決前揭習知技術之技術問題,本發明之一目的係透過解析及辨識語音之特定結構,以解決於充斥各式背景雜音之語音辨識問題。 In order to solve the technical problem of the prior art, one of the objects of the present invention is to solve the problem of speech recognition filled with various background noises by analyzing and recognizing the specific structure of the speech.

為達上述之目的,本發明提供一種識別語音及非語音系統,其包含聲音轉換模組、特徵分析模組、萃取模組以及決策模組。首先,聲音轉換模組將輸入之聲音訊號轉換成時頻域之二維圖像,並將時頻域之二維圖像傳送至特徵分析模組。特徵分析模組接著將前述之時頻域之二維圖像進行分析,以得到複數個聲音特徵,並將複數個聲音特徵傳送至萃取模組。而後,萃取模組針對複數個聲音特徵萃取出語音特徵辨識值,再將語音特徵辨識傳送至決策模組。決策模組最後再將語音特徵辨識值與語音門檻值進行比較,以區分出聲音訊號之語音訊號以及非語音訊號。 To achieve the above objective, the present invention provides a voice recognition and non-speech system including a voice conversion module, a feature analysis module, an extraction module, and a decision module. First, the sound conversion module converts the input sound signal into a two-dimensional image in the time-frequency domain, and transmits the two-dimensional image in the time-frequency domain to the feature analysis module. The feature analysis module then analyzes the two-dimensional image of the aforementioned time-frequency domain to obtain a plurality of sound features, and transmits a plurality of sound features to the extraction module. Then, the extraction module extracts the speech feature identification value for the plurality of sound features, and then transmits the speech feature recognition to the decision module. Finally, the decision module compares the speech feature identification value with the speech threshold value to distinguish the voice signal of the audio signal from the non-speech signal.

為達上述之目的,本發明另提供一種識別語音及非語音之方法,其包含下列步驟:將輸入之聲音訊號經由聲音轉換模組轉換成時頻域之二維圖像;接著,將時頻域之二維圖像使用特徵模組作進一步之分析,以取得複數個聲音特徵;再者,將複數個聲音特徵使用萃取模組進行萃取,以得出語音特性辨識值;最後,使用決策模組將語音特性辨識值與內建之語音門檻值進行比較以及分析,以區分聲音訊號中之語音以及非語音之訊號。 For the above purposes, the present invention further provides a method for recognizing speech and non-speech, comprising the steps of: converting an input audio signal into a two-dimensional image in a time-frequency domain via a sound conversion module; and then, time-frequency The two-dimensional image of the domain is further analyzed using the feature module to obtain a plurality of sound features; further, a plurality of sound features are extracted using an extraction module to obtain a speech characteristic identification value; finally, a decision mode is used. The group compares and analyzes the speech feature identification value with the built-in speech threshold to distinguish between speech and non-speech signals in the audio signal.

由於傳統濾波器僅能針對特定之雜訊進行濾波設計,因而限縮了濾波器之濾波效能及其濾波範圍;反觀本發明所述之語音及非語音辨識系統及其方法係透過分析語音之特定結構進行分析,使得在複雜的背景雜音下,仍可萃取出所欲之信號,以提供優質之語音辨識服務。 Since the conventional filter can only filter the design for specific noise, the filter performance and filtering range of the filter are limited; in contrast, the speech and non-speech identification system and the method thereof according to the present invention are analyzed by analyzing the specificity of the speech. The structure is analyzed so that under complex background noise, the desired signal can still be extracted to provide a high quality speech recognition service.

1‧‧‧識別語音及非語音系統 1‧‧‧ Identifying voice and non-speech systems

3‧‧‧聲音轉換模組 3‧‧‧Sound Conversion Module

5‧‧‧特徵分析模組 5‧‧‧Characteristic Analysis Module

7‧‧‧萃取模組 7‧‧‧Extraction module

9‧‧‧決策模組 9‧‧‧Decision module

20‧‧‧聲音訊號 20‧‧‧Sound signal

22‧‧‧時頻域之二維圖像 22‧‧‧2D image of the time-frequency domain

24‧‧‧聲音特徵 24‧‧‧Sound characteristics

26‧‧‧語音特性辨識值 26‧‧‧Voice characteristic identification value

28‧‧‧輸出訊號 28‧‧‧Output signal

第1圖係為本發明識別語音及非語音系統之架構圖。 Figure 1 is an architectural diagram of a speech recognition and non-speech system of the present invention.

第2圖及第3圖係為本發明所採用二維濾波器之示意圖。 2 and 3 are schematic views of a two-dimensional filter used in the present invention.

第4圖及第5圖係為本發明聲音訊號經二維濾波器萃取後之聲音特徵之示意圖。 Fig. 4 and Fig. 5 are schematic diagrams showing the sound characteristics of the sound signal of the present invention after being extracted by a two-dimensional filter.

第6圖係為本發明識別語音及非語音方法之流程圖。 Figure 6 is a flow chart of the method for recognizing speech and non-speech according to the present invention.

以下將描述具體之實施例以說明本發明之實施態樣,惟其並非用以限制本發明所欲保護之範疇。 The specific embodiments are described below to illustrate the embodiments of the invention, but are not intended to limit the scope of the invention.

以下係揭露識別語音及非語音系統之實施例。請參閱第1圖,係本發明之識別語音及非語音系統之架構圖。語音非語音識別系統1主要包含了聲音轉換模組3、特徵分析模組5、萃取模組7、以及決策模組9。 Embodiments for identifying voice and non-speech systems are disclosed below. Please refer to FIG. 1 , which is an architectural diagram of the recognized speech and non-speech system of the present invention. The voice non-speech recognition system 1 mainly includes a voice conversion module 3, a feature analysis module 5, an extraction module 7, and a decision module 9.

其中,聲音轉換模組3將所接收之聲音訊號20轉換成時頻域之二維圖像22,並將時頻域之二維圖像22傳送至與聲音轉換模組3連接之特徵分析模組5。接著,特徵分析模組5將時頻域之二維圖像22進行分析,以取得複數個聲音特徵24,並將複數個聲音特徵24傳送至與特徵分析模組5連結之萃取模組7。萃取模組7對複數個聲音特徵24進行萃取作業,並將生成之語音特性辨識值26傳送至與萃取模組7連接之決策模組9。決策模組9接著根據語音門檻值與語音特性辨識值26進行比較,藉以分辨出語音訊號及非語音訊號。 The sound conversion module 3 converts the received sound signal 20 into a two-dimensional image 22 in the time-frequency domain, and transmits the two-dimensional image 22 in the time-frequency domain to the feature analysis module connected to the sound conversion module 3. Group 5. Next, the feature analysis module 5 analyzes the two-dimensional image 22 in the time-frequency domain to obtain a plurality of sound features 24 and transmits the plurality of sound features 24 to the extraction module 7 coupled to the feature analysis module 5. The extraction module 7 performs an extraction operation on the plurality of sound features 24 and transmits the generated speech characteristic identification value 26 to the decision module 9 connected to the extraction module 7. The decision module 9 then compares the speech threshold value with the speech characteristic identification value 26 to distinguish between the speech signal and the non-speech signal.

另外,聲音轉換模組3於接收到聲音訊號20後,更可將聲音 訊號20進一步劃分成複數個音框,並將其分別透過短時間傅立葉轉換後再加以合成,以形成時頻域之二維圖像22。前述之時頻域二維圖像22係為模擬聽覺中大腦皮質輸出的聽覺頻譜圖。 In addition, after receiving the sound signal 20, the sound conversion module 3 can further sound The signal 20 is further divided into a plurality of sound boxes, which are respectively converted into short-time Fourier transforms and then combined to form a two-dimensional image 22 in the time-frequency domain. The aforementioned time-frequency domain two-dimensional image 22 is an auditory spectrogram that simulates the output of the cerebral cortex in the auditory sense.

請再同時參閱第2圖及第3圖,第2圖以及第3圖係為基於時頻域二維圖像之語音調變方向性示意圖。由於語音訊號具有特定之諧波性以及頻率調變方向性,當聲音訊號透過時軸與頻軸來觀察時,可清楚看出欲解析訊號其頻率調變之方向。請繼續參閱第2圖,圖式為輸入訊號於特定時間區間內之頻率為隨時間遞減,其語音調變方向為一下移模式,而其時間軸上封包變化率為4Hz,其頻率軸上封包變化率為2ms;請接著參閱第3圖,其輸入訊號於特定時間區間內之頻率為隨時間遞增,其語音調變方向為一上移模式,而其時間軸上封包變化率為8Hz,其頻率軸上封包變化率為4ms,透時頻域二維圖像22可分析出聲音訊號20係為何種調變方向,並提供接續模組進行分析。 Please refer to FIG. 2 and FIG. 3 at the same time. FIG. 2 and FIG. 3 are schematic diagrams of the speech modulation directionality based on the time-frequency domain two-dimensional image. Since the voice signal has a specific harmonicity and frequency modulation directionality, when the sound signal is transmitted through the axis and the frequency axis, the direction of the frequency modulation of the signal to be resolved can be clearly seen. Please continue to refer to Figure 2, where the frequency of the input signal is decremented over time in a specific time interval, and the voice modulation direction is the next shift mode, and the packet change rate on the time axis is 4 Hz, and the frequency axis is encapsulated. The rate of change is 2ms; please refer to Figure 3, the frequency of the input signal in a specific time interval is increasing with time, the direction of speech modulation is an upshift mode, and the rate of packet change on the time axis is 8Hz. The packet change rate on the frequency axis is 4 ms. The time-frequency domain 2D image 22 can analyze the direction of the modulation signal 20 and provide a connection module for analysis.

為進一步取得解析參數,特徵分析模組5利用二維時頻脈衝響應帶通濾波器組來產生時頻域之二維圖像22的聯合時域及頻域之複數個聲音特徵24。前述之二維時頻脈衝響應帶通濾波器組可將時間軸上封包變化率以及頻率軸上封包的變化率作為設計之參數,以生成複數個帶通濾波器。而前述帶通濾波器之數量可透過時間上封包的變化率、頻率軸上封包的變化率、以及時頻域脈衝響應方向性的數量相乘所得之積來決定。前述之複數個濾波器可由語音及非語音之特性加以區分成複數個語音特性濾波器以及非語音特性濾波器二個群組,以分別處理語音以及非語音之訊號。而複數個聲音特徵24主要包含了時間、頻率、時間軸上封包變化率、以及 頻率軸上封包的變化率之參數。 To further obtain the analytical parameters, the feature analysis module 5 uses the two-dimensional time-frequency impulse response bandpass filter bank to generate a plurality of sound features 24 in the combined time domain and frequency domain of the two-dimensional image 22 in the time-frequency domain. The aforementioned two-dimensional time-frequency impulse response bandpass filter bank can use the rate of change of the packet on the time axis and the rate of change of the packet on the frequency axis as parameters of the design to generate a plurality of bandpass filters. The number of the band pass filters can be determined by multiplying the rate of change of the time packet, the rate of change of the packet on the frequency axis, and the amount of directionality of the impulse response in the time domain. The plurality of filters described above can be distinguished by a combination of speech and non-speech characteristics into a plurality of groups of speech characteristic filters and non-speech characteristic filters to process speech and non-speech signals, respectively. The plurality of sound features 24 mainly include time, frequency, rate of change on the time axis, and The parameter of the rate of change of the packet on the frequency axis.

為更清楚解釋,請再參閱第4圖及第5圖。其皆為聲音特徵24之示意圖。當一同時包含語音以及具有不確定性風聲之聲音訊號22依序經由聲音模組3及特徵分析模組5處理後,可得到第4圖以及第5圖之所示之聲音特徵24。其中,第4圖為基於時間軸上封包變化率以及頻率軸上封包變化率之語音訊號之調變能量分布情形;第5圖為基於時間軸上封包變化率及頻率軸上封包變化率之非語音訊號之調變能量分布情形。本發明透過觀察聲音能量分布之差異性即可明確的分類出語音以及非語音訊號。 For a clearer explanation, please refer to Figure 4 and Figure 5. They are all schematic diagrams of the sound features 24. When the sound signal 22 including both the voice and the uncertain wind sound is sequentially processed by the sound module 3 and the feature analyzing module 5, the sound features 24 shown in FIGS. 4 and 5 can be obtained. 4 is a modulation energy distribution based on the rate of change of the packet on the time axis and the rate of change of the packet on the frequency axis; FIG. 5 is based on the rate of change of the packet on the time axis and the rate of change of the packet on the frequency axis. The modulation energy distribution of the voice signal. The present invention can clearly classify speech and non-speech signals by observing the difference in sound energy distribution.

而在第4圖中,語音調變能量大部分落在時間軸上封包變化率=4Hz、頻率軸上封包變化率=5ms之區間;在第5圖中,非語音調變能量大部分落在時間軸上封包變化率=32Hz、時間軸上封包變化率=2ms之區間。因此本發明僅需透過選取上述特定之區間即可萃取出所欲之訊號。 In Fig. 4, the speech modulation energy mostly falls on the time axis, the packet change rate = 4 Hz, and the frequency axis packet change rate = 5 ms; in Fig. 5, the non-speech modulation energy mostly falls on The packet change rate on the time axis = 32 Hz, and the packet change rate on the time axis = 2 ms. Therefore, the present invention only needs to extract the desired signal by selecting the above specific interval.

接著,萃取模組7通過上述之複數個語音特性以及非語音特性濾波器以分別求得語音特性以及非語音特性的於時間及頻率軸上變化率分布之調變能量後,並將前述之語音特性以及非語音特性的調變能量分別乘上各自之權重值(例如語音特性濾波的調變能量之權重值可設為1,非語音特性濾波的調變能量可設為-1),以分別求得語音特性及以非語音特性之加權分數。 Next, the extraction module 7 obtains the modulation energy of the speech characteristic and the non-speech characteristic on the time and frequency axis by using the above-mentioned plurality of speech characteristics and the non-speech characteristic filter, respectively, and then the aforementioned speech The modulation energy of the characteristic and the non-speech characteristic are respectively multiplied by the respective weight values (for example, the weight value of the modulation energy of the speech characteristic filter can be set to 1, and the modulation energy of the non-speech characteristic filter can be set to -1), respectively. The speech characteristics and the weighted scores with non-speech characteristics are obtained.

最後,決策模組9將前述之語音及非語音特性之加權分數和語音門檻值進行比較,以分辨聲音訊號20為語音訊號或為非語音訊號,而達到語音及非語音辨識之目的。 Finally, the decision module 9 compares the weighted scores of the voice and non-speech characteristics with the voice threshold to distinguish the voice signal 20 as a voice signal or a non-speech signal, thereby achieving the purpose of voice and non-speech recognition.

以下,更進一步揭露識別語音及非語音方法之實施例。請參 閱第6圖,第6圖為本發明識別語音及非語音方法之流程圖。該方法包含下列步驟:步驟601,將聲音訊號經由聲音轉換模組轉換成時頻域之二維圖像;步驟603,將時頻域之二維圖像使用特徵分析模組進行分析,以取得複數個聲音特徵;步驟605,將複數個聲音特徵使用萃取模組進行萃取成語音特性辨識值;步驟607,將語音特性辨識值與語音門檻值使用決策模組進行比較分析,以區分出聲音訊號中語音訊號與非語音訊號的部份。 Embodiments of the recognized speech and non-speech methods are further disclosed below. Please refer to Referring to Figure 6, Figure 6 is a flow chart of the method for recognizing speech and non-speech according to the present invention. The method includes the following steps: Step 601: converting a sound signal into a two-dimensional image in a time-frequency domain via a sound conversion module; and step 603, analyzing a two-dimensional image in a time-frequency domain using a feature analysis module to obtain a plurality of sound features; in step 605, the plurality of sound features are extracted into a speech characteristic identification value by using an extraction module; and in step 607, the speech characteristic identification value is compared with the speech threshold value using a decision module to distinguish the sound signal. The part of the voice signal and the non-voice signal.

在本實施例中,於步驟601所取得之聲音訊號可先經由聲音轉換模組劃分成複數個音框,並使用短時間傅利葉轉換將複數個音框產生時頻域之二維圖像,並將前述之時頻域之二維圖像傳送至特徵分析模組加以處理。而前述經由聲音轉換模組所轉換之時頻域之二維圖像,係為模擬聽覺中大腦皮質輸出的聽覺頻譜圖。 In this embodiment, the sound signal obtained in step 601 can be first divided into a plurality of sound frames through a sound conversion module, and a plurality of sound frames are used to generate a two-dimensional image in a time-frequency domain using a short-time Fourier transform, and The two-dimensional image of the aforementioned time-frequency domain is transmitted to the feature analysis module for processing. The two-dimensional image in the time-frequency domain converted by the sound conversion module is an auditory spectrum image of the cerebral cortex output in the simulated hearing.

步驟603之特徵分析模組係為二維時頻域脈衝響應帶通濾波器組,前述之二維時頻域脈衝響應帶通濾波器組係把時頻域的二維圖像的聯合時頻及頻域的能量變化率進行語音結構分析並以此產生複數個聲音特徵,並將聲音特徵傳送至萃取模組進行下一步的處理。而上述之二維時頻域脈衝響應帶通濾波器組在實施上可使用複數個帶通濾波器來組成,且其帶通濾波器之數量係經由時間軸上封包的變化率、頻率軸上封包變化率以及時頻域脈衝響應方向性之數量相乘所得的積來決定。而上述之濾波器組可透過語音特性及非語音特性區分成複數個語音特性濾波器及複數個非語音特性濾波器以分別處理語音以及非語音之訊號。而前述之複數個聲音特徵包含了時間、頻率、時間軸上封包變化率、以及頻率軸上封包的變化率之參數。 The feature analysis module of step 603 is a two-dimensional time-frequency domain impulse response bandpass filter bank, and the two-dimensional time-frequency domain impulse response bandpass filter bank is a joint time-frequency of a two-dimensional image in a time-frequency domain. And the frequency change rate of the frequency domain is used for speech structure analysis to generate a plurality of sound features, and the sound features are transmitted to the extraction module for further processing. The above two-dimensional time-frequency domain impulse response band-pass filter bank can be implemented by using a plurality of band-pass filters, and the number of band-pass filters is based on the rate of change of the packet on the time axis, on the frequency axis. The product of the rate of change of the packet and the amount of directionality of the impulse response in the time-frequency domain is determined by the product of the multiplication. The filter bank can be divided into a plurality of speech characteristic filters and a plurality of non-speech characteristic filters to process speech and non-speech signals respectively through speech characteristics and non-speech characteristics. The plurality of sound characteristics described above include parameters of time, frequency, rate of change of the packet on the time axis, and rate of change of the packet on the frequency axis.

接著,步驟605將上述之複數個聲音特徵通過複數個帶通濾波器中的語音特性濾波器並乘上語音特性濾波器的權重值以得到一具有語音特性之加權分數,而此具有語音特性之加權分數可為語音特性辨識值之一部;另外,步驟605亦可將上述之複數個聲音特徵通過複數個帶通濾波器中的非語音特性濾波器並乘上非語音特性濾波器的權重值以得到一具有非語音特性之加權分數,而此具有非語音特性之加權分數可為語音特性辨識值之另一部。 Next, in step 605, the plurality of sound features are passed through a speech characteristic filter in the plurality of band pass filters and multiplied by the weight value of the speech characteristic filter to obtain a weighted score having a speech characteristic, and the speech characteristic is The weighted score may be part of the speech characteristic identification value. In addition, step 605 may also pass the plurality of sound features to pass through the non-speech characteristic filter of the plurality of band pass filters and multiply the weight value of the non-speech characteristic filter. To obtain a weighted score with non-speech characteristics, and the weighted score with non-speech characteristics can be another part of the speech characteristic identification value.

最後,步驟607更針對語音特性之加權分數與非語音特性之加權分數與該語音門檻值做辨別後,作為判斷聲音訊號係為語音非語音訊號之依據。 Finally, in step 607, the weighted scores of the speech characteristics and the weighted scores of the non-speech characteristics are discriminated from the speech threshold value, and then the sound signal is used as the basis for the voice non-speech signal.

上列詳細說明係針對本創作之一可行實施例之具體說明,惟該實施例並非用以限制本發明之專利範圍,凡未脫離本發明技藝精神所為之等效實施或變更,均應包含於本案之專利範圍中。 The detailed description above is a detailed description of one of the possible embodiments of the present invention, and is not intended to limit the scope of the present invention. The patent scope of this case.

1‧‧‧識別語音及非語音系統 1‧‧‧ Identifying voice and non-speech systems

3‧‧‧聲音轉換模組 3‧‧‧Sound Conversion Module

5‧‧‧特徵分析模組 5‧‧‧Characteristic Analysis Module

7‧‧‧萃取模組 7‧‧‧Extraction module

9‧‧‧決策模組 9‧‧‧Decision module

20‧‧‧聲音訊號 20‧‧‧Sound signal

22‧‧‧時頻域之二維圖像 22‧‧‧2D image of the time-frequency domain

24‧‧‧聲音特徵 24‧‧‧Sound characteristics

26‧‧‧語音特性辨識值 26‧‧‧Voice characteristic identification value

28‧‧‧輸出訊號 28‧‧‧Output signal

Claims (21)

一種識別語音及非語音之系統,包含:一聲音轉換模組,係將一聲音訊號轉換成一時頻域之二維圖像;一特徵分析模組,係與該聲音轉換模組連接並將該時頻域之二維圖像進行分析以取得複數個聲音特徵,該等聲音特徵包含語音訊號以及非語音訊號,其中該語音訊號之調變能量落在時間軸上封包變化率=4Hz、頻率軸上封包變化率=5ms之區間,該非語音訊號之調變能量落在時間軸上封包變化率=32Hz、時間軸上封包變化率=2ms之區間;一萃取模組,係與該特徵分析模組連接並將該複數個聲音特徵進行萃取成一語音特性辨識值;一決策模組,係與該萃取模組連接並將該語音特性辨識值與一語音門檻值進行比較以區分出聲音訊號中一語音訊號與一非語音訊號的部份。 A system for recognizing voice and non-speech includes: a sound conversion module for converting a sound signal into a two-dimensional image in a time-frequency domain; a feature analysis module connected to the sound conversion module and The two-dimensional image in the time-frequency domain is analyzed to obtain a plurality of sound features, and the sound features include a voice signal and a non-speech signal, wherein the modulation energy of the voice signal falls on the time axis, and the packet change rate=4 Hz, the frequency axis The change rate of the upper packet = 5ms, the modulation energy of the non-speech signal falls on the time axis, the rate of change of the packet = 32 Hz, and the rate of change of the packet on the time axis = 2 ms; an extraction module, and the feature analysis module Connecting and extracting the plurality of sound features into a speech characteristic identification value; a decision module is connected to the extraction module and comparing the speech characteristic identification value with a speech threshold value to distinguish a speech in the audio signal The signal and a part of a non-speech signal. 如申請專利範圍第1項所述之識別語音及非語音的系統,其中該聲音轉換模組係將該聲音訊號劃分成複數個音框,並將該複數個音框分別以短時間傅立葉轉換產生該時頻域之二維圖像。 The system for recognizing speech and non-speech according to claim 1, wherein the sound conversion module divides the sound signal into a plurality of sound boxes, and generates the plurality of sound boxes by short-time Fourier transform respectively. A two-dimensional image of the time-frequency domain. 如專利申請範圍第1項所述之識別語音及非語音的系統,其中該聲音轉換模組中的該時頻域之二維圖像,係模擬聽覺中大腦皮質輸出的聽覺頻譜圖。 The system for recognizing speech and non-speech according to claim 1, wherein the two-dimensional image of the time-frequency domain in the sound conversion module simulates an auditory spectrogram output of the cerebral cortex in the auditory sense. 如申請專利範圍第1項所述之識別語音及非語音的系統,其中該特徵分析模組係利用一二維時頻域脈衝響應帶通濾波器組所產生該時頻域之二維圖像的聯合時域及頻域的該複數個聲音特徵。 The system for recognizing speech and non-speech according to claim 1, wherein the feature analysis module generates a two-dimensional image of the time-frequency domain by using a two-dimensional time-frequency domain impulse response bandpass filter bank. The complex sound features of the combined time domain and frequency domain. 如申請專利範圍第4項所述之識別語音及非語音的系統,其中該二維時 頻域脈衝響應帶通濾波器組由複數個帶通濾波器所組成。 A system for recognizing speech and non-speech as described in claim 4, wherein the two-dimensional time The frequency domain impulse response bandpass filter bank consists of a plurality of bandpass filters. 如申請專利範圍第5項所述之識別語音及非語音的系統,其中該複數個帶通濾波器的數量為時間軸上封包的變化率及頻率軸上封包的變化率及時頻域脈衝響應方向性數量相乘所得的積相符。 The system for recognizing speech and non-speech according to claim 5, wherein the number of the plurality of band pass filters is a rate of change of the packet on the time axis and a rate of change of the packet on the frequency axis, and a direction of the frequency response of the frequency domain. The product of the number of sex is multiplied by the product. 如申請專利範圍第4項所述之識別語音及非語音的系統,其中該複數個聲音特徵為時間、頻率、時間軸上封包的變化率及頻率軸上封包的變化率。 The system for recognizing speech and non-speech according to claim 4, wherein the plurality of sound characteristics are time, frequency, rate of change of the packet on the time axis, and rate of change of the packet on the frequency axis. 如申請專利範圍第6項所述之識別語音及非語音的系統,其中該複數個帶通濾波器係依語音特性及非語音特性區分成複數個語音特性濾波器及複數個非語音特性濾波器。 The system for recognizing speech and non-speech according to claim 6, wherein the plurality of band pass filters are divided into a plurality of speech characteristic filters and a plurality of non-speech characteristic filters according to speech characteristics and non-speech characteristics. . 如申請專利範圍第8項所述之識別語音及非語音的系統,其中該萃取模組通過該複數個語音特性濾波器及該複數個非語音特性濾波器以分別求得語音特性的調變能量值及非語音特性的調變能量值,並將該具有語音特性的調變能量值乘上該複數個語音特性濾波器的權重值及具有非語音特性的調變能量值乘上該複數個非語音特性濾波器的權重值後,以分別求得語音特性之加權分數及非語音之特性之加權分數。 The system for recognizing speech and non-speech according to claim 8, wherein the extraction module obtains the modulation energy of the speech characteristic through the plurality of speech characteristic filters and the plurality of non-speech characteristic filters respectively a modulated energy value of the value and the non-speech characteristic, and multiplying the modulated energy value having the speech characteristic by the weight value of the plurality of speech characteristic filters and the modulation energy value having the non-speech characteristic multiplied by the plurality of non-speech After the weight value of the speech characteristic filter, the weighted scores of the weighted scores of the speech characteristics and the non-speech characteristics are respectively obtained. 如申請專利範圍第9項所述之識別語音及非語音的系統,其中該決策模組將該語音特性之加權分數及該非語音之特性之加權分數經由該語音門檻值以判別該聲音訊號為該語音訊號或該非語音訊號。 The system for recognizing a voice and a non-speech according to claim 9 , wherein the decision module determines the weighted score of the voice characteristic and the weighted score of the non-speech characteristic to determine the voice signal by using the voice threshold Voice signal or the non-speech signal. 一種識別語音及非語音之方法,包含下列步驟:步驟1:將該聲音訊號經由該聲音轉換模組轉換成該時頻域之二維圖像;步驟2:將該時頻域之二維圖像使用該特徵分析模組進行分析,以取得該 複數個聲音特徵,該等聲音特徵包含語音訊號以及非語音訊號,其中該語音訊號之調變能量落在時間軸上封包變化率=4Hz、頻率軸上封包變化率=5ms之區間,該非語音訊號之調變能量落在時間軸上封包變化率=32Hz、時間軸上封包變化率=2ms之區間;步驟3:將該複數個聲音特徵使用該萃取模組進行萃取成該語音特性辨識值;步驟4:將該語音特性辨識值與該語音門檻值使用該決策模組進行比較分析,以區分出聲音訊號中該語音訊號與該非語音訊號的部份。 A method for recognizing speech and non-speech includes the following steps: Step 1: converting the audio signal into a two-dimensional image in the time-frequency domain via the sound conversion module; Step 2: performing a two-dimensional image of the time-frequency domain Use the feature analysis module for analysis to get the a plurality of sound features, wherein the sound characteristics include a voice signal and a non-speech signal, wherein the modulation energy of the voice signal falls on a time axis: a packet change rate=4 Hz, and a frequency axis packet change rate=5 ms, the non-speech signal The modulation energy falls on the time axis, the packet change rate=32 Hz, and the packet change rate on the time axis=2 ms; step 3: extracting the plurality of sound features into the speech characteristic identification value by using the extraction module; 4: The voice characteristic identification value and the voice threshold value are compared and analyzed by using the decision module to distinguish the voice signal and the non-voice signal part of the voice signal. 如申請專利範圍第11項所述之識別語音及非語音的方法,其中該聲音訊號被該聲音轉換模組劃分成複數個音框,該複數個音框分別以短時間傅立葉轉換產生該時頻域之二維圖像,並將該時頻域之二維圖像傳送至特徵分析模組處理。 The method for recognizing speech and non-speech according to claim 11, wherein the audio signal is divided into a plurality of sound boxes by the sound conversion module, and the plurality of sound frames are respectively generated by short time Fourier transform to generate the time frequency. A two-dimensional image of the domain, and the two-dimensional image of the time-frequency domain is transmitted to the feature analysis module for processing. 如申請專利範圍第11項所述之識別語音及非語音的方法,其中該聲音轉換模組中的該時頻域之二維圖像,係模擬聽覺中大腦皮質輸出的聽覺頻譜圖。 The method for recognizing speech and non-speech according to claim 11, wherein the two-dimensional image of the time-frequency domain in the sound conversion module simulates an auditory spectrogram outputted by the cerebral cortex in the auditory sense. 如申請專利範圍第11項所述之識別語音及非語音的方法,其中係該特徵分析模組係為一二維時頻域脈衝響應帶通濾波器組,將該二維時頻域脈衝響應帶通濾波器組的時頻域之二維圖像的聯合時域及頻域的能量變化率進行語音結構分析以產生該複數個聲音特徵,並將該複數個聲音特徵傳送至該萃取模組進行處理。 The method for recognizing speech and non-speech according to claim 11, wherein the feature analysis module is a two-dimensional time-frequency domain impulse response bandpass filter bank, and the two-dimensional time-frequency domain impulse response is used. Performing speech structure analysis on the joint time domain and frequency domain energy change rate of the two-dimensional image of the band-pass filter bank in the time-frequency domain to generate the plurality of sound features, and transmitting the plurality of sound features to the extraction module Process it. 如申請專利範圍第14項所述之識別語音及非語音的方法,其中該二維時頻域脈衝響應帶通濾波器組係由該複數個帶通濾波器所組成。 The method for recognizing speech and non-speech according to claim 14, wherein the two-dimensional time-frequency domain impulse response band pass filter group is composed of the plurality of band pass filters. 如申請專利範圍第14項所述之識別語音及非語音的方法,其中該複數個聲音特徵為時間、頻率、時間軸上封包的變化率及頻率軸上封包的變化率。 The method for recognizing speech and non-speech according to claim 14, wherein the plurality of sound characteristics are time, frequency, rate of change of the packet on the time axis, and rate of change of the packet on the frequency axis. 如申請專利範圍第15項所述之識別語音及非語音的方法,其中該複數個帶通濾波器的數量係將該時間軸上封包的變化率及頻率軸上封包的變化率及時頻域脈衝響應方向性之數量相乘所得的積而來。 The method for recognizing speech and non-speech according to claim 15, wherein the number of the plurality of band pass filters is a rate of change of the packet on the time axis and a rate of change of the packet on the frequency axis in time and frequency domain. The product obtained by multiplying the number of directionalities is derived. 如申請專利範圍第15項所述之識別語音及非語音的方法,該複數個帶通濾波器依語音特性及非語音特性係區分成複數個語音特性濾波器及複數個非語音特性濾波器。 The method for recognizing speech and non-speech according to claim 15 is characterized in that the plurality of band pass filters are divided into a plurality of speech characteristic filters and a plurality of non-speech characteristic filters according to a speech characteristic and a non-speech characteristic. 如申請專利範圍第18項所述之識別語音及非語音的方法,其中該語音特性辨識值係由該複數個聲音特徵通過該複數個帶通濾波器中的該語音特性濾波器並乘上該語音特性濾波器的權重值後可得一具有語音特性之加權分數。 The method for recognizing speech and non-speech according to claim 18, wherein the speech characteristic identification value is passed by the plurality of sound characteristics through the speech characteristic filter of the plurality of band pass filters and multiplied by the The weighted value of the speech characteristic filter can be followed by a weighted score with speech characteristics. 如申請專利範圍第19項所述之識別語音及非語音的方法,其中該非語音特性辨識值係由該複數個聲音特徵通過該複數個帶通濾波器中的該非語音特性濾波器並乘上該非語音特性濾波器的權重值後可得一具有非語音特性之加權分數。 The method for recognizing speech and non-speech according to claim 19, wherein the non-speech characteristic identification value is passed by the plurality of sound characteristics through the non-speech characteristic filter of the plurality of band pass filters and multiplied by the non-speech The weight value of the speech characteristic filter can be followed by a weighted score with non-speech characteristics. 如申請專利範圍第20項所述之識別語音及非語音的方法,其中將語音特性之加權分數與非語音特性之加權分數與該語音門檻值做辨別後,作為該聲音訊號中該語音訊號或該非語音訊號的判定依據。 The method for recognizing speech and non-speech according to claim 20, wherein the weighted score of the speech characteristic and the weighted score of the non-speech characteristic are discriminated from the speech threshold, and the voice signal is used as the voice signal or The basis for determining the non-speech signal.
TW102136684A 2013-10-11 2013-10-11 Speech Recognition System Based on Joint Time - Frequency Domain and Its Method TWI520131B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW102136684A TWI520131B (en) 2013-10-11 2013-10-11 Speech Recognition System Based on Joint Time - Frequency Domain and Its Method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW102136684A TWI520131B (en) 2013-10-11 2013-10-11 Speech Recognition System Based on Joint Time - Frequency Domain and Its Method

Publications (2)

Publication Number Publication Date
TW201514977A TW201514977A (en) 2015-04-16
TWI520131B true TWI520131B (en) 2016-02-01

Family

ID=53437707

Family Applications (1)

Application Number Title Priority Date Filing Date
TW102136684A TWI520131B (en) 2013-10-11 2013-10-11 Speech Recognition System Based on Joint Time - Frequency Domain and Its Method

Country Status (1)

Country Link
TW (1) TWI520131B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI689865B (en) * 2017-04-28 2020-04-01 塞席爾商元鼎音訊股份有限公司 Smart voice system, method of adjusting output voice and computre readable memory medium
TWI768676B (en) * 2021-01-25 2022-06-21 瑞昱半導體股份有限公司 Audio processing method and audio processing device, and associated non-transitory computer-readable medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI689865B (en) * 2017-04-28 2020-04-01 塞席爾商元鼎音訊股份有限公司 Smart voice system, method of adjusting output voice and computre readable memory medium
TWI768676B (en) * 2021-01-25 2022-06-21 瑞昱半導體股份有限公司 Audio processing method and audio processing device, and associated non-transitory computer-readable medium

Also Published As

Publication number Publication date
TW201514977A (en) 2015-04-16

Similar Documents

Publication Publication Date Title
CN110600017B (en) Training method of voice processing model, voice recognition method, system and device
CN113823293B (en) A speaker recognition method and system based on speech enhancement
CN110827837A (en) Whale activity audio classification method based on deep learning
CN106504763A (en) Multi-target Speech Enhancement Method Based on Microphone Array Based on Blind Source Separation and Spectral Subtraction
CN102483926B (en) System and method for noise reduction in processing speech signals by targeting speech and disregarding noise
CN103646649A (en) High-efficiency voice detecting method
CN108831440A (en) A kind of vocal print noise-reduction method and system based on machine learning and deep learning
US9997168B2 (en) Method and apparatus for signal extraction of audio signal
US20210287674A1 (en) Voice recognition for imposter rejection in wearable devices
CN103985390A (en) Method for extracting phonetic feature parameters based on gammatone relevant images
CN116416996A (en) A multi-modal speech recognition system and method based on millimeter wave radar
CN105825857A (en) Voiceprint-recognition-based method for assisting deaf patient in determining sound type
CN106548786A (en) A kind of detection method and system of voice data
CN103258537A (en) Method utilizing characteristic combination to identify speech emotions and device thereof
Murugaiya et al. Probability enhanced entropy (PEE) novel feature for improved bird sound classification
TWI520131B (en) Speech Recognition System Based on Joint Time - Frequency Domain and Its Method
WO2018095167A1 (en) Voiceprint identification method and voiceprint identification system
Pour et al. Gammatonegram based speaker identification
CN105916090A (en) Hearing aid system based on intelligent speech recognition technology
Fathima et al. Gammatone cepstral coefficient for speaker Identification
CN112908347A (en) Noise detection method and terminal
CN108538290A (en) Intelligent household control method based on audio signal detection
CN120544610B (en) Power cable fault discharge sound recognition method and system based on multi-feature fusion
CN115064182A (en) Fan fault feature identification method of self-adaptive Mel filter in strong noise environment
CN118155608B (en) Miniature microphone voice recognition system for multi-noise environment

Legal Events

Date Code Title Description
MM4A Annulment or lapse of patent due to non-payment of fees