TWI396186B

TWI396186B - Speech enhancement technique based on blind source separation for far-field noisy speech recognition

Info

Publication number: TWI396186B
Application number: TW98138353A
Authority: TW
Inventors: Jhing Fa Wang; Sheng Chieh Lee; Miao Hai Chen
Original assignee: Nat Cheng Kong University
Priority date: 2009-11-12
Filing date: 2009-11-12
Publication date: 2013-05-11
Also published as: TW201117193A

Description

Remote Noise Speech Recognition Based on Blind Signal Separation Speech Enhancement Technology

本發明係以結合獨立成分分析方法以及子空間語音增強方式所建立之遠距離雜訊語音辨識系統，對於語音處理領域上之相關應用技術如噪音消除、語音增強，以及語音辨識技術上，提出一種結合雜訊語音中噪音濾除以及語音增強和辨識處理之雜訊語音辨識系統架構設計。The invention relates to a long-distance noise speech recognition system established by combining an independent component analysis method and a subspace speech enhancement method, and proposes a related art application in the field of speech processing, such as noise cancellation, speech enhancement, and speech recognition technology. The architecture of the noise recognition system combined with noise filtering and speech enhancement and recognition processing in noise speech.

在語音分離辨識方面，本發明提出盲訊號分離(Blind Source Separation,BSS)語音增強技術來針對遠距離語音辨識率進行改善，將混合訊號中的語音訊號和噪音訊號個別分離出來，使其在具有噪聲環境下，也有良好之辨識效果。In terms of speech separation identification, the present invention proposes Blind Source Separation (BSS) speech enhancement technology to improve the long-distance speech recognition rate, and separately separates the speech signal and the noise signal in the mixed signal to have In the noisy environment, there is also a good recognition effect.

對於目前常見的語音辨識技術及近年來所發明之中華民國專利相關技術，在2007年由沈家麟等人所發明之「利用語音辨識以選取聲音內容之系統及其方法」，在此專利中除了是以一來源資料庫來存放文字內容與語音資訊外，此專利技術之辨識模組是利用一聲學模型來進行語音辨識，此聲學模型比對方式為隱藏式馬可夫模型(Hidden Markov Model,HMM)、神經網路(Neural Networks)以及動態時間校準(Dynamic Time Warping,DTW)來進行語音辨識，但此技術之辨識模組只適用在於近距離的語音辨識，且在含有噪聲環境下，也會大幅影響此專利之辨識模組辨識率。For the current common speech recognition technology and the patent technology related to the Republic of China in recent years, in 2007, Shen Jialin et al. invented the system and method for using sound recognition to select sound content, in addition to this patent. In addition to the text content and voice information stored in a source database, the identification module of the patented technology utilizes an acoustic model for speech recognition. The acoustic model comparison method is a Hidden Markov Model (HMM), Neural Networks and Dynamic Time Warping (DTW) for speech recognition, but the identification module of this technology is only suitable for close-range speech recognition, and it also has a significant impact in noisy environments. The identification module identification rate of this patent.

另外在2009年，由陳茂林所發明之「聲控語音辨識控制器」此發明之語音辨識控制器，包含一無線發射器以及一無線接收器，並且該無線發射器與接收器為RF發射器與RF接收器，且無線發射器與語音辨識元件相連結、無線接收器則與控制元件連結，在辨識技術上則是將語音訊息進行預強調處理、加強窗處理、線性預測係數處理，最後再經由倒頻譜係數處理將該語音訊息經語音辨識元件處理後供比對此語音訊息是否符合該語音命令，在此發明中也只適用於安靜之環境下來進行語音辨識，並無針對於噪聲環境下之處理。同年，由張建陽等人所發明之「語音辨識裝置、系統以及方法」，在此語音辨識系統之電子資料處理裝置中，係利用數個不同層級資料庫以存放不同語音資料，此發明包含接收模組、轉換模組、切割模組以及辨識模組，主要將所接收到之語音訊號轉換至數位訊號後，再依據切割規則將此數位訊號切割成複數個具有先後時間順序之語音資料，再從資料庫中檢索出匹配之語音資料，此技術也一樣無針對噪聲環境下之語音處理。In addition, in 2009, the voice-activated voice controller of the invention, which was invented by Chen Maolin, includes a wireless transmitter and a wireless receiver, and the wireless transmitter and receiver are RF transmitters and RFs. The receiver is connected to the voice recognition component, and the wireless receiver is connected to the control component. In the identification technology, the voice message is pre-emphasized, the window processing is processed, the linear prediction coefficient is processed, and finally, the signal is processed. The spectral coefficient process processes the voice message through the speech recognition component for comparison with whether the voice message is In accordance with the voice command, in this invention, it is only suitable for voice recognition in a quiet environment, and is not directed to processing in a noisy environment. In the same year, the "voice recognition device, system and method" invented by Zhang Jianyang et al., in the electronic data processing device of the speech recognition system, uses several different hierarchical databases to store different voice data. The invention includes a receiving mode. The group, the conversion module, the cutting module and the identification module mainly convert the received voice signal into a digital signal, and then cut the digital signal into a plurality of voice data having a chronological order according to a cutting rule, and then The matching speech data is retrieved in the database, and the technique is also free for speech processing in a noisy environment.

綜觀上述所提及之語音辨識系統，它們多為近距離語音辨識器，也就是語者和麥克風間的距離不可過遠，因為當二者之間距離漸遠，雜訊和遠距離的影響會使得辨識效果漸差，也因為需要近距離使用的限制造成的便利性較差。此外，對於含有噪聲下之環境，也會明顯的影響上述發明專利之語音辨識系統，因此對於在真實環境下，要如何去除不必要之噪音訊號也是本發明之重要技術內容。Looking at the speech recognition systems mentioned above, they are mostly close-range speech recognizers, that is, the distance between the speaker and the microphone should not be too far, because when the distance between the two is farther away, the influence of noise and long-distance will be This makes the recognition effect worse, and it is also less convenient because of the limitation of the need to use it at close range. In addition, for the environment containing noise, the speech recognition system of the above invention patent is also significantly affected, so how to remove unnecessary noise signals in the real environment is also an important technical content of the present invention.

本發明的重點於針對Ubiquitous環境來設置聲控系統，利用兩種不同的噪音環境來設置兩種系統，如第二圖、第四圖兩種系統流程，第二圖使用麥克風混音器(mixer)來架構Ubiquitous環境運算，在大範圍的佈置麥克風來錄音的同時，若採用多通道錄音卡會有很高的成本，本發明提出使用麥克風混音器來收音，將多支麥克風的信號收集混合成單一信號，如此可以降低運算量(只處理單一信號即可)，又可以降低建置成本。並配合子空間增強的方法來處理雜訊方面的問題，最後再將結果送至HTK(Hidden Markov Model Toolkit)語音辨識器。The present invention focuses on setting up a voice control system for the Ubiquitous environment, using two different noise environments to set up two systems, such as the second and fourth systems, and the second figure uses a microphone mixer. To construct a Ubiquitous environment operation, while a large range of microphones are arranged for recording, if a multi-channel recording card is used, there is a high cost. The present invention proposes to use a microphone mixer to collect sound, and to collect and mix signals of multiple microphones into one. A single signal, which reduces the amount of computation (only a single signal can be processed), and can reduce the cost of construction. The subspace enhancement method is used to deal with the problem of noise, and finally the result is sent to the HTK (Hidden Markov Model Toolkit) speech recognizer.

另外，提出針對較大干擾源的環境，如第四圖的架構，使用兩階段的語音增強來去除干擾源信號，分別利用盲訊號分離搭配子空間增強方法來去除較大干擾源，提升語音辨識器的辨識率。In addition, an environment for a large interference source is proposed. For example, the architecture of the fourth figure uses two-stage speech enhancement to remove the interference source signal, and uses the blind signal separation and subspace enhancement method to remove the large interference source and improve the speech recognition. The recognition rate of the device.

針對本技術之特徵及特色，本系統使用兩隻麥克風組成之麥克風陣列來進行收音，降低大量的建置成本和信號處理運算複雜性，又可以有效改善遠距離受到雜訊干擾的程度，使得增強後之語音訊號適用於語音辨識之用。因此本發明可運用在數位家庭聲控設備上，讓家庭生活更加數位化及便利化。For the characteristics and characteristics of the technology, the system uses a microphone array composed of two microphones to perform sound collection, which reduces a large amount of construction cost and signal processing operation complexity, and can effectively improve the degree of noise interference at a long distance, so as to enhance The subsequent voice signal is suitable for speech recognition. Therefore, the present invention can be applied to digital home voice control devices to make family life more digital and convenient.

對於一般語音辨識系統之比較，因本系統採用兩隻麥克風組成之麥克風陣列來收音，可以有效降低建置成本，降低運算量，並針對遠距離使用環境之噪聲干擾進行語音增強。因此與一般近距離語音辨識系統相較，不僅提高了方便性，也使得辨識系統更為強健，可以有效地提升目前傳統語音辨識的方便性與正確性。For the comparison of the general speech recognition system, the system uses a microphone array composed of two microphones to collect the sound, which can effectively reduce the construction cost, reduce the calculation amount, and perform voice enhancement for the noise interference of the long-distance use environment. Therefore, compared with the general short-distance speech recognition system, not only the convenience is improved, but also the identification system is more robust, which can effectively improve the convenience and correctness of the current traditional speech recognition.

另外，附錄的論文資料：「基於盲訊號分離語音增強技術之遠距離雜訊語音辨識」做為本發明之附件資料，該附件已於2009年9月1~2日公開於「中華民國計算語言學學會」之網站，網址為http：//www.aclclp.org.tw/rocling/rocling2009_c.php，且所揭露的內容是以全文方式納入本發明之範疇。In addition, the papers in the appendix: "Remote Noise Recognition Based on Blind Signal Separation Speech Enhancement Technology" is attached to the present invention. The attachment was published in the "Republic of China Computing Language" on September 1-2, 2009. The website of the Society is available at http://www.aclclp.org.tw/rocling/rocling2009_c.php and the disclosure is incorporated by reference in its entirety.

本系統提出盲訊號分離語音增強技術來處理遠距離噪聲環境下所造成的干擾，本發明之遠距離雜訊語音辨識系統技術流程圖如第一圖所示，主要可分成兩個階段：語音增強階段、語音辨識階段。The system proposes a blind signal separation speech enhancement technology to deal with interference caused by a long-distance noise environment. The technical flow chart of the long-distance noise speech recognition system of the present invention is mainly shown in the first figure, and can be mainly divided into two stages: voice enhancement. Stage, speech recognition stage.

語音增強階段：首先透過雙通道的麥克風陣列進行收音，並將採集到的雙聲道訊號透過Line-In介面傳送到PC端並經音效卡進行採樣，將此雙通道訊號經盲訊號分離方式來初步分離雜訊和語音信號。如第二圖所示。In the voice enhancement stage: firstly, the audio is collected through the dual-channel microphone array, and the collected two-channel signal is transmitted to the PC through the Line-In interface and sampled by the sound card, and the two-channel signal is separated by the blind signal. Initially separate noise and speech signals. As shown in the second figure.

在此，我們使用FastICA為盲訊號分離的演算法，而預處理的部份，則是分為集中變數與白色化兩步驟，集中變數的處理步驟主要是將混合訊號扣除其平均值，藉此簡化之後求得解混合矩陣之求解過程，計算公式如下，x表示混合訊號。Here, we use FastICA as the algorithm for blind signal separation, and the pre-processing part is divided into two steps: centralized variable and whitening. The processing steps of centralized variable are mainly to deduct the average value of the mixed signal. Solve the solution after simplification The solution process of the mixed matrix is calculated as follows, where x represents the mixed signal.

預處理的第二步驟就是資料白色化，資料白色化的目的在於將轉換後的資料彼此間具有非相關性(Uncorrelated)且變異數(Variance)數值為一，在此假設轉換資料為z，則此資料之共變異矩陣(Covariance matrix)會成為單位矩陣。因此資料白色化的方式為找出一白色化矩陣V，並將所接收到的訊號x做線性轉換且使其共變異矩陣為單位矩陣。The second step of preprocessing is data whitening. The purpose of data whitening is to make the converted data non-correlated and the Variance value is one. If the conversion data is z, The Covariance matrix of this data becomes the identity matrix. Therefore, the whitening of the data is to find a whitening matrix V, and linearly convert the received signal x and make the covariation matrix into a unit matrix.

z =Vx ,E {zz ^T }=I (2) z = Vx , E { zz ^T }= I (2)

FastICA核心演算法可分為目標函數與最佳化演算法。目標函數我們選用負熵(Negentropy)函數，而最佳化演算法則使用牛頓法來快速取得收斂值。其中熵(Entropy)的定義根據離散訊號或連續訊號可由下列公式所表示：H (y )=-ΣP (y )logP (y ) (3)The FastICA core algorithm can be divided into objective function and optimization algorithm. The objective function we choose the negative entropy (Negentropy) function, and the optimization algorithm uses the Newton method to quickly obtain the convergence value. The definition of entropy (Entropy) can be expressed by the following formula according to the discrete signal or continuous signal: H ( y )=-Σ P ( y )log P ( y ) (3)

H (y )=-ʃf (y )logf (y )dy (4) H ( y )=-ʃ f ( y )log f ( y ) dy (4)

在語音訊號部份中，當訊號y為高斯分佈時，其熵為最大值，因此為了計算方便我們使用負熵來作為依據，如公式(5)所示，y_gauss 為和y有相同變異矩陣之高斯分佈訊號，因此當訊號y為高斯分佈時，則負熵為零，為了簡化其計算，我們將公式(5)簡化為公式(6)。In the voice signal part, when the signal y is Gaussian, its entropy is the maximum value. Therefore, for the convenience of calculation, we use negative entropy as the basis. As shown in formula (5), y _gauss has the same variation matrix as y. The Gaussian distribution signal, so when the signal y is Gaussian, the negative entropy is zero. To simplify the calculation, we simplify the formula (5) to the formula (6).

J (y )=H (y _gauss )-H (y ) (5) J ( y )= H ( y _gauss )- H ( y ) (5)

其中G為對照方程式，訊號v為平均值為零變異數為一之高斯分佈訊號，在此我們設定E{G(y)}=E{G(W^T x)}，W為解混合矩陣，x為混合訊號，因此公式(6)可改寫成公式(7)，當E{G(W^T x)}為最大時，則可以找到非高斯分佈性最高的語音訊號，最後再利用牛頓法疊代運算，將解混合矩陣W求解出來。Where G is the control equation, the signal v is the Gaussian distribution signal with the mean value of zero and the variance is one, here we set E{G(y)}=E{G(W ^T x)}, W is the de-mixing matrix, x is a mixed signal, so equation (6) can be rewritten into formula (7). When E{G(W ^T x)} is maximum, the non-Gaussian distribution of the highest voice signal can be found, and finally the Newton method is used. On behalf of the operation, the solution matrix M is solved.

W ←E {xG (W ^T x )}-E {G '(W ^T x )}W (8) W ← E { xG ( W ^T x )}- E { G '( W ^T x )} W (8)

經由FastICA分離後的語音訊號仍會帶有殘留的噪聲，因此，我們使用訊號子空間語音強化方法(Signal Subspace Speech Enhancement Method)來處理殘餘噪聲，如第三圖所示。The voice signal separated by FastICA still carries residual noise. Therefore, we use the Signal Subspace Speech Enhancement Method to process the residual noise, as shown in the third figure.

假設原始混合訊號即為原本的語音訊號子空間y再加上另一個噪音訊號子空間n_S ，我們必須找出一濾波器F，使得混合訊號經由濾波後能得到乾淨的訊號y’=Fz，而濾波後訊號與原訊號相比較可計算其濾波器的誤差，其誤差值δ計算如下： Assuming that the original mixed signal is the original voice signal subspace y plus another noise signal subspace n _S , we must find a filter F so that the mixed signal can be filtered to obtain a clean signal y'=Fz. The filtered signal is compared with the original signal to calculate the error of the filter. The error value δ is calculated as follows:

其中δ_y 表示被濾波器濾除的語音訊號失真，表示沒有被濾除的噪音所產生的失真，若要對訊號子空間中的濾波器作最佳化處理，對於語音訊號部份，語音失真的程度要最小，對於噪音訊號部份，殘留的噪音只要盡量抑制到不至於影響辨識結果的程度就好，而非要求完全沒有殘存的噪音成分存在，最後再藉由特徵分解(Eigen-Decomposition)來分析出語音訊號及背景噪音，來進行噪音消除。Where δ _y represents the distortion of the speech signal filtered by the filter. Indicates the distortion caused by the noise that is not filtered. To optimize the filter in the signal subspace, the degree of speech distortion is minimal for the voice signal portion, and the residual noise for the noise signal portion. As long as it is as close as possible to the extent that it does not affect the identification result, instead of requiring no residual noise components, the noise signal is eliminated by analyzing the speech signal and background noise by Eigen-Decomposition.

經子空間語音增強之後的語音訊號，需再經過端點偵測(VAD)的處理。端點偵測為語音辨識前的一個很重要的步驟，因為端點偵測可移除多餘的無效音段，否則此無效音段會嚴重影響語音辨識率，這裡端點偵測主要透過下列式子的能量和過零率的方式來擷取語音信號的有效音段，移除靜音段使得此語音訊號適用於語音辨識。如第四圖所示。The voice signal after sub-space speech enhancement needs to be processed by endpoint detection (VAD). Endpoint detection is a very important step before speech recognition, because endpoint detection can remove redundant invalid segments, otherwise the invalid segment will seriously affect the speech recognition rate. The sub-energy and zero-crossing rate are used to capture the effective segment of the speech signal, and the mute segment is removed so that the speech signal is suitable for speech recognition. As shown in the fourth figure.

其中x(n)表示第n個樣本點的振幅能量，x(n-1)表示為前一個樣本點，因此過零率是指兩連續樣本間，具有不同的正負號次數。取出正確的語音訊號後即可開始進行辨識。Where x(n) represents the amplitude energy of the nth sample point, and x(n-1) represents the previous sample point, so the zero-crossing rate means that there are different sign times between two consecutive samples. After the correct voice signal is removed, the identification can begin.

語音辨識階段：最後在語音辨識部分，我們使用HTK來訓練語音模型、語音特徵擷取及語音辨識器，如第五圖所示。本發明之遠距離雜訊語音辨識系統實驗結果如第六圖、第七圖、第八圖所示。Speech recognition phase: Finally, in the speech recognition part, we use HTK to train the speech model, speech feature capture and speech recognizer, as shown in the fifth figure. The experimental results of the long-distance noise speech recognition system of the present invention are shown in the sixth, seventh and eighth figures.

圖一：本發明之遠距離雜訊語音辨識系統技術流程圖Figure 1: Technical flow chart of the long-distance noise speech recognition system of the present invention

圖二：本發明之快速獨立成分分析法技術流程圖Figure 2: Flow chart of the rapid independent component analysis method of the present invention

圖三：本發明之子空間語音增強技術示意圖Figure 3: Schematic diagram of the subspace speech enhancement technology of the present invention

圖四：本發明之端點偵測處理技術流程圖Figure 4: Flow chart of endpoint detection processing technology of the present invention

圖五：本發明之語音辨識技術流程圖Figure 5: Flow chart of the speech recognition technology of the present invention

圖六：本發明之遠距離雜訊語音辨識系統應用程式介面圖Figure 6: Application interface diagram of the long-distance noise speech recognition system of the present invention

圖七：本發明之遠距離雜訊語音辨識系統對各種背景噪音環境下信噪比和分段信噪比表現圖Figure 7: Signal-to-Noise Ratio and Segmented Signal-to-Noise Ratio (SNR) performance of the long-range noise speech recognition system of the present invention in various background noise environments

圖八：本發明之遠距離雜訊語音辨識系統對各種背景噪音環境下辨識率表現圖Figure 8: The recognition rate of the long-distance noise recognition system of the present invention for various background noise environments

附錄：本發明之投稿論文：基於盲訊號分離語音增強技術之遠距離雜訊語音辨識Appendix: Submission of the present invention: Long-range noise speech recognition based on blind signal separation speech enhancement technology

Claims

A long-distance noise speech recognition system technology based on blind signal separation speech enhancement technology, the flow of implementation steps is as follows: Before using the independent component analysis method to obtain the de-mixing matrix, it is necessary to assume that the signal sources are independent of each other, however, In reality, the signal sources are not independent of each other. Therefore, before performing the independent component analysis process, the pre-processing must be performed before the de-mixing matrix can be found. The pre-processing method of the system of the present invention is Centering. And data whitening (Whitening) processing: (1) centralized variable calculation And (2) data whitening calculation z = Vx , E { zz ^T } = I ; according to the central limit theorem, the multiple non-Gaussian and independent signals are added together, which will make the whole tend to Gaussian distribution. Therefore, if any two random signals tend to be non-Gaussian, the components of the two signals are independent of each other. In estimating the non-Gaussian signal part, the present invention uses Negentropy to evaluate the calculation, and the best. The algorithm uses Newton's method to quickly obtain convergence values: (3) Negative entropy calculation And (4) Newton's method of iterative calculation W ← E { xG ( W ^T x )}- E { G '( W ^T x )} W ; in the long-distance noise speech recognition system of the present invention, the mixed signal is the original The voice signal subspace plus another noise signal subspace, since the present invention uses an estimate in the time domain, the voice signal distortion filtered by the filter is calculated directly in the time domain. And distortion caused by noise that is not filtered out: (5) Calculation of filter error value The enhanced voice signal is then processed by the endpoint detection method to capture the effective segment of the voice signal through the energy and zero-crossing rate, and the mute segment is removed to make the voice signal suitable for speech recognition: (6) Point detection calculation And in the speech recognition part using the HTK voice kit for identification.