TW201101852A

TW201101852A - Sound source direction detecting method and apparatus thereof

Info

Publication number: TW201101852A
Application number: TW98121567A
Authority: TW
Inventors: Hung-Yan Gu; Shan-Siang Yang
Original assignee: Univ Nat Taiwan Science Tech
Priority date: 2009-06-26
Filing date: 2009-06-26
Publication date: 2011-01-01

Abstract

A method for detecting the direction of a sound source is provided. Sound signals from the sound source are received via a microphone array, and then digitized separately for each microphone. By calculating spectral entropy and signal-to-noise ratio, voice activity is detected for each digital signal stream, so as to extract the signal frames carrying speech signals. When a frame of speech signal is detected, time delay of arrival (TDOA) between each pair of microphone is estimated according to the principle of generalized cross correlation function. Then, the sound source's direction is computed to obtain an azimuth and an elevation according to the estimated TDOA vector.

Description

201101852 六、發明說明：【發明所屬之技術領域】本發明係有關於一種聲源偵測裝置，特別是有關於一種適用於偵測人聲方位的聲源偵測裝置。【先前技術】聲源方位偵測的應用範圍相當廣泛，例如聲納探測、無線通訊、視訊會議系統等等。近年來隨著機器人的發展 ❹ 越來越蓬勃’應用在機器人聽覺上之聲源方位偵測的研究也逐漸增加。另外’聲源方位偵測系統也可以用於加強人機之間的互動（Human-Robot Interaction，HRI )，或是結合自動語音辨識（ Automatic Speech Recognition，ASR)系統或語音增強（Speech Enhancement)系統，以增強語音處理系統的效能。一聲源方位偵測有關的文獻上，被提出的技術可約略分 f二大類.，其中一類是將麥克風陣列所接收到的資料〇 ΐΠΐί,再使用beamforming或子空間理論處理，以求侍聲源的角度。另一類則是設法去估 =的時間延遲(TD0A)，再利用聲源與麥; 何關係，以估計出聲源的角度。刃戍【發明内容】本發明提供一種聲源偵測方法及其裝 =測、時間延遲估計以及方位角度計算來4：：;;: 方位源之本=提供-種聲源偵測方法，適用於偵測一聲源之首先，經由一麥克風陣列，接收來自於上述聲 201101852 一語音信號，其中上述麥爻湿將每-上述麥克風所接收包括複數麥克風。接著，波。接著，將已濾波之每一 ϋ入信▲號進行放大以及遽號。接著，藉由計算各㈣^輸人㈣轉換成—數位信比:進行語 ===，號音框時關函數的迫近具法去计算出兩 =的估計值，並且形成-個TDOA向量： Ο 〇 TD〇A向量去進行方位計算，以4 上述聲源之-水平方位角以及—仰角。㈣塾、择本發明提供—種聲源偵測裝置，適用於偵測一聲源之方位。上述聲源備測裝置包括一麥克風陣歹心秀風包括一第一麥克風、-第二麥 _ —麥克風，用以接收來自上述聲源之一語音 :來μ第=麥克風之—第二輸人信號以及對應於上述第 ΐ笛-i之τ第三輸人信號。上述判斷電路首先侧出上銓二ί號的語音音框(含有語音之信號音框)、上述框：當上：第“：語ί音框以及上述第三輸入信號的語音音立第一及第三輸入信號同時都偵測出語音二节函不述判斷電路就會進行第-、第二以及第三輪入二之間的時間延遲TD0A值的估計，然後再據以計算上梅之一水平方位角以及一仰角。【實施方式】顧it讓本發明之上述和其他目的、特徵、和優點能更明 *重下文特舉出較佳實施例，並配合所附圖式，作詳 201101852 細說明如下：實施例：第1圖係顯示根據本發明一實施例所述之聲源偵測裝置100用以偵測聲源140之方位。聲源偵測裝置100包 ^麥1，陣列110、信號轉換電路12〇以及判斷電路130， =信號轉換電路m包括放大與遽波單元122以及類比 ^數位轉換單元124,而判斷電路13〇包括語音活動價測〇早70 132、時間延遲估計單元134以及方位計算單元136。在，發，中’聲源偵測裝置100可根據由聲源140所發出之語音信號Ss而得到聲源140的三維方位，即聲源140所在位置之水平方位角0以及仰角含。第2圖係顯示第1圖中麥克風陣列110之結構示意圖。麥克風陣列110係由三個麥克風210、220及230所組成，其中三個麥克風210、220及230係設置成正三角形以組成一平面陣列。此外，麥克風210、220及230皆為全向性麥 ◎ 克風。同時參考第1圖與第2圖，當語音信號Ss從聲源14〇被傳送至聲源偵測裝置1〇0時，麥克風210、22〇及23〇會分別產生所對應之輸入信號。例如，當接收到語音信號s 時’麥克風210會提供輸入信號Sinl至信號轉換電路: 麥克風220會提供輸入信號Sw至信號轉換電路12〇,而麥克風230則會提供輸入信號Sin3至信號轉換電路12〇。由於聲源140與麥克風210、220及230之間的距離並不相同'，因此當語音信號Ss從聲源140被傳送至聲源偵測裝置1〇〇時，語音信號Ss不會同時到達麥克風210、220及23〇。例如’麥克風210與聲源140,之間的距離最近而麥克風23〇 201101852 與聲源140之間的距離最遠，則麥克風21〇會最先接收到語音信號Ss而麥克風230會最後接收到語音信號^。 Ο ❹ 第3圖係顯示第1圖中信號轉換電路120之示意圖。如第3圖所顯示，放大與濾波單元122包括放大器302、 304和306以及濾波斋312、314和316，而類比對數位轉換單元124包括三個類比對數位轉換器322、324和326，其中濾波器312、314和316為低通濾波器。同時參考第i 圖與第3圖，由麥克風陣列110内各麥克風所提供的輸入信號Sinl、Sw以及Sho會分別經過所對應之放大器和減波器來進行放大以及滤波。例如’輸入信號Sini會先經=放大器302進行放大之後，再經由濾波器312進行濾波，以得到濾波信號sfl。同樣地’輸入信號Sin；2會經由放1器3〇4 及濾波器314進行放大與濾波以得到濾波信號，而輸入信，Sin3會經由放大器3〇6及濾波器316進行放大與以得到濾波信號SD。接著，類比對數位轉換器％〕 Π會分別將滤波錢^^以及^轉換成數201101852 VI. Description of the Invention: [Technical Field] The present invention relates to a sound source detecting device, and more particularly to a sound source detecting device suitable for detecting a human voice position. [Prior Art] Sound source position detection has a wide range of applications, such as sonar detection, wireless communication, video conferencing systems, and the like. In recent years, with the development of robots, 越来越 has become more and more vigorous. Research on the detection of sound source orientation applied to robots has gradually increased. In addition, the sound source azimuth detection system can also be used to enhance human-robot interaction (HRI), or combine with Automatic Speech Recognition (ASR) system or Speech Enhancement system. To enhance the performance of the speech processing system. In the literature related to azimuth detection of a source, the proposed technique can be roughly divided into two categories. One of them is to receive the data received by the microphone array, and then use beamforming or subspace theory to obtain the sound. The angle of the source. The other is to try to estimate the time delay (TD0A), and then use the relationship between the sound source and the wheat to estimate the angle of the sound source.戍戍戍戍戍戍戍戍戍戍戍戍戍戍戍戍戍戍戍声声声声声声声声声声声声声声声声声声声声声声声声声声声声声声Firstly, a voice signal from the sound 201101852 is received via a microphone array, wherein the microphone is received by each microphone to include a plurality of microphones. Then, wave. Next, each filtered signal is amplified and apostrophed. Then, by calculating each (four)^input (4) into a digital-to-digital ratio: the expression ===, the timbre of the closing function of the phonogram is used to calculate the estimated value of two = and form a TDOA vector: Ο 〇 TD 〇 A vector to perform azimuth calculation, with 4 - azimuth of the above-mentioned sound source and - elevation angle. (4) The invention provides a sound source detecting device for detecting the orientation of a sound source. The sound source preparation device comprises a microphone array including a first microphone, a second microphone, and a microphone for receiving a voice from the sound source: the second microphone = the second input The signal and the τ third input signal corresponding to the above-mentioned first flute-i. The above judging circuit firstly outputs the voice frame of the upper voice (the voice box containing the voice), and the above frame: when: the first: the voice box of the voice and the voice of the third input signal are first and The third input signal simultaneously detects the speech two-section function, and the judging circuit performs the estimation of the time delay TD0A value between the first, second, and third rounds, and then calculates one of the top meters. The above-mentioned and other objects, features, and advantages of the present invention will become more apparent. The following detailed description of the preferred embodiments and the accompanying drawings The sound source detecting device 100 is used to detect the orientation of the sound source 140. The sound source detecting device 100 includes the microphone 1, the array 110. The first embodiment shows the sound source detecting device 100 according to an embodiment of the invention. The signal conversion circuit 12A and the determination circuit 130, the signal conversion circuit m includes an amplification and chopping unit 122 and an analog-to-digital conversion unit 124, and the determination circuit 13 includes a voice activity price measurement time 70 132, a time delay estimation unit. 134 and square The calculating unit 136. The sound source detecting device 100 can obtain the three-dimensional orientation of the sound source 140 according to the voice signal Ss emitted by the sound source 140, that is, the horizontal azimuth angle of the sound source 140 and the elevation angle. Fig. 2 is a schematic view showing the structure of the microphone array 110 in Fig. 1. The microphone array 110 is composed of three microphones 210, 220 and 230, wherein three microphones 210, 220 and 230 are arranged in an equilateral triangle to form a In addition, the microphones 210, 220, and 230 are all omnidirectional microphones. Referring to Figures 1 and 2, when the voice signal Ss is transmitted from the sound source 14 to the sound source detecting device 1 0, the microphones 210, 22〇 and 23〇 respectively generate the corresponding input signals. For example, when the voice signal s is received, the microphone 210 provides the input signal Sin1 to the signal conversion circuit: the microphone 220 provides the input signal Sw to The signal conversion circuit 12A, and the microphone 230 provides the input signal Sin3 to the signal conversion circuit 12A. Since the distance between the sound source 140 and the microphones 210, 220 and 230 is not the same ', therefore, when the voice signal When the Ss is transmitted from the sound source 140 to the sound source detecting device 1 , the voice signal Ss does not reach the microphones 210, 220, and 23 同时 at the same time. For example, the distance between the microphone 210 and the sound source 140 is the closest to the microphone 23 〇201101852 The farthest distance from the sound source 140, the microphone 21〇 will receive the voice signal Ss first and the microphone 230 will finally receive the voice signal ^. ❹ ❹ Figure 3 shows the signal conversion circuit in Figure 1 A schematic diagram of 120. As shown in FIG. 3, amplification and filtering unit 122 includes amplifiers 302, 304, and 306 and filters 312, 314, and 316, and analog-to-digital conversion unit 124 includes three analog-to-digital converters 322, 324. And 326, wherein filters 312, 314, and 316 are low pass filters. Referring to the first and third figures, the input signals Sin1, Sw and Sho provided by the microphones in the microphone array 110 are respectively amplified and filtered by the corresponding amplifiers and subtractors. For example, the input signal Sini will be amplified by the amplifier 302 first, and then filtered by the filter 312 to obtain the filtered signal sfl. Similarly, the input signal Sin; 2 is amplified and filtered by the amplifier 3 〇 4 and the filter 314 to obtain a filtered signal, and the input signal, Sin3 is amplified and amplified by the amplifier 3 〇 6 and the filter 316 to obtain a filter. Signal SD. Then, the analog-to-digital converter %] Π will convert the filter money ^^ and ^ into a number

Dl、D2以及D3，並傳送至判斷電路13〇。 I 參考回第1圖，在語音活動偵測單元132 浯音活動偵測（v〇ice activity —，Vad : 位信號Di、D2以及D3各自的信號音框是號。語音活動侧是—種語音信號處理方法 ==音框是否含有說話的語音。藉由判斷== =3一步的處理，例如語音辨識、語音‘ I “帶的的：率較低，即語音信號的能量是由判斷出輪入信號是語音或非語音信號。首先布;=動： 7 201101852D1, D2, and D3 are transmitted to the judging circuit 13A. I refer back to Fig. 1, in the voice activity detecting unit 132 voice activity detection (v〇ice activity -, Vad: the signal frames of the bit signals Di, D2, and D3 are numbers. The voice activity side is a voice) Signal processing method == Whether the sound box contains spoken speech. By judging ===3 one-step processing, such as speech recognition, speech 'I' band: the rate is lower, that is, the energy of the speech signal is determined by the round The incoming signal is a voice or non-speech signal. First cloth; = motion: 7 201101852

測單元132會將數位信號D1、D2以及D3各自劃分為複數個音框（frame)。例如，假設一個音框的長度為23ms而數位信號D1已收到的長度為230ms，則數位信號D1可被劃分為10個音框。接著，每個音框會先經過快速傅利葉轉換（Fast Fourier Transform，FFT)以轉換至頻域，然後定義每個頻帶的機率值，再據以計算該音框的頻譜亂度 (entropy)。語音信號的低頻成分較強（大約集中於3KHz 以下）’並且頻譜上的強度會顯現出明顯的高低起伏變化。然而’噪音信號的頻譜強度變化較小而較為平坦。因此，藉由將各音框的頻譜亂度與一特定亂度參考值進行比較，語音活動偵測單元132可判斷出該音框為語音音框或是非語音音框。接著，當該音框被判斷為語音音框時，語音活動偵測單元132會進一步計算該音框之信號雜訊比（signal to noise ratio’ SNR)以避免誤判，以提高語音活動偵測判斷的準確性。第4圖係顯示根據本發明一實施例所述，使用廣義交互相關函數（generalized cross correlation function)的原理來估計兩麥克風之_時間延遲（Time Delay of Arrivd， TDOA)數值之示意圖，實際實施時，我們將這個理論性的更ΐΐ:個逼近的作法。在第4圖中，信號&和& 兩麥克風所接收到及放大、遽波後的信號。例如，以是第3圖中放大與濾波單元的輸出信號 ί先信號、2與，或是輸出信號s義。盲无’ L *5虎X〗和合八σϊ , 4?n 2 θ刀別經過線性非時變濾波器410和二，Λ ^yi和^接著，乘法器_會將信號^ 力仃相乘以得到信號M，其中錢yj信號 201101852 ==器430所產生。接著，積分的吟間平移範圍内將信號“作月匕 ^ (peak detector)460 ^ 5 扁一胜—政μ 靶找到廣義交互相關函數的最大值。二門：=::則使用第4圖的廣義交互相關所得到的時間延遲估計值6，會盥如功率頻雄r mi3. . 、乂互相關函數Κχι，Χ2(τ)及交互力旱頻。日Gxl，x2(f)兩者之間的關係式為：、⑺=〜r ⑴ 〇 Γ艮=4圖可得到信“、y2之間的交互功率頻譜The measuring unit 132 divides the digital signals D1, D2, and D3 into a plurality of frames. For example, assuming that the length of one frame is 23 ms and the length of the digital signal D1 has been received is 230 ms, the digital signal D1 can be divided into 10 frames. Then, each frame is first subjected to Fast Fourier Transform (FFT) to be converted to the frequency domain, and then the probability value of each band is defined, and then the spectral entropy of the frame is calculated. The low frequency component of the speech signal is strong (about concentrated below 3 kHz) and the intensity on the spectrum will show significant fluctuations. However, the spectral intensity of the noise signal changes less and is flatter. Therefore, by comparing the spectral disorder of each frame with a specific ambiguity reference value, the voice activity detecting unit 132 can determine whether the sound box is a voice box or a non-speech box. Then, when the sound box is determined to be a voice sound box, the voice activity detecting unit 132 further calculates a signal to noise ratio SNR of the sound box to avoid false positives, so as to improve voice activity detection and judgment. The accuracy. Figure 4 is a schematic diagram showing the estimation of the value of the Time Delay of Arrivd (TDOA) of two microphones using the principle of a generalized cross correlation function according to an embodiment of the present invention. We will make this theory even more ambiguous: an approaching approach. In Figure 4, the signals & and & microphones receive and amplify, chopped signals. For example, it is the output signal of the amplification and filtering unit in FIG. 3, the first signal, the 2 and the output signal s. Blind without 'L *5 Tiger X〗 and Hex σϊ, 4?n 2 θ knife passes through the linear time-invariant filter 410 and two, Λ ^yi and ^ Then, the multiplier _ multiplies the signal 仃A signal M is obtained in which the money yj signal 201101852 == is generated by the 430. Then, within the inter-turn translation range of the integral, the signal is “peak detector 460 ^ 5 flat one win — political μ target to find the maximum value of the generalized cross-correlation function. Two gates: =:: use the fourth graph The time delay estimation value 6 obtained by the generalized cross-correlation will be, for example, the power frequency r mi3. . , the 乂 cross-correlation function Κχι, Χ 2 (τ) and the interaction force drought frequency. The day Gxl, x2 (f) The relationship is: , (7) = ~ r (1) 〇Γ艮 = 4 graph can get the interactive power spectrum between the letter ", y2"

Gyi，y2(f)和Gxi，x2(f)之間的關係式：、 G-⑺，/)坧(/)、⑺.The relationship between Gyi, y2(f) and Gxi, x2(f): , G-(7), /)坧(/), (7).

=表示複數共輛，雜賺別是濾波器241〇與涛 :=的頻率響應。如此，信號〜和 J 相關函數可定義成下列算式（3) ·· m 〇)=]>“/)〜(/) eJ^KfT ^ ⑺’剌/)。由於在有限之信號〜和(3心)的觀察〇二/宜只此侍到、(/)的估計值《⑺。因此，可將算式 U (3)改寫成下列算式（4) ·· 飞 t (Γ)= ·0“/)《J/V〜y 的情形下义⑺'⑺。另外，為了得二準 = —個適當的濾波_函= 第4圖φΐ 的值U時的值有較大的差異。在 phIt 11^ —Γ- 直可使V: f的頻率響應，如下列算式⑸所之間的交互相關函數形成脈衝函數，而脈衝函數之最大值會出現在辱間延遲竭位置。 9 201101852 Ψκ (/): 'ΚΓωί (5)。接著，將算式（5)代入I—' h二公)中可以得到下列算式(6): 1 ^ ’當語音活動㈣單元132偵測出數位 t唬Dl、D2以及乜3的立 1只州aj數位 Ο Ο 的數位信號音框進行會對每個麥克風輪入帶的=轉換，以得到頻域上各頻的時間延遲。實際上的風210、220 *230之間的左右相等_ 士式則是’首先依照算式（6) 和D (即〇的、叙刀▲別取得Dl(即VO的數位信號）的音框不(，)“(:):接著位二號)的音框的快速傅利葉轉換頻譜中的6⑺以及Ιά r、丨吏用頻譜不(/)和尤2(/)來逼近算式（6) 中的g⑺，，也就是以{⑺《⑺逼近算式（6) Ι>ίΛ 二 Ί/「以 14^).5(/)1 逼近 Κ (/)| ;然後，對計出離散^值進行反向快速傅利葉轉換運算而估中’找出具有最大:數值；接著，從Γ值的合理範圍值。在時間延遲估值一值，來作為時間延遲的估計義交互相關函數的逼：：134中’就是藉由使用上述的廣在本發明中，每=方法，來估計出時間延遲的數值。延遲估叶值。* 克風的輸入信號可以得到一個時間言，—共可得丨，對第2圖中麥克風210、220及230而 -個時間延遲估㈣偶，其可組成時間延遲數值的合理範圍是由二麥克風之間的症離以 201101852 及聲波在空氣中的傳遞速度所決定。因此，當聲源所在位置與二麥克風形成一直線時，則可得到最大的時間延遲估計值。此外’當從（(Γ)求得的時間延遲7的數值越大時，則τ的相鄰兩離散值所對應的角度值之間的差距就會越大。因此，為了提高時間延遲估計值的精準度，我們使用拋物線内插法(parabola interpolation)來取得拋物線頂點所對應的τ值’以減少角度誤差。 " Ο 〇第5圖係顯示一聲源530與一麥克風對（即麥克及520)之間的位置圖。㈣53〇的三維座標，風51。及52。的三維座標分別為{〜，从則二}而 =此，根據下列算式⑺及算式（8)可聲^ 聲源別傳送tl w聲皮由 ⑺ ⑻ 。=Κι - λ)2+k, -zy t2 = ysf + c \〇) ί二聲波在空氣中傳播的速度。所以，麥克風 510及麥克風520之間的時間延遲理論d 克風可計算出一時間延遲理认母麥克風對個理論值向量(A A.將延則固理論值組成- 算式⑴，方位計算圖傭置，計算出-個對:::=對位，直㈣之平角度:及仰角异單兀136 —共可得到〔逆y ^ 因此方位计 +1個理論值向量。接著，方位 201101852 计异單7G 136會計算來自時間延遲估計單元134的時間延遲估計值向量(M4)與所有時_遲理論值向量的幾何距離’以找出距離最小的時間延遲理論值向量，進而可得到聲源之方位角Θ及仰角沴。所描述，本發明所描述之聲源摘測裝置可使用 :個麥克風所組成的正三角形之陣列來偵測出聲源的三維方位。在本發明中，使用基於亂度量測之語音活動偵測、以及基於廣義交互相關函數原理之時間延遲估計，因而可〇降低聲源之水平方位角以及仰角在估計上的誤差。本發明雖以較佳實施例揭露如上，然其並非用以限定 1明j範圍，任何熟習此項技藝者，在不脫離本發明之仅=和範圍内，當可做些許的更動與潤飾，因此本發明之，、護範圍當視後附之申請專利範圍所界定者為準。= indicates a complex number of vehicles, and the miscellaneous earning is the frequency response of the filter 241〇 and Tao :=. Thus, the signal ~ and J correlation functions can be defined as the following equation (3) ·· m 〇)=]>"/)~(/) eJ^KfT ^ (7)'剌/). Because of the limited signal ~ and ( 3 heart) observation 〇 2 / should only be the wait, (/) estimated value "(7). Therefore, the formula U (3) can be rewritten into the following formula (4) · · fly t (Γ) = · 0" /) "J/V~y in the case of meaning (7)' (7). In addition, there is a large difference in the value of the value U of the φ 第第适当适当适当适当适当适当。。。。。. The frequency response of V: f can be made directly at phIt 11^ —Γ-, and the inter-correlation function between equations (5) below forms a pulse function, and the maximum value of the pulse function appears in the ruthless delay position. 9 201101852 Ψκ (/): 'ΚΓωί (5). Next, substituting equation (5) into I-'h two public) can obtain the following formula (6): 1 ^ 'When voice activity (4) unit 132 detects digital states t唬Dl, D2, and 乜3 The digital signal frame of the aj digit Ο 进行 performs a = conversion for each microphone wheel to obtain the time delay of each frequency in the frequency domain. In fact, the right and left winds 210, 220 * 230 are equal to the left and right _ Shishi is 'first according to the formula (6) and D (ie, the 〇, the ▲ ▲ do not get Dl (ie VO digital signal) the sound box is not (,) 6(7) and Ιά r in the fast Fourier transform spectrum of the sound box of "(:): followed by bit 2), and the spectrum is not (/) and especially 2 (/) to approximate the equation (6) g(7), that is, by {(7)"(7) approximation (6) Ι> Λ Λ / / "14^).5(/)1 is approximated by Κ (/)|; then, the inverse of the discrete value is counted The fast Fourier transform operation evaluates to 'find the maximum: the value; then, the reasonable range value from the Γ value. Estimate a value at time delay, as the estimated time-interval correlation function of the time delay:: 134' It is by using the above-mentioned widely used method, the value of the time delay is estimated every time. The delay is estimated. The input signal of the wind can get a time, and the total is available, for the second In the figure, the microphones 210, 220 and 230 have a time delay estimate (four) even, which can form a reasonable range of time delay values between the two microphones. The symptom is determined by the transmission speed of the sound in the air in 201101852. Therefore, when the position of the sound source is in line with the two microphones, the maximum time delay estimate can be obtained. In addition, when the (from () is obtained The larger the value of time delay 7 is, the larger the difference between the angle values corresponding to the two adjacent discrete values of τ is. Therefore, in order to improve the accuracy of the time delay estimation, we use parabolic interpolation. (parabola interpolation) to obtain the τ value corresponding to the parabola vertex to reduce the angular error. " Ο 〇 Figure 5 shows the position map between a sound source 530 and a microphone pair (ie, Mike and 520). (4) 53〇 The three-dimensional coordinates, the winds 51. and 52. The three-dimensional coordinates are {~, from the second} and =, according to the following formula (7) and formula (8), the sound source can be transmitted by tl w (7) (8). =Κι - λ)2+k, -zy t2 = ysf + c \〇) ί The speed at which the two sound waves propagate in the air. Therefore, the time delay between the microphone 510 and the microphone 520 can be calculated for a time. Delaying the recognition of the mother microphone Value vector (A A. will extend the theoretical value of the formula - the formula (1), the azimuth calculation map, calculate - a pair::: = alignment, straight (four) flat angle: and elevation angle 兀 136 - total Obtaining [inverse y ^ and thus the orientation +1 theoretical value vector. Next, the orientation 201101852 calculus 7G 136 calculates the time delay estimation vector (M4) from the time delay estimation unit 134 and all the time-late theoretical vector vectors. The geometric distance 'to find the theoretical vector of the time delay with the smallest distance, and then the azimuth angle and elevation angle 声 of the sound source can be obtained. As described, the sound source pick-up device described in the present invention can use an array of equilateral triangles composed of microphones to detect the three-dimensional orientation of the sound source. In the present invention, speech activity detection based on chaotic measurement and time delay estimation based on the generalized interactive correlation function principle are used, thereby reducing the horizontal azimuth of the sound source and the error in estimation of the elevation angle. The present invention has been described above with reference to the preferred embodiments. However, it is not intended to limit the scope of the invention, and it is possible to make some modifications and refinements without departing from the scope of the invention. Therefore, the scope of the invention is defined by the scope of the appended claims.

12· 201101852 【圖式簡單說明】第1圖係顯示根據本發明一實施例所述之聲源偵測裝置，用以偵測聲源之方位；第2圖係顯示第1圖中麥克風陣列之示意圖；第3圖係顯示第1圖中信號轉換電路之示意圖；第4圖係顯示根據本發明一實施例所述之使用廣義交互相關函數原理來估計兩麥克風之間的時間延遲值之示意圖；以及第5圖係顯示一聲源與一麥克風對之間的位置圖。〇【主要元件符號說明】 100〜聲源偵測裝置； 110〜麥克風陣列； 120〜信號轉換電路； 122〜放大與濾波單元； 124〜類比對數位轉換單元；130〜判斷電路； 132〜語音活動偵測單元； 134〜時間延遲估計單元； 136〜方位計算單元； 140、530〜聲源； 210、220、230、510、520〜麥克風； 302、304、306〜放大器；312、314、316〜濾波器；〇 322、324、326〜類比對數位轉換器； 410、420〜線性非時變濾波器； 430〜延遲器； 440〜乘法器； 450〜積分器； 460〜峰值偵測器；12·201101852 [Simplified description of the drawings] Fig. 1 shows a sound source detecting device for detecting the orientation of a sound source according to an embodiment of the present invention; and Fig. 2 shows a microphone array of Fig. 1 FIG. 3 is a schematic diagram showing a signal conversion circuit in FIG. 1; FIG. 4 is a schematic diagram showing estimating a time delay value between two microphones using a generalized cross-correlation function principle according to an embodiment of the invention; And Figure 5 shows a positional map between a sound source and a pair of microphones. 〇 [Main component symbol description] 100~ sound source detection device; 110~ microphone array; 120~ signal conversion circuit; 122~ amplification and filtering unit; 124~ analog-to-digital conversion unit; 130~ judgment circuit; Detecting unit; 134~time delay estimating unit; 136~azimuth calculating unit; 140, 530~ sound source; 210, 220, 230, 510, 520~ microphone; 302, 304, 306~ amplifier; 312, 314, 316~ Filter; 〇322, 324, 326~ analog-to-digital converter; 410, 420~ linear time-invariant filter; 430~ retarder; 440~multiplier; 450~ integrator; 460~peak detector;

Dl、D2、D3〜數位信號； d、A、4、A〜時間延遲估計值；Dl, D2, D3 ~ digital signal; d, A, 4, A ~ time delay estimation value;

Sfl、Sf2、Sf3〜濾、波信號；Sfl, Sf2, Sf3 ~ filter, wave signal;

Sinl、Si„2、Sjn3、Χι、X2〜輸入信號；Sinl, Si„2, Sjn3, Χι, X2~ input signals;

Ss〜語音信號；. y]、y2、y3、Μ〜信號； 61〜方位角；以及多〜仰角。 13Ss ~ speech signal; y], y2, y3, Μ ~ signal; 61 ~ azimuth; and more ~ elevation angle. 13

Claims

201101852 VII. The scope of application for patents······································································································· , including: 1.- An ancient detection method, suitable for finding a microphone array, picking up your J, the sound source, the orientation, each of the above microphones: the wind; the mouth filtering, and the magnified The signal is amplified and the signal is converted into a signal; the input signal is converted into a digital signal Ο 音 Ξ ί ί ί ί ί ί ί ί ί ί ί ί ί ί ί : ί , , 语音 = = = = = = = = = = = = = = = = = = = = = = = = = Azimuth and azimuth calculation to obtain the above-mentioned complex digital signal m sound source detecting method 'where the speech signal is detected by the spectral disorder of the speech sound box of the letter "two = sound = bracket:" The second item of the above plurality of digit signals is used for voice activity, wherein the pair of == complex sounds (four) of the signal-to-synchronization and the voice signal is calculated by the above-mentioned time measurement method, wherein the above-mentioned conversion is determined as each value is obtained; The generalized cross-correlation function between 〃 and 1414 201101852 estimates the vector of the above-mentioned time delay value according to the generalized cross-correlation function between the above complex microphones. 5. The sound source detection as described in claim 1 The method, wherein the performing the azimuth calculation step further comprises: calculating a difference between the time delay estimated value vector and a specific vector to obtain the azimuth angle and the elevation angle. 6. The sound source detection as described in claim 1 The method of measuring, wherein the microphone array comprises a first microphone, a second microphone and a third microphone, wherein the first, second and third microphones are arranged in a planar array of a ^ equilateral triangles. The sound source detecting method of the sixth aspect of the invention, wherein the first, second, and third microphones are omnidirectional microphones. 8. A sound source detecting device, configured to detect a sound source orientation The method includes: a microphone array including a first microphone, a second microphone, and a third microphone for receiving the sound from the sound a voice signal to obtain a first input signal corresponding to one of the first microphones, a second input signal corresponding to one of the second microphones, and a third input signal corresponding to one of the third microphones; and a determination a circuit for detecting a voice frame of the first input signal, a voice frame of the second input signal, and a voice frame of the third input signal, and simultaneously detecting the first and second When the voice box of the third input signal is used, calculating a correlation function value between the first, second, and third input signals to obtain a horizontal azimuth and an elevation angle of the sound source. The sound source detecting device of item 8, wherein the microphone array is a planar array in which the first, second, and third microphones are arranged as an equilateral triangle of 15 - 201101852. 10. The sound source detecting device of claim 9, wherein the first, second, and third microphones are omnidirectional microphones. 11. The sound source detecting device of claim 8, further comprising: a signal conversion circuit coupled between the microphone array and the determining circuit, comprising: an amplifying and filtering unit for amplifying The first, second, and third input signals, and the low-pass filtering of the amplified first, second, and third input signals; and a type of comparison digital conversion unit for filtering the above The first, second and third input signals are respectively converted into a first digital signal, a second digital signal and a third digital signal. 12. The sound source detecting device of claim 11, wherein the determining circuit comprises: a voice activity detecting unit, configured to divide the first, second, and third digit signals into a plurality of The sound box, and detecting the voice frames in the first, second and third input signals respectively according to the amount of spectrum disturbance. 13. The sound source detecting device according to claim 12, wherein the voice activity detecting unit further sets a voice frame in the first, second, and third input signals according to a signal noise ratio. Further confirmation as a voice box. 14. The sound source detecting device of claim 12, wherein the determining circuit further comprises: a time delay estimating unit, wherein when the voice activity is detected, the detecting unit 16 201101852 detects the first When the sound box in the second and third input signals is a voice box, the time delay estimating unit performs fast Fourier transform on the first, second and third digit signal frames, and uses the Fourier spectrum of each input signal respectively. Obtaining an interaction power spectrum between the two signals to obtain an approximation value of the correlation function between the two microphones of the first, second, and third microphones, and estimating the voice according to the approximation value of the correlation function The signal arrives at a time delay value vector between the two microphones of the first, second and third microphones. 15. The sound source detecting device of claim 14, wherein the correlation function is a generalized cross-correlation function between the two input signals. 16. The sound source detecting device of claim 14, wherein the determining circuit further comprises: an orientation calculating unit configured to obtain the horizontal azimuth angle according to the time delay estimated value vector and a specific vector, and The above elevation angle. ❹ 17