200916812 九、發明說明: 【發明所屬之技術領域】 本發明係k供一種語音方位辨識器,尤係指—種利用模糊 推淪架構之語音方位辨識器’可應用於各式機器人或是電腦設 備上,且可有效判定使用者(語者)的方位,並降低系統複雜 性。 【先前技術】 按’視覺系統是多感知系統中開發成熟的部份,而聽覺系 統則是另—補續發展的研究方向。有狄機ϋ人之視覺系統 屬於被動性感測’因此當受服務對象未在視覺系統可視範圍内 時,機器人將無法提供服務。 關於音源方位判定系統,常見的方法有許多,諸如運用延 遲總和細餘拉(Delay姻_Sum Beamf_r)與時域互 才關f生(Time Domain Cross Correlation),其目的是為了求 得聲音與麥克風間的夹角_、,這是多數研究巾所使用的方 法’但是此方法複雜而且不易獅。而錢外學者提出利用類 ^經網路之聽脉贿構,目的在於舰覺緒具備學習的功 能’以提高純定位效能,但此方法珊段的誤差量還是相當 而且與上述的方法同樣受限於環境的條件與測試語料的設 定,只能在已知的環境下針對已知的聲音做判定。 —另外’習知相關之專利前案,諸如許天明所發明之「結合 疋位技術之麥克風陣列收音方法及其系統」發明第⑵㈣3 200916812 號、蘇文禮所發明之「自動調整擷取裝置之信號擷取角度的系 統以及方法」發明第1274519號、飛利浦電泡廠所申請之「用 於調整轉移特性至一空間内的二個收聽位置之方法及裝置」發 明第117239號及胡竹生等人所發明之「語音定位系統」發明 第1262433號。該些專利前案揭示有相關之音源方位判定,在 此併入本文,以供參考。 【發明内容】 本發明之目的係提供一種利用模糊推論架構之語音方位 辨識器,提出一模糊語者定位系統(Fuzzy Speech200916812 IX. Description of the invention: [Technical field of the invention] The present invention is a speech orientation identifier, especially a speech orientation identifier using a fuzzy push architecture, which can be applied to various types of robots or computer equipment. On, and can effectively determine the orientation of the user (speech) and reduce system complexity. [Prior Art] Pressing the 'vision system is a mature part of the multi-perception system, and the auditory system is another research direction. The visual system of the singularity of the genius is a passive sensory measure. Therefore, when the subject is not within the visual system, the robot will not be able to provide the service. There are many common methods for the sound source orientation determination system, such as using the delay summation (Sum Beamf_r) and the time domain cross Correlation, which is intended to obtain sound and microphone. The angle between the _, this is the method used by most research towels' but this method is complicated and not easy to lion. The foreign scholars proposed to use the network to listen to the bribes, the purpose is to have the function of learning to improve the pure positioning performance, but the error amount of this method is still quite the same and the same as the above method Limited to the conditions of the environment and the setting of test corpus, the judgment can only be made for known sounds in a known environment. - In addition, the 'pre-related patents, such as the microphone array radio method and system combined with the clamp technology, invented by Xu Tianming, invented by the invention (2) (4) 3, 200916812, Su Wenli invented the signal of the automatic adjustment and extraction device. System and method for taking angles" Invention No. 1274519, "Method and Apparatus for Adjusting Transfer Characteristics to Two Listening Positions in a Space", which was invented by Philips Bubble Factory, Invention No. 117239 and Hu Zhusheng et al. "Voice Positioning System" invention No. 1262433. The prior art patents disclose relevant source orientation determinations, which are incorporated herein by reference. SUMMARY OF THE INVENTION The object of the present invention is to provide a speech azimuth recognizer using a fuzzy inference structure, and propose a fuzzy speech localization system (Fuzzy Speech).
Localization Recognition, FSLR ),當使用者(語者)發出 聲音,透過麥克風陣列接收到訊號後,進入FSLR系統可有效 判別語者方位,進而控制機器人轉向及移動,使面向靠近交談 之使用者。 本發明之另一目的係提供一種利用模糊推論架構之語音 方位辨識器’其具有主動性感測,不僅可以補足視覺系統被動 性的問題,更可提高機器人與使用者間的互動感。 為達致上述目的,本發明之語音方位辨識器,包括:複數 個區域’每-區域包含有複數組麥克風_,顏接收一語者 之曰讯迠量,而輸出至少兩組較大能量的麥克風陣列的平均電 壓’其中’能量最大-組的平均錢為一第一電壓,其方位角 度為-第-角度;次大之—組之平均電壓為—第二電壓,其方 位角度為m將該第—龍及第二電壓之能量作為一 200916812 模糊系統的兩個輸入訊號,經過重心解模糊化後,對應到其歸 屬函數,並依據一預定之規則庫得到訊號與前述第一角度及第 二角度方位的關係,計算出一補償角度;當第一角度小於第二 角度時,第一角度加上補償角度即為語音方位辨識器與語者間 的角度;當第一角度大於第二角度時,第一角度減掉補償角度 即為語音方位辨識器與語者間的角度。根據此資訊,即可用以 實現驅動馬銳齡人轉向至語者面前,而可搭配其他感知系 統與使用者進行更多變化的互動。 為了讓本發明之上述目的、特徵、優點能更明顯,下文特 舉本發明較佳實施例,並配合所附圖示,作詳細說明如下。 【實施方式】 5月參閱第一圖,其係本發明實施例麥克風陣列配置之示意 圖,包括有: 〜 複數個區域(例如A、B、C、D等區域),每一區域用以分 J接收其所對應之平面方位的音訊能量,包括有: —第—區域10 (即A區域)’包含有複數組(例如:四組) /八12 (Al' A2、A3、A4),每一麥克風以等距離間隔設置, 成平面麥克風卩車列,用以接收〇。平面方位的音訊能量。 第一區域14 (即B區域),包含有複數組(例如:四組) ,、几16 (B1、B2、助、B3),每一麥克風以等距離間隔設置, 形成平面麥克風陣列’用以接收9〇。平面方位的音訊能量。 第二區域18 (即C區域),包含有複數組(例如:四組) 200916812 ” (C1以’乂幻’每一麥克風以等距離間隔設置, 形辭面麥克風陣列,用以接收⑽。平面方位的音訊能量。 一第四區域22 (即D區域),包含有複數組(例如 :四組) 4(D1乃2、1)3、03)’每一麥克風以等距離間隔設置, 形成=面麥麵相,_触270。平面綠的音訊能量。 、則述第—至第四區域所區隔之空間内有-機器人26,用 、 者之9讯’由於每一區域等距離間隔設置有複數個麥 克風’使每-__音訊可以平均的接收而不受到角度 響。在語音未開始之前雖會取到一段背景嚼音的訊號,在 束之後也會取到—段背景噪音,基本上該些雜訊都會低 於设定的f職,目此,她侧_門触來過遽。 、接下來將相針對音_衫射_點铜、語者方位 初步判定與語者定位做詳細說明。 (一)端點_ :基於㈣處理之目的,本發明為了減少 運算的複雜與運算量,在實施上採用每G5sec擷取音段一 次,這個音段即為-個音框大小,直接去對此音段去做分析, 不再切割音框。每筆請都會進行能量敏,如 開始有高於Η檻值_量•,代表語相 b 機器人’嫌綱後,嫩_人,储=二 話’此為最終的語者發話位置。 因此,本發明爲了減少環境反射對語者定位的影響,又在 200916812 不增加計算上的困難的考量下,本發明將每個音框取到的第一 筆超過門檻值的有效能量加以分析。經過初步的測試之後,發 現在-般室内空財,語者分別在_公尺的距離上,本發 明得到的有聲段的振幅大小大約為q.瞻以上,經過⑽倍 之訊號放大後,因此,本糾在實狀設計上以& Μ為門檀 值來區分有聲段與無聲段。 (二)語者方位初步判定:經過端點铜區分出有聲段的 部份之後,纖咖振.何崎辨出訊舰量的強弱。 由於-組麥克風陣列包含四個音訊頻道(cha_),將同組同 -時間點取得之音訊能量平均,可制各組之平均能量。由於 音波傳遞時鮮會隨雜增純衰減,目此輯音_近之麥 克風陣列所取得的訊舰量較大,_相對之音波振幅較大。 因此’本發㈣將這四組平均能量振幅做—比較,取較大兩 組’即可視此兩組角度為可能音源所在方位,其中將能量較大 的(即為四組平触量最大者)視為參考綠,喊所在角度視 為基礎參考角度0basie。 (三)語者定位技術:爲了能夠精確的判斷出語者方位, 本發明建立了模糊語者定位辨識系統(Fuzzy Speaker Local咖on Rec〇gniti〇n,亂反system)來進一步提高辨識效 果。其中,系統以得到較大能量的兩組平面麥克風陣列的平均 電壓當作其輸入,假設能量最大-組的平均電壓為一第一電壓 200916812Localization Recognition (FSLR), when the user (speech) makes a sound and receives the signal through the microphone array, entering the FSLR system can effectively determine the orientation of the speaker, and then control the robot to turn and move, so that the user is close to the conversation. Another object of the present invention is to provide a speech azimuth recognizer that utilizes a fuzzy inference structure, which has an active sensing, which not only complements the problem of passiveness of the visual system, but also enhances the interaction between the robot and the user. In order to achieve the above object, the speech orientation recognizer of the present invention comprises: a plurality of regions 'each-region containing a complex array microphone _, and a person receiving a message of a speaker, and outputting at least two sets of larger energy The average voltage of the microphone array 'where the energy is the largest - the average amount of the group is a first voltage, the azimuth angle is - the first angle; the second largest - the average voltage of the group is - the second voltage, the azimuth angle is m The energy of the first dragon and the second voltage is used as two input signals of a 200916812 fuzzy system, and after being defuzzified by the center of gravity, corresponding to the attribution function, and according to a predetermined rule base, the signal is obtained with the first angle and the first The relationship between the two angular orientations is calculated as a compensation angle; when the first angle is smaller than the second angle, the first angle plus the compensation angle is the angle between the speech orientation recognizer and the speaker; when the first angle is greater than the second angle When the first angle is subtracted from the compensation angle, it is the angle between the speech orientation recognizer and the speaker. Based on this information, it can be used to drive the Ma Ruiren to turn to the speaker, and can be used with other sensing systems to interact with the user. The above described objects, features and advantages of the present invention will become more apparent from the detailed description of the preferred embodiments of the invention. [Embodiment] Referring to the first figure in May, it is a schematic diagram of a microphone array configuration according to an embodiment of the present invention, including: ~ a plurality of regions (for example, areas such as A, B, C, D, etc.), each region is used to divide J Receiving the audio energy of the corresponding plane orientation, including: - the first region 10 (ie, the A region) 'contains a complex array (for example: four groups) / eight 12 (Al' A2, A3, A4), each The microphones are arranged at equal intervals to form a planar microphone train for receiving cymbals. The audio energy of the plane orientation. The first area 14 (ie, the B area) includes a complex array (for example, four groups), and several 16 (B1, B2, and B3), and each microphone is disposed at equal intervals to form a planar microphone array. Receive 9〇. The audio energy of the plane orientation. The second area 18 (ie, the C area) includes a complex array (for example, four groups) 200916812 ” (C1 is arranged at equal intervals for each microphone, and is arranged to receive (10). Azimuthal audio energy. A fourth region 22 (ie, D region) containing complex arrays (eg, four groups) 4 (D1 is 2, 1) 3, 03) 'Each microphone is set at equal intervals, forming = Faceted wheat noodles, _ touch 270. Plane green audio energy., then the space between the first and fourth regions is - Robot 26, with 9 signals 'Because each area is equidistantly spaced There are multiple microphones' so that each -__ audio can be received evenly without being angled. Although a background chewing sound is taken before the voice starts, the background noise will be obtained after the beam. These noises will be lower than the set position, so that she will touch the door. The next step is to make a _ _ point copper, the initial position of the speaker and the position of the speaker. Detailed description. (1) End point _: Based on the purpose of (4) processing, the present invention is The complexity and computational complexity of the less computation, in the implementation of using G5sec to capture the segment once, this segment is the size of a box, directly to the segment to do analysis, no longer cut the frame. Will be energy sensitive, such as the beginning has a higher than the _ _ amount •, the representative phase b robot 'since, the tender _ person, the store = two words' this is the final speaker position. Therefore, the present invention is intended to reduce The effect of environmental reflection on the localization of the speaker, and in 200916812, without increasing the computational difficulty, the present invention analyzes the effective energy of the first sentence exceeding the threshold value obtained by each sound frame. After preliminary testing, It is found that in the ordinary indoor space, the speaker is at a distance of _m, respectively, and the amplitude of the voiced segment obtained by the present invention is about q. The above is amplified by (10) times the signal, therefore, the correction is in the real state. The design uses & Μ as the door value to distinguish between the voiced segment and the silent segment. (2) The initial judgment of the speaker's position: after the end of the copper distinguishes the part of the sound segment, the fiber coffee vibration. He Qi identified the ship The strength of the amount. Because of the - group microphone The array consists of four audio channels (cha_), which averages the audio energy obtained by the same group at the same time point, and can be used to make the average energy of each group. Since the sound wave will be mixed with the pure attenuation, the sound is close to the microphone. The number of signals acquired by the array is large, and the amplitude of the sound waves is relatively large. Therefore, 'the present (4) compares the average energy amplitudes of the four groups, and takes the larger two groups' to see the two sets of angles as possible sound sources. Orientation, in which the energy is larger (that is, the four groups with the largest amount of flat touch) is regarded as the reference green, and the angle of the call is regarded as the basic reference angle 0basie. (3) Speaker positioning technology: in order to accurately judge the speaker Orientation, the present invention establishes a fuzzy speaker localization identification system (Fuzzy Speaker Local coffee on Rec〇gniti〇n, chaotic anti-system) to further improve the recognition effect. Wherein, the system takes the average voltage of two sets of planar microphone arrays with larger energy as their input, assuming that the energy maximum - the average voltage of the group is a first voltage 200916812
Vi ’其所在的基礎參考方位角度為一第一角度;次大之一 組之平均電壓為―第二雜v2,其所麵基礎參考方位角度 為一第二角度Θ2,其巾,在義輸人巾能倾A的分組,本 發明係以能量最大的分組之方位角度0 i作為一基礎角度Θ basic。之後將此兩大能量(Vl、VO作為一模糊系統的兩個輸 入訊號,經過重心解模糊化後,對應到其歸屬函數,並依據一 預定之規則庫得到訊號與此兩大方位的關係,計算出一補償角 度Θ exact,假設機器人與語者間的角度為0 t。⑷。因此語者的方 位可能存在於0 _ = 6> basic 土 Θ exact。之後將該<9 total角度傳給 一可程式化邏輯閘陣列(Field Pr〇grammable Gate Array, FPGA);或一複雜可程式邏輯裝置(c〇mplex pr〇grammabie Logic Device, CPLD)或其他之處理裝置,即可用以實現驅動 馬達使機器人轉向至語者面前’此時可以搭配其他感知系統與 使用者進行更多變化的互動,達到服務型機器人的目的。 簡言之,根據前述之資訊,即可以用來判定語者距離那個 分組較靠近或是較遠,得到一個補償角度Θ⑽(為FSLR之輸 出)。當hch時’前述之基礎角度0 basic加上補償角度$ exact 即為語音方位辨識II (例如··可安裝在-機狀上)與語者間 的角度;反之’當<9!〉θ2時’基礎角度〜asic減掉補償 角度β exact即為語音辨識器與語者間的角度0。 本發明說明音源方位判定系統的設計概念,將經過簡單運 200916812 算的訊號’依據四部麥克風陣 設計〜 早夕爾4大傾係本發明 撼古f⑽奴絲㈣讀域的转角度,依 =位角度_係,了解聲音與純間的空_係 角度固雜塊的問題,假如將本發明運用於服務 上’可以大缺高機ϋ人與使时_城與穩定 性0 “ 本發明之_語者定㈣統师碑有下狀優點: ® 適用於任何環境。 (U)測試者事先不需要經過訓練即可使用 Μ系統運算速度快且解,使用領域廣泛。 (IV)搭配發明人設計之矩形麥克風陣列系統可以提高辨 識率。 (ν)模糊系統之參數易於修改,環境與場合相容性高。 (vi)適用於各類聲音,不受限於任意發聲元件。 故本發明之提出,應符合專繼#上_性、新顆性、以 及進步性之所規定。雖然前述的描述及圖式已揭示本發明之較 佳實施例,惟此乃僅係實施例之呈現,舉凡各種增添、修改和 g雜佳實施例’仍應屬落人本發明之申請 專利範15所界定之範g内。目此,本文於此所揭示的實施例所 有觀點,應被視為用以說明本發明,而非用以限制本發明。本 200916812 發明之範圍應由後附之申請專利範圍所界定,並涵蓋其合法均 等物,並不限於先前之描述。 11 200916812 【圖式簡單說明】 第一圖係本發明實施例麥克風陣列配置之示意圖。 【主要元件符號說明】 10 第一區域 12 麥克風 14 第二區域 16 麥克風 18 第三區域 20 麥克風 22 第四區域 24 麥克風 26 機器人 12Vi 'the base reference azimuth angle is a first angle; the average voltage of one of the second largest group is “second miscellaneous v2, and its base reference azimuth angle is a second angle Θ2, its towel, in the sense of loss The human towel can be grouped into groups A. The present invention uses the azimuth angle 0 i of the group with the highest energy as a basic angle Θ basic. Then, the two energy sources (Vl, VO) are used as two input signals of a fuzzy system, and after being defuzzified by the center of gravity, corresponding to their attribution function, and according to a predetermined rule base, the relationship between the signals and the two major directions is obtained. Calculate a compensation angle Θ exact, assuming that the angle between the robot and the speaker is 0 t. (4). Therefore, the orientation of the speaker may exist at 0 _ = 6> basic soil exact. Then pass the <9 total angle to A programmable logic gate array (FPGA); or a complex programmable logic device (CPLD) or other processing device can be used to implement the drive motor The robot turns to the front of the speaker. At this time, it can be combined with other sensing systems to interact with the user to achieve the purpose of the service robot. In short, according to the above information, it can be used to determine the distance of the speaker from that group. Closer or farther, get a compensation angle Θ(10) (which is the output of FSLR). When hch, 'the aforementioned base angle 0 basic plus the compensation angle $ exact is The azimuth recognition II (for example, can be mounted on the -machine) and the angle between the speakers; otherwise, when the <9!> θ2, the 'base angle ~ asic minus the compensation angle β exact is the speech recognizer and The angle between the speakers is 0. The invention describes the design concept of the sound source azimuth determination system, and the signal calculated by the simple operation 200916812 is based on the design of the four microphone arrays ~ the early morning of the four major systems of the invention, the ancient f (10) slaves (four) read the domain Turning angle, according to the angle _ system, to understand the problem of the empty _ system angle solid block between the sound and the pure, if the invention is applied to the service, 'there can be a big shortage of high-powered people and the time _ city and stability 0 “The _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ IV) The rectangular microphone array system designed by the inventor can improve the recognition rate. (v) The parameters of the fuzzy system are easy to modify, and the environment and the occasion are highly compatible. (vi) Suitable for all kinds of sounds, not limited to any sounding components. Therefore, the present invention The present invention is intended to be in accordance with the specific embodiments of the present invention, and the foregoing description and drawings have disclosed the preferred embodiments of the present invention. Various additions, modifications, and embodiments are still within the scope of the invention as defined in the application of the present invention. For the purpose of the invention, all aspects of the embodiments disclosed herein are considered to be The invention is not intended to limit the invention. The scope of the invention is defined by the scope of the appended claims and the legal equivalents thereof are not limited to the foregoing description. 11 200916812 [Simplified description of the drawings] The first figure is a schematic diagram of a microphone array configuration of an embodiment of the present invention. [Main component symbol description] 10 First area 12 Microphone 14 Second area 16 Microphone 18 Third area 20 Microphone 22 Fourth area 24 Microphone 26 Robot 12