JP2010232862A

JP2010232862A - Audio processing apparatus, audio processing method, and program

Info

Publication number: JP2010232862A
Application number: JP2009076984A
Authority: JP
Inventors: Koji Fujimura; 浩司藤村
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2009-03-26
Filing date: 2009-03-26
Publication date: 2010-10-14
Also published as: WO2010109711A1

Abstract

【課題】複数の環境において、マイクから入力される音信号の雑音を好適に減じる音声処理装置、音声処理方法、及び、プログラムを提供すること。
【解決手段】音声処理装置１００は、音源と複数のマイクとの相対位置の指標を検出する位置パターン検出部１０２と、前記マイクの各々から入力される音信号に対する音声処理を、前記相対位置の指標に基づいて決定する処理決定部１０３と、前記音信号に対し、決定された前記音声処理を実行する信号処理部１０４と、を有する。
【選択図】図１An audio processing apparatus, an audio processing method, and a program for suitably reducing noise of a sound signal input from a microphone in a plurality of environments.
An audio processing apparatus includes: a position pattern detecting unit that detects an index of a relative position between a sound source and a plurality of microphones; and an audio process for a sound signal input from each of the microphones. A processing determining unit 103 that determines based on the index, and a signal processing unit 104 that executes the determined audio processing on the sound signal.
[Selection] Figure 1

Description

本発明は、音声処理装置に係わり、特に音源とマイクロホンの位置をＮパターンに分類し、それぞれの位置パターンに対応する処理を行うことで、ＳＮＲの高い目的音を得られる音声処理装置に関する。 The present invention relates to a sound processing apparatus, and more particularly to a sound processing apparatus that can obtain a target sound having a high SNR by classifying the positions of a sound source and a microphone into N patterns and performing processing corresponding to each position pattern.

従来から、マイクロホンアレイと呼ばれる複数のマイクロホンを用いて音声を収集し、これらに信号処理を施すことにより、目的音源方向の推定や、雑音を抑圧し高いＳＮＲで目的音源からの信号抽出を行う技術が知られている。 Conventionally, a technique for estimating a target sound source direction and extracting a signal from a target sound source with high SNR by collecting sound using a plurality of microphones called a microphone array and performing signal processing on the collected sound. It has been known.

例えば非特許文献１では、目的音をマイクロホンアレイによって受信し、各マイクが受信した信号の各々に対し、各マイクロホンへの目的音の到達時間差を補正した後、それらの信号を足し合わせる、いわゆる遅延和をとることによって、目的音を強調した信号を得る方法が示されている。非特許文献１に開示の発明は、どのマイクにも目的音と雑音が混入した信号が入力されることを前提としている。 For example, in Non-Patent Document 1, a target sound is received by a microphone array, the arrival time difference of the target sound to each microphone is corrected for each signal received by each microphone, and these signals are added together. A method of obtaining a signal in which the target sound is emphasized by taking the sum is shown. The invention disclosed in Non-Patent Document 1 is based on the premise that a signal mixed with target sound and noise is input to any microphone.

また、複数マイクロホンを利用する方法として、２本のマイクロホンを用いて、１本を雑音収集用マイクロホン、他方を雑音が混入した目的音を収集するマイクロホンとし、目的音を収集するマイクロホンの信号から雑音収集用雑音マイクロホンの出力を減算することによって雑音を低減し、目的音をより鮮明に抽出する方法が知られている。 Further, as a method of using a plurality of microphones, two microphones are used, one is a noise collecting microphone, the other is a microphone that collects a target sound mixed with noise, and noise from the microphone signal collecting the target sound is detected. There is known a method for reducing noise by subtracting the output of a collecting noise microphone and extracting a target sound more clearly.

その一例として、特開２００４−２２６６５６号公報（特許文献１）では、２本のマイクロホンを用い、口唇とあらかじめ選択された基準となるマイクロホンとの距離を、基準マイクロホンともう一方のマイクロホンの信号レベルの差から算出し、距離によって基準マイクロホンの信号から、もう一方のマイクロホンの信号を減算する際の減算量を調整する話者距離検出装置等の技術が開示されている。 As an example, in Japanese Patent Application Laid-Open No. 2004-226656 (Patent Document 1), two microphones are used, and the distance between the lip and a preselected reference microphone is set to the signal level of the reference microphone and the other microphone. A technique such as a speaker distance detection device is disclosed that calculates from the difference between the two and adjusts the subtraction amount when subtracting the signal of the other microphone from the signal of the reference microphone according to the distance.

特許文献１に開示の発明は、一方のマイクには目的音と雑音が混在した信号が入るが、もう一方のマイクには雑音のみ、あるいは目的音が混在したとしても比較的少ないことを前提にしている。 The invention disclosed in Patent Document 1 is based on the premise that one microphone contains a signal in which target sound and noise are mixed, but the other microphone has only noise or a relatively small amount of target sound even if mixed. ing.

特開２００４−２２６６５６号公報JP 2004-226656 A

Ｊ．Ｌ．Ｆｌａｎａｇａｎ，Ｊ．Ｄ．Ｊｏｈｎｓｔｏｎ，Ｒ．ＺａｈｎａｎｄＧ．Ｗ．Ｅｌｋｏ，”Ｃｏｍｐｕｔｅｒ−ｓｔｅｅｒｅｄｍｉｃｒｏｐｈｏｎｅａｒｒａｙｓｆｏｒｓｏｕｎｄｔｒａｎｓｄｕｃｔｉｏｎｉｎｌａｒｇｅｒｏｏｍｓ，”Ｊ．Ａｃｏｕｓｔ．Ｓｏｃ．Ａｍ．，ｖｏｌ．７８，ｎｏ．５，ｐｐ．１５０８−１５１８，１９８５J. et al. L. Flaganan, J. et al. D. Johnston, R.A. Zahn and G.M. W. Elko, "Computer-steered microphone arrays for sound production in large rooms," J. Acoustic. Soc. Am. , Vol. 78, no. 5, pp. 1508-1518, 1985

しかしながら、複数のマイクを用いて生成した音信号を処理する際には、何れのマイクにも目的音と雑音とが混在する環境と、一のマイクに目的音と雑音が入り他のマイクに雑音が主として入る環境とで、同一の処理を行うと、目的音を好適に処理することができないことがある。上記非特許文献１及び特許文献１に開示の発明では、このことは考慮されていない。 However, when processing sound signals generated using multiple microphones, the environment in which the target sound and noise are mixed in all microphones, the target sound and noise are input to one microphone, and the noise is input to the other microphones. If the same processing is performed in an environment where the main sound enters, the target sound may not be appropriately processed. In the inventions disclosed in Non-Patent Document 1 and Patent Document 1, this is not taken into consideration.

本発明は、上記の点に鑑みて、これらの問題を解消するために発明されたものであり、複数の環境において、マイクから入力される音信号の雑音を好適に減じることを目的としている。 The present invention has been invented in order to solve these problems in view of the above points, and has an object to suitably reduce noise of a sound signal input from a microphone in a plurality of environments.

上述した課題を解決し、目的を達成するために、本発明の一態様は、音源と複数のマイクとの相対位置の指標を検出する位置パターン検出部と、前記複数のマイクの各々から入力される音信号に対する音声処理を、前記相対位置の指標に基づいて決定する処理決定部と、前記音信号に対し、決定された前記音声処理を実行する信号処理部と、を有することを特徴とする。 In order to solve the above-described problem and achieve the object, one aspect of the present invention is input from a position pattern detection unit that detects an index of a relative position between a sound source and a plurality of microphones, and each of the plurality of microphones. A sound signal processing unit that determines sound processing for the sound signal based on the relative position index; and a signal processing unit that executes the sound processing determined for the sound signal. .

本発明によれば、複数の環境において、マイクから入力される音信号の雑音を好適に減じることが可能になる。 ADVANTAGE OF THE INVENTION According to this invention, it becomes possible to reduce suitably the noise of the sound signal input from a microphone in a some environment.

第１の実施形態に係わる音声処理装置を示すブロック図である。It is a block diagram which shows the audio | voice processing apparatus concerning 1st Embodiment. 第１の実施形態に係る音声処理装置における処理を示すフローチャートを示す図である。It is a figure which shows the flowchart which shows the process in the audio | voice processing apparatus which concerns on 1st Embodiment. 位置パターンを示す図（その１）である。It is a figure (the 1) which shows a position pattern. 位置パターンを示す図（その２）である。It is a figure (the 2) which shows a position pattern. 位置パターンを示す図（その３）である。It is a figure (the 3) which shows a position pattern. 位置パターン（２）と分類された場合の、音声処理装置の例を示す図である。It is a figure which shows the example of a speech processing unit at the time of classifying with a position pattern (2). 位置パターン（１）と分類された場合の、音声処理装置の例を示す図である。It is a figure which shows the example of a speech processing unit at the time of classifying with a position pattern (1). 第２の実施形態に係わる音声処理装置を示すブロック図である。It is a block diagram which shows the audio processing apparatus concerning 2nd Embodiment. 第２の実施形態に係わる音声処理装置の動作を示すフローチャートを示す図である。It is a figure which shows the flowchart which shows the operation | movement of the audio processing apparatus concerning 2nd Embodiment. 第３の実施形態に係わる音声処理装置を示すブロック図である。It is a block diagram which shows the audio processing apparatus concerning 3rd Embodiment. 第３の実施形態に係わる音声処理装置の動作を示すフローチャートを示す図である。It is a figure which shows the flowchart which shows the operation | movement of the speech processing unit concerning 3rd Embodiment. 角度センサを携帯電話に設けた例を説明する図である。It is a figure explaining the example which provided the angle sensor in the mobile telephone. 本実施の形態にかかる音声処理装置のハードウェア構成を示す説明図である。It is explanatory drawing which shows the hardware constitutions of the audio processing apparatus concerning this Embodiment.

（第１の実施形態）
図１は、第１の実施形態に係わる音声処理装置を示すブロック図である。第１の実施形態に係わる音声処理装置１００は、入力される音信号を、予め保持する音源の位置パターンに照合し、位置パターン毎に対応する音声処理を実行する。音声処理装置１００は、音入力部１０１、位置パターン検出部１０２、処理決定部１０３、信号処理部１０４、及び、パターンデータベース（以下、「パターンＤＢ」という。）１０９を有する。 (First embodiment)
FIG. 1 is a block diagram showing a speech processing apparatus according to the first embodiment. The sound processing apparatus 100 according to the first embodiment collates the input sound signal with a position pattern of a sound source held in advance, and executes sound processing corresponding to each position pattern. The sound processing apparatus 100 includes a sound input unit 101, a position pattern detection unit 102, a processing determination unit 103, a signal processing unit 104, and a pattern database (hereinafter referred to as “pattern DB”) 109.

音入力部１０１は、複数のマイクからの入力音をデジタル化された音信号に変換し、音声の始終端を検出する。位置パターン検出部１０２は、音信号から、音源とマイクとの位置パターンの指標を検出する。処理決定部１０３は、位置パターンの指標を予め保持される位置パターンに照合することにより音信号に対して実行する処理を決定する。信号処理部１０４は、処理決定部１０３の決定に従って処理を行う。 The sound input unit 101 converts input sounds from a plurality of microphones into digitized sound signals, and detects the start and end of the sound. The position pattern detection unit 102 detects a position pattern index between the sound source and the microphone from the sound signal. The process determining unit 103 determines a process to be performed on the sound signal by collating the position pattern index with a previously held position pattern. The signal processing unit 104 performs processing according to the determination of the processing determination unit 103.

パターンＤＢ１０９は、複数のマイクと音源との位置パターンに係る情報を保持する。位置パターンは、複数のマイクと音源との相対的な位置（相対位置）を表す。パターンＤＢ１０９には、位置パターン毎に、複数のマイクから入力される音信号のパターンの指標が対応づけられて格納されている。パターンＤＢ１０９に格納されるパターンは、処理決定部１０３により呼び出され、位置パターン検出部１０２により検出された指標と照合される。 The pattern DB 109 holds information related to position patterns of a plurality of microphones and sound sources. The position pattern represents a relative position (relative position) between the plurality of microphones and the sound source. The pattern DB 109 stores an index of a pattern of sound signals input from a plurality of microphones in association with each position pattern. The pattern stored in the pattern DB 109 is called by the process determination unit 103 and collated with the index detected by the position pattern detection unit 102.

図２は、第１の実施形態に係る音声処理装置１００における処理を示すフローチャートである。ここでは、音源とマイクとの位置パターンを、２本のマイクロホンを使用して取得する例について説明する。２本のマイクを、それぞれ、マイク１及びマイク２とする。 FIG. 2 is a flowchart showing processing in the speech processing apparatus 100 according to the first embodiment. Here, an example in which the position pattern of the sound source and the microphone is acquired using two microphones will be described. The two microphones are called microphone 1 and microphone 2, respectively.

ステップＳ１０１では、音入力部１０１が、ＡＤ変換器を用いて、マイクロホンに入力される音をアナログ信号からデジタル信号に変換する。 In step S101, the sound input unit 101 uses an AD converter to convert sound input to the microphone from an analog signal to a digital signal.

ステップＳ１０２では、雑音処理を行う音声の始終端を検知するために、例えば、ゼロ交差回数等を使った音声区間検出を行う。この音声区間検出はマイク１及びマイク２のマイク出力を使って行う。 In step S102, voice section detection using, for example, the number of zero crossings is performed in order to detect the start and end of the voice to be subjected to noise processing. This voice section detection is performed using the microphone outputs of the microphone 1 and the microphone 2.

より詳細には、マイク１が取得した音信号とマイク２が取得した音信号とに対し、ゼロ交差回数を計算し、どちらかのマイクで音声区間が検出されたと判断されたらその検出点からの音を音声として扱う。 More specifically, the number of zero crossings is calculated with respect to the sound signal acquired by the microphone 1 and the sound signal acquired by the microphone 2, and if it is determined that a voice section has been detected by either microphone, from the detection point Treat sound as speech.

尚、ここでそれぞれマイク１、及び、マイク２が検出した始端情報を、それぞれ、Ｓ１、及び、Ｓ２として保持する。音声始端が最も遅く検出された出力マイクにおいて音声終端確定後、図２の処理が終了される。なお、区間検出方法はこれに限らず種々の区間検出方式を適用することが可能である。また例えば、複数マイク特有の区間検出方法を適用してもよい。 Here, the start end information detected by the microphone 1 and the microphone 2 is held as S1 and S2, respectively. After the voice termination is confirmed in the output microphone in which the voice start edge is detected latest, the processing in FIG. Note that the section detection method is not limited to this, and various section detection methods can be applied. Further, for example, a section detection method unique to a plurality of microphones may be applied.

ステップＳ１０３では、位置パターン検出部１０２が、音入力部１０１が検出した音声信号を用いて音源とマイクの位置パターンの指標を検出する。指標は、例えば、マイクロホン間の音声到達時間差、及び、信号レベル比を用いる。 In step S103, the position pattern detection unit 102 detects an index of the position pattern of the sound source and the microphone using the audio signal detected by the sound input unit 101. As the index, for example, a difference in voice arrival time between microphones and a signal level ratio are used.

より詳細には、例えば、マイク１を基準とする場合に、マイク２への音声到達時間差が大きくなるほどマイク１に音源が近づいている。またマイク１側の信号レベルがマイク２と比較して大きいほどマイク１に音源が近づいている。 More specifically, for example, when the microphone 1 is used as a reference, the sound source is closer to the microphone 1 as the difference in the voice arrival time to the microphone 2 increases. Further, the sound source is closer to the microphone 1 as the signal level on the microphone 1 side is larger than that of the microphone 2.

この２つの指標を算出する際には、目的とする音声の初期音声区間を使用する。初期音声とは音声が検出されてから一定区間の音声である。マイクロホン間の音声到達時間差は相互相関を用いることによって算出する。最も早く音声始端が検出されたマイクの始端検出時を時刻０とし、マイク１に入力される音声信号をｘ１、ｘ１の相関算出区間を時刻ｔｓから時刻ｔｅとし（Ｓ１＜ｔｓ＜ｔｅ）、その区間中の波形をパワーで正規化したものをｘ１’とする。また、マイク２に入力される音声信号をｘ２とし、音声到達時間差を求めるための区間［０−Ｔ］の時刻Ｔを、最も遅く音声始端が検出されたマイクの始端検出時から少なくともｔｓからｔｄの区間以上取れるように設定する。例えば、Ｓ１＜Ｓ２の場合、次式（１）により、Ｔを設定する。 When calculating these two indexes, the initial speech section of the target speech is used. The initial voice is a voice in a certain section after the voice is detected. The difference in voice arrival time between microphones is calculated by using cross-correlation. The time when the start of the microphone whose ear start is detected earliest is detected as time 0, the sound signal input to the microphone 1 is set as x1, and the correlation calculation section of x1 is set as time te from time ts (S1 <ts <te). Let x1 'be the waveform normalized in power in the interval. Further, the audio signal input to the microphone 2 is set to x2, and the time T of the section [0-T] for obtaining the audio arrival time difference is set at least from ts to td from the time of detection of the start end of the microphone in which the audio start end is detected latest Set to take more than. For example, when S1 <S2, T is set by the following equation (1).

式（１）において、出音声到達時間差ｔｄは、次式（２）から（４）を用い、式（５）で表すことができる。

In Expression (1), the outgoing voice arrival time difference td can be expressed by Expression (5) using the following Expressions (2) to (4).

この到達時間差ｔｄが音源と各マイクとの位置パターン判定指標の一つとなる。マイクが２つの場合はマイク１からマイク２に対しての到達時間差ｔｄを求めれば正負の符号を逆にすることによってマイク２からマイク１に対しての到達時間差となる。
マイク１，マイク２間の信号レベル比ｄｄは先ほど求めたｔｄを使い下式で求めることができる。

This arrival time difference td is one of the position pattern determination indexes between the sound source and each microphone. In the case of two microphones, if the arrival time difference td from the microphone 1 to the microphone 2 is obtained, the arrival time difference from the microphone 2 to the microphone 1 is obtained by reversing the positive and negative signs.
The signal level ratio dd between the microphone 1 and the microphone 2 can be obtained by the following equation using td obtained previously.

式（６）の信号レベル比ｄｄが音源と各マイクとの位置パターン判定指標の一つとなる。なお、位置パターン判定指標は前述のもののみに限らず、様々な基準を適用することが可能である。例えば、先に算出した相関の最大値などもこれに含まれる。最大相関値がある基準よりも高ければ音源と２つのマイクが等距離にあり、最大相関値がある基準よりも低ければどちらか一方のマイクには音源が近く、一方のマイクに対しては音源が遠いという位置パターンを導き出すことができる。最大相関値ｒ_ｍａｘは次式（７）により算出される。

The signal level ratio dd in Expression (6) is one of the position pattern determination indexes between the sound source and each microphone. Note that the position pattern determination index is not limited to that described above, and various standards can be applied. For example, the maximum value of the correlation calculated earlier is included in this. If the maximum correlation value is higher than a certain reference, the sound source and the two microphones are equidistant. If the maximum correlation value is lower than a certain reference, the sound source is close to one microphone, and the sound source is close to one microphone. The position pattern that is far can be derived. The maximum correlation value r _max is calculated by the following equation (7).

ステップＳ１０４では、処理決定部１０３が、位置パターン検出部１０２で算出した位置パターンを判定するための指標を使い、下記の（１）から（３）の３つの位置パターンの何れに属するかを照合する。図３ないし図５は、３つの位置パターンを示す図である。 In step S104, the processing determining unit 103 uses the index for determining the position pattern calculated by the position pattern detecting unit 102 to check which of the following three position patterns (1) to (3) belongs to. To do. 3 to 5 are diagrams showing three position patterns.

（１）マイク１に音源が接近している（図３）。
（２）マイク２に音源が接近している（図４）。
（３）どちらのマイクにも音源は接近していない（図５）。 (1) A sound source is approaching the microphone 1 (FIG. 3).
(2) A sound source is approaching the microphone 2 (FIG. 4).
(3) The sound source is not close to either microphone (FIG. 5).

到達時間差判定閾値ｔｔｈｒｅ、信号レベル差判定閾値ｄｄ_{ｔｈｒｅ１}，ｄｄ_{ｔｈｒｅ２}をそれぞれ定数（ただし、ｔ_ｔｈｒｅ＞０，ｄｄ_{ｔｈｒｅ１}＞ｄｄ_{ｔｈｒｅ２}＞０）とすると、ｔｄ＞０の場合に、次式（８）及び式（９）が成り立つとき、位置パターンは、（１）に分類される。

When the arrival time difference determination threshold value tthr and the signal level difference determination threshold _{values dd_thre1} and _{dd_thre2} are constants (where _{t_thre} > 0, _{dd_thre1} > _{dd_thre2} > 0), when td> 0, the following equation (8) And when Expression (9) holds, the position pattern is classified into (1).

また、ｔｄ＜＝０の場合に、次式（１０）及び式（１１）が成り立つとき、位置パターンは、（２）に分類される。

（１）及び（２）の何れにも分類されなければ、位置パターンは（３）に分類される。 Further, when td <= 0, when the following expressions (10) and (11) hold, the position pattern is classified into (2).

If it is not classified into either (1) or (2), the position pattern is classified into (3).

ステップＳ１０５では、信号処理部１０４が、分類された位置パターンに応じて予め定められた処理を行う。図６及び図７は、信号処理部１０４における、信号処理の切り替え時の動作を示す図である。図６は、位置パターン（２）と分類された場合の、音声処理装置の例を示す図であり、図７は、位置パターン（１）と分類された場合の、音声処理装置の例を示す図である。 In step S105, the signal processing unit 104 performs a predetermined process according to the classified position pattern. 6 and 7 are diagrams illustrating an operation when the signal processing unit 104 switches the signal processing. FIG. 6 is a diagram illustrating an example of the voice processing device when classified as the position pattern (2), and FIG. 7 illustrates an example of the voice processing device when classified as the position pattern (1). FIG.

以下、信号処理の切り替えについて説明する。 Hereinafter, switching of signal processing will be described.

位置パターンが（１）に分類された場合、マイク１に入力される音声を目的音声とし、マイク２に入力される音を雑音として処理する。具体的には、音声処理装置１００の出力音声ｏは、αを定数（０≦α）とすると、先ほど算出した遅延時間ｔｄを用いて、次式（１２）で表すことができる。

When the position pattern is classified as (1), the sound input to the microphone 1 is processed as the target sound and the sound input to the microphone 2 is processed as noise. Specifically, the output speech o of the speech processing apparatus 100 can be expressed by the following equation (12) using the delay time td calculated earlier, where α is a constant (0 ≦ α).

また、このとき信号を周波数領域に変換してスペクトルサブトラクションを行っても良い。または、エコーキャンセラ等でよく用いられる適応フィルタを用いてｘ２を参照信号としてｘ１から雑音成分を除去する方法も可能である。 At this time, spectrum subtraction may be performed by converting the signal into the frequency domain. Alternatively, a method of removing a noise component from x1 using x2 as a reference signal using an adaptive filter often used in an echo canceller or the like is also possible.

位置パターンが（２）に分類された場合、マイク２に入力される音声を目的音声とし、マイク１に入力される音を雑音として処理する。具体的な処理は位置パターン（１）の場合の処理のマイク１とマイク２を入れ替えたときと同じである。このとき出力音声ｏは、次式（１３）で表される。

When the position pattern is classified as (2), the sound input to the microphone 2 is processed as the target sound, and the sound input to the microphone 1 is processed as noise. The specific process is the same as when the microphone 1 and the microphone 2 in the process for the position pattern (1) are switched. At this time, the output sound o is expressed by the following equation (13).

このようにある特定のマイクに音源が近づいているというような位置パターンとして分類された場合の処理は他にも考えることができる。例えば、αを前記最大相関値の関数にし、減算量を調整しても良い。このときａ、ｂを定数とする一次関数により次式（１４）によりαの値を制御することができる。

Other processing can be considered when the sound source is classified as a position pattern such that the sound source is approaching a specific microphone. For example, α may be a function of the maximum correlation value and the subtraction amount may be adjusted. At this time, the value of α can be controlled by the following equation (14) by a linear function having a and b as constants.

式（１４）のようにしてαを表現することによって最大相関値が高いときには減算量を小さく、低い時には減算量を大きくすることができる。 By expressing α as in Expression (14), the subtraction amount can be reduced when the maximum correlation value is high, and the subtraction amount can be increased when the maximum correlation value is low.

位置パターンが（３）に分類された場合、マイク１、マイク２に入力される音声を両方用いて遅延和アレー処理を行う。遅延和アレーを用いた場合、出力音声ｏは、次式（１５）で表される。

When the position pattern is classified into (3), the delay sum array process is performed using both the voices input to the microphone 1 and the microphone 2. When the delay sum array is used, the output sound o is expressed by the following equation (15).

尚、このアレー処理適応部では上記方式に限定されず、例えばＧｒｉｆｆｉｔｈｓ−Ｊｉｍ型のアレー処理を適用することによって、２つのマイクロホンで一定の角度に対して雑音の死角を形成し、その範囲の音声に対してＳＮＲの高い目的音声ｏを抽出することが可能となる。
図２に戻り、ステップＳ１０６では、音入力部１０１で終端が検出され、音声処理終了を終了する。 Note that this array processing adaptation unit is not limited to the above-mentioned method. For example, by applying Griffiths-Jim type array processing, a dead angle of noise is formed with respect to a certain angle by two microphones, and the sound in the range is used. It is possible to extract the target speech o having a high SNR.
Returning to FIG. 2, in step S <b> 106, the sound input unit 101 detects the end, and the sound processing ends.

以上、２つのマイクロホンを例に、本発明の実施の形態を説明したが、本発明を実施するに当って、マイクロホンが２本であることは必須ではなく、本発明を３つ以上のマイクロホンに拡張することも可能である。マイク３つの場合、マイクをマイク１、マイク２、マイク３とすると、以下の７つの位置パターンを用意する。 As described above, the embodiment of the present invention has been described by taking two microphones as an example. However, in implementing the present invention, it is not essential that there are two microphones, and the present invention is divided into three or more microphones. It is also possible to expand. In the case of three microphones, if the microphones are microphone 1, microphone 2, and microphone 3, the following seven position patterns are prepared.

（１’）マイク１に接近している。
（２’）マイク２に接近している。
（３’）マイク３に接近している。
（４’）マイク１、２に接近している。
（５’）マイク２、３に接近している。
（６’）マイク１、３に接近している。
（７’）どのマイクにも接近していない。 (1 ′) The microphone 1 is approaching.
(2 ′) Approaching the microphone 2.
(3 ′) Approaching the microphone 3.
(4 ′) Close to microphones 1 and 2.
(5 ′) Close to microphones 2 and 3.
(6 ′) Close to microphones 1 and 3.
(7 ') No microphone is approaching.

入力される音信号に対し、先述の到達時間差、信号レベル差を用いてどの位置パターンに分類されるかを決定する。より詳細には、マイク１からマイク２への到達時間差をｔｄ_１２とし、式（２）ないし式（５）により算出する。他のマイク間の到達時間差も同様に、ｔｄ_１３、ｔｄ_２１、ｔｄ_２３、ｔｄ_３１、ｔｄ_３２とし、これらを算出する。またマイク１のマイク２に対する信号レベル差をｄｄ_１２とし、式（６）により算出する。他のマイク間の信号レベル差も同様に、ｄｄ_１３、ｄｄ_２１、ｄｄ_２１、ｄｄ_２３、ｄｄ_３１、ｄｄ_３２とし、これらを算出する。 The position pattern to be classified is determined by using the arrival time difference and the signal level difference described above for the input sound signal. More specifically, the arrival time difference from the microphone 1 to the microphone 2 is set to td _12, and the calculation is performed by the equations (2) to (5). Similarly, the arrival time differences between the other microphones are td ₁₃ , td ₂₁ , td ₂₃ , td ₃₁ , td _32, and these are calculated. The signal level difference for the microphone 2 microphones 1 and _{dd 12,} is calculated by the equation (6). Similarly, the signal level differences between the other microphones are calculated as dd ₁₃ , dd ₂₁ , dd ₂₁ , dd ₂₃ , dd ₃₁ , and dd ₃₂ .

このとき音源から最も近いマイクｎ１は他２つのマイクとの到達時間差が正の値となる。２番目に音源に近いマイクｎ２は残る一つのマイクに対して到達時間差が正となり、マイクｎ１に対しては負となる。最も音源から遠いマイクｎ３は他２つのマイクとの到達時間差が負の値となる。そこで、この特性により、先ず、何れのマイクが、マイクｎ１、マイクｎ２、マイクｎ３となるかを決定する。 At this time, the arrival time difference between the microphone n1 closest to the sound source and the other two microphones is a positive value. The microphone n2, which is the second closest to the sound source, has a positive arrival time difference with respect to the remaining microphone and is negative with respect to the microphone n1. The microphone n3 farthest from the sound source has a negative arrival time difference from the other two microphones. Therefore, based on this characteristic, first, which microphone is to be the microphone n1, the microphone n2, or the microphone n3 is determined.

マイクｎ１とマイクｎ２との到達時間差がｔｄ_ｎ１ｎ２、その到達時間差の閾値がｔｄ_{ｔｈｒｅ１}、マイクｎ１とマイクｎ２との信号レベル差がｄｄ_ｎ１ｎ２、その信号レベル差の閾値がｄｄ_{ｔｈｒｅ１}であり、次式（１６）及び式（１７）が満たされる場合に、マイク１がマイクｎ１であるとき（１’）、マイク２がマイクｎ１であるとき（２’）、マイク３がマイクｎ１であるとき（３’）の位置パターンに分類する。

Mike n1 and arrival time difference _td between the microphone n2 _N1N2, the arrival time difference threshold _{td thre1,} microphone n1 and the signal level difference _{dd N1N2} with microphone n2, the threshold of the signal level difference is _{dd thre1,} the following equation When (16) and Expression (17) are satisfied, when the microphone 1 is the microphone n1 (1 ′), when the microphone 2 is the microphone n1 (2 ′), and when the microphone 3 is the microphone n1 (3) ') Position pattern.

次に、マイクｎ１とマイクｎ２との到達時間差がｔｄ_ｎ１ｎ２、その到達時間差の閾値がｔｄ_{ｔｈｒｅ１}、マイクｎ２とマイクｎ３との到達時間差がｔｄ_ｎ２ｎ３、その到達時間差の閾値がｔｄ_{ｔｈｒｅ２}、マイクｎ１とマイクｎ２との信号レベル差がｄｄ_ｎ１ｎ２、その信号レベル差の閾値がｄｄ_{ｔｈｒｅ１}、マイクｎ２とマイクｎ３との信号レベル差がｄｄ_ｎ２ｎ３、その信号レベル差の閾値がｄｄ_{ｔｈｒｅ２}であり、次式（１８）ないし式（２１）を満たす場合に、マイク３がマイクｎ３であるとき（４’）、マイク１がマイクｎ３であるとき（５’）、マイク２がマイクｎ３であるとき（６’）の位置パターンに分類する。

Next, the arrival time difference between the microphone n1 and the microphone n2 is td _n1n2 , the arrival time difference threshold is td _thre1 , the arrival time difference between the microphone n2 and the microphone n3 is td _n2n3 , the arrival time difference threshold is td _thre2 , and the microphone n1 The signal level difference with the microphone n2 is dd _n1n2 , the signal level difference threshold is dd _thre1 , the signal level difference between the microphone n2 and the microphone n3 is dd _n2n3 , the signal level difference threshold is dd _thre2 , and the following formula ( 18) to the expression (21), when the microphone 3 is the microphone n3 (4 '), when the microphone 1 is the microphone n3 (5'), and when the microphone 2 is the microphone n3 (6 ') The position pattern is classified.

また、（１’）から（６’）のどの位置パターンにも分類されなかった場合は、全てのマイクに対する音源の距離が遠いとみなし（７’）の位置パターンに分類する。 If the position pattern is not classified into any of the position patterns (1 ') to (6'), it is regarded that the distance of the sound source with respect to all the microphones is long, and the position pattern is classified into the position pattern (7 ').

このように分類された後、各パターンによって処理を切り替える。より詳細には（１’），（２’），（３’）の場合は、式（２２）による処理を行い、音源に近いマイクの目的音から雑音を減算する。

但し、α１、α２は定数であり、α１≧０、α２≧０である。 After the classification, the process is switched according to each pattern. More specifically, in the case of (1 ′), (2 ′), and (3 ′), the processing according to Expression (22) is performed, and noise is subtracted from the target sound of the microphone close to the sound source.

However, α1 and α2 are constants, and α1 ≧ 0 and α2 ≧ 0.

また、（４’）、（５’）、（６’）の場合は、式（２３）による処理を行う。これにより、音源に近い２つのマイクは遅延和アレーで音声強調され、音源から最も遠いマイクの出力は雑音の減算に使用される。

In the case of (4 ′), (5 ′), and (6 ′), the processing according to Expression (23) is performed. As a result, the two microphones close to the sound source are emphasized by the delay sum array, and the output of the microphone farthest from the sound source is used for noise subtraction.

また、（７’）の場合は、式（２４）により処理を行う。これにより、全てのマイクを使って遅延和アレーで音声強調される。

このように３つのマイクロホンにも容易に拡張可能である。 In the case of (7 ′), the processing is performed according to the equation (24). As a result, voice enhancement is performed with a delay-and-sum array using all microphones.

Thus, it can be easily expanded to three microphones.

また３つ以上のマイクを使い、３次元空間で音源位置を推定してもよい。音源位置が推定できた場合，各マイクから音源までの距離を算出することができる。この処理によって得られたマイク−音源間の距離をそれぞれｌｄ_１，ｌｄ_２，ｌｄ_３とする。 Further, the sound source position may be estimated in a three-dimensional space using three or more microphones. When the sound source position can be estimated, the distance from each microphone to the sound source can be calculated. The distances between the microphone and the sound source obtained by this processing are denoted by ld ₁ , ld ₂ , and ld ₃ , respectively.

このとき，距離閾値ｌｄ_ｔｈｒｅを定数として、次式（２５）を満たす場合に、（１’）に分類される。

同様に、（２’）〜（７’）の位置パターンへの分類も実現することができる。 At this time, if the following formula (25) is satisfied with the distance threshold _{ld_thre} as a constant, the distance is classified as (1 ′).

Similarly, classification into position patterns (2 ′) to (7 ′) can also be realized.

（第２の実施形態）
図８は、第２の実施形態に係わる音声処理装置を示すブロック図である。第２の実施形態に係わる音声処理装置１００ａは、音信号に対し、位置センサにより取得される音源の位置毎に対応する処理を選択して行う。音声処理装置１００ａは、音入力部１０１、位置パターン検出部１０２ａ、処理決定部１０３ａ、信号処理部１０４、及び、パターンＤＢ１０９ａを有する。 (Second Embodiment)
FIG. 8 is a block diagram showing a speech processing apparatus according to the second embodiment. The sound processing apparatus 100a according to the second embodiment selects and performs processing corresponding to each sound source position acquired by the position sensor on the sound signal. The voice processing device 100a includes a sound input unit 101, a position pattern detection unit 102a, a processing determination unit 103a, a signal processing unit 104, and a pattern DB 109a.

音入力部１０１は、入力音から、音声の始終端を検出する。位置パターン検出部１０２ａは、位置センサからの信号により、音源とマイクとの位置パターンの指標を検出する。処理決定部１０３ａは、位置パターンの指標を、予め保持する位置パターンに照合することによって実行する処理を決定する。信号処理部１０４は、処理決定部の決定に従って処理を行う。 The sound input unit 101 detects the start and end of the sound from the input sound. The position pattern detection unit 102a detects a position pattern index between the sound source and the microphone based on a signal from the position sensor. The process determining unit 103a determines a process to be executed by collating a position pattern index with a previously held position pattern. The signal processing unit 104 performs processing according to the determination of the processing determination unit.

パターンＤＢ１０９ａは、音源とマイクとの位置バターンを保持する。パターンＤＢ１０９ａには、音源とマイクとの相対的な位置の位置パターン毎に、位置センサからの入力される信号の指標が対応づけられている。パターンＤＢ１０９ａに格納される位置パターンは、位置パターン検出部１０２ａから読み出され、位置センサからの入力と照合される。 The pattern DB 109a holds a position pattern between the sound source and the microphone. In the pattern DB 109a, an index of a signal input from the position sensor is associated with each position pattern of a relative position between the sound source and the microphone. The position pattern stored in the pattern DB 109a is read from the position pattern detection unit 102a and collated with the input from the position sensor.

図９は、第２の実施形態に係わる音声処理装置の動作を示すフローチャートである。２つのマイクロホン（マイク１、マイク２）を使用し、目的音声を処理する例を用いて説明する。なお、マイクロホンが２つであることは必須ではなく、マイクロホンが２つ以上あれば実施可能である。また目的音が音声であることも必須要素ではない。本実施形態の動作は位置センサ、位置パターン検出部１０２ａ、及び、処理決定部１０３ａの動作を除き第１の実施形態と同様であり、第１の実施形態と同じ動作の部分は説明を割愛する。 FIG. 9 is a flowchart showing the operation of the speech processing apparatus according to the second embodiment. A description will be given using an example of processing target speech using two microphones (microphone 1 and microphone 2). Note that it is not essential that there are two microphones, and this is possible if there are two or more microphones. It is not an essential element that the target sound is a voice. The operation of this embodiment is the same as that of the first embodiment except for the operations of the position sensor, the position pattern detection unit 102a, and the processing determination unit 103a, and the description of the same operation part as that of the first embodiment is omitted. .

ステップＳ２０３では、各マイク近くに取り付けられた距離センサからの出力により、そのセンサでの測定結果を位置パターン判定指標とする。具体的には、距離センサを各マイクから音源に当る対象物体までの距離を測定できる赤外線センサなどとし、各マイクから音源までの距離を測定する。マイクを２つ使用し、それぞれマイク１、マイク２から音源までの距離をｌｄ_１、ｌｄ_２とする。 In step S203, based on the output from the distance sensor attached near each microphone, the measurement result of that sensor is used as a position pattern determination index. Specifically, the distance sensor is an infrared sensor or the like that can measure the distance from each microphone to the target object that hits the sound source, and the distance from each microphone to the sound source is measured. Two microphones are used, and the distances from the microphone 1 and the microphone 2 to the sound source are ld ₁ and ld ₂ , respectively.

ステップＳ２０４では、処理決定部１０３ａが、位置パターン検出部１０２で算出した位置パターン判定指標を使い、３つの位置パターンのどれに属するか分類する。３つのパターンを以下に示す。
（１Ａ）マイク１に音源が接近している。
（２Ａ）マイク２に音源が接近している。
（３Ａ）どちらのマイクにも音源は接近していない。 In step S204, the process determination unit 103a uses the position pattern determination index calculated by the position pattern detection unit 102 to classify which of the three position patterns. Three patterns are shown below.
(1A) A sound source is approaching the microphone 1.
(2A) A sound source is approaching the microphone 2.
(3A) No sound source is approaching either microphone.

このとき，距離閾値ｌｄ_ｔｈｒｅを定数とすると、次式（２６）が成り立つ場合に、（１Ａ）に分類し、次式（２７）が成り立つ場合に、（２Ａ）に分類する。

At this time, if the distance threshold _{ld_thre} is a constant, it is classified as (1A) when the following expression (26) is satisfied, and is classified as (2A) when the following expression (27) is satisfied.

上記の何れでもない場合には、（３Ａ）の位置パターンに分類する。位置パターン分類後の、ステップＳ２０５及びステップＳ２０６の処理は、図２のステップＳ１０５及びステップＳ１０６と同じあるので、ここでは説明を省略する。 If none of the above, the position pattern is classified into (3A). Since the processing of step S205 and step S206 after the position pattern classification is the same as step S105 and step S106 of FIG. 2, description thereof is omitted here.

（第３の実施形態）
図１０は、第３の実施形態に係わる音声処理装置を示すブロック図である。第３の実施形態に係わる音声処理装置１００ｂは、位置センサからの入力と、音信号と、に基づいて、音源の位置パターンを検出し、位置パターン毎に対応する音声処理を実行する。 (Third embodiment)
FIG. 10 is a block diagram showing a speech processing apparatus according to the third embodiment. The sound processing apparatus 100b according to the third embodiment detects a position pattern of a sound source based on an input from a position sensor and a sound signal, and executes sound processing corresponding to each position pattern.

音声処理装置１００ｂは、音入力部１０１、位置パターン検出部１０２ｂ、処理決定部１０３ｂ、信号処理部１０４、及び、パターンＤＢ１０９ｂを有する。
音入力部１０１は、マイクからの入力音をデジタル化された音信号に変換し、音声の始終端を検出する。位置パターン検出部１０２ｂは、位置センサからの入力と音声とから音源とマイクとの位置パターンの指標を検出する。処理決定部１０３ｂは、位置パターンの指標を、予め保持する位置パターンに照合することによって実行する処理を決定する。信号処理部１０４は、処理決定部の決定に従って処理を行う。 The voice processing device 100b includes a sound input unit 101, a position pattern detection unit 102b, a processing determination unit 103b, a signal processing unit 104, and a pattern DB 109b.
The sound input unit 101 converts the input sound from the microphone into a digitized sound signal and detects the start and end of the sound. The position pattern detection unit 102b detects an index of the position pattern of the sound source and the microphone from the input from the position sensor and sound. The process determining unit 103b determines a process to be executed by collating a position pattern index with a previously held position pattern. The signal processing unit 104 performs processing according to the determination of the processing determination unit.

パターンＤＢ１０９ｂは、マイクと音源との位置パターンを保持する。パターンＤＢ１０９ｂには、マイクと音源との相対的な位置の位置パターン毎に、位置センサから入力される信号の指標と音信号の指標との組み合わせが対応づけられている。パターンＤＢ１０９ｂに格納されるパターンは、位置パターン検出部１０２ｂから呼び出され、音入力部１０１が取得した音信号及び位置センサからの入力と照合される。 The pattern DB 109b holds a position pattern between the microphone and the sound source. In the pattern DB 109b, a combination of a signal index and a sound signal index input from the position sensor is associated with each position pattern of the relative positions of the microphone and the sound source. The pattern stored in the pattern DB 109b is called from the position pattern detection unit 102b and collated with the sound signal acquired by the sound input unit 101 and the input from the position sensor.

図１１は、第３の実施形態に係わる音声処理装置の動作を示すフローチャートである。ここでは、２つのマイクロホン（マイク１，マイク２）を使用し目的音声を処理する例を用いて説明する。なお、マイクロホンが２つであることは必須ではなく，マイクロホンが２つ以上あればよい。また目的音が音声であることも必須要素ではない。本実施形態の動作は位置パターン検出部１０２ｂ、及び、処理決定部１０３ｂの動作が、第２の実施形態と異なる他は、第２の実施形態と同様であるので、同動作の部分はここでは、説明を割愛する。 FIG. 11 is a flowchart showing the operation of the speech processing apparatus according to the third embodiment. Here, a description will be given using an example in which target microphones are processed using two microphones (microphone 1 and microphone 2). Note that it is not essential that there are two microphones, and there may be two or more microphones. It is not an essential element that the target sound is a voice. The operation of this embodiment is the same as that of the second embodiment except that the operations of the position pattern detection unit 102b and the processing determination unit 103b are different from those of the second embodiment. , Omit the explanation.

音声処理装置１００ｂは、例えば、位置センサとして距離センサを用いるとよい。ステップＳ３０３では、位置パターン検出部１０２ｂが、距離センサによる測定結果と音声情報とを位置パターン判定指標として取得する。 The voice processing device 100b may use a distance sensor as a position sensor, for example. In step S303, the position pattern detection unit 102b acquires a measurement result by the distance sensor and audio information as a position pattern determination index.

より詳細には、位置センサとして赤外線センサなどを使い、本装置から音源までの距離を測定する。また音信号を取得するマイクを２つ使用し、センサを使って取得した音声処理装置１００ｂから音源までの距離をｌｄとする。また第１の実施形態と同様に，音声到達時間差ｔｄ、信号レベル比ｄｄもそれぞれ求めておく。 More specifically, an infrared sensor or the like is used as a position sensor, and the distance from the device to the sound source is measured. Also, two microphones that acquire sound signals are used, and the distance from the sound processing device 100b acquired using the sensor to the sound source is ld. Similarly to the first embodiment, the voice arrival time difference td and the signal level ratio dd are also obtained.

ステップＳ３０４では、処理決定部１０３ｂが、位置パターン検出部で算出した位置パターン判定指標を使い，３つの位置パターンのどれに属するか分類する。３つのパターンを以下に示す。
（１Ｂ）マイク１に音源が接近している。
（２Ｂ）マイク２に音源が接近している。
（３Ｂ）どちらのマイクにも音源は接近していない。 In step S304, the process determination unit 103b uses the position pattern determination index calculated by the position pattern detection unit to classify which of the three position patterns it belongs to. Three patterns are shown below.
(1B) A sound source is approaching the microphone 1.
(2B) A sound source is approaching the microphone 2.
(3B) The sound source is not approaching either microphone.

到達時間差判定閾値ｔ_ｔｈｒｅ、信号レベル差判定閾値ｄｄ_{ｔｈｒｅ１}、ｄｄ_{ｔｈｒｅ２}、距離判定閾値ｌｄ_ｔｈｒｅをそれぞれ定数（ただし、ｔ_ｔｈｒｅ＞０，ｄｄ_{ｔｈｒｅ１}＞ｄｄ_{ｔｈｒｅ２}＞０，ｌｄ_ｔｈｒｅ＞０）とする。ここで、ｔｄ＞０の場合に、次式（２８）が全て成り立つとき、位置パターンを（１Ｂ）に分類する。

The arrival time difference determination threshold value t _thre , the signal level difference determination threshold _values dd _thre1 , dd _thre2 , and the distance determination threshold value ld _thre are constants (where t _thre > 0, dd _thre1 > dd _thre2 > 0, ld _thre > 0). Here, when all of the following expressions (28) hold when td> 0, the position pattern is classified as (1B).

また、ｔｄ＜＝０の場合に、次式（２９）が全て成り立つとき、位置パターンを（２Ｂ）に分類する。

また、（１Ｂ）、（２Ｂ）の何れでもない位置パターンを（３）に分類する。この３つの位置パターン毎に、第１の実施形態の（１），（２），（３）と同様の処理を行う。 Further, when td <= 0, when the following expression (29) holds, the position pattern is classified as (2B).

Further, a position pattern that is neither (1B) nor (2B) is classified as (3). Processing similar to (1), (2), and (3) of the first embodiment is performed for each of these three position patterns.

また，角度センサからの出力を位置パターン判定指標として使用することもできる。図１２は、角度センサを携帯電話に設けた例を説明する図である。図１２の例では、携帯電話は、操作時は横向き，通話時は縦向きに使用される。このような機器において，機器本体に取り付けられた角度センサを使い角度を検出する。検出する角度θの例を、図１２に示す。角度θは、例えば、２つのマイクを結ぶ線分と地面とが水平である位置を０度とする。また第１の実施形態と同様に、音声到達時間差ｔｄ，信号レベル比ｄｄもそれぞれ求めておく。 Also, the output from the angle sensor can be used as a position pattern determination index. FIG. 12 is a diagram illustrating an example in which an angle sensor is provided in a mobile phone. In the example of FIG. 12, the mobile phone is used in the horizontal direction during operation and in the vertical direction during a call. In such a device, an angle is detected using an angle sensor attached to the device body. An example of the detected angle θ is shown in FIG. For the angle θ, for example, the position where the line segment connecting the two microphones and the ground are horizontal is 0 degrees. Similarly to the first embodiment, the voice arrival time difference td and the signal level ratio dd are also obtained.

図１２の例では、（１Ｂ），（２Ｂ），（３Ｂ）への位置パターンの分類は、到達時間差判定閾値ｔ_ｔｈｒｅ，信号レベル差判定閾値ｄｄ_{ｔｈｒｅ１}，ｄｄ_{ｔｈｒｅ２}，角度判定閾値θ_ｔｈｒｅ，をそれぞれ定数（ただし，ｔ_ｔｈｒｅ＞０，ｄｄ_{ｔｈｒｅ１}＞ｄｄ_{ｔｈｒｅ２}＞０，θ_ｔｈｒｅ≧０）とすると次式（３０）及び次式（３１）により行われる。 In the example of FIG. 12, the position pattern classification into (1B), (2B), and (3B) includes arrival time difference determination threshold t _thre , signal level difference determination thresholds dd _thre1 , dd _thre2 , and angle determination threshold θ _thre . If constants (where t _thre > 0, dd _thre1 > dd _thre2 > 0, θ _thre ≧ 0), respectively, the following equations (30) and (31) are performed.

ｔｄ＞０の場合に、次式（３０）が成り立つとき、位置パターンは、（１Ｂ）に分類される。

When td> 0, when the following equation (30) holds, the position pattern is classified as (1B).

ｔｄ＜＝０の場合に、次式（３１）が成り立つとき、位置パターンは、（２Ｂ）に分類される。

位置パターンが（１Ｂ），（２Ｂ）の何れでもない場合には、（３Ｂ）に分類される。 When td <= 0, when the following equation (31) holds, the position pattern is classified as (2B).

If the position pattern is neither (1B) nor (2B), it is classified as (3B).

（コンピュータ等による実現最小構成）
次に、本実施の形態にかかる音声処理装置のハードウェア構成について図１３を用いて説明する。図１３は、本実施の形態にかかる音声処理装置のハードウェア構成を示す説明図である。 (Minimum configuration realized by computer etc.)
Next, the hardware configuration of the speech processing apparatus according to this embodiment will be described with reference to FIG. FIG. 13 is an explanatory diagram showing a hardware configuration of the speech processing apparatus according to the present embodiment.

本実施の形態にかかる音声処理装置は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）５１などの制御装置と、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）５２やＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）５３などの記憶装置と、ネットワークに接続して通信を行う通信Ｉ／Ｆ５４と、各部を接続するバス６１を備えている。 The audio processing apparatus according to the present embodiment is connected to a control device such as a CPU (Central Processing Unit) 51, a storage device such as a ROM (Read Only Memory) 52 and a RAM (Random Access Memory) 53, and a network. A communication I / F 54 that performs communication and a bus 61 that connects each unit are provided.

本実施の形態にかかる音声処理装置で実行されるプログラムは、ＲＯＭ５２等に予め組み込まれて提供される。 A program executed by the speech processing apparatus according to the present embodiment is provided by being incorporated in advance in the ROM 52 or the like.

本実施の形態にかかる音声処理装置で実行されるプログラムは、インストール可能な形式又は実行可能な形式のファイルでＣＤ−ＲＯＭ（ＣｏｍｐａｃｔＤｉｓｋＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、フレキシブルディスク（ＦＤ）、ＣＤ−Ｒ（ＣｏｍｐａｃｔＤｉｓｋＲｅｃｏｒｄａｂｌｅ）、ＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｋ）等のコンピュータで読み取り可能な記録媒体に記録して提供するように構成してもよい。 A program executed by the audio processing apparatus according to the present embodiment is a file in an installable format or an executable format, and is a CD-ROM (Compact Disk Read Only Memory), a flexible disk (FD), a CD-R (Compact). You may comprise so that it may record and provide on computer-readable recording media, such as a Disk Recordable (DVD) and DVD (Digital Versatile Disk).

さらに、本実施の形態にかかる音声処理装置で実行されるプログラムを、インターネット等のネットワークに接続されたコンピュータ上に格納し、ネットワーク経由でダウンロードさせることにより提供するように構成してもよい。また、本実施の形態にかかる音声処理装置で実行されるプログラムをインターネット等のネットワーク経由で提供または配布するように構成してもよい。 Furthermore, the program executed by the audio processing apparatus according to the present embodiment may be stored on a computer connected to a network such as the Internet and provided by being downloaded via the network. The program executed by the speech processing apparatus according to the present embodiment may be provided or distributed via a network such as the Internet.

本実施の形態にかかる音声処理装置で実行されるプログラムは、上述した各部を含むモジュール構成となっており、実際のハードウェアとしてはＣＰＵ５１が上記ＲＯＭ５２からプログラムを読み出して実行することにより上記各部が主記憶装置上にロードされ、各部が主記憶装置上に生成されるようになっている。 The program executed by the speech processing apparatus according to the present embodiment has a module configuration including the above-described units. As actual hardware, the CPU 51 reads the program from the ROM 52 and executes the program so that each unit is It is loaded on the main storage device, and each unit is generated on the main storage device.

なお、本発明は、上記実施形態そのままに限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で構成要素を変形して具体化できる。また、上記実施形態に開示されている複数の構成要素の適宜な組み合わせにより、種々の発明を形成できる。例えば、実施形態に示される全構成要素から幾つかの構成要素を削除してもよい。さらに、異なる実施形態にわたる構成要素を適宜組み合わせてもよい。 Note that the present invention is not limited to the above-described embodiment as it is, and can be embodied by modifying the constituent elements without departing from the scope of the invention in the implementation stage. In addition, various inventions can be formed by appropriately combining a plurality of components disclosed in the embodiment. For example, some components may be deleted from all the components shown in the embodiment. Furthermore, constituent elements over different embodiments may be appropriately combined.

以上のように、本発明の実施の形態にかかる音声処理装置は、雑音除去に有用であり、特に、マイクロホンアレイから入力される音信号の処理に適している。 As described above, the sound processing apparatus according to the embodiment of the present invention is useful for noise removal, and is particularly suitable for processing a sound signal input from a microphone array.

１、２、３マイク
１００、１００ａ、１００ｂ音声処理装置
１０１音入力部
１０２、１０２ａ、１０２ｂ位置パターン検出部
１０３、１０３ａ、１０３ｂ処理決定部
１０４信号処理部 1, 2, 3 Microphone 100, 100a, 100b Sound processing device 101 Sound input unit 102, 102a, 102b Position pattern detection unit 103, 103a, 103b Processing determination unit 104 Signal processing unit

Claims

A position pattern detection unit for detecting an index of relative position between the sound source and the plurality of microphones;
A process determining unit that determines a sound process for a sound signal input from each of the plurality of microphones based on an index of the relative position;
A signal processing unit that executes the determined audio processing on the sound signal;
A speech processing apparatus comprising:

The speech processing apparatus according to claim 1, wherein the relative position index includes a difference in arrival time of sound signals input from the plurality of microphones and a difference in level of the sound signals.

The speech processing apparatus according to claim 1, wherein the relative position index includes a distance measured by a distance sensor provided at a predetermined position with respect to each of the plurality of microphones.

The speech processing apparatus according to claim 1, wherein the relative position index includes an inclination of the microphone measured by an angle sensor provided at a predetermined position with respect to each of the plurality of microphones.

The process determining unit is configured to add a greater weight to a sound signal input from a microphone whose distance from the sound source is smaller than a predetermined value than a sound signal input from a microphone whose distance from the sound source is a predetermined value or more. The speech processing apparatus according to claim 1, wherein the speech processing apparatus is determined to perform the process.

The processing determining unit determines to perform audio processing that takes a delay sum for a sound signal input from the plurality of microphones when a distance from the sound source of the plurality of microphones is equal to or greater than a predetermined value. The speech processing apparatus according to any one of claims 1 to 4, wherein the speech processing apparatus is characterized.

Computer
A position pattern detection unit for detecting an index of relative position between the sound source and the plurality of microphones;
A process determining unit that determines a sound process for a sound signal input from each of the plurality of microphones based on an index of the relative position;
A signal processing unit that executes the determined audio processing on the sound signal;
Program to function as.

A position pattern detection unit that detects a relative position index between the sound source and the plurality of microphones;
A process determining step in which a process determining unit determines a sound process for a sound signal input from each of the plurality of microphones based on an index of the relative position;
A signal processing step in which the signal processing unit executes the determined sound processing on the sound signal;
A voice processing method characterized by comprising: