JP2006243664A

JP2006243664A - Signal separation device, signal separation method, signal separation program, and recording medium

Info

Publication number: JP2006243664A
Application number: JP2005062957A
Authority: JP
Inventors: Akiko Araki; 章子荒木; Shoji Makino; 昭二牧野; Hiroshi Sawada; 宏澤田; Makoto Mukai; 良向井
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: NTT Inc
Priority date: 2005-03-07
Filing date: 2005-03-07
Publication date: 2006-09-14

Abstract

【課題】複数の信号源から発せられた信号が混合された混合信号から目的信号を小さな歪みで分離抽出する。
【解決手段】まず、窓関数の長さTの1/S（S>2）倍シフトで設定された各フレームにおいて当該窓関数を混合信号に掛け合わせ、それらの各演算結果を時間周波数領域の信号に変換する。次に、これらの時間周波数領域の信号から抽出した特徴量をクラスタリングしてクラスタを生成し、これを用いて時間周波数毎のマスクを生成する。その後、マスクと時間周波数領域の信号とを用い、時間周波数毎に時間周波数領域の分離信号を抽出し、この時間周波数領域の分離信号を時間領域の分離信号に変換し、各フレームに対応する時間領域の分離信号を加算合成する。
【選択図】図２PROBLEM TO BE SOLVED: To separate and extract a target signal with small distortion from a mixed signal obtained by mixing signals emitted from a plurality of signal sources.
First, the window function is multiplied by the mixed signal in each frame set by 1 / S (S> 2) times shift of the length T of the window function, and each calculation result is calculated in the time-frequency domain. Convert to signal. Next, a cluster is generated by clustering the feature amounts extracted from these signals in the time frequency domain, and a mask for each time frequency is generated using the cluster. Then, using the mask and the time-frequency domain signal, a time-frequency domain separation signal is extracted for each time frequency, and the time-frequency domain separation signal is converted into a time-domain separation signal. The separated signals of the areas are added and synthesized.
[Selection] Figure 2

Description

本発明は、信号処理の技術分野に関し、特に目的信号に他のノイズなどが重畳されて観測される状況において目的信号を推定し分離抽出する技術に関する。 The present invention relates to a technical field of signal processing, and more particularly, to a technique for estimating and separating and extracting a target signal in a situation where other noises are superimposed on the target signal and observed.

複数のセンサを用いた信号抽出技術としては、ビームフォーマ（beamformer）（ビームフォーミング（beamforming）とも呼ぶ)が広く知られている（例えば、非特許文献１参照）。しかし、ビームフォーマでは、目的信号の方向や目的信号がアクティブでない時間区間などの情報を必要とし、これらの情報が正確に与えられない（または推定できない）場合、信号抽出の精度は低くなる。また、別の技術として、独立成分分析（Independent Component Analysis: ICA）によるブラインド信号分離（Blind Signal Separation: BSS）がある（例えば、非特許文献２参照）。BSSでは、上記ビームフォーマが必要とした情報を必要としない点が優れている。しかし、ビームフォーマ、BSSともに、センサ数Mが信号数N（目的信号数＋ノイズの数）と同じか多い場合（N≦M）にしか精度良く信号抽出を行うことはできない。 As a signal extraction technique using a plurality of sensors, a beamformer (also referred to as beamforming) is widely known (for example, see Non-Patent Document 1). However, the beamformer requires information such as the direction of the target signal and a time interval during which the target signal is not active, and if such information cannot be accurately given (or cannot be estimated), the accuracy of signal extraction is low. Another technique is blind signal separation (BSS) by independent component analysis (ICA) (see, for example, Non-Patent Document 2). BSS is superior in that it does not require the information required by the beamformer. However, in both the beamformer and the BSS, signal extraction can be performed accurately only when the number of sensors M is equal to or greater than the number of signals N (number of target signals + number of noises) (N ≦ M).

一方、センサ数Mが信号数Nよりも少ない場合（N＞M）の技術としては、時間周波数マスク（時間周波数毎のマスクを意味し、例えば、１か０の値をとるバイナリマスクが広く用いられる）による方法（例えば、非特許文献３参照）がある。この方法によると、N＞Mの場合についても信号抽出が可能である。
B. D. Van Veen and K. M. Buckley, "Beamforming: a versatile approach to special filtering, "IEEE ASSP Magazine, pp. 2-24, April 1988 A. Hyvaerinen and J. Karhunen and E. Oja, "Independent Component Analysis," John Wiley & Sons, 2001, ISBN 0-471-40540 O. Yilmaz and S. Rickard, "Blind separation of speech mixtures via time-frequency masking," IEEE Trans. on SP, vol. 52, no. 7, pp. 1830-1847, 2004. On the other hand, as a technique when the number of sensors M is smaller than the number of signals N (N> M), a time frequency mask (meaning a mask for each time frequency, for example, a binary mask having a value of 1 or 0 is widely used) (See Non-Patent Document 3, for example). According to this method, signal extraction is possible even when N> M.
BD Van Veen and KM Buckley, "Beamforming: a versatile approach to special filtering," IEEE ASSP Magazine, pp. 2-24, April 1988 A. Hyvaerinen and J. Karhunen and E. Oja, "Independent Component Analysis," John Wiley & Sons, 2001, ISBN 0-471-40540 O. Yilmaz and S. Rickard, "Blind separation of speech mixture via time-frequency masking," IEEE Trans. On SP, vol. 52, no. 7, pp. 1830-1847, 2004.

しかし、時間周波数マスクによる処理は非線形処理であることから、時間周波数マスクによる方法によって抽出された信号には非線形歪みが生じるという難点がある。特に、音響信号の場合、この非線形歪みはmusical noiseと呼ばれ、可聴で不快なノイズとして知覚されてしまう。
つまり、この時間周波数マスクによる処理は時間周波数領域で行われるが、信号を時間周波数領域に変換する際に用いる短時間フーリエ変換（Short-Time Fourier Transform: STFT）では、用いる窓関数の長さTの半分の窓シフト（T／２シフト）を用いることが多い。そして、時間周波数マスクは、このT／２シフトで求められた観測信号の時間周波数信号に対して推定され適用される。すなわち、時間周波数マスクによる非線形処理が、窓関数の半分の長さ毎に行われ、その粒度で抽出信号が急に立ち上がったり立ち下がったりする。これが歪みの大きな要因である。なお、Ｔは短時間フーリエ変換に用いるサンプル数に等しい整数（この好ましくは偶数）である。また、窓関数の実際の時間長は、サンプリング周波数をf_sとすると、T/f_s秒である。 However, since the process using the time frequency mask is a non-linear process, the signal extracted by the method using the time frequency mask has a drawback that nonlinear distortion occurs. In particular, in the case of an acoustic signal, this nonlinear distortion is called musical noise and is perceived as audible and unpleasant noise.
That is, the processing by the time frequency mask is performed in the time frequency domain, but in the short-time Fourier transform (STFT) used when the signal is converted into the time frequency domain, the length T of the window function used Often a half window shift (T / 2 shift) is used. Then, the time frequency mask is estimated and applied to the time frequency signal of the observation signal obtained by this T / 2 shift. That is, nonlinear processing using a time-frequency mask is performed every half length of the window function, and the extracted signal suddenly rises or falls depending on the granularity. This is a major factor of distortion. T is an integer equal to the number of samples used for the short-time Fourier transform (preferably an even number). The actual time length of the window function is T / f _s seconds where the sampling frequency is f _s .

本発明はこのような点に鑑みてなたものであり、目的信号を小さな歪みで分離抽出することができる技術を提供することを目的とする。 The present invention has been made in view of these points, and an object of the present invention is to provide a technique capable of separating and extracting a target signal with a small distortion.

本発明では上記課題を解決するために、まず、窓関数の長さTの1/S（S>2）倍シフトで設定された各フレーム（窓）において当該窓関数を混合信号に掛け合わせ、それらの各演算結果を時間周波数領域の信号に変換する。そして、時間周波数領域の信号から抽出した特徴量をクラスタリングし、クラスタを生成し、このクラスタの情報を用い、時間周波数毎のマスクを生成する。さらに、これらのマスクと時間周波数領域の信号とを用い、時間周波数毎に時間周波数領域の分離信号を抽出し、時間周波数領域の分離信号を時間領域の分離信号に変換して、各フレームに対応する時間領域の分離信号を加算合成（重畳加算）する。 In the present invention, in order to solve the above-described problem, first, in each frame (window) set with a 1 / S (S> 2) times shift of the length T of the window function, the window function is multiplied by the mixed signal, Each of the calculation results is converted into a time-frequency domain signal. Then, the feature quantities extracted from the signal in the time frequency domain are clustered to generate a cluster, and a mask for each time frequency is generated using the cluster information. Furthermore, using these masks and time-frequency domain signals, the time-frequency domain separation signal is extracted for each time frequency, and the time-frequency domain separation signal is converted to the time-domain separation signal to support each frame. The time domain separation signals to be added are added and combined (superposition addition).

ここで、本発明では、窓関数の長さTの1/S（S>2）倍シフトで各フレームを設定している。すなわち、従来、観測信号の時間周波数表現を得る際、T/２シフト長で各フレームを設定していたところを、それよりも細かい窓シフト（ファインシフト）で各フレームを設定している。このように細かい窓シフト（短い窓シフト長）を用いることで、時間軸上における各フレーム間の重複範囲と重複数とを増加させることができる。また、本発明では、各フレームにおいて窓関数を混合信号に掛け合わせた演算結果をそれぞれ時間周波数領域の信号に変換し、それらを用いて時間周波数領域の分離信号を抽出し、それを時間領域の分離信号に変換して加算合成する。ここで各フレームは時間軸上で密に重複しているため、時間領域に変換された分離信号も時間軸上で密に重複している。そして、最後にこれらを加算合成することにより、各フレームに対応する各分離信号が各離散時刻において平均化され、分離信号の平滑化効果を得ることができる。その結果、分離信号の急な立ち上がりや立ち下がりが減少し、非線形歪みが低減される。 Here, in the present invention, each frame is set with a 1 / S (S> 2) times shift of the length T of the window function. That is, conventionally, when obtaining the time-frequency representation of the observation signal, each frame is set with a finer window shift (fine shift) than where each frame was set with the T / 2 shift length. By using such a fine window shift (short window shift length), it is possible to increase the overlapping range and overlapping number between frames on the time axis. In the present invention, the calculation result obtained by multiplying the mixed signal by the window function in each frame is converted into a signal in the time frequency domain, and a separated signal in the time frequency domain is extracted using them. It is converted into a separated signal and added and synthesized. Here, since the frames overlap closely on the time axis, the separated signals converted into the time domain also overlap closely on the time axis. Finally, by adding and synthesizing these, each separated signal corresponding to each frame is averaged at each discrete time, and a smoothing effect of the separated signal can be obtained. As a result, the sudden rise or fall of the separated signal is reduced, and nonlinear distortion is reduced.

また、本発明において好ましくは、窓関数が、各フレームに対応する時間領域の分離信号が有する当該窓関数成分の加算合成値が各離散時刻において一定となる関数である。これにより、加算合成された信号の強度の揺らぎを防止でき、質の高い分離信号を再現できる。
さらに、本発明において好ましくは、Ｔ及びＳを２のべき乗とする。これにより、時間周波数領域の信号への変換処理に高速フーリエ変換（Fast Fourier Trasform：FFT）を利用することが可能となり、処理を高速化できる。 In the present invention, it is preferable that the window function is a function in which the added composite value of the window function components included in the time domain separation signal corresponding to each frame is constant at each discrete time. Thereby, fluctuations in the intensity of the added and synthesized signal can be prevented, and a high-quality separated signal can be reproduced.
In the present invention, T and S are preferably powers of 2. Thereby, it becomes possible to use a fast Fourier transform (FFT) for the conversion process to the signal of a time frequency domain, and can speed up a process.

以上のように、本発明では、Ｎ＞Ｍの場合でも目的信号を小さな歪みで分離抽出することが可能となる。 As described above, according to the present invention, the target signal can be separated and extracted with small distortion even when N> M.

以下、本発明の実施の形態を図面を参照して説明する。
＜原理＞
まず、本形態の原理を説明する。
ブラインド信号分離では、複数の信号源から発せられた原信号が混合し、複数のセンサで観測される状況下において、その観測信号のみから、原信号と推測される分離信号を取り出す。
［混合信号（観測信号）のモデル］
Ｎを信号源の数、Ｍをセンサの数とする。尚、本形態ではN＞Mであるものとする。また、ｓ_ｉをｉ（ｉ＝１，…，Ｎ）番目の信号源ｉから発せられた信号、ｈ_ｊｉを信号源ｉからｊ（ｊ＝１，…，Ｍ）番目のセンサｊまでのインパルス応答とする。この場合、センサｊで観測される信号ｘ_ｊは、これら原信号ｓ_ｉとインパルス応答ｈ_ｊｉとの畳み込み混合

でモデル化される。ここで、式（１）におけるｎはサンプリング時刻を、ｐは掃引のための変数を、それぞれ示している。また、すべての信号はあるサンプリング周波数でサンプリングされ、離散的に表現されるものとする。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
<Principle>
First, the principle of this embodiment will be described.
In the blind signal separation, original signals emitted from a plurality of signal sources are mixed, and a separated signal that is assumed to be an original signal is extracted from only the observed signals under the condition of being observed by a plurality of sensors.
[Model of mixed signal (observation signal)]
Let N be the number of signal sources and M be the number of sensors. In this embodiment, it is assumed that N> M. In addition, s _i is a signal generated from the i (i = 1,..., N) th signal source i, and h _ji is an impulse from the signal source i to the j (j = 1,..., M) th sensor j. A response. In this case, the signal x _j observed by the sensor j is a convolution mixture of the original signal s _i and the impulse response h _ji.

Modeled with Here, n in the expression (1) indicates a sampling time, and p indicates a variable for sweeping. All signals are sampled at a certain sampling frequency and expressed discretely.

［時間周波数領域表現］
ブラインド信号分離では、このように畳み込み混合でモデル化される観測信号から原信号を分離抽出するが、この畳み込み混合の問題は扱いが繁雑である。そのため、式（１）に短時間フーリエ変換（STFT）を施して、観測信号を時間周波数領域に変換した上で問題を扱うことが有効である。
時間周波数領域では、式（１）は、

と各周波数での単純混合に近似表現できる。ここで、ｆは周波数であり、f=0,f_s/T,...,f_s (T-1)/Tのような離散値である。また、f_sはサンプリング周波数を表す。また、Ｔは短時間フーリエ変換に用いるサンプル数（整数、好ましくは偶数）である。さらに、ｍはSTFTに用いる窓関数の離散時刻（各フレームに対応する整数）を表す。また、H_ji(f)は信号源ｉからセンサｊまでの周波数応答であり、S_i(f,m)は原信号s_iに短時間離散フーリエ変換を施したものである。式（２）を行列とベクトルとを用いて表記すると、
X(f,m)= H(f)・S(f,m) …(3)
となる。ここで、H(f)は、ｊｉ要素に信号源ｉからセンサｊまでの周波数応答H_ji(f)を持つ（N×M）行列（今後これを混合行列と呼ぶ）、S(f,m)= [S₁(f,m),…,S_N(f,m)]^T及びX(f,m)=[X₁(f,m),…,X_M(f,m)]^Tは、それぞれ、原信号s_i及び観測信号x_jのSTFT結果を要素に持つベクトルである。尚、[*]^Tは[*]の転置ベクトルを示す。 [Time-frequency domain representation]
In the blind signal separation, the original signal is separated and extracted from the observation signal modeled by the convolution mixing as described above, but the problem of the convolution mixing is complicated. Therefore, it is effective to handle the problem after applying short-time Fourier transform (STFT) to Equation (1) to convert the observation signal into the time-frequency domain.
In the time frequency domain, equation (1) is

And an approximate expression for simple mixing at each frequency. Here, f is a frequency and is a discrete value such as f = 0, f _s / T,..., F _s (T−1) / T. F _s represents a sampling frequency. T is the number of samples (integer, preferably even) used for short-time Fourier transform. Further, m represents a discrete time (an integer corresponding to each frame) of the window function used for STFT. H _ji (f) is a frequency response from the signal source i to the sensor j, and S _i (f, m) is a signal obtained by subjecting the original signal s _i to a short-time discrete Fourier transform. When Expression (2) is expressed using a matrix and a vector,
X (f, m) = H (f) ・ S (f, m) (3)
It becomes. Here, H (f) is an (N × M) matrix having frequency response H _ji (f) from the signal source i to the sensor j in the ji element (hereinafter referred to as a mixing matrix), S (f, m ) = [S ₁ (f, m), ..., S _N (f, m)] ^T and X (f, m) = [X ₁ (f, m), ..., X _M (f, m)] ^T Are vectors having STFT results of the original signal s _i and the observation signal x _j as elements, respectively. [*] ^T indicates a transposed vector of [*].

［本形態のポイント］
前述した通り、非線形歪みは分離信号の急な立ち上がり立ち下がりに起因する。これを減らすために本形態では、（ａ）観測信号を時間周波数領域の信号に変換する際の窓シフト長を短くし、（ｂ）時間周波数領域の信号を時間領域の信号に戻す際に重畳加算法を用い、（ｃ）この重畳加算に適した窓関数を用いる。
このように窓シフト長を短くすることで、時間軸上における各フレーム間の重複範囲と重複数とを増加させることができる。そして、時間周波数領域の信号を時間領域の信号に戻す際に重畳加算法を用いることにより、各フレームに対応する各分離信号を各離散時刻において平均化でき、分離信号の平滑化効果を得ることができる。その結果、分離信号の急な立ち上がりや立ち下がりが減少し、非線形歪みが低減される。そして、この重畳加算に適した窓関数を用いることにより、高い品質の分離信号を得ることができる。 [Points of this form]
As described above, the non-linear distortion is caused by the sudden rise and fall of the separated signal. In order to reduce this, in the present embodiment, (a) the window shift length when the observation signal is converted into the time-frequency domain signal is shortened, and (b) the superposition is performed when the time-frequency domain signal is returned to the time-domain signal. An addition method is used, and (c) a window function suitable for this superposition addition is used.
Thus, by shortening the window shift length, it is possible to increase the overlapping range and overlapping number between frames on the time axis. Then, by using the superposition addition method when returning the time-frequency domain signal to the time-domain signal, each separated signal corresponding to each frame can be averaged at each discrete time, and a smoothing effect of the separated signal can be obtained. Can do. As a result, the sudden rise or fall of the separated signal is reduced, and nonlinear distortion is reduced. By using a window function suitable for this superposition addition, a high quality separated signal can be obtained.

＜本形態の具体例＞
次に、本形態の具体例について説明する。
［ハードウェア構成］
図１は、本形態における信号分離装置１のハードウェア構成を例示したブロック図である。
図１に例示するように、この例の信号分離装置１は、ＣＰＵ（Central Processing Unit）１０、入力部２０、出力部３０、補助記憶装置４０、ＲＡＭ（Random Access Memory）５０、ＲＯＭ（Read Only Memory）６０及びバス７０を有している。 <Specific example of this embodiment>
Next, a specific example of this embodiment will be described.
[Hardware configuration]
FIG. 1 is a block diagram illustrating a hardware configuration of a signal separation device 1 according to this embodiment.
As illustrated in FIG. 1, a signal separation device 1 of this example includes a CPU (Central Processing Unit) 10, an input unit 20, an output unit 30, an auxiliary storage device 40, a RAM (Random Access Memory) 50, a ROM (Read Only). Memory) 60 and bus 70.

この例のＣＰＵ１０は、制御部１１、演算部１２及びレジスタ１３を有し、レジスタ１３に読み込まれた各種プログラムに従って様々な演算処理を実行する。また、この例の入力部２０は、データが入力される入力ポート、キーボード、マウス等であり、出力部３０は、データを出力する出力ポート、ディスプレイ等である。補助記憶装置４０は、例えば、ハードディスク、ＭＯ（Magneto-Optical disc）、半導体メモリ等であり、本形態の信号分離処理を実行するための信号分離プログラムを格納した信号分離プログラム領域４１及びセンサで観測された時間領域の混合信号等の各種データが格納されるデータ領域４２を有している。また、ＲＡＭ５０は、例えば、ＳＲＡＭ (Static Random Access Memory)、ＤＲＡＭ (Dynamic Random Access Memory)等であり、信号分離プログラムが書き込まれる信号分離プログラム領域５１及び各種データが書き込まれるデータ領域５２を有している。また、この例のバス７０は、ＣＰＵ１０、入力部２０、出力部３０、補助記憶装置４０、ＲＡＭ５０及びＲＯＭ６０を通信可能に接続している。 The CPU 10 in this example includes a control unit 11, a calculation unit 12, and a register 13, and executes various calculation processes according to various programs read into the register 13. The input unit 20 in this example is an input port for inputting data, a keyboard, a mouse, and the like, and the output unit 30 is an output port for outputting data, a display, and the like. The auxiliary storage device 40 is, for example, a hard disk, an MO (Magneto-Optical disc), a semiconductor memory, or the like, and is observed by a signal separation program area 41 storing a signal separation program for executing the signal separation processing of this embodiment and a sensor. And a data area 42 in which various data such as mixed signals in the time domain are stored. The RAM 50 is, for example, an SRAM (Static Random Access Memory), a DRAM (Dynamic Random Access Memory) or the like, and has a signal separation program area 51 in which a signal separation program is written and a data area 52 in which various data are written. Yes. In addition, the bus 70 in this example connects the CPU 10, the input unit 20, the output unit 30, the auxiliary storage device 40, the RAM 50, and the ROM 60 so that they can communicate with each other.

＜ハードウェアとソフトウェアとの協働＞
この例のＣＰＵ１０は、読み込まれたＯＳ（Operating System）プログラムに従い、補助記憶装置４０の信号分離プログラム領域４１に格納されている信号分離プログラムを、ＲＡＭ５０の信号分離プログラム領域５１に書き込む。同様にＣＰＵ１０は、補助記憶装置４０のデータ領域４２に格納されている時間領域の混合信号等の各種データをＲＡＭ５０のデータ領域５２に書き込む。さらに、ＣＰＵ１０は、この信号分離プログラムや各種データが書き込まれたＲＡＭ５０上のアドレスをレジスタ１３に格納する。そして、ＣＰＵ１０の制御部１１は、レジスタ１３に格納されたこれらのアドレスを順次読み出し、読み出したアドレスが示すＲＡＭ５０上の領域からプログラムやデータを読み出し、そのプログラムが示す演算を演算部１２に順次実行させ、その演算結果をレジスタ１３に格納していく。 <Cooperation between hardware and software>
The CPU 10 of this example writes the signal separation program stored in the signal separation program area 41 of the auxiliary storage device 40 in the signal separation program area 51 of the RAM 50 in accordance with the read OS (Operating System) program. Similarly, the CPU 10 writes various data such as a mixed signal in the time domain stored in the data area 42 of the auxiliary storage device 40 in the data area 52 of the RAM 50. Further, the CPU 10 stores the address on the RAM 50 in which the signal separation program and various data are written in the register 13. Then, the control unit 11 of the CPU 10 sequentially reads these addresses stored in the register 13, reads a program and data from the area on the RAM 50 indicated by the read address, and sequentially executes the calculation indicated by the program to the calculation unit 12. The calculation result is stored in the register 13.

図２は、このようにＣＰＵ１０に信号分離プログラムが読み込まれることにより構成される信号分離装置１のブロック図の例示である。また、図３（ａ）は、図２における時間周波数領域変換部１３０の詳細を例示したブロック図であり、図３（ｂ）は、図２における時間領域変換部１６０及び重畳加算部１７０の構成を示したブロック図である。尚、これらの図において、実線の矢印は実際のデータの流れを示し、破線の矢印は理論上の情報の流れを示す。また、説明の簡略化のため、制御部１８０に出入りするデータの流れの記載は省略してある。 FIG. 2 is an example of a block diagram of the signal separation device 1 configured by reading the signal separation program into the CPU 10 as described above. 3A is a block diagram illustrating details of the time-frequency domain conversion unit 130 in FIG. 2, and FIG. 3B is a configuration of the time-domain conversion unit 160 and the superposition addition unit 170 in FIG. It is the block diagram which showed. In these figures, solid arrows indicate actual data flow, and broken arrows indicate theoretical information flow. Further, for the sake of simplification of description, description of the flow of data entering and exiting the control unit 180 is omitted.

図２に例示するように、本形態の信号分離装置１は、メモリ１１０と、窓関数の長さTの1/S（S>2）倍シフトで設定された各フレームにおいて当該窓関数を混合信号に掛け合わせ、それらの各演算結果を時間周波数領域の信号に変換する時間周波数領域変換部１３０と、時間周波数毎のマスクを生成する時間周波数マスク推定部１４０と、マスクと時間周波数領域の信号とを用い、時間周波数毎に時間周波数領域の分離信号を抽出する時間周波数領域信号抽出部１５０と、時間周波数領域の分離信号を時間領域の分離信号に変換する時間領域変換部１６０と、各フレームに対応する時間領域の分離信号を加算合成する重畳加算部１７０と、信号分離装置１全体を制御する制御部１８０とを有している。 As illustrated in FIG. 2, the signal separation device 1 of the present embodiment mixes the window function in the memory 110 and each frame set with a 1 / S (S> 2) times shift of the length T of the window function. A time frequency domain conversion unit 130 that multiplies the signal and converts each calculation result into a signal in the time frequency domain, a time frequency mask estimation unit 140 that generates a mask for each time frequency, and a signal in the mask and the time frequency domain , A time frequency domain signal extraction unit 150 that extracts a time frequency domain separation signal for each time frequency, a time domain conversion unit 160 that converts a time frequency domain separation signal into a time domain separation signal, and each frame And a superposition adding unit 170 for adding and synthesizing the separated signals in the time domain corresponding to, and a control unit 180 for controlling the entire signal separating apparatus 1.

ここで、時間周波数マスク推定部１４０は、時間周波数領域の信号から抽出した特徴量をクラスタリングし、クラスタを生成するクラスタリング部１４１と、クラスタの情報を用い、時間周波数毎のマスクを生成するマスク生成部１４２とを有している。また、メモリ１１０は記憶領域１１１〜１２１を有し、制御部１８０は一時メモリ１８１を有している。尚、この例のメモリ１１０及び一時メモリ１８１は、補助記憶装置４０のデータ領域４２、ＲＡＭ５０のデータ領域５１及びレジスタ１３の何れか或いはこれらの組合せである。 Here, the time-frequency mask estimation unit 140 performs clustering of feature amounts extracted from signals in the time-frequency domain, and generates a cluster, and a mask generation that generates a mask for each time frequency using the cluster information. Part 142. The memory 110 includes storage areas 111 to 121, and the control unit 180 includes a temporary memory 181. The memory 110 and the temporary memory 181 in this example are any one of the data area 42 of the auxiliary storage device 40, the data area 51 of the RAM 50, the register 13, or a combination thereof.

また、図３（ａ）に例示するように、時間周波数領域変換部１３０は、窓関数生成部１３１、シフト量Ｓを調整するシフト調整部１３２と、カウンタ１３３〜１３６と、離散時刻r+m・T/S(ｒは整数）における混合信号を抽出するＴ／Ｓシフト部１３７と、離散時刻r+m・T/Sにおける混合信号を離散時刻ｒにおける窓関数の値に掛け合わせる乗算部１３８と、乗算部１３８における演算結果に離散フーリエ変換を施す離散フーリエ変換部１３９とを有している。
また、図３（ｂ）に例示するように、時間領域変換部１６０は、離散フーリエ逆変換によって時間周波数領域の信号を時間領域の信号に変換する機能を有し、重畳加算部１７０は、入力された複数の時間領域の信号を重畳加算して出力する機能を有する。 Also, as illustrated in FIG. 3A, the time-frequency domain conversion unit 130 includes a window function generation unit 131, a shift adjustment unit 132 that adjusts the shift amount S, counters 133 to 136, and a discrete time r + m. A T / S shift unit 137 that extracts a mixed signal at T / S (r is an integer), and a multiplier 138 that multiplies the mixed signal at the discrete time r + m · T / S by the value of the window function at the discrete time r. And a discrete Fourier transform unit 139 that performs a discrete Fourier transform on the calculation result in the multiplication unit 138.
Further, as illustrated in FIG. 3B, the time domain transforming unit 160 has a function of transforming a time frequency domain signal into a time domain signal by discrete Fourier inverse transform, And a function of superposing and adding the plurality of time domain signals.

＜処理＞
次に、本形態における信号分離装置１の処理について説明する。尚、以下の各処理は、制御部１８０の制御のもと行われ、特に明示しない限り、演算途中の各データは一時メモリ１８１に読み書きされながらそれぞれの演算処理に用いられる。
［処理の全体］
図４は、本形態における信号分離装置１の処理の全体を説明するためのフローチャートである。以下、このフローチャートに沿って、本形態における信号分離装置１の処理の全体を説明する。
信号分離装置１に対する入力は、Ｍ個のセンサｊによって観測された時間領域の混合信号x_j(n)(j={1,...,M})を要素とする観測信号ベクトルx(n)=[x₁(n),...,x_M(n)]^Tである。この例の場合、これらの時間領域の混合信号x_j(n)は、対応するセンサｊ及びサンプリング時刻ｎに関連付けてメモリ１１０の記憶領域１１１に格納される。また、使用する窓関数の長さＴをメモリ１１０の記憶領域１１２に格納しておく。 <Processing>
Next, processing of the signal separation device 1 in this embodiment will be described. The following processes are performed under the control of the control unit 180, and data in the middle of calculation is used for each calculation process while being read from and written to the temporary memory 181 unless otherwise specified.
[Overall processing]
FIG. 4 is a flowchart for explaining the entire processing of the signal separation device 1 according to this embodiment. Hereinafter, the entire processing of the signal separation device 1 according to the present embodiment will be described with reference to this flowchart.
The input to the signal separation device 1 is an observation signal vector x (n) whose elements are time-domain mixed signals x _j (n) (j = {1,..., M}) observed by M sensors j. ) = [x ₁ (n), ..., x _M (n)] ^T. In this example, these time domain mixed signals x _j (n) are stored in the storage area 111 of the memory 110 in association with the corresponding sensor j and sampling time n. Further, the length T of the window function to be used is stored in the storage area 112 of the memory 110.

信号分離が開始されると、まず、時間周波数領域変換部１３０に観測信号ベクトルx(n)=[x₁(n),...,x_M(n)]^Tが入力され、時間周波数領域変換部１３０は、短時間フーリエ変換（STFT）により、それを時間周波数領域の観測信号ベクトルX(f,m)=[Ｘ₁(f,m),...,Ｘ_M(f,m)]^Tに変換する。この例の場合、時間周波数領域変換部１３０は、まず、メモリ１１０の記憶領域１１１から混合信号x_jを読み出す。時間領域の混合信号x_jが入力された時間周波数領域変換部１３０は、窓関数w(n)の長さTの1/S（S>2）倍シフトで設定された各フレームにおいて当該窓関数w(n)を混合信号x_j(n)に掛け合わせ、それらの各演算結果を時間周波数領域の信号Ｘ_j(f,m)に変換し、それらをメモリ１１０の記憶領域１１５に格納する（ステップＳ１）。尚、ｍは各フレームに対応する整数である。また時間周波数領域変換部１３０の処理の詳細については後述する。 When signal separation is started, first, an observation signal vector x (n) = [x ₁ (n), ..., x _M (n)] ^T is input to the time-frequency domain transform unit 130, and the time-frequency domain The conversion unit 130 converts the observation signal vector X (f, m) = [X ₁ (f, m), ..., X _M (f, m) in the time-frequency domain by short-time Fourier transform (STFT). ] Convert to ^T. In the case of this example, the time frequency domain transform unit 130 first reads the mixed signal x _j from the storage area 111 of the memory 110. The time-frequency domain transforming unit 130, to which the time-domain mixed signal _xj is input, performs the window function in each frame set by a 1 / S (S> 2) times shift of the length T of the window function w (n). Multiplying the mixed signal x _j (n) by w (n) and converting each calculation result into a signal X _j (f, m) in the time-frequency domain, and storing them in the storage area 115 of the memory 110 ( Step S1). Note that m is an integer corresponding to each frame. Details of the processing of the time-frequency domain transform unit 130 will be described later.

次に、時間周波数マスク推定部１４０に観測信号ベクトルX(f,m)=[Ｘ₁(f,m),...,Ｘ_M(f,m)]^Tが入力され、時間周波数マスク推定部１４０は、この観測信号ベクトルX(f,m)=[Ｘ₁(f,m),...,Ｘ_M(f,m)]^Tを用い、信号を分離抽出するマスクM(f,m)= [M₁(f,m),...,M_K(f,m)]^Tを時間周波数毎に推定する。ここで、M_k(f,m)（k={1,...,M}）は、ｋ番目の分離信号を抽出するためのマスク、Kは分離抽出する信号の数（≦Ｎ）である。この例の場合、まず、クラスタリング部１４１が、メモリ１１０の記憶領域１１５から時間周波数領域の信号Ｘ_j(f,m)（j∈{1,...,M}）を読み出し、それから特徴量θ(f,m)を抽出し、これらをメモリ１１０の記憶領域１１６に格納する（ステップＳ２）。次に、クラスタリング部１４１において、メモリ１１０の記憶領域１１５から特徴量θ(f,m)を読み出し、これらの特徴量θ(f,m)をクラスタリングしてクラスタを生成し、各クラスタを特定するための情報θ_k ^〜（k={1,...,K}）をメモリ１１０の記憶領域１１７に格納する（ステップＳ３）。次に、マスク生成部１４２において、ｋ番目の分離信号を抽出するマスクM_k(f,m)を時間周波数毎に生成し、これらをメモリ１１０の記憶領域１１８に格納する（ステップＳ４）。尚、時間周波数マスク推定部１４０の処理の詳細については後述する。 Next, the observation signal vector X (f, m) = [X ₁ (f, m), ..., X _M (f, m)] ^T is input to the time-frequency mask estimation unit 140, and the time-frequency mask estimation is performed. The unit 140 uses this observation signal vector X (f, m) = [X ₁ (f, m), ..., X _M (f, m)] ^T to separate and extract a mask M (f, m) = [M ₁ (f, m), ..., M _K (f, m)] ^T is estimated for each time frequency. Here, M _k (f, m) (k = {1,..., M}) is a mask for extracting the k-th separated signal, and K is the number of signals to be separated and extracted (≦ N). is there. In this example, first, the clustering unit 141 reads the signal X _j (f, m) (j∈ {1,..., M}) in the time frequency domain from the storage area 115 of the memory 110, and then the feature quantity. θ (f, m) is extracted and stored in the storage area 116 of the memory 110 (step S2). Next, in the clustering unit 141, the feature quantity θ (f, m) is read from the storage area 115 of the memory 110, the feature quantity θ (f, m) is clustered to generate a cluster, and each cluster is specified. Information θ _k ^˜ (k = {1,..., K}) for storing is stored in the storage area 117 of the memory 110 (step S3). Next, the mask generator 142 generates a mask M _k (f, m) for extracting the k-th separated signal for each time frequency, and stores these in the storage area 118 of the memory 110 (step S4). Details of the processing of the time-frequency mask estimation unit 140 will be described later.

次に、時間周波数領域信号抽出部１５０において、時間周波数領域の分離信号
Y(f,m)=M(f,m)Ｘ_J(f,m)
が算出される。ここで、Y(f,m)は、時間周波数領域の分離信号Y_k(f,m)を要素とする分離信号ベクトルY(f,m)=[Y₁(f,m),...,Y_K(f,m)]^Tを意味する。また、Ｘ_J(f,m)は、時間周波数領域の信号Ｘ_j(f,m)の１つでありＪ∈{1,...,M}である。また、M(f,m)は［M₁(f,m),...,M_K(f,m)］^Tである。この例の場合、まず、時間周波数領域信号抽出部１５０において、メモリ１１０の記憶領域１１８からマスクM_k(f,m)を、記憶領域１１５から時間周波数領域の信号の１つＸ_J(f,m)を読み出す。そして、時間周波数領域信号抽出部１５０は、これらを用いて時間周波数毎に
Y_k(f,m)=M_k(f,m)Ｘ_J(f,m)
の演算を行い、時間周波数領域の分離信号Y_k(f,m)を抽出し、抽出した時間周波数領域の分離信号Y_k(f,m)をメモリ１１０の記憶領域１１９に格納する（ステップＳ５）。 Next, in the time-frequency domain signal extraction unit 150, the separated signal in the time-frequency domain
Y (f, m) = M (f, m) X _J (f, m)
Is calculated. Here, Y (f, m) is a separated signal vector Y (f, m) = [Y ₁ (f, m), ... with the separated signal Y _k (f, m) in the time-frequency domain as an element. , Y _K (f, m)] ^T. X _J (f, m) is one of the signals X _j (f, m) in the time frequency domain, and J∈ {1,..., M}. M (f, m) is [M ₁ (f, m), ..., M _K (f, m)] ^T. In this example, first, in the time-frequency domain signal extraction unit 150, the mask M _k (f, m) is stored from the storage area 118 of the memory 110, and one of the time-frequency domain signals X _J (f, m) from the storage area 115. Read m). And the time frequency domain signal extraction part 150 uses these for every time frequency.
Y _k (f, m) = M _k (f, m) X _J (f, m)
The time frequency domain separation signal Y _k (f, m) is extracted, and the extracted time frequency domain separation signal Y _k (f, m) is stored in the storage area 119 of the memory 110 (step S5). ).

尚、ここでは、Y_k(f,m)=M_k(f,m)Ｘ_J(f,m)の演算によって時間周波数領域の分離信号Y_k(f,m)を抽出する例を説明したが、他の信号分離抽出手法とマスクM_k(f,m)とを組み合わせた手法を用いて時間周波数領域の分離信号Y_k(f,m)を抽出することとしてもよい。例えば、時間周波数毎のマスクと独立成分分析（ICA）とを組み合わせた手法（例えば「S. Araki, S. Makino, A. Blin, R. Mukai and H. Sawada, "Blind separation of more speech than sensors with less distortion by combining sparseness and ICA," Proc. IWAENC2003, pp. 271-274, 2003.」参照）や、時間周波数毎のマスクとビームフォーマとを組み合わせた手法（例えば「N. Roman and D. Wang, "Binaural sound segregation for multisource reverberant environments," Proc. ICASSP2004, pp. 373-386, 2004.」参照）などを用いて時間周波数領域の分離信号Y_k(f,m)を抽出することとしてもよい。 Here, the example in which the separation signal Y _k (f, m) in the time-frequency domain is extracted by the calculation of Y _k (f, m) = M _k (f, m) X _J (f, m) has been described. However, the separation signal Y _k (f, m) in the time-frequency domain may be extracted using a method in which another signal separation extraction method and the mask M _k (f, m) are combined. For example, methods combining masks for each time frequency and independent component analysis (ICA) (for example, “S. Araki, S. Makino, A. Blin, R. Mukai and H. Sawada,“ Blind separation of more speech than sensors with less distortion by combining sparseness and ICA, "Proc. IWAENC2003, pp. 271-274, 2003.") and methods that combine masks and beamformers for each time frequency (for example, "N. Roman and D. Wang" , "Binaural sound segregation for multisource reverberant environments," Proc. ICASSP2004, pp. 373-386, 2004.) etc. may be used to extract the separation signal Y _k (f, m) in the time-frequency domain .

次に、時間領域変換部１６０において、メモリ１１０の記憶領域１１９から時間周波数領域の分離信号Y_k(f,m),Y_k(f,m+1),Y_k(f,m+2),...,Y_k(f,m+S−1)を読み出し、短時間フーリエ逆変換（STIFT）によってこれらを時間領域の分離信号

に変換し、各時間領域の分離信号y_k ^m(n)をメモリ１１０の記憶領域１２０に格納する（図３（ｂ），ステップＳ６）。尚、vは虚数単位であり、g=n−m・R、R=T/Sである。また、時間領域変換部１６０の処理の詳細については後述する。
最後に、重畳加算部１７０において、メモリ１１０の記憶領域１２０から各サンプリング時刻ｎに対応する時間領域の分離信号y_k ^m(n),y_k ^m+1(n),y_k ^m+2(n),...,y_k ^m+S-1(n)を読み出して加算合成し（図３（ｂ））、時間領域の分離信号y_k(n)を算出し、記憶領域１２１に格納する（ステップＳ７）。尚、重畳加算部１７０の処理の詳細については後述する。 Next, in the time domain transforming unit 160, the separated signals Y _k (f, m), Y _k (f, m + 1), Y _k (f, m + 2) in the time frequency domain from the storage area 119 of the memory 110. , ..., Y _k (f, m + S−1) are read out, and these are separated in the time domain by short-time inverse Fourier transform (STIFT).

And the separated signal y _k ^m (n) of each time domain is stored in the storage area 120 of the memory 110 (FIG. 3B, step S6). Note that v is an imaginary unit, and g = n−m · R and R = T / S. Details of the processing of the time domain conversion unit 160 will be described later.
Finally, in the superposition adding unit 170, the separated signals y _k ^m (n), y _k ^{m + 1} (n), y _k ^{m + 2} (in the time domain corresponding to each sampling time n from the storage area 120 of the memory 110. n), ..., y _k ^{m + S-1} (n) are read out and synthesized (FIG. 3 (b)), a time domain separation signal y _k (n) is calculated and stored in the storage area 121 (Step S7). Details of the processing of the superposition adding unit 170 will be described later.

［時間周波数領域変換部１３０の処理の詳細］
時間周波数領域変換部１３０は、観測された時間領域の混合信号x_j(n)から長さＴの窓関数w(n)にて信号を切り出し、それに対して離散フーリエ変換を行うことで時間周波数領域の信号Ｘ_j(f,m)を算出する。そして、この窓関数を従来（T/2シフト）よりも細かいT/S（S>2）シフトでずらしながら、順次、時間周波数領域の信号Ｘ_j(f,m)を算出していく。以下、この時間周波数領域変換部１３０による処理の詳細を説明する。 [Details of Processing of Time Frequency Domain Transformer 130]
The time-frequency domain transforming unit 130 extracts a signal from the observed time-domain mixed signal x _j (n) using a window function w (n) having a length T, and performs a discrete Fourier transform on the signal to thereby obtain a time frequency. The region signal X _j (f, m) is calculated. Then, while shifting this window function by a T / S (S> 2) shift that is finer than the conventional (T / 2 shift), the signal X _j (f, m) in the time frequency domain is sequentially calculated. Details of the processing by the time-frequency domain transforming unit 130 will be described below.

最初に時間周波数領域変換部１３０が行う前処理について説明する。
まず、時間周波数領域変換部１３０は、窓関数生成部１３１においてメモリ１１０の記憶領域１１２から長さＴを読み出し、長さＴの窓関数w(n)を生成してメモリ１１０の記憶領域１１３に格納する。尚、本形態に適した窓関数w(n)については後述する。また、窓関数生成部１３１を設けず、使用する窓関数w(n)を予めメモリ１１０の記憶領域１１３に格納しておいてもよい。 First, preprocessing performed by the time-frequency domain transform unit 130 will be described.
First, the time frequency domain transform unit 130 reads the length T from the storage area 112 of the memory 110 in the window function generation unit 131, generates a window function w (n) of length T, and stores it in the storage area 113 of the memory 110. Store. A window function w (n) suitable for this embodiment will be described later. Further, the window function w (n) to be used may be stored in the storage area 113 of the memory 110 in advance without providing the window function generation unit 131.

次に、シフト調整部１３２において、シフト率1/S（S>2）を決定し、このパラメータＳをメモリ１１０の記憶領域１１２に格納する。ここで、シフト調整部１３２は、分離信号に求められる品質（非線型歪みをどこまで許すか）と、システムに許される計算時間（Ｓが大きいほど計算時間大）とを比較考慮し、適切なパラメータＳを決めて出力する。具体的には、例えば、分離信号に求められる品質とシステムに許される計算時間とをキーとしてパラメータＳが特定される表（lookup table）をメモリ１１０に格納しておき、シフト調整部１３２がこの表を参照しつつ、別途与えられた分離信号に求められる品質とシステムに許される計算時間とに適したパラメータＳを決定する。また、シフト調整部１３２を設けず、人間が適切なパラメータＳを選択し、これをメモリ１１０の記憶領域１１２に格納しておくこととしてもよい。尚、上述した長さＴ及びパラメータＳは、２のべき乗であることが望ましい。これにより、後述する離散フーリエ変換部１３９における処理に高速フーリエ変換を利用することが可能となり、処理を高速化できるからである。 Next, the shift adjustment unit 132 determines the shift rate 1 / S (S> 2) and stores the parameter S in the storage area 112 of the memory 110. Here, the shift adjustment unit 132 compares and considers the quality required for the separated signal (how much nonlinear distortion is allowed) and the calculation time allowed for the system (the larger the S, the longer the calculation time), and the appropriate parameter. Determine S and output. Specifically, for example, a table (lookup table) in which the parameter S is specified is stored in the memory 110 using the quality required for the separated signal and the calculation time allowed for the system as keys, and the shift adjustment unit 132 sets the table. With reference to the table, a parameter S suitable for the quality required for the separately given separated signal and the calculation time allowed for the system is determined. Alternatively, the shift adjustment unit 132 may not be provided, and a human may select an appropriate parameter S and store it in the storage area 112 of the memory 110. Note that the above-described length T and parameter S are preferably powers of 2. This is because fast Fourier transform can be used for processing in the discrete Fourier transform unit 139 described later, and the processing speed can be increased.

次に、本形態の時間周波数領域変換部１３０による時間領域から時間周波数領域への変換処理について説明する。
時間周波数領域変換部１３０は、

の演算によって、時間領域の混合信号x_jを時間周波数領域の信号Ｘ_j(f,m)に変換する。尚、Ｒは窓シフト長を示しR=T/Sを満たす。またｖは虚数単位（v²=−１）を示す。以下、この処理をフローチャートを用いて説明する。 Next, the conversion process from the time domain to the time frequency domain by the time frequency domain conversion unit 130 of this embodiment will be described.
The time-frequency domain transforming unit 130

Thus, the mixed signal x _j in the time domain is converted into a signal X _j (f, m) in the time frequency domain. R represents the window shift length and satisfies R = T / S. V represents an imaginary unit (v ² = −1). Hereinafter, this process will be described with reference to a flowchart.

図５は、この時間周波数領域変換部１３０による時間領域から時間周波数領域への変換処理を説明するためのフローチャートである。
まず、時間周波数領域変換部１３０のカウンタ１３３においてｊを１に初期化し（ステップＳ１１）、カウンタ１３４においてｍを０に初期化し（ステップＳ１２）、カウンタ１３６においてｆを０に初期化し（ステップＳ１３）、カウンタ１３５においてｒを０に初期化し、制御部１８０において0を代入したＸを一時メモリ１８１に格納する（ステップＳ１４）。 FIG. 5 is a flowchart for explaining the conversion process from the time domain to the time frequency domain by the time frequency domain conversion unit 130.
First, j is initialized to 1 in the counter 133 of the time-frequency domain converter 130 (step S11), m is initialized to 0 in the counter 134 (step S12), and f is initialized to 0 in the counter 136 (step S13). The counter 135 initializes r to 0, and the control unit 180 stores X into which 0 is substituted in the temporary memory 181 (step S14).

次に、Ｔ／Ｓシフト部１３７において、メモリ１１０の記憶領域１１２からＳ及びＴを読み出し、カウンタ１３４からｍを受け取り、カウンタ１３５からｒを受け取り、
r+m・R（R=T/S）
の演算を行い、その演算結果を一時メモリ１８１に格納する（ステップＳ１５）。次に、Ｔ／Ｓシフト部１３７において、一時メモリ１８１からr+m・Rを読み出し、メモリ１１０の記憶領域１１１から時間領域の混合信号x_j(r+m・R)を抽出し、乗算部１３８に送る（ステップＳ１６）。乗算部１３８は、さらにメモリ１１０の記憶領域１１３から窓関数w(n)を読み出し、カウンタ１３５からｒを受け取り、
w(r)・x_j(r+m・R)
を演算し、これをメモリ１１０の記憶領域１１４に格納する（ステップＳ１７）。 Next, the T / S shift unit 137 reads S and T from the storage area 112 of the memory 110, receives m from the counter 134, receives r from the counter 135,
r + m ・ R (R = T / S)
And the result of the calculation is stored in the temporary memory 181 (step S15). Next, the T / S shift unit 137 reads r + m · R from the temporary memory 181, extracts the time-domain mixed signal x _j (r + m · R) from the storage area 111 of the memory 110, and a multiplication unit 138 (step S16). The multiplier 138 further reads the window function w (n) from the storage area 113 of the memory 110, receives r from the counter 135,
w (r) ・ x _j (r + m ・ R)
Is stored in the storage area 114 of the memory 110 (step S17).

次に、離散フーリエ変換部１３９において、メモリ１１０の記憶領域１１４からw(r)・x_j(r+m・R)を読み出し、カウンタ１３５からｒを受け取り、カウンタ１３６からｆを受け取り、一時メモリ１８１からＸを読み出し、
X+ w(r)・x_j(r+m・R)・e^-v2πft
の演算を行い、その演算結果を新たなＸとして一時メモリ１８１に格納する（ステップＳ１８）。 Next, the discrete Fourier transform unit 139 reads w (r) · x _j (r + m · R) from the storage area 114 of the memory 110, receives r from the counter 135, receives f from the counter 136, and stores the temporary memory. Read X from 181;
X + w (r) ・ x _j (r + m ・ R) ・ e ^-v2πft
The calculation result is stored in the temporary memory 181 as a new X (step S18).

次に、制御部１８０において、カウンタ１３５のｒがT-1であるか否かを判断する（ステップＳ１９）。ここで、r=T-1でなければ、カウンタ１３５においてr+1を新たなｒとし（rをカウントアップし）、ステップＳ１５の処理に戻る（ステップＳ２０）。一方、r=T-1であれば、離散フーリエ変換部１３９において、一時メモリ１８１に格納されている最新のXを時間周波数領域の信号X_j(f,m)として、メモリ１１０の記憶領域１１５に格納する（ステップＳ２１）。 Next, the control unit 180 determines whether r of the counter 135 is T-1 (step S19). Here, if r = T−1 is not satisfied, r + 1 is set as a new r in the counter 135 (r is counted up), and the process returns to step S15 (step S20). On the other hand, if r = T−1, the discrete Fourier transform unit 139 uses the latest X stored in the temporary memory 181 as the signal X _j (f, m) in the time frequency domain, and the storage area 115 of the memory 110. (Step S21).

次に、制御部１８０において、メモリ１１０の記憶領域１１２からＴを読み出し、カウンタ１３６のｆが(T−1)f_s/Tであるか否かを判断する（ステップＳ２２）。ここで、f=(T−1)f_s/Tでなければ、カウンタ１３６において、f+f_s/Tを新たなfとし（カウントアップし）、ステップＳ１４に戻る（ステップＳ２３）。一方、f=(T−1)f_s/Tならば、制御部１８０において、カウンタ１３４のｍがｍ_max（ｍの最大値）であるか否かを判断する（ステップＳ２４）。尚、この例の場合ｍ_maxの値は予め定めておくものとする。ここで、ｍ=ｍ_maxでなければ、カウンタ１３４においてｍ+1を新たなｍとし（ｍをカウントアップし）、ステップＳ１３の処理に戻る（ステップＳ２５）。一方、ｍ=ｍ_maxであれば、制御部１８０において、カウンタ１３３のｊがMであるか否かを判断する（ステップＳ２６）。ここで、ｊ=Mでなければ、カウンタ１３３においてj+1を新たなjとし（カウントアップし）、ステップＳ１２に戻る。一方、ｊ=Mであれば、ステップＳ１の処理を終了する。 Next, the control unit 180 reads a T from the storage area 112 of the memory 110, f of the counter 136 determines whether a _{(T-1) f s /} T ( step S22). If not f = (T−1) f _s / T, the counter 136 sets f + f _s / T as a new f (counts up), and returns to step S14 (step S23). On the other hand, if f = (T−1) f _s / T, the control unit 180 determines whether m of the counter 134 is m _max (the maximum value of m) (step S24). In this example, the value of m _max is determined in advance. If m = m _{max is} not satisfied, m + 1 is set as a new m in the counter 134 (m is counted up), and the process returns to step S13 (step S25). On the other hand, if m = m _max , the control unit 180 determines whether j of the counter 133 is M (step S26). If j = M is not satisfied, j + 1 is set as a new j (counted up) in the counter 133, and the process returns to step S12. On the other hand, if j = M, the process of step S1 is terminated.

［時間周波数マスク推定部１４０の処理の詳細］
次に、時間周波数マスク推定部１４０の処理の詳細について説明する。
本形態の手法では、信号のスパース性を仮定する。ここでスパースとは、信号が殆どのサンプリング時刻ｎにおいて０であることを指す。信号のスパース性は、例えば音声信号で確認される。信号のスパース性を仮定することで、複数の信号が存在していても、各時間周波数ポイント(f,m)では互いに重なって観測される確率が低いことを仮定できる。よって、各時間周波数ポイント(f,m)の各センサにおける観測信号は、その時間周波数ポイント(f,m)でアクティブな信号S_i(f,m)から各センサまでの周波数応答H_i(f)=[H_1i(f),...,H_Mi(f)]^Tを反映して観測される。従って、観測信号ベクトルは、その反映する周波数応答H_i(f)によってクラスタリングすることができる。そして、それぞれのクラスに属するメンバの時間周波数に対応する観測信号X(f,m)のみを抽出する時間周波数毎のマスクM(f,m)を用いることで各信号を分離抽出できる。以下に、この時間周波数毎のマスクを生成する手法を例示する。 [Details of processing of time-frequency mask estimation unit 140]
Next, details of the processing of the time-frequency mask estimation unit 140 will be described.
In the method of this embodiment, signal sparsity is assumed. Here, sparse means that the signal is 0 at most sampling times n. The sparsity of the signal is confirmed by an audio signal, for example. By assuming the sparseness of the signal, it can be assumed that even if there are a plurality of signals, the probability that they are observed overlapping each other at each time frequency point (f, m) is low. Therefore, the observed signal at each sensor at each time frequency point (f, m) is the frequency response H _i (f from the signal S _i (f, m) active at that time frequency point (f, m) to each sensor. ) = [H _1i (f), ..., H _Mi (f)] ^T is observed. Therefore, the observed signal vectors can be clustered by the reflected frequency response H _i (f). Each signal can be separated and extracted by using a mask M (f, m) for each time frequency for extracting only the observation signal X (f, m) corresponding to the time frequency of the member belonging to each class. Hereinafter, a method for generating a mask for each time frequency will be exemplified.

まず、時間周波数マスク推定部１４０のクラスタリング部１４１において、メモリ１１０の記憶領域１１５から時間周波数領域の信号X_j(f,m)を読み出し、これらを用いてクラスタリングに用いる特徴量を抽出する。この例の場合、クラスタリング部１４１において、メモリ１１０の記憶領域１１５から時間周波数領域の信号X_j(f,m)を順次読み出し、２つのセンサ（各センサｊと基準となるセンサＪ'）における観測信号X_j(f,m)，X_J'(f,m)間の位相差

を算出し、それから信号の推定到来方向（Direction of Arrival : DOA）

を算出し、これを特徴量としてメモリ１１０の記憶領域１１６に格納する（ステップＳ２）。尚、Ｊ'はセンサｊの何れか１つ（Ｊ'∈{1,...,M}）である。また、dはセンサｊとセンサＪ'との距離、ｃは信号速度を示す。 First, the clustering unit 141 of the time-frequency mask estimation unit 140 reads the time-frequency domain signal X _j (f, m) from the storage area 115 of the memory 110, and uses these to extract the feature quantity used for clustering. In the case of this example, the clustering unit 141 sequentially reads out the signal X _j (f, m) in the time frequency domain from the storage area 115 of the memory 110 and observes the two sensors (each sensor j and the reference sensor J ′). Phase difference between signals X _j (f, m) and X _{J '} (f, m)

And then the estimated direction of arrival (DOA) of the signal

Is stored in the storage area 116 of the memory 110 as a feature amount (step S2). J ′ is any one of the sensors j (J′ε {1,..., M}). Further, d is a distance between the sensor j and the sensor J ′, and c is a signal speed.

次にクラスタリング部１４１において、メモリ１１０の記憶領域１１６から信号の推定到来方向θ(f,m)（「特徴量」に相当）を読み出し、これらを例えばk-means法（例えば、”R. O. Duda, P. E. Hart, and D. G. Stork,(尾上守夫監訳）, パターン認識, John Wiley & Sons, 新技術コミュニケーションズ，ISBN 0-471-05669-3, 2003.”参照）等を用いてクラスタリングし、クラスタを生成する。そして、クラスタリング部１４１は、生成した各クラスタの平均値θ_k ^〜（k={1,...,K}）（「各クラスタを特定するための情報」に相当）を求め、これらをメモリ１１０の記憶領域１１７に格納する（ステップＳ３）。次に、マスク生成部１４２において、メモリ１１０の記憶領域１１７から各クラスタの平均値θ_k ^〜を読み出し、

という時間周波数毎のマスクを生成し、これらをメモリ１１０の記憶領域１１８に格納する（ステップＳ４）。このマスクM_k(f,m)は、θ_k ^〜−Δ≦θ(f,m)≦θ_k ^〜＋Δの範囲に属する信号の推定到来方向θ(f,m)に対して１の値をとり、それ以外の範囲に属する信号の推定到来方向θ(f,m)に対して０の値をとるバイナリマスクである。ここで、ΔはマスクM_k(f,m)を用いて抽出できる分離信号の範囲を与える。すなわち、Δを十分小さくすると良い分離抽出性能が得られるが非線型歪みが大きくなる。また、Δを大きくすると、非線型歪みは減少するが分離性能が劣化する。 Next, the clustering unit 141 reads the estimated arrival direction θ (f, m) (corresponding to “feature”) of the signal from the storage area 116 of the memory 110 and uses, for example, the k-means method (for example, “RO Duda, Clustering using PE Hart, and DG Stork (translated by Morio Onoe), Pattern Recognition, John Wiley & Sons, New Technology Communications, ISBN 0-471-05669-3, 2003. . Then, the clustering unit 141 obtains an average value θ _k ^˜ (k = {1,..., K}) (corresponding to “information for identifying each cluster”) of each generated cluster, and stores them in the memory 110 is stored in the storage area 117 (step S3). Then, the mask generating unit 142 reads out the average value theta _k ^~ of each cluster from the storage area 117 of the memory 110,

A mask for each time frequency is generated and stored in the storage area 118 of the memory 110 (step S4). This mask M _k (f, m) has a value of 1 for the estimated arrival direction θ (f, m) of signals belonging to the range of θ _k ^to −Δ ≦ θ (f, m) ≦ θ _k ^to + Δ. The binary mask takes a value of 0 with respect to the estimated arrival direction θ (f, m) of signals belonging to other ranges. Here, Δ gives the range of the separated signal that can be extracted using the mask M _k (f, m). That is, if Δ is sufficiently small, good separation and extraction performance can be obtained, but nonlinear distortion increases. When Δ is increased, nonlinear distortion is reduced, but the separation performance is deteriorated.

尚、ここでは、信号の推定到来方向を特徴量としてクラスタリングを行ったが、例えば、２つのセンサ（各センサｊと基準となるセンサＪ'）における観測信号X_j(f,m)，X_J'(f,m)間の位相差（式（４））自身や、両者のゲイン比

を特徴量としてクラスタリングを行うこととしてもよい。また、観測信号ベクトルX(f,m)をクラスタリングすることもできる。 Here, clustering is performed using the estimated arrival direction of the signal as a feature quantity. For example, the observation signals X _j (f, m) and X _{J in} two sensors (each sensor j and a reference sensor J ′) are used. _' Phase difference between (f, m) (Equation (4)) itself and gain ratio of both

It is good also as performing clustering by using as a feature-value. Further, the observation signal vector X (f, m) can be clustered.

［時間領域変換部１６０の処理の詳細］
次に、時間領域変換部１６０の処理の詳細について説明する。
時間領域変換部１６０は、離散フーリエ逆変換により時間周波数領域の分離信号Y_k(f,m)を時間領域の分離信号y_k ^m(n)に変換する。この例の場合、時間領域変換部１６０において、メモリ１１０の記憶領域１１２からＳ，Ｔを読み出し、記憶領域１１９からｋ毎にＳフレーム分の分離信号Y_k(f,m),Y_k(f,m+1),Y_k(f,m+2),...,Y_k(f,m+S−1)を読み出す。そして、時間領域変換部１６０は、ｋ毎に各フレームmに対して以下のような時間領域の分離信号y_k ^m(n)を算出して、メモリ１１０の記憶領域１２０に格納する（ステップＳ６）。尚、以下においてvは虚数単位であり、g=n−m・R、R=T/Sである。

[Details of processing of time domain conversion unit 160]
Next, details of the processing of the time domain conversion unit 160 will be described.
The time domain transforming unit 160 transforms the time frequency domain separated signal Y _k (f, m) into a time domain separated signal y _k ^m (n) by inverse discrete Fourier transform. In the case of this example, the time domain conversion unit 160 reads S and T from the storage area 112 of the memory 110, and separates signals Y _k (f, m) and Y _k (f , m + 1), Y _k (f, m + 2),..., Y _k (f, m + S−1) are read out. Then, the time domain conversion unit 160 calculates the following time domain separation signal y _k ^m (n) for each frame m for each _k , and stores it in the storage area 120 of the memory 110 (step S6). ). In the following, v is an imaginary unit, and g = n−m · R and R = T / S.

図６は、時間周波数領域の分離信号Y_k(f,m)から時間領域の分離信号y_k ^m(n)が生成され、さらにそれらが加算合成される様子を例示した概念図である。
この図に示すように、この時間領域変換部１６０の処理により、各フレーム{m, m+1,m+2,...,m+S−1}の時間周波数領域の分離信号Y_k(f,m),Y_k(f,m+1),Y_k(f,m+2),...,Y_k(f,m+S−1)は、それぞれn=m・Rから長さTだけ値を持ち、あとは０であるような時間領域の分離信号y_k ^m(n)に変換される。 FIG. 6 is a conceptual diagram illustrating a state in which a time-domain separated signal y _k ^m (n) is generated from a time-frequency domain separated signal Y _k (f, m) and further added and synthesized.
As shown in this figure, by the processing of the time domain transforming unit 160, the separation signal Y _k (in the time frequency domain of each frame {m, m + 1, m + 2,..., M + S−1}). f, m), Y _k (f, m + 1), Y _k (f, m + 2), ..., Y _k (f, m + S−1) are long from n = m It is converted into a time domain separation signal y _k ^m (n) that has a value of T and the rest is zero.

尚、式（６）の右辺（m・R≦ｎ≦m・R＋Ｔ−１）は、

の関係を満たす。すなわち、ここで算出された時間領域の分離信号y_k ^m(n)は、本来の時間領域の分離信号y_k(n)に窓関数成分w(n−m・R)を乗じたものとなっている。これは、時間周波数領域変換部１３０の乗算部１３８において混合信号に窓関数w(n)を乗じていたことに起因するものである。 The right side (m · R ≦ n ≦ m · R + T−1) of the equation (6) is

Satisfy the relationship. In other words, the time domain separation signal y _k ^m (n) calculated here is obtained by multiplying the original time domain separation signal y _k (n) by the window function component w (n−m · R). ing. This is because the multiplication unit 138 of the time-frequency domain conversion unit 130 multiplies the mixed signal by the window function w (n).

［重畳加算部１７０の処理の詳細］
次に、重畳加算部１７０の処理の詳細について説明する。
この例の場合、重畳加算部１７０において、メモリ１１０の記憶領域１２０から各フレーム{m, m+1,m+2,...,m+S−1}の時間領域の分離信号y_k ^m(n),y_k ^m+1(n),y_k ^m+2(n),...,y_k ^m+S-1(n)を読み出し、これらを以下の式に従って加算合成し、その合成結果を時間領域の分離信号y_k(n)としてメモリ１１０の記憶領域１２１に格納する（ステップ７）。尚、この例のＣは窓関数w(n)とシフト率1／Ｓによって決められた定数である（後述）。

この処理により、図６に示すような加算合成された分離信号y_k(n)が生成される。尚、ここでは分離信号y_k(n)の順序は規定されない。すなわちy_k(n)が原信号s_k(n)の推定値とは限らない。 [Details of processing of superposition adding unit 170]
Next, the details of the processing of the superposition adding unit 170 will be described.
In the case of this example, in the superposition addition unit 170, the separation signal y _k ^m of the time domain of each frame {m, m + 1, m + 2,..., M + S−1} from the storage area 120 of the memory 110. (n), y _k ^{m + 1} (n), y _k ^{m + 2} (n), ..., y _k ^{m + S-1} (n) are read out, and these are added and synthesized according to the following equation. The combined result is stored in the storage area 121 of the memory 110 as a time domain separation signal y _k (n) (step 7). In this example, C is a constant determined by the window function w (n) and the shift rate 1 / S (described later).

As a result of this processing, the separated combined signal y _k (n) as shown in FIG. 6 is generated. Here, the order of the separated signals y _k (n) is not defined. That is, y _k (n) is not necessarily an estimated value of the original signal s _k (n).

［本形態に適した窓関数w(n)］
次に、本形態に適した窓関数w(n)について説明する。
上述した通り、時間周波数領域変換部１３０の乗算部１３８において混合信号に窓関数w(n)を乗じている関係上、時間領域変換部１６０で算出される時間領域の分離信号y_k ^m(n)は、本来の時間領域の分離信号y_k(n)に窓関数成分w(n−m・R)（式（７））を乗じたものとなっている。よって、式（８）に従って時間領域の分離信号y_k ^m(n)を加算合成した際における窓関数成分w(n−m・R)の加算合成分が歪みの原因とならないよう適切な窓関数w(n)を選択しなければならない。具体的には、各フレームｍに対応する時間領域の分離信号y_k ^m(n)が有する窓関数成分w(n−m・R)の加算合成値が時間軸上の所定の範囲において一定となる窓関数w(n)を用いることが望ましい。すなわち、窓関数w(n)は、

を満たすものであることが望ましい。 [Window function w (n) suitable for this form]
Next, the window function w (n) suitable for this embodiment will be described.
As described above, the time domain separation signal y _k ^m (n calculated by the time domain transform unit 160 is obtained because the mixed signal is multiplied by the window function w (n) in the multiplication unit 138 of the time frequency domain transform unit 130. ) Is obtained by multiplying the original time domain separated signal y _k (n) by the window function component w (n−m · R) (formula (7)). Therefore, an appropriate window function is provided so that the added and combined portion of the window function component w (n−m · R) when the separated signals y _k ^m (n) in the time domain are added and combined according to Equation (8) does not cause distortion. w (n) must be selected. Specifically, it is assumed that the added composite value of the window function component w (n−m · R) included in the time domain separation signal y _k ^m (n) corresponding to each frame m is constant in a predetermined range on the time axis. It is desirable to use the window function w (n) That is, the window function w (n) is

It is desirable to satisfy.

このような条件を満たす窓関数としては以下のものを例示できる。
ハニング窓（Hanning window）
w(n)=0.5−0.5・cos(2πn/T） (n=0,...,T−1）
ハミング窓（Hamming window）
w(n)=0.54−0.46・cos(2πn/T） (n=0,...,T−1）
バートレット（三角）窓（Bartlett window）

Examples of window functions that satisfy such conditions include the following.
Hanning window
w (n) = 0.5−0.5 ・ cos (2πn / T) (n = 0, ..., T−1)
Hamming window
w (n) = 0.54−0.46 ・ cos (2πn / T) (n = 0, ..., T−1)
Bartlett window

図７（ａ）〜（ｃ）は、これらのハニング窓（Hanning window）、ハミング窓（Hamming window）及びバートレット（三角）窓（Bartlett window）をそれぞれ式（９）に従って加算合成した様子を示した図である。この図に示すように各窓関数の窓関数波形２０１，２１１，２２１を式（９）に従って加算合成した加算合成波形２０２，２１２，２２２は、n=500〜1300程度の範囲において１となっている。 FIGS. 7A to 7C show how the Hanning window, the Hamming window, and the Bartlett window are added and synthesized according to the equation (9). FIG. As shown in this figure, the added composite waveforms 202, 212, and 222 obtained by adding and synthesizing the window function waveforms 201, 211, and 221 of each window function according to the equation (9) are 1 in the range of n = 500 to 1300. Yes.

尚、上述したハニング窓やバートレット窓を用いる場合、式（８）における定数Ｃは、0.5Sであることが望ましく、ハミング窓を用いる場合には、0.54Sであることが望ましい。また、一般化してハニング窓やバートレット窓をｂ倍した窓関数を用いる場合には、定数Ｃは、0.5S・bであることが望ましく、ハミング窓を用いる場合には、0.54S・bであることが望ましい。さらに、式（８）で算出される分離信号y_k(n)の出力レベルが問題にならないのであれば、定数Ｃをこれ以外の値（例えば、窓関数w(n)やシフト率1／Ｓに依存しない値）としてもよい。 When the Hanning window or Bartlett window described above is used, the constant C in Equation (8) is preferably 0.5S, and when a Hamming window is used, it is preferably 0.54S. Further, when a window function obtained by generalizing a Hanning window or Bartlett window is multiplied by b, the constant C is preferably 0.5 S · b, and when a Hamming window is used, it is 0.54 S · b. It is desirable. Further, if the output level of the separated signal y _k (n) calculated by the equation (8) is not a problem, the constant C is set to a value other than this (for example, the window function w (n) or the shift ratio 1 / S). It may be a value independent of.

＜本形態の特徴＞
［聴感上の特徴］
以上説明した通り、時間周波数毎のマスクM_k(f,m)による非線形処理によって時間周波数領域の分離信号Y_k(f,m)が抽出され、時間領域変換部１６０において長さＴの時間領域の信号y_k ^m(n)に変換され、式（８）によって加算合成して分離信号y_k(n)を算出する。ここでは、滑らかに分離信号y_k(n)が抽出されるかを定性的に確かめるため重畳加算されたマスクの形状を示す。これは、式（５）において各周波数ｆ，各離散時刻ｍで決めた時間周波数毎のマスクM_k(f,m)を式（８）と同様に重畳加算したものであり、次のように表されるものである。

<Features of this embodiment>
[Hearing characteristics]
As described above, the separation signal Y _k (f, m) in the time frequency domain is extracted by nonlinear processing using the mask M _k (f, m) for each time frequency, and the time domain of the length T is obtained in the time domain conversion unit 160. The signal y _k ^m (n) is converted into the signal y _k ^m (n) and added and synthesized by the equation (8) to calculate the separated signal y _k (n). Here, the shape of a mask that is superimposed and added in order to confirm qualitatively whether the separation signal y _k (n) is extracted smoothly is shown. This is _obtained by superimposing and adding the mask M _k (f, m) for each time frequency determined at each frequency f and each discrete time m in Expression (5) as in Expression (8). It is expressed.

図８は、縦軸をM_f(n)とし横軸をサンプリング時刻ｎとしたグラフである。ここで、図８（ａ）は、S=2（粗いシフト）の場合におけるM_f(n)のグラフであり、図８（ｂ）は、S=8（細かいシフト）の場合におけるM_f(n)のグラフである。また、T=512（サンプリング周波数=8000Hzでの実験のため、64msに相当）であり、S=2の場合のシフト量R=T/Sは256（サンプリング周波数=8000Hzでの実験のため、32msに相当）であり、S=8の場合のシフト量R=T/Sは64（サンプリング周波数=8000Hzでの実験のため、8msに相当）である。また、図８のM_f(n)は、図９の実験条件において信号源Ｓ_１から発せられた信号を抽出するマスクM_k(f,m)に式（１０）を適用したものである。 FIG. 8 is a graph in which the vertical axis represents M _f (n) and the horizontal axis represents sampling time n. Here, FIG. 8 (a) is a graph of M _f (n) in the case of S = 2 (rough shift), FIG. 8 (b), M in the case of S = 8 (fine shift) _f ( It is a graph of n). Also, T = 512 (equivalent to 64 ms for sampling frequency = 8000 Hz), and shift amount R = T / S when S = 2 is 256 (32 ms for sampling frequency = 8000 Hz) The shift amount R = T / S when S = 8 is 64 (corresponding to 8 ms for the sampling frequency = 8000 Hz experiment). Further, M _f (n) in FIG. 8 is _obtained by applying Equation (10) to a mask M _k (f, m) for extracting a signal emitted from the signal source S _{1 under} the experimental conditions in FIG.

図８から、S=2のときにM_f(n)には32ms(>22ms)のギャップがあることが分かる（図８（ａ）の位置（Ａ）や位置（Ｃ））。ここで、時間方向の音響信号の平滑さについて、「ヒトは、22ms以上のギャップ（連続信号の中の急な立ち上がり／立ち下がりを伴う無音区間)を認知できる」とされている（例えば、"B. C. J. Moore, An introduction to the psychology of hearing, 3rd Ed., Academic Press, 1989."参照）。よって、このギャップがmusical noiseの一因となっていることは十分考えられる。
一方、S=8のときは、ギャップは見られるものの（図８（ｂ）の位置（Ｂ））その長さは8msであり、ヒトには認知されない。さらに、図８（ｂ）の位置（Ｄ）付近では、M_f(n)の値は徐々に変化している。そのため、位置（Ｄ）付近のギャップはmusical noiseの原因となりにくい。すなわち、シフト長を短くすることによりmusical noiseの発生を低減させることができると考えられる。 From FIG. 8, it can be seen that there is a gap of 32 ms (> 22 ms) in M _f (n) when S = 2 (position (A) and position (C) in FIG. 8A). Here, with regard to the smoothness of the acoustic signal in the time direction, it is said that “a human can recognize a gap of 22 ms or more (a silent section with a sudden rise / fall in a continuous signal)” (for example, “ BCJ Moore, An introduction to the psychology of hearing, 3rd Ed., Academic Press, 1989. "). Therefore, it is quite possible that this gap contributes to musical noise.
On the other hand, when S = 8, although a gap is seen (position (B) in FIG. 8B), its length is 8 ms and is not recognized by humans. Furthermore, the value of M _f (n) gradually changes near the position (D) in FIG. Therefore, the gap near the position (D) is unlikely to cause musical noise. That is, it is considered that the generation of musical noise can be reduced by shortening the shift length.

また、M_f(n)の振幅の変化を見た場合、S=2（粗いシフト）の場合は一段が0.5と大きいのに比べ、S=8（細かいシフト）の場合は一段が0.125と小さい。これは、M_k(f,m)∈｛1，0｝であるので一段の大きさが必ず1/Sとなることに起因する。すなわち、細かいシフトを用いた場合、M_f(n)の振幅が急激に変化することはない。よって本形態で用いる細かいシフトと加算合成とは、musical noiseの一因である分離信号の振幅の急激な変化を起こさないことが分かる。
以上のようにS=2の場合に比べて有利な効果を有するのはS=4の場合についても同様である。 Also, when looking at the change in the amplitude of M _f (n), one step is as small as 0.5 when S = 2 (coarse shift), and one step is as small as 0.125 when S = 8 (fine shift). . This is because M _k (f, m) ∈ {1, 0}, so that the size of one stage is always 1 / S. That is, when a fine shift is used, the amplitude of M _f (n) does not change abruptly. Therefore, it can be seen that the fine shift and addition synthesis used in this embodiment do not cause a sudden change in the amplitude of the separated signal, which is a cause of musical noise.
As described above, the same effect is obtained in the case of S = 4 as compared to the case of S = 2.

［実験結果（分離性能による客観評価およびＭＯＳによる主観評価］
図１０は、本形態においてシフト量Ｓを変化させた場合の分離信号y_k(n)の性能比較を示した表である。ここでSIR（信号対干渉信号比）は分離性能を、SDR（信号対歪み比）は信号の歪みの程度を、MOS（Mean Opinion Score）は聴感上の品質を表す。特にMOSは、実際に20名の被験者が抽出された出力信号(音)を聞いて点数をつけたものの平均値である。どの評価値についでも、値が大きい方が性能が良いことを示す。また、図１０（ａ）は残響のない環境における結果を示しており、図１０（ｂ）は残響時間RT₆₀=130msのときの結果を示している。 [Experimental results (objective evaluation by separation performance and subjective evaluation by MOS]
FIG. 10 is a table showing a performance comparison of the separated signal y _k (n) when the shift amount S is changed in the present embodiment. Here, SIR (signal-to-interference signal ratio) represents separation performance, SDR (signal-to-distortion ratio) represents the degree of signal distortion, and MOS (Mean Opinion Score) represents auditory quality. In particular, MOS is an average value of points obtained by listening to output signals (sounds) actually extracted by 20 subjects. For any evaluation value, a larger value indicates better performance. FIG. 10A shows the result in an environment without reverberation, and FIG. 10B shows the result when the reverberation time RT ₆₀ = 130 ms.

原信号としては、3人の話者（男性2名・女性1名）による音声信号を用いた（図９）。図１０（ａ）では、残響のない環境における２つの無指向性マイクでの観測信号を模擬した。図１０（ｂ）では、130msの残響のある部屋で観測したインパルス応答を原信号に畳み込み、式（１）に従って観測信号を作成した。また、ＤＯＡでクラスタリングを行い、窓関数としてはハニング窓w(n)=0.5−0.5・cos(2πn/T） (n=0,...,T−1）を用い、式（８）におけるCは0.5Sとした。
図１０の結果より、シフト長を短くする（=Sを大きくする）ことで、分離性能（SIR）を落とすことなく、SDR及び'MOSの値を向上させることが可能であることが分かる。ここで、MOSだけでなくSDRの値も上がっている。よって本形態の手法では、音響信号musical noiseの除去のみでなく、一般の信号の非線型歪みの除去にも有効であると期待される。尚、MOS値については、分散分析により有意性が認められた。 As the original signal, speech signals from three speakers (two men and one woman) were used (Fig. 9). In FIG. 10A, an observation signal with two omnidirectional microphones in an environment without reverberation was simulated. In FIG. 10 (b), an impulse response observed in a room with 130 ms of reverberation was convolved with the original signal, and an observation signal was created according to equation (1). Also, clustering is performed with DOA, and Hanning window w (n) = 0.5−0.5 · cos (2πn / T) (n = 0,..., T−1) is used as the window function, and in equation (8) C was set to 0.5S.
From the result of FIG. 10, it can be seen that by decreasing the shift length (= increasing S), the values of SDR and 'MOS can be improved without degrading the separation performance (SIR). Here, not only the MOS but also the SDR value has risen. Therefore, the method of this embodiment is expected to be effective not only for removing the acoustic signal musical noise but also for removing non-linear distortion of general signals. In addition, about MOS value, the significance was recognized by analysis of variance.

さらに、シフトを細かくすることでSIRも向上している。これは、細かい窓シフト長（ファインシフト）を用いることで各周波数でのサンプル点が多くなり、クラスタリングの精度が上がるためと考えられる。これもファインシフトの効果の1つである。
図１１は、その他の信号分離抽出手法を用いたときの窓シフト量による性能を比較した結果である。ここで図１１（ａ）は、前述したマスクM_k(f,m)と独立成分分析（ICA）とを組み合わせた手法を用いたときの性能比較を示しており、図１１（ｂ）は、前述したマスクM_k(f,m)とビームフォーマとを組み合わせた手法を用いたときの性能比較を示している。尚、共に前述の残響のある環境下での結果を示している。ここでもシフトを細かくする（Ｓを大きくする）ことで、分離性能（SIR）を落とすことなく、SDRの値を向上できていることが分かる。尚、本形態により聴感上良い音が得られることを＜URL＞http://www.kecl.ntt.co.jp/icl/signal/araki/fineshift.htmlのホームページ上に示す。 Furthermore, SIR is improved by making the shift finer. This is presumably because the use of a fine window shift length (fine shift) increases the number of sample points at each frequency and improves the accuracy of clustering. This is one of the effects of fine shift.
FIG. 11 shows the result of comparing the performance depending on the window shift amount when other signal separation and extraction methods are used. Here, FIG. 11A shows a performance comparison when using a method combining the above-described mask M _k (f, m) and independent component analysis (ICA), and FIG. The performance comparison when using the method combining the mask M _k (f, m) and the beam former described above is shown. Both show the results under the above-mentioned reverberant environment. Here again, it can be seen that by making the shift finer (increasing S), the SDR value can be improved without degrading the separation performance (SIR). In addition, it is shown on the homepage of <URL> http://www.kecl.ntt.co.jp/icl/signal/araki/fineshift.html that this embodiment can obtain a good sound.

尚、本発明は上述の実施の形態に限定されるものではない。例えば、図４や図５等に示した各種の処理は、記載に従った時系列で実行されるのみならず、必要に応じてその処理順序を入れ替えて実行してもよい。また、これらの各処理は、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行してもよい。 The present invention is not limited to the embodiment described above. For example, the various processes shown in FIG. 4 and FIG. 5 are not only executed in time series according to the description, but may be executed by changing the processing order as necessary. In addition, each of these processes may be executed in parallel or individually as required by the processing capability of the apparatus that executes the process.

また、本形態では、マスクとして式（５）に示したようなハイレベル値（上述の例では「１」）からローレベル値（上述の例では「０」）への推移が不連続なバイナリマスクを使用することとした。しかし、その代わりにハイレベル値からローレベル値への推移が連続的な滑らかな形状のマスクを使用することとしてもよい（例えば、「S. Araki, S. Makino, H. Sawada and R. Mukai, "Underdetermined Blind Speech Separation with Directivity Pattern based Continuous Mask and ICA, "EUSIPCO2004, pp.1991-1994, Sept. 2004.」参照）。ここで「滑らかな形状のマスク」は、物理的に限定されたある方向からセンサに届いたと推定される信号を、空間的に１／０で切ることをせず、滑らかに取り出そうとするマスクであり、いわば空間的に滑らかなマスクであるといえる。これに対し、上述した本形態は、時間方向の信号の急激な変化を抑える効果があり、いわば時間的に滑らかなマスクを生成していたといえる。よって、本形態の構成に空間的に滑らかなマスクを組み合わせることにより、より一層分離信号の歪を低減させることができるといえる。 Also, in this embodiment, a binary pattern in which the transition from a high level value (“1” in the above example) to a low level value (“0” in the above example) as shown in Expression (5) is a discontinuous mask. It was decided to use a mask. However, it is also possible to use a mask having a smooth shape in which the transition from the high level value to the low level value is continuous (for example, “S. Araki, S. Makino, H. Sawada and R. Mukai”). , "Underdetermined Blind Speech Separation with Directivity Pattern based Continuous Mask and ICA," EUSIPCO2004, pp.1991-1994, Sept. 2004. "). Here, the “smooth-shaped mask” is a mask that tries to smoothly extract a signal that is estimated to have reached the sensor from a physically limited direction without being spatially cut by 1/0. In other words, it can be said to be a spatially smooth mask. On the other hand, the present embodiment described above has an effect of suppressing a rapid change in a signal in the time direction, and it can be said that a mask that is smooth in time is generated. Therefore, it can be said that the distortion of the separated signal can be further reduced by combining a spatially smooth mask with the configuration of this embodiment.

また、本形態の信号分離手法として、このような滑らかな形状のマスクと他の信号分離抽出手法、例えば、独立成分分析（ICA）とを組み合わせた手法（例えば、「S. Araki, S. Makino, H. Sawada and R. Mukai, "Underdetermined Blind Speech Separation with Directivity Pattern based Continuous Mask and ICA, "EUSIPCO2004, pp.1991-1994, Sept. 2004.」参照）を用いることとしてもよい。
また、本形態ではN＞Mであることとしたが、Ｎ≦Ｍの場合に本発明を適用することとしてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。 In addition, as a signal separation method of this embodiment, a method in which such a smooth mask and another signal separation extraction method, for example, independent component analysis (ICA) are combined (for example, “S. Araki, S. Makino”). , H. Sawada and R. Mukai, "Underdetermined Blind Speech Separation with Directivity Pattern based Continuous Mask and ICA," EUSIPCO 2004, pp. 1991-1994, Sept. 2004.) may be used.
In this embodiment, N> M. However, the present invention may be applied when N ≦ M. Needless to say, other modifications are possible without departing from the spirit of the present invention.

また、上述の処理内容を記述した信号分離プログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよいが、具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）／ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto-Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 Further, the signal separation program describing the above-described processing contents can be recorded on a computer-readable recording medium. The computer-readable recording medium may be any medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, or a semiconductor memory. Specifically, for example, the magnetic recording device may be a hard disk device or a flexible Discs, magnetic tapes, etc. as optical disks, DVD (Digital Versatile Disc), DVD-RAM (Random Access Memory), CD-ROM (Compact Disc Read Only Memory), CD-R (Recordable) / RW (ReWritable), etc. As the magneto-optical recording medium, MO (Magneto-Optical disc) or the like can be used, and as the semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory) or the like can be used.

また、この信号分離プログラムの流通は、例えば、その信号分離プログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、この信号分離プログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータに転送することにより流通させる構成としてもよい。 The signal separation program is distributed, for example, by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the signal separation program is recorded. Further, the signal separation program may be stored in a storage device of a server computer and distributed by transferring it from the server computer to another computer via a network.

また、上述した信号分離プログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接信号分離プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータから信号分離プログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへの信号分離プログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。尚、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 As another execution form of the signal separation program described above, the computer may directly read the signal separation program from a portable recording medium and execute processing according to the program. Each time the signal separation program is transferred, the processing according to the received program may be executed sequentially. In addition, the above-mentioned processing is executed by a so-called ASP (Application Service Provider) type service that realizes processing functions only by the execution instruction and result acquisition without transferring the signal separation program from the server computer to the computer. It is good also as composition to do. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, the present apparatus is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

本発明により、様々なノイズ・妨害信号・干渉信号が存在する環境下においても、目的信号を精度よく分離抽出することが可能となる。例えば、本発明をオーディオ分野へ応用した場合、音声認識機の入力マイクロホンと話者とが離れた位置にありマイクロホンが目的話者音声以外の音まで収音してしまう状況でも、目的音声を分離抽出することで認識率の高い音声認識系を構築できる。 According to the present invention, it is possible to accurately separate and extract a target signal even in an environment where various noises, interference signals, and interference signals exist. For example, when the present invention is applied to the audio field, the target speech is separated even in a situation where the input microphone of the speech recognizer is far away from the speaker and the microphone picks up sounds other than the target speaker. By extracting, a speech recognition system with a high recognition rate can be constructed.

信号分離装置のハードウェア構成を例示したブロック図。The block diagram which illustrated the hardware constitutions of the signal separation apparatus. ＣＰＵに信号分離プログラムが読み込まれることにより構成される信号分離装置のブロック図。The block diagram of the signal separation apparatus comprised when a signal separation program is read by CPU. （ａ）は、図２における時間周波数領域変換部の詳細を例示したブロック図。（ｂ）は、図２における時間領域変換部及び重畳加算部の構成を示したブロック図。(A) is the block diagram which illustrated the detail of the time frequency domain conversion part in FIG. FIG. 3B is a block diagram illustrating a configuration of a time domain conversion unit and a superposition addition unit in FIG. 2. 信号分離装置の処理の全体を説明するためのフローチャート。The flowchart for demonstrating the whole process of a signal separation apparatus. 時間周波数領域変換部による時間領域から時間周波数領域への変換処理を説明するためのフローチャート。The flowchart for demonstrating the conversion process from the time domain to a time frequency domain by a time frequency domain conversion part. 時間周波数領域の分離信号Y_k(f,m)から時間領域の分離信号y_k ^m(n)が生成され、さらにそれらが加算合成される様子を例示した概念図。Separation signal Y _k (f, m) in the time-frequency domain separated signals from the time domain y _k ^m (n) is generated, further conceptual diagram thereof is illustrated the manner in which the additive synthesis. （ａ）はハニング窓（Hanning window）を式（９）に従って加算合成した様子を示した図。（ｂ）はハミング窓（Hamming window）を式（９）に従って加算合成した様子を示した図。（ｃ）はバートレット（三角）窓（Bartlett window）を式（９）に従って加算合成した様子を示した図。(A) is the figure which showed a mode that the Hanning window (Hanning window) was added and synthesize | combined according to Formula (9). FIG. 6B is a diagram illustrating a state in which a Hamming window is added and synthesized according to Equation (9). (C) is the figure which showed a mode that addition synthesis | combination of the Bartlett (triangle) window (Bartlett window) according to Formula (9). 縦軸をM_f(n)とし横軸をサンプリング時刻ｎとしたグラフ。ここで、図８（ａ）は、S=2（粗いシフト）の場合におけるM_f(n)のグラフ。（ｂ）は、S=8（細かいシフト）の場合におけるM_f(n)のグラフ。A graph in which the vertical axis represents M _f (n) and the horizontal axis represents sampling time n. Here, FIG. 8A is a graph of M _f (n) in the case of S = 2 (coarse shift). (B) is a graph of M _f (n) in the case of S = 8 (fine shift). 実験条件を示した図。The figure which showed experiment conditions. シフト量Ｓを変化させた場合の分離信号y_k(n)の性能比較を示した表。The table | surface which showed the performance comparison of isolation | separation signal _yk (n) at the time of changing the shift amount S. FIG. （ａ）は、マスクM_k(f,m)と独立成分分析（ICA）とを組み合わせた手法を用いたときの性能比較を示した表。（ｂ）は、前述したマスクM_k(f,m)とビームフォーマとを組み合わせた手法を用いたときの性能比較を示した表。(A) is the table | surface which showed the performance comparison when the method which combined mask _Mk (f, m) and independent component analysis (ICA) was used. (B) is a table showing a performance comparison when using the method combining the mask M _k (f, m) and the beam former described above.

Explanation of symbols

１信号分離装置
１３０時間周波数領域変換部
１４０時間周波数マスク推定部
１５０時間周波数領域信号抽出部
１６０時間領域変換部
１７０重畳加算部 1 Signal Separator 130 Time Frequency Domain Transformer 140 Time Frequency Mask Estimator 150 Time Frequency Domain Signal Extractor 160 Time Domain Transformer 170 Superposition Adder

Claims

A signal separation device that separates a mixed signal obtained by mixing signals emitted from a plurality of signal sources into the signal,
Time for multiplying the mixed signal by the window function in each frame set with 1 / S (S> 2) times shift of the length T of the window function, and converting each operation result into a signal in the time-frequency domain A frequency domain transforming means;
Clustering the feature quantities extracted from the signals in the time-frequency domain, and clustering means for generating a cluster;
Mask generating means for generating a mask for each time frequency using the cluster information;
Using the mask and the signal in the time frequency domain, time frequency domain signal extraction means for extracting a separation signal in the time frequency domain for each time frequency;
Time domain transforming means for transforming the time frequency domain separation signal into a time domain separation signal;
Superimposing and adding means for adding and synthesizing the time domain separation signals corresponding to each frame;
A signal separation device comprising:

The signal separation device according to claim 1,
The window function is
The added composite value of the window function components of the time domain separation signal corresponding to each frame is a function that is constant in a predetermined range on the time axis.
A signal separation device.

The signal separation device according to claim 1,
T and S are powers of 2,
A signal separation device.

The signal separation device according to claim 1,
Time frequency domain transforming means
T / S shift means for extracting the mixed signal at discrete time r + m · T / S (r and m are integers);
Multiplication means for multiplying the mixed signal at the discrete time r + m · T / S by the value of the window function at the discrete time r;
Discrete Fourier transform means for performing a discrete Fourier transform on the operation result in the multiplication means;
A signal separation device comprising:

A signal separation method for separating a mixed signal obtained by mixing signals emitted from a plurality of signal sources into the signal,
The time-frequency domain transforming means to which the mixed signal is input multiplies the mixed signal by the window function in each frame set with a 1 / S (S> 2) times shift of the length T of the window function, A procedure for converting each of the calculation results into a signal in the time-frequency domain,
Clustering means clustering the feature quantities extracted from the signal in the time frequency domain, and generating a cluster;
A mask generating means uses the cluster information to generate a mask for each time frequency; and
The time frequency domain signal extraction means uses the mask and the signal of the time frequency domain to extract a separation signal of the time frequency domain for each time frequency,
A time domain converting means for converting the time-frequency domain separation signal into a time-domain separation signal;
A step of superimposing and adding and synthesizing the time domain separated signals corresponding to each frame;
A signal separation method comprising:

A signal separation program for causing a computer to function as the signal separation device according to claim 1.

A computer-readable recording medium storing the signal separation program according to claim 6.