JP2018128500A

JP2018128500A - Formation device, formation method and formation program

Info

Publication number: JP2018128500A
Application number: JP2017019449A
Authority: JP
Inventors: 卓哉樋口; Takuya Higuchi; 慶介木下; Keisuke Kinoshita; 中谷　智広; Tomohiro Nakatani; 智広中谷
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: NTT Inc
Priority date: 2017-02-06
Filing date: 2017-02-06
Publication date: 2018-08-16
Anticipated expiration: 2037-02-06
Also published as: JP6711765B2

Abstract

【課題】環境に応じて音声認識に最適な音声強調を行うこと。【解決手段】取得部１５ａが、音声認識の対象である目的音声の音響信号と、該目的音声以外の雑音の音響信号とを含む複数の地点における観測信号を取得し、音声強調部１５ｄが、周波数ごとに音響信号のビームを形成するための所定のビームフォーマｗｆを用いて、観測信号のうち該目的音声の音響信号を強調した強調音声の音響信号を算出し、音声認識部１５ｅが、算出された強調音声の音素の確率分布を推定するとともに、該強調音声に音素を示す参照ラベルを付与し、最適化部１５ｆが、参照ラベルと強調音声の音素の確率分布との差を最小化するように、ビームフォーマｗｆを最適化する。【選択図】図１Speech enhancement that is optimal for speech recognition according to the environment. An acquisition unit 15a acquires observation signals at a plurality of points including an acoustic signal of a target speech that is a target of speech recognition and an acoustic signal of noise other than the target speech, and a speech enhancement unit 15d Using a predetermined beamformer wf for forming an acoustic signal beam for each frequency, an acoustic signal of an emphasized speech in which the acoustic signal of the target speech is emphasized among the observation signals is calculated, and the speech recognition unit 15e calculates The estimated phoneme probability distribution of the emphasized speech is estimated, a reference label indicating the phoneme is assigned to the emphasized speech, and the optimization unit 15f minimizes the difference between the reference label and the phoneme probability distribution of the emphasized speech. Thus, the beam former wf is optimized. [Selection] Figure 1

Description

本発明は、形成装置、形成方法および形成プログラムに関する。 The present invention relates to a forming apparatus, a forming method, and a forming program.

従来、音声認識を行う前に、雑音を抑制して音声強調を行ったビームを形成するビームフォーマを算出する技術が開示されている（非特許文献１、２参照）。また、より明確に音声認識を行えるように、環境に応じて音声強調のためのパラメータを推定する技術が開示されている（非特許文献３参照）。 2. Description of the Related Art Conventionally, a technique for calculating a beam former that forms a beam that has been subjected to speech enhancement while suppressing noise before speech recognition has been disclosed (see Non-Patent Documents 1 and 2). Further, a technique for estimating a parameter for speech enhancement according to the environment is disclosed so that speech recognition can be performed more clearly (see Non-Patent Document 3).

T.Higuchi，N.Ito，T.Yoshioka，T.Nakatani,“ROBUST MVDR BEAMFORMING USING TIME-FREQUENCY MASKS FOR ONLINE/OFFLINE ASR IN NOISE”，2016 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP)，2016年3月，pp.5210-5214T.Higuchi, N.Ito, T.Yoshioka, T.Nakatani, “ROBUST MVDR BEAMFORMING USING TIME-FREQUENCY MASKS FOR ONLINE / OFFLINE ASR IN NOISE”, 2016 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), 2016 March, pp.5210-5214 L.J.Griffiths，C.W.Jim，“An Alternative Approach to Linearly Constrained Adaptive Beamforming”，IEEE Transactions on antennas and propagation，vol.AP-30，NO.1，1982年1月，pp.27-34L.J.Griffiths, C.W.Jim, “An Alternative Approach to Linearly Constrained Adaptive Beamforming”, IEEE Transactions on antennas and propagation, vol.AP-30, NO.1, January 1982, pp.27-34 T.Higuchi，T.Yoshioka，T.Nakatani，“Optimization of Speech Enhancement Front-end with Speech Recognition-level Criterion”，Interspeech 2016，2016年，pp.3808-3812T.Higuchi, T.Yoshioka, T.Nakatani, “Optimization of Speech Enhancement Front-end with Speech Recognition-level Criterion”, Interspeech 2016, 2016, pp.3808-3812

しかしながら、従来の技術においては、音声強調と音声認識とを切り離して行っているため、必ずしも音声認識に最適に音声強調がなされているとは限らなかった。また、非特許文献３に記載されている時間周波数マスクによる音声強調の技術に依っても、音声認識率の改善幅はビームフォーマによる認識率改善幅より小さかった。 However, in the prior art, since speech enhancement and speech recognition are performed separately, speech enhancement is not always optimal for speech recognition. Also, even with the speech enhancement technique using the time-frequency mask described in Non-Patent Document 3, the improvement rate of the speech recognition rate was smaller than the improvement rate of the recognition rate by the beamformer.

本発明は、上記に鑑みてなされたものであって、環境に応じて音声認識に最適な音声強調を行うことを目的とする。 The present invention has been made in view of the above, and an object thereof is to perform speech enhancement that is optimal for speech recognition in accordance with the environment.

上述した課題を解決し、目的を達成するために、本発明に係る形成装置は、音声認識の対象である目的音声の音響信号と、該目的音声以外の雑音の音響信号とを含む複数の地点における観測信号を取得する取得部と、周波数ごとに音響信号のビームを形成するための所定のビームフォーマを用いて、前記観測信号のうち前記目的音声の音響信号を強調した強調音声の音響信号を算出する音声強調部と、前記算出された強調音声の音素の確率分布を推定するとともに、該強調音声に音素を示す参照ラベルを付与する音声認識部と、前記参照ラベルと前記強調音声の音素の確率分布との差を最小化するように、前記ビームフォーマを最適化する最適化部と、を備えることを特徴とする。 In order to solve the above-described problems and achieve the object, the forming apparatus according to the present invention includes a plurality of points including an acoustic signal of a target speech that is a target of speech recognition and an acoustic signal of noise other than the target speech. Using the acquisition unit for acquiring the observation signal and a predetermined beamformer for forming a beam of the acoustic signal for each frequency, the acoustic signal of the emphasized speech that emphasizes the acoustic signal of the target speech among the observation signals A speech enhancement unit to calculate; a speech recognition unit that estimates a phoneme probability distribution of the calculated enhanced speech; and a reference label indicating a phoneme to the enhanced speech; and the reference label and the phoneme of the enhanced speech An optimization unit for optimizing the beamformer so as to minimize a difference from the probability distribution.

本発明によれば、環境に応じて音声認識に最適な音声強調を行うことが可能となる。 According to the present invention, it is possible to perform speech enhancement optimal for speech recognition according to the environment.

図１は、本発明の一実施形態に係る形成装置の概略構成を示す模式図である。FIG. 1 is a schematic diagram showing a schematic configuration of a forming apparatus according to an embodiment of the present invention. 図２は、本実施形態の推定部の処理を説明するための説明図である。FIG. 2 is an explanatory diagram for explaining processing of the estimation unit of the present embodiment. 図３は、本実施形態の最適化部の処理を説明するための説明図である。FIG. 3 is an explanatory diagram for explaining the processing of the optimization unit of the present embodiment. 図４は、本実施形態の形成処理手順を示すフローチャートである。FIG. 4 is a flowchart showing the formation processing procedure of this embodiment. 図５は、形成プログラムを実行するコンピュータを例示する図である。FIG. 5 is a diagram illustrating a computer that executes a forming program.

以下、図面を参照して、本発明の一実施形態を詳細に説明する。なお、この実施形態により本発明が限定されるものではない。また、図面の記載において、同一部分には同一の符号を付して示している。 Hereinafter, an embodiment of the present invention will be described in detail with reference to the drawings. In addition, this invention is not limited by this embodiment. Moreover, in description of drawing, the same code | symbol is attached | subjected and shown to the same part.

［形成装置の構成］
まず、図１を参照して、本実施形態に係る形成装置の概略構成を説明する。図１に示すように、本実施形態に係る形成装置１は、ワークステーションやパソコン等の汎用コンピュータで実現され、入力部１１と出力部１２と通信制御部１３と、記憶部１４と、制御部１５とを備える。形成装置１は、後述する形成処理を実行して、音声認識に最適に目的音声の音声強調を行ったビームを形成する。 [Configuration of forming apparatus]
First, a schematic configuration of a forming apparatus according to the present embodiment will be described with reference to FIG. As shown in FIG. 1, the forming apparatus 1 according to the present embodiment is realized by a general-purpose computer such as a workstation or a personal computer, and includes an input unit 11, an output unit 12, a communication control unit 13, a storage unit 14, and a control unit. 15. The forming apparatus 1 executes a forming process to be described later, and forms a beam that has been subjected to speech enhancement of the target speech optimally for speech recognition.

入力部１１は、キーボードやマウス等の入力デバイスを用いて実現され、操作者による入力操作に対応して、制御部１５に対して各種指示情報を入力する。出力部１２は、液晶ディスプレイなどの表示装置、プリンター等の印刷装置、情報通信装置、スピーカ等によって実現され、例えば、後述する形成処理を実行した後、強調音声や音声認識結果等を操作者に対して出力する。 The input unit 11 is realized using an input device such as a keyboard or a mouse, and inputs various instruction information to the control unit 15 in response to an input operation by the operator. The output unit 12 is realized by a display device such as a liquid crystal display, a printing device such as a printer, an information communication device, a speaker, and the like. For example, after executing a forming process to be described later, an emphasis voice, a voice recognition result, and the like are sent to the operator. Output.

通信制御部１３は、ＮＩＣ（Network Interface Card）等で実現され、ＬＡＮ（Local Area Network）やインターネットなどの電気通信回線を介したサーバ等の外部の装置と制御部１５との通信を制御する。 The communication control unit 13 is realized by a NIC (Network Interface Card) or the like, and controls communication between an external device such as a server and the control unit 15 via a telecommunication line such as a LAN (Local Area Network) or the Internet.

記憶部１４は、ＲＡＭ（Random Access Memory）、フラッシュメモリ（Flash Memory）等の半導体メモリ素子、または、ハードディスク、光ディスク等の記憶装置によって実現される。記憶部１４には、形成装置１を動作させる処理プログラムや、処理プログラムの実行中に使用されるデータなどが予め記憶され、あるいは処理の都度一時的に記憶される。また、この記憶部１４は、通信制御部１３を介して制御部１５と通信する構成でもよい。 The storage unit 14 is realized by a semiconductor memory device such as a RAM (Random Access Memory) or a flash memory, or a storage device such as a hard disk or an optical disk. In the storage unit 14, a processing program for operating the forming apparatus 1, data used during execution of the processing program, and the like are stored in advance, or temporarily stored for each processing. The storage unit 14 may be configured to communicate with the control unit 15 via the communication control unit 13.

制御部１５は、ＣＰＵ（Central Processing Unit）等の演算処理装置がメモリに記憶された処理プログラムを実行することにより、図１に例示するように、取得部１５ａ、時間周波数分析部１５ｂ、推定部１５ｃ、音声強調部１５ｄ、音声認識部１５ｅおよび最適化部１５ｆとして機能する。 As illustrated in FIG. 1, the control unit 15 executes a processing program stored in a memory by an arithmetic processing unit such as a CPU (Central Processing Unit), so that an acquisition unit 15 a, a time-frequency analysis unit 15 b, and an estimation unit 15c, the speech enhancement unit 15d, the speech recognition unit 15e, and the optimization unit 15f.

取得部１５ａは、音声認識の対象である目的音声の音響信号と、該目的音声以外の雑音の音響信号とを含む複数の地点における観測信号を取得する。具体的に、取得部１５ａは、音声認識の対象である１つの目的音声の音源からの音響信号と、背景の雑音の音響信号とが混在する状況において、Ｍ箇所の異なる地点に設置されているマイクで収録されたＭ個の観測信号からなる多チャンネル観測信号を取得する。 The acquisition unit 15a acquires observation signals at a plurality of points including an acoustic signal of a target speech that is a target of speech recognition and an acoustic signal of noise other than the target speech. Specifically, the acquisition unit 15a is installed at different points in M locations in a situation where an acoustic signal from a sound source of one target speech that is a target of speech recognition and an acoustic signal of background noise are mixed. A multi-channel observation signal consisting of M observation signals recorded by a microphone is acquired.

時間周波数分析部１５ｂは、取得部１５ａが取得したＭ個の観測信号を対象に、短時間フーリエ変換等の短時間信号分析を行って、所定の長さの同一の短時間区間の周波数（以下、時間周波数とも記す）ごとに観測信号を抽出する。また、時間周波数分析部１５ｂは、抽出した時間周波数ごとの観測信号を用いてＭ次元縦ベクトルである観測ベクトルを生成する。 The time-frequency analysis unit 15b performs short-time signal analysis such as short-time Fourier transform on the M observation signals acquired by the acquisition unit 15a, and performs the frequency (hereinafter referred to as the frequency of the same short-term section of a predetermined length). The observation signal is extracted every time). Moreover, the time frequency analysis part 15b produces | generates the observation vector which is an M-dimensional vertical vector using the extracted observation signal for every time frequency.

ここで、目的音声はスパース性を有するため、目的音声が含まれない雑音のみの時間周波数の点が存在するものと仮定する（非特許文献１参照）。その場合、観測ベクトルｙ_ｆ，ｔは、次式（１）または次式（２）で表すことができる。ここで、ｔは１〜Ｔの整数であり、時間の番号を表す。また、ｆは０〜Ｆの整数であり、周波数の番号を表す。 Here, since the target speech has sparsity, it is assumed that there is a point of time frequency of only noise that does not include the target speech (see Non-Patent Document 1). In this case, the observation vector y _{f, t} can be expressed by the following formula (1) or the following formula (2). Here, t is an integer of 1 to T and represents a time number. F is an integer of 0 to F and represents a frequency number.

なお、ステアリングベクトルとは、目的音声の音源や雑音の音源から各マイクまでの伝達特性を成分とするベクトルであり、音源の空間情報を含む。 The steering vector is a vector whose component is a transfer characteristic from the sound source of the target speech or the noise sound source to each microphone, and includes spatial information of the sound source.

推定部１５ｃは、観測信号の所定の長さの同一の短時間区間における周波数ごとの信号の組み合わせのうち、目的音声の音響信号を含まない信号の組み合わせの確率分布を分離して推定することにより、周波数ごとに目的音声の音源の空間情報を含むステアリングベクトルを推定し、該ステアリングベクトルを用いてビームフォーマを算出する。 The estimation unit 15c separates and estimates the probability distribution of the combination of signals that do not include the acoustic signal of the target speech from among the combinations of signals for each frequency in the same short period of the predetermined length of the observation signal. The steering vector including the spatial information of the target sound source is estimated for each frequency, and the beam former is calculated using the steering vector.

具体的に、まず、推定部１５ｃは、観測信号の所定の長さの同一の短時間区間における周波数ごとの信号の組み合わせとして、時間周波数分析部１５ｂから観測ベクトルｙ_ｆ，ｔを取得する。次に、推定部１５ｃは、観測ベクトルを目的音声と雑音とのクラスタと、目的音声を含まない雑音のみのクラスタとに分類して、各クラスタに対応する空間相関行列を推定する。また、推定部１５ｃは、これを用いて目的音声の空間相関行列を推定する。この空間相関行列から、目的音声の音源の空間情報を含むステアリングベクトルが導出される。 Specifically, first, the estimation unit 15c acquires the observation vector y _{f, t} from the time-frequency analysis unit 15b as a combination of signals for each frequency in the same short period of a predetermined length of the observation signal. Next, the estimation unit 15c classifies the observation vector into a cluster of target speech and noise and a cluster of only noise that does not include the target speech, and estimates a spatial correlation matrix corresponding to each cluster. In addition, the estimation unit 15c estimates the spatial correlation matrix of the target speech using this. From this spatial correlation matrix, a steering vector including the spatial information of the sound source of the target speech is derived.

ここで、図２を参照して、推定部１５ｃの処理を説明する。図２に示すように、推定部１５ｃは、パラメータ推定部１５１、マスク推定部１５２、空間相関行列計算部１５３、ステアリングベクトル計算部１５４、およびビームフォーマ推定部１５５を含む。 Here, with reference to FIG. 2, the process of the estimation part 15c is demonstrated. As shown in FIG. 2, the estimation unit 15c includes a parameter estimation unit 151, a mask estimation unit 152, a spatial correlation matrix calculation unit 153, a steering vector calculation unit 154, and a beamformer estimation unit 155.

まず、観測ベクトルｙ_ｆ，ｔの確率分布は、次式（３）に示すように、目的音声と雑音とのクラスタの確率分布（以下、事後確率とも記す）と雑音のみのクラスタの事後確率との混合分布でモデル化して表すことができる。 First, the probability distribution of the observation vector y _{f, t} , as shown in the following equation (3), is a probability distribution of a cluster of target speech and noise (hereinafter also referred to as posterior probability) and a posterior probability of a cluster of noise only. Can be modeled and expressed as a mixture distribution.

この場合に、パラメータ推定部１５１は、上記式（３）の各パラメータ（以下、分布パラメータと記す）を推定する。その際、パラメータ推定部１５１は、次式（４）に示す尤度関数を目的関数とする。 In this case, the parameter estimation unit 151 estimates each parameter (hereinafter referred to as a distribution parameter) of the above formula (3). At that time, the parameter estimation unit 151 uses the likelihood function shown in the following equation (4) as an objective function.

すなわち、パラメータ推定部１５１は、観測ベクトルの分布を近似的に表す混合分布の分布パラメータとして、上記式（４）に示す目的関数を局所最大化する分布パラメータを求める。 That is, the parameter estimation unit 151 obtains a distribution parameter that locally maximizes the objective function shown in the above equation (4) as a distribution parameter of the mixed distribution that approximately represents the distribution of the observation vector.

そこで、パラメータ推定部１５１は、ＥＭ（Expectation-Maximization）アルゴリズムを適用するため、次式（５）に示すように、対数尤度関数の条件付期待値を表すＱ関数を定義する。 Therefore, in order to apply the EM (Expectation-Maximization) algorithm, the parameter estimation unit 151 defines a Q function representing the conditional expected value of the log likelihood function as shown in the following equation (5).

上記式（５）の補助パラメータは、観測ベクトルが各クラスタに属する度合いを表すマスクに相当し、Ｅ（期待値）ステップにおいて次式（６）のように算出できる。 The auxiliary parameter of the above equation (5) corresponds to a mask representing the degree to which the observation vector belongs to each cluster, and can be calculated as the following equation (6) in the E (expected value) step.

また、分布パラメータの更新式は、Ｍ（最大化）ステップにおいて、上記式（５）に示すＱ関数をそれぞれのパラメータで偏微分して０とすることにより、次式（７）および次式（８）のように導出される。 In addition, the distribution parameter update equation is expressed by the following equation (7) and the following equation (M) in the M (maximization) step by partially differentiating the Q function shown in the equation (5) with respect to each parameter to 0. 8) is derived.

パラメータ推定部１５１が、Ｍステップにおける上記式（７）および式（８）による分布パラメータの更新を行う。また、マスク推定部１５２が、Ｅステップにおける、更新された分布パラメータを用いた上記式（６）による補助パラメータの算出を行う。推定部１５ｃは、この分布パラメータの更新と補助パラメータの算出とを反復的に行う。これにより、パラメータ推定部１５１は、上記式（４）に示す目的関数を局所最大化する分布パラメータを推定する。また、マスク推定部１５２が、補助パラメータすなわちマスクを推定する。 The parameter estimation unit 151 updates the distribution parameters according to the above equations (7) and (8) in M steps. Further, the mask estimation unit 152 calculates the auxiliary parameter by the above equation (6) using the updated distribution parameter in the E step. The estimation unit 15c repeatedly performs the update of the distribution parameter and the calculation of the auxiliary parameter. Thereby, the parameter estimation unit 151 estimates a distribution parameter that locally maximizes the objective function shown in the above equation (4). Further, the mask estimation unit 152 estimates an auxiliary parameter, that is, a mask.

ここで、観測信号に雑音のみのクラスタに対応する補助パラメータλ^（ｎ） _ｆ，ｔを掛け合わせることにより、雑音のみの観測信号が得られる。したがって、雑音のみの空間相関行列は、次式（９）により得ることができる。 Here, by multiplying the observation signal by the auxiliary parameter λ ⁽ⁿ⁾ _{f, t} corresponding to the noise-only cluster, the noise-only observation signal is obtained. Therefore, the noise-only spatial correlation matrix can be obtained by the following equation (9).

そこで、空間相関行列計算部１５３は、次式（１０）に示すように、観測信号の空間相関行列から雑音のみの空間相関行列を差し引くことにより、目的音声の空間相関行列を求めることができる。 Therefore, the spatial correlation matrix calculation unit 153 can obtain the spatial correlation matrix of the target speech by subtracting the spatial correlation matrix of only noise from the spatial correlation matrix of the observation signal, as shown in the following equation (10).

次に、ステアリングベクトル計算部１５４が、目的音声のステアリングベクトルとして、目的音声の空間相関行列を固有値分解して第一固有値に対応する固有ベクトルを導出する。 Next, the steering vector calculation unit 154 derives an eigenvector corresponding to the first eigenvalue by performing eigenvalue decomposition on the spatial correlation matrix of the target speech as the steering vector of the target speech.

また、ビームフォーマ推定部１５５は、推定された目的音声のステアリングベクトルを用いて、目的音声を強調するビームを形成するビームフォーマｗ_ｆを算出する。具体的に、ビームフォーマ推定部１５５は、次式（１１）に示す条件下において、次式（１２）に示す目的関数を最小化することにより、ビームフォーマｗ_ｆを算出する（非特許文献２参照）。 In addition, the beamformer estimation unit 155 calculates a beamformer w _f that forms a beam that enhances the target speech, using the estimated steering vector of the target speech. Specifically, the beamformer estimation unit 155 calculates the beamformer w _f by minimizing the objective function shown in the following equation (12) under the condition shown in the following equation (11) (Non-Patent Document 2). reference).

ここで算出されるビームフォーマｗ_ｆは、目的音声の音源の空間情報を含むステアリングベクトル方向の音響信号のパワーを減衰させることなく、その他の方向の雑音の音響信号のパワーを減衰させることにより、雑音を抑制するビームを形成することができる。 The beam former w _f calculated here attenuates the power of the acoustic signal of noise in the other direction without attenuating the power of the acoustic signal in the steering vector direction including the spatial information of the sound source of the target speech. A beam that suppresses noise can be formed.

図１の説明に戻る。音声強調部１５ｄは、周波数ｆごとに音響信号のビームを形成するための所定のビームフォーマｗ_ｆを用いて、観測信号のうち目的音声の音響信号を強調した強調音声の音響信号を算出する。具体的に、音声強調部１５ｄは、目的音声のステアリングベクトルを用いて算出されたビームフォーマｗ_ｆを初期値として用いて、次式（１３）に示すように、観測ベクトルとビームフォーマｗ_ｆとの内積をとることにより強調音声のビームを形成する。 Returning to the description of FIG. Speech enhancement unit 15d, using a predetermined beamformer w _f for forming a beam of acoustic signals for each frequency f, to calculate the acoustic signal of enhanced speech sound signal of the target speech emphasized among the observed signal. Specifically, the speech enhancement unit 15d uses the beamformer w _f calculated using the steering vector of the target speech as an initial value, as shown in the following equation (13), the observation vector and the beamformer w _f Is used to form a beam of emphasized speech.

音声認識部１５ｅは、算出された強調音声の音素の確率分布を推定するとともに、該強調音声に音素を示す参照ラベルを付与する。 The speech recognition unit 15e estimates the calculated probability distribution of the phoneme of the emphasized speech and assigns a reference label indicating the phoneme to the enhanced speech.

ここで、以下の説明において、Ｍ個のマイクロホンで収録された観測信号を、次式（１４）に示すように表す。 Here, in the following description, an observation signal recorded by M microphones is expressed as shown in the following equation (14).

また、観測ベクトルｙ_ｆ，ｔを、短時間離散フーリエ変換や短時間離散コサイン変換等の短時間信号分析を適用して求められた時間周波数ごとの信号特徴量Ｙ_m,f,tを用いて、次式（１５）のように表す。 Also, the observation vector y _f, a _t, short-time discrete Fourier transform and short time discrete signal feature quantity of each time-frequency determined by applying the short signal analysis cosine transform or the like Y _{m, f,} using the _t Is expressed as the following equation (15).

この場合に、音声認識部１５ｅは、次式（１６）で表される演算を行って、上記式（１３）により求められた強調音声の各時刻における音素の確率分布（以下、音素の事後確率または音素状態事後確率とも記す）を求める。 In this case, the speech recognition unit 15e performs the calculation represented by the following equation (16), and the phoneme probability distribution (hereinafter, the phoneme posterior probability) of the emphasized speech obtained by the above equation (13). (Also referred to as phoneme state posterior probability).

ここで、強調音声は、各周波数における強調音声を用いて、次式（１７）に示すベクトルで表される。 Here, the emphasized speech is expressed by a vector represented by the following equation (17) using the enhanced speech at each frequency.

具体的に、音声認識部１５ｅは、音声強調部１５ｄから上記式（１７）に示す強調音声を受け取って、事前に学習されたパラメータの初期値を用いて、線形演算と非線形演算とを複数回繰り返し、次式（１８）で表される各時刻の音素の事後確率を出力する。 Specifically, the speech recognition unit 15e receives the enhanced speech represented by the above formula (17) from the speech enhancement unit 15d, and performs linear computation and nonlinear computation multiple times using the initial values of the parameters learned in advance. Repeatedly, the posterior probability of the phoneme at each time expressed by the following equation (18) is output.

また、音声認識部１５ｅは、次式（１９）で表されるように、各時刻の強調音声の音素を示すバイナリの参照ラベルを付与する。 Moreover, the speech recognition unit 15e assigns a binary reference label indicating the phoneme of the emphasized speech at each time, as represented by the following equation (19).

最適化部１５ｆは、参照ラベルと強調音声の音素の確率分布との差を最小化するように、ビームフォーマｗ_ｆを最適化する。すなわち、最適化部１５ｆは、音声強調部１５ｄおよび音声認識部１５ｅで構成されるネットワークを、観測ベクトルを入力すると強調音声の音素状態事後確率を出力するネットワークとみなし、出力の最適化を行う。 Optimization unit 15f is the difference between the phoneme probability distribution of the reference label and enhanced speech to minimize, to optimize the beamformer w _f. That is, the optimization unit 15f regards the network formed by the speech enhancement unit 15d and the speech recognition unit 15e as a network that outputs the phoneme state posterior probability of the enhanced speech when the observation vector is input, and optimizes the output.

具体的に、最適化部１５ｆは、上記式（１８）で表される各時刻の音素の事後確率と、上記式（１９）で表される各時刻の音素の参照ラベルとの間で、次式（２０）に示すように定義されるクロスエントロピーを目的関数として、この目的関数を最小化する。 Specifically, the optimization unit 15f performs the following operation between the posterior probability of the phonemes at each time represented by the above formula (18) and the reference label of the phonemes at each time represented by the above formula (19). Using the cross entropy defined as shown in Equation (20) as an objective function, this objective function is minimized.

ここで、最急降下法を適用することにより、ビームフォーマｗ_ｆの更新式は、次式（２１）のように表される。 Here, by applying the steepest descent method, the update formula of the beamformer w _f is expressed as the following formula (21).

この場合に、次式（２２）に示すように、目的関数の勾配は、微分法における連鎖律を適用してＡ×Ｂの形に変形することにより算出できる。 In this case, as shown in the following equation (22), the gradient of the objective function can be calculated by applying the chain rule in the differential method and transforming it into an A × B shape.

すなわち、上記式（２２）のＡの部分は、ニューラルネットワークのパラメータ推定に適用されるバックプロパゲーションに基づく周知の手法を用いて算出することができる。また、上記式（２２）のＢの部分については、上記式（１３）に基づいて、次式（２３）により算出できる。 That is, the part A in the above equation (22) can be calculated by using a well-known method based on backpropagation applied to parameter estimation of the neural network. Further, the portion B of the above formula (22) can be calculated by the following formula (23) based on the above formula (13).

ここで、図３を参照して、最適化部１５ｆの処理を説明する。図３に示すように、最適化部１５ｆは、パラメータ初期化部１５６、勾配計算部１５７、パラメータ更新部１５８および収束判定部１５９を含む。 Here, the processing of the optimization unit 15f will be described with reference to FIG. As illustrated in FIG. 3, the optimization unit 15 f includes a parameter initialization unit 156, a gradient calculation unit 157, a parameter update unit 158, and a convergence determination unit 159.

パラメータ初期化部１５６は、最適化部１５ｆ内の処理に用いられる各種のパラメータの初期値を設定する。例えば、パラメータ初期化部１５６は、推定部１５ｃが算出したビームフォーマｗ_ｆを初期値として勾配計算部１５７に引き渡す。なお、パラメータ初期化部１５６は、単位ベクトルをビームフォーマｗ_ｆの初期値として勾配計算部１５７に引き渡してもよい。また、パラメータ初期化部１５６は、上記式（２１）に用いられる学習率αをパラメータ更新部１５８に引き渡す。 The parameter initialization unit 156 sets initial values of various parameters used for processing in the optimization unit 15f. For example, the parameter initialization unit 156 delivers the beam former w _f calculated by the estimation unit 15c to the gradient calculation unit 157 as an initial value. The parameter initializing unit 156 may hand over the unit vector in the gradient calculation section 157 as the initial value of the beamformer w _f. Further, the parameter initialization unit 156 passes the learning rate α used in the above equation (21) to the parameter update unit 158.

勾配計算部１５７は、上記式（２２）に示す勾配を算出し、引き渡されたビームフォーマｗ_ｆの初期値と算出した勾配とをパラメータ更新部１５８に引き渡す。パラメータ更新部１５８は、学習率αとビームフォーマｗ_ｆと勾配とを受け取って、上記式（２１）を用いてビームフォーマｗ_ｆの更新値を算出し、算出したビームフォーマｗ_ｆを収束判定部１５９に引き渡す。 The gradient calculation unit 157 calculates the gradient represented by the above equation (22), and transfers the transferred initial value of the beamformer w _f and the calculated gradient to the parameter update unit 158. Parameter updating unit 158 receives the gradient learning rate α and a beamformer w _f, and calculates the updated value of the beamformer w _f by using equation (21), calculated beamformer w _f convergence determination unit Deliver to 159.

収束判定部１５９は、所定の収束条件を満たしているか否かを判定する。収束条件とは、例えば、上記式（２１）に示す更新式の反復回数が所定の回数を満たしていること、あるいは、上記式（１７）に示した目的関数が収束すること等が例示される。収束条件を満たしていない場合には、収束判定部１５９は、パラメータ更新部１５８から受け取ったビームフォーマｗ_ｆを勾配計算部１５７に引き渡す。勾配計算部１５７は、収束判定部１５９から受け取ったビームフォーマｗ_ｆを初期値として、上記の処理を繰り返す。これにより、所定の収束条件を満たすまで、パラメータ更新部１５８がビームフォーマｗ_ｆの更新値を算出する。 The convergence determination unit 159 determines whether or not a predetermined convergence condition is satisfied. Examples of the convergence condition include that the number of iterations of the update equation shown in the above equation (21) satisfies a predetermined number, or that the objective function shown in the above equation (17) converges. . If the convergence condition is not satisfied, the convergence determination unit 159 passes the beam former w _f received from the parameter update unit 158 to the gradient calculation unit 157. The gradient calculation unit 157 repeats the above processing using the beam former w _f received from the convergence determination unit 159 as an initial value. Thereby, the parameter update unit 158 calculates an update value of the beamformer w _f until a predetermined convergence condition is satisfied.

所定の収束条件を満たしている場合に、音声認識に最適に更新されたビームフォーマｗ_ｆが導出されたことを意味する。この場合に、収束判定部１５９は、ビームフォーマｗ_ｆの更新値を、直接あるいは推定部１５ｃを介して、音声強調部１５ｄに引き渡す。 This means that the beam former w _f updated optimally for speech recognition is derived when a predetermined convergence condition is satisfied. In this case, the convergence determination unit 159 passes the updated value of the beamformer w _f directly or via the estimation unit 15c to the speech enhancement unit 15d.

なお、音声認識に最適に更新されたビームフォーマｗ_ｆを受け取った音声強調部１５ｄが、このビームフォーマｗ_ｆを用いて、上記式（１３）に示すように、音声認識に最適な強調音声のビームを形成し、スピーカ等で実現される出力部１２が強調音声を出力する。 Incidentally, speech enhancement unit 15d that has received the best updated beamformer w _f in speech recognition, with reference to the beamformer w _f, as shown in the equation (13), the optimum enhancement speech for speech recognition A beam is formed, and the output unit 12 realized by a speaker or the like outputs emphasized speech.

［形成処理］
次に、図４を参照して、形成装置１の形成処理について説明する。図４は、形成装置１の形成処理手順を示すフローチャートである。図４のフローチャートは、例えば、処理の開始を指示する操作入力があったタイミングで開始される。 [Formation processing]
Next, the forming process of the forming apparatus 1 will be described with reference to FIG. FIG. 4 is a flowchart showing a forming process procedure of the forming apparatus 1. The flowchart of FIG. 4 is started, for example, at a timing when there is an operation input instructing the start of processing.

まず、取得部１５ａが、音声認識の対象である目的音声の音響信号と、目的音声以外の雑音の音響信号とを含む複数の地点に設置されたマイクで収録された多チャンネルの観測信号を取得する（ステップＳ１）。 First, the acquisition unit 15a acquires multi-channel observation signals recorded by microphones installed at a plurality of points including an acoustic signal of a target voice that is a target of voice recognition and an acoustic signal of noise other than the target voice. (Step S1).

次に、観測信号から時間周波数分析部１５ｂが生成した観測ベクトルを用いて、推定部１５ｃが目的音声の音源の空間情報を含むステアリングベクトルを推定する（ステップＳ２）。また、推定部１５ｃは、推定されたステアリングベクトルを用いて、ステアリングベクトル方向の音響信号を強調する強調音声のビームを形成するビームフォーマｗ_ｆを算出する。 Next, using the observation vector generated by the time-frequency analysis unit 15b from the observation signal, the estimation unit 15c estimates a steering vector including the spatial information of the sound source of the target speech (step S2). In addition, the estimation unit 15c calculates a beam former w _f that forms a beam of enhanced speech that enhances the acoustic signal in the steering vector direction, using the estimated steering vector.

音声強調部１５ｄが、推定されたステアリングベクトルを用いて算出されるビームフォーマｗ_ｆを用いて、強調音声の音響信号を算出する（ステップＳ３）。 Speech enhancement unit 15d, using the beamformer w _f calculated using the estimated steering vector, and calculates the acoustic signal of the enhanced speech (step S3).

次に、音声認識部１５ｅが、算出された強調音声の音声認識を行う（ステップＳ４）。すなわち、音声認識部１５ｅは、強調音声の音素の確率分布を推定する。また、音声認識部１５ｅは、強調音声に音素を示す参照ラベルを付与する。 Next, the voice recognition unit 15e performs voice recognition of the calculated emphasized voice (step S4). That is, the speech recognition unit 15e estimates the phoneme probability distribution of the emphasized speech. In addition, the voice recognition unit 15e gives a reference label indicating a phoneme to the emphasized voice.

最適化部１５ｆは、参照ラベルと強調音声の音素の確率分布との差を最小化するように、ビームフォーマｗ_ｆを最適化する。すなわち、最適化部１５ｆは、強調音声の音声認識を最適化するビームフォーマｗ_ｆを導出することにより、強調音声を最適化する（ステップＳ５）。 Optimization unit 15f is the difference between the phoneme probability distribution of the reference label and enhanced speech to minimize, to optimize the beamformer w _f. That is, the optimizing unit 15f, by deriving a beamformer w _f for optimizing speech recognition enhanced speech, to optimize the enhanced speech (step S5).

また、出力部１２が、最適化された強調音声を出力する（ステップＳ６）。これにより、一連の形成処理が終了する。 Further, the output unit 12 outputs the optimized emphasized speech (step S6). Thereby, a series of forming processes is completed.

以上、説明したように、本実施形態の形成装置１では、取得部１５ａは、音声認識の対象である目的音声の音響信号と、該目的音声以外の雑音の音響信号とを含む複数の地点における観測信号を取得する。また、音声強調部１５ｄは、周波数ごとに音響信号のビームを形成するための所定のビームフォーマｗ_ｆを用いて、観測信号のうち該目的音声の音響信号を強調した強調音声の音響信号を算出する。また、音声認識部１５ｅは、算出された強調音声の音素の確率分布を推定するとともに、該強調音声に音素を示す参照ラベルを付与する。また、最適化部１５ｆは、参照ラベルと強調音声の音素の確率分布との差を最小化するように、ビームフォーマｗ_ｆを最適化する。 As described above, in the forming apparatus 1 of the present embodiment, the acquisition unit 15a includes a plurality of points including the acoustic signal of the target speech that is the target of speech recognition and the acoustic signal of noise other than the target speech. Obtain observation signals. The speech enhancement unit 15d may calculate the predetermined using beamformer w _f, acoustic signals of enhanced speech emphasizing an acoustic signal that purpose voice of observed signals for forming a beam of acoustic signals for each frequency To do. In addition, the speech recognition unit 15e estimates the calculated phoneme probability distribution of the emphasized speech, and assigns a reference label indicating the phoneme to the enhanced speech. Moreover, the optimization section 15f is the difference between the phoneme probability distribution of the reference label and enhanced speech to minimize, to optimize the beamformer w _f.

これにより、形成装置１は、雑音を抑制して目的音声の音声を強調したビームを、音声認識に最適に形成することができる。したがって、環境に応じて音声認識に最適な音声強調を行うことが可能となる。例えば、雑音下でのスマートフォンの操作や検索、会話や講義の自動書き起こし等の際に高精度な音声認識を行える。 Accordingly, the forming apparatus 1 can optimally form a beam that suppresses noise and emphasizes the voice of the target voice for voice recognition. Therefore, it is possible to perform speech enhancement optimal for speech recognition according to the environment. For example, high-accuracy voice recognition can be performed during operations such as smartphone operations and searches under noisy, automatic transcription of conversations and lectures, and the like.

なお、推定部１５ｃが、観測信号の所定の長さの同一の短時間区間ｔにおける周波数ｆごとの信号の組み合わせのうち、目的音声の音響信号を含まない信号の組み合わせの確率分布を分離して推定することにより、周波数毎に目的音声の音源の空間情報を含むステアリングベクトルを推定し、推定したステアリングベクトルを用いてビームフォーマｗ_ｆを算出する。これにより、最適化部１５ｆの処理に用いられるビームフォーマｗ_ｆの初期値として、雑音を抑制するビームを形成するビームフォーマｗ_ｆを算出できる。 Note that the estimation unit 15c separates the probability distribution of the combination of signals that do not include the acoustic signal of the target speech from among the combinations of signals for each frequency f in the same short period t of the observation signal with a predetermined length. by estimating estimates the steering vector containing the spatial information of the target speech sound for every frequency to calculate a beamformer w _f using the estimated steering vector. As a result, the beam former w _f that forms a beam for suppressing noise can be calculated as the initial value of the beam former w _f used for the processing of the optimization unit 15 _f .

また、最適化部１５ｆは、周波数ｆごとのビームフォーマｗ_ｆの全てを更新しなくてもよい。背景雑音の状況等に応じて、例えば、一部の周波数ｆについてのみが更新されてもよい。あるいは、各周波数ｆについてのビームフォーマｗ_ｆの成分のうち、ベクトルの一部の成分のみが更新されてもよい。これにより、形成処理１の処理負荷を軽減することができる。 Moreover, the optimization section 15f may not update all beamformer w _f for each frequency f. For example, only a part of the frequencies f may be updated according to the background noise condition or the like. Alternatively, among the components of the beamformer w _f for each frequency f, only a part of the components of the vector may be updated. Thereby, the processing load of the formation process 1 can be reduced.

［実施例］
上記実施形態に係る形成装置１を用いて、バスの中やカフェ等の背景雑音が存在する環境において、一人の話者がタブレットに向かって文章を読み上げる音声を、タブレットに装着されたＭ＝６個のマイクで収録した場合について、実験を行った。ここで、学習率αは６×１０^３とした。また、ビームフォーマｗ_ｆの初期値は、上記式（４）に示す尤度関数を最大化するように求めた値とした。また、上記式（２１）に示すビームフォーマｗ_ｆの更新式の反復回数は３０回とした。 [Example]
Using the forming apparatus 1 according to the above-described embodiment, in an environment where background noise exists such as in a bus or a cafe, a speaker reads a voice reading a sentence toward the tablet, and M = 6 attached to the tablet. An experiment was conducted for recording with a single microphone. Here, the learning rate α is set to 6 × 10 ³ . The initial value of the beamformer w _f has a value determined so as to maximize the likelihood function shown in the equation (4). In addition, the number of repetitions of the beamformer w _f update equation shown in the above equation (21) is 30 times.

この場合に、形成装置１を用いずに音声認識を行った場合の単語認識誤差率は１６．８０％であった。これ対し、最適化部１５ｆによる処理を行う前のビームフォーマｗ_ｆの初期値による強調音声の音声認識を行った場合の単語認識誤差率は９．０６％であった。また、最適化部１５ｆにより更新したビームフォーマによる強調音声の音声認識を行った場合の単語認識誤差率は８．８９％であった。このように、本実施形態の形成装置１による形成処理の効果を確認できた。 In this case, the word recognition error rate when speech recognition was performed without using the forming apparatus 1 was 16.80%. This contrast, word recognition error rate in the case of performing speech recognition of the initial value due to enhanced speech before the beamformer w _f for performing a process by the optimizing unit 15f was 9.06%. In addition, the word recognition error rate when the speech recognition of the emphasized speech was performed by the beamformer updated by the optimization unit 15f was 8.89%. Thus, the effect of the forming process by the forming apparatus 1 of the present embodiment was confirmed.

［プログラム］
上記実施形態に係る形成装置１が実行する処理をコンピュータが実行可能な言語で記述したプログラムを作成することもできる。一実施形態として、形成装置１は、パッケージソフトウェアやオンラインソフトウェアとして上記の形成処理を実行する形成プログラムを所望のコンピュータにインストールさせることによって実装できる。例えば、上記の形成プログラムを情報処理装置に実行させることにより、情報処理装置を形成装置１として機能させることができる。ここで言う情報処理装置には、デスクトップ型またはノート型のパーソナルコンピュータが含まれる。また、その他にも、情報処理装置にはスマートフォン、携帯電話機やＰＨＳ（Personal Handyphone System）などの移動体通信端末、さらには、ＰＤＡ（Personal Digital Assistants）などのスレート端末などがその範疇に含まれる。また、ユーザが使用する端末装置をクライアントとし、当該クライアントに上記の形成処理に関するサービスを提供するサーバ装置として実装することもできる。例えば、形成装置１は、観測信号を入力とし、強調音声を出力する形成処理サービスを提供するサーバ装置として実装される。この場合、形成装置１は、Ｗｅｂサーバとして実装することとしてもよいし、アウトソーシングによって上記の形成処理に関するサービスを提供するクラウドとして実装することとしてもかまわない。以下に、形成装置１と同様の機能を実現する形成プログラムを実行するコンピュータの一例を説明する。 [program]
It is also possible to create a program in which processing executed by the forming apparatus 1 according to the above embodiment is described in a language that can be executed by a computer. As an embodiment, the forming apparatus 1 can be implemented by installing a forming program for executing the forming process as package software or online software on a desired computer. For example, the information processing apparatus can function as the forming apparatus 1 by causing the information processing apparatus to execute the above forming program. The information processing apparatus referred to here includes a desktop or notebook personal computer. In addition, the information processing apparatus includes mobile communication terminals such as smart phones, mobile phones and PHS (Personal Handyphone System), and slate terminals such as PDA (Personal Digital Assistants). In addition, a terminal device used by a user can be a client, and the client can be implemented as a server device that provides services related to the above-described formation processing to the client. For example, the forming apparatus 1 is implemented as a server apparatus that provides a forming process service that receives an observation signal and outputs an emphasized voice. In this case, the forming apparatus 1 may be implemented as a Web server, or may be implemented as a cloud that provides services related to the above-described forming processing by outsourcing. Hereinafter, an example of a computer that executes a forming program that realizes the same function as the forming apparatus 1 will be described.

図５に示すように、形成プログラムを実行するコンピュータ１０００は、例えば、メモリ１０１０と、ＣＰＵ１０２０と、ハードディスクドライブインタフェース１０３０と、ディスクドライブインタフェース１０４０と、シリアルポートインタフェース１０５０と、ビデオアダプタ１０６０と、ネットワークインタフェース１０７０とを有する。これらの各部は、バス１０８０によって接続される。 As shown in FIG. 5, the computer 1000 that executes the forming program includes, for example, a memory 1010, a CPU 1020, a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface. 1070. These units are connected by a bus 1080.

メモリ１０１０は、ＲＯＭ（Read Only Memory）１０１１およびＲＡＭ１０１２を含む。ＲＯＭ１０１１は、例えば、ＢＩＯＳ（Basic Input Output System）等のブートプログラムを記憶する。ハードディスクドライブインタフェース１０３０は、ハードディスクドライブ１０３１に接続される。ディスクドライブインタフェース１０４０は、ディスクドライブ１０４１に接続される。ディスクドライブ１０４１には、例えば、磁気ディスクや光ディスク等の着脱可能な記憶媒体が挿入される。シリアルポートインタフェース１０５０には、例えば、マウス１０５１およびキーボード１０５２が接続される。ビデオアダプタ１０６０には、例えば、ディスプレイ１０６１が接続される。 The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM 1012. The ROM 1011 stores a boot program such as BIOS (Basic Input Output System). The hard disk drive interface 1030 is connected to the hard disk drive 1031. The disk drive interface 1040 is connected to the disk drive 1041. For example, a removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1041. For example, a mouse 1051 and a keyboard 1052 are connected to the serial port interface 1050. For example, a display 1061 is connected to the video adapter 1060.

ここで、図５に示すように、ハードディスクドライブ１０３１は、例えば、ＯＳ１０９１、アプリケーションプログラム１０９２、プログラムモジュール１０９３およびプログラムデータ１０９４を記憶する。上記実施形態で説明した各テーブルは、例えばハードディスクドライブ１０３１やメモリ１０１０に記憶される。 Here, as shown in FIG. 5, the hard disk drive 1031 stores, for example, an OS 1091, an application program 1092, a program module 1093, and program data 1094. Each table described in the above embodiment is stored in the hard disk drive 1031 or the memory 1010, for example.

また、形成プログラムは、例えば、コンピュータ１０００によって実行される指令が記述されたプログラムモジュール１０９３として、ハードディスクドライブ１０３１に記憶される。具体的には、上記実施形態で説明した形成装置１が実行する各処理が記述されたプログラムモジュール１０９３が、ハードディスクドライブ１０３１に記憶される。 Further, the forming program is stored in the hard disk drive 1031 as a program module 1093 in which a command executed by the computer 1000 is described, for example. Specifically, a program module 1093 describing each process executed by the forming apparatus 1 described in the above embodiment is stored in the hard disk drive 1031.

また、形成プログラムによる情報処理に用いられるデータは、プログラムデータ１０９４として、例えば、ハードディスクドライブ１０３１に記憶される。そして、ＣＰＵ１０２０が、ハードディスクドライブ１０３１に記憶されたプログラムモジュール１０９３やプログラムデータ１０９４を必要に応じてＲＡＭ１０１２に読み出して、上述した各手順を実行する。 Data used for information processing by the forming program is stored as program data 1094 in, for example, the hard disk drive 1031. Then, the CPU 1020 reads the program module 1093 and the program data 1094 stored in the hard disk drive 1031 to the RAM 1012 as necessary, and executes the above-described procedures.

なお、形成プログラムに係るプログラムモジュール１０９３やプログラムデータ１０９４は、ハードディスクドライブ１０３１に記憶される場合に限られず、例えば、着脱可能な記憶媒体に記憶されて、ディスクドライブ１０４１等を介してＣＰＵ１０２０によって読み出されてもよい。あるいは、形成プログラムに係るプログラムモジュール１０９３やプログラムデータ１０９４は、ＬＡＮ（Local Area Network）やＷＡＮ（Wide Area Network）等のネットワークを介して接続された他のコンピュータに記憶され、ネットワークインタフェース１０７０を介してＣＰＵ１０２０によって読み出されてもよい。 Note that the program module 1093 and the program data 1094 related to the formation program are not limited to being stored in the hard disk drive 1031, but are stored in, for example, a removable storage medium and read out by the CPU 1020 via the disk drive 1041 or the like. May be. Alternatively, the program module 1093 and the program data 1094 related to the formation program are stored in another computer connected via a network such as a LAN (Local Area Network) or a WAN (Wide Area Network), and are transmitted via the network interface 1070. It may be read by the CPU 1020.

以上、本発明者によってなされた発明を適用した実施形態について説明したが、本実施形態による本発明の開示の一部をなす記述および図面により本発明は限定されることはない。すなわち、本実施形態に基づいて当業者等によりなされる他の実施形態、実施例および運用技術等は全て本発明の範疇に含まれる。 As mentioned above, although embodiment which applied the invention made | formed by this inventor was described, this invention is not limited with the description and drawing which make a part of indication of this invention by this embodiment. That is, other embodiments, examples, operational techniques, and the like made by those skilled in the art based on this embodiment are all included in the scope of the present invention.

１形成装置
１１入力部
１２出力部
１３通信制御部
１４記憶部
１５制御部
１５ａ取得部
１５ｂ時間周波数分析部
１５ｃ推定部
１５ｄ音声強調部
１５ｅ音声認識部
１５ｆ最適化部 DESCRIPTION OF SYMBOLS 1 Formation apparatus 11 Input part 12 Output part 13 Communication control part 14 Storage part 15 Control part 15a Acquisition part 15b Time frequency analysis part 15c Estimation part 15d Speech enhancement part 15e Speech recognition part 15f Optimization part

Claims

An acquisition unit for acquiring observation signals at a plurality of points including an acoustic signal of a target speech that is a target of speech recognition and an acoustic signal of noise other than the target speech;
Using a predetermined beamformer for forming a beam of an acoustic signal for each frequency, a speech enhancement unit that calculates an acoustic signal of an enhanced speech in which the acoustic signal of the target speech is enhanced among the observation signals;
A speech recognition unit that estimates a probability distribution of the calculated phoneme of the emphasized speech and assigns a reference label indicating the phoneme to the emphasized speech;
An optimization unit for optimizing the beamformer so as to minimize a difference between the reference label and a phoneme probability distribution of the emphasized speech;
A forming apparatus comprising:

Furthermore, by separating and estimating the probability distribution of the combination of signals that do not include the acoustic signal of the target speech among the combinations of signals for each frequency in the same short period of the predetermined length of the observation signal, Estimating a steering vector including spatial information of the sound source of the target speech for each frequency, and comprising an estimation unit that calculates a beamformer using the steering vector,
The speech enhancement unit calculates an acoustic signal of an enhanced speech in which the acoustic signal of the target speech is enhanced among the observation signals using the calculated beam former as an initial value. The forming apparatus as described.

The forming apparatus according to claim 1, wherein the optimization unit optimizes the beam former with respect to a part of frequencies or a part of components of a vector.

A forming method executed by a forming apparatus,
An acquisition step of acquiring observation signals at a plurality of points including an acoustic signal of a target speech that is a target of speech recognition and an acoustic signal of noise other than the target speech;
Using a predetermined beamformer for forming a beam of an acoustic signal for each frequency, a speech enhancement step of calculating an acoustic signal of an enhanced speech in which the acoustic signal of the target speech is enhanced among the observation signals;
A speech recognition step of estimating the calculated probability distribution of the phoneme of the emphasized speech and adding a reference label indicating the phoneme to the emphasized speech;
An optimization step of optimizing the beamformer to minimize the difference between the reference label and the phonetic probability distribution of the enhanced speech;
The formation method characterized by including.

An acquisition step of acquiring observation signals at a plurality of points including an acoustic signal of a target speech that is a target of speech recognition and an acoustic signal of noise other than the target speech;
Using a predetermined beamformer for forming a beam of acoustic signals for each frequency, a speech enhancement step of calculating an acoustic signal of enhanced speech in which the acoustic signal of the target speech is enhanced among the observed signals;
A speech recognition step of estimating the calculated probability distribution of the phoneme of the emphasized speech and adding a reference label indicating the phoneme to the emphasized speech;
An optimization step of optimizing the beamformer so as to minimize the difference between the reference label and the phoneme probability distribution of the enhanced speech;
A program for causing a computer to execute.