JP2020012928A

JP2020012928A - Noise resistant voice recognition device, noise resistant voice recognition method, and computer program

Info

Publication number: JP2020012928A
Application number: JP2018133977A
Authority: JP
Inventors: 雅清藤本; Masakiyo Fujimoto; 恒河井; Hisashi Kawai
Original assignee: National Institute of Information and Communications Technology
Current assignee: National Institute of Information and Communications Technology
Priority date: 2018-07-17
Filing date: 2018-07-17
Publication date: 2020-01-23
Anticipated expiration: 2038-07-17
Also published as: JP7231181B2; WO2020017226A1

Abstract

【課題】単一チャネルの音声信号しか利用可能でなくても音声認識精度の高い耐雑音音声認識装置、耐雑音音声認識方法、及びコンピュータプログラムを提供する。【解決手段】音声認識装置１８０は、目的信号である音声信号に雑音信号が重畳した信号である音響信号１１２を入力とし、所定の音声強調手法により音声信号を強調した強調音声信号２０３を出力する音声強調回路２０２と、強調音声信号２０３と、音響信号１１２とを入力として受けてそれぞれ特徴量を抽出する拡大特徴抽出部２００と、これらの特徴量とそのための音響モデル２０６を用いて音声信号の発話内容をテキスト化する音声認識部２０４とを含む。【選択図】図３A noise-tolerant speech recognition apparatus, a noise-tolerant speech recognition method, and a computer program are provided that achieve high speech recognition accuracy even when only a single-channel speech signal is available. A speech recognition device (180) receives as input an acoustic signal (112), which is a signal obtained by superimposing a noise signal on a speech signal (a target signal), and outputs an enhanced speech signal (203) in which the speech signal is emphasized by a predetermined speech enhancement technique. A speech enhancement circuit 202, an enlarged feature extraction unit 200 which receives as inputs an enhanced speech signal 203 and an acoustic signal 112 and extracts feature quantities, respectively, and an acoustic model 206 for the feature quantities and the speech signal. and a speech recognition unit 204 that converts the content of the speech into text. [Selection drawing] Fig. 3

Description

この発明は音声認識に関し、特に単一のマイクにより集音された音声に対しても高精度の音声認識を可能にする耐雑音音声認識装置及び方法、並びにコンピュータプログラムに関する。 The present invention relates to speech recognition, and more particularly to a noise-tolerant speech recognition apparatus and method capable of performing speech recognition with high precision even for speech collected by a single microphone, and a computer program.

近年、コンピュータの計算能力の高度化及びコンピュータサイエンスの発展に伴い、音声認識アプリケーションの利用範囲が大きく拡大している。従前から音声認識が用いられていた分野とは別に、いわゆる家電製品にも音声認識が取り入れられ、さらにスマートスピーカ等、音声認識を用いて従来にはなかった機能を提供する製品も利用者が急激に増大している。これに伴い、音声認識が利用されるシーンも多様になっている。 2. Description of the Related Art In recent years, with the advancement of computer computing power and the development of computer science, the range of use of speech recognition applications has greatly expanded. Apart from the fields in which speech recognition has been used in the past, speech recognition has also been introduced into so-called home appliances, and users, such as smart speakers, have rapidly increased the use of speech recognition in products that provide functions that were not previously available. Has increased. Along with this, scenes in which voice recognition is used have also become diverse.

一方、音声認識にとって本質的に重要なのはその精度である。音声認識が利用されるシーンが多様になると、雑音が多く、またその種類も多様になり、音声認識の精度を常に高く保つのは困難になる。そこで、雑音に対しても精度を高く保つ音声認識（耐雑音音声認識）が重要性を増している。 On the other hand, what is essential for speech recognition is its accuracy. When the scenes in which the speech recognition is used are diverse, there are many noises and various types, and it is difficult to always keep the accuracy of the speech recognition high. Therefore, speech recognition (noise-tolerant speech recognition) that maintains high accuracy even for noise has become increasingly important.

耐雑音音声認識には、従来は大きく分けて２種類の手法が用いられてきた。すなわち以下の２つである。 Conventionally, two types of techniques have been used for noise-resistant speech recognition. That is, there are the following two.

・音声強調（雑音除去）
・雑音付加学習
音声強調とは、音声認識の対象となる音声信号から雑音を除去することによって音声認識の精度を高める技術である。典型的には、マイクロホンからの音声信号に対して音声強調を行ってから音声認識の処理を行う。・ Speech enhancement (noise removal)
-Noise-added learning Speech enhancement is a technology for improving the accuracy of speech recognition by removing noise from a speech signal to be subjected to speech recognition. Typically, voice recognition processing is performed after voice enhancement is performed on a voice signal from a microphone.

従来の音声強調技術として、後掲の非特許文献１に記載されたスペクトラル・サブトラクション法、非特許文献２に記載されたMMSE-STSA推定法(minimum mean square error short-time spectral amplitude estimator)、非特許文献３に記載されたベクトル・テイラー級数展開（Vector Taylor series）を用いた手法、及び非特許文献４に記載されたデノイジング・オートエンコーダ（denoising autoencoder）がある。 As conventional speech enhancement techniques, a spectral subtraction method described in Non-Patent Document 1, a MMSE-STSA estimation method (minimum mean square error short-time spectral amplitude estimator) described in Non-Patent Document 2, There is a method using a Vector Taylor series described in Patent Document 3, and a denoising autoencoder described in Non-Patent Document 4.

これら手法は、いずれも単一のマイクロホンから得られた音響信号について音声認識の前処理として音声強調を行う手法である。 Each of these methods is a method of emphasizing a sound signal obtained from a single microphone as preprocessing for voice recognition.

図１に、従来の音声認識装置１００の概略構成を示す。図１を参照して、この音声認識装置１００は、図示しないマイクロホンが出力した、波形１１０により表される、雑音重畳音声である音声信号１１２を受けて上記したいずれかの手法により音声強調を行って強調音声信号１１６を出力するための音声強調部１１４と、この強調音声信号１１６から所定の特徴量を抽出するための特徴抽出部１１８と、この特徴量に対する音声認識を行って波形１１０により表される音声に対応するテキスト１２２を出力するための音声認識部１２０とを含む。音声認識部１２０としては、例えば特許文献１に開示されたものを使用できる。 FIG. 1 shows a schematic configuration of a conventional speech recognition apparatus 100. Referring to FIG. 1, a speech recognition apparatus 100 receives a speech signal 112, which is a noise-superimposed speech represented by a waveform 110 and output from a microphone (not shown), and performs speech enhancement by any of the above-described methods. A voice emphasis unit 114 for outputting a emphasized voice signal 116, a feature extraction unit 118 for extracting a predetermined feature amount from the emphasized voice signal 116, and performing speech recognition on the feature amount to display a waveform 110. And a voice recognition unit 120 for outputting a text 122 corresponding to the voice to be reproduced. As the voice recognition unit 120, for example, the one disclosed in Patent Document 1 can be used.

音声認識装置１００はさらに、音声認識部１２０が音声認識を行う際に用いる音響モデル１２４、発音辞書１２６及び言語モデル１２８とを含む。音響モデル１２４は、特徴抽出部１１８から入力された特徴量に基づいて、対応する音素を推定するためのものである。発音辞書１２６は、音響モデル１２４により推定された音素列に対応する単語を得るために用いられる。言語モデル１２８は、発音辞書１２６を用いて推定された単語列により構成される認識結果の発話文の候補の各々についてその確率を算出する際に使用される。 The speech recognition device 100 further includes an acoustic model 124, a pronunciation dictionary 126, and a language model 128 used when the speech recognition unit 120 performs speech recognition. The acoustic model 124 is for estimating a corresponding phoneme based on the feature amount input from the feature extraction unit 118. The pronunciation dictionary 126 is used to obtain a word corresponding to the phoneme sequence estimated by the acoustic model 124. The language model 128 is used when calculating the probability of each of the utterance sentence candidates of the recognition result formed by the word string estimated using the pronunciation dictionary 126.

図２には、音響モデル１２４の概略構成を示す。図２から分かるように、この音響モデル１２４はいわゆる深層ニューラル・ネットワークからなり、特徴量を受ける入力層１５０及びこの特徴量から推定された音素を特定する情報を出力する出力層１６２と、入力層１５０及び出力層１６２の間に順番に設けられた複数の隠れ層１５２、隠れ層１５４、隠れ層１５６、隠れ層１５８、及び隠れ層１６０とを含む。音響モデル１２４の構成及び学習方法はよく知られているのでここではその詳細は繰返さない。音響モデル１２４の学習には雑音を含まないクリーン音声が用いられる。なお、推定された音素を特定する情報としては、例えば音素の集合の各要素についての確率ベクトルという形が考えられる。以下、本明細書では、記載を簡潔にするために、音素を特定する情報を出力することを単に「音素を出力する」という。 FIG. 2 shows a schematic configuration of the acoustic model 124. As can be seen from FIG. 2, the acoustic model 124 is formed of a so-called deep neural network, and has an input layer 150 for receiving a feature, an output layer 162 for outputting information specifying a phoneme estimated from the feature, and an input layer 162. It includes a plurality of hidden layers 152, hidden layers 154, hidden layers 156, hidden layers 158, and hidden layers 160 provided in order between the hidden layer 152 and the output layer 162. Since the configuration and learning method of acoustic model 124 are well known, details thereof will not be repeated here. For learning of the acoustic model 124, clean speech containing no noise is used. Note that the information for specifying the estimated phoneme may be, for example, a form of a probability vector for each element of a set of phonemes. Hereinafter, in this specification, to simplify the description, outputting information for specifying a phoneme is simply referred to as “outputting a phoneme”.

一方、雑音付加学習は、雑音を含む音声信号を学習データとして、深層ニューラル・ネットワークによる音響モデルを学習することにより、雑音を含む音声に対する音声認識精度を高めようとする手法である。この場合は、音声信号に対する前処理は行わないが、音声認識の対象はやはり単一の音声信号である。 On the other hand, the noise-added learning is a method for improving the speech recognition accuracy for speech containing noise by learning an acoustic model using a deep neural network using speech signals containing noise as learning data. In this case, no preprocessing is performed on the voice signal, but the target of voice recognition is still a single voice signal.

近年では、単一チャネルのマイクロホンから得た音声信号に対する音声強調ではなく、複数チャネルのマイクロホン（マイクロホンアレイ）から得た多チャネル音声強調が音声認識の前処理とし幅広く利用されている。その好例がスマートスピーカである。スマートスピーカは、様々な企業により開発及び販売され、特に米国等で急速に普及している。 In recent years, multichannel speech enhancement obtained from microphones (microphone arrays) of multiple channels has been widely used as preprocessing for speech recognition, instead of speech enhancement for audio signals obtained from microphones of a single channel. A good example is a smart speaker. Smart speakers have been developed and sold by various companies, and are rapidly spreading especially in the United States and the like.

マイクロホンアレイを用いることにより、音源の空間情報も用いて雑音除去ができるため、高精度かつ低歪で音声強調が行える。 By using the microphone array, noise can be removed using the spatial information of the sound source, so that voice enhancement can be performed with high accuracy and low distortion.

特開2017-219769JP 2017-219769

S. F. Boll, “Suppression of acoustic noise in speech using spectral subtraction”, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 27, no. 2, pp. 113-120, Apr. 1979.S. F. Boll, “Suppression of acoustic noise in speech using spectral subtraction”, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 27, no.2, pp. 113-120, Apr. 1979. Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean square error short-time spectral amplitude estimator”, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 32, issue 6, pp. 1109-1121, Dec. 1984.Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean square error short-time spectral amplitude estimator”, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 32, issue 6, pp. 1109-1121, Dec . 1984. P. J. Moreno, B. Raj, and R. M. Stern, “A vector Taylor series approach for environment-independent speech recognition”, in Proceedings of ICASSP '96, vol. II, pp. 733-736, May 1996.P. J. Moreno, B. Raj, and R.M.Stern, “A vector Taylor series approach for environment-independent speech recognition”, in Proceedings of ICASSP '96, vol. II, pp. 733-736, May 1996. X. Lu, Y. Tsao, S. Matsuda, C. Hori: “Speech enhancement based on deep denoising autoencoder”, in Proceedings of Interspeech '13, pp. 436-440, Aug. 2013.X. Lu, Y. Tsao, S. Matsuda, C. Hori: “Speech enhancement based on deep denoising autoencoder”, in Proceedings of Interspeech '13, pp. 436-440, Aug. 2013. J. Barker, R. Marxer, E. Vincent, & S. Watanabe. The third 'CHiME' speech separation and recognition challenge: Analysis and outcomes. Computer Speech and Language, Volume 46, pp. 605-626, November 2017.J. Barker, R. Marxer, E. Vincent, & S. Watanabe.The third 'CHiME' speech separation and recognition challenge: Analysis and outcomes. Computer Speech and Language, Volume 46, pp. 605-626, November 2017.

しかし、多チャネルの音声信号を用いる場合、そのためのマイクロホンアレイ及び多チャネルマイクアンプという特殊なデバイスが必要となる。また音声信号に対する処理量及び転送量が増大する。こうした問題のため、例えばいわゆるスマートホンのようにマイクロホンが１つしかなく、処理量にも限界があるデバイスには適用できないという問題がある。 However, when using multi-channel audio signals, special devices such as a microphone array and a multi-channel microphone amplifier are required. Further, the processing amount and the transfer amount for the audio signal increase. Due to such a problem, there is a problem that it cannot be applied to a device having only one microphone such as a so-called smartphone and having a limited processing amount.

このため、スマートホンでは前記した音声強調処理のいずれかが適用されるが、この場合には大幅な音声歪の増大が見られ、音声認識精度が著しく劣化してしまうという問題がある。 For this reason, any of the above-described voice emphasizing processes is applied to a smartphone, but in this case, a significant increase in voice distortion is observed, and there is a problem that voice recognition accuracy is significantly deteriorated.

それ故に本発明の目的は、単一チャネルの音声信号しか利用可能でなくても音声認識精度を高くできる音響モデル及び音声認識装置、並びにそのためのコンピュータプログラムを提供することである。 Therefore, an object of the present invention is to provide an acoustic model and a speech recognition device capable of improving speech recognition accuracy even when only a single-channel speech signal is available, and a computer program therefor.

本発明の第１の局面に係る耐雑音音声認識装置は、目的信号である音声信号に雑音信号が重畳した音響信号を入力とし、音声信号が強調された強調音声信号を出力する音声強調回路と、強調音声信号と、音響信号とを受け、音声信号の発話内容をテキスト化する音声認識部とを含む。 A noise-tolerant speech recognition device according to a first aspect of the present invention includes a speech enhancement circuit that receives an audio signal in which a noise signal is superimposed on a speech signal as a target signal, and outputs an enhanced speech signal in which the speech signal is enhanced. And a voice recognition unit that receives the emphasized voice signal and the acoustic signal and converts the uttered content of the voice signal into text.

好ましくは、音声強調回路は、音響信号に対して第１の種類の音声強調処理を行って第１の強調音声信号を出力する第１の音声強調部と、音響信号に対して第１の種類と異なる第２の種類の音声強調処理を行って第２の強調音声信号を出力する第２の音声強調部とを含み、音声認識部は、第１及び第２の強調音声信号と、音響信号とを受け、音声信号の発話内容をテキスト化する。 Preferably, the audio enhancement circuit performs a first type of audio enhancement processing on the audio signal and outputs a first enhanced audio signal, and a first audio enhancement unit for the audio signal. A second voice enhancement unit that performs a second type of voice enhancement process different from the above and outputs a second enhanced voice signal. The speech recognition unit includes a first and a second enhanced voice signal, and a sound signal. Then, the speech content of the voice signal is converted to text.

より好ましくは、音声認識部は、音響信号から第１の特徴量を抽出する第１の特徴抽出手段と、強調音声信号から第２の特徴量を抽出する第２の特徴抽出手段と、第２の特徴量の各々について、第１の特徴量と、第２の特徴量とに応じて取捨選択する特徴選択手段と、特徴選択手段により選択された第２の特徴量を用いて音声信号の発話内容をテキスト化する音声認識手段とを含む。 More preferably, the speech recognition unit includes a first feature extraction unit that extracts a first feature amount from the audio signal, a second feature extraction unit that extracts a second feature amount from the enhanced speech signal, Feature selecting means for selecting each of the feature quantities according to the first feature quantity and the second feature quantity, and utterance of an audio signal using the second feature quantity selected by the feature selecting means. Voice recognition means for converting the contents into text.

さらに好ましくは、耐雑音音声認識装置は、音声認識手段が音声認識に用いる音響モデルを記憶する音響モデル記憶手段をさらに含み、当該音響モデルは複数の隠れ層を持つ深層ニューラル・ネットワークであり、音響モデルは、第１の特徴量を入力として受ける第１のサブネットワークと、第２の特徴量を入力として受ける第２のサブネットワークと、第１のサブネットワークの出力と第２のサブネットワークの出力とを受け、第１の特徴量及び第２の特徴量から推定される音素を出力する第３のサブネットワークとを含む。 More preferably, the noise-tolerant speech recognition apparatus further includes acoustic model storage means for storing an acoustic model used for speech recognition by the speech recognition means, wherein the acoustic model is a deep neural network having a plurality of hidden layers, The model includes a first sub-network receiving a first feature as an input, a second sub-network receiving a second feature as an input, an output of the first sub-network, and an output of the second sub-network. And a third sub-network for outputting phonemes estimated from the first feature amount and the second feature amount.

本発明の第２の局面に係る耐雑音音声認識方法は、コンピュータが、目的信号である音声信号に雑音信号が重畳した単一チャネルの音響信号を入力として、音声信号が強調された強調音声信号を出力するステップと、コンピュータが、強調音声信号と、音響信号とを受け、音声信号の発話内容をテキスト化する音声認識ステップとを含む。 A noise-tolerant speech recognition method according to a second aspect of the present invention is characterized in that a computer receives, as an input, a single-channel acoustic signal in which a noise signal is superimposed on a speech signal as a target signal, and an enhanced speech signal in which the speech signal is emphasized. And a voice recognition step in which the computer receives the emphasized voice signal and the acoustic signal and converts the uttered content of the voice signal into text.

本発明の第３の局面に係るコンピュータプログラムは、コンピュータを、上記したいずれかの耐雑音音声認識装置として機能させる。 A computer program according to a third aspect of the present invention causes a computer to function as any of the noise-tolerant speech recognition devices described above.

本発明の解決した課題、本発明の構成及びその有利な効果は、添付の図面を参照しながら実施の形態の詳細な説明を読むことにより一層明らかとなる。 The problem solved by the present invention, the configuration of the present invention, and its advantageous effects will become more apparent by reading the detailed description of the embodiments with reference to the accompanying drawings.

図１は、単一チャネルの音声信号に対して従来の音声強調手法による前処理を行って音声認識を行う音声認識装置の概略構成を示すブロック図である。FIG. 1 is a block diagram showing a schematic configuration of a speech recognition device that performs preprocessing on a single-channel speech signal by a conventional speech enhancement method to perform speech recognition. 図２は、図１に示す音声認識装置で利用される深層ニューラル・ネットワークによる音響モデルの構成を示すブロック図である。FIG. 2 is a block diagram showing a configuration of an acoustic model based on a deep neural network used in the speech recognition device shown in FIG. 図３は、本発明の第１の実施の形態に係る音声認識装置の概略構成を示すブロック図である。FIG. 3 is a block diagram showing a schematic configuration of the speech recognition device according to the first embodiment of the present invention. 図４は、図３に示す音声認識装置で用いられる音響モデルの構成を示す概略ブロック図である。FIG. 4 is a schematic block diagram showing a configuration of an acoustic model used in the speech recognition device shown in FIG. 図５は、本発明の第２の実施の形態に係る音声認識装置で用いられる音響モデルの構成を示すブロック図である。FIG. 5 is a block diagram illustrating a configuration of an acoustic model used in the speech recognition device according to the second embodiment of the present invention. 図６は、本発明の第３の実施の形態に係る音声認識装置の概略構成を示すブロック図である。FIG. 6 is a block diagram showing a schematic configuration of the speech recognition device according to the third embodiment of the present invention. 図７は、図６に示す音声認識装置で用いられる音響モデルの構成を示すブロック図である。FIG. 7 is a block diagram showing a configuration of an acoustic model used in the speech recognition device shown in FIG. 図８は、本発明の第４の実施の形態に係る音声認識装置で用いられる音響モデルの概略構成を示すブロック図である。FIG. 8 is a block diagram illustrating a schematic configuration of an acoustic model used in the speech recognition device according to the fourth embodiment of the present invention. 図９は、本発明の第５の実施の形態に係る音声認識装置で用いられる音響モデルの概略構成を示すブロック図である。FIG. 9 is a block diagram illustrating a schematic configuration of an acoustic model used in the speech recognition device according to the fifth embodiment of the present invention. 図１０は、本発明の第６の実施の形態に係る音声認識装置で用いられる音響モデルの概略構成を示すブロック図である。FIG. 10 is a block diagram illustrating a schematic configuration of an acoustic model used in the speech recognition device according to the sixth embodiment of the present invention. 図１１は、本発明の第７の実施の形態に係る音声認識装置で用いられる音響モデルの概略構成を示すブロック図である。FIG. 11 is a block diagram illustrating a schematic configuration of an acoustic model used in the speech recognition device according to the seventh embodiment of the present invention. 図１２は、本発明の第８の実施の形態に係る音声認識装置で用いられる音響モデルの概略構成を示すブロック図である。FIG. 12 is a block diagram showing a schematic configuration of an acoustic model used in the speech recognition device according to the eighth embodiment of the present invention. 図１３は、本発明の第５の実施の形態〜第８の実施の形態に係る音響モデルが有するゲート層の機能を説明する図である。FIG. 13 is a diagram illustrating the function of the gate layer included in the acoustic model according to the fifth to eighth embodiments of the present invention. 図１４は、従来技術と本発明の第１〜第８の実施の形態に係る音声認識装置による単語誤り率を対比して表形式で示す図である。FIG. 14 is a diagram showing, in a table form, a comparison between the conventional technology and the word error rates by the speech recognition apparatuses according to the first to eighth embodiments of the present invention. 図１５は、本発明に係る音声認識装置を実現する典型的なコンピュータのハードウェアブロック図である。FIG. 15 is a hardware block diagram of a typical computer for realizing the speech recognition device according to the present invention.

以下の説明及び図面では、同一の部品には同一の参照番号を付してある。したがって、それらについての詳細な説明は繰返さない。 In the following description and drawings, the same components are denoted by the same reference numerals. Therefore, detailed description thereof will not be repeated.

［第１の実施の形態」
図３は、本発明の第１の実施の形態に係る音声認識装置１８０の概略構成を示すブロック図である。図３を参照して、音声認識装置１８０は、波形１１０により表される音声についてマイクロホンが出力する、雑音重畳音声である音声信号１１２に対し、既存の音声強調処理を行って強調音声信号２０３を出力する音声強調部２０２と、音声信号１１２及び強調音声信号２０３の双方を入力として、拡大された音声の特徴量２１０及び２１２を抽出する拡大特徴抽出部２００と、拡大特徴抽出部２００が出力する特徴量２１０及び２１２を入力として受けて音声認識を行って認識後のテキスト２０８を出力する音声認識部２０４とを含む。音声認識部２０４としては、図１に示す音声認識部１２０と同様のものを用いることができる。ただし、使用する特徴量については後述するように従来のものとは異なっている。 [First Embodiment]
FIG. 3 is a block diagram illustrating a schematic configuration of the speech recognition device 180 according to the first embodiment of the present invention. Referring to FIG. 3, speech recognition apparatus 180 performs existing speech enhancement processing on speech signal 112 that is a noise-superimposed speech that is output from a microphone with respect to the speech represented by waveform 110, and generates enhanced speech signal 203. The audio enhancement unit 202 to be output, the enlarged signal extraction unit 200 that receives the audio signal 112 and the enhanced audio signal 203 as inputs, and extracts the extended audio features 210 and 212, and the enlarged characteristic extraction unit 200 output And a speech recognition unit 204 that receives the feature amounts 210 and 212 as input, performs speech recognition, and outputs a recognized text 208. As the voice recognition unit 204, the same one as the voice recognition unit 120 shown in FIG. 1 can be used. However, the feature amounts used are different from those of the related art, as described later.

音声認識装置１８０はさらに、音声認識部２０４が音声認識の際に用いる、図２に示す従来のものとは異なる構成の音響モデル２０６と、図１に示すものとそれぞれ同じ発音辞書１２６及び言語モデル１２８とを含む。これら音響モデル２０６、発音辞書１２６及び言語モデル１２８はいずれも後述するハードディスク等の記憶装置に記憶される。 The speech recognition device 180 further includes an acoustic model 206 having a configuration different from the conventional one shown in FIG. 2 and a pronunciation dictionary 126 and a language model respectively identical to those shown in FIG. 128. These acoustic model 206, pronunciation dictionary 126, and language model 128 are all stored in a storage device such as a hard disk described later.

拡大特徴抽出部２００は、雑音重畳音声である音声信号１１２の入力を受けて特徴量２１０を出力する、図１に示すものと同様の特徴抽出部１１８と、音声強調部２０２から出力される強調音声信号２０３から特徴量２１２を抽出する、特徴抽出部１１８と同様の機能を持つ特徴抽出部２２０とを含む。本実施の形態では、特徴抽出部１１８と特徴抽出部２２０とは同じ構成を持ち、特徴量２１０と特徴量２１２とは同じ意味を持つ特徴量である。しかし、一般的には両者の入力が異なるために特徴量２１０及び２１２の値は互いに異なる。 The enlarged feature extraction unit 200 receives the input of the audio signal 112 that is a noise-superimposed audio, and outputs a feature amount 210. The feature extraction unit 118 similar to that illustrated in FIG. 1 and the enhancement output from the audio enhancement unit 202. A feature extraction unit 220 that extracts the feature amount 212 from the audio signal 203 and has the same function as the feature extraction unit 118 is included. In the present embodiment, the feature extractor 118 and the feature extractor 220 have the same configuration, and the feature 210 and the feature 212 have the same meaning. However, generally, the values of the feature quantities 210 and 212 are different from each other because the two inputs are different.

図４を参照して、図３に示す音響モデル２０６は、雑音が重畳された音声から得られた特徴量２１０と、強調音声信号２０３から得られた特徴量２１２との双方を入力とする入力層２４０と、推定された音素を出力する出力層２５６と、これら入力層２４０及び出力層２５６の間に順番に設けられた複数の隠れ層２４２〜２５４とを含む。本実施の形態では、隠れ層の数は７層である。 Referring to FIG. 4, an acoustic model 206 shown in FIG. 3 has an input in which both a feature amount 210 obtained from a voice on which noise is superimposed and a feature amount 212 obtained from an enhanced voice signal 203 are input. It includes a layer 240, an output layer 256 for outputting estimated phonemes, and a plurality of hidden layers 242 to 254 provided in order between the input layer 240 and the output layer 256. In the present embodiment, the number of hidden layers is seven.

図４に示す入力層２４０は、いずれもベクトルである特徴量２１０及び２１２の要素数の和だけの数の入力を受ける。これら特徴量２１０及び２１２を出力する特徴抽出部１１８及び２２０は、本実施の形態では図１に示す従来の特徴抽出部１１８と同じ構成である。したがって、音響モデル２０６が受ける特徴量の数は図１に示す従来のものと比較して２倍になる。そのうち半数は雑音重畳音声から得られた特徴量であり、残りの半数は強調音声から得られた特徴量である。 The input layer 240 shown in FIG. 4 receives as many inputs as the sum of the number of elements of the feature quantities 210 and 212, both of which are vectors. The feature extraction units 118 and 220 that output these feature amounts 210 and 212 have the same configuration in the present embodiment as the conventional feature extraction unit 118 shown in FIG. Therefore, the number of features received by the acoustic model 206 is twice as large as that of the conventional model shown in FIG. Of these, half are features obtained from the noise-superimposed speech, and the other half are features obtained from the emphasized speech.

音声認識部２０４の動作は、図１に示す音響モデル１２４に代えて音響モデル２０６を用いること、及び処理対象となる音響特徴量が強調音声からのものに加えて雑音重畳音声の特徴量も含むことを除き、図１に示す音声認識装置１００と同じである。したがってここではその詳細な説明は繰返さない。 The operation of the speech recognition unit 204 uses the acoustic model 206 instead of the acoustic model 124 shown in FIG. 1, and the acoustic features to be processed include those of the noise-superimposed speech in addition to those from the emphasized speech. Except for this point, it is the same as the speech recognition device 100 shown in FIG. Therefore, the detailed description will not be repeated here.

このような構成の音響モデル２０６を採用することにより、図１４を参照して後述するように、本実施の形態に係る音声認識装置１８０では、図１に示す従来のものと比較してより高い精度の音声認識を行うことができた。 By employing the acoustic model 206 having such a configuration, as described later with reference to FIG. 14, the voice recognition device 180 according to the present embodiment is higher than the conventional one shown in FIG. 1. Accurate speech recognition could be performed.

なお、音響モデル２０６の学習は、予め雑音重畳音声と、その音声が表すテキストからなる学習データを準備することにより、通常の深層ニューラル・ネットワークと同様の誤差逆伝搬法により行うことができる。これは以下に述べる各実施の形態における学習でも同様である。 The learning of the acoustic model 206 can be performed by an error back propagation method similar to that of a normal deep neural network by preparing learning data including a noise-superimposed voice and a text represented by the voice in advance. The same applies to learning in each of the embodiments described below.

［第２の実施の形態］
図５に、本発明の第２の実施の形態に係る音響モデル２８０の構成を示す。第２の実施の形態に係る音声認識装置は、図３に示す音響モデル２０６に代えて図５に示す音響モデル２８０を用いる点を除き第１の実施の形態に係る音声認識装置１８０と同じである。 [Second embodiment]
FIG. 5 shows a configuration of an acoustic model 280 according to the second embodiment of the present invention. The speech recognition apparatus according to the second embodiment is the same as the speech recognition apparatus 180 according to the first embodiment except that an acoustic model 280 shown in FIG. 5 is used instead of the acoustic model 206 shown in FIG. is there.

音響モデル２８０は、雑音重畳音声の特徴量２１０を受ける雑音重畳音声のためのサブネットワーク３００と、強調音声の特徴量２１２を受ける強調音声のためのサブネットワーク３０２と、雑音重畳音声のためのサブネットワーク３００の出力及び強調音声のためのサブネットワーク３０２の出力を受ける出力側サブネットワーク３０４と、出力側サブネットワーク３０４の出力を受けて音素を出力する出力層３０６とを含む。 The acoustic model 280 includes a sub-network 300 for noise-superimposed speech that receives the feature amount 210 of the noise-superimposed speech, a sub-network 302 for the emphasized voice that receives the feature amount 212 of the emphasized voice, and a sub-network 302 for the noise-superimposed speech. An output sub-network 304 that receives the output of the network 300 and the output of the sub-network 302 for emphasized voice, and an output layer 306 that receives the output of the output sub-network 304 and outputs phonemes.

雑音重畳音声のためのサブネットワーク３００は、雑音重畳音声の特徴量２１０を受けるように接続された入力層３２０と、入力層３２０と出力側サブネットワーク３０４の入力との間に順番に接続された複数個（本実施の形態では３個）の隠れ層３２２、３２４及び３２６とを含む。 The sub-network 300 for the noise-superimposed speech is connected in order between the input layer 320 connected to receive the feature amount 210 of the noise-superimposed speech, and the input between the input layer 320 and the input of the output-side sub-network 304. It includes a plurality (three in this embodiment) of hidden layers 322, 324 and 326.

強調音声のためのサブネットワーク３０２は、強調音声の特徴量２１２を受けるように接続された入力層３３０と、入力層３３０と出力側サブネットワーク３０４の入力との間に順番に接続された複数個（本実施の形態では３個）の隠れ層３３２、３３４及び３３６とを含む。 The sub-network 302 for the emphasized voice includes an input layer 330 connected to receive the feature amount 212 of the emphasized voice, and a plurality of sub-networks sequentially connected between the input layer 330 and the input of the output side sub-network 304. (Three in this embodiment) hidden layers 332, 334 and 336.

出力側サブネットワーク３０４は、雑音重畳音声のためのサブネットワーク３００及び強調音声のためのサブネットワーク３０２の出力を受けるように接続された隠れ層３５０と、この隠れ層３５０と出力層３０６との間に順に接続された隠れ層３５２、３５４及び３５６とを含む。 The output side sub-network 304 includes a hidden layer 350 connected to receive the outputs of the sub-network 300 for the noise-superimposed speech and the sub-network 302 for the emphasized speech, and a signal between the hidden layer 350 and the output layer 306. And hidden layers 352, 354 and 356 connected in sequence.

図５に示す音響モデル２８０が第１の実施の形態の音響モデル２０６と異なるのは、音響モデル２０６では入力層２４０が雑音重畳音声の特徴量２１０と強調音声の特徴量２１２の双方を受け、それ以後の隠れ層２４２〜２５４の全てに双方からの情報が伝搬されていくのに対し、音響モデル２８０では、雑音重畳音声のためのサブネットワーク３００を構成する入力層３２０及び隠れ層３２２〜３２６には雑音重畳音声の特徴量２１０からの情報のみが伝搬し、強調音声のためのサブネットワーク３０２の入力層３３０及び隠れ層３３２〜３３６には強調音声の特徴量２１２からの情報のみが伝搬することである。両者の情報は、隠れ層３５０で初めて統合され、以後、隠れ層３５２〜３５６及び出力層３０６に伝搬する。 The acoustic model 280 shown in FIG. 5 differs from the acoustic model 206 of the first embodiment in that the input layer 240 of the acoustic model 206 receives both the feature amount 210 of the noise-superimposed speech and the feature amount 212 of the emphasized speech. On the other hand, the information from both sides is propagated to all of the hidden layers 242 to 254 thereafter, whereas in the acoustic model 280, the input layer 320 and the hidden layers 322 to 326 configuring the sub-network 300 for the noise-superimposed speech are provided. , Only the information from the feature 210 of the noise-superimposed speech propagates, and only the information from the feature 212 of the emphasized speech propagates to the input layer 330 and the hidden layers 332 to 336 of the subnetwork 302 for the emphasized speech. That is. Both pieces of information are integrated for the first time in the hidden layer 350, and thereafter propagate to the hidden layers 352 to 356 and the output layer 306.

音響モデル２８０を採用した音声認識装置の構成は第１の実施の形態の音声認識装置１８０と同様である。 The configuration of the speech recognition device employing the acoustic model 280 is the same as that of the speech recognition device 180 of the first embodiment.

この第２の実施の形態に係る音響モデル２８０を用いた音声認識装置でも、図１４に示すように従来技術より高い精度を達成できた。 The speech recognition device using the acoustic model 280 according to the second embodiment also achieved higher accuracy than the conventional technology, as shown in FIG.

［第３の実施の形態］
図６に、本発明の第３の実施の形態に係る音声認識装置３８０のブロック図を示す。この音声認識装置３８０は、波形１１０により表される音声についてマイクロホンが出力する音声信号１１２に対し、それぞれ既存の第１〜第４の音声強調処理を行ってそれぞれ強調音声信号２０３、３９３、３９５及び３９７を出力する音声強調部２０２、３９２、３９４及び３９６と、音声信号１１２及び強調音声信号２０３、３９３、３９５及び３９７を入力として、拡大された音声の特徴量２１０、２１２、４３０、４３２及び４３４を抽出する拡大特徴抽出部３９０と、拡大特徴抽出部３９０が出力する特徴量２１０、２１２、４３０、４３２及び４３４を入力として受けて音声認識を行って認識後のテキスト４００を出力する音声認識部４０２とを含む。 [Third Embodiment]
FIG. 6 shows a block diagram of a voice recognition device 380 according to the third embodiment of the present invention. The voice recognition device 380 performs existing first to fourth voice enhancement processing on the voice signal 112 output from the microphone with respect to the voice represented by the waveform 110, and performs the voice enhancement signals 203, 393, and 395, respectively. Speech enhancement units 202, 392, 394, and 396 that output 397, and the audio signal 112 and the emphasized audio signals 203, 393, 395, and 397 as inputs, and feature amounts 210, 212, 430, 432, and 434 of the expanded audio. And a speech recognition unit that receives as input the feature amounts 210, 212, 430, 432, and 434 output by the expanded feature extraction unit 390, performs speech recognition, and outputs the recognized text 400. 402.

音声認識装置３８０はさらに、音声認識部４０２が音声認識の際に用いる音響モデル３９８と、図１に示すものとそれぞれ同じ発音辞書１２６及び言語モデル１２８とを含む。 The speech recognition device 380 further includes an acoustic model 398 used by the speech recognition unit 402 for speech recognition, and the same pronunciation dictionary 126 and language model 128 as those shown in FIG.

拡大特徴抽出部３９０は、雑音が重畳された音声信号１１２を受けて特徴量２１０を抽出するための特徴抽出部１１８と、音声強調部２０２から強調音声信号２０３を受けて第１の強調音声の特徴量２１２を抽出するための特徴抽出部２２０と、音声強調部３９２から強調音声信号３９３を受けて第２の強調音声の特徴量４３０を出力する特徴抽出部４１０と、音声強調部３９４から強調音声信号３９５を受けて第３の強調音声の特徴量４３２を出力する特徴抽出部４１２と、音声強調部３９６から強調音声信号３９７を受けて第４の強調音声の特徴量４３４を出力する特徴抽出部４１４とを含む。 The enlarged feature extraction unit 390 receives the voice signal 112 on which noise is superimposed, extracts a feature amount 210, and receives the enhanced voice signal 203 from the voice enhancement unit 202, and receives the first emphasized voice. A feature extraction unit 220 for extracting the feature amount 212, a feature extraction unit 410 that receives the enhanced voice signal 393 from the voice enhancement unit 392, and outputs a feature amount 430 of the second enhanced voice, and a voice enhancement unit 394. A feature extraction unit 412 that receives the audio signal 395 and outputs a feature amount 432 of the third emphasized voice, and a feature extraction that receives the enhanced voice signal 397 from the speech enhancement unit 396 and outputs a fourth feature amount 434 of the enhanced voice Unit 414.

音声強調部２０２は非特許文献１に開示された手法により音声強調を行う。音声強調部３９２は非特許文献２に開示された手法により音声強調を行う。音声強調部３９４は非特許文献３に開示された手法により音声強調を行う。音声強調部３９６は非特許文献４に開示された手法により音声強調を行う。 The voice enhancement unit 202 performs voice enhancement by the method disclosed in Non-Patent Document 1. The voice enhancement unit 392 performs voice enhancement by the method disclosed in Non-Patent Document 2. The voice enhancement unit 394 performs voice enhancement by the method disclosed in Non-Patent Document 3. The voice enhancement unit 396 performs voice enhancement by the method disclosed in Non-Patent Document 4.

図７に音響モデル３９８を形成する深層ニューラル・ネットワークの構成をブロック図形式で示す。図７を参照して、この音響モデル３９８は、図４に示す第１の実施の形態に係る音響モデル２０６を、４つの強調音声から抽出された特徴量を用いるよう拡張したものである。 FIG. 7 is a block diagram showing the configuration of the deep neural network forming the acoustic model 398. Referring to FIG. 7, the acoustic model 398 is obtained by expanding the acoustic model 206 according to the first embodiment shown in FIG. 4 so as to use feature amounts extracted from four emphasized voices.

音響モデル３９８は、雑音重畳音声の特徴量２１０、第１の強調音声の特徴量２１２、第２の強調音声の特徴量４３０、第３の強調音声の特徴量４３２及び第４の強調音声の特徴量４３４を受ける入力層４５０と、音響モデル３９８が推定した音素を出力する出力層４５４と、入力層４５０と出力層４５４との間に接続された複数の隠れ層からなる中間層４５２とを含む。 The acoustic model 398 includes the feature 210 of the noise-superimposed speech, the feature 212 of the first emphasized speech, the feature 430 of the second emphasized speech, the feature 432 of the third emphasized speech, and the feature of the fourth emphasized speech. An input layer 450 receiving the quantity 434, an output layer 454 for outputting phonemes estimated by the acoustic model 398, and an intermediate layer 452 composed of a plurality of hidden layers connected between the input layer 450 and the output layer 454. .

中間層４５２は、入力層４５０の出力に接続された入力を持つ隠れ層４７０と、それぞれの入力が前の層の出力に接続された隠れ層４７２、４７４、４７６、４７８、４８０及び４８２とを含む。隠れ層４８２の出力は出力層４５４の入力に接続されている。 The middle layer 452 includes a hidden layer 470 having inputs connected to the outputs of the input layer 450, and hidden layers 472, 474, 476, 478, 480 and 482, each having an input connected to the output of the previous layer. Including. The output of the hidden layer 482 is connected to the input of the output layer 454.

この第３の実施の形態に係る音声認識装置３８０は、第２の実施の形態に係る音声認識装置１８０を４つの音声強調を使用するように拡張したものである。その動作も第１の実施の形態のものと基本的には同一である。 The voice recognition device 380 according to the third embodiment is obtained by expanding the voice recognition device 180 according to the second embodiment to use four voice enhancements. The operation is basically the same as that of the first embodiment.

この第３の実施の形態でも、従来技術と比較して音声認識の精度を高くすることができた。 Also in the third embodiment, the accuracy of speech recognition can be increased as compared with the related art.

［第４の実施の形態］
第３の実施の形態では、雑音重畳音声の特徴量２１０及び第１〜第４の強調音声の特徴量２１２、４３０、４３２及び４３４がいずれも入力層４５０に入力されており、中間層４５２を構成する全ての隠れ層にこの情報が伝搬されている。しかし本発明はそのような実施の形態には限定されない。 [Fourth Embodiment]
In the third embodiment, the feature amount 210 of the noise-superimposed speech and the feature amounts 212, 430, 432, and 434 of the first to fourth emphasized speeches are all input to the input layer 450. This information is propagated to all the constituent hidden layers. However, the present invention is not limited to such an embodiment.

この第４の実施の形態に係る音声認識装置は基本的に図６に示す音声認識装置３８０の構成と同様である。異なる点は、音声認識装置３８０が使用していた音響モデル３９８に代えて図８に示すような構成の音響モデル５００を用いている点である。 The speech recognition device according to the fourth embodiment has basically the same configuration as the speech recognition device 380 shown in FIG. The difference is that an acoustic model 500 having the configuration shown in FIG. 8 is used instead of the acoustic model 398 used by the speech recognition device 380.

図８を参照して、この音響モデル５００は、雑音重畳音声である音声信号１１２の特徴量２１０を受ける第１のサブネットワーク５４０と、第１の強調音声の特徴量２１２を受ける第２のサブネットワーク５４２と、第２の強調音声の特徴量４３０を受ける第３のサブネットワーク５４４と、第３の強調音声の特徴量４３２を受ける第４のサブネットワーク５４６と、第４の強調音声の特徴量４３４を受ける第５のサブネットワーク５４８と、第１のサブネットワーク５４０、第２のサブネットワーク５４２、第３のサブネットワーク５４４、第４のサブネットワーク５４６及び第５のサブネットワーク５４８の出力を受けるように接続された中間サブネットワーク５５０と、中間サブネットワーク５５０の出力に接続された入力を持ち、音響モデル５００の出力である音素の推定結果を出力する出力層５５２とを含む。 Referring to FIG. 8, this acoustic model 500 includes a first sub-network 540 that receives a feature 210 of a voice signal 112 that is a noise-superimposed voice, and a second sub-network that receives a feature 212 of a first emphasized voice. A network 542, a third sub-network 544 receiving the second emphasized voice feature 430, a fourth sub-network 546 receiving the third emphasized voice feature 432, and a fourth emphasized voice feature A fifth sub-network 548 that receives 434 and a first sub-network 540, a second sub-network 542, a third sub-network 544, a fourth sub-network 546, and a fifth sub-network 548. And an input connected to the output of the intermediate sub-network 550, And an output layer 552 outputs the phoneme estimation result which is the output of model 500.

第１のサブネットワーク５４０は、雑音重畳音声の特徴量２１０を受ける入力を持つ入力層５７０と、入力層５７０と中間サブネットワーク５５０の入力との間に順番に接続された隠れ層５７２、隠れ層５７４及び隠れ層５７６とを含む。 The first sub-network 540 includes an input layer 570 having an input for receiving the feature amount 210 of the noise-superimposed speech, a hidden layer 572, and a hidden layer 572 connected in sequence between the input layer 570 and the input of the intermediate sub-network 550. 574 and a hidden layer 576.

第２のサブネットワーク５４２は、第１の強調音声の特徴量２１２を受ける入力を持つ入力層５８０と、入力層５８０と中間サブネットワーク５５０の入力との間に順番に接続された隠れ層５８２、隠れ層５８４及び隠れ層５８６とを含む。 The second sub-network 542 includes an input layer 580 having an input for receiving the feature amount 212 of the first emphasized voice, and a hidden layer 582 connected in order between the input layer 580 and the input of the intermediate sub-network 550. And a hidden layer 584 and a hidden layer 586.

第３のサブネットワーク５４４は、第２の強調音声の特徴量４３０を受ける入力を持つ入力層５９０と、入力層５９０と中間サブネットワーク５５０の入力との間に順に接続された隠れ層５９２、隠れ層５９４及び隠れ層５９６とを含む。 The third sub-network 544 includes an input layer 590 having an input for receiving the feature amount 430 of the second emphasized voice, a hidden layer 592 sequentially connected between the input layer 590 and an input of the intermediate sub-network 550, and a hidden layer 592. A layer 594 and a hidden layer 596.

第４のサブネットワーク５４６は、第３の強調音声の特徴量４３２を受ける入力を持つ入力層６００と、入力層６００と中間サブネットワーク５５０の入力との間に順に接続された隠れ層６０２、隠れ層６０４及び隠れ層６０６とを含む。 The fourth sub-network 546 includes an input layer 600 having an input for receiving the feature amount 432 of the third emphasized voice, a hidden layer 602 sequentially connected between the input layer 600 and an input of the intermediate sub-network 550, and a hidden layer 602. A layer 604 and a hidden layer 606.

第５のサブネットワーク５４８は、第４の強調音声の特徴量４３４を受ける入力を持つ入力層６１０と、入力層６１０と中間サブネットワーク５５０の入力との間に順に接続された隠れ層６１２、隠れ層６１４及び隠れ層６１６とを含む。 The fifth sub-network 548 includes an input layer 610 having an input for receiving the feature amount 434 of the fourth emphasized voice, a hidden layer 612 sequentially connected between the input layer 610 and the input of the intermediate sub-network 550, and a hidden layer 612. Including a layer 614 and a hidden layer 616.

中間サブネットワーク５５０は、第１〜第５のサブネットワーク５４０、５４２、５４４、５４６及び５４８の出力を受けるように接続された隠れ層６２０と、隠れ層６２０から出力層５５２までの間に順に接続された隠れ層６２２、隠れ層６２４及び隠れ層６２６とを含む。 The intermediate sub-network 550 is connected to a hidden layer 620 connected to receive the outputs of the first to fifth sub-networks 540, 542, 544, 546 and 548, and connected in order from the hidden layer 620 to the output layer 552. Hidden layer 622, hidden layer 624, and hidden layer 626.

この実施の形態に係る音声認識装置の構成も図６に示すものと同様で、図６の音響モデル３９８に代えて図８に示す音響モデル５００を用いる点のみが異なる。 The configuration of the speech recognition apparatus according to this embodiment is the same as that shown in FIG. 6, except that an acoustic model 500 shown in FIG. 8 is used instead of acoustic model 398 in FIG.

第３の実施の形態では、全ての隠れ層が、雑音重畳音声の特徴量２１０、第１〜第４の強調音声の特徴量２１２、４３０、４３２及び４３４を伝搬している。しかし本実施の形態では、雑音重畳音声の特徴量２１０は第１のサブネットワーク５４０の内部を伝搬した後隠れ層６２０に入力される。同様に、第１〜第４の強調音声の特徴量２１２、４３０、４３２及び４３４はそれぞれ第２〜第５のサブネットワーク５４２、５４４、５４６及び５４８のみの中を伝搬した後、隠れ層６２０に入力される。隠れ層６２０から始まる中間サブネットワーク５５０の内部では、全ての特徴量が統合されて順に隠れ層を伝搬し最終的に出力層５５２から音素の推定結果が出力される。 In the third embodiment, all the hidden layers propagate the feature 210 of the noise-superimposed speech and the features 212, 430, 432, and 434 of the first to fourth emphasized speech. However, in the present embodiment, the feature 210 of the noise-superimposed speech is input to the hidden layer 620 after propagating inside the first sub-network 540. Similarly, the feature amounts 212, 430, 432, and 434 of the first to fourth emphasized voices propagate through only the second to fifth sub-networks 542, 544, 546, and 548, respectively, and then pass through the hidden layer 620. Is entered. Inside the intermediate sub-network 550 starting from the hidden layer 620, all the feature amounts are integrated, propagated through the hidden layer in order, and finally the phoneme estimation result is output from the output layer 552.

この第４の実施の形態に係る音響モデル５００を用いた音声認識装置でも、従来の音声認識装置より高い精度で音声認識を行うことができた。 The speech recognition device using the acoustic model 500 according to the fourth embodiment was able to perform speech recognition with higher accuracy than the conventional speech recognition device.

［第５の実施の形態］
図９に、第５の実施の形態に係る音声認識装置で使用される音響モデル６５０の概略構成を示す。図９から分かるように、この音響モデル６５０も深層ニューラル・ネットワークからなる。 [Fifth Embodiment]
FIG. 9 shows a schematic configuration of an acoustic model 650 used in the speech recognition device according to the fifth embodiment. As can be seen from FIG. 9, this acoustic model 650 also consists of a deep neural network.

図９に示す音響モデル６５０は、図４に示す音響モデル２０６において、雑音重畳音声の特徴量２１０と第１の強調音声の特徴量２１２の双方を受ける入力層２４０の前に、第１の強調音声の特徴量２１２を受け、区間［０，１］の重みを乗じて入力層２４０に入力するゲート層６８２を設けたものである。以後、図４に示すものと同様、隠れ層２４２から出力層２５６まで、これら特徴量からの情報はいずれも共通して伝搬される。 The acoustic model 650 shown in FIG. 9 is different from the acoustic model 206 shown in FIG. 4 in that the first emphasis speech is obtained before the input layer 240 receiving both the feature 210 of the noise-superimposed speech and the feature 212 of the first emphasized speech. A gate layer 682 is provided, which receives the voice feature 212 and multiplies it by the weight of the section [0, 1] and inputs the result to the input layer 240. Thereafter, as in the case shown in FIG. 4, all of the information from these feature amounts is propagated in common from the hidden layer 242 to the output layer 256.

ゲート層６８２も一種の隠れ層ということができるが、その機能は通常の隠れ層と異なる。すなわち、図１３を参照して、ゲート層６８２を一般的にゲート層１１００として表現すると、ゲート層１１００は入力ベクトルｘ_ｔの各要素に対してゲート重みｇ_ｔ＝σ（Ｗｘ_ｔ＋ｂ）を要素ごとに乗じて出力ベクトルｙ_ｔを出力するゲート機能を持つ。ここでベクトルｘ_ｔをＭ次元とすると、ＷはＭ×Ｍ次元の重み行列、ｂはＭ次元のバイアスベクトル、σ（・）は区間［０，１］の値域である任意の活性化関数、を表す。ゲート重みの各要素は前述したとおり区間［０，１］内の値である。これら重み行列Ｗ及びバイアスベクトルｂの各要素はいずれも学習の対象である。学習時には、上記した区間の制約に従うことを除き、重み行列Ｗ及びバイアスベクトルｂの各要素の学習は通常の深層ニューラル・ネットワークと同じ手法を用いて学習できる。以後の説明でも、ゲート層と呼ばれる層はいずれも図１３のゲート層１１００と同じ機能を持ち、いずれのパラメータも上記した区間［０、１］という制約の下、他のパラメータと同様に学習できる。 The gate layer 682 can also be referred to as a kind of hidden layer, but its function is different from that of a normal hidden layer. That is, referring to FIG. 13, if gate layer 682 is generally expressed as gate layer 1100, gate layer 1100 has a gate weight g _t = σ (Wx _t + b) for each element of input vector x _t. It has an output to gate function output vector y _t by multiplying each. If the vector _xt is M-dimensional, W is an M × M-dimensional weight matrix, b is an M-dimensional bias vector, σ (·) is an arbitrary activation function in the range of the interval [0, 1], Represents Each element of the gate weight is a value in the section [0, 1] as described above. Each of the elements of the weight matrix W and the bias vector b is a learning target. At the time of learning, learning of each element of the weight matrix W and the bias vector b can be performed using the same method as that of a normal deep neural network, except that the above-described restrictions on the section are obeyed. Also in the following description, each layer called a gate layer has the same function as the gate layer 1100 in FIG. 13, and all parameters can be learned in the same manner as other parameters under the constraint of the section [0, 1] described above. .

なおこのゲート層は、入力ベクトルの各要素に対して別々にゲート処理を行うことに注意する必要がある。したがって、強調音声の特徴量ごとに、音声認識時に利用するか否かをゲート処理できる。 It should be noted that this gate layer separately performs gate processing on each element of the input vector. Therefore, it is possible to perform a gating process for each feature amount of the emphasized speech as to whether or not to use the feature during speech recognition.

この結果、各特徴量からなる入力ベクトルの要素ごとに、その要素に対する重みに応じて取捨選択がされる。この取捨選択は重み行列Ｗとバイアスベクトルｂと、各入力ベクトルに含まれる各要素の値とにより行われることになる。すなわち、入力される特徴量の値に応じて各要素が取捨選択され、音声認識に使用される。 As a result, for each element of the input vector composed of each feature amount, selection is made according to the weight for that element. This selection is performed based on the weight matrix W, the bias vector b, and the value of each element included in each input vector. That is, each element is selected according to the value of the input feature amount and used for speech recognition.

この第５の実施の形態に係る音響モデル６５０を用いた音声認識装置でも、従来技術と比較して高い精度を達成できた。 The speech recognition device using the acoustic model 650 according to the fifth embodiment also achieved higher accuracy than the conventional technology.

［第６の実施の形態］
図１０に、本発明の第６の実施の形態に係る音声認識装置で使用される音響モデル７５０の概略構成を示す。この実施の形態に係る音声認識装置自体の構成は図３に示すものと同様である。ただし、図３の音響モデル２０６に代えて音響モデル７５０を用いる点が異なる。 [Sixth Embodiment]
FIG. 10 shows a schematic configuration of an acoustic model 750 used in the speech recognition device according to the sixth embodiment of the present invention. The configuration of the voice recognition device itself according to this embodiment is the same as that shown in FIG. However, the difference is that an acoustic model 750 is used instead of the acoustic model 206 in FIG.

音響モデル７５０は、全体として１つの深層ニューラル・ネットワークを構成する。音響モデル７５０は、雑音重畳音声の特徴量２１０の入力を受ける第１のサブネットワーク７７０と、第１の強調音声の特徴量２１２の入力を受ける第２のサブネットワーク７７２と、第１のサブネットワーク７７０の出力と第２のサブネットワーク７７２の出力とを受けるように接続された、深層ニューラル・ネットワークの一部である第３のサブネットワーク７７４と、第３のサブネットワーク７７４の出力を受けて音響モデル７５０により推定された音素を特定する出力層７７６とを含む。 The acoustic model 750 constitutes one deep neural network as a whole. The acoustic model 750 includes a first sub-network 770 that receives an input of the feature 210 of the noise-superimposed speech, a second sub-network 772 that receives the input of the feature 212 of the first emphasized speech, and a first sub-network. A third sub-network 774, which is part of the deep neural network, connected to receive the output of the second sub-network 770 and the output of the second sub-network 772; And an output layer 776 for specifying phonemes estimated by the model 750.

第１のサブネットワーク７７０は、雑音重畳音声の特徴量２１０を受ける入力層８００と、入力層８００から第３のサブネットワーク７７４の入力までの間に順に接続された隠れ層８０２、隠れ層８０４及び隠れ層８０６とを含む。 The first sub-network 770 includes an input layer 800 that receives the feature amount 210 of the noise-superimposed speech, and a hidden layer 802, a hidden layer 804, and a hidden layer 802 that are sequentially connected from the input layer 800 to the input of the third sub-network 774. And a hidden layer 806.

第２のサブネットワーク７７２は、第１の強調音声の特徴量２１２を受ける入力層８１０と、入力層８１０の後に順に接続された隠れ層８１２、隠れ層８１４及び隠れ層８１６と、隠れ層８１６の出力を受けるように接続され、第５の実施の形態のゲート層６８２と同様の機能を持つゲート層８１８とを含む。 The second sub-network 772 includes an input layer 810 that receives the feature amount 212 of the first emphasized voice, a hidden layer 812, a hidden layer 814, a hidden layer 816 connected in order after the input layer 810, and a hidden layer 816. A gate layer 818 connected to receive an output and having the same function as the gate layer 682 of the fifth embodiment is included.

第３のサブネットワーク７７４は、第１のサブネットワーク７７０の出力及び第２のサブネットワーク７７２の出力を受ける隠れ層８３０と、隠れ層８３０以後、出力層７７６までの間に順に接続された隠れ層８３２、隠れ層８３４及び隠れ層８３６とを含む。 The third sub-network 774 includes a hidden layer 830 receiving the output of the first sub-network 770 and the output of the second sub-network 772, and a hidden layer connected in order between the hidden layer 830 and the output layer 776. 832, a hidden layer 834 and a hidden layer 836.

この音響モデル７５０は、図９に示すものと異なり、雑音重畳音声の特徴量２１０及び第１の強調音声の特徴量２１２は、音響モデル７５０の前半では第１のサブネットワーク７７０と第２のサブネットワーク７７２とに分離されてそれぞれの内部で伝搬される。第１のサブネットワーク７７０の出力はそのまま第３のサブネットワーク７７４に入力されるが、第２のサブネットワーク７７２では、最後の隠れ層８１６の出力に対してゲート層８１８でのゲート処理が実行された後、その結果が隠れ層８３０に入力される。 The acoustic model 750 is different from the acoustic model shown in FIG. 9 in that the feature 210 of the noise-superimposed speech and the feature 212 of the first emphasized speech include the first subnetwork 770 and the second subnetwork 770 in the first half of the acoustic model 750. It is separated into a network 772 and propagated inside each. The output of the first sub-network 770 is directly input to the third sub-network 774, but the second sub-network 772 performs the gate processing in the gate layer 818 on the output of the last hidden layer 816. After that, the result is input to the hidden layer 830.

こうした構成により、第１の強調音声の特徴量２１２を利用した方が有利なときには第１の強調音声の特徴量２１２が有効に利用され、第１の強調音声の特徴量２１２を利用すると不利になるときには第２のサブネットワーク７７２の出力は小さな値となり、結果として音声認識には利用されない。 With such a configuration, when it is advantageous to use the feature amount 212 of the first emphasized voice, the feature amount 212 of the first emphasized voice is effectively used, and when the feature amount 212 of the first emphasized voice is used, it is disadvantageous. When this happens, the output of the second sub-network 772 will have a small value and as a result will not be used for speech recognition.

この第６の実施の形態に係る音響モデル７５０を用いても、従来技術と比較して高い精度で音声認識できた。 Even with the use of the acoustic model 750 according to the sixth embodiment, speech recognition could be performed with higher accuracy than in the related art.

［第７の実施の形態］
図１１は第７の実施の形態に係る音声認識装置で使用される音響モデル８５０の概略構成を示す。図１１からも分かるようにこの音響モデル８５０も深層ニューラル・ネットワークからなる。この第７の実親形態に係る音声認識装置は、図６に示す音声認識装置３８０と同様である。ただし、図７の音響モデル３９８に代えて音響モデル８５０を使用する点が異なる。 [Seventh Embodiment]
FIG. 11 shows a schematic configuration of an acoustic model 850 used in the speech recognition device according to the seventh embodiment. As can be seen from FIG. 11, this acoustic model 850 is also composed of a deep neural network. The speech recognition device according to the seventh embodiment has the same configuration as the speech recognition device 380 shown in FIG. The difference is that an acoustic model 850 is used instead of the acoustic model 398 in FIG.

図１１を参照して、この音響モデル８５０は、図７に示す音響モデル３９８の構成要素に加えて、入力層４５０の前に、第１の強調音声の特徴量２１２を受けて区間［０，１］の重みを乗じて入力層４５０に入力するゲート層８９２と、第２の強調音声の特徴量４３０を受けて区間［０，１］の重みを乗じて入力層４５０に入力するゲート層９０２と、第３の強調音声の特徴量４３２を受けて区間［０，１］の重みを乗じて入力層４５０に入力するゲート層９１２と、第４の強調音声の特徴量４３４を受けて区間［０，１］の重みを乗じて入力層４５０に入力するゲート層９２２とを含む。その他の点ではこの音響モデル８５０は、図７に示す音響モデル３９８と同一である。 Referring to FIG. 11, this acoustic model 850 receives the feature amount 212 of the first emphasized speech before the input layer 450 in addition to the components of the acoustic model 398 shown in FIG. 1], and a gate layer 902 that receives the feature amount 430 of the second emphasized voice and multiplies it by the weight of the section [0, 1] and inputs the same to the input layer 450. And a gate layer 912 that receives the feature amount 432 of the third emphasized voice and multiplies it by the weight of the section [0, 1] and inputs the same to the input layer 450, and a section [434] that receives the feature amount 434 of the fourth emphasized voice. 0, 1] and input to the input layer 450. In other respects, the acoustic model 850 is the same as the acoustic model 398 shown in FIG.

この音響モデル８５０では、第１〜第４の強調音声の特徴量２１２、４３０、４３２及び４３４のいずれに対してもゲート層８９２、９０２、９１２及び９２２の機能により、音声認識時に有利となるような特徴量については有効に利用し、そうでない特徴量については利用しないようにできる。その結果、この音響モデル８５０を用いた音声認識でも精度を高くできる。 In the acoustic model 850, the functions of the gate layers 892, 902, 912, and 922 are advantageous for any of the first to fourth emphasized speech features 212, 430, 432, and 434 during speech recognition. Effective feature amounts can be used effectively, and other feature amounts can be prevented from being used. As a result, the accuracy can be improved even in speech recognition using the acoustic model 850.

実際、後述するようにこの実施の形態の音響モデル８５０を用いた音声認識装置では、従来の技術よりも高い精度で音声認識を行うことができた。 In fact, as described later, the speech recognition apparatus using the acoustic model 850 according to the present embodiment was able to perform speech recognition with higher accuracy than the conventional technique.

［第８の実施の形態］
図１２に、本発明の第８の実施の形態に係る音声認識装置で使用される音響モデル９５０の概略構成を示す。音響モデル９５０もまた他の実施の形態に係る音響モデルと同様、深層ニューラル・ネットワークからなる。 [Eighth Embodiment]
FIG. 12 shows a schematic configuration of an acoustic model 950 used in the speech recognition device according to the eighth embodiment of the present invention. The acoustic model 950 also includes a deep neural network, like the acoustic model according to the other embodiments.

音響モデル９５０は、雑音重畳音声の特徴量２１０を受ける第１の入力サブネットワーク９６０と、第１の強調音声の特徴量２１２を受ける第２の入力サブネットワーク９６２と、第２の強調音声の特徴量４３０を受ける第３の入力サブネットワーク９６４と、第３の強調音声の特徴量４３２を受ける第４の入力サブネットワーク９６６と、第４の強調音声の特徴量４３４を受ける第５の入力サブネットワーク９６８と、第１〜第５の入力サブネットワーク９６０、９６２、９６４、９６６及び９６８の出力を受ける中間サブネットワーク９７０と、中間サブネットワーク９７０の出力を受けて音響モデル９５０が推定する音素を出力する出力層９７２とを含む。 The acoustic model 950 includes a first input sub-network 960 that receives the feature 210 of the noise-superimposed speech, a second input sub-network 962 that receives the feature 212 of the first emphasized speech, and a feature of the second emphasized speech. A third input sub-network 964 receiving the quantity 430, a fourth input sub-network 966 receiving the third emphasized speech feature 432, and a fifth input sub-network receiving the fourth emphasized speech feature 434. 968, an intermediate sub-network 970 receiving the outputs of the first to fifth input sub-networks 960, 962, 964, 966, and 968, and receiving the output of the intermediate sub-network 970 and outputting the phonemes estimated by the acoustic model 950. And an output layer 972.

第１の入力サブネットワーク９６０は、雑音重畳音声の特徴量２１０を受ける入力層９８０と、入力層９８０から中間サブネットワーク９７０までの間に順に接続された隠れ層９８２、隠れ層９８４及び隠れ層９８６とを含む。 The first input subnetwork 960 includes an input layer 980 that receives the feature amount 210 of the noise-superimposed speech, and a hidden layer 982, a hidden layer 984, and a hidden layer 986 that are sequentially connected from the input layer 980 to the intermediate subnetwork 970. And

第２の入力サブネットワーク９６２は、第１の強調音声の特徴量２１２を受ける入力層９９０と、入力層９９０の後に順に接続される隠れ層９９２、隠れ層９９４及び隠れ層９９６と、隠れ層９９６の出力と中間サブネットワーク９７０の入力との間に挿入されたゲート層９９８とを含む。 The second input sub-network 962 includes an input layer 990 that receives the feature value 212 of the first emphasized voice, a hidden layer 992, a hidden layer 994, a hidden layer 996, and a hidden layer 996 that are sequentially connected after the input layer 990. And the gate layer 998 inserted between the input of the intermediate sub-network 970.

第３の入力サブネットワーク９６４は、第２の強調音声の特徴量４３０を受ける入力層１０００と、入力層１０００の後に順に接続された隠れ層１００２、隠れ層１００４及び隠れ層１００６と、隠れ層１００６の出力と中間サブネットワーク９７０の入力との間に挿入されたゲート層１００８とを含む。 The third input sub-network 964 includes an input layer 1000 receiving the feature amount 430 of the second emphasized voice, a hidden layer 1002, a hidden layer 1004, a hidden layer 1006, and a hidden layer 1006 connected in order after the input layer 1000. And a gate layer 1008 inserted between the input of the intermediate sub-network 970.

第４の入力サブネットワーク９６６は、第３の強調音声の特徴量４３２を受ける入力層１０１０と、入力層１０１０の後に順に接続された隠れ層１０１２、隠れ層１０１４及び隠れ層１０１６と、隠れ層１０１６の出力と中間サブネットワーク９７０の入力との間に挿入されたゲート層１０１８とを含む。 The fourth input sub-network 966 includes an input layer 1010 that receives the feature amount 432 of the third emphasized voice, a hidden layer 1012, a hidden layer 1014, a hidden layer 1016, and a hidden layer 1016 that are sequentially connected after the input layer 1010. And a gate layer 1018 inserted between the output of the intermediate sub-network 970.

第５の入力サブネットワーク９６８は、第４の強調音声の特徴量４３４を受ける入力層１０２０と、入力層１０２０の後に順に接続された隠れ層１０２２、隠れ層１０２４及び隠れ層１０２６と、隠れ層１０２６の出力と中間サブネットワーク９７０の入力との間に挿入されたゲート層１０２８とを含む。 The fifth input sub-network 968 includes an input layer 1020 that receives the feature amount 434 of the fourth emphasized voice, a hidden layer 1022, a hidden layer 1024, a hidden layer 1026, and a hidden layer 1026 that are sequentially connected after the input layer 1020. , And a gate layer 1028 inserted between the inputs of the intermediate sub-network 970.

中間サブネットワーク９７０は、第１の入力サブネットワーク９６０並びに第２〜第５の入力サブネットワーク９６２、９６４、９６６及び９６８の出力を受ける隠れ層１０３０と、隠れ層１０３０と出力層９７２との間に順に接続された隠れ層１０３２、隠れ層１０３４及び隠れ層１０３６とを含む。 The intermediate sub-network 970 includes a hidden layer 1030 that receives the outputs of the first input sub-network 960 and the second to fifth input sub-networks 962, 964, 966, and 968, and a hidden layer 1030 and an output layer 972 between the hidden layer 1030 and the output layer 972. It includes a hidden layer 1032, a hidden layer 1034, and a hidden layer 1036 connected in order.

この音響モデル９５０を用いた音声認識装置の動作も、音響モデルとして音響モデル９５０を使用することを除き、図６に示す音声認識装置３８０と同様である。 The operation of the speech recognition device using the acoustic model 950 is the same as that of the speech recognition device 380 shown in FIG. 6, except that the acoustic model 950 is used as the acoustic model.

この実施の形態では、第１〜第４の音声強調により得られた特徴量の各要素の各々について、区間［０、１］の値をとる係数で重み付けをして音素を推定できる。音声強調ごとに、かつその特徴量ごとに、音声認識に有利な特徴については有効に利用し、不利な特徴については使用しないようにできる。その結果、音声認識の精度を高くできる。 In this embodiment, it is possible to estimate a phoneme by weighting each element of the feature amount obtained by the first to fourth speech enhancements with a coefficient that takes a value in the section [0, 1]. For each voice emphasis and for each feature amount, it is possible to effectively use features that are advantageous for speech recognition and not to use disadvantageous features. As a result, the accuracy of voice recognition can be increased.

後述のように、この実施の形態では、従来技術での精度はもちろん、上記した第１〜第７の実施の形態のいずれよりも高い精度を実現することができた。 As will be described later, in this embodiment, it is possible to realize higher accuracy than any of the above-described first to seventh embodiments, as well as the accuracy in the prior art.

［実験結果］
図１４に、上記各実施の形態について行った実験結果（単語誤り率）を表形式で示す。この実験では、非特許文献５に記載されたCHiME3（タブレットを用いた屋外で収録した音声）を認識対象として使用した。この実験で使用した音声強調処理は以下のとおりである。 [Experimental result]
FIG. 14 shows, in the form of a table, the results of experiments (word error rates) performed on the above embodiments. In this experiment, CHiME3 (voice recorded outdoors using a tablet) described in Non-Patent Document 5 was used as a recognition target. The speech emphasis processing used in this experiment is as follows.

・音声強調１：非特許文献１に開示された技術
・音声強調２：非特許文献２に開示された技術
・音声強調３：非特許文献３に開示された技術
・音声強調４：非特許文献４に開示された技術
第１、第２、第５及び第６の実施の形態に関する実験では、例えば図３に示す音声強調部２０２として上記音声強調１〜４をそれぞれ採用して各実施の形態の音響モデルを使用して音声認識精度を測定し、第３、第５、第７及び第８の実施の形態に関する実験では、図６に示す音声強調部２０２、３９２、３９４及び３９６として上記音声強調１〜４をそれぞれ採用し、各実施の形態の音響モデルを使用して音声認識精度を測定した。・ Speech enhancement 1: Technology disclosed in Non-Patent Document 1 ・ Speech enhancement 2: Technology disclosed in Non-Patent Document 2 ・ Speech enhancement 3: Technology disclosed in Non-Patent Document 3 ・ Speech enhancement 4: Non-Patent Document In the experiments relating to the first, second, fifth and sixth embodiments, for example, each of the above embodiments 1 to 4 is adopted as the speech enhancement section 202 shown in FIG. The accuracy of speech recognition was measured using the acoustic model of the third embodiment, and in experiments relating to the third, fifth, seventh, and eighth embodiments, the speech emphasis units 202, 392, 394, and 396 shown in FIG. Emphasis 1 to 4 were respectively adopted, and the speech recognition accuracy was measured using the acoustic model of each embodiment.

なお、図１４には示していないが、従来の音声認識装置で音声強調なしで同じデータに対する音声認識を行った場合の単語誤り率は２２．６４％であった。 Although not shown in FIG. 14, the word error rate when speech recognition was performed on the same data without speech enhancement using a conventional speech recognition device was 22.64%.

図１４から明らかなように、本発明の第１〜第８の実施の形態によれば、従来技術の音声強調を用いた場合よりも単語誤り率が低かった。すなわち音声認識の精度は高かった。従来の音声認識で音声強調なしの場合と比較しても、大部分の場合で精度はより高かった。特に第２の実施の形態ではいずれの音声強調を使用しても高い精度を実現できた。また第４の実施の形態及び第８の実施の形態では精度は非常に高く、特に第８の実施の形態では他の実施の形態と比較しても一段と高い精度を実現できた。 As is clear from FIG. 14, according to the first to eighth embodiments of the present invention, the word error rate was lower than in the case where the conventional speech enhancement was used. That is, the accuracy of voice recognition was high. In most cases, the accuracy was higher than in conventional speech recognition without speech enhancement. In particular, in the second embodiment, high accuracy was realized by using any of the voice enhancements. In the fourth embodiment and the eighth embodiment, the accuracy is very high. In the eighth embodiment, in particular, higher accuracy can be realized as compared with the other embodiments.

［コンピュータによる実現］
上記した各実施の形態に係る音声認識装置の各機能部は、それぞれコンピュータハードウェアと、そのハードウェア上でＣＰＵ（中央演算処理装置）及びＧＰＵ（Graphics Processing Unit）により実行されるプログラムとにより実現できる。図１５に上記各音声認識装置を実現するコンピュータハードウェアを示す。ＧＰＵは通常は画像処理を行うために使用されるが、このようにＧＰＵを画像処理ではなく通常の演算処理に使用する技術をＧＰＧＰＵ（General-purpose computing on graphics processing units）と呼ぶ。ＧＰＵは同種の複数の演算を同時並列的に実行できる。一方、ニューラル・ネットワークの場合、特に学習時には演算が大量に必要になるが、それらは同時に超並列的に実行可能である。したがって、音声認識装置とそこに用いられる音響モデルを構成するニューラル・ネットワークの訓練と推論にはＧＰＵを備えたコンピュータが適している。なお、学習が終わった音響モデルを用いて音声認識を行う場合、十分高速なＣＰＵを搭載したコンピュータであれば、必ずしもＧＰＵを搭載していなくてもよい。 [Realization by computer]
Each functional unit of the speech recognition device according to each of the embodiments described above is realized by computer hardware and a program executed by a CPU (Central Processing Unit) and a GPU (Graphics Processing Unit) on the hardware. it can. FIG. 15 shows computer hardware for realizing each of the above speech recognition devices. The GPU is usually used for performing image processing, and such a technique of using the GPU for normal arithmetic processing instead of image processing is called GPGPU (General-purpose computing on graphics processing units). The GPU can execute a plurality of operations of the same type simultaneously and in parallel. On the other hand, in the case of a neural network, a large amount of computation is required, especially at the time of learning. Therefore, a computer equipped with a GPU is suitable for training and inference of a speech recognizer and a neural network constituting an acoustic model used in the speech recognizer. When speech recognition is performed using an acoustic model for which learning has been completed, a computer having a sufficiently high-speed CPU need not necessarily be equipped with a GPU.

図１５を参照して、このコンピュータシステム１１３０は、メモリポート１１５２及びＤＶＤ（Digital Versatile Disk）ドライブ１１５０を有するコンピュータ１１４０と、キーボード１１４６と、マウス１１４８と、モニタ１１４２とを含む。 Referring to FIG. 15, the computer system 1130 includes a computer 1140 having a memory port 1152 and a DVD (Digital Versatile Disk) drive 1150, a keyboard 1146, a mouse 1148, and a monitor 1142.

コンピュータ１１４０はさらに、ＣＰＵ１１５６及びＧＰＵ１１５８と、これら並びにメモリポート１１５２及びＤＶＤドライブ１１５０に接続されたバス１１６６と、ブートプログラム等を記憶する読出専用メモリであるＲＯＭ１１６０と、バス１１６６に接続され、プログラム命令、システムプログラム及び作業データ等を記憶するコンピュータ読出可能な記憶媒体であるランダムアクセスメモリ（ＲＡＭ）１１６２と、コンピュータ読出可能な不揮発性記憶媒体であるハードディスク１１５４を含む。コンピュータ１１４０はさらに、いずれもバス１１６６に接続され、ネットワーク１１６８への接続を提供するネットワークインターフェイス（Ｉ／Ｆ）１１４４と、外部との音声信号の入出力を行うための音声Ｉ／Ｆ１１７０とを含む。 The computer 1140 is further connected to a CPU 1156 and a GPU 1158, a bus 1166 connected to these, a memory port 1152 and a DVD drive 1150, a ROM 1160 which is a read-only memory for storing a boot program and the like, and connected to a bus 1166, It includes a random access memory (RAM) 1162 which is a computer readable storage medium for storing a system program, work data, and the like, and a hard disk 1154 which is a computer readable nonvolatile storage medium. The computer 1140 further includes a network interface (I / F) 1144 which is connected to the bus 1166 and provides a connection to the network 1168, and an audio I / F 1170 for inputting and outputting audio signals to and from the outside. .

コンピュータシステム１１３０を上記した実施の形態に係る各音声認識装置の各機能部及び音響モデルの記憶装置として機能させるためのプログラムは、ＤＶＤドライブ１１５０又はメモリポート１１５２に装着される、いずれもコンピュータ読出可能な記憶媒体であるＤＶＤ１１７２又はリムーバブルメモリ１１６４に記憶され、さらにハードディスク１１５４に転送される。又は、プログラムはネットワーク１１６８を通じてコンピュータ１１４０に送信されハードディスク１１５４に記憶されてもよい。プログラムは実行の際にＲＡＭ１１６２にロードされる。ＤＶＤ１１７２から、リムーバブルメモリ１１６４から、又はネットワーク１１６８を介して、直接にＲＡＭ１１６２にプログラムをロードしてもよい。また、上記処理に必要なデータは、ハードディスク１１５４、ＲＡＭ１１６２、ＣＰＵ１１５６又はＧＰＵ１１５８内のレジスタ等の所定のアドレスに記憶され、ＣＰＵ１１５６又はＧＰＵ１１５８により処理され、プログラムにより指定されるアドレスに格納される。最終的に訓練が終了した音響モデルのパラメータは、音響モデルの訓練及び推論アルゴリズムを実現するプログラムとともに例えばハードディスク１１５４に格納されたり、ＤＶＤドライブ１１５０及びメモリポート１１５２をそれぞれ介してＤＶＤ１１７２又はリムーバブルメモリ１１６４に格納されたりする。又は、ネットワークＩ／Ｆ１１４４を介して接続された他のコンピュータ又は記憶装置に送信される。 A program for causing the computer system 1130 to function as a storage unit of each of the functional units and acoustic models of each of the speech recognition apparatuses according to the above-described embodiments is mounted on the DVD drive 1150 or the memory port 1152, and both are computer-readable. It is stored in the DVD 1172 or the removable memory 1164, which is a suitable storage medium, and further transferred to the hard disk 1154. Alternatively, the program may be transmitted to computer 1140 via network 1168 and stored on hard disk 1154. The program is loaded into the RAM 1162 at the time of execution. The program may be loaded from the DVD 1172, from the removable memory 1164, or directly to the RAM 1162 via the network 1168. Data required for the above processing is stored at a predetermined address such as a register in the hard disk 1154, the RAM 1162, the CPU 1156, or the GPU 1158, processed by the CPU 1156 or the GPU 1158, and stored at an address designated by the program. The parameters of the acoustic model finally trained are stored, for example, on the hard disk 1154 together with a program for realizing the acoustic model training and inference algorithm, or stored on the DVD 1172 or the removable memory 1164 via the DVD drive 1150 and the memory port 1152, respectively. Or stored. Alternatively, it is transmitted to another computer or storage device connected via the network I / F 1144.

このプログラムは、コンピュータ１１４０を、上記実施の形態に係る各装置及びシステムとして機能させるための複数の命令からなる命令列を含む。上記各装置及びシステムにおける数値演算処理は、ＣＰＵ１１５６及びＧＰＵ１１５８を用いて行う。ＣＰＵ１１５６のみを用いてもよいがＧＰＵ１１５８を用いる方が高速である。コンピュータ１１４０にこの動作を行わせるのに必要な基本的機能のいくつかはコンピュータ１１４０上で動作するオペレーティングシステム若しくはサードパーティのプログラム又はコンピュータ１１４０にインストールされる、ダイナミックリンク可能な各種プログラミングツールキット又はプログラムライブラリにより提供される。したがって、このプログラム自体はこの実施の形態の音声認識装置を実現するのに必要な機能全てを必ずしも含まなくてよい。このプログラムは、命令のうち、所望の結果が得られるように制御されたやり方で適切な機能又はプログラミングツールキット又はプログラムライブラリ内の適切なプログラムを実行時に動的に呼出すことにより、上記したシステム、装置又は方法としての機能を実現する命令のみを含んでいればよい。もちろん、静的リンクにより必要な機能を全て組込んだプログラムをコンピュータにロードすることによって上記した音声認識装置を実現してもよい。 This program includes an instruction sequence including a plurality of instructions for causing the computer 1140 to function as each device and system according to the above embodiment. Numerical processing in each of the above devices and systems is performed using the CPU 1156 and the GPU 1158. Although only the CPU 1156 may be used, using the GPU 1158 is faster. Some of the basic functions required to cause the computer 1140 to perform this operation include an operating system or third party program running on the computer 1140 or various dynamically linkable programming toolkits or programs installed on the computer 1140. Provided by the library. Therefore, the program itself does not necessarily need to include all the functions necessary to realize the speech recognition device of this embodiment. The program described above may be implemented by dynamically calling, at run time, the appropriate functions or appropriate programs in a programming toolkit or program library in a controlled manner to obtain the desired result of the instructions, It is only necessary to include only instructions for realizing the functions of the device or the method. Of course, the above-described speech recognition apparatus may be realized by loading a program incorporating all necessary functions into a computer by a static link.

［変形例］
上記第３、第４、第７及び第８の実施の形態では、４種類の音声強調処理を用いている。しかし本発明はそのような実施の形態には限定されない。２種類、３種類、又は５種類以上の音声強調処理を用いるようにしてもよい。 [Modification]
In the third, fourth, seventh, and eighth embodiments, four types of voice enhancement processing are used. However, the present invention is not limited to such an embodiment. Two, three, or five or more types of voice enhancement processing may be used.

また上記実施の形態では、音響モデルを構成する深層ニューラル・ネットワークの隠れ層は全部で７層であり、第３、第４、第７及び第８の実施の形態では、深層ニューラル・ネットワークの前半に３層、後半に４層の隠れ層を用いている。しかし本発明はそのような実施の形態に限定されるわけではない。隠れ層の層数が６層以下でも、８層以上でもよい。また第３、第４、第７及び第８の実施の形態にしたがって音響モデルを構築する際には、前半と後半の隠れ層の数をそれぞれ３層及び４層とする必要は全くない。ただし、上記実験では、前半に３層、後半に４層としたときに最もよい結果が得られたことは事実である。 Further, in the above embodiment, the hidden layer of the deep neural network constituting the acoustic model is seven layers in total, and in the third, fourth, seventh and eighth embodiments, the former half of the deep neural network is used. Three hidden layers and four hidden layers in the latter half. However, the present invention is not limited to such an embodiment. The number of hidden layers may be six or less, or eight or more. When constructing an acoustic model according to the third, fourth, seventh and eighth embodiments, the number of hidden layers in the first half and the second half does not need to be three and four, respectively. However, in the above experiment, it is true that the best results were obtained when three layers were used in the first half and four layers were used in the second half.

なお、上記実施の形態では単一チャネルの音声信号に対して本発明を適用した。しかし本発明はそうした実施の形態には限定されず、複数チャネルの音声信号に対しても適用は可能である。 In the above embodiment, the present invention is applied to a single-channel audio signal. However, the present invention is not limited to such an embodiment, and is applicable to audio signals of a plurality of channels.

今回開示された実施の形態は単に例示であって、本発明が上記した実施の形態のみに制限されるわけではない。本発明の範囲は、発明の詳細な説明の記載を参酌した上で、特許請求の範囲の各請求項によって示され、そこに記載された文言と均等の意味及び範囲内での全ての変更を含む。 The embodiment disclosed this time is merely an example, and the present invention is not limited to the above-described embodiment. The scope of the present invention is shown by each claim of the claims, in consideration of the description of the detailed description of the invention, and all changes within the meaning and range equivalent to the language described therein are described. Including.

１００、１８０、３８０音声認識装置
１１０波形
１１２音声信号
１１４、２０２、３９２、３９４、３９６音声強調部
１１６、２０３、３９３、３９５、３９７強調音声信号
１１８、２２０、４１０、４１２、４１４特徴抽出部
１２０、２０４、４０２音声認識部
１２２、２０８、４００テキスト
１２４、２０６、２８０、３９８、５００、６５０、７５０、８５０、９５０音響モデル
１２６発音辞書
１２８言語モデル
２００、３９０拡大特徴抽出部
２１０雑音重畳音声の特徴量
２１２第１の強調音声の特徴量
３００雑音重畳音声のためのサブネットワーク
３０２強調音声のためのサブネットワーク
３０４出力側サブネットワーク
４３０第２の強調音声の特徴量
４３２第３の強調音声の特徴量
４３４第４の強調音声の特徴量
４５２中間層
５３０、１１３０コンピュータシステム
５４０、７７０第１のサブネットワーク
５４２、７７２第２のサブネットワーク
５４４、７７４第３のサブネットワーク
５４６第４のサブネットワーク
５４８第５のサブネットワーク
５５０、９７０中間サブネットワーク
６８２、８１８、８９２、９０２、９１２、９２２、９９８、１００８、１０１８、１０２８、１１００ゲート層
９６０第１の入力サブネットワーク
９６２第２の入力サブネットワーク
９６４第３の入力サブネットワーク
９６６第４の入力サブネットワーク
９６８第５の入力サブネットワーク 100, 180, 380 Speech recognition device 110 Waveform 112 Speech signal 114, 202, 392, 394, 396 Speech enhancement unit 116, 203, 393, 395, 397 Speech enhancement signal 118, 220, 410, 412, 414 Feature extraction unit 120 , 204, 402 Speech recognition units 122, 208, 400 Text 124, 206, 280, 398, 500, 650, 750, 850, 950 Acoustic model 126 Pronunciation dictionary 128 Language model 200, 390 Enlarged feature extraction unit 210 Features 212 Features of first emphasized speech 300 Subnetwork 302 for noise superimposed speech Subnetwork 304 for emphasized speech Output subnetwork 430 Features of second emphasized speech 432 Features of third emphasized speech Quantity 434 Feature quantity 4 of fourth emphasized voice 2 Middle layers 530, 1130 Computer systems 540, 770 First subnetwork 542, 772 Second subnetwork 544, 774 Third subnetwork 546 Fourth subnetwork 548 Fifth subnetwork 550, 970 Intermediate subnetwork 682, 818, 892, 902, 912, 922, 998, 1008, 1018, 1028, 1100 Gate layer 960 First input sub-network 962 Second input sub-network 964 Third input sub-network 966 Fourth input sub-network Network 968 fifth input sub-network

Claims

An audio enhancement circuit that receives an audio signal in which a noise signal is superimposed on an audio signal that is a target signal and outputs an enhanced audio signal in which the audio signal is enhanced,
A noise-tolerant speech recognition device, comprising: a speech recognition unit that receives the emphasized speech signal and the acoustic signal and converts the uttered content of the speech signal into text.

The voice emphasis circuit,
A first voice enhancement unit that performs a first type of voice enhancement process on the audio signal and outputs a first enhanced voice signal;
A second voice enhancement unit that performs a second type of voice enhancement process different from the first type on the audio signal and outputs a second enhanced audio signal,
The noise-tolerant speech recognition device according to claim 1, wherein the speech recognition unit receives the first and second emphasized speech signals and the acoustic signal, and converts the utterance content of the speech signal into text.

The voice recognition unit,
First feature extraction means for extracting a first feature amount from the audio signal;
Second feature extraction means for extracting a second feature amount from the emphasized audio signal;
Feature selecting means for selecting each of the second feature values according to the first feature value and the second feature value;
2. The noise-tolerant speech recognition apparatus according to claim 1, further comprising: speech recognition means for converting the utterance content of the speech signal into text using the second feature amount selected by the feature selection means.

The speech recognition unit further includes an acoustic model storage unit that stores an acoustic model used for speech recognition,
The acoustic model is a deep neural network with multiple hidden layers,
The acoustic model is
A first sub-network receiving as input the first feature value;
A second sub-network receiving as input the second feature value;
A third sub-network that receives an output of the first sub-network and an output of the second sub-network, and outputs a phoneme estimated from the first feature and the second feature. The noise-tolerant speech recognition device according to claim 3.

Computer, as an input a single-channel sound signal in which a noise signal is superimposed on a sound signal as a target signal, and outputting an emphasized sound signal in which the sound signal is emphasized,
A noise-tolerant speech recognition method, comprising: a computer receiving the emphasized speech signal and the acoustic signal, and a speech recognition step of converting a speech content of the speech signal into text.

A computer program that causes a computer to function as the noise-resistant device according to claim 1.