JP2018141925A

JP2018141925A - Acoustic model learning device and acoustic model learning program

Info

Publication number: JP2018141925A
Application number: JP2017037421A
Authority: JP
Inventors: 伊藤　均; Hitoshi Ito; 均伊藤; 庄衛佐藤; Shoe Sato; 彰夫小林; Akio Kobayashi
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp; NHK Engineering System Inc
Current assignee: Japan Broadcasting Corp; NHK Engineering System Inc
Priority date: 2017-02-28
Filing date: 2017-02-28
Publication date: 2018-09-13
Anticipated expiration: 2037-02-28
Also published as: JP6820764B2

Abstract

【課題】日本語音響モデルに必要な表現力があり、かつ、学習時間が短くＷＥＲが改善された音響モデル学習装置を提供する。【解決手段】音響モデル学習手段１００Ｃは、３層のＢＬＳＴＭからなる深層学習手段１１１Ａと、深層学習手段の最終層のＢＬＳＴＭ３０ｃの出力するベクトルの次元を所定の演算により変換する線形写像手段１１２Ｂとを備える。ＢＬＳＴＭ３０ｄは、時間方向の情報を記憶するメモリセルｃを他の層より小さくしてその出力の次元数を６４０から３２０に圧縮している。線形写像手段１１２Ｂは、ＢＬＳＴＭ３０ｃの出力の次元数である６４０と文字出力ベクトル７の次元数である２９３４とで表される変換行列を、ランクｒ＝３２０で行列分解した第１変換行列および第２変換行列を順次乗算する。【選択図】図７PROBLEM TO BE SOLVED: To provide an acoustic model learning device which has the expressive power required for a Japanese acoustic model, has a short learning time, and has an improved WER. SOLUTION: An acoustic model learning means 100C comprises a deep learning means 111A composed of three layers of BLSTM and a linear mapping means 112B for converting the dimension of a vector output by BLSTM30c of the final layer of the deep learning means by a predetermined operation. Be prepared. The BLSTM30d makes the memory cell c that stores information in the time direction smaller than the other layers and compresses the number of dimensions of its output from 640 to 320. The linear mapping means 112B is a first transformation matrix and a second transformation matrix obtained by matrix-decomposing the transformation matrix represented by 640, which is the number of dimensions of the output of BLSTM30c, and 2934, which is the number of dimensions of the character output vector 7, at rank r = 320. Multiply the transformation matrix sequentially. [Selection diagram] FIG. 7

Description

本発明は、音響モデル学習装置および音響モデル学習プログラムに関する。 The present invention relates to an acoustic model learning device and an acoustic model learning program.

近年、音声認識の分野ではＤＮＮ（Deep Neural Network）を用いたＥｎｄ−ｔｏ−ｅｎｄ音声認識の手法がいくつか提案されている（非特許文献１、非特許文献２）。そのための音響モデル学習装置は、音声と文字の対応付けを一つの音響モデルを使って直接学習することで、音素という中間状態を経ずに音声から文字へＥｎｄ−ｔｏ−ｅｎｄの変換を行う。Ｅｎｄ−ｔｏ−ｅｎｄ音声認識の手法において、時間方向の情報を記憶するものとしては、ＲＮＮ（Recurrent Neural Network）、ＬＳＴＭ（Long Short-Term Memory）、またはＢＬＳＴＭ（Bi-directional LSTM）を用いる場合もある。 In recent years, several end-to-end speech recognition methods using DNN (Deep Neural Network) have been proposed in the field of speech recognition (Non-patent Documents 1 and 2). For this purpose, the acoustic model learning device directly learns the correspondence between speech and characters by using one acoustic model, and performs end-to-end conversion from speech to characters without going through an intermediate state of phonemes. In the end-to-end speech recognition technique, RNN (Recurrent Neural Network), LSTM (Long Short-Term Memory), or BLSTM (Bi-directional LSTM) may be used to store time direction information. is there.

なお、ＤＮＮの中間層の特定の層のユニット数を削減したネットワーク構造はボトルネック構造と呼ばれており、ボトルネック構造が別のＤＮＮの入力として用いられることもある（非特許文献３参照）。ここで、ユニット数を削減することは、学習により決定すべきパラメータの数（次元数）を削減することに対応する。 A network structure in which the number of units in a specific layer of the DNN is reduced is called a bottleneck structure, and the bottleneck structure may be used as an input for another DNN (see Non-Patent Document 3). . Here, reducing the number of units corresponds to reducing the number of parameters (the number of dimensions) to be determined by learning.

また、非特許文献４には、ＤＮＮを用いるＨＭＭ（Hidden Markov Model）による音声認識方式（ＤＮＮ−ＨＭＭ）の分野では、Ａｆｆｉｎｅ変換（線形変換）の変換行列として行列分解したものを用いると、ＷＥＲ（Word error rate：単語認識誤り率）を低下させることなく学習時間を短縮できることが記載されている。 Further, in Non-Patent Document 4, in the field of a speech recognition method (DNN-HMM) based on HMM (Hidden Markov Model) using DNN, if a matrix decomposition is used as a transformation matrix of Affine transformation (linear transformation), WER It describes that learning time can be shortened without reducing (Word error rate).

Amodei, D., et al.,”Deep Speech 2: End-to-End Speech Recognition in English and Mandarin” the Computing Research Repository (CoRR), arXiv:1512.02595v1 [cs.CL] 8 Dec 2015Amodei, D., et al., “Deep Speech 2: End-to-End Speech Recognition in English and Mandarin” the Computing Research Repository (CoRR), arXiv: 1512.02595v1 [cs.CL] 8 Dec 2015 Miao, Y., et al., "ESSEN: END-TO-END SPEECH RECOGNITION USING DEEP RNN MODELS AND WFST-BASED DECODING" the Computing Research Repository (CoRR), arXiv:1507.08240v3 [cs.CL] 18 Oct 2015Miao, Y., et al., "ESSEN: END-TO-END SPEECH RECOGNITION USING DEEP RNN MODELS AND WFST-BASED DECODING" the Computing Research Repository (CoRR), arXiv: 1507.08240v3 [cs.CL] 18 Oct 2015 Wollmer M., et al., "FEATURE ENHANCEMENT BY BIDIRECTIONAL LSTM NETWORKS FOR CONVERSATIONAL SPEECH RECOGNITION IN HIGHLY NON-STATIONARY NOISE", 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Pages 6822-6826 (2013)Wollmer M., et al., "FEATURE ENHANCEMENT BY BIDIRECTIONAL LSTM NETWORKS FOR CONVERSATIONAL SPEECH RECOGNITION IN HIGHLY NON-STATIONARY NOISE", 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Pages 6822-6826 (2013) Sainath T., et al., "LOW-RANK MATRIX FACTORIZATION FOR DEEP NEURAL NETWORK TRAINING WITH HIGH-DIMENSIONAL OUTPUT TARGETS", 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Pages 6655-6659 (2013)Sainath T., et al., "LOW-RANK MATRIX FACTORIZATION FOR DEEP NEURAL NETWORK TRAINING WITH HIGH-DIMENSIONAL OUTPUT TARGETS", 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Pages 6655-6659 (2013)

しかし、従来技術には以下のような問題点が存在した。
従来の音声認識技術の多くは、変換候補が３０文字程度の英語音声認識を対象としている。日本語の場合、ひらがな、カタカナ、漢字等を合わせると変換候補の数が２０００以上と膨大であり、そのため学習により決定すべきパラメータの数（パラメータをベクトルとみた場合、ベクトルの次元数）が著しく増大する。
また、従来技術では、学習により決定すべきパラメータ数（次元数）が増大すると、学習時間が増大するという問題がある。またパラメータが多すぎると細かいところまで表現し過ぎてしまって、逆により一般的で本質的な特徴を表現しきれないという問題（汎化能力の問題）もある。一方、パラメータが少なすぎると、必要な個数の文字を表現できなくなる。 However, the following problems existed in the prior art.
Many of the conventional speech recognition technologies are targeted for English speech recognition with about 30 conversion candidates. In the case of Japanese, if hiragana, katakana, kanji, etc. are combined, the number of conversion candidates is as large as 2000 or more. Therefore, the number of parameters to be determined by learning (the number of dimensions of the vector when the parameter is regarded as a vector) is remarkably large. Increase.
Further, the conventional technique has a problem that the learning time increases when the number of parameters (number of dimensions) to be determined by learning increases. There is also a problem (generalization ability problem) that if too many parameters, too much detail is expressed too much, and conversely, general and essential features cannot be expressed. On the other hand, if there are too few parameters, the required number of characters cannot be expressed.

したがって、前記したＲＮＮ、ＬＳＴＭまたはＢＬＳＴＭといった時間方向の情報を記憶することのできるニューラルネットワークを用いる音響モデル学習装置において、学習により決定すべきパラメータ数を適切に削減することができれば、日本語音声認識にも適用可能になることが期待される。 Therefore, in the acoustic model learning apparatus using a neural network capable of storing information in the time direction such as RNN, LSTM, or BLSTM, if the number of parameters to be determined by learning can be appropriately reduced, Japanese speech recognition It is expected to be applicable to

また、非特許文献４に記載されている研究対象とする音声認識システムでは、音響モデルとして、音素列を経由するＤＮＮ−ＨＭＭが用いられており、Ｅｎｄ−ｔｏ−ｅｎｄの音声認識手法で用いる音響モデルをその対象とするものではない。 Further, in the speech recognition system to be researched described in Non-Patent Document 4, DNN-HMM passing through a phoneme string is used as an acoustic model, and the acoustic used in the end-to-end speech recognition method. It does not target the model.

本発明は、以上のような問題点に鑑みてなされたものであり、日本語音響モデルに必要な表現力があり、かつ、学習時間が短くＷＥＲが改善された音響モデル学習装置および音響モデル学習プログラムを提供することを課題とする。 The present invention has been made in view of the above problems, and has an expressive power necessary for a Japanese acoustic model, an acoustic model learning apparatus and an acoustic model learning that have a short learning time and an improved WER. The challenge is to provide a program.

本発明は、前記課題を解決するため、音響モデル学習装置として、入力された音声が音声認識されることにより出力される文字との対応付けを学習することにより、前記入力された音声をＥｎｄ−ｔｏ−ｅｎｄの音声認識手法を用いて文字に変換し、当該文字を出力する音響モデルを学習する音響モデル学習装置であって、３層以上の多層構造のニューラルネットワークを有し、音声の特徴量が連続して入力され、前記多層構造の各層において、前記特徴量についての時間方向の情報を記憶し、当該時間方向の情報を用いて、前記音声の特徴量から対象とする複数の文字のいずれであるのかを予測した確率を表す特徴ベクトルを出力する深層学習手段と、前記深層学習手段の最終層の出力である特徴ベクトルに所定の変換行列を適用することにより、前記深層学習手段の出力する特徴ベクトルの次元を所定の演算により変換する線形写像手段と、を備え、前記深層学習手段および前記線形写像手段による演算のうちの少なくとも１つの演算で取り扱う前記特徴ベクトルの次元を圧縮することにより前記音響モデルを学習することを特徴とする構成とした。 In order to solve the above-mentioned problem, the present invention, as an acoustic model learning device, learns the correspondence between the input speech and the characters that are output when the input speech is speech-recognized. An acoustic model learning apparatus that learns an acoustic model that converts to a character using a to-end speech recognition method and outputs the character, and has a neural network having a multilayer structure of three or more layers, and features of speech Is stored in each layer of the multilayer structure, and information on the time direction about the feature amount is stored in each layer of the multilayer structure, and any one of a plurality of target characters from the feature amount of the voice is stored using the information on the time direction. A deep learning means for outputting a feature vector representing the probability of predicting whether or not, and applying a predetermined transformation matrix to the feature vector that is the output of the last layer of the deep learning means And a linear mapping unit that transforms the dimension of the feature vector output from the deep learning unit by a predetermined calculation, and is handled by at least one of the calculations performed by the deep learning unit and the linear mapping unit. The acoustic model is learned by compressing the dimension of the vector.

本発明は、以下に示す優れた効果を奏するものである。
本発明に係る音響モデル学習装置によれば、演算で取り扱うベクトルの次元圧縮処理を行うことで、音響モデルをＥｎｄ−ｔｏ−ｅｎｄの音声認識手法を用いて学習する際に決定すべきパラメータ数が削減される。
また、本発明に係る音響モデル学習装置によれば、日本語音響モデルに必要な表現力があり、かつ、単語認識誤り率（ＷＥＲ）が改善され、学習時間および学習回数が著しく短縮される。 The present invention has the following excellent effects.
According to the acoustic model learning device according to the present invention, the number of parameters to be determined when learning an acoustic model using an end-to-end speech recognition method by performing dimension compression processing of a vector handled by calculation. Reduced.
Moreover, according to the acoustic model learning device of the present invention, the Japanese acoustic model has the expressive power necessary, the word recognition error rate (WER) is improved, and the learning time and the number of learnings are remarkably shortened.

本実施形態に係る日本語音響モデル学習装置を備える日本語音声認識装置の全体の構成を示すブロック図である。It is a block diagram which shows the whole structure of the Japanese speech recognition apparatus provided with the Japanese acoustic model learning apparatus which concerns on this embodiment. Ｅｎｄ−ｔｏ−ｅｎｄ音響モデルのうちＢＬＳＴＭ構造をもつ標準的なネットワーク構造の一例を示す図である。It is a figure which shows an example of the standard network structure which has a BLSTM structure among End-to-end acoustic models. 第１実施形態に係る音響モデル学習手段で用いる音響モデルのネットワーク構造を説明する模式図である。It is a schematic diagram explaining the network structure of the acoustic model used with the acoustic model learning means which concerns on 1st Embodiment. 第１実施形態に係る音響モデル学習手段で用いる音響モデルのネットワーク構造の一例を示す図である。It is a figure which shows an example of the network structure of the acoustic model used with the acoustic model learning means which concerns on 1st Embodiment. 第２実施形態に係る音響モデル学習手段で用いる音響モデルのネットワーク構造のうち線形変換部分を説明する模式図である。It is a schematic diagram explaining a linear conversion part among the network structures of the acoustic model used with the acoustic model learning means which concerns on 2nd Embodiment. 第２実施形態に係る音響モデル学習手段で用いる音響モデルのネットワーク構造の一例を示す図である。It is a figure which shows an example of the network structure of the acoustic model used with the acoustic model learning means which concerns on 2nd Embodiment. 第３実施形態に係る音響モデル学習手段で用いる音響モデルのネットワーク構造の一例を示す図である。It is a figure which shows an example of the network structure of the acoustic model used with the acoustic model learning means which concerns on 3rd Embodiment.

以下、本発明の実施形態に係る日本語音響モデル学習装置について、図面を参照しながら説明する。
［日本語音声認識装置の構成］
図１に示す日本語音声認識装置１は、日本語音響モデル学習装置１０と、日本語言語モデル学習装置２０と、を備えている。 Hereinafter, a Japanese acoustic model learning device according to an embodiment of the present invention will be described with reference to the drawings.
[Configuration of Japanese speech recognition device]
A Japanese speech recognition device 1 shown in FIG. 1 includes a Japanese acoustic model learning device 10 and a Japanese language model learning device 20.

日本語音響モデル学習装置１０は、入力された音声と出力される文字との対応付けを学習することにより、入力された音声をＥｎｄ−ｔｏ−ｅｎｄで文字に変換して出力する音響モデルを学習する装置である。以下では、日本語の音響モデル作成用の学習データ２を、音声２ａとテキスト２ｂとのペアとして説明する。音声２ａ及びテキスト２ｂは、日本語の大量の音声データ及び大量のテキストを表している。例えば、音声２ａとして、事前学習用の放送番組の番組音声を用い、テキスト２ｂとして、その番組音声の内容の厳密な書き起こし、又は、それに準ずるものを用いることができる。 The Japanese acoustic model learning device 10 learns an acoustic model that converts input speech into characters by End-to-end and outputs it by learning correspondence between input speech and output characters. It is a device to do. Hereinafter, the learning data 2 for creating a Japanese acoustic model will be described as a pair of a voice 2a and a text 2b. The voice 2a and the text 2b represent a large amount of Japanese voice data and a large amount of text. For example, the program audio of a broadcast program for pre-learning can be used as the audio 2a, and the content of the program audio can be transcribed or equivalent to the text 2b.

ここでは、日本語音響モデル学習装置１０は、音響モデル学習手段１００と、音響モデル記憶手段１０１と、を備えている。 Here, the Japanese acoustic model learning device 10 includes an acoustic model learning unit 100 and an acoustic model storage unit 101.

音響モデル学習手段１００は、日本語の音響モデル作成用の学習データ２における音声２ａとテキスト２ｂとのペアおよび文字ラベル（以下、単にラベルという）を用いる学習により、音声がラベルのうちどれであるか（どの文字であるか）を出力するモデル（音響モデル）のパラメータ（重み係数等）を学習し、音響モデルを音響モデル記憶手段１０１に記憶する。日本語に対応したラベルは、平仮名やカタカナの表音文字と、漢字の表意文字と、句読点など記号を含んでいる。以下では、記号を含むラベルのことを単に文字と呼んだり、ラベル列を文字列と呼んだりする場合もある。音響モデル学習手段１００は、非特許文献２に記載されたような文字のシーケンスを特定するＥｎｄ−ｔｏ−ｅｎｄの音響モデルの全てに適用可能なものである。 The acoustic model learning means 100 performs learning using a pair of a speech 2a and a text 2b and a character label (hereinafter simply referred to as a label) in the learning data 2 for creating an acoustic model in Japanese, and the speech is any of the labels. The model (acoustic model) parameter (weighting coefficient, etc.) for outputting (which character is) is learned, and the acoustic model is stored in the acoustic model storage means 101. Labels corresponding to Japanese include hiragana and katakana phonetic characters, kanji ideographs, and symbols such as punctuation marks. Hereinafter, a label including a symbol may be simply referred to as a character, or a label string may be referred to as a character string. The acoustic model learning unit 100 can be applied to all end-to-end acoustic models that specify a character sequence as described in Non-Patent Document 2.

この音響モデルは、大量の音声データから予め抽出した音響特徴量（メル周波数ケプストラム係数、フィルタバンク出力等）を、設定したラベルごとにディープニューラルネットワーク（Deep Neural Network）とコネクショニスト時系列分類法（ＣＴＣ：Connectionist Temporal Classification）等によってモデル化したものである。なお、音響モデルによる音響特徴量の尤度計算は、出力が漢字を含む書記素であれば再帰型ニューラルネットワーク（ＲＮＮ：Recurrent Neural Network)であっても、長・短期記憶（ＬＳＴＭ：Long Short Term Memory）であっても構わない。
音響モデル記憶手段１０１は、音響モデル学習手段１００が学習により生成した音響モデルを記憶するもので、ハードディスク等の一般的な記憶媒体である。 In this acoustic model, acoustic features (mel frequency cepstrum coefficients, filter bank output, etc.) extracted in advance from a large amount of speech data are classified into a deep neural network (Deep Neural Network) and a connectionist time series classification method (CTC) for each set label. : Connectionist Temporal Classification) etc. It should be noted that the likelihood calculation of the acoustic feature quantity by the acoustic model is a long short term memory (LSTM: Long Short Term) even if the output is a grapheme containing kanji, even if it is a recurrent neural network (RNN). Memory).
The acoustic model storage unit 101 stores an acoustic model generated by learning by the acoustic model learning unit 100, and is a general storage medium such as a hard disk.

以上の説明は、音響モデルが適用される２つのフェーズ（事前学習フェーズ、評価フェーズ）のうち事前学習フェーズにおける処理の説明に対応している。
一方、学習が終了した後の評価フェーズにおいては、音響モデル記憶手段１０１（日本語音響モデル学習装置１０）に対して、学習データ２の代わりに、評価用の音声３を入力する。このとき、音響モデル学習手段１００は、音響モデル記憶手段１０１に記憶されているところの、事前学習により生成された音響モデルを用いて、評価用の音声３を認識し、対応する文字列を出力する。 The above description corresponds to the description of the process in the pre-learning phase among the two phases (pre-learning phase and evaluation phase) to which the acoustic model is applied.
On the other hand, in the evaluation phase after the learning is completed, the evaluation voice 3 is input instead of the learning data 2 to the acoustic model storage unit 101 (Japanese acoustic model learning device 10). At this time, the acoustic model learning unit 100 recognizes the evaluation speech 3 using the acoustic model generated by the prior learning stored in the acoustic model storage unit 101, and outputs a corresponding character string. To do.

すなわち、評価フェーズにおいては、音響モデル学習手段１００は、入力された評価用の音声３を特徴量（特徴ベクトル）に変換し、この特徴量を音響モデル記憶手段１０１に記憶されている音響モデルを用いて、順次、ラベル（文字）に変換することで文字列を生成する文字列生成手段として機能する。 That is, in the evaluation phase, the acoustic model learning unit 100 converts the input evaluation sound 3 into a feature amount (feature vector), and the acoustic model stored in the acoustic model storage unit 101 is converted to the feature model. And functioning as a character string generating means for generating a character string by sequentially converting to a label (character).

なお、評価フェーズにおいて、評価用の音声３の代わりにその特徴量（特徴ベクトル）が入力する場合には、音響モデル学習手段１００は、前記の変換処理をすることなく、入力された特徴量を、音響モデルを用いて、順次、ラベルに変換すればよい。
また、評価フェーズに対応した処理を行う文字列生成手段を別に設けて、音響モデル学習手段１００には事前学習フェーズに対応した処理だけを行わせるように構成しても構わない。 In the evaluation phase, when the feature amount (feature vector) is input instead of the evaluation voice 3, the acoustic model learning unit 100 does not perform the conversion process, but the input feature amount. Using the acoustic model, it may be sequentially converted into a label.
Further, a character string generation unit that performs processing corresponding to the evaluation phase may be provided separately, and the acoustic model learning unit 100 may be configured to perform only processing corresponding to the pre-learning phase.

日本語言語モデル学習装置２０は、日本語の大量のテキストを用いてラベルから単語列を出力する言語モデルを学習する装置である。ここでは、日本語言語モデル学習装置２０は、言語モデル学習手段２００と、言語モデル記憶手段２０１と、を備えている。 The Japanese language model learning device 20 is a device that learns a language model that outputs a word string from a label using a large amount of Japanese text. Here, the Japanese language model learning device 20 includes a language model learning unit 200 and a language model storage unit 201.

言語モデル学習手段２００は、ラベルと言語モデル用コーパス４を用いてラベルから単語列を出力するモデル（言語モデル）のパラメータを学習し、言語モデルを言語モデル記憶手段２０１に記憶する。言語モデル用コーパス４は、自然言語の文章を大規模に集積したコーパスである。言語モデル用コーパス４は、音響モデル作成用の学習データ２のテキスト２ｂに比べて大量のデータからなる。 The language model learning unit 200 learns parameters of a model (language model) that outputs a word string from a label using the label and the language model corpus 4, and stores the language model in the language model storage unit 201. The language model corpus 4 is a corpus in which natural language sentences are accumulated on a large scale. The language model corpus 4 includes a larger amount of data than the text 2b of the learning data 2 for creating the acoustic model.

言語モデル記憶手段２０１は、言語モデル学習手段２００が学習により生成した言語モデルを記憶するものであって、ハードディスク等の一般的な記憶媒体である。
言語モデル記憶手段２０１に記憶されている言語モデルは、非特許文献２に記載されたモデルのように、音響モデル記憶手段１０１に対して評価用の音声３またはその特徴量を入力して得られた表意文字を含む文字列を入力として、前後の単語の関係から単語列を推定し、推定結果である単語列を出力するモデルの全てに適用可能なものである。言語モデルは、大量のテキストから予め学習した出力系列（単語等）の出現確率等をモデル化したものであり、例えば、一般的なＮグラム言語モデルを用いることができる。 The language model storage unit 201 stores a language model generated by learning by the language model learning unit 200 and is a general storage medium such as a hard disk.
The language model stored in the language model storage unit 201 is obtained by inputting the evaluation speech 3 or its feature amount to the acoustic model storage unit 101 like the model described in Non-Patent Document 2. The present invention is applicable to all models that take a character string including an ideographic character as an input, estimate a word string from the relationship between previous and subsequent words, and output a word string as an estimation result. The language model is obtained by modeling the appearance probability of an output sequence (words and the like) learned in advance from a large amount of text. For example, a general N-gram language model can be used.

評価フェーズにおいて、日本語音響モデル学習装置１０に記憶されている学習済みのパラメータを有する音響モデルに音声３またはその特徴量が連続的に入力されると、それに対応する文字列が連続的に出力され、言語モデル記憶手段２０１（日本語言語モデル学習装置２０）に入力する。このとき、言語モデル学習手段２００は、言語モデル記憶手段２０１に記憶されている学習済みのパラメータを有する言語モデルを用いて、入力される文字列から自然な日本語の文章としての認識結果５（単語列）を出力する。
すなわち、評価フェーズにおいては、言語モデル学習手段２００は、言語モデル記憶手段２０１に記憶されている言語モデルを用いて、入力された文字列を、順次、単語に変換することで単語列を生成する単語列生成手段として機能する。なお、評価フェーズに対応した処理を行う単語列生成手段を別に設けて、言語モデル学習手段２００には事前学習フェーズに対応した処理だけを行わせるように構成しても構わない。 In the evaluation phase, when the speech 3 or its feature value is continuously input to the acoustic model having the learned parameters stored in the Japanese acoustic model learning device 10, the corresponding character string is continuously output. And input to the language model storage unit 201 (Japanese language model learning device 20). At this time, the language model learning unit 200 uses the language model having the learned parameters stored in the language model storage unit 201 to recognize the recognition result 5 (as a natural Japanese sentence from the input character string). Output a word string).
That is, in the evaluation phase, the language model learning unit 200 uses the language model stored in the language model storage unit 201 to generate a word string by sequentially converting the input character string into a word. Functions as word string generation means. Note that a word string generation unit that performs processing corresponding to the evaluation phase may be provided separately, and the language model learning unit 200 may be configured to perform only processing corresponding to the pre-learning phase.

［日本語音響モデル学習装置１０の構成］
日本語音響モデル学習装置１０の音響モデル学習手段１００で用いる音響モデルのネットワーク構造を説明する前に、Ｅｎｄ−ｔｏ−ｅｎｄ音響モデルのネットワーク構造について図２を参照して説明する。図２にはＢＬＳＴＭ構造をもつ標準的なネットワーク構造の一例が示されているが、ＬＳＴＭを用いて実現したものやＬＳＴＭ構造を持たない一般的なＲＮＮに対しても本発明が同様に適用可能である。 [Configuration of Japanese acoustic model learning device 10]
Before describing the network structure of the acoustic model used by the acoustic model learning means 100 of the Japanese acoustic model learning device 10, the network structure of the end-to-end acoustic model will be described with reference to FIG. FIG. 2 shows an example of a standard network structure having a BLSTM structure. However, the present invention can be similarly applied to a general RNN that does not have an LSTM structure or an LSTM structure. It is.

図２に示すように、この標準的なネットワーク構造を用いて音響モデルを学習する音響モデル学習手段１００Ｒは、深層学習手段１１１Ｒと、線形写像手段１１２と、正規化手段１１３とを備えている。
深層学習手段１１１Ｒは、第１層のＢＬＳＴＭ３０ａと、第２層のＢＬＳＴＭ３０ｂと、第３層のＢＬＳＴＭ３０ｃと、で構成されている。深層学習手段１１１Ｒは、音声を入力とし、音声がラベルのうちどれであるかを学習する手段である。ここでは３層構造としたが、深層学習手段１１１Ｒは、４層以上の多層構造のニューラルネットワークであっても構わない。深層学習手段１１１Ｒは、音声の特徴量が連続して入力され、多層構造の各層において、音声の特徴量についての時間方向の情報を記憶し、当該時間方向の情報を用いて、音声の特徴量から対象とする複数の文字のいずれであるのかを予測した確率を表す特徴ベクトルを出力する。深層学習手段１１１Ｒは、その内部構造をパラメータにより定義することができる。ＢＬＳＴＭ構造の場合、パラメータは、層数とメモリセルである。メモリセルは、ＬＳＴＭ構造において、時間方向の情報を記憶するベクトルの次元数を決定するパラメータ、言い換えれば、時間軸上どこまで離れたデータを計算に取り込むかの長さを表している。なお、ＬＳＴＭ構造におけるメモリセルについては、非特許文献２に詳述されているので、ここでは説明を省略する。 As shown in FIG. 2, an acoustic model learning unit 100R that learns an acoustic model using this standard network structure includes a deep learning unit 111R, a linear mapping unit 112, and a normalizing unit 113.
The deep learning means 111R includes a first layer BLSTM 30a, a second layer BLSTM 30b, and a third layer BLSTM 30c. The deep learning means 111R is a means for learning which voice is the label by using the voice as input. Although the three-layer structure is used here, the deep learning means 111R may be a neural network having a multilayer structure of four or more layers. The deep learning means 111R continuously receives voice feature values, stores time direction information about the sound feature values in each layer of the multilayer structure, and uses the time direction information to store the sound feature values. A feature vector representing the probability of predicting which of the plurality of target characters is output. The deep learning means 111R can define its internal structure by parameters. In the case of the BLSTM structure, the parameters are the number of layers and the memory cell. In the LSTM structure, the memory cell represents a parameter for determining the number of dimensions of a vector for storing information in the time direction, in other words, how far the data on the time axis is taken into the calculation. Since the memory cell in the LSTM structure is described in detail in Non-Patent Document 2, description thereof is omitted here.

図２に示した音響モデル学習手段１００Ｒの場合、深層学習手段１１１Ｒの各層のＢＬＳＴＭ３０ａ，３０ｂ，３０ｃはいずれも同一の規模である。具体的には、各層のＢＬＳＴＭは、いずれも出力する特徴ベクトルの次元は６４０次元である。各ＢＬＳＴＭ３０ａ，３０ｂ，３０ｃが有する前方の時間方向情報を記憶するメモリセルと、後方の時間方向情報を記憶するメモリセルも同一サイズであり（２つのメモリセルがそれぞれＣ＝３２０）、いずれも３２０次元のベクトルを出力する。なお、メモリセルＣの数値３２０は１つのメモリセルＣのメモリ容量に対応している。この数値に依存して各層のメモリセルが出力する特徴ベクトルの次元数が変わる。
深層学習手段１１１Ｒは、１２０次元の音声の特徴量（特徴ベクトル）６を入力として、その最終層のＢＬＳＴＭ３０ｃから６４０次元の特徴ベクトルを出力する。 In the case of the acoustic model learning unit 100R shown in FIG. 2, the BLSTMs 30a, 30b, and 30c of the layers of the deep learning unit 111R are all the same scale. Specifically, the BLSTM of each layer has a feature vector of 640 dimensions to be output. The memory cells that store the forward time direction information and the memory cells that store the backward time direction information of each BLSTM 30a, 30b, and 30c have the same size (two memory cells are C = 320, respectively). Output a vector of dimensions. The numerical value 320 of the memory cell C corresponds to the memory capacity of one memory cell C. Depending on this value, the number of dimensions of the feature vector output from the memory cell of each layer changes.
The deep learning means 111R receives a feature quantity (feature vector) 6 of 120-dimensional speech and outputs a 640-dimensional feature vector from the BLSTM 30c of the final layer.

線形写像手段１１２は、深層学習手段１１１によって各パラメータ（ＢＬＳＴＭ構造の場合、層数、メモリセル）により定義された次元数で表現される音響特徴量（特徴ベクトル）を入力とする。線形写像手段１１２は、この特徴ベクトルを入力として、所定の変換行列を適用することにより、深層学習手段１１１の出力する特徴ベクトルの次元を所定の演算により変換する。すなわち、線形写像手段１１２はＢＬＳＴＭ３０ｃの出力する特徴ベクトルの次元を文字出力ベクトル７の次元に変換する。ここで、線形写像手段１１２は、ＢＬＳＴＭ３０ｃの出力ベクトルに対して単一のＡｆｆｉｎｅ変換行列を適用する。具体的には、線形写像手段１１２は、ＢＬＳＴＭ３０ｃから入力される６４０次元の特徴ベクトルに６４０行２９３４列の行列（以下、６４０＊３２０の行列と表記する。以下同様）を乗算して、２９３４次元のベクトルを出力する。ここで、２９３４は、識別対象としている日本語のひらがな、カタカナ、漢字、記号の個数である。線形写像手段１１２の出力するベクトルは正規化手段１１３へ入力する。 The linear mapping unit 112 receives an acoustic feature amount (feature vector) expressed by the number of dimensions defined by each parameter (in the case of a BLSTM structure, the number of layers and a memory cell) by the deep learning unit 111. The linear mapping unit 112 converts the dimension of the feature vector output from the deep learning unit 111 by a predetermined calculation by applying a predetermined conversion matrix using the feature vector as an input. That is, the linear mapping means 112 converts the dimension of the feature vector output from the BLSTM 30c into the dimension of the character output vector 7. Here, the linear mapping means 112 applies a single Affine transformation matrix to the output vector of the BLSTM 30c. Specifically, the linear mapping unit 112 multiplies the 640-dimensional feature vector input from the BLSTM 30c by a matrix of 640 rows and 2934 columns (hereinafter referred to as a 640 * 320 matrix; the same applies hereinafter) to obtain 2934 dimensions. Returns a vector of. Here, 2934 is the number of Japanese hiragana, katakana, kanji and symbols to be identified. The vector output from the linear mapping unit 112 is input to the normalization unit 113.

正規化手段１１３は、線形写像手段１１２によって調整された次元の目的関数の正規化を行うものである。正規化手段１１３は、Ｓｏｆｔｍａｘ関数を用いて、線形写像手段１１２によって調整された次元の目的関数の正規化を行って２９３４次元の文字出力ベクトル７として出力する。これにより、最終的に２９３４ラベルの識別を行うことができる。なお、この音声認識で識別しようとするアウトプットの個数（文字の個数＝２９３４）を変えれば、それに依存して、学習により決定すべきパラメータ数（次元数）も変わる。 The normalizing means 113 normalizes the dimensional objective function adjusted by the linear mapping means 112. The normalizing means 113 normalizes the dimensional objective function adjusted by the linear mapping means 112 using the Softmax function, and outputs the result as a 2934-dimensional character output vector 7. As a result, the 2934 label can be finally identified. Note that if the number of outputs (number of characters = 2934) to be identified by voice recognition is changed, the number of parameters (number of dimensions) to be determined by learning also changes depending on the number.

（第１実施形態）
図３は第１実施形態に係る音響モデル学習手段で用いる音響モデルのネットワーク構造を説明する模式図である。ここでは、図２を参照して説明した、同一規模の３層の深層学習手段１１１ＲをＮ層に一般化して深層学習手段１１１と表記する。深層学習手段１１１は、層数がＮ（Ｎ≧３）であるものとしている。また、図２のＢＬＳＴＭのことを、Ｆｗ−ＬＳＴＭとＢｗ−ＬＳＴＭのペアで図示して説明する。なお、この深層学習手段１１１の次元数は、層数Ｎが一定値であれば、メモリセルＣに依存する。 (First embodiment)
FIG. 3 is a schematic diagram for explaining the network structure of the acoustic model used in the acoustic model learning means according to the first embodiment. Here, the three layers of deep learning means 111R of the same scale described with reference to FIG. 2 are generalized to N layers and expressed as deep learning means 111. The deep learning means 111 assumes that the number of layers is N (N ≧ 3). Further, the BLSTM of FIG. 2 will be described with reference to a pair of Fw-LSTM and Bw-LSTM. The number of dimensions of the deep learning means 111 depends on the memory cell C if the number of layers N is a constant value.

第１実施形態に係る日本語音響モデル学習装置１０の音響モデル学習手段１００（図１）は、図３の深層学習手段１１１の第１層における前方（Ｆｗ）のＬＳＴＭには、メモリセルＣが設定されており、後方（Ｂｗ）のＬＳＴＭにもメモリセルＣが設定されている。
同様に、深層学習手段１１１の第Ｎ層における前方（Ｆｗ）のＬＳＴＭには、メモリセルＣが設定されており、後方（Ｂｗ）のＬＳＴＭにもメモリセルＣが設定されている。
一方、深層学習手段１１１の第１層と第Ｎ層を除く所定の第ｎ層においては、前方（Ｆｗ）のＬＳＴＭには、メモリセルｃ（ｃ＜Ｃ）が設定されており、後方（Ｂｗ）のＬＳＴＭにもメモリセルｃ（ｃ＜Ｃ）が設定されている。
さらに、第１層、第ｎ層、第Ｎ層以外のその他の層では、前方（Ｆｗ）のＬＳＴＭには、メモリセルＣが設定されており、後方（Ｂｗ）のＬＳＴＭにもメモリセルＣが設定されている。 The acoustic model learning unit 100 (FIG. 1) of the Japanese acoustic model learning device 10 according to the first embodiment includes a memory cell C in the LSTM in the front (Fw) in the first layer of the deep learning unit 111 in FIG. The memory cell C is also set in the rear (Bw) LSTM.
Similarly, the memory cell C is set in the front (Fw) LSTM in the Nth layer of the deep learning means 111, and the memory cell C is also set in the rear (Bw) LSTM.
On the other hand, in the predetermined n-th layer excluding the first layer and the N-th layer of the deep learning means 111, the memory cell c (c <C) is set in the front (Fw) LSTM, and the rear (Bw The memory cell c (c <C) is also set in the LSTM.
Further, in the other layers other than the first layer, the nth layer, and the Nth layer, the memory cell C is set in the front (Fw) LSTM, and the memory cell C is also set in the rear (Bw) LSTM. Is set.

つまり、深層学習手段１１１を構成するＮ層のＢＬＳＴＭ（Ｆｗ−ＬＳＴＭとＢｗ−ＬＳＴＭのペア）のうち、第１層と第Ｎ層を除く所定の第ｎ層におけるメモリセルｃは、符号３０１で示すように、他の層のメモリセルＣよりも小さく設定されている。
したがって、第ｎ層の出力する特徴ベクトルの次元は、他の層から出力する特徴ベクトルの次元よりも縮小され、音響モデルのネットワーク構造の次元圧縮（ボトルネック構造）が実現される。これにより、深層学習手段１１１による演算で取り扱う特徴ベクトルの次元を圧縮することができる。なお、図３では、Ｆｗ−ＬＳＴＭおよびＢｗ−ＬＳＴＭをそれぞれ表すブロックの横幅でメモリセルの大小を表している。 That is, among the N layer BLSTMs (Fw-LSTM and Bw-LSTM pairs) constituting the deep learning means 111, a memory cell c in a predetermined nth layer excluding the first layer and the Nth layer is denoted by reference numeral 301. As shown, it is set smaller than the memory cell C of the other layer.
Therefore, the dimension of the feature vector output from the n-th layer is reduced more than the dimension of the feature vector output from the other layers, and dimensional compression (bottleneck structure) of the network structure of the acoustic model is realized. Thereby, the dimension of the feature vector handled by the calculation by the deep learning means 111 can be compressed. In FIG. 3, the size of the memory cell is represented by the horizontal width of each block representing Fw-LSTM and Bw-LSTM.

図４は第１実施形態に係る音響モデル学習手段で用いる音響モデルのネットワーク構造の一例を示す図である。
図４に示すように、第１実施形態に係る音響モデル学習手段１００Ａは、一例として層数Ｎを３とした深層学習手段１１１Ａと、線形写像手段１１２と、正規化手段１１３とを備えている。なお、図２に示した音響モデル学習手段１００Ｒと同じ構成には同じ符号を付して説明を省略する。
深層学習手段１１１Ａは、第１層のＢＬＳＴＭ３０ａと、第２層のＢＬＳＴＭ３０ｄと、第３層のＢＬＳＴＭ３０ｃと、で構成されている。
第１層のＢＬＳＴＭ３０ａおよび最終層（第３層）のＢＬＳＴＭ３０ｃは、いずれも出力する特徴ベクトルの次元は６４０次元であり、それぞれの層において２つのメモリセルがそれぞれＣ＝３２０である。
一方、第２層のＢＬＳＴＭ３０ｄは、出力する特徴ベクトルの次元は３２０次元であり、２つのメモリセルがそれぞれｃ＝１６０である。 FIG. 4 is a diagram illustrating an example of a network structure of an acoustic model used in the acoustic model learning unit according to the first embodiment.
As shown in FIG. 4, the acoustic model learning unit 100A according to the first embodiment includes, as an example, a deep learning unit 111A having a layer number N of 3, a linear mapping unit 112, and a normalizing unit 113. . In addition, the same code | symbol is attached | subjected to the same structure as the acoustic model learning means 100R shown in FIG. 2, and description is abbreviate | omitted.
The deep learning means 111A includes a first layer BLSTM 30a, a second layer BLSTM 30d, and a third layer BLSTM 30c.
The BLSTM 30a of the first layer and the BLSTM 30c of the final layer (third layer) both output feature vectors of 640 dimensions, and two memory cells in each layer have C = 320, respectively.
On the other hand, in the BLSTM 30d of the second layer, the dimension of the feature vector to be output is 320 dimensions, and each of the two memory cells is c = 160.

また、音響モデル学習手段１００Ａで用いる音響モデルのネットワーク構造は、ＢＬＳＴＭ構造に限らず、ＬＳＴＭを用いて実現したものや、ＬＳＴＭ構造を持たない、より一般的なＲＮＮに対しても、時間軸上どこまで離れたデータを計算に取り込むかの長さを設定することができるものであれば同様に適用可能である。
第１実施形態に係る日本語音響モデル学習装置１０によれば、深層学習手段１１１Ａの演算で取り扱う特徴ベクトルの次元を圧縮することにより音響モデルのネットワーク構造の次元圧縮を実現し、これによって、音響モデルの学習により決定すべきパラメータ数が削減される。 Further, the network structure of the acoustic model used in the acoustic model learning unit 100A is not limited to the BLSTM structure, but is also realized on the time axis for a realization using LSTM or a more general RNN having no LSTM structure. The present invention can be similarly applied as long as it can set the length of how far away data is taken into the calculation.
According to the Japanese acoustic model learning device 10 according to the first embodiment, the dimensional compression of the network structure of the acoustic model is realized by compressing the dimension of the feature vector handled by the computation of the deep learning means 111A. The number of parameters to be determined is reduced by learning the model.

（第２実施形態）
図５は第２実施形態に係る音響モデル学習手段で用いる音響モデルのネットワーク構造のうち、線形変換部分を説明する模式図である。ここでは、図２の音響モデル学習手段１００Ｒにおいて、線形写像手段１１２へ入力する特徴ベクトルが仮に４次元であり、出力される文字を表すベクトルが１００次元であるものとして説明する。図２の音響モデル学習手段１００Ｒによれば、線形写像手段１１２によって、図５（ａ）に示すように、入力される４次元ベクトル（１＊４の行列）に対して、４＊１００の行列が乗算されて１００次元のベクトル（１＊１００の行列）が出力される。この場合、入力される４次元ベクトルに対して乗算される行列の要素数は４×１００＝４００である。この行列の要素数は、音響モデルの学習により決定すべきパラメータ数（次元数）の大小の目安となる。 (Second Embodiment)
FIG. 5 is a schematic diagram for explaining a linear conversion part in the network structure of the acoustic model used by the acoustic model learning means according to the second embodiment. Here, in the acoustic model learning unit 100R in FIG. 2, the feature vector input to the linear mapping unit 112 is assumed to be four-dimensional, and the vector representing the output character is assumed to be 100-dimensional. According to the acoustic model learning means 100R of FIG. 2, the linear mapping means 112 performs a 4 * 100 matrix with respect to the input four-dimensional vector (1 * 4 matrix) as shown in FIG. 5 (a). Are multiplied to output a 100-dimensional vector (1 * 100 matrix). In this case, the number of elements of the matrix multiplied by the input four-dimensional vector is 4 × 100 = 400. The number of elements in this matrix is a measure of the number of parameters (number of dimensions) to be determined by learning the acoustic model.

第２実施形態に係る日本語音響モデル学習装置１０の音響モデル学習手段１００（図１）は、図２に示した音響モデル学習手段１００Ｒの線形写像手段１１２において演算で取り扱う特徴ベクトルの次元を圧縮することで、音響モデルのネットワーク構造の次元圧縮を実現するものである。具体例で説明すると、第２実施形態によれば、図５（ａ）に示した４＊１００の行列を乗算することに代えて、図５（ｂ）に示すように、それをランクｒ＝２で行列分解して得られる２つの行列、すなわち、４＊２の行列および２＊１００の行列を順次乗算する。この場合、行列の要素数の合計は４×２＋２×１００＝２０８となり、音響モデルの学習により決定すべきパラメータ数が、図５（ａ）の場合の要素数である４００と比べて大幅に削減される。 The acoustic model learning means 100 (FIG. 1) of the Japanese acoustic model learning apparatus 10 according to the second embodiment compresses the dimension of the feature vector handled by the operation in the linear mapping means 112 of the acoustic model learning means 100R shown in FIG. By doing so, dimensional compression of the network structure of the acoustic model is realized. More specifically, according to the second embodiment, instead of multiplying the 4 * 100 matrix shown in FIG. 5 (a), as shown in FIG. 5 (b), the rank r = Two matrices obtained by matrix decomposition at 2, that is, a 4 * 2 matrix and a 2 * 100 matrix are sequentially multiplied. In this case, the total number of elements of the matrix is 4 × 2 + 2 × 100 = 208, and the number of parameters to be determined by learning of the acoustic model is significantly reduced compared to 400, which is the number of elements in the case of FIG. Is done.

図２に示した音響モデル学習手段１００Ｒを用いて、深層学習手段１１１Ｒの出力する特徴ベクトルの次元数、および、線形写像手段１１２の出力するベクトルの次元数について、より一般化して説明する。ここで、深層学習手段１１１Ｒの最終層であるＢＬＳＴＭ３０ｃの出力する特徴ベクトルの次元数をＤ_L、線形写像手段１１２の出力するベクトルの次元数をＤ_Aとすると、線形写像手段１１２でのパラメータ数Ｐ_Aは、次の式（ａ）で表される。なお、式（ａ）において、右辺第１項は線形変換部分（変換行列）を表し、右辺第２項は平行移動成分（バイアス）を表している。 By using the acoustic model learning unit 100R shown in FIG. 2, the dimensionality of the feature vector output from the deep learning unit 111R and the dimensionality of the vector output from the linear mapping unit 112 will be described more generally. Here, if the dimension number of the feature vector output from the BLSTM 30c as the final layer of the deep learning means 111R is D _L and the dimension number of the vector output from the linear mapping means 112 is D _A , the number of parameters in the linear mapping means 112 P _a is represented by the following formula (a). In Expression (a), the first term on the right side represents a linear transformation part (transformation matrix), and the second term on the right side represents a translation component (bias).

Ｐ_A＝Ｄ_L×Ｄ_A＋Ｄ_A … 式（ａ） P _A = D _L × D _A + D _A Formula (a)

このような線形写像手段１１２の変換行列を低ランクrで行列分解すると、このときのパラメータ数Ｐ_rは、次の式（ｂ）で表される。 When matrix decomposition of such transformation matrices of the linear mapping means 112 in the low-rank r, the number of parameters P _r at this time is expressed by the following formula (b).

Ｐ_r＝Ｄ_L×r＋r×Ｄ_A＋Ｄ_A … 式（ｂ） _{_{P r = D L × r +}} r × D A + D A ... formula (b)

ここで、低ランクrが、次の式（１）を満たすときＰ_A＞Ｐ_rとなり、行列分解によりパラメータ数（次元数）を削減できる。 Here, when the low rank r satisfies the following expression (1), P _A > P _r , and the number of parameters (number of dimensions) can be reduced by matrix decomposition.

Ｄ_L×Ｄ_A ＞Ｄ_L×r＋r×Ｄ_A … 式（１） D _L × D _A > D _L × r + r × D _A Formula (1)

図６は第２実施形態に係る音響モデル学習手段で用いる音響モデルのネットワーク構造の一例を示す図である。
図６に示すように、第２実施形態に係る音響モデル学習手段１００Ｂは、深層学習手段１１１Ｒと、線形写像手段１１２Ｂと、正規化手段１１３とを備えている。なお、図２に示した音響モデル学習手段１００Ｒと同じ構成には同じ符号を付して説明を省略する。
線形写像手段１１２Ｂは、第１線形写像手段４０と、第２線形写像手段４２と、を備えている。
第１線形写像手段４０は、深層学習手段１１１Ｒの最終層（第３層）であるＢＬＳＴＭ３０ｃから入力される６４０次元の特徴ベクトルに対して、６４０＊３２０の行列を乗算して３２０次元のベクトルを出力する。
第２線形写像手段４２は、第１線形写像手段４０から入力される３２０次元の特徴ベクトルに対して、３２０＊２９３４の行列を乗算して２９３４次元のベクトルを出力する。第２線形写像手段４２の出力するベクトルは、正規化手段１１３へ入力する。 FIG. 6 is a diagram illustrating an example of a network structure of an acoustic model used by the acoustic model learning unit according to the second embodiment.
As shown in FIG. 6, the acoustic model learning unit 100B according to the second embodiment includes a deep learning unit 111R, a linear mapping unit 112B, and a normalizing unit 113. In addition, the same code | symbol is attached | subjected to the same structure as the acoustic model learning means 100R shown in FIG. 2, and description is abbreviate | omitted.
The linear mapping unit 112B includes a first linear mapping unit 40 and a second linear mapping unit 42.
The first linear mapping unit 40 multiplies the 640-dimensional feature vector input from the BLSTM 30c that is the final layer (third layer) of the deep learning unit 111R by a 640 * 320 matrix to obtain a 320-dimensional vector. Output.
The second linear mapping means 42 multiplies the 320-dimensional feature vector input from the first linear mapping means 40 by a 320 * 2934 matrix and outputs a 2934-dimensional vector. The vector output from the second linear mapping unit 42 is input to the normalizing unit 113.

この具体例について図６と図２とを対比して説明する。
図２に示した音響モデル学習手段１００Ｒの場合、すなわち、線形写像手段１１２が行列分解を行わない場合、線形写像手段１１２が入力ベクトルに対して乗算する行列に着目すると、その行列の要素数は、
６４０×２９３４＝１，８７７，７６０である。 This specific example will be described by comparing FIG. 6 with FIG.
In the case of the acoustic model learning unit 100R shown in FIG. 2, that is, when the linear mapping unit 112 does not perform matrix decomposition, when attention is paid to the matrix that the linear mapping unit 112 multiplies the input vector, the number of elements of the matrix is ,
640 × 2934 = 1,877,760.

一方、第２実施形態に係る音響モデル学習手段１００Ｂの場合、すなわち、線形写像手段１１２Ｂが行列分解を行う場合、行列分解された各行列の要素数の合計は減少する。具体的には、第１線形写像手段４０が入力ベクトルに対して乗算する行列の要素数と、第２線形写像手段４２が入力ベクトルに対して乗算する行列の要素数との合計は、
６４０×３２０＋３２０×２９３４＝１，１４３，６８０である。 On the other hand, in the case of the acoustic model learning unit 100B according to the second embodiment, that is, when the linear mapping unit 112B performs matrix decomposition, the total number of elements of each matrix subjected to matrix decomposition decreases. Specifically, the sum of the number of elements of the matrix that the first linear mapping unit 40 multiplies the input vector and the number of elements of the matrix that the second linear mapping unit 42 multiplies the input vector is:
640 × 320 + 320 × 2934 = 1,143,680.

したがって、第２実施形態に係る日本語音響モデル学習装置１０によれば、線形写像手段１１２Ｂが行列分解を行って線形写像手段１１２Ｂの演算で取り扱う特徴ベクトルの次元を圧縮することにより音響モデルのネットワーク構造の次元圧縮を実現し、これによって、音響モデルの学習により決定すべきパラメータ数が大幅に削減される。
また、線形写像手段１１２Ｂが備える第１線形写像手段４０の出力するベクトルの次元が３２０次元まで圧縮されており、汎化能力が高まることが期待される。 Therefore, according to the Japanese acoustic model learning device 10 according to the second embodiment, the linear mapping unit 112B performs matrix decomposition and compresses the dimension of the feature vector handled by the calculation of the linear mapping unit 112B, thereby generating a network of acoustic models. Dimensional compression of the structure is realized, which greatly reduces the number of parameters to be determined by learning the acoustic model.
Moreover, the dimension of the vector which the 1st linear mapping means 40 with which the linear mapping means 112B is provided is compressed to 320 dimensions, and it is anticipated that generalization capability will increase.

（第３実施形態）
第３実施形態に係る音響モデル学習手段で用いる音響モデルのネットワーク構造は、第１および第２実施形態を組み合わせたネットワーク構造である。すなわち、図２に示した深層学習手段１１１Ｒの第２層をＢＬＳＴＭ３０ｄと置換することでボトルネック構造の深層学習手段１１１Ａを備えると共に、図２に示した線形写像手段１１２を行列分解を行うことのできる線形写像手段１１２Ｂと置換することで、音響モデルのネットワーク構造の次元圧縮を実現するものである。図７は第３実施形態に係る音響モデル学習手段で用いる音響モデルのネットワーク構造の一例を示す図である。 (Third embodiment)
The network structure of the acoustic model used in the acoustic model learning unit according to the third embodiment is a network structure in which the first and second embodiments are combined. That is, by replacing the second layer of the deep learning means 111R shown in FIG. 2 with the BLSTM 30d, the deep learning means 111A having a bottleneck structure is provided, and the linear mapping means 112 shown in FIG. 2 is subjected to matrix decomposition. By replacing the linear mapping means 112B that can be used, dimensional compression of the network structure of the acoustic model is realized. FIG. 7 is a diagram showing an example of a network structure of an acoustic model used in acoustic model learning means according to the third embodiment.

図７に示すように、第３実施形態に係る音響モデル学習手段１００Ｃは、一例として層数Ｎを３とした深層学習手段１１１Ａと、線形写像手段１１２Ｂと、正規化手段１１３とを備えている。図７において、図２、図４および図６を参照して説明した構成要素と同じ構成要素には同じ符号を付し、これ以上の説明を省略する。
第３実施形態に係る日本語音響モデル学習装置１０によれば、深層学習手段１１１Ａおよび線形写像手段１１２Ｂの双方の演算で取り扱う特徴ベクトルの次元を圧縮することにより音響モデルのネットワーク構造の次元圧縮を実現し、これによって、音響モデルの学習により決定すべきパラメータ数が削減される。 As shown in FIG. 7, the acoustic model learning unit 100C according to the third embodiment includes, as an example, a deep layer learning unit 111A having a layer number N of 3, a linear mapping unit 112B, and a normalizing unit 113. . In FIG. 7, the same components as those described with reference to FIGS. 2, 4, and 6 are denoted by the same reference numerals, and further description thereof is omitted.
According to the Japanese acoustic model learning device 10 according to the third embodiment, the dimensional compression of the network structure of the acoustic model is performed by compressing the dimension of the feature vector handled by the operations of both the deep learning means 111A and the linear mapping means 112B. This reduces the number of parameters to be determined by learning the acoustic model.

以上、本発明の各実施形態について説明したが、本発明はこれらに限定されるものではなく、その趣旨を変えない範囲で実施することができる。例えば、前記各実施形態では、日本語音響モデル学習装置として説明したが、各装置の構成の処理を可能にするように、汎用又は特殊なコンピュータ言語で記述した日本語音響モデル学習プログラムとみなすことも可能である。 As mentioned above, although each embodiment of this invention was described, this invention is not limited to these, It can implement in the range which does not change the meaning. For example, in each of the embodiments described above as a Japanese acoustic model learning device, it can be regarded as a Japanese acoustic model learning program written in a general-purpose or special computer language so as to enable processing of the configuration of each device. Is also possible.

各実施形態に係る日本語音響モデル学習装置の性能を確かめるために、各実施形態にそれぞれ対応した複数のネットワーク構造について学習した各モデルの音声認識実験結果を比較した。評価音声には、総合テレビの情報番組『ひるまえほっと』２０１３年６月放送分の番組音声(３２ｋ単語＝３２，０００単語)を用いた。各手法とも学習データは、放送音声と字幕のペア１０２３時間、入力特徴量はFilter bank４０次元＋delta＋deltadeltaの計１２０次元を用いた。言語モデルにはＮＨＫ（登録商標）の原稿や過去番組の字幕等のべ６．２億単語から学習した語彙２００ｋのモデルを利用した。学習に用いたネットワークは、図２の標準的な構造と、図４、図６および図７の３つの構造であり、各学習結果を比較した。その結果を表１に示す。 In order to confirm the performance of the Japanese acoustic model learning device according to each embodiment, the speech recognition experiment results of each model learned for a plurality of network structures corresponding to each embodiment were compared. As the evaluation sound, the program sound (32k words = 32,000 words) for the June 2013 broadcast of the information program “Hiruma Ehot” on the general television was used. In each method, the learning data used a total of 120 dimensions, that is, 1023 hours of a pair of broadcast audio and subtitles, and an input feature amount of Filter bank 40 dimensions + delta + deltadelta. The language model used was a 200k vocabulary model learned from 600 million words such as NHK (registered trademark) manuscripts and subtitles of past programs. The network used for learning has the standard structure of FIG. 2 and the three structures of FIG. 4, FIG. 6, and FIG. The results are shown in Table 1.

表１によれば、図２の標準的な構造と比較して、いずれの実施形態においても単語認識誤り率（ＷＥＲ）が改善され、学習時間および学習回数が著しく短縮された。
詳細には、Ａｆｆｉｎｅ変換の行列分解を行う手法、すなわち、線形写像手段１１２Ｂにおいて次元を圧縮する第２実施形態および第３実施形態において、ＷＥＲがより改善されており、汎化能力がより高められている。このうち、Ａｆｆｉｎｅ変換の行列分解のみを適用したモデル、すなわち、第２実施形態では、ＷＥＲが、図２の標準的な構造を用いる手法より２０．２％改善した。これは、漢字の読み相当の次元数（＝３２０）まで一度次元を圧縮したことで、モデルの汎化能力が向上したためと考えられる。 According to Table 1, compared to the standard structure of FIG. 2, the word recognition error rate (WER) was improved and the learning time and the number of times of learning were remarkably shortened in any of the embodiments.
Specifically, in the second embodiment and the third embodiment in which the dimension is compressed in the linear mapping unit 112B, the WER is further improved, and the generalization ability is further increased. ing. Among these, in the model to which only the matrix decomposition of the Affine transformation is applied, that is, in the second embodiment, the WER is improved by 20.2% compared to the method using the standard structure of FIG. This is presumably because the generalization ability of the model was improved by compressing the dimensions once to the number of dimensions equivalent to the reading of kanji (= 320).

また、ＢＬＳＴＭ部分のパラメータを削減する手法、すなわち、深層学習手段１１１Ａにおいて次元を圧縮する第１実施形態および第３実施形態において、学習時間の短縮効果がより大きくなった。このうち、ボトルネック構造と行列分解の両方を採用したモデル、すなわち、第３実施形態では、学習１回あたりの平均学習時間が、図２の標準的な構造を用いる手法より９.３％改善した。これは各実施形態で削減したＢＬＳＴＭの次元は時間方向に影響するものであるため、Ａｆｆｉｎｅ変換の行列分解に比べ更に学習時間の短縮効果が得られたと考えられる。 In addition, in the technique for reducing the parameters of the BLSTM part, that is, in the first embodiment and the third embodiment in which the dimension is compressed in the deep learning means 111A, the effect of shortening the learning time is greater. Among these, in the model employing both the bottleneck structure and the matrix decomposition, that is, the third embodiment, the average learning time per learning is improved by 9.3% from the method using the standard structure of FIG. did. Since the BLSTM dimension reduced in each embodiment affects the time direction, it is considered that the learning time can be further shortened as compared with the matrix decomposition of the Affine transform.

１日本語音声認識装置
１０日本語音響モデル学習装置
１００，１００Ａ，１００Ｂ，１１０Ｃ音響モデル学習手段
１０１音響モデル記憶手段
１１１，１１１Ａ，１１１Ｒ深層学習手段
１１２，１１２Ｂ線形写像手段
１１３正規化手段
３０ａ，３０ｂ，３０ｃ，３０ｄＢＬＳＴＭ
４０第１線形写像手段
４２第２線形写像手段 DESCRIPTION OF SYMBOLS 1 Japanese speech recognition apparatus 10 Japanese acoustic model learning apparatus 100,100A, 100B, 110C Acoustic model learning means 101 Acoustic model storage means 111,111A, 111R Deep learning means 112,112B Linear mapping means 113 Normalization means 30a, 30b , 30c, 30d BLSTM
40 First linear mapping means 42 Second linear mapping means

Claims

The input voice is converted into a character by using an end-to-end voice recognition method by learning the correspondence with the character output by the voice recognition of the input voice, and the character An acoustic model learning device for learning an acoustic model that outputs
It has a multi-layered neural network of three or more layers, and feature values of speech are continuously input, information on the time direction about the feature values is stored in each layer of the multi-layer structure, and information on the time direction is stored. A deep learning means for outputting a feature vector representing a probability of predicting which of a plurality of target characters from the feature amount of the speech,
Linear mapping means for converting the dimension of the feature vector output from the deep learning means by a predetermined calculation by applying a predetermined transformation matrix to the feature vector that is the output of the last layer of the deep learning means,
An acoustic model learning apparatus, wherein the acoustic model is learned by compressing a dimension of the feature vector handled by at least one of computations by the deep learning unit and the linear mapping unit.

The acoustic model learning device according to claim 1,
In order to compress the dimension of the feature vector, the deep learning means
The number of dimensions of a vector for storing information in the time direction in a predetermined layer excluding the first layer and the last layer of the multilayer structure is the number of dimensions of a vector for storing information in the time direction in the first layer and the last layer. An acoustic model learning device that predicts a character from a feature amount of the input speech in a state where the character is set smaller than the above.

In the acoustic model learning device according to claim 1 or 2,
The linear mapping means includes
The dimension number of the feature vector output from the last layer of the deep learning means is D _L , and the dimension number of the vector output from the linear mapping means is D _A.
Instead of applying the transformation matrix to the feature vector output from the last layer of the deep learning means, the transformation matrix is expressed by the following formula: D _L × D _A > D _L × r + r × D _A. 1)
An acoustic model learning device characterized in that the dimension of the feature vector is compressed by sequentially applying two matrices obtained by matrix decomposition with a rank r satisfying.

The acoustic model learning program for functioning a computer as the acoustic model learning apparatus as described in any one of Claims 1-3.