JPH01177600A

JPH01177600A - Voice recognition error correcting device

Info

Publication number: JPH01177600A
Application number: JP63001488A
Authority: JP
Inventors: Kenichi Iso; 健一磯
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1988-01-06
Filing date: 1988-01-06
Publication date: 1989-07-13
Also published as: JPH0580000B2

Abstract

PURPOSE:To practically improve the recognition performance of an acoustic recognizing part by using a reverse propagation network model to correct a time series of symbols including erroneous recognition obtained as the acoustic recognition result. CONSTITUTION:A symbol string as the acoustic recognition result is stored in an input buffer part 1, and an input window part 2 successively segments a fixed-length symbol string from the input buffer part 1 while shifting the start point by every one symbol, and a corresponding symbol in the input buffer part 1 is written with the output symbol each time when the inference result of the input is outputted. The output of a reverse propagation network model part 3 is stored in an output buffer part 4, and a first control part 6 shifts the start position of the input window part 2 by one symbol to perform the next correcting operation each time when one symbol is outputted. When detecting that correction is performed up to the layer symbol in the input buffer part 1, a second control part 7 writes back stored contents of the output buffer part 4 to the input buffer part 1 and performs the correcting operation again, and contents of the output buffer part 4 are started to write in a correction result output part 8 after this process is repeated a fixed number of times. Thus, the recognition performance of the acoustic recognizing part is improved.

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は音声認識誤り訂正装置に関し１％に音声認識装
置において嘗祐認誠結釆として得られる誤シを含むシン
ボルの時系列（たとえば音素認識の結果書られる音素シ
ンボル列や単語認識の結果書られる単語シンボル列等）
を、時系列内の前後のコンテキストを考慮して修正する
音声認識誤り訂正装置の改良に関する。DETAILED DESCRIPTION OF THE INVENTION [Field of Industrial Application] The present invention relates to a speech recognition error correction device. phoneme symbol strings written as a result of recognition, word symbol strings written as a result of word recognition, etc.)
This invention relates to an improvement of a speech recognition error correction device that corrects the error by taking into consideration the preceding and following contexts in a time series.

[Conventional technology]

時系列内の前後のコンテキストを考慮して誤りを訂正す
る方法として１前後のシンボル列が確定した場合の中央
のシンボルの出現確率（条件付き確率）を認識対象のデ
ータから算出してテーブル化し、誤りを含む時系列が与
えられるとテーブル化された条件付き確率を用いて、事
後確率が最大になるようにシンボル列を曹き換えて修正
する方法がある。たとえば１前後３シンボルを考慮して
訂正を行う場合には条件付き確率Ｐは次のように表され
る。As a method for correcting errors by taking into account the preceding and following contexts in the time series, the probability of appearance (conditional probability) of the central symbol when the symbol sequence before and after one is determined is calculated from the data to be recognized and tabulated. When a time series containing errors is given, there is a method of modifying the symbol sequence by using tabulated conditional probabilities to maximize the posterior probability. For example, when correction is performed by considering three symbols before and after one, the conditional probability P is expressed as follows.

（式１）　　Ｐ（ｓｏｌｓｌｓｚｓｓｓ４ｓｓＳｓｓｙ
）ここでｓｉはｉ番目のシンボルを表し、Ｐはシンボル
Ｓｏを３４に誤る確率を表している。中央のシンボルＳ
４に対する訂正結果は（ＳｓＳｚＳｓＳａＳｓうなＳ。(Formula 1) P(solslszsss4ssSssy
) Here, si represents the i-th symbol, and P represents the probability of mistaking the symbol So to 34. central symbol S
The correction result for 4 is (SsSzSsSaSs).

とじて決められる。即ち、訂正結果Ｓ。は△ （式２　）　　Ｓ、　＝　ａｒｇｍａｘ　［Ｐ　（ｓｏ
　ｌ　５１Ｓ２Ｓ３Ｏ３ａ　５５８６’？））で与えられる。It can be decided by closing. That is, the correction result S. is △ (Equation 2) S, = argmax [P (so
l 51S2S3O 3a 5586'? )) is given by.

[Problem that the invention seeks to solve]

しかし上記の方法では、考慮に入れる前後のコンテキス
トを広げると条件付き確率のテーフ″ルのサイズが指数
的に増大してしまい、実用的ではない。即ち、考慮にい
れるコンテキストの長さをり。However, in the above method, the size of the table of conditional probabilities increases exponentially when the contexts taken into account are expanded, which is not practical.In other words, the length of the contexts taken into account is

シンボルの種類をＭとする条件付き確率の定義式（式ｌ
）からも知られるようにテーブルのサイズは〜Ｏ（Ｍ　　）（ただし　〜０（）はサイズのオーダーを示す）となる
。また事後確率の最大化の為の最適化計算の計算量も無
視できなくなる。更に前後のコンテキストに多くの誤り
が含まれる場合には安定な誤シ訂正が困難になる。Definition formula of conditional probability where symbol type is M (formula l
), the size of the table is ~O(M) (where ~0() indicates the order of size). Moreover, the amount of calculation required for optimization calculations to maximize the posterior probability cannot be ignored. Furthermore, when many errors are included in the preceding and following contexts, stable error correction becomes difficult.

本発明の目的は、上記のように条件付き確率のテーフ゛
ルの記憶容量が膨大になシ実現が困難になへ４− るのを回避し、更に音響認識部の認識結果を用いて誤り
訂正の教師付き学習を行なう事によシ音響認識部の認識
誤りの傾向に適応した誤り訂正を実現し、また誤り訂正
時には最適化計算は不要であるので計算量の大幅な削減
を可能にし、加えて訂正結果を用いて入力シンボル列を
順次訂正しておくことによって誤シの少ない前後関係を
用いて安定な誤り訂正を行うことを可能にするような認
識誤り訂正装置を提供することにある。The purpose of the present invention is to avoid the above-mentioned conditional probability table having an enormous storage capacity, which is difficult to realize, and to perform error correction using the recognition results of the acoustic recognition unit. By performing supervised learning, it is possible to perform error correction that adapts to the tendency of recognition errors in the acoustic recognition unit.Also, since no optimization calculation is required during error correction, it is possible to significantly reduce the amount of calculation. It is an object of the present invention to provide a recognition error correction device that makes it possible to perform stable error correction using a context with fewer errors by sequentially correcting an input symbol string using correction results.

本発明による認識誤り訂正装置を音響認識部の後処理部
として用いれば、実質的に音響認識部の認識性能を向上
させたのと同じ効果が得られる。If the recognition error correction device according to the present invention is used as a post-processing section of the acoustic recognition section, substantially the same effect as that of improving the recognition performance of the acoustic recognition section can be obtained.

[Means for solving problems]

本発明による音声認識誤り訂正装置は、音声認識に於て
、認識の結果として得られるシンボルの時系列に含まれ
る認識誤りを修正するのに際して。The speech recognition error correction device according to the present invention is used in speech recognition to correct recognition errors included in a time series of symbols obtained as a result of recognition.

前記時系列を記憶する入力バッファ部と、前記入力バッ
ファ部に記憶されているシンボルの時系列の先頭から順
次始点を１シンボル分づつずらして固定長の該シンホル
列を切り出す入力窓部と、前Ｖ５− 記入力窓部の出力として得られる固定長の該シンボル列
を入力としてその中央のシンボルに対する正解を出力す
るようにあらかじめ誤りを含むシンボル列を用いて教師
付きの学習を行なった逆伝播ネットワーク・モデル部と
、前記逆伝播ネットワーク・モデル部がシンボルを出力
した時点で入力バッファ部の対応するシンボルを修正さ
れたシンボルに書き換える誓き換え部と、続いて前記入
力バッファ部から同定長の該シンボル列を切り出す入力
窓部の始点を１シンボル分シフトして前記逆伝播ネット
ワークｅモデル部に次のシンボルの修正動作を行わせる
第一制御部と、前記逆伝播ネットワーク・モデル部が出
力するシンボル列を記憶する出力バッファ部と、前記入
力バッファ部のシンボル列の終端のシンボルが修正され
たことを検出した時点で前記出力バッファ部の内容を前
記入力バッファ部に書き戻し、再度前記修正動作を繰り
返させる第二制御部と、一定回数前記修正動作を繰り返
した時点で出力バッファ部の内容を修正結果として出力
する修正結果出力部とを備えて構−一ら　＝成される。an input buffer section for storing the time series; an input window section for cutting out the fixed-length symbol string by sequentially shifting the starting point by one symbol from the beginning of the time series of symbols stored in the input buffer section; V5- A backpropagation network that performs supervised learning using a symbol string containing errors in advance so as to input the fixed-length symbol string obtained as the output of the input window and output the correct answer for the central symbol. - a model section, a rewriting section that rewrites the corresponding symbol in the input buffer section to a modified symbol at the time when the backpropagation network model section outputs a symbol; a first control unit that causes the backpropagation network e model unit to perform a correction operation on the next symbol by shifting the starting point of an input window unit for cutting out a symbol string by one symbol; and a symbol output by the backpropagation network model unit. an output buffer section for storing the sequence, and when it is detected that the symbol at the end of the symbol string in the input buffer section has been modified, the contents of the output buffer section are written back to the input buffer section, and the modification operation is performed again. The present invention comprises: a second control section that repeats the correction operation; and a correction result output section that outputs the contents of the output buffer section as a correction result when the correction operation is repeated a certain number of times.

[Effect]

本発明の基本的な原理は、音声認識に於て、音響認識結
果として得られる誤認識を含むシンボルの時系列をあら
かじめ教師付きの学習を行なった逆伝播ネットワーク・
モデルを用いて修正しようとするものである。以下に本
発明の原理を詳細に説明する。The basic principle of the present invention is that in speech recognition, a backpropagation network is used that performs supervised learning in advance to analyze the time series of symbols, including erroneous recognition, obtained as acoustic recognition results.
This is an attempt to correct the problem using a model. The principle of the present invention will be explained in detail below.

入力音声を認識した場合に音響認識部の出力として得ら
れるシンボル列は、現状では不可避な音響認識部の認識
誤りによって、音響認識部の誤り傾向を反映した幾つか
の誤りを含んでいる。本発明ではこの誤りを含むシンボ
ルの時系列をその前後のコンテキストを考慮して修正し
、実質的には音響認識部の認識性能を向上させようとす
るものである。The symbol string obtained as the output of the acoustic recognition section when input speech is recognized contains several errors reflecting the error tendency of the acoustic recognition section due to recognition errors of the acoustic recognition section, which are unavoidable under the present circumstances. The present invention corrects the time series of symbols containing errors by taking into consideration the context before and after the symbol, thereby essentially improving the recognition performance of the acoustic recognition unit.

訂正には連想記憶やパターン認識のモデルとして考案さ
れた逆伝播ネットワーク・モデルを利用する。このモデ
ルの詳細については、［欧文誌コンプレックス・システ
ムズ％１９８ＩＩＥｇ１号１４５−１６８頁Ｊ　（’Ｐ
ａｒａｌｌｅｌ　Ｎｅｔｗｏｒｋｓ　ｔｈａｔＬｅａｒ
ｎ　　ｔｏ　Ｐｒｏｎｏｕｎｃｅ　Ｅｎｇｌｉｓｈ　Ｔ
ｅｘｔ’、Ｔ、Ｊ。For correction, a backpropagation network model, which was devised as a model for associative memory and pattern recognition, is used. For details of this model, please refer to [European Magazine Complex Systems%198IIEg1, pp.145-168J ('P
arallel Networks thatLear
n to Pronounce English T
ext', T, J.

Ｓｅｊｎｏｗｓｋｉ　ｈ　Ｃ，Ｒ，Ｒｏｓｅｎｂｅｒｇ
、Ｃｏｍｐｌｅｘ　Ｓｙｓｔｅｍｓ。Sejnowski h C, R, Rosenberg
, Complex Systems.

Ｖｏｌ、１（１９８７）１４５−１６８）が詳り、い。Vol. 1 (1987) 145-168).

モデルは一般に第２図のように３種類の層から階層的に
構成され、それぞれ入カニニット層、隠れユニット層、
出力ユニ、ト層と呼ばれている。Generally, a model is hierarchically composed of three types of layers as shown in Figure 2: an input crab layer, a hidden unit layer, and a hidden unit layer.
It is called the output layer.

各層にはユニットと呼ばれる処理単位が配置され。Each layer has processing units called units.

各ユニットは入力層に近い側に隣接する層のユニットか
らの入力を受けて、隣接する出力層に近い側の層のユニ
ットへ出力を出す。各ユニットの入・出力の応答関係は
次のように与えられる。Each unit receives an input from a unit in a layer adjacent to the input layer, and outputs an output to a unit in a layer adjacent to the output layer. The input/output response relationship of each unit is given as follows.

（式４）　　ｙ”＝ｆ（ｘ寸））（式５　）　　ｆ（ｘ）＝　（１＋ｅ−”）−’ここで
Ｘはユニットへの入力、ｙはユニットの出力、θはユニ
ットの持つ閾値、上付き添え字は入力層からの階層を表
わしくｎ＝＝ｌ、・・・・・・、Ｎ）。(Formula 4) y"=f(x dimension)) (Formula 5) f(x)=(1+e-")-'Here, X is the input to the unit, y is the output of the unit, and θ is the threshold value of the unit. , the superscript represents the hierarchy from the input layer (n==l,...,N).

下付き添え字は層内のユニットを表わす番号であ第ｎ層
のユニットｊへの結合を表わす荷重、ｆ（ｘ）は（式５
）に示すように各ユニ、トに共通の非線形飽和型の応答
関数である。結局、各ユニットは隣接する上位層のユニ
ットの出力の荷重和とあらかじめ定められた閾値との差
を入力として一程の閾値論理によってその出力を決定す
る。The subscript is the number representing the unit in the layer, and the load representing the connection to unit j of the n-th layer, f(x) is (Equation 5
) is a nonlinear saturation type response function common to each unit. After all, each unit determines its output by a certain amount of threshold logic using as input the difference between the weighted sum of outputs of adjacent upper layer units and a predetermined threshold.

このモデルの入力層にデータが与えられると。When data is given to the input layer of this model.

その情報（データ）ＩＩ′ｉ隣接する下位層で順次処理
されながら出力層まで伝播して行く。そしてこの出力層
のユニットの出力が与えられた入力データに対するモデ
ルの推論結果となるのである。The information (data) II'i is sequentially processed in adjacent lower layers and propagated to the output layer. The output of this output layer unit becomes the model's inference result for the given input data.

本発明では入力層にｉｂを含むシンボル列から切り出し
た固定長のシンボル列を提示したときへ出力層に入力さ
れた固定長のシンボル列の中央のシンボルに対する誤り
訂正の結果（推論結果）が出力されるようなモデルを構
成する。In the present invention, when a fixed-length symbol string cut out from a symbol string containing ib is presented to the input layer, the error correction result (inference result) for the center symbol of the fixed-length symbol string input to the output layer is output. Construct a model that will

次にモデルが望ましい推論動作を行なうようにユニ、ト
間の結合を定める学習法（逆伝播学習）壕な入力音声に
対する実際の音響認識部の出力である誤りを含むシンボ
ル列から切り出した固定長のシンボル列か、あるいはシ
ンボル間のｆｌａすＭ向を仮定して、誤りのないシンボ
ル列に確率的に誤りを付加した疑似データである。これ
らのデータを入力層に提示し、出力層には中央のシンボ
ルに対する正解を提示して逆伝播学習を繰り返し行なう
。逆伝播法では入力されたデータに対する望ましい推論
結果（出力データ）を教師信号として与えて、モデルの
推論結果と教師信号の差（誤差）を小さくする方向に繰
り返しユニット間結合を修正する。実際には次式で定義
される出力層（第Ｎ層）に於けるモデルの出力ｙｉと与
えられた入力に対する望ましい出力（答え）ｙＩとから
定まる誤差関数を最小化するようなユニット間結合を見
い出すことに対応する。Next, a learning method (backpropagation learning) that determines the connections between units and units so that the model performs the desired inference operation.A fixed length of symbols extracted from the error-containing symbol string that is the output of the actual acoustic recognition unit for the incorrect input voice. This is pseudo data in which errors are stochastically added to a symbol string with no errors, assuming a symbol string of , or an M direction of fla between symbols. These data are presented to the input layer, the correct answer for the central symbol is presented to the output layer, and backpropagation learning is repeatedly performed. In the backpropagation method, a desired inference result (output data) for input data is given as a teacher signal, and connections between units are repeatedly corrected in a direction that reduces the difference (error) between the model's inference result and the teacher signal. In reality, the connection between units that minimizes the error function determined from the output yi of the model in the output layer (Nth layer) defined by the following equation and the desired output (answer) yI for the given input is created. Respond to what you discover.

（式６　）　　ｎ＝　（１／２）ぞ（ｙマーｙ、　）２
この関数はｙ（Ｎ）を通じてあらゆるユニット間結合に
依存しているので、最小化はＥを評価関数とし一１〇− て行なえばよい。結果として得られる逆伝播学習のアル
ゴリズムに関しては前記の文献に詳しい。(Formula 6) n= (1/2) (y mer y, )2
Since this function depends on all inter-unit connections through y(N), minimization can be performed using E as the evaluation function. The resulting backpropagation learning algorithm is detailed in the above-mentioned literature.

学習の終了したモデルを用いて訂正を行なう場合には２
入力音声に対する音響認識部の出力であるシンボル列か
ら１シンボルづつ始点をシフトして逐次的に固定長のシ
ンボル列を切り出して逆伝播ネットワーク・モデルに入
力する。モデルが入力された（支）足長シンボル列の中
央のシンボルに対する修正結果を出力すると、そのシン
ボルで入力シンボル時系列の対応するシンボルを書き換
える。When performing correction using a trained model, 2
The starting point is shifted one symbol at a time from the symbol string that is the output of the acoustic recognition unit for input speech, and fixed-length symbol strings are sequentially cut out and input to the backpropagation network model. When the model outputs the correction result for the center symbol of the input (support) leg length symbol string, the corresponding symbol in the input symbol time series is rewritten with that symbol.

このことＶＣよってモデルの入カニニット層に提示され
る固定長のシンボル列の前半部は常にそれ以前に訂正を
加えられたよシ確からしいシンボルから構成されること
になるので、モデルによる誤シ訂正がよ多安定に行われ
ることになる。This means that the first half of the fixed-length symbol string presented to the input layer of the model by VC always consists of highly probable symbols that have been previously corrected, so the model cannot correct errors. It will be done more stably.

このようにしてモデルによって修正されたシンボル列に
も修正しきれなかった誤シが残っている可能性があるの
で、その残された誤シを修正するために一度モデルによ
って修正されたシンボル列全体を再び入力としてモデル
に与えて誤り訂正を行わせる。この過程を繰り返すこと
によって１次第に誤りの少ないシンボル列が得られるよ
うになる。In this way, even the symbol string corrected by the model may still have errors that could not be corrected, so in order to correct the remaining errors, the entire symbol string that has been corrected by the model is given to the model again as input to perform error correction. By repeating this process, symbol strings with progressively fewer errors can be obtained.

〔Example〕

第１図は本発明を実現した装置の一実施例を示したプロ
、り図である。人カバ、ファ部１は音響認識結果である
シンボル列を格納し、入力窓部２は入力バッファ部１か
ら１シンボルづつ始点をシフトして順次固定長のシンボ
ル列を切り出して逆伝播ネットワーク・モデル部３が入
力に対する推論結果を出力する毎に、その出力シンボル
で入力バッファ部の対応するシンボルを書き換える。出
カバッ７ア部４は逆伝播ネットワーク・モデル部３の出
力を記憶し、第一制御部６は逆伝播ネットの修正動作を
行わせる。第二制御部７は入力バッファ部１の終端のシ
ンボルまで訂正されたことを検出すると出力バラフッ部
４の記憶内容を入力バッファ部１に書き戻し、再度前記
修正動作を行わせ、この過程を一定回数繰ル返した後に
出力バッファ部４の内容を修正結果出力部８に書き出す
。FIG. 1 is a schematic diagram showing an embodiment of a device implementing the present invention. The buffer section 1 stores the symbol string that is the acoustic recognition result, and the input window section 2 shifts the starting point one symbol at a time from the input buffer section 1 and sequentially cuts out fixed-length symbol strings to create a backpropagation network model. Every time unit 3 outputs an inference result for the input, it rewrites the corresponding symbol in the input buffer unit with the output symbol. The output buffer section 4 stores the output of the backpropagation network model section 3, and the first control section 6 causes the backpropagation network to be corrected. When the second control unit 7 detects that the symbols at the end of the input buffer unit 1 have been corrected, the second control unit 7 writes the memory contents of the output balance unit 4 back to the input buffer unit 1, causes the correction operation to be performed again, and keeps this process constant. After repeating the process several times, the contents of the output buffer section 4 are written to the modified result output section 8.

〔Effect of the invention〕

以上述べたように１本発明によれば音響認識部の出力で
あるシンボル列の誤シをその前後関係を利用して、ボト
ムアップ的に訂正することが可能である。更に修正結果
を１シンボル毎に入力シンボル列に書き戻すことによっ
てよシ確からしい前後関係を利用して誤シ訂正を行うこ
とを可能にすると共に、モデルの出力シンボル全体を繰
り返し再入力して誤り訂正させることによって誤シの少
ない訂正結果を得ることを可能にする。As described above, according to the present invention, it is possible to correct errors in the symbol string output from the acoustic recognition section in a bottom-up manner by utilizing the context. Furthermore, by writing the correction results back to the input symbol string symbol by symbol, it is possible to correct errors using a likely context, and by repeatedly re-inputting the entire output symbol of the model, errors can be corrected. By performing the correction, it is possible to obtain a correction result with fewer errors.

本発明の効果は結果的に／ｌ′ｉ音響認識部の認識性能
を向上させたことに相当し、音声認識装置全体としても
高い精度を実現することを可能にする。The effect of the present invention corresponds to improving the recognition performance of the /l'i sound recognition section, and makes it possible to achieve high accuracy as a whole of the speech recognition device.

また、実行に喪する記憶容量は、考慮に入れる前後関係
の長さをり、シンボルの種類をＭ、隠れユニットの数を
Ｈとすると記憶容量のオーダーは。Furthermore, the storage capacity required for execution is determined by the length of the context to be taken into account, where M is the type of symbol and H is the number of hidden units.The order of the storage capacity is as follows.

〜Ｏ（Ｌ−Ｍ−）１）となシ゛、従来技術と比べて大幅に縮小することを一１
１３− 可能にする。~O(L-M-)1)
13- Make possible.

[Brief explanation of the drawing]

第１図は本発明の一実施例を示すブロック図。第２図は逆伝播ネットワーク・モデルの一般的な構成を
表す図。１は入力バッファ部、２は入力窓部、３は逆伝播ネット
ワーク・モデル部、４は出力バッファ部。５は書き換え部、６は第一制御部、７は第二制御部、８
Ｆｉ、修正結果出力部である。代理人　弁理士　　内　原　　　晋ふ− ７＼、カユエ・ソＦ滑閃FIG. 1 is a block diagram showing one embodiment of the present invention. FIG. 2 is a diagram showing the general configuration of a backpropagation network model. 1 is an input buffer section, 2 is an input window section, 3 is a backpropagation network model section, and 4 is an output buffer section. 5 is a rewriting unit, 6 is a first control unit, 7 is a second control unit, 8
Fi is a modified result output unit. Agent: Patent Attorney Susumu Uchihara 7＼, Kayue SoF Nasen

Claims

[Claims]

In speech recognition, when correcting recognition errors included in a time series of symbols obtained as a result of recognition, an input buffer section that stores the time series, and a time series of symbols stored in the input buffer section are used. An input window section that sequentially shifts the starting point one symbol at a time from the beginning of the series to cut out a fixed-length symbol string, and inputs the fixed-length symbol string obtained as the output of the input window section and calculates the correct answer for the center symbol. A backpropagation network model section that performs supervised learning using a symbol string containing errors in advance so as to output the corresponding symbol in the input buffer section at the time the backpropagation network model section outputs a symbol. a rewriting unit that rewrites the symbol sequence into a modified symbol, and then shifts the starting point of the input window unit that cuts out the fixed-length symbol string from the input buffer unit by one symbol and sends the next symbol to the backpropagation network model unit. a first control unit that performs a correction operation; an output buffer unit that stores the symbol string output by the backpropagation network model unit; and a first control unit that detects that a symbol at the end of the symbol string of the input buffer unit has been corrected. writing the contents of the output buffer section back to the input buffer section at a point in time;
A voice recognition system characterized by comprising: a second control section that repeats the correction operation again; and a correction result output section that outputs the contents of the output buffer section as a correction result when the correction operation is repeated a certain number of times. Error correction device.