JP2003345388A

JP2003345388A - Method, device, and program for voice recognition

Info

Publication number: JP2003345388A
Application number: JP2002149746A
Authority: JP
Inventors: Takafumi Koshinaka; 孝文越仲
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2002-05-23
Filing date: 2002-05-23
Publication date: 2003-12-03
Anticipated expiration: 2022-05-23
Also published as: JP4239479B2

Abstract

<P>PROBLEM TO BE SOLVED: To reduce recognition error in a device for voice recognition. <P>SOLUTION: A voice signal input means 101 receives a voice signal, and a recognition result candidate generation means 102 outputs N word strings, word duration length sequences respectively corresponding to N word strings, and N scores from preliminarily prepared sound models and a word dictionary. A duration length normalizing means 103 normalizes the duration lengths on the basis of a parameter prescribing linear function mapping, which is estimated by a normalization constant calculation means 105, by mapping prescribed by a linear function in accordance with speaking speed. A recognition result candidate correction means 104 corrects scores of each of the candidates of a recognition result in accordance with likelihoods of arrays of duration lengths of recognition result candidate word strings. <P>COPYRIGHT: (C)2004,JPO

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、音声認識装置、音
声認識方法、および、音声認識プログラムに関し、特
に、連続発声された音声を認識しその発声内容を表す記
号列に変換する音声認識装置、音声認識方法、および、
音声認識プログラムに関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice recognition device, a voice recognition method, and a voice recognition program, and more particularly to a voice recognition device for recognizing continuously uttered voices and converting them into a symbol string representing the utterance contents. Voice recognition method, and
Regarding speech recognition programs.

【０００２】[0002]

【従来の技術】従来、この種の音声認識装置および音声
認識プログラムを記録した記録媒体は、主として、音素
（母音ａ、ｉ、ｕ、…、子音ｋ、ｓ、ｔ、…等）の標準
パタンとして一方向型（ｌｅｆｔ−ｔｏ−ｒｉｇｈｔ
型）隠れマルコフモデル（ＨＭＭ）を用意し、これら標
準パタンと入力音声信号の各部とのマッチングをくり返
すことにより、入力音声信号ともっともよくマッチする
音素列を導き出し、この音素列から一意に決まる記号列
（例えば単語列）を認識結果として出力していた。2. Description of the Related Art Conventionally, a recording medium in which a voice recognition device and a voice recognition program of this kind are recorded is mainly a standard pattern of phonemes (vowels a, i, u, ..., Consonants k, s, t, ...). As a one-way type (left-to-right
Type) Hidden Markov Model (HMM) is prepared, and by repeating matching between these standard patterns and each part of the input speech signal, a phoneme string that best matches the input speech signal is derived and uniquely determined from this phoneme string. A symbol string (for example, a word string) is output as a recognition result.

【０００３】ＨＭＭに基づく音声認識技術では、音声信
号の任意の部分がマッチングの対象となり得るため、極
端に長い時間区間と子音がよくマッチし、短い時間区間
と母音がよくマッチするということが起こり得る。しか
し、多くの場合、母音は比較的長時間継続し、子音は短
時間で終了するため、上記のような現象は認識誤りの原
因となり得る。In the voice recognition technology based on HMM, an arbitrary part of a voice signal can be a target of matching, so that an extremely long time section and a consonant often match, and a short time section and a vowel often match. obtain. However, in many cases, a vowel lasts for a relatively long time and a consonant ends in a short time, so the above phenomenon may cause a recognition error.

【０００４】一般に、母音や子音あるいは音節や単語な
どの種々の単位が音声信号中で継続する時間の長さを継
続時間長と呼ぶ。ＨＭＭに基づく音声認識技術は、継続
時間長を陽に制御する仕組みを持たない。これはＨＭＭ
の大きな欠点の一つであった。この欠点の解決法の一つ
として、文献「１９８８年、中川聖一、確率モデルによ
る音声認識、電子情報通信学会」にも示されているよう
な、継続時間分布モデルを導入する方法がある。Generally, the length of time during which various units such as vowels, consonants, syllables and words continue in a voice signal is called a duration time. HMM-based speech recognition technology does not have a mechanism for explicitly controlling the duration. This is HMM
Was one of the big drawbacks. As one of the solutions to this drawback, there is a method of introducing a duration distribution model as shown in the document "1988, Seiichi Nakagawa, Speech Recognition by Stochastic Model, IEICE".

【０００５】これは、音声の継続時間長の確率分布モデ
ルを、ＨＭＭの最適状態遷移系列を求めるヴィタビ探索
中に組み入れて尤度の算出を行うことにより継続時間を
制御する方法である。より簡略には、通常のヴィタビ探
索によって複数個の認識結果候補を算出し（第１パ
ス）、その後、継続時間長の確率分布モデルに基づいて
各候補の継続時間長の並びのもっともらしさを計算し、
各候補のスコア（尤度）を修正する（第２パス）という
２段階の手順を踏む方法でも代用可能である。なお、確
率分布としては、予め学習によって推定された平均値、
分散値で定義される正規分布などを用いる。This is a method of controlling the duration by incorporating the probability distribution model of the duration of speech into the Viterbi search for obtaining the optimum state transition sequence of the HMM and calculating the likelihood. More simply, a plurality of recognition result candidates are calculated by a normal Viterbi search (first pass), and then the plausibility of the sequence of durations of each candidate is calculated based on the probability distribution model of durations. Then
A method of performing a two-step procedure of correcting the score (likelihood) of each candidate (second pass) can also be used instead. As the probability distribution, an average value estimated by learning in advance,
A normal distribution defined by the variance value is used.

【０００６】図５は、上述の従来技術のうち、２段階の
手順を踏む後者の方法を導入した音声認識装置の実施の
形態の一例を示す構成図である。FIG. 5 is a block diagram showing an example of an embodiment of a voice recognition apparatus in which the latter method of performing the two-step procedure among the above-mentioned conventional techniques is introduced.

【０００７】この例は、外部から入力される音声信号を
受け取る音声信号入力手段２０１と、音声信号入力手段
２０１から音声信号の時系列を受け取り、認識に有用な
特徴量（例えばメルケプストラム）の系列を計算し、あ
らかじめ準備された音響モデル（標準パタンのセット）
と、必要に応じて準備される文法（単語辞書）から、特
徴量系列にもっともよくマッチする標準パタン（音素）
の系列をヴィタビ探索などによって上位から順にＮ個求
め、求められたＮ個の音素列からそれぞれ一意に決まる
Ｎ個の記号列（文字や単語の系列）と、Ｎ個の記号列そ
れぞれに対応する記号の継続時間長系列と、Ｎ個のスコ
ア（尤度）を出力する認識結果候補生成手段２０２と、
認識結果候補生成手段２０２から記号列、継続時間長系
列、スコアを受け取り、継続時間長分布モデルに基づい
て継続時間長のもっともらしさを表すスコアを計算し、
認識結果候補生成手段２０２の出したスコアを加減して
スコア補正する認識結果候補修正手段２０３と、認識結
果候補修正手段２０３が継続時間長のもっともらしさを
表すスコアを計算するために参照する継続時間長の平均
や分散の値を格納する継続時間長統計量格納手段２０４
を備える。In this example, a voice signal input means 201 for receiving a voice signal inputted from the outside, a time series of voice signals from the voice signal input means 201, and a series of feature quantities (for example, mel cepstrum) useful for recognition. Calculates and prepares an acoustic model (set of standard patterns)
And, from the grammar (word dictionary) prepared as necessary, the standard pattern (phoneme) that best matches the feature quantity sequence.
N sequences are sequentially obtained from the top by a Viterbi search or the like, and N symbol sequences (a sequence of characters and words) uniquely determined from the obtained N phoneme sequences and N symbol sequences are respectively corresponded. A sequence of symbol durations, a recognition result candidate generation unit 202 that outputs N scores (likelihoods),
The symbol string, the duration sequence, and the score are received from the recognition result candidate generating unit 202, and the score representing the likelihood of the duration is calculated based on the duration distribution model.
The recognition result candidate correction means 203 for correcting the score by adjusting the score output from the recognition result candidate generation means 202, and the duration time referred to by the recognition result candidate correction means 203 for calculating the score representing the plausibility of the duration time. Duration length statistic storage means 204 for storing the average length and variance value
Equipped with.

【０００８】認識結果候補修正手段２０３が認識結果候
補生成手段２０２から受け取る情報の内容および受け取
った情報の処理手順について、もう少し詳しく述べる。
認識結果候補生成手段２０２が生成するＮ個の認識結果
候補とは、記号列（例えば単語の系列）、記号の継続時
間長系列と、スコア（尤度）の組で表される。この組の
１つを（Ｗ_ｋ，１，Ｗ_ｋ，２，…，Ｗ_{ｋ，ｎ（ｋ）}；Ｘ
_ｋ，１，Ｘ_ｋ，２，…，Ｘ_{ｋ，ｎ（ｋ）}；Ｌ_ｋ）と表
し、第ｋ位候補と呼ぶ。認識結果候補修正手段２０３
は、認識結果候補生成手段２０２よりＮ個の組（第１位
候補から第Ｎ位候補まで）を受け取る。ここにｋ＝１，
２，…，Ｎ。The contents of information received by the recognition result candidate correction means 203 from the recognition result candidate generation means 202 and the processing procedure of the received information will be described in more detail.
The N recognition result candidates generated by the recognition result candidate generating unit 202 are represented by a set of a symbol string (for example, a word sequence), a symbol duration length sequence, and a score (likelihood). _Let one of the sets be (W _{k, 1} , W _{k, 2} , ..., W _{k, n (k)} ; X
_{_{k, 1, X k, 2}} , ..., X k, n (k); represents _{L k)} and is referred to as a k-th candidate. Recognition result candidate correction means 203
Receives N sets (first to Nth candidates) from the recognition result candidate generation means 202. Where k = 1,
2, ..., N.

【０００９】また、Ｌ_１＞Ｌ_２＞…＞Ｌ_Ｎとする。Ｗ
_ｋ，１，Ｗ_ｋ，２，…，Ｗ_{ｋ，ｎ（ｋ} _）が認識結果の第
ｋ位候補の実体をなす記号列（例えば単語の系列）であ
り、Ｘ _ｋ，１，Ｘ_ｋ，２，…，Ｘ_{ｋ，ｎ（ｋ）}は、記号
列と一対一に対応する継続時間長系列である。Ｌ_ｋは、
音響モデルの観点（入力音声信号と標準パタンとの整合
の度合い）から算出されたスコアあるいは尤度であり、
第ｋ位候補の音響的なもっともらしさを表す。たとえ
ば、文献「１９９５年、ローレンス・ラビナー他(古井
貞煕監訳)、音声認識の基礎(下)、ＮＴＴアドバンステ
クノロジ株式会社、１８８〜１９４頁」にＬ_ｋの算出の
例が記載されている。Also, L₁> L_Two＞… ＞ L_NAnd W
_{k, 1}, W_{k, 2},…, W_{k, n (k} ₎Is the recognition result
A symbol string (for example, a series of words) that is a substance of the k-th candidate.
X _{k, 1}, X_{k, 2}, ..., X_{k, n (k)}Is the sign
It is a duration sequence corresponding to a column one-to-one. L_kIs
Viewpoint of acoustic model (match between input speech signal and standard pattern)
Is the score or likelihood calculated from
It represents the acoustic plausibility of the kth candidate. for example
For example, refer to the document “Lawrence Rabiner et al. (1995, Furui
(Translated by Teihi), Basics of speech recognition (below), NTT Advanced
Kunoroji Co., Ltd., pp. 188-194 "_kOf calculation
Examples are given.

【００１０】認識結果候補修正手段２０３は、各認識結
果候補のもっともらしさを継続時間長の観点から検証
し、その結果に応じて第ｋ位候補のスコアＬ_ｋを補正す
る。すなわち、認識結果の第ｋ位候補記号列の各要素Ｗ
_ｋ，ｉに対応する継続時間長Ｘ _ｋ，ｉが、その平均値μ
（Ｗ_ｋ，ｉ）から外れていればスコア値Ｌ_ｋを下方修正
する。具体的には、次式（数式１）に従ってスコアを修
正する。The recognition result candidate correction means 203 is provided for each recognition result.
Verification of plausibility of fruit candidates from the viewpoint of duration
And the score L of the kth candidate according to the result_kTo correct
It That is, each element W of the k-th candidate symbol string of the recognition result
_{k, i}Duration X corresponding to _{k, i}Is the average value μ
(W_{k, i}) Score value L_kDownward correction
To do. Specifically, the score is modified according to the following formula (Formula 1).
To correct.

【００１１】[0011]

【数１】 [Equation 1]

【００１２】ここにμ（Ｗ_ｋ，ｉ）およびσ
（Ｗ_ｋ，ｉ）はそれぞれ認識結果記号列を構成する記号
（例えば単語）Ｗ_ｋ，ｉの継続時間長の平均および分散
であり、あらゆる要素Ｗに対する平均μ（Ｗ）および分
散σ（Ｗ）が継続時間長統計量格納手段２０４に格納さ
れている。またαは継続時間長のもっともらしさにかか
る重み係数（定数）である。認識結果候補修正手段２０
３は、必要に応じて継続時間長統計量格納手段２０４か
らμ（Ｗ）およびσ（Ｗ）の値を読み出して使用する。
なお、これらの値は事前に大規模な学習データに対して
ヴィタビ探索を行うことによって計算し保持しておくも
のとする。Where μ (W _{k, i} ) and σ
(W _{k, i} ) is the average and variance of the durations of the symbols (eg, words) W _{k, i} that make up the recognition result symbol string, and the average μ (W) and variance σ (W) for every element W. Are stored in the duration statistics storage means 204. Further, α is a weighting coefficient (constant) for the plausibility of the duration. Recognition result candidate correction means 20
3 reads out and uses the values of μ (W) and σ (W) from the duration statistics storage means 204 as needed.
Note that these values are calculated and stored in advance by performing a Viterbi search on large-scale learning data.

【００１３】しかしながら、上述した従来の技術は、発
話速度が極端に速いあるいは遅い場合や、発話の前後に
突発的な雑音が重なる場合に、頑健性を失うという問題
を有する。However, the above-mentioned conventional technique has a problem that the robustness is lost when the utterance speed is extremely fast or slow, or when sudden noises overlap before and after the utterance.

【００１４】通常、継続時間長統計量格納手段２０４に
格納される統計量μ（Ｗ）およびσ（Ｗ）は、いろいろ
な発話速度の音声データから抽出された継続時間長から
計算される平均や分散であるため、標準的な発話速度の
発話に対しては正しく推定された統計量となるが、標準
から外れる発話に対しては、正確な推定量とはならな
い。正確でない統計量を用いて数式１のスコア修正を行
うと、スコアが不適切に修正されてしまい、認識誤りの
原因となる。すなわち、発話速度の変動に応じて統計量
μ（Ｗ）およびσ（Ｗ）を適宜調節（発話速度正規化）
する必要がある。Generally, the statistics μ (W) and σ (W) stored in the duration statistics storage means 204 are averages calculated from durations extracted from voice data of various speech rates, and Since it is a variance, it is a correctly estimated statistic for utterances with a standard speech rate, but is not an accurate estimator for utterances that are out of the standard. If the score of Formula 1 is corrected using an inaccurate statistic, the score is corrected inappropriately, which causes a recognition error. That is, the statistics μ (W) and σ (W) are appropriately adjusted according to the fluctuation of the speech rate (speech rate normalization).
There is a need to.

【００１５】上に述べた発話速度の変動について、頑健
性を確保する１つの方策として、「特開平１１−３１１
９９４号公報」記載の「情報処理装置および方法、並び
に提供媒体」の技術が知られている。この技術は、「個
々の単語の継続時間長と、各単語の継続時間長の合計と
の比」を「正規化継続時間長」と定義することにより、
発話速度の変動に対する頑健性をある程度確保するもの
である。Regarding the fluctuation of the speech rate described above, one measure for ensuring robustness is described in "Japanese Patent Laid-Open No. 11-311".
The technology of "information processing device and method, and providing medium" described in Japanese Patent Laid-Open No. 994 "is known. This technique defines "the ratio of the duration of each word and the sum of the duration of each word" to the "normalized duration",
This is to ensure a certain degree of robustness against fluctuations in speech rate.

【００１６】[0016]

【発明が解決しようとする課題】しかし、上記公報に記
載された技術は、発話の前後に突発的な雑音が重なった
りして一部の単語の継続時間長が極端な擾乱を受けた場
合、頑健性を失う可能性がある。例えば、発話の直前に
雑音が重なった場合、先頭の単語の継続時間長だけが極
端に長くなり、「各単語の継続時間長の合計」が大きく
増加する。すると、本来あるべき値よりも大きな値で除
するような正規化を行うことになり、正規化後の各単語
の継続時間長は、本来あるべき値よりも小さくなる。結
果として、継続時間長の正規化が誤って行われ、発話速
度に対する頑健性が損なわれる。発話の直後に雑音が重
なった場合にも、同様の問題が発生することは明らかで
ある。この種の問題は、よく知られているように、発話
に含まれる単語数が少ない場合に特に起こりやすい。However, in the technique described in the above publication, when sudden noises are overlapped before and after utterance and the duration of some words is extremely disturbed, Robustness can be lost. For example, when noise overlaps immediately before the utterance, only the duration of the first word becomes extremely long, and the “total duration of each word” significantly increases. Then, the normalization is performed by dividing by a value larger than the original value, and the length of the duration of each word after the normalization becomes smaller than the original value. As a result, the normalization of the duration is erroneously performed, and the robustness against the speech rate is impaired. It is clear that a similar problem occurs when the noise overlaps immediately after the utterance. As is well known, this type of problem is especially likely to occur when the utterance contains a small number of words.

【００１７】その一方で、上記公報に示された従来技術
では、継続時間長が発話速度と比例関係にあると仮定し
ている点も問題となり得る。この技術では、各単語の継
続時間長の合計の逆数という定数係数を各単語の継続時
間長に乗ずることにより、継続時間長の正規化を行って
いた。しかしながら、一般に単語は母音と子音から構成
されるが、母音は発話速度に対して直接的に変動して比
例関係をなすが、子音は発話速度にはそれほど依存せず
に一定値を保つことが知られている。すなわち、一般に
継続時間長は発話速度に依存して変化する線形成分と、
発話速度に依存しないバイアス成分からなると考えられ
る。したがって、ある定数を乗ずる（掛け算）手続き
と、ある別の定数を加える（足し算）手続きを組み合わ
せなければ、継続時間長の適切な正規化を実現すること
は難しい。On the other hand, in the prior art disclosed in the above publication, it may be a problem that the duration is assumed to be proportional to the speech rate. In this technique, the duration length of each word is normalized by multiplying the duration length of each word by a constant coefficient that is the reciprocal of the total duration length of each word. However, words are generally composed of vowels and consonants. Vowels directly fluctuate and have a proportional relationship with speech rate, but consonants do not depend on speech rate and can maintain a constant value. Are known. That is, in general, the duration is a linear component that changes depending on the speech rate,
It is considered to consist of a bias component that does not depend on the speech rate. Therefore, it is difficult to realize proper normalization of the duration without combining the procedure of multiplying by a constant (multiplication) and the procedure of adding another constant (addition).

【００１８】以上述べたように、発話の前後に雑音が重
なった場合のように一部の単語の継続時間長が極端な擾
乱を受けた場合でも精度を落とさず、なおかつ発話速度
に依存する部分と依存しない部分をともに考慮に入れた
継続時間長の正規化を行い、これにより継続時間長に関
するもっともらしさを適切に数値化し、音声認識のスコ
アに反映させる技術が望まれる。As described above, even if the duration of some words is extremely disturbed, such as when noise is overlapped before and after the utterance, the accuracy is not deteriorated, and the part that depends on the utterance speed is not affected. There is a demand for a technique that normalizes the duration time taking into consideration both the non-dependence part and the normalization time, appropriately digitizes the plausibility of the duration time, and reflects it in the speech recognition score.

【００１９】[0019]

【課題を解決するための手段】一般的な傾向として、発
話の前後に雑音が重なった場合、先頭の単語や末尾の単
語の継続時間長が増大することはあっても、先頭や末尾
の単語以外の、発話の途中に位置する各単語の継続時間
長はほとんど雑音の影響を受けない。したがって、発話
に含まれる各単語の継続時間長について、それぞれの平
均からのずれが平等に小さくなるように継続時間長正規
化のための係数を決定すれば、一部の単語の継続時間長
が誤っていたとしても、残りの単語の継続時間長も利用
することにより、誤った継続時間長の影響が緩和され
て、頑健な正規化が行えると期待される。さらに、正規
化の手続きを、ある定数を乗じてなおかつ別のある定数
を加える処理とすることで、発話速度変動の現象論に即
した継続時間長の正規化が行えると期待される。[Means for Solving the Problems] As a general tendency, when noises overlap before and after utterance, the duration of the first word or the last word may increase, but Other than, the duration of each word located in the middle of the utterance is hardly affected by noise. Therefore, if the coefficients for duration normalization are determined so that the deviations from the average of the durations of each word included in the utterance are equally reduced, the durations of some words will be Even if it is incorrect, it is expected that the influence of the incorrect duration will be mitigated by using the duration of the remaining words, and robust normalization can be performed. Furthermore, it is expected that the normalization procedure will be a process of multiplying a certain constant and adding another constant to normalize the duration length in accordance with the phenomenological theory of the fluctuation in speech rate.

【００２０】そこで、上述した課題を解決するために、
本発明による音声認識装置および音声認識プログラムを
記録した記録媒体は、従来に加えて、継続時間長の正規
化を行う手段と、前記継続時間長の正規化を行う手段が
継続時間長を正規化する際に必要となる正規化定数（比
例定数およびバイアス定数）を計算する手段を備え、さ
らに、前記正規化係数を計算する手段は、認識結果単語
列と、継続時間長系列と、あらかじめ準備された継続時
間長統計量を使用して、各単語の継続時間長とその平均
値との差異がなるべく小さくなるように、自乗誤差最小
化に基づいて正規化定数を計算する。Therefore, in order to solve the above-mentioned problems,
The voice recognition device and the recording medium recording the voice recognition program according to the present invention include, in addition to the conventional art, a unit for normalizing the duration and a unit for normalizing the duration normalize the duration. A means for calculating the normalization constants (proportionality constant and bias constant) necessary for the calculation is further provided, and the means for calculating the normalization coefficient is prepared in advance with the recognition result word string, the duration sequence, and Using the duration statistics, the normalization constant is calculated based on the square error minimization so that the difference between the duration of each word and its average value becomes as small as possible.

【００２１】本発明の第１の音声認識装置は、外部から
入力される音声信号を受け取る音声信号入力手段と、前
記音声信号入力手段から音声信号を受け取り、あらかじ
め準備された標準パタンのセット、および、文法に基づ
いて、音声信号に適合するスコアの高い単語列を上位か
ら順に複数個求め、複数個の単語列、複数個の単語列そ
れぞれに対応する単語継続時間長系列、および、複数個
の単語列それぞれに対応するスコアを出力する認識結果
候補生成手段と、前記認識結果候補生成手段から単語
列、継続時間長系列およびスコアを受け取り、単語列、
継続時間長から正規化定数を用いて継続時間長の正規化
を行う継続時間長正規化手段と、前記認識結果候補生成
手段の出力したスコアを継続時間長の平均、分散に基づ
いて補正し、出力する認識結果候補修正手段と、前記継
続時間長正規化手段が継続時間長の正規化のために用い
る正規化定数を継続時間長の平均、分散に基づいて計算
する正規化定数計算手段と、前記認識結果候補修正手段
がスコアを計算する際、および、前記正規化定数計算手
段が正規化定数を計算する際に使用する継続時間長の平
均、分散を格納する継続時間長統計量格納手段と、を備
えることを特徴とする。A first voice recognition apparatus of the present invention includes a voice signal input means for receiving a voice signal inputted from the outside, a set of standard patterns prepared in advance for receiving the voice signal from the voice signal input means, and Based on the grammar, a plurality of word strings with a high score that match the voice signal are obtained in order from the top, and a plurality of word strings, a word duration sequence corresponding to each of the plurality of word strings, and Recognition result candidate generating means for outputting a score corresponding to each word string, and a word string, a duration length series and a score from the recognition result candidate generating means, a word string,
Duration length normalizing means for normalizing the duration time using the normalization constant from the duration time, and the average output of the recognition result candidate generating means, the correction of the score based on the variance, Recognition result candidate correction means for outputting, and a normalization constant calculation means for calculating a normalization constant used by the continuation time length normalization means for normalization of the continuation time length based on the average and variance of the continuation time length, Duration duration statistic storage means for storing the average and variance of durations used when the recognition result candidate modifying means calculates a score and when the normalization constant calculating means calculates a normalization constant; , Are provided.

【００２２】本発明の第２の音声認識装置は、外部から
入力される音声信号を受け取る音声信号入力手段と、前
記音声信号入力手段から音声信号を受け取り、あらかじ
め準備された標準パタンのセット、および、文法に基づ
いて、音声信号に適合するスコアの高い単語列を上位か
ら順に複数個求め、複数個の単語列、複数個の単語列そ
れぞれに対応する単語継続時間長系列、および、複数個
の単語列それぞれに対応するスコアを出力する認識結果
候補生成手段と、前記認識結果候補生成手段から単語
列、継続時間長系列およびスコアを受け取り、単語列、
継続時間長から正規化定数を用いて継続時間長の正規化
を行う継続時間長正規化手段と、前記認識結果候補生成
手段の出力したスコアを継続時間長の平均、分散に基づ
いて補正し、出力する認識結果候補修正手段と、継続時
間長の定数倍を行う第１種の正規化定数、および、継続
時間長の定数加算を行う第２種の正規化定数を前記継続
時間長正規化手段が継続時間長の正規化のために用いる
正規化定数として継続時間長の平均、分散に基づいて計
算する正規化定数計算手段と、前記認識結果候補修正手
段がスコアを計算する際、および、前記正規化定数計算
手段が正規化定数を計算する際に使用する継続時間長の
平均、分散を格納する継続時間長統計量格納手段と、を
備えることを特徴とする。A second voice recognition apparatus of the present invention is a voice signal input means for receiving a voice signal inputted from the outside, a set of standard patterns prepared in advance for receiving the voice signal from the voice signal input means, and Based on the grammar, a plurality of word strings with a high score that match the voice signal are obtained in order from the top, and a plurality of word strings, a word duration sequence corresponding to each of the plurality of word strings, and Recognition result candidate generating means for outputting a score corresponding to each word string, and a word string, a duration length series and a score from the recognition result candidate generating means, a word string,
Duration length normalizing means for normalizing the duration time using the normalization constant from the duration time, and the average output of the recognition result candidate generating means, the correction of the score based on the variance, The recognition result candidate correction means for outputting, a first type normalization constant for multiplying the duration time by a constant, and a second type normalization constant for performing a constant addition of the duration time are the duration time normalization means. Is an average of the duration as a normalization constant used for normalization of the duration, the normalization constant calculating means for calculating based on the variance, when the recognition result candidate correction means calculates the score, and, And a duration constant statistic storage means for storing the average and the variance of the duration used when the normalization constant calculation means calculates the normalization constant.

【００２３】本発明の第３の音声認識装置は、外部から
入力される音声信号を受け取る音声信号入力手段と、前
記音声信号入力手段から音声信号を受け取り、あらかじ
め準備された標準パタンのセット、および、文法に基づ
いて、音声信号に適合するスコアの高い単語列を上位か
ら順に複数個求め、複数個の単語列、複数個の単語列そ
れぞれに対応する単語継続時間長系列、および、複数個
の単語列それぞれに対応するスコアを出力する認識結果
候補生成手段と、前記認識結果候補生成手段から単語
列、継続時間長系列およびスコアを受け取り、単語列、
継続時間長から正規化定数を用いて継続時間長の正規化
を行う継続時間長正規化手段と、前記認識結果候補生成
手段の出力したスコアを継続時間長の平均、分散に基づ
いて補正し、出力する認識結果候補修正手段と、前記継
続時間長正規化手段が継続時間長の正規化のために用い
る正規化定数を前記継続時間長系列と継続時間長の平均
との間の自乗誤差が最小となるように継続時間長の平
均、分散に基づいて計算する正規化定数計算手段と、前
記認識結果候補修正手段がスコアを計算する際、およ
び、前記正規化定数計算手段が正規化定数を計算する際
に使用する継続時間長の平均、分散を格納する継続時間
長統計量格納手段と、を備えることを特徴とする。A third voice recognition apparatus of the present invention comprises a voice signal input means for receiving a voice signal input from the outside, a set of standard patterns prepared in advance by receiving the voice signal from the voice signal input means, and Based on the grammar, a plurality of word strings with a high score that match the voice signal are obtained in order from the top, and a plurality of word strings, a word duration sequence corresponding to each of the plurality of word strings, and Recognition result candidate generating means for outputting a score corresponding to each word string, and a word string, a duration length series and a score from the recognition result candidate generating means, a word string,
Duration length normalizing means for normalizing the duration time using the normalization constant from the duration time, and the average output of the recognition result candidate generating means, the correction of the score based on the variance, The recognition result candidate correction means for outputting and a normalization constant used by the duration length normalization means for normalizing the duration length have a minimum squared error between the duration length series and the average of the duration lengths. Normalization constant calculation means for calculating the average length and variance of the durations so that when the recognition result candidate correction means calculates the score, and the normalization constant calculation means calculates the normalization constant And a duration length statistic storage means for storing the average and the variance of the duration lengths used when performing.

【００２４】本発明の第１の音声認識方法は、外部から
入力される音声信号を受け取る第１の手順と、前記第１
の手順から音声信号を受け取り、あらかじめ準備された
標準パタンのセット、および、文法に基づいて、音声信
号に適合するスコアの高い単語列を上位から順に複数個
求め、複数個の単語列、複数個の単語列それぞれに対応
する単語継続時間長系列、および、複数個の単語列それ
ぞれに対応するスコアを出力する第２の手順と、前記第
２の手順から単語列、継続時間長系列およびスコアを受
け取り、単語列、継続時間長から正規化定数を用いて継
続時間長の正規化を行う第３の手順と、前記第２の手順
の出力したスコアを継続時間長の平均、分散に基づいて
補正し、出力する第４の手順と、前記第３の手順が継続
時間長の正規化のために用いる正規化定数を継続時間長
の平均、分散に基づいて計算する第５の手順と、前記第
４の手順がスコアを計算する際、および、前記第５の手
順が正規化定数を計算する際に使用する継続時間長の平
均、分散を格納する第６の手順と、を含むことを特徴と
する。The first voice recognition method of the present invention comprises a first procedure for receiving a voice signal input from the outside, and the first procedure described above.
, A sequence of standard patterns prepared in advance and a grammar are used to find a plurality of word strings with a high score that match the voice signal in order from the top. A second sequence for outputting a word duration sequence corresponding to each of the word strings and a score corresponding to each of the plurality of word sequences, and a word sequence, a duration sequence, and a score from the second procedure. A third procedure for receiving and normalizing the duration using a normalization constant from the word string and the duration, and correcting the score output by the second procedure based on the average and variance of the durations And outputting a fourth procedure, a fifth procedure for calculating a normalization constant used by the third procedure for normalizing the duration, based on the average and variance of the duration, and Score of 4 steps When computing, and, characterized in that it comprises a sixth step of storing duration mean, the variance used in the fifth step calculates a normalization constant.

【００２５】本発明の第２の音声認識方法は、外部から
入力される音声信号を受け取る第１の手順と、前記第１
の手順から音声信号を受け取り、あらかじめ準備された
標準パタンのセット、および、文法に基づいて、音声信
号に適合するスコアの高い単語列を上位から順に複数個
求め、複数個の単語列、複数個の単語列それぞれに対応
する単語継続時間長系列、および、複数個の単語列それ
ぞれに対応するスコアを出力する第２の手順と、前記第
２の手順から単語列、継続時間長系列およびスコアを受
け取り、単語列、継続時間長から正規化定数を用いて継
続時間長の正規化を行う第３の手順と、前記第２の手順
の出力したスコアを継続時間長の平均、分散に基づいて
補正し、出力する第４の手順と、継続時間長の定数倍を
行う第１種の正規化定数、および、継続時間長の定数加
算を行う第２種の正規化定数を前記第３の手順が継続時
間長の正規化のために用いる正規化定数として継続時間
長の平均、分散に基づいて計算する第５の手順と、前記
第４の手順がスコアを計算する際、および、前記第５の
手順が正規化定数を計算する際に使用する継続時間長の
平均、分散を格納する第６の手順と、を含むことを特徴
とする。A second voice recognition method of the present invention comprises a first procedure for receiving a voice signal input from the outside, and the first procedure described above.
, A sequence of standard patterns prepared in advance and a grammar are used to find a plurality of word strings with a high score that match the voice signal in order from the top. A second sequence for outputting a word duration sequence corresponding to each of the word strings and a score corresponding to each of the plurality of word sequences, and a word sequence, a duration sequence, and a score from the second procedure. A third procedure for receiving and normalizing the duration using a normalization constant from the word string and the duration, and correcting the score output by the second procedure based on the average and variance of the durations Then, the fourth procedure for outputting, the first type normalization constant for multiplying the duration time by a constant, and the second type normalization constant for performing the constant addition of the duration time are added by the third procedure. Normalization of duration The fifth procedure for calculating a score based on the average and variance of the durations as the normalization constant used for the above, when the fourth procedure calculates the score, and the fifth procedure for calculating the normalization constant. And a sixth step of storing the average of the duration lengths used at the time and the variance.

【００２６】本発明の第３の音声認識方法は、外部から
入力される音声信号を受け取る第１の手順と、前記第１
の手順から音声信号を受け取り、あらかじめ準備された
標準パタンのセット、および、文法に基づいて、音声信
号に適合するスコアの高い単語列を上位から順に複数個
求め、複数個の単語列、複数個の単語列それぞれに対応
する単語継続時間長系列、および、複数個の単語列それ
ぞれに対応するスコアを出力する第２の手順と、前記第
２の手順から単語列、継続時間長系列およびスコアを受
け取り、単語列、継続時間長から正規化定数を用いて継
続時間長の正規化を行う第３の手順と、前記第２の手順
の出力したスコアを継続時間長の平均、分散に基づいて
補正し、出力する第４の手順と、前記第３の手順が継続
時間長の正規化のために用いる正規化定数を前記継続時
間長系列と継続時間長の平均との間の自乗誤差が最小と
なるように継続時間長の平均、分散に基づいて計算する
第５の手順と、前記第４の手順がスコアを計算する際、
および、前記第５の手順が正規化定数を計算する際に使
用する継続時間長の平均、分散を格納する第６の手順
と、を含むことを特徴とする。A third voice recognition method of the present invention comprises a first procedure for receiving a voice signal input from the outside, and the first procedure described above.
, A sequence of standard patterns prepared in advance and a grammar are used to find a plurality of word strings with a high score that match the voice signal in order from the top. A second sequence for outputting a word duration sequence corresponding to each of the word strings and a score corresponding to each of the plurality of word sequences, and a word sequence, a duration sequence, and a score from the second procedure. A third procedure for receiving and normalizing the duration using a normalization constant from the word string and the duration, and correcting the score output by the second procedure based on the average and variance of the durations Then, the fourth procedure for outputting and the normalization constant used by the third procedure for normalizing the duration length are the minimum squared error between the duration sequence and the average duration. When to continue to be When the average length, the fifth step of calculating, based on the dispersion, the fourth step calculates the score,
And, the fifth procedure includes a sixth procedure of storing an average and a variance of the durations used when calculating the normalization constant.

【００２７】本発明の第１の音声認識プログラムは、外
部から入力される音声信号を受け取る第１の手順と、前
記第１の手順から音声信号を受け取り、あらかじめ準備
された標準パタンのセット、および、文法に基づいて、
音声信号に適合するスコアの高い単語列を上位から順に
複数個求め、複数個の単語列、複数個の単語列それぞれ
に対応する単語継続時間長系列、および、複数個の単語
列それぞれに対応するスコアを出力する第２の手順と、
前記第２の手順から単語列、継続時間長系列およびスコ
アを受け取り、単語列、継続時間長から正規化定数を用
いて継続時間長の正規化を行う第３の手順と、前記第２
の手順の出力したスコアを継続時間長の平均、分散に基
づいて補正し、出力する第４の手順と、前記第３の手順
が継続時間長の正規化のために用いる正規化定数を継続
時間長の平均、分散に基づいて計算する第５の手順と、
前記第４の手順がスコアを計算する際、および、前記第
５の手順が正規化定数を計算する際に使用する継続時間
長の平均、分散を格納する第６の手順と、をコンピュー
タに実行させることを特徴とする。A first voice recognition program of the present invention includes a first procedure for receiving a voice signal input from the outside, a set of standard patterns prepared in advance by receiving the voice signal from the first procedure, and , Based on the grammar
A plurality of word strings with a high score that match the voice signal are obtained in order from the top, and a plurality of word strings, a word duration sequence corresponding to each of the plurality of word strings, and a plurality of word strings are respectively corresponded. The second step of outputting the score,
A third procedure for receiving a word string, a duration sequence and a score from the second procedure and normalizing the duration using the normalization constant from the word sequence and the duration;
Of the procedure of (4), which corrects the score output based on the average and variance of the duration, and outputs the fourth procedure, and the normalization constant used by the third procedure for normalizing the duration. A fifth step of calculating based on the mean and variance of the length,
A sixth step of storing an average and a variance of durations used when the fourth procedure calculates a score and when the fifth procedure calculates a normalization constant. It is characterized by

【００２８】本発明の第２の音声認識プログラムは、外
部から入力される音声信号を受け取る第１の手順と、前
記第１の手順から音声信号を受け取り、あらかじめ準備
された標準パタンのセット、および、文法に基づいて、
音声信号に適合するスコアの高い単語列を上位から順に
複数個求め、複数個の単語列、複数個の単語列それぞれ
に対応する単語継続時間長系列、および、複数個の単語
列それぞれに対応するスコアを出力する第２の手順と、
前記第２の手順から単語列、継続時間長系列およびスコ
アを受け取り、単語列、継続時間長から正規化定数を用
いて継続時間長の正規化を行う第３の手順と、前記第２
の手順の出力したスコアを継続時間長の平均、分散に基
づいて補正し、出力する第４の手順と、継続時間長の定
数倍を行う第１種の正規化定数、および、継続時間長の
定数加算を行う第２種の正規化定数を前記第３の手順が
継続時間長の正規化のために用いる正規化定数として継
続時間長の平均、分散に基づいて計算する第５の手順
と、前記第４の手順がスコアを計算する際、および、前
記第５の手順が正規化定数を計算する際に使用する継続
時間長の平均、分散を格納する第６の手順と、をコンピ
ュータに実行させることを特徴とする。A second voice recognition program of the present invention includes a first procedure for receiving a voice signal input from the outside, a set of standard patterns prepared in advance by receiving the voice signal from the first procedure, and , Based on the grammar
A plurality of word strings with a high score that match the voice signal are obtained in order from the top, and a plurality of word strings, a word duration sequence corresponding to each of the plurality of word strings, and a plurality of word strings are respectively corresponded. The second step of outputting the score,
A third procedure for receiving a word string, a duration sequence and a score from the second procedure and normalizing the duration using the normalization constant from the word sequence and the duration;
The score output by the procedure of (1) is corrected based on the average and variance of the duration, and is output, and the first type of normalization constant that performs a constant multiple of the duration and the duration A fifth procedure for calculating a second type normalization constant for performing constant addition as a normalization constant used by the third procedure for normalization of the duration, based on the average and variance of the duration, A sixth step of storing an average and a variance of durations used when the fourth procedure calculates a score and when the fifth procedure calculates a normalization constant. It is characterized by

【００２９】本発明の第３の音声認識プログラムは、外
部から入力される音声信号を受け取る第１の手順と、前
記第１の手順から音声信号を受け取り、あらかじめ準備
された標準パタンのセット、および、文法に基づいて、
音声信号に適合するスコアの高い単語列を上位から順に
複数個求め、複数個の単語列、複数個の単語列それぞれ
に対応する単語継続時間長系列、および、複数個の単語
列それぞれに対応するスコアを出力する第２の手順と、
前記第２の手順から単語列、継続時間長系列およびスコ
アを受け取り、単語列、継続時間長から正規化定数を用
いて継続時間長の正規化を行う第３の手順と、前記第２
の手順の出力したスコアを継続時間長の平均、分散に基
づいて補正し、出力する第４の手順と、前記第３の手順
が継続時間長の正規化のために用いる正規化定数を前記
継続時間長系列と継続時間長の平均との間の自乗誤差が
最小となるように継続時間長の平均、分散に基づいて計
算する第５の手順と、前記第４の手順がスコアを計算す
る際、および、前記第５の手順が正規化定数を計算する
際に使用する継続時間長の平均、分散を格納する第６の
手順と、をコンピュータに実行させることを特徴とす
る。A third voice recognition program of the present invention includes a first procedure for receiving a voice signal input from the outside, a set of standard patterns prepared in advance by receiving the voice signal from the first procedure, and , Based on the grammar
A plurality of word strings with a high score that match the voice signal are obtained in order from the top, and a plurality of word strings, a word duration sequence corresponding to each of the plurality of word strings, and a plurality of word strings are respectively corresponded. The second step of outputting the score,
A third procedure for receiving a word string, a duration sequence and a score from the second procedure and normalizing the duration using the normalization constant from the word sequence and the duration;
The procedure for correcting the output score of the procedure (1) on the basis of the average and variance of the duration and outputting the fourth procedure, and the normalization constant used by the third procedure for normalizing the duration A fifth step of calculating based on the average and variance of the durations so that the squared error between the time series and the average of the durations is minimized, and the fourth procedure when calculating the score. , And the sixth step of storing the average and variance of the duration used in the calculation of the normalization constant in the fifth procedure, and the sixth procedure.

【００３０】[0030]

【発明の実施の形態】次に、本発明の第１の実施の形態
について、図面を参照して詳細に説明する。BEST MODE FOR CARRYING OUT THE INVENTION Next, a first embodiment of the present invention will be described in detail with reference to the drawings.

【００３１】図１は、本発明の第１の実施の形態の構成
を示すブロック図である。FIG. 1 is a block diagram showing the configuration of the first embodiment of the present invention.

【００３２】図１を参照すると、本発明の第１の実施の
形態は、外部から入力される音声信号を受け取る音声信
号入力手段１０１と、音声信号入力手段１０１から音声
信号の時系列を受け取り、認識に有用な特徴量（例えば
メルケプストラム）の系列を計算し、あらかじめ準備さ
れた音響モデル（標準パタンのセット）、および、必要
に応じて準備される文法（単語辞書）から、特徴量系列
にもっともよくマッチする標準パタン（音素）の系列を
ヴィタビ探索などによって上位から順にＮ個求め、求め
られたＮ個の音素列からそれぞれ一意に決まるＮ個の単
語列、Ｎ個の単語列それぞれに対応する単語継続時間長
系列、および、Ｎ個のスコア（尤度）を出力する認識結
果候補生成手段１０２と、認識結果候補生成手段１０２
から単語列、継続時間長系列、スコアを受け取り、単語
列と継続時間長から継続時間長の正規化（数値揃え）を
行う継続時間長正規化手段１０３と、継続時間長分布モ
デルに基づいて継続時間長のもっともらしさを表すスコ
アを計算し、認識結果候補生成手段１０２の出したスコ
アを加減してスコア補正する認識結果候補修正手段１０
４と、継続時間長正規化手段１０３が継続時間長の正規
化を行う際に必要となる正規化定数を計算する正規化定
数計算手段１０５と、認識結果候補修正手段１０４が継
続時間長のもっともらしさを表すスコアを計算する際、
および、正規化定数計算手段１０５が正規化定数を計算
する際に参照する継続時間長の平均や分散の値を格納す
る継続時間長統計量格納手段１０６（たとえば、メモ
リ）とから構成される。Referring to FIG. 1, in the first embodiment of the present invention, an audio signal input means 101 for receiving an audio signal inputted from the outside and a time series of the audio signal are received from the audio signal input means 101, A series of feature quantities (eg, mel cepstrum) useful for recognition is calculated, and a feature quantity series is created from a prepared acoustic model (set of standard patterns) and a grammar (word dictionary) prepared as necessary. A sequence of standard patterns (phonemes) that best matches is found in order from the top by a Viterbi search or the like, and it corresponds to each of the N word strings and N word strings that are uniquely determined from the N phoneme strings obtained. Recognition result candidate generation means 102 for outputting a word duration sequence and a score (likelihood) of N pieces, and recognition result candidate generation means 102.
From the word string and the duration length sequence, the score is received from, and the duration length normalizing means 103 for normalizing (numerical alignment) the duration length from the word string and the duration length, and the continuation based on the duration length distribution model Recognition result candidate correction means 10 for calculating a score representing the plausibility of the time length and correcting the score by adding or subtracting the score generated by the recognition result candidate generation means 102.
4, the normalization constant calculation means 105 for calculating a normalization constant required when the duration length normalization means 103 normalizes the duration time, and the recognition result candidate correction means 104 When calculating the score that expresses likeness,
Further, the normalization constant calculation means 105 is composed of a continuous time length statistic storage means 106 (for example, a memory) for storing an average value or a dispersion value of the continuous time length referred to when calculating the normalization constant.

【００３３】各々の手段はそれぞれ計算機上に記憶され
たプログラムとして動作させることにより実現可能であ
る。Each means can be realized by operating as a program stored in a computer.

【００３４】次に、本発明の第１の実施の形態の動作に
ついて図面を参照して説明する。Next, the operation of the first embodiment of the present invention will be described with reference to the drawings.

【００３５】音声信号入力手段１０１は、マイクなどの
入力デバイスを備え、入力音声信号の時系列データを取
り込み格納する。The audio signal input means 101 is provided with an input device such as a microphone, and fetches and stores time series data of the input audio signal.

【００３６】認識結果候補生成手段１０２は、あらかじ
め適当な音声データとそれに対応する正解単語列を用い
て学習された音響モデル（標準パタンＨＭＭのセット）
と、認識タスクに応じて準備された文法（単語の並びを
規定する辞書）を備え、音声信号入力手段１０１から音
声信号の時系列データを受け取り、これを認識に有用な
特徴量ベクトル（例えばメルケプストラム）の系列に変
換し、前記音響モデルと、前記文法を用いて、特徴量ベ
クトル系列にもっともよくマッチし、かつ文法に合致す
る標準パタン（例えば音素）の系列をヴィタビ探索など
によって上位から順に複数個（ここではＮ個とする）求
め、求められたＮ個の音素列からそれぞれ一意に決まる
Ｎ個の単語列と、Ｎ個の単語列それぞれに対応する単語
継続時間長系列と、Ｎ個のスコア（尤度）を出力する。The recognition result candidate generating means 102 is an acoustic model (a set of standard pattern HMMs) learned in advance using appropriate voice data and a correct answer word string corresponding thereto.
And a grammar (dictionary that defines the arrangement of words) prepared according to the recognition task, receives time-series data of a voice signal from the voice signal input means 101, and uses this for a feature vector useful for recognition (for example, mel Cepstral) sequence, and using the acoustic model and the grammar, a sequence of standard patterns (for example, phonemes) that best matches the feature vector sequence and matches the grammar is sequentially searched from the top by Viterbi search or the like. A plurality (here, N) is obtained, N word strings uniquely determined from the obtained N phoneme strings, word duration sequences corresponding to the N word strings, and N word strings. The score (likelihood) of is output.

【００３７】以下では、これら単語列、単語継続時間長
系列、スコアのＮ個の組を（Ｗ_ｋ， _１，Ｗ_ｋ，２，…，
Ｗ_{ｋ，ｎ（ｋ）}；Ｘ_ｋ，１，Ｘ_ｋ，２，…，Ｘ
_{ｋ，ｎ（ｋ）}；Ｌ_ｋ）という記号で表す。ただしｋ＝
１，２，…，Ｎであり、単語列、単語継続時間長系列、
スコアの各組はスコアＬ_ｋの高い順にソートされている
とする。ｎ（ｋ）は第ｋ位候補の単語数である。In the following, N sets of these word strings, word duration sequences, and scores will be represented by (W _k, ₁ , W _{k, 2} , ...,
W _{k, n (k)} ; X _{k, 1} , X _{k, 2} , ..., X
It is represented by the symbol _{k, n (k)} ; L _k ). Where k =
1, 2, ..., N, a word string, a word duration sequence,
It is assumed that each set of scores is sorted in descending order of the score L _k . n (k) is the number of words in the kth candidate.

【００３８】継続時間長正規化手段１０３は、認識結果
候補生成手段１０２より上位Ｎ個の認識結果候補（Ｗ
_ｋ，１，Ｗ_ｋ，２，…，Ｗ_{ｋ，ｎ（ｋ）}；Ｘ_ｋ，１，Ｘ
_ｋ，２，…，Ｘ_{ｋ，ｎ（ｋ）}；Ｌ_ｋ）、ｋ＝１，２，
…，Ｎを受け取り、１候補ずつ順に、発話速度に起因す
る変動要因が小さくなるように継続時間長を正規化す
る。The continuation time length normalizing means 103 has the upper N recognition result candidates (W) than the recognition result candidate generating means 102.
_{k, 1} , W _{k, 2} , ..., W _{k, n (k)} ; X _{k, 1} , X
_{k, 2} , ..., _{Xk, n (k)} ; _Lk ), k = 1,2,
, N is received, and the duration is normalized so that the variation factor caused by the speech rate becomes smaller one by one.

【００３９】ここでの正規化は、次式（数式２）のよう
な、２種のパラメータ（正規化定数）κおよびλで規定
される１次関数による写像である。κは、継続時間長の
スケーリング(定数倍)を行う正規化定数であり、λは、
継続時間長のシフト(定数加算)を行う定数正規化定数で
ある。The normalization here is a mapping by a linear function defined by two kinds of parameters (normalization constants) κ and λ as in the following expression (Expression 2). κ is a normalization constant that scales the duration (constant multiple), and λ is
It is a constant normalization constant that shifts the duration (constant addition).

【００４０】[0040]

【数２】 [Equation 2]

【００４１】上記正規化定数κおよびλの値は、正規化
に先立って、正規化定数計算手段１０５によって決めら
れる。その手順は以下のようである。The values of the normalization constants κ and λ are determined by the normalization constant calculation means 105 prior to normalization. The procedure is as follows.

【００４２】継続時間長正規化手段１０３は、第ｋ位候
補の単語列Ｗ_ｋ，１，Ｗ_ｋ，２，…，Ｗ_{ｋ，ｎ（ｋ）}、
および、継続時間長系列Ｘ_ｋ，１，Ｘ_ｋ，２，…，Ｘ
_ｋ，ｎ _（ｋ）を正規化定数計算手段１０５に送る。一
方、継続時間長統計量格納手段１０６には、あらゆる単
語に関する継続時間長の平均および分散があらかじめ準
備され格納されている。The duration normalizing means 103 is a k-th candidate word string W _{k, 1} , W _{k, 2} , ..., W _{k, n (k)} ,
And the duration length sequence X _{k, 1} , X _{k, 2} , ..., X
_{k, n} _(k) is sent to the normalization constant calculation means 105. On the other hand, the average duration duration and variance of all words are prepared and stored in advance in the duration statistics storage means 106.

【００４３】正規化定数計算手段１０５は、継続時間長
正規化手段１０３より、上記単語列および継続時間長系
列を受け取り、さらに継続時間長統計量格納手段１０６
より、単語列Ｗ_ｋ，１，Ｗ_ｋ，２，…，Ｗ_{ｋ，ｎ（ｋ）}
に対応する継続時間長の平均μ（Ｗ_ｋ，１），μ（Ｗ
_ｋ，２），…，μ（Ｗ_{ｋ，ｎ（ｋ）}）と分散σ（Ｗ_ｋ，
_１），σ（Ｗ_ｋ，２），…，σ（Ｗ_{ｋ，ｎ（ｋ）}）を読
み出す。そして、正規化定数計算手段１０５は正規化定
数κおよびλを次式（数式３、数式４）に従って算出す
る。The normalization constant calculating means 105 receives the above-mentioned word string and duration length sequence from the duration length normalizing means 103, and further, duration length statistic storage means 106.
From the word sequence W _{k, 1} , W _{k, 2} , ..., W _{k, n (k)}
Of the average duration μ (W _{k, 1} ), μ (W
_{k, 2} ), ..., μ (W _{k, n (k)} ) and variance σ (W _k,
₁ ), σ (W _{k, 2} ), ..., σ (W _{k, n (k)} ) are read. Then, the normalization constant calculation means 105 calculates the normalization constants κ and λ according to the following expressions (Expression 3 and Expression 4).

【００４４】[0044]

【数３】 [Equation 3]

【００４５】[0045]

【数４】 [Equation 4]

【００４６】これらの算出式は、継続時間長系列Ｘ
_ｋ，１，Ｘ_ｋ，２，…，Ｘ_{ｋ，ｎ（ｋ）}と継続時間長の
平均の系列μ（Ｗ_ｋ，１），μ（Ｗ_ｋ，２），…，μ
（Ｗ_ｋ，ｎ _（ｋ））の間の自乗誤差が最小となるように
設定されている。より正確には、継続時間長のもっとも
らしさを表す量（前記数式１、または、後記数式５の右
辺第２項）が最大となるように設計されている。正規化
定数計算手段１０５は、これらの算出式にしたがって算
出された正規化定数κおよびλを、継続時間長正規化手
段１０３に渡し、継続時間長正規化手段１０３は前述し
た数式２に従って、継続時間長を正規化する。These calculation formulas are used for the duration length series X
_{k, 1} , X _{k, 2} , ..., X _{k, n (k)} and the average sequence μ (W _{k, 1} ), μ (W _{k, 2} ), ...
The square error between (W _{k, n} _(k) ) is set to be minimum. More precisely, it is designed so that the amount representing the plausibility of the duration time (the above-mentioned mathematical expression 1 or the second term on the right side of the following mathematical expression 5) becomes maximum. The normalization constant calculation means 105 passes the normalization constants κ and λ calculated according to these calculation formulas to the continuation time length normalization means 103, and the continuation time length normalization means 103 continues according to the above-mentioned mathematical expression 2. Normalize the time length.

【００４７】認識結果候補修正手段１０４は、継続時間
長正規化手段１０３より、Ｎ個の認識結果候補（Ｗ
_ｋ，１，Ｗ_ｋ，２，…，Ｗ_{ｋ，ｎ（ｋ）}；Ｘ_ｋ，１，Ｘ
_ｋ，２，…，Ｘ_{ｋ，ｎ（ｋ）}；Ｌ_ｋ）、ｋ＝１，２，
…，Ｎを受け取り（ここでＬ_ｋは正規化済みであること
に注意）、また、継続時間長統計量格納手段１０６より
単語Ｗ_ｋ，１，Ｗ_ｋ，２，…，Ｗ_{ｋ，ｎ（ｋ）}に対応す
る継続時間長の平均および分散を読み出し、このうちの
スコアＬ_ｋを、正規化継続時間長のもっともらしさに基
づいて修正する。The recognition result candidate correction means 104 receives N recognition result candidates (W
_{k, 1} , W _{k, 2} , ..., W _{k, n (k)} ; X _{k, 1} , X
_{k, 2} , ..., _{Xk, n (k)} ; _Lk ), k = 1,2,
, N is received (note that L _k is already normalized), and the words W _{k, 1} , W _{k, 2} , ..., W _{k, n (} It reads the duration of the mean and variance corresponding to _k), the score L _k of this, modified based on the plausibility of the normalized duration.

【００４８】その修正の方法は、従来の技術において示
した数式１（後述する数式５）と同一である。さらに認
識結果候補修正手段１０４は、修正されたスコアＬ_ｋの
大きい順にＮ個の認識結果候補をソートし直し、１位の
認識結果候補（あるいは、必要に応じて上位複数個の認
識結果候補）を最終的な認識結果として出力する。The correction method is the same as the mathematical expression 1 (the mathematical expression 5 described later) shown in the prior art. Further, the recognition result candidate correction means 104 re-sorts the N recognition result candidates in the descending order of the corrected score L _k , and ranks the first recognition result candidate (or a plurality of higher recognition result candidates as necessary). Is output as the final recognition result.

【００４９】次に、本発明の第１の実施の形態の動作に
ついてより詳細に図面を参照して説明する。Next, the operation of the first embodiment of the present invention will be described in more detail with reference to the drawings.

【００５０】図２は、単語継続時間長に基づくスコア修
正による認識結果候補の修正の例を示す説明図である。FIG. 2 is an explanatory view showing an example of correction of recognition result candidates by score correction based on the word duration length.

【００５１】図２を参照すると、住所の番地部分などの
認識でよく現れる、任意桁数の数字と「の」が交互に現
れるような発声を認識した場合の認識結果であり、上か
ら順に実際の発話内容、認識結果候補生成手段１０２が
出力する修正前の認識結果候補（第４位候補まで）、認
識結果候補修正手段１０４が出力する修正後の認識結果
候補（同）となっている。各認識結果候補には、そのス
コアＬ_ｋとともに、単語Ｗ_ｋ，１，Ｗ_ｋ，２，…，Ｗ
_{ｋ，ｎ（ｋ）}、および、継続時間長Ｘ_ｋ，１，
Ｘ _ｋ，２，…，Ｘ_{ｋ，ｎ（ｋ）}［ｍｓｅｃ単位］の対を
記してある。Referring to FIG. 2, such as the address part of an address
An arbitrary number of digits that often appear in recognition and "no" are displayed alternately.
It is the recognition result when utterances that
The actual utterance content and the recognition result candidate generation means 102
Output the recognition result candidates before correction (up to the 4th candidate),
Corrected recognition result output by the knowledge result candidate correction unit 104
It is a candidate (the same). Each recognition result candidate has its
Core L_kWith the word W_{k, 1}, W_{k, 2},…, W
_{k, n (k)}, And duration X_{k, 1}，
X _{k, 2}, ..., X_{k, n (k)}[Msec unit] pair
It is written.

【００５２】たとえば、認識結果候補（修正前）の１位
候補は、単語［継続時間長］が、にせん［６６７］、さ
ん［４４４］、の［４６６］、よん［４３６］、の［１
８９］、いち［４２１］である。For example, the 1st place candidate of the recognition result candidate (before correction) has the word [duration length] of [1] of [n] [667], Ms. [444], [466], and [436].
89] and 1 [421].

【００５３】図３は、単語継続時間長統計量の例を示す
説明図である。図３を参照すると、継続時間長統計量格
納手段１０６が有する単語ごとの継続時間長の平均およ
び分散の値が示されている。この例では、平均および分
散の単位は、ｍｓｅｃ、および、平方ｍｓｅｃである。FIG. 3 is an explanatory diagram showing an example of the word duration statistics. Referring to FIG. 3, there are shown average and variance values of the duration length of each word that the duration statistic storage means 106 has. In this example, the units of mean and variance are msec and square msec.

【００５４】正規化定数計算手段１０５は、数式３およ
び数式４に基づいて、認識結果候補ごとに正規化定数
κ、λを算出する。The normalization constant calculation means 105 calculates the normalization constants κ and λ for each recognition result candidate based on Expressions 3 and 4.

【００５５】たとえば、１位候補においては、κ＝
｛（１／１０８９８＋１／５２５４＋１／１５２６＋１
／１５２６＋１／５１２３）×（６６７×４４２／１０
８９８＋４６６×１６０／１５２６＋４３６×２８５／
４５３０＋１８９×１６０／１５２６＋４２１×３５１
／５１２３）−（６６７／１０８９８＋４４４／５２５
４＋４６６／１５２６＋４３６／４５３０＋１８９／１
５２６＋４２１／５１２３）×（４４２／１０８９８＋
２９４／５２５４＋１６０／１５２６＋２８５／４５３
０＋１６０／１５２６＋３５１／５１２３）｝／｛（１
／１０８９８＋１／５２５４＋１／１５２６＋１／１５
２６＋１／５１２３）×（６６７２／１０８９８＋４４
４^２／５２５４＋４６６^２／１５２６＋４３６^２／４５
３０＋１８９ ^２／１５２６＋４２１^２／５１２３）−
（６６７／１０８９８＋４４４／５２５４＋４６６／１
５２６＋４３６／４５３０＋１８９／１５２６＋４２１
／５１２３）^２｝＝０．３３４である。For example, in the first candidate, κ =
{(1/10898 + 1/5254 + 1/1526 + 1
/ 1526 + 1/5123) × (667 × 442/10
898 + 466 x 160/1526 + 436 x 285 /
4530 + 189 × 160/1526 + 421 × 351
/ 5123)-(667/10898 + 444/525)
4 + 466/1526 + 436/4530 + 189/1
526 + 421/5123) x (442/10898 +
294/5254 + 160/1526 + 285/453
0 + 160/1526 + 351/5123)} / {(1
/ 10898 + 1/5254 + 1/1526 + 1/15
26 + 1/5123) x (6672/10898 + 44
Four^Two/ 5254 + 466^Two/ 1526 + 436^Two/ 45
30 + 189 ^Two/ 1526 + 421^Two/ 5123)-
(667/10898 + 444/5254 + 466/1
526 + 436/4530 + 189/1526 + 421
/ 5123)^Two} = 0.334.

【００５６】また、λ＝（４４２／１０８９８＋２９４
／５２５４＋１６０／１５２６＋２８５／４５３０＋１
６０／１５２６＋３５１／５１２３）／（１／１０８９
８＋１／５２５４＋１／１５２６＋１／１５２６＋１／
５１２３）−０．３３４×（６６７／１０８９８＋４４
４／５２５４＋４６６／１５２６＋４３６／４５３０＋
１８９／１５２６＋４２１／５１２３）／（１／１０８
９８＋１／５２５４＋１／１５２６＋１／１５２６＋１
／５１２３）＝９２．７である。Further, λ = (442/10898 + 294)
/ 5254 + 160/1526 + 285/4530 + 1
60/1526 + 351/5123) / (1/1089)
8 + 1/5254 + 1/1526 + 1/1526 + 1 /
5123) -0.334 x (667/10898 + 44
4/5254 + 466/1526 + 436/4530 +
189/1526 + 421/5123) / (1/108
98 + 1/5254 + 1/1526 + 1/1526 + 1
/5123)=92.7.

【００５７】よって、正規化定数は、１位候補から順に
（κ，λ）＝（０．３３４，９２．７）、（０．９９
４，−２２１）、（０．６１４，４６．３）、（０．６
７８，３１．９）、…となる。Therefore, the normalization constants are (κ, λ) = (0.334, 92.7), (0.99) in order from the first candidate.
4, -221), (0.614, 46.3), (0.6
78, 31.9), and so on.

【００５８】継続時間長正規化手段１０３は、数式２に
基づいて、各認識結果候補の継続時間長を正規化（修
正）する。なお、修正後の継続時間長は、図２の修正後
の認識結果候補に記されている。The duration length normalizing means 103 normalizes (corrects) the duration length of each recognition result candidate based on the mathematical expression 2. The corrected duration time is described in the corrected recognition result candidate in FIG.

【００５９】たとえば、認識結果候補（修正後）の１位
候補（修正前の２位候補）の単語「にせん」の継続時間
長は、０．３３４×６６７＋９２．７＝３１５である。For example, the duration time of the word "Nisen" of the first-ranked candidate (second-ranked candidate before correction) of the recognition result candidate (after correction) is 0.334 × 667 + 92.7 = 315.

【００６０】認識結果候補修正手段１０４は、継続時間
長正規化手段１０３によって正規化された継続時間長を
用いて、次式（数式５）に基づき、第ｋ位候補のスコア
Ｌ_ｋを修正する。The recognition result candidate correction means 104 corrects the score L _k of the k-th candidate using the duration length normalized by the duration length normalization means 103 based on the following equation (Equation 5). .

【００６１】[0061]

【数５】 [Equation 5]

【００６２】修正後のスコア値は、図２の修正後の認識
結果候補に記されている。ただしここではα＝１．０で
計算している。The corrected score value is written in the corrected recognition result candidate in FIG. However, here, the calculation is performed with α = 1.0.

【００６３】たとえば、認識結果候補（修正後）の２位
候補（修正前の１位候補）のスコア値は、８６．０−
｛（３１５−４４２）^２／１０８９８＋（２４０−２９
４）^２／５２５４＋（２４８−１６０）^２／１５２６＋
（２３８−２８５）^２／４５３０＋（１５５−１６０）
^２／１５２６＋（２３３＋３５１）^２／５１２３｝＝８
６．０−１０．３＝７５．３である。認識結果候補修正
手段１０４はさらに、修正後のスコア値に基づいて認識
結果候補を再ソートし、順位を付け直す。結果として、
修正前に１位、２位であった候補は修正によって大きく
スコアが下がり（主として、単語「の」の継続時間長が
不自然に大きいことによる）、修正前に４位であった候
補も若干スコアが下がり（主として、単語「いっせん」
の継続時間長が不自然に小さいことによる）、修正前に
３位であった候補（正解）が１位に上がっている。For example, the score value of the second candidate (first candidate before modification) of the recognition result candidates (after modification) is 86.0−.
^{(315-442) 2/10898 + (240-29
⁴⁾ 2/5254 + ^(248-160) 2/1526 +
^(238-285) 2/4530 + (155-160)
^2/1526 + ⁽²³³ + 351) 2/5123} = 8
It is 6.0-10.3 = 75.3. The recognition result candidate correction means 104 further re-sorts the recognition result candidates based on the corrected score value and re-ranks them. as a result,
The candidates who were in 1st and 2nd place before the correction had a large drop in score due to the correction (mainly due to the unnaturally long duration of the word "no"), and some candidates were in 4th place before the correction. The score goes down (mainly the word "Issen")
Due to the unnaturally small duration of the), the candidate (correct answer), which was in 3rd place before the correction, went up to 1st place.

【００６４】ここで、図３で例示した、あらかじめ準備
しておく継続時間長統計量（平均および分散）の算出手
順の一例について説明する。学習サンプルとして、音声
データとそれに対応する正解単語列を用意する。この音
声データに対して、認識結果候補生成手段１０２が使用
しているのと同一の音響モデルを用いて、ヴィタビ探索
による単語の最適アラインメント（各単語の開始時刻と
終了時刻）を求める。Here, an example of the procedure for calculating the duration statistics (average and variance) prepared in advance, which is illustrated in FIG. 3, will be described. As learning samples, voice data and correct word strings corresponding to them are prepared. For this voice data, the same acoustic model as that used by the recognition result candidate generation means 102 is used to obtain the optimum alignment of words (start time and end time of each word) by the Viterbi search.

【００６５】これにより、種々の単語の継続時間長がわ
かるので、単語ごとに集計して、継続時間の平均と分散
を計算すればよい。学習サンプルの規模については、大
きければ大きいほどよいが、１種の単語が数回以上は出
現していることが望ましい。As a result, the durations of various words can be known, so that the averages and variances of the durations can be calculated by aggregating each word. The larger the learning sample, the better. However, it is desirable that one type of word appears several times or more.

【００６６】なお、もしも、正規化定数計算手段１０５
における正規化定数の計算にかかるコストを削減したい
場合は、単語ごとの分散の違いを無視することによっ
て、多くの場合さほど実効性を落とすことなく、前記数
式３および数式４の計算量を減らすことが可能となる。
その計算式は次式（数式６、数式７）である。Incidentally, if the normalization constant calculation means 105
In order to reduce the cost of the calculation of the normalization constant in, the difference in the variance of each word is ignored, and in many cases, the calculation amount of Equations 3 and 4 is reduced without significantly reducing the effectiveness. Is possible.
The calculation formula is the following formula (Formula 6 and Formula 7).

【００６７】[0067]

【数６】 [Equation 6]

【００６８】[0068]

【数７】 [Equation 7]

【００６９】以上では、継続時間長を正規化する、すな
わち継続時間長をその平均に近づけるように変換すると
いう手続きを説明したが、逆も可能である。つまり、継
続時間長の平均を、実際に得られた継続時間長に近づけ
るように変換するという手続きでも、実効上同等の処理
が実現できる。その場合には、継続時間長正規化手段１
０３における継続時間長の正規化は行わず、代わりに認
識結果候補修正手段１０４において、正規化された継続
時間長統計量を用いた尤度の補正を行う。その方法は次
式（数式８）となる。In the above, the procedure of normalizing the duration time, that is, converting the duration time so as to be closer to its average has been described, but the reverse procedure is also possible. In other words, even in the procedure of converting the average of the durations so as to be closer to the actually obtained duration, the same processing can be effectively realized. In that case, the duration length normalizing means 1
The duration length is not normalized in 03. Instead, the recognition result candidate correction means 104 corrects the likelihood using the normalized duration length statistic. The method is represented by the following equation (Equation 8).

【００７０】[0070]

【数８】 [Equation 8]

【００７１】また、上式で使用される正規化定数の計算
は、次式（数式９、数式１０）となる。Further, the calculation of the normalization constant used in the above equation is given by the following equations (Equations 9 and 10).

【００７２】[0072]

【数９】 [Equation 9]

【００７３】[0073]

【数１０】 [Equation 10]

【００７４】次に、本発明の第２の実施の形態につい
て、図面を参照して説明する。Next, a second embodiment of the present invention will be described with reference to the drawings.

【００７５】本発明の第２の実施の形態は、外部から入
力される音声信号を受け取る第１の手順（音声信号入力
手段１０１相当）と、前記第１の手順から音声信号を受
け取り、あらかじめ準備された標準パタンのセット、お
よび、文法に基づいて、音声信号に適合するスコアの高
い単語列を上位から順に複数個求め、複数個の単語列、
複数個の単語列それぞれに対応する単語継続時間長系
列、および、複数個の単語列それぞれに対応するスコア
を出力する第２の手順（認識結果候補生成手段１０２相
当）と、前記第２の手順から単語列、継続時間長系列お
よびスコアを受け取り、単語列、継続時間長から正規化
定数を用いて継続時間長の正規化を行う第３の手順（継
続時間長正規化手段１０３相当）と、前記第２の手順の
出力したスコアを継続時間長の平均、分散に基づいて補
正し、出力する第４の手順（認識結果候補修正手段１０
４相当）と、前記第３の手順が継続時間長の正規化のた
めに用いる正規化定数を前記継続時間長系列と継続時間
長の平均との間の自乗誤差が最小となるように継続時間
長の平均、分散に基づいて計算する第５の手順（正規化
定数計算手段１０５相当）と、前記第４の手順がスコア
を計算する際、および、前記第５の手順が正規化定数を
計算する際に使用する継続時間長の平均、分散を格納す
る第６の手順（継続時間長統計量格納手段１０６相当）
とを含む方法である。In the second embodiment of the present invention, the first procedure (corresponding to the voice signal input means 101) for receiving a voice signal input from the outside and the voice signal from the first procedure are received and prepared in advance. Based on the set of standard patterns and the grammar, a plurality of word strings having a high score that match the voice signal are obtained in order from the top, and a plurality of word strings,
A second procedure (corresponding to recognition result candidate generating means 102) for outputting a word duration sequence corresponding to each of the plurality of word strings, and a score corresponding to each of the plurality of word strings, and the second procedure. A third procedure (corresponding to the duration length normalizing means 103) for receiving a word string, a duration length sequence and a score from, and normalizing the duration length from the word string and duration length using a normalization constant; A fourth procedure for correcting the score output by the second procedure based on the average and variance of the durations and outputting it (recognition result candidate correction means 10
4)) and a normalization constant used by the third procedure for normalization of the duration length so that the squared error between the duration length sequence and the average of the duration lengths is minimized. A fifth procedure (corresponding to the normalization constant calculation means 105) that calculates based on the average and variance of lengths, when the fourth procedure calculates a score, and when the fifth procedure calculates a normalization constant. A sixth procedure (corresponding to the duration statistical storage means 106) for storing the average and variance of durations used when performing
It is a method including and.

【００７６】次に、本発明の第３の実施の形態につい
て、図面を参照して説明する。Next, a third embodiment of the present invention will be described with reference to the drawings.

【００７７】本発明の第３の実施の形態は、外部から入
力される音声信号を受け取る第１の手順（音声信号入力
手段１０１相当）と、前記第１の手順から音声信号を受
け取り、あらかじめ準備された標準パタンのセット、お
よび、文法に基づいて、音声信号に適合するスコアの高
い単語列を上位から順に複数個求め、複数個の単語列、
複数個の単語列それぞれに対応する単語継続時間長系
列、および、複数個の単語列それぞれに対応するスコア
を出力する第２の手順（認識結果候補生成手段１０２相
当）と、前記第２の手順から単語列、継続時間長系列お
よびスコアを受け取り、単語列、継続時間長から正規化
定数を用いて継続時間長の正規化を行う第３の手順（継
続時間長正規化手段１０３相当）と、前記第２の手順の
出力したスコアを継続時間長の平均、分散に基づいて補
正し、出力する第４の手順（認識結果候補修正手段１０
４相当）と、前記第３の手順が継続時間長の正規化のた
めに用いる正規化定数を前記継続時間長系列と継続時間
長の平均との間の自乗誤差が最小となるように継続時間
長の平均、分散に基づいて計算する第５の手順（正規化
定数計算手段１０５相当）と、前記第４の手順がスコア
を計算する際、および、前記第５の手順が正規化定数を
計算する際に使用する継続時間長の平均、分散を格納す
る第６の手順（継続時間長統計量格納手段１０６相当）
とをコンピュータに実行させるプログラムである。In the third embodiment of the present invention, a first procedure (corresponding to the voice signal input means 101) for receiving a voice signal input from the outside and a voice signal from the first procedure are received and prepared in advance. Based on the set of standard patterns and the grammar, a plurality of word strings having a high score that match the voice signal are obtained in order from the top, and a plurality of word strings,
A second procedure (corresponding to recognition result candidate generating means 102) for outputting a word duration sequence corresponding to each of the plurality of word strings, and a score corresponding to each of the plurality of word strings, and the second procedure. A third procedure (corresponding to the duration length normalizing means 103) for receiving a word string, a duration length sequence and a score from, and normalizing the duration length from the word string and duration length using a normalization constant; A fourth procedure for correcting the score output by the second procedure based on the average and variance of the durations and outputting it (recognition result candidate correction means 10
4)) and a normalization constant used by the third procedure for normalization of the duration length so that the squared error between the duration length sequence and the average of the duration lengths is minimized. A fifth procedure (corresponding to the normalization constant calculation means 105) that calculates based on the average and variance of lengths, when the fourth procedure calculates a score, and when the fifth procedure calculates a normalization constant. A sixth procedure (corresponding to the duration statistical storage means 106) for storing the average and variance of durations used when performing
Is a program that causes a computer to execute.

【００７８】次に、本発明の第４の実施の形態につい
て、図面を参照して説明する。Next, a fourth embodiment of the present invention will be described with reference to the drawings.

【００７９】図４は、本発明の第４の実施の形態の構成
を示すブロック図である。FIG. 4 is a block diagram showing the configuration of the fourth embodiment of the present invention.

【００８０】図４を参照すると、本発明の第４の実施の
形態は、音声認識プログラムを記録した記録媒体３０１
を備える。この記録媒体３０１はＣＤ−ＲＯＭ、磁気デ
ィスク、半導体メモリその他のものでも可能であり、ネ
ットワークを介して流通する場合も含む。Referring to FIG. 4, the fourth embodiment of the present invention is a recording medium 301 in which a voice recognition program is recorded.
Equipped with. The recording medium 301 may be a CD-ROM, a magnetic disk, a semiconductor memory or the like, including a case of being distributed via a network.

【００８１】音声認識プログラムは記録媒体３０１から
データ処理装置３０２に読み込まれ、データ処理装置３
０２の動作を制御する。The voice recognition program is read from the recording medium 301 to the data processing device 302, and the data processing device 3
02 operation is controlled.

【００８２】本発明の第４の実施の形態の実施例として
は、データ処理装置３０２は音声認識プログラムの制御
により、第１の実施の形態における音声信号入力手段１
０１、認識結果候補生成手段１０２、継続時間長正規化
手段１０３、認識結果候補修正手段１０４、正規化定数
計算手段１０５による処理と同一の処理を実行して、第
１の実施の形態における継続時間長統計量格納手段１０
６と同等の情報を有する継続時間長統計量記録媒体３０
３を参照することによって、音声認識結果を出力する。As an example of the fourth embodiment of the present invention, the data processing device 302 controls the voice recognition program to control the voice signal input means 1 of the first embodiment.
01, the recognition result candidate generation means 102, the duration length normalization means 103, the recognition result candidate correction means 104, and the normalization constant calculation means 105 are executed, and the duration time in the first embodiment is executed. Long statistic storage means 10
Duration duration statistic recording medium 30 having information equivalent to 6
By referring to 3, the speech recognition result is output.

【００８３】[0083]

【発明の効果】本発明の効果は、継続時間長の情報をよ
り有効に利用することができ、その結果として音声認識
の精度を向上させることができることである。The effect of the present invention is that the information on the duration can be used more effectively, and as a result, the accuracy of voice recognition can be improved.

【００８４】その理由は、継続時間長の情報を利用して
音声認識の精度を向上させる技術において、雑音等によ
り継続時間長に擾乱が加わった場合でも発話速度の変動
をより高精度に推定し、かつ変動要因をより詳細に吸収
できるようにしたからである。The reason is that, in the technique of improving the accuracy of speech recognition by using the information on the duration time, even if the duration time is disturbed by noise or the like, the fluctuation of the speech speed can be estimated with higher accuracy. This is because the fluctuation factors can be absorbed in more detail.

[Brief description of drawings]

【図１】本発明の第１の実施の形態の構成を示すブロッ
ク図であるFIG. 1 is a block diagram showing a configuration of a first exemplary embodiment of the present invention.

【図２】単語継続時間長に基づくスコア修正による認識
結果候補の修正の例を示す説明図であるFIG. 2 is an explanatory diagram showing an example of correction of recognition result candidates by score correction based on word duration length.

【図３】単語継続時間長統計量の例を示す説明図であるFIG. 3 is an explanatory diagram showing an example of word duration statistics.

【図４】本発明の第４の実施の形態機の構成を示すブロ
ック図であるFIG. 4 is a block diagram showing a configuration of a fourth exemplary embodiment of the present invention.

【図５】従来技術の構成を示すブロック図であるFIG. 5 is a block diagram showing a configuration of a conventional technique.

[Explanation of symbols]

１０１音声信号入力手段１０２認識結果候補生成手段１０３継続時間長正規化手段１０４認識結果候補修正手段１０５正規化定数計算手段１０６継続時間長統計量格納手段２０１音声信号入力手段２０２認識結果候補生成手段２０３認識結果候補修正手段２０４継続時間長統計量格納手段３０１記録媒体３０２データ処理装置３０３継続時間長統計量記録媒体 101 audio signal input means 102 recognition result candidate generating means 103 Duration normalization means 104 recognition result candidate correction means 105 Normalization constant calculation means 106 duration duration statistic storage means 201 Audio signal input means 202 recognition result candidate generating means 203 Recognition result candidate correction means 204 Duration duration statistics storage means 301 recording medium 302 Data processing device 303 Duration Length Statistics Recording Medium

Claims

[Claims]

1. A voice signal input means for receiving a voice signal input from the outside, a voice signal from the voice signal input means, and a voice signal based on a set of standard patterns prepared in advance and a grammar. Obtain multiple word strings with high matching scores in order from the top, and output multiple word strings, word duration series corresponding to each word string, and scores corresponding to each word string. A recognition result candidate generating unit that receives the word string, the duration sequence and the score from the recognition result candidate generating unit, and the duration time for normalizing the duration length from the word string and the duration length using a normalization constant. Long normalization means,
The recognition result candidate correction means for correcting and outputting the score output by the recognition result candidate generation means based on the average and variance of the durations, and the duration normalization means for normalizing the duration A normalization constant calculation means for calculating the normalization constant to be used based on the average of the durations, the variance, when the recognition result candidate correction means calculates a score, and the normalization constant calculation means calculates the normalization constant. A voice recognition device comprising: a duration statistics storage means for storing an average and a variance of durations used for calculation.

2. A voice signal input means for receiving a voice signal input from the outside, a voice signal from the voice signal input means, and a voice signal based on a set of standard patterns prepared in advance and a grammar. Obtain multiple word strings with high matching scores in order from the top, and output multiple word strings, word duration series corresponding to each word string, and scores corresponding to each word string. A recognition result candidate generating unit that receives the word string, the duration sequence and the score from the recognition result candidate generating unit, and the duration time for normalizing the duration length from the word string and the duration length using a normalization constant. Long normalization means,
Recognition result candidate correction means for correcting and outputting the score output by the recognition result candidate generation means based on the average and variance of the duration time, and a first type normalization constant for multiplying the duration time by a constant, and , A second-type normalization constant for performing constant addition of durations is calculated based on the average and variance of durations as a normalization constant used by the duration normalizing means for normalization of durations. The normalization constant calculating means and the recognition result candidate correcting means store the average and variance of the durations used when calculating the score and when the normalization constant calculating means calculates the normalization constant. A speech recognition apparatus comprising: a duration statistics storage unit.

3. A voice signal input means for receiving a voice signal input from the outside, a voice signal from the voice signal input means, and a voice signal based on a set of standard patterns prepared in advance and a grammar. Obtain multiple word strings with high matching scores in order from the top, and output multiple word strings, word duration series corresponding to each word string, and scores corresponding to each word string. A recognition result candidate generating unit that receives the word string, the duration sequence and the score from the recognition result candidate generating unit, and the duration time for normalizing the duration length from the word string and the duration length using a normalization constant. Long normalization means,
The recognition result candidate correction means for correcting and outputting the score output by the recognition result candidate generation means based on the average and variance of the durations, and the duration normalization means for normalizing the duration Normalization constant calculating means for calculating the normalization constant to be used based on the average of the durations and the variance so that the squared error between the duration sequence and the average of the durations is minimized, and the recognition result And a duration length statistic storage means for storing an average and a variance of the duration length used when the candidate correction means calculates the score and when the normalization constant calculation means calculates the normalization constant. A voice recognition device characterized by the above.

4. A first procedure for receiving a voice signal input from the outside, and a voice signal from the first procedure,
Based on a set of standard patterns and grammar prepared in advance, a plurality of word strings with high scores that match the audio signal are obtained in order from the top, and a plurality of word strings and words corresponding to each of the plurality of word strings are obtained. A second procedure for outputting a duration sequence and a score corresponding to each of a plurality of word strings, and a word sequence, a duration sequence and a score are received from the second procedure, and the word sequence and the duration length are received. From the third procedure for normalizing the duration using the normalization constant, and the fourth procedure for correcting and outputting the score output by the second procedure based on the average and variance of the duration. And a fifth step in which the third procedure calculates a normalization constant used for normalization of the duration based on the average and variance of the duration, and the fourth procedure calculates a score. When, and the fifth Mean procedure duration length used in calculating the normalization constant, speech recognition method, which comprises the steps of the sixth, the storing the dispersion.

5. A first procedure for receiving a voice signal input from the outside, and a voice signal for receiving the voice signal from the first procedure,
Based on a set of standard patterns and grammar prepared in advance, a plurality of word strings with high scores that match the audio signal are obtained in order from the top, and a plurality of word strings and words corresponding to each of the plurality of word strings are obtained. A second procedure for outputting a duration sequence and a score corresponding to each of a plurality of word strings, and a word sequence, a duration sequence and a score are received from the second procedure, and the word sequence and the duration length are received. From the third procedure for normalizing the duration using the normalization constant, and the fourth procedure for correcting and outputting the score output by the second procedure based on the average and variance of the duration. And a normalization constant of the first kind for multiplying the duration length by a constant, and a second kind of normalization constant for adding a constant of the duration time, in order to normalize the duration time by the third procedure. Continue as a normalization constant used for A fifth step of calculating based on the mean and variance of interval lengths, a duration used when the fourth step calculates a score, and when the fifth step calculates a normalization constant. And a sixth step of storing the average and variance of the speech recognition method.

6. A first procedure for receiving a voice signal input from the outside, and a voice signal for receiving the voice signal from the first procedure,
Based on a set of standard patterns and grammar prepared in advance, a plurality of word strings with high scores that match the audio signal are obtained in order from the top, and a plurality of word strings and words corresponding to each of the plurality of word strings are obtained. A second procedure for outputting a duration sequence and a score corresponding to each of a plurality of word strings, and a word sequence, a duration sequence and a score are received from the second procedure, and the word sequence and the duration length are received. From the third procedure for normalizing the duration using the normalization constant, and the fourth procedure for correcting and outputting the score output by the second procedure based on the average and variance of the duration. And a normalization constant used in the third procedure for normalizing the duration length so that the squared error between the duration sequence and the average duration is minimized so as to minimize the mean squared duration. , Calculated based on variance That a fifth procedure, when the fourth procedure for calculating the scores, and the average of the duration of the fifth step is used in calculating a normalization constant,
A sixth step of storing the variance; and a voice recognition method.

7. A first procedure for receiving a voice signal input from the outside, and a voice signal for receiving the voice signal from the first procedure,
Based on a set of standard patterns and grammar prepared in advance, a plurality of word strings with high scores that match the audio signal are obtained in order from the top, and a plurality of word strings and words corresponding to each of the plurality of word strings are obtained. A second procedure for outputting a duration sequence and a score corresponding to each of a plurality of word strings, and a word sequence, a duration sequence and a score are received from the second procedure, and the word sequence and the duration length are received. From the third procedure for normalizing the duration using the normalization constant, and the fourth procedure for correcting and outputting the score output by the second procedure based on the average and variance of the duration. And a fifth step in which the third procedure calculates a normalization constant used for normalization of the duration based on the average and variance of the duration, and the fourth procedure calculates a score. When, and the fifth Mean procedure duration length used in calculating the normalization constant, speech recognition program, characterized in that to execute a sixth step of storing the dispersion, to the computer.

8. A first procedure for receiving a voice signal input from the outside, and a voice signal from the first procedure,
Based on a set of standard patterns and grammar prepared in advance, a plurality of word strings with high scores that match the audio signal are obtained in order from the top, and a plurality of word strings and words corresponding to each of the plurality of word strings are obtained. A second procedure for outputting a duration sequence and a score corresponding to each of a plurality of word strings, and a word sequence, a duration sequence and a score are received from the second procedure, and the word sequence and the duration length are received. From the third procedure for normalizing the duration using the normalization constant, and the fourth procedure for correcting and outputting the score output by the second procedure based on the average and variance of the duration. And a normalization constant of the first kind for multiplying the duration length by a constant, and a second kind of normalization constant for adding a constant of the duration time, in order to normalize the duration time by the third procedure. Continue as a normalization constant used for A fifth step of calculating based on the mean and variance of interval lengths, a duration used when the fourth step calculates a score, and when the fifth step calculates a normalization constant. And a sixth procedure for storing the average and variance of the voice recognition program.

9. A first procedure for receiving a voice signal input from the outside, and a voice signal from the first procedure,
Based on a set of standard patterns and grammar prepared in advance, a plurality of word strings with high scores that match the audio signal are obtained in order from the top, and a plurality of word strings and words corresponding to each of the plurality of word strings are obtained. A second procedure for outputting a duration sequence and a score corresponding to each of a plurality of word strings, and a word sequence, a duration sequence and a score are received from the second procedure, and the word sequence and the duration length are received. From the third procedure for normalizing the duration using the normalization constant, and the fourth procedure for correcting and outputting the score output by the second procedure based on the average and variance of the duration. And a normalization constant used in the third procedure for normalizing the duration length so that the squared error between the duration sequence and the average duration is minimized so as to minimize the mean squared duration. , Calculated based on variance That a fifth procedure, when the fourth procedure for calculating the scores, and the average of the duration of the fifth step is used in calculating a normalization constant,
A sixth step of storing the variance, and a voice recognition program causing a computer to execute the sixth step.