JP2001083978A

JP2001083978A - Voice recognition device

Info

Publication number: JP2001083978A
Application number: JP2000164077A
Authority: JP
Inventors: Junichi Nakabashi; 順一中橋; Junko Yagi; 順子八木
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1999-07-15
Filing date: 2000-06-01
Publication date: 2001-03-30

Abstract

(57)【要約】【課題】入力音声は母音のように時間的に定常性が高
い部分と、子音のように時間的な定常性が低い部分とが
ある。このような入力音声の特徴により分析窓長を可変
にすることにより、認識処理時間を増加させることな
く、従来より認識性能の高い音声認識装置を提供する。【解決手段】入力された音声に対して、音声分析部１
０３，１０４は各時刻毎に該音声から各々違った分析窓
長で各々特徴ベクトルに変換する。特徴ベクトル間相関
算出部１０５は各時刻の各々の特徴ベクトル間の相関値
を算出する。特徴ベクトル間の相関値に応じて違った分
析窓長で算出された特徴ベクトルの中から最適の特徴ベ
クトルを選択し、これを用いてＨＭＭ法により音声を認
識する。 (57) [Summary] [Problem] An input speech has a part with high temporal continuity like a vowel and a part with low temporal continuity like a consonant. By making the analysis window length variable according to such characteristics of the input speech, a speech recognition device with higher recognition performance than before can be provided without increasing the recognition processing time. SOLUTION: A voice analysis unit 1 receives an input voice.
Numerals 03 and 104 convert the speech into feature vectors with different analysis window lengths at each time. The feature vector correlation calculation unit 105 calculates a correlation value between each feature vector at each time. An optimal feature vector is selected from feature vectors calculated with different analysis window lengths according to the correlation value between the feature vectors, and the speech is recognized by the HMM method using the optimal feature vector.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、入力された音声の
特徴ベクトルの時系列と基準となる音声の特徴ベクトル
の時系列又はその統計モデルとの類似度を用いて、マイ
クロコンピュータや電子計算機等により自動音声認識を
行う音声認識装置に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a microcomputer, an electronic computer, or the like using a time series of a feature vector of an input speech and a time series of a feature vector of a reference speech or a statistical model thereof. The present invention relates to a speech recognition device that performs automatic speech recognition by using the following.

【０００２】[0002]

【従来の技術】音声認識装置においては、マイクロフォ
ンなどから入力された音声をスペクトルの時系列に変換
して認識することが一般的である。スペクトルを表わす
最も基本的な特徴ベクトルはケプストラムであり、この
ケプストラムは対数スペクトル（以降、単にスペクトル
と呼ぶ）のフーリエ変換により定義されている。又、ケ
プストラムをメルスケール化したメルケプストラムや、
ＦＦＴやＢＰＦ( バンドパスフィルタ) の出力なども音
声認識の特徴ベクトルとして用いられる。2. Description of the Related Art In a voice recognition apparatus, it is common to convert voice input from a microphone or the like into a time series of spectrum and recognize it. The most basic feature vector representing a spectrum is a cepstrum, which is defined by a Fourier transform of a logarithmic spectrum (hereinafter simply referred to as a spectrum). In addition, mel cepstrum that is a cepstrum made into a mel scale,
The output of FFT or BPF (bandpass filter) is also used as a feature vector for speech recognition.

【０００３】又、近年認識率の向上を目的として、スペ
クトルの時間変化を表現するデルタケプストラム［S.Fu
rui ：”Speaker-Independent Isolated Word Recognit
ionUsing Dynamic Features of Speech Spectrum", IEE
E Trans., ASSP-34,No.1, pp.52-59,(1986-2)］やスペ
クトル強調を目的とした重み付きケプストラム［Y.Tohk
ura ："A Weighted Cepstral Distance Measure for Sp
eech Recognition", IEEE Trans., ASSP-35, No.10, p
p.1414-1422,(1987-10)］等の種々の特徴ベクトルが報
告されている。Further, in order to improve the recognition rate in recent years, a delta cepstrum [S. Fu
rui: "Speaker-Independent Isolated Word Recognit
ionUsing Dynamic Features of Speech Spectrum ", IEE
E Trans., ASSP-34, No.1, pp.52-59, (1986-2)] and weighted cepstrum [Y.Tohk
ura: "A Weighted Cepstral Distance Measure for Sp
eech Recognition ", IEEE Trans., ASSP-35, No.10, p
p.1414-1422, (1987-10)].

【０００４】しかし、これらの従来の特徴ベクトルを分
析窓長の観点でまとめると、ケプストラムと重み付きケ
プストラムは予め定められた常に一定の分析窓長で算出
され、又、デルタケプストラムも予め定められた個数
（少なくとも２フレーム以上のケプストラム）の線形回
帰係数として算出されており、これらの特徴ベクトルで
は入力された音声の性質によらず一定の分析窓長で音声
分析を行うものであった。However, when these conventional feature vectors are summarized in terms of the analysis window length, the cepstrum and the weighted cepstrum are calculated with a predetermined always constant analysis window length, and the delta cepstrum is also predetermined. It is calculated as the number of linear regression coefficients (at least two frames or more of cepstrum), and these feature vectors are used to perform speech analysis with a constant analysis window length regardless of the properties of the input speech.

【０００５】又、複数の線形回帰係数を１つにまとめて
ベクトル次元の大きな特徴ベクトルにする方法も提案さ
れているが、この方法は複数化した分だけの処理時間が
認識に必要になり、実時間システムの構築には向かな
い。A method has also been proposed in which a plurality of linear regression coefficients are combined into one to form a feature vector having a large vector dimension. However, this method requires a processing time corresponding to the plurality of linear regression coefficients. Not suitable for building real-time systems.

【０００６】図７は固定分析窓長から算出されるケプス
トラムを用いた従来の音声認識装置のブロック図であ
る。図７において、マイクロフォン７００に入力された
音声は、低域通過フィルタ７０１でサンプリング周波数
の１／２以上の周波数成分をカットしてから、Ａ／Ｄ変
換部７０２によって予め定められたサンプリング周波数
（例えば8kHz）でサンプリングされディジタル信号に変
換される。このディジタル信号は、音声分析部７０３に
よって特徴ベクトル（ケプストラム）の時系列に変換さ
れ、ＨＭＭ認識部７０４にて、ＨＭＭ格納部７０７に記
憶されている各認識対象語彙のモデルと特徴ベクトルの
時系列との類似度を計算し、認識結果判定部７０６にて
最も入力音声と類似度の高いものを認識結果として判定
する。FIG. 7 is a block diagram of a conventional speech recognition apparatus using a cepstrum calculated from a fixed analysis window length. In FIG. 7, a sound input to a microphone 700 is cut by a low-pass filter 701 at a frequency component equal to or more than サンプリング of a sampling frequency, and then a sampling frequency (for example, 8kHz) and converted to a digital signal. The digital signal is converted into a time series of feature vectors (cepstrum) by a voice analysis unit 703, and a model of each recognition target vocabulary stored in an HMM storage unit 707 and a time series of feature vectors are converted by an HMM recognition unit 704. Is calculated, and the recognition result determining unit 706 determines the speech having the highest similarity to the input voice as the recognition result.

【０００７】[0007]

【発明が解決しようとする課題】しかしながら、入力さ
れる音声は母音のように、例えば定常時間100msec 程度
の時間的に定常性が高い部分と、子音や半母音のように
時間的な定常性が低い部分とがある。従来はそれらを同
一の分析窓長で音声分析していたため、母音に適した窓
長では子音部分の時間分解能が低くなり、逆に、子音に
適した窓長では母音部分での周波数分解能が低くなり認
識性能が劣化してしまうという欠点があった。又、予め
複数の分析窓長で音声分析した特徴ベクトルをそのまま
結合した場合、情報が膨大になり認識時の処理が大幅に
増加するという欠点があった。However, the input voice has a portion with high temporal continuity, for example, a stationary time of about 100 msec, such as a vowel, and a low temporal continuity, such as a consonant or a semi-vowel. There is a part. In the past, these were analyzed using the same analysis window length, so that the window resolution suitable for vowels resulted in lower temporal resolution of consonant parts, and conversely, the window resolution suitable for consonants resulted in lower frequency resolution in vowel parts. There is a drawback that the recognition performance deteriorates. Further, when feature vectors that have been subjected to voice analysis in advance with a plurality of analysis window lengths are directly combined, there is a disadvantage that the information becomes enormous and the processing at the time of recognition is greatly increased.

【０００８】そこで、本願の請求項１〜３及び５，６の
発明の目的は、入力された音声の特徴により分析窓長を
可変にすることにより、認識処理時間を大幅に増加させ
ることなく、従来と比較して認識性能の高い音声認識方
法を提供することにある。又本願の請求項４の発明は、
入力された音声の特徴により得られる特徴ベクトルを強
調することによって、認識性能の高い音声認識装置を提
供することにある。Accordingly, an object of the present invention is to make the length of the analysis window variable according to the characteristics of the input speech without significantly increasing the recognition processing time. An object of the present invention is to provide a speech recognition method having a higher recognition performance than conventional ones. The invention of claim 4 of the present application
An object of the present invention is to provide a speech recognition device with high recognition performance by emphasizing a feature vector obtained from features of input speech.

【０００９】[0009]

【課題を解決するための手段】本願の請求項１の発明
は、音声を入力するための音声入力手段と、各時刻毎に
該入力音声を各々異なった分析窓長で特徴ベクトルに変
換する少なくとも２つの音声分析手段と、前記各音声分
析手段から出力される特徴ベクトル間の相関値及び類似
度の少なくとも一方を算出する特徴ベクトル間相関算出
手段と、前記特徴ベクトル間相関値算出手段で算出され
た相関値又は類似度に応じて前記音声分析手段で算出さ
れた特徴ベクトルの中から相関値又は類似度が大きい場
合は入力音声の変化が少ないとして大きい分析窓長の特
徴ベクトルを、相関値又は類似度が小さい場合は入力音
声の変化が多いとして小さい分析窓長の特徴ベクトル
（以降、選択特徴ベクトルと呼ぶ）を選択する特徴ベク
トル選択手段と、各認識対象語彙に対するモデルを記憶
する標準パタン記憶手段と、前記選択特徴ベクトルの時
系列と前記標準パタン記憶手段に記憶されている標準パ
タンとの尤度を計算する認識手段と、前記認識手段の出
力である各標準パタンとの尤度から認識結果を判定する
認識結果判定手段とを備えたことを特徴とするものであ
る。According to a first aspect of the present invention, there is provided a voice input means for inputting voice, and at least a voice input means for converting the input voice into a feature vector with a different analysis window length at each time. Two voice analysis means, a feature vector correlation calculation means for calculating at least one of a correlation value and a similarity between feature vectors output from the voice analysis means, and a feature vector correlation value calculation means. If the correlation value or similarity is large from among the feature vectors calculated by the voice analysis means according to the correlation value or similarity, the feature vector having a large analysis window length is assumed to have a small change in the input voice. When the similarity is small, it is assumed that the input speech changes largely, and a feature vector selecting means for selecting a feature vector having a small analysis window length (hereinafter referred to as a selected feature vector); Standard pattern storage means for storing a model for the target vocabulary; recognition means for calculating the likelihood between the time series of the selected feature vector and the standard pattern stored in the standard pattern storage means; and an output of the recognition means. A recognition result determining unit that determines a recognition result based on the likelihood of each of the standard patterns.

【００１０】このような特徴を有する本願の請求項１の
発明によれば、相関値又は類似度が大きい場合は入力音
声の変化が少ないとして大きい分析窓長の特徴ベクトル
を、相関値又は類似度が小さい場合は入力音声の変化が
多いとして小さい分析窓長のものを選択することができ
る。これにより入力音声に適した分析窓長を選択するこ
とができ、認識性能を向上させることができる。According to the first aspect of the present invention having such a feature, when the correlation value or the similarity is large, the change of the input voice is assumed to be small, and the feature vector having a large analysis window length is converted to the correlation value or the similarity. Is small, the change in the input voice is large, and a small analysis window length can be selected. As a result, an analysis window length suitable for input speech can be selected, and recognition performance can be improved.

【００１１】本願の請求項２の発明は、音声を入力する
ための音声入力手段と、該入力音声を予め定められた分
析窓長で特徴ベクトルに変換する第１の音声分析手段
と、少なくとも１つ以上の過去の時刻の特徴ベクトルを
記憶するバッファと、前記第１の音声分析手段の出力で
ある特徴ベクトルと前記バッファに記憶されている過去
の特徴ベクトル間の相関値及び類似度の少なくとも一方
を算出する特徴ベクトル間相関算出手段と、該相関値又
は類似度と予め定められた第１の閾値とを比較し該閾値
以上である場合は予め定められた分析窓長よりも長い分
析窓長を、第１の閾値より小さい予め定められた第２の
閾値と比較して該閾値以下である場合は前記予め定めた
分析窓長より短い分析窓長を設定する分析窓長設定手段
と、前記分析窓長設定手段で設定された分析窓長を用い
て前記入力音声を特徴ベクトルに変換する第２の音声分
析手段と、各認識対象語彙に対するモデルを記憶する標
準パタン記憶手段と、前記第２の音声分析手段より得ら
れた特徴ベクトルの系列と前記標準パタン記憶手段に記
憶されている標準パタンとの尤度を計算する認識手段
と、前記認識手段の出力である各標準パタンとの尤度か
ら認識結果を判定する認識結果判定手段とを備えたこと
を特徴とするものである。According to a second aspect of the present invention, there is provided a voice input means for inputting voice, a first voice analysis means for converting the input voice into a feature vector with a predetermined analysis window length, and at least one A buffer for storing at least one past time feature vector; and at least one of a correlation value and a similarity between the feature vector output from the first speech analysis means and the past feature vector stored in the buffer. Between the feature vector calculating means for calculating the correlation value or the similarity and a predetermined first threshold value. If the correlation value is equal to or larger than the threshold value, the analysis window length is longer than the predetermined analysis window length. An analysis window length setting means for comparing an analysis window length shorter than the predetermined analysis window length with a predetermined second threshold value which is smaller than the first threshold value and being smaller than the threshold value, Analysis window length Second speech analysis means for converting the input speech into a feature vector using the analysis window length set by the means, standard pattern storage means for storing a model for each vocabulary to be recognized, and second speech analysis means Recognition means for calculating the likelihood between the obtained sequence of feature vectors and the standard pattern stored in the standard pattern storage means, and a recognition result from the likelihood of each standard pattern output from the recognition means. And a recognition result judging means for judging.

【００１２】このような特徴を有する本願の請求項２の
発明によれば、分析窓長設定手段で算出される最適分析
窓長が相関値又は類似度の関数として定めることによ
り、特徴ベクトル間の相関値又は類似度が大きいほど入
力音声の変化が少ないとして大きい分析窓長を、又、相
関値又は類似度が小さいほど入力音声の変化が多いとし
て小さい分析窓長を算出することができる。これにより
入力音声に適した分析窓長を選択することができ、認識
性能を向上させることができる。According to the second aspect of the present invention having such a feature, the optimum analysis window length calculated by the analysis window length setting means is determined as a function of the correlation value or the similarity, so that the feature vector It is possible to calculate a large analysis window length as the correlation value or similarity is larger, as the change of the input voice is smaller, and a smaller analysis window length as the correlation value or similarity is smaller, as the change of the input voice is larger. As a result, an analysis window length suitable for input speech can be selected, and recognition performance can be improved.

【００１３】本願の請求項３の発明は、音声を入力する
ための音声入力手段と、該入力音声を予め定められた分
析窓長で特徴ベクトルに変換する第１の音声分析手段
と、前記音声分析手段で得られる少なくとも２つ以上の
特徴ベクトルを記憶しておくバッファと、前記バッファ
に記憶されている特徴ベクトルから線形回帰係数を算出
する線形回帰係数算出手段と、該線形回帰係数が予め定
められた第１の閾値と比較し該閾値以下である場合は予
め定められた分析窓長よりも長い分析窓長を、第１の閾
値より大きい予め定められた第２の閾値と比較して該閾
値以上である場合は前記予め定めた分析窓長より短い分
析窓長を設定する分析窓長設定手段と、該分析窓長設定
手段で設定された分析窓長を用いて前記入力音声を特徴
ベクトルに変換する第２の音声分析手段と、各認識対象
語彙に対するモデルを記憶する標準パタン記憶手段と、
前記第２の音声分析手段より得られた特徴ベクトルの系
列と前記標準パタン記憶手段に記憶されている標準パタ
ンとの尤度を計算する認識手段と、前記認識手段の出力
である各標準パタンとの尤度から認識結果を判定する認
識結果判定手段とを備えたことを特徴とするものであ
る。According to a third aspect of the present invention, there is provided voice input means for inputting voice, first voice analysis means for converting the input voice into a feature vector with a predetermined analysis window length, and A buffer for storing at least two or more feature vectors obtained by the analysis means; a linear regression coefficient calculation means for calculating a linear regression coefficient from the feature vectors stored in the buffer; Comparing the analysis window length longer than a predetermined analysis window length with a predetermined second threshold value larger than the first threshold value. Analysis window length setting means for setting an analysis window length shorter than the predetermined analysis window length when the threshold value is equal to or more than the threshold value; and analyzing the input speech using the analysis window length set by the analysis window length setting means. Convert to And second sound analysis unit, and the standard pattern storage means for storing a model for each vocabulary to be recognized,
A recognition unit for calculating a likelihood between the series of feature vectors obtained by the second speech analysis unit and a standard pattern stored in the standard pattern storage unit; and a standard pattern output from the recognition unit. And a recognition result determining means for determining a recognition result from the likelihood of the recognition result.

【００１４】このような特徴を有する本願の請求項３の
発明によれば、分析窓長設定手段で算出される最適分析
窓長が線形回帰係数の関数として定められることによ
り、線形回帰係数が小さいほど入力音声の変化が少ない
として大きい分析窓長を、又、線形回帰係数が大きいほ
ど入力音声の変化が多いとして小さい分析窓長を算出す
ることができる。これにより入力音声に適した分析窓長
を選択することができ、認識性能を向上させることがで
きる。According to the third aspect of the present invention having such features, the linear regression coefficient is small because the optimal analysis window length calculated by the analysis window length setting means is determined as a function of the linear regression coefficient. A larger analysis window length can be calculated as the change in the input voice becomes smaller as the change in the input voice decreases, and a smaller analysis window length can be calculated as the change in the input voice increases as the linear regression coefficient increases. As a result, an analysis window length suitable for input speech can be selected, and recognition performance can be improved.

【００１５】本願の請求項４の発明は、音声を入力する
ための音声入力手段と、該入力音声を予め定められた分
析窓長で特徴ベクトルに変換する音声分析手段と、少な
くとも１つ以上の過去の時刻の特徴ベクトルを記憶する
バッファと、前記音声分析手段の出力である特徴ベクト
ルと前記バッファに記憶されている過去の特徴ベクトル
間の相関値及び類似度の少なくとも一方を算出する特徴
ベクトル間相関算出手段と、前記相関値又は前記類似度
を予め定められた閾値と比較して閾値以下である場合は
前記特徴ベクトルを強調するための強調係数を算出する
強調係数算出手段と、前記特徴ベクトルを前記強調係数
で強調した特徴ベクトルに置き換える音声強調手段と、
各認識対象語彙に対するモデルを記憶する標準パタン記
憶手段と、前記音声強調手段より得られる特徴ベクトル
の時系列と前記標準パタン記憶手段に記憶されている標
準パタンとの尤度を計算する認識手段と、前記認識手段
の出力である各標準パタンとの尤度から認識結果を判定
する認識結果判定手段とを備えたことを特徴とするもの
である。According to a fourth aspect of the present invention, there is provided a voice input means for inputting voice, a voice analysis means for converting the input voice into a feature vector with a predetermined analysis window length, and at least one or more voice input means. A buffer for storing a feature vector at a past time, and a feature vector for calculating at least one of a correlation value and a similarity between a feature vector output from the speech analysis unit and a past feature vector stored in the buffer. Correlation calculation means, an enhancement coefficient calculation means for comparing the correlation value or the similarity with a predetermined threshold value and calculating an enhancement coefficient for enhancing the feature vector when the correlation value or the similarity is equal to or less than a threshold value; Voice emphasizing means for replacing with a feature vector emphasized by the emphasis coefficient,
Standard pattern storage means for storing a model for each vocabulary to be recognized; recognition means for calculating the likelihood between the time series of feature vectors obtained by the voice emphasis means and the standard pattern stored in the standard pattern storage means; And a recognition result determining means for determining a recognition result from a likelihood of each standard pattern output from the recognition means.

【００１６】このような特徴を有する本願の請求項４の
発明によれば、強調係数算出手段の強調係数が相関値又
は類似度の関数として算出されることにより、相関値又
は類似度が小さいほど入力音声の変化が多いとして強調
係数を大きくする。これにより入力音声に適した特徴ベ
クトルとすることができ、認識性能を向上させることが
できる。According to the fourth aspect of the present invention having such a feature, the emphasis coefficient of the emphasis coefficient calculating means is calculated as a function of the correlation value or the similarity. The emphasis coefficient is increased on the assumption that the input voice changes largely. As a result, a feature vector suitable for input speech can be obtained, and recognition performance can be improved.

【００１７】本願の請求項５の発明は、音声を入力する
ための音声入力手段と、各時刻毎に該入力音声を各々異
なった分析窓長で音声のピッチ周波数（基本周波数）に
変換する少なくとも２つのピッチ抽出手段と、前記各ピ
ッチ抽出手段から出力されるピッチ周波数間の相関値及
び類似度の少なくとも一方を算出するピッチ間相関計算
手段と、該相関値又は類似度に応じて前記ピッチ抽出手
段に用いた分析窓長の中から相関値又は類似度が大きい
場合には入力音声の変化が少ないとして大きい分析窓長
を、相関値又は類似度が小さい場合には入力音声の変化
が多いとして小さい分析窓長を設定する分析窓長設定手
段と、前記分析窓長設定手段で設定された分析窓長を用
いて前記入力音声を特徴ベクトルに変換する音声分析手
段と、各認識対象語彙に対するモデルを記憶する標準パ
タン記憶手段と、前記音声分析手段により得られた特徴
ベクトルの系列と前記標準パタン記憶手段に記憶されて
いる標準パタンとの尤度を計算する認識手段と、前記認
識手段の出力である各標準パタンとの尤度から認識結果
を判定する認識結果判定手段とを備えたことを特徴とす
るものである。According to a fifth aspect of the present invention, there is provided a voice input means for inputting voice, and at least converting the input voice into a pitch frequency (fundamental frequency) of voice with a different analysis window length at each time. Two pitch extracting means, an inter-pitch correlation calculating means for calculating at least one of a correlation value and a similarity between pitch frequencies output from the pitch extracting means, and the pitch extracting means according to the correlation value or the similarity When the correlation value or similarity is large from among the analysis window lengths used for the means, a large analysis window length is assumed to have a small change in the input voice, and when the correlation value or similarity is small, a large change in the input voice is assumed. Analysis window length setting means for setting a small analysis window length; speech analysis means for converting the input speech into a feature vector using the analysis window length set by the analysis window length setting means; Standard pattern storage means for storing a model for a vocabulary; recognition means for calculating a likelihood between a sequence of feature vectors obtained by the voice analysis means and a standard pattern stored in the standard pattern storage means; And a recognition result determining means for determining a recognition result from the likelihood of each standard pattern output from the means.

【００１８】このような特徴を有する本願の請求項５の
発明によれば、分析窓長設定手段で算出される最適な分
析窓長を相関値又は類似度に基づいて定めることによ
り、ピッチ周波数間の相関値又は類似度が大きいほど入
力音声のピッチ周波数が分析窓長に依存しないすなわち
母音の定常部分で入力音声の変化が少ないとして大きい
分析窓長を、又、相関値又は類似度が小さいほど子音や
音韻の渡り部分で入力音声の変化が多いとして小さい分
析窓長を算出することができる。これにより入力音声に
適した分析窓長を選択することができ、認識性能を向上
させることができる。According to the fifth aspect of the present invention having such a feature, the optimum analysis window length calculated by the analysis window length setting means is determined based on the correlation value or the similarity, whereby the pitch frequency The larger the correlation value or similarity, the larger the analysis window length as the pitch frequency of the input voice does not depend on the analysis window length, that is, the smaller the change in the input voice in the steady part of the vowel, and the smaller the correlation value or similarity. It is possible to calculate a small analysis window length assuming that there is a large change in the input voice at the transition between consonants and phonemes. As a result, an analysis window length suitable for input speech can be selected, and recognition performance can be improved.

【００１９】本願の請求項６の発明は、音声を入力する
ための音声入力手段と、該入力音声を予め定められた分
析窓長で音声のピッチ周波数（基本周波数）に変換する
ピッチ抽出手段と、前記ピッチ抽出手段で得られる少な
くとも１つ以上の過去の時刻のピッチ周波数を記憶する
バッファと、前記ピッチ抽出手段の出力であるピッチ周
波数と前記バッファに記憶されている過去のピッチ周波
数間の相関値及び類似度の少なくとも一方を算出するピ
ッチ間相関算出手段と、該相関値又は類似度と予め定め
られた第１の閾値とを比較し該第１の閾値以上である場
合は予め定められた分析窓長よりも長い分析窓長を、第
１の閾値より小さい予め定められた第２の閾値と比較し
て該第２の閾値以下である場合は前記予め定めた分析窓
長より短い分析窓長を設定する分析窓長設定手段と、前
記分析窓長設定手段で設定された分析窓長を用いて前記
入力音声を特徴ベクトルに変換する音声分析手段と、各
認識対象語彙に対するモデルを記憶する標準パタン記憶
手段と、前記音声分析手段により得られた特徴ベクトル
の系列と前記標準パタン記憶手段に記憶されている標準
パタンとの尤度を計算する認識手段と、前記認識手段の
出力である各標準パタンとの尤度から認識結果を判定す
る認識結果判定手段とを備えたことを特徴とするもので
ある。According to a sixth aspect of the present invention, there is provided a voice input means for inputting voice, and a pitch extracting means for converting the input voice into a voice pitch frequency (basic frequency) with a predetermined analysis window length. A buffer for storing at least one or more past time pitch frequencies obtained by the pitch extracting means, and a correlation between a pitch frequency output from the pitch extracting means and a past pitch frequency stored in the buffer. A pitch-to-pitch correlation calculating means for calculating at least one of the value and the similarity, and comparing the correlation value or the similarity with a predetermined first threshold, and when the correlation is equal to or greater than the first threshold, the predetermined correlation is determined. An analysis window length longer than the analysis window length is compared with a predetermined second threshold value smaller than the first threshold value, and if the analysis window length is equal to or smaller than the second threshold value, the analysis window length shorter than the predetermined analysis window length is determined. Analysis window length setting means for setting the analysis window length, speech analysis means for converting the input speech into a feature vector using the analysis window length set by the analysis window length setting means, and a standard for storing a model for each vocabulary to be recognized. Pattern storage means; recognition means for calculating the likelihood between the sequence of feature vectors obtained by the voice analysis means and the standard pattern stored in the standard pattern storage means; and each standard output from the recognition means. And a recognition result determining means for determining the recognition result from the likelihood of the pattern.

【００２０】このような特徴を有する本願の請求項６の
発明によれば、分析窓長設定手段で算出される最適な分
析窓長を相関値又は類似度に基づいて定めることによ
り、ピッチ周波数間の相関値又は類似度が大きいほど入
力音声のピッチ周波数が隣り合う時刻間で変化が少な
い、すなわち母音の定常部分で入力音声の変化が少ない
として大きい分析窓長を、又、相関値又は類似度が小さ
いほど子音や音韻の渡り部分で入力音声の変化が多いと
して小さい分析窓長を算出することができる。これによ
り入力音声に適した分析窓長を選択することができ、認
識性能を向上させることができる。According to the sixth aspect of the present invention having such a feature, the optimum analysis window length calculated by the analysis window length setting means is determined based on the correlation value or the similarity, whereby the pitch frequency The larger the correlation value or similarity is, the smaller the change in the pitch frequency of the input voice between adjacent times, that is, the larger the analysis window length as the change in the input voice is small in the steady part of the vowel, and the larger the correlation value or similarity. It is possible to calculate a small analysis window length on the assumption that the smaller the is, the more the input voice changes in the consonant or phoneme transition part. As a result, an analysis window length suitable for input speech can be selected, and recognition performance can be improved.

【００２１】[0021]

【発明の実施の形態】以下、本発明の実施の形態につい
て、図１から図６を用いて説明する。図１は、本発明の
第１の実施の形態における音声認識装置の構成を示すブ
ロック図を示す。本実施の形態では音声分析部が２つの
場合について説明する。図１において、マイクロフォン
１００は音声を入力するものであり、入力された音声
（以降、入力音声と呼ぶ）は低域通過フィルタ１０１に
入力される。低域通過フィルタ１０１はサンプリング周
波数の１／２以上の周波数成分をカットするものであ
り、その出力はＡ／Ｄ変換部１０２に与えられる。Ａ／
Ｄ変換部１０２は入力信号を予め定められたサンプリン
グ周波数（例えば8kHz）でサンプリングして、ディジタ
ル信号に変換するものである。マイクロフォン１００，
低域通過フィルタ１０１，Ａ／Ｄ変換部１０２は、音声
を入力する音声入力手段を構成している。第１の音声分
析部１０３はこのディジタル信号を予め定められた分析
窓長１よって第１の特徴ベクトル（ケプストラム）に変
換するものである。又第２の音声分析部１０４はディジ
タル信号を分析窓長１とは違う予め定められた分析窓長
２によって第２の特徴ベクトル（ケプストラム）に変換
するものである。特徴ベクトル間相関算出部１０５は第
１，第２の特徴ベクトル間の相関値又は類似度を計算す
るものである。特徴ベクトル選択部１０６はその相関値
又は類似度を元に２つの特徴ベクトルのうち最適な分析
窓長のものを選択特徴ベクトルとして選択するものであ
る。ＨＭＭ格納部１０８は認識の対象となる語彙に対す
るモデルを記憶する標準パタン記憶手段であり、ＨＭＭ
認識部１０７はＨＭＭ格納部１０８に記憶されている各
認識対象語彙のモデルと選択特徴ベクトル系列との尤度
を計算する認識手段である。認識結果判定部１０９は最
も尤度の高いものを認識結果として判定する認識結果判
定手段である。DESCRIPTION OF THE PREFERRED EMBODIMENTS Embodiments of the present invention will be described below with reference to FIGS. FIG. 1 is a block diagram showing the configuration of the speech recognition device according to the first embodiment of the present invention. In the present embodiment, a case where there are two voice analysis units will be described. In FIG. 1, a microphone 100 is for inputting voice, and the input voice (hereinafter, referred to as input voice) is input to a low-pass filter 101. The low-pass filter 101 cuts a frequency component equal to or more than の of the sampling frequency, and its output is provided to an A / D converter 102. A /
The D conversion unit 102 samples the input signal at a predetermined sampling frequency (for example, 8 kHz) and converts it into a digital signal. Microphone 100,
The low-pass filter 101 and the A / D converter 102 constitute an audio input unit for inputting audio. The first speech analysis unit 103 converts the digital signal into a first feature vector (cepstrum) using a predetermined analysis window length 1. The second speech analysis unit 104 converts the digital signal into a second feature vector (cepstrum) using a predetermined analysis window length 2 different from the analysis window length 1. The feature vector correlation calculator 105 calculates a correlation value or a similarity between the first and second feature vectors. The feature vector selection unit 106 selects a feature vector having an optimum analysis window length from two feature vectors based on the correlation value or similarity as a selected feature vector. The HMM storage unit 108 is a standard pattern storage unit that stores a model for a vocabulary to be recognized.
The recognition unit 107 is a recognition unit that calculates the likelihood between the model of each vocabulary to be recognized stored in the HMM storage unit 108 and the selected feature vector sequence. The recognition result determination unit 109 is a recognition result determination unit that determines the one with the highest likelihood as a recognition result.

【００２２】以上のように構成された本発明の音声認識
装置について、最適な分析窓長選択の判断を相関値又は
類似度の大小と考えた場合を例に、相関値又は類似度が
大きい場合と小さい場合とに分けてその動作を説明す
る。但し、音声分析部１０３，１０４の分析窓長１，２
をここではＷとその２倍の２Ｗとして説明する。（ステップ１−１）Ａ／Ｄ変換部１０２でサンプリング
されたディジタル信号Ｕ＝｛ｕ₁,ｕ₂,ｕ₃,・・・，ｕ
_T ｝を、分析窓長Ｗの第１の音声分析部１０３により第
１の特徴ベクトル系列Ａ＝｛ａ₁,ａ₂,・・・，ａ_i ，・
・・，ａ_I ｝に変換する。又、分析窓長２Ｗ（Ｗの２
倍）の第２の音声分析部１０４により第２の特徴ベクト
ル系列Ｂ＝｛ｂ₁,ｂ₂,・・・，ｂ_j ，・・・ｂ_J ｝に変
換する。In the speech recognition apparatus of the present invention configured as described above, the case where the correlation value or similarity is large is considered as an example when the determination of the optimal analysis window length selection is considered as the magnitude of the correlation value or similarity. The operation will be described separately for the case of the small case. However, the analysis window lengths 1 and 2 of the voice analysis units 103 and 104
Is described here as W and twice as much as 2W. (Step 1-1) Digital signal U = ｛u ₁ , u ₂ , u ₃ ,..., U sampled by A / D conversion section 102
_T } is converted into a first feature vector sequence A = {a ₁ , a ₂ ,..., A _i,.
.., A _I }. In addition, the analysis window length 2W (2 of W)
Feature vector series of the second voice analysis unit 104 of the second fold) B = convert _{_{{b 1, b 2, ···}} , b j, ··· b J} in.

【００２３】（ステップ１−２）第１の特徴ベクトル系
列Ａと第２の特徴ベクトル系列Ｂの同時刻の特徴ベクト
ル間の相関値ｒ（ａ_i ，ｂ_j ) を例えば、数１に従って
算出する。(Step 1-2) The correlation value r (a _i , b _j ) between the feature vectors of the first feature vector sequence A and the second feature vector sequence B at the same time is calculated according to, for example, Equation 1. .

【数１】但し、ｓ（ａ，ｂ）はａ，ｂの分散を示す。又、類似度
ｈ（ａ_i , ｂ_j ) の場合は数２に従って算出する。(Equation 1) Here, s (a, b) indicates the variance of a and b. Further, in the case of the similarity h (a _i , b _j ), the similarity is calculated according to Equation 2.

【数２】この相関値又は類似度の大小は以下のような意味を持つ
と考える。相関値又は類似度が大きいということは、同
時刻の音声を長い分析窓長２Ｗ及び短い分析窓長Ｗで音
声分析しても同等の特徴ベクトルを算出するため、周波
数的な変化の少ない定常部分（母音部分）であるという
ことができる。逆に、相関値又は類似度が小さいという
ことは、周波数的な変化の多い子音部分であるというこ
とができる。(Equation 2) The magnitude of the correlation value or the similarity is considered to have the following meaning. The fact that the correlation value or similarity is large means that the same feature vector is calculated even if the voice at the same time is analyzed using the long analysis window length 2W and the short analysis window length W. (Vowel part). Conversely, when the correlation value or similarity is small, it can be said that the consonant part has a large frequency change.

【００２４】（ステップ１−３) （相関値又は類似度が予め定めた閾値より大きい場合）
相関値ｒ（ａ_i , ｂ_j ) 又は類似度ｈ（ａ_i , ｂ_j ) が
予め定めた閾値より大きい場合は、長い分析窓長２Ｗを
選択すればいいので特徴ベクトル選択部１０６では分析
窓長２Ｗの特徴ベクトルを現時刻の最適特徴ベクトルと
選択する。（相関値又は類似度が予め定めた閾値より小さい場合)
一方、相関値ｒ（ａ_i , ｂ_j ) 又は類似度ｈ（ａ_i , ｂ
_j ) が予め定めた閾値より小さい場合は、短い分析窓長
Ｗを選択すればいいので特徴ベクトル選択部１０６では
分析窓長Ｗの特徴ベクトルを現時刻の最適特徴ベクトル
と選択する。これにより自動的に母音・子音に最適な分
析窓長を選択していることになる。(Step 1-3) (When the correlation value or similarity is larger than a predetermined threshold)
If the correlation value r (a _i , b _j ) or the similarity h (a _i , b _j ) is larger than a predetermined threshold value, the feature vector selection unit 106 may select the longer analysis window length 2W. The feature vector having a length of 2 W is selected as the optimal feature vector at the current time. (When the correlation value or similarity is smaller than a predetermined threshold)
On the other hand, the correlation value r (a _i , b _j ) or the similarity h (a _i , b
_{If j} ) is smaller than the predetermined threshold value, a short analysis window length W may be selected, and the feature vector selection unit 106 selects the feature vector of the analysis window length W as the optimal feature vector at the current time. As a result, the optimum analysis window length for the vowel / consonant is automatically selected.

【００２５】（ステップ１−４)以上のようにして求ま
った最適特徴ベクトル系列ＣとＨＭＭ記憶部１０８に記
憶されている各認識対象語彙の隠れマルコフモデルλ
(w) との尤度Ｌ(w) を数３に従ってＨＭＭ認識部１０７
で算出する。(Step 1-4) The Hidden Markov Model λ of the vocabulary to be recognized stored in the HMM storage unit 108 and the optimal feature vector sequence C obtained as described above
The HMM recognition unit 107 calculates the likelihood L (w) with (w) according to Equation 3.
Is calculated by

【数３】 (Equation 3)

【００２６】（ステップ１−５)各認識対象語彙の尤度
Ｌ(w) が算出されたら、認識結果判定部１０９では数４
に従って認識結果ｙを決定する。(Step 1-5) When the likelihood L (w) of each vocabulary to be recognized is calculated, the recognition result judgment unit 109 calculates
Is determined according to the following.

【数４】以上のような（ステップ１−１）から（ステップ１−
５）により、分析窓長を可変にして認識処理時間を大幅
に増加させることなく認識精度の高い音声認識を実現す
ることができる。(Equation 4) From the above (Step 1-1) to (Step 1-
According to 5), it is possible to realize speech recognition with high recognition accuracy without making the analysis window length variable and significantly increasing the recognition processing time.

【００２７】尚この実施の形態では音声分析部が２つの
場合について説明しているが、更に多数の音声分析部を
用いて相関値又は類似度に対応させてその音声分析部か
らの特徴ベクトル系列を選択するようにしてもよい。Although this embodiment has been described in connection with the case where there are two voice analysis units, the feature vector sequence from the voice analysis unit is used by associating the correlation value or similarity with a plurality of voice analysis units. May be selected.

【００２８】次に、図２は本発明の第２の実施の形態に
おける音声認識装置の構成を示すブロック図である。本
実施の形態ではバッファに記憶する過去の特徴ベクトル
が1時刻前の１つのみの場合を説明する。図２におい
て、マイクロフォン２００は音声を入力するものであ
り、入力された音声（以降、入力音声と呼ぶ）は低域通
過フィルタ２０１に入力される。低域通過フィルタ２０
１はサンプリング周波数の１／２以上の周波数成分をカ
ットするものであり、その出力はＡ／Ｄ変換部２０２に
与えられる。Ａ／Ｄ変換部２０２は入力信号を予め定め
られたサンプリング周波数（例えば8kHz）でサンプリン
グしてディジタル信号に変換するものである。マイクロ
フォン２００，低域通過フィルタ２０１，Ａ／Ｄ変換部
２０２は、音声を入力する音声入力手段を構成してい
る。第１の音声分析部２０３はこのディジタル信号を予
め定められた分析窓長によって特徴ベクトル（ケプスト
ラム）に変換するものである。バッファ２０４は１時刻
前の特徴ベクトルを記憶するものである。特徴ベクトル
間相関算出部２０５は音声分析部２０３の算出結果であ
る特徴ベクトルとバッファ２０４に記憶されている１時
刻前の特徴ベクトルから、この２つのベクトル間の相関
値又は類似度を算出するものである。分析窓長設定部２
０６はその相関値又は類似度が予め定めた第１の閾値に
比べ大きい場合には、音声分析部２０３の予め定められ
た分析窓長よりも長い分析窓長に設定し、第１の閾値よ
り小さい予め定めた第２の閾値に比べて小さい場合には
音声分析部２０３の予め定められた分析窓長よりも短い
分析窓長に設定するものである。そして第２の音声分析
部２０７はその分析窓長（以降、最適分析窓長と呼ぶ)
にて再度Ａ／Ｄ変換部２０２の出力であるディジタル信
号からの音声分析を行い、その特徴ベクトル( 以降、最
適特徴ベクトルと呼ぶ) を出力するものである。この最
適特徴ベクトルの系列に対して、ＨＭＭ認識部２０８に
て、ＨＭＭ格納部２０９に記憶されている各認識対象語
彙のモデルとの尤度を計算し、認識結果判定部２１０に
て最も尤度の高いものを認識結果として判定する。Next, FIG. 2 is a block diagram showing a configuration of a speech recognition apparatus according to a second embodiment of the present invention. In the present embodiment, a case will be described in which only one past feature vector is stored in the buffer one time earlier. In FIG. 2, a microphone 200 inputs voice, and the input voice (hereinafter, referred to as input voice) is input to a low-pass filter 201. Low-pass filter 20
Reference numeral 1 denotes a component for cutting a frequency component equal to or more than の of the sampling frequency, and its output is supplied to the A / D converter 202. The A / D converter 202 samples an input signal at a predetermined sampling frequency (for example, 8 kHz) and converts it into a digital signal. The microphone 200, the low-pass filter 201, and the A / D conversion unit 202 constitute a voice input unit for inputting voice. The first speech analysis unit 203 converts this digital signal into a feature vector (cepstrum) using a predetermined analysis window length. The buffer 204 stores the feature vector one time ago. A feature vector correlation calculator 205 calculates a correlation value or similarity between the two vectors from the feature vector calculated by the speech analyzer 203 and the feature vector one time ago stored in the buffer 204. It is. Analysis window length setting unit 2
06 is set to an analysis window length longer than the predetermined analysis window length of the voice analysis unit 203 when the correlation value or the similarity is larger than the predetermined first threshold value, and is set to be larger than the first threshold value. If it is smaller than the second predetermined small threshold value, the analysis window length is set to be shorter than the predetermined analysis window length of the voice analysis unit 203. Then, the second speech analysis unit 207 uses the analysis window length (hereinafter, referred to as an optimum analysis window length).
The voice analysis is performed again from the digital signal output from the A / D conversion unit 202, and its characteristic vector (hereinafter, referred to as an optimum characteristic vector) is output. The likelihood with each recognition target vocabulary model stored in the HMM storage unit 209 is calculated by the HMM recognizing unit 208 with respect to this optimal feature vector sequence, and the recognition result determining unit 210 calculates the maximum likelihood. Are determined as recognition results.

【００２９】以上のように構成された本発明の音声認識
装置について、相関値又は類似度が大きい場合と小さい
場合とに分けてその動作を説明する。但し、音声分析部
２０３の分析窓長をＷとして説明する。（ステップ２−１）Ａ／Ｄ変換部２０２でサンプリング
されたディジタル信号Ｕ＝｛ｕ₁,ｕ₂,ｕ₃,・・・，ｕ
_T ｝を、分析窓長Ｗの音声分析部２０３により特徴ベク
トルａ_i に変換する。又、バッファ２０４には１時刻前
の特徴ベクトル列ａ_i-1 が記憶されている。The operation of the speech recognition apparatus of the present invention configured as described above will be described separately for cases where the correlation value or similarity is large and cases where it is small. However, description will be made assuming that the analysis window length of the voice analysis unit 203 is W. (Step 2-1) Digital signal U = ｛u ₁ , u ₂ , u ₃ ,..., U sampled by A / D conversion section 202
_T } is converted into a feature vector a _i by the speech analysis unit 203 having the analysis window length W. The buffer 204 stores a feature vector sequence a _i-1 one time before.

【００３０】（ステップ２−２）特徴ベクトルａ_i と１
時刻前の特徴ベクトルａ_i-1 の特徴ベクトル間の相関値
ｒ(ａ_i , ａ_i-1 ) を例えば、数５に従って算出する。(Step 2-2) The feature vectors a _i and 1
The correlation value r (a _i , a _i-1 ) between the feature vectors of the feature vector a _i-1 before the time is calculated, for example, according to Equation 5.

【数５】但し、ｓ（ａ，ａ）はａ，ａの分散を示す。又、類似度
ｈ（ａ_i , ａ_i-1 ) の場合は数６に従って算出する。(Equation 5) Here, s (a, a) indicates the variance of a and a. In the case of the similarity h (a _i , a _i−1 ), the similarity is calculated in accordance with Equation 6.

【数６】この相関値又は類似度の大小は以下のような意味を持つ
と考える。相関値又は類似度が大きいということは、隣
り合う特徴ベクトルの相関があり、周波数的な変化の少
ない定常部分（母音部分）であり、逆に、相関値又は類
似度が小さいということは、周波数的な変化の多い子音
部分といえる。(Equation 6) The magnitude of the correlation value or the similarity is considered to have the following meaning. A large correlation value or similarity means that there is a correlation between adjacent feature vectors and a stationary part (vowel part) with little change in frequency. Conversely, a small correlation value or similarity means that the correlation value or similarity is small. It can be said that the consonant part has a lot of dynamic changes.

【００３１】（ステップ２−３)ここで次式によって分
析窓長ＷをＷ’に変換する。ｒ（ａ_i , ａ_i-1 ) 又はｈ
（ａ_i , ａ_i-1 ) をｘとすると、ここで用いられる関数
ｆ(x) はｘに依存してＷへの倍数を決定する関数であ
る。(Step 2-3) Here, the analysis window length W is converted into W 'by the following equation. r (a _i , a _i-1 ) or h
_Assuming that (a _i , a _i-1 ) is x, the function f (x) used here is a function for determining a multiple to W depending on x.

【数７】例えばｆ(x) ＝（ｘ−Ｔｈ１）＋１（ｘ≧Ｔｈ１）ｆ(x) ＝ｅｘｐ（ｘ−Ｔｈ２）（ｘ≦Ｔｈ２）ｆ(x) ＝１（Ｔｈ２＜ｘ＜Ｔｈ１）のように相関値又は類似度が第１の閾値Ｔｈ１より大き
い場合は、分析窓長Ｗをより大きくする値を返し、逆に
第２の閾値Ｔｈ２より小さい場合は、分析窓長を小さく
する値を返す関数とする。又閾値Ｔｈ１と閾値Ｔｈ２の
間の値を取る場合は、１を返す。（相関値又は類似度が第１の閾値Ｔｈ１より大きい場
合）相関値ｒ（ａ_i , ａ_i-1 ) 又は類似度ｈ（ａ_i , ａ
_i-1 ) がＴｈ１より大きい場合は、分析窓長をＷより大
きな分析窓長Ｗ’を数７により算出し設定する。（相関値又は類似度が第２の閾値Ｔｈ２より小さい場
合)一方、相関値ｒ（ａ_i , ａ_i-1 ) 又は類似度ｈ（ａ_i
, ａ_i-1 ) が第２の閾値Ｔｈ２より小さい場合は、短
い分析窓長Ｗ’を数７を用いて算出し設定する。(Equation 7) For example, f (x) = (x−Th1) +1 (x ≧ Th1) f (x) = exp (x−Th2) (x ≦ Th2) f (x) = 1 (Th2 <x <Th1) If the value or similarity is greater than the first threshold Th1, a function that returns a value that makes the analysis window length W larger, and if the value or the similarity is smaller than the second threshold Th2, a function that returns a value that makes the analysis window length smaller, I do. When a value between the threshold value Th1 and the threshold value Th2 is taken, 1 is returned. (When the correlation value or similarity is larger than the first threshold Th1) The correlation value r ( _ai , _ai-1 ) or the similarity h ( _ai , a)
_{If i-1} ) is larger than Th1, the analysis window length W 'larger than W is calculated and set according to equation (7). (When the correlation value or similarity is smaller than the second threshold value Th2) On the other hand, the correlation value r (a _i , a _i−1 ) or the similarity h (a _i
, a _i-1 ) is smaller than the second threshold value Th2, a short analysis window length W ′ is calculated and set using Equation 7.

【００３２】（ステップ２−４)以上のようにして求ま
った第２の分析窓長Ｗ’をもって、再度第２の音声分析
部２０７にて特徴ベクトルを算出し、その特徴ベクトル
の系列ＤとＨＭＭ記憶部２０９に記憶されている各認識
対象語彙の隠れマルコフモデルλ(w) との尤度Ｌ’(w)
を数８に従ってＨＭＭ認識部２０８で算出する。(Step 2-4) Using the second analysis window length W 'obtained as described above, the second speech analysis unit 207 calculates a feature vector again, and the sequence D of the feature vector and the HMM Likelihood L ′ (w) of each vocabulary to be recognized stored in storage unit 209 with hidden Markov model λ (w)
Is calculated by the HMM recognizing unit 208 according to Equation 8.

【数８】 (Equation 8)

【００３３】（ステップ２−５)各認識対象語彙の尤度
Ｌ’(w) が算出されたら、認識結果判定部２１０では数
９に従って認識結果ｙ’を決定する。(Step 2-5) After the likelihood L '(w) of each vocabulary to be recognized has been calculated, the recognition result determination unit 210 determines the recognition result y' according to equation (9).

【数９】以上のように（ステップ２−１）から（ステップ２−
５）により分析窓長を可変にすることにより、認識処理
時間を大幅に増加させることなく認識精度の高い音声認
識方法を実現することができる。関数ｆ(x) はここで示
した関数に限定されるものでなく、このような傾向を持
った任意の関数を用いることができる。(Equation 9) As described above, from (Step 2-1) to (Step 2-
By making the analysis window length variable according to 5), a speech recognition method with high recognition accuracy can be realized without significantly increasing the recognition processing time. The function f (x) is not limited to the function shown here, and any function having such a tendency can be used.

【００３４】図３は、本発明の第３の実施の形態におけ
る音声認識装置の構成を示すブロック図を示す。図３に
おいて、マイクロフォン３００は音声を入力するもので
あり、入力された音声（以降、入力音声と呼ぶ）は低域
通過フィルタ３０１に入力される。低域通過フィルタ３
０１はサンプリング周波数の１／２以上の周波数成分を
カットするものであり、その出力はＡ／Ｄ変換部３０２
に与えられる。Ａ／Ｄ変換部３０２は入力信号を予め定
められたサンプリング周波数（例えば8kHz）でサンプリ
ングして、ディジタル信号に変換するものである。マイ
クロフォン３００，低域通過フィルタ３０１，Ａ／Ｄ変
換部３０２は、音声を入力する音声入力手段を構成して
いる。第１の音声分析部３０３はこのディジタル信号を
予め定められた分析窓長によって特徴ベクトル（ケプス
トラム）に変換するものである。バッファ３０４には線
形回帰係数算出部３０５で線形回帰係数を算出するため
に、少なくとも２つ以上の特徴ベクトルを記憶してお
く。分析窓長設定部３０６は線形回帰係数算出部３０５
で算出された特徴ベクトルの線形回帰係数が予め定めた
第１の閾値Ｔｈ３に比べ小さい場合には音声分析部３０
３の予め定められた分析窓長Ｗよりも長い分析窓長Ｗ”
に設定し、又、予め定めた第２の閾値Ｔｈ４（＞Ｔｈ
３）に比べて大きい場合には音声分析部３０３の予め定
められた分析窓長よりも短い分析窓長Ｗ”に設定するも
のである。第２の音声分析部３０７はその分析窓長Ｗ”
にて再度Ａ／Ｄ変換部３０２の出力であるディジタル信
号からの音声分析を行うものである。そしてその特徴ベ
クトルの系列に対して、ＨＭＭ認識部３０８にて、ＨＭ
Ｍ格納部３０９に記憶されている各認識対象語彙のモデ
ルとの尤度を計算し、認識結果判定部３１０にて最も入
力音声と尤度の高いものを認識結果として判定する。FIG. 3 is a block diagram showing a configuration of a speech recognition apparatus according to a third embodiment of the present invention. In FIG. 3, a microphone 300 is for inputting voice, and the input voice (hereinafter, referred to as input voice) is input to a low-pass filter 301. Low-pass filter 3
01 is for cutting frequency components equal to or more than の of the sampling frequency.
Given to. The A / D converter 302 samples an input signal at a predetermined sampling frequency (for example, 8 kHz) and converts it into a digital signal. The microphone 300, the low-pass filter 301, and the A / D converter 302 constitute a voice input unit for inputting voice. The first speech analysis unit 303 converts this digital signal into a feature vector (cepstrum) using a predetermined analysis window length. The buffer 304 stores at least two or more feature vectors for the linear regression coefficient calculation unit 305 to calculate the linear regression coefficient. The analysis window length setting unit 306 includes a linear regression coefficient calculation unit 305.
If the linear regression coefficient of the feature vector calculated in step (1) is smaller than a predetermined first threshold Th3, the speech analysis unit 30
3 is longer than the predetermined analysis window length W.
And a second predetermined threshold value Th4 (> Th
If it is larger than 3), the analysis window length W "is set to be shorter than the predetermined analysis window length of the speech analysis unit 303. The second speech analysis unit 307 sets the analysis window length W".
The voice analysis is performed again from the digital signal output from the A / D conversion unit 302 at. Then, the HMM recognizing unit 308 applies the HM
The likelihood of each recognition target vocabulary stored in the M storage unit 309 with the model is calculated, and the recognition result determination unit 310 determines the input speech having the highest likelihood as the recognition result.

【００３５】以上のように構成された本発明の音声認識
装置について、その動作を説明する。但し、音声分析部
３０３の分析窓長をＷとして説明する。（ステップ３−１）Ａ／Ｄ変換部３０２でサンプリング
されたディジタル信号Ｕ＝｛ｕ₁,ｕ₂,ｕ₃,・・・，ｘ
_T ｝を、分析窓長Ｗの音声分析部３０３により特徴ベク
トルａ_i に変換する。又バッファ３０４には少なくとも
２つ以上の特徴ベクトルが記憶されている。The operation of the speech recognition apparatus of the present invention configured as described above will be described. However, the description is made assuming that the analysis window length of the voice analysis unit 303 is W. (Step 3-1) Digital signal U = {u ₁ , u ₂ , u ₃ ,..., X sampled by A / D conversion section 302
_T } is converted into a feature vector a _i by the speech analysis unit 303 having the analysis window length W. The buffer 304 stores at least two or more feature vectors.

【００３６】（ステップ３−２）特徴ベクトルの線形回
帰係数Δａは例えば数１０で算出される。(Step 3-2) The linear regression coefficient Δa of the feature vector is calculated by, for example, Expression 10.

【数１０】この線形回帰係数の大小は以下のような意味を持つと考
える。線形回帰係数が予め定めた第１の閾値Ｔｈ３より
小さいということは、近隣の特徴ベクトルの変動が少な
く、周波数的な変化の少ない定常部分（母音部分）であ
り、逆に、線形回帰係数が予め定めた第２の閾値Ｔｈ４
より大きいということは、近隣の特徴ベクトルの変動が
激しく、周波数的な変化の多い子音部分といえる。(Equation 10) The magnitude of this linear regression coefficient is considered to have the following meaning. The fact that the linear regression coefficient is smaller than the predetermined first threshold Th3 is a steady part (vowel part) in which the variation of the neighboring feature vector is small and the frequency is not changed. Determined second threshold value Th4
If it is larger, it can be said that the consonant part in which the neighboring feature vector fluctuates greatly and the frequency changes frequently.

【００３７】（ステップ３−３)以下の数１１を用いて
分析窓長ＷをＷ”に変換する。(Step 3-3) The analysis window length W is converted to W ″ using the following equation (11).

【数１１】ｇ（Δａ）はΔａに依存してＷへの倍数を決定する関数
である。例えばｇ（Δａ）＝（Ｔｈ３−Δａ）＋１（Δａ≦Ｔｈ３）ｇ（Δａ）＝ｅｘｐ（Ｔｈ４−Δａ）（Δａ≧Ｔｈ４）ｇ（Δａ）＝１（Ｔｈ３＜Δａ＜Ｔｈ４）のように線形回帰係数Δａが閾値Ｔｈ３より小さい場合
は、分析窓長Ｗをより大きくする値を返し、逆に閾値Ｔ
ｈ４より大きい場合は、分析窓長を小さくする値を返す
関数とする。又閾値Ｔｈ３と閾値Ｔｈ４の間の値を取る
場合は、１を返す。（線形回帰係数が予め定めた第１の閾値Ｔｈ３より小さ
い場合）線形回帰係数Δａが小さい場合は、分析窓長を
Ｗより大きな分析窓長Ｗ”を数１１により算出し設定す
る。（線形回帰係数が予め定めた第２の閾値Ｔｈ４より大き
い場合)一方、線形回帰係数Δａが予め定めた第２の閾
値Ｔｈ４より大きい場合は、小さい分析窓長Ｗ”を数１
１を用いて算出し設定する。[Equation 11] g (Δa) is a function that determines a multiple to W depending on Δa. For example, g (Δa) = (Th3-Δa) +1 (Δa ≦ Th3) g (Δa) = exp (Th4-Δa) (Δa ≧ Th4) g (Δa) = 1 (Th3 <Δa <Th4) If the regression coefficient Δa is smaller than the threshold Th3, a value that makes the analysis window length W larger is returned.
If h4 is larger than h4, the function returns a value that reduces the analysis window length. When a value between the threshold value Th3 and the threshold value Th4 is taken, 1 is returned. (When the linear regression coefficient is smaller than a predetermined first threshold value Th3) When the linear regression coefficient Δa is small, the analysis window length W ″ larger than W is calculated and set according to Expression 11. (Linear regression) On the other hand, when the coefficient is larger than the second threshold Th4, the small analysis window length W ″ is calculated by the following equation (1).
Calculate and set using 1.

【００３８】（ステップ３−４)以上のようにして求ま
った第２の分析窓長Ｗ”をもって、再度第２の音声分析
部３０７にて特徴ベクトルを算出し、その特徴ベクトル
の系列ＥとＨＭＭ記憶部３０９に記憶されている各認識
対象語彙の隠れマルコフモデルλ(w) との尤度Ｌ”(w)
を数１２に従ってＨＭＭ認識部３０８で算出する。(Step 3-4) Using the second analysis window length W ″ obtained as described above, the second speech analysis unit 307 calculates a feature vector again, and the feature vector sequence E and the HMM Likelihood L ″ (w) of each vocabulary to be recognized stored in storage unit 309 with hidden Markov model λ (w)
Is calculated by the HMM recognizing unit 308 according to Equation 12.

【数１２】 (Equation 12)

【００３９】（ステップ３−５)各認識対象語彙の尤度
Ｌ" (w) が算出されたら、認識結果判定部３１０では数
１３に従って認識結果ｙ" を決定する。(Step 3-5) When the likelihood L ″ (w) of each vocabulary to be recognized is calculated, the recognition result determination unit 310 determines the recognition result y ″ according to Expression 13.

【数１３】以上のような（ステップ３−１）から（ステップ３−
５）に示すように、分析窓長を可変にして認識処理時間
を大幅に増加させることなく認識精度の高い音声認識方
法を実現することができる。尚関数ｇ（Δａ）はここで
示した関数に限定されるものでなく、このような傾向を
持った任意の関数を用いることができる。(Equation 13) From the above (Step 3-1) to (Step 3-
As shown in 5), a speech recognition method with high recognition accuracy can be realized without greatly increasing the recognition processing time by making the analysis window length variable. Note that the function g (Δa) is not limited to the function shown here, and any function having such a tendency can be used.

【００４０】図４は、本発明の第４の実施の形態におけ
る音声認識装置の構成を示すブロック図を示す。本実施
の形態ではバッファに記憶する過去の特徴ベクトルが１
時刻前の１つのみの場合を説明する。図４において、マ
イクロフォン４００は音声を入力するものであり、入力
された音声（以降、入力音声と呼ぶ）は低域通過フィル
タ４０１に入力される。低域通過フィルタ４０１はサン
プリング周波数の１／２以上の周波数成分をカットする
ものであり、その出力はＡ／Ｄ変換部４０２に与えられ
る。Ａ／Ｄ変換部４０２は入力信号を予め定められたサ
ンプリング周波数（例えば8kHz）でサンプリングして、
ディジタル信号に変換するものである。マイクロフォン
４００，低域通過フィルタ４０１，Ａ／Ｄ変換部４０２
は、音声を入力する音声入力手段を構成している。第１
の音声分析部４０３はこのディジタル信号を予め定めら
れた分析窓長によって特徴ベクトル（ケプストラム）に
変換するものである。バッファ４０４には１時刻前の特
徴ベクトルが記憶されており、特徴ベクトル間相関算出
部４０５は音声分析部４０３の算出結果である特徴ベク
トルとバッファ４０４に記憶されている１時刻前の特徴
ベクトルからこの２つのベクトル間の相関値又は類似度
を算出するものである。音声強調係数算出部４０６はそ
の相関値又は類似度が、予め定めた閾値に比べて小さい
場合のみ特徴ベクトルを強調すべく強調係数を設定する
ものである。音声強調部４０７で特徴ベクトルを強調
し、その特徴ベクトルの系列に対して、ＨＭＭ認識部４
０８にて、ＨＭＭ格納部４０９に記憶されている各認識
対象語彙のモデルとの尤度を計算し、認識結果判定部２
１０にて最も入力音声と尤度の高いものを認識結果とし
て判定する。FIG. 4 is a block diagram showing a configuration of a voice recognition device according to a fourth embodiment of the present invention. In the present embodiment, the past feature vector stored in the buffer is 1
The case of only one before the time will be described. In FIG. 4, a microphone 400 is for inputting voice, and the input voice (hereinafter, referred to as input voice) is input to a low-pass filter 401. The low-pass filter 401 cuts a frequency component equal to or more than の of the sampling frequency, and its output is provided to the A / D converter 402. The A / D converter 402 samples the input signal at a predetermined sampling frequency (for example, 8 kHz),
It is to be converted into a digital signal. Microphone 400, low-pass filter 401, A / D converter 402
Constitutes voice input means for inputting voice. First
The voice analysis unit 403 converts the digital signal into a feature vector (cepstrum) using a predetermined analysis window length. The buffer 404 stores the feature vector one time ago, and the inter-feature-vector correlation calculation unit 405 calculates the feature vector calculated by the speech analysis unit 403 and the feature vector one time ago stored in the buffer 404. The correlation value or similarity between these two vectors is calculated. The voice emphasis coefficient calculation unit 406 sets an emphasis coefficient to emphasize the feature vector only when the correlation value or the similarity is smaller than a predetermined threshold. The speech enhancement unit 407 emphasizes the feature vector, and the HMM recognition unit 4
At 08, the likelihood of each vocabulary to be recognized stored in the HMM storage unit 409 with the model is calculated, and the recognition result determination unit 2
In step 10, the input speech having the highest likelihood is determined as the recognition result.

【００４１】以上のように構成された本発明の音声認識
装置について、その動作を説明する。（ステップ４−１）Ａ／Ｄ変換部４０２でサンプリング
されたディジタル信号Ｕ＝｛ｕ₁,ｕ₂,ｕ₃,・・・，ｕ
_T ｝を、分析窓長Ｗの音声分析部４０３により特徴ベク
トルａ_i に変換する。又、バッファ４０４には１時刻前
の特徴ベクトル列ａ_i-1 が記憶されている。The operation of the speech recognition apparatus of the present invention configured as described above will be described. (Step 4-1) Digital signal U = ｛u ₁ , u ₂ , u ₃ ,..., U sampled by A / D conversion section 402
_T } is converted into a feature vector a _i by the speech analysis unit 403 of the analysis window length W. The buffer 404 stores a feature vector sequence a _i-1 one time before.

【００４２】（ステップ４−２）特徴ベクトルａ_i と１
時刻前の特徴ベクトルａ_i-1 の特徴ベクトル間の相関値
ｒ（ａ_i , ａ_i-1 ) を例えば、数１４に従って算出す
る。(Step 4-2) The feature vectors a _i and 1
The correlation value r (a _i , a _i-1 ) between the feature vectors of the feature vector a _i-1 before the time is calculated according to, for example, Expression 14.

【数１４】但し、ｓ（ａ, ａ) はａ, ａの分散を示す。又、類似度
ｈ（ａ_i , ａ_i-1 ) の場合は数１５に従って算出する。[Equation 14] Here, s (a, a) indicates the variance of a, a. In the case of the similarity h (a _i , a _i-1 ), the similarity is calculated according to Equation 15.

【数１５】この相関値又は類似度の大小は以下のような意味を持つ
と考える。相関が大きい場合は周波数的変化の少ない母
音部分と考えられ強調する必要はないが、一方、相関値
又は類似度が小さいということは、周波数的な変化の多
い子音部分と考えられるため、強調することにより認識
性能を向上できる。(Equation 15) The magnitude of the correlation value or the similarity is considered to have the following meaning. If the correlation is large, it is considered to be a vowel part with little frequency change and need not be emphasized. On the other hand, a small correlation value or similarity is considered to be a consonant part with many frequency changes, so it is emphasized. Thereby, the recognition performance can be improved.

【００４３】（ステップ４−３)相関値ｒ（ａ_i , ａ_i-1
) 又は類似度ｈ（ａ_i , ａ_i-1 ) が小さい場合にの
み、音声強調係数Ｚを数１６を用いて算出し設定する。(Step 4-3) Correlation value r (a _i , a _i-1
) Or only when the similarity h (a _i , a _i−1 ) is small, the speech emphasis coefficient Z is calculated and set by using Expression (16).

【数１６】ｒ（ａ_i , ａ_i-1 ) 又はｈ（ａ_i , ａ_i-1 ) をｘとする
と、ｅ(x) はｘに依存して音声強調係数を算出する関数
である。例えばｅ(x) ＝（ｘ−Ｔｈ）＋１（ｘ≦Ｔｈ）ｅ(x) ＝１（Ｔｈ＜ｘ）のように相関値又は類似度が閾値Ｔｈより小さい場合
は、各フレームの特徴ベクトルを強調する値を返す閾値
とする。又相関値又は類似度が閾値Ｔｈ以上を取る場合
は、１を返す。算出した音声強調係数Ｚを用いて音声強
調を以下のように行う。ａ_i ＝ａ_i ＋（ａ_i −ａ_i-1 ）＊Ｚのようにａ_i がａ_i-1 に比較して変化した方向と変化の
大きさに応じた強調を行う。(Equation 16) Assuming that r (a _i , a _i-1 ) or h (a _i , a _i-1 ) is x, e (x) is a function for calculating a speech enhancement coefficient depending on x. For example, when the correlation value or the similarity is smaller than the threshold value Th such that e (x) = (x−Th) +1 (x ≦ Th) e (x) = 1 (Th <x), the feature vector of each frame is set as A threshold value for returning a value to be emphasized. If the correlation value or similarity is equal to or greater than the threshold Th, 1 is returned. Speech emphasis is performed using the calculated speech emphasis coefficient Z as follows. performing emphasis a _i as _{_{a i = a i + (a}} i -a i-1) * Z is corresponding to the magnitude of the change direction changed compared to a _i-1.

【００４４】(ステップ４−４)以上のようにして求まっ
た音声強調係数をもって、音声分析部４０３の出力であ
る特徴ベクトルを強調し、その特徴ベクトルの系列Ｆと
ＨＭＭ記憶部４０９に記憶されている各認識対象語彙の
隠れマルコフモデルλ(w) との尤度Ｌ'''(w)を数１７に
従ってＨＭＭ認識部４０８で算出する。(Step 4-4) The feature vector output from the speech analysis unit 403 is enhanced using the speech enhancement coefficient obtained as described above, and the sequence F of the feature vector is stored in the HMM storage unit 409. The likelihood L ′ ″ (w) of each vocabulary to be recognized with the hidden Markov model λ (w) is calculated by the HMM recognition unit 408 according to Expression 17.

【数１７】 [Equation 17]

【００４５】（ステップ４−５)各認識対象語彙の尤度
Ｌ'''(w)が算出されたら、認識結果判定部４１０では数
１8 に従って認識結果ｙ''' を決定する。(Step 4-5) After the likelihood L ′ ″ (w) of each vocabulary to be recognized has been calculated, the recognition result determining unit 410 determines the recognition result y ′ ″ according to equation (18).

【数１８】以上のように（ステップ４−１）から（ステップ４−
５）により、分析窓長を可変にして認識処理時間を大幅
に増加させることなく認識精度の高い音声認識方法を実
現することができる。尚関数ｅ(x) はここで示した関数
に限定されるものでなく、このような傾向を持った任意
の関数を用いることができる。(Equation 18) As described above, from (Step 4-1) to (Step 4-
According to 5), a speech recognition method with high recognition accuracy can be realized without making the analysis window length variable and significantly increasing the recognition processing time. Note that the function e (x) is not limited to the function shown here, and an arbitrary function having such a tendency can be used.

【００４６】次に、図５は本発明の第５の実施の形態に
おける音声認識装置の構成を示すブロック図である。本
実施の形態ではピッチ抽出部が２つの場合について説明
する。図５において、マイクロフォン５００は音声を入
力するものであり、入力された音声（以降、入力音声と
呼ぶ）は低域通過フィルタ５０１に入力される。低域通
過フィルタ５０１はサンプリング周波数の１／２以上の
周波数成分をカットするものであり、その出力はＡ／Ｄ
変換部５０２に与えられる。Ａ／Ｄ変換部５０２は入力
信号を予め定められたサンプリング周波数（例えば８ｋ
Ｈｚ）でサンプリングして、ディジタル信号に変換する
ものである。マイクロフォン５００，低域通過フィルタ
５０１，Ａ／Ｄ変換部５０２は、音声を入力する音声入
力手段を構成している。第１のピッチ抽出部５０３はこ
のディジタル信号を予め定められた分析窓長１によって
第１の音声のピッチ周波数（基本周波数ともいう）に変
換するものである。ピッチ周波数の抽出には、ケプスト
ラム法や偏自己相関法などが一般的によく用いられる。
又第２のピッチ抽出部５０４はこのディジタル信号を予
め定められた分析窓長２によって第２の音声のピッチ周
波数（基本周波数）に変換するものである。ピッチ間相
関算出部５０５は第１，第２のピッチ周波数間の相関値
又は類似度を算出するものである。分析窓長設定部５０
６はその相関値又は類似度を元に２つの分析窓長のうち
最適な分析窓長を選択し、そして音声分析部５０７はそ
の分析窓長（以降、最適分析窓長と呼ぶ）にてＡ／Ｄ変
換部５０２の出力であるディジタル信号からの音声分析
を行い、その特徴ベクトル（以降、最適特徴ベクトルと
呼ぶ）を出力するものである。ＨＭＭ記憶部５０９は認
識の対象となる語彙に対するモデルを記憶する標準パタ
ン記憶手段であり、ＨＭＭ認識部５０８はＨＭＭ記憶部
５０９に記憶されている各認識対象語彙のモデルと最適
特徴ベクトルの系列に対して尤度を計算する認識手段で
ある。認識結果判定部５１０は最も尤度の高いものを認
識結果として判定する認識結果判定手段である。Next, FIG. 5 is a block diagram showing a configuration of a speech recognition apparatus according to a fifth embodiment of the present invention. In the present embodiment, a case where there are two pitch extraction units will be described. In FIG. 5, a microphone 500 is for inputting voice, and the input voice (hereinafter, referred to as input voice) is input to a low-pass filter 501. The low-pass filter 501 cuts a frequency component equal to or more than １／ of the sampling frequency, and its output is A / D
This is provided to conversion section 502. The A / D converter 502 converts the input signal to a predetermined sampling frequency (for example, 8 k
Hz) and converts it into a digital signal. The microphone 500, the low-pass filter 501, and the A / D converter 502 constitute a voice input unit for inputting voice. The first pitch extraction unit 503 converts this digital signal into a pitch frequency (also referred to as a fundamental frequency) of the first voice by using a predetermined analysis window length. The cepstrum method and the partial autocorrelation method are generally used for extracting the pitch frequency.
The second pitch extracting section 504 converts this digital signal into a pitch frequency (basic frequency) of the second voice by a predetermined analysis window length 2. The inter-pitch correlation calculation unit 505 calculates a correlation value or similarity between the first and second pitch frequencies. Analysis window length setting unit 50
6 selects the optimum analysis window length from the two analysis window lengths based on the correlation value or the similarity, and the speech analysis unit 507 uses the analysis window length (hereinafter, referred to as the optimum analysis window length) as A. The voice analysis is performed on the digital signal output from the / D conversion unit 502, and its feature vector (hereinafter, referred to as an optimum feature vector) is output. The HMM storage unit 509 is a standard pattern storage unit that stores a model for the vocabulary to be recognized. The HMM recognition unit 508 stores a model of each recognition target vocabulary stored in the HMM storage unit 509 and a sequence of optimal feature vectors. This is a recognition means for calculating likelihood. The recognition result determination unit 510 is a recognition result determination unit that determines the one with the highest likelihood as a recognition result.

【００４７】以上のように構成された本発明の音声認識
装置について、最適な分析窓長選択の判断を相関値又は
類似度の大小と考えた場合を例に、相関値又は類似度が
大きい場合と小さい場合とに分けてその動作を説明す
る。但し、ピッチ抽出部５０３，５０４の分析窓長１，
２をここではＷとその２倍の２Ｗとして説明する。（ステップ５−１）Ａ／Ｄ変換部５０２でサンプリング
されたディジタル信号Ｕ＝｛ｕ₁,ｕ₂,ｕ₃,・・・，ｕ
_T ｝を、分析窓長Ｗの第１のピッチ抽出部５０３により
第１のピッチ周波数の系列Ｃ＝｛ｃ₁,ｃ₂,・・・，ｃ
_i ，・・・，ｃ_I ｝に変換する。又、分析窓長２Ｗ（Ｗ
の２倍）の第２のピッチ抽出部５０４により第２のピッ
チ周波数の系列Ｄ＝｛ｄ₁,ｄ₂,・・・，ｄ_j ，・・・ｄ
_J ｝に変換する。In the speech recognition apparatus of the present invention configured as described above, the case where the correlation value or the similarity is large is considered as an example in the case where the determination of the optimum analysis window length is considered as the magnitude of the correlation value or the similarity. The operation will be described separately for the case of the small case. However, the analysis window length of the pitch extraction units 503 and 504 is 1,
Here, 2 is described as W and twice as much as 2W. (Step 5-1) Digital signal U = ｛u ₁ , u ₂ , u ₃ ,..., U sampled by A / D conversion section 502
The _T}, series C = {c ₁ of the first pitch frequency by a first pitch extraction unit 503 in the analysis window length W, c _2, ···, c
_i ,..., c _I }. In addition, the analysis window length 2W (W
), A second pitch frequency sequence D = ｛d ₁ , d ₂ ,..., D _j ,.
Convert to _J ｝.

【００４８】（ステップ５−２）第１のピッチ周波数系
列Ｃと第２のピッチ周波数系列Ｄの同時刻のピッチ周波
数間の相関値ｒ（ｃ_i ，ｄ_j ) を例えば、数１９に従っ
て算出する。[0048] (Step 5-2) The correlation value r (c _i, d _j) between the pitch frequency at the same time the first pitch frequency sequence C and a second pitch frequency sequence D, for example, be calculated according to Equation 19 .

【数１９】但し、ｓ（ｃ，ｄ）はｃ，ｄの分散を示す。又、類似度
ｈ（ｃ_i , ｄ_j ) の場合は数２０に従って算出する。[Equation 19] Here, s (c, d) indicates the variance of c and d. In the case of the similarity h (c _i , _dj ), the similarity is calculated according to Equation 20.

【数２０】この相関値又は類似度の大小は以下のような意味を持つ
と考える。相関値又は類似度が大きいということは、同
時刻の音声を長い分析窓長２Ｗ及び短い分析窓長Ｗで音
声分析しても同等のピッチ周波数を算出するため、周波
数的な変化の少ない定常部分（母音部分）であり、逆
に、相関値又は類似度が小さいということは、周波数的
な変化の多い子音部分ということができる。(Equation 20) The magnitude of the correlation value or the similarity is considered to have the following meaning. The fact that the correlation value or similarity is large means that the same pitch frequency is calculated even if the voice at the same time is analyzed using the long analysis window length 2W and the short analysis window length W. (Vowel portion), and conversely, a small correlation value or similarity can be regarded as a consonant portion with a large frequency change.

【００４９】(ステップ５−３) （相関値又は類似度が予め定めた閾値より大きい場合）
相関値ｒ（ｃ_i ，ｄ_j ) 又は類似度ｈ（ｃ_i , ｄ_j ) が
予め定めた閾値より大きい場合は、長い分析窓長２Ｗを
選択すればいいので分析窓長設定部５０６では分析窓長
２Ｗを最適窓長Ｗ''' と設定する。（相関値又は類似度が小さい場合)一方、相関値ｒ（ｃ_i
，ｄ_j ) 又は類似度ｈ（ｃ_i , ｄ_j ) が予め定めた閾
値より小さい場合は、短い分析窓長Ｗを選択すればいい
ので分析窓長設定部５０６では分析窓長Ｗを最適窓長
Ｗ''' と設定する。これにより自動的に母音・子音に最
適な分析窓長を選択していることになる。(Step 5-3) (When the correlation value or similarity is larger than a predetermined threshold)
Correlation value r (c _i, d _j) or similarity h (c _i, d _j) larger than the threshold which is determined in advance, so do I select a longer analysis window length 2W analysis window length setting unit 506 in the analysis The window length 2W is set as the optimum window length W '''. (When the correlation value or similarity is small) On the other hand, the correlation value r (c _i
, D _j ) or the similarity h (c _i , d _j ) is smaller than a predetermined threshold, the analysis window length setting unit 506 sets the analysis window length W to the optimal window. Length W '''. As a result, the optimum analysis window length for the vowel / consonant is automatically selected.

【００５０】(ステップ５−４)以上のようにして求まっ
た最適分析窓長Ｗ''' をもって、音声分析部５０７にて
特徴ベクトルを算出し、その特徴ベクトルの系列ＧとＨ
ＭＭ記憶部５０９に記憶されている各認識対象語彙の隠
れマルコフモデルλ(w) との尤度Ｌ''''(w) を数２１に
従ってＨＭＭ認識部５０８で算出する。(Step 5-4) Using the optimum analysis window length W ′ ″ obtained as described above, the speech analysis unit 507 calculates a feature vector, and the feature vector sequences G and H
The likelihood L ″ ″ (w) of each vocabulary to be recognized stored in the MM storage unit 509 with the hidden Markov model λ (w) is calculated by the HMM recognition unit 508 according to Formula 21.

【数２１】（ステップ５−５)各認識対象語彙の尤度Ｌ''''(w) が
算出されたら、認識結果判定部５１０では数２２に従っ
て認識結果ｙ''''を決定する。(Equation 21) (Step 5-5) When the likelihood L ″ ″ (w) of each vocabulary to be recognized is calculated, the recognition result determination unit 510 determines the recognition result y ″ ″ according to Expression 22.

【数２２】以上のような（ステップ５−１）から（ステップ５−
５）により分析窓長を可変にすることにより、認識処理
時間を大幅に増加させることなく認識精度の高い音声認
識方法を実現することができる。(Equation 22) From the above (Step 5-1) to (Step 5-
By making the analysis window length variable according to 5), a speech recognition method with high recognition accuracy can be realized without significantly increasing the recognition processing time.

【００５１】尚この実施の形態ではピッチ抽出部が２つ
の場合について説明しているが、更に多数のピッチ抽出
部を用いて相関値又は類似度に対応させてそのピッチ抽
出部の分析窓長を設定するようにしてもよい。In this embodiment, the case where there are two pitch extraction units is described. However, the analysis window length of the pitch extraction units is made to correspond to the correlation value or similarity by using more pitch extraction units. You may make it set.

【００５２】次に、図６は本発明の第６の実施の形態に
おける音声認識装置の構成を示すブロック図である。本
実施の形態では、バッファに記憶する過去のピッチ周波
数が1 時刻前の１つのみの場合について説明する。図６
において、マイクロフォン６００は音声を入力するもの
であり、入力された音声（以降、入力音声と呼ぶ）は低
域通過フィルタ６０１に入力される。低域通過フィルタ
６０１はサンプリング周波数の１／２以上の周波数成分
をカットするものであり、その出力はＡ／Ｄ変換部６０
２に与えられる。Ａ／Ｄ変換部６０２は入力信号を予め
定められたサンプリング周波数（例えば８ｋＨｚ）でサ
ンプリングされディジタル信号に変換するものである。
マイクロフォン６００，低域通過フィルタ６０１，Ａ／
Ｄ変換部６０２は、音声を入力する音声入力手段を構成
している。ピッチ抽出部６０３はこのディジタル信号を
予め定められた分析窓長によって音声のピッチ周波数
（基本周波数）に変換するものである。ピッチの抽出に
は、ケプストラム法や偏自己相関法などが一般的によく
用いられる。バッファ６０４は１時刻前のピッチ周波数
を記憶するものである。ピッチ間相関算出部６０５はピ
ッチ抽出部６０３の算出結果であるピッチ周波数とバッ
ファ６０４に記憶されている１時刻前のピッチ周波数か
ら、この２つの周波数間の相関値又は類似度を算出する
ものである。分析窓長設定部６０６はその相関値又は類
似度が予め定めた第１の閾値に比べ大きい場合には、ピ
ッチ抽出部６０３の予め定められた分析窓長よりも長い
分析窓長に設定し、第１の閾値より小さい予め定めた第
２の閾値に比べて小さい場合にはピッチ抽出部６０３の
予め定められた分析窓長よりも短い分析窓長に設定する
ものである。そして音声分析部６０７はその分析窓長
（以降、最適分析窓長と呼ぶ）にてＡ／Ｄ変換部６０２
の出力であるディジタル信号からの音声分析を行い、そ
の特徴ベクトル（以降、最適特徴ベクトルと呼ぶ）を出
力するものである。ＨＭＭ記憶部６０９は認識の対象と
なる語彙に対するモデルを記憶する標準パタン記憶手段
であり、ＨＭＭ認識部６０８はＨＭＭ記憶部６０９に記
憶されている各認識対象語彙のモデルと最適特徴ベクト
ルの系列に対して尤度を計算する認識手段である。認識
結果判定部６１０は最も尤度の高いものを認識結果とし
て判定する認識結果判定手段である。Next, FIG. 6 is a block diagram showing a configuration of a speech recognition apparatus according to a sixth embodiment of the present invention. In the present embodiment, a case will be described in which the past pitch frequency stored in the buffer is only one before one time. FIG.
In, the microphone 600 is for inputting voice, and the input voice (hereinafter referred to as input voice) is input to the low-pass filter 601. The low-pass filter 601 cuts a frequency component equal to or more than の of the sampling frequency.
2 given. The A / D converter 602 is for sampling an input signal at a predetermined sampling frequency (for example, 8 kHz) and converting it into a digital signal.
Microphone 600, low-pass filter 601, A /
The D conversion unit 602 constitutes a voice input unit for inputting voice. The pitch extracting unit 603 converts this digital signal into a pitch frequency (fundamental frequency) of a voice according to a predetermined analysis window length. For pitch extraction, a cepstrum method, a partial autocorrelation method, and the like are generally used. The buffer 604 stores the pitch frequency one time before. The pitch-to-pitch correlation calculation unit 605 calculates a correlation value or similarity between the two frequencies from the pitch frequency, which is the calculation result of the pitch extraction unit 603, and the pitch frequency one time before stored in the buffer 604. is there. The analysis window length setting unit 606 sets the analysis window length longer than the predetermined analysis window length of the pitch extraction unit 603 when the correlation value or the similarity is larger than a predetermined first threshold value, If it is smaller than the second predetermined threshold smaller than the first threshold, the analysis window length is set to be shorter than the predetermined analysis window length of the pitch extraction unit 603. Then, the speech analysis unit 607 uses the A / D conversion unit 602 based on the analysis window length (hereinafter, referred to as an optimum analysis window length).
The voice analysis is performed from the digital signal which is the output of, and its feature vector (hereinafter, referred to as an optimum feature vector) is output. The HMM storage unit 609 is a standard pattern storage unit that stores a model for a vocabulary to be recognized. The HMM recognition unit 608 stores a model of each recognition target vocabulary stored in the HMM storage unit 609 and a sequence of optimal feature vectors. This is a recognition means for calculating likelihood. The recognition result determination unit 610 is a recognition result determination unit that determines the one with the highest likelihood as a recognition result.

【００５３】以上のように構成された本発明の音声認識
装置について、最適な分析窓長選択の判断を相関値又は
類似度の大小と考えた場合を例に、相関値又は類似度が
大きい場合と小さい場合とに分けてその動作を説明す
る。但し、ピッチ抽出部６０３の分析窓長をＷとして説
明する。（ステップ６−１）Ａ／Ｄ変換部７０２でサンプリング
されたディジタル信号Ｕ＝｛ｕ₁,ｕ₂,ｕ₃,・・・，ｕ
_T ｝を、分析窓長Ｗのピッチ抽出部６０３によりピッチ
周波数ｃ_i に変換する。又、バッファ６０４には１時刻
前のピッチ周波数の系列ｃ_i-1 が記憶されている。In the speech recognition apparatus of the present invention configured as described above, the case where the determination of the optimal analysis window length selection is considered to be the magnitude of the correlation value or the similarity is taken as an example. The operation will be described separately for the case of the small case. However, the description will be made on the assumption that the analysis window length of the pitch extraction unit 603 is W. (Step 6-1) Digital signal U = ｛u ₁ , u ₂ , u ₃ ,..., U sampled by A / D conversion section 702
_T ｝ is converted into a pitch frequency c _i by the pitch extraction unit 603 of the analysis window length W. The buffer 604 stores a pitch frequency sequence ci _-1 one time before.

【００５４】（ステップ６−２）ピッチ周波数系列ｃ_i
と1 時刻前のピッチ周波数ｃ_i-1 のピッチ周波数間の相
関値ｒ（ｃ_i ，ｃ_i-1 ) を例えば、数２３に従って算出
する。(Step 6-2) Pitch frequency sequence c _i
If one time before the pitch frequency c _i-1 of the correlation value _{_{r (c i, c i-}} 1) between the pitch frequency, for example, be calculated according to Equation 23.

【数２３】但し、ｓ（ｃ，ｃ）はｃ，ｃの分散を示す。又、類似度
ｈ（ｃ_i , ｃ_i-1 ) の場合は数２４に従って算出する。(Equation 23) Here, s (c, c) indicates the variance of c and c. In the case of the similarity h (c _i , c _i−1 ), the similarity is calculated according to Equation 24.

【数２４】この相関値又は類似度の大小は以下のような意味を持つ
と考える。相関値又は類似度が大きいということは、隣
り合うピッチ周波数に相関があり、周波数的な変化の少
ない定常部分（母音部分）であり、逆に、相関値又は類
似度が小さいということは、周波数的な変化の多い子音
部分ということができる。(Equation 24) The magnitude of the correlation value or the similarity is considered to have the following meaning. A large correlation value or similarity means that there is a correlation between adjacent pitch frequencies and a steady portion (vowel portion) with little change in frequency. Conversely, a small correlation value or similarity means that the frequency is small. It can be said that the consonant part has many dynamic changes.

【００５５】(ステップ６−３)ここで次式によって分析
窓長Ｗに変換する。相関値ｒ（ｃ_i , ｃ_i-1 ) 又は類似
度ｈ（ｃ_i , ｃ_i-1 ) をｘとすると、ここで用いられる
関数ｆ(x) はｘに依存してＷへの倍数を決定する関数で
ある。(Step 6-3) Here, it is converted into the analysis window length W by the following equation. Correlation value _{_{r (c i, c i-}} 1) or the similarity _{_{h (c i, c i-}} 1) When the x, multiples function f (x) is to W, depending on the x used here The function to determine.

【数２５】例えば、ｆ(x) ＝（ｘ−Ｔｈ１）＋１（ｘ≧Ｔｈ１）ｆ(x) ＝ｅｘｐ（ｘ−Ｔｈ２）（ｘ≦Ｔｈ２）ｆ(x) ＝１（Ｔｈ２＜ｘ＜Ｔｈ１）のように相関値又は類似度が第１の閾値Ｔｈ１より大き
い場合は、分析窓長Ｗをより大きくする値を返し、逆に
第２の閾値Ｔｈ２より小さい場合は、分析窓長を小さく
する値を返す関数とする。又閾値Ｔｈ１と閾値Ｔｈ２の
間の値を取る場合は、１を返す。（相関値又は類似度が第１の閾値Ｔｈ１より大きい場
合）相関値ｒ（ｃ_i , ｃ_i-1 ) 又は類似度ｈ（ｃ_i , ｃ
_i-1 ) が第１の閾値Ｔｈ１より大きい場合は、分析窓長
をＷより大きな分析窓長Ｗ''''を数２５を用いて算出し
設定する。 (相関値又は類似度が第２の閾値Ｔｈ２より小さい場合)
一方、相関値ｒ（ｃ_i , ｃ_i-1 ) 又は類似度ｈ（ｃ_i ,
ｃ_i-1 ) が第２の閾値Ｔｈ２より小さい場合は、分析窓
長をＷより短い分析窓長Ｗ''''を数２５を用いて算出し
設定する。(Equation 25) For example, f (x) = (x−Th1) +1 (x ≧ Th1) f (x) = exp (x−Th2) (x ≦ Th2) f (x) = 1 (Th2 <x <Th1) If the correlation value or similarity is greater than the first threshold Th1, a function that returns a value that increases the analysis window length W is returned. If the correlation value or the similarity is smaller than the second threshold Th2, a function that returns a value that reduces the analysis window length is returned. And When a value between the threshold value Th1 and the threshold value Th2 is taken, 1 is returned. (If the correlation value or similarity is greater than a first threshold value Th1) correlation value _{_{r (c i, c i-}} 1) or the similarity h (c _i, c
_{If i-1} ) is larger than the first threshold value Th1, the analysis window length is calculated and set using an analysis window length W '''' larger than W using Equation 25. (When the correlation value or similarity is smaller than the second threshold Th2)
On the other hand, the correlation value _{_{r (c i, c i-}} 1) or the similarity h (c _i,
If c _i-1 ) is smaller than the second threshold value Th2, the analysis window length W ″ ″ shorter than W is calculated and set using Equation 25.

【００５６】（ステップ６−４)以上のようにして求ま
った第２の分析窓長Ｗ''''をもって、音声分析部にて特
徴ベクトルを算出し、その特徴ベクトルの系列ＨとＨＭ
Ｍ記憶部６０９に記憶されている各認識対象語彙の隠れ
マルコフモデルλ(w) との尤度Ｌ''''(w) を数２６に従
ってＨＭＭ認識部６０８で算出する。(Step 6-4) Using the second analysis window length W ″ ″ obtained as described above, the speech analysis unit calculates a feature vector, and the sequence H and HM of the feature vector are calculated.
The likelihood L ″ ″ (w) of each vocabulary to be recognized stored in the M storage unit 609 with the hidden Markov model λ (w) is calculated by the HMM recognition unit 608 according to Formula 26.

【数２６】 (Equation 26)

【００５７】（ステップ６−５)各認識対象語彙の尤度
Ｌ''''(w) が算出されたら、認識結果判定部６１０では
数２７に従って認識結果ｙを決定する。(Step 6-5) After the likelihood L ″ ″ (w) of each vocabulary to be recognized has been calculated, the recognition result determination unit 610 determines the recognition result y according to Equation 27.

【数２７】 [Equation 27]

【００５８】以上のような（ステップ６−１）から（ス
テップ６−５）により分析窓長を可変にすることによ
り、認識処理時間を大幅に増加させることなく認識精度
の高い音声認識方法を実現することができる。尚関数ｆ
(x) はここで示した関数に限定されることなく、このよ
うな傾向をもった任意の関数を用いることができる。By making the analysis window length variable from (Step 6-1) to (Step 6-5) as described above, a speech recognition method with high recognition accuracy can be realized without greatly increasing the recognition processing time. can do. Function f
(x) is not limited to the function shown here, and any function having such a tendency can be used.

【００５９】[0059]

【発明の効果】以上詳細に説明したように本願の請求項
1 〜３及び５，６の発明によれば、最適の分析窓長を用
いて音声を分析することができる。従って相関の高い母
音等は長い分析フレームののスペクトルを選択し、相関
性の低い子音等は短い分析フレームのスペクトルを選択
することにより、分析窓長の設定を自動的に行うことが
でき、高い認識率を実現することができる。又本願の請
求項４の発明によれば、入力信号の特徴に応じた最適な
特徴ベクトルとすることができ、認識率を向上させるこ
とができるという効果が得られる。As described in detail above, the claims of the present application are described below.
According to the first to third and fifth and sixth aspects, speech can be analyzed using the optimum analysis window length. Therefore, the analysis window length can be automatically set by selecting a spectrum of a long analysis frame for a vowel with a high correlation and selecting a spectrum of a short analysis frame for a consonant with a low correlation. A recognition rate can be realized. Further, according to the invention of claim 4 of the present application, it is possible to obtain an optimum feature vector according to the feature of the input signal, and it is possible to improve the recognition rate.

[Brief description of the drawings]

【図１】本発明の第１の実施の形態による音声認識装置
の構成を示すブロック図である。FIG. 1 is a block diagram illustrating a configuration of a speech recognition device according to a first embodiment of the present invention.

【図２】本発明の第２の実施の形態による音声認識装置
の構成を示すブロック図である。FIG. 2 is a block diagram showing a configuration of a speech recognition device according to a second embodiment of the present invention.

【図３】本発明の第３の実施の形態による音声認識装置
の構成を示すブロック図である。FIG. 3 is a block diagram showing a configuration of a speech recognition device according to a third embodiment of the present invention.

【図４】本発明の第４の実施の形態による音声認識装置
の構成を示すブロック図である。FIG. 4 is a block diagram showing a configuration of a voice recognition device according to a fourth embodiment of the present invention.

【図５】本発明の第５の実施の形態による音声認識装置
の構成を示すブロック図である。FIG. 5 is a block diagram showing a configuration of a voice recognition device according to a fifth embodiment of the present invention.

【図６】本発明の第６の実施の形態による音声認識装置
の構成を示すブロック図である。FIG. 6 is a block diagram showing a configuration of a voice recognition device according to a sixth embodiment of the present invention.

【図７】従来の音声認識装置の構成を示すブロック図で
ある。FIG. 7 is a block diagram showing a configuration of a conventional speech recognition device.

[Explanation of symbols]

１００，２００，３００，４００，５００，６００，７
００マイクロフォン１０１，２０１，３０１，４０１，５０１，６０１，７
０１低域通過フィルタ１０２，２０２，３０２，４０２，５０２，６０２，７
０２Ａ／Ｄ変換部１０３，１０４，２０３，３０３，４０３，７０３音
声分析部１０５，２０５，４０５特徴ベクトル間相関算出部１０６特徴ベクトル選択部１０７，２０８，３０８，４０８，５０８，６０８，７
０４ＨＭＭ認識部１０８，２０５，２０９，３０９，４０９，５０９，６
０９，７０５ＨＭＭ記憶部１０９，２１０，３１０，４１０，５１０，６１０，７
０６認識結果判定部２０４，３０４，４０４，６０４バッファ２０６，３０６，５０６，６０６分析窓長設定部２０７，３０７，５０７，６０７音声分析部３０５線形回帰係数算出部４０６音声強調係数算出部４０７音声強調部５０３，５０４，６０３ピッチ抽出部５０５，６０５ピッチ間相関算出部100, 200, 300, 400, 500, 600, 7
00 microphone 101, 201, 301, 401, 501, 601, 7
01 Low-pass filter 102, 202, 302, 402, 502, 602, 7
02 A / D converters 103, 104, 203, 303, 403, 703 Speech analyzers 105, 205, 405 Feature vector correlation calculator 106 Feature vector selectors 107, 208, 308, 408, 508, 608, 7
04 HMM recognition unit 108, 205, 209, 309, 409, 509, 6
09,705 HMM storage unit 109,210,310,410,510,610,7
06 recognition result determination unit 204, 304, 404, 604 buffer 206, 306, 506, 606 analysis window length setting unit 207, 307, 507, 607 voice analysis unit 305 linear regression coefficient calculation unit 406 voice enhancement coefficient calculation unit 407 voice enhancement Units 503, 504, 603 Pitch extraction unit 505, 605 Inter-pitch correlation calculation unit

Claims

[Claims]

A voice input unit for inputting voice; at least two voice analysis units for converting the input voice into feature vectors with different analysis window lengths at each time; Feature vector correlation calculating means for calculating at least one of a correlation value and a similarity between output feature vectors; and the voice analyzing means according to the correlation value or the similarity calculated by the feature vector correlation value calculating means. If the correlation value or similarity is large from among the feature vectors calculated in step (1), it is assumed that the change in the input voice is small.If the correlation value or similarity is small, the change in the input voice is large. A feature vector selecting means for selecting a feature vector having a small analysis window length (hereinafter, referred to as a selected feature vector); and a marker for storing a model for each vocabulary to be recognized. Pattern storage means; recognition means for calculating the likelihood between the time series of the selected feature vector and the standard pattern stored in the standard pattern storage means; and likelihood with each standard pattern output from the recognition means. And a recognition result determining means for determining a recognition result from the speech recognition device.

2. Voice input means for inputting voice, first voice analysis means for converting the input voice into a feature vector with a predetermined analysis window length, and at least one or more past time points A buffer for storing a feature vector, and a correlation between feature vectors for calculating at least one of a correlation value and a similarity between a feature vector output from the first speech analysis unit and a past feature vector stored in the buffer. Calculating means for comparing the correlation value or similarity with a predetermined first threshold, and when the correlation value or similarity is equal to or greater than the threshold, an analysis window length longer than the predetermined analysis window length, An analysis window length setting unit configured to set an analysis window length shorter than the predetermined analysis window length when the value is equal to or smaller than the predetermined second threshold value as compared with a small second threshold value; Minutes Second speech analysis means for converting the input speech into a feature vector using a window length; standard pattern storage means for storing a model for each vocabulary to be recognized; feature vector obtained from the second speech analysis means Recognition means for calculating the likelihood of the sequence of the reference pattern and the standard pattern stored in the standard pattern storage means, and recognition result determination means for determining a recognition result from the likelihood of each standard pattern output from the recognition means A voice recognition device comprising:

3. Voice input means for inputting voice, first voice analysis means for converting the input voice into a feature vector with a predetermined analysis window length, and at least two voices obtained by the voice analysis means. A buffer for storing one or more feature vectors, a linear regression coefficient calculating means for calculating a linear regression coefficient from the feature vectors stored in the buffer, and a first threshold in which the linear regression coefficient is predetermined. If the comparison is less than the threshold, the analysis window length longer than the predetermined analysis window length is compared with a predetermined second threshold larger than the first threshold. Analysis window length setting means for setting an analysis window length shorter than a predetermined analysis window length; and a second voice for converting the input voice into a feature vector using the analysis window length set by the analysis window length setting means. Analytical means and A standard pattern storage unit that stores a model for each recognition target vocabulary; and a likelihood between a series of feature vectors obtained by the second speech analysis unit and a standard pattern stored in the standard pattern storage unit. A speech recognition apparatus comprising: a recognition unit; and a recognition result determination unit that determines a recognition result from a likelihood of each standard pattern output from the recognition unit.

4. Speech input means for inputting speech, speech analysis means for converting the input speech into a feature vector with a predetermined analysis window length, and at least one or more past time feature vectors. A buffer for storing; a feature vector correlation calculating means for calculating at least one of a correlation value and a similarity between a feature vector output from the voice analyzing means and a past feature vector stored in the buffer; If the value or the similarity is less than or equal to a threshold value by comparing with a predetermined threshold value, an enhancement coefficient calculating unit that calculates an enhancement coefficient for enhancing the feature vector, and the feature vector is enhanced by the enhancement coefficient. Voice emphasis means for replacing with a feature vector; standard pattern storage means for storing a model for each vocabulary to be recognized; Recognition means for calculating the likelihood between the time series of feature vectors to be obtained and the standard pattern stored in the standard pattern storage means, and the recognition result is determined from the likelihood of each standard pattern output from the recognition means. A speech recognition device comprising: a recognition result determination unit.

5. Speech input means for inputting speech, at least two pitch extraction means for converting the input speech into a pitch frequency (fundamental frequency) of speech with a different analysis window length at each time, An inter-pitch correlation calculating means for calculating at least one of a correlation value and a similarity between pitch frequencies output from the pitch extracting means; and an analysis window used for the pitch extracting means in accordance with the correlation value or the similarity. If the correlation value or similarity is large from among the lengths, a large analysis window length is set as the change in the input voice is small, and if the correlation value or similarity is small, a small analysis window length is set as the change in the input voice is large. Analysis window length setting means for performing analysis, speech analysis means for converting the input speech into a feature vector using the analysis window length set by the analysis window length setting means, and a model for each vocabulary to be recognized. A standard pattern storage unit for storing, a recognition unit for calculating a likelihood between the sequence of feature vectors obtained by the voice analysis unit and a standard pattern stored in the standard pattern storage unit, and an output of the recognition unit. A speech recognition apparatus comprising: a recognition result determination unit that determines a recognition result from likelihood with a certain standard pattern.

6. A voice input unit for inputting voice, a pitch extracting unit for converting the input voice into a voice pitch frequency (basic frequency) with a predetermined analysis window length, and a pitch extracting unit. A buffer for storing at least one or more past pitch frequencies at least in the past, and at least one of a correlation value and a similarity between the pitch frequency output from the pitch extracting means and the past pitch frequencies stored in the buffer. Between the pitch value and the correlation value or similarity and a predetermined first threshold value. If the correlation value is equal to or greater than the first threshold value, an analysis longer than a predetermined analysis window length is performed. The window length is compared with a predetermined second threshold smaller than the first threshold, and when the window length is equal to or smaller than the second threshold, an analysis window length shorter than the predetermined window length is set. Setting means; speech analysis means for converting the input speech into a feature vector using the analysis window length set by the analysis window length setting means; standard pattern storage means for storing a model for each vocabulary to be recognized; Recognition means for calculating the likelihood between the series of feature vectors obtained by the voice analysis means and the standard pattern stored in the standard pattern storage means, and from the likelihood of each standard pattern output from the recognition means. A speech recognition device comprising: a recognition result determination unit that determines a recognition result.