JP2011180308A

JP2011180308A - Voice recognition device and recording medium

Info

Publication number: JP2011180308A
Application number: JP2010043553A
Authority: JP
Inventors: Masatomo Okumura; 真知奥村; Hiroaki Kojima; 宏明児島; Hiroshi Omura; 浩大村
Original assignee: National Institute of Advanced Industrial Science and Technology AIST
Current assignee: National Institute of Advanced Industrial Science and Technology AIST
Priority date: 2010-02-26
Filing date: 2010-02-26
Publication date: 2011-09-15

Abstract

【課題】現実の利用環境における多様な状況によらず、音声に固有の特徴を確実かつ頑健に抽出して、音声認識に利用する。
【解決手段】一般的な音声認識手法として現在広く普及している隠れマルコフモデル（ＨＭＭ）に基づく手法では、入力音声を標準パターンと照合する際に、音声区間全体における全ての分析時点（フレーム）の特徴量を均一に評価するため、音声も雑音も同等にスコアに反映されるのに対して、本方法は、音声を識別するための重要な特徴が存在するフォルマントの極大及び極小部分の特徴量を集中的に利用することにより、雑音に頑健な音声認識が可能になる。
【選択図】図２A feature unique to speech is reliably and robustly extracted and used for speech recognition regardless of various situations in an actual use environment.
In a method based on a Hidden Markov Model (HMM) that is currently widely used as a general speech recognition method, all the analysis points (frames) in the entire speech section are compared when the input speech is compared with a standard pattern. Since the voice and noise are equally reflected in the score in order to uniformly evaluate the feature amount of the feature, the method uses the features of the maximum and minimum parts of the formant in which important features for identifying the voice exist. By using the amount intensively, it becomes possible to perform speech recognition robust to noise.
[Selection] Figure 2

Description

本発明は、音声認識装置、及び記録媒体に関する。 The present invention relates to a voice recognition device and a recording medium.

音声認識において雑音環境における認識精度を向上させるための手法として、無音部分の雑音のスペクトル情報を算出して全体から差し引く方法（スペクトル減算法）や、長時間にわたるケプストラムの平均値を求めて特徴量を正規化する方法（ケプストラム平均正規化法）などの前処理的手法の他、想定される雑音をマルコフモデルで統計的にモデル化する方法などが知られている（例えば特許文献１を参照）。 As a method for improving the recognition accuracy in a noisy environment in speech recognition, a method of calculating and subtracting the noise spectrum information of the silent part from the whole (spectrum subtraction method) or obtaining the average value of cepstrum over a long time In addition to a pre-processing method such as a method for normalizing (cepstrum average normalization method), a method for statistically modeling an assumed noise with a Markov model is known (see, for example, Patent Document 1). .

特開２００６−１３９０３３号公報JP 2006-139033 A

しかしながら、上記の音声認識手法では、入力音声を標準パターンと照合する際に、音声区間全体における全ての分析時点（フレーム）の特徴量を同一の尺度で評価するため、音声も雑音も同等にスコアに反映される。このため、たとえ小語彙の音声認識であっても、雑音の影響により認識精度が大幅に低下する問題点があった。
また、雑音の特性がある程度定常的で予測可能な環境であれば、特許文献１のようにそれを統計的にモデル化することで、従来手法でも対応可能であるが、現実の応用場面では、想定外の雑音が生じることは不可避であり、新たなアプローチが必要となる。 However, in the above speech recognition method, when the input speech is compared with the standard pattern, the feature quantities at all the analysis points (frames) in the entire speech section are evaluated on the same scale, so that both speech and noise are scored equally. It is reflected in. For this reason, even in the case of speech recognition of small vocabulary, there is a problem that the recognition accuracy is greatly lowered due to the influence of noise.
In addition, if the noise characteristics are somewhat steady and predictable in the environment, it can be handled by the conventional method by modeling it statistically as in Patent Document 1, but in an actual application situation, Unexpected noise is unavoidable, and a new approach is required.

そこで、本発明は、音声を識別するための重要な特徴が存在するフォルマントの極大及び極小部分の特徴量を集中的に利用するとことにより、雑音に頑健な音声認識を可能にするとともに、音声認識用の評価パターンを構築する際に、音声学及び音響分析の知見を利用して、認識対象単語の辞書表記のテキストから、フォルマントの極大・極小の連鎖パターンを自動生成することにより、従来手法のような大量データからの統計的学習を不要とし、文字表記された任意の語彙を認識対象とすることが可能な音声認識装置及びこれらに使用する音声認識用特徴データを記憶している記憶媒体を提供することを目的としている。 Therefore, the present invention makes it possible to perform speech recognition that is robust against noise by intensively using the feature quantities of the maximum and minimum parts of the formant in which important features for identifying speech exist, and to achieve speech recognition. When constructing an evaluation pattern, the knowledge of phonetics and acoustic analysis is used to automatically generate maximum and minimum formant chain patterns from the dictionary notation text of the recognition target word. A speech recognition device that does not require statistical learning from such a large amount of data and can recognize arbitrary vocabulary written in letters, and a storage medium storing speech recognition feature data used for these devices It is intended to provide.

本発明に係る音声認識装置は、話者が発声した音声信号を周波数分析した、フォルマント周波数、フォルマントパワーレベル、基本周波数、自己相関係数、ケプストラムピーク値、スペクトル傾き、ピッチチャンネルパワー、又は周波数高域パワーのパラメータの時間軸方向の変化パターンの軌跡における極大値及び極小値並びにそれらの近傍の時間的区間を抽出する手段と、
区間における特徴パラメータを連結した連鎖パターンにより、音声信号の特徴量パターンを生成する手段と、
音声認識の対象とする語彙の各単語の文字表記から、各単語に対応する標準特徴量パターンを生成する手段と、
話者が発声した音声信号の特徴量パターンと標準特徴量パターンとを照合することにより音声信号に適合する単語を決定する手段とを備える。
本発明に係る音声認識装置は、話者が発声した音声信号を周波数分析して得られる特徴パラメータと特徴量パターンから、音声信号の時間的な部分区間を音声の生成方法によって分類した、有声音区分、無声音区分、無音区分、又は有声子音区分の調音的記述区分の列を生成する手段と、
音声認識の対象とする語彙の各単語の文字表記から、各単語に対応する調音的記述区分の標準的記号列を導出する手段と、
話者が発声した音声信号に対応する調音的記述区分の列と、調音的記述区分の標準的記号列との対応表を生成する手段と、
対応表と特徴量パターンとを併用してパターン照合を行うことにより音声信号に適合する単語を決定する手段とを備える。
本発明に係る音声認識装置は、調音的記述区分の標準的記号列において、音声の変動に関する知見を体系化した変形規則により、一つの単語に対して複数の標準記号列を生成する手段を備える。
上記音声認識装置は、更に話者の音声認識結果を出力する手段を備えていることが好ましい。
本発明に係る記録媒体は、音声言語体系毎に、認識対象となる全ての単語の発音内容を記号で表記した辞書と、その単語の発音変形パターンを記号により記述した辞書と、音素ごとの標準的なフォルマント特徴量の値と、各単語の標準的なフォルマント特徴量の時間的変化パターンとのうち、少なくとも一つを記憶している。
本発明によれば、音声を識別するための重要な特徴が存在するフォルマントの極大及び極小部分の特徴量を集中的に利用することが可能になり、したがって、本発明によれば、雑音に頑健な音声認識が可能である。
かかる構成によれば、音声認識用の評価パターンを構築する際に、音声学及び音響分析の知見を利用して、認識対象単語の辞書表記のテキストから、フォルマントの極大・極小の連鎖パターンを自動生成することが可能になり、大量のデータからの統計的学習を必要とせず、文字で表記された任意の単語を認識対象とすることが可能になる。 The speech recognition apparatus according to the present invention is a formant frequency, formant power level, fundamental frequency, autocorrelation coefficient, cepstrum peak value, spectral tilt, pitch channel power, or frequency high, which is obtained by frequency analysis of a speech signal uttered by a speaker. Means for extracting the maximum value and the minimum value in the trajectory of the change pattern in the time axis direction of the parameter of the region power, and the time interval in the vicinity thereof,
Means for generating a feature amount pattern of an audio signal by a chain pattern in which feature parameters in a section are connected;
Means for generating a standard feature pattern corresponding to each word from the character notation of each word of the vocabulary to be subjected to speech recognition;
Means for determining a word suitable for the voice signal by comparing the feature pattern of the voice signal uttered by the speaker with the standard feature pattern.
The speech recognition apparatus according to the present invention is a voiced sound in which temporal partial sections of a speech signal are classified by a speech generation method from feature parameters and feature amount patterns obtained by frequency analysis of a speech signal uttered by a speaker. Means for generating a sequence of articulatory description categories of segments, unvoiced segments, silence segments, or voiced consonant segments;
Means for deriving a standard symbol string of articulatory description classification corresponding to each word from the character notation of each word of the vocabulary targeted for speech recognition;
Means for generating a correspondence table between a column of articulatory description categories corresponding to a speech signal uttered by a speaker and a standard symbol string of articulatory description categories;
Means for determining a word suitable for the audio signal by performing pattern matching using the correspondence table and the feature amount pattern together.
The speech recognition apparatus according to the present invention includes means for generating a plurality of standard symbol sequences for one word according to a deformation rule that systematizes knowledge about speech fluctuations in a standard symbol sequence of articulatory description category. .
The speech recognition apparatus preferably further includes means for outputting the speech recognition result of the speaker.
The recording medium according to the present invention includes, for each speech language system, a dictionary in which the pronunciation contents of all words to be recognized are represented by symbols, a dictionary in which pronunciation deformation patterns of the words are described by symbols, and a standard for each phoneme At least one of a typical formant feature value and a temporal change pattern of a standard formant feature value of each word is stored.
According to the present invention, it is possible to intensively use the feature quantities of the maximum and minimum parts of formants in which important features for identifying speech exist. Therefore, according to the present invention, robustness against noise can be achieved. Voice recognition is possible.
According to this configuration, when constructing an evaluation pattern for speech recognition, the formant maximum / minimum chain pattern is automatically extracted from the text of the dictionary notation of the recognition target word using the knowledge of phonetics and acoustic analysis. It becomes possible to generate any word written in letters without requiring statistical learning from a large amount of data.

以上説明したように、本発明によれば、任意の単語に対して、大量のデータからの学習を必要とせず、雑音に頑健な音声認識が可能な、音声認識装置、音声認識方法、及び音声認識プログラム、更にこれらの音声認識において使用される種々の情報を記憶している記憶媒体が提供される。 As described above, according to the present invention, a speech recognition apparatus, a speech recognition method, and speech that can perform speech recognition robust to noise without requiring learning from a large amount of data for an arbitrary word. A storage medium storing a recognition program and various information used in the speech recognition is provided.

本発明の実施の形態に係る音声認識装置として動作するコンピュータの構成を示す図The figure which shows the structure of the computer which operate | moves as a speech recognition apparatus concerning embodiment of this invention. 音声認識システムの構成と処理の流れを示す図Diagram showing the configuration and processing flow of a speech recognition system 音声信号からフォルマント関連の特徴量を抽出した例を示す図The figure which shows the example which extracted the feature-value related to formant from an audio signal フォルマント周波数の軌跡の極大・極小値とその近傍区間の抽出例を示す図Diagram showing examples of extraction of local maximum and minimum values of formant frequency trajectory 二つ母音の連鎖におけるフォルマント周波数の極大・極小の連鎖パターンを示す図Diagram showing maximum / minimum chain pattern of formant frequency in a chain of two vowels ある単語「暗証番号」の音声信号サンプルから抽出した音声調音的記述区分の列を表示した図A diagram showing a column of speech articulatory description segments extracted from a speech signal sample of a certain word "password"

以下、図面を参照して本発明の好適な実施形態について詳細に説明する。図１は、本発明の実施の形態に係る音声認識装置として動作するコンピュータの構成を示す図である。
音声認識装置１０として動作するコンピュータは、図１に示すように、中央処理装置（ＣＰＵ）１２ａ、主記憶装置（メモリ）１２ｂ、補助記憶装置１２ｃ、入出力インターフェイス１２ｅ、マイク１２ｆ、及びモニタ１２ｄまたは信号出力ポート１２ｇを備えている。
ＣＰＵ１２ａ、メモリ１２ｂ、モニタ１２ｄ、及び入出力インターフェイス１２ｅは、システムバス１２ｈを介して互いに接続されており、マイク１２ｆ及び信号出力ポート１２ｇは、入出力インターフェイス１２ｅを介してシステムバス１２ｈに接続されている。 DESCRIPTION OF EMBODIMENTS Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the drawings. FIG. 1 is a diagram showing a configuration of a computer that operates as a speech recognition apparatus according to an embodiment of the present invention.
As shown in FIG. 1, the computer that operates as the speech recognition device 10 includes a central processing unit (CPU) 12a, a main storage device (memory) 12b, an auxiliary storage device 12c, an input / output interface 12e, a microphone 12f, and a monitor 12d or A signal output port 12g is provided.
The CPU 12a, the memory 12b, the monitor 12d, and the input / output interface 12e are connected to each other via a system bus 12h, and the microphone 12f and the signal output port 12g are connected to the system bus 12h via the input / output interface 12e. Yes.

以下、音声認識装置１０による音声認識の流れを、図１を参照しつつ、概略的に説明する。音声認識システムでは、予め音声認識の対象となる単語の集合が確定される。ユーザはこの集合から、その要素である単語のうちの一つをマイク１２ｆに向かって発声する。このときの音声はマイク１２ｆで集音されアナログ音声信号として入出力インターフェイス１２ｅでデジタルデータに変換される。以下、このデジタルデータを「音声信号」、あるいはアナログ信号の波形をデジタルデータ化したものという意味で、「音声波形データ」という。音声認識のためのプログラムは、ＨＤＤやフラッシュメモリ等の補助記憶装置１２ｃに保存され、ＣＰＵ１２aやメモリ１２ｂにより実行される。入力した音声信号を認識した結果は、モニタ１２ｄにより画面表示されるか、または、信号出力ポート１２ｇから他の装置を制御するための信号として出力される。 Hereinafter, the flow of speech recognition by the speech recognition apparatus 10 will be schematically described with reference to FIG. In the speech recognition system, a set of words to be subjected to speech recognition is determined in advance. From this set, the user utters one of the words that are the elements toward the microphone 12f. The sound at this time is collected by the microphone 12f and converted into digital data by the input / output interface 12e as an analog sound signal. Hereinafter, this digital data is referred to as “voice signal” or “voice waveform data” in the sense that the waveform of an analog signal is converted into digital data. A program for voice recognition is stored in an auxiliary storage device 12c such as an HDD or a flash memory, and is executed by the CPU 12a or the memory 12b. The result of recognizing the input audio signal is displayed on the screen by the monitor 12d or output as a signal for controlling another device from the signal output port 12g.

以下、コンピュータを音声認識装置１０として動作させるための音声認識システムについて説明する。図２は、音声認識システムにおける処理の手順を示す図である。図２に示す音声認識システム２０は、音声信号分析部、調音的記述区間抽出部、調音的記述区分判定部、特徴量系列生成部、認識対象単語セット発音辞書データベース（ＤＢ）、調音的発音記述生成部、調音的発音記述辞書データベース（ＤＢ）、発音変形規則データベース（ＤＢ）、発音パターン生成部、発音パターン辞書データベース（ＤＢ）、音素標準特徴量データベース（ＤＢ）、標準特徴量系列生成部、特徴量パターン照合部を備えている。 Hereinafter, a voice recognition system for operating a computer as the voice recognition device 10 will be described. FIG. 2 is a diagram illustrating a processing procedure in the speech recognition system. A speech recognition system 20 shown in FIG. 2 includes a speech signal analysis unit, an articulatory description section extraction unit, an articulatory description category determination unit, a feature amount series generation unit, a recognition target word set pronunciation dictionary database (DB), and an articulatory pronunciation description. Generator, articulatory pronunciation description dictionary database (DB), pronunciation modification rule database (DB), pronunciation pattern generator, pronunciation pattern dictionary database (DB), phoneme standard feature database (DB), standard feature sequence generator, A feature amount pattern matching unit is provided.

音声認識処理の対象となる音声信号は、音声信号分析部に入力される。音声信号分析部では、周波数分析を中心とする信号処理手法を用いて基本的な特徴量を得る。その特徴量は、時間的な変化パターン表現できるように、一定時間（例えばｍｓ）ごとに求める。好適な特徴量の例を表１に示す。表１は、音声信号の周波数分析で得られる特徴量の例を示す表である。 A voice signal to be subjected to voice recognition processing is input to a voice signal analysis unit. The voice signal analysis unit obtains a basic feature amount by using a signal processing technique centered on frequency analysis. The feature amount is obtained every fixed time (for example, ms) so that a temporal change pattern can be expressed. Examples of suitable feature values are shown in Table 1. Table 1 is a table showing examples of feature amounts obtained by frequency analysis of audio signals.

ここで、ピッチは声の高さに対応する基本周波数であり、その抽出方法としては、音声波形の自己相関関数に基づいて周期性を推定する手法と、対数パワースペクトルのフーリエ変換であるケプストラムを用いて推定する手法がある。好適な実装においては、その両者を併用し、雑音の特性に応じて選択的に利用する。フォルマントとは、音声を生成する際に、口の中（声道）が音響管となる際に強く共鳴する周波数で、音声を認識するための重要な特徴量とされている。フォルマント抽出の手法としては、声道の共振特性をＡＲモデルで近似し、その極を線形予測係数（ＬＰＣ）から求める手法が一般的に知られている。フォルマント抽出の例を図３に示す。 Here, the pitch is a fundamental frequency corresponding to the pitch of the voice, and the extraction method includes a method of estimating periodicity based on the autocorrelation function of the speech waveform and a cepstrum that is the Fourier transform of the logarithmic power spectrum. There is a method to estimate using. In a preferred implementation, both are used together and selectively used according to noise characteristics. Formant is an important feature quantity for recognizing speech at a frequency that resonates strongly when the inside of the mouth (the vocal tract) becomes an acoustic tube when generating speech. As a formant extraction method, a method is generally known in which the resonance characteristics of the vocal tract are approximated by an AR model and its pole is obtained from a linear prediction coefficient (LPC). An example of formant extraction is shown in FIG.

次に、得られた特徴量の時系列パターンが、調音的記述区間抽出部に入力される。調音的記述区間抽出部においては、フォルマント周波数の時間変化パターンから、図３に示すように、山になる部分即ち時間変化量の極大値を取る部分と、谷になる部分即ち時間変化量の極小値を取る部分の周辺区間（｛ｐ，ｖ｝区間）を求める。周辺区間の時間幅は、特徴量の変化量の閾値を与えることにより好適に設定される。フォルマントは音声の生成上基本となる情報を含む特徴量であるため、この区間には、音声を区別するための重要な情報が集中的に含まれていると考えられる。このとき、ｐ，ｖ区間以外の区間に雑音があっても、その影響を受けないため、雑音に対する頑健性が期待できる。この区間を以下では調音的記述区間と呼ぶことにする。無声子音など子音の種類によっては、フォルマント以外の特徴量が重要であるものもあり、音素の種類に応じて適した特徴量を採用することができるが、以下では、フォルマント特徴量を例として説明する。 Next, the obtained time series pattern of the feature amount is input to the articulatory description section extraction unit. In the articulatory description section extraction unit, as shown in FIG. 3, from the time change pattern of the formant frequency, as shown in FIG. The peripheral section ({p, v} section) of the part taking the value is obtained. The time width of the peripheral section is preferably set by giving a threshold value for the change amount of the feature amount. Since the formant is a feature amount including information that is fundamental to the generation of voice, it is considered that important information for distinguishing voice is concentrated in this section. At this time, even if there is noise in a section other than the p and v sections, it is not affected, so robustness against noise can be expected. Hereinafter, this section is referred to as an articulatory description section. Depending on the type of consonant, such as unvoiced consonants, features other than formants are important, and features that are suitable for the type of phoneme can be adopted. To do.

フォルマント変化パターンの代表的な例として、音声中で２個の母音が連鎖する場合のｐ−ｖパターンを図５に示す。例えば、／ｉ／から／ａ／に連鎖する場合、Ｆ１（第１フォルマント）はｖ→ｐ、Ｆ２（第２フォルマント）はｐ→ｖのパターンを取る。このようなパターンの大局的特徴をとらえることが、本手法の利点である。 As a typical example of the formant change pattern, FIG. 5 shows a p-v pattern when two vowels are chained in a voice. For example, when linking from / i / to / a /, F1 (first formant) takes a pattern of v → p, and F2 (second formant) takes a pattern of p → v. It is an advantage of this method to capture the global characteristics of such a pattern.

次に、より詳細な特徴量パターンを得るため、音声を構成する音素を大分類し、それを並べた系列により音声の大局的構造を記述する。これを調音的記述区分と呼ぶことにする。例えば、好適な実装として、音声を声帯振動の有無に基づいて分類し、｛有声音，無声音，有声子音，無音｝の４通りの区分として記述する。ここで、無音には破裂音の閉鎖や促音に伴う単語中の無音区間が含まれる。 Next, in order to obtain a more detailed feature amount pattern, phonemes constituting the speech are roughly classified, and the global structure of the speech is described by a sequence in which the phonemes are arranged. This is called articulatory description division. For example, as a preferred implementation, voices are classified based on the presence or absence of vocal cord vibrations and described as four categories of {voiced sound, unvoiced sound, voiced consonant, silence}. Here, the silence includes a silent section in the word accompanying the closing of the plosive or the prompt sound.

調音的記述区分は、図２の調音的記述区分判定部において、前段の調音的記述区間抽出部で得られたｐ−ｖパターンの値や他の特徴量を利用して判定される。その好適な例として、前項における４通りの調音的記述区分を判定するために利用するパラメータを表２に示す。
表２は、調音的記述区分を判定するために用いる特徴パラメータを示す図である。 The articulatory description category is determined by using the value of the p-v pattern obtained by the articulatory description segment extracting unit in the previous stage and other feature amounts in the articulatory description category determining unit of FIG. As a suitable example, Table 2 shows parameters used for determining the four articulatory description categories in the previous section.
Table 2 is a diagram showing feature parameters used to determine the articulatory description category.

例えば、有声子音区間（Ｖｃｏｎｓｏ）は、有声音区間の中にあって自己相関ピーク値の谷（ｖ）と全域パワーパラメータの谷（ｖ）から判定される。このようにして得られた調音的記述区分の例を図６に示す。この例は「暗証番号」と発声した音声信号を分析して調音的記述区分を判定し、濃淡で塗り分けた長方形として表示したものである For example, the voiced consonant section (Vconso) is determined from the valley (v) of the autocorrelation peak value and the valley (v) of the global power parameter in the voiced sound section. An example of the articulatory description division obtained in this way is shown in FIG. In this example, an audio signal uttered as "PIN code" is analyzed to determine the articulatory description category and displayed as a rectangle with different shades.

次に、標準パターンの生成方法について述べる。音声認識の対象とする単語の集合を決定し、それを音素を単位とする記号で記述したものを、図２における認識対象単語セット発音辞書ＤＢとして用意する。
認識対象単語セット発音辞書ＤＢは、調音的発音記述生成部において、各音素に対応した調音的記述に変換することにより、調音的発音記述辞書ＤＢが得られる。調音的記述としては、例えば前述のような声帯振動の有無による分類により、有声音を（Ｖ）、無声音を（Ｌ）、有声子音を（Ｃ）、無音を（Ｓ）で記述する場合、調音的発音記述辞書ＤＢは、｛Ｖ，Ｌ，Ｃ，Ｓ｝の４種類の記号で記述される。 Next, a standard pattern generation method will be described. A set of words targeted for speech recognition is determined, and a set of words described in symbols with phonemes as units is prepared as a recognition target word set pronunciation dictionary DB in FIG.
The articulatory pronunciation description dictionary DB is obtained by converting the recognition target word set pronunciation dictionary DB into an articulatory description corresponding to each phoneme in the articulatory pronunciation description generator. As the articulatory description, for example, when the voiced sound is described by (V), the voiceless sound is (L), the voiced consonant is (C), and the soundless is (S) by classification based on the presence or absence of vocal cord vibration as described above, The typical pronunciation description dictionary DB is described by four types of symbols {V, L, C, S}.

一般に人間の音声は同じ単語に対しても音響的に多様性があるため、発音辞書としては、一通りの標準的な発音だけでなく、複数の発音のバリエーションを含むことが望ましい。そのために、発音変形パターン生成部において、多様な発音的変形パターンを含んだ発音辞書ＤＢを自動的に生成する。そのために、発音変形規則ＤＢを利用する。発音変形規則としては、いわゆる無声化や脱落といった音声学的な知見に基づく変形規則の他、多くの音声データのサンプルの音響分析結果を観察した経験的規則を含む。
前述の調音的発音記述法に基づく発音変形規則の例を表３に示す。 In general, since human speech is acoustically diverse even for the same word, it is desirable that the pronunciation dictionary includes not only one standard pronunciation but also a plurality of pronunciation variations. For this purpose, the pronunciation deformation pattern generation unit automatically generates a pronunciation dictionary DB including various phonetic deformation patterns. For this purpose, the pronunciation modification rule DB is used. The pronunciation deformation rules include empirical rules that observe the results of acoustic analysis of many audio data samples, as well as deformation rules based on phonetic knowledge such as so-called devoicing and omission.
Table 3 shows examples of pronunciation modification rules based on the above-mentioned articulatory pronunciation description method.

例えば、単語「アダプター」は，辞書から直接得られる系列ではＶＣＶＳＬＶＳＬＶのように記述されるが、書き換え規則ｘやiiを適用にすることにより，例えばＶＣＶＳＬＳＶのような系列も得られる。
このようにして得られた発音パターン辞書から、標準特徴量系列生成部において、Ｖ区間に対応する母音音素に対して、予め求めた日本語の５母音のフォルマント（Ｆ１〜Ｆ３）の平均値を対応付けることにより、フォルマント周波数を特徴量とする特徴ベクトルの系列が、標準パターンとして得られる。 For example, the word “adapter” is described as VCVSLVSVLV in a sequence obtained directly from the dictionary, but by applying the rewrite rules x and ii, a sequence such as VCVSLSV is also obtained.
From the pronunciation pattern dictionary obtained in this way, the standard feature value series generation unit calculates the average value of formants (F1 to F3) of Japanese vowels obtained in advance for the vowel phonemes corresponding to the V section. By associating, a series of feature vectors having a formant frequency as a feature quantity is obtained as a standard pattern.

一方、入力された音声信号に対しても、調音的記述区分判定部で得られた発音記述の系列に対して、発音変形規則を反映した発音パターン辞書ＤＢを対応付け、対応する調音的記述区間における特徴量を選択的に採用して連結した時系列を構成することにより、入力音声信号に対応した特徴量パターンが得られる。この処理を図２の特徴量系列生成部で行う。
前項で得られた特徴量パターンと、前々項で得られた標準パターンとを照合して、好適な距離尺度に基づいて適合度を算出することにより、最も適合する単語が認識結果として得られる。 On the other hand, with respect to the input speech signal, the pronunciation description sequence obtained by the articulatory description category determination unit is associated with the pronunciation pattern dictionary DB reflecting the pronunciation modification rule, and the corresponding articulatory description section. The feature amount pattern corresponding to the input voice signal can be obtained by selectively adopting the feature amount in and configuring the time series. This processing is performed by the feature amount series generation unit in FIG.
By comparing the feature amount pattern obtained in the previous section with the standard pattern obtained in the previous section and calculating the degree of matching based on a suitable distance measure, the most suitable word is obtained as the recognition result. .

以下では、この方法によって音声認識システムのプロトタイプを構築し、有効性を評価した結果を述べる。
雑音環境下での日本語単語の音声認識性能を評価するための共通データとして、ＣＥＮＳＲＥＣが広く利用されている。そのうち、単語音声認識に対応したＣＥＮＳＲＥＣ−３には、多様な走行環境と複数種類のマイクで自動車内で収録された音声サンプルが収められている。ＣＥＮＳＲＥＣ−３では、音声認識手法として一般的な隠れマルコフモデル（ＨＭＭ）を想定し、学習用に用意された文音声と、評価用に用意された表４のような５０単語の音声が用意されているが、本手法では、統計的学習を行わないので、文音声は使用しない。表４は、音声認識実験に用いた認識対象語彙リストを示す表である。 In the following, we describe the results of building a prototype of a speech recognition system using this method and evaluating its effectiveness.
CENSREC is widely used as common data for evaluating the speech recognition performance of Japanese words in a noisy environment. Among them, CENSREC-3, which supports word speech recognition, contains speech samples recorded in a car with various driving environments and a plurality of types of microphones. CENSREC-3 assumes a general hidden Markov model (HMM) as a speech recognition method, and prepares sentence speech prepared for learning and speech of 50 words as shown in Table 4 prepared for evaluation. However, since this method does not perform statistical learning, sentence speech is not used. Table 4 is a table showing a recognition target vocabulary list used in the speech recognition experiment.

ＣＥＮＳＲＥＣ−３の評価用単語音声を用いて、本手法で認識した結果を表５に示す。 Table 5 shows the results recognized by this method using the word sound for evaluation of CENSREC-3.

標準モデルの生成方法など実験条件が異なるため、直接比較することはできないが、ＣＥＮＳＲＥＣ−３のベースライン認識性能と比較して、本手法により雑音環境下において少ない次元の特徴量で高い認識精度が得られる可能性が示された。 Since the experimental conditions such as the standard model generation method are different, it cannot be directly compared. However, compared with the baseline recognition performance of CENSREC-3, this method provides high recognition accuracy with a small amount of features in a noisy environment. The possibility to be obtained was shown.

以上において、図２に示した音声認識プログラム全体や、認識対象単語セット発音辞書ＤＢ調音的発音記述辞書ＤＢ、発音変形規則ＤＢ、発音パターン辞書ＤＢ、音素標準特徴量ＤＢは、ＣD−ＲＯＭなどの媒体に記録することができ、音声認識装置１０において利用できる。すなわち、一つの認識対象単語セット毎に一つのＣD−ＲＯＭに記録し、応用場面に応じた利用に供することなどが可能である。 In the above, the entire speech recognition program shown in FIG. 2, the recognition target word set pronunciation dictionary DB, the articulatory pronunciation description dictionary DB, the pronunciation modification rule DB, the pronunciation pattern dictionary DB, and the phoneme standard feature DB are such as CD-ROM. It can be recorded on a medium and used in the speech recognition apparatus 10. In other words, each recognition target word set can be recorded in one CD-ROM and used in accordance with the application scene.

以上説明した、音声認識装置１０によれば、以下の効果が奏される。即ち、音声認識装置１０を使用することによって、多様な雑音環境や多様な発音に対しても頑健に動作する音声認識を実現できるため、日常生活環境や向上などの生産現場等にも採用することができる。
また、音声認識装置１０によれば、既存の一般的な隠れマルコフモデル（ＨＭＭ）に基づく手法と比較して、大量データからの統計的学習を必要とせず、標準モデルがシンプルであるため、認識処理が軽くプログラムの実装がコンパクトになる。
また、音声認識装置１０によれば、調音の状態に対応した特徴量を採用しているため、ボイストレーニングなどへの応用にも優位性があると考えられる。 According to the voice recognition device 10 described above, the following effects are produced. In other words, by using the speech recognition device 10, it is possible to realize speech recognition that operates robustly against various noise environments and various pronunciations. Can do.
Moreover, according to the speech recognition apparatus 10, compared with the method based on the existing general hidden Markov model (HMM), since the standard model is simple without requiring statistical learning from a large amount of data, the recognition is performed. Light processing and compact program implementation.
Further, according to the voice recognition device 10, since the feature amount corresponding to the state of articulation is adopted, it is considered that there is an advantage in application to voice training and the like.

１０音声認識装置
１２ａ中央処理装置（ＣＰＵ）
１２ｂ主記憶装置（メモリ）
１２ｃ補助記憶装置
１２ｄモニタ
１２ｅ入出力インターフェイス
１２ｆマイク
１２ｇ信号出力ポート
１２ｈシステムバス
２０音声認識システム 10 Voice recognition device 12a Central processing unit (CPU)
12b Main memory (memory)
12c Auxiliary storage device 12d Monitor 12e Input / output interface 12f Microphone 12g Signal output port 12h System bus 20 Voice recognition system

Claims

A frequency analysis of the speech signal uttered by the speaker, formant frequency, formant power level, fundamental frequency, autocorrelation coefficient, cepstrum peak value, spectral tilt, pitch channel power, or frequency high-frequency power parameters in the time axis direction Means for extracting local maximum values and local minimum values in the locus of the change pattern and temporal intervals in the vicinity thereof;
Means for generating a feature amount pattern of the audio signal by a chain pattern in which feature parameters in the section are connected;
Means for generating a standard feature amount pattern corresponding to each word from the character notation of each word of the vocabulary to be subjected to speech recognition;
A speech recognition apparatus comprising: means for determining a word that matches a speech signal by comparing the feature amount pattern of the speech signal uttered by the speaker with the standard feature amount pattern.

From the feature parameter and feature amount pattern obtained by frequency analysis of the speech signal uttered by the speaker, the temporal partial section of the speech signal is classified by the speech generation method, voiced sound classification, unvoiced sound classification, silent classification, Or means for generating a sequence of articulatory description categories of voiced consonant categories;
Means for deriving a standard symbol string of the articulatory description category corresponding to each word from the character notation of each word of the vocabulary targeted for speech recognition;
Means for generating a correspondence table between the column of the articulatory description section corresponding to the speech signal uttered by the speaker and the standard symbol string of the articulatory description section;
A speech recognition apparatus comprising: means for determining a word that matches a speech signal by performing pattern matching using the correspondence table and the feature amount pattern together.

The standard symbol string of the articulatory description section comprises means for generating a plurality of standard symbol strings for one word according to a deformation rule that systematizes knowledge about speech fluctuations. The speech recognition apparatus according to 2.

The speech recognition apparatus according to claim 2, further comprising means for outputting a speech recognition result of the speaker.

For each spoken language system, a dictionary that describes the pronunciation of all the words to be recognized in symbols, a dictionary that describes the pronunciation deformation patterns of the words in symbols, and the standard formant feature values for each phoneme A recording medium that stores at least one of the temporal change patterns of the standard formant feature of each word.