JPH09114484A

JPH09114484A - Voice recognition device

Info

Publication number: JPH09114484A
Application number: JP7275866A
Authority: JP
Inventors: Toshiyuki Takezawa; 寿幸竹澤; Takuma Morimoto; 逞森元
Original assignee: ATR ONSEI HONYAKU TSUSHIN KENKYUSHO KK; ATR Interpreting Telecommunications Research Laboratories
Current assignee: ATR ONSEI HONYAKU TSUSHIN KENKYUSHO KK; ATR Interpreting Telecommunications Research Laboratories
Priority date: 1995-10-24
Filing date: 1995-10-24
Publication date: 1997-05-02
Anticipated expiration: 2015-10-24
Also published as: JP2880436B2

Abstract

PROBLEM TO BE SOLVED: To eliminate the ambiguity of modification relation in a syntax analysis by determining the modification relation of the syntax analysis as to each voice section divided with a voiceless section, etc., and recognizing a spoken voice. SOLUTION: A detection part 30 for a voiceless section, etc., detects the voiceless section, etc., including a section based upon a pause, a redundant word, rhythmical information, etc., according to a time series of feature parameters outputted from a buffer memory 3, and outputs its detection signal to an LR purser 5. The LR purser 5 reads in data in the speech section of a section unit indicated with the inputted detection signal and performs a section-limited HMM-LR process using an HMM-LR method for the speech section. Consequently, the modification relation in the speech section of each section unit is determined, and then modification relation between words, phrases, or clauses in the speech sections of different section units is determined.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は音声認識装置に関
し、特に、発声音声中におけるポーズ（無音区間）又は
冗長語などの無音区間等を検出して連続的に音声認識を
実行する音声認識装置に関する。なお、本明細書では、
ポーズと冗長語並びに韻律的な情報等を手がかりとする
区切りとを含むものを無音区間等という。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition apparatus, and more particularly to a speech recognition apparatus which detects a pause (silent section) in a uttered voice or a silent section such as a redundant word to continuously perform speech recognition. . In this specification,
Those that include pauses, redundant words, and breaks based on prosodic information are called silent intervals.

【０００２】[0002]

【従来の技術】近年、連続音声認識の研究が盛んに行わ
れ、いくつかの研究機関で文音声認識システムが構築さ
れている。これらのシステムの多くは丁寧に発声された
音声を入力対象にしている。しかしながら、人間同士の
コミュニケーションでは、「あのー」、「えーと」など
に代表される冗長語や、一時的に発声音声が無い無音区
間等の状態のポーズである言い淀みや言い誤り及び言い
直しなどが頻繁に出現する。2. Description of the Related Art In recent years, continuous speech recognition has been actively researched, and a sentence speech recognition system has been constructed by several research institutions. Many of these systems use carefully spoken speech as input targets. However, in human-to-human communication, redundant words such as "Ah" and "Eto", and stagnation, rewording, and rewording, which are poses such as silent intervals where there is no vocalization, are possible. Appears frequently.

【０００３】図９は、図２に示す例文「きれいな黒い髪
の女の子を見た」を従来例の連続音声認識装置で音声認
識処理を実行するときの音声認識動作をスタック形式で
示す図である。従来例の連続音声認識装置の音声認識動
作について図９を参照して説明する。まず、図９の状態
スタック２０１に示すように、「きれいな」という発声
音声の系列が認識されて文字として積まれる。次に、状
態スタック２０１における「きれいな」という文字は音
声認識用辞書に載っているので、状態スタック２０２に
示すように形容詞句を表す「ａｄｊ」という文字に変換
される。次に、「黒い」という発声音声の系列が認識さ
れて状態スタック２０３に示すように文字として積ま
れ、状態スタック２０３における「黒い」は音声認識辞
書に載っているので状態スタック２０４に示すように形
容詞句を表す「ａｄｊ」という文字に変換される。FIG. 9 is a view showing a speech recognition operation in a stack form when the example sentence "I saw a girl with beautiful black hair" shown in FIG. 2 is subjected to speech recognition processing by a conventional continuous speech recognition apparatus. . The speech recognition operation of the conventional continuous speech recognition apparatus will be described with reference to FIG. First, as shown in the state stack 201 of FIG. 9, a sequence of uttered voices “clean” is recognized and stacked as a character. Next, since the character "beautiful" in the state stack 201 is included in the voice recognition dictionary, it is converted into the character "adj" representing an adjective phrase as shown in the state stack 202. Next, a sequence of utterances of "black" is recognized and stacked as characters as shown in the state stack 203. Since "black" in the state stack 203 is in the voice recognition dictionary, as shown in the state stack 204. It is converted to the character "adj" which represents an adjective phrase.

【０００４】次に、「髪の」という発声音声の系列が認
識されて状態スタック２０５に示すようにさらに文字と
して積まれ、状態スタック２０５における「髪の」とい
う文字は音声認識辞書に載っているので状態スタック２
０６に示すように名詞句を表す「ＮＰ」という文字に変
換される。さらに、状態スタック２０６において、形容
詞句の「ａｄｊ」と名詞句の「ＮＰ」とは名詞句の「Ｎ
Ｐ」になるという構文規則が適用されて、「黒い」が変
換された形容詞句の「ａｄｊ」と「髪の」が変換された
名詞句の「ＮＰ」とは、状態スタック２０７に示すよう
に名詞句の「ＮＰ」に変換される。すなわち、状態スタ
ック２０７における名詞句の「ＮＰ」は「黒い髪の」を
表す。Next, a sequence of vocalized voice "hair no" is recognized and further stacked as characters as shown in the state stack 205, and the character "hair no" in the state stack 205 is listed in the voice recognition dictionary. So state stack 2
As shown in 06, it is converted into the character "NP" representing a noun phrase. Further, in the state stack 206, the adjective “adj” and the noun phrase “NP” are the noun phrase “N”.
The adjective phrase “adj” in which “black” has been converted and the noun phrase “NP” in which “hair” have been converted by applying the syntactic rule of becoming “P” are as shown in the state stack 207. Converted to the noun phrase "NP". That is, the noun phrase “NP” in the state stack 207 represents “black hair”.

【０００５】ここで、状態スタック２０７において、形
容詞句の「ａｄｊ」と名詞句の「ＮＰ」とは名詞句の
「ＮＰ」になるという構文規則を適用するかしないか２
つの選択枝がある。ここで、構文規則を適用すると「き
れいな」は「髪の」に係ることになり、構文規則を適用
しないと「きれいな」は「髪の」に係らない構文構造の
ままで係り受け関係の決定は以降の音声認識処理に持ち
越されることになる。従って、このような場合、従来例
の連続音声認識装置では、文字を積む装置を２つに分離
して以降の音声認識を実行する。すなわち、一方の装置
は、状態スタック２０７に構文規則を適用した状態スタ
ック２０８に示す状態で以降の音声認識処理を実行し、
他方の装置は、状態スタック２０７のままの状態で以降
の音声認識処理を実行する。ここで、一方の装置の状態
スタック２０８における名詞句の「ＮＰ」は「きれいな
黒い髪の」を表す。In the state stack 207, whether or not the syntax rule that the adjective "adj" and the noun phrase "NP" become the noun phrase "NP" is applied or not.
There are two options. Here, when the syntax rule is applied, "clean" is related to "hair", and when the syntax rule is not applied, "clean" is the syntax structure not related to "hair" and the dependency relationship is determined. It will be carried over to the subsequent voice recognition processing. Therefore, in such a case, in the continuous speech recognition apparatus of the conventional example, the apparatus for accumulating characters is separated into two and the subsequent speech recognition is executed. That is, one device executes the following speech recognition processing in the state shown in the state stack 208 in which the syntax rules are applied to the state stack 207,
The other device executes the following voice recognition processing while the state stack 207 remains as it is. Here, the noun phrase "NP" in the state stack 208 of one device represents "clean black hair."

【０００６】一方の装置において、状態スタック２０９
に示すように「きれいな黒い髪の」を表示する名詞句
「ＮＰ」の上に、「女の子を」という発声音声の系列が
認識されて文字として積まれ、状態スタック２０９にお
ける「女の子を」の文字は音声認識辞書に載っているの
で状態スタック２１０に示すように名詞句を表す「Ｎ
Ｐ」という文字に変換される。次に状態スタック２１０
において、名詞句の「ＮＰ」と名詞句の「ＮＰ」は名詞
句の「ＮＰ」になるという構文規則が適用されて、状態
スタック２１０の「きれいな黒い髪の」が変換された名
詞句の「ＮＰ」と「女の子を」が変換された名詞句の
「ＮＰ」は状態スタック２１１に示すように名詞句の
「ＮＰ」に変換される。ここで、状態スタック２１１の
名詞句「ＮＰ」は「きれいな黒い髪の女の子」を表す。
そして、「見た」という発声音声の系列が認識されて状
態スタック２１２に示すように文字として積まれ、状態
スタック２１２における「見た」は音声認識用辞書に載
っているので状態スタック２１３に示すように動詞句を
表す「ＶＰ」に変換され、状態スタック２１４に示すよ
うに１つの文章として認識される。すなわち、「きれい
な」が「髪に」に係る構造の認識結果が得られる。In one device, the state stack 209
As shown in, the sequence of the vocalized voice "girl" is recognized and piled up as a character on the noun phrase "NP" that displays "clean black hair", and the character "girl" is added in the state stack 209. Is listed in the voice recognition dictionary, so as shown in the state stack 210, "N
Is converted to the character "P". Then state stack 210
, The noun phrase “NP” and the noun phrase “NP” become the noun phrase “NP” by applying the syntactic rule to convert the “clean black hair” of the state stack 210 into the converted noun phrase “ The noun phrase “NP” obtained by converting “NP” and “girl” is converted into the noun phrase “NP” as shown in the state stack 211. Here, the noun phrase “NP” in the state stack 211 represents “a girl with beautiful black hair”.
Then, the sequence of the vocalized speech "saw" is recognized and stacked as characters as shown in the state stack 212, and "saw" in the state stack 212 is shown in the state stack 213 because it is in the voice recognition dictionary. Is converted into “VP” representing a verb phrase, and is recognized as one sentence as shown in the state stack 214. That is, a recognition result of a structure in which "clean" is related to "hair" is obtained.

【０００７】他方の装置において、「女の子を」という
発声音声の系列が認識されて状態スタック２２１に示す
ように文字として積まれ、「女の子を」の文字は状態ス
タック２２２に示すように名詞句を表す「ＮＰ」という
文字に変換される。次に状態スタック２２２において、
構文規則が適用されて、「黒い髪の」が変換された名詞
句の「ＮＰ」と「女の子を」が変換された名詞句の「Ｎ
Ｐ」は状態スタック２２３に示すように名詞句の「Ｎ
Ｐ」に変換される。ここで、状態スタック２２３の名詞
句「ＮＰ」は「黒い髪の女の子」を表す。そして、さら
に構文規則が適用されて、「きれいな」が変換された形
容詞句の「ａｄｊ」と「黒い髪の女の子を」が変換され
た名詞句の「ＮＰ」は状態スタック２２４に示すように
名詞句の「ＮＰ」に変換される。すなわち、「きれい
な」が「女の子を」に係る構造として認識される。次
に、「見た」という発声音声の系列が認識されて状態ス
タック２２５に示すように文字として積まれ、状態スタ
ック２２５における「見た」は状態スタック２２６に示
すように動詞句を表す「ＶＰ」に変換され、状態スタッ
ク２２７に示すように１つの文章として認識される。す
なわち、「きれいな」が「女の子を」に係る構造の認識
結果が得られる。[0007] In the other device, the sequence of the vocalized voice "girl" is recognized and stacked as a character as shown in the state stack 221, and the character "girl" is converted into a noun phrase as shown in the state stack 222. It is converted to the character "NP". Next, in the state stack 222,
By applying the syntax rules, the noun phrase "NP" in which "black hair" is converted and the noun phrase "N in which" girl "is converted
“P” is the noun phrase “N” as shown in the state stack 223.
P ". Here, the noun phrase “NP” in the state stack 223 represents “black hair girl”. Then, the syntax rules are further applied, and the adjective phrase “adj” converted from “pretty” and the noun phrase “NP” converted from “black hair girl” are converted into the nouns as shown in the state stack 224. Converted to the phrase "NP". That is, “pretty” is recognized as a structure related to “girl”. Next, the sequence of the vocalized speech "saw" is recognized and stacked as characters as shown in the state stack 225, and "saw" in the state stack 225 represents a verb phrase "VP" as shown in the state stack 226. , And is recognized as one sentence as shown in the state stack 227. That is, the recognition result of the structure in which "pretty" relates to "girl" is obtained.

【０００８】[0008]

【発明が解決しようとする課題】以上詳述したように、
図２の例文を従来例の連続音声認識装置で認識すると、
「きれいな」が「髪の」に係る構造の認識結果と、「き
れいな」が「女の子を」に係る構造の認識結果の２つの
異なる構造の認識結果が得られ、統語解析における係り
受け関係の曖昧性が解消できないという問題点があっ
た。また、その結果さらに長い発話を扱うと曖昧性が増
していくという問題点があった。As described in detail above,
When the example sentence of FIG. 2 is recognized by the conventional continuous speech recognition apparatus,
The results of recognition of two different structures are obtained: the result of recognizing the structure that "clean" is related to "hair" and the result of recognition of the structure related to "pretty" is "girl". There was a problem that the sex could not be resolved. In addition, as a result, there is a problem that ambiguity increases when dealing with longer utterances.

【０００９】本発明の目的は以上の問題点を解決し、統
語解析における係り受け関係の曖昧性を解消することの
できる音声認識装置を提供することにある。An object of the present invention is to solve the above problems and to provide a speech recognition apparatus capable of eliminating the ambiguity of dependency relations in syntactic analysis.

【００１０】[0010]

【課題を解決するための手段】本発明に係る請求項１記
載の音声認識装置は、入力された発声音声を音声認識し
て音声認識結果を出力する音声認識手段を備えた音声認
識装置において、入力された発声音声に基づいてポーズ
と冗長語と句又は節の境界とのうちの少なくとも１つを
検出して検出信号を出力する検出手段を備え、上記音声
認識手段は、上記検出信号に基づいて統語解析における
係り受け関係を決定して上記発声音声の音声認識をする
ことを特徴とする。According to a first aspect of the present invention, there is provided a voice recognition device comprising voice recognition means for voice-recognizing an input voice and outputting a voice recognition result. The speech recognition means includes a detection means for detecting at least one of a pause, a redundant word, and a boundary of a phrase or a clause based on the inputted uttered voice and outputting a detection signal. It is characterized in that the dependency relation in the syntactic analysis is determined to recognize the uttered voice.

【００１１】また、請求項２記載の音声認識装置は、請
求項１記載の音声認識装置において、上記音声認識手段
は、上記ポーズと冗長語と句又は節の境界とのうちの少
なくとも１つによって分割された複数の音声区間からな
る入力された発声音声の各音声区間について音声認識処
理をした後、異なる音声区間に属する語、句又は節の間
の係り受け関係を決定して、上記入力された発声音声の
音声認識をすることを特徴とする。A speech recognition apparatus according to a second aspect is the speech recognition apparatus according to the first aspect, wherein the speech recognition means uses at least one of the pause, a redundant word, and a boundary of a phrase or a clause. After voice recognition processing is performed on each voice section of the input uttered voice composed of a plurality of divided voice sections, the dependency relation between words, phrases, or clauses belonging to different voice sections is determined, and the input is performed as described above. It is characterized by performing voice recognition of the vocalized voice.

【００１２】さらに、請求項３記載の音声認識装置は、
請求項１又は２記載の音声認識装置において、上記検出
手段は、上記発声音声のパワーが、所定の時間の範囲だ
け、所定のしきい値以下である第１の条件と、上記発声
音声のゼロクロスの数が、所定の時間の間において、所
定のしきい値以上である第２の条件とのうち少なくとも
１つの条件が満足することを検出することにより上記ポ
ーズを検出することを特徴とする。Further, the voice recognition device according to claim 3 is
The voice recognition device according to claim 1 or 2, wherein the detecting means has a first condition that the power of the uttered voice is equal to or less than a predetermined threshold value within a predetermined time range, and a zero cross of the uttered voice. Is detected by detecting that at least one of the second conditions, which is equal to or larger than a predetermined threshold value, is satisfied during a predetermined time.

【００１３】またさらに、請求項４記載の音声認識装置
は、請求項１又は２記載の音声認識装置において、上記
検出手段は、上記ポーズと冗長語と句又は節の境界との
うちの少なくとも１つを、それぞれの予め決められた言
語モデルに一致するか否かを判断することにより検出す
ることを特徴とする。Furthermore, the speech recognition apparatus according to claim 4 is the speech recognition apparatus according to claim 1 or 2, wherein the detection means is at least one of the pause, the redundant word, and the boundary of the phrase or clause. Are detected by determining whether or not they match the respective predetermined language models.

【００１４】[0014]

【発明の実施の形態】以下、図面を参照して本発明に係
る実施形態について説明する。＜第１の実施形態＞図１は、本発明に係る第１の実施形
態である連続音声認識装置８１のブロック図である。第
１の実施形態の連続音声認識装置８１は、ＳＳＳ（Succ
essive State Splitting：逐次状態分割法）−ＬＲ（le
ft-to-right rightmost derivation型、すなわち最右
導出型）不特定話者連続音声認識装置であって、隠れマ
ルコフ網（以下、ＨＭ網という。）メモリ１１に格納さ
れた隠れマルコフモデル（以下、ＨＭＭという。）のネ
ットワークを用いて音素照合処理を音素照合部４で実行
しその結果である音声認識スコアを音素コンテキスト依
存型ＬＲパーザ（以下、ＬＲパーザという。）５に送
り、これに応答してＬＲパーザ５が入力された発声音声
の１つの文に対して連続音声認識を実行して音素予測デ
ータを音素照合部４に送って音声認識処理を行う。第１
の実施形態は特に、バッファメモリ３から出力される特
徴パラメータの時系列に基づいてポーズや冗長語並びに
韻律的な情報等を手がかりとする区切りを含む無音区間
等を検出してその検出信号をＬＲパーザ５に出力する無
音区間等検出部３０を備え、ＬＲパーザ５は、無音区間
等検出部３０から入力された検出信号で示された区切り
単位の音声区間のデータを読み込んで、当該音声区間に
対してＨＭＭ−ＬＲ法を用いた区間制限付きＨＭＭ−Ｌ
Ｒ処理を実行し、最後の区切り単位の末端まで到達する
と入力された発声音声の１つの文に対して区間制限無し
ＨＭＭ−ＬＲ処理を実行することにより音声認識結果デ
ータを出力することを特徴とする。Embodiments of the present invention will be described below with reference to the drawings. <First Embodiment> FIG. 1 is a block diagram of a continuous speech recognition apparatus 81 according to a first embodiment of the present invention. The continuous speech recognition device 81 according to the first embodiment has an SSS (Succ
essive State Splitting) -LR (le
ft-to-right right most derivation type, that is, rightmost derivation type unspecified speaker continuous speech recognition apparatus, which is a hidden Markov model (hereinafter, referred to as HM network) stored in a memory 11 (hereinafter, referred to as HM network). HMM) network is used to execute the phoneme matching process in the phoneme matching unit 4, and the resulting speech recognition score is sent to the phoneme context dependent LR parser (hereinafter referred to as LR parser) 5 and responds to it. Then, the LR parser 5 executes continuous speech recognition on one sentence of the inputted speech and sends the phoneme prediction data to the phoneme matching unit 4 to perform the speech recognition process. First
In particular, the embodiment of the present invention detects a silent section including a pause, a redundant word, and a section based on prosodic information as a clue based on the time series of the characteristic parameters output from the buffer memory 3, and detects the detection signal by LR. The LR parser 5 is provided with a silent section detection unit 30 for outputting to the parser 5, and the LR parser 5 reads the data of the speech section of the delimiter unit indicated by the detection signal input from the silent section detection unit 30, On the other hand, HMM-L with interval restriction using HMM-LR method
The speech recognition result data is output by executing the R process and executing the HMM-LR process without section restriction for one sentence of the vocalized voice that is input when the end of the last division unit is reached. To do.

【００１５】ここで、上記ＳＳＳにおいては、音素の特
徴空間上に割り当てられた確率的定常信号源（状態）の
間の確率的な遷移により音声パラメータの時間的な推移
を表現した確率モデルに対して、尤度最大化の基準に基
づいて個々の状態をコンテキスト方向又は時間方向へ分
割するという操作を繰り返すことによって、モデルの精
密化を逐次的に実行する。Here, in the above SSS, with respect to the stochastic model expressing the temporal transition of the speech parameter by the stochastic transition between the stochastic stationary signal sources (states) assigned in the feature space of the phoneme, Then, the model refinement is sequentially executed by repeating the operation of dividing each state into the context direction or the time direction based on the likelihood maximization criterion.

【００１６】図１において、話者の発声音声はマイクロ
ホン１に入力されて音声信号に変換された後、特徴抽出
部２に入力される。特徴抽出部２は、入力された音声信
号をＡ／Ｄ変換した後、例えばＬＰＣ分析を実行し、対
数パワー、１６次ケプストラム係数、Δ対数パワー及び
１６次Δケプストラム係数を含む３４次元の特徴パラメ
ータを抽出する。抽出された特徴パラメータの時系列は
バッファメモリ３を介して音素照合部４に入力される。In FIG. 1, the uttered voice of the speaker is input to the microphone 1 and converted into a voice signal, and then input to the feature extraction unit 2. After performing A / D conversion on the input audio signal, the feature extraction unit 2 performs, for example, LPC analysis, and obtains a 34-dimensional feature parameter including logarithmic power, 16th cepstrum coefficient, Δlog power, and 16th Δcepstrum coefficient. Is extracted. The time series of the extracted feature parameters is input to the phoneme matching unit 4 via the buffer memory 3.

【００１７】音素照合部４に接続されるＨＭ網メモリ１
１内のＨＭ網は、各状態をノードとする複数のネットワ
ークとして表され、各状態はそれぞれ以下の情報を有す
る。（ａ）状態番号（ｂ）受理可能なコンテキストクラス（ｃ）先行状態、及び後続状態のリスト（ｄ）出力確率密度分布のパラメータ（ｅ）自己遷移確率及び後続状態への遷移確率HM network memory 1 connected to the phoneme collation unit 4.
The HM network in 1 is represented as a plurality of networks in which each state is a node, and each state has the following information. (A) State number (b) Acceptable context class (c) List of preceding and succeeding states (d) Parameters of output probability density distribution (e) Self transition probability and transition probability to succeeding state

【００１８】なお、第１の実施形態において、ＨＭ網
は、各分布がどの話者に由来するかを特定する必要があ
るため、所定の話者混合ＨＭ網を変換して作成する。こ
こで、出力確率密度関数は３４次元の対角共分散行列を
もつ混合ガウス分布であり、各分布はある特定の話者の
サンプルを用いて学習されている。In the first embodiment, since the HM network needs to specify which speaker each distribution is derived from, a predetermined speaker mixed HM network is created by conversion. Here, the output probability density function is a Gaussian mixture distribution having a 34-dimensional diagonal covariance matrix, and each distribution is learned using a specific speaker sample.

【００１９】音素照合部４は、ＬＲパーザ５からの音素
照合要求に応じて音素照合処理を実行する。このとき
に、ＬＲパーザ５からは、音素照合区間及び照合対象音
素とその前後の音素から成る音素コンテキスト情報が渡
される。音素照合部４は、受け取った音素コンテキスト
情報に基づいてそのようなコンテキストを受理すること
ができるＨＭ網上の状態を、先行状態リストと後続状態
リストの制約内で連結することによって、１つのモデル
が選択される。そして、このモデルを用いて音素照合区
間内のデータに対する尤度が計算され、この尤度の値が
音素照合スコアとしてＬＲパーザ５に返される。このと
きに用いられるモデルは、ＨＭＭと等価であるために、
尤度の計算には通常のＨＭＭで用いられている前向きパ
スアルゴリズムをそのまま使用する。The phoneme matching unit 4 executes a phoneme matching process in response to a phoneme matching request from the LR parser 5. At this time, the LR parser 5 passes phoneme context information including a phoneme matching section, a phoneme to be matched, and phonemes before and after the phoneme. The phoneme matching unit 4 connects the states on the HM network capable of accepting such a context based on the received phoneme context information within the constraints of the preceding state list and the following state list, thereby forming one model. Is selected. Then, the likelihood for the data in the phoneme matching section is calculated using this model, and the value of the likelihood is returned to the LR parser 5 as a phoneme matching score. Since the model used at this time is equivalent to HMM,
The forward pass algorithm used in the normal HMM is used as it is for the calculation of the likelihood.

【００２０】一方、無音区間等検出部３０は、バッファ
メモリ３から出力される特徴パラメータの時系列に基づ
いてポーズや冗長語並びに韻律的な情報等を手がかりと
する区切りを含む無音区間等を検出して、その検出信号
をＬＲパーザ５に出力する。ここで、無音区間等検出部
３０は、冗長語については予め内部メモリに格納された
冗長語の音素モデルと比較照合することにより冗長語と
して認識する一方、無音区間であるポーズについては以
下の２つの条件のうちの１つが満足するときにポーズと
して検出する。（第１の検出条件）パワーが所定のしきい値レベル以下
である時間ｔ０が例えば以下の範囲の値のとき。好まし
くは、５０ミリ秒≦ｔ０≦３秒。より好ましくは、５０
ミリ秒≦ｔ０≦５００ミリ秒。（第２の検出条件）入力された音声信号がゼロ電位と交
差するゼロクロスの数が所定のしきい値以上である時間
ｔ１が例えば以下の範囲の値のとき。好ましくは、５０
ミリ秒≦ｔ１≦３秒。より好ましくは、５０ミリ秒≦ｔ
１≦５００ミリ秒。さらに、韻律的な情報等を手がかり
とする区切りとは、具体的には、イントネーションが急
激に上昇又は下降するときは、句又は節の境界であると
推測される。これについては、入力される特徴パラメー
タのうち基本周波数が所定の傾斜の度合い以上で急激に
上昇し又は下降して変化したことを検出することにより
当該区切り又は境界と判別する。On the other hand, the silent section detecting unit 30 detects a silent section including pauses, redundant words, and prosodic information as a clue based on the time series of the characteristic parameters output from the buffer memory 3. Then, the detection signal is output to the LR parser 5. Here, the silent section detection unit 30 recognizes the redundant word as a redundant word by comparing and collating with the phoneme model of the redundant word stored in the internal memory in advance, while recognizing the pause as a silent section as described in 2 below. A pose is detected when one of the two conditions is satisfied. (First Detection Condition) When the time t0 when the power is below a predetermined threshold level is a value in the following range, for example. Preferably, 50 milliseconds ≤ t0 ≤ 3 seconds. More preferably 50
Ms ≦ t0 ≦ 500 ms. (Second detection condition) When the time t1 when the number of zero crosses at which the input audio signal crosses the zero potential is equal to or more than a predetermined threshold value is, for example, a value in the following range. Preferably, 50
Milliseconds ≦ t1 ≦ 3 seconds. More preferably, 50 milliseconds ≤ t
1 ≦ 500 milliseconds. Further, the delimiter based on prosodic information or the like is presumed to be a phrase or clause boundary when the intonation sharply rises or falls. With respect to this, the boundary or boundary is determined by detecting that the fundamental frequency among the input characteristic parameters has rapidly increased or decreased at a predetermined degree of inclination or higher.

【００２１】そして、ＬＲパーザ５は、無音区間等検出
部３０から入力された検出信号で示された区切り単位の
音声区間のデータを読み込んで、当該音声区間に対して
ＨＭＭ−ＬＲ法を用いた区間制限付きＨＭＭ−ＬＲ処理
を実行し、最後の区切り単位の末端まで到達すると入力
された発声音声の１つの文に対して区間制限無しＨＭＭ
−ＬＲ処理を実行することにより音声認識結果データを
出力する。ここで、区間制限付きＨＭＭ−ＬＲ処理と
は、１つの区切り単位の音声区間内に限って実行するＨ
ＭＭを用いたＬＲパーザ５による音声認識処理のことで
あり、区間制限無しＨＭＭ−ＬＲ処理とは、区間を限定
せず、入力された発声音声の１つの文に対して、異なる
区切り単位の音声区間に属する語、句又は節にＬＲテー
ブルメモリ１３内の構文規則を適用して実行するＨＭＭ
を用いたＬＲパーザ５による音声認識処理のことであ
る。ここで、音声区間とは図５に示すように入力された
発声音声の１つの文のうちの無音区間等（図５において
は括弧を付して示している。）によって分割された１つ
の区間のことをいい、区切り単位とは図５において括弧
を付して示すように音声区間と当該音声区間の後にある
無音区間等とからなる１単位のことをいう。また、本明
細書において、無音区間等とはポーズと冗長語並びに韻
律的な情報等を手がかりとする区切りとを含むものをい
い、ポーズ単位とは図５に示すようにポーズによって分
割された区切り単位のことをいう。Then, the LR parser 5 reads the data of the voice section in the delimiter unit indicated by the detection signal input from the silent section detection unit 30 and uses the HMM-LR method for the voice section. HMM-LR processing with section restriction is executed, and when the end of the last division unit is reached, no section restriction HMM is applied to one sentence of the uttered speech input.
-Output the voice recognition result data by executing the LR process. Here, the section-restricted HMM-LR processing is the H that is executed only within the voice section of one division unit.
This is a speech recognition process by the LR parser 5 using MM, and the section-unlimited HMM-LR processing does not limit sections and a speech of different delimiter units is applied to one sentence of the inputted uttered speech. An HMM that executes by applying the syntax rules in the LR table memory 13 to words, phrases, or clauses belonging to an interval
Is a voice recognition process by the LR parser 5 using. Here, the voice section is one section divided by a silent section or the like (shown in parentheses in FIG. 5) in one sentence of the vocalized voice input as shown in FIG. The term “delimiter unit” means one unit consisting of a voice section and a silent section after the voice section as shown in parentheses in FIG. Further, in the present specification, a silent section or the like means a pause and a section including a redundant word and prosodic information as a clue, and a pause unit is a section divided by a pose as shown in FIG. Refers to a unit.

【００２２】文脈自由文法データベースメモリ２０内の
所定の文脈自由文法（ＣＦＧ）は公知の通り予め自動的
に変換されてＬＲテーブルを作成してＬＲテーブルメモ
リ１３に格納される。ＬＲパーザ５は、例えば音素継続
時間長モデルを含む話者モデルメモリ１２と上記ＬＲテ
ーブルとを参照して、入力された音素予測データについ
て左から右方向に、後戻りなしに処理する。構文的にあ
いまいさがある場合は、スタックを分割してすべての候
補の解析が平行して処理される。ＬＲパーザ５は、ＬＲ
テーブルメモリ１３内のＬＲテーブルから次にくる音素
を予測して音素予測データを音素照合部４に出力する。
これに応答して、音素照合部４は、その音素に対応する
ＨＭ網メモリ１１内の情報を参照して照合し、その尤度
を音声認識スコアとしてＬＲパーザ５に戻し、順次音素
を連接していくことにより、連続音声の認識を行ってい
る。A predetermined context-free grammar (CFG) in the context-free grammar database memory 20 is automatically converted in advance to create an LR table and stored in the LR table memory 13, as is known. For example, the LR parser 5 refers to the speaker model memory 12 including the phoneme duration model and the LR table, and processes the input phoneme prediction data from left to right without backtracking. If there is syntactic ambiguity, the stack is split and the analysis of all candidates is processed in parallel. LR parser 5 is LR
The next phoneme is predicted from the LR table in the table memory 13 and the phoneme prediction data is output to the phoneme collation unit 4.
In response to this, the phoneme collation unit 4 collates by referring to the information in the HM network memory 11 corresponding to the phoneme, returns the likelihood to the LR parser 5 as a speech recognition score, and sequentially connects the phonemes. By doing so, continuous voice recognition is performed.

【００２３】以上のように構成された第１の実施形態の
連続音声認識装置８１において、特徴抽出部２と音素照
合部４とＬＲパーザ５とは、例えばデジタル電子計算機
で構成される。In the continuous speech recognition apparatus 81 of the first embodiment configured as described above, the feature extraction unit 2, the phoneme matching unit 4, and the LR parser 5 are composed of, for example, a digital computer.

【００２４】図６は、図１の連続音声認識装置８１のＬ
Ｒパーザ５において実行される音声認識処理を示すフロ
ーチャートである。以下、図６を参照して音声認識処理
について説明する。FIG. 6 shows L of the continuous speech recognition device 81 of FIG.
6 is a flowchart showing a voice recognition process executed in the R parser 5. The voice recognition process will be described below with reference to FIG.

【００２５】図６に示すように、ステップＳ１において
は、ＨＭＭ作業域の初期化、並びにＬＲパーザ５の初期
化を実行する。具体的には、状態スタック０のセルを１
個作成する。ここで、連続音声認識装置８１において用
いるセルは、従来のＨＭＭ−ＬＲ法の音声認識の解析に
必要な情報を保持するデータ構造、すなわち状態スタッ
クを有するＬＲ作業域と、音声認識スコアと確率テーブ
ルとからなるＨＭＭ作業域とを有する。As shown in FIG. 6, in step S1, initialization of the HMM work area and initialization of the LR parser 5 are executed. Specifically, the state stack 0 cell is set to 1
Create pieces. Here, the cell used in the continuous speech recognition device 81 is a data structure that holds information necessary for analyzing speech recognition of the conventional HMM-LR method, that is, an LR work area having a state stack, a speech recognition score, and a probability table. And an HMM work area consisting of.

【００２６】そして、ステップＳ２において、無音区間
等検出部３０から入力された検出信号で示された区切り
単位の音声区間のデータを読み込む。さらに、ステップ
Ｓ３において、音声データが読み込まれた区切り単位の
音声区間に対してＨＭＭ−ＬＲ法を用いた区間制限付き
ＨＭＭ−ＬＲ処理を実行する。ステップＳ４において、
複数の区切り単位のうち最後の区切り単位の末端まで到
達したか否かが判断され、最後の区切り単位の末端まで
到達していないときは（ステップＳ４においてＮＯ）ス
テップＳ２に進み、ステップＳ２，Ｓ３の処理を繰り返
す。一方、ステップＳ４において、最後の区切り単位の
末端まで到達しているときは（ステップＳ４においてＹ
ＥＳ）ステップＳ５に進み、区間制限無しＨＭＭ−ＬＲ
処理を実行して音声認識処理を終了する。Then, in step S2, the data of the voice section in the delimiter unit indicated by the detection signal input from the silent section detection unit 30 is read. Furthermore, in step S3, section-limited HMM-LR processing using the HMM-LR method is executed for the voice section in units of delimiters in which the voice data is read. In step S4,
It is determined whether or not the end of the last division unit among the plurality of division units has been reached. If the end of the last division unit has not been reached (NO in step S4), the process proceeds to step S2, and steps S2 and S3. The process of is repeated. On the other hand, in step S4, when the end of the last division unit is reached (Y in step S4)
ES) Proceed to step S5, and HMM-LR without section restriction
The process is executed and the voice recognition process is ended.

【００２７】次に、図１の第１の実施形態の連続音声認
識装置８１の音声認識動作を図２に示す例文を用いて説
明する。図２は、文の構造解析すなわち統語解析におけ
る係り受け関係の曖昧性を含む一例文である。図２の例
文を文字列のみを認識して解析しようとすると、図２の
例文の上に矢印で示した第１の係り受け関係と例文の下
に矢印で示した第２の係り受け関係の少なくとも２つの
係り受け関係の曖昧性が残る。すなわち、「きれいな」
が「女の子」に係る第１の係り受け関係の「きれいな女
の子」であるのか、「きれいな」が「髪」に係る第２の
係り受け関係の「きれいな髪」であるのかが不明であ
る。本発明者らは、無音区間であるポーズを利用するこ
とにより上述の２つの係り受け関係のうちのいずれか１
つに決定できることを見いだした。すなわち、「きれい
な」と「黒い」との間に無音区間であるポーズ（図２に
おいては、「きれいな」と「黒い」との間に「△」で示
している。）があれば、「きれいな」が「女の子」に係
る第１の係り受け関係であると決定でき、「髪の」と
「女の子を」との間にポーズ（図２においては、「髪
の」と「女の子を」との間に「△」で示している。）が
あれば、「きれいな」が「髪」に係る第２の係り受け関
係であると決定できる。本発明は上述のポーズと係り受
け関係との間の規則を利用して、統語解析における係り
受け関係の曖昧性を取り除いて音声認識処理を実行して
いる。Next, the voice recognition operation of the continuous voice recognition apparatus 81 of the first embodiment of FIG. 1 will be described using the example sentence shown in FIG. FIG. 2 is an example sentence including ambiguity of dependency relations in sentence structure analysis, that is, syntactic analysis. When recognizing only the character string in the example sentence of FIG. 2 and trying to analyze, the first dependency relation shown by the arrow above the example sentence of FIG. 2 and the second dependency relation shown by the arrow below the example sentence are shown. At least two dependency ambiguities remain. That is, "clean"
It is unclear whether "is" the "clean girl" of the first dependency relationship relating to "girl" or "clean" is the "clean hair" of the second dependency relationship relating to "hair". The present inventors utilize one of the above-mentioned two dependency relationships by utilizing a pause which is a silent section.
I found that I could decide one day. That is, if there is a pause in a silent section between "clean" and "black" (indicated by "△" between "clean" and "black" in FIG. 2), "clean" Is a first dependency relationship for “girl”, and poses between “hair” and “girl” (in FIG. 2, “hair” and “girl”). If there is an intervening “Δ”), it can be determined that “pretty” is the second dependency relationship for “hair”. The present invention removes the ambiguity of the dependency relation in the syntactic analysis and executes the voice recognition process by using the rule between the pause and the dependency relation described above.

【００２８】図３は、図２の例文において第１の係り受
け関係を有する場合の連続音声認識装置８１の音声認識
動作をスタック形式で示す図である。以下に第１の係り
受け関係を有する場合の音声認識動作を図３を参照して
説明する。まず、図３の状態スタック５１に示すよう
に、ＬＲパーザ５で「きれいな」という発声音声の系列
が認識されて文字として積まれ、次に「きれいな」の認
識処理の直後でポーズが無音区間等検出部３０によって
検出されて、検出信号が当該検出部３０からＬＲパーザ
５に入力されて「きれいな」という文字の上にポーズを
表示する「△」として積まれる。次に、状態スタック５
１における「きれいな」という文字は音声認識用辞書に
載っているので、状態スタック５２に示すように形容詞
句を表す「ａｄｊ」という文字に変換される。次に、Ｌ
Ｒパーザ５で「黒い」という発声音声の系列が認識され
て状態スタック５３に示すようにポーズを表示する
「△」の上に文字として積まれ、状態スタック５３にお
ける「黒い」は音声認識辞書に載っているので状態スタ
ック５４に示すように形容詞句を表す「ａｄｊ」という
文字に変換される。ここで、状態スタック５４において
「きれいな」が変換された形容詞句の「ａｄｊ」と「黒
い」が変換された形容詞句の「ａｄｊ」とには、間にポ
ーズを表示する「△」が積まれているので構文規則は適
用されない。FIG. 3 is a diagram showing the speech recognition operation of the continuous speech recognition device 81 in the case of having the first dependency relationship in the example sentence of FIG. 2 in a stack form. The voice recognition operation in the case of having the first dependency relationship will be described below with reference to FIG. First, as shown in the state stack 51 of FIG. 3, the LR parser 5 recognizes a sequence of vocal sounds of “clean” and accumulates them as characters, and then immediately after the recognition process of “clean”, the pause is a silent section or the like. The detection signal is detected by the detection unit 30, and the detection signal is input from the detection unit 30 to the LR parser 5 and accumulated as “Δ” indicating a pose on the character “pretty”. Then state stack 5
Since the character "Beautiful" in 1 is included in the voice recognition dictionary, it is converted to the character "adj" representing an adjective phrase as shown in the state stack 52. Next, L
The R parser 5 recognizes a series of vocalized voices called "black" and stacks them as characters on "△" indicating a pose as shown in the state stack 53, and "black" in the state stack 53 is stored in the voice recognition dictionary. Since it is listed, it is converted into the character "adj" representing the adjective phrase as shown in the state stack 54. Here, in the state stack 54, between the adjective phrase “adj” in which “pretty” is converted and the adjective phrase “adj” in which “black” is converted, “Δ” indicating a pause is stacked. Syntax rules do not apply.

【００２９】次に、ＬＲパーザ５で「髪の」という発声
音声の系列が認識されて状態スタック５５に示すように
「黒い」が変換された形容詞句の「ａｄｊ」の上に文字
として積まれ、状態スタック５５における「髪の」とい
う文字は音声認識辞書に載っているので状態スタック５
６に示すように名詞句を表す「ＮＰ」という文字に変換
される。さらに、状態スタック５６において、形容詞句
の「ａｄｊ」と名詞句の「ＮＰ」とは名詞句の「ＮＰ」
になるという構文規則が適用されて、「黒い」が変換さ
れた形容詞句の「ａｄｊ」と「髪の」が変換された名詞
句の「ＮＰ」とは状態スタック５７に示すように名詞句
の「ＮＰ」に変換される。すなわち、状態スタック５７
における名詞句の「ＮＰ」は「黒い髪の」を表す。次
に、「女の子を」という発声音声の系列が認識されて状
態スタック５８に示すように「黒い髪の」を表す名詞句
の「ＮＰ」の上に文字として積まれ、状態スタック５８
における「女の子を」の文字は音声認識辞書に載ってい
るので状態スタック５９に示すように名詞句を表す「Ｎ
Ｐ」という文字に変換される。Next, the LR parser 5 recognizes the vocalized sequence "hairy", and as shown in the state stack 55, it is stacked as a character on the converted adjective phrase "adj". , The character "hair no" in the state stack 55 is in the voice recognition dictionary, so the state stack 5
As shown in 6, it is converted into the character "NP" representing a noun phrase. Further, in the state stack 56, the adjective phrase “adj” and the noun phrase “NP” are the noun phrase “NP”.
The adjective phrase “adj” in which “black” is transformed and the noun phrase “NP” in which “hair” is transformed are applied to the noun phrase Converted to "NP". That is, the state stack 57
The noun phrase "NP" in represents "black hair". Next, a sequence of vocalized voices "girl" is recognized and stacked as characters on the noun phrase "NP" representing "black hair" as shown in the state stack 58, and the state stack 58
Since the character "girl" in is included in the voice recognition dictionary, "N" represents the noun phrase as shown in the state stack 59.
Is converted to the character "P".

【００３０】次に状態スタック５９において、名詞句の
「ＮＰ」と名詞句の「ＮＰ」は名詞句の「ＮＰ」になる
という構文規則が適用されて、状態スタック５９の「黒
い髪の」が変換された名詞句の「ＮＰ」と「女の子を」
が変換された名詞句の「ＮＰ」は状態スタック６０に示
すように名詞句の「ＮＰ」に変換される。ここで、状態
スタック６０の名詞句の「ＮＰ」は「黒い髪の女の子」
を表す。そして、ＬＲパーザ５で「見た」という発声音
声の系列が認識されて状態スタック６１に示すように
「黒い髪の女の子」を表す名詞句の「ＮＰ」の上に文字
として積まれ、状態スタック６１における「見た」は音
声認識用辞書に載っているので状態スタック６２に示す
ように動詞句を表す「ＶＰ」に変換される。Next, in the state stack 59, the syntactic rule that the noun phrase “NP” and the noun phrase “NP” become the noun phrase “NP” is applied, and the “black hair” of the state stack 59 is changed. The converted noun phrase "NP" and "girl"
The converted noun phrase “NP” is converted to the noun phrase “NP” as shown in the state stack 60. Here, the noun phrase "NP" in the state stack 60 is "black hair girl"
Represents Then, the LR parser 5 recognizes the sequence of the vocalized voice "saw" and is stacked as a character on the noun phrase "NP" representing "black hair girl" as shown in the state stack 61, and the state stack Since "see" in 61 is included in the voice recognition dictionary, it is converted into "VP" representing a verb phrase as shown in the state stack 62.

【００３１】そして、最後のポーズ単位の末端まで到達
していると判断されて、ポーズを表示する「△」の前後
に位置する「きれいな」を表す形容詞句の「ａｄｊ」と
「黒い髪の女の子を」を表す名詞句の「ＮＰ」とに、形
容詞句の「ａｄｊ」と名詞句の「ＮＰ」とは名詞句の
「ＮＰ」になるという構文規則が適用されて状態スタッ
ク６３に示すように名詞句の「ＮＰ」に変換される。こ
こで、状態スタック６３の名詞句の「ＮＰ」は、「きれ
いな」が「女の子を」に係る構造の「きれいな黒い髪の
女の子を」を表す。さらに、状態スタック６４に示すよ
うに文章を表す「Ｓ」に変換されて、「きれいな」が
「女の子を」に係るような構造の音声認識結果のみが出
力される。以上のようにポーズを表示する「△」の前後
に位置する「きれいな」を表す形容詞句の「ａｄｊ」と
「黒い髪の女の子を」を表す名詞句の「ＮＰ」との間の
構文規則の適用を最後のポーズ単位の末端まで到達して
から実行するので、「きれいな」が「女の子を」に係る
構造の音声認識結果のみを出力することができる。以上
のように第１の実施形態では、複数の音声区間からなる
入力された発声音声の１つの文の各音声区間の音声認識
を実行した後、区間を限定せず、入力された発声音声の
１つの文に対して異なる音声区間に属する語、句又は節
の間にＬＲテーブルメモリ１３内の構文規則を適用して
異なる音声区間に属する語、句又は節の間の係り受け関
係を決定している。Then, it is determined that the end of the last pose unit has been reached, and the adjectives "adj" and "black haired girl" representing "pretty" located before and after "△" indicating the pose are displayed. As shown in the state stack 63, the syntactic rule that the adjective phrase “adj” and the noun phrase “NP” become the noun phrase “NP” is applied to the noun phrase “NP” representing “”. Converted to the noun phrase "NP". Here, the noun phrase "NP" of the state stack 63 represents "a pretty dark-haired girl" having a structure in which "pretty" refers to "a girl". Further, as shown in the state stack 64, it is converted into “S” representing a sentence, and only the speech recognition result having a structure in which “pretty” relates to “girl” is output. As described above, the syntax rule between the adjective phrase “adj” that represents “pretty” and the noun phrase “NP” that represents “black hair girl” that is positioned before and after “△” that displays a pose Since the application is executed after reaching the end of the last pose unit, only the speech recognition result of the structure in which "pretty" relates to "girl" can be output. As described above, in the first embodiment, after the voice recognition of each voice section of one sentence of the input uttered voice including a plurality of voice sections is executed, the section of the input uttered voice is not limited without limiting the section. By applying the syntax rules in the LR table memory 13 between words, phrases or clauses belonging to different speech intervals for one sentence, a dependency relation between words, phrases or clauses belonging to different speech intervals is determined. ing.

【００３２】図４は、図２の例文において第２の係り受
け関係を有する場合の連続音声認識装置８１の音声認識
動作をスタック形式で示す図である。以下に第２の係り
受け関係を有する場合の音声認識動作を図４を参照して
説明する。まず、図４の状態スタック１５１に示すよう
に、ＬＲパーザ５で「きれいな」という発声音声の系列
が認識されて文字として積まれる。次に、状態スタック
１５１における「きれいな」という文字は音声認識用辞
書に載っているので、状態スタック１５２に示すように
形容詞句を表す「ａｄｊ」という文字に変換される。次
に、ＬＲパーザ５で「黒い」という発声音声の系列が認
識されて状態スタック１５３に示すように「きれいな」
を表す形容詞句の「ａｄｊ」の上に文字として積まれ、
状態スタック１５３における「黒い」は音声認識辞書に
載っているので状態スタック１５４に示すように形容詞
句を表す「ａｄｊ」という文字に変換される。FIG. 4 is a diagram showing in stack form the speech recognition operation of the continuous speech recognition apparatus 81 in the case of having the second dependency relationship in the example sentence of FIG. The voice recognition operation in the case of having the second dependency relationship will be described below with reference to FIG. First, as shown in the state stack 151 of FIG. 4, the LR parser 5 recognizes a sequence of vocal sounds "clean" and stacks them as characters. Next, since the character "beautiful" in the state stack 151 is included in the voice recognition dictionary, it is converted into the character "adj" representing an adjective phrase as shown in the state stack 152. Next, the LR parser 5 recognizes a sequence of vocalized voices of "black", and as shown in the state stack 153, "clean".
Stacked as a character on the adjective adjective
Since "black" in the state stack 153 is included in the voice recognition dictionary, it is converted to the character "adj" representing an adjective phrase as shown in the state stack 154.

【００３３】次に、ＬＲパーザ５で「髪の」という発声
音声の系列が認識されて状態スタック１５５に示すよう
に「黒い」を表す形容詞句の「ａｄｊ」の上に文字とし
て積まれ、次に「髪の」の認識処理の直後でポーズが無
音区間等検出部３０によって検出されて、検出信号が当
該検出部３０からＬＲパーザ５に入力されてポーズを表
示する「△」として「髪の」の文字の上に積まれる。そ
して、状態スタック１５５における「髪の」という文字
は音声認識辞書に載っているので状態スタック１５６に
示すように名詞句を表す「ＮＰ」という文字に変換され
る。Next, the LR parser 5 recognizes the vocalized sequence "hairy" and stacks it as a character on the adjective phrase "adj" representing "black" as shown in the state stack 155. Immediately after the “hair” recognition process, a pause is detected by the silent section detection unit 30, and a detection signal is input from the detection unit 30 to the LR parser 5 to indicate the pose as “△” and “hair”. It is piled up on the character of ". Since the character "hair" in the state stack 155 is included in the voice recognition dictionary, it is converted into the character "NP" representing the noun phrase as shown in the state stack 156.

【００３４】さらに、状態スタック１５６において、形
容詞句の「ａｄｊ」と名詞句の「ＮＰ」とは名詞句の
「ＮＰ」になるという構文規則が適用されて、「黒い」
が変換された形容詞句の「ａｄｊ」と「髪の」が変換さ
れた名詞句の「ＮＰ」とは状態スタック１５７に示すよ
うに名詞句の「ＮＰ」に変換される。すなわち、状態ス
タック１５７における名詞句の「ＮＰ」は「黒い髪の」
を表す。さらに、状態スタック１５７において、形容詞
句の「ａｄｊ」と名詞句の「ＮＰ」とは名詞句の「Ｎ
Ｐ」になるという構文規則が適用されて、「きれいな」
が変換された形容詞句の「ａｄｊ」と「黒い髪の」を表
す名詞句の「ＮＰ」とは状態スタック１５８に示すよう
に名詞句の「ＮＰ」に変換される。これによって、「き
れいな」が「髪の」にかかる構造として認識される。Further, in the state stack 156, the syntax rule that the adjective “adj” and the noun phrase “NP” become the noun phrase “NP” is applied, resulting in “black”.
The converted adjective phrase “adj” and the converted “hair” noun phrase “NP” are converted to the noun phrase “NP” as shown in the state stack 157. That is, the noun phrase “NP” in the state stack 157 is “black hair”.
Represents Further, in the state stack 157, the adjective phrase “adj” and the noun phrase “NP” are the noun phrase “N”.
The syntax rule of becoming "P" is applied, and "clean"
The converted adjective phrase “adj” and the converted noun phrase “NP” representing “black hair” are converted to the noun phrase “NP” as shown in the state stack 158. This allows "clean" to be perceived as a "hairy" structure.

【００３５】次に、「女の子を」という発声音声の系列
が認識されて状態スタック１５９に示すようにポーズを
表示する「△」の上に文字として積まれ、状態スタック
１５９における「女の子を」の文字は音声認識辞書に載
っているので状態スタック１６０に示すように名詞句を
表す「ＮＰ」という文字に変換される。ここで、状態ス
タック１６０において「きれいな黒い髪の」を表す名詞
句の「ＮＰ」と「女の子を」が変換された名詞句の「Ｎ
Ｐ」とには、間にポーズを表示する「△」が積まれてい
るので構文規則は適用されない。そして、「見た」とい
う発声音声の系列が認識されて状態スタック１６１に示
すように「女の子を」が変換された名詞句の「ＮＰ」の
上に文字として積まれ、状態スタック１６１における
「見た」は音声認識用辞書に載っているので状態スタッ
ク１６２に示すように動詞句を表す「ＶＰ」に変換され
る。Next, a sequence of vocalized voices "girl" is recognized and stacked as characters on "△" indicating a pose as shown in the state stack 159, and "girl" is displayed in the state stack 159. Since the character is in the voice recognition dictionary, it is converted to the character "NP" representing a noun phrase as shown in state stack 160. Here, in the state stack 160, the noun phrase “NP” representing “beautiful black hair” and the noun phrase “N” converted from “girl” are converted.
The syntax rule is not applied to “P” because “Δ” indicating a pause is stacked between them. Then, the sequence of the vocalized speech "saw" is recognized, and as shown in the state stack 161, "girl" is stacked as a character on the converted noun phrase "NP", and the "saw" in the state stack 161 is read. Since "ta" is included in the voice recognition dictionary, it is converted into "VP" representing a verb phrase as shown in the state stack 162.

【００３６】そして、ＬＲパーザ５で最後のポーズ単位
の末端まで到達していると判断されて、状態スタック１
６２におけるポーズを表示する「△」の前後に位置する
「きれいな黒い髪の」を表す名詞句の「ＮＰ」と「女の
子を」を表す名詞句の「ＮＰ」とにＬＲテーブルメモリ
１３内の構文規則が適用されて、状態スタック１６３に
示すように名詞句の「ＮＰ」に変換されて、さらに状態
スタック１６４に示すように文章を表す「Ｓ」の文字に
変換されて、「きれいな」が「黒い髪の」に係るような
構造の音声認識結果のみが出力される。Then, the LR parser 5 determines that the end of the last pause unit has been reached, and the state stack 1
The syntax in the LR table memory 13 is the noun phrase "NP" representing "beautiful black hair" and the noun phrase "NP" representing "girl" located before and after "△" that displays the pose in 62. The rules are applied to convert the noun phrase to “NP” as shown in the state stack 163, and further to the letter “S” representing a sentence as shown in the state stack 164, so that “pretty” becomes “ Only the voice recognition result having the structure related to "black hair" is output.

【００３７】以上の第１の実施形態の連続音声認識装置
８１は、無音区間等を検出して検出信号を出力する無音
区間等検出部３０を備え、ＬＲパーザ５は、無音区間等
検出部３０から入力された検出信号で示された区切り単
位の音声区間のデータを読み込んで、当該音声区間に対
してＨＭＭ−ＬＲ法を用いた区間制限付きＨＭＭ−ＬＲ
処理を実行し、最後の区切り単位の末端まで到達すると
入力された発声音声の１つの文に対して区間制限無しＨ
ＭＭ−ＬＲ処理を実行することにより音声認識結果デー
タを出力する。これによって、各区切り単位の音声区間
内における係り受け関係を決定した後、異なる区切り単
位の音声区間に属する語、句又は節の間の係り受け関係
を決定できるので、統語解析における係り受け関係の曖
昧性を解消することができる。The continuous speech recognition apparatus 81 of the first embodiment described above is provided with the silent section and the like detecting section 30 which detects the silent section and outputs the detection signal, and the LR parser 5 is the silent section and the like detecting section 30. HMM-LR with section restriction using the HMM-LR method by reading the data of the section of speech indicated by the detection signal input from
When the processing is executed and the end of the last division unit is reached, there is no section restriction H for one sentence of the voiced speech input.
The voice recognition result data is output by executing the MM-LR process. With this, after determining the dependency relation in the speech section of each delimiter unit, it is possible to determine the dependency relation between words, phrases, or clauses belonging to the speech sections of different delimiter units. The ambiguity can be resolved.

【００３８】＜第２の実施形態＞図７は、本発明に係る
第２の実施形態である連続音声認識装置８２のブロック
図である。図７の第２の実施形態の連続音声認識装置８
２は、図１の第１の実施形態の連続音声認識装置８１の
隠れマルコフ網メモリ１１に代えて隠れマルコフ網メモ
リ１１ａを備え、かつ無音区間等検出部３０を除いて構
成される。第２の実施形態の連続音声認識装置８２にお
いては、ポーズや冗長語並びに韻律的な情報等を手がか
りとする区切りなどの無音区間等をＨＭＭでモデル化し
たモデルが隠れマルコフ網メモリ１１ａに格納され、当
該モデルを用いて無音区間等の検出を音素照合部４で行
っている。<Second Embodiment> FIG. 7 is a block diagram of a continuous speech recognition apparatus 82 according to a second embodiment of the present invention. Continuous speech recognition device 8 of the second embodiment of FIG.
2 is configured by including a hidden Markov network memory 11a in place of the hidden Markov network memory 11 of the continuous speech recognition apparatus 81 of the first embodiment of FIG. 1 and excluding the silent section detection unit 30. In the continuous speech recognition device 82 of the second embodiment, a model in which silent sections such as pauses, redundant words, and breaks based on prosodic information and the like are modeled by HMM is stored in the hidden Markov network memory 11a. The phoneme matching unit 4 detects a silent section or the like using the model.

【００３９】図８は、図７の連続音声認識装置８２にお
いて実行される音声認識処理を示すフローチャートであ
る。以下、図８を参照して第２の実施形態の連続音声認
識装置８２の音声認識処理について説明する。まず、ス
テップＳ１０においては、ＨＭＭ作業域の初期化、並び
にＬＲパーザ５の初期化を実行する。具体的には、状態
スタック０のセルを１個作成する。そして、ステップＳ
１１において、例えば、特徴パラメータの処理単位であ
る音声フレーム（例えば２０ミリ秒）毎に音声データの
読み込みを行い、ステップＳ１２において区間制限付き
ＨＭＭ−ＬＲ処理を実行する。次にステップＳ１３にお
いて無音区間等を検出したか否かが判断され、無音区間
等を検出していない場合はステップＳ１１に進みステッ
プＳ１１，Ｓ１２の処理が繰り返され、無音区間等を検
出した場合はステップＳ１４に進む。FIG. 8 is a flowchart showing a voice recognition process executed in the continuous voice recognition device 82 of FIG. Hereinafter, the voice recognition process of the continuous voice recognition device 82 according to the second embodiment will be described with reference to FIG. First, in step S10, initialization of the HMM work area and initialization of the LR parser 5 are executed. Specifically, one cell of state stack 0 is created. And step S
In 11, for example, the voice data is read for each voice frame (for example, 20 milliseconds) which is the processing unit of the characteristic parameter, and the section-limited HMM-LR process is executed in step S12. Next, in step S13, it is determined whether or not a silent section or the like is detected. If the silent section or the like is not detected, the process proceeds to step S11 and the processes of steps S11 and S12 are repeated. It proceeds to step S14.

【００４０】ステップＳ１４において、すべての音声区
間の音声認識処理が終了したか否かが判断され、すべて
の音声区間の処理が終了していないときは（ステップＳ
１４においてＮＯ）ステップＳ１１に進み、ステップＳ
１１，Ｓ１２，Ｓ１３の処理を繰り返し、すべての音声
区間の処理が終了したと判断されると（ステップＳ１４
においてＹＥＳ）ステップＳ１５に進み、入力された発
声音声の１つの文に対して区間制限無しＨＭＭ−ＬＲ処
理を実行して音声認識処理を終了する。In step S14, it is determined whether or not the voice recognition processing for all the voice sections has been completed. If the processing for all the voice sections has not been completed (step S14).
14), the process proceeds to step S11, step S
When the processing of 11, S12, and S13 is repeated and it is determined that the processing of all the voice sections is completed (step S14)
(YES in step S15), the process proceeds to step S15, the section-unrestricted HMM-LR process is executed for one sentence of the input uttered voice, and the voice recognition process ends.

【００４１】以上の第２の実施形態の連続音声認識装置
８２は、無音区間等の検出を隠れマルコフ網メモリ１１
ａに格納されたＨＭＭでモデル化した無音区間等のモデ
ルを使用して音素照合部４で行い、ＬＲパーザ５は、音
声データを読み込んで、１つの音声区間に対してＨＭＭ
−ＬＲ法を用いた区間制限付きＨＭＭ−ＬＲ処理を実行
し、各音声区間についての処理が終了すると入力された
発声音声の１つの文に対して区間制限無しＨＭＭ−ＬＲ
処理を実行することにより音声認識結果データを出力す
る。これによって、各区切り単位の音声区間内における
係り受け関係を決定した後、異なる音声区間に属する
語、句又は節の間の係り受け関係を決定できるので、統
語解析における係り受け関係の曖昧性を解消することが
できる。The continuous speech recognizer 82 of the second embodiment described above detects the detection of a silent section or the like by the hidden Markov network memory 11.
The phoneme collation unit 4 uses a model of the silent section or the like modeled by the HMM stored in a, and the LR parser 5 reads the voice data and sets the HMM for one voice section.
-A section-restricted HMM-LR processing using the LR method is executed, and when processing for each speech section is completed, section-unrestricted HMM-LR for one sentence of the input uttered speech
The voice recognition result data is output by executing the processing. With this, after determining the dependency relations within the speech segment of each break unit, it is possible to determine the dependency relations between words, phrases, or clauses belonging to different speech segments, so that the ambiguity of the dependency relations in the syntactic analysis can be determined. It can be resolved.

【００４２】以上の第１と第２の実施形態においては、
入力された発声音声の１つの文に対して区間制限無しＨ
ＭＭ−ＬＲ処理を実行することにより音声認識結果デー
タを出力するようにした。しかしながら、本発明はこれ
に限らず、入力された発声音声の１つの句又は節等の１
つのシーケンスの発声音声に対して区間制限無しＨＭＭ
−ＬＲ処理を実行するようにしてもよいし、連続音声認
識装置のスイッチがオンされてからオフされるまでの間
に入力される発声音声に対して区間制限無しＨＭＭ−Ｌ
Ｒ処理を実行するようにしてもよい。以上のように構成
しても第１と第２の実施形態と同様に動作し同様の効果
を有する。In the above first and second embodiments,
There is no section restriction for one sentence of the input voice
The voice recognition result data is output by executing the MM-LR process. However, the present invention is not limited to this, and one phrase such as one phrase or clause of the input uttered speech is
HMM without section restriction for vocalization of one sequence
-LR processing may be executed, or there is no section restriction HMM-L for voiced speech input from when the switch of the continuous speech recognition device is turned on to when it is turned off.
R processing may be executed. Even with the above configuration, the same operation and the same effects as those of the first and second embodiments are achieved.

【００４３】以上の第１と第２の実施形態においては、
ＨＭＭ−ＬＲ法を用いた音声認識装置について述べてい
るが、本発明はこれに限らず、ニューラルネットワーク
を用いた音声認識装置など他の種類の音声認識装置に適
用することができる。In the above first and second embodiments,
Although the speech recognition apparatus using the HMM-LR method has been described, the present invention is not limited to this, and can be applied to other types of speech recognition apparatuses such as a speech recognition apparatus using a neural network.

【００４４】[0044]

【発明の効果】本発明に係る請求項１記載の音声認識装
置は、入力された発声音声に基づいてポーズと冗長語と
句又は節の境界とのうちの少なくとも１つを検出して検
出信号を出力する検出手段を備え、上記音声認識手段
は、上記検出信号に基づいて統語解析における係り受け
関係を決定して上記発声音声の音声認識をしている。こ
れによって、統語解析における係り受け関係の曖昧性を
解消できる。According to the first aspect of the present invention, the speech recognition apparatus detects at least one of a pause, a redundant word, and a boundary of a phrase or a clause based on the input uttered voice and outputs a detection signal. The voice recognition means determines the dependency relation in the syntactic analysis based on the detection signal and performs voice recognition of the uttered voice. As a result, the ambiguity of the dependency relation in the syntactic analysis can be resolved.

【００４５】また、請求項２記載の音声認識装置は、請
求項１記載の音声認識装置において、上記音声認識手段
は、上記ポーズと冗長語と句又は節の境界とのうちの少
なくとも１つによって分割された複数の音声区間からな
る入力された発声音声の各音声区間について音声認識処
理をした後、異なる音声区間に属する語、句又は節の間
の係り受け関係を決定して、上記入力された発声音声の
音声認識をしている。これによって、統語解析における
係り受け関係の曖昧性を解消できる。The speech recognition apparatus according to claim 2 is the speech recognition apparatus according to claim 1, wherein the speech recognition means uses at least one of the pause, a redundant word, and a boundary of a phrase or a clause. After voice recognition processing is performed on each voice section of the input uttered voice composed of a plurality of divided voice sections, the dependency relation between words, phrases, or clauses belonging to different voice sections is determined, and the input is performed as described above. The voice recognition of the vocalized voice. As a result, the ambiguity of the dependency relation in the syntactic analysis can be resolved.

【００４６】さらに、請求項３記載の音声認識装置は、
請求項１又は２記載の音声認識装置において、上記検出
手段は、上記発声音声のパワーが、所定の時間の範囲だ
け、所定のしきい値以下である第１の条件と、上記発声
音声のゼロクロスの数が、所定の時間の間において、所
定のしきい値以上である第２の条件とのうち少なくとも
１つの条件が満足することを検出することにより上記ポ
ーズを検出している。これによって、上記ポーズに基づ
いて統語解析における係り受け関係を決定でき、統語解
析における係り受け関係の曖昧性を解消できる。Further, the voice recognition device according to claim 3 is:
The voice recognition device according to claim 1 or 2, wherein the detecting means has a first condition that the power of the uttered voice is equal to or less than a predetermined threshold value within a predetermined time range, and a zero cross of the uttered voice. The above pose is detected by detecting that at least one of the second conditions, which is equal to or larger than a predetermined threshold value, is satisfied during a predetermined time. Thereby, the dependency relation in the syntactic analysis can be determined based on the pose, and the ambiguity of the dependency relation in the syntactic analysis can be resolved.

【００４７】またさらに、請求項４記載の音声認識装置
は、請求項１又は２記載の音声認識装置において、上記
検出手段は、上記ポーズと冗長語と句又は節の境界との
うちの少なくとも１つを、それぞれの予め決められた言
語モデルに一致するか否かを判断することにより検出し
ている。これによって、音声認識過程で上記ポーズと冗
長語と句又は節の境界とのうちの少なくとも１つを検出
でき、統語解析における係り受け関係の曖昧性を解消で
きる。Furthermore, the speech recognition apparatus according to claim 4 is the speech recognition apparatus according to claim 1 or 2, wherein the detection means is at least one of the pause, the redundant word, and the boundary of the phrase or the clause. Are detected by determining whether or not they match the respective predetermined language models. Accordingly, at least one of the pause, the redundant word, and the boundary of the phrase or the clause can be detected in the voice recognition process, and the ambiguity of the dependency relation in the syntactic analysis can be resolved.

[Brief description of the drawings]

【図１】本発明に係る第１の実施形態である連続音声
認識装置のブロック図である。FIG. 1 is a block diagram of a continuous voice recognition device according to a first embodiment of the present invention.

【図２】図１の連続音声認識装置８１の音声認識動作
を説明するために用いた第１と第２の２つの係り受け関
係を有する一例文を示す図である。2 is a diagram showing an example sentence having first and second dependency relationships used to describe a voice recognition operation of the continuous voice recognition device 81 of FIG. 1. FIG.

【図３】図１の連続音声認識装置８１の音声認識動作
の一例をスタック形式で示す図である。FIG. 3 is a diagram showing an example of a voice recognition operation of a continuous voice recognition device 81 of FIG. 1 in a stack format.

【図４】図１の連続音声認識装置８１の音声認識動作
の図３とは異なる例をスタック形式で示す図である。4 is a diagram showing an example of a voice recognition operation of the continuous voice recognition device 81 of FIG. 1 different from that of FIG. 3 in a stack format.

【図５】図２の例文の音声区間、ポーズ（無音区間
等）及びポーズ単位（区切り単位）を示す図である。5 is a diagram showing a voice section, a pause (silent section, etc.), and a pause unit (separation unit) of the example sentence of FIG.

【図６】図１の連続音声認識装置８１のＬＲパーザ５
によって実行される音声認識処理を示すフローチャート
である。6 is an LR parser 5 of the continuous speech recognition device 81 of FIG.
It is a flowchart which shows the voice recognition process performed by.

【図７】本発明に係る第２の実施形態である連続音声
認識装置８２のブロック図である。FIG. 7 is a block diagram of a continuous voice recognition device 82 according to a second embodiment of the present invention.

【図８】図７の連続音声認識装置８２のＬＲパーザ５
によって実行される音声認識処理を示すフローチャート
である。8 is an LR parser 5 of the continuous speech recognizer 82 of FIG.
It is a flowchart which shows the voice recognition process performed by.

【図９】従来例の連続音声認識装置の音声認識動作を
スタック形式で示す図である。FIG. 9 is a diagram showing a speech recognition operation of a continuous speech recognition apparatus of a conventional example in a stack format.

[Explanation of symbols]

１…マイクロホン、２…特徴抽出部、３…バッファメモリ、４…音素照合部、５…ＬＲパーザ、１１，１１ａ…隠れマルコフ網メモリ、１２…話者モデルメモリ、１３…ＬＲテーブルメモリ、２０…文脈自由文法データベースメモリ、３０…無音区間等検出部。 DESCRIPTION OF SYMBOLS 1 ... Microphone, 2 ... Feature extraction part, 3 ... Buffer memory, 4 ... Phoneme matching part, 5 ... LR parser, 11, 11a ... Hidden Markov network memory, 12 ... Speaker model memory, 13 ... LR table memory, 20 ... Context-free grammar database memory, 30 ... Silent section detection unit.

Claims

[Claims]

1. A voice recognition device comprising voice recognition means for voice-recognizing an input uttered voice and outputting a voice recognition result, wherein a pause, a redundant word, and a boundary between a phrase or a clause based on the input uttered voice. And a detection unit that outputs a detection signal by detecting at least one of the above, and the voice recognition unit determines the dependency relation in the syntactic analysis based on the detection signal to recognize the voice of the uttered voice. A voice recognition device characterized by:

2. The voice recognizing means outputs a voice for each voice section of an input uttered voice, which is composed of a plurality of voice sections divided by at least one of the pause, a redundant word, and a boundary of a phrase or a clause. The speech recognition according to claim 1, wherein after the recognition processing, a dependency relation between words, phrases, or clauses belonging to different speech sections is determined, and speech recognition of the inputted uttered speech is performed. apparatus.

3. The detecting means comprises a first condition that the power of the uttered voice is equal to or less than a predetermined threshold value within a predetermined time range, and the number of zero crosses of the uttered voice is a predetermined time. 3. The voice according to claim 1, wherein the pause is detected by detecting that at least one of the second condition that is equal to or more than a predetermined threshold is satisfied during Recognition device.

4. The detection means detects at least one of the pose, the redundant word, and the boundary of a phrase or a clause by determining whether or not the pose matches a predetermined language model. The method according to claim 1 or 2, wherein
The speech recognition device according to the above.