JP2007156107A

JP2007156107A - Speech recognition apparatus and method

Info

Publication number: JP2007156107A
Application number: JP2005351308A
Authority: JP
Inventors: Hiroki Yamamoto; 寛樹山本
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2005-12-05
Filing date: 2005-12-05
Publication date: 2007-06-21

Abstract

【課題】音声認識結果における誤認識が繰り返されることを回避する。
【解決手段】音声入力部２０１から入力された音声に対し、音響処理部２０２〜探索部２０４によって音声認識を行い、その複数の認識結果を認識スコアとともに出力する。認識結果選択部２０５では認識スコアの高い順に認識結果の所定数を選択し、該選択された所定数の認識結果を認識結果出力部２０９で出力する際に、最高順位の認識結果を認識結果履歴２０８に保持しておく。このとき、現在の最高順位の認識結果が、認識結果履歴２０８に保持された前回の履歴と一致する場合、当該認識結果の順位を下げて出力することによって、同じ誤認識が繰り返されることを防ぐ。
【選択図】図２PROBLEM TO BE SOLVED: To avoid repeated erroneous recognition in a speech recognition result.
SOLUTION: A voice input from a voice input unit 201 is subjected to voice recognition by an acoustic processing unit 202 to a search unit 204, and a plurality of recognition results are output together with a recognition score. The recognition result selection unit 205 selects a predetermined number of recognition results in descending order of recognition score, and when the recognition result output unit 209 outputs the selected predetermined number of recognition results, the recognition result history of the highest rank is displayed. It is held at 208. At this time, when the current highest-ranked recognition result matches the previous history held in the recognition-result history 208, the same erroneous recognition is prevented from being repeated by lowering the rank of the recognition result and outputting it. .
[Selection] Figure 2

Description

本発明は、音声認識を行ってその認識結果を出力する音声認識装置およびその方法に関する。 The present invention relates to a speech recognition apparatus and method for performing speech recognition and outputting the recognition result.

近年の音声認識技術の向上により、音声による操作を可能とする機器が実用化されている。音声認識技術においては、誤った認識結果を出力した場合にどのような対処を行うか、という課題がある。 With recent improvements in voice recognition technology, devices that enable voice operation have been put into practical use. In the speech recognition technology, there is a problem of what to do when an erroneous recognition result is output.

この課題に対して、同じ認識誤りを繰り返さないように構成された音声認識装置が、例えば特許文献１および特許文献２に開示されている。 For this problem, for example, Patent Literature 1 and Patent Literature 2 disclose speech recognition apparatuses configured so as not to repeat the same recognition error.

特許文献１に開示された音声認識装置は、認識結果の正誤を使用者が入力する手段と、使用者が誤認識と判定した回数を認識候補ごとに記憶する手段とを備える。そして、誤認識回数が所定回数を越えた認識候補については、最大出力候補数（以下、Ｎ）以内での出力を省略する。ただし、省略された認識候補のスコアが所定の順位以内の場合は、（Ｎ＋１）番目以降に出力する。 The speech recognition apparatus disclosed in Patent Literature 1 includes means for a user to input the correctness / incorrectness of the recognition result, and means for storing the number of times the user has determined that the recognition result is incorrect for each recognition candidate. And about the recognition candidate in which the frequency | count of misrecognition exceeded predetermined number, the output within the maximum output candidate number (henceforth, N) is abbreviate | omitted. However, if the score of the omitted recognition candidate is within a predetermined rank, it is output after the (N + 1) th.

また、特許文献２に開示された音声認識装置は、以下のような制御を行う。すなわち、音声認識結果を報知してから認識結果の確定処理までの間に再入力があり、再入力音声が前回の認識結果と同じ所定のカテゴリに属する場合、前回の認識結果と再入力音声の認識結果が同一とみなされれば、その結果を除外して認識結果を決定する。例えば、「東京都大田区昭和島」という音声入力に対して、認識結果が「東京都大田区城南島」と出力された場合、再入力時に「東京都大田区城南島」という認識結果が出力されないように、認識辞書の変更あるいは認識結果の後処理を実施する。これにより、同じ誤認識が繰り返されてしまうことが回避される。 Further, the speech recognition apparatus disclosed in Patent Document 2 performs the following control. That is, if there is a re-input between the notification of the speech recognition result and the recognition result confirmation process, and the re-input speech belongs to the same predetermined category as the previous recognition result, the previous recognition result and the re-input speech If the recognition results are regarded as the same, the recognition result is determined by excluding the result. For example, if the recognition result is output as “Tokyo Ota Ward Jonanjima” for voice input “Tokyo Ota Ward Showajima”, the recognition result “Tokyo Metropolitan Ota Ward Jonanjima” will be output when re-input. As a result, the recognition dictionary is changed or the recognition result is post-processed. Thereby, it is avoided that the same erroneous recognition is repeated.

特許文献２で開示されている音声認識装置は、使用者が音声認識結果に対して正誤判定を行う必要がないという点で、特許文献１に開示された音声認識装置よりも使用者にとって利便性が高い。
特許第３５０５９３１号明細書特許第３５８０６４３号明細書 The speech recognition apparatus disclosed in Patent Document 2 is more convenient for the user than the speech recognition apparatus disclosed in Patent Document 1 in that the user does not need to make a correct / incorrect determination on the speech recognition result. Is expensive.
Japanese Patent No. 3505931 Japanese Patent No. 3580643

特許文献１に開示された音声認識装置では、認識結果に対して使用者が正誤判定を行うための操作が必要であり、音声認識結果を確定するまでの操作回数が多いという第１の問題がある。また、誤認識回数が所定回数を越えた認識候補は、当該候補が正しく認識された場合でも、Ｎ＋１番目以降に出力されてしまう。したがって、認識結果の表示領域がＮ個以下に制約されている場合は、当該候補を探すために利用者は画面の切り替えあるいはスクロールなど何らかの表示を変える操作が必要になるという第２の問題がある。 The speech recognition device disclosed in Patent Document 1 requires a first operation for the user to make a correct / incorrect determination on the recognition result, and there is a large number of operations until the speech recognition result is confirmed. is there. In addition, recognition candidates for which the number of erroneous recognitions exceeds a predetermined number are output after the (N + 1) th even when the candidates are correctly recognized. Therefore, when the display area of the recognition result is limited to N or less, there is a second problem that the user needs to change the display such as screen switching or scrolling in order to search for the candidate. .

また、特許文献２に開示された音声認識装置では、認識結果の正誤判定の操作が不要であるため、特許文献１に係る第１の問題は解消される。しかしながら、再入力時に過去の認識結果と同じ認識候補を除外するため、一度誤認識した認識候補は、所定時間内（確定処理までの期間）の再入力では出力されないという問題がある。例えば、出力された認識結果が正しいのにもかかわらず、使用者が誤って再入力操作を行ってしまった場合には、所定時間の間、所望の認識結果を得られなくなる。 In addition, since the speech recognition apparatus disclosed in Patent Document 2 does not require an operation for determining whether the recognition result is correct or not, the first problem related to Patent Document 1 is solved. However, since the same recognition candidate as the past recognition result is excluded at the time of re-input, there is a problem that a recognition candidate that has been erroneously recognized once is not output by re-input within a predetermined time (period until the confirmation process). For example, when the user erroneously performs a re-input operation even though the output recognition result is correct, a desired recognition result cannot be obtained for a predetermined time.

また、別の課題として、認識結果の全てではなく一部の訂正を行う発声に対応する必要がある。一般に、人と人との会話では以下に示すように、聞き取りまちがいを訂正する場合、同じ内容を繰り返すのではなく、間違った部分のみ訂正することが多い。 Further, as another problem, it is necessary to deal with utterances that correct some but not all of the recognition results. In general, in the conversation between people, as shown below, when correcting a mistake, there are many cases where only the wrong part is corrected instead of repeating the same contents.

Ａさん：「住所はどちらですか？」
Ｂさん：「東京都大田区下丸子です。」
Ａさん：「東京都大田区新丸子ですね？」
Ｂさん：「いいえ、下丸子です」
このように、人同士の会話においては、間違えた部分（上記例では「下丸子」）のみを相手に伝える場面が多い。音声認識装置においても同様に、訂正のための発声を行う際に、認識結果の正しい部分（上記例では「東京都大田区」）を含めて再度発声を行うことは、使用者にとって面倒である。先の人同士の会話と同様の手順で認識結果を訂正できる方が、使用者にとっては利便性が良い。 Mr. A: "Where is your address?"
Mr. B: “It is Shimomaruko, Ota-ku, Tokyo.”
Mr. A: “You ’re Shinmaruko, Ota-ku, Tokyo?”
Mr. B: “No, this is Shimomaruko”
In this way, in a conversation between people, there are many scenes where only the wrong part ("Shimomaruko" in the above example) is transmitted to the other party. Similarly, in the speech recognition apparatus, it is troublesome for the user to utter again including the correct part of the recognition result (in the above example, “Ota-ku, Tokyo”) when uttering for correction. . It is more convenient for the user that the recognition result can be corrected in the same procedure as the conversation between the previous persons.

本発明は、上述した課題を解決するためになされたものであり、誤認識の訂正を効率良く行う音声認識装置およびその方法を提供することを目的とする。 The present invention has been made to solve the above-described problems, and an object of the present invention is to provide a speech recognition apparatus and method for efficiently correcting erroneous recognition.

上記目的を達成するための一手段として、本発明の音声認識装置は以下の構成を備える。 As a means for achieving the above object, a speech recognition apparatus of the present invention comprises the following arrangement.

すなわち、音声情報を受信する受信手段と、前記音声情報を認識して複数の認識結果をその認識スコアとともに取得する音声認識手段と、前記複数の認識結果から、少なくともその認識スコアが最も高い認識結果を含む複数の認識結果を選択する選択手段と、前記選択された認識結果を出力する出力手段と、前記選択された認識結果のうち、前記認識スコアが最も高い認識結果を履歴として保持する履歴手段と、を有し、前記選択手段は、前記音声認識手段によって取得された認識スコアが最も高い認識結果が、前記履歴手段に保持された履歴の少なくとも一部と一致する場合、当該認識結果の順位を下げることを特徴とする。 That is, receiving means for receiving voice information, voice recognition means for recognizing the voice information and acquiring a plurality of recognition results together with the recognition score, and a recognition result having the highest recognition score from the plurality of recognition results Selection means for selecting a plurality of recognition results including: output means for outputting the selected recognition results; history means for holding the recognition result having the highest recognition score among the selected recognition results as history And when the recognition result having the highest recognition score acquired by the voice recognition means matches at least a part of the history held in the history means, the ranking of the recognition results It is characterized by lowering.

また、上記目的を達成するための一手段として、本発明の音声認識方法は以下の構成を備える。 As a means for achieving the above object, the speech recognition method of the present invention comprises the following arrangement.

すなわち、音声情報を受信する受信ステップと、前記音声情報を認識して複数の認識結果をその認識スコアとともに取得する音声認識ステップと、前記複数の認識結果から、少なくともその認識スコアが最も高い認識結果を含む複数の認識結果を選択する選択ステップと、前記選択された認識結果を出力する出力ステップと、前記選択された認識結果のうち、前記認識スコアが最も高い認識結果を履歴として保持する履歴ステップと、を有し、前記選択ステップは、前記音声認識ステップによって取得された認識スコアが最も高い認識結果が、前記履歴手段に保持された履歴の少なくとも一部と一致する場合、当該認識結果の順位を下げることを特徴とする。 That is, a reception step of receiving voice information, a voice recognition step of recognizing the voice information and acquiring a plurality of recognition results together with the recognition score, and a recognition result having the highest recognition score from at least the plurality of recognition results A selection step for selecting a plurality of recognition results including: an output step for outputting the selected recognition result; and a history step for holding the recognition result having the highest recognition score among the selected recognition results as a history. And when the recognition result with the highest recognition score acquired by the speech recognition step matches at least a part of the history held in the history means, the ranking of the recognition result It is characterized by lowering.

上記構成からなる本発明によれば、最高順位の認識結果が直前の発声の認識結果と一致する場合は、当該認識結果は誤認識であると判断してその順位を最高位よりも低い順位に下げることによって、誤認識の訂正を効率良く行うことができる。 According to the present invention having the above configuration, when the recognition result of the highest order matches the recognition result of the immediately preceding utterance, the recognition result is determined to be erroneous recognition, and the order is set to a lower rank than the highest order. By lowering, erroneous recognition can be corrected efficiently.

以下、添付の図面を参照して、本発明をその好適な実施形態に基づいて詳細に説明する。なお、以下の各実施形態において示す構成は一例に過ぎず、本発明は図示された構成に限定されるものではない。 Hereinafter, the present invention will be described in detail based on preferred embodiments with reference to the accompanying drawings. The configurations shown in the following embodiments are merely examples, and the present invention is not limited to the illustrated configurations.

＜第１実施形態＞
図１は、本発明に係る一実施形態である音声認識装置の概略構成を示すブロック図である。図１において、１０１は中央処理装置（ＣＰＵ）、１０２は制御メモリ（ＲＯＭ）、１０３はメモリ（ＲＡＭ）である。１０４はキーボードやボタンなどの操作キー、１０５は液晶などの表示装置、１０６はマイクなどの音声入力装置、１０７はスピーカなどの音声出力装置である。１０８はデータバスであり、上記各構成間における信号の授受を仲介する。 <First Embodiment>
FIG. 1 is a block diagram showing a schematic configuration of a speech recognition apparatus according to an embodiment of the present invention. In FIG. 1, 101 is a central processing unit (CPU), 102 is a control memory (ROM), and 103 is a memory (RAM). Reference numeral 104 denotes operation keys such as a keyboard and buttons, 105 denotes a display device such as a liquid crystal, 106 denotes a sound input device such as a microphone, and 107 denotes a sound output device such as a speaker. A data bus 108 mediates transmission / reception of signals between the above components.

本実施形態の音声認識装置を実現するための制御プログラムやその制御プログラムで用いるデータは、ＲＯＭ１０２に記録される。これらの制御プログラムやデータは、ＣＰＵ１０１の制御のもと、データバス１０８を通じて適宜ＲＡＭ１０３に取り込まれ、ＣＰＵ１０１によって実行される。実行した結果、すなわち音声認識の結果は表示装置１０５で表示されるか、あるいは音声合成を利用してスピーカ１０７から出力される。 A control program for realizing the speech recognition apparatus of the present embodiment and data used in the control program are recorded in the ROM 102. These control programs and data are appropriately fetched into the RAM 103 through the data bus 108 and executed by the CPU 101 under the control of the CPU 101. The execution result, that is, the result of speech recognition is displayed on the display device 105 or output from the speaker 107 using speech synthesis.

図２は、本実施形態における音声認識装置の機能構成を示すブロック図である。図２において、２０１は音声入力部、２０２は音響処理部、２０３は尤度計算部、２０４は探索部、２０５は認識結果出力部である。また、２０６は音響モデル、２０７は言語モデル、２０８は認識結果履歴である。 FIG. 2 is a block diagram illustrating a functional configuration of the speech recognition apparatus according to the present embodiment. In FIG. 2, 201 is a voice input unit, 202 is an acoustic processing unit, 203 is a likelihood calculation unit, 204 is a search unit, and 205 is a recognition result output unit. Reference numeral 206 denotes an acoustic model, 207 denotes a language model, and 208 denotes a recognition result history.

本装置に入力された入力音声信号は音声入力部２０１によって取り込まれ、音声区間が検出される。音響処理部２０２は、検出した区間の音声信号から、例えばＬＰＣケプストラムやメルケプストラム係数などの音声認識に用いる特徴量を抽出する。尤度計算部２０３では、抽出した特徴量と音響モデル２０６を参照して、探索部２０４で実行する探索処理に必要な音響モデル（ＨＭＭ）の尤度を計算する。 The input voice signal input to this apparatus is captured by the voice input unit 201, and a voice section is detected. The acoustic processing unit 202 extracts feature amounts used for speech recognition, such as LPC cepstrum and mel cepstrum coefficients, from the detected speech signal of the section. The likelihood calculation unit 203 calculates the likelihood of an acoustic model (HMM) necessary for the search process executed by the search unit 204 with reference to the extracted feature quantity and the acoustic model 206.

探索部２０４は、認識対象語を列挙した認識辞書や受理可能な文法を記録した言語モデル２０７、音響モデル２０６から音声認識に必要なＨＭＭ系列を構成する。そして、尤度計算部２０３で計算した各ＨＭＭの尤度を参照して、Viterbiアルゴリズム等を用いてＨＭＭ系列ごとに尤度の累積値（認識スコア）を求める。すなわち、認識スコアが高いほど、認識の確信度が高いことを示す。ここではＨＭＭを用いた例を説明したが、ＤＰを用いる等既存の他の方法を用いてもかまわない。その際における認識スコアについても、ここでは認識の確信度が高い場合にその値が高くなるものであるとする。 The search unit 204 configures an HMM sequence necessary for speech recognition from a recognition dictionary that lists recognition target words, a language model 207 that records acceptable grammars, and an acoustic model 206. Then, referring to the likelihood of each HMM calculated by the likelihood calculation unit 203, a cumulative value (recognition score) of likelihood is obtained for each HMM sequence using the Viterbi algorithm or the like. That is, the higher the recognition score, the higher the certainty of recognition. Although an example using the HMM has been described here, other existing methods such as using DP may be used. In this case, the recognition score is assumed to increase when the certainty of recognition is high.

認識結果選択部２０５では認識スコアの良いものから順に、所定のＮ個の認識結果を選択する。この選択の際には、直前の認識結果を記憶している認識結果履歴２０８を参照して、選択する認識結果の順位を決定するが、この詳細については後述する。認識結果出力部２０９は認識結果選択部２０５で決定した順位にしたがって、認識結果を所定の位置に表示する。 The recognition result selection unit 205 selects predetermined N recognition results in descending order of recognition score. At the time of this selection, the order of recognition results to be selected is determined with reference to the recognition result history 208 that stores the previous recognition result. Details of this will be described later. The recognition result output unit 209 displays the recognition result at a predetermined position according to the order determined by the recognition result selection unit 205.

以上のように構成される音声認識装置における、認識結果選択処理について、図３のフローチャートを用いて説明する。 The recognition result selection process in the speech recognition apparatus configured as described above will be described with reference to the flowchart of FIG.

認識結果選択部２０５では、スコアが１位となった認識結果と、認識結果履歴２０８に記憶されている直前の認識結果を比較する。そして、当該認識結果が記憶されている認識結果の全部あるいは一部と一致する場合（Ｓ１０１）、当該認識結果の順位をＮ位に下げる（Ｓ１０２）。この際、２位からＮ−１位の認識結果の順位をそれぞれ１つずつ上げる。ステップＳ１０１で直前の認識結果が記憶されていない場合あるいは当該認識結果が直前の認識結果とまったく一致しない場合には、順位の変更は行わない。 The recognition result selection unit 205 compares the recognition result having the first score with the previous recognition result stored in the recognition result history 208. If the recognition result matches all or a part of the stored recognition results (S101), the rank of the recognition results is lowered to N (S102). At this time, the ranks of the recognition results from the 2nd place to the N-1th place are raised one by one. If the previous recognition result is not stored in step S101 or if the recognition result does not match the previous recognition result at all, the rank is not changed.

そして、決定した順位にしたがって認識結果出力部２０９が認識結果をＮ個出力し（Ｓ１０３）、１位の認識結果を認識結果履歴部２０８に記憶し、認識結果履歴を更新する（Ｓ１０４）。 Then, according to the determined order, the recognition result output unit 209 outputs N recognition results (S103), stores the first recognition result in the recognition result history unit 208, and updates the recognition result history (S104).

以下、具体的な発声例を用いて、本発明の音声認識装置の動作を説明する。ここでは、本発明の音声認識装置が住所入力装置として機能する場合を例とする。 Hereinafter, the operation of the speech recognition apparatus of the present invention will be described using a specific utterance example. Here, a case where the voice recognition device of the present invention functions as an address input device is taken as an example.

まず、使用者が１回目の発声で「大田区昭和島」と発声し、誤認識したので同一内容を発声しなおす場合について、図４を用いて説明する。この例では、出力する認識結果の個数ＮをＮ＝３として説明する。 First, the case where the user utters “Ota Ward Showajima” in the first utterance and misrecognizes it, and utters the same content again will be described with reference to FIG. In this example, the number N of recognition results to be output is described as N = 3.

図４は、使用者の発声内容とその認識結果等を示した表である。図４において、列３０１は発声回数、列３０２は探索部２０４で求めた認識スコア順の認識結果、列３０３は認識結果履歴２０８に記憶されている直前の発声の１位の認識結果、列３０４は認識結果出力部２０５が出力する認識結果、である。また、行３０５は１回目の発声、行３０６は２回目の発声を示す。まず、行３０５に示す１回目の発声の場合、列３０１に示す使用者の発声内容「大田区昭和島」に対して、列３０２示す認識スコアの上位３個が以下のように求まる。 FIG. 4 is a table showing the user's utterance contents and the recognition results. In FIG. 4, column 301 is the number of utterances, column 302 is the recognition result in the order of recognition scores obtained by search unit 204, column 303 is the first recognition result of the last utterance stored in recognition result history 208, and column 304. Is a recognition result output by the recognition result output unit 205. A line 305 indicates the first utterance, and a line 306 indicates the second utterance. First, in the case of the first utterance shown in the row 305, the top three recognition scores shown in the column 302 are obtained as follows for the user's utterance content “Ota Ward Showajima” shown in the column 301.

１位：大田区城南島
２位：大田区昭和島
３位：大田区京浜島
この場合、列３０３に示す直前の認識結果履歴がないため、認識結果選択部２０５では順位の変更を行わない（Ｓ１０１）。認識結果出力部２０９は列３０４に示すように、認識スコアの順に認識結果を出力し（Ｓ１０３）、出力した１位の認識結果「大田区城南島」を認識結果履歴２０８に記憶する（Ｓ１０４）。 1st place: Ota-ku Jonanjima 2nd place: Ota-ku Showajima 3rd place: Ota-ku Keihinjima In this case, the recognition result selection unit 205 does not change the rank because there is no previous recognition result history shown in column 303 ( S101). As shown in the column 304, the recognition result output unit 209 outputs the recognition results in the order of the recognition scores (S103), and stores the output first-ranked recognition result “Ota Ward Jonanjima” in the recognition result history 208 (S104). .

すると、出力された認識結果の１位が実際の発声内容と異なるため、使用者は認識結果を訂正するために、行３０６に示すように２回目の「昭和島」を発声する。この２回目の発声に対し、列３０２に示す認識スコアの上位３個が以下のように求まる。 Then, since the first place of the output recognition result is different from the actual utterance content, the user utters “Showajima” for the second time as shown in line 306 in order to correct the recognition result. For the second utterance, the top three recognition scores shown in column 302 are obtained as follows.

１位：城南島
２位：昭和島
３位：京浜島
このとき、一位の認識結果「城南島」が、列３０３に示す直前の認識結果「大田区城南島」の一部「城南島」と一致する（Ｓ１０１）。したがって、認識結果選択部２０５は「城南島」の順位を３位に下げ、２位の「昭和島」を１位に、３位の「京浜島」を２位にそれぞれ繰り上げる（Ｓ１０２）。すると認識結果出力部２０９は列３０４に示すように、認識結果選択部２０５が変更した順位に従って認識結果を出力し（Ｓ１０３)、１位の認識結果として出力された「昭和島」を認識結果履歴２０８に記憶する（Ｓ１０４）。 1st place: Jonanjima 2nd place: Showajima 3rd place: Keihinjima At this time, the first recognition result "Jonanjima" is a part of the previous recognition result "Ota-ku Jonanjima" shown in row 303 "Jonanjima" (S101). Accordingly, the recognition result selection unit 205 lowers the rank of “Jonanjima” to the third place, and raises the second place “Showajima” to the first place and the third place “Keihinjima” to the second place (S102). Then, as shown in the column 304, the recognition result output unit 209 outputs the recognition results according to the order changed by the recognition result selection unit 205 (S103). The recognition result history is “Showajima” output as the first recognition result. 208 is stored (S104).

以上のように、認識結果が直前発声の認識結果の一部と一致した場合にはその順位を下げることによって、同じ認識誤り（この例では「城南島」）が繰り返されてしまうことを回避できる。 As described above, when the recognition result matches a part of the recognition result of the immediately preceding utterance, the same recognition error (in this example, “Jonanjima”) can be avoided by lowering the rank. .

以下、本実施形態の音声認識装置（住所入力装置）における認識結果の表示方法について、図５を用いて説明する。図５は、認識結果の表示例を示す図であり、認識結果出力部２０９によって１位の認識結果における「区部」,「町名」が、領域４０１，４０２にそれぞれ表示される。また同時に、１位〜３位の認識結果が、領域４０３，４０４，４０５にそれぞれ表示される。また、領域４０３〜４０５はマウスやキーボード等の操作キー１０４による選択が可能であり、選択された認識結果は領域４０１，４０２に反映される。 Hereinafter, the display method of the recognition result in the voice recognition device (address input device) of the present embodiment will be described with reference to FIG. FIG. 5 is a diagram showing a display example of the recognition result. The recognition result output unit 209 displays “kube” and “town name” in the first recognition result in areas 401 and 402, respectively. At the same time, the first to third recognition results are displayed in areas 403, 404, and 405, respectively. Further, the areas 403 to 405 can be selected by the operation keys 104 such as a mouse and a keyboard, and the selected recognition result is reflected in the areas 401 and 402.

このような表示方法による、図４に示す音声入力例に対する表示の遷移を図６に示す。 FIG. 6 shows a display transition for the voice input example shown in FIG. 4 by such a display method.

図６において、５００は１回目の発声が行われる前の表示例であり、全ての表示領域が無為である旨を示す。 In FIG. 6, 500 is a display example before the first utterance is performed, and indicates that all display areas are ineffective.

５１０は１回目の発声「大田区昭和島」が行われ、行３０５に示す認識結果が表示された例を示している。すなわち、「大田区昭和島」の発声に対して、認識結果選択部２０５で選択した１位の認識結果である「大田区城南島」の「大田区」が領域５１１に、「城南島」が領域５１２に表示されている。また、この時得られた１位〜３位の認識結果が、領域５１３〜５１５に表示されている。 510 shows an example in which the first utterance “Ota Ward Showajima” is performed and the recognition result shown in the row 305 is displayed. That is, for the utterance of “Ota Ward Showajima”, “Ota Ward” of “Ota Ward Jonanjima”, which is the first recognition result selected by the recognition result selection unit 205, is in the region 511, and “Johnanjima” is It is displayed in area 512. In addition, the first to third recognition results obtained at this time are displayed in regions 513 to 515.

５２０は２回目の発声「昭和島」を認識した時の表示例を示している。この場合、先に説明したように、「昭和島」という発声に対して、１位に「城南島」という認識結果が得られる。しかしながら、直前の認識結果である「大田区城南島」と一部が一致するため（Ｓ１０１）、認識結果選択部２０５によって「城南島」の順位を３位に下げ、２位以下の認識結果である「昭和島」「京浜島」の順位を繰り上げる（Ｓ１０２）。したがって、認識結果出力部２０９は順位が繰り上がって１位になった「昭和島」を領域５２２に表示し、認識結果選択部２０５が決定した順位にしたがって、１位〜３位の認識結果を領域５２３〜５２５に出力する。 Reference numeral 520 shows a display example when the second utterance “Showajima” is recognized. In this case, as described above, the recognition result of “Jonanjima” is obtained in the first place with respect to the utterance of “Showajima”. However, since a part of the previous recognition result “Ota Ward Jonanjima” coincides (S101), the recognition result selection unit 205 lowers the rank of “Jonanjima” to the third place, The ranking of a certain “Showajima” and “Keihinjima” is advanced (S102). Accordingly, the recognition result output unit 209 displays “Showa Island”, which has been ranked first, in the area 522, and displays the recognition results of the first to third positions according to the order determined by the recognition result selection unit 205. Output to areas 523 to 525.

本実施形態では、表示された複数の認識結果から、使用者が所望する結果を選択することができる。例えば、表示例５２０の領域５２５に示すように「城南島」は認識結果選択部２０５によって降順され、３位の認識結果として表示されているが、これをマウス等の操作キー１０４で選択することができる。この選択後の表示は５３０のようになる。すなわち、使用者によって選択された「城南島」が、領域５３２に表示される。この場合、認識結果履歴２０８には使用者によって選択された認識結果「城南島」が記憶される。 In the present embodiment, a result desired by the user can be selected from a plurality of displayed recognition results. For example, as shown in the area 525 of the display example 520, “Jonanjima” is displayed in descending order by the recognition result selection unit 205 and displayed as the third recognition result, but this can be selected with the operation key 104 such as a mouse. Can do. The display after this selection becomes 530. That is, “Seongnamjima” selected by the user is displayed in area 532. In this case, the recognition result history 208 stores the recognition result “Seongnamjima” selected by the user.

上述したように本実施形態においては、認識結果が直前の認識結果の一部または全部と一致する場合、すなわち認識結果が誤認識であると判断した場合でも、当該認識結果がＮ位（本実施例ではＮ＝３）として出力・表示される。したがって、認識結果の表示領域がＮ個分以上あれば、使用者が画面切り換え等の操作を行うことなく、当該認識結果を選択することができる。 As described above, in this embodiment, even when the recognition result matches a part or all of the previous recognition result, that is, even when it is determined that the recognition result is erroneous recognition, the recognition result is Nth (this embodiment In the example, N = 3) is output and displayed. Therefore, if there are N or more recognition result display areas, the user can select the recognition result without performing operations such as screen switching.

なお、本実施形態では、使用者が町名のみを訂正する場合について説明したが、本発明はもちろんこの例に限るものではなく、区部名のみあるいは区部名と町名の両方を訂正する場合にも適用可能である。すなわち、「大田区昭和島」という発声に対して、「世田谷区京浜島」という認識結果が１位に得られた場合に、区部名のみを訂正するために「大田区」と発声した場合にも、本実施形態と同様の手順が適用される。区名・町村名両方を訂正するために「大田区昭和島」と発声した場合も同様である。 In addition, although this embodiment demonstrated the case where a user corrects only a town name, this invention is not restricted to this example of course, and when correcting only a ward name or both a ward name and a town name. Is also applicable. That is, when the recognition result “Keihamajima” in “Setagaya-ku” is obtained in the first place against the utterance “Ota-ku Showajima”, “Ota-ku” is used to correct only the ward name. Also, the same procedure as in the present embodiment is applied. The same is true if you say “Ota Ward Showajima” to correct both the ward name and the village name.

また、図３のステップＳ１０２において、得られた１位の認識結果が、直前の認識結果の全部あるいは一部と一致する場合に、当該認識結果の順位をＮ位（Ｎは音声認識装置が認識結果を出力する個数）にする例を説明した。しかしながら本発明はこの例に限るものではなく、１位であった認識結果を２位〜Ｎ位の任意の順位に変更するようにしても、同様の効果が得られる。 In addition, in step S102 in FIG. 3, when the obtained first-order recognition result matches all or part of the previous recognition result, the recognition result rank is ranked N (N is recognized by the speech recognition apparatus). The example of setting the number of results to be output) has been described. However, the present invention is not limited to this example, and the same effect can be obtained even if the recognition result that was first place is changed to an arbitrary order from the second place to the Nth place.

以上説明したように本実施形態によれば、同じ誤認識が繰り返して１位の認識結果として出力されることを防ぎ、かつ当該認識結果を１位以外の順位で出力する。したがって、もしも当該認識結果が正しい場合、すなわち誤認識であるという判断が誤っている場合に、使用者が当該認識結果を選択する余地が残されている。また、使用者は認識結果を訂正する際に、訂正したい部分のみを発声すれば良いので、使用者の利便性が向上する。 As described above, according to the present embodiment, the same erroneous recognition is prevented from being repeatedly output as the first recognition result, and the recognition results are output in a rank other than the first. Therefore, if the recognition result is correct, that is, if the determination that it is erroneous recognition is incorrect, there remains room for the user to select the recognition result. Further, when the user corrects the recognition result, only the part to be corrected needs to be uttered, so that the convenience for the user is improved.

＜他の実施形態＞
以上、実施形態例を詳述したが、本発明は例えば、システム、装置、方法、プログラム若しくは記憶媒体(記録媒体)等としての実施態様をとることが可能である。具体的には、複数の機器から構成されるシステムに適用しても良いし、また、一つの機器からなる装置に適用しても良い。 <Other embodiments>
Although the embodiment has been described in detail above, the present invention can take an embodiment as a system, apparatus, method, program, storage medium (recording medium), or the like. Specifically, the present invention may be applied to a system composed of a plurality of devices, or may be applied to an apparatus composed of a single device.

尚本発明は、前述した実施形態の機能を実現するソフトウェアのプログラムを、システムあるいは装置に直接あるいは遠隔から供給し、そのシステムあるいは装置のコンピュータが該供給されたプログラムコードを読み出して実行することによっても達成される。なお、この場合のプログラムとは、実施形態において図に示したフローチャートに対応したプログラムである。 In the present invention, a software program for realizing the functions of the above-described embodiments is supplied directly or remotely to a system or apparatus, and the computer of the system or apparatus reads and executes the supplied program code. Is also achieved. The program in this case is a program corresponding to the flowchart shown in the drawing in the embodiment.

従って、本発明の機能処理をコンピュータで実現するために、該コンピュータにインストールされるプログラムコード自体も本発明を実現するものである。つまり、本発明は、本発明の機能処理を実現するためのコンピュータプログラム自体も含まれる。 Accordingly, since the functions of the present invention are implemented by computer, the program code installed in the computer also implements the present invention. In other words, the present invention includes a computer program itself for realizing the functional processing of the present invention.

その場合、プログラムの機能を有していれば、オブジェクトコード、インタプリタにより実行されるプログラム、OSに供給するスクリプトデータ等の形態であっても良い。 In that case, as long as it has the function of a program, it may be in the form of object code, a program executed by an interpreter, script data supplied to the OS, or the like.

プログラムを供給するための記録媒体としては、以下に示す媒体がある。例えば、フロッピー（登録商標）ディスク、ハードディスク、光ディスク、光磁気ディスク、MO、CD-ROM、CD-R、CD-RW、磁気テープ、不揮発性のメモリカード、ROM、DVD(DVD-ROM，DVD-R)などである。 Recording media for supplying the program include the following media. For example, floppy disk, hard disk, optical disk, magneto-optical disk, MO, CD-ROM, CD-R, CD-RW, magnetic tape, nonvolatile memory card, ROM, DVD (DVD-ROM, DVD- R).

プログラムの供給方法としては、以下に示す方法も可能である。すなわち、クライアントコンピュータのブラウザからインターネットのホームページに接続し、そこから本発明のコンピュータプログラムそのもの(又は圧縮され自動インストール機能を含むファイル)をハードディスク等の記録媒体にダウンロードする。また、本発明のプログラムを構成するプログラムコードを複数のファイルに分割し、それぞれのファイルを異なるホームページからダウンロードすることによっても実現可能である。つまり、本発明の機能処理をコンピュータで実現するためのプログラムファイルを複数のユーザに対してダウンロードさせるWWWサーバも、本発明に含まれるものである。 As a program supply method, the following method is also possible. That is, the browser of the client computer is connected to a homepage on the Internet, and the computer program itself (or a compressed file including an automatic installation function) of the present invention is downloaded to a recording medium such as a hard disk. It can also be realized by dividing the program code constituting the program of the present invention into a plurality of files and downloading each file from a different homepage. That is, a WWW server that allows a plurality of users to download a program file for realizing the functional processing of the present invention on a computer is also included in the present invention.

また、本発明のプログラムを暗号化してCD-ROM等の記憶媒体に格納してユーザに配布し、所定の条件をクリアしたユーザに対し、インターネットを介してホームページから暗号化を解く鍵情報をダウンロードさせることも可能である。すなわち該ユーザは、その鍵情報を使用することによって暗号化されたプログラムを実行し、コンピュータにインストールさせることができる。 In addition, the program of the present invention is encrypted, stored in a storage medium such as a CD-ROM, distributed to users, and key information for decryption is downloaded from a homepage via the Internet to users who have cleared predetermined conditions. It is also possible to make it. That is, the user can execute the encrypted program by using the key information and install it on the computer.

また、コンピュータが、読み出したプログラムを実行することによって、前述した実施形態の機能が実現される。さらに、そのプログラムの指示に基づき、コンピュータ上で稼動しているOSなどが、実際の処理の一部または全部を行い、その処理によっても前述した実施形態の機能が実現され得る。 Further, the functions of the above-described embodiments are realized by the computer executing the read program. Furthermore, based on the instructions of the program, an OS or the like running on the computer performs part or all of the actual processing, and the functions of the above-described embodiments can also be realized by the processing.

さらに、記録媒体から読み出されたプログラムが、コンピュータに挿入された機能拡張ボードやコンピュータに接続された機能拡張ユニットに備わるメモリに書き込まれた後、実行されることによっても、前述した実施形態の機能が実現される。すなわち、該プログラムの指示に基づき、その機能拡張ボードや機能拡張ユニットに備わるCPUなどが実際の処理の一部または全部を行うことが可能である。 Further, the program read from the recording medium is written in a memory provided in a function expansion board inserted into the computer or a function expansion unit connected to the computer, and then executed, so that the program of the above-described embodiment can be realized. Function is realized. That is, based on the instructions of the program, the CPU provided in the function expansion board or function expansion unit can perform part or all of the actual processing.

本発明に係る一実施形態である音声認識装置の概略構成を示すブロック図である。1 is a block diagram illustrating a schematic configuration of a speech recognition apparatus according to an embodiment of the present invention. 音声認識装置における機能構成を示すブロック図である。It is a block diagram which shows the function structure in a speech recognition apparatus. 音声認識処理を示すフローチャートである。It is a flowchart which shows a speech recognition process. 発声内容とその認識結果を説明する表である。It is a table | surface explaining utterance content and its recognition result. 音声認識結果表示の一例を示す図である。It is a figure which shows an example of a speech recognition result display. 音声認識結果表示の一例を示す図である。It is a figure which shows an example of a speech recognition result display.

Claims

Receiving means for receiving audio information;
Voice recognition means for recognizing the voice information and acquiring a plurality of recognition results together with the recognition score;
Selecting means for selecting a plurality of recognition results including at least the recognition result having the highest recognition score from the plurality of recognition results;
Output means for outputting the selected recognition result;
A history unit that holds the recognition result having the highest recognition score among the selected recognition results as a history, and
The selection means lowers the rank of the recognition result when the recognition result having the highest recognition score acquired by the voice recognition means matches at least a part of the history held in the history means. Voice recognition device.

The selection means lowers the rank of the recognition result when the recognition result having the highest recognition score acquired by the voice recognition means matches a part of the latest history held in the history means. The speech recognition apparatus according to claim 1, wherein:

The speech recognition apparatus according to claim 1, wherein when the rank of the recognition result having the highest recognition score is lowered, the selection means lowers the rank to a rank within the number of recognition results selected by the selection means.

The speech recognition apparatus according to claim 1, wherein the output unit displays the selected recognition result.

Furthermore, it has an operator selection means for selecting a recognition result according to an instruction from the operator among the recognition results displayed by the output means,
5. The speech recognition apparatus according to claim 4, wherein the output means outputs the recognition result selected by the operator selection means as the highest-order recognition result.

6. The voice recognition apparatus according to claim 5, wherein the operator's instruction in the operator selection means is performed by a pointing device.

7. The speech recognition apparatus according to claim 5, wherein the history unit holds a recognition result selected by the operator selection unit as a history.

A receiving step for receiving audio information;
A speech recognition step of recognizing the speech information and obtaining a plurality of recognition results together with the recognition score;
A selection step of selecting a plurality of recognition results including at least the recognition result having the highest recognition score from the plurality of recognition results;
Outputting the selected recognition result; and
A history step of holding the recognition result having the highest recognition score among the selected recognition results as a history, and
The selection step lowers the rank of the recognition result when the recognition result having the highest recognition score acquired by the speech recognition step matches at least a part of the history held in the history means. Voice recognition method.

9. The speech recognition method according to claim 8, wherein the selection step lowers the rank of the recognition result having the highest recognition score to a rank within the number of recognition results selected by the selection step.

The output step displays the selected recognition result,
Of the recognition results displayed by the output means, further comprising an operator selection means for selecting a recognition result according to an instruction from the operator,
9. The speech recognition method according to claim 8, wherein the output means outputs the recognition result selected in the operator selection step as the highest-order recognition result.

The speech recognition method according to claim 10, wherein the history step holds the recognition result selected in the operator selection step as a history.

A program for realizing the speech recognition method according to any one of claims 8 to 11 by being executed on an information processing apparatus.