JP2009104047A

JP2009104047A - Information processing method and information processing apparatus

Info

Publication number: JP2009104047A
Application number: JP2007277587A
Authority: JP
Inventors: Makoto Hirota; 誠廣田
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2007-10-25
Filing date: 2007-10-25
Publication date: 2009-05-14

Abstract

【課題】音声登録型音声認識において、音声登録時に音声を２回以上発声させる負担を負わせることなく、実行時の音声認識の精度を、登録時に２回以上発声した場合の精度に近づける。
【解決手段】ユーザが発声した登録対象の音声を取得し、取得した登録対象の音声に対応する音声情報をメモリに登録し、別途ユーザが発声した認識対象の音声を、メモリに登録されている音声情報を用いて音声認識して１つ又は複数の認識結果を出力し、出力された１つ又は複数の認識結果の中から、ユーザが意図した認識結果を特定するとともに、特定された認識結果に対応する登録音声情報として、認識対象として発声された音声に対応する音声情報を登録する。
【選択図】図２In speech registration type speech recognition, the accuracy of speech recognition at the time of execution is brought close to the accuracy at the time of utterance twice or more at the time of registration without incurring the burden of uttering speech twice or more at the time of speech registration.
SOLUTION: A registration target voice uttered by a user is acquired, voice information corresponding to the acquired registration target voice is registered in a memory, and a recognition target voice uttered by a user is separately registered in the memory. Speech recognition is performed using speech information, and one or a plurality of recognition results are output. A recognition result intended by the user is specified from the output one or more recognition results, and the specified recognition result As the registered voice information corresponding to, voice information corresponding to the voice uttered as a recognition target is registered.
[Selection] Figure 2

Description

本発明は、音声登録型の音声認識を行う情報処理方法に関する。 The present invention relates to an information processing method for performing speech registration type speech recognition.

音声認識は、認識可能な語彙を記述した文法に基づいて認識を行う方式の他に、あらかじめ発声した音声を一つまたは複数登録しておき、実行時には、入力音声と最も類似した登録音声を求める方式がある。後者は音声登録型音声認識と呼ぶ。例として、電話番号をダイヤルする代わりに、かける相手の名前を発声することで電話をかけることのできる電話機を考えてみる。発信先音声登録モードにおいて、「鈴木一郎」さんの電話番号“０３−１ＸＸＸ−ＸＸＸＸ”に対し、“イチロー”と発声した音声を対応付けて登録する。登録後は、発信モードで”イチロー”と発声することで、“０３−１ＸＸＸ−ＸＸＸＸ”に電話をかけることができるようになる。同様に、「○×歯科医院」の電話番号“０４５−２ＸＸ−ＸＸＸＸ”に対し、“ハイシャサン”と発声して音声を登録すれば、その後、“ハイシャサン”と発声するだけで、「○×歯科医院」に電話をかけることができる。登録する音声は好きな言葉でよい。 In speech recognition, in addition to a method of recognizing based on a grammar describing a recognizable vocabulary, one or a plurality of previously spoken voices are registered, and at the time of execution, a registered voice most similar to the input voice is obtained. There is a method. The latter is called voice registration type voice recognition. As an example, consider a telephone that can make a call by speaking the name of the person to call instead of dialing a telephone number. In the destination voice registration mode, “Ichiro” and the voice uttered are registered in association with the telephone number “03-1XXX-XXXX” of “Ichiro Suzuki”. After registration, the user can call “03-1XXX-XXXX” by saying “Ichiro” in the transmission mode. Similarly, if you register the voice by saying “Hi Sha san” to the telephone number “045-2XX-XXXX” of “○ × Dental Clinic”, then you can simply say “Hi Sha San” and “ You can call the clinic. You can use any language you like to register.

このような音声登録型音声認識を実現する一般的な方法は、次のようなものである。任意の音節列を受理可能な音声認識文法１を用意しておき、登録時の発声をこの音声認識文法１を用いて認識し、認識結果として音節列を出力する。この音節列を、対象となる電話番号と対応付けて実行用の音声認識文法２に登録する。発信モードでは、入力音声を音声認識文法２を用いて認識する。
特開２０００−２７６１８７号公報 A general method for realizing such voice registration type voice recognition is as follows. A speech recognition grammar 1 capable of accepting an arbitrary syllable string is prepared, the utterance at the time of registration is recognized using the speech recognition grammar 1, and a syllable string is output as a recognition result. This syllable string is registered in the speech recognition grammar 2 for execution in association with the target telephone number. In the transmission mode, the input speech is recognized using the speech recognition grammar 2.
JP 2000-276187 A

前述の例では、音声登録時に、音声を１回だけ発声する場合を説明したが、これでは実行時の精度が十分に出ないという問題があり、音声を２回以上発声して登録するように設計することが多い。この設計によれば精度を向上することは可能だが、ユーザの負担が増えるという問題があった。 In the above example, the case where the voice is uttered only once at the time of voice registration has been described. However, there is a problem in that the accuracy at the time of execution is not sufficient, so that the voice is uttered twice or more and registered. Often designed. Although this design can improve accuracy, there is a problem that the burden on the user increases.

特許文献１では、実行時の第一の入力音声が認識不能、または誤認識であった場合に、ユーザが続けて言い直した第二の入力音声を登録音声として登録する方法を開示している。すなわち、第一の入力音声と第二の入力音声が類似している場合に、これが言い直しであると判断し、さらにユーザによる手動操作があった場合に、その手動操作で指示されたコマンドに対する登録音声として第一の音声を追加登録するというものである。このように、実行時の入力音声を登録音声として追加登録することで、その後の認識精度を向上させることができ、代わりに登録時は１回の発声で済ませることが可能になる。しかしこの方法では、認識誤りに対してユーザが同じ言葉を言い直すこと、さらに手動操作が入ることが前提になっている。この状況は確実に起こるものではなく、追加登録による効果が現れるかどうかは不確実である、という問題がある。 Patent Document 1 discloses a method of registering, as a registered voice, a second input voice that the user has continuously rephrased when the first input voice at the time of execution is unrecognizable or misrecognized. . That is, when the first input voice and the second input voice are similar, it is determined that this is a rephrase, and when there is a manual operation by the user, the command instructed by the manual operation is The first voice is additionally registered as the registered voice. In this way, by additionally registering the input voice at the time of execution as a registered voice, it is possible to improve the subsequent recognition accuracy, and instead, it is possible to complete with one utterance at the time of registration. However, this method is based on the premise that the user rephrases the same word for a recognition error and further that manual operation is performed. This situation is not certain and there is a problem that it is uncertain whether the effect of additional registration will appear.

上記問題を解決するために、本発明に係る情報処理装置は、ユーザが発声した音声を取得する取得手段と、前記取得手段で取得した登録対象の音声に対応する音声情報を登録音声情報としてメモリに登録する登録手段と、前記取得手段で取得した認識対象の音声を、前記メモリに登録されている音声情報を用いて音声認識し、１つ又は複数の認識結果を出力する音声認識手段と、前記音声認識手段により出力された１つ又は複数の認識結果の中から、ユーザが意図した認識結果を特定する特定手段とを備え、前記登録手段は、前記特定手段によって特定された認識結果に対応する登録音声情報として、前記認識対象の音声に対応する音声情報を登録することを特徴とする。 In order to solve the above problems, an information processing apparatus according to the present invention includes an acquisition unit that acquires a voice uttered by a user, and a memory that stores voice information corresponding to the registration target voice acquired by the acquisition unit as registered voice information. A speech recognition means for recognizing the speech to be recognized acquired by the acquisition means using the speech information registered in the memory and outputting one or more recognition results; A specifying unit for specifying a recognition result intended by the user from one or a plurality of recognition results output by the voice recognition unit, and the registration unit corresponds to the recognition result specified by the specifying unit. Voice information corresponding to the voice to be recognized is registered as the registered voice information.

本発明によれば、音声登録時に、音声を２回以上発声させる負担を負わせることなく、実行時の音声認識の精度を、登録時に２回以上発声した場合の精度に近づけることが可能となる。 According to the present invention, at the time of voice registration, the accuracy of voice recognition at the time of execution can be made close to the accuracy when uttered twice or more at the time of registration without burdening the voice to be uttered twice or more. .

以下、添付図面を参照して本発明に係る実施の形態を詳細に説明する。ただし、この実施の形態に記載されている構成要素はあくまでも例示であり、本発明の範囲をそれらのみに限定する趣旨のものではない。 Embodiments according to the present invention will be described below in detail with reference to the accompanying drawings. However, the constituent elements described in this embodiment are merely examples, and are not intended to limit the scope of the present invention only to them.

以下、図面を参照して本発明の実施例１を詳細に説明する。本実施例では、本発明に係る情報処理装置の例として、図１の１０１のような携帯電話を操作するケースを挙げて説明する。携帯電話１０１は、液晶画面１０２を有し、キー１０３の操作と、マイク１０４からの音声入力による操作が可能である。 Hereinafter, Embodiment 1 of the present invention will be described in detail with reference to the drawings. In this embodiment, as an example of the information processing apparatus according to the present invention, a case of operating a mobile phone like 101 in FIG. 1 will be described. The mobile phone 101 has a liquid crystal screen 102 and can be operated by operating a key 103 and inputting voice from a microphone 104.

図２は、本実施例に係る携帯電話１０１の構成を表すブロック図である。同図において、２０１は音声入力部である。２０２は音声認識部である。２０３は音声認識部２０２が用いる音声認識文法保持部である。２０４は確認ダイアログをユーザに提示するか否かを判定する判定部である。２０５は、判定部２０４が確認ダイアログを提示すると判定した場合に不図示の表示画面や不図示の音声出力部等を介して確認ダイアログをユーザに提示し、ユーザからの入力を受け取る確認部である。２０６は、確認部２０５でのユーザの指示入力が終わった時に、音声入力部２０１を介して取得した音声を認識し、該音声に対応する音声情報として、音節系列を出力する音節認識部である。２０７は、音節認識部２０６が用いる音節認識文法保持部である。２０８は、音節認識部２０６が出力した音節系列を、登録音声情報として音声認識文法保持部２０３に登録する登録部である。 FIG. 2 is a block diagram illustrating the configuration of the mobile phone 101 according to the present embodiment. In the figure, reference numeral 201 denotes a voice input unit. Reference numeral 202 denotes a voice recognition unit. A voice recognition grammar holding unit 203 is used by the voice recognition unit 202. A determination unit 204 determines whether or not to present a confirmation dialog to the user. Reference numeral 205 denotes a confirmation unit that presents a confirmation dialog to the user via a display screen (not shown) or a voice output unit (not shown) when the determination unit 204 determines to present a confirmation dialog, and receives input from the user. . 206 is a syllable recognition unit that recognizes a voice acquired via the voice input unit 201 when a user's instruction input in the confirmation unit 205 is finished, and outputs a syllable sequence as voice information corresponding to the voice. . A syllable recognition grammar holding unit 207 is used by the syllable recognition unit 206. A registration unit 208 registers the syllable sequence output by the syllable recognition unit 206 in the speech recognition grammar holding unit 203 as registered speech information.

図３は、本実施例に係る携帯電話１０１のハードウエア構成を示す構成図である。同図において、３０１はＣＰＵであり、後述する携帯電話１０１の動作手順を実現するプログラムに従って動作する。３０２はＲＡＭであり、上記プログラムの動作に必要な記憶領域を提供する。３０３はＲＯＭであり、上記プログラムの動作手順を実現するプログラムなどを保持する。３０４は図１の１０３に示すような各種ボタンである。３０５は図１の１０４に示すようなマイクである。３０６は図１の１０２に示すようなＬＣＤである。３０７はバスである。 FIG. 3 is a configuration diagram illustrating a hardware configuration of the mobile phone 101 according to the present embodiment. In the figure, reference numeral 301 denotes a CPU which operates according to a program for realizing an operation procedure of the mobile phone 101 described later. A RAM 302 provides a storage area necessary for the operation of the program. Reference numeral 303 denotes a ROM which holds a program for realizing the operation procedure of the program. Reference numeral 304 denotes various buttons as indicated by 103 in FIG. Reference numeral 305 denotes a microphone as shown by 104 in FIG. Reference numeral 306 denotes an LCD as indicated by 102 in FIG. Reference numeral 307 denotes a bus.

本実施例では、電話番号のメモリ登録機能において、登録した電話番号に対応する音声を登録し、登録した音声で登録電話番号を呼び出すケースを説明する。 In this embodiment, a case will be described in which the voice corresponding to the registered telephone number is registered and the registered telephone number is called with the registered voice in the memory registration function of the telephone number.

まず、電話番号をメモリ登録する際の処理を、図４のフローチャートに沿って説明する。図６の６０１は、携帯電話の所定の操作で、メモリ登録画面を呼び出した場面を表している。まず、図６の６０１のように、登録する電話番号の入力を促し、電話番号をキー１０３を用いて入力させる（Ｓ４０１）。続いて、６０２のような画面で、その電話番号を呼び出す操作をするための登録音声の入力を促す。ユーザは、マイク１０４に向かって、登録対象となる登録音声を発声する（Ｓ４０２）。登録音声はどんな発声でもよい。ここでは、登録電話番号が友人の山田太郎氏のものだったとして、「タロークン」と発声したとする。この入力音声を音節認識部２０６で認識する（Ｓ４０３）。音節認識部２０６は音節認識文法保持部２０７の文法に基づいて認識処理を行う。この文法は、任意の音節列パターンを受理できるように記述されている。音声認識結果は、音節列の形で出力される。入力音声「タロークン」を正確に認識できた場合の認識結果の音節列は、”ｔａ−ｒｏｏ−ｋｕ−ｎ”となるが、音節列の認識は１００％正確にできるとは限らず、”ｔｏａ−ｒｅｏ−ｋｕ−ｎ”のように正確な音節列とは異なる音節列が出力されることもある。この出力音節列を音声認識文法保持部２０３の音声認識文法に、登録する電話番号０９０１１１１ＸＸＸＸと対応付けて登録し（Ｓ４０４）、図６の６０３のようなメッセージを表示する。同様にして、複数の電話番号と対応する音声を登録した結果の音声認識文法２０３の例が図９である。このように、登録音声の音節認識結果の音節列と対応する電話番号がペアになって登録される。右側には、参考として、どのような発声をしたかを示した。 First, a process for registering a telephone number in memory will be described with reference to the flowchart of FIG. Reference numeral 601 in FIG. 6 represents a scene in which the memory registration screen is called by a predetermined operation of the mobile phone. First, as shown at 601 in FIG. 6, the user is prompted to input a telephone number to be registered, and the telephone number is input using the key 103 (S 401). Subsequently, on a screen such as 602, the user is prompted to input a registered voice for performing an operation of calling the telephone number. The user utters a registration voice to be registered toward the microphone 104 (S402). The registered voice may be any utterance. Here, it is assumed that the registered telephone number is that of a friend, Mr. Taro Yamada, and that "Tarokun" is spoken. This input speech is recognized by the syllable recognition unit 206 (S403). The syllable recognition unit 206 performs recognition processing based on the grammar of the syllable recognition grammar holding unit 207. This grammar is written to accept any syllable string pattern. The speech recognition result is output in the form of a syllable string. The syllable string of the recognition result when the input speech “Tarokun” can be accurately recognized is “ta-roo-ku-n”, but the recognition of the syllable string is not always 100% accurate, and “toa A syllable string different from an accurate syllable string, such as -reo-ku-n ", may be output. This output syllable string is registered in the speech recognition grammar of the speech recognition grammar holding unit 203 in association with the registered telephone number 0901111XXX (S404), and a message such as 603 in FIG. 6 is displayed. Similarly, FIG. 9 shows an example of the speech recognition grammar 203 as a result of registering speech corresponding to a plurality of telephone numbers. In this way, the phone number corresponding to the syllable string of the syllable recognition result of the registered voice is registered as a pair. On the right side, I showed what I made as a reference.

続いて、図５のフローチャートに沿って、メモリ登録された電話番号を、音声入力で呼び出して電話をかける場合の動作を説明する。図７の７０１は、携帯電話の所定の操作で、メモリ登録された電話番号を音声入力により呼び出して発信するための画面を呼び出した場面を表している。図７の７０１のように、音声入力を促し、音声入力させる（Ｓ５０１）。この認識対象である入力音声を音声認識部２０２で認識する（Ｓ５０２）。この音声認識は、音声認識文法保持部２０３の文法を用いて行う。本例では、図９が音声認識文法となる。音声認識結果は、表示制御手段によって、確信度付きＮ−ｂｅｓｔとして表示画面に出力される。例えば、友人の山田太郎氏に電話をかけようとして、“タロークン”と音声入力した場合、認識結果は図１０のような形で出力される。この認識結果の第一位候補が、第二位以下の候補に対して有意な差を持つかどうかを判定する（Ｓ５０３）。ここでは、確信度スコアの値が０．７以上であれば、有意な差であると判定するものとする。図１０では、認識結果第一位の確信度スコアは０．７５であるので、有意な差があると判定される。この場合は、第一位の電話番号が、ユーザが意図するものに一致すると判断し、その電話番号への発信操作を行う（Ｓ５０７）。図７の７０２は発信中の画面表示である。一方、認識結果が図１１のような場合、第一位の確信度スコアが０．７５以上ではないので、図８の８０２のような、Ｎ−ｂｅｓｔリストによる確認ダイアログを表示する（Ｓ５０４）。ユーザは、所定のキー操作によって、この中から所望の電話番号を選択する（Ｓ５０５）。ここでは、２番目の電話番号を選択する。そして選択された電話番号への発信が行われる（Ｓ５０７）が、その前に、Ｓ５０６の処理が行われる。所望の電話番号がユーザによって選択されると、Ｓ５０１で発声された音声が、電話番号０９０１１１１ＸＸＸＸに対応するものであることがわかる。そこで、この音声に対応する音声情報を、電話番号０９０１１１１ＸＸＸＸに発信するための操作情報に対応する登録音声情報として追加登録する。登録の方法は、図４のＳ４０３、Ｓ４０４の手順と同じである。この結果、音声認識文法保持部２０３の音声認識文法は、図１２のようになり、電話番号０９０１１１１ＸＸＸＸ（山田太郎氏の電話番号）に対して２種類の音節列が登録された状態になる。 Next, the operation for making a call by calling a telephone number registered in the memory by voice input will be described with reference to the flowchart of FIG. Reference numeral 701 in FIG. 7 represents a scene in which a screen for calling and calling a telephone number registered in a memory by voice input is called by a predetermined operation of the mobile phone. As indicated by reference numeral 701 in FIG. 7, voice input is prompted and voice input is performed (S501). The speech recognition unit 202 recognizes the input speech that is the recognition target (S502). This speech recognition is performed using the grammar of the speech recognition grammar holding unit 203. In this example, FIG. 9 is a speech recognition grammar. The speech recognition result is output to the display screen as N-best with certainty by the display control means. For example, when a voice is input as “Tarokun” in order to call a friend, Mr. Taro Yamada, the recognition result is output as shown in FIG. It is determined whether or not the first candidate of the recognition result has a significant difference with respect to the second or lower candidate (S503). Here, if the certainty score value is 0.7 or more, it is determined that the difference is significant. In FIG. 10, since the first certainty score of the recognition result is 0.75, it is determined that there is a significant difference. In this case, it is determined that the first-ranked telephone number matches the one intended by the user, and a call operation to the telephone number is performed (S507). Reference numeral 702 in FIG. 7 denotes a screen display during transmission. On the other hand, when the recognition result is as shown in FIG. 11, since the first certainty score is not 0.75 or more, a confirmation dialog with an N-best list such as 802 in FIG. 8 is displayed (S504). The user selects a desired telephone number from these by a predetermined key operation (S505). Here, the second telephone number is selected. Then, a call is made to the selected telephone number (S507), but before that, the process of S506 is performed. When the desired telephone number is selected by the user, it can be seen that the voice uttered in S501 corresponds to the telephone number 0901111XXXX. Therefore, the voice information corresponding to this voice is additionally registered as registered voice information corresponding to the operation information for transmitting to the telephone number 0901111XXXX. The registration method is the same as the procedure of S403 and S404 in FIG. As a result, the speech recognition grammar stored in the speech recognition grammar holding unit 203 is as shown in FIG. 12, and two types of syllable strings are registered for the phone number 0901111XXX (Taro Yamada's phone number).

以上のようにして、登録音声の登録操作は１回で済むが、電話番号呼び出し時の入力音声を利用して、２つ以上の登録音声が登録された状態を作ることができ、その後の認識率を向上させることができる。 As described above, the registration operation of the registered voice can be performed only once. However, it is possible to create a state in which two or more registered voices are registered by using the input voice at the time of calling the telephone number. The rate can be improved.

実施例１では、図５のＳ５０３において、第一位候補の確信度スコアが所定の値より大きいか否かで、有意な差であるか否かを判定していた。これに対し、第一位候補の確信度スコアと第二位候補の確信度スコアが所定の差以上であることをもって、有意な差であると判定するようにしてもよい。 In Example 1, in S503 of FIG. 5, it is determined whether or not there is a significant difference depending on whether or not the certainty score of the first candidate is greater than a predetermined value. On the other hand, when the certainty score of the first candidate and the certainty score of the second candidate are equal to or greater than a predetermined difference, it may be determined that the difference is significant.

上記実施例では、認識結果の確信度スコアによって、図８の８０２のダイアログを表示するか否かの判定を行っていた。これに対し、ダイアログを表示する前に第一位の結果に対する動作を実行し、それがユーザによって取り消された場合に、Ｎ−ｂｅｓｔダイアログを表示するようにしてもよい。例えば、図１３のように、１３０１で入力音声に対する認識結果の第一位の電話番号に発信した直後に、１３０３のように所定の操作で取り消しが行われた場合に、１３０４のようにダイアログ表示するようにしてもよい。 In the above embodiment, whether or not to display the dialog 802 in FIG. 8 is determined based on the certainty score of the recognition result. On the other hand, when the operation for the first result is executed before the dialog is displayed and the operation is canceled by the user, the N-best dialog may be displayed. For example, as shown in FIG. 13, when canceling is performed by a predetermined operation as in 1303 immediately after calling the first phone number of the recognition result for the input voice in 1301, a dialog is displayed as in 1304. You may make it do.

上記実施例では、携帯電話でメモリ登録された電話番号を音声で呼び出す例を示したが、本発明は、さまざまなデバイス、さまざまなシステムの音声操作に適用できることは言うまでもない。音声登録型音声認識を用いたユーザインタフェースであれば、例えば、複写機の音声操作、デジタルカメラの音声操作、デジタルテレビの音声操作などにも適用可能である。また、表示を伴わない音声のみの対話でもよい。例えば、コールセンターの音声対話で、
Ｓｙｓ：製品についてのお問い合わせは”１”を、故障修理のお問い合わせは”２”を押してください。
Ｕｓｅｒ：［“２”を押下］
Ｓｙｓ：コンパクトカメラの場合は”１”を・・・。
Ｕｓｅｒ：［“１”を押下］
Ｓｙｓ：担当者におつなぎ致します。ここまでの操作を登録する場合は、登録する音声を発声してください。登録しない場合は、そのままお待ちください。
Ｕｓｅｒ：カメラコショー
Ｓｙｓ：音声を登録しました。担当者におつなぎ致します。
のようにして、コンパクトカメラの故障修理の問合せをする操作情報に対応する登録音声を登録したとする。その後、このユーザが再度このコールセンターに電話した場合、以下のような対話を行う。
（１）’Ｓｙｓ：製品についてのお問い合わせは”１”を、故障修理のお問い合わせは”２”を押してください。
（２）’Ｕｓｅｒ：カメラコショー
（３）’Ｓｙｓ：カメラコショーでよろしいですか？
（４）’Ｕｓｅｒ：はい
（５）’Ｓｙｓ：担当者におつなぎ致します。
（３）’のような確認とそれに対する（４）’の返答により、（２）’の入力音声が、コンパクトカメラの故障修理問い合わせ担当者へつなぐことを意図したものであると判断される。そこで、（５）’で担当者へつなぐ処理を実行するとともに、（２）’の入力音声を登録音声として登録する。なお、（３）’のシステム音声は、「カメラコショー」の部分には（６）の登録音声が用いられ、「でよろしいですか」という音声データと結合することにより生成される。 In the above-described embodiment, an example has been shown in which a telephone number registered in a memory in a mobile phone is called by voice. However, it goes without saying that the present invention can be applied to voice operations of various devices and various systems. A user interface using voice registration type voice recognition can be applied to voice operation of a copying machine, voice operation of a digital camera, voice operation of a digital television, and the like. In addition, an audio-only dialogue without display may be used. For example, in a call center voice conversation,
Sys: Press “1” for product inquiries and “2” for fault repair inquiries.
User: [Press “2”]
Sys: “1” for a compact camera.
User: [Press “1”]
Sys: I will connect to the person in charge. When registering the operations up to this point, say the sound you want to register. If you do not register, please wait.
User: Camera show Sys: Voice has been registered. I will connect you to the person in charge.
As described above, it is assumed that the registered voice corresponding to the operation information for inquiring about the repair of the compact camera is registered. Thereafter, when the user calls the call center again, the following dialogue is performed.
(1) 'Sys: Press "1" for product inquiries and "2" for fault repair inquiries.
(2) 'User: Camera show (3)' Sys: Are you sure you want to have a camera show?
(4) 'User: Yes (5)' Sys: We will connect to the person in charge.
(3) Based on the confirmation such as' and the response to (4) ', it is determined that the input voice of (2)' is intended to be connected to the person in charge of inquiring the repair of the compact camera. Therefore, the process of connecting to the person in charge at (5) ′ is executed, and the input voice of (2) ′ is registered as a registered voice. The system voice of (3) ′ is generated by combining the voice data “Are you sure?” With the registered voice of (6) used for the “camera show” part.

上記実施例では、登録時の入力音声を音節認識部２０６で認識した結果の音節列を登録部２０８によって音声認識文法保持部２０３に登録し、実行時の入力音声を、音声認識文法保持部２０３の音声認識文法を用いて音声認識部２０２で認識するようにしていた。これに対し、図１４のような構成でも実施可能である。同図において、１４０１は音声入力部である。１４０２は音節認識部である。１４０３は音節認識部１４０２が用いる音節認識文法保持部である。１４０４は評価部である。１４０５は確認ダイアログをユーザに提示するか否かを判定する判定部である。１４０６は、判定部１４０５が確認ダイアログを提示すると判定した場合に確認ダイアログをユーザに提示し、ユーザからの入力を受け取る確認部である。１４０７は音節列保持部である。１４０８は、確認部１４０６でのユーザの指示入力が終わった時に、音節認識部１４０２が出力した音節系列を、音節列保持部１４０７に登録する登録部である。 In the above embodiment, the syllable string obtained as a result of recognition of the input speech at the time of registration by the syllable recognition unit 206 is registered in the speech recognition grammar holding unit 203 by the registration unit 208, and the input speech at the time of execution is registered as the speech recognition grammar holding unit 203. The voice recognition unit 202 recognizes the voice recognition grammar. On the other hand, the configuration shown in FIG. 14 can also be implemented. In the figure, reference numeral 1401 denotes a voice input unit. Reference numeral 1402 denotes a syllable recognition unit. Reference numeral 1403 denotes a syllable recognition grammar holding unit used by the syllable recognition unit 1402. Reference numeral 1404 denotes an evaluation unit. A determination unit 1405 determines whether to present a confirmation dialog to the user. Reference numeral 1406 denotes a confirmation unit that presents a confirmation dialog to the user and receives an input from the user when the determination unit 1405 determines to present the confirmation dialog. Reference numeral 1407 denotes a syllable string holding unit. A registration unit 1408 registers the syllable sequence output from the syllable recognition unit 1402 in the syllable string holding unit 1407 when the user's instruction input in the confirmation unit 1406 is completed.

入力音声は、登録時も実行時も、音節認識部１４０２で認識される。音節認識部１４０２は認識結果として音節列を出力する。登録音声は、音節列の形で、音節列保持部１４０７に登録される。登録内容は、図９と同様である。評価部１４０４は、音節認識部１４０２が出力した音節列と、音節列保持部１４０７に登録された各音節列の一致度を計算し、一致度の高い順に一致度スコアとともにＮ−ｂｅｓｔ出力する。判定部１４０５は、実施例１と同様、Ｎ−ｂｅｓｔ出力の内容に基づいて動作する。 The input speech is recognized by the syllable recognition unit 1402 at the time of registration and execution. The syllable recognition unit 1402 outputs a syllable string as a recognition result. The registered speech is registered in the syllable string holding unit 1407 in the form of a syllable string. The registered contents are the same as in FIG. The evaluation unit 1404 calculates the degree of coincidence between the syllable string output by the syllable recognition unit 1402 and each syllable string registered in the syllable string holding unit 1407, and outputs N-best together with the degree of coincidence score in descending order of degree of coincidence. The determination unit 1405 operates based on the content of the N-best output as in the first embodiment.

上記実施例では、音声認識結果に対して、確認ダイアログや確認メッセージによってユーザが意図したものを特定するようにしていたが、本発明はこのような形態に限られない。音声認識結果の第一候補を実行し、ユーザがその実行結果に対して取り消し処理を行わなかった場合に認識結果の第一候補がユーザの意図に合っていたと判断し、その第一候補に対する登録音声として、入力音声の情報を登録するよう構成しても構わない。 In the above-described embodiment, the user's intention is specified by the confirmation dialog or the confirmation message for the voice recognition result, but the present invention is not limited to such a form. When the first candidate of the speech recognition result is executed and the user does not cancel the execution result, it is determined that the first candidate of the recognition result matches the user's intention, and the first candidate is registered. Information of input voice may be registered as voice.

上記実施例では、音声情報として、音節列を登録する例をあげて説明したが、本発明はこれに限られず、音声データを登録する構成としても構わない。その場合、まず、ユーザが発声した登録対象の音声に対応する音声データをメモリに記憶しておき、別途ユーザが発声した認識対象の音声と、メモリに記憶された音声データとを比較して１つ又は複数の認識結果を出力する。そして、出力された１つ又は複数の認識結果の中から、ユーザが意図した認識結果を特定し、特定した認識結果に対応する登録音声情報として、認識対象の音声に対応する音声データをメモリに登録するといった構成となる。 In the above-described embodiment, an example in which a syllable string is registered as voice information has been described. However, the present invention is not limited to this, and voice data may be registered. In that case, first, the voice data corresponding to the registration target voice uttered by the user is stored in the memory, and the recognition target voice uttered separately by the user is compared with the voice data stored in the memory. One or more recognition results are output. Then, the recognition result intended by the user is identified from the output one or more recognition results, and the speech data corresponding to the speech to be recognized is stored in the memory as registered speech information corresponding to the identified recognition result. It is configured to register.

なお、本発明の目的は次のようにしても達成される。即ち、前述した実施形態の機能を実現するソフトウェアのプログラムコードを記録した記憶媒体を、システムあるいは装置に供給する。そして、そのシステムあるいは装置のコンピュータ（またはＣＰＵやＭＰＵ）が記憶媒体に格納されたプログラムコードを読み出し実行する。このようにしても目的が達成されることは言うまでもない。 The object of the present invention can also be achieved as follows. That is, a storage medium in which a program code of software that realizes the functions of the above-described embodiments is recorded is supplied to the system or apparatus. Then, the computer (or CPU or MPU) of the system or apparatus reads and executes the program code stored in the storage medium. It goes without saying that the purpose is achieved even in this way.

この場合、記憶媒体から読み出されたプログラムコード自体が前述した実施形態の機能を実現することになり、そのプログラムコードを記憶した記憶媒体は本発明を構成することになる。 In this case, the program code itself read from the storage medium realizes the functions of the above-described embodiments, and the storage medium storing the program code constitutes the present invention.

プログラムコードを供給するための記憶媒体としては、例えば、フレキシブルディスク、ハードディスク、光ディスク、光磁気ディスク、ＣＤ−ＲＯＭ、ＣＤ−Ｒ、磁気テープ、不揮発性のメモリカード、ＲＯＭなどを用いることができる。 As a storage medium for supplying the program code, for example, a flexible disk, a hard disk, an optical disk, a magneto-optical disk, a CD-ROM, a CD-R, a magnetic tape, a nonvolatile memory card, a ROM, or the like can be used.

また、本発明に係る実施の形態は、コンピュータが読出したプログラムコードを実行することにより、前述した実施形態の機能が実現される場合に限られない。例えば、そのプログラムコードの指示に基づき、コンピュータ上で稼働しているＯＳ（オペレーティングシステム）などが実際の処理の一部または全部を行い、その処理によって前述した実施形態の機能が実現される場合も含まれることは言うまでもない。 Further, the embodiments according to the present invention are not limited to the case where the functions of the above-described embodiments are realized by executing the program code read by the computer. For example, an OS (operating system) running on a computer performs part or all of actual processing based on an instruction of the program code, and the functions of the above-described embodiments may be realized by the processing. Needless to say, it is included.

さらに、本発明に係る実施形態の機能は次のようにしても実現される。即ち、記憶媒体から読出されたプログラムコードが、コンピュータに挿入された機能拡張ボードやコンピュータに接続された機能拡張ユニットに備わるメモリに書込まれる。そして、そのプログラムコードの指示に基づき、その機能拡張ボードや機能拡張ユニットに備わるＣＰＵなどが実際の処理の一部または全部を行う。この処理により前述した実施形態の機能が実現されることは言うまでもない。 Furthermore, the functions of the embodiment according to the present invention are also realized as follows. That is, the program code read from the storage medium is written in a memory provided in a function expansion board inserted into the computer or a function expansion unit connected to the computer. Then, based on the instruction of the program code, the CPU provided in the function expansion board or function expansion unit performs part or all of the actual processing. It goes without saying that the functions of the above-described embodiments are realized by this processing.

本発明の実施例１に係る携帯電話である。1 is a mobile phone according to a first embodiment of the present invention. 本発明の実施例１に係る携帯電話１０１のブロック図である。It is a block diagram of the mobile telephone 101 which concerns on Example 1 of this invention. 本発明の実施例１に係る携帯電話１０１のハードウエア構成図ある。It is a hardware block diagram of the mobile telephone 101 which concerns on Example 1 of this invention. 本発明の実施例１に係る携帯電話１０１上で、登録電話番号に対する音声を登録する処理を説明するフローチャートである。It is a flowchart explaining the process which registers the audio | voice with respect to a registration telephone number on the mobile telephone 101 which concerns on Example 1 of this invention. 本発明の実施例１に係る携帯電話１０１上で、メモリ登録された電話番号を音声入力で呼び出して電話をかける場合の動作を説明するフローチャートである。It is a flowchart explaining the operation | movement in the case of making a telephone call by calling the telephone number registered in the memory by voice input on the mobile phone 101 according to the first embodiment of the present invention. 本発明の実施例１に係る携帯電話１０１上で、登録電話番号に対する音声を登録する際の画面表示の遷移例を示す図である。It is a figure which shows the example of a transition of the screen display at the time of registering the audio | voice with respect to a registration telephone number on the mobile telephone 101 which concerns on Example 1 of this invention. 本発明の実施例１に係る携帯電話１０１上で、メモリ登録された電話番号を音声入力で呼び出して電話をかける場合の画面表示の遷移例を示す図である。It is a figure which shows the example of a transition of a screen display in the case of making a telephone call by calling the telephone number registered in memory by voice input on the mobile phone 101 according to the first embodiment of the present invention. 本発明の実施例１に係る携帯電話１０１上で、メモリ登録された電話番号を音声入力で呼び出して電話をかける場合の確認ダイアログの表示を含む画面表示の遷移例である。It is a transition example of a screen display including a display of a confirmation dialog when calling a telephone by calling a telephone number registered in a memory by voice input on the mobile phone 101 according to the first embodiment of the present invention. 本発明の実施例１に係る音声認識文法の例である。It is an example of the speech recognition grammar based on Example 1 of this invention. 本発明の実施例１に係る音声認識部が出力する確信度付きＮ−ｂｅｓｔの例である。It is an example of N-best with certainty which the voice recognition part concerning Example 1 of the present invention outputs. 本発明の実施例１に係る音声認識部が出力する確信度付きＮ−ｂｅｓｔの別の例である。It is another example of N-best with reliability which the speech recognition part which concerns on Example 1 of this invention outputs. 本発明の実施例１に係る音声認識文法で、新規に登録音声の音節列が追加された場合の例図である。It is an example figure at the time of the speech recognition grammar which concerns on Example 1 of this invention when the syllable string of a registration speech is newly added. 本発明の実施例３に係る携帯電話１０１上で、メモリ登録された電話番号を音声入力で呼び出した直後、取り消し処理が行われた場合に確認ダイアログを表示する場合の画面表示の遷移例を示す図である。An example of screen display transition when a confirmation dialog is displayed when a cancel process is performed immediately after calling a phone number registered in a memory by voice input on the mobile phone 101 according to the third embodiment of the present invention is shown. FIG. 本発明の実施例５に係る携帯電話１０１のブロック図である。It is a block diagram of the mobile telephone 101 which concerns on Example 5 of this invention.

Explanation of symbols

２０１音声入力部
２０２音声認識部
２０３音声認識文法保持部
２０４判定部
２０５確認部
２０６音節認識部
２０７音節認識文法保持部
２０８登録部 DESCRIPTION OF SYMBOLS 201 Speech input part 202 Speech recognition part 203 Speech recognition grammar holding part 204 Judgment part 205 Confirmation part 206 Syllable recognition part 207 Syllable recognition grammar holding part 208 Registration part

Claims

Obtaining means for obtaining the voice uttered by the user;
Registration means for registering voice information corresponding to the registration target voice acquired by the acquisition means in a memory as registered voice information;
Speech recognition means for recognizing speech to be recognized acquired by the acquisition means using registered voice information registered in the memory, and outputting one or a plurality of recognition results;
A specifying unit for specifying a recognition result intended by the user from one or a plurality of recognition results output by the voice recognition unit;
The information processing apparatus according to claim 1, wherein the registration unit registers voice information corresponding to the recognition target voice as registered voice information corresponding to the recognition result specified by the specifying unit.

The registration unit registers voice information corresponding to the registration target voice in association with predetermined operation information as registered voice information,
The voice recognition means outputs operation information corresponding to the voice information as the recognition result,
The specifying unit specifies operation information intended by the user from one or more pieces of operation information output as a recognition result,
The information processing apparatus according to claim 1, wherein the registration unit registers voice information corresponding to the recognition target voice as registered voice information corresponding to the operation information specified by the specifying unit.

Further comprising display control means for displaying one or more recognition results output by the voice recognition means on a display screen;
The identification unit identifies a recognition result selected by the user among one or more recognition results displayed by the display control unit as a recognition result intended by the user. The information processing apparatus described.

A determination means for determining whether or not to output a confirmation dialog for presenting the user with one or more recognition results output by the voice recognition means and selecting a recognition result intended by the user;
2. The identification unit according to claim 1, wherein when the determination unit determines to output a confirmation dialog, the identification unit identifies a recognition result selected by the user according to the confirmation dialog as a recognition result intended by the user. The information processing apparatus described.

The voice recognition means outputs a recognition result with certainty,
5. The determination unit according to claim 4, wherein the determination unit determines to output the confirmation dialog when the certainty factor of the first candidate of the recognition result is not significantly different from other candidates. Information processing device.

The significant difference is that the value of the certainty factor of the first candidate is greater than a predetermined value, or the difference between the certainty factor of the first candidate and the certainty factor of the second candidate is a predetermined value. The information processing apparatus according to claim 5, wherein the information processing apparatus is larger.

Execution means for executing processing corresponding to one of the recognition results output by the voice recognition means;
The information processing apparatus according to claim 4, wherein the determination unit determines to output the confirmation dialog when a user performs a cancel operation on the process executed by the execution unit.

Execution means for executing processing corresponding to one of the recognition results output by the voice recognition means;
When the user does not cancel the process executed by the execution unit, the specifying unit sets a recognition result corresponding to the process executed by the execution unit as a recognition result intended by the user. The information processing apparatus according to claim 1, wherein the information processing apparatus is specified.

The information processing apparatus according to claim 1, wherein the voice information corresponding to the voice is voice data corresponding to the voice or a syllable string obtained by voice recognition of the voice.

An acquisition step of acquiring voice uttered by the user;
A first registration step of registering voice information corresponding to the voice to be registered acquired in the acquisition step in a memory as registered voice information;
A speech recognition step of recognizing the recognition target speech acquired in the acquisition step using registered speech information registered in the memory and outputting one or more recognition results;
A specifying step of specifying a recognition result intended by the user from one or a plurality of recognition results output in the voice recognition step;
An information processing method comprising: a second registration step of registering voice information corresponding to the recognition target voice in a memory as registered voice information corresponding to the recognition result specified by the specifying step.

A program for causing a computer to execute the information processing method according to claim 10.

The computer-readable storage medium which recorded the program of Claim 11.