JP2020034832A

JP2020034832A - Dictionary generation device, voice recognition system, and dictionary generation method

Info

Publication number: JP2020034832A
Application number: JP2018162731A
Authority: JP
Inventors: 大輔山▲崎▼; Daisuke Yamazaki; 伊藤　隆志; Takashi Ito; 隆志伊藤
Original assignee: Denso Ten Ltd
Current assignee: Denso Ten Ltd
Priority date: 2018-08-31
Filing date: 2018-08-31
Publication date: 2020-03-05

Abstract

To improve recognition rate of voice recognition.SOLUTION: A dictionary generation device according to an embodiment includes a receiving unit, a generation unit, and an update unit. The receiving unit receives an uttered voice of a word with a pronunciation different from the standard pronunciation. The generation unit generates a phoneme sequence based on the uttered voice received by the receiving unit. The update unit updates a voice dictionary by adding the phoneme sequence generated by the generation unit to the voice dictionary.SELECTED DRAWING: Figure 3

Description

本発明は、辞書生成装置、音声認識システムおよび辞書生成方法に関する。 The present invention relates to a dictionary generation device, a speech recognition system, and a dictionary generation method.

従来、例えば、ユーザの発話による音声操作を受け付ける車載装置がある。かかる車載装置では、発話音声に基づき、音声認識用の辞書を検索することで、音声操作を受け付けることが一般的である。 2. Description of the Related Art Conventionally, for example, there is an in-vehicle device that receives a voice operation by a user's speech. In such an in-vehicle device, it is common to accept a voice operation by searching a dictionary for voice recognition based on an uttered voice.

また、特許文献１には、発話音声に基づく候補が複数存在する場合、追加で音声入力を受け付けることで、最終的に１つの候補を絞り込むことが開示されている。 Further, Patent Document 1 discloses that when there are a plurality of candidates based on the uttered voice, an additional voice input is accepted to finally narrow down one candidate.

特開平１１−２５００７８号公報JP-A-11-250078

しかしながら、従来技術では、音声認識の認識率を向上させるうえで、改善の余地があった。すなわち、一般的に使用される音声認識用の音声辞書は、ネイティブの発話に基づいて作成されているため、ノンネイティブの発話に対して認識率が低下する。 However, in the prior art, there is room for improvement in improving the recognition rate of speech recognition. That is, since a generally used speech dictionary for speech recognition is created based on native utterances, the recognition rate is lower for non-native utterances.

本発明は、上記に鑑みてなされたものであって、音声認識の認識率を向上させることができる辞書生成装置、音声認識システムおよび辞書生成方法を提供することを目的とする。 The present invention has been made in view of the above, and an object of the present invention is to provide a dictionary generation device, a voice recognition system, and a dictionary generation method that can improve the recognition rate of voice recognition.

上述した課題を解決し、目的を達成するために、実施形態に係る辞書生成装置は、受付部と、生成部と、更新部とを備える。前記受付部は、標準発音と異なる発音による単語の発話音声を受け付ける。前記生成部は、前記受付部によって受け付けられた前記発話音声に基づく音素列を生成する。前記更新部は、前記生成部によって生成された前記音素列を音声辞書へ追加することで、当該音声辞書を更新する。 In order to solve the above-described problem and achieve the object, a dictionary generation device according to an embodiment includes a reception unit, a generation unit, and an update unit. The receiving unit receives an uttered voice of a word having a pronunciation different from a standard pronunciation. The generating unit generates a phoneme sequence based on the uttered voice received by the receiving unit. The updating unit updates the voice dictionary by adding the phoneme string generated by the generating unit to the voice dictionary.

本発明によれば、音声認識の認識率を向上させることができる。 According to the present invention, the recognition rate of voice recognition can be improved.

図１は、辞書生成方法の概要を示す図である。FIG. 1 is a diagram showing an outline of a dictionary generation method. 図２は、音声認識システムの概要図である。FIG. 2 is a schematic diagram of the speech recognition system. 図３は、辞書生成装置のブロック図である。FIG. 3 is a block diagram of the dictionary generation device. 図４は、音声辞書の具体例を示す図である。FIG. 4 is a diagram illustrating a specific example of the speech dictionary. 図５は、規則情報の一例を示す図である。FIG. 5 is a diagram illustrating an example of the rule information. 図６は、車載装置のブロック図である。FIG. 6 is a block diagram of the vehicle-mounted device. 図７は、車載用音声辞書の具体例を示す図である。FIG. 7 is a diagram illustrating a specific example of the vehicle-mounted speech dictionary. 図８は、辞書生成装置が実行する処理手順を示すフローチャートである。FIG. 8 is a flowchart illustrating a processing procedure executed by the dictionary generation device. 図９Ａは、車載装置が実行する処理手順を示すフローチャート（その１）である。FIG. 9A is a flowchart (part 1) illustrating a processing procedure executed by the vehicle-mounted device. 図９Ｂは、車載装置が実行する処理手順を示すフローチャート（その２）である。FIG. 9B is a flowchart (part 2) illustrating a processing procedure executed by the vehicle-mounted device. 図１０は、マイクの搭載例を示す図である。FIG. 10 is a diagram illustrating an example of mounting a microphone.

以下、添付図面を参照して、実施形態に係る辞書生成装置、音声認識システムおよび辞書生成方法について詳細に説明する。なお、本実施形態によりこの発明が限定されるものではない。 Hereinafter, a dictionary generation device, a speech recognition system, and a dictionary generation method according to an embodiment will be described in detail with reference to the accompanying drawings. The present invention is not limited by the embodiment.

まず、図１を用いて実施形態に係る辞書生成方法について説明する。図１は、辞書生成方法の概要を示す図である。なお、かかる辞書生成方法は、図１に示す辞書生成装置１によって実行される。 First, a dictionary generation method according to the embodiment will be described with reference to FIG. FIG. 1 is a diagram showing an outline of a dictionary generation method. The dictionary generation method is executed by the dictionary generation device 1 shown in FIG.

図１に示す辞書生成装置１は、音声認識を行う際に使用する音声辞書を生成するものである。かかる音声辞書は、単語ごとに音素列や音響モデルが対応付けられたものである。 The dictionary generation device 1 shown in FIG. 1 generates a speech dictionary used when performing speech recognition. In such a speech dictionary, a phoneme sequence and an acoustic model are associated with each word.

ところで、欧米などの移民国家においては、同じ言語(例えば、英語)であっても、発話者によって発音が異なる場合がある。具体的には、ネイティブの発話者と、ノンネイティブの発話者とでは、同じ単語であってもその発音が異なる場合がある。 Incidentally, in the immigration nations such as Europe and the United States, even in the same language (for example, English), pronunciation may be different depending on the speaker. Specifically, a native speaker and a non-native speaker may have different pronunciations even for the same word.

また、一般的な音声辞書は、ネイティブ発話者を対象として生成されているため、ノンネイティブ発話者に対する音声認識の認識精度が低下する。 Further, since a general speech dictionary is generated for native speakers, the recognition accuracy of speech recognition for non-native speakers decreases.

そこで、実施形態に係る辞書生成方法では、音声辞書にノンネイティブ発話者用の音素列を音声辞書へ追加することで、音声認識の認識精度を向上させることとした。 Therefore, in the dictionary generation method according to the embodiment, the recognition accuracy of voice recognition is improved by adding a phoneme sequence for a non-native speaker to the voice dictionary.

具体的には、図１に示すように、まず、辞書生成装置１は、標準発音とは異なるノンネイティブ発話者による単語の発話音声を受け付け（ステップＳ１）、受け付けた発話音声に基づく音素列を生成する（ステップＳ２）。 Specifically, as shown in FIG. 1, first, the dictionary generation device 1 receives an uttered voice of a word by a non-native speaker different from the standard pronunciation (step S1), and generates a phoneme sequence based on the received uttered voice. Generate (Step S2).

例えば、辞書生成装置１の管理者(以下、単に「管理者」と記載する)が、ノンネイティブ発話者に対して指定した単語を発話してもらったり、インターネット上のノンネイティブ発話者による発話を収集したりすることで、ノンネイティブ発話者の発話音声を収集することができる。 For example, an administrator of the dictionary generation device 1 (hereinafter simply referred to as “administrator”) has a non-native speaker speak a specified word, or speaks a non-native speaker on the Internet. By collecting, the uttered voice of the non-native speaker can be collected.

そして、管理者が、収集した発話音声の音声データを辞書生成装置１へ入力することで、辞書生成装置１は、発話音声を受け付ける。 Then, when the administrator inputs the voice data of the collected speech voice to the dictionary generation device 1, the dictionary generation device 1 accepts the speech voice.

音素列とは、音素を文字として表記した文字列である。すなわち、辞書生成装置１は、発話音声を音素列に書き起こす処理を行う。これにより、管理者が、発話音声から音素列を書き起こす場合に比べて、音素列を容易に得ることが可能となる。 A phoneme string is a character string in which phonemes are represented as characters. That is, the dictionary generation device 1 performs a process of transcribing the uttered voice into a phoneme sequence. This makes it easier for the administrator to obtain a phoneme sequence than when a phoneme sequence is transcribed from an uttered voice.

その後、辞書生成装置１は、音素列を音声辞書４１へ追加する（ステップＳ３）。これにより、音声辞書４１には、ネイティブ発話者に対する音素列に対してノンネイティブ発話者による音素列が対応付けられることとなる。 Thereafter, the dictionary generation device 1 adds the phoneme string to the speech dictionary 41 (Step S3). As a result, the phonetic sequence of the non-native speaker is associated with the phoneme sequence of the native speaker in the voice dictionary 41.

図１に示す例では、音声辞書４１の「Ｊａｓｏｎ」に対してデフォルト（ネイティブ用）としてカナ表記で「ジェイソン」の音素列が予め登録されている場合を示す。 The example shown in FIG. 1 shows a case where a phoneme string of “Jason” is previously registered as a default (for native) in “Jason” of the voice dictionary 41 in kana notation.

そして、「Ｊａｓｏｎ」に対していずれもカナ表記で「ジェイスン」、「ヤソン」、「ハソン」などの音素列が追加された場合を示す。したがって、音声辞書４１を用いることで、「ジェイスン」「ヤソン」、「ハソン」の発話音声を「Ｊａｓｏｎ」として認識することが可能となる。 Then, a case where phoneme strings such as "Jason", "Yason", and "Hathon" are added to "Jason" in kana notation. Therefore, by using the voice dictionary 41, it is possible to recognize the uttered voices of “Jason”, “Yason”, and “Hasson” as “Jason”.

このように、実施形態に係る辞書生成方法では、ノンネイティブの発話音声に基づく音素列を音声辞書４１に追加することで、音声辞書４１を更新する。これにより、音声認識の認識率を向上させることが可能となる。 As described above, in the dictionary generation method according to the embodiment, the speech dictionary 41 is updated by adding the phoneme sequence based on the non-native uttered speech to the speech dictionary 41. As a result, the recognition rate of voice recognition can be improved.

また、実施形態に係る辞書生成方法では、辞書生成装置１側で音素列を生成するため、音声辞書４１の更新を容易に行うことが可能となる。 Further, in the dictionary generation method according to the embodiment, since the phoneme string is generated on the dictionary generation device 1 side, the speech dictionary 41 can be easily updated.

ところで、上述のように、音声辞書４１には、１つの単語に対して複数の音素列が登録される。そのため、例えば、複数の単語において、近似する音素列が登録される場合がある。例えば、音声辞書４１に「Ｊａｓｏｎ」および「Ｙａｓｏｎ」（ヤソン）の単語がそれぞれ登録されていた場合、カナ表記で「ヤソン」という音素列が、「Ｊａｓｏｎ」および「Ｙａｓｏｎ」の双方に登録される場合がある。 By the way, as described above, the phonetic dictionary 41 registers a plurality of phoneme strings for one word. Therefore, for example, an approximate phoneme sequence may be registered for a plurality of words. For example, when the words "Jason" and "Yason" (Yason) are registered in the voice dictionary 41, a phoneme string of "Yason" in kana notation is registered in both "Jason" and "Yason". There are cases.

この場合、「ヤソン」という発話音声に対して「Ｊａｓｏｎ」および「Ｙａｓｏｎ」の双方の単語が音声辞書４１から抽出されるため、却って音声認識の認識率が低下することが想定される。 In this case, since the words “Jason” and “Yason” are extracted from the speech dictionary 41 for the uttered speech “Yason”, the recognition rate of the speech recognition may be reduced.

このため、音声辞書４１に登録された各音素列に重み付けをすることで、音声認識の認識率を向上させることも可能であるが、この点の詳細については、後述する。 For this reason, it is possible to improve the recognition rate of speech recognition by weighting each phoneme string registered in the speech dictionary 41, but this point will be described later in detail.

次に、図２を用いて実施形態に係る音声認識システムＳの構成について説明する。図２は、音声認識システムＳの概要図である。図２に示すように、音声認識システムＳは、辞書生成装置１と、複数の車載装置５０−１〜５０−ｎ(ｎは、２以上の自然数)とを備える。なお、車載装置５０−１〜５０−ｎは、音声認識装置の一例である。また、以下では、車載装置５０−１〜５０−ｎについて、単に車載装置５０と記載する。 Next, the configuration of the speech recognition system S according to the embodiment will be described with reference to FIG. FIG. 2 is a schematic diagram of the speech recognition system S. As shown in FIG. 2, the speech recognition system S includes the dictionary generation device 1 and a plurality of in-vehicle devices 50-1 to 50-n (n is a natural number of 2 or more). Note that the in-vehicle devices 50-1 to 50-n are examples of a voice recognition device. Hereinafter, the in-vehicle devices 50-1 to 50-n will be simply referred to as the in-vehicle device 50.

辞書生成装置１と、車載装置５０とは、ネットワークＮを介して接続されており、相互にデータ通信を行うことが可能である。各車載装置５０は、辞書生成装置１によって生成、更新された音声辞書が搭載された音声認識装置であり、車内のユーザによる音声操作を支援する。 The dictionary generation device 1 and the in-vehicle device 50 are connected via a network N, and can perform data communication with each other. Each in-vehicle device 50 is a voice recognition device equipped with a voice dictionary generated and updated by the dictionary generation device 1, and supports voice operations by a user in the vehicle.

例えば、後述するように、ユーザは、車載装置５０を介して、スマートフォンなどのユーザ端末の電話帳呼び出しや、ナビゲーション装置の行先決定を音声操作として行うことが可能である。 For example, as described later, the user can call the telephone directory of a user terminal such as a smartphone or determine the destination of the navigation device as a voice operation via the in-vehicle device 50.

続いて、図３を用いて辞書生成装置１の構成例について説明する。図３は、辞書生成装置１のブロック図である。図３に示すように、辞書生成装置１は、通信部２と、制御部３と、記憶部４とを備える。 Next, a configuration example of the dictionary generation device 1 will be described with reference to FIG. FIG. 3 is a block diagram of the dictionary generation device 1. As shown in FIG. 3, the dictionary generation device 1 includes a communication unit 2, a control unit 3, and a storage unit 4.

通信部２は、ネットワークＮを介して各車載装置５０とデータ通信を行う通信モジュールである。通信部２は、制御部３の指示に応じて音声辞書を各車載装置５０へ配信したり、各車載装置５０から音声辞書の配信指示を取得したりすることができる。 The communication unit 2 is a communication module that performs data communication with each in-vehicle device 50 via the network N. The communication unit 2 can distribute a voice dictionary to each in-vehicle device 50 in accordance with an instruction from the control unit 3 and can acquire a voice dictionary distribution instruction from each in-vehicle device 50.

制御部３は、受付部３０と、生成部３１と、調整部３２と、更新部３３と、配信部３４とを備える。また、制御部３は、例えば、ＣＰＵ（Central Processing Unit）、ＲＯＭ（Read Only Memory）、ＲＡＭ（Random Access Memory）、ＨＤＤ（Hard Disk Drive）、入出力ポートなどを有するコンピュータや各種の回路を含む。 The control unit 3 includes a reception unit 30, a generation unit 31, an adjustment unit 32, an update unit 33, and a distribution unit 34. The control unit 3 includes, for example, a computer having a CPU (Central Processing Unit), a ROM (Read Only Memory), a RAM (Random Access Memory), a HDD (Hard Disk Drive), an input / output port, and various circuits. .

コンピュータのＣＰＵは、例えば、ＲＯＭに記憶されたプログラムを読み出して実行することによって、制御部３の受付部３０、生成部３１、調整部３２、更新部３３および配信部３４として機能する。 The CPU of the computer functions as, for example, the reception unit 30, the generation unit 31, the adjustment unit 32, the update unit 33, and the distribution unit 34 of the control unit 3 by reading and executing a program stored in the ROM.

また、制御部３の受付部３０、生成部３１、調整部３２、更新部３３および配信部３４の少なくともいずれか一部または全部をＡＳＩＣ（Application Specific Integrated Circuit）やＦＰＧＡ（Field Programmable Gate Array）等のハードウェアで構成することもできる。 In addition, at least some or all of the receiving unit 30, the generating unit 31, the adjusting unit 32, the updating unit 33, and the distributing unit 34 of the control unit 3 may be configured as an ASIC (Application Specific Integrated Circuit) or an FPGA (Field Programmable Gate Array). Hardware.

また、記憶部４は、例えば、ＲＡＭやＨＤＤに対応する。ＲＡＭやＨＤＤは、音声辞書データベース４０および規則情報データベース４２を有する。なお、辞書生成装置１は、有線や無線のネットワークで接続された他のコンピュータや可搬型記録媒体を介して上記したプログラムや各種情報を取得することとしてもよい。 The storage unit 4 corresponds to, for example, a RAM or an HDD. The RAM and the HDD have a voice dictionary database 40 and a rule information database 42. Note that the dictionary generation device 1 may acquire the above-described programs and various information via another computer or a portable recording medium connected via a wired or wireless network.

音声辞書データベース４０は、上述の音声辞書４１をテーブルとして記憶するデータベースである。図４は、音声辞書４１の具体例を示す図である。図４に示すように、音声辞書４１は、「名前」、「デフォルト」、「追加分」および「訛り」が互いに関連付けられた情報である。 The voice dictionary database 40 is a database that stores the voice dictionary 41 as a table. FIG. 4 is a diagram showing a specific example of the voice dictionary 41. As shown in FIG. 4, the voice dictionary 41 is information in which “name”, “default”, “addition”, and “accent” are associated with each other.

「名前」は、単語の一例である。「デフォルト」は、ネイティブ発話者用の音素列を示す。「追加分」は、制御部３によって追加されたノンネイティブ発話者用の音素列を示す。 “Name” is an example of a word. "Default" indicates a phoneme sequence for a native speaker. The “addition” indicates a phoneme string for a non-native speaker added by the control unit 3.

「訛り」は、各音素列を発話する発話者の訛りを示す。図４では、訛りとして、ネイティブ、北欧、ゲルマン、スパニッシュを例示している。これら訛りは、後述する規則情報に基づき、制御部３によって追加される。 “Accent” indicates the accent of the speaker who speaks each phoneme sequence. FIG. 4 illustrates native, Nordic, Germanic, and Spanish as accents. These accents are added by the control unit 3 based on rule information described later.

図３に戻り、規則情報データベース４２について説明する。規則情報データベース４２は、ネイティブの発話に基づく音素列と、ノンネイティブの発話に基づく音素列との規則性に関する情報である規則情報をテーブルとして記憶するデータベースである。 Returning to FIG. 3, the rule information database 42 will be described. The rule information database 42 is a database that stores, as a table, rule information that is information on the regularity of phoneme strings based on native utterances and phoneme strings based on non-native utterances.

図５は、規則情報の一例を示す図である。図５に示すように、規則性情報４３は、「規則性」、「適用対象」、「ｎａｔｉｖｅ」および「ｎｏｎｎａｔｉｖｅ」が互いに対応付けらえた情報である。 FIG. 5 is a diagram illustrating an example of the rule information. As shown in FIG. 5, the regularity information 43 is information in which “regularity”, “application target”, “native”, and “non native” are associated with each other.

「規則性」は、規則性の分類を示す「分類」と、その「分類」に対応する「事象」とが含まれる。「適用対象」は、規則性が適用される単語の具体例を示し、「ｎａｔｉｖｅ」および「ｎｏｎｎａｔｉｖｅ」は、それぞれネイティブおよびノンネイティブの発音を示す。 “Regularity” includes “classification” indicating a classification of regularity, and “event” corresponding to the “classification”. “Applicable object” indicates a specific example of a word to which regularity is applied, and “native” and “non native” indicate native and non-native pronunciations, respectively.

なお、ここでは、説明を分かりやすくするために、「ｎａｔｉｖｅ」および「ｎｏｎｎａｔｉｖｅ」についてそれぞれカナ表記で示しているが、音素列で示すことも可能である。 Note that, here, for the sake of simplicity of explanation, “native” and “non native” are each shown in kana notation, but they may be shown by phoneme strings.

図５に示すように、例えば、「Ｊの発音」において、ネイティブ発話者は、ジョ、ジェ、ジュと発音するものに対して、スペイン系のノンネイティブ発話者は、イ、ヨ、ホ、フと発音する。 As shown in FIG. 5, for example, in "pronunciation of J", native speakers pronounce Jo, Jae, and Ju, whereas Spanish non-native speakers speak Lee, Yo, Ho, and Fu. Pronounced

このため、「ｊｅｓｓｉｃａ」について、ネイティブ発話者が「ジェシカ」と発音するところを、スペイン系のノンネイティブ発話者は、「イェシカ」と発音する。 For this reason, the native non-speaker pronounces "Jessica" as "Jessica", while the Spanish non-native speaker pronounces "Jessica".

「ｈｅｎｒｙ」について、ネイティブ発話者が、「ヘンリ」と発音するところを、フランス系のノンネイティブ発話者は、「アンリ」と発音し、スペイン系のノンネイティブ発話者は「エンリ」と発音する。同様に、「ｊａｍｅｓ」についても、ネイティブ発話者が、「ジェイムス」と発音するところを、フランス系のノンネイティブ発話者は「ジャムス」、スペイン系のノンネイティブ発話者は「ハメス」と発音する。なお、「ｈｅｎｒｙ」や「ｊａｍｅｓ」は、アルファベット表記とノンネイティブ発話者による発音とが異なるため、アルファベットから音素列を推定するのが困難である。 Regarding "henry", a native speaker who pronounces "henry" pronounces "enri" while a French non-native speaker pronounces "enri" and a Spanish non-native speaker pronounces "enri". Similarly, for "james", a native speaker pronounces "James", a French non-native speaker pronounces "Jams", and a Spanish non-native speaker pronounces "James". It is difficult to estimate the phoneme sequence from the alphabet of “henry” and “james” because the alphabet notation and the pronunciation by a non-native speaker are different.

このように、規則性情報４３は、ノンネイティブの属性に応じて音素列の規則性を示す情報であり、かかる規則性情報４３を参照することで、ユーザがどのノンネイティブの発話者かを識別することが可能である。なお、管理者によって、辞書生成装置１に規則性情報４３が入力・更新されることにしてもよいし、辞書生成装置１に機械学習機能を備えることとし、辞書生成装置１側で規則性情報４３を適宜更新することも可能である。 As described above, the regularity information 43 is information indicating the regularity of the phoneme string according to the non-native attribute, and by referring to the regularity information 43, it is possible to identify which non-native speaker the user is. It is possible to It should be noted that the regularity information 43 may be input and updated to the dictionary creation device 1 by the administrator, or the dictionary creation device 1 may be provided with a machine learning function, and the dictionary creation device 1 may be provided with the regularity information 43. 43 can be updated as appropriate.

図３の説明に戻り、制御部３の受付部３０について説明する。受付部３０は、標準発音と異なる発音による単語の発話音声を受け付ける。すなわち、受付部３０は、ノンネイティブ発話者の単語の発話音声を音声波形として受け付ける。 Returning to the description of FIG. 3, the receiving unit 30 of the control unit 3 will be described. The receiving unit 30 receives an uttered voice of a word with a pronunciation different from the standard pronunciation. That is, the receiving unit 30 receives the uttered voice of the word of the non-native speaker as a voice waveform.

上述のように、受付部３０は、管理者によって収集された発話音声を受け付けることが可能となる。この際、受付部３０は、単語の文字列の入力を管理者から受け付け、発話音声が示す単語の文字列を生成部３１へ通知する。 As described above, the receiving unit 30 can receive the uttered voice collected by the administrator. At this time, the receiving unit 30 receives the input of the character string of the word from the administrator, and notifies the generating unit 31 of the character string of the word indicated by the uttered voice.

また、受付部３０は、ノンネイティブ発話者の母国語の種別の入力をあわせて受け付けることにしてもよい。この場合、制御部３は、母国語に関する情報に基づいて規則情報データベース４２を更新することも可能である。 The receiving unit 30 may also receive the input of the native language type of the non-native speaker. In this case, the control unit 3 can update the rule information database 42 based on information on the native language.

続いて、生成部３１について説明する。生成部３１は、受付部３０によって受け付けられた発話音声に基づく音素列を生成する。例えば、生成部３１は、音声波形と対応する音素列の相関関係を示すテーブルを有しており、かかるテーブルを参照することで、発話音声に基づく音素列を生成することができる。また、生成部３１によって生成された音素列は、調整部３２へ通知される。 Next, the generation unit 31 will be described. The generation unit 31 generates a phoneme sequence based on the uttered voice received by the reception unit 30. For example, the generation unit 31 has a table indicating a correlation between a speech waveform and a corresponding phoneme sequence, and can generate a phoneme sequence based on the uttered voice by referring to the table. The phoneme string generated by the generation unit 31 is notified to the adjustment unit 32.

調整部３２は、生成部３１によって生成された音素列を調整し、更新部３３へ通知する。例えば、調整部３２は、音素列を音声へ変換し、音素列を再生音として図示しないスピーカから管理者へ出力する。管理者は、かかる再生音と実際の発話音声を比較することで、音素列を手動で調整することができる。これにより、音素列をより実際の発話音声に近づけることが可能となる。 The adjusting unit 32 adjusts the phoneme string generated by the generating unit 31 and notifies the updating unit 33 of the adjusted phoneme sequence. For example, the adjustment unit 32 converts the phoneme string into a voice, and outputs the phoneme string as a reproduced sound from a speaker (not shown) to the administrator. The administrator can manually adjust the phoneme sequence by comparing the reproduced sound with the actual uttered voice. This makes it possible to bring the phoneme sequence closer to the actual uttered voice.

調整部３２は、上記の処理を自動的に行うことも可能である。すなわち、調整部３２は、音素列に基づく音声波形と、実際の発話音声の音声波形とを比較し、波形が異なる箇所について音素列について適宜修正していく。 The adjusting unit 32 can also automatically perform the above processing. That is, the adjustment unit 32 compares the voice waveform based on the phoneme sequence with the voice waveform of the actual uttered voice, and appropriately corrects the phoneme sequence at a portion where the waveform is different.

そして、調整部３２は、音素列に基づく音声波形と、実際の発話音声の音声波形とが略一致するまで、かかる処理を行うことで、音素列を実際の発話音声に近づけることが可能となる。このように、調整部３２は、音素列を調整することで、音素列を用いて実際の発話音声を精度よく再現することが可能となる。したがって、音声辞書４１を用いた際の音声認識率を向上させることが可能となる。 Then, the adjusting unit 32 performs such processing until the voice waveform based on the phoneme sequence substantially matches the voice waveform of the actual uttered voice, thereby making it possible to bring the phoneme sequence closer to the actual uttered voice. . As described above, the adjustment unit 32 can accurately reproduce an actual uttered voice using the phoneme sequence by adjusting the phoneme sequence. Therefore, it is possible to improve the speech recognition rate when the speech dictionary 41 is used.

更新部３３は、調整部３２から通知される音素列を音声辞書４１へ追加することで、音声辞書４１を更新する。更新部３３は、ネイティブ発話者用の音素列に対してノンネイティブ発話者用の音素列を対応付けることで、音声辞書４１を更新する。 The updating unit 33 updates the voice dictionary 41 by adding the phoneme sequence notified from the adjusting unit 32 to the voice dictionary 41. The updating unit 33 updates the voice dictionary 41 by associating a phoneme string for a non-native speaker with a phoneme string for a native speaker.

これにより、音声辞書４１には、ネイティブ発話者用の音素列と、各ノンネイティブ発話者用の音素列とが含まれることとなる。したがって、音声辞書４１を用いることで、ネイティブ発話者と、各ノンネイティブ発話者との双方の発話音声を認識することが可能となる。つまり、音声認識の認識率を向上させることが可能となる。 As a result, the speech dictionary 41 includes a phoneme sequence for a native speaker and a phoneme sequence for each non-native speaker. Therefore, by using the voice dictionary 41, it is possible to recognize both uttered voices of the native speaker and each non-native speaker. That is, it is possible to improve the recognition rate of voice recognition.

配信部３４は、各車載装置５０の配信要求に基づいて各車載装置５０に対して音声辞書４１を配信する。かかる配信要求には、各車載装置５０の制御対象となる外部装置に登録されたテキストが含まれる。 The distribution unit 34 distributes the voice dictionary 41 to each vehicle-mounted device 50 based on a distribution request from each vehicle-mounted device 50. The distribution request includes text registered in an external device to be controlled by each in-vehicle device 50.

配信部３４は、音声辞書データベース４０から、かかるテキストに対応付けられた音素列を抽出して各車載装置５０へ配信する。すなわち、配信部３４は、外部装置に登録されたテキストに対応する音素列のみを送信する。これにより、各車載装置５０が保有する車載用音声辞書７０(図６参照)のデータ容量を抑えることが可能となる。 The delivery unit 34 extracts a phoneme string associated with the text from the speech dictionary database 40 and delivers the phoneme string to each in-vehicle device 50. That is, the distribution unit 34 transmits only the phoneme string corresponding to the text registered in the external device. This makes it possible to reduce the data capacity of the vehicle-mounted voice dictionary 70 (see FIG. 6) held by each vehicle-mounted device 50.

このように、配信部３４は、車載装置５０毎に、配信する音声辞書４１を最適化することで、音声辞書４１の利便性を向上させることができる。 In this way, the distribution unit 34 can improve the convenience of the voice dictionary 41 by optimizing the voice dictionary 41 to be distributed for each in-vehicle device 50.

次に、図６を用いて車載装置５０について説明する。図６は、車載装置５０のブロック図である。なお、図６には、ユーザ端末８０およびナビゲーション装置８１をあわせて示す。ユーザ端末８０およびナビゲーション装置８１は、外部装置の一例である。 Next, the in-vehicle device 50 will be described with reference to FIG. FIG. 6 is a block diagram of the vehicle-mounted device 50. FIG. 6 shows the user terminal 80 and the navigation device 81 together. The user terminal 80 and the navigation device 81 are examples of an external device.

ユーザ端末８０は、例えば、ユーザが保有するスマートフォンなどのハンズフリー通話機能を有する端末である。例えば、ユーザ端末８０は、電話帳に登録された名前のテキスト情報や、ユーザ端末８０に操作履歴を車載装置５０へ通知する。また、ユーザ端末８０は、車載装置５０を介してハンズフリー通話を行うことも可能である。 The user terminal 80 is a terminal having a hands-free call function, such as a smartphone owned by the user. For example, the user terminal 80 notifies the in-vehicle device 50 of the text information of the name registered in the telephone directory and the operation history to the user terminal 80. Further, the user terminal 80 can also make a hands-free call via the in-vehicle device 50.

ナビゲーション装置８１は、車両の走行経路を案内するモジュールであり、目的地を登録可能なテキスト情報を車載装置５０へ通知する。また、ナビゲーション装置８１は、車載装置５０を介したユーザの音声操作に基づき、目的地を設定することも可能である。 The navigation device 81 is a module that guides the traveling route of the vehicle, and notifies the vehicle-mounted device 50 of text information in which a destination can be registered. The navigation device 81 can also set a destination based on a user's voice operation via the in-vehicle device 50.

車載装置５０は、制御部６と、記憶部７とを備える。制御部６は、取得部６０と、認識部６１と、重み付け部６２と、実行部６３とを備える。また、制御部６は、例えば、ＣＰＵ（Central Processing Unit）、ＲＯＭ（Read Only Memory）、ＲＡＭ（Random Access Memory）、ＨＤＤ（Hard Disk Drive）、入出力ポートなどを有するコンピュータや各種の回路を含む。 The in-vehicle device 50 includes a control unit 6 and a storage unit 7. The control unit 6 includes an acquisition unit 60, a recognition unit 61, a weighting unit 62, and an execution unit 63. The control unit 6 includes, for example, a computer having a CPU (Central Processing Unit), a ROM (Read Only Memory), a RAM (Random Access Memory), a HDD (Hard Disk Drive), an input / output port, and various circuits. .

コンピュータのＣＰＵは、例えば、ＲＯＭに記憶されたプログラムを読み出して実行することによって、制御部６の取得部６０、認識部６１、重み付け部６２および実行部６３として機能する。 The CPU of the computer functions as an acquisition unit 60, a recognition unit 61, a weighting unit 62, and an execution unit 63 of the control unit 6, for example, by reading and executing a program stored in the ROM.

また、制御部３の取得部６０、認識部６１、重み付け部６２および実行部６３の少なくともいずれか一部または全部をＡＳＩＣ（Application Specific Integrated Circuit）やＦＰＧＡ（Field Programmable Gate Array）等のハードウェアで構成することもできる。 In addition, at least some or all of the acquisition unit 60, the recognition unit 61, the weighting unit 62, and the execution unit 63 of the control unit 3 are implemented by hardware such as an ASIC (Application Specific Integrated Circuit) or an FPGA (Field Programmable Gate Array). It can also be configured.

また、記憶部７は、例えば、ＲＡＭやＨＤＤに対応する。ＲＡＭやＨＤＤは、車載用音声辞書７０、特定用情報７１およびコマンド情報７２を記憶する。なお、車載装置５０は、有線や無線のネットワークで接続された他のコンピュータや可搬型記録媒体を介して上記したプログラムや各種情報を取得することとしてもよい。 The storage unit 7 corresponds to, for example, a RAM or an HDD. The RAM and the HDD store the vehicle-mounted voice dictionary 70, the specifying information 71, and the command information 72. Note that the in-vehicle device 50 may acquire the above-described programs and various types of information via another computer or a portable recording medium connected via a wired or wireless network.

また、車載装置５０は、図示しない通信装置と接続されており、かかる通信装置を介してネットワークＮ（図２参照）と接続し、辞書生成装置１とデータ通信を行うことが可能である。 The in-vehicle device 50 is connected to a communication device (not shown), and can perform data communication with the dictionary generation device 1 by connecting to the network N (see FIG. 2) via the communication device.

取得部６０は、辞書生成装置１から音声辞書を取得する。具体的には、取得部６０は、制御対象となるユーザ端末８０や、ナビゲーション装置８１に登録されたテキストに関する情報であるテキスト情報を取得し、かかるテキスト情報に対応する音声辞書の配信要求を辞書生成装置１へ送信する。 The acquisition unit 60 acquires a voice dictionary from the dictionary generation device 1. Specifically, the obtaining unit 60 obtains text information that is information relating to text registered in the user terminal 80 to be controlled or the text registered in the navigation device 81, and sends a distribution request of the audio dictionary corresponding to the text information to the dictionary. Transmit to the generation device 1.

ここでのテキストとは、ユーザ端末８０の電話帳に登録された氏名や、ナビゲーション装置８１に登録された地名など、ユーザの音声操作を行う際に、必要となる単語のテキストを指す。取得部６０は、辞書生成装置１から各テキストに対応する音素列を取得すると、記憶部７に車載用音声辞書７０として追加する。 Here, the text refers to a text of a word that is necessary when the user performs a voice operation, such as a name registered in the telephone directory of the user terminal 80 and a place name registered in the navigation device 81. When the acquisition unit 60 acquires a phoneme string corresponding to each text from the dictionary generation device 1, the acquisition unit 60 adds the phoneme string to the storage unit 7 as the vehicle-mounted speech dictionary 70.

このように、取得部６０は、テキスト情報に基づいて音素列を取得し、車載用音声辞書７０へ追加することで、車載用音声辞書７０のデータ容量を抑えることができる。 As described above, the acquiring unit 60 acquires the phoneme string based on the text information and adds the acquired phoneme string to the vehicle-mounted speech dictionary 70, so that the data capacity of the vehicle-mounted speech dictionary 70 can be reduced.

また、取得部６０は、ユーザ端末８０からユーザ端末８０の操作履歴を取得することもできる。かかる操作履歴は、ユーザ端末８０の発話履歴や、ユーザ端末８０のウェブブラウザの閲覧履歴を含む。これら操作履歴は、後述する重み付け部６２へ通知される。 The acquisition unit 60 can also acquire the operation history of the user terminal 80 from the user terminal 80. The operation history includes an utterance history of the user terminal 80 and a browsing history of the web browser of the user terminal 80. These operation histories are notified to a weighting unit 62 described later.

認識部６１は、図示しないマイクから入力されるユーザの発話音声と、車載用音声辞書７０および特定用情報７１とに基づいて音声認識を行う。例えば、認識部６１は、車載用音声辞書７０に基づいて音声認識を行う第１モードと、特定用情報７１に基づいて音声認識を行う第２モードとで動作する。 The recognizing unit 61 performs voice recognition based on the user's uttered voice input from a microphone (not shown), the vehicle-mounted voice dictionary 70, and the specifying information 71. For example, the recognition unit 61 operates in a first mode for performing voice recognition based on the vehicle-mounted voice dictionary 70 and a second mode for performing voice recognition based on the specifying information 71.

第１モードは、車載用音声辞書７０に基づき、ユーザによる音声操作を受け付けるモードである。ユーザは、各操作コマンドに応じたキーワードを発話し、認識部６１がかかるキーワードを認識した場合に、第２モードから第１モードに移行する。 The first mode is a mode for receiving a voice operation by a user based on the vehicle-mounted voice dictionary 70. The user speaks a keyword according to each operation command, and when the recognition unit 61 recognizes the keyword, the mode shifts from the second mode to the first mode.

具体的には、ユーザが電話帳の呼び出しを行う場合、「コール（Ｃａｌｌ）○○」と発話すると、コールに続く単語（上記○○に対応）に対応する電話帳の呼び出しを行うことが可能である。認識部６１は、第１モードにおいて認識した単語を実行部６３へ出力する。 Specifically, when the user calls the telephone directory, when the user speaks “Call (XX)”, it is possible to call the telephone directory corresponding to the word following the call (corresponding to the above XX). It is. The recognition unit 61 outputs the word recognized in the first mode to the execution unit 63.

また、認識部６１は、第１モードにおいて、１回の発話音声に対して複数の単語を抽出した場合、複数の単語それぞれを重み付け部６２へ出力することも可能である。 In the first mode, when a plurality of words are extracted from one uttered voice in the first mode, the recognizing unit 61 can output each of the plurality of words to the weighting unit 62.

また、車載用音声辞書７０に登録される音素列には、後述する重み付け部６２によって重み付けが行われる。これにより、認識部６１による音声認識の誤認識を抑制することが可能である。なお、車載用音声辞書７０の具体例については、図７を用いて後述する。 The phoneme sequence registered in the vehicle-mounted speech dictionary 70 is weighted by a weighting unit 62 described later. This makes it possible to suppress erroneous speech recognition by the recognition unit 61. A specific example of the vehicle-mounted speech dictionary 70 will be described later with reference to FIG.

一方、第２モードは、ユーザの「訛り」を特定するためのモードである。すなわち、認識部６１は、第２モードにおいてユーザの車内の会話からユーザの訛りを特定することができる。 On the other hand, the second mode is a mode for specifying the “accent” of the user. That is, the recognizing unit 61 can specify the user's accent from the conversation in the vehicle of the user in the second mode.

具体的には、認識部６１は、特定用情報７１を参照することで、ユーザの訛りを特定する。特定用情報７１は、各ノンティティブ発話者特有の訛りを特定するための情報であり、管理者によって上記の規則性情報４３に基づいて作成される。認識部６１は、第２モードにおいて特定した訛りに関する情報を重み付け部６２へ出力する。 Specifically, the recognition unit 61 specifies the accent of the user by referring to the specifying information 71. The specifying information 71 is information for specifying an accent peculiar to each non-active speaker, and is created by the administrator based on the regularity information 43 described above. The recognition unit 61 outputs information on the accent specified in the second mode to the weighting unit 62.

重み付け部６２は、車載用音声辞書７０に登録された音素列毎に重み付けを行う。具体的には、重み付け部６２は、上述の認識部６１が第２モードで動作時に、認識部６１から入力される訛りに関する情報に基づき、車載用音声辞書７０に登録された音素列毎に重み付けを行う。 The weighting unit 62 performs weighting for each phoneme sequence registered in the vehicle-mounted speech dictionary 70. Specifically, the weighting unit 62 weights each phoneme string registered in the vehicle-mounted speech dictionary 70 based on the information on the accent inputted from the recognition unit 61 when the recognition unit 61 operates in the second mode. I do.

図７は、車載用音声辞書７０の具体例を示す図である。図７に示すように、車載用音声辞書７０は、「名前」、「音素列」、「訛り」、「重み」および「有効フラグ」が互いに関連付けられた情報である。 FIG. 7 is a diagram showing a specific example of the vehicle-mounted speech dictionary 70. As shown in FIG. 7, the in-vehicle voice dictionary 70 is information in which “name”, “phoneme sequence”, “accent”, “weight”, and “valid flag” are associated with each other.

重み付け部６２は、認識部６１から入力される「訛り」に関する情報に基づき、各音素列の「重み」を更新する。図７に示す例では、北欧の重みが「０．７」であり、ゲルマンの重みが「０．３」である場合を示す。 The weighting unit 62 updates the “weight” of each phoneme string based on the information on “accent” input from the recognition unit 61. The example shown in FIG. 7 shows a case where the weight of Northern Europe is “0.7” and the weight of Germanic is “0.3”.

言い換えれば、ユーザが北欧系もしくはゲルマン系のノンネイティブ発話者である可能性が高い場合を示す。このため、北欧系もしくはゲルマン系の音素列の有効フラグを「１」にすることで、かかる音素列を有効するとともに、それ以外の音素列の有効フラグを「０」にすることで、かかる音素列を無効にする。 In other words, the case where the user is likely to be a Nordic or Germanic non-native speaker is high. Therefore, by setting the valid flag of a Nordic or Germanic phoneme string to "1", the phoneme string is made valid, and by setting the valid flags of other phoneme strings to "0", such phoneme string is made. Disable a column.

これにより、上述の認識部６１は、車載用音声辞書７０に登録された音素列のうち、有効である音素列から音声認識を行うこととなる。これにより、音声認識の誤認識を抑制することが可能となる。 Thus, the above-described recognition unit 61 performs voice recognition from a valid phoneme sequence among the phoneme sequences registered in the vehicle-mounted speech dictionary 70. This makes it possible to suppress erroneous speech recognition.

また、重み付け部６２は、認識部６１によって１つの発話音声に対して複数の単語が認識された場合、図示しない表示画面に複数の単語を表示し、ユーザの選択操作に基づいて重み付けを行うことも可能である。 In addition, when the recognition unit 61 recognizes a plurality of words for one uttered voice, the weighting unit 62 displays the plurality of words on a display screen (not shown) and performs weighting based on a user's selection operation. Is also possible.

すなわち、重み付け部６２は、ユーザの選択操作により、重み付けを行うことで、簡便かつ精度よく各音素列に対して重み付けを行うことができる。このように、重み付け部６２は、各音素列について重み付けを行うことで、発話者毎に最適な音声辞書を車載用音声辞書７０とすることができる。これにより、音声認識の認識精度を向上させつつ、誤認識を抑制することができる。 That is, the weighting unit 62 can easily and accurately weight each phoneme string by performing weighting by a user's selection operation. As described above, the weighting unit 62 performs weighting on each phoneme sequence, so that the optimal voice dictionary for each speaker can be the vehicle-mounted voice dictionary 70. Thereby, erroneous recognition can be suppressed while improving the recognition accuracy of voice recognition.

なお、重み付け部６２は、有効フラグがオフである音素列を車載用音声辞書７０から削除することも可能である。また、ユーザの「訛り」を特定した時点で、かかる訛りに対応する音素列のみを辞書生成装置１から取得することも可能である。 Note that the weighting unit 62 can also delete the phoneme string for which the valid flag is off from the vehicle-mounted speech dictionary 70. Further, when the user's “accent” is specified, it is also possible to acquire only the phoneme string corresponding to the accent from the dictionary generation device 1.

これにより、車載用音声辞書７０のデータ容量を抑えることが可能となる。また、かかる訛りに対応する音素列のみを辞書生成装置１から取得することで、通信容量を抑えることも可能である。 This makes it possible to reduce the data capacity of the vehicle-mounted voice dictionary 70. In addition, by obtaining only the phoneme string corresponding to such accent from the dictionary generation device 1, it is possible to suppress the communication capacity.

また、重み付け部６２は、ユーザ端末８０の操作履歴に基づき、重み付けを行うことも可能である。この場合、重み付け部６２は、発話履歴の電話番号に含まれる国番号や、ウェブブラウザのドメインに含まれる国別コードに基づき、ユーザの母国語を推定する。そして、重み付け部６２は、かかる母国語に基づき、上記の「訛り」を推定したうえで、音素列毎に重み付けを行うことができる。 Further, the weighting unit 62 can also perform weighting based on the operation history of the user terminal 80. In this case, the weighting unit 62 estimates the user's native language based on the country code included in the telephone number in the speech history and the country code included in the domain of the web browser. Then, the weighting unit 62 can perform weighting for each phoneme string after estimating the “accent” based on the native language.

図６の説明に戻り、実行部６３について説明する。実行部６３は、上述の第１モードにおいて、認識部６１によって認識された単語をコマンドへ変換し、ユーザ端末８０やナビゲーション装置８１を制御する。 Returning to the description of FIG. 6, the execution unit 63 will be described. The execution unit 63 converts the word recognized by the recognition unit 61 into a command in the first mode, and controls the user terminal 80 and the navigation device 81.

コマンド情報７２は、ユーザ端末８０やナビゲーション装置８１を制御するためのコマンドに関する情報であり、実行部６３は、コマンド情報７２を参照することで、単語をコマンドへ変換することができる。 The command information 72 is information about a command for controlling the user terminal 80 and the navigation device 81. The execution unit 63 can convert a word into a command by referring to the command information 72.

そして、実行部６３は、かかるコマンドをユーザ端末８０やナビゲーション装置８１へ出力する。これにより、ユーザ端末８０やナビゲーション装置８１は、かかるコマンドに基づく動作を実行することとなる。言い換えれば、ユーザ端末８０やナビゲーション装置８１は、ユーザによる音声操作を受け付けることができる。 Then, the execution unit 63 outputs the command to the user terminal 80 or the navigation device 81. Accordingly, the user terminal 80 and the navigation device 81 execute an operation based on the command. In other words, the user terminal 80 and the navigation device 81 can receive a voice operation by the user.

次に、図８を用いて、辞書生成装置１が実行する処理手順についてそれぞれ説明する。図８は、辞書生成装置１が実行する処理手順を示すフローチャートである。 Next, a processing procedure executed by the dictionary generation device 1 will be described with reference to FIG. FIG. 8 is a flowchart illustrating a processing procedure executed by the dictionary generation device 1.

図８に示すように、辞書生成装置１は、まず、ノンネイティブ発話者による単語の発話音声を受け付けたか否かを判定し（ステップＳ１０１）、発話音声を受け付けた場合（ステップＳ１０１，Ｙｅｓ）、発話音声に対する音素列を生成する（ステップＳ１０２）。 As shown in FIG. 8, the dictionary generation device 1 first determines whether or not the utterance sound of a word by a non-native speaker has been received (step S101), and if the utterance sound has been received (step S101, Yes), A phoneme sequence for the uttered voice is generated (step S102).

続いて、辞書生成装置１は、かかる音素列を音声辞書４１に追加し（ステップＳ１０３）、処理を終了する。また、辞書生成装置１は、ステップＳ１０１において、発話音声を受け付けていない場合（ステップＳ１０１，Ｎｏ）、そのまま処理を終了する。 Subsequently, the dictionary generation device 1 adds the phoneme string to the voice dictionary 41 (step S103), and ends the processing. If the utterance sound has not been received in step S101 (No in step S101), the dictionary generation device 1 ends the process as it is.

次に、図９Ａおよび図９Ｂを用いて、車載装置５０が実行する処理手順について説明する。図９Ａおよび図９Ｂは、車載装置５０が実行する処理手順を示すフローチャートである。なお、図９Ａは、第１モードにおける車載装置５０の処理手順を示し、図９Ｂは、第２モードにおける車載装置５０の処理手順を示す。 Next, a processing procedure executed by the vehicle-mounted device 50 will be described with reference to FIGS. 9A and 9B. 9A and 9B are flowcharts illustrating a processing procedure executed by the vehicle-mounted device 50. 9A shows a processing procedure of the vehicle-mounted device 50 in the first mode, and FIG. 9B shows a processing procedure of the vehicle-mounted device 50 in the second mode.

図９Ａに示すように、車載装置５０は、まず、音声入力を受け付けたか否かを判定し（ステップＳ２０１）、音声入力を受け付けた場合（ステップＳ２０１，Ｙｅｓ）、車載用音声辞書７０に基づいて認識された単語の候補は１つか否かを判定する（ステップＳ２０２）。 As shown in FIG. 9A, the in-vehicle device 50 first determines whether or not a voice input has been received (step S201). It is determined whether there is one recognized word candidate (step S202).

続いて、車載装置５０は、候補が１つでなかった場合（ステップＳ２０２，Ｎｏ）、２以上の単語の候補をユーザへ提示する（ステップＳ２０３）。続いて、車載装置５０は、ユーザから複数の候補に対する選択操作を受け付けて（ステップＳ２０４）、選択操作に基づいて音素列に対して重み付けを行う（ステップＳ２０５）。 Subsequently, when the number of candidates is not one (Step S202, No), the in-vehicle device 50 presents two or more word candidates to the user (Step S203). Subsequently, the in-vehicle device 50 receives a selection operation for a plurality of candidates from the user (step S204), and weights the phoneme string based on the selection operation (step S205).

そして、車載装置５０は、車載用音声辞書を更新して（ステップＳ２０６）、選択操作に基づくコマンドを出力して（ステップＳ２０７）、処理を終了する。また、車載装置５０は、ステップＳ２０２の処理にて、単語の候補が１つであった場合（ステップＳ２０２，Ｙｅｓ）、ステップＳ２０７の処理へ移行する。また、車載装置５０は、音声入力を受け付けていない場合（ステップＳ２０１，Ｎｏ）、処理を終了する。 Then, the vehicle-mounted device 50 updates the vehicle-mounted voice dictionary (step S206), outputs a command based on the selection operation (step S207), and ends the process. If the in-vehicle device 50 determines that there is only one word candidate in the process of step S202 (step S202, Yes), the process proceeds to step S207. When the voice input is not received (step S201, No), the in-vehicle device 50 ends the process.

次に、図９Ｂを用いて第２モードにおける車載装置５０の処理手順について説明する。図９Ｂに示すように、車載装置５０は、音声入力を受け付けたか否かを判定し（ステップＳ２１１）、音声入力を受け付けた場合（ステップＳ２１１，Ｙｅｓ）、特定用情報７１に基づいて訛りを特定できたか否かを判定する（ステップＳ２１２）。 Next, a processing procedure of the vehicle-mounted device 50 in the second mode will be described with reference to FIG. 9B. As shown in FIG. 9B, the in-vehicle device 50 determines whether a voice input has been received (Step S211), and if a voice input has been received (Step S211, Yes), specifies the accent based on the specifying information 71. It is determined whether or not the operation has been completed (step S212).

車載装置５０は、訛りを特定できた場合（ステップＳ２１２，Ｙｅｓ）、かかる訛りに基づいて音素列に対して重み付けを行い（ステップＳ２１３）、車載用音声辞書を更新して（ステップＳ２１４）、処理を終了する。 When the accent is specified (step S212, Yes), the in-vehicle device 50 weights the phoneme sequence based on the accent (step S213), updates the in-vehicle speech dictionary (step S214), and performs processing. To end.

また、車載装置５０は、音声入力を受け付けていない場合（ステップＳ２１１，Ｎｏ）、訛りを特定できなかった場合（ステップＳ２１２，Ｎｏ）、そのまま処理を終了することとなる。 If the in-vehicle device 50 has not received a voice input (No at Step S211) or has failed to identify an accent (No at Step S212), the process ends.

上述したように、実施形態に係る辞書生成装置１は、受付部３０と、生成部３１と、更新部３３とを備える。受付部３０は、標準発音と異なる発音による単語の発話音声を受け付ける。生成部３１は、受付部３０によって受付らえた発話音声に基づく音素列を生成する。 As described above, the dictionary generation device 1 according to the embodiment includes the reception unit 30, the generation unit 31, and the update unit 33. The receiving unit 30 receives an uttered voice of a word with a pronunciation different from the standard pronunciation. The generation unit 31 generates a phoneme sequence based on the uttered voice received by the reception unit 30.

更新部３３は、生成部３１によって生成された音素列を音声辞書４１へ追加することで、当該音声辞書４１を更新する。したがって、実施形態に係る辞書生成装置１によれば、音声認識の認識率を向上させることができる。 The updating unit 33 updates the voice dictionary 41 by adding the phoneme sequence generated by the generating unit 31 to the voice dictionary 41. Therefore, according to the dictionary generation device 1 according to the embodiment, the recognition rate of voice recognition can be improved.

ところで、例えば、車内においては、複数のユーザが搭乗する場合もある。このため、車載装置５０は、ユーザ毎に乗車位置を特定することで、上記の重み付けをユーザ毎に行うことも可能である。 By the way, for example, in a car, a plurality of users may board. For this reason, the in-vehicle device 50 can also perform the weighting for each user by specifying the boarding position for each user.

図１０は、マイクの搭載例を示す図である。図１０に示すように、車両Ｃには、複数のマイクＭ１〜Ｍ４が搭載され、各マイクＭ１〜Ｍ４は、車載装置５０に接続される。 FIG. 10 is a diagram illustrating an example of mounting a microphone. As shown in FIG. 10, a plurality of microphones M1 to M4 are mounted on the vehicle C, and each of the microphones M1 to M4 is connected to the vehicle-mounted device 50.

マイクＭ１は、運転席の前方に設けられ、運転者の音声を検出する。マイクＭ２は、助手席の前方に設けられ、助手席に乗車したユーザの音声を検出する。また、マイクＭ３およびマイクＭ４は、それぞれ後部座席の側壁に設けられ、後部座席に乗車したユーザの音声をそれぞれ検知する。 The microphone M1 is provided in front of the driver's seat and detects a driver's voice. The microphone M2 is provided in front of the passenger seat and detects a voice of a user who gets on the passenger seat. The microphones M3 and M4 are provided on the side wall of the rear seat, respectively, and detect the voice of the user who gets on the rear seat.

車載装置５０は、各マイクＭ１〜Ｍ４において検知結果に基づき、発話者を特定することが可能である。すなわち、運転者が発話した場合、マイクＭ１から入力される音声が他のマイクＭ２〜Ｍ４から入力される音声よりも相対的に大きくなる。 The in-vehicle device 50 can identify the speaker based on the detection result in each of the microphones M1 to M4. That is, when the driver speaks, the voice input from the microphone M1 is relatively louder than the voices input from the other microphones M2 to M4.

したがって、車載装置５０は、この場合に、発話者が運転者であることを認識することができる。この場合、車載装置５０は、発話者毎に訛りを特定し、車載用音声辞書７０の音素列について重み付けを行うことが可能である。 Therefore, in this case, the vehicle-mounted device 50 can recognize that the speaker is the driver. In this case, the in-vehicle device 50 can specify an accent for each speaker and weight the phoneme strings in the in-vehicle speech dictionary 70.

すなわち、かかる場合に、車載用音声辞書７０は、マイクＭ１〜Ｍ４のそれぞれに対して、重み付けが行われることとなる。そして、各ユーザがそれぞれ音声入力を行う場合に、車載装置５０は、各マイクＭ１〜Ｍ４に対してチューニングされた車載用音声辞書７０を用いることで、音声認識率を向上させることが可能となる。 That is, in such a case, the vehicle-mounted voice dictionary 70 weights each of the microphones M1 to M4. Then, when each user performs a voice input, the in-vehicle device 50 can improve the voice recognition rate by using the in-vehicle voice dictionary 70 tuned to each of the microphones M1 to M4. .

なお、ここでは、車載装置５０が、各マイクＭ１〜Ｍ４の音量に基づき、発話者を特定する場合について説明したが、これに限定されるものではない。例えば、車内を撮像するカメラの撮像画像において、各ユーザの口元を解析することで、発話者を特定することにしてもよい。 Here, the case where the in-vehicle device 50 specifies the speaker based on the volume of each of the microphones M1 to M4 has been described, but the present invention is not limited to this. For example, the speaker may be identified by analyzing the mouth of each user in an image captured by a camera that captures the inside of the vehicle.

また、上述した実施形態では、音声認識装置の一例として車載装置５０を例に挙げて説明したが、これに限定されるものではない。すなわち、音声認識装置は、スマートフォン、タブレット端末、パーソナルコンピュータ、家電機器等、音声認識を行う各種機器に適用することができる。 Further, in the above-described embodiment, the in-vehicle device 50 has been described as an example of the voice recognition device. However, the present invention is not limited to this. That is, the voice recognition device can be applied to various devices that perform voice recognition, such as a smartphone, a tablet terminal, a personal computer, and a household electrical appliance.

また、上述した実施形態では、音声辞書４１を音声認識時に用いる場合について説明したが、これに限定されるものではない。すなわち、音声辞書４１に登録された音素列をテキスト読み上げに適用することも可能である。この場合、ユーザに応じて、認識しやすい訛りでテキスト読み上げを行うことが可能となる。 In the above-described embodiment, the case where the voice dictionary 41 is used for voice recognition has been described, but the present invention is not limited to this. That is, the phoneme sequence registered in the voice dictionary 41 can be applied to text-to-speech. In this case, the text can be read aloud according to the user with an easily recognizable accent.

さらなる効果や変形例は、当業者によって容易に導き出すことができる。このため、本発明のより広範な態様は、以上のように表しかつ記述した特定の詳細および代表的な実施形態に限定されるものではない。したがって、添付の特許請求の範囲およびその均等物によって定義される総括的な発明の概念の精神または範囲から逸脱することなく、様々な変更が可能である。 Further effects and modifications can be easily derived by those skilled in the art. Thus, the broader aspects of the present invention are not limited to the specific details and representative embodiments shown and described above. Accordingly, various modifications may be made without departing from the spirit or scope of the general inventive concept as defined by the appended claims and equivalents thereof.

１辞書生成装置
３０受付部
３１生成部
３２調整部
３３更新部
３４配信部
４１音声辞書
４３規則性情報
５０車載装置
６０取得部
６１認識部
６２重み付け部
６３実行部
７０車載用音声辞書
７１特定用情報
７２コマンド情報
Ｓ音声認識システム Reference Signs List 1 dictionary generation device 30 reception unit 31 generation unit 32 adjustment unit 33 update unit 34 distribution unit 41 voice dictionary 43 regularity information 50 in-vehicle device 60 acquisition unit 61 recognition unit 62 weighting unit 63 execution unit 70 in-vehicle voice dictionary 71 identification information 72 Command information S Voice recognition system

Claims

A receiving unit for receiving a speech sound of a word with a pronunciation different from the standard pronunciation,
A generation unit that generates a phoneme sequence based on the uttered voice received by the reception unit,
An updating unit that updates the voice dictionary by adding the phoneme string generated by the generating unit to the voice dictionary.

A speech recognition system comprising: the dictionary generation device according to claim 1; and a speech recognition device that performs speech recognition by the dictionary generation device based on the speech dictionary.

The voice recognition device,
The speech recognition system according to claim 2, further comprising: an acquisition unit configured to acquire the phoneme string corresponding to a word registered in an external device.

The voice recognition device,
The voice according to claim 2, further comprising a weighting unit configured to weight each of the phoneme strings registered in the speech dictionary based on rule information indicating regularity of the phoneme strings for each of the different pronunciations. Recognition system.

The voice recognition device,
A recognition unit configured to recognize a word registered in the voice dictionary based on a voice input;
The weighting unit,
The method according to claim 4, wherein when a plurality of words are extracted by the recognition unit, the plurality of words are presented to a user, and weighting is performed based on a selection operation performed on the plurality of words by the user. Voice recognition system.

The weighting unit,
The speech recognition system according to claim 4, wherein the regularity to be applied to the user is selected based on a conversation voice of the user, and the weight of the selected regularity is increased.

A receiving step of receiving an uttered voice of a word with a pronunciation different from the standard pronunciation,
A generation step of generating a phoneme sequence based on the uttered voice received by the reception step,
Updating the voice dictionary by adding the phoneme string generated in the generating step to the voice dictionary.