JP2021064110A

JP2021064110A - Voice authentication device, voice authentication system and voice authentication method

Info

Publication number: JP2021064110A
Application number: JP2019187784A
Authority: JP
Inventors: 亜優美國分; Ayumi Kokubu; 藤田　裕一; Yuichi Fujita; 裕一藤田; 繁信西田; Shigenobu Nishida; 秀吾斉藤; Shugo Saito; 浦川　康孝; Yasutaka Urakawa; 康孝浦川
Original assignee: Glory Ltd; Fuetrek Co Ltd
Current assignee: Glory Ltd; Fuetrek Co Ltd
Priority date: 2019-10-11
Filing date: 2019-10-11
Publication date: 2021-04-22
Anticipated expiration: 2039-10-11
Also published as: JP7339116B2

Abstract

【課題】認証の際に高い安全性を確保するとともに、登録時にユーザに掛かる負担を軽減することができる音声認証装置、音声認証システム、および音声認証方法を提供する。【解決手段】第１発話用文字列に含まれる文字を含む第２発話用文字列を生成する認証用文字列生成部２３と、第１発話用文字列をユーザが発話して生成された第１音声データから、第２発話用文字列に対応する声紋データを抽出する声紋データ生成部２５と、第２発話用文字列をユーザが発話して生成された第２音声データおよび声紋データを用いてユーザの認証を行う認証部２６と、を備える。【選択図】図３PROBLEM TO BE SOLVED: To provide a voice authentication device, a voice authentication system, and a voice authentication method capable of ensuring high security at the time of authentication and reducing the burden on the user at the time of registration. SOLUTION: An authentication character string generation unit 23 for generating a second utterance character string including characters included in a first utterance character string, and a first utterance character string generated by a user speaking. Using the voiceprint data generation unit 25 that extracts the voiceprint data corresponding to the second speech character string from the first voice data, and the second voice data and voiceprint data generated by the user speaking the second speech string. It is provided with an authentication unit 26 that authenticates the user. [Selection diagram] Fig. 3

Description

本発明は、音声による認証を行う音声認証装置、音声認証システム、および音声認証方法に関する。 The present invention relates to a voice authentication device, a voice authentication system, and a voice authentication method for performing voice authentication.

ユーザが発話した音声を用いて認証を行う音声認証技術が普及している。音声認証技術は、あらかじめ登録されたユーザの音声と、認証の際に新たに取得したユーザの音声とを照合することで、認証を行う技術である。 Voice authentication technology that authenticates using voice spoken by a user is widespread. The voice authentication technology is a technology for authenticating by collating a user's voice registered in advance with a user's voice newly acquired at the time of authentication.

音声認証技術における照合方式の代表的なものとして、パスワード方式、フリーワード方式等がある。パスワード方式では、ユーザがあらかじめ定められたパスワードを発話して得られた音声データをあらかじめ登録しておき、当該音声データと、認証時にユーザがパスワードを改めて発話して得られた音声データとを照合することで認証が行われる。また、フリーワード方式の音声認証技術では、あらかじめ登録された音声データと、ユーザが自由な内容を発話して得られた音声データとを用いて、ユーザの音声の特徴が合致するか否かを判定することで認証が行われる。 Typical verification methods in voice authentication technology include password methods and free word methods. In the password method, the voice data obtained by the user speaking a predetermined password is registered in advance, and the voice data is collated with the voice data obtained by the user speaking the password again at the time of authentication. Authentication is performed by doing. Further, in the free word type voice authentication technology, whether or not the characteristics of the user's voice match is determined by using the voice data registered in advance and the voice data obtained by the user uttering free contents. Authentication is performed by making a judgment.

このような音声認証技術において、例えば登録時の音声データが盗用されると、なりすましにより不正に認証が行われてしまうことがある。これを避けるため、ユーザが複数のパスワードを発話して得られた複数の音声データをあらかじめ登録しておき、複数のパスワードのうち１つまたはいくつかのパスワードを用いて認証を行うことが考えられる。 In such a voice authentication technology, for example, if the voice data at the time of registration is stolen, the authentication may be performed illegally by spoofing. In order to avoid this, it is conceivable that a plurality of voice data obtained by the user uttering a plurality of passwords are registered in advance, and authentication is performed using one or several passwords among the plurality of passwords. ..

しかしながら、ユーザが複数のパスワードを発話して音声データを登録するためには多大な時間が掛かり、ユーザに大きな負担を強いることになる。このため、登録時のユーザの負担を少なくし、かつ安全性が高い認証技術が要望されている。 However, it takes a lot of time for the user to speak a plurality of passwords and register the voice data, which imposes a heavy burden on the user. Therefore, there is a demand for an authentication technology that reduces the burden on the user at the time of registration and has high security.

特許文献１には、ユーザがパスワードを発話して得られた音声データを登録するのではなく、パスワードを構成する音の各要素について要素毎の声紋データを登録しておき、認証時に得られた音声データと声紋データとを要素毎に比較することで認証を行う技術が開示されている。 In Patent Document 1, instead of registering the voice data obtained by the user speaking the password, the voice print data for each element of each sound element constituting the password is registered, and the voice data obtained at the time of authentication is registered. A technique for performing authentication by comparing voice data and voice print data for each element is disclosed.

特開２００５−１２８３０７号公報Japanese Unexamined Patent Publication No. 2005-128307

特許文献１に開示された技術では、認証時に設定されるパスワードに登場しうる全ての音の要素毎に声紋データをあらかじめ登録しておく必要がある。このため、特許文献１に開示された技術では、認証の度に異なるパスワードをユーザに発話させることになる。このため、登録時の音声データが盗用されても不正認証を防止することはできるが、全ての音の要素毎に声紋データを登録するため、登録時にユーザに大きな負担を強いることは改善されていない。 In the technique disclosed in Patent Document 1, it is necessary to register voiceprint data in advance for each sound element that can appear in the password set at the time of authentication. Therefore, in the technique disclosed in Patent Document 1, the user is made to speak a different password each time the authentication is performed. For this reason, even if the voice data at the time of registration is stolen, unauthorized authentication can be prevented, but since the voiceprint data is registered for each sound element, it has been improved that a heavy burden is imposed on the user at the time of registration. Absent.

本発明によれば、認証の際に高い安全性を確保するとともに、登録時にユーザに掛かる負担を軽減することができる音声認証装置、音声認証システム、および音声認証方法を提供することを目的とする。 An object of the present invention is to provide a voice authentication device, a voice authentication system, and a voice authentication method that can ensure high security at the time of authentication and reduce the burden on the user at the time of registration. ..

本発明の音声認証装置は、所定文字列に含まれる文字の少なくともいずれかを含む認証用文字列を生成する認証用文字列生成部と、前記所定文字列をユーザが発話して生成された第１音声データから、前記認証用文字列に対応する部分の第１声紋データを生成するとともに、前記認証用文字列を前記ユーザが発話して生成された第２音声データに基づいて第２声紋データを生成する声紋データ生成部と、前記第１声紋データと前記第２声紋データとを照合して前記ユーザの認証を行う認証部と、を備える。 The voice authentication device of the present invention includes an authentication character string generator that generates an authentication character string including at least one of the characters included in the predetermined character string, and a first generated by the user speaking the predetermined character string. From the 1 voice data, the first voice print data of the portion corresponding to the authentication character string is generated, and the second voice print data is based on the second voice data generated by the user speaking the authentication character string. The user is provided with a voiceprint data generation unit that generates the data, and an authentication unit that collates the first voiceprint data with the second voiceprint data to authenticate the user.

本発明の音声認証システムは、ユーザの発話に基づいて音声データを生成する音声データ生成部と、前記ユーザに対して発話を求める文字列を提示する提示部と、所定文字列に含まれる文字の少なくともいずれかを含む認証用文字列を生成する認証用文字列生成部と、前記所定文字列を前記ユーザが発話して生成された第１音声データから、前記認証用文字列に対応する部分の第１声紋データを生成するとともに、前記認証用文字列を前記ユーザが発話して生成された第２音声データに基づいて第２声紋データを生成する声紋データ生成部と、前記第１声紋データと前記第２声紋データとを照合して前記ユーザの認証を行う認証部と、を備える。 The voice authentication system of the present invention includes a voice data generation unit that generates voice data based on a user's speech, a presentation section that presents a character string that requests the user to speak, and a character included in a predetermined character string. An authentication character string generation unit that generates an authentication character string including at least one of them, and a portion corresponding to the authentication character string from the first voice data generated by the user speaking the predetermined character string. A voiceprint data generation unit that generates first voiceprint data and generates second voiceprint data based on the second voice data generated by the user speaking the authentication character string, and the first voiceprint data. It includes an authentication unit that collates with the second voiceprint data and authenticates the user.

本発明の音声認証方法は、所定文字列に含まれる文字の少なくともいずれかを含む認証用文字列を生成し、前記所定文字列をユーザが発話して生成された第１音声データから、前記認証用文字列に対応する部分の第１声紋データを生成し、前記認証用文字列を前記ユーザが発話して生成された第２音声データに基づいて第２声紋データを生成し、前記第１声紋データおよび前記第２声紋データを用いて前記ユーザの認証を行う。 The voice authentication method of the present invention generates an authentication character string including at least one of the characters included in the predetermined character string, and the authentication is performed from the first voice data generated by the user speaking the predetermined character string. The first voiceprint data of the part corresponding to the character string is generated, the second voiceprint data is generated based on the second voice data generated by the user speaking the authentication character string, and the first voiceprint is generated. The user is authenticated using the data and the second voice pattern data.

認証の際に高い安全性を確保するとともに、登録時にユーザに掛かる負担を軽減することができる。 It is possible to ensure high security at the time of authentication and reduce the burden on the user at the time of registration.

音声認証システムの構成について説明するための図Diagram for explaining the configuration of the voice authentication system 端末装置の構成について説明するための図Diagram for explaining the configuration of the terminal device 音声認証装置の構成について説明するための図Diagram for explaining the configuration of the voice authentication device 音声認識部の構成について説明するための図The figure for demonstrating the structure of the voice recognition part ユーザの音声を音声認証システムに登録する音声登録処理について説明するためのフローチャートFlow chart for explaining the voice registration process for registering the user's voice in the voice authentication system ユーザが受けようとするサービスとセキュリティレベルとの関係を示す図Diagram showing the relationship between the service that the user wants to receive and the security level あらかじめ登録された音声を用いて音声認証を行う音声認証処理について説明するためのフローチャートFlow chart for explaining voice authentication processing that performs voice authentication using pre-registered voice

以下、本発明の各実施の形態について図面を参照して詳細に説明する。ただし、必要以上に詳細な説明、例えば、既によく知られた事項の詳細説明や実質的に同一の構成に対する重複説明等は省略する場合がある。 Hereinafter, each embodiment of the present invention will be described in detail with reference to the drawings. However, more detailed explanations than necessary, such as detailed explanations of already well-known matters and duplicate explanations for substantially the same configuration, may be omitted.

なお、以下の説明および参照される図面は、当業者が本発明を理解するために提供されるものであって、本発明の請求の範囲を限定するためのものではない。 The following description and referenced drawings are provided for those skilled in the art to understand the present invention, and are not intended to limit the scope of the claims of the present invention.

＜音声認証システム１の構成＞
まず、図１を参照して、本発明の実施の形態に係る音声認証システム１の構成について説明する。図１に示すように、音声認証システム１は、端末装置１０と、音声認証装置２０と、ネットワーク３０と、を有する。 <Configuration of voice authentication system 1>
First, the configuration of the voice authentication system 1 according to the embodiment of the present invention will be described with reference to FIG. As shown in FIG. 1, the voice authentication system 1 includes a terminal device 10, a voice authentication device 20, and a network 30.

端末装置１０は、音声認証システム１のユーザ（以下、単にユーザと記載する）の音声入力を受け付け、音声データを生成する装置である。端末装置１０の一例としては、タブレット端末、スマートフォン等の携帯端末装置が挙げられる。また、端末装置１０の一例としては、特定の場所に設置されたＰＣ（Personal Computer）等のコンピュータが挙げられる。 The terminal device 10 is a device that receives voice input from a user of the voice authentication system 1 (hereinafter, simply referred to as a user) and generates voice data. An example of the terminal device 10 is a mobile terminal device such as a tablet terminal or a smartphone. Further, as an example of the terminal device 10, a computer such as a PC (Personal Computer) installed at a specific place can be mentioned.

本実施の形態において、端末装置１０は、ユーザが所望のサービスを受けようとする際に、そのサービスを受けるための認証が行われる場所、および、認証のために必要な音声データの登録が行われる場所等に設置されている。ユーザは、この端末装置１０を介して、所望のサービスを受けるより前に、音声認証のために必要な音声データの登録を行う。また、ユーザは、音声認証装置２０により、登録された音声データと、認証の際に新たに生成された音声データとを用いた音声認証が成功した場合に、所望のサービスを受けることができる。 In the present embodiment, when the user wants to receive a desired service, the terminal device 10 registers the place where the authentication for receiving the service is performed and the voice data necessary for the authentication. It is installed in a place where it is called. The user registers the voice data necessary for voice authentication through the terminal device 10 before receiving the desired service. In addition, the user can receive the desired service when the voice authentication using the registered voice data and the voice data newly generated at the time of authentication is successful by the voice authentication device 20.

なお、本実施の形態において、音声認証とは、音声データを用いて認証を行うことを意味する。また、以下の説明において、音声認証の際に用いられる音声データを事前に登録することを、音声登録と記載する。音声認証では、音声登録された音声データと、認証の際に新たに生成された音声データとを照合することで、認証の成否が判断される。音声認証システム１における、音声登録を行うための音声登録処理、および、音声認証を行うための音声認証処理についての詳細は、後述する。 In the present embodiment, voice authentication means that authentication is performed using voice data. Further, in the following description, registering voice data used for voice authentication in advance is referred to as voice registration. In voice authentication, the success or failure of authentication is determined by collating the voice data registered by voice with the voice data newly generated at the time of authentication. The details of the voice registration process for performing voice registration and the voice authentication process for performing voice authentication in the voice authentication system 1 will be described later.

所望のサービスについては、本発明では特に限定しない。所望のサービスの一例としては、特定の施設（会員制の店舗等）への入場、料金決済、他のユーザへの送金、等が挙げられる。 The desired service is not particularly limited in the present invention. Examples of desired services include admission to a specific facility (membership store, etc.), payment of fees, remittance to other users, and the like.

例えばユーザが会員制のスポーツクラブへの入店を希望する場合、ユーザは、当該スポーツクラブの入口に設置された端末装置１０を用いて音声登録を行っておく。その後、ユーザは当該端末装置１０を介して音声認証システム１に音声認証を行わせる。認証に成功した場合に、当該スポーツクラブへのユーザの入店が許可される。 For example, when a user wishes to enter a membership-based sports club, the user performs voice registration using a terminal device 10 installed at the entrance of the sports club. After that, the user causes the voice authentication system 1 to perform voice authentication via the terminal device 10. If the authentication is successful, the user is allowed to enter the sports club.

また、例えばユーザが商品購入時に料金の決済を希望する場合、ユーザは、商品を購入しようとする店舗のレジに設置された端末装置１０を用いて音声登録を行っておく。その後、ユーザは当該端末装置１０を介して音声認証システム１に音声認証を行わせる。認証に成功した場合に、あらかじめ登録されたユーザの決済用口座を用いて料金の決済が行われる。 Further, for example, when the user wishes to settle the fee at the time of purchasing the product, the user performs voice registration using the terminal device 10 installed at the cash register of the store where the product is to be purchased. After that, the user causes the voice authentication system 1 to perform voice authentication via the terminal device 10. If the authentication is successful, the fee will be settled using the user's settlement account registered in advance.

図１の説明に戻る。音声認証装置２０は、ネットワーク３０を介して端末装置１０と接続されており、端末装置１０から音声データを受信して、その音声データを用いて音声登録および音声認証を行う装置である。音声認証装置２０の一例としては、例えばクラウドサーバが挙げられる。 Returning to the description of FIG. The voice authentication device 20 is a device that is connected to the terminal device 10 via a network 30, receives voice data from the terminal device 10, and performs voice registration and voice authentication using the voice data. An example of the voice authentication device 20 is a cloud server.

ネットワーク３０は、例えばインターネット等の公衆ネットワークである。 The network 30 is a public network such as the Internet.

＜端末装置１０の構成＞
次に、図２を参照して、端末装置１０の構成について説明する。図２に示すように、端末装置１０は、通信部１１と、音声データ生成部１２と、提示部１３と、入力部１４と、を有する。 <Configuration of terminal device 10>
Next, the configuration of the terminal device 10 will be described with reference to FIG. As shown in FIG. 2, the terminal device 10 includes a communication unit 11, a voice data generation unit 12, a presentation unit 13, and an input unit 14.

通信部１１は、ネットワーク３０を介して、音声認証装置２０と種々のデータの送受信を行う。 The communication unit 11 transmits and receives various data to and from the voice authentication device 20 via the network 30.

音声データ生成部１２は、端末装置１０に向かって発話したユーザの音声入力を受け付け、音声入力に基づく音声データを生成する。音声データ生成部１２は、例えば、マイクを内蔵するデバイスである。音声データ生成部１２による音声データ生成の方法については、本発明では特に限定しないが、例えば、マイクから入力された音声信号をＡ−Ｄ変換して音声データを生成する方法が採用されうる。音声データ生成部１２は、さらに、音声データを圧縮符号に変換してもよい。 The voice data generation unit 12 receives the voice input of the user who has spoken to the terminal device 10 and generates voice data based on the voice input. The voice data generation unit 12 is, for example, a device having a built-in microphone. The method of generating audio data by the audio data generation unit 12 is not particularly limited in the present invention, but for example, a method of AD-converting an audio signal input from a microphone to generate audio data can be adopted. The voice data generation unit 12 may further convert the voice data into a compression code.

以下の説明では、音声信号をＡ−Ｄ変換したデータ、Ａ−Ｄ変換したデータを符号化したデータ、後述する各種処理（ノイズ軽減、特徴量抽出、及び音素変換）を経て得られるデータを総称して音声データという。また、音声データのうち、個人を識別可能なデータを声紋データという。なお、音声認証には声紋データを用いることが一般的であり、以下の説明において、音声認証と声紋認証は実質的に同義であるとする。 In the following description, the data obtained by A-D conversion of the audio signal, the encoded data of the A-D converted data, and the data obtained through various processes (noise reduction, feature amount extraction, and phonetic conversion) described later are collectively referred to. And it is called voice data. In addition, among the voice data, data that can identify an individual is called voiceprint data. It should be noted that voiceprint data is generally used for voice authentication, and in the following description, voice authentication and voiceprint authentication are substantially synonymous.

提示部１３は、端末装置１０がユーザの音声登録または音声認証を行う際に、ユーザに発話させる文字列（後述する第１発話用文字列、または後述する第２発話用文字列）をユーザに対して提示する。提示部１３は、例えば、ディスプレイであり、この場合、提示部１３は、文字列を表示することで、文字列をユーザに対して提示する。また、提示部１３は、例えば、スピーカであり、この場合、提示部１３は、ユーザに発話させる文字列を音声で提示する。なお、本発明において、提示部１３はディスプレイまたはスピーカに限定されず、例えば点字等により文字列を提示する装置であってもよい。また、提示部１３はディスプレイおよびスピーカ、または他の装置を併用したものであってもよい。 The presentation unit 13 gives the user a character string (a first utterance character string described later or a second utterance character string described later) to be spoken by the user when the terminal device 10 performs voice registration or voice authentication of the user. I will present it to you. The presentation unit 13 is, for example, a display, and in this case, the presentation unit 13 presents the character string to the user by displaying the character string. Further, the presentation unit 13 is, for example, a speaker, and in this case, the presentation unit 13 presents a character string to be spoken by the user by voice. In the present invention, the presentation unit 13 is not limited to the display or the speaker, and may be a device that presents a character string in, for example, Braille. Further, the presentation unit 13 may be a combination of a display and a speaker, or another device.

本発明において、文字列とは、文字が連なったものを意味する。本発明において、文字列とは、意味をなさない文字の連なりであってもよいが、文字列全体で意味をなす文字の連なり（単語等）であることがより望ましい。本発明において、文字とは、言語の伝達手段の一つとして使われる記号等を意味しており、どのような記号であるかについては特に限定しないが、１つの記号が１つの音を表すもの（いわゆる音節文字）であることがより望ましい。以下では、文字として、仮名、またはアラビア数字を採用した場合について説明する。 In the present invention, the character string means a series of characters. In the present invention, the character string may be a series of characters that do not make sense, but it is more preferable that the character string is a series of characters (words, etc.) that make sense in the entire character string. In the present invention, the character means a symbol or the like used as one of the means of transmitting a language, and the symbol is not particularly limited, but one symbol represents one sound. (So-called syllabary) is more desirable. In the following, a case where a kana or an Arabic numeral is adopted as a character will be described.

提示部１３は、１つの文字列のみを提示してもよいし、複数の文字列を提示してもよい。提示部１３が複数の文字列を提示する場合、そのうちいずれの文字列を発話するかについては、ユーザが選択できるようにしてもよい。また、提示部１３は、複数の文字列の提示順を、後述するセキュリティレベルに応じた順序としてもよいし、ランダムな順序としてもよい。 The presentation unit 13 may present only one character string, or may present a plurality of character strings. When the presenting unit 13 presents a plurality of character strings, the user may be able to select which of the character strings to speak. In addition, the presentation unit 13 may present a plurality of character strings in an order according to the security level described later, or may be a random order.

入力部１４は、ユーザによる音声以外の入力操作を受け付ける。入力部１４は、例えば、タッチパネル、キーボード、マウス、トラックボール等の操作デバイスである。 The input unit 14 accepts input operations other than voice by the user. The input unit 14 is, for example, an operation device such as a touch panel, a keyboard, a mouse, and a trackball.

端末装置１０は、このような構成により、ユーザの音声認証の際に、ユーザに発話させる文字列を提示し、ユーザが文字列を発話した場合、その音声を音声データとして取得する。 With such a configuration, the terminal device 10 presents a character string to be spoken by the user at the time of voice authentication of the user, and when the user speaks the character string, the voice is acquired as voice data.

なお、提示部１３が提示する文字列は、音声認証装置２０が有する後述の第１発話用文字列生成部２２、または認証用文字列生成部２３によって生成されたものである。第１発話用文字列生成部２２、または認証用文字列生成部２３の動作の詳細については、後述する。 The character string presented by the presentation unit 13 is generated by the first utterance character string generation unit 22 or the authentication character string generation unit 23 of the voice authentication device 20, which will be described later. The details of the operation of the first utterance character string generation unit 22 or the authentication character string generation unit 23 will be described later.

＜音声認証装置２０の構成＞
次に、図３を参照して、音声認証装置２０の構成について説明する。図３に示すように、音声認証装置２０は、通信部２１と、第１発話用文字列生成部２２と、認証用文字列生成部２３と、音声認識部２４と、声紋データ生成部２５と、認証部２６と、サービス提供部２７と、ユーザデータ取得部２８と、記憶部２９と、を有する。 <Configuration of voice authentication device 20>
Next, the configuration of the voice authentication device 20 will be described with reference to FIG. As shown in FIG. 3, the voice authentication device 20 includes a communication unit 21, a first utterance character string generation unit 22, an authentication character string generation unit 23, a voice recognition unit 24, and a voiceprint data generation unit 25. , An authentication unit 26, a service provision unit 27, a user data acquisition unit 28, and a storage unit 29.

通信部２１は、ネットワーク３０を介して、端末装置１０と種々のデータの送受信を行う。 The communication unit 21 transmits and receives various data to and from the terminal device 10 via the network 30.

第１発話用文字列生成部２２は、音声認証システム１の音声登録処理の際にユーザに発話させる文字列である第１発話用文字列を生成する。第１発話用文字列生成部２２による第１発話用文字列の生成方法については特に限定しないが、ユーザが発話しやすい文字列が生成されることが望ましい。 The first utterance character string generation unit 22 generates the first utterance character string, which is a character string to be spoken by the user during the voice registration process of the voice authentication system 1. The method of generating the first utterance character string by the first utterance character string generation unit 22 is not particularly limited, but it is desirable that a character string that is easy for the user to speak is generated.

第１発話用文字列生成部２２は、例えば、後述する記憶部２９に記憶されているコーパスやユーザ個別の辞書データを参照して文字列を生成してもよい。または、第１発話用文字列生成部２２は、後述するユーザデータ取得部２８により入手されたユーザに関する個人情報の少なくとも一部または当該個人情報から連想される単語を含むように第１発話用文字列を生成してもよい。 The first utterance character string generation unit 22 may generate a character string by referring to, for example, a corpus or user-specific dictionary data stored in a storage unit 29, which will be described later. Alternatively, the first utterance character string generation unit 22 includes at least a part of personal information about the user obtained by the user data acquisition unit 28, which will be described later, or a word associated with the personal information. You may generate columns.

認証用文字列生成部２３は、音声認証システム１の音声認証処理の際にユーザに発話させる文字列である第２発話用文字列を生成する。認証用文字列生成部２３は、まず、第１発話用文字列生成部２２が生成した第１発話用文字列に含まれる文字のいずれかを含む認証用文字列を生成した後、認証用文字列を適宜組み合わせて第２発話用文字列を生成する。 The authentication character string generation unit 23 generates a second utterance character string, which is a character string to be spoken by the user during the voice authentication process of the voice authentication system 1. The authentication character string generation unit 23 first generates an authentication character string including any of the characters included in the first speech character string generated by the first speech character string generation unit 22, and then generates an authentication character string. A character string for the second speech is generated by appropriately combining the columns.

第１発話用文字列生成部２２および認証用文字列生成部２３により生成される各文字列については、後に具体例を挙げて詳しく説明する。 Each character string generated by the first utterance character string generation unit 22 and the authentication character string generation unit 23 will be described in detail later with specific examples.

音声認識部２４は、通信部２１が端末装置１０から受信した音声データを文字列に変換する。これとともに、音声認識部２４は、所定の文字列毎に音声データを切り出し、文字列と対応付けて出力する。そして、音声認識部２４は、切り出した音声データ、文字列、及び当該文字列の時刻情報（開始時間及び終了時間）の対応付けを行い、対応付けた情報を記憶部２９に記憶させる。 The voice recognition unit 24 converts the voice data received from the terminal device 10 by the communication unit 21 into a character string. At the same time, the voice recognition unit 24 cuts out voice data for each predetermined character string and outputs the voice data in association with the character string. Then, the voice recognition unit 24 associates the cut-out voice data, the character string, and the time information (start time and end time) of the character string, and stores the associated information in the storage unit 29.

なお、音声認識部２４は、音声データに基づいて変換した文字列に含まれる文字毎に音声データを切り出してもよいが、例えば変換した文字列に含まれる単語毎、または文節毎に音声データを切り出すことがより望ましい。音声認識部２４が変換する文字列と、切り出す音声データとについては、後に具体例を挙げて詳しく説明する。 The voice recognition unit 24 may cut out the voice data for each character included in the character string converted based on the voice data, but for example, the voice data is output for each word or phrase included in the converted character string. It is more desirable to cut out. The character string converted by the voice recognition unit 24 and the voice data to be cut out will be described in detail later with specific examples.

音声認識部２４は、音声認証システム１の音声登録処理の際には、第１発話用文字列をユーザが実際に発話して得られた音声データを用いて、上記動作（文字列への変換、文字列に対応する音声データの切り出し、文字列の時刻情報との対応付け）を行う。なお、以下の説明において、第１発話用文字列をユーザが実際に発話して得られた音声データを、第１音声データと記載する。 The voice recognition unit 24 uses the voice data obtained by the user actually speaking the first utterance character string during the voice registration process of the voice authentication system 1 to perform the above operation (conversion to a character string). , Cut out the voice data corresponding to the character string, and associate it with the time information of the character string). In the following description, the voice data obtained by the user actually speaking the first utterance character string is referred to as the first voice data.

また、音声認識部２４は、音声認証システム１の音声認証処理の際には、第２発話用文字列をユーザが実際に発話して得られた音声データを用いて、上記動作を行う。なお、以下の説明において、第２発話用文字列をユーザが実際に発話して得られた音声データを、第２音声データと記載する。さらに、音声認識部２４は、音声認証処理の際には、ユーザの音声データから、後述する付加文字列に該当する部分を除去し、その他の部分の音声データ及び時刻情報のみを音声認識の結果として出力する。 In addition, the voice recognition unit 24 performs the above operation when the voice authentication process of the voice authentication system 1 is performed, using the voice data obtained by the user actually speaking the second utterance character string. In the following description, the voice data obtained by the user actually speaking the second utterance character string will be referred to as the second voice data. Further, the voice recognition unit 24 removes a part corresponding to an additional character string described later from the user's voice data at the time of voice recognition processing, and only the voice data and time information of the other parts are the result of voice recognition. Output as.

図４を参照して、音声認識部２４について説明する。図４に示すように、音声認識部２４は、ノイズリダクション部２４１と、発話検出部２４２と、特徴量抽出部２４３と、音素変換部２４４と、文字列変換部２４５と、音響モデル２４６と、言語モデル２４７と、付加文字列除去部２４８と、を有する。 The voice recognition unit 24 will be described with reference to FIG. As shown in FIG. 4, the voice recognition unit 24 includes a noise reduction unit 241, an utterance detection unit 242, a feature amount extraction unit 243, a phoneme conversion unit 244, a character string conversion unit 245, an acoustic model 246, and the like. It has a language model 247 and an additional character string removing unit 248.

ノイズリダクション部２４１は、音声データに含まれるノイズ（ユーザの周辺環境のノイズ等）を軽減する。ノイズリダクション部２４１によるノイズの軽減方法については、本発明では特に限定せず、既知のノイズ軽減方法を採用することができる。 The noise reduction unit 241 reduces noise (noise in the user's surrounding environment, etc.) included in the voice data. The noise reduction method by the noise reduction unit 241 is not particularly limited in the present invention, and a known noise reduction method can be adopted.

発話検出部２４２は、入力された音声データからユーザが発話している部分を切り出す。 The utterance detection unit 242 cuts out the portion spoken by the user from the input voice data.

特徴量抽出部２４３は、例えばＭＦＣＣ（Mel-Frequency Cepstrum Coefficient;メル周波数ケプストラム係数）等を音声データの特徴量として抽出する。特徴量の抽出方法については既知の技術を採用すればよく、本発明では特徴量の抽出方法については特に限定しない。 The feature amount extraction unit 243 extracts, for example, MFCC (Mel-Frequency Cepstrum Coefficient) or the like as a feature amount of voice data. A known technique may be adopted for the feature amount extraction method, and the feature amount extraction method is not particularly limited in the present invention.

音素変換部２４４は、抽出された特徴量を音素に変換する。音素とは、ある言語における音声の最小単位を意味しており、本実施形態では、１つの文字に対応する１つの音を示す。具体例を挙げると、「あいうえお」という文字列をユーザが実際に発話して得られた音声データの中から、「あ」に該当する部分の音声データ（音声波形）を抽出したものが音素である。音素変換部２４４により、１つの音声データは、複数の音素の集まりに変換される。 The phoneme conversion unit 244 converts the extracted features into phonemes. The phoneme means the smallest unit of speech in a certain language, and in the present embodiment, it indicates one sound corresponding to one character. To give a specific example, the phoneme is the voice data (voice waveform) of the part corresponding to "a" extracted from the voice data obtained by the user actually speaking the character string "aiueo". is there. One voice data is converted into a set of a plurality of phonemes by the phoneme conversion unit 244.

音素変換部２４４が特徴量を音素に変換する方法については、本発明では特に限定しないが、以下のような方法を採用することができる。例えば、音素変換部２４４は、音響モデル２４６を用いた統計的手法により、特徴量を音素に変換する。音響モデル２４６は、特徴量と音素の繋がりに関する確率分布をあらかじめモデル化したものであり、音素変換部２４４は、例えばある特徴量と音響モデルとを比較（照合）し、最も確度が高い音素を、その特徴量に対応する音素として出力する。例えば、音素変換部２４４は、確度が高くなるよう、最も有用な特徴量を１つずつ足していき、確度の変化がなくなるまで反復的に繰り返すようにしてもよい。 The method by which the phoneme conversion unit 244 converts the feature amount into a phoneme is not particularly limited in the present invention, but the following method can be adopted. For example, the phoneme conversion unit 244 converts the feature amount into a phoneme by a statistical method using the acoustic model 246. The acoustic model 246 models the probability distribution regarding the connection between the feature amount and the phoneme in advance, and the phoneme conversion unit 244 compares (matches) a certain feature amount with the acoustic model, and selects the phoneme with the highest accuracy. , Output as a phoneme corresponding to the feature amount. For example, the phoneme conversion unit 244 may add the most useful features one by one so as to increase the accuracy, and repeat the process until there is no change in the accuracy.

文字列変換部２４５は、音素を文字列に変換する。すなわち、文字列変換部２４５は、入力された音声データの内容を文字列として出力する。 The character string conversion unit 245 converts phonemes into character strings. That is, the character string conversion unit 245 outputs the content of the input voice data as a character string.

文字列変換部２４５が音素を文字列に変換する方法については、本発明では特に限定しないが、以下のような方法を採用することができる。例えば、文字列変換部２４５は、言語モデル２４７を用いた統計的手法により、音素を文字列に変換する。言語モデル２４７は、品詞（単語）の繋がり等、文法構造に関する確率分布をモデル化したものであり、辞書データを含むモデルである。言語モデル２４７は、音声認識や音声認証の結果等に応じて動的に更新されてもよい。 The method by which the character string conversion unit 245 converts a phoneme into a character string is not particularly limited in the present invention, but the following method can be adopted. For example, the character string conversion unit 245 converts phonemes into character strings by a statistical method using the language model 247. The language model 247 is a model of a probability distribution related to a grammatical structure such as a connection of part of speech (word), and is a model including dictionary data. The language model 247 may be dynamically updated according to the result of voice recognition or voice authentication.

文字列変換部２４５は、例えば、音素変換部２４４から出力された音素の並びに対応する単語、品詞、またはこれらの繋がりを、言語モデル２４７から抽出し、最も確度が高い文字列を当該音素に対応する文字列として出力する。このような方法により、文章として意味が通っている音声データを、精度よく文字列に変換することができる。また、文字列変換部２４５は、変換した文字列（単語）毎に音声データを切り出すとともに、文字列の時刻情報（開始時刻および終了時刻）を取得し、変換した文字列と、切り出した音声データと、時刻情報と、を対応付けて出力する。 The character string conversion unit 245 extracts, for example, the sequence of phonemes output from the phoneme conversion unit 244, the corresponding word, part of speech, or a connection thereof from the language model 247, and corresponds to the most accurate character string to the phoneme. Output as a character string to be used. By such a method, voice data that makes sense as a sentence can be accurately converted into a character string. Further, the character string conversion unit 245 cuts out voice data for each converted character string (word), acquires time information (start time and end time) of the character string, and converts the converted character string and the cut out voice data. And the time information are associated and output.

付加文字列除去部２４８は、記憶部２９を参照し、文字列変換部２４５が出力した文字列の中から認証文字列に対応する文字列を検出し、検出した認証文字列以外の文字列（付加文字列に相当）を除去する。そして、付加文字列除去部２４８は、認証文字列、当該認証文字列に対応付けられた音声データおよび時刻情報のみを出力する。 The additional character string removing unit 248 refers to the storage unit 29, detects a character string corresponding to the authentication character string from the character strings output by the character string conversion unit 245, and character strings other than the detected authentication character string ( (Equivalent to an additional character string) is removed. Then, the additional character string removing unit 248 outputs only the authentication character string, the voice data associated with the authentication character string, and the time information.

図３の説明に戻る。声紋データ生成部２５は、音声認識部２４が出力した、文字列毎に対応付けられた音声データから、声紋データを生成する。声紋データ生成部２５は、音声認識部２４から複数の文字列が出力された場合は、それぞれの文字列に対応付けられた音声データから、それぞれの声紋データを生成する。 Returning to the description of FIG. The voiceprint data generation unit 25 generates voiceprint data from the voice data associated with each character string output by the voice recognition unit 24. When a plurality of character strings are output from the voice recognition unit 24, the voiceprint data generation unit 25 generates each voiceprint data from the voice data associated with each character string.

声紋データとは、ユーザの音声に固有の特徴である声紋に関するデータである。声紋データの生成方法については本発明では特に限定せず、既知の方法が適宜採用されればよい。一般的な声紋データ生成方法としては、多数の話者の音声から作られた音韻の標準モデルとの差分をとることで声紋データを生成する方法が挙げられる。 The voiceprint data is data related to the voiceprint, which is a characteristic unique to the user's voice. The method for generating the voiceprint data is not particularly limited in the present invention, and a known method may be appropriately adopted. As a general voiceprint data generation method, there is a method of generating voiceprint data by taking a difference from a standard model of phonology created from the voices of a large number of speakers.

声紋データ生成部２５は、音声認証システム１の音声登録処理の際には、音声認識部２４が第１音声データに基づいて出力した文字列に対応付けられた音声データから、声紋データを生成する。以下の説明において、音声登録処理の際に声紋データ生成部２５が生成した声紋データを、第１声紋データと記載する。声紋データ生成部２５が生成した第１声紋データは、音声登録処理の際に、音声認識部２４が第１音声データに基づいて出力した文字列と対応付けられて、後述の記憶部２９に記憶される。 The voice pattern data generation unit 25 generates voice pattern data from the voice data associated with the character string output by the voice recognition unit 24 based on the first voice data during the voice registration process of the voice authentication system 1. .. In the following description, the voiceprint data generated by the voiceprint data generation unit 25 during the voice registration process will be referred to as the first voiceprint data. The first voiceprint data generated by the voiceprint data generation unit 25 is associated with a character string output by the voice recognition unit 24 based on the first voice data during the voice registration process, and is stored in the storage unit 29 described later. Will be done.

一方、声紋データ生成部２５は、音声認証システム１の音声認証処理の際には、音声認識部２４が第２音声データに基づいて出力した文字列に対応付けられた音声データから、声紋データを生成する。以下の説明において、音声認証処理の際に声紋データ生成部２５が生成した声紋データを、第２声紋データと記載する。 On the other hand, the voice pattern data generation unit 25 obtains voice pattern data from the voice data associated with the character string output by the voice recognition unit 24 based on the second voice data during the voice authentication process of the voice authentication system 1. Generate. In the following description, the voiceprint data generated by the voiceprint data generation unit 25 during the voice authentication process will be referred to as the second voiceprint data.

認証部２６は、音声認証システム１の音声認証処理の際に、記憶部２９に記憶されている第１声紋データと、声紋データ生成部２５が生成した第２声紋データと、を照合することで、ユーザの認証を行う。認証部２６は、第１声紋データと第２声紋データとの一致率が所定の閾値以上である場合に認証成功と判断し、一致率が所定の閾値未満である場合に認証失敗と判断する。 The authentication unit 26 collates the first voiceprint data stored in the storage unit 29 with the second voiceprint data generated by the voiceprint data generation unit 25 during the voice authentication process of the voice authentication system 1. , Authenticate the user. The authentication unit 26 determines that the authentication is successful when the matching rate between the first voiceprint data and the second voiceprint data is equal to or more than a predetermined threshold value, and determines that the authentication is unsuccessful when the matching rate is less than the predetermined threshold value.

サービス提供部２７は、認証部２６によるユーザの認証が成功した場合に、ユーザに対して、所望のサービスを提供する。なお、本実施の形態では、説明の都合上、音声認証装置２０がサービス提供部２７を有するとしたが、本発明はこれに限定されない。 The service providing unit 27 provides a desired service to the user when the authentication unit 26 succeeds in authenticating the user. In the present embodiment, for convenience of explanation, the voice authentication device 20 has the service providing unit 27, but the present invention is not limited to this.

本発明では、音声認証装置２０がサービス提供部２７を有する代わりに、音声認証システム１がユーザに対して所望のサービスを提供するサービス提供装置をさらに有していてもよい（つまり、音声認証装置２０から独立したサービス提供装置が設けられていてもよい）。この場合、サービス提供装置が、音声認証装置２０の認証部２６による認証が成功した場合に、サービスを提供すればよい。 In the present invention, instead of the voice authentication device 20 having the service providing unit 27, the voice authentication system 1 may further have a service providing device that provides a desired service to the user (that is, the voice authentication device). A service providing device independent of 20 may be provided). In this case, the service providing device may provide the service when the authentication by the authentication unit 26 of the voice authentication device 20 is successful.

ユーザデータ取得部２８は、端末装置１０を介してユーザにより入力された、ユーザの個人情報（名前、住所、生年月日、性別、年齢等）に関するデータ（ユーザデータ）を取得する。ユーザデータ取得部２８は、例えば端末装置１０の入力部１４を介してユーザが入力したユーザデータを、ネットワーク３０および通信部２１を介して取得すればよい。 The user data acquisition unit 28 acquires data (user data) related to the user's personal information (name, address, date of birth, gender, age, etc.) input by the user via the terminal device 10. The user data acquisition unit 28 may acquire user data input by the user via the input unit 14 of the terminal device 10, for example, via the network 30 and the communication unit 21.

記憶部２９は、音声認証装置２０において用いられる種々のデータを記憶する記憶デバイスである。記憶部２９には、例えば、第１発話用文字列を生成するための、コーパスおよび辞書データの少なくとも一方が記憶されている。また、記憶部２９には、ユーザの音声データ、文字列、および当該文字列の時刻情報が互いに対応付けられた情報が記憶されている。また、記憶部２９には、ユーザ毎に、ユーザの識別情報、第１発話用文字列、認証用文字列、第２発話用文字列、第１音声データ、第２音声データ、第１声紋データ、および第２声紋データが、互いに対応付けられて記憶されている。 The storage unit 29 is a storage device that stores various data used in the voice authentication device 20. The storage unit 29 stores, for example, at least one of the corpus and dictionary data for generating the first utterance character string. Further, the storage unit 29 stores the user's voice data, the character string, and the information in which the time information of the character string is associated with each other. Further, in the storage unit 29, the user's identification information, the first speech character string, the authentication character string, the second speech character string, the first voice data, the second voice data, and the first voice print data are stored for each user. , And the second voice pattern data are stored in association with each other.

＜音声認証システム１の動作例＞
以下では、上述した構成を有する音声認証システム１の動作例について、図５及び図６を参照して説明する。 <Operation example of voice authentication system 1>
Hereinafter, an operation example of the voice authentication system 1 having the above-described configuration will be described with reference to FIGS. 5 and 6.

［１］音声登録処理
まず、図５を参照して、ユーザの音声を音声認証システム１に登録する音声登録処理について説明する。 [1] Voice registration process First, a voice registration process for registering a user's voice in the voice authentication system 1 will be described with reference to FIG.

ステップＳ１において、音声認証装置２０は、端末装置１０を介して入力された、ユーザからの登録処理の開始要求を受信する。ユーザによる音声登録処理の開始要求は、例えば端末装置１０の入力部１４を介して行われる。これにより、音声認証システム１の音声登録処理が開始される。 In step S1, the voice authentication device 20 receives the registration processing start request from the user input via the terminal device 10. The user requests to start the voice registration process, for example, via the input unit 14 of the terminal device 10. As a result, the voice registration process of the voice authentication system 1 is started.

ステップＳ２において、音声認証装置２０の第１発話用文字列生成部２２は、ユーザの音声登録時にユーザに発話させるための文字列である第１発話用文字列を生成する。 In step S2, the first utterance character string generation unit 22 of the voice authentication device 20 generates a first utterance character string which is a character string for the user to speak at the time of voice registration of the user.

第１発話用文字列生成部２２が生成する第１発話用文字列は、上述したように、ユーザが発話しやすい文字列であることが望ましいが、ランダムな文字列であってもよいし、例えばユーザの生年月日や年齢、住所、氏名等の個人情報の少なくとも一部または当該個人情報から連想される単語を含む文字列であってもよい。 As described above, the first utterance character string generated by the first utterance character string generation unit 22 is preferably a character string that is easy for the user to speak, but may be a random character string. For example, it may be a character string including at least a part of personal information such as a user's date of birth, age, address, name, or a word associated with the personal information.

ランダムな文字列の例としては、特定の国や地域でよく使われる言い回し（例えば「げんこつやまのたぬきさん」）をコーパスからランダムに取得すること等が挙げられる。ユーザデータから導き出されるユーザの属性などに応じて、コーパスから取得した文字列を辞書データとして記憶部２９に記憶させておき、第１発話用文字列生成部２２は当該辞書データを参照して第１発話用文字列をランダムに生成してもよい。辞書データは音声認識や音声認証の結果などに応じて随時更新されてもよい。 An example of a random character string is to randomly obtain a phrase (for example, "Genkotsuyama no Tanuki-san") that is often used in a specific country or region from the corpus. The character string acquired from the corpus is stored in the storage unit 29 as dictionary data according to the user's attributes derived from the user data, and the first utterance character string generation unit 22 refers to the dictionary data to the second. A character string for one utterance may be randomly generated. The dictionary data may be updated at any time according to the result of voice recognition or voice authentication.

ユーザの生年月日、年齢、住所等の個人情報を含む文字列の例としては、「１９９５ねん１がつ１にち」、「３３さい」、「ちよだくそとかんだ」等が挙げられる。なお、この例ではひらがなで表記したが、カタカナを採用してもよい。 Examples of character strings including personal information such as the user's date of birth, age, and address include "1995 Nen 1 Gatsu 1 Nichi", "33 Sai", and "Chiyoda Kusoto Kanda". In this example, it is written in hiragana, but katakana may be used.

また、第１発話用文字列生成部２２は、コーパスや辞書データから取得したランダムな文字列とユーザデータに基づく文字列とを組み合わせて第１発話用文字列を生成するようにしてもよい（例えば、「げんこつ３３さいのたぬき」等）。 Further, the first utterance character string generation unit 22 may generate the first utterance character string by combining a random character string acquired from the corpus or dictionary data and a character string based on the user data (1st utterance character string generation unit 22). For example, "Genkotsu 33 Sai no Tanuki" etc.).

第１発話用文字列生成部２２は、このような動作により、複数の異なる第１発話用文字列を生成することがより望ましい。 It is more desirable that the first utterance character string generation unit 22 generate a plurality of different first utterance character strings by such an operation.

図５の説明に戻る。ステップＳ３において、端末装置１０の提示部１３は、ステップＳ２で第１発話用文字列生成部２２が生成した第１発話用文字列を、ユーザに対して提示する。これとともに、提示部１３は、第１発話用文字列を発話する（読み上げる）よう、ユーザに対して要求する。 Returning to the description of FIG. In step S3, the presentation unit 13 of the terminal device 10 presents the first utterance character string generated by the first utterance character string generation unit 22 in step S2 to the user. At the same time, the presentation unit 13 requests the user to speak (read) the first utterance character string.

例えば提示部１３がディスプレイである場合、提示部１３は第１発話用文字列とともに、「表示された文字を読み上げて下さい。」等のメッセージを表示する。提示部１３がスピーカである場合、提示部１３は、「この後の言葉を発話して下さい。」等のメッセージを発した後、第１発話用文字列を音声として再生する。第１発話用文字列が複数生成されている場合、提示部１３は、複数の第１発話用文字列を提示するとともに、そのうち１つの第１発話用文字列を選択して発話するようにユーザに要求するメッセージを出力する。 For example, when the presentation unit 13 is a display, the presentation unit 13 displays a message such as "Please read the displayed characters aloud" together with the first utterance character string. When the presentation unit 13 is a speaker, the presentation unit 13 reproduces the first utterance character string as voice after issuing a message such as "Please speak the following words." When a plurality of first utterance character strings are generated, the presentation unit 13 presents a plurality of first utterance character strings and selects one of the first utterance character strings to speak. Outputs the message requested to.

これにより、ユーザが第１発話用文字列を発話すると、端末装置１０の音声データ生成部１２によって、第１発話用文字列に対応する第１音声データが生成される。 As a result, when the user speaks the first utterance character string, the voice data generation unit 12 of the terminal device 10 generates the first voice data corresponding to the first utterance character string.

ステップＳ４において、音声認証装置２０は、端末装置１０から第１音声データを取得する。 In step S4, the voice authentication device 20 acquires the first voice data from the terminal device 10.

ステップＳ５において、音声認証装置２０の音声認識部２４は、第１音声データを文字列に変換するとともに、変換した文字列に含まれる単語（または文字、または文節）毎に音声データを切り出し、文字列と切り出した音声データとを対応付けて出力する。 In step S5, the voice recognition unit 24 of the voice authentication device 20 converts the first voice data into a character string, cuts out the voice data for each word (or character, or phrase) included in the converted character string, and sets the character. The column and the cut out audio data are associated and output.

以下、具体例を挙げて説明する。第１音声データが、ユーザが「げんこつやまのたぬきさん」と発話して得られた音声データであった場合、音声認識部２４は、「げんこつ」に対応する音声データ、「やま」に対応する音声データ、「の」に対応する音声データ、「たぬき」に対応する音声データ、「さん」に対応する音声データをそれぞれ切り出す。 Hereinafter, a specific example will be described. When the first voice data is the voice data obtained by the user speaking "Genkotsuyama no Tanuki-san", the voice recognition unit 24 corresponds to the voice data corresponding to "Genkotsu", "Yama". Cut out voice data, voice data corresponding to "no", voice data corresponding to "tanuki", and voice data corresponding to "san".

なお、この例では音声認識部２４は第１発話用文字列を単語毎に分割して音声データの切り出しを行う場合について説明したが、音声認識部２４が音声データをどの程度（文字毎、単語毎、文節毎）に分けて切り出すかについては本発明では特に限定しない。音声認識部２４は、例えば第１発話用文字列を文字毎（「げ」、「ん」、「こ」、「つ」・・・）に分割して音声データの切り出しを行ってもよい。 In this example, the voice recognition unit 24 has described the case where the first utterance character string is divided into words to cut out the voice data, but how much the voice recognition unit 24 divides the voice data (word by character, word). In the present invention, there is no particular limitation as to whether or not the data is cut out separately for each clause. For example, the voice recognition unit 24 may divide the first utterance character string into each character (“ge”, “n”, “ko”, “tsu” ...) And cut out the voice data.

図５の説明に戻る。ステップＳ６において、音声認証装置２０の認証用文字列生成部２３は、後述の音声認証処理の際に用いられる文字列である、認証用文字列を生成する。 Returning to the description of FIG. In step S6, the authentication character string generation unit 23 of the voice authentication device 20 generates an authentication character string, which is a character string used in the voice authentication process described later.

認証用文字列生成部２３が生成する認証用文字列は、第１発話用文字列生成部２２が生成した第１発話用文字列に含まれる文字（単語、文節）を含む。 The authentication character string generated by the authentication character string generation unit 23 includes characters (words, phrases) included in the first utterance character string generated by the first utterance character string generation unit 22.

例えば、第１発話用文字列が「げんこつやまのたぬきさん」である場合、認証用文字列の例として、以下のようなものが挙げられる。
（１）げんこつ
（２）たぬきやま
（３）たぬきのこ
（４）こやまさん
（５）げやたさ For example, when the character string for the first utterance is "Genkotsuyamanotanuki-san", the following is an example of the character string for authentication.
(1) Genkotsu (2) Tanukiyama (3) Tanukinoko (4) Koyama-san (5) Geyata

文字列（１）は、第１発話用文字列から文字の順序や単語の意味を変えず、当該第１発話用文字列の一部から認証用文字列を生成した例である。文字列（２）および（３）は、第１発話用文字列から文字の順序を一部変えつつも、当該第１発話用文字列に含まれる各単語の意味から逸脱しない範囲で、当該第１発話用文字列の一部から認証用文字列を生成した例である。 The character string (1) is an example in which an authentication character string is generated from a part of the first utterance character string without changing the order of characters or the meaning of a word from the first utterance character string. The character strings (2) and (3) are the first, as long as the order of the characters is partially changed from the first speech character string, but the meaning of each word included in the first speech string is not deviated. This is an example in which an authentication character string is generated from a part of a one-speech character string.

また、文字列（４）は、第１発話用文字列とは異なる意味が生まれるよう、当該第１発話用文字列の一部から認証用文字列を生成した例である。文字列（５）は、第１発話用文字列の順序や単語の意味とは全く無関係に、ランダムに当該第１発話用文字列の一部から文字を抽出して認証用文字列を生成した例である。 Further, the character string (4) is an example in which an authentication character string is generated from a part of the first utterance character string so that a meaning different from that of the first utterance character string is generated. The character string (5) randomly extracts characters from a part of the first utterance character string to generate an authentication character string, regardless of the order of the first utterance character string and the meaning of the word. This is an example.

このように、認証用文字列生成部２３は、第１発話用文字列に含まれる文字を含む認証用文字列を生成する。なお、上述した例では、認証用文字列生成部２３は複数の認証用文字列を生成していたが、本発明はこれに限定されず、１つの認証用文字列のみを生成してもよい。 In this way, the authentication character string generation unit 23 generates an authentication character string including the characters included in the first utterance character string. In the above example, the authentication character string generation unit 23 generates a plurality of authentication character strings, but the present invention is not limited to this, and only one authentication character string may be generated. ..

なお、ステップＳ６において、認証用文字列生成部２３は、認証用文字列の生成に用いるための第１発話用文字列として、ステップＳ５で音声認識部２４が第１音声データに基づいて変換した文字列を用いることがより望ましい。しかしながら、本発明はこれに限定されず、認証用文字列生成部２３は、第１発話用文字列を第１発話用文字列生成部２２から直接取得してもよい。 In step S6, the authentication character string generation unit 23 was converted by the voice recognition unit 24 based on the first voice data in step S5 as the first speech character string to be used for generating the authentication character string. It is more desirable to use a character string. However, the present invention is not limited to this, and the authentication character string generation unit 23 may directly acquire the first utterance character string from the first utterance character string generation unit 22.

ステップＳ７において、認証用文字列生成部２３は、ステップＳ６で生成した認証用文字列を用いて、第２発話用文字列を生成する。第２発話用文字列とは、後述する音声認証処理において、ユーザに発話させるための文字列である。 In step S7, the authentication character string generation unit 23 generates a second utterance character string using the authentication character string generated in step S6. The second utterance character string is a character string for causing the user to speak in the voice authentication process described later.

上述したステップＳ６において、認証用文字列が複数生成された場合、本ステップＳ７において、認証用文字列生成部２３は、そのうちの少なくとも１つを用いて第２発話用文字列を生成する。この際、認証用文字列生成部２３は、複数の認証用文字列を用いて第２発話用文字列を生成してもよい。また、認証用文字列生成部２３は、１つまたは複数の認証用文字列に、認証用文字列以外の文字列である付加文字列を付加して第２発話用文字列を生成してもよい。 When a plurality of authentication character strings are generated in step S6 described above, in this step S7, the authentication character string generation unit 23 generates a second utterance character string using at least one of them. At this time, the authentication character string generation unit 23 may generate a second utterance character string using a plurality of authentication character strings. Further, the authentication character string generation unit 23 may generate a second utterance character string by adding an additional character string which is a character string other than the authentication character string to one or more authentication character strings. Good.

付加文字列は、音声認証処理のセキュリティレベルを向上させることを目的として、認証用文字列に付加される、余分な文字列である。付加文字列の生成方法については特に限定しないが、例えば、認証用文字列とともに使われやすい文字列をコーパスまたは辞書データから取得することで生成されてよいし、まったく意味をなさない完全なランダムな文字列を付加文字列としてもよい。 The additional character string is an extra character string added to the authentication character string for the purpose of improving the security level of the voice authentication process. The method of generating the additional character string is not particularly limited, but for example, it may be generated by acquiring a character string that is easy to use together with the authentication character string from the corpus or dictionary data, and it is completely random and does not make any sense at all. The character string may be an additional character string.

認証用文字列が上記文字列（１）〜（５）である場合の、第２発話用文字列の具体例について以下説明する。 A specific example of the second utterance character string when the authentication character string is the above character strings (1) to (5) will be described below.

１つの例として、認証用文字列生成部２３は、単に上記文字列（１）〜（５）のうちの１つ（例えば文字列（２）「たぬきやま」）をそのまま第２発話用文字列としてもよい。 As an example, the authentication character string generation unit 23 simply uses one of the above character strings (1) to (5) (for example, the character string (2) "Tanukiyama") as it is for the second utterance character string. May be.

または、１つの例として、認証用文字列生成部２３は、上記文字列（１）〜（５）のうちのいくつかを組み合わせて第２発話用文字列を生成してもよい。 Alternatively, as an example, the authentication character string generation unit 23 may generate a second utterance character string by combining some of the above character strings (1) to (5).

文字列（１）「げんこつ」と文字列（３）「たぬきのこ」とを組み合わせた場合、第２発話用文字列は、「げんこつたぬきのこ」となる。さらに、認証用文字列生成部２３は、文字列（１）「げんこつ」と文字列（３）「たぬきのこ」の２つの文字列をそれぞれ第２発話用文字列としてもよい。 When the character string (1) "Genkotsu" and the character string (3) "Tanuki no Ko" are combined, the second utterance character string becomes "Genkotsu Tanuki no Ko". Further, the authentication character string generation unit 23 may use each of the two character strings (1) "Genkotsu" and the character string (3) "Tanuki no Ko" as the second utterance character string.

もしくは、文字列（２）「たぬきやま」、文字列（４）「こやまさん」、文字列（５）「げやたさ」を組み合わせた場合、第２発話用文字列は、「げやたさたぬきやまこやまさん」となる。さらに、認証用文字列生成部２３は、文字列（２）「たぬきやま」、文字列（４）「こやまさん」、文字列（５）「げやたさ」の３つの文字列をそれぞれ第２発話用文字列としてもよい。 Alternatively, if the character string (2) "tanukiyama", the character string (4) "koyama-san", and the character string (5) "geyatasa" are combined, the second utterance character string is "geyata". It becomes "Satanuki Yamakoyama-san". Further, the authentication character string generation unit 23 generates three character strings, that is, the character string (2) "Tanukiyama", the character string (4) "Koyama-san", and the character string (5) "Geyatasa". It may be a character string for two utterances.

さらに、１つの例として、認証用文字列生成部２３は、上記文字列（１）〜（５）のいずれか、またはこれらを組み合わせた文字列に、付加文字列を付加して第２発話用文字列を生成してもよい。例えば、文字列（３）「たぬきのこ」に対して付加文字列「なかよし」を付加した場合、第２発話用文字列は、「なかよしたぬきのこ」となる。 Further, as one example, the authentication character string generation unit 23 adds an additional character string to any one of the above character strings (1) to (5) or a combination thereof, and is used for the second utterance. You may generate a string. For example, when the additional character string "Nakayoshi" is added to the character string (3) "Tanuki no Ko", the second utterance character string becomes "Nakayoshi Nuki no Ko".

このように、認証用文字列生成部２３は、認証用文字列が複数ある場合、そのうちの少なくとも１つを第２発話用文字列としてもよいし、全てではない複数個を組み合わせて第２発話用文字列としてもよいし、全てを組み合わせて第２発話用文字列としてもよい。また、認証用文字列生成部２３は、認証用文字列に対して当該認証用文字列と無関係の、または親和性のある文字列である付加文字列を付加して第２発話用文字列を生成してもよい。 In this way, when there are a plurality of authentication character strings, the authentication character string generation unit 23 may use at least one of them as the second speech string, or may combine a plurality of not all of them to make the second speech. It may be a character string for the second speech, or it may be a character string for the second speech by combining all of them. Further, the authentication character string generation unit 23 adds an additional character string, which is a character string unrelated to or has an affinity for the authentication character string, to the authentication character string to obtain a second speech character string. It may be generated.

ここで、認証用文字列生成部２３は、あらかじめ設定されたセキュリティレベル（認証レベル）に応じて、第２発話用文字列を生成するための認証用文字列の数および付加文字列の付加の有無を決定する。セキュリティレベルは、例えばユーザが受けようとするサービスの内容、サービスを受けようとする場所、利用頻度、環境音レベル、等の複数の要素毎にあらかじめ決定されていればよい。 Here, the authentication character string generation unit 23 adds the number of authentication character strings and the additional character strings for generating the second utterance character string according to the preset security level (authentication level). Determine the presence or absence. The security level may be determined in advance for each of a plurality of factors such as the content of the service to be received by the user, the place to receive the service, the frequency of use, the environmental sound level, and the like.

以下、図６を参照して、セキュリティレベルについて具体例を挙げて説明する。例えば、ユーザが利用しようとするサービスの内容と、セキュリティレベルとの関係は、例えば以下のようになる。すなわち、図６に示すように、ある施設への入退室はセキュリティレベル「１」である。９，９９９円以下の決済はセキュリティレベル「２」である。１０，０００円以上の決済はセキュリティレベル「３」である。他者への送金はセキュリティレベル「４」である。 Hereinafter, the security level will be described with reference to FIG. 6 by giving a specific example. For example, the relationship between the content of the service that the user intends to use and the security level is as follows, for example. That is, as shown in FIG. 6, the entrance / exit to a certain facility has a security level of “1”. Payments of 9,999 yen or less have a security level of "2". Payments of 10,000 yen or more have a security level of "3". Remittances to others have a security level of "4".

なお、本実施の形態では、セキュリティレベルの数字が大きいほどより高い安全性が求められることを意味しており、上述のサービス内容の例では、ある施設への入退室より他者への送金の方が求められる安全性が高いことが示されている。 In this embodiment, the larger the security level number, the higher the security is required. In the above example of the service content, remittance to another person is performed from entering or leaving a certain facility. It has been shown that the required safety is higher.

また、利用場所と、セキュリティレベルとの関係は、例えば以下のようになる。すなわち、図６に示すように、利用場所がユーザの自宅である場合、セキュリティレベル「１」である。利用場所が金融機関である場合、セキュリティレベル「２」である。利用場所が公的機関（役所、官公庁等）である場合、セキュリティレベル「３」である。利用場所がコンビニエンスストアである場合、周りに人が多いことが想定されるため、より求められる安全性が高く、セキュリティレベル「４」である。 The relationship between the place of use and the security level is as follows, for example. That is, as shown in FIG. 6, when the usage place is the user's home, the security level is "1". If the place of use is a financial institution, the security level is "2". If the place of use is a public institution (government office, government office, etc.), the security level is "3". When the place of use is a convenience store, it is assumed that there are many people around, so the required safety is higher and the security level is "4".

また、利用頻度と、セキュリティレベルとの関係は、例えば以下のようになる。すなわち、図６に示すように、前回のサービス利用から１日以内である場合、セキュリティレベル「１」である。前回のサービス利用から１週間以内である場合、セキュリティレベル「２」である。前回のサービス利用から１ヶ月以内である場合、セキュリティレベル「３」である。サービス利用が初回である場合、セキュリティレベル「４」である。 The relationship between the frequency of use and the security level is as follows, for example. That is, as shown in FIG. 6, if it is within one day from the previous use of the service, the security level is "1". If it is within one week from the previous use of the service, the security level is "2". If it is within one month from the previous use of the service, the security level is "3". When the service is used for the first time, the security level is "4".

また、図６には図示しないが、セキュリティレベルは、さらに環境音レベルに基づいて適宜設定されてもよい。なお、環境音レベルとは、ユーザがサービスを利用しようとする環境においてユーザが浴びる環境音の大きさを示す値である。環境音レベルが小さくなる（＝周囲が静かになる）ほどセキュリティレベルが高くなるように設定されてもよいし、環境音の大きさが大きくなる（＝周囲が騒がしくなる）ほどセキュリティレベルが高くなるように設定されてもよい。 Further, although not shown in FIG. 6, the security level may be appropriately set based on the environmental sound level. The environmental sound level is a value indicating the loudness of the environmental sound that the user receives in the environment in which the user intends to use the service. The security level may be set to be higher as the environmental sound level becomes smaller (= the surroundings become quieter), and the security level becomes higher as the environmental sound becomes louder (= the surroundings become noisy). May be set as.

また、上述した例以外にも、例えばユーザ毎にあらかじめセキュリティレベルが設定されていてもよい。具体的には、例えばユーザＡのセキュリティレベルが「１」、ユーザＢのセキュリティレベルが「３」のように、あらかじめ設定されていてもよい。ユーザ毎のセキュリティレベルは、例えばユーザのこれまでのサービス利用履歴に基づき算出される信用度等に応じてあらかじめ設定されればよい。 In addition to the above examples, for example, a security level may be set in advance for each user. Specifically, for example, the security level of user A may be set to "1" and the security level of user B may be set to "3". The security level for each user may be set in advance according to, for example, the credit rating calculated based on the user's service usage history so far.

さらに、上述した複数の要素毎に設定されたセキュリティレベルを組み合わせて、最終的なセキュリティレベルを決定するようにしてもよい。図６には４段階のセキュリティレベルが設定される例を示したが、５段階以上、または３段階以下のセキュリティレベルが設定されるようにしてもよい。また、要素毎に異なる段階のセキュリティレベルがそれぞれ設定されるようにしてもよい。 Further, the security level set for each of the plurality of elements described above may be combined to determine the final security level. Although FIG. 6 shows an example in which four levels of security are set, security levels of five or more levels or three or less levels may be set. In addition, different levels of security may be set for each element.

複数の要素の組み合わせによってセキュリティレベルが設定される具体例について説明する。例えば、利用場所が、周りに人が多く騒がしいことが想定されるコンビニエンスストア（図６の例ではセキュリティレベル「４」）であっても、サービス内容が５，０００円の決済等、比較的少額の決済（図６の例ではセキュリティレベル「２」）である場合には、最終的なセキュリティレベルは例えば「３」に設定されればよい。 A specific example in which the security level is set by a combination of a plurality of elements will be described. For example, even if the place of use is a convenience store where many people are expected to be noisy (security level "4" in the example of Fig. 6), the service content is relatively small, such as payment of 5,000 yen. In the case of payment (security level "2" in the example of FIG. 6), the final security level may be set to, for example, "3".

認証用文字列生成部２３は、このように設定されたセキュリティレベルが高いほど、より多くの認証用文字列および付加文字列を組み合わせて第２発話用文字列を生成する。具体例を挙げると、例えばセキュリティレベルが図６に示すように４段階であった場合、認証用文字列生成部２３は、セキュリティレベル「４」では、上述した認証用文字列（１）〜（５）の全てと当該認証用文字列（１）〜（５）以外の付加文字列とを組み合わせて第２発話用文字列を生成する。一方、認証用文字列生成部２３は、セキュリティレベル「１」では、上述した認証用文字列（１）〜（５）のうち、例えばいずれか２つのみを組み合わせて第２発話用文字列を生成すればよい。 The authentication character string generation unit 23 generates a second utterance character string by combining more authentication character strings and additional character strings as the security level set in this way is higher. To give a specific example, for example, when the security level is four levels as shown in FIG. 6, the authentication character string generation unit 23 has the above-mentioned authentication character strings (1) to (1) at the security level "4". A second speech character string is generated by combining all of 5) with additional character strings other than the authentication character strings (1) to (5). On the other hand, at the security level "1", the authentication character string generation unit 23 creates a second utterance character string by combining, for example, only two of the above-mentioned authentication character strings (1) to (5). Just generate it.

なお、認証用文字列生成部２３は、セキュリティレベルが比較的高い場合には、ユーザデータに基づく認証用文字列を用いて第２発話用文字列を生成すればより望ましい。 When the security level is relatively high, it is more desirable that the authentication character string generation unit 23 generate the second utterance character string using the authentication character string based on the user data.

以上説明した処理により、求められるセキュリティレベルが高いほど、より多くの認証用文字列および付加文字列に基づく、より複雑な第２発話用文字列が生成されることになる。 By the process described above, the higher the required security level, the more complicated the second utterance character string based on the more authentication character strings and the additional character strings will be generated.

図５の説明に戻る。ステップＳ８において、声紋データ生成部２５は、音声認識部２４が切り出した音声データに基づいて、認証用文字列に対応する第１声紋データを生成する。 Returning to the description of FIG. In step S8, the voiceprint data generation unit 25 generates the first voiceprint data corresponding to the authentication character string based on the voice data cut out by the voice recognition unit 24.

以下、声紋データ生成部２５による第１声紋データの生成について具体例を挙げて説明する。以下の説明では、第１発話用文字列が「げんこつやまのたぬきさん」であり、第１発話用文字列が文字毎に分割され、認証用文字列として（１）「げんこつ」、（２）「たぬきやま」、（３）「たぬきのこ」、（４）「こやまさん」、（５）「げやたさ」が生成されたとする。 Hereinafter, the generation of the first voiceprint data by the voiceprint data generation unit 25 will be described with reference to specific examples. In the following explanation, the first utterance character string is "Genkotsuyama no Tanuki-san", and the first utterance character string is divided for each character, and the authentication character string is (1) "Genkotsu", (2). It is assumed that "Tanukiyama", (3) "Tanuki no Ko", (4) "Koyama-san", and (5) "Geyatasa" are generated.

まず、声紋データ生成部２５は、ユーザが第１発話用文字列「げんこつやまのたぬきさん」を発話して得られた第１音声データに基づいて音声認識部２４の文字列変換部２４５が切り出した音声データのうち、認証用文字列に対応する部分のみの音声データを取得する。 First, the voice pattern data generation unit 25 is cut out by the character string conversion unit 245 of the voice recognition unit 24 based on the first voice data obtained by the user speaking the first utterance character string "Genkotsuyama no Tanuki-san". Of the voice data, only the part corresponding to the authentication character string is acquired.

具体例を挙げて説明する。声紋データ生成部２５は、音声認識部２４が第１発話用文字列から文字毎に生成した音声データに基づき、上記認証用文字列（１）〜（５）生成するそれぞれについて、第１声紋データを生成する。 A specific example will be described. The voiceprint data generation unit 25 generates the first voiceprint data for each of the authentication character strings (1) to (5) generated by the voice recognition unit 24 based on the voice data generated for each character from the first utterance character string. To generate.

ステップＳ９において、音声認証装置２０は、ユーザの識別情報、第１発話用文字列、第１音声データ、複数の認証用文字列、第２発話用文字列、および、認証用文字列のそれぞれに対応する第１声紋データを、互いに関連づけた話者モデルとして記憶部２９に記憶する。これにより、ユーザ毎の話者モデルが記憶部２９に記憶され、ユーザが音声認証システム１を利用するための音声登録処理が完了する。 In step S9, the voice authentication device 20 provides the user identification information, the first utterance character string, the first voice data, a plurality of authentication character strings, the second utterance character string, and the authentication character string, respectively. The corresponding first voice pattern data is stored in the storage unit 29 as a speaker model associated with each other. As a result, the speaker model for each user is stored in the storage unit 29, and the voice registration process for the user to use the voice authentication system 1 is completed.

なお、図５に示す音声登録処理において、認証用文字列生成部２３が認証用文字列を生成するステップＳ６は、音声認識部２４が第１音声データを文字列に変換するステップＳ５より後であれば、どのタイミングで行われてもよい。すなわち、ステップＳ６は、例えば音声認識部が文字列を認識した直後に行われてもよいし、それより後（例えば、音声登録済みのユーザが音声認証のため音声認証システム１にログインした後）であってもよい。 In the voice registration process shown in FIG. 5, the authentication character string generation unit 23 generates the authentication character string in step S6 after the step S5 in which the voice recognition unit 24 converts the first voice data into the character string. If there is, it may be performed at any timing. That is, step S6 may be performed immediately after, for example, the voice recognition unit recognizes the character string, or after that (for example, after the voice-registered user logs in to the voice authentication system 1 for voice authentication). It may be.

また、図５に示す音声登録処理において、声紋データ生成部２５が第１声紋データを生成するステップＳ８は、音声認識部２４が認証用文字列に対応した音声データを切り出すステップＳ５より後であれば、どのタイミングで行われてもよい。すなわち、ステップＳ８は、例えば音声認識部２４から、認識された文字列に対応付けられた音声データが出力された直後に行われてもよいし、それより後（例えば、他のユーザの音声登録処理や音声認証処理が行われているバックグラウンドで実行される）であってもよい。 Further, in the voice registration process shown in FIG. 5, the step S8 in which the voiceprint data generation unit 25 generates the first voiceprint data is after the step S5 in which the voice recognition unit 24 cuts out the voice data corresponding to the authentication character string. For example, it may be performed at any timing. That is, step S8 may be performed immediately after the voice data associated with the recognized character string is output from, for example, the voice recognition unit 24, or after that (for example, voice registration of another user). It may be executed in the background where the processing or voice recognition processing is performed).

［２］音声認証処理
次に、図７を参照して、ユーザがサービスを利用しようとする際の音声認証処理について説明する。音声認証処理は、上述した音声登録処理が完了したユーザに対して行われる。 [2] Voice Authentication Process Next, a voice authentication process when a user intends to use a service will be described with reference to FIG. 7. The voice authentication process is performed on the user who has completed the above-mentioned voice registration process.

ステップＳ１１において、音声認証装置２０は、端末装置１０を介して入力された、ユーザからのサービス開始要求を受信する。ユーザによるサービス開始要求は、例えば端末装置１０の入力部１４を介して行われる。これにより、ユーザの音声認証処理が開始される。 In step S11, the voice authentication device 20 receives the service start request from the user input via the terminal device 10. The service start request by the user is made, for example, via the input unit 14 of the terminal device 10. As a result, the user's voice authentication process is started.

ステップＳ１２において、端末装置１０の提示部１３は、上述した音声登録処理において生成された第２発話用文字列を、ユーザに対して提示する。これとともに、提示部１３は、第２発話用文字列を発話する（読み上げる）よう、ユーザに対して要求する。提示部１３による第２発話用文字列の提示方法については、例えば、図５のステップＳ３における第１発話用文字列の提示と同様の方法を採用すればよい。 In step S12, the presentation unit 13 of the terminal device 10 presents the second utterance character string generated in the voice registration process described above to the user. At the same time, the presentation unit 13 requests the user to speak (read) the second utterance character string. As for the method of presenting the second utterance character string by the presentation unit 13, for example, the same method as the presentation of the first utterance character string in step S3 of FIG. 5 may be adopted.

このステップＳ１２の要求に応じて、ユーザが第２発話用文字列を発話すると、端末装置１０の音声データ生成部１２によって、第２発話用文字列に対応する第２音声データが生成される。 When the user speaks the second utterance character string in response to the request in step S12, the voice data generation unit 12 of the terminal device 10 generates the second voice data corresponding to the second utterance character string.

ステップＳ１３において、音声認証装置２０は、端末装置１０から第２音声データを取得する。 In step S13, the voice authentication device 20 acquires the second voice data from the terminal device 10.

ステップＳ１４において、音声認証装置２０の音声認識部２４は、第２音声データを文字列に変換するとともに、変換した文字列に含まれる単語（または文字、または文節）毎に音声データを切り出し、文字列と切り出した音声データとを対応付けて出力する。また、音声認識部２４は、第２音声データに付加文字列に対応する部分が含まれる場合、付加文字列に対応する音声データを破棄する。 In step S14, the voice recognition unit 24 of the voice authentication device 20 converts the second voice data into a character string, cuts out the voice data for each word (or character, or phrase) included in the converted character string, and sets the character. The column and the cut out audio data are associated and output. Further, when the second voice data includes a part corresponding to the additional character string, the voice recognition unit 24 discards the voice data corresponding to the additional character string.

以下、具体例を挙げて説明する。第２音声データが、ユーザが「なかよしたぬきのこ」と発話して得られた音声データであった場合、音声認識部２４は、まず、第２音声データから、「なかよし」に対応する音声データ、「たぬき」に対応する音声データ、「の」に対応する音声データ、「こ」に対応する音声データをそれぞれ切り出す。 Hereinafter, a specific example will be described. When the second voice data is the voice data obtained by the user speaking "Nakayoshi Nuki no Ko", the voice recognition unit 24 first starts with the second voice data and corresponds to the voice data corresponding to "Nakayoshi". , The voice data corresponding to "Tanuki", the voice data corresponding to "no", and the voice data corresponding to "ko" are cut out respectively.

次に、音声認識部２４は、記憶部２９に記憶されている話者モデルを参照し、切り出した音声データの中から、認証用文字列に対応する音声データのみを抽出し、認証用文字列に対応しない音声データを除去する。なお、認証用文字列に対応しない音声データとは、付加文字列に対応する音声データである。上述した例では、音声認識部２４は、「なかよし」に対応する音声データを破棄し、「たぬき」、「の」、「こ」に対応する音声データを出力する。 Next, the voice recognition unit 24 refers to the speaker model stored in the storage unit 29, extracts only the voice data corresponding to the authentication character string from the cut out voice data, and extracts the authentication character string. Remove audio data that does not correspond to. The voice data that does not correspond to the authentication character string is the voice data that corresponds to the additional character string. In the above example, the voice recognition unit 24 discards the voice data corresponding to "Nakayoshi" and outputs the voice data corresponding to "raccoon dog", "no", and "ko".

図７の説明に戻る。ステップＳ１５において、声紋データ生成部２５は、ステップＳ１４で音声認識部２４が第２音声データに基づいて出力した音声データを用いて、第２声紋データを生成する。上述した例の場合、声紋データ生成部２５は、「たぬき」、「の」、「こ」に対応する第２声紋データを生成する。 Returning to the description of FIG. In step S15, the voice pattern data generation unit 25 generates the second voice pattern data by using the voice data output by the voice recognition unit 24 based on the second voice data in step S14. In the case of the above example, the voiceprint data generation unit 25 generates the second voiceprint data corresponding to "raccoon dog", "no", and "ko".

ステップＳ１６において、認証部２６は、上述した音声登録処理において生成された第１声紋データと、ステップＳ１５で生成した第２声紋データとを用いて、ユーザの音声認証を行う。 In step S16, the authentication unit 26 performs voice authentication of the user by using the first voiceprint data generated in the above-mentioned voice registration process and the second voiceprint data generated in step S15.

以下、具体例を挙げて説明する。以下の例では、上述した音声登録処理において示した例のように、認証用文字列が文字列（１）「げんこつ」、（２）「たぬきやま」、（３）「たぬきのこ」、（４）「こやまさん」、（５）「げやたさ」であるとする。そして、声紋データ生成部２５が、「たぬき」、「の」、「こ」に対応する第２声紋データを生成したとする。 Hereinafter, a specific example will be described. In the following example, as in the example shown in the voice registration process described above, the authentication character string is the character string (1) "Genkotsu", (2) "Tanukiyama", (3) "Tanuki no Ko", ( 4) "Koyama-san" and (5) "Geyatasa". Then, it is assumed that the voiceprint data generation unit 25 generates the second voiceprint data corresponding to "raccoon dog", "no", and "ko".

この場合、記憶部２９には、認証用文字列（３）「たぬきのこ」に対応する第１声紋データが記憶されている。認証部２６は、記憶部２９から当該第１声紋データを読み出すとともに、声紋データ生成部２５から「たぬき」、「の」、「こ」に対応する第２声紋データを取得する。そして、認証部２６は、文字列「たぬき」、「の」、「こ」のそれぞれについて第１声紋データと第２声紋データとの照合を行って一致率を算出する。認証部２６は、一致率が所定閾値以上である場合に認証成功と判断し、一致率が所定閾値より低い場合には認証失敗と判断する。 In this case, the storage unit 29 stores the first voiceprint data corresponding to the authentication character string (3) “raccoon dog”. The authentication unit 26 reads the first voiceprint data from the storage unit 29, and acquires the second voiceprint data corresponding to "tanuki", "no", and "ko" from the voiceprint data generation unit 25. Then, the authentication unit 26 collates the first voiceprint data and the second voiceprint data for each of the character strings "tanuki", "no", and "ko", and calculates the matching rate. The authentication unit 26 determines that the authentication is successful when the matching rate is equal to or higher than the predetermined threshold value, and determines that the authentication is unsuccessful when the matching rate is lower than the predetermined threshold value.

図７の説明に戻る。ステップＳ１７において、ステップＳ１６での認証の結果が成功である場合（ステップＳ１７：成功）、処理はステップＳ１８に進み、認証の結果が失敗である場合（ステップＳ１７：失敗）、処理はステップＳ１９に進む。 Returning to the description of FIG. In step S17, if the authentication result in step S16 is successful (step S17: success), the process proceeds to step S18, and if the authentication result is unsuccessful (step S17: failure), the process proceeds to step S19. move on.

ステップＳ１８において、サービス提供部２７は、音声認証に成功したユーザに対して、ユーザが要求するサービスの提供を開始する。 In step S18, the service providing unit 27 starts providing the service requested by the user to the user who has succeeded in voice authentication.

一方、ステップＳ１９において、サービス提供部２７は、音声認証に失敗したユーザに対して、認証が失敗した旨を、端末装置１０の提示部１３等を介して通知させる。 On the other hand, in step S19, the service providing unit 27 causes the user who has failed in voice authentication to be notified via the presentation unit 13 of the terminal device 10 or the like that the authentication has failed.

＜作用、効果＞
以上説明したように、本発明の実施の形態に係る音声認証装置２０は、第１発話用文字列に含まれる文字の少なくともいずれかを含む認証用文字列を生成する認証用文字列生成部２３と、第１発話用文字列をユーザが発話して生成された第１音声データから、認証用文字列に対応する部分の声紋データを生成するとともに、認証用文字列を含む第２発話用文字列をユーザが発話して生成された第２音声データに基づいて第２声紋データを生成する声紋データ生成部２５と、第１声紋データと第２声紋データとを照合してユーザの認証を行う認証部２６と、を備える。 <Action, effect>
As described above, the voice authentication device 20 according to the embodiment of the present invention is an authentication character string generation unit 23 that generates an authentication character string including at least one of the characters included in the first speech character string. And, from the first voice data generated by the user speaking the first speech character string, the voice print data of the part corresponding to the authentication character string is generated, and the second speech character including the authentication character string is generated. The voiceprint data generation unit 25, which generates the second voiceprint data based on the second voice data generated by the user speaking the column, collates the first voiceprint data with the second voiceprint data to authenticate the user. It includes an authentication unit 26.

このような構成により、ユーザの音声をシステムに利用登録する音声登録処理時にユーザに発話させる第１発話用文字列と、ユーザを音声認証する音声認証処理時にユーザに発話させる第２発話用文字列とを、互いに異なる文字列とすることができる。これにより、例えば第１発話用文字列をユーザが発話した音声データが盗用されても、音声認証処理時に盗用されたデータで認証成功となることがない。このため、音声認証の安全性が向上する。 With such a configuration, a first utterance character string that causes the user to speak during the voice registration process for registering the user's voice in the system and a second utterance character string that causes the user to speak during the voice authentication process for voice authentication of the user. Can be different character strings from each other. As a result, for example, even if the voice data in which the user utters the first utterance character string is stolen, the stolen data during the voice authentication process does not result in successful authentication. Therefore, the security of voice authentication is improved.

また、認証用文字列を、第１発話用文字列に含まれる文字を用いて生成するとともに、第１発話用文字列をユーザに実際に発話させて得られた第１音声データに基づいて声紋データを生成する。これにより、ユーザの登録時に認証用文字列そのものを発話させる必要がない。このため、認証用文字列を複数用意した場合でも、音声登録処理時におけるユーザの負担が軽減される。従って、ユーザに負担にならない程度の長さのフレーズを１回だけ発話させるだけで、実質的に複数の認証用文字列を生成していることになる。そのため、登録に要する手間を削減でき、ユーザの利便性が向上する。 In addition, the authentication character string is generated using the characters included in the first utterance character string, and the voice pattern is based on the first voice data obtained by having the user actually speak the first utterance character string. Generate data. As a result, it is not necessary to utter the authentication character string itself when the user is registered. Therefore, even when a plurality of authentication character strings are prepared, the burden on the user during the voice registration process is reduced. Therefore, it is possible to substantially generate a plurality of authentication character strings by uttering a phrase having a length that does not burden the user only once. Therefore, the time and effort required for registration can be reduced, and the convenience of the user is improved.

すなわち、本発明の実施の形態に係る音声認証装置２０によれば、音声認証の安全性向上と、ユーザ登録時のユーザの手間の軽減とを両立させることができる。 That is, according to the voice authentication device 20 according to the embodiment of the present invention, it is possible to both improve the security of voice authentication and reduce the time and effort of the user at the time of user registration.

また、本発明の実施の形態に係る音声認証装置２０において、認証用文字列生成部２３は、第１発話用文字列に含まれる文字を含む複数の認証用文字列を生成し、複数の認証用文字列の少なくともいずれかを含む第２発話用文字列を生成する。この際、認証用文字列生成部２３は、あらかじめ設定されたセキュリティレベル（認証レベル）に対応した数の認証用文字列を含む第２発話用文字列を生成する。 Further, in the voice authentication device 20 according to the embodiment of the present invention, the authentication character string generation unit 23 generates a plurality of authentication character strings including the characters included in the first utterance character string, and a plurality of authentications are performed. Generates a second utterance string containing at least one of the strings. At this time, the authentication character string generation unit 23 generates a second utterance character string including a number of authentication character strings corresponding to a preset security level (authentication level).

このような構成により、求められる音声認証の安全性の高さに応じて、臨機応変に第２発話用文字列を生成することができる。すなわち、求められるセキュリティレベルが高い場合には、多くの認証用文字列を組み合わせて複雑な第２発話用文字列を生成することで、高い安全性を確保することができる。また、それほど高いセキュリティレベルが求められない場合には、少ない認証用文字列の組み合わせで比較的簡単な第２発話用文字列を生成することができる。これにより、音声認証処理時にユーザが実際に第２発話用文字列を発話する際に、求められるセキュリティレベルが低いにもかかわらず複雑な第２発話用文字列を発話させられることで、認証時にユーザに大きな負担が掛かる事態が回避される。従って、求められるセキュリティレベルとユーザの負担とのバランスが取れた音声認証処理を行うことができる。 With such a configuration, it is possible to flexibly generate the second utterance character string according to the required high security of voice authentication. That is, when the required security level is high, high security can be ensured by combining a large number of authentication character strings to generate a complicated second utterance character string. Further, when a very high security level is not required, a relatively simple second utterance character string can be generated by combining a small number of authentication character strings. As a result, when the user actually speaks the second utterance character string during the voice authentication process, a complicated second utterance character string can be spoken even though the required security level is low, so that the second utterance character string can be spoken at the time of authentication. A situation in which a heavy burden is placed on the user is avoided. Therefore, it is possible to perform voice authentication processing in which the required security level and the burden on the user are well-balanced.

さらに、本発明の実施の形態に係る音声認証装置２０によれば、認証用文字列生成部２３は、認証用文字列とは異なる付加文字列を認証用文字列に付加して第２発話用文字列を生成する。そして、認証用文字列生成部２３は、あらかじめ設定されたセキュリティレベルに基づいて、第２発話用文字列を構成する認証用文字列の数を変更するとともに、付加文字列を付加するか否かを決定する。 Further, according to the voice authentication device 20 according to the embodiment of the present invention, the authentication character string generation unit 23 adds an additional character string different from the authentication character string to the authentication character string for the second speech. Generate a string. Then, the authentication character string generation unit 23 changes the number of authentication character strings constituting the second utterance character string based on a preset security level, and whether or not to add an additional character string. To determine.

このように、生成された認証用文字列にはない文字列を付加してユーザに発話させるので、発話のバリエーションを増やすことができ、不正利用をさらに低減できる。 In this way, since the user is made to speak by adding a character string that is not included in the generated authentication character string, the variation of the utterance can be increased, and the unauthorized use can be further reduced.

また、本発明の実施の形態に係る音声認証装置２０によれば、ユーザの個人情報に基づいて第１発話用文字列を生成する第１発話用文字列生成部２２と、をさらに備える。 Further, according to the voice authentication device 20 according to the embodiment of the present invention, the first utterance character string generation unit 22 for generating the first utterance character string based on the personal information of the user is further provided.

このような構成により、音声登録処理時にユーザに発話させる第１発話用文字列を、ユーザの個人情報を含む文字列とすることができる。音声認証処理の際にユーザが発話する第２発話用文字列は、第１発話用文字列に基づいて生成されるため、第１発話用文字列にユーザに関する文字列が含まれることにより、ユーザが認識しやすく、発話しやすい第２発話用文字列を生成することができるようになる。 With such a configuration, the first utterance character string to be spoken by the user during the voice registration process can be a character string including the user's personal information. Since the second utterance character string uttered by the user during the voice authentication process is generated based on the first utterance character string, the user is caused by including the character string related to the user in the first utterance character string. Can generate a second utterance character string that is easy to recognize and speak.

＜変形例＞
以上、本発明の実施の形態について説明したが、本発明は、上述した実施の形態に限定されるものではなく、本発明の趣旨を逸脱しない範囲で、適宜変形して実施することが可能である。 <Modification example>
Although the embodiments of the present invention have been described above, the present invention is not limited to the above-described embodiments, and can be appropriately modified and implemented without departing from the spirit of the present invention. is there.

上述した実施の形態において、本発明の音声認証装置の例として、端末装置１０において生成されたユーザの音声データを、ネットワーク３０を介して取得するクラウドサーバとしての音声認証装置２０について説明したが、本発明はこれに限定されない。 In the above-described embodiment, as an example of the voice authentication device of the present invention, the voice authentication device 20 as a cloud server that acquires the voice data of the user generated by the terminal device 10 via the network 30 has been described. The present invention is not limited to this.

本発明の音声認証装置の全ての機能が、クラウドサーバとしての音声認証装置２０に搭載されている必要はなく、そのうちいずれかの機能は端末装置１０側に搭載されていてもよい。例えば、第１発話用文字列生成部２２、認証用文字列生成部２３、音声認識部２４、声紋データ生成部２５、認証部２６のいずれかは、端末装置１０に搭載されていてもよい。 Not all the functions of the voice authentication device of the present invention need to be mounted on the voice authentication device 20 as a cloud server, and any of the functions may be mounted on the terminal device 10 side. For example, any one of the first utterance character string generation unit 22, the authentication character string generation unit 23, the voice recognition unit 24, the voiceprint data generation unit 25, and the authentication unit 26 may be mounted on the terminal device 10.

また、本発明の音声認証装置の全ての機能が、端末装置側に搭載されていてもよい。この場合、音声認証システムにおいて、端末装置以外の構成は必要なくなる。ただし、その場合でも、例えばユーザにサービスを提供するサービス提供部だけは端末装置の外部に設置してもよく、その場合、端末装置とサービス提供部とがネットワークを介して通信可能とすればよい。 Further, all the functions of the voice authentication device of the present invention may be mounted on the terminal device side. In this case, the voice authentication system does not require any configuration other than the terminal device. However, even in that case, for example, only the service providing unit that provides the service to the user may be installed outside the terminal device, and in that case, the terminal device and the service providing unit may be able to communicate via the network. ..

上述した実施の形態では、ユーザの音声のみを用いて認証を行っていたが、本発明はこれに限定されず、他の要素の認証を加えた多要素認証を行うようにしてもよい。多要素認証の例としては、例えば以下のようなものがある。すなわち、あらかじめユーザの生体情報（顔画像、虹彩画像、指紋画像、指静脈画像等）を登録しておき、ユーザによりサービス提供が要求された場合に、音声認証処理とともに、これらの生体情報を用いた認証を行う。両方が成功した場合にのみサービス提供を行うようにすれば、より認証の安全性を向上させることができる。 In the above-described embodiment, authentication is performed using only the voice of the user, but the present invention is not limited to this, and multi-factor authentication may be performed by adding authentication of other elements. Examples of multi-factor authentication include: That is, the user's biometric information (face image, iris image, fingerprint image, finger vein image, etc.) is registered in advance, and when the user requests the provision of a service, the biometric information is used together with the voice authentication process. Perform the authentication that was done. If the service is provided only when both are successful, the security of authentication can be further improved.

音声データ以外に用いられる要素は、生体情報だけではなく、知識情報（ユーザが知っていること）や所持情報（ユーザが持っているもの）であってもよい。つまり、多要素認証を行ってもよい。 The elements used other than the voice data may be not only biological information but also knowledge information (what the user knows) and possession information (what the user has). That is, multi-factor authentication may be performed.

例えば、音声とパスワードを併用して認証を行う場合、音声より先にパスワードの入力をユーザに求めることにより、パスワードに関連づけられるユーザの話者モデルを参照することができるとともに、音声認証の際に参照する話者モデルを絞り込むためにかかる時間を短縮することができる。これにより、セキュリティの高さとユーザの利便性とを両立させることができる。 For example, when authentication is performed using both voice and password, by asking the user to enter the password before voice, the user's speaker model associated with the password can be referred to, and at the time of voice authentication, the user's speaker model can be referred to. It is possible to reduce the time required to narrow down the speaker model to be referred to. As a result, both high security and user convenience can be achieved at the same time.

また、上述した実施の形態では、付加文字列は単に認証用文字列とは異なる文字列であったが、例えば付加文字列をユーザによって（ユーザデータを用いて）変化させるようにしてもよい。このような場合、音声認識結果が一致するユーザの数をより限定させることができ、それにより声紋認証、本人認証の精度向上を図ることができるようになる。 Further, in the above-described embodiment, the additional character string is simply a character string different from the authentication character string, but for example, the additional character string may be changed by the user (using user data). In such a case, the number of users whose voice recognition results match can be further limited, and thereby the accuracy of voiceprint authentication and personal authentication can be improved.

本発明は、ユーザの音声認証を行う音声認証装置または音声認証システムに好適である。 The present invention is suitable for a voice authentication device or a voice authentication system that performs voice authentication of a user.

１音声認証システム
１０端末装置
１１通信部
１２音声データ生成部
１３提示部
１４入力部
２０音声認証装置
２１通信部
２２第１発話用文字列生成部
２３認証用文字列生成部
２４音声認識部
２４１ノイズリダクション部
２４２発話検出部
２４３特徴量抽出部
２４４音素変換部
２４５文字列変換部
２４６音響モデル
２４７言語モデル
２４８付加文字列除去部
２５声紋データ生成部
２６認証部
２７サービス提供部
２８ユーザデータ取得部
２９記憶部
３０ネットワーク 1 Voice authentication system 10 Terminal device 11 Communication unit 12 Voice data generation unit 13 Presentation unit 14 Input unit 20 Voice authentication device 21 Communication unit 22 First speech character string generation unit 23 Authentication character string generation unit 24 Voice recognition unit 241 Noise Reduction unit 242 Speech detection unit 243 Feature extraction unit 244 Sound element conversion unit 245 Character string conversion unit 246 Sound model 247 Language model 248 Additional character string removal unit 25 Voice pattern data generation unit 26 Authentication unit 27 Service provision unit 28 User data acquisition unit 29 Storage 30 network

Claims

An authentication character string generator that generates an authentication character string that includes at least one of the characters included in the predetermined character string, and
From the first voice data generated by the user speaking the predetermined character string, the first voice pattern data of the portion corresponding to the authentication character string is generated, and the user speaks the authentication character string. A voice string data generation unit that generates a second voice pattern data based on the generated second voice data,
An authentication unit that authenticates the user by collating the first voiceprint data with the second voiceprint data.
A voice authentication device.

A voice recognition unit that converts voice data into a character string, divides the voice data based on the converted character string, and outputs the character string and the divided voice data.
Further prepare,
The voice authentication device according to claim 1.

The authentication character string generation unit generates the authentication character string by combining a plurality of characters included in the predetermined character string.
The voice authentication device according to claim 1 or 2.

The authentication character string generation unit generates a plurality of the authentication character strings different from each other, and generates an utterance authentication character string including at least one of the plurality of authentication character strings.
The voiceprint data generation unit generates the second voiceprint data based on the voice data generated by the user speaking the speech authentication character string.
The voice authentication device according to any one of claims 1 to 3.

The authentication character string generation unit adds an additional character string different from the authentication character string to the authentication character string to generate the speech authentication character string.
The voice authentication device according to claim 4.

The authentication character string generation unit changes the number of the authentication character strings included in the utterance authentication character string based on a preset authentication level.
The voice authentication device according to claim 4 or 5.

The authentication character string generation unit determines whether or not to add an additional character string to the authentication character string based on a preset authentication level.
The voice authentication device according to claim 6.

The authentication level is set in relation to the loudness of the surrounding environmental sound at the time of voice authentication by the authentication unit.
The voice authentication device according to claim 6 or 7.

The authentication level is set in relation to the content of the service provided to the user when the user is successfully authenticated.
The voice authentication device according to any one of claims 6 to 8.

The authentication level is set in relation to the location of the service provided to the user if the user is successfully authenticated.
The voice authentication device according to any one of claims 6 to 9.

The authentication level is set in relation to the execution frequency of the service provided to the user when the user is successfully authenticated.
The voice authentication device according to any one of claims 6 to 10.

The voice according to any one of claims 1 to 11, further comprising a service providing unit that permits the user to provide a predetermined service when the user is successfully authenticated by the authentication unit. Authentication device.

A predetermined character string generation unit that generates the predetermined character string based on the personal information of the user.
The voice authentication device according to any one of claims 1 to 12, further comprising.

A voice data generator that generates voice data based on the user's utterance,
A presentation unit that presents a character string that asks the user to speak, and
An authentication character string generator that generates an authentication character string that includes at least one of the characters included in the predetermined character string, and
From the first voice data generated by the user speaking the predetermined character string, the first voice pattern data of the portion corresponding to the authentication character string is generated, and the user speaks the authentication character string. A voice string data generation unit that generates a second voice pattern data based on the second voice data generated in
An authentication unit that authenticates the user by collating the first voiceprint data with the second voiceprint data.
A voice authentication system equipped with.

Generates an authentication string that contains at least one of the characters contained in the given string,
From the first voice data generated by the user speaking the predetermined character string, the first voiceprint data of the portion corresponding to the authentication character string is generated.
The second voiceprint data is generated based on the second voice data generated by the user speaking the authentication character string.
The user is authenticated using the first voiceprint data and the second voiceprint data.
Voice authentication method.