JP2018180424A

JP2018180424A - Speech recognition apparatus and speech recognition method

Info

Publication number: JP2018180424A
Application number: JP2017083427A
Authority: JP
Inventors: 多喜夫芳賀; Takio Haga; 須藤　俊一; Shunichi Sudo; 俊一須藤
Original assignee: Alpine Electronics Inc
Current assignee: Alpine Electronics Inc
Priority date: 2017-04-20
Filing date: 2017-04-20
Publication date: 2018-11-15

Abstract

【課題】音声プロファイルの煩雑な登録作業を行うことなく、特定の乗員の発話音声だけをトリガレス音声認識の対象とすることができる「音声認識装置および音声認識方法」を提供する。【解決手段】車内に設置されたマイク２００から入力された音声の音声認識を実行可能な音声認識装置１００において、音声認識を行う対象とするユーザの顔写真、および、音声認識を行う対象とする座席の少なくとも一方を認識対象として登録しておき、車内に設置されたカメラにより撮影された画像を認識し、登録された認識対象に該当する乗員の口が動いていると判定された場合に、車内に設置されたマイク２００から入力された音声を認識するようにすることにより、音声プロファイルの煩雑な登録作業を行うことなく、特定の乗員の発話音声だけをトリガレス音声認識の対象とすることができるようにした。【選択図】図１PROBLEM TO BE SOLVED: To provide a "voice recognition device and a voice recognition method" capable of targeting only the spoken voice of a specific occupant for triggerless voice recognition without performing complicated registration work of a voice profile. SOLUTION: In a voice recognition device 100 capable of performing voice recognition of voice input from a microphone 200 installed in a vehicle, a face photograph of a user to be voice-recognized and a target to be voice-recognized. When at least one of the seats is registered as a recognition target, the image taken by the camera installed in the car is recognized, and it is determined that the mouth of the occupant corresponding to the registered recognition target is moving. By recognizing the voice input from the microphone 200 installed in the vehicle, it is possible to target only the spoken voice of a specific occupant to the target of triggerless voice recognition without performing complicated registration work of the voice profile. I made it possible. [Selection diagram] Fig. 1

Description

本発明は、音声認識装置および音声認識方法に関し、特に、発話ボタンの操作や特定動作などのトリガを不要にした音声認識装置に用いて好適なものである。 The present invention relates to a voice recognition device and a voice recognition method, and is particularly suitable for use in a voice recognition device that does not require a trigger such as an operation of a speech button or a specific operation.

車両には、オーディオ装置、エアコンディショナ、ナビゲーション装置など各種の電子機器が搭載されている。また、これらの電子機器を操作する際の片手運転等を回避するために、電子機器の操作を音声認識により行えるようにしたシステムも提供されている。この音声認識技術を用いれば、運転者は、ハンドルから手を離すことなく（リモートコントローラや操作パネル等の操作部を手動で操作せずに）各種電子機器の操作を行うことができる。 The vehicle is equipped with various electronic devices such as an audio device, an air conditioner, and a navigation device. In addition, there is also provided a system in which the operation of the electronic device can be performed by voice recognition in order to avoid one-hand operation or the like when operating these electronic devices. Using this voice recognition technology, the driver can operate various electronic devices without releasing his / her hand from the steering wheel (without manually operating the operation unit such as the remote controller or the operation panel).

音声認識装置は通常、ユーザが発声した特定の単語や熟語、簡単な命令文など（以下、これらをまとめて「ワード」という）を発話コマンドとして認識する。電子機器は、音声認識装置により認識されたワード（発話コマンド）に応じた制御を行う。かかる音声認識装置では、音声認識辞書に登録している各認識対象ワードの音声パターンとユーザの発話音声とを比較して、発話音声との類似度が最も高い音声パターンを検索し、その音声パターンに対応付けられているワードを発話音声のワードであると認識する。 The speech recognition device usually recognizes a specific word or phrase uttered by the user, a simple command sentence, etc. (hereinafter, these are collectively referred to as "word") as a speech command. The electronic device performs control in accordance with the word (utterance command) recognized by the speech recognition device. The speech recognition apparatus compares the speech pattern of each recognition target word registered in the speech recognition dictionary with the speech of the user to search for a speech pattern having the highest similarity to the speech, and the speech pattern The word associated with is recognized as the word of the utterance voice.

従来の音声認識装置は、ユーザが備え付けの発話ボタンを押すことで音声認識モードとなり、マイクから入力されたユーザの発話音声を認識してコマンドを実行するようになされている。発話ボタンの操作に代えて、手を叩く等の特定動作をトリガとして音声認識モードとなるようになされたものも知られている。最近では、音声認識時に発話ボタンの操作や特定動作などのトリガを不要にした音声認識（以下、トリガレス音声認識という）も提供されている。トリガレス音声認識では、マイクを常時オン状態にしておき、入力音声を識別して、発話コマンドに該当するワードかどうかを判定する。 In the conventional voice recognition apparatus, when the user presses an equipped speech button, the speech recognition mode is set, and the user's speech voice input from the microphone is recognized to execute a command. It is also known that the voice recognition mode is set to be triggered by a specific operation such as hitting a hand instead of the operation of the speech button. Recently, voice recognition (hereinafter referred to as triggerless voice recognition) has been provided which eliminates the need for a trigger such as an operation of a speech button or a specific operation at the time of voice recognition. In triggerless speech recognition, the microphone is always on, the input speech is identified, and it is determined whether the word corresponds to a speech command.

なお、車内に複数の乗員がいる場合、複数の乗員の発話音声がマイクに同時に入力される場合がある。そのような場合には誤動作を起こす可能性があることに鑑みて、複数のマイクと、車内画像から車内座席の乗員の有無を判断する撮像装置とを設け、複数のマイクからの音声入力と撮像装置からの着座情報とから、音声を入力した人の座席位置を特定するようにした音声認識システムが考案されている（例えば、特許文献１参照）。 In addition, when there are a plurality of occupants in the car, speech voices of the plurality of occupants may be simultaneously input to the microphone. In such a case, in consideration of the possibility of causing a malfunction, a plurality of microphones and an imaging device for determining the presence or absence of an occupant of the in-vehicle seat from the in-vehicle image are provided, and voice input and imaging from the plurality of microphones A voice recognition system has been devised in which the seat position of a person who has input a voice is specified from the seating information from the device (see, for example, Patent Document 1).

また、車内に複数の乗員がいる場合、トリガレス音声認識ではマイクが常時オンとなっているため、音声認識のための発話音声の他に、乗員どうしの会話音声もマイクより入力される。これに対し、運転者が同乗者と会話している際の運転者の音声の大きさとその変化、音声の高さ、ノイズ区間との音声の大きさの差、発話開始部分もしくは発話終端部分の音声の特徴、発声速度などを用いて運転者のプロファイルデータを作成しておき、運転者によるトークスイッチの操作を検知した後、運転者の状態をプロファイルデータと比較することで、運転者の発声が同乗者に向けた会話であるか車載システムに向けた音声入力であるかを判定し、車載システムに向けた音声入力について選択的に音声認識するようにした音声認識装置が知られている（例えば、特許文献２参照）。 Further, when there are a plurality of occupants in the car, the microphone is always on in triggerless speech recognition, so conversational voices between the occupants are also input from the microphone in addition to the speech for speech recognition. On the other hand, when the driver talks with a passenger, the size and change of the driver's voice, the height of the voice, the difference in the size of the voice from the noise section, the speech start part or the speech end part The driver's profile data is created using voice characteristics, vocalization speed, etc., and after detecting the operation of the talk switch by the driver, the driver's vocalization is made by comparing the driver's condition with the profile data. There is known a speech recognition apparatus that determines whether the speech is directed to a passenger or a speech input directed to an in-vehicle system, and selectively recognizes speech for the speech input directed to the in-vehicle system (see See, for example, Patent Document 2).

また、乗員から発せられる音声と、乗員の口唇の動きと、乗員のジェスチャとに基づいて、乗員の意図を認識する第１〜第３の乗員意図認識手段を備え、第１〜第３の乗員意図認識手段の認識結果に応じて車両搭載機器の動作を制御するようにした車両の制御装置も知られている（例えば、特許文献３参照）。 In addition, first to third occupant intention recognizing means for recognizing the intention of the occupant based on the sound emitted from the occupant, the movement of the lip of the occupant, and the gesture of the occupant are provided, and the first to third occupants are provided. There is also known a control device for a vehicle that controls the operation of a vehicle-mounted device in accordance with the recognition result of the intention recognition means (see, for example, Patent Document 3).

特開２００４−３５４９３０号公報JP 2004-354930 A 特開２００８−２５０２３６号公報JP 2008-250236 A 特開２０１１−２３２６３７号公報JP, 2011-232637, A

車室内に複数の乗員がいる場合、トリガレス音声認識ではマイクが常時オンとなっているため、何れの乗員が発話した音声もマイクより入力されて、音声認識の対象とされてしまう。そのため、特定の乗員の発話音声だけをトリガレス音声認識の対象とするといったことができないという問題があった。 When there are a plurality of occupants in the vehicle interior, the microphone is always on in triggerless voice recognition, and therefore the voices uttered by any occupant are also input from the microphone and are considered as voice recognition targets. Therefore, there is a problem that it is not possible to set only the voice of a specific occupant as the target of triggerless speech recognition.

なお、特定の乗員に関する声紋や話し方の特徴などを音声プロファイルとして登録しておき、マイクより入力される音声を音声プロファイルと比較することにより、特定の乗員の音声のみを音声認識の対象とすることが考えられる。しかしながら、この場合、発話コマンドとして認識すべきワード毎に音声プロファイルを作成して登録する必要があり、その登録作業が極めて煩雑となる。 Note that voiceprints and characteristics of speaking style related to a specific passenger are registered as a voice profile, and only the voice of a specific passenger is targeted for voice recognition by comparing the voice input from the microphone with the voice profile. Is considered. However, in this case, it is necessary to create and register a voice profile for each word to be recognized as a speech command, and the registration work becomes extremely complicated.

本発明は、このような問題を解決するために成されたものであり、音声プロファイルの煩雑な登録作業を行うことなく、特定の乗員の発話音声だけをトリガレス音声認識の対象とすることができるようにすることを目的とする。 The present invention has been made to solve such a problem, and it is possible to make only a specific occupant's speech voice a target of triggerless speech recognition without performing complicated registration work of the voice profile. The purpose is to

上記した課題を解決するために、本発明では、音声認識を行う対象とするユーザの顔写真、および、上記音声認識を行う対象とする座席の少なくとも一方を認識対象として登録しておき、車内に設置されたカメラにより撮影された画像を認識し、登録された認識対象に該当する乗員の口が動いていると判定された場合に、車内に設置されたマイクから入力された音声を認識するようにしている。 In order to solve the above-described problems, in the present invention, at least one of a face photograph of a user who is a target of voice recognition and a seat which is a target of the voice recognition is registered as a recognition target. Recognizes the image taken by the installed camera, and when it is determined that the passenger's mouth corresponding to the registered recognition object is moving, to recognize the voice input from the microphone installed in the car I have to.

上記のように構成した本発明によれば、音声認識を行う対象として顔写真または座席を登録しておけば、マイクより常時入力される音声のうち、顔写真が登録された特定の乗員、または、登録された座席に座っている特定の乗員の口が動いていると判定されたときにマイクより入力される音声だけが音声認識処理の対象とされる。これにより、音声プロファイルの煩雑な登録作業を行うことなく、特定の乗員の発話音声だけをトリガレス音声認識の対象とすることができる。 According to the present invention configured as described above, if a face picture or a seat is registered as an object to be subjected to voice recognition, a specific occupant to which a face picture is registered among voices constantly input from the microphone, or Only voice input from the microphone when it is determined that the mouth of a specific occupant sitting in the registered seat is moving is subjected to voice recognition processing. As a result, it is possible to make only a specific occupant's uttered voice a target of triggerless voice recognition without performing complicated registration work of the voice profile.

本実施形態による音声認識装置の機能構成例を示すブロック図である。It is a block diagram showing an example of functional composition of a speech recognition device by this embodiment. 本実施形態の音声認識装置を適用する車載システムの構成例を示す図である。It is a figure which shows the structural example of the vehicle-mounted system which applies the speech recognition apparatus of this embodiment. 本実施形態の認識対象登録部により表示される顔写真選択画面を示す図である。It is a figure which shows the face photography selection screen displayed by the recognition object registration part of this embodiment. 本実施形態の認識対象登録部により表示される座席選択画面を示す図である。It is a figure which shows the seat selection screen displayed by the recognition object registration part of this embodiment. 本実施形態による音声認識装置の動作例を示すフローチャートである。It is a flowchart which shows the operation example of the speech recognition apparatus by this embodiment. 本実施形態による画像認識部の動作例を示すフローチャートである。It is a flowchart which shows the operation example of the image recognition part by this embodiment. 本実施形態による音声認識制御部の動作例を示すフローチャートである。It is a flowchart which shows the operation example of the speech recognition control part by this embodiment.

以下、本発明の一実施形態を図面に基づいて説明する。図１は、本実施形態による音声認識装置１００の機能構成例を示すブロック図である。図２は、本実施形態の音声認識装置１００を適用する車載システムの構成例を示す図である。車載システムは、車両に搭載されるシステムであり、本実施形態の音声認識装置１００、マイク２００、前部座席カメラ３０１（特許請求の範囲のカメラに相当）、後部座席カメラ３０２（特許請求の範囲のカメラに相当）およびタッチパネル４００を備えて構成されている。 Hereinafter, an embodiment of the present invention will be described based on the drawings. FIG. 1 is a block diagram showing an example of a functional configuration of the speech recognition apparatus 100 according to the present embodiment. FIG. 2 is a view showing an example of the configuration of an in-vehicle system to which the speech recognition apparatus 100 of the present embodiment is applied. The in-vehicle system is a system mounted on a vehicle, and the voice recognition device 100 of this embodiment, the microphone 200, the front seat camera 301 (corresponding to the camera in claims), the rear seat camera 302 (claims) And the touch panel 400 are configured.

図２に示すように、本実施形態の音声認識装置１００は、車両のセンターコンソールまたはダッシュボード等に設置される。マイク２００は、車両の前部座席付近に設置される。マイク２００は、前部座席に着座する乗員、および、後部座席に着座する乗員のいずれの発話音声も収音できる。前部座席カメラ３０１は、前部座席付近の天井等、前部座席に着座する乗員の顔を撮影可能な位置に設けられる。後部座席カメラ３０２は、後部座席付近の天井等、後部座席に着座する乗員の顔を撮影可能な位置に設けられる。タッチパネル４００は、車両のダッシュボード等、前部座席のうち、少なくとも運転席に着座する乗員がタッチ操作可能な位置に設けられる。 As shown in FIG. 2, the speech recognition apparatus 100 of the present embodiment is installed on a center console or a dashboard of a vehicle. The microphone 200 is installed near the front seat of the vehicle. The microphone 200 can pick up voices of both the occupant seated in the front seat and the occupant seated in the rear seat. The front seat camera 301 is provided at a position where it can capture the face of an occupant seated in the front seat, such as a ceiling near the front seat. The rear seat camera 302 is provided at a position where it can capture the face of an occupant seated in the rear seat, such as a ceiling near the rear seat. The touch panel 400 is provided at a position at which at least an occupant of the front seat, such as a dashboard of a vehicle, can perform touch operation.

本実施形態の音声認識装置１００は、マイク２００より入力される乗員の発話音声（特定の単語や熟語、簡単な命令文などのワード）を発話コマンドとして認識し、認識結果に基づいて車両に搭載された電子機器５００を制御するものである。なお、音声認識装置１００が制御対象とする電子機器５００は、例えば、ナビゲーション装置や、オーディオ装置、エアコンディショナ等である。 The speech recognition apparatus 100 according to the present embodiment recognizes the speech of an occupant (a word such as a specific word, a phrase or a simple command sentence) input from the microphone 200 as a speech command, and is mounted on a vehicle based on the recognition result. The electronic device 500 is controlled. The electronic device 500 to be controlled by the voice recognition device 100 is, for example, a navigation device, an audio device, an air conditioner, or the like.

図１に示すように、本実施形態の音声認識装置１００は、その機能構成として、リスト登録部１１（特許請求の範囲の乗員リスト登録部に相当）、認識対象登録部１２、画像認識部１３、音声処理部１４、音声認識部１５、電子機器制御部１６、および音声認識制御部１７を備えている。また、音声認識装置１００は、記憶媒体として、乗員リスト記憶部２１、車内構成リスト記憶部２２、および認識対象記憶部２３を備えている。 As shown in FIG. 1, the voice recognition apparatus 100 of this embodiment has a list registration unit 11 (corresponding to an occupant list registration unit in the claims), a recognition target registration unit 12, and an image recognition unit 13 as its functional configuration. The voice processing unit 14, the voice recognition unit 15, the electronic device control unit 16, and the voice recognition control unit 17 are provided. The voice recognition device 100 further includes an occupant list storage unit 21, an in-vehicle configuration list storage unit 22, and a recognition target storage unit 23 as storage media.

なお、上記各機能ブロック１１〜１７は、ハードウェア、ＤＳＰ（Digital Signal Processor）、ソフトウェアの何れによっても構成することが可能である。例えばソフトウェアによって構成する場合、上記各機能ブロック１１〜１７は、実際にはコンピュータのＣＰＵ、ＲＡＭ、ＲＯＭなどを備えて構成され、ＲＡＭやＲＯＭ、ハードディスクまたは半導体メモリ等の記録媒体に記憶されたプログラムが動作することによって実現される。 Each of the functional blocks 11 to 17 can be configured by any of hardware, a DSP (Digital Signal Processor), and software. For example, when configured by software, each of the functional blocks 11 to 17 actually comprises a CPU, a RAM, a ROM and the like of a computer, and a program stored in a storage medium such as a RAM, a ROM, a hard disk or a semiconductor memory Is realized by operating.

リスト登録部１１は、複数のユーザの顔写真を乗員リスト２１ａとして登録する。乗員リスト２１ａは、複数のユーザについて、ユーザを識別する情報（以下、「ユーザ識別情報」という。）と、ユーザの顔写真とを対応付けて有するリストである。一のユーザの顔写真とは、当該一のユーザの顔が撮影されることによって生成された画像データを意味する。後述するように顔写真はユーザの顔認識に用いられるため、顔写真は、ユーザの顔を正面から撮影することによって生成された画像データであることが望ましい。 The list registration unit 11 registers face photographs of a plurality of users as the occupant list 21a. The occupant list 21a is a list having information (hereinafter referred to as "user identification information") for identifying a plurality of users in association with a user's face picture. The face photograph of one user means image data generated by photographing the face of the one user. As described later, since a face picture is used for user's face recognition, it is desirable that the face picture is image data generated by photographing the user's face from the front.

乗員リスト記憶部２１は、乗員リスト２１ａを記憶している。本実施形態において、「ユーザの顔写真を乗員リスト２１ａとして登録する」とは、乗員リスト記憶部２１が記憶する乗員リスト２１ａに、ユーザ識別情報と、顔写真との組み合わせを登録することを意味する。 The occupant list storage unit 21 stores an occupant list 21a. In the present embodiment, “registering the user's face photograph as the occupant list 21a” means registering the combination of the user identification information and the face photograph in the occupant list 21a stored by the occupant list storage unit 21. Do.

本実施形態では、リスト登録部１１は、以下の処理を行って顔写真の登録を行う。すなわち、事前に、音声認識装置１００の所定のフォルダに、顔写真（画像データ）が格納される。そして、リスト登録部１１は、タッチパネル４００に対するユーザの指示に応じて、ユーザ識別情報の入力、および、当該所定のフォルダに格納された顔写真のうちの１つの顔写真の指定が可能なユーザインターフェイスをタッチパネル４００に表示する。以下、当該ユーザインターフェイスに対して入力を行うものを「入力者」という。入力者は、当該ユーザインターフェイスを用いて、登録対象のユーザのユーザ識別情報を入力すると共に、当該所定のフォルダに格納された顔写真のうちから、登録対象のユーザの顔写真を指定する。顔写真の指定は、例えば、顔写真のファイル名の指定により行われる。リスト登録部１１は、入力されたユーザ識別情報を取得すると共に、指定された顔写真を取得する。次いで、リスト登録部１１は、ユーザ識別情報と顔写真とを対応付けて、乗員リスト記憶部２１が記憶する乗員リスト２１ａに登録する。例えば、車両のオーナーは、車両に搭乗する可能性のあるユーザであって、音声認識の対象となることを許容するユーザについて、事前に、乗員リスト２１ａへの登録を行う。 In the present embodiment, the list registration unit 11 performs the following processing to register a face picture. That is, a face picture (image data) is stored in advance in a predetermined folder of the speech recognition apparatus 100. Then, the list registration unit 11 receives a user identification information in response to a user's instruction on the touch panel 400, and a user interface capable of designating one of the face photographs stored in the predetermined folder. Is displayed on the touch panel 400. Hereinafter, the user who makes an input to the user interface is referred to as an “input person”. The input person inputs the user identification information of the user to be registered using the user interface, and specifies the facial photograph of the user to be registered among the facial photographs stored in the predetermined folder. The designation of the face photograph is performed, for example, by designation of the file name of the face photograph. The list registration unit 11 acquires the input user identification information and acquires the designated face photograph. Next, the list registration unit 11 registers the user identification information and the face picture in the occupant list 21 a stored in the occupant list storage unit 21 in association with each other. For example, the owner of the vehicle is a user who is likely to get on the vehicle and registers in the occupant list 21a in advance for a user who is permitted to be a target of voice recognition.

なお、上述したリスト登録部１１の処理は、一例であり、他の処理を行って顔写真の登録を行う構成でもよい。例えば、顔写真の取得に際し、リスト登録部１１は、以下の処理を実行してもよい。すなわち、顔写真の登録が行われていないユーザについて、前部座席カメラ３０１又は後部座席カメラ３０２により顔が撮影されて画像データが生成される。そして、リスト登録部１１は、前部座席カメラ３０１又は後部座席カメラ３０２の撮影結果に基づいて生成された画像データを顔写真として取得する構成でもよい。この場合、ユーザは、事前に顔写真を上述した所定のフォルダに格納する必要がなく、車両に前部座席カメラ３０１および後部座席カメラ３０２が搭載されていることを活用して、実際に車両に搭乗する乗員の顔写真を、適宜、登録することができる。また例えば、リスト登録部１１は、通信可能な外部装置と通信し、当該外部装置から顔写真の登録に必要な情報を取得し、取得した情報に基づいて顔写真の登録を行ってもよい。 The process of the list registration unit 11 described above is an example, and another process may be performed to register a face picture. For example, when acquiring a face picture, the list registration unit 11 may execute the following process. That is, for a user whose face picture has not been registered, a face is photographed by the front seat camera 301 or the rear seat camera 302 to generate image data. Then, the list registration unit 11 may be configured to acquire image data generated based on the photographing result of the front seat camera 301 or the rear seat camera 302 as a face picture. In this case, it is not necessary for the user to store the face photograph in the predetermined folder described above in advance, and the fact that the front seat camera 301 and the rear seat camera 302 are mounted on the vehicle is used. A face photograph of the passenger who is on board can be registered as appropriate. Also, for example, the list registration unit 11 may communicate with a communicable external device, acquire information necessary for registering a facial photograph from the external device, and register the facial photograph based on the acquired information.

また、リスト登録部１１は、車両に存在する複数の座席を車内構成リスト２２ａとして登録する。車内構成リスト２２ａは、車両に存在する複数の座席のそれぞれについて、車両における座席の位置を示す情報（以下、「座席位置情報」という。）を有するリストである。 In addition, the list registration unit 11 registers a plurality of seats existing in the vehicle as the in-vehicle configuration list 22a. The in-vehicle configuration list 22a is a list having information (hereinafter referred to as "seat position information") indicating the position of a seat in the vehicle for each of a plurality of seats present in the vehicle.

図１に示すように、車内構成リスト記憶部２２は、車内構成リスト２２ａを記憶している。本実施形態において、「座席を車内構成リスト２２ａとして登録する」とは、車内構成リスト記憶部２２が記憶する車内構成リスト２２ａに、車両に存在する複数の座席のそれぞれの座席位置情報を登録することを意味する。一例として、リスト登録部１１は、座席位置情報を入力させるユーザインターフェイスをタッチパネル４００に表示し、当該ユーザインターフェイスに対する入力に基づいて、車両に存在する複数の座席のそれぞれの座席位置情報を、車内構成リスト記憶部２２が記憶する車内構成リスト２２ａに登録する。 As shown in FIG. 1, the in-vehicle configuration list storage unit 22 stores an in-vehicle configuration list 22a. In the present embodiment, "registering the seat as the in-vehicle configuration list 22a" registers seat position information of each of a plurality of seats existing in the vehicle in the in-vehicle configuration list 22a stored in the in-vehicle configuration list storage unit 22. It means that. As an example, the list registration unit 11 displays a user interface for inputting seat position information on the touch panel 400, and based on the input to the user interface, the seat position information of each of a plurality of seats present in the vehicle The in-vehicle configuration list 22a stored by the list storage unit 22 is registered.

なお、車両における座席の位置は、基本的に、車両の車種によって固定的である。これを踏まえ、車両の車種に対応した内容の車内構成リスト２２ａが、音声認識装置１００を提供する会社や、車両を提供する会社等により、事前に登録される構成でもよい。また、リスト構成登録部１１は、車内構成リスト２２ｂの登録に関し、以下の処理を行ってもよい。すなわち、リスト登録部１１は、車両の車種を入力するユーザインターフェイスをタッチパネル４００に表示し、ユーザに車両の車種を入力させる。リスト登録部１１は、当該ユーザインターフェイスに入力された車種を取得し、取得した車種に対応する車内構成リスト２２ａを取得する。例えば、車種ごとに、車種に対応する内容の車内構成リスト２２ａを有するデータベースが音声認識装置１００に記憶され、リスト登録部１１は、当該データベースから、ユーザが入力した車種に対応する車内構成リスト２２ａを取得する。また例えば、リスト登録部１１は、車種ごとの車内構成リスト２２ａを有する外部の装置（例えば、インターネット上のサーバ）と通信し、当該外部の装置から車内構成リスト２２ａを取得する。リスト登録部１１は、取得した車内構成リスト２２ａを車内構成リスト記憶部２２に記憶する。 The position of the seat in the vehicle is basically fixed depending on the type of the vehicle. Based on this, an in-vehicle configuration list 22a having contents corresponding to the vehicle type of the vehicle may be registered in advance by a company providing the speech recognition apparatus 100, a company providing the vehicle, or the like. In addition, the list configuration registration unit 11 may perform the following processing regarding the registration of the in-vehicle configuration list 22b. That is, the list registration unit 11 displays a user interface for inputting a vehicle type of a vehicle on the touch panel 400, and allows the user to input the vehicle type of the vehicle. The list registration unit 11 acquires the vehicle type input to the user interface, and acquires the in-vehicle configuration list 22a corresponding to the acquired vehicle type. For example, a database having an in-vehicle configuration list 22a having contents corresponding to a vehicle type is stored in the speech recognition apparatus 100 for each vehicle type, and the list registration unit 11 selects an in-vehicle configuration list 22a corresponding to the vehicle type input by the user from the database. To get Also, for example, the list registration unit 11 communicates with an external device (for example, a server on the Internet) having the in-vehicle configuration list 22a for each vehicle type, and acquires the in-vehicle configuration list 22a from the external device. The list registration unit 11 stores the acquired in-vehicle configuration list 22a in the in-vehicle configuration list storage unit 22.

認識対象登録部１２は、動作モードＭ１〜Ｍ３に応じて、以下の何れかの処理を行う。すなわち、動作モードＭ１の場合、認識対象登録部１２は、音声認識を行う対象とするユーザの顔写真を認識対象として登録する。また、動作モードＭ２の場合、認識対象登録部１２は、音声認識を行う対象とする座席を認識対象として登録する。また、動作モードＭ３の場合、認識対象登録部１２は、音声認識を行う対象とするユーザの顔写真に加えて、音声認識を行う対象とする座席を認識対象として登録する。後述するように、本実施形態では、認識対象登録部１２により認識対象として登録された情報に対応するユーザが音声認識の対象となり、当該ユーザが発話している間、音声認識部１５による音声認識が行われる。 The recognition target registration unit 12 performs any of the following processing according to the operation modes M1 to M3. That is, in the case of the operation mode M1, the recognition target registration unit 12 registers, as a recognition target, a face picture of a user who is a target of voice recognition. Further, in the case of the operation mode M2, the recognition target registration unit 12 registers, as a recognition target, a seat to which voice recognition is to be performed. Further, in the case of the operation mode M3, the recognition target registration unit 12 registers, as a recognition target, a seat to which voice recognition is to be performed, in addition to a face photograph of the user to which voice recognition is to be performed. As described later, in the present embodiment, the user corresponding to the information registered as the recognition target by the recognition target registration unit 12 is the target of the speech recognition, and the speech recognition by the speech recognition unit 15 is performed while the user speaks. Is done.

以下、動作モードＭ１〜Ｍ３毎に、認識対象登録部１２の処理について説明する。なお、ユーザは、所定の手段によって、動作モードを選択することができる。例えば、ユーザは、タッチパネル４００に対してタッチ操作を行って、動作モードを選択することができる。 The processing of the recognition target registration unit 12 will be described below for each of the operation modes M1 to M3. The user can select an operation mode by a predetermined means. For example, the user can select an operation mode by performing a touch operation on the touch panel 400.

（動作モードＭ１における認識対象登録部１２の処理）
動作モードＭ１では、認識対象登録部１２は、乗員リスト記憶部２１が記憶する乗員リスト２１ａに基づいて、音声認識を行う対象とするユーザの顔写真を認識対象として登録する。 (Processing of recognition target registration unit 12 in operation mode M1)
In the operation mode M1, based on the occupant list 21a stored in the occupant list storage unit 21, the recognition target registration unit 12 registers, as a recognition target, a face photograph of a user who is to be subjected to voice recognition.

詳述すると、認識対象の登録に際し、認識対象登録部１２は、乗員リスト記憶部２１が記憶する乗員リスト２１ａを取得する。次いで、認識対象登録部１２は、取得した乗員リスト２１ａに基づいて、タッチパネル４００に顔写真選択画面Ｇ１を表示する。 More specifically, when registering recognition targets, the recognition target registration unit 12 acquires the occupant list 21 a stored by the occupant list storage unit 21. Next, the recognition target registration unit 12 displays the face photo selection screen G1 on the touch panel 400 based on the acquired occupant list 21a.

図３は、顔写真選択画面Ｇ１の一例を示す図である。図３に示すように、顔写真選択画面Ｇ１は、乗員リスト２１ａに登録されたユーザ毎にブロックＢＬ１を有する。各ブロックＢＬ１の左部にはユーザ識別情報Ｎ１が表示され、右部には顔写真画像Ｆ１が表示される。顔写真画像Ｆ１は、顔写真（画像データ）のサムネイル画像である。また、各ブロックには、チェックボックスＣＫ１が表示される。ユーザは、チェックボックスＣＫ１をタッチ操作することによって、チェックボックスＣＫ１にチェックを入れることができ、また、チェックを外すことができる。また、顔写真選択画面Ｇ１は、当該画面への入力を確定する確定ボタンＢＫ１を有する。以下、顔写真選択画面Ｇ１を操作する者を「操作者」と表現する。後述する座席選択画面Ｇ２を操作する者についても同様に「操作者」と表現する。 FIG. 3 is a view showing an example of the face picture selection screen G1. As shown in FIG. 3, the face picture selection screen G1 has a block BL1 for each user registered in the occupant list 21a. The user identification information N1 is displayed on the left side of each block BL1, and the face photo image F1 is displayed on the right side. The facial photograph image F1 is a thumbnail image of a facial photograph (image data). In each block, a check box CK1 is displayed. The user can check the check box CK1 or can uncheck the check box by touch-operating the check box CK1. In addition, the face picture selection screen G1 has a determination button BK1 for determining an input on the screen. Hereinafter, a person who operates the face picture selection screen G1 is referred to as an “operator”. A person who operates a seat selection screen G2 described later is similarly expressed as "operator".

操作者は、音声認識を行う対象とする一又は複数のユーザについて、各ユーザに対応するチェックボックスＣＫ１にチェックを入れた上で、確定ボタンＢＫ１をタッチ操作する。確定ボタンＢＫ１がタッチ操作されたことを検出した場合、認識対象登録部１２は、チェックが入れられたチェックボックスＣＫ１に対応する顔写真を取得する。次いで、認識対象登録部１２は、取得した顔写真（音声認識を行う対象とするユーザの顔写真）を、認識対象記憶部２３に登録する。この結果、認識対象記憶部２３には、音声認識を行う対象とするユーザのそれぞれの顔写真が記憶された状態となる。 The operator checks the check box CK1 corresponding to each user for one or a plurality of users who are to be subjected to voice recognition, and then performs the touch operation on the confirmation button BK1. When it is detected that the confirm button BK1 has been touched, the recognition target registration unit 12 acquires a face picture corresponding to the checked check box CK1. Next, the recognition target registration unit 12 registers the acquired face picture (a face picture of the user who is the target of voice recognition) in the recognition target storage unit 23. As a result, the recognition target storage unit 23 stores the face photographs of the user who is the target of the speech recognition.

以上のように、動作モードＭ１では、認識対象登録部１２は、乗員リスト２１ａの中からユーザ操作により選択されたユーザの顔写真を認識対象として登録する。操作者は、顔写真選択画面Ｇ１をタッチ操作してユーザを選択するという簡易な操作によって、車両に搭乗する乗員のうち、特定の乗員を、音声認識の対象とすることができる。 As described above, in the operation mode M1, the recognition target registration unit 12 registers, as a recognition target, a face photograph of the user selected from the occupant list 21a by the user operation. The operator can use a simple operation of touching the face picture selection screen G1 to select the user, so that a specific occupant among the occupants who get on the vehicle can be subjected to the voice recognition.

（動作モードＭ２における認識対象登録部１２の処理）
動作モードＭ２では、認識対象登録部１２は、車内構成リスト記憶部２２が記憶する車内構成リスト２２ａに基づいて、音声認識を行う対象とする座席を認識対象として登録する。 (Processing of recognition target registration unit 12 in operation mode M2)
In the operation mode M2, based on the in-vehicle configuration list 22a stored in the in-vehicle configuration list storage unit 22, the recognition target registration unit 12 registers a seat to be subjected to voice recognition as a recognition target.

詳述すると、認識対象の登録に際し、認識対象登録部１２は、車内構成リスト記憶部２２が記憶する車内構成リスト２２ａを取得する。次いで、認識対象登録部１２は、取得した車内構成リスト２２ａに基づいて、タッチパネル４００に座席選択画面Ｇ２を表示させる。 More specifically, when registering the recognition target, the recognition target registration unit 12 acquires the in-vehicle configuration list 22a stored in the in-vehicle configuration list storage unit 22. Next, the recognition target registration unit 12 causes the touch panel 400 to display the seat selection screen G2 based on the acquired in-vehicle configuration list 22a.

図４は、座席選択画面Ｇ２の一例を示す図である。図４に示すように、座席選択画面Ｇ２は、搭乗者が着座可能な座席が明示された状態で、車内が模式的に表された画像を有する。座席選択画面Ｇ２には、各座席と対応付けてチェックボックスＣＫ２が表示される。また、座席選択画面Ｇ２は、当該画面への入力を確定する確定ボタンＢＫ２を有する。 FIG. 4 is a diagram showing an example of the seat selection screen G2. As shown in FIG. 4, the seat selection screen G2 has an image in which the inside of the vehicle is schematically represented in a state in which a seat on which a passenger can sit is clearly indicated. A check box CK2 is displayed on the seat selection screen G2 in association with each seat. Further, the seat selection screen G2 has a determination button BK2 for determining an input to the screen.

操作者は、音声認識を行う対象とする一又は複数の座席について、各座席に対応するチェックボックスＣＫ２にチェックを入れた上で、確定ボタンＢＫ２をタッチ操作する。「座席を音声認識を行う対象とする」とは、その座席に着座したユーザを音声認識の対象とすることを意味する。確定ボタンＢＫ２がタッチ操作されたことを検出した場合、認識対象登録部１２は、チェックが入れられたチェックボックスＣＫ２に対応する座席の座席位置情報を取得する。次いで、認識対象登録部１２は、取得した座席位置情報（音声認識を行う対象とする座席の位置を示す情報）を、認識対象記憶部２３に登録する。座席位置情報を認識対象記憶部２３に登録する処理は、音声認識を行う対象とする座席を認識対象として登録する処理に相当する。この結果、認識対象記憶部２３には、音声認識を行う対象とする座席のそれぞれの座席位置情報が記憶された状態となる。 The operator checks the check box CK2 corresponding to each seat for one or more seats for which voice recognition is to be performed, and then performs a touch operation on the confirmation button BK2. The phrase "target seat for speech recognition" means that the user seated on the seat is targeted for speech recognition. When it is detected that the confirm button BK2 has been touched, the recognition target registration unit 12 acquires seat position information of a seat corresponding to the checked check box CK2. Next, the recognition target registration unit 12 registers the acquired seat position information (information indicating the position of the seat to which voice recognition is performed) in the recognition target storage unit 23. The process of registering the seat position information in the recognition target storage unit 23 corresponds to the process of registering a seat which is a target of voice recognition as a recognition target. As a result, in the recognition target storage unit 23, the seat position information of each seat to be subjected to the voice recognition is stored.

以上のように、動作モードＭ２では、認識対象登録部１２は、車内構成リスト２２ａの中からユーザ操作により選択された座席を認識対象として登録する。操作者は、座席選択画面Ｇ２をタッチ操作して座席を指定するという簡易な操作によって、車両に存在する座席のうち、特定の座席に着座する乗員を、音声認識の対象とすることができる。 As described above, in the operation mode M2, the recognition target registration unit 12 registers the seat selected by the user operation from the in-vehicle configuration list 22a as the recognition target. The operator can use the seat selection screen G2 as a touch operation to designate a seat, thereby making it possible to make an occupant, who is seated in a specific seat among the seats present in the vehicle, an object of voice recognition.

（動作モードＭ３における認識対象登録部１２の処理）
動作モードＭ３では、認識対象登録部１２は、乗員リスト記憶部２１が記憶する乗員リスト２１ａおよび車内構成リスト記憶部２２が記憶する車内構成リスト２２ａに基づいて、音声認識を行う対象とするユーザの顔写真に加えて、音声認識を行う対象とする座席を認識対象として登録する。 (Processing of recognition target registration unit 12 in operation mode M3)
In the operation mode M3, the recognition target registration unit 12 uses the occupant list 21a stored by the occupant list storage unit 21 and the in-vehicle configuration list 22a stored in the in-vehicle configuration list storage unit 22 to perform voice recognition for the user. In addition to the face picture, a seat to be subjected to voice recognition is registered as a recognition target.

すなわち、動作モードＭ３において、認識対象登録部１２は、タッチパネル４００に顔写真選択画面Ｇ１を表示させ、当該画面に対するユーザ操作に基づいて、音声認識を行う対象とするユーザの顔写真を認識対象として認識対象記憶部２３に登録する。次いで、認識対象登録部１２は、タッチパネル４００に座席選択画面Ｇ２を表示させ、当該画面に対するユーザ操作に基づいて、音声認識を行う対象とする座席（座席位置情報）を認識対象として認識対象記憶部２３に登録する。この結果、認識対象記憶部２３には、音声認識を行う対象とするユーザのそれぞれの顔写真が記憶されると共に、音声認識を行う対象とする座席のそれぞれの座席位置情報が記憶された状態となる。 That is, in the operation mode M3, the recognition target registration unit 12 causes the touch panel 400 to display the face picture selection screen G1 and sets the face picture of the user who is the target of voice recognition as the recognition target based on the user operation on the screen. It is registered in the recognition target storage unit 23. Next, the recognition target registration unit 12 causes the touch panel 400 to display the seat selection screen G2, and based on a user operation on the screen, the recognition target storage unit sets a seat (seat position information) to be a target for voice recognition. Register at 23 As a result, the recognition target storage unit 23 stores a face picture of each user who is the target of voice recognition, and stores seat position information of each of the seats that are the target of voice recognition. Become.

以上のように、動作モードＭ３では、認識対象登録部１２は、ユーザ操作により選択された顔写真に加えて、ユーザ操作により選択された座席を認識対象として登録する。後述するように、ユーザ操作により選択された顔写真に対応するユーザが、ユーザ操作により選択された座席に着座する場合に、当該ユーザが、音声認識を行う対象となり、当該ユーザが発話している間に、音声認識に関する処理が実行される。操作者は、顔写真選択画面Ｇ１および座席選択画面Ｇ２をタッチ操作して顔写真および座席を選択するという簡易な操作によって、特定の座席に着座する特定の乗員を、音声認識の対象とすることができる。 As described above, in the operation mode M3, the recognition target registration unit 12 registers the seat selected by the user operation as the recognition target in addition to the face photograph selected by the user operation. As described later, when the user corresponding to the face photograph selected by the user operation sits on the seat selected by the user operation, the user is a target for performing voice recognition and the user utters In the meantime, processing related to speech recognition is performed. The operator recognizes a specific occupant seated in a specific seat as a voice recognition target by a simple operation of touching the facial photograph selection screen G1 and the seat selection screen G2 to select a facial photograph and a seat. Can.

以上、動作モードＭ１〜Ｍ３のそれぞれについて、認識対象登録部１２の処理について説明した。上述の説明では、操作者により画面がタッチ操作されることによって、顔写真又は座席が選択された。この場合、画面に対するユーザのタッチ操作が「ユーザ操作」に相当する。この点に関し、例えば、ジェスチャー入力により当該選択が行われる構成でもよい。一例として、ジェスチャーの認識に用いられる赤外線センサーが音声認識装置１００に設けられ、認識対象登録部１２は、赤外線センサーの検出値に基づいてユーザが行ったジェスチャーを、適宜、認識する。ジェスチャーの認識は、前部座席カメラ３０１や、後部座席カメラ３０２等の撮影結果に基づいて行われる構成でもよい。顔写真の登録に際し、認識対象登録部１２は、顔写真を登録するユーザの選択をジェスチャーによって行うことが可能なユーザインターフェイスをタッチパネル４００に表示する。例えば、認識対象登録部１２は、図３で示す顔写真選択画面Ｇ１をタッチパネル４００に表示する。そして、認識対象登録部１２は、操作者によるジェスチャーを検出し、検出したジェスチャーに基づいて操作者が選択したユーザの顔写真を取得する。例えば、ユーザインターフェイスとして顔写真選択画面Ｇ１を表示する場合において、ジェスチャーとして、チェックボックスＣＫ１にチェックを入れ又はチェックボックスＣＫ１からチェックを外す対象とするユーザを選択するジェスチャー、チェックボックスＣＫ１にチェックを入れ又はチェックボックスＣＫ１からチェックを外すジェスチャー、確定ボタンＢＫ１を操作するジェスチャーが事前に定められる。そして、認識対象登録部１２は、認識したジェスチャーに基づいて、操作者が選択したユーザを認識し、選択したユーザの顔写真を登録する。座席の登録の際も、認識対象登録部１２は、同様の処理を実行する。この場合、ユーザが行うジェスチャーが、「ユーザ操作」に相当する。 The process of the recognition target registration unit 12 has been described above for each of the operation modes M1 to M3. In the above description, the face picture or the seat is selected by the touch operation of the screen by the operator. In this case, the user's touch operation on the screen corresponds to "user operation". In this regard, for example, the selection may be performed by gesture input. As an example, an infrared sensor used for recognizing a gesture is provided in the voice recognition apparatus 100, and the recognition target registration unit 12 appropriately recognizes a gesture performed by the user based on a detection value of the infrared sensor. The recognition of the gesture may be performed based on the imaging results of the front seat camera 301, the rear seat camera 302, and the like. At the time of face photograph registration, the recognition target registration unit 12 displays on the touch panel 400 a user interface capable of making a gesture to select a user who registers the face picture. For example, the recognition target registration unit 12 displays the face picture selection screen G1 shown in FIG. 3 on the touch panel 400. Then, the recognition target registration unit 12 detects a gesture by the operator, and acquires a face photograph of the user selected by the operator based on the detected gesture. For example, in the case of displaying the face picture selection screen G1 as a user interface, a gesture for selecting a user to be a target for checking or unchecking the check box CK1 is checked as a gesture. Alternatively, a gesture for removing the check from the check box CK1 and a gesture for operating the confirmation button BK1 are set in advance. Then, the recognition target registration unit 12 recognizes the user selected by the operator based on the recognized gesture, and registers the face photograph of the selected user. Also at the time of seat registration, the recognition target registration unit 12 executes the same process. In this case, the gesture performed by the user corresponds to "user operation".

画像認識部１３は、車内に設置された前部座席カメラ３０１および後部座席カメラ３０２により撮影された画像を認識し、認識対象登録部１２により登録された認識対象に該当する乗員の口が動いているか否かを判定する。当該判定に際し、画像認識部１３は、動作モードに応じて異なる処理を実行する。以下、動作モードＭ１〜Ｍ３毎に、画像認識部１３の処理について詳述する。 The image recognition unit 13 recognizes the images taken by the front seat camera 301 and the rear seat camera 302 installed in the car, and the mouth of the occupant corresponding to the recognition target registered by the recognition target registration unit 12 moves. Determine if there is. At the time of the determination, the image recognition unit 13 executes different processing according to the operation mode. Hereinafter, the processing of the image recognition unit 13 will be described in detail for each of the operation modes M1 to M3.

（動作モードＭ１における画像認識部１３の処理）
動作モードＭ１において、画像認識部１３は、前部座席カメラ３０１および後部座席カメラ３０２により撮影された乗員の顔画像を認識し、認識対象として顔写真が登録されたユーザと同じ乗員の口が動いているか否かを判定する。 (Process of image recognition unit 13 in operation mode M1)
In the operation mode M1, the image recognition unit 13 recognizes the face image of the occupant taken by the front seat camera 301 and the rear seat camera 302, and the same occupant's mouth as the user whose face picture is registered as the recognition object moves. It is determined whether the

本実施形態では、動作モードＭ１において、画像認識部１３は、以下の処理を実行する。すなわち、前部座席カメラ３０１は、車両の前部座席に対応する領域を所定の周期で撮影し、撮影結果に基づいて、画像データを画像認識部１３に出力する。同様に、後部座席カメラ３０２は、車両の後部座席に対応する領域を所定の周期で撮影し、撮影結果に基づいて、画像データを画像認識部１３に出力する。以下、前部座席カメラ３０１および後部座席カメラ３０２を総称して、単に、「カメラ」と表現する場合がある。また、前部座席カメラ３０１および後部座席カメラ３０２が、撮影結果に基づいて出力する画像データを「撮影画像データ」と表現する。 In the present embodiment, in the operation mode M1, the image recognition unit 13 executes the following processing. That is, the front seat camera 301 shoots an area corresponding to the front seat of the vehicle at a predetermined cycle, and outputs image data to the image recognition unit 13 based on the shooting result. Similarly, the rear seat camera 302 shoots an area corresponding to the rear seat of the vehicle at a predetermined cycle, and outputs image data to the image recognition unit 13 based on the shooting result. Hereinafter, the front seat camera 301 and the rear seat camera 302 may be collectively referred to simply as “camera”. Further, the image data that the front seat camera 301 and the rear seat camera 302 output based on the shooting result is expressed as “shot image data”.

そして、画像認識部１３は、カメラからの撮影画像データの入力に応じて、撮影画像データを分析し、当該データに含まれる顔画像を認識する。ここで認識される顔画像は、車両の乗員の顔の画像である。顔画像の認識は、既存の顔認識技術を用いて適切に行われる。次いで、画像認識部１３は、認識対象記憶部２３が記憶する顔写真を取得し、認識した顔画像の中に、認識対象記憶部２３に登録された顔写真（音声認識を行う対象とするユーザの顔写真）と同一性を有する顔画像が含まれるか否かを判定する。当該判定は、顔写真における顔部分の画像の特徴点をテンプレートとするパターンマッチングを応用した顔認識技術等、既存の顔認識技術を用いて適切に行われる。認識した顔画像が、認識対象記憶部２３に登録された顔写真と同一性を有する場合、その顔画像に対応する乗員は、認識対象記憶部２３に登録された顔写真に対応するユーザと同一人物ということであり、音声認識を行う対象とする人物である。以下、車両に搭乗する乗員のうち、音声認識を行う対象とする乗員を「認識対象乗員」という。 Then, the image recognition unit 13 analyzes the photographed image data according to the input of the photographed image data from the camera, and recognizes the face image included in the data. The face image recognized here is an image of the face of the vehicle occupant. Face image recognition is suitably performed using existing face recognition techniques. Next, the image recognition unit 13 acquires the face photograph stored in the recognition target storage unit 23 and, among the recognized face images, the face photograph registered in the recognition target storage unit 23 (a user who is a target for performing voice recognition) It is determined whether a face image having the same identity as that of The determination is appropriately performed using existing face recognition technology such as face recognition technology that applies pattern matching using a feature point of an image of a face part in a face photograph as a template. If the recognized face image has the same identity as the face picture registered in the recognition target storage unit 23, the occupant corresponding to the face image is the same as the user corresponding to the face picture registered in the recognition target storage unit 23. It is a person, and it is a person who is the target of speech recognition. Hereinafter, among the occupants who get on the vehicle, the occupants who are the target of voice recognition are referred to as “recognition target occupants”.

撮影画像データに認識対象乗員の顔画像が含まれる場合、画像認識部１３は、カメラから所定の周期で入力される撮影画像データに含まれる認識対象乗員の顔画像を追跡して分析し、認識対象乗員の口が動いているか否かを判定する。 When the captured image data includes the face image of the recognition target occupant, the image recognition unit 13 tracks and analyzes the face image of the recognition target occupant included in the captured image data input from the camera in a predetermined cycle. It is determined whether the target occupant's mouth is moving.

本実施形態では、画像認識部１３は、以下の処理を行って、認識対象乗員の口が動いているか否かを判定する。すなわち、画像認識部１３は、撮影画像データに含まれる認識対象乗員の顔画像における口の領域（以下、「口領域」という。）を特定し、口領域を追跡して分析する。口領域の分析の結果に基づいて、画像認識部１３は、口が一定時間以上連続して動いている状態の場合、「口が動いている」と判定し、口が一定時間以上連続して動いていない状態（口が閉じたままの状態や、開いたままの状態）の場合、「口が動いていない」と判定する。ここで、人間は、発話する場合、口を連続して動かすという特徴がある。従って、認識対象乗員の口が連続して動いている場合、当該認識対象乗員が発話している状態である可能性が高い。これを踏まえ、画像認識部１３は、認識対象乗員の口が動いているか否かを判定することによって、認識対象乗員が発話している状態か否かを判定している。なお、口が動いているか否かの判定は、発話している状態か否かを判定するために行われる処理であるため、例えば、所定の処理を行って、食事をしている時の口の動きや、笑っているときの口の動き等の発話時の口の動きとは異なる口の動きについては、「口が動いていない」と判定するようにし、判定の精度を上げてもよいことは勿論である。 In the present embodiment, the image recognition unit 13 performs the following processing to determine whether the mouth of the recognition target occupant is moving. That is, the image recognition unit 13 identifies a mouth area (hereinafter referred to as "mouth area") in the face image of the recognition target occupant included in the photographed image data, and tracks and analyzes the mouth area. Based on the result of analysis of the mouth area, the image recognition unit 13 determines that the mouth is moving when the mouth is moving continuously for a predetermined time or more, and the mouth continues for a predetermined time or more. If it is not moving (the mouth is closed or is open), it is determined that "the mouth is not moving". Here, human beings are characterized by moving their mouth continuously when speaking. Therefore, when the recognition target occupant's mouth is moving continuously, there is a high possibility that the recognition target occupant is speaking. Based on this, the image recognition unit 13 determines whether or not the recognition target occupant is speaking by determining whether or not the mouth of the recognition target occupant is moving. In addition, since it is a process performed in order to determine whether it is the state which is uttering whether the determination of whether the mouth is moving is performed, for example, a predetermined process is performed and the mouth at the time of eating The movement of the mouth or the movement of the mouth which is different from the movement of the mouth such as the movement of the mouth while smiling may be determined as "the mouth is not moving", and the accuracy of the determination may be raised Of course.

画像認識部１３は、認識対象乗員の口が動いていると判定した場合、音声認識制御部１７に、認識対象乗員の口が動いていることを通知する。本実施形態では、画像認識部１３は、連続する一連の口の動きの開始時に、口の動きが開始されたことの通知（以下、「動作開始通知」という。）を行い、終了時に、口の動きが停止されたことの通知（以下、「動作終了通知」という。）を行う。なお、画像認識部１３は、一定時間（例えば、０．５秒）以上連続して口が動いた場合に、口が動いていると判定するため、実際に口が動き始めたときのタイミングと、動作開始通知が行われるタイミングとにはタイムラグが生じる。後述するように、音声認識制御部１７は、当該タイムラグに対応する処理を実行する。 If the image recognition unit 13 determines that the mouth of the recognition target occupant is moving, the image recognition unit 13 notifies the voice recognition control unit 17 that the mouth of the recognition target occupant is moving. In the present embodiment, at the start of a series of successive mouth movements, the image recognition unit 13 notifies that the mouth movements have been started (hereinafter referred to as “operation start notice”), and at the end, the mouth Notification that the movement of the movement has been stopped (hereinafter referred to as “operation end notification”). In addition, since the image recognition unit 13 determines that the mouth is moving when the mouth moves continuously for a predetermined time (for example, 0.5 seconds) or more, the timing when the mouth actually starts to move is determined. There is a time lag between when the operation start notification is made. As described later, the voice recognition control unit 17 executes a process corresponding to the time lag.

（動作モードＭ２における画像認識部１３の処理）
動作モードＭ２において、画像認識部１３は、前部座席カメラ３０１および後部座席カメラ３０２により撮影された乗員の顔画像を認識し、認識対象として登録された座席にいる乗員の口が動いているか否かを判定する。 (Process of image recognition unit 13 in operation mode M2)
In the operation mode M2, the image recognition unit 13 recognizes the face image of the occupant photographed by the front seat camera 301 and the rear seat camera 302, and the occupant's mouth in the seat registered as the recognition target is moving or not. Determine if

本実施形態では、動作モードＭ２において、画像認識部１３は、以下の処理を実行する。すなわち、画像認識部１３は、認識対象記憶部２３が記憶する座席位置情報を取得する。次いで、画像認識部１３は、取得した座席位置情報、および、カメラから入力される撮影画像データに基づいて、以下の処理を実行する。すなわち、画像認識部１３は、撮影画像データに含まれる顔画像を認識し、顔画像のうち、座席位置情報が示す位置の座席に着座する乗員の顔画像を特定する。ここで、特定された顔画像に対応する乗員は、認識対象乗員である。次いで、画像認識部１３は、撮影画像データにおける認識対象乗員の顔画像を追跡して分析し、認識対象乗員の口が動いているか否かを判定する。画像認識部１３は、認識対象乗員の口が動いているか否かの判定に応じて、適宜、上述した動作開始通知および動作終了通知を音声認識制御部１７に対して行う。 In the present embodiment, in the operation mode M2, the image recognition unit 13 executes the following processing. That is, the image recognition unit 13 acquires seat position information stored in the recognition target storage unit 23. Next, the image recognition unit 13 executes the following processing based on the acquired seat position information and photographed image data input from the camera. That is, the image recognition unit 13 recognizes the face image included in the photographed image data, and specifies the face image of the occupant seated in the seat at the position indicated by the seat position information among the face images. Here, the occupant corresponding to the identified face image is a recognition target occupant. Next, the image recognition unit 13 tracks and analyzes the face image of the recognition target occupant in the photographed image data, and determines whether or not the mouth of the recognition target occupant is moving. The image recognition unit 13 appropriately sends the above-described operation start notification and operation end notification to the voice recognition control unit 17 according to the determination as to whether or not the mouth of the recognition target occupant is moving.

（動作モードＭ３における画像認識部１３の処理）
動作モードＭ３において、画像認識部１３は、前部座席カメラ３０１および後部座席カメラ３０２により撮影された乗員の顔画像を認識し、認識対象として登録された座席にいる乗員が、認識対象として顔写真が登録されたユーザであり、かつ、当該乗員の口が動いているか否かを判定する。 (Process of image recognition unit 13 in operation mode M3)
In the operation mode M3, the image recognition unit 13 recognizes the face image of the occupant photographed by the front seat camera 301 and the rear seat camera 302, and the occupant in the seat registered as the recognition target is a face photograph as the recognition target Is a registered user, and it is determined whether or not the passenger's mouth is moving.

本実施形態では、動作モードＭ３において、画像認識部１３は、以下の処理を実行する。すなわち、画像認識部１３は、認識対象記憶部２３が記憶する顔写真および座席位置情報を取得する。次いで、画像認識部１３は、取得した顔写真および座席位置情報と、カメラから入力される撮影画像データとに基づいて、以下の処理を実行する。すなわち、画像認識部１３は、撮影画像データに含まれる顔画像を認識し、顔画像のうち、座席位置情報が示す位置の座席に着座する乗員の顔画像を特定する。次いで、画像認識部１３は、特定した顔画像（座席位置情報が示す位置の座席に着座する乗員の顔画像）が、取得した顔写真と同一性を有するか否かを判定する。取得した顔写真と同一性を有すると判定された顔画像に対応する乗員は、認識対象として登録された座席に着座し、かつ、認識対象として登録された顔写真に対応するユーザであり、音声認識を行う対象とする人物である。すなわち、当該乗員は、「認識対象乗員」である。 In the present embodiment, in the operation mode M3, the image recognition unit 13 executes the following processing. That is, the image recognition unit 13 acquires a face picture and seat position information stored in the recognition target storage unit 23. Next, the image recognition unit 13 executes the following processing based on the acquired face photograph and seat position information, and the photographed image data input from the camera. That is, the image recognition unit 13 recognizes the face image included in the photographed image data, and specifies the face image of the occupant seated in the seat at the position indicated by the seat position information among the face images. Next, the image recognition unit 13 determines whether or not the specified face image (a face image of an occupant sitting on a seat at a position indicated by the seat position information) has the same identity as the acquired face picture. The occupant corresponding to the face image determined to have the same identity as the acquired face photograph is a user who is seated on the seat registered as the recognition target and corresponds to the face photograph registered as the recognition target, and the voice It is a person to be recognized. That is, the said passenger | crew is a "recognition object passenger | crew."

次いで、画像認識部１３は、撮影画像データにおける認識対象乗員の顔画像を追跡して分析し、認識対象乗員の口が動いているか否かを判定する。画像認識部１３は、認識対象乗員の口が動いているか否かの判定に応じて、適宜、上述した動作開始通知および動作終了通知を音声認識制御部１７に対して行う。 Next, the image recognition unit 13 tracks and analyzes the face image of the recognition target occupant in the photographed image data, and determines whether or not the mouth of the recognition target occupant is moving. The image recognition unit 13 appropriately sends the above-described operation start notification and operation end notification to the voice recognition control unit 17 according to the determination as to whether or not the mouth of the recognition target occupant is moving.

音声処理部１４は、マイク２００によって収音された音声についてＡ／Ｄ変換等の必要な音声処理を行って音声信号を生成し、バッファ３１へのバッファリングを行う。バッファ３１は、ＲＡＭ等の揮発性メモリに形成された記憶領域である。音声処理部１４による処理の結果、現時点から遡って所定期間の間にマイク２００により収音された音声に基づく音声信号が累積的にバッファ３１に記憶された状態となる。以下、バッファ３１に記憶された音声信号の集合を「録音音声データ」という。 The audio processing unit 14 performs necessary audio processing such as A / D conversion on the audio collected by the microphone 200 to generate an audio signal, and performs buffering in the buffer 31. The buffer 31 is a storage area formed in volatile memory such as RAM. As a result of the processing by the audio processing unit 14, audio signals based on the audio collected by the microphone 200 during a predetermined period retroactively from the current time are accumulated in the buffer 31. Hereinafter, a set of audio signals stored in the buffer 31 is referred to as "recorded audio data".

音声認識部１５は、車内に設置されたマイク２００から入力された音声を認識する。本実施形態では、音声認識部１５は、バッファ３１に記憶された録音音声データを順次読み出し、録音音声データに記録された音声に、予め登録されたワード（発話コマンド）が含まれているか否かを判定する。本実施形態では、ワードは、ワードに対応する特定の処理を電子機器５００に実行させる文字列等である。ワードとして、例えば、電子機器５００がナビゲーション装置である場合において、車両の現在位置の表示という処理を実行させるワード（一例として、「現在地表示」）や、目的地までの経路の案内を開始させるワード（一例として「ナビ開始」）が予め登録されている。音声認識部１５による上述の判定は、予め登録されたワードの音声パターンと、録音音声データに記録された音声の音声パターンとの比較等、既存の音声認識技術を用いて適切に行われる。音声認識部１５は、ワードが含まれていると判定した場合、そのワードを電子機器制御部１６に通知する。 The voice recognition unit 15 recognizes the voice input from the microphone 200 installed in the car. In the present embodiment, the voice recognition unit 15 sequentially reads the recorded voice data stored in the buffer 31, and whether or not the voice recorded in the recorded voice data includes a word (utterance command) registered in advance. Determine In the present embodiment, the word is a character string or the like that causes the electronic device 500 to execute a specific process corresponding to the word. As a word, for example, when the electronic device 500 is a navigation device, a word for executing processing of displaying the current position of a vehicle (for example, “display current position”) or a word for starting guidance of a route to a destination ("Navi start" as an example) is registered in advance. The above-described determination by the voice recognition unit 15 is appropriately performed using existing voice recognition technology, such as comparison between a voice pattern of a word registered in advance and a voice pattern of voice recorded in recorded voice data. If the speech recognition unit 15 determines that a word is included, the speech recognition unit 15 notifies the electronic device control unit 16 of the word.

電子機器制御部１６は、音声認識部１５からワードが通知された場合、通知されたワードに対応する処理を電子機器５００に実行させる制御信号を電子機器５００に出力する。 When the word is notified from the speech recognition unit 15, the electronic device control unit 16 outputs, to the electronic device 500, a control signal that causes the electronic device 500 to execute the process corresponding to the notified word.

音声認識制御部１７は、認識対象に該当する乗員の口が動いていると画像認識部１３により判定された場合に、音声認識部１５の処理を行うように制御する。 The voice recognition control unit 17 controls the voice recognition unit 15 to perform processing when the image recognition unit 13 determines that the occupant's mouth corresponding to the recognition target is moving.

本実施形態では、音声認識制御部１７は、画像認識部１３による動作開始通知および動作終了通知に基づいて、音声認識部１５を制御する。詳述すると、音声認識制御部１７は、動作開始通知があった場合、音声認識部１５に、バッファ３１に記憶された録音音声データについて、動作開始通知があったタイミングより所定の時間（例えば、０．５秒）だけ前の位置から読み出しを行わせて、音声認識（音声にワードが含まれるか否かの判定）を開始させる。さらに、音声認識制御部１７は、動作終了通知があった場合、音声認識部１５に、動作終了通知があったタイミングに対応する位置まで、録音音声データの読み出しおよび読み出した録音音声データに基づく音声認識を実行させた上で、音声認識を停止させる。この結果、動作開始通知があったタイミングより所定の時間だけ前のタイミングから、動作終了通知があったタイミングまでの期間に、マイク２００により収音された音声について、音声認識部１５による音声認識が行われる。上述したように、動作開始通知から動作終了通知までの期間は、認識対象乗員の口が動いていると画像認識部１３に判定された期間である。これを踏まえ、本実施形態では、認識対象乗員の口が動いている期間（認識対象乗員が発話している蓋然性が高い期間）に、マイク２００により収音された音声について、音声認識部１５により音声認識が行われる。 In the present embodiment, the voice recognition control unit 17 controls the voice recognition unit 15 based on the operation start notification and the operation end notification by the image recognition unit 13. More specifically, when notified of the start of operation, the voice recognition control unit 17 causes the voice recognition unit 15 to determine a predetermined time from the timing when the operation start notification is received for the voice recording voice data stored in the buffer 31 (for example, The reading is performed from the position before 0.5 seconds) to start speech recognition (determination as to whether or not the speech contains a word). Furthermore, when the voice recognition control unit 17 receives the operation end notification, the voice recognition unit 15 reads the voice recording data and reads the voice based on the read voice data to a position corresponding to the timing when the operation termination notification is received. After recognition is performed, speech recognition is stopped. As a result, the voice recognition unit 15 recognizes the voice collected by the microphone 200 in a period from the timing when the operation start notification is given to the timing when the operation end notification is given from the timing before the predetermined timing. To be done. As described above, the period from the operation start notification to the operation end notification is a period determined by the image recognition unit 13 that the mouth of the recognition target occupant is moving. Based on this, in the present embodiment, the voice recognition unit 15 is used for the voice collected by the microphone 200 during a period in which the mouth of the recognition target occupant is moving (a period in which the probability of the recognition target occupant speaking is high). Speech recognition is performed.

なお、音声認識制御部１７が、音声認識部１５に、バッファ３１に記憶された録音音声データについて、現時点より所定の時間だけ前の位置から読み出しを行わせる理由は以下である。すなわち、上述したように、認識対象乗員の口が実際に動き始めたときのタイミングと、画像認識部１３により動作開始通知が行われるタイミングとの間にはタイムラグが生じる。そして、音声認識部１５に、現時点（動作開始通知が行われた時点）よりも所定の時間だけ前の位置から録音音声データの読み出しを行わせることにより、当該タイムラグに起因して、認識対象乗員の発話音声が、音声認識部１５による音声認識の対象から漏れることを防止することができる。 The reason why the voice recognition control unit 17 causes the voice recognition unit 15 to read the recorded voice data stored in the buffer 31 from a position ahead of the current time by a predetermined time is as follows. That is, as described above, a time lag occurs between the timing when the recognition target occupant's mouth actually starts moving and the timing when the image recognition unit 13 performs the operation start notification. Then, by causing the voice recognition unit 15 to read the recorded voice data from a position preceding the current time (at the time when the operation start notification is made) by a predetermined time, the recognition target occupant is caused by the time lag. Can be prevented from leaking out from the target of speech recognition by the speech recognition unit 15.

また、音声認識制御部１７は、画像認識部１３から動作終了通知があった後に次の動作開始通知があるまでの期間、音声認識部１５を制御して、音声認識部１５による音声認識を停止する。動作終了通知から次の動作開始通知があるまでの期間は、認識対象乗員の口が動いていないと画像認識部１３に判定された期間である。 In addition, the voice recognition control unit 17 controls the voice recognition unit 15 during a period until the next operation start notification is received after the image recognition unit 13 notifies the operation completion, and the voice recognition unit 15 stops voice recognition. Do. The period from the operation end notification to the next operation start notification is a period determined by the image recognition unit 13 that the mouth of the recognition target occupant has not moved.

以上のように、本実施形態では、認識対象乗員の口が動いている期間にマイク２００により収音された音声が、音声認識部１５による音声認識の対象とされる一方、当該期間以外の期間にマイク２００により収音された音声は、音声認識部１５による音声認識の対象とされない。これにより、以下の効果を奏する。 As described above, in the present embodiment, while the voice collected by the microphone 200 during the period when the recognition target occupant's mouth is moving is the target of voice recognition by the voice recognition unit 15, the period other than that period The voice collected by the microphone 200 is not considered as a target of voice recognition by the voice recognition unit 15. This produces the following effects.

すなわち、上述したように、認識対象乗員の口が動いている場合、その認識対象乗員が発話している状態である蓋然性が高い。そして、本実施形態によれば、認識対象乗員という、音声認識の対象とすることが指示された乗員の発話音声だけを音声認識の対象とすることができる。従来は、特定の乗員の発話音声だけを音声認識の対象とするためには、音声認識を行う対象とする乗員のそれぞれについて、ワード毎に音声プロファイルを作成して登録する必要があったが、本実施形態によれば、このような登録を行うことなく、特定の乗員の発話音声だけを音声認識（トリガレス音声認識）の対象とすることができる。このため、音声プロファイルの作成、登録に関する煩雑な作業が必要なく、ユーザの利便性が高い。 That is, as described above, when the recognition target occupant's mouth is moving, the probability that the recognition target occupant is speaking is high. Then, according to the present embodiment, it is possible to set only the speech of the occupant who is instructed to be the target of voice recognition, which is the recognition target occupant, as the target of the voice recognition. Conventionally, in order to target only speech voices of specific occupants as the target of speech recognition, it was necessary to create and register speech profiles for each word for each of the occupants who are the target of speech recognition, According to the present embodiment, it is possible to set only the speech of a specific occupant as the target of speech recognition (triggerless speech recognition) without performing such registration. For this reason, there is no need for the complicated work of creating and registering the voice profile, and the convenience of the user is high.

図５は、本実施形態による音声認識装置１００の動作例を示すフローチャートである。図５に示すフローチャートは、音声認識装置１００の電源がオンされた後に、音声認識に関する設定を行うときの音声認識装置１００の全体の動作例を示している。 FIG. 5 is a flowchart showing an operation example of the speech recognition apparatus 100 according to the present embodiment. The flowchart shown in FIG. 5 shows an example of the overall operation of the speech recognition apparatus 100 when setting regarding speech recognition is performed after the speech recognition apparatus 100 is powered on.

図５に示すように、音声認識装置１００の電源がオンされた後、認識対象登録部１２は、ユーザの指示に応じて、音声認識を行う対象とするユーザに関する登録を行う（ステップＳ１）。上述したように、認識対象登録部１２は、動作モードＭ１の場合、音声認識を行う対象とするユーザの顔写真を認識対象記憶部２３に登録し、動作モードＭ２の場合、音声認識を行う対象とする座席を認識対象記憶部２３に登録し、動作モードＭ３の場合、音声認識を行う対象とするユーザの顔写真と、音声認識を行う対象とする座席とを認識対象記憶部２３に登録する。 As shown in FIG. 5, after the power of the speech recognition apparatus 100 is turned on, the recognition target registration unit 12 performs registration regarding the user who is the target of speech recognition according to the user's instruction (step S1). As described above, in the case of the operation mode M1, the recognition target registration unit 12 registers the face picture of the user who is the target of the speech recognition in the recognition target storage unit 23, and in the operation mode M2, the target of the speech recognition The seat to be registered is registered in the recognition target storage unit 23. In the case of the operation mode M3, the face photograph of the user who is the target of voice recognition and the seat that is the target of the voice recognition are registered in the recognition target storage unit 23. .

認識対象登録部１２による登録後、画像認識部１３は、撮影画像データに含まれる顔画像の認識、および、当該認識に基づく認識対象乗員の口が動いているか否かの判定を開始する（ステップＳ２）。また、音声処理部１４は、マイク２００が収音した音声に基づく音声信号のバッファ３１へのバッファリングを開始する（ステップＳ３）。また、音声認識制御部１７は、画像認識部１３の判定の結果に応じた音声認識部１５の制御を開始する（ステップＳ４）。以上により、認識対象乗員の発話音声を認識の対象とする音声認識に関する処理（以下、「特定音声認識処理」という。）が開始される。 After registration by the recognition target registration unit 12, the image recognition unit 13 starts recognition of the face image included in the photographed image data and determination as to whether the mouth of the recognition target occupant is moving based on the recognition (step S2). Further, the audio processing unit 14 starts buffering of the audio signal based on the audio collected by the microphone 200 into the buffer 31 (step S3). Further, the voice recognition control unit 17 starts control of the voice recognition unit 15 according to the determination result of the image recognition unit 13 (step S4). By the above, the process (hereinafter, referred to as “specific voice recognition process”) relating to the voice recognition in which the speech voice of the recognition target occupant is the target of recognition is started.

図６は、特定音声認識処理の実行時における画像認識部１３の動作例を示すフローチャートである。 FIG. 6 is a flowchart showing an operation example of the image recognition unit 13 at the time of execution of the specific voice recognition process.

図６に示すように、画像認識部１３は、認識対象記憶部２３に登録された情報を取得する（ステップＳ１１）。認識対象記憶部２３に登録された情報とは、動作モードＭ１の場合は、顔写真であり、動作モードＭ２の場合は、座席位置情報であり、動作モードＭ３の場合は、顔写真と座席位置情報とである。以下、動作モードにかかわらず、認識対象記憶部２３に登録された情報を総称して「登録情報」という。 As shown in FIG. 6, the image recognition unit 13 acquires the information registered in the recognition target storage unit 23 (step S11). The information registered in the recognition target storage unit 23 is a face photograph in the case of the operation mode M1, is seat position information in the case of the operation mode M2, and is a face photograph and a seat position in the case of the operation mode M3. It is information. Hereinafter, regardless of the operation mode, the information registered in the recognition target storage unit 23 is generically referred to as “registration information”.

次いで、画像認識部１３は、ステップＳ１１で取得した登録情報と、カメラから入力される撮影画像データとに基づいて、上述した処理を行って、撮影画像データに含まれる顔画像のうち、当該登録情報によって特定される認識対象乗員に対応する顔画像を特定する（ステップＳ１２）。なお、ステップＳ１２において、複数の顔画像が登録される場合がある。この場合、画像認識部１３は、以下で説明するステップＳ１３〜Ｓ１６の処理を、複数の顔画像のそれぞれについて、並列して実行する。 Next, the image recognition unit 13 performs the above-described processing based on the registration information acquired in step S11 and the photographed image data input from the camera, and the registration among the face images included in the photographed image data is performed. A face image corresponding to the recognition target occupant specified by the information is specified (step S12). In step S12, a plurality of face images may be registered. In this case, the image recognition unit 13 executes the processes of steps S13 to S16 described below in parallel for each of the plurality of face images.

次いで、画像認識部１３は、カメラから間欠的に入力される撮影画像データについて、ステップＳ２２で特定した顔画像を追跡して分析し、認識対象乗員の口の連続的な動きが開始されたか否かを監視する（ステップＳ１３）。ステップＳ１３の処理は、認識対象乗員の口の連続的な動きの開始が検出されるまで、継続して行われる。認識対象乗員の口の連続的な動きが開始されたことを検出した場合（ステップＳ１３：ＹＥＳ）、画像認識部１３は、上述した動作開始通知を行う（ステップＳ１４）。 Next, the image recognition unit 13 tracks and analyzes the face image identified in step S22 for the photographed image data intermittently input from the camera, and whether or not the continuous movement of the recognition target occupant's mouth is started Is monitored (step S13). The process of step S13 is continuously performed until the start of continuous movement of the recognition target occupant's mouth is detected. When it is detected that the continuous movement of the recognition target occupant's mouth has been started (step S13: YES), the image recognition unit 13 performs the above-described operation start notification (step S14).

次いで、画像認識部１３は、認識対象乗員の口の連続的な動きが停止したか否かを監視する（ステップＳ１５）。ステップＳ１５の処理は、認識対象乗員の口の連続的な動きの停止が検出されるまで、継続して行われる。認識対象乗員の口の連続的な動きが停止されたことを検出した場合（ステップＳ１５：ＹＥＳ）、画像認識部１３は、上述した動作終了通知を行う（ステップＳ１６）。ステップＳ１６の処理後、処理手順は、ステップＳ１３に戻る。以上の処理が行われる結果、画像認識部１３により、認識対象乗員の口の動きに応じた適切なタイミングで、動作開始通知および動作終了通知が実行される。 Next, the image recognition unit 13 monitors whether or not the continuous movement of the recognition target occupant's mouth has stopped (step S15). The process of step S15 is continuously performed until the stop of continuous movement of the recognition target occupant's mouth is detected. When it is detected that the continuous movement of the recognition target occupant's mouth is stopped (step S15: YES), the image recognition unit 13 performs the above-described operation end notification (step S16). After the process of step S16, the process procedure returns to step S13. As a result of the above processing being performed, the image recognition unit 13 executes the operation start notification and the operation end notification at an appropriate timing according to the movement of the recognition target occupant's mouth.

図７は、特定音声認識処理の実行時における音声認識制御部１７の動作例を示すフローチャートである。 FIG. 7 is a flowchart showing an operation example of the speech recognition control unit 17 at the time of execution of the specific speech recognition process.

図７に示すように、音声認識制御部１７は、画像認識部１３によって動作開始通知があったか否かを監視する（ステップＳ２１）。ステップＳ２１の処理は、動作開始通知があるまで、継続して行われる。動作開始通知があった場合（ステップＳ２１：ＹＥＳ）、音声認識制御部１７は、音声認識部１５に音声認識の実行を開始させる（ステップＳ２２）。上述したように、音声認識の開始に際し、音声認識制御部１７は、音声認識部１５に、動作開始通知があったタイミングより所定の時間だけ前の位置から、録音音声データの読み出しを行わせる。 As shown in FIG. 7, the voice recognition control unit 17 monitors whether or not there is an operation start notification from the image recognition unit 13 (step S21). The process of step S21 is continuously performed until the operation start notification is received. If there is an operation start notification (step S21: YES), the speech recognition control unit 17 causes the speech recognition unit 15 to start executing speech recognition (step S22). As described above, at the start of voice recognition, the voice recognition control unit 17 causes the voice recognition unit 15 to read the recorded voice data from a position preceding the timing when the operation start notification is given by a predetermined time.

次いで、音声認識制御部１７は、画像認識部１３によって動作終了通知があったか否かを監視する（ステップＳ２３）。ステップＳ２３の処理は、動作終了通知があるまで、継続して行われる。動作終了通知があった場合（ステップＳ２３：ＹＥＳ）、音声認識制御部１７は、音声認識部１５に音声認識の実行を停止させる（ステップＳ２４）。上述したように、音声認識の停止に際し、音声認識制御部１７は、音声認識部１５に、音声終了通知があったタイミングに対応する位置まで録音音声データの読み出しおよび読み出した録音音声データに基づく音声認識を行わせた上で、音声認識の実行を停止させる。テップＳ２４の処理後、処理手順は、ステップＳ２１に戻る。以上の処理が行われる結果、認識対象乗員の口が動いている間のみ、音声認識部１５により、マイク２００により収音される音声の音声認識が行われる。 Next, the voice recognition control unit 17 monitors whether the operation recognition has been notified by the image recognition unit 13 (step S23). The process of step S23 is continuously performed until the operation end notification is received. If the operation end notification has been received (step S23: YES), the voice recognition control unit 17 causes the voice recognition unit 15 to stop the execution of voice recognition (step S24). As described above, when the voice recognition is stopped, the voice recognition control unit 17 causes the voice recognition unit 15 to read the recorded voice data up to the position corresponding to the timing at which the voice end notification is made and the voice based on the read voice data Once recognition is performed, execution of speech recognition is stopped. After the processing of step S24, the processing procedure returns to step S21. As a result of the above processing being performed, the voice recognition unit 15 performs voice recognition of the voice collected by the microphone 200 only while the mouth of the recognition target occupant is moving.

以上詳しく説明したように、本実施形態では、音声認識を行う対象とするユーザの顔写真、および、上記音声認識を行う対象とする座席の少なくとも一方を認識対象として登録しておき、車内に設置されたカメラにより撮影された画像を認識し、登録された認識対象に該当する乗員の口が動いていると判定された場合に、車内に設置されたマイク２００から入力された音声を認識するようにしている。 As described above in detail, in the present embodiment, at least one of the face photograph of the user who is the target of voice recognition and the seat which is the target of the voice recognition is registered as a recognition target and installed in the car Recognize an image captured by the selected camera, and recognize voice input from the microphone 200 installed in the vehicle when it is determined that the passenger's mouth corresponding to the registered recognition object is moving I have to.

これにより、音声認識を行う対象として顔写真または座席を登録しておけば、マイク２００より常時入力される音声のうち、顔写真が登録された特定の乗員、または、登録された座席に座っている特定の乗員の口が動いていると判定されたときにマイク２００より入力される音声だけが音声認識処理の対象とされる。これにより、音声プロファイルの煩雑な登録作業を行うことなく、特定の乗員の発話音声だけを音声認識（トリガレス音声認識）の対象とすることができる。 As a result, if a face picture or a seat is registered as an object to be subjected to voice recognition, among voices constantly input from the microphone 200, the user can sit in a specific occupant or a registered seat in which the face picture is registered. Only voice input from the microphone 200 when it is determined that the specific occupant's mouth is moving is subjected to voice recognition processing. As a result, it is possible to set only the speech of a specific occupant as the target of speech recognition (triggerless speech recognition) without performing complicated registration work of the speech profile.

なお、上述した実施形態では、ユーザが顔写真を選択するときに使用する画面について、図３の例を用いて説明したが、当該画面の内容は、図３で例示した画面の内容に限られるものではない。ユーザが座席を選択するときに使用する画面についても同様である。 In the embodiment described above, the screen used when the user selects a face picture has been described using the example of FIG. 3, but the contents of the screen are limited to the contents of the screen illustrated in FIG. It is not a thing. The same applies to the screen used when the user selects a seat.

また、上述した実施形態では、動作モードＭ３において、画像認識部１３は、認識対象として登録された座席にいる乗員が、認識対象として顔写真が登録されたユーザである場合に、当該乗員を、認識対象乗員と判定した。この点について、画像認識部１３は、以下の処理を実行してもよい。すなわち、画像認識部１３は、カメラにより撮影された乗員の顔画像を認識し、認識対象として顔写真が登録されたユーザと同じ乗員の口が動いているか否かを判定すると共に、認識対象として登録された座席にいる乗員の口が動いているか否かを判定する。この場合、認識対象として登録された座席に着座する乗員、および、認識対象として顔写真が登録されたユーザに対応する乗員のそれぞれが、認識対象乗員と判定され、認識対象乗員の口が動いている間に対応する期間、音声認識制御部１７の制御の下、音声認識部１５により音声認識が行われる。 In the embodiment described above, in the operation mode M3, when the occupant in the seat registered as the recognition target is the user whose face photograph has been registered as the recognition target, the image recognition unit 13 selects the occupant, It was judged as a recognition target occupant. In this regard, the image recognition unit 13 may execute the following process. That is, the image recognition unit 13 recognizes the face image of the occupant photographed by the camera, and determines whether or not the mouth of the same occupant as the user whose face picture is registered as the recognition target is moving, and as the recognition target It is determined whether the passenger's mouth in the registered seat is moving. In this case, the occupant sitting in the seat registered as the recognition target and the occupant corresponding to the user whose face photograph is registered as the recognition target are each determined as the recognition target occupant, and the mouth of the recognition target occupant moves. Under the control of the speech recognition control unit 17, speech recognition is performed by the speech recognition unit 15 for a corresponding period.

また、上述した実施形態では、車両に１台のマイク２００が設けられ、１台のマイク２００により、車両に搭乗する乗員の発話音声を収音する構成であった。この点に関し、以下の構成でもよい。すなわち、車両の座席のそれぞれの近傍にマイク２００を設ける。一の座席の近傍に設けられたマイク２００は、当該一の座席に着座する乗員の発話音声を収音するためのマイク２００である。マイク２００は、画像認識部１３によりオン／オフが切り替え可能である。 Further, in the above-described embodiment, one microphone 200 is provided in the vehicle, and one microphone 200 is configured to pick up an utterance voice of an occupant getting in the vehicle. In this regard, the following configuration may be employed. That is, the microphones 200 are provided in the vicinity of each of the seats of the vehicle. A microphone 200 provided in the vicinity of one seat is a microphone 200 for picking up the speech of the occupant seated in the one seat. The microphone 200 can be switched on / off by the image recognition unit 13.

そして、画像認識部１３は、上述した処理を行って、認識対象乗員が口を動かしているか否かを判定すると共に、さらに、認識対象乗員の口の動きが開始されたと判定した場合、認識対象乗員が着座する座席の近傍に設けられたマイク２００をオンし、動作開始通知を音声認識制御部１７に行う。なお、上述した実施形態では、一定時間以上連続して口が動いた場合に、画像認識部１３は、口が動いていると判定したが、本構成では、画像認識部１３は、認識対象乗員の口が動いていない状態から、動いた状態へ移行したときに、連続する口の動きが開始されたと判定し、即時に、マイク２００をオンし、動作開始通知を行う。また、画像認識部１３は、口の動きの開始後、連続した口の動きが終了したと判定した場合、マイク２００をオフし、動作終了通知を音声認識制御部１７に行う。音声認識制御部１７は、画像認識部１３からの通知に基づいて、動作開始通知があったタイミングに対応する位置から、動作終了通知があったタイミングに対応する位置までの録音音声データの読み出し、および、読み出した録音音声データに基づく音声認識を実行させる。 Then, the image recognition unit 13 performs the above-described processing to determine whether the recognition target occupant is moving the mouth, and further determines that the movement of the recognition target occupant's mouth is started, the recognition target The microphone 200 provided in the vicinity of the seat where the occupant sits is turned on, and the voice recognition control unit 17 is notified of the operation start. In the embodiment described above, the image recognition unit 13 determines that the mouth is moving when the mouth moves continuously for a predetermined time or more, but in the present configuration, the image recognition unit 13 recognizes the occupant When transitioning to the moving state from the state in which the mouth of the mouth is not moving, it is determined that the continuous movement of the mouth has been started, and the microphone 200 is immediately turned on to issue an operation start notification. If the image recognition unit 13 determines that the continuous movement of the mouth has ended after the start of the movement of the mouth, the image recognition unit 13 turns off the microphone 200 and sends an operation end notification to the voice recognition control unit 17. The voice recognition control unit 17 reads the recorded voice data from the position corresponding to the timing at which the operation start notification has been made to the position corresponding to the timing at which the operation end notification is received, based on the notification from the image recognition unit 13. And, voice recognition based on the read out recorded voice data is performed.

この結果、認識対象乗員の口が動いている期間、認識対象乗員の近傍に位置するマイク２００により収音が行われ、収音された音声（認識対象乗員の発話音声）について、音声認識部１５により音声認識が行われる。これにより、認識対象乗員の発話音声を、認識対象乗員の近傍に位置するマイク２００により収音できるため、マイク２００にノイズが含まれる可能性を低減でき、高い精度での音声認識を実現できる。また、不必要にマイク２００がオンされることを抑制でき、リソースの消費を効果的に抑制できる。 As a result, while the mouth of the recognition target occupant is moving, the microphone 200 located in the vicinity of the recognition target occupant performs sound collection, and the voice recognition unit 15 collects the collected sound (voice of the recognition target passenger). Speech recognition is performed. As a result, since the speech of the recognition target occupant can be collected by the microphone 200 located in the vicinity of the recognition target occupant, the possibility that noise is contained in the microphone 200 can be reduced, and voice recognition with high accuracy can be realized. Moreover, it can suppress that the microphone 200 is turned on unnecessarily, and can suppress consumption of a resource effectively.

その他、上記実施形態は、何れも本発明を実施するにあたっての具体化の一例を示したものに過ぎず、これによって本発明の技術的範囲が限定的に解釈されてはならないものである。すなわち、本発明はその要旨、またはその主要な特徴から逸脱することなく、様々な形で実施することができる。 In addition, any of the above-described embodiments is merely an example of embodying the present invention, and the technical scope of the present invention should not be interpreted in a limited manner. That is, the present invention can be implemented in various forms without departing from the scope or main features of the present invention.

１１リスト登録部（乗員リスト登録部）
１２認識対象登録部
１３画像認識部
１５音声認識部
１７音声認識制御部
２２車内構成リスト記憶部
１００音声認識装置 11 List registration unit (Occupant list registration unit)
12 recognition target registration unit 13 image recognition unit 15 voice recognition unit 17 voice recognition control unit 22 in-vehicle configuration list storage unit 100 voice recognition device

Claims

A recognition target registration unit that registers, as a recognition target, a face photograph of a user who is a target of speech recognition;
A voice recognition unit that recognizes voice input from a microphone installed in the car;
An image recognition unit that recognizes an image captured by a camera installed in the vehicle, and determines whether a passenger's mouth corresponding to the recognition target registered by the recognition target registration unit is moving;
And a voice recognition control unit configured to control processing of the voice recognition unit when it is determined by the image recognition unit that the mouth of the occupant corresponding to the recognition target is moving. Voice recognition device.

The image recognition unit recognizes a face image of the occupant photographed by the camera, and determines whether or not the mouth of the same occupant as the user whose face picture is registered as the recognition target is moving. The speech recognition apparatus according to claim 1.

The occupant list registration unit for registering face photographs of a plurality of users as an occupant list is further provided.
The voice recognition apparatus according to claim 1 or 2, wherein the recognition target registration unit registers a face picture of a user selected from the occupant list by a user operation as the voice recognition.

The recognition target registration unit registers, as the recognition target, a seat on which the voice recognition is to be performed, instead of the user's face picture.
The image recognition unit recognizes a face image of the occupant photographed by the camera, and determines whether or not the mouth of the occupant in the seat registered as the recognition target is moving. The speech recognition device according to claim 1.

The in-vehicle configuration list storage unit further stores a plurality of seats as the in-vehicle configuration list,
The voice recognition device according to claim 4, wherein the recognition target registration unit registers a seat selected by a user operation from the in-vehicle configuration list as the recognition target.

The recognition target registration unit registers, as the recognition target, a seat to which the voice recognition is to be performed, in addition to the face picture of the user.
The image recognition unit recognizes the face image of the occupant photographed by the camera, and the occupant in the seat registered as the recognition target is a user whose face photograph has been registered as the recognition target, and the occupant The speech recognition apparatus according to claim 1, wherein it is determined whether or not the mouth of the mouse is moving.

The recognition target registration unit registers, as the recognition target, a seat to which the voice recognition is to be performed, in addition to the face picture of the user.
The image recognition unit recognizes the face image of the occupant photographed by the camera, determines whether or not the mouth of the same occupant as the user whose face picture is registered as the recognition target is moving, and the recognition target The speech recognition apparatus according to claim 1, wherein it is determined whether or not the mouth of the occupant in the seat registered as is moving.

A first step of registering, as a recognition target, at least one of a face picture of a user who is a target of speech recognition and a seat which is a target of the speech recognition;
The image recognition unit of the voice recognition device recognizes an image captured by a camera installed in the car, and whether or not the mouth of the occupant corresponding to the recognition target registered by the recognition target registration unit is moving A second step of determining
The voice recognition control unit of the voice recognition device recognizes the voice input from the microphone installed in the vehicle when the image recognition unit determines that the occupant's mouth corresponding to the recognition object is moving. And a third step of performing control so as to perform processing of the speech recognition unit.