JP2019008274A

JP2019008274A - Voice information processing system, control method of voice information processing system, program of voice information processing system and storage medium

Info

Publication number: JP2019008274A
Application number: JP2018075244A
Authority: JP
Inventors: 真人藤野; Masato Fujino
Original assignee: Fairy Devices Inc
Current assignee: Fairy Devices Inc
Priority date: 2017-06-26
Filing date: 2018-04-10
Publication date: 2019-01-17
Also published as: JP2020042292A

Abstract

【課題】高精度で、しかも安価に音声内容の分析、解析、認識、評価、修正することにより最適化するとともに、会話の行われている状況をより的確に把握する技術的思想の提示を目指すこと。【解決手段】音声に関する音声情報を入力する入力手段と、入力された音声情報に対し、識別処理が容易となるような前段処理を行う前段処理手段と、前段処理手段により処理された音声情報に所定の加工を施し、第１の情報に基づいてタスク処理を行い、タスク処理の評価が十分でない場合に第１の情報を修正し、評価が十分になるまで一連の処理を繰り返すことで最適化する最適化手段とを備える。【選択図】図５PROBLEM TO BE SOLVED: To present a technical idea for more accurately grasping a situation in which a conversation is taking place while optimizing by analyzing, analyzing, recognizing, evaluating, and correcting voice contents with high accuracy and at low cost. thing. SOLUTION: An input means for inputting voice information related to voice, a pre-stage processing means for performing pre-stage processing on the input voice information so as to facilitate identification processing, and voice information processed by the pre-stage processing means. Optimized by performing predetermined processing, performing task processing based on the first information, correcting the first information when the evaluation of the task processing is not sufficient, and repeating a series of processing until the evaluation is sufficient. It is provided with an optimization means to be used. [Selection diagram] Fig. 5

Description

本発明は、会話における音声情報処理システムに係る、音声情報処理システム、音声情報処理システムの制御方法、音声情報処理システムのプログラム及び記録媒体に関する。 The present invention relates to a voice information processing system, a method for controlling the voice information processing system, a program for the voice information processing system, and a recording medium according to a voice information processing system in conversation.

近年、音声情報処理技術の発展は目覚ましいものがある。例えば、利用者がシステムの状態を容易に把握できるようにし、利用者とシステムとが常に円滑な対話を実現できる音声対話システム（例えば、特許文献１参照）、苦情等を音声により受け付けて、後の処理に利用可能な形態で処理者に伝達することが可能な対話記録システム（例えば、特許文献２参照。）、及びユーザと円滑に対話できる、対話機能を有する電子機器（例えば、特許文献３参照。）が開示されている。 In recent years, the development of speech information processing technology has been remarkable. For example, a voice dialogue system (for example, refer to Patent Document 1) that allows a user to easily grasp the state of the system and realizes a smooth dialogue between the user and the system, accepts complaints by voice, A dialog recording system (see, for example, Patent Document 2) that can be transmitted to a processor in a form that can be used for the above processing, and an electronic device (for example, Patent Document 3) that can smoothly interact with a user. Reference).

特許文献１に記載の発明は、マイク、音声入力手段、音声分析手段、音声認識手段、構文解析手段、意図抽出手段、対話管理手段、問題解決手段、応答文生成手段、音声合成手段、音声出力手段、スピーカ、複数の中途応答処理手段からなる音声対話システムにおいて、複数の中途応答処理手段は、入力系の手段である音声入力手段、音声分析手段、音声認識手段、構文解析手段、意図抽出手段のうち任意の１つあるいは複数の手段の処理結果を入力として、処理結果を出力系の手段である音声出力手段、音声合成手段、応答文生成手段のうち１つあるいは複数の手段へ出力するものである。 The invention described in Patent Document 1 includes a microphone, voice input means, voice analysis means, voice recognition means, syntax analysis means, intention extraction means, dialogue management means, problem solving means, response sentence generation means, voice synthesis means, voice output In the voice dialogue system comprising means, a speaker, and a plurality of halfway response processing means, the plurality of halfway response processing means are voice input means, voice analysis means, voice recognition means, syntax analysis means, and intention extraction means as input system means. The processing result of any one or a plurality of means is input, and the processing result is output to one or a plurality of means among a speech output means, a speech synthesizing means, and a response sentence generating means, which are output means. It is.

特許文献２に記載の発明は、対話の音声データを記録する記録装置と、記録される音声データについて、特定の箇所を識別するための識別子を生成して、記録装置に記録させる処理を行う情報処理装置とを備え、情報処理装置は、記記録装置に記録される音声データについて、識別子の生成の要求を受け付けて識別子を生成し、識別子を、記録すべき音声データと対応付けて記録装置に記録し、記録装置には、音声データと、識別子データとが記録され、また、音声データを音声認識部により、音声認識して得られたテキストデータが記録されるものである。 The invention described in Patent Document 2 is a recording device that records voice data of a dialog, and information that performs processing for generating an identifier for identifying a specific portion of the recorded voice data and causing the recording device to record the identifier. A processing device, and the information processing device receives an identifier generation request for the audio data recorded in the recording device, generates an identifier, and associates the identifier with the audio data to be recorded in the recording device. In the recording apparatus, voice data and identifier data are recorded, and text data obtained by voice recognition of the voice data by the voice recognition unit is recorded.

特許文献３に記載の発明は、冷蔵庫は、マイクおよびスピーカを備え、音声を取得し、取得した音声に応じて発話する対話機能を有するものであり、冷蔵庫は、冷蔵庫近傍の所定の範囲内におけるユーザの位置を特定する位置特定部と、位置特定部にて特定されたユーザの位置に応じた値となるように、マイクの感度を調整するマイク制御部、スピーカの音量を調整するスピーカ制御部を備えるものである。 In the invention described in Patent Document 3, the refrigerator includes a microphone and a speaker, has an interactive function of acquiring voice and speaking according to the acquired voice, and the refrigerator is within a predetermined range near the refrigerator. A position specifying unit for specifying the position of the user, a microphone control unit for adjusting the sensitivity of the microphone so as to have a value corresponding to the position of the user specified by the position specifying unit, and a speaker control unit for adjusting the volume of the speaker Is provided.

特許第３４５４８９７号公報Japanese Patent No. 345497 特開２０００−０６７０６４号公報JP 2000-067064 A 特開２０１７−０６９８３５号公報JP 2017-069835 A

しかしながら、特許文献１に記載の発明は、オウム返し応答もしくは相槌応答によって、利用者は、自分の発話が音声として入力されていることを認識でき、安心して次の発話を行なえるが、定型文を利用しているため、ノイズやエコーの混在したイレギュラーな発話に対しては何ら評価をしたり、定型文に修正を施したりするようにはなっていない。 However, in the invention described in Patent Document 1, the user can recognize that his / her utterance is input as a voice by a parrot return response or a mutual response, and can make the next utterance with peace of mind. Therefore, no evaluation is made for irregular utterances with mixed noises and echoes, and fixed phrases are not modified.

また、特許文献２に記載の発明は、予め定めた基準値以上かを判定し、基準値未満のときは、発言が途切れていると判定し、その後、基準値を超える状態となったとき、発言が始まったと判定して、頭出し信号を出力したりするものの、対話の相手の感情を把握したり、聞き間違いに対して改善するような処理はなされていない。 In addition, the invention described in Patent Document 2 determines whether it is equal to or greater than a predetermined reference value. When the reference value is less than the reference value, it is determined that the speech is interrupted, and then when the reference value is exceeded, Although it is determined that the utterance has begun and a cueing signal is output, there is no processing for grasping the emotion of the other party in the dialogue or improving the mistake in hearing.

さらに、特許文献３に記載の発明は、取得した音声に応じて発話する受動的な応答をするようになっているものの相手に対して能動時に話しかけるようにはなっていない。 Furthermore, although the invention described in Patent Document 3 makes a passive response to speak in response to the acquired voice, it does not talk to the other party when active.

本願は、このような問題点を解決するために企図されたものであり、高精度で、しかも安価に音声内容の分析、解析、認識、評価、修正することにより最適化するとともに、会話の行われている状況をより的確に把握する技術的思想の提示を目指すものである。 This application is intended to solve such problems, and is optimized by analyzing, analyzing, recognizing, evaluating, and correcting speech content with high accuracy and at a low cost. The aim is to present a technical idea that more accurately grasps the situation.

上記課題を解決するため、請求項１に記載の発明は、音声に関する音声情報を入力する入力手段と、前記入力された音声情報に対し、識別処理が容易となるような前段処理を行う前段処理手段と、前記前段処理手段により処理された音声情報に所定の加工を施し、第１の情報に基づいてタスク処理を行い、前記タスク処理の評価が十分でない場合に前記第１の情報を修正し、前記評価が十分になるまで一連の処理を繰り返すことで最適化する最適化手段と、を備えたことを特徴とする。 In order to solve the above-mentioned problem, the invention according to claim 1 is characterized in that input means for inputting sound information relating to sound and pre-stage processing for performing pre-stage processing for facilitating identification processing on the input sound information. A predetermined processing is performed on the voice information processed by the processing means and the preceding processing means, the task processing is performed based on the first information, and the first information is corrected when the evaluation of the task processing is not sufficient. And an optimization means for optimizing by repeating a series of processes until the evaluation becomes sufficient.

ここで、音声とは、物音（例えば、机やドアをたたく音等）と人の声と雑音（例えば、サイレンや動物の鳴き声、クシャミ等）とを含む音波である。 Here, the sound is a sound wave including a physical sound (for example, a sound of hitting a desk or a door), a human voice and noise (for example, a siren, a cry of an animal, a crushing sound, etc.).

また、第１の情報とは、アプリケーションソフトウェア（以下、「アプリ」と記す。）に関連するシナリオデザイン、各種手段のうちどの手段を選択し、どの順番でどのように実行し、評価し、評価が不十分な場合に繰り返すフロー等を含む情報である。 Further, the first information is a scenario design related to application software (hereinafter referred to as “application”), which means is selected from various means, and in what order is executed, evaluated, and evaluated. This is information including a flow that repeats when this is insufficient.

請求項２に記載の発明は、請求項１に記載の構成に加え、前記最適化手段は、前記タスク処理の結果を評価する第１の評価手段と、前記評価が十分でない場合に前記第１の情報を修正する修正手段と、前記前段処理手段から前記修正手段までの一連の処理を繰り返す繰返手段と、を備えたことを特徴とする。 According to a second aspect of the present invention, in addition to the configuration of the first aspect, the optimization means includes a first evaluation means for evaluating a result of the task processing, and the first evaluation means when the evaluation is not sufficient. And a repeating unit that repeats a series of processing from the preceding processing unit to the correcting unit.

請求項３に記載の発明は、請求項１に記載の構成に加え、室内で前記音声の内容を分析して応答する際に、クライアント側の音声入出力装置の処理能力が対応可能な場合に前記音声入出力装置で情報処理を行い、前記音声入出力装置の処理能力が対応可能でない場合にクラウド側が情報処理を行う判断手段を備えたことを特徴とする。 The invention according to claim 3 is the case where the processing capability of the voice input / output device on the client side can cope with the response to the analysis of the contents of the voice in the room in addition to the configuration according to claim 1 The voice input / output device performs information processing, and when the processing capability of the voice input / output device is not compatible, the cloud side includes a determination unit that performs information processing.

請求項４に記載の発明は、請求項１に記載の構成に加え、室内の環境の設定、意図的解釈、及び対話を管理する外部システムを備えたことを特徴とする。 The invention described in claim 4 is characterized in that, in addition to the configuration described in claim 1, an external system for managing the setting of the indoor environment, intentional interpretation, and dialogue is provided.

ここで、意図的解釈とは話者の意図を推定し、推定結果を反映した解釈を言う。 Here, intentional interpretation refers to interpretation that estimates the speaker's intention and reflects the estimation result.

請求項５に記載の発明は、請求項１に記載の構成に加え、前記外部システムは、前記音声入出力装置の筐体外を撮像する撮像手段、前記筐体を振動させる振動手段、前記筐体を回転させる回転手段、及び前記筐体外の壁に画像を投影する投影手段を少なくとも一つ備えたことを特徴とする。 According to a fifth aspect of the present invention, in addition to the configuration according to the first aspect, the external system includes an imaging unit that images the outside of the voice input / output device casing, a vibrating unit that vibrates the casing, and the casing. And at least one projection means for projecting an image onto a wall outside the housing.

請求項６に記載の発明は、請求項１に記載の構成に加え、意図的解釈、前記対話の管理に外部コンテンツの利用が可能なことを特徴とする。 The invention described in claim 6 is characterized in that, in addition to the configuration described in claim 1, external content can be used for intentional interpretation and management of the dialog.

請求項７に記載の発明は、請求項１に記載の構成に加え、前記音声の内容の分析処理、解析処理、及び認識処理を含む情報加工処理を行い、話者の、年齢、性別を含む属性について推論する推論手段と、第２の情報をデザインする際に利用したログを収集する収集手段と、前記ログを解析する解析手段と、前記応答及び前記第２の情報を評価する第２の評価手段とをそなえ、前記評価に基づいて継続的に改善することで最適化することを特徴とする。 The invention according to claim 7 performs the information processing process including the analysis process, the analysis process, and the recognition process of the content of the voice in addition to the configuration of the claim 1, and includes the age and gender of the speaker. Inference means for inferring attributes, collection means for collecting logs used when designing the second information, analysis means for analyzing the logs, second means for evaluating the response and the second information And an evaluation means, and performing optimization based on continuous evaluation based on the evaluation.

ここで、第２の情報とは、各種手段のうちどの手段を用い、どの順番で処理し、評価し、十分でない場合に繰り返すフローについての情報をいう。 Here, the second information refers to information about a flow that is used among various means, is processed in which order, is evaluated, and is repeated when it is not sufficient.

認識処理は、収集した音声情報から、話し手の他に、笑い声、拍手、呼び声等の認識、さらに環境音を分析、解析、認識等の処理を行った結果から、話者識別、性別推定、年齢推定等を行うとともに、イントネーション判定から、出身地等に関する各種情報を提供するものである。 The recognition process consists of recognition of laughter, applause, call, etc. in addition to the speaker from the collected voice information, and analysis, analysis, recognition, etc. of environmental sound, and speaker identification, gender estimation, age In addition to performing estimation and the like, various information regarding the birthplace and the like is provided from intonation determination.

請求項８に記載の発明は、請求項７に記載の構成に加え、前記推論手段は、前記話者との対話を意図的に解釈する解釈手段と、前記話者との対話を管理する管理手段と、を備えたことを特徴とする。 According to an eighth aspect of the present invention, in addition to the configuration according to the seventh aspect, the inference means includes an interpretation means for intentionally interpreting the dialogue with the speaker, and a management for managing the dialogue with the speaker. Means.

ここで、話者との対話の管理とは、顧客満足度向上のため、話者がどのような発話に対しどのような感情を抱いたかを記録し、クライアント側の音声入出力装置をコールセンターに利用していた場合にオペレータに注意喚起したり、管理者に報告したりすることを含む。また、クライアント側の音声入出力装置を会議に利用していた場合に出席者が感情的になった場合に落ち着かせるように休憩を入れたり、冷静になるような旨の音声を発話したりすることを含む。 Here, the management of dialogue with the speaker is to record what kind of emotion the speaker has about what kind of utterance in order to improve customer satisfaction, and the voice input / output device on the client side is placed in the call center. This includes alerting the operator and reporting to the administrator when it is being used. Also, when using the client's voice input / output device for a meeting, if the attendees become emotional, take a break to calm down, or speak a voice to the effect of calmness Including that.

請求項９に記載の発明は、請求項４に記載の構成に加え、前記環境判断手段は、前記室内のサイズを判断するサイズ判断手段と、前記室内のノイズレベルを認識するノイズレベル認識手段と、前記室内の残響レベルを認識する残響レベル認識手段と、を備えたことを特徴とする。 According to a ninth aspect of the present invention, in addition to the configuration according to the fourth aspect, the environment determining unit includes a size determining unit that determines the size of the room, and a noise level recognizing unit that recognizes the noise level in the room. Reverberation level recognition means for recognizing the reverberation level in the room.

請求項１０に記載の発明は、請求項５に記載の構成に加え、前記筐体に設けられ画像を表示する画像表示手段を備えたことを特徴とする。 According to a tenth aspect of the present invention, in addition to the configuration according to the fifth aspect, an image display means provided on the housing for displaying an image is provided.

請求項１１に記載の発明は、請求項５に記載の構成に加え、前記筐体に設けられユーザを認識する指紋認証手段を備えたことを特徴とする。 The invention described in claim 11 is characterized in that, in addition to the configuration described in claim 5, a fingerprint authentication unit provided in the housing for recognizing a user is provided.

請求項１２に記載の発明は、請求項１に記載の構成に加え、クライアント側の音声入出力装置の処理能力は、プロセッサの演算速度、メモリーサイズ、センサ種類、マイクアレイ、スピーカの数、ＬＥＤの数、内蔵カメラ、アプリケーションソフトウェアの数、他社の装置に対応可能なフロントエンド信号処理を含むことを特徴とする。 The invention according to claim 12 has the processing capability of the voice input / output device on the client side in addition to the configuration according to claim 1, such as the processing speed of the processor, memory size, sensor type, microphone array, number of speakers, LED And the number of built-in cameras, the number of application software, and front-end signal processing compatible with devices of other companies.

請求項１３に記載の発明は、請求項１に記載の構成に加え、前記話者の音声から話し方を特徴として抽出する特徴抽出手段と、前記特徴を前記話者の情報に紐づけて記憶し、新たに入力した音声の特徴を前記記憶手段に記憶された話者の情報と照合して話者を識別する話者識別手段と、を備えたことを特徴とする。 According to a thirteenth aspect of the present invention, in addition to the configuration of the first aspect, characteristic extraction means for extracting a speech method as a feature from the speaker's voice, and storing the feature in association with the speaker information. And speaker identification means for identifying the speaker by comparing the characteristics of the newly input voice with the speaker information stored in the storage means.

請求項１４に記載の発明は、請求項７に記載の構成に加え、前記話者の感情を識別する感情識別手段を備えたことを特徴とする。 The invention described in claim 14 is characterized in that, in addition to the configuration described in claim 7, an emotion identifying means for identifying the speaker's emotion is provided.

請求項１５に記載の発明は、請求項１４に記載の構成に加え、前記筐体の外周に設けられ、軌道上の一部の発光色が残りの部分の発光色と異なるように周回点灯し、前記話者を検知したときに前記一部の発光色が話者の方向で停止するように発光する発光手段を備えたことを特徴とする。 The invention according to claim 15 is provided on the outer periphery of the casing in addition to the structure according to claim 14, and is lit around so that a part of the emission color on the track is different from the emission color of the remaining part. And a light emitting means for emitting light so that the part of the emission color stops in the direction of the speaker when the speaker is detected.

請求項１６に記載の発明は、音声に関する音声情報を入力し、前記入力された音声情報に対し、識別処理が容易となるような前段処理を行い、前記前段処理された音声情報に所定の加工を施し、第１の情報に基づいてタスク処理を行い、前記タスク処理の評価が十分でない場合に前記第１の情報を修正し、前記評価が十分になるまで一連の処理を繰り返すことで最適化することを特徴とする。 According to the sixteenth aspect of the present invention, voice information related to voice is input, pre-processing is performed on the input voice information to facilitate identification processing, and predetermined processing is performed on the pre-processed voice information. To perform task processing based on the first information, modify the first information when the evaluation of the task processing is not sufficient, and optimize by repeating a series of processing until the evaluation is sufficient It is characterized by doing.

請求項１７に記載の発明は、コンピュータが読み取り可能なプログラムであって、コンピュータを、音声に関する音声情報を入力する入力手段、前記入力された音声情報に対し、識別処理が容易となるような前段処理を行う前段処理手段、前記前段処理手段により処理された音声情報に所定の加工を施し、第１の情報に基づいてタスク処理を行い、前記タスク処理の評価が十分でない場合に前記第１の情報を修正し、前記評価が十分になるまで一連の処理を繰り返すことで最適化する最適化手段、として機能させるための音声情報処理システムのプログラムであることを特徴とする。 The invention according to claim 17 is a computer-readable program, wherein the computer has input means for inputting voice information related to voice, and the preceding stage that facilitates identification processing for the input voice information. Pre-processing means for performing processing, performing predetermined processing on the audio information processed by the pre-processing means, performing task processing based on the first information, and when the evaluation of the task processing is not sufficient, It is a program of a voice information processing system for functioning as an optimization unit that corrects information and optimizes by repeating a series of processes until the evaluation becomes sufficient.

請求項１８に記載の発明は、請求項１７に記載のプログラムを記録した記録媒体であることを特徴とする。 The invention according to claim 18 is a recording medium on which the program according to claim 17 is recorded.

本発明によれば、高精度で、しかも安価に音声内容の分析、解析、認識、評価することにより最適化するとともに、会話の行われている状況をより的確に把握することを可能とする。 According to the present invention, it is possible to optimize by analyzing, analyzing, recognizing, and evaluating speech content with high accuracy and at low cost, and more accurately grasping a situation where conversation is being performed.

本発明の一実施形態に係る音声情報処理システム全体の構成図の一例である。It is an example of the block diagram of the whole audio | voice information processing system which concerns on one Embodiment of this invention. 図１に示した音声情報処理システムに用いられるクラウド側サーバのハードウェアブロック図の一例である。It is an example of the hardware block diagram of the cloud side server used for the audio | voice information processing system shown in FIG. 図１に示した音声情報処理システムのハードウェアブロック図の一例である。FIG. 2 is an example of a hardware block diagram of the voice information processing system shown in FIG. 1. 図２に示した音声情報処理システムのソフトウェアブロック図の一例である。It is an example of the software block diagram of the audio | voice information processing system shown in FIG. 図２に示した音声情報処理システムの処理内容を示すソフトウェアスタック図の一例である。FIG. 3 is an example of a software stack diagram showing processing contents of the voice information processing system shown in FIG. 2. 図２に示した音声情報処理システムの外観図の一例である。It is an example of the external view of the audio | voice information processing system shown in FIG. 図２に示した音声情報処理システムにおける全体動作を示すフローチャートの一例である。It is an example of the flowchart which shows the whole operation | movement in the audio | voice information processing system shown in FIG. 図２に示した音声情報処理システムにおける全体動作を示すフローチャートの他の一例である。It is another example of the flowchart which shows the whole operation | movement in the audio | voice information processing system shown in FIG. 図２に示した音声情報処理システムにおける全体動作を示すフローチャートの他の一例である。It is another example of the flowchart which shows the whole operation | movement in the audio | voice information processing system shown in FIG.

本発明の実施の形態を、図面を参照して説明する。 Embodiments of the present invention will be described with reference to the drawings.

＜構成＞
＜システム全体＞
図１は、本発明の一実施形態に係る音声情報処理システム全体の構成図の一例である。本発明の一実施形態に係る音声情報処理システムは、ネットワーク１０を介してクラウド側サーバ２０、及びクライアント側の音声入出力装置１００が接続されて構成されている。 <Configuration>
<Entire system>
FIG. 1 is an example of a configuration diagram of an entire audio information processing system according to an embodiment of the present invention. The voice information processing system according to an embodiment of the present invention is configured by connecting a cloud-side server 20 and a client-side voice input / output device 100 via a network 10.

音声入出力装置１００とスマートフォンとで連携して音声情報処理システムを構成してもよく、Ｗｉ-Ｆｉルータを用いてインターネット接続してもよい。なお、音声入出力装置１００とスマートフォンとの間の通信手段としては、例えば、無線、赤外線、有線等どのような方法を用いてもよい。また本願を適用する、例えば音声情報処理システムが複数存在する場合には、音声入出力装置１００はその数だけ存在することとなる。 A voice information processing system may be configured by linking the voice input / output device 100 and a smartphone, or may be connected to the Internet using a Wi-Fi router. In addition, as a communication means between the voice input / output device 100 and the smartphone, any method such as wireless, infrared, or wired may be used. Further, when there are a plurality of audio information processing systems to which the present application is applied, for example, there are as many audio input / output devices 100 as there are.

＜クラウド側サーバ及び音声入出力装置のハードウェア構成＞
次に、図２に従ってクラウド側サーバ２０の詳細を説明する。図２に示すように、クラウド側サーバ２０は、データベース（以下、「ＤＢ」という。）２１、プロセッサ２２、出力装置２３、入力装置２４、及びインターフェース２６等を備えて構成されている。プロセッサ（「コンピュータ」とも称する）２２は、音声情報の管理に関するデータの処理を行い、ＤＢ２１は音声情報の管理に関する情報等のデータ、及び制御プログラム等を記憶する。出力装置２３は、ディスプレイ、プリンタ等を備えて構成され、必要に応じて各種情報を出力する。また、入力装置２４は、キーボード、バーコードリーダ、及びスキャナ等を備えて構成され、必要に応じて情報の入力を行うが、情報の入力を可能とするすべての装置を含むものとする。なお、クラウド側サーバ２０は最終的に音声情報処理システムとしての業務遂行が可能であれば、単独でも複数のシステムから構成されていてもよい。 <Hardware configuration of cloud server and voice input / output device>
Next, details of the cloud-side server 20 will be described with reference to FIG. As shown in FIG. 2, the cloud server 20 includes a database (hereinafter referred to as “DB”) 21, a processor 22, an output device 23, an input device 24, an interface 26, and the like. The processor (also referred to as “computer”) 22 performs processing of data related to management of voice information, and the DB 21 stores data such as information related to management of voice information, a control program, and the like. The output device 23 includes a display, a printer, and the like, and outputs various types of information as necessary. The input device 24 includes a keyboard, a barcode reader, a scanner, and the like. The input device 24 inputs information as necessary, but includes all devices that can input information. Note that the cloud-side server 20 may be composed of a single system or a plurality of systems as long as it can finally perform business as a voice information processing system.

＜音声入出力装置のハードウェア構成＞
次に、図３に従って音声入出力装置１００の詳細を説明する。同図に示すように、音声入出力装置１００は、主として拡張部２０１、記憶部２０２、マイクユニット２０３、マイク制御部２０４、信号処理部２０５、通信部２０６、音声発生部２０７、非可聴音発生部２０８、及び表示部２０９を備える。表示部２０９はＬＥＤ（ＬｉｇｈｔＥｍｉｔｔｉｎｇＤｉｏｄｅ：発光ダイオード）２１０とＬＣＤ（ＬｉｑｕｉｄＣｒｙｓｔａｌＤｉｓｐｌａｙ：液晶表示ディスプレイ）２１１とを有してもよい。ＬＥＤ２１０はリング状であってもよい。 <Hardware configuration of voice input / output device>
Next, details of the voice input / output device 100 will be described with reference to FIG. As shown in the figure, the voice input / output device 100 mainly includes an expansion unit 201, a storage unit 202, a microphone unit 203, a microphone control unit 204, a signal processing unit 205, a communication unit 206, a voice generation unit 207, and an inaudible sound generation. A unit 208 and a display unit 209 are provided. The display unit 209 may include an LED (Light Emitting Diode) 210 and an LCD (Liquid Crystal Display) 211. The LED 210 may be ring-shaped.

音声入出力装置１００は、さらに破線で示す撮像部２１２、個人認証部２１３、ＩＲ（Ｉｎｆｒａｒｅｄ：赤外線）部２１４、投影部２１５、振動部２１６、及び回転部２１７を備えて構成してもよい。 The voice input / output device 100 may further include an imaging unit 212, a personal authentication unit 213, an IR (Infrared) unit 214, a projection unit 215, a vibration unit 216, and a rotation unit 217 indicated by broken lines.

拡張部２０１は音声入出力装置１００にＵＳＢ（ＵｎｉｖｅｒｓａｌＳｅｒｉａｌＢｕｓ）メモリやＵＳＢ機器を接続するための部材である。 The expansion unit 201 is a member for connecting a USB (Universal Serial Bus) memory or a USB device to the voice input / output device 100.

記憶部２０２は、音声入出力装置１００の制御プログラム、音声データ、個人データ、画像データ等のデータを記憶する部材であり、例えばＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ：読み出し専用メモリ）、ＲＡＭ（ＲａｎｄａｍＡｃｃｅｓｓＭｅｍｏｒｙ：書き換え自在メモリ）、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）、ＳＳＤ（ＳｏｌＩｄＳｔａｔｅＤｒｉｖｅ）が挙げられる。 The storage unit 202 is a member that stores data such as a control program of the voice input / output device 100, voice data, personal data, and image data. For example, a ROM (Read Only Memory) and a RAM (Random Access Memory): Examples thereof include a rewritable memory (HDD), a hard disk drive (HDD), and a solid state drive (SSD).

マイクユニット２０３は、少なくとも１本のマイクからなり、マイク制御部２０４にてＡＧＣ（ＡｕｔｏｍａｔｉｃＧａｉｎＣｏｎｔｒｏｌ：自動利得制御）やフォーミング等の制御が行われる。 The microphone unit 203 includes at least one microphone, and the microphone control unit 204 performs control such as AGC (Automatic Gain Control) and forming.

信号処理部２０５は、マイクからの音声信号に対し周囲雑音除去等の処理を施し、正確に認識処理した後、処理済みの情報を記憶部２０２に記憶し、音声発生部２０７から音声を発生させたり、マイクからの音声信号の話者識別処理や感情識別処理を行ったり、処理済みの音声情報を到来方位情報、話者識別情報、感情識別情報と共に記憶部２０２に記憶し、同時に表示部２０９に適合した表示を行ったりする。併せて通信部２０６や拡張部２０１より外部に送信し、クラウド処理等により詳細な情報分析を行うことができる。これらの処理により、特定方位に存在する雑音源からの音声情報をブロックアウトしたり、これとは逆に特定方位からの情報のみを記録したりすることができる。 The signal processing unit 205 performs processing such as ambient noise removal on the audio signal from the microphone, and after performing accurate recognition processing, stores the processed information in the storage unit 202 and generates audio from the audio generation unit 207. Or performing speaker identification processing or emotion identification processing of the audio signal from the microphone, or processing the processed voice information together with the arrival direction information, speaker identification information, and emotion identification information in the storage unit 202, and at the same time the display unit 209 Display conforming to the. In addition, the information can be transmitted from the communication unit 206 or the expansion unit 201 to the outside, and detailed information analysis can be performed by cloud processing or the like. With these processes, it is possible to block out audio information from a noise source existing in a specific direction, or to record only information from a specific direction on the contrary.

また、記憶部２０２は多層構成とし、記録すべき音声情報の到来方位や話者識別、感情識別等の関連情報の整理が可能となる。 In addition, the storage unit 202 has a multi-layer configuration, and it is possible to organize related information such as the arrival direction of voice information to be recorded, speaker identification, and emotion identification.

信号処理部２０５は、Ｗｉ-Ｆｉやブルートゥース（登録商標）等により外部機器と無線通信するための通信部２０６とハードワイヤにて外部機器と接続する拡張部２０１とを有し、外部マイクにより周囲雑音を集音して拡張ポートからかかる受信雑音を入力して周囲雑音の影響を低減したり、ＵＳＢポートにより外部機器と通信したりすることが可能である。 The signal processing unit 205 includes a communication unit 206 for wireless communication with an external device using Wi-Fi, Bluetooth (registered trademark), and the like, and an expansion unit 201 connected to the external device with a hard wire. It is possible to collect noise and input the received noise from the expansion port to reduce the influence of ambient noise, or to communicate with an external device via the USB port.

非可聴音発生部２０８から超音波を発生し、その反射による話者や壁までの距離を測定することができる。 Ultrasonic waves are generated from the non-audible sound generation unit 208, and the distance to the speaker or the wall due to the reflection can be measured.

表示部２０９のＬＥＤ２１０は、リング状のＬＥＤを周回点灯させたり、点滅させたり発光間隔や発光色を変化させたりしてもよい。ＬＣＤ２１１は、音声入出力装置１００の筐体の天板や側面に設けてもよく、カラーでもモノクロでもよい。 The LED 210 of the display unit 209 may be a ring-shaped LED that is lit around, blinks, or changes the emission interval or emission color. The LCD 211 may be provided on a top plate or a side surface of the housing of the voice input / output device 100, and may be color or monochrome.

撮像部２１２は、音声入出力装置１００の周囲の状況を撮像する部材であり、例えばＣＣＤ（ＣｈａｒｇｅＣｏｕｐｌｅｄＤｅｖｉｃｅ：電荷結合素子）カメラが挙げられる。撮像部２１２による撮像画像は動画像でも静止画像でもよい。 The imaging unit 212 is a member that captures the situation around the audio input / output device 100, and includes, for example, a CCD (Charge Coupled Device) camera. The image captured by the imaging unit 212 may be a moving image or a still image.

個人認証部２１３は、ユーザの指紋や声紋を識別する部材であり、音声入出力装置１００の天板に設けられた指紋識別装置であっても、話者の音声から声紋を識別する声紋識別装置（もしくはソフトウェア）であってもよい。 The personal authentication unit 213 is a member that identifies a user's fingerprint or voiceprint, and even if it is a fingerprint identification device provided on the top plate of the voice input / output device 100, a voiceprint identification device that identifies a voiceprint from the voice of the speaker (Or software).

ＩＲ部２１４は、赤外線センサであり、人感センサとして人の侵入の監視や来客検知に用いることができる。 The IR unit 214 is an infrared sensor, and can be used as a human sensor for monitoring intrusion of people and detecting visitors.

投影部２１５は、音声入出力装置１００の筐体に設けられ、例えば会議や旅行説明のため、室内のホワイトボードや壁やスクリーンに地図や議題を投影するプロジェクターである。 The projection unit 215 is a projector that is provided in the housing of the voice input / output device 100 and projects a map and an agenda on an indoor whiteboard, a wall, and a screen, for example, for a meeting and travel explanation.

振動部２１６は、クライアント側の音声入出力装置１００の筐体を振動させることで、ユーザに注意を喚起させるものである。振動部２１６は、例えば、圧電素子や出力軸に偏芯カムを有するモータ等が挙げられる。 The vibration unit 216 alerts the user by vibrating the casing of the client-side voice input / output device 100. Examples of the vibration unit 216 include a piezoelectric element and a motor having an eccentric cam on the output shaft.

回転部２１７は、音声入出力装置１００の底面に設けられたベースと、ベース上に設けられた回転軸と、回転軸上に設けられ筐体を回転させるモータとで構成される。この回転部２１７により、投影部２１５やＬＣＤ２１１の向きを変えることができる。 The rotating unit 217 includes a base provided on the bottom surface of the voice input / output device 100, a rotating shaft provided on the base, and a motor provided on the rotating shaft for rotating the casing. The rotation unit 217 can change the orientation of the projection unit 215 and the LCD 211.

＜クラウド側サーバ及び音声入出力装置のソフトウェア構成＞
図４に従ってクラウド側サーバ及び音声入出力装置のソフトウェア構成について説明する。 <Software configuration of cloud server and voice input / output device>
The software configuration of the cloud server and the voice input / output device will be described with reference to FIG.

＜クラウド側サーバ＞
クラウド側サーバ２０は、入力手段４１、出力手段４２、記憶手段４３、判断手段４４、最適化手段４５、翻訳手段４９、第１制御手段５０、推論手段５１、及び通信手段５２を備えて構成されている。 <Cloud server>
The cloud server 20 includes an input unit 41, an output unit 42, a storage unit 43, a determination unit 44, an optimization unit 45, a translation unit 49, a first control unit 50, an inference unit 51, and a communication unit 52. ing.

最適化手段４５は、評価手段４６、修正手段４７、及び繰返手段４８を備える。推論手段５１は、感情識別手段５１ａ、方位検出手段５１ｂ、話者識別手段５１ｃ、収集手段５１ｄ、解釈手段５１ｅ、管理手段５１ｆ、サイズ判断手段５１ｇ、ノイズレベル認識手段５１ｈ、及び残響レベル認識手段５１ｉを備える。 The optimization unit 45 includes an evaluation unit 46, a correction unit 47, and a repetition unit 48. Inference means 51 includes emotion identification means 51a, orientation detection means 51b, speaker identification means 51c, collection means 51d, interpretation means 51e, management means 51f, size determination means 51g, noise level recognition means 51h, and reverberation level recognition means 51i. Is provided.

入力手段４１は、必要に応じて情報の入力を行うが、情報の入力を可能とするすべての装置を含むものであり、図２に示した入力装置２４によって実現される。 The input means 41 inputs information as necessary, but includes all devices that enable input of information, and is realized by the input device 24 shown in FIG.

出力手段４２は、必要に応じて各種情報を出力する手段であり、図２に示した出力装置２３によって実現される。 The output means 42 is a means for outputting various information as necessary, and is realized by the output device 23 shown in FIG.

記憶手段４３は、クラウド側サーバの制御プログラム、及び音声情報の管理に関する情報等のデータ等を記憶する手段であり、図２に示したデータベース２１によって実現される。制御プログラムには、起動時について能動的な動作、すなわち、例えば人を検知したときに先に挨拶を言うように設定されている。 The storage means 43 is means for storing data such as information related to the control program of the cloud-side server and management of voice information, and is realized by the database 21 shown in FIG. The control program is set so as to say a greeting first when it is activated, that is, for example, when a person is detected.

判断手段４４は、室内で音声の内容を分析して応答する際に、クライアント側の音声入出力装置１００の処理能力が対応可能な場合にクライアント側の音声入出力装置１００で情報処理を行い、クライアント側の音声入出力装置１００の処理能力が対応可能でない場合にクラウド側サーバ２０が情報処理を行うよう判断する手段であり、図２に示したプロセッサ２２によって実現される。 The determination means 44 performs information processing in the voice input / output device 100 on the client side when the processing capability of the voice input / output device 100 on the client side is compatible when analyzing and responding to the voice content in the room, This is means for determining that the cloud-side server 20 performs information processing when the processing capability of the client-side voice input / output device 100 is not compatible, and is realized by the processor 22 shown in FIG.

ここで、クライアント側の音声入出力装置１００の処理能力とは、プロセッサの演算速度、メモリーサイズ、センサ種類、マイクアレイ、スピーカの数、ＬＥＤの数、内蔵カメラ、アプリケーションソフトウェアの数、他社の装置に対応可能なフロントエンド信号処理を含む。 Here, the processing capability of the voice input / output device 100 on the client side is the processing speed of the processor, memory size, sensor type, microphone array, number of speakers, number of LEDs, built-in camera, number of application software, device of other companies Includes front-end signal processing that can handle

最適化手段４５は、識別処理が容易となるような前段処理手段により処理された音声情報に所定の加工を施し、第１の情報に基づいてタスク処理を行い、タスク処理の評価が十分でない場合に第１の情報を修正し、評価が十分になるまで一連の処理を何回でも繰り返すことで最適化する手段であり、図２に示したデータベース２１及びプロセッサ２２によって実現される。 When the optimization unit 45 performs predetermined processing on the voice information processed by the pre-processing unit so that the identification processing becomes easy, performs task processing based on the first information, and the task processing is not sufficiently evaluated The first information is corrected and optimized by repeating a series of processes any number of times until the evaluation becomes sufficient, and is realized by the database 21 and the processor 22 shown in FIG.

第１の情報とは、アプリに関連するシナリオデザイン、各種手段のうちどの手段を選択し、どの順番でどのように実行し、評価し、結果が不十分な場合に何回でも繰り返すフロー等を含む情報である。 The first information is the scenario design related to the application, which one of the various methods is selected, how and how it is executed, evaluated, and the flow that is repeated as many times as the results are insufficient. It is information to include.

最適化手段４５の評価手段４６は、タスク処理の結果を評価する手段である。最適化手段４５の修正手段４７は、結果が十分でない場合に第１の情報を修正する手段である。最適化手段４５の繰返手段は、評価手段４６から修正手段４７までの一連の処理を何回でも繰り返す手段である。 The evaluation means 46 of the optimization means 45 is a means for evaluating the result of task processing. The correcting means 47 of the optimizing means 45 is means for correcting the first information when the result is not sufficient. The repetition means of the optimization means 45 is means for repeating a series of processes from the evaluation means 46 to the correction means 47 any number of times.

翻訳手段４９は、自動的に言語を識別し、例えば日本語から日本語以外の多言語に翻訳し、日本語以外の多言語を日本語に翻訳する手段であり、図２に示したデータベース２１及びプロセッサ２２によって実現される。翻訳手段４９は、日本語の音声を日本語以外の多言語の音声に変換したり、日本語以外の多言語の音声を日本語の音声に変換したりする、いわば、通訳機能（もしくは同時通訳機能）を有していてもよい。この場合、音声は音声合成手段により合成されるが、話者の性別、年齢を判断し、老若男女に対応した音声を発音するようにしてもよい。また、翻訳手段４９は、翻訳する際はテキスト形式で翻訳内容を記録することもできるようになっている。テキストデータは話者と紐づけるようにするのが好ましい。テキストデータは話者に応じて色分けするようにしてもよい。 The translation means 49 is means for automatically identifying a language, for example, translating from Japanese to multilingual other than Japanese, and translating multilingual other than Japanese into Japanese. The database 21 shown in FIG. And the processor 22. The translation means 49 converts Japanese speech into multilingual speech other than Japanese, or converts multilingual speech other than Japanese into Japanese speech, so to speak, an interpretation function (or simultaneous interpretation). Function). In this case, the speech is synthesized by speech synthesis means. However, it is also possible to determine the gender and age of the speaker and pronounce the speech corresponding to the young and old. The translation means 49 can also record the translation contents in a text format when translating. The text data is preferably associated with the speaker. The text data may be color-coded according to the speaker.

第１制御手段５０は、クラウド側サーバ２０の各手段を統括制御する手段であり、図２に示したプロセッサ２２によって実現される。 The first control unit 50 is a unit that performs overall control of each unit of the cloud-side server 20, and is realized by the processor 22 illustrated in FIG.

推論手段５１は、音声の内容の分析処理、解析処理、及び認識処理を含む情報加工処理を行い、話者の、年齢、性別を含む属性について推論する手段であり、図２に示したデータベース２１及びプロセッサ２２によって実現される。 The inference means 51 is a means for performing information processing including speech content analysis processing, analysis processing, and recognition processing, and inferring attributes including the age and gender of the speaker. The database 21 shown in FIG. And the processor 22.

感情識別手段５１ａは、話者の感情を識別する手段であり、図２に示したデータベース２１及びプロセッサ２２によって実現される。感情識別手段５１ａによって識別された話者の感情について話者の発話と紐づけて記録するのが好ましい。 Emotion identification means 51a is means for identifying the speaker's emotion, and is realized by the database 21 and the processor 22 shown in FIG. It is preferable to record the emotion of the speaker identified by the emotion identifying means 51a in association with the speech of the speaker.

方位検出手段５１ｂは、音声入出力装置１００から見た話者の方位を検出する手段であり、図２に示した入力装置２４及びプロセッサ２２によって実現される。 The direction detection means 51b is a means for detecting the direction of the speaker as viewed from the voice input / output device 100, and is realized by the input device 24 and the processor 22 shown in FIG.

話者識別手段５１ｃは、話者の音声から話し方の平均的な音響モデルとの差を特徴として抽出する特徴抽出手段により得られた特徴を話者の情報に紐づけて記憶し、新たに入力した音声の特徴を記憶手段４３に記憶された話者の情報と照合して話者を識別する手段であり、図２に示したデータベース２１及びプロセッサ２２によって実現される。話者識別に関しては、後述する音声入出力装置１００に、撮像する撮像手段７３や指紋認証手段７２を設けることにより、顔画像認識処理や指紋認証処理によって実現するようにしてもよい。 The speaker identifying unit 51c stores the feature obtained by the feature extracting unit that extracts the difference from the average acoustic model of the speech as a feature from the speaker's voice in association with the speaker information, and newly inputs the feature. The voice characteristics are compared with the speaker information stored in the storage means 43 to identify the speaker, and is realized by the database 21 and the processor 22 shown in FIG. Speaker identification may be realized by face image recognition processing or fingerprint authentication processing by providing an imaging unit 73 and fingerprint authentication unit 72 for imaging in the voice input / output device 100 described later.

収集手段５１ｄは、第２の情報をデザインする際に利用したログを収集する手段であり、図２に示したデータベース２１及びプロセッサ２２によって実現される。第２の情報とは、前述したように各種手段のうちどの手段を用い、どの順番で処理し、評価し、結果が十分でない場合に繰り返すフローについての情報をいう。 The collecting unit 51d is a unit that collects a log used when designing the second information, and is realized by the database 21 and the processor 22 illustrated in FIG. As described above, the second information refers to information about a flow to be repeated among the various means, which means are used, in which order they are processed, evaluated, and the result is not sufficient.

解釈手段５１ｅは、話者との対話を意図的に解釈する手段であり、図２に示したデータベース２１及びプロセッサ２２によって実現される。 The interpretation means 51e is a means for intentionally interpreting the dialogue with the speaker, and is realized by the database 21 and the processor 22 shown in FIG.

管理手段５１ｆは、話者との対話を管理する手段であり、図２に示したデータベース２１及びプロセッサ２２によって実現される。 The management means 51f is a means for managing dialogues with speakers, and is realized by the database 21 and the processor 22 shown in FIG.

サイズ判断手段５１ｇは、室内のサイズを判断する手段であり、図２に示したインターフェース２６、及びプロセッサ２２によって実現される。サイズ判断手段５１ｇは、インターフェース２６を介してクライアント側の音声入出力装置１００において、非可聴音を間欠発音し、周辺からの反射音をマイクで集音し、クライアント側の音声入出力装置１００の環境を把握（２次元方位と距離）するようにしてもよい。 The size determination means 51g is a means for determining the indoor size, and is realized by the interface 26 and the processor 22 shown in FIG. The size determination means 51g intermittently generates non-audible sounds in the client-side voice input / output device 100 via the interface 26, collects the reflected sound from the surroundings with a microphone, and the client-side voice input / output device 100 You may make it grasp | ascertain an environment (two-dimensional azimuth | direction and distance).

ノイズレベル認識手段５１ｈは、室内のノイズレベルを認識する手段であり、図２に示したインターフェース２６、及びプロセッサ２２によって実現される。ノイズレベル認識手段５１ｈは、クライアント側の音声入出力装置１００のマイクで得られた室内の音声からノイズ除去処理前のノイズレベルの情報を、インターフェース２６を介して得ることができるようになっている。室内のノイズレベルによってクライアント側の音声入出力装置１００の環境が例えば受付か、会議室か、コールセンター室かそれ以外かが判断できる。 The noise level recognition means 51h is a means for recognizing the indoor noise level, and is realized by the interface 26 and the processor 22 shown in FIG. The noise level recognizing means 51h can obtain information on the noise level before the noise removal processing from the room voice obtained by the microphone of the voice input / output device 100 on the client side via the interface 26. . Based on the noise level in the room, it can be determined whether the environment of the voice input / output device 100 on the client side is, for example, reception, conference room, call center room, or other.

残響レベル認識手段５１ｉは、室内の残響レベルを認識する手段であり、図２に示したインターフェース２６、及びプロセッサ２２によって実現される。残響レベル認識手段５１ｉは、クライアント側の音声入出力装置１００のマイクで得られた室内の音声から残響レベルの情報を、インターフェース２６を介して得ることができるようになっている。室内の残響レベルによってクライアント側の音声入出力装置１００の環境が判断できる。 The reverberation level recognition means 51i is a means for recognizing the reverberation level in the room, and is realized by the interface 26 and the processor 22 shown in FIG. The reverberation level recognizing means 51i can obtain the information of the reverberation level from the room sound obtained by the microphone of the client-side voice input / output device 100 via the interface 26. The environment of the voice input / output device 100 on the client side can be determined by the reverberation level in the room.

通信手段５２は、クラウド側サーバ２０とクライアント側の音声入出力装置１００との間でネットワーク１０を介して情報を授受するための手段であり、図２に示したインターフェース２６によって実現できる。 The communication means 52 is means for exchanging information between the cloud-side server 20 and the client-side voice input / output device 100 via the network 10, and can be realized by the interface 26 shown in FIG.

＜音声入出力装置＞
クライアント側の音声入出力装置１００は、主として入力手段６１、出力手段６２、前段処理手段６３、発光手段６４、通信手段６５、第２制御手段６６、記憶手段６７、入出力手段６８、及び検知手段６９を備えて構成されている。音声入出力装置１００は、さらに画像表示手段７１、指紋認証手段７２、撮像手段７３、外部情報入力手段７４、振動手段７５、及び回転手段７６を備えて構成してもよい。 <Voice input / output device>
The voice input / output device 100 on the client side mainly includes input means 61, output means 62, pre-stage processing means 63, light emitting means 64, communication means 65, second control means 66, storage means 67, input / output means 68, and detection means. 69 is provided. The voice input / output device 100 may further include an image display means 71, a fingerprint authentication means 72, an imaging means 73, an external information input means 74, a vibration means 75, and a rotation means 76.

入力手段６１は、音声に関する音声情報を入力する手段であり、図３に示すマイクユニット２０３及びマイク制御部２０４によって実現される。入力手段６１は、ビームフォーミング処理、ブラインド音源分離処理、残響抑制処理、ノイズ抑圧処理、エコーキャンセル（バージイン）処理、及び音声区間検出（ＶｏｉｃｅＡｃｔｉｖｉｔｙＤｅｔｅｃｔｉｏｎ：ＶＡＤとも略す。）処理を施す。 The input means 61 is means for inputting voice information related to voice, and is realized by the microphone unit 203 and the microphone control unit 204 shown in FIG. The input means 61 performs beam forming processing, blind sound source separation processing, reverberation suppression processing, noise suppression processing, echo cancellation (barge-in) processing, and voice section detection (Voice Activity Detection: VAD) processing.

出力手段６２は、非可聴音（２０ｋＨｚ〜４０ｋＨｚ、好ましくは３０ｋＨｚの音）を発生する手段であり、図３に示した非可聴音発生部２０８によって実現される。非可聴音発生部２０８としては、例えばムービングコイルを用いた超音波スピーカの他、圧電素子を用いた超音波スピーカが挙げられる。 The output unit 62 is a unit that generates a non-audible sound (a sound of 20 kHz to 40 kHz, preferably 30 kHz), and is realized by the non-audible sound generation unit 208 shown in FIG. Examples of the non-audible sound generating unit 208 include an ultrasonic speaker using a piezoelectric element in addition to an ultrasonic speaker using a moving coil.

前段処理手段６３は、マイクからの音声からノイズを除去したり、エコーをキャンセルしたり、ビームフォーミング処理、ブラインド音源分離処理、残響抑制処理を行う手段であり、図３に示したマイク制御部２０４によって実現される。 The pre-processing unit 63 is a unit that removes noise from the sound from the microphone, cancels echo, performs beam forming processing, blind sound source separation processing, and reverberation suppression processing. The microphone control unit 204 shown in FIG. It is realized by.

発光手段６４は、筐体の外周に設けられ、軌道上の一部の発光色が残りの部分の発光色と異なるように周回点灯したり、話者を検知したときに一部の発光色が話者の方向で停止するように発光したりする手段であり、図３に示したＬＥＤ２１０によって実現される。 The light emitting means 64 is provided on the outer periphery of the housing, and turns on so that a part of the light emission color on the track is different from the light emission color of the remaining part, or a part of the light emission color when the speaker is detected. It is means for emitting light so as to stop in the direction of the speaker, and is realized by the LED 210 shown in FIG.

通信手段６５は、クライアント側の音声入出力装置１００とクラウド側サーバ２０との間でネットワーク１０を介して情報を授受する手段であり、図３に示した通信部２０６によって実現される。 The communication means 65 is means for exchanging information between the client-side voice input / output device 100 and the cloud-side server 20 via the network 10, and is realized by the communication unit 206 shown in FIG.

第２制御手段６６は、音声入出力装置１００を統括制御する手段であり、図３に示した信号処理部２０５によって実現される。信号処理部２０５としては、例えばプロセッサが挙げられる。 The second control means 66 is a means for comprehensively controlling the voice input / output device 100, and is realized by the signal processing unit 205 shown in FIG. An example of the signal processing unit 205 is a processor.

記憶手段６７は、音声入出力装置１００を統括制御するプログラムを記憶する手段であり、図３に示した記憶部２０２によって実現される。記憶手段６７は、例えばＲＯＭ、ＲＡＭ、ＨＤＤ、ＳＳＤが挙げられ、音声情報、個人情報、画像情報、指紋情報を記憶するように構成してもよい。 The storage means 67 is a means for storing a program for overall control of the voice input / output device 100, and is realized by the storage unit 202 shown in FIG. The storage unit 67 includes, for example, ROM, RAM, HDD, and SSD, and may be configured to store voice information, personal information, image information, and fingerprint information.

入出力手段６８は、ＵＳＢフラッシュメモリやＵＳＢ機器を接続するための手段であり、図３に示した拡張部２０１によって実現される。 The input / output means 68 is means for connecting a USB flash memory or a USB device, and is realized by the expansion unit 201 shown in FIG.

検知手段６９は、音声入出力装置１００に接近したり通過したりする人を検知する手段であり、図３に示したＩＲ部２１４が挙げられる。検知手段６９としては、例えば人感センサが挙げられる。 The detection means 69 is a means for detecting a person who approaches or passes the voice input / output device 100, and includes the IR unit 214 shown in FIG. An example of the detection unit 69 is a human sensor.

画像表示手段７１は、文字情報を含む静止画像や動画像等の画像を表示する手段であり、図３に示したＬＣＤ２１１によって実現される。 The image display means 71 is a means for displaying an image such as a still image or a moving image including character information, and is realized by the LCD 211 shown in FIG.

指紋認証手段７２は、ユーザを認識する手段であり、図３に示した個人認証部２１３によって実現される。指紋認証手段７２としては、例えば指紋センサが挙げられる。 The fingerprint authentication means 72 is a means for recognizing a user, and is realized by the personal authentication unit 213 shown in FIG. An example of the fingerprint authentication means 72 is a fingerprint sensor.

撮像手段７３は、デジタルカメラであり、図３に示した撮像部２１２によって実現される。 The imaging means 73 is a digital camera and is realized by the imaging unit 212 shown in FIG.

外部情報入力手段７４は、外部からのコンテンツを入力する手段であり、図３に示した拡張部２０１によって実現される。 The external information input unit 74 is a unit for inputting content from the outside, and is realized by the extension unit 201 shown in FIG.

振動手段７５は、音声入出力装置１００の筐体を振動させる手段であり、図３に示した振動部２１６によって実現される。 The vibration means 75 is a means for vibrating the casing of the voice input / output device 100, and is realized by the vibration unit 216 shown in FIG.

回転手段７６は、音声入出力装置１００の筐体を鉛直な中心軸の周りに回転（旋回）させる手段であり、図３に示した回転部２１７によって実現される。 The rotating means 76 is means for rotating (turning) the casing of the voice input / output device 100 around a vertical central axis, and is realized by the rotating unit 217 shown in FIG.

投影手段７７は、室内のスクリーン、ホワイトボード、壁面等に画像を投影する手段であり、図３に示した投影部２１５によって実現される。 The projection means 77 is a means for projecting an image onto an indoor screen, whiteboard, wall surface, etc., and is realized by the projection unit 215 shown in FIG.

＜ソフトウェアスタック＞
上記に示す分析処理、解析処理、認識処理等の構成について、図５のソフトウェアスタック図の処理内容に従って説明する。処理内容は、利用ログ収集・解析部５０３、意図解釈・対話管理技術部５０４、音声認識部５０５、話者識別部５０６、環境音認識部５０７、感情分析部５０８、フロントエンド信号処理技術部５０９、マイクアレイ処理技術部５１０、マルチマイクアレイ処理技術部５１０、マルチマイクハードウェア部５１２、センサ５１５等から構成される。意図的解釈・対話管理技術部５０４は、外部システム５１３や外部コンテンツ５１４が接続されていてもよい。対話アプリ５０１−１〜５０１−ｎに応じてシナリオデザインが評価される。その際利用ログ収集・解析して継続的に改善する。尚、図では継続的改善に３つの矢印が記載されているが、意図的解釈・対話管理技術部５０４、音声認識部５０５、話者識別部５０６、環境音認識部５０７、感情分析部５０８、フロントエンド信号処理技術部５０９、マイクアレイ処理技術部５１０、及びマルチマイクハードウェア部５１２についても、順番を入れ替えたり、一部を省略したりするとともに継続的改善が行われるので矢印が記載されていてもよい。 <Software stack>
The configuration of the analysis processing, analysis processing, recognition processing, and the like described above will be described according to the processing contents of the software stack diagram of FIG. The processing contents include a usage log collection / analysis unit 503, an intention interpretation / dialog management technology unit 504, a speech recognition unit 505, a speaker identification unit 506, an environmental sound recognition unit 507, an emotion analysis unit 508, and a front-end signal processing technology unit 509. , A microphone array processing technology unit 510, a multi-microphone array processing technology unit 510, a multi-microphone hardware unit 512, a sensor 515, and the like. The intentional interpretation / dialog management unit 504 may be connected to an external system 513 or an external content 514. The scenario design is evaluated according to the dialog applications 501-1 to 501-n. At that time, use log collection and analysis to improve continuously. In the figure, although three arrows are described for continuous improvement, the intentional interpretation / dialog management technology unit 504, the speech recognition unit 505, the speaker identification unit 506, the environmental sound recognition unit 507, the emotion analysis unit 508, The front-end signal processing technology unit 509, the microphone array processing technology unit 510, and the multi-microphone hardware unit 512 are also described with arrows because the order is changed or a part thereof is omitted and continuous improvement is performed. May be.

マルチマイクハードウェア５１２やフロントエンド信号処理技術部５０９は他社製品にも柔軟に対応可能である。 The multi-microphone hardware 512 and the front-end signal processing technology unit 509 can flexibly support products of other companies.

マイクアレイ処理技術部５１０は、ビームフォーミング処理、ブラインド音源分離処理、残響抑制処理等から構成され、フロントエンド信号処理技術部５０９はノイズ抑圧処理、エコーキャンセル（バージイン）処理、音声区間検出（「ＶＡＤ」とも略す。）処理等から構成され、音声を文字等に変換する音声認識部５０５は多言語対応処理、自動言語識別処理、多言語混合処理等から構成され、話者識別部５０６は事前学習処理、クラスタリング処理、さらに話者照合（認証）処理等から構成され、感情分析部５０８は感情多クラス分類処理、感情マッピング処理、抑揚認識処理等から構成され、環境音認識部５０７は拍手音・笑い声認識処理、重なり検出処理、シーン推定処理、さらに異音検査処理は音響官能検査処理、正常／異常音識別処理等から構成される。 The microphone array processing technology unit 510 includes beam forming processing, blind sound source separation processing, reverberation suppression processing, and the like, and the front-end signal processing technology unit 509 includes noise suppression processing, echo cancellation (barge-in) processing, voice section detection (“VAD”). The speech recognition unit 505 that includes processing and the like and converts speech into characters and the like includes multilingual processing, automatic language identification processing, multilingual mixed processing, and the like, and the speaker identification unit 506 performs pre-learning. Processing, clustering processing, speaker verification (authentication) processing, etc., emotion analysis unit 508 is composed of emotion multi-class classification processing, emotion mapping processing, intonation recognition processing, etc., and environmental sound recognition unit 507 is Laughter recognition processing, overlap detection processing, scene estimation processing, and abnormal sound inspection processing are acoustic sensory inspection processing, normal / abnormal sound recognition processing It consists of processing, and the like.

以上、音声処理の構成を説明したが、音声処理により、笑い声、拍手、呼び鈴といった特徴のある音の認識、話者識別、性別推定、イントネーション判定等の処理を行う。ただし、上記各処理を可能とするためには、音の前処理として、音声切り出し、ノイズ抑圧、残響抑圧、音源定位すなわち指定角度の音を取り出すビームフォーミング等の処理を行うものとする。さらに、上記各処理による認識結果情報は、例えば関連システムと連携して、利用ログ収集・解析技術部５０３、意図解釈・対話管理技術部５０４により解析、管理処理を行うものとする。 Although the configuration of the voice processing has been described above, the voice processing performs processing such as recognition of characteristic sounds such as laughter, applause, and bells, speaker identification, gender estimation, intonation determination, and the like. However, in order to enable each of the above processing, as sound preprocessing, processing such as voice extraction, noise suppression, reverberation suppression, sound source localization, that is, beam forming for extracting a sound at a specified angle is performed. Further, the recognition result information obtained by each process is analyzed and managed by the usage log collection / analysis technology unit 503 and the intention interpretation / dialog management technology unit 504 in cooperation with the related system, for example.

また、人の話し声を例えばマイクによって収集するだけでなく、例えばスピーカにより、人の聴覚では聞くことができない、例えば超可聴音もしくは非可聴音を発音し、この超可聴音もしくは非可聴音の反射音を例えばマイクによって収音し、この情報を音声処理により解析、認識処理を行い、音声入出力装置１００の周辺状況に関し、反射する対象の材質、距離等を把握するものとする。これらを可能とするために、例えばマイクを複数搭載するとともに、例えば水平方向に円形状にマイクを搭載したマイク搭載部（図示せず）を縦方向に、例えば２層搭載することにより、垂直方向の検知を可能とする。また、発音する超可聴音もしくは非可聴音を例えばパルス状に成形し、あるパルス音から次のパルス音との間隔に、ある音声入出力装置１００に特有の情報音を付加することにより、複数の音声入出力装置１００を識別することを可能とする。 In addition to collecting a person's speech, for example, using a microphone, the speaker produces, for example, a super audible sound or a non-audible sound that cannot be heard by a person's hearing, for example, and a reflection of the super audible sound or non-audible sound. The sound is collected by, for example, a microphone, and this information is analyzed and recognized by voice processing, and the material, distance, and the like of the object to be reflected are grasped regarding the surrounding situation of the voice input / output device 100. In order to enable these, for example, a plurality of microphones are mounted, and for example, a microphone mounting portion (not shown) in which a microphone is mounted in a circular shape in the horizontal direction is mounted in the vertical direction, for example, in two layers, so that the vertical direction Can be detected. In addition, a super audible sound or a non-audible sound to be generated is formed into, for example, a pulse shape, and a plurality of information sounds peculiar to a certain voice input / output device 100 are added to an interval between a certain pulse sound and the next pulse sound. The voice input / output device 100 can be identified.

ここで、本実施形態によれば、話者識別や感情識別等の音声処理や利用ログ収集、解析、意図解釈等のどの処理を使うか、どの順番で使うか、ノイズを除去してから残響処理を行うか、残響処理を行ってからノイズを除去するのかはユーザが自由に選べるようになっている。 Here, according to the present embodiment, which processing is used, such as voice processing such as speaker identification and emotion identification, usage log collection, analysis, and intention interpretation, in which order, and reverberation after removing noise. The user can freely select whether the noise is removed after performing the processing or the reverberation processing.

＜音声情報処理システム＞
図６に示すシステムは、音声入出力装置１００と、クラウド側サーバ２０と、を有するシステムである。 <Voice information processing system>
The system illustrated in FIG. 6 is a system including the voice input / output device 100 and the cloud-side server 20.

音声入出力装置１００は、多数の貫通孔が形成された筐体１０１の天板外周部に配置されたＬＥＤ（ＬｉｇｈｔＥｍｉｔｔｉｎｇＤｉｏｄｅ：発光ダイオード）リング１０２と、筐体１０１の同一平面の周面に円周状に配置された複数（例えば１６個であるが限定されない。）のＰＤＭマイク１０３−１〜１０３−１６と、筐体１０１内に下端向きに配置されたスピーカ群（スコーカ１０４Ｓ、ツイータ１０４Ｓ）と、筐体１０１の底面に上向きに凸の円錐形状の反射板１０５と、を備える装置である。筐体１０１内には各種回路基板が設けられている。１０６は電源ランプとしてのＬＥＤである。１０７は電源コードであるが、バッテリー搭載可能である。 The voice input / output device 100 includes an LED (Light Emitting Diode) ring 102 disposed on the outer periphery of the top plate of the housing 101 in which a large number of through holes are formed, and a peripheral surface on the same plane of the housing 101. A plurality (for example, 16 but not limited) of PDM microphones 103-1 to 103-16 arranged in a circle and a speaker group (a squawker 104 </ b> S and a tweeter 104 </ b> S) arranged in the casing 101 toward the lower end. ) And a conical reflector 105 convex upward on the bottom surface of the housing 101. Various circuit boards are provided in the housing 101. Reference numeral 106 denotes an LED as a power lamp. Reference numeral 107 denotes a power cord, which can be mounted on a battery.

筐体１０１は、図では円筒状であるが限定されず角柱状であっても、円錐状であっても、角錐台状であっても、円錐台状であってもよい。 The casing 101 is cylindrical in the figure, but is not limited, and may be a prismatic shape, a cone shape, a truncated pyramid shape, or a truncated cone shape.

ＬＥＤリング１０２は多数の３色ＬＥＤ多色発光型のデバイスであり、環状に形成されたものである。ＬＥＤリング１０２は、一部の数個の隣接配置されたＬＥＤと他の残りの部分のＬＥＤとが異なる発光色で発光したり、円周軌道にそって回転したり停止したりし、周回点灯したりすることが可能である。例えば、話者方向を数個のＬＥＤで白色点灯し、残りのＬＥＤを青色点灯することが挙げられるが、これに限定されるものではなく、点灯の代わりに点滅させたり、インジケータのように話者の音声の強度に応じて照度を変化させたりしてもよい。 The LED ring 102 is a large number of three-color LED multi-color light emitting devices, and is formed in an annular shape. The LED ring 102 emits light in a different emission color from some of the adjacently arranged LEDs and the other remaining LEDs, or rotates or stops along a circular path, and is lit around It is possible to do. For example, it is possible to turn on the speaker in white with several LEDs and turn on the remaining LEDs in blue. However, the present invention is not limited to this. The illuminance may be changed according to the intensity of the person's voice.

ＰＤＭマイク１０３−１〜１０３−１６は、筐体１０１の同一平面の周面に１６個等間隔で配置されているため、反射音より音源の左右方向の識別が２０度の範囲で可能であり、音声入出力装置１００の近傍にいる音源としての人の音声を集音することが可能である。 Since 16 PDM microphones 103-1 to 103-16 are arranged at equal intervals on the same plane of the casing 101, the sound source can be identified in the left-right direction within 20 degrees from the reflected sound. It is possible to collect human voice as a sound source in the vicinity of the voice input / output device 100.

スコーカ１０４Ｓは通常の音声を発生するスピーカであり、ツイータ１０４Ｔは非可聴音（超音波）を発生するスピーカである。反射板１０５は、スコーカ１０４Ｓ及びツイータ１０４Ｓからの音声や非可聴音を筐体１０１の外側に放射状に反射させる部材である。ツイータ１０４からの非可聴音は反射板１０５で反射され筐体１０１の外部に出射して話者に反射した場合には筐体１０１に向かい、ＰＤＭマイク１０３−１〜１０３−１６で集音することでレーダー（もしくはソナー）のように機能させることができる。この非可聴音はパルス変調されていてもよい。
クラウドシステム３００は、複数のサーバを有するサーバ群３０１からなり、音声入出力装置１００のディープラーニング処理等のソフトウェア処理を行ってもよい。 The squawker 104S is a speaker that generates normal sound, and the tweeter 104T is a speaker that generates inaudible sound (ultrasonic waves). The reflection plate 105 is a member that radially reflects the sound and non-audible sound from the squawker 104S and the tweeter 104S to the outside of the housing 101. When the non-audible sound from the tweeter 104 is reflected by the reflecting plate 105 and is emitted to the outside of the housing 101 and reflected by the speaker, the sound is directed to the housing 101 and collected by the PDM microphones 103-1 to 103-16. It can function like a radar (or sonar). This non-audible sound may be pulse modulated.
The cloud system 300 includes a server group 301 having a plurality of servers, and may perform software processing such as deep learning processing of the voice input / output device 100.

＜動作１＞
図７に示すフローチャートの動作の主体は、クラウド側サーバ２０のプロセッサ２２である。 <Operation 1>
The subject of the operation of the flowchart shown in FIG.

本システムは、予めユーザが第１の情報としてのシナリオを設定可能である。シナリオとは、例えば、話者と音声入出力装置１００との対話に対してどのように応答するかを示すストーリーである。 In this system, a user can set a scenario as first information in advance. The scenario is, for example, a story indicating how to respond to a dialogue between the speaker and the voice input / output device 100.

まずユーザがクラウド側サーバ２０で複数情報（例えば、第１の情報、第２の情報、第３の情報）を作成する（ステップＳ１０）。 First, the user creates a plurality of information (for example, first information, second information, and third information) in the cloud-side server 20 (step S10).

音声入出力装置１００の電源スイッチがオンされると（ステップＳ１１）、プロセッサ２２は音声入出力装置１００の外部から、例えば図示しないスマートフォンでモード設定信号が有るか否か判断する（ステップＳ１２）。 When the power switch of the voice input / output device 100 is turned on (step S11), the processor 22 determines whether there is a mode setting signal from the outside of the voice input / output device 100, for example, with a smartphone (not shown) (step S12).

プロセッサ２２は、外部からモード設定信号が有ると判断した場合（ステップＳ１２／ＹＥＳ）、コールセンターモード、受付モード、会議モード、…のうちのいずれかのモードが設定され（ステップＳ１３）、外部からモード設定信号が無いと判断した場合（ステップＳ１２／ＮＯ）、ステップＳ１４に進む。 When the processor 22 determines that there is a mode setting signal from the outside (step S12 / YES), one of the call center mode, the reception mode, the conference mode,... Is set (step S13), and the mode from the outside is set. When it is determined that there is no setting signal (step S12 / NO), the process proceeds to step S14.

ステップＳ１４では、プロセッサ２２が、クライアント側の音声入出力装置１００の処理能力が十分か否かを判断し、音声入出力装置１０の処理能力が十分であると判断した場合（ステップＳ１４／ＹＥＳ）、クライアント側の音声入出力装置１００で処理し（ステップＳ１５）、クライアント側の音声入出力装置１００の処理能力が十分でないと判断した場合（ステップＳ１４／ＮＯ）クラウド側で処理し（ステップＳ１６）、ステップＳ１７に進む。 In step S14, the processor 22 determines whether or not the processing capability of the voice input / output device 100 on the client side is sufficient, and determines that the processing capability of the voice input / output device 10 is sufficient (step S14 / YES). When processing is performed by the voice input / output device 100 on the client side (step S15) and it is determined that the processing capability of the voice input / output device 100 on the client side is not sufficient (step S14 / NO), processing is performed on the cloud side (step S16). The process proceeds to step S17.

プロセッサ２２は、最適化済か否か判断し（ステップＳ１７）、最適化済でない場合（ステップＳ１７／ＮＯ）、第１の情報を選択し（ステップＳ１８）、識別処理が容易となるような前段処理を実行する。前段処理は、第１の情報として、例えば、ビームフォーミング、ブラインド音源分離、及び残響抑制のいずれかから少なくとも一つ選択して、順番を決定し、実行する。ノイズ抑圧、エコーキャンセル、及び音声区間検出から適宜選択して、順番を決定し、実行するものとする（ステップＳ１９）。 The processor 22 determines whether or not it has been optimized (step S17). If it has not been optimized (step S17 / NO), the processor 22 selects the first information (step S18), and makes the identification process easy. Execute the process. For example, at least one of beam forming, blind source separation, and reverberation suppression is selected as the first information, and the order is determined and executed. It is assumed that the order is determined and executed by appropriately selecting from noise suppression, echo cancellation, and speech section detection (step S19).

次にタスクの評価（ステップＳ２０）を行い、評価が十分か否かを判断し（ステップＳ２１）、評価が十分でないと判断した場合（ステップＳ２１／ＮＯ）、ステップＳ１０に戻り、クラウド側で第１の情報を修正し、ステップＳ１０〜ステップＳ２３を何回でも繰り返す。 Next, the task is evaluated (step S20), it is determined whether or not the evaluation is sufficient (step S21), and if it is determined that the evaluation is not sufficient (step S21 / NO), the process returns to step S10 and the cloud side 1 information is corrected, and steps S10 to S23 are repeated any number of times.

プロセッサ２２は、ステップＳ２１でタスクの評価が十分であると判断した場合（ステップＳ２１／ＹＥＳ）、第２の情報を選択し（ステップＳ２２）、識別処理を実行する。識別処理は、第２の情報として、例えば、音声認識、話者識別、感情分析、及び環境音認識のいずれかから少なくとも一つ選択して、順番を決定し、実行するものとする（ステップＳ２３）。 When the processor 22 determines that the evaluation of the task is sufficient in step S21 (step S21 / YES), the processor 22 selects the second information (step S22) and executes the identification process. The identification process is executed by selecting at least one of the second information, for example, voice recognition, speaker identification, emotion analysis, and environmental sound recognition, determining the order, and executing (step S23). ).

次にタスクの評価（ステップＳ２４）を行い、評価が十分か否かを判断し（ステップＳ２５）、評価が十分でないと判断した場合（ステップＳ２５／ＮＯ）、ステップＳ１０に戻り、クラウド側で第２の情報を修正し、ステップＳ１０〜ステップＳ２４を何回でも繰り返す。 Next, the task is evaluated (step S24), it is determined whether the evaluation is sufficient (step S25), and if it is determined that the evaluation is not sufficient (step S25 / NO), the process returns to step S10, and the cloud side 2 is corrected, and Steps S10 to S24 are repeated any number of times.

プロセッサ２２は、ステップＳ２４でタスクの評価が十分であると判断した場合（ステップＳ２４／ＹＥＳ）、第３の情報を選択し（ステップＳ２６）、対話アプリを実行する。対話アプリについては、第３の情報として、利用ログ収集、解析、意図的解釈、及び対話管理のいずれかから少なくとも一つ選択して、順番を決定し、実行するものとする（ステップＳ２７）。 When the processor 22 determines that the evaluation of the task is sufficient in step S24 (step S24 / YES), the processor 22 selects the third information (step S26) and executes the interactive application. As for the dialogue application, at least one of usage log collection, analysis, intentional interpretation, and dialogue management is selected as the third information, and the order is determined and executed (step S27).

次に、タスクの評価（ステップＳ２８）を行い、評価が十分か否かを判断し（ステップＳ２９）、評価が十分でないと判断した場合（ステップＳ２９／ＮＯ）、ステップＳ１０に戻り、クラウド側で第３の情報を修正し、ステップＳ１０〜ステップＳ２８を何回でも繰り返す。 Next, the task is evaluated (step S28), it is determined whether the evaluation is sufficient (step S29), and if it is determined that the evaluation is not sufficient (step S29 / NO), the process returns to step S10, and the cloud side The third information is corrected, and steps S10 to S28 are repeated any number of times.

プロセッサ２２は、最適化が完了すると（ステップＳ３０.）、アプリを実行し（ステップＳ３１）、終了しない場合（ステップＳ３２／ＮＯ）、ステップＳ１２に戻り、終了する場合（ステップＳ３２／ＹＥＳ）、終了する。この場合、電源スイッチが自動的オフになるように構成してもよい。 When the optimization is completed (step S30), the processor 22 executes the application (step S31), and if not finished (step S32 / NO), returns to step S12, and if finished (step S32 / YES), ends. To do. In this case, the power switch may be automatically turned off.

ここで、図７に示したフローチャートは一実施例に過ぎず、限定されるものではない。例えば、室内環境が同一であって、話者の数に変更があったり、室内環境に変更があっても話者や人数に変更がなかったり、使用するモードが変更になったり、使用するマイクの数やグレードに変更があったりしても、適宜ステップＳ１８〜ステップＳ２１と、ステップＳ２２〜ステップＳ２５と、ステップＳ２６〜ステップＳ２９とを入れ替えたり、一部を省略したりすることで継続的改善が施されるとともに柔軟な対応が可能である。 Here, the flowchart shown in FIG. 7 is only an example and is not limited. For example, if the indoor environment is the same and there are changes in the number of speakers, even if there is a change in the indoor environment, there is no change in the number of speakers or people, the mode used is changed, the microphone used Even if there is a change in the number or grade, continuous improvement is possible by replacing Step S18 to Step S21, Step S22 to Step S25, and Step S26 to Step S29 as appropriate, or omitting some of them. Can be applied flexibly.

例えば、図７に示したフローチャートに基づいて処理する場合、マイクユニットのマイクが１本の場合にはビームフォーミングやブラインド音源分離は行わず、また、マイクの性能が高性能でない場合にはノイズ抑圧は１回だけ行い、マイクユニットのマイクが多数、例えば１６個以上の場合にはビームフォーミンツやブラインド音源処理を行い、ノイズ抑圧やエコーキャンセル等の処理を行うようになっている。 For example, when processing is performed based on the flowchart shown in FIG. 7, beam forming and blind sound source separation are not performed when there is one microphone of the microphone unit, and noise suppression is performed when the performance of the microphone is not high performance. Is performed only once. When there are a large number of microphones in the microphone unit, for example, 16 or more, beam forming and blind sound source processing are performed, and processing such as noise suppression and echo cancellation is performed.

＜動作２＞
図８に示すフローチャートの動作の主体は、クラウド側サーバ２０のプロセッサ２２である。図８に示したフローチャートの図７に示したフローチャートとの相違点は、ステップＳ１８〜ステップＳ２１と、ステップＳ２６〜ステップＳ２９とを入れ替えた点である。これは、クラウド側サーバで処理した場合に図７に示したフローチャートで処理しても結果が不十分な場合に行う処理の一例である。このような処理を行っても継続的改善が施されるとともに柔軟な対応が可能である。 <Operation 2>
8 is the processor 22 of the cloud-side server 20. The difference between the flowchart shown in FIG. 8 and the flowchart shown in FIG. 7 is that steps S18 to S21 and steps S26 to S29 are interchanged. This is an example of processing that is performed when the result is insufficient even if processing is performed in the flowchart shown in FIG. Even if such a process is performed, continuous improvement is made and a flexible response is possible.

＜動作３＞
図９に示したフローチャートは、音声入出力装置１００の使用が終了し、電源スイッチをオフにした状態で受付、会議室、もしくはコールセンターのオペレータの机の上に載置し、翌営業日に再度電源スイッチをオンにした場合について想定したものである。 <Operation 3>
The flowchart shown in FIG. 9 is placed on the reception desk, conference room, or call center operator desk with the power switch turned off after the use of the voice input / output device 100 is finished, and again on the next business day. It is assumed that the power switch is turned on.

電源スイッチがオンされると（ステップＳ９１）、プロセッサ２２は、室内環境、話者の変更、人数及びモード変更が無いか否かを判断し（ステップＳ９２）、変更が無い場合（ステップＳ９２／ＹＥＳ）、アプリを実行する（ステップＳ９３）。 When the power switch is turned on (step S91), the processor 22 determines whether there is no room environment, speaker change, number of people, and mode change (step S92). If there is no change (step S92 / YES) ) The application is executed (step S93).

プロセッサ２２は、アプリを実行した後終了か否か判断し（ステップＳ９４）、終了の場合には終了し（ステップＳ９４／ＹＥＳ）、終了しない場合にはステップＳ９２に戻る（ステップＳ９４／ＮＯ）。 The processor 22 determines whether or not to end after executing the application (step S94). If it is ended, the processor 22 ends (step S94 / YES). If not, the processor 22 returns to step S92 (step S94 / NO).

プロセッサ２２は、室内環境、話者の変更、人数及びモード変更が有った場合（ステップＳ９２／ＮＯ）、クライアント側の音声入出力装置１００の処理能力が十分か否かを判断する（ステップＳ９５）。 When there is an indoor environment change, speaker change, number of persons, and mode change (step S92 / NO), the processor 22 determines whether or not the processing capability of the voice input / output device 100 on the client side is sufficient (step S95). ).

プロセッサ２２は、クライアント側の音声入出力装置１００の処理能力が十分であると判断した場合（ステップＳ９５／ＹＥＳ）、音声入出力装置１００で処理し（ステップＳ９６）、クライアント側の音声入出力装置１００の処理能力が十分でないと判断した場合（ステップＳ９６／ＮＯ）、クラウド側で処理し（ステップＳ９７）、ステップＳ９８に進む。 When the processor 22 determines that the processing capability of the voice input / output device 100 on the client side is sufficient (step S95 / YES), the processor 22 performs processing (step S96), and the voice input / output device on the client side When it is determined that the processing capacity of 100 is not sufficient (step S96 / NO), the processing is performed on the cloud side (step S97), and the process proceeds to step S98.

プロセッサ２２は、モード設定信号が有るか否か判断し（ステップＳ９８）、外部からモード設定信号が有ると判断した場合（ステップＳ９８／ＹＥＳ）、コールセンターモード、受付モード、会議モード、…のうちのいずれかのモードが設定され（ステップＳ９９）、外部からモード設定信号が無いと判断した場合（ステップＳ９８／ＮＯ）、ステップＳ１４（図７参照。）に進む。 The processor 22 determines whether or not there is a mode setting signal (step S98), and if it is determined that there is a mode setting signal from the outside (step S98 / YES), the call center mode, the reception mode, the conference mode,. If any mode is set (step S99) and it is determined that there is no mode setting signal from the outside (step S98 / NO), the process proceeds to step S14 (see FIG. 7).

ステップＳ９２の処理により、無駄な処理が省略され効率が向上する。 By the processing in step S92, useless processing is omitted and efficiency is improved.

以上で説明した本発明に係る音声入出力装置１００は、コンピュータで処理を実行させる制御プログラムによって実現されている。一例として、プログラムにより本発明の機能を実現する場合の説明を以下で行う。 The voice input / output device 100 according to the present invention described above is realized by a control program that causes a computer to execute processing. As an example, the case where the function of the present invention is realized by a program will be described below.

コンピュータが読み取り可能なプログラムであって、
コンピュータを、
音声に関する音声情報を入力する入力手段、
入力された音声情報に対し、識別処理が容易となるような前段処理を行う前段処理手段、
前段処理手段により処理された音声情報に所定の加工を施し、第１の情報に基づいてタスク処理を行い、タスク処理の評価が十分でない場合に第１の情報を修正し、評価が十分になるまで一連の処理を繰り返すことで最適化する最適化手段、
として機能させるための音声情報処理システムのプログラムが挙げられる。 A computer-readable program,
Computer
Input means for inputting voice information related to the voice;
Pre-stage processing means for performing pre-stage processing that facilitates identification processing on the input voice information;
Predetermined processing is performed on the audio information processed by the preceding processing means, task processing is performed based on the first information, and if the evaluation of the task processing is not sufficient, the first information is corrected and the evaluation becomes sufficient Optimization means to optimize by repeating a series of processing until
A program of a voice information processing system for functioning as

これにより、プログラムが実行可能なコンピュータ環境さえあれば、どこにおいても本発明にかかる音声入出力装置１００を実現することができる。 Thus, the voice input / output device 100 according to the present invention can be realized anywhere as long as there is a computer environment capable of executing the program.

このようなプログラムは、コンピュータに読み取り可能な記録媒体に記憶されていてもよい。 Such a program may be stored in a computer-readable recording medium.

＜記録媒体＞
ここで、記録媒体としては、例えばＣＤ-ＲＯＭ、フレキシブルディスク（ＦＤ）、ＣＤ−Ｒ等のコンピュータで読み取り可能な記録媒体、フラッシュメモリ、ＲＡＭ、ＲＯＭ、ＦｅＲＡＭ等の半導体メモリやＨＤＤが挙げられる。 <Recording medium>
Here, examples of the recording medium include computer-readable recording media such as CD-ROM, flexible disk (FD), and CD-R, semiconductor memories such as flash memory, RAM, ROM, and FeRAM, and HDD.

ＣＤ−ＲＯＭは、ＣｏｍｐａｃｔＤｉｓｃＲｅａｄＯnｌｙＭｅｍｏｒｙの略である。フレキシブルディスクは、ＦｌｅｘｉｂｌｅＤｉｓｋを意味する。ＣＤ-Ｒは、ＣＤＲｅｃｏｒｄａｂｌｅの略である。ＦｅＲＡＭは、ＦｅｒｒｏｅｌｅｃｔｒｉｃＲＡＭの略で、強誘電体メモリを意味する。ＨＤＤは、ＨａｒｄＤｉｓｃＤｒｉｖｅの略である。 CD-ROM is an abbreviation for Compact Disc Read Only Memory. The flexible disk means a flexible disk. CD-R is an abbreviation for CD Recordable. FeRAM is an abbreviation for Ferroelectric RAM and means a ferroelectric memory. HDD is an abbreviation for Hard Disc Drive.

尚、上述した実施の形態は、本発明の好適な実施の形態の一例を示すものであり、本発明はそれに限定されることなく、その要旨を逸脱しない範囲内において、種々変形実施が可能である。 The above-described embodiment shows an example of a preferred embodiment of the present invention, and the present invention is not limited thereto, and various modifications can be made without departing from the scope of the invention. is there.

本発明は、主に音声による情報をもとに対象の状況を、詳細に知ることができるだけでなく、将来にわたり、発生が予測される事項について認識可能とした上で、関係者全般に対し情報を提供することを可能とすることで、主に音声による状況把握を必要とする場面に適用可能である。 The present invention is not only capable of knowing the situation of the object in detail based mainly on information by voice, but also making it possible to recognize matters that are expected to occur in the future. It is applicable to the scene that needs to grasp the situation mainly by voice.

１０・・・ネットワーク
２０・・・クラウド側サーバ
２１・・・データベース（ＤＢ）
２２・・・プロセッサ
２３・・・出力装置
２４・・・入力装置
２６・・・インターフェース
１００・・・音声入出力装置１００
１０１・・・筐体
１０２・・・ＬＥＤリング
１０３−１〜１０３−１６・・・ＰＤＭマイク
１０４・・・スピーカ群
１０４Ｓ・・・スコーカ
１０４Ｔ・・・ツイータ
１０５・・・反射板
１０６・・・電源ランプ
１０７・・・電源コード
２０１・・・拡張部
２０２・・・記憶部
２０３・・・マイクユニット
２０４・・・マイク制御部
２０５・・・信号処理部
２０６・・・通信部
２０７・・・音声発生部
２０８・・・非可聴音発生部
２０９・・・表示部 10 ... Network 20 ... Cloud side server 21 ... Database (DB)
22 ... Processor 23 ... Output device 24 ... Input device 26 ... Interface 100 ... Voice input / output device 100
DESCRIPTION OF SYMBOLS 101 ... Housing 102 ... LED ring 103-1 to 103-16 ... PDM microphone 104 ... Speaker group 104S ... Squawker 104T ... Tweeter 105 ... Reflector plate 106 ... Power lamp 107 ・・・ Power cord 201 ・・・ Expansion unit 202 ・・・ Storage unit 203 ・・・ Microphone unit 204 ・・・ Mic control unit 205 ・・・ Signal processing unit 206 ・・・ Communication unit 207 ・・・Sound generation unit 208 ・・・ Non-audible sound generation unit 209 ・・・ Display unit

Claims

An input means for inputting voice information about the voice;
Pre-stage processing means for performing pre-stage processing for facilitating identification processing on the input voice information;
Predetermined processing is performed on the audio information processed by the pre-processing unit, task processing is performed based on the first information, and the first information is corrected when the evaluation of the task processing is not sufficient, and the evaluation An audio information processing system comprising: optimization means for optimizing by repeating a series of processes until the signal becomes sufficient.

The optimization means includes
First evaluation means for evaluating a result of the task processing;
Correction means for correcting the first information when the evaluation is not sufficient;
The speech information processing system according to claim 1, further comprising: a repeating unit that repeats a series of processing from the preceding processing unit to the correcting unit.

When analyzing and responding to the contents of the voice in the room, if the processing capability of the voice input / output device on the client side is compatible, the voice input / output device performs information processing, and the processing capability of the voice input / output device The voice information processing system according to claim 1, further comprising a determination unit that performs information processing on the cloud side when the information cannot be handled.

The audio information processing system according to claim 1, further comprising an external system that manages indoor environment settings, intentional interpretation, and dialogue.

The external system includes an imaging unit that images the outside of the voice input / output device, a vibration unit that vibrates the casing, a rotating unit that rotates the casing, and a projection unit that projects an image on a wall outside the casing. The speech information processing system according to claim 1, comprising at least one of the following.

The audio information processing system according to claim 1, wherein external contents can be used for intentional interpretation and management of the dialogue.

An inference means for performing an information processing process including an analysis process, an analysis process, and a recognition process of the content of the voice, and inferring an attribute including a speaker's age and sex;
A collecting means for collecting logs used when designing the second information;
Analyzing means for analyzing the log;
Second evaluation means for evaluating the response and the second information, and
The voice information processing system according to claim 1, wherein the voice information processing system is optimized through continuous improvement based on the evaluation.

The inference means is
Interpretation means for intentionally interpreting the conversation with the speaker;
The voice information processing system according to claim 7, further comprising management means for managing a dialogue with the speaker.

The environment judgment means includes
Size determining means for determining the size of the room;
Noise level recognition means for recognizing the noise level in the room;
The sound information processing system according to claim 4, further comprising reverberation level recognition means for recognizing the reverberation level in the room.

The audio information processing system according to claim 5, further comprising an image display unit provided in the housing for displaying an image.

6. The voice information processing system according to claim 5, further comprising fingerprint authentication means provided in the housing for recognizing a user.

The processing capability of the voice input / output device on the client side is the processing speed of the processor, memory size, sensor type, microphone array, number of speakers, number of LEDs, built-in camera, number of application software, front that can support other companies' devices The voice information processing system according to claim 1, further comprising end signal processing.

Feature extraction means for extracting speech from the voice of the speaker as a feature;
Speaker identification means for identifying the speaker by storing the characteristic in association with the speaker information, and comparing the newly input voice characteristic with the speaker information stored in the storage means; The voice information processing system according to claim 1, wherein:

8. The voice information processing system according to claim 7, further comprising emotion identifying means for identifying the speaker's emotion.

It is provided on the outer periphery of the housing, turns around so that a part of the emission color on the orbit is different from the emission color of the remaining part, and when the speaker is detected, the part of the emission color 15. The voice information processing system according to claim 14, further comprising light emitting means for emitting light so as to stop in a direction.

Enter audio information about the audio,
For the input voice information, perform pre-processing that facilitates identification processing,
Apply predetermined processing to the audio information processed in the previous stage, perform task processing based on the first information, modify the first information when the evaluation of the task processing is not sufficient, and the evaluation is sufficiently performed A control method for a speech information processing system, characterized by optimizing by repeating a series of processes until

A computer-readable program,
Computer
Input means for inputting voice information related to the voice;
Pre-stage processing means for performing pre-stage processing for facilitating identification processing on the input voice information,
Predetermined processing is performed on the audio information processed by the pre-processing unit, task processing is performed based on the first information, and the first information is corrected when the evaluation of the task processing is not sufficient, and the evaluation Optimization means to optimize by repeating a series of processes until the
Program for voice information processing system to function as

A recording medium on which the program according to claim 17 is recorded.