JP2016531375A

JP2016531375A - Local and remote speech processing

Info

Publication number: JP2016531375A
Application number: JP2016543926A
Authority: JP
Inventors: ストロムニッコー; スポルディングヴァンランドピーター; ホフメイスタービョルン
Original assignee: Amazon Technologies Inc
Current assignee: Amazon Technologies Inc
Priority date: 2013-09-20
Filing date: 2014-09-09
Publication date: 2016-10-06
Also published as: WO2015041892A1; EP3047481A1; CN105793923A; EP3047481A4

Abstract

ユーザデバイスは、ユーザの発するトリガー表現を検出し及びコマンドとして後続の単語またはフレーズを解釈することによって応答するように構成されることもある。コマンドは、スピーチ認識を実行するように構成されたリモートサービスに単語またはフレーズを含むオーディオを送信することによって認識されることもある。特定のコマンドは、ローカルコマンドとして指定されることもあり、及びリモートサービスに頼るよりもむしろローカル的に検出されることもある。トリガー表現を検出すると、オーディオは、リモートサービスにストリーミングし、ローカルコマンドの発話を検出するためにローカル的に分析される。ローカルコマンドが検出されると、対応する機能が即座に開始され、そして、リモートサービスによるその後の活動または応答はキャンセルまたは無視される。The user device may be configured to detect the trigger expression emitted by the user and respond by interpreting subsequent words or phrases as commands. The command may be recognized by sending audio containing words or phrases to a remote service configured to perform speech recognition. Certain commands may be designated as local commands and may be detected locally rather than relying on remote services. Upon detecting the trigger expression, the audio is streamed to the remote service and analyzed locally to detect local command utterances. When a local command is detected, the corresponding function is started immediately and subsequent activity or response by the remote service is canceled or ignored.

Description

本発明は、ローカルとリモートのスピーチ処理に関する。 The present invention relates to local and remote speech processing.

関連出願の相互参照
本出願は、２０１３年９月２０日に出願されたＵＳ特許出願番号１４／０３３，３０２に対して優先権を主張し、「ＬｏｃａｌａｎｄＲｅｍｏｔｅＳｐｅｅｃｈＰｒｏｃｅｓｓｉｎｇ（ローカルとリモートのスピーチ処理）」を名称とし、明細書の全体を参照することにより組み込まれる。 Cross-reference of related applications
This application claims priority to US Patent Application No. 14 / 033,302, filed on September 20, 2013, and is named “Local and Remote Speech Processing”. , Incorporated herein by reference in its entirety.

家庭、オフィス、自動車及び公共空間は、より有線化され、及びノートパソコン、タブレット、エンターテイメント・システム及び携帯通信機器のようなコンピューティングデバイスの普及に関連している。コンピューティングデバイスが進化するにつれて、ユーザがこれらのデバイスと相互作用する方法は進化し続ける。例えば、人は、機械デバイスを介して、コンピューティングデバイス（例えば、キーボード、マウスなど）、電気機器（例えば、タッチスクリーン、タッチパッドなど）、及び光学デバイス（例えば、モーション検出器、カメラなど）と対話することができる。コンピューティングデバイスと対話をするためのもう１つ方法は、人の音声を捉え及び反応するオーディオデバイスを介することである。 Homes, offices, cars and public spaces are more wired and associated with the proliferation of computing devices such as laptops, tablets, entertainment systems and portable communication devices. As computing devices evolve, the way users interact with these devices continues to evolve. For example, a person can communicate with a computing device (eg, keyboard, mouse, etc.), an electrical device (eg, touch screen, touchpad, etc.), and an optical device (eg, motion detector, camera, etc.) via a mechanical device. Can interact. Another way to interact with a computing device is through an audio device that captures and reacts to human speech.

詳細な説明は、添付図面を参照して説明する。図において、参照番号の左端の桁（複数含む）は、参照番号が最初に現れる図面を識別する。異なる図面において同じ参照番号の使用は、類似または同一のコンポーネントまたは特徴を示す。
ローカルオーディオデバイス及びリモート音声処理サービスを含む例示的な音声対話コンピューティングアーキテクチャのブロックダイアグラムである。リモート音声処理サービスに関連してローカルオーディオデバイスによって実行することができるコマンドの表現を検出するための例示的プロセスを示すフローダイアグラムである。リモート音声処理サービスに関連してローカルオーディオデバイスによって実行することができるコマンドの表現を検出するための例示的プロセスを示すフローダイアグラムである。リモート音声処理サービスに関連してローカルオーディオデバイスによって実行することができるコマンドの表現を検出するための例示的プロセスを示すフローダイアグラムである。 The detailed description is described with reference to the accompanying drawings. In the figure, the leftmost digit (s) of a reference number identifies the drawing in which the reference number first appears. The use of the same reference numbers in different drawings indicates similar or identical components or features.
1 is a block diagram of an exemplary voice interaction computing architecture including a local audio device and a remote voice processing service. 2 is a flow diagram illustrating an exemplary process for detecting a representation of a command that can be executed by a local audio device in connection with a remote voice processing service. 2 is a flow diagram illustrating an exemplary process for detecting a representation of a command that can be executed by a local audio device in connection with a remote voice processing service. 2 is a flow diagram illustrating an exemplary process for detecting a representation of a command that can be executed by a local audio device in connection with a remote voice processing service.

本開示は、一般的に、ユーザとのスピーチベース相互作用を提供しまたは促進するスピーチインタフェースシステムに関係する。システムは、ユーザスピーチを含むオーディオを捕捉するマイクを有するローカルデバイスを含む。口語ユーザコマンドは、トリガー表現または目覚め表現として呼ばれるキーワードによって付けられてもよい。トリガー表現に続くオーディオは、スピーチ認識のためのリモートサービスにストリーミングされてもよく、サービスは、機能を実行することによる応答またはオーディオデバイスによって実行されるコマンドを提供してもよい。 The present disclosure relates generally to a speech interface system that provides or facilitates speech-based interaction with a user. The system includes a local device having a microphone that captures audio including user speech. Spoken user commands may be attached with keywords called trigger expressions or awakening expressions. The audio following the trigger expression may be streamed to a remote service for speech recognition, and the service may provide a response by performing a function or a command executed by the audio device.

リモートサービスとの通信は、ほとんどの場合、許容範囲内に最小限に抑えることができる応答待ち時間を導入してもよい。いくつかの口語コマンドは、しかしながら少ない待ち時間のために呼び出してもよい。例として、停止、一時停止、ハングアップなどのようなメディアレンダリングの特定のタイプに関連している口語コマンドは、待ち時間の少ない知覚量で実行する必要があるかもしれない。 Communication with remote services may in most cases introduce response latencies that can be kept to an acceptable minimum. Some colloquial commands, however, may be invoked for low latency. As an example, colloquial commands associated with a particular type of media rendering, such as stop, pause, hang up, etc. may need to be executed with a low latency perceptual amount.

様々の実施形態によれば、ローカルコマンド、またはローカルコマンド表現としてここで参照される特定のコマンド表現は、リモートサービスによって検出されるより、むしろローカルデバイスによってまたは、ローカルデバイスで検出される。より具体的には、ローカルデバイスは、後続のスピーチがコマンドを形成するためにユーザによって意図されていることを示しているトリガー、または警告表現を検出するように構成されている。後続の音声がコマンドを形成するために、ユーザによって意図されていることを示すトリガー、または警告表現を検出するように構成されている。トリガー表現の検出において、ローカルデバイスは、リモートサービスとともに通信セッションを示し及びサービスにオーディオを受信するストリーミングを開始する。応答において、リモートサービスは、受信したオーディオのスピーチ認識を行い、認識されたスピーチに基づいてユーザの意図を識別することを試みる。ユーザの意図の認識の応答において、リモートサービスは、対応する機能を実行してもよい。いくつかの例において、機能は、ローカルデバイスに関連して実行してもよい。例えば、リモートサービスは、ローカルデバイスが対応する機能を実行するためのコマンドを実施すべきであることを示すローカルデバイスにコマンドを送信してもよい。 According to various embodiments, a local command, or a specific command expression referred to herein as a local command expression, is detected by or at the local device rather than being detected by a remote service. More specifically, the local device is configured to detect a trigger or warning expression that indicates that subsequent speech is intended by the user to form a command. It is configured to detect a trigger, or warning expression, that indicates that subsequent audio is intended by the user to form a command. In detecting the trigger expression, the local device initiates streaming to indicate a communication session with the remote service and receive audio to the service. In response, the remote service performs speech recognition of the received audio and attempts to identify the user's intention based on the recognized speech. In response to recognition of the user's intention, the remote service may perform a corresponding function. In some examples, the function may be performed in connection with a local device. For example, the remote service may send a command to the local device indicating that the local device should execute a command to perform the corresponding function.

リモートサービスの活動と同時に、ローカルデバイスは、トリガー表現に続くローカルコマンド表現の発生を検出するオーディオをモニターまたは分析する。オーディオにおけるローカルコマンド表現を検出すると、ローカルデバイスは、すぐに対応する機能を実現する。加えるに、リモートサービスによるさらなるアクションは、単一のユーザ発話に対する重複アクションを避けるために停止またはキャンセルされる。リモートサービスによるアクションは、発話が通信セッションの終了またはキャンセルによりローカルに作用された明示的なリモートサービスを通知することにより及び／またはユーザスピーチのリモート認識に応じてリモートサービスによって指定される任意のコマンドの見合わせる実行によって中止されてもよい。 Concurrent with the remote service activity, the local device monitors or analyzes the audio that detects the occurrence of the local command expression following the trigger expression. Upon detecting a local command representation in the audio, the local device immediately implements the corresponding function. In addition, further actions by the remote service are stopped or canceled to avoid duplicate actions for a single user utterance. The action by the remote service is any command specified by the remote service by notifying the explicit remote service where the utterance was acted on locally by the termination or cancellation of the communication session and / or in response to remote recognition of the user speech It may be canceled by execution of matching.

図１は、音声対話システム１００の一例を示す。システム１００は、家庭のような環境１０４内に配置されてもよく及びユーザ１０６と対話するために使用されてもよいローカル音声ベースオーディオデバイス１０２を利用してもよくまたは含んでもよい。音声対話システム１００は、またオーディオでのスピーチを認識するオーディオを受信するように構成され、及び機能を実行するため、認識されたスピーチに応じて、サービス同定機能として参照されるリモートネットワークベーススピーチコマンドサービス１０８を利用しまたは含んでもよい。サービス同定機能は、オーディオデバイスと独立したスピーチコマンドサービス１０８によって実現されてもよく、及び／または、ローカル実行のためにオーディオデバイス１０２にコマンドを提供することによって実現されてもよい。 FIG. 1 shows an example of a voice interaction system 100. System 100 may utilize or include a local audio-based audio device 102 that may be located within an environment 104 such as a home and that may be used to interact with user 106. The voice interaction system 100 is also configured to receive audio that recognizes speech in audio and to perform a function, so that a remote network-based speech command referred to as a service identification function in response to the recognized speech. A service 108 may be utilized or included. The service identification function may be implemented by the speech command service 108 independent of the audio device and / or by providing a command to the audio device 102 for local execution.

特定の実施形態において、オーディオデバイス１０２とユーザ相互作用との主要モードは、スピーチを介してもよい。例えば、オーディオデバイス１０２は、ユーザ１０６からの発語コマンド表現を受信してもよく及びコマンドに応じたサービスを提供してもよい。ユーザは、事前定義された目覚めまたはトリガー表現（例えば、アウェイク）を話してもよく、コマンドまたは命令によって続けてもよい（例えば、私は映画を見に行きたい。ローカルの映画館で上映しているものを教えてください。）。提供されるサービスは、アクションまたは活動、メディアレンダリング、情報の取得及び／または提供を実行し、オーディオデバイス１０２を介して生成または合成スピーチを介して情報を提供し、ユーザ１０６などに代わって、インターネットベースサービスを開始することなどを含んでもよい。 In certain embodiments, the primary mode of audio device 102 and user interaction may be via speech. For example, the audio device 102 may receive a spoken command expression from the user 106 and may provide a service in response to the command. The user may speak a predefined awakening or trigger expression (eg, awake) or continue with a command or command (eg, I want to go to a movie. Please tell me what you have.) The provided services perform actions or activities, media rendering, acquisition and / or provision of information, provide information via generated or synthesized speech via the audio device 102, on behalf of the user 106, etc. It may include starting a base service.

ローカルオーディオデバイス１０２及びスピーチコマンドサービス１０８は、ユーザ１０６からコマンド表現を受信し応答するよう互いに関連しアクションするよう構成されている。コマンド表現は、スピーチコマンドサービス１０８の独立したローカルデバイス１０２によって検出され行われるコマンド表現を含んでもよい。コマンド表現は、またリモートスピーチコマンドサービス１０８の関連によって、または関連で解釈及び実行されるコマンドを含む。 Local audio device 102 and speech command service 108 are configured to interact and interact with each other to receive and respond to a command representation from user 106. The command representation may include a command representation that is detected and performed by the independent local device 102 of the speech command service 108. The command representation also includes commands that are interpreted and executed by or in association with the remote speech command service 108.

オーディオサービス１０２は、ユーザ１０６との相互作用を容易にする１またはそれ以上のマイクロフォン１１０及び１またはそれ以上のスピーカまたは変換器１１２を有してもよい。マイクロフォン１１０は、ユーザ１０６によって発せられた音または表現を含み、環境１０４からのオーディオを表し、またオーディオ信号入力として参照されたマイクロフォン信号を生成する。 Audio service 102 may include one or more microphones 110 and one or more speakers or transducers 112 that facilitate interaction with user 106. Microphone 110 contains sounds or representations made by user 106 and represents the audio from environment 104 and generates a microphone signal referenced as an audio signal input.

いくつかの場合において、マイクロフォン１１０は、選択された方向に集中するオーディオ信号入力を生成するオーディオビーム形成技術に関連して使用されるマイクロフォンアレイを含んでもよい。同様に、複数の方向のマイクロフォン１１０は、複数の利用可能な方向の１つに対応するオーディオ信号を生成するために使用されてもよい。 In some cases, the microphone 110 may include a microphone array used in connection with an audio beamforming technique that produces an audio signal input that is concentrated in a selected direction. Similarly, a multi-directional microphone 110 may be used to generate an audio signal corresponding to one of a plurality of available directions.

オーディオサービス１０２は、プロセッサ１１４及びメモリ１１６を含むかもしれない多くの場合であるところの処理ロジックを含む。プロセッサ１１４は、複数のコアを有する複数のプロセッサ及び／または単一のプロセッサを含んでもよい。プロセッサ１１４は、またオーディオ信号を処理するディジタルシングルプロセッサを含みまたは備えてもよい。 Audio service 102 includes processing logic, which in many cases may include processor 114 and memory 116. The processor 114 may include multiple processors having multiple cores and / or a single processor. The processor 114 may also include or comprise a digital single processor that processes audio signals.

メモリ１１６は、以下に記載の特殊な機能を含み、オーディオデバイス１０２の要求される機能を実行する行為または活動を行うプロセッサ１１４によって実行されるコンピュータ実行命令の形成でのアプリケーション及びプログラムを含んでもよい。メモリ１１６は、コンピュータ読み込み可能な記憶メディアのタイプであってもよく、揮発性及び不揮発性メモリを含んでもよい。そして、メモリ１１６は、しかし限定されず、ＲＡＭ、ＲＯＭ、ＥＥＰＲＯＭ、フラッシュメモリ、または他のメモリ技術を含んでもよい。 Memory 116 includes the special functions described below, and may include applications and programs in the formation of computer-executed instructions executed by processor 114 that performs acts or activities that perform the required functions of audio device 102. . The memory 116 may be a computer readable storage media type and may include volatile and non-volatile memory. And the memory 116 may include, but is not limited to, RAM, ROM, EEPROM, flash memory, or other memory technology.

オーディオデバイス１０２は、複数のアプリケーション、サービス及び／またはサービス及び機能を提供するためのプロセッサ１１４によって実行される機能コンポーネント１１８としてまとめて以下に参照される機能１１８を含んでもよい。アプリケーション及び他の機能コンポーネント１１８は、ミュージックプレイヤのようなメディア再生サービスを含んでもよい。アプリケーション及び他の機能コンポーネント１１８によって供給されまたは実行される他のサービスまたはオペレーションは、例として、要求または消費エンターテイメント（例えば、ゲーム、音楽の発見演奏、映画あるいは他のコンテンツなど）、パーソナルマネージメント（例えば、カレンダリング、ノートを取るなど）、オンラインショッピング、金融取引、データベースに関する問い合わせ、個人対個人の音声通信などを含んでもよい。 Audio device 102 may include a plurality of applications, services and / or functions 118 referred to below collectively as functional components 118 executed by processor 114 for providing services and functions. Applications and other functional components 118 may include media playback services such as music players. Other services or operations provided or performed by applications and other functional components 118 include, for example, request or consumption entertainment (eg, games, music discovery performances, movies or other content), personal management (eg, , Calendaring, taking notes, etc.), online shopping, financial transactions, database inquiries, person-to-person voice communications, and the like.

いくつかの実施形態において、機能コンポーネント１１８は、オーディオデバイス１０２にプリインストールされてもよいし、オーディオデバイス１０２のコア機能に実装してもよい。他の実施形態において、１またはそれ以上のアプリケーションまたは他の機能コンポーネント１１８は、ユーザ１０６によって設置され、またさもなければ、ユーザ１０６によって利用されるオーディオデバイス１０２の後に設置されてもよく、及びユーザ１０６による要求として加え、またはカスタマイズされた機能に実装してもよい。 In some embodiments, the functional component 118 may be pre-installed on the audio device 102 or implemented in the core functionality of the audio device 102. In other embodiments, one or more applications or other functional components 118 may be installed by the user 106, or may be installed after the audio device 102 utilized by the user 106, and the user It may be implemented as a request by 106 or in a customized function.

プロセッサ１１４は、マイクロフォン１１０及び／またはスピーカ１１２に供給されるオーディオ信号出力によって生成されるオーディオ信号入力を処理する機能またはコンポーネント１２０のオーディオ処理によって構成されてもよい。一例としいて、オーディオ処理コンポーネント１２０は、マイクロフォン１１０とスピーカ１１２の間に音響連結によって生成されるオーディオエコーを減少する音響エコーキャンセレーションを実行してもよい。オーディオ処理コンポーネント１２０は、またユーザスピーチより他のオーディオ信号入力の要素のような受信したオーディオ信号でのノイズを減少するノイズ減少化を実行してもよい。特定の実施形態において、オーディオ処理コンポーネント１２０は、ユーザスピーチが検出からの方向において集束されたオーディオ信号の生成される複数のマイクロフォン１１０に応答する１またはそれ以上のオーディオビーム生成器を含んでもよい。 The processor 114 may be configured with audio processing of a function or component 120 that processes the audio signal input generated by the audio signal output supplied to the microphone 110 and / or the speaker 112. As an example, the audio processing component 120 may perform acoustic echo cancellation that reduces audio echo generated by the acoustic coupling between the microphone 110 and the speaker 112. Audio processing component 120 may also perform noise reduction to reduce noise in the received audio signal, such as elements of the audio signal input other than user speech. In certain embodiments, the audio processing component 120 may include one or more audio beam generators that are responsive to a plurality of microphones 110 from which user speech is generated in an audio signal focused in a direction from detection.

オーディオデバイス１０２は、マイクロフォン１１０によって捉えられたスピーチでトリガー表現を検出するために使用されるかもしれない１またはそれ以上の表現検出器またはスピーチ認識コンポーネント１２２を実行するように構成されてもよい。用語「トリガー表現」は、後続のユーザスピーチがコマンドとして解釈されるユーザによって意図されたオーディオデバイス１０２の信号に使用される単語、フレーズまたは他の発話を示すためにここで使用される。 Audio device 102 may be configured to execute one or more expression detectors or speech recognition component 122 that may be used to detect a trigger expression in speech captured by microphone 110. The term “trigger expression” is used herein to indicate a word, phrase or other utterance used in the signal of the audio device 102 intended by the user for which subsequent user speech is interpreted as a command.

１またはそれ以上の認識コンポーネント１２２は、またマイクロフォン１１０によって捕らえられたスピーチでのコマンドまたはコマンド表現を検出するために使用されてもよい。用語「コマンド表現」は、オーディオデバイス１０２によって、またはスピーチコマンドサービス１０８のようなオーディオサービス１０２にアクセス可能であるサービスまたは他のサービスによって実行される機能に対応する、あるいは関連する単語、フレーズまたは他の発話を示すためにここで使用される。例えば、単語「ストップ」、「ポーズ」、「ハングアップ」は、コマンド表現として使用されてもよい。「ストップ」、「ポーズ」のコマンド表現は、メディア再生活動が中断されることを示してもよい。「ハングアップ」コマンド表現は、現在の個人対個人の通信が終了されるべきであることを示してもよい。異なる機能に対応する他のコマンド表現は、また使用されてもよい。コマンド表現は、「近くのイタリアンレストランを探す」というような会話形式の指令を含んでもよい。 One or more recognition components 122 may also be used to detect commands or command expressions in speech captured by the microphone 110. The term “command expression” refers to a word, phrase or other that corresponds to or is associated with a function performed by the audio device 102 or by a service or other service accessible to the audio service 102 such as the speech command service 108. Used here to show the utterances. For example, the words “stop”, “pause”, and “hang-up” may be used as command expressions. The “stop”, “pause” command expressions may indicate that the media playback activity is interrupted. The “hang-up” command expression may indicate that the current person-to-person communication should be terminated. Other command expressions corresponding to different functions may also be used. The command expression may include an interactive command such as “find a nearby Italian restaurant”.

コマンド表現は、スピーチコマンドサービス１０８に頼ることなくオーディオデバイス１０２により解釈されるべきであるローカルコマンド表現を含んでもよい。一般的に、ローカルコマンド表現は、オーディオデバイス１０２によって簡単に検出されることができる単一の言語または短いフレーズのような関連した短い表現である。ローカルコマンド表現は、メディア処理またはメディア再生処理機能のような比較的低い反応待ち時間が要求されているためのデバイス機能に応答してもよい。スピーチコマンドサービス１０８のサービスは、より大きい応答待ち時間が受け入れられるために他のコマンド表現のために利用されてもよい。スピーチコマンドサービスによって作用されるべきコマンド表現は、リモートコマンド表現としてここで参照される。 The command representation may include a local command representation that should be interpreted by the audio device 102 without resorting to the speech command service 108. In general, a local command expression is a related short expression such as a single language or a short phrase that can be easily detected by the audio device 102. The local command representation may be responsive to device functions for which a relatively low response latency is required, such as media processing or media playback processing functions. The service of the speech command service 108 may be utilized for other command expressions because larger response latencies are accepted. The command representation to be acted upon by the speech command service is referred to herein as the remote command representation.

いくつかの場合において、スピーチ認識コンポーネント１２２は、自動スピーチ認識（ＡＳＲ）技術を用いて実現されてもよい。例えば、大きな量のスピーチ認識技術は、キーワード削減に用いられてもよく、及びスピーチ認識の出力は、キーワードの出現をモニターされてもよい。実施例として、スピーチ認識は、音声入力に対応した連続的なワードストリームを提供するため及び音声を認識するためのヒドンマルコフモデル及びガウスミクスチャモデルを使用してもよい。ワードストリームは、それから１またはそれ以上の特殊な言語及び表現を検出するためにモニターされてもよい。 In some cases, the speech recognition component 122 may be implemented using automatic speech recognition (ASR) technology. For example, a large amount of speech recognition technology may be used for keyword reduction, and the speech recognition output may be monitored for the appearance of keywords. As an example, speech recognition may use a Hidden Markov model and a Gaussian mixture model to provide a continuous word stream corresponding to speech input and to recognize speech. The word stream may then be monitored to detect one or more special languages and expressions.

代わりに、スピーチ認識コンポーネント１２２は、１またはそれ以上のキーワードスポッターによって実現されてもよい。キーワードスポッターは、オーディオ信号での１またはそれ以上の事前定義された単語または表現の存在を検出するためのオーディオ信号を評価する機能コンポーネントまたはアルゴリズムである。一般的に、キーワードスポッターは、特定の単語を検出する簡略化されたＡＳＲ技術またはかなり大規模な語彙を認識するよりも言語の限られた数を使用する。例えば、キーワードスポッターは、特殊な言語が、テキストまたは単語ベース出力を提供するよりも音声信号で検出された場合に通知を提供してもよい。これらの技術を用いたキーワードスポッターは、一連の状態として単語を表現するヒドンマルコフモデル（ＨＭＭ）に基づいて別の単語と比較してもよい。一般的に、発話は、キーワードモデルと背景モデルに対しそのモデルを比較することによって分析される。キーワードモデルとともに発話のモデルを比較することは、発話がキーワードに対応する可能性を表すスコアを得る。キーワードモデルとともに発話のモデルを比較することは、発話がキーワードよりもほかに一般的な単語に対応するように表すスコアを得る。２つのスコアは、キーワードが発話されたかどうかを決定するために比較されることができる。 Alternatively, the speech recognition component 122 may be implemented by one or more keyword spotters. A keyword spotter is a functional component or algorithm that evaluates an audio signal to detect the presence of one or more predefined words or expressions in the audio signal. In general, keyword spotters use a limited number of languages rather than a simplified ASR technique for detecting specific words or recognizing a fairly large vocabulary. For example, a keyword spotter may provide notification if a special language is detected in the audio signal rather than providing text or word-based output. Keyword spotters using these techniques may compare to another word based on a Hidden Markov Model (HMM) that represents the word as a series of states. In general, utterances are analyzed by comparing the model against a keyword model and a background model. Comparing the utterance model with the keyword model obtains a score representing the likelihood that the utterance corresponds to the keyword. Comparing the utterance model with the keyword model obtains a score that represents the utterance corresponding to a common word other than the keyword. The two scores can be compared to determine if the keyword was spoken.

オーディオデバイス１０２は、さらに、制御機能１２４を含み、オーディオデバイス１０２の論理的機能を実現するためにオーディオデバイス１０２の他のコンポーネントと相互作用するように構成されているコントローラまたはコントロールロジックとしてここで参照される。 The audio device 102 further includes a control function 124, referred to herein as a controller or control logic that is configured to interact with other components of the audio device 102 to implement the logical function of the audio device 102. Is done.

コントロールロジック１２４、オーディオ処理コンポーネント１２０、スピーチ認識コンポーネント１２２、及び機能コンポーネント１１８は、プロセッサ１１４によってメモリ１１６に格納され実行される実行可能な命令、プログラム、及び／またはプログラムモジュールを含んでもよい。 Control logic 124, audio processing component 120, speech recognition component 122, and functional component 118 may include executable instructions, programs, and / or program modules that are stored and executed by processor 114 in memory 116.

スピーチコマンドサービス１０８は、いくつかの例においてインターネットなどのネットワーク１２６を介して保守され及びアクセス可能であるネットワークアクセス可能コンピューティングプラットフォームの一部としてもよい。このようなネットワークアクセス可能コンピューティングプラットフォームは、「サービスとしてのソフトウェア（ＳａａＳ）」、「オンデマンド・コンピューティング」、「プラットフォームコンピューティング」、「ネットワークアクセスプラットフォーム」、「クラウドサービス」、「データセンター」などのような用語を使用して参照してもよい。 The speech command service 108 may be part of a network accessible computing platform that, in some examples, is maintained and accessible via a network 126 such as the Internet. Such network accessible computing platforms are “software as a service (SaaS)”, “on-demand computing”, “platform computing”, “network access platform”, “cloud service”, “data center”. Reference may be made using terms such as.

オーディオデバイス１０２及び／またはスピーチコマンドサービス１０８は、有線技術（例えば、ワイヤ、ユニバーサルシリアルバス（ＵＳＢ）、光ファイバケーブルなど）、無線技術（例えば、無線周波数（ＲＦ）、携帯電話、携帯電話網、衛星、ブルートゥースなど）を介してネットワーク１２６、または他の接続技術に通信可能に結合されてよい。ネットワーク１２６は、データ及び/又は音声ネットワークを含む通信ネットワークの任意のタイプを表しており、及び有線インフラストラクチャ（例えば、同軸ケーブル、光ファイバケーブルなど）、無線インフラストラクチャ（例えば、ＲＦ、セルラー、マイクロ波、衛星、ブルートゥース（登録商標）など）、及び／または他の接続技術を用いて実現されてもよい。 Audio device 102 and / or speech command service 108 may be wired technology (eg, wire, universal serial bus (USB), fiber optic cable, etc.), wireless technology (eg, radio frequency (RF), mobile phone, mobile phone network, Satellite, Bluetooth, etc.) may be communicatively coupled to the network 126, or other connection technology. Network 126 represents any type of communication network including data and / or voice networks, and wired infrastructure (eg, coaxial cable, fiber optic cable, etc.), wireless infrastructure (eg, RF, cellular, micro, etc.). Wave, satellite, Bluetooth, etc.) and / or other connection technologies may be used.

けれどもオーディオデバイス１０２は、音声制御またはスピーチベースインタフェースデバイスとしてここに記載されおり、ここに記載されている技術は、通信デバイス及びコンポーネント、ハンズフリーデバイス、娯楽デバイス、メディア再生デバイスなどのような様々な異なるタイプのデバイスと併せて実施してもよい。 However, the audio device 102 is described herein as a voice control or speech-based interface device, and the techniques described herein are various such as communication devices and components, hands-free devices, entertainment devices, media playback devices, etc. It may be implemented in conjunction with different types of devices.

スピーチコマンドサービス１０８は、一般的にオーディオデバイス１０２からオーディオストリームを受信し、オーディオストリームでスピーチを認識し、認識されたスピーチからユーザの意図を決定し、及びユーザの意図に応じて活動またはサービスを実行するための機能を提供する。提供される活動は、いくつかの場合においてオーディオデバイス１０２に関連して実行されてもよく、これらの場合にスピーチコマンドサービス１０８がオーディオデバイス１０２によって実行されるコマンドを示すオーディオデバイス１０２に応答を返してもよい。 The speech command service 108 typically receives an audio stream from the audio device 102, recognizes speech in the audio stream, determines user intent from the recognized speech, and performs an activity or service depending on the user intent. Provides functions to execute. The provided activities may be performed in connection with the audio device 102 in some cases, in which the speech command service 108 returns a response to the audio device 102 indicating the command to be performed by the audio device 102. May be.

スピーチコマンドサービス１０８は、多くの場合１またはそれ以上のサーバ、コンピュータ、及びまたはプロセッサ１２８を含んでもよい処理ロジックを含む。スピーチコマンドサービス１０８は、また、ここで具体的に説明した機能を含むスピーチコマンドサービスの要求された機能を実現する行為または活動を実行するためにプロセッサ１２８によって実行される命令の形式でのアプリケーションやプログラムを含むメモリ１３０を有してもよい。メモリ１３０は、コンピュータ記憶メディアの種類であってもよく、揮発性および不揮発性メモリを含んでもよい。よって、メモリ１３０は、しかし限定されず、ＲＡＭ、ＲＯＭ、ＥＥＰＲＯＭ、フラッシュメモリ、または他のメモリ技術を含んでもよい。 The speech command service 108 includes processing logic that may often include one or more servers, computers, and / or processors 128. The speech command service 108 may also include applications in the form of instructions executed by the processor 128 to perform actions or activities that implement the required functionality of the speech command service, including the functions specifically described herein. You may have the memory 130 containing a program. Memory 130 may be a type of computer storage media and may include volatile and non-volatile memory. Thus, the memory 130 is not limited, and may include RAM, ROM, EEPROM, flash memory, or other memory technology.

他の論理および物理コンポーネントが具体的に示されていない中で、スピーチコマンドサービス１０８は、スピーチ認識コンポーネント１３２を含んでもよい。スピーチ認識コンポーネント１３２は、音声信号での人間のスピーチ認識、自動スピーチ認識（ＡＳＲ）機能を含んでもよい。 The speech command service 108 may include a speech recognition component 132, among other logical and physical components not specifically shown. The speech recognition component 132 may include a human speech recognition on an audio signal, an automatic speech recognition (ASR) function.

スピーチコマンドサービス１０８は、また認識されたスピーチに基づいてユーザの意図を決定する自然言語理解コンポーネント（ＮＬＵ）１３４を含んでもよい。 The speech command service 108 may also include a natural language understanding component (NLU) 134 that determines user intent based on the recognized speech.

スピーチコマンドサービス１０８は、ユーザの意図に応じた機能またはコマンドを決定するコマンドインタプリタ及びアクションディスパッチャ１３６（コマンドインタプリタ１３６として単に以下に参照）を含んでもよい。いくつかの場合において、コマンドは、オーディオデバイス１０２によって少なくとも部分的に実行される機能に対応してもよく、コマンドインタプリタ１３６は、このような機能を実現するためのコマンドを示すオーディオデバイス１０２への応答をそれらの場合に提供してもよい。コマンドインタプリタ１３６からの指令に応答したオーディオデバイスによって実行されるかもしれないコマンドまたは機能の例は、スピーカ１１２を介して可聴音声を生成することと、類似のデバイスのユーザとの通信の特定のタイプを開始することと、スピーカ１１２のボリュームを増加／減少する音楽またはその他のメディアの再生などを含んでもよい。 The speech command service 108 may include a command interpreter and action dispatcher 136 (simply referred to below as the command interpreter 136) that determines functions or commands according to the user's intention. In some cases, the commands may correspond to functions that are at least partially performed by the audio device 102, and the command interpreter 136 may send commands to the audio device 102 that indicate commands to implement such functions. A response may be provided in those cases. Examples of commands or functions that may be executed by an audio device in response to a command from the command interpreter 136 include generating audible audio through the speaker 112 and a particular type of communication with a user of a similar device. And playback of music or other media that increases / decreases the volume of the speaker 112 may be included.

スピーチコマンドサービス１０８は、図１に示されていない実体またはデバイスを伴う受信されたオーディオから認識されたスピーチに応答する機能を実行してもよいことに留意するべきである。例えば、スピーチコマンドサービス１０８は、ユーザ１０６に代わって情報またはサービスを得るために他のネットワークベースサービスと相互作用してもよい。さらに、スピーチコマンドサービス１０８は、それ自体、ユーザ１０６の発話に応答することができる様々な構成要素及び機能を有してもよい。 It should be noted that the speech command service 108 may perform the function of responding to speech recognized from received audio with entities or devices not shown in FIG. For example, the speech command service 108 may interact with other network-based services to obtain information or services on behalf of the user 106. Further, the speech command service 108 may itself have various components and functions that can respond to the user's 106 speech.

動作時に、オーディオデバイス１０２のマイクロフォン１１０は、ユーザ１０６のスピーチを含むオーディオを捕らえまたは受信する。オーディオは、オーディオ処理コンポーネント１２０によって処理され、処理されたオーディオは、スピーチ認識コンポーネント１２２によって受信される。スピーチ認識コンポーネント１２２は、オーディオに含まれるスピーチでのトリガー表現の発生を検出するオーディオを分析する。トリガー表現を検出すると、コントローラ１２４は、ユーザスピーチの解釈及び認識するためのスピーチコマンドサービス１０８の要求に沿ってスピーチコマンドサービス１０８に受信したオーディオを送信またはストリーミングすること、及びある解釈の意図に対応する機能を開始することを始める。 In operation, the microphone 110 of the audio device 102 captures or receives audio including the user's 106 speech. The audio is processed by the audio processing component 120 and the processed audio is received by the speech recognition component 122. The speech recognition component 122 analyzes the audio that detects the occurrence of a trigger expression in the speech included in the audio. Upon detecting the trigger expression, the controller 124 transmits or streams the received audio to the speech command service 108 in response to the request of the speech command service 108 to interpret and recognize the user speech, and responds to certain interpretation intentions. Start to start function.

同時にスピーチコマンドサービス１０８にオーディオを送信すると、スピーチ認識コンポーネント１２２は、ユーザスピーチでのローカルコマンド表現の発生を検出するために受信したオーディオを分析し続ける。ローカルコマンド表現の検出の際に、コントローラ１２４は、ローカルコマンド表現に対応するデバイス機能を開始または実行する。例えば、ローカルコマンド表現「停止」に応答して、コントローラ１２４は、メディア再生を停止する機能を開始してもよい。コントローラ１２４は、機能を開始または実行するときに、１またはそれ以上の機能コンポーネント１１８と相互作用となってもよい。 At the same time, sending audio to the speech command service 108, the speech recognition component 122 continues to analyze the received audio to detect the occurrence of local command expressions in the user speech. Upon detection of the local command expression, the controller 124 initiates or executes a device function corresponding to the local command expression. For example, in response to the local command expression “stop”, the controller 124 may initiate a function to stop media playback. The controller 124 may interact with one or more functional components 118 when initiating or executing a function.

一方、スピーチコマンドサービス１０８は、オーディオの受信に応答して、同時に、ユーザの意図に応じて実現されるサービス識別機能を決定するため、及びユーザの意図を決定するために、スピーチを認識するオーディオを解析する。しかしながら、ローカルコマンド表現にローカル的に検出し及び作用した後に、オーディオデバイス１０２は、最終的にスピーチコマンドサービス１０８によって開始することができる任意のサービス識別機能をキャンセル、無効、または無効にする措置をとってもよい。例えば、オーディオデバイス１０２は、スピーチコマンドサービス１０８にキャンセルメッセージを送信することにより、及び／またはスピーチコマンドサービス１０８にオーディオのストリーミングを停止することにより、その前の要求をキャンセルしてもよい。他の実施例として、オーディオデバイスは、以前の要求に応じてスピーチコマンドサービス１０８から受信されたいかなる応答またはサービス指定コマンドを無視または破棄してもよい。いくつかの場合において、オーディオデバイスは、ローカルコマンド表現に応答してローカル的に実行されたアクションのスピーチコマンドサービス１０８を通知してもよく、及びスピーチコマンドサービス１０８は、情報に基づいて後続の動作を変更してもよい。例えば、スピーチコマンドサービス１０８は、そうでなければ受信したオーディオで認識されスピーチに応じて実行している可能性があるアクションを見合わせてもよい。 On the other hand, the speech command service 108 responds to the reception of the audio, and at the same time, determines the service identification function realized according to the user's intention, and recognizes the speech to recognize the user's intention. Is analyzed. However, after detecting and acting locally on the local command representation, the audio device 102 may take action to cancel, disable, or disable any service identification function that may eventually be initiated by the speech command service 108. It may be taken. For example, the audio device 102 may cancel the previous request by sending a cancel message to the speech command service 108 and / or by stopping streaming audio to the speech command service 108. As another example, the audio device may ignore or discard any response or service specific command received from the speech command service 108 in response to a previous request. In some cases, the audio device may notify the speech command service 108 of actions performed locally in response to the local command representation, and the speech command service 108 may perform subsequent actions based on the information. May be changed. For example, the speech command service 108 may refrain from actions that may otherwise be recognized in the received audio and that are performing in response to the speech.

図２は、ユーザスピーチを認識し対応するためにスピーチコマンドサービス１０８に関連してオーディオデバイス１０２によって実行されてもよい例示的方法２００を示す。方法２００は、図１のシステム１００の内容で説明され、それにもかかわらず、方法２００は、他の環境で実行されてもよいし、及び異なる方法で実施されてもよい。 FIG. 2 illustrates an example method 200 that may be performed by the audio device 102 in connection with the speech command service 108 to recognize and respond to user speech. The method 200 is described in the context of the system 100 of FIG. 1, nevertheless, the method 200 may be performed in other environments and may be implemented in different ways.

図２の左側のアクションは、ローカルオーディオデバイス１０２で、また、ローカルオーディオデバイス１０２によって実行される。図２の右側のアクションは、また、リモートスピーチコマンドサービス１０８で、また、リモートスピーチコマンドサービス１０８によって実行される。 The actions on the left side of FIG. 2 are performed by and by the local audio device 102. The actions on the right side of FIG. 2 are also performed by the remote speech command service 108 and by the remote speech command service 108.

アクション２０２は、マイクロフォン１１０に関連し、また、マイクロフォン１１０によって取り込まれるオーディオ信号を受信することを含む。オーディオ信号は、環境１０４からオーディオを含み、また、表し、及びユーザスピーチを含んでもよい。オーディオ信号は、アナログ電気信号でもよく、またデジタルオーディオストリームなどのデジタル信号を含んでもよい。 Action 202 is associated with microphone 110 and includes receiving an audio signal captured by microphone 110. The audio signal includes audio from the environment 104 and may include representations and user speech. The audio signal may be an analog electrical signal or may include a digital signal such as a digital audio stream.

アクション２０４は、受信したオーディオ及び／またはユーザスピーチでのトリガー表現の発生を検出することを含む。これは、いくつかの実施形態においてキーワードスポッターを含んでいてもよく、上記のようなスピーチ認識コンポーネント１２２によって実行されてもよい。トリガー表現が検出されない場合に、アクション２０４は、連続的にトリガー表現の発生を続けてモニターするために繰り返される。図２に示した残りのアクションは、トリガー表現を検出することに応答して行われる。 Action 204 includes detecting the occurrence of a trigger expression in the received audio and / or user speech. This may include a keyword spotter in some embodiments, and may be performed by the speech recognition component 122 as described above. If no trigger expression is detected, action 204 is repeated to continuously monitor the generation of the trigger expression. The remaining actions shown in FIG. 2 are performed in response to detecting the trigger expression.

トリガー表現がアクション２０４で検出された場合、アクション２０６は、実行され、オーディオでのスピーチを認識し、及び認識されたスピーチに応じた機能を実現するスピーチコマンドサービス１０８のためにサービス要求２０８に沿ってスピーチコマンドサービス１０８で受信したオーディオに続いて送信することを含む。このようなスピーチコマンドサービス１０８によって開始される機能は、ここでサービス識別機能として参照され、及び特殊な場合では、オーディオデバイス１０２に関連して実行されてもよい。例えば、機能は、オーディオデバイス１０２にコマンドを送ることによって開始されてもよい。 If a trigger expression is detected in action 204, action 206 is executed along with service request 208 for speech command service 108 that recognizes speech in audio and implements a function in response to the recognized speech. Transmission following the audio received by the speech command service 108. Such functions initiated by the speech command service 108 are referred to herein as service identification functions and may be performed in connection with the audio device 102 in special cases. For example, the function may be initiated by sending a command to the audio device 102.

送信２０６は、ストリーミングまたはスピーチコマンドサービス１０８へのデジタルオーディオストリーム２１０の他の送信を含み、トリガー表現の検出に続いてマイクロフォン１１０から受信されたオーディオを表し、または含んでもよい。特定の実施形態では、アクション２０６は、オーディオデバイス１０２及びスピーチコマンドサービス１０８との間の通信セッションを開くこと、または開始を含んでもよい。具体的には、要求２０８は、意図を理解し、スピーチ認識の目的のためにスピーチコマンドサービス１０８との通信セッションを確立するため、及びユーザスピーチに応答して実行されるアクションまたは機能の決定に使用されてもよい。要求２０８は、ストリームされたオーディオ２１０によって続くかまたは伴うことがあってもよい。いくつかの場合では、スピーチコマンドサービス１０８に提供されたオーディオストリーム２１０は、トリガー表現の待ち時間の直前において始まる受信したオーディオの部分を含んでもよい。 Transmission 206 includes other transmissions of digital audio stream 210 to streaming or speech command service 108 and may represent or include audio received from microphone 110 following detection of a trigger expression. In certain embodiments, action 206 may include opening or initiating a communication session between audio device 102 and speech command service 108. Specifically, the request 208 understands the intent, establishes a communication session with the speech command service 108 for the purpose of speech recognition, and determines the action or function to be performed in response to the user speech. May be used. Request 208 may be followed or accompanied by streamed audio 210. In some cases, the audio stream 210 provided to the speech command service 108 may include a portion of the received audio that starts just before the trigger expression latency.

通信セッションは、オーディオデバイス１０２及びスピーチコマンドサービス１０８との間に確立された通信セッションを識別する通信またはセッション識別子（ＩＤ）に関連付けてもよい。セッションＩＤは、特定のユーザ発話、またはオーディオストリームに関連する将来の通信に含まれてもよく、または使用されてもよい。いくつかの場合に、セッションＩＤは、オーディオデバイス１０２によって生成され、及びスピーチコマンドサービス１０８に対する要求２０８において提供されてもよい。代わりに、セッションＩＤは、スピーチコマンドサービス１０８によって生成され、要求２０８の認諾でのスピーチコマンドサービス１０８によって提供されてもよい。用語「要求（ＩＤ）」は、特定のセッションＩＤを有する要求を示すためにここで使用される。同じセッション要求またはオーディオストリームに関連するスピーチコマンドサービス１０８からの応答は、用語「応答（ＩＤ）」によって示されてもよい。 The communication session may be associated with a communication or session identifier (ID) that identifies a communication session established between the audio device 102 and the speech command service 108. The session ID may be included or used in a particular user utterance or future communication associated with the audio stream. In some cases, the session ID may be generated by audio device 102 and provided in request 208 for speech command service 108. Alternatively, the session ID may be generated by the speech command service 108 and provided by the speech command service 108 upon acceptance of the request 208. The term “request (ID)” is used herein to indicate a request with a particular session ID. Responses from the speech command service 108 associated with the same session request or audio stream may be indicated by the term “response (ID)”.

特定の実施形態において、各通信セッション及び対応するセッションＩＤは、単一ユーザの発話に対応してもよい。例えば、オーディオデバイス１０２は、トリガー表現を検出するとセッションを確立してもよい。オーディオデバイス１０２は、ユーザの発話の終了まで同じセッションの一部としてスピーチコマンドサービス１０８にオーディオをストリーミングしそれから続けてもよい。スピーチコマンドサービス１０８は、同じセッションＩＤを使用してセッションを介してオーディオデバイス１０２への応答を提供してもよい。いくつかの場合かもしれない応答は、受信したオーディオ２１０にスピーチコマンドサービス１０８によって認識されたスピーチに対応してオーディオデバイス１０２によって実行されるコマンドを示してもよい。通信セッションは、オーディオデバイス１０２がスピーチコマンドサービス１０８からの応答を受信するまで、及びオーディオデバイス１０２が要求をキャンセルするまで開くことを維持してもよい。 In certain embodiments, each communication session and corresponding session ID may correspond to a single user utterance. For example, the audio device 102 may establish a session upon detecting the trigger expression. Audio device 102 may stream audio to speech command service 108 as part of the same session until the end of the user's utterance and then continue. The speech command service 108 may provide a response to the audio device 102 via the session using the same session ID. The response that may be in some cases may indicate a command to be executed by the audio device 102 in response to the speech recognized by the speech command service 108 on the received audio 210. The communication session may remain open until the audio device 102 receives a response from the speech command service 108 and until the audio device 102 cancels the request.

スピーチコマンドサービス１０８は、アクション２１２での要求２０８及びオーディオストリーム２１０を受信する。応答において、スピーチコマンドサービス１０８は、受信したオーディオで認識しスピーチコマンドサービス１０８のスピーチ認識１３２及び自然言語理解コンポーネント１３４を使用して、認識されたスピーチにより表されるようなユーザの意図を決定するというアクション２１４を実行する。コマンドインタプリタ１３６によって実行されるアクション２１４は、決定したユーザ意図を満たすサービス識別機能を識別し及び開始することを含む。サービス識別機能は、いくつかの場合においてオーディオデバイス１０２の独立したスピーチコマンドサービス１０８によって実行されてもよい。他の場合において、スピーチコマンドサービス１０８は、オーディオデバイス１０２によって実行される機能を識別してもよく、及びオーディオデバイス１０２による実行のためにオーディオデバイス１０２に対応するコマンドを送信してもよい。 The speech command service 108 receives the request 208 and the audio stream 210 at action 212. In response, the speech command service 108 recognizes the received audio and uses the speech recognition service 132 and the natural language understanding component 134 of the speech command service 108 to determine the user's intention as represented by the recognized speech. The action 214 is executed. Action 214 performed by command interpreter 136 includes identifying and initiating a service identification function that satisfies the determined user intent. The service identification function may be performed by the independent speech command service 108 of the audio device 102 in some cases. In other cases, the speech command service 108 may identify functions to be performed by the audio device 102 and may send commands corresponding to the audio device 102 for execution by the audio device 102.

スピーチコマンドサービス１０８によって実行されるアクションと同時に、ローカルオーディオデバイス１０２は、ユーザがローカルコマンド表現を発しているかどうか、及びこのような発話ローカルコマンド表現に応じた対応するローカル機能を実行するかどうかを決定するためにさらにアクションを実行する。具体的に、アクション２０４でトリガー表現を検出することに応答して実行されるアクション２１８は、続いて、または直ちに続いて受信したユーザスピーチにおけるトリガー表現をローカルコマンド表現の発生を検出するアクション２０２で受信されたオーディオの分析を含む。これは、上述したようにオーディオデバイス１０２のスピーチ認識コンポーネント１２２によって実行されてもよく、いくつかの実施形態においてキーワードスポッターを含んでもよい。 Concurrent with the action performed by the speech command service 108, the local audio device 102 determines whether the user is issuing a local command expression and whether to perform a corresponding local function in response to such an utterance local command expression. Take further action to make a decision. Specifically, action 218 performed in response to detecting the trigger expression in action 204 is an action 202 that detects the occurrence of a local command expression in the user speech that is subsequently or immediately followed by the trigger expression in the received user speech. Includes analysis of received audio. This may be performed by the speech recognition component 122 of the audio device 102 as described above, and may include a keyword spotter in some embodiments.

アクション２１８でのローカルコマンド表現を検出することに応答して、アクション２２０では、直ちにローカルコマンド表現と関連付けられているデバイス機能を開始することが行われる。例えば、ローカルコマンド表現の「停止」は、メディア再生を停止する機能に関連付けられているかもしれない。 In response to detecting the local command representation at action 218, action 220 immediately initiates the device function associated with the local command representation. For example, “stop” in the local command expression may be associated with a function to stop media playback.

また、アクション２１８でのローカルコマンド表現の検出に応答して、オーディオデバイス１０２は、スピーチコマンドサービス１０８に対し要求２０８を停止またはキャンセルするアクション２２２を実行する。これは、受信した要求２０８に応答して及びオーディオ２１０に伴うスピーチコマンドサービス１０８によって実現されるかもしれないサービス識別機能の実現をキャンセルすることまたは無効にすることを含んでもよい。 Also, in response to detecting the local command expression at action 218, audio device 102 performs action 222 to stop or cancel request 208 for speech command service 108. This may include canceling or disabling the implementation of the service identification function that may be implemented by the speech command service 108 in response to the received request 208 and with the audio 210.

特定の実施において、アクション２２２は、スピーチコマンドサービス１０８への明示的な通知やコマンドを送信すること、スピーチコマンドサービス１０８は、サービス要求２０８に対して、任意の更なる認知活動をキャンセルするよう要求すること、及び／またはそうしないと認識されたスピーチに応答して開始されるかもしれないサービス識別機能の実現をキャンセルすることを含んでもよい。代わりに、オーディオデバイス１０２は、ローカルコマンド表現のローカル認識に応答してローカル的に実行されるどんな機能に関するスピーチコマンドサービス１０８を単に通知してもよく、及びスピーチコマンドサービス１０８は、サービス要求２０８をキャンセルすることによって、または適切である他のアクションの実行によって応答してもよい。 In a specific implementation, action 222 sends an explicit notification or command to speech command service 108, and speech command service 108 requests service request 208 to cancel any further cognitive activity. And / or canceling the implementation of a service identification function that may be initiated in response to speech recognized as otherwise. Alternatively, the audio device 102 may simply notify the speech command service 108 regarding any function that is performed locally in response to local recognition of the local command representation, and the speech command service 108 may send a service request 208. You may respond by canceling or by performing other actions that are appropriate.

特定の実施において、スピーチコマンドサービス１０８は、オーディオデバイス１０２によって実行されるコマンドを識別することによってサービス識別機能を実行してもよい。サービス要求２０８がキャンセルされる通知の受信の応答について、スピーチコマンドサービス１０８は、オーディオデバイス１０２にコマンドを送信することを見合わせてもよい。代わりとして、スピーチコマンドサービスは、その処理を完了し、及びオーディオデバイス１０２にコマンドを送信するようにしてもよく、そうするとオーディオデバイス１０２は、コマンドを無視するか、またはコマンドの実行を見合わせてもよい。 In certain implementations, the speech command service 108 may perform a service identification function by identifying commands executed by the audio device 102. The speech command service 108 may forego sending a command to the audio device 102 in response to receiving a notification that the service request 208 is cancelled. Alternatively, the speech command service may complete its processing and send the command to the audio device 102, so that the audio device 102 may ignore the command or suspend execution of the command. .

いくつかの実施において、スピーチコマンドサービスは、サービス識別機能を開始する前に、オーディオデバイス１０２に通知するように構成されてもよく、及びオーディオデバイス１０２から許可を受信するまでサービス識別機能の実現を遅らせてもよい。この場合、オーディオデバイス１０２は、ローカルコマンド表現がローカルとして認識されたときに、そのような許可を拒否するように構成されてもよい。 In some implementations, the speech command service may be configured to notify the audio device 102 before initiating the service identification function and implements the service identification function until a permission is received from the audio device 102. You may delay. In this case, the audio device 102 may be configured to deny such permission when the local command representation is recognized as local.

上述の様々なアプローチは、コマンド待ち時間の異なる量を求める状況で使用されてもよい。例えば、スピーチコマンドサービスからの通信を待っていることは、比較的高い待ち時間を導入することになり、これはいくつかの状況で受け入れられないかもしれない。このような機能を実現する前の通信は、重複または意図しないアクションに対して防いでもよい。ローカル認識コマンド表現を即時に実現すること、及びスピーチコマンドサービスから後続のコマンドを無視するか、または続いてスピーチコマンドサービスをキャンセルする要求は、より少ない待ち時間が望まれている状況でより適切であるとよい。 The various approaches described above may be used in situations where different amounts of command latency are desired. For example, waiting for communication from the speech command service introduces a relatively high latency, which may not be acceptable in some situations. Communication prior to realizing such a function may be prevented against overlapping or unintended actions. The immediate realization of local recognition command representation and the request to ignore subsequent commands from the speech command service or subsequently cancel the speech command service is more appropriate in situations where less latency is desired. There should be.

図２に示したスピーチコマンドサービス１０８のアクションは、オーディオデバイス１０２のアクション２１８、２２０及び２２２と平行及び非同期に行われることに留意するべきである。これは、オーディオデバイス１０２が比較的迅速にローカルコマンド表現の際の検出し、及び作用することができるいくつかの実現を想定しており、それは要求２０８をキャンセルするアクション２２２を実行してもよく、及びアクション２１６のサービス識別機能の前のスピーチコマンドサービス１０８によるその後の処理は、実現または実行されている。 It should be noted that the actions of the speech command service 108 shown in FIG. 2 are performed in parallel and asynchronously with the actions 218, 220 and 222 of the audio device 102. This envisions some implementations in which the audio device 102 can detect and act on local command representations relatively quickly, which may perform an action 222 that cancels the request 208. , And subsequent processing by the speech command service 108 prior to the service identification function of action 216 has been implemented or performed.

図３は、スピーチコマンドサービス１０８がオーディオデバイス１０２にコマンドを返し、オーディオデバイス１０２が、ローカルコマンド表現がオーディオデバイス１０２により既に検出され行動されている状況においてコマンドを無視し、またはコマンドの実行を見合わせるように構成されている実施例の方法３００を示している。最初のアクションは、上述のものと類似、または同一である。オーディオデバイス１０２によって実行されるアクションは、左側に示され、及びスピーチコマンドサービス１０８によって実行されるアクションは、右側に示されている。 FIG. 3 illustrates that the speech command service 108 returns a command to the audio device 102, and the audio device 102 ignores the command or suspends execution of the command in situations where a local command representation is already detected and acted on by the audio device 102. An example method 300 configured as described is shown. The initial action is similar or identical to that described above. The actions performed by the audio device 102 are shown on the left and the actions performed by the speech command service 108 are shown on the right.

アクション３０２は、ユーザスピーチを含むオーディオ信号を受信することを含む。アクション３０４は、ユーザスピーチにおけるトリガー表現を検出するオーディオ信号を分析することを含む。図３に示す後続のアクションは、トリガー表現を検出することに応答して実行される。 Action 302 includes receiving an audio signal that includes user speech. Action 304 includes analyzing the audio signal to detect a trigger expression in the user speech. The subsequent actions shown in FIG. 3 are performed in response to detecting the trigger expression.

アクション３０６は、スピーチコマンドサービス１０８に要求３０８及びオーディオ３１０を送信することを含む。アクション３１２は、スピーチコマンドサービス１０８において要求３０８及びオーディオ３１０の受信を含む。アクション３１４は、認識されたユーザスピーチに基づきユーザスピーチを認識すること及びユーザの意図を決定することを含む。 Action 306 includes sending a request 308 and audio 310 to the speech command service 108. Action 312 includes receiving request 308 and audio 310 at speech command service 108. Action 314 includes recognizing user speech and determining user intent based on the recognized user speech.

決定されたユーザの意図に応じて、スピーチコマンドサービス１０８は、認識されたユーザの意図に対応したサービス識別機能を実現するために、オーディオデバイス１０２による実行のためにオーディオデバイス１０２に送信するコマンド３１８のアクション３１６を実行する。例えば、コマンドは、オーディオデバイス１０２が、音楽の再生を停止することを示している「停止」コマンドを含んでもよい。 In response to the determined user intention, the speech command service 108 sends a command 318 to the audio device 102 for execution by the audio device 102 to implement a service identification function corresponding to the recognized user intention. The action 316 is executed. For example, the command may include a “stop” command indicating that the audio device 102 stops playing music.

オーディオデバイス１０２によって実行されるアクション３２０は、コマンドを受信すること、及び実行することを含む。アクション３２０は、ローカルコマンド表現がオーディオデバイス１０２により検出され及び行動されているかどうかに基づき、それが条件付きで実行されることを示すために破線のボックスで示される。具体的に、ローカルコマンド表現がオーディオデバイス１０２により検出された場合、アクション３２０が実行されない。 Action 320 performed by audio device 102 includes receiving and executing a command. Action 320 is shown in a dashed box to indicate that it is conditionally executed based on whether a local command representation is detected and acted upon by audio device 102. Specifically, if a local command representation is detected by audio device 102, action 320 is not performed.

スピーチコマンドサービス１０８によって実行されるアクションと並行して、オーディオデバイス１０２は、受信したユーザスピーチでのトリガー表現に続く、またはすぐに続くローカルコマンド表現の発生を検出するオーディオ受信解析のアクション３２２を実行する。ローカルコマンドの表現を検出することに応答して、アクション３２４は、ローカルコマンド表現と関連付けられているローカルデバイス機能を開始することが直ちに行われる。 In parallel with the actions performed by the speech command service 108, the audio device 102 performs an audio reception analysis action 322 that detects the occurrence of a local command expression following or immediately following the trigger expression in the received user speech. To do. In response to detecting the local command representation, action 324 immediately occurs to initiate a local device function associated with the local command representation.

また、アクション３２２で、ローカルコマンド表現の検出に応答して、オーディオデバイス１０２は、受信したコマンド３１８の見合わせる実行をアクション３２６で行う。より具体的には、要求３０８に応答してスピーチコマンドサービス１０８から受信したコマンドは、破棄または無視される。要求３０８に対応する応答及びコマンドは、応答に関連付けられたセッションＩＤによって識別されてもよい。 Also, in response to the detection of the local command expression in action 322, the audio device 102 performs execution of matching the received command 318 in action 326. More specifically, commands received from the speech command service 108 in response to the request 308 are discarded or ignored. The response and command corresponding to the request 308 may be identified by a session ID associated with the response.

ローカルコマンド表現が、アクション３２２で検出されないならば、オーディオデバイスは、スピーチコマンドサービス１０８から受信して実行しているコマンド３１８のアクション３２０を行う。 If no local command representation is detected at action 322, the audio device performs action 320 for the command 318 being received and executed from the speech command service 108.

図４は、オーディオデバイス１０２が、ローカルコマンド表現をローカル的に検出した後に、スピーチコマンドサービス１０８への要求を積極的にキャンセルするように構成されている実施例の方法４００を示している。最初のアクションは、上述のこれらのものと類似または同一である。オーディオデバイス１０２によって実行されるアクションは、左側に表示され、及びスピーチコマンドサービス１０８によって実行されるアクションは、右側に示されている。 FIG. 4 illustrates an example method 400 that is configured for the audio device 102 to actively cancel a request to the speech command service 108 after detecting a local command representation locally. The initial action is similar or identical to those described above. The actions performed by the audio device 102 are displayed on the left and the actions performed by the speech command service 108 are shown on the right.

アクション４０２は、ユーザスピーチを含むオーディオ信号を受信することを含む。アクション４０４は、ユーザスピーチにおけるトリガー表現を検出するオーディオ信号を分析することを含む。図４に示す後続のアクションは、トリガー表現の検出に応答して実行される。 Action 402 includes receiving an audio signal that includes user speech. Action 404 includes analyzing the audio signal to detect a trigger expression in the user speech. The subsequent actions shown in FIG. 4 are performed in response to detecting the trigger expression.

アクション４０６は、スピーチコマンドサービス１０８に要求４０８及びオーディオ４１０を送信することを含む。アクション４１２は、スピーチコマンドサービス１０８での要求４０８及びオーディオ４１０を受信することを含む。アクション４１４は、認識されたユーザスピーチに基づいてユーザスピーチを認識すること及びユーザの意図を決定することを含む。 Action 406 includes sending request 408 and audio 410 to speech command service 108. Action 412 includes receiving request 408 and audio 410 at speech command service 108. Action 414 includes recognizing user speech and determining user intent based on the recognized user speech.

アクション４１６は、要求４０８がオーディオデバイス１０２によってキャンセルされたかどうかを決定することを含む。一例として、オーディオデバイス１０２は、キャンセルメッセージを送信してもよく、または要求をキャンセルするために現在の通信セッションを終了してもよい。要求が、オーディオデバイス１０２によってキャンセルされたならば、それ以上のアクションは、スピーチコマンドサービスによって取得されない。要求がキャンセルされていないならば、アクション４１８は実行され、それは認識されたユーザの意図に対応したサービス識別機能を実現するためにオーディオデバイス１０２による実行のためにオーディオデバイス１０２にコマンド４２０を送信することを含む。 Action 416 includes determining whether request 408 has been canceled by audio device 102. As an example, the audio device 102 may send a cancel message or may terminate the current communication session to cancel the request. If the request is canceled by the audio device 102, no further action is obtained by the speech command service. If the request has not been canceled, action 418 is executed, which sends a command 420 to the audio device 102 for execution by the audio device 102 to implement a service identification function corresponding to the recognized user intent. Including that.

オーディオデバイス１０２によって実行されるアクション４２２は、コマンドを受信すること及び実行することを含む。アクション４２２は、スピーチコマンドサービス１０８からコマンドが送信され及び受信されたかどうかによって依存し、それは次にオーディオデバイス１０２が、要求４０８をキャンセルしたかどうかによって依存する条件付きで実行されることを表すために破線ボックスで示されている。 Action 422 performed by audio device 102 includes receiving and executing a command. Action 422 depends on whether a command has been sent and received from the speech command service 108, which then indicates that the audio device 102 will be executed conditionally depending on whether the request 408 has been canceled. Is indicated by a dashed box.

スピーチコマンドサービス１０８によって実行されるアクションと並行して、オーディオデバイス１０２は、受信したユーザスピーチでのトリガー表現に続く、またはすぐに続くローカルコマンド表現の発生を検出するオーディオ受信解析のアクション４２４を実行する。ローカルコマンドの表現を検出することに応答して、アクション４２６は、ローカルコマンド表現と関連付けられているローカルデバイス機能を開始することが直ちに行われる。 In parallel with the actions performed by the speech command service 108, the audio device 102 performs an audio reception analysis action 424 that detects the occurrence of a local command expression following or immediately following the trigger expression in the received user speech. To do. In response to detecting the representation of the local command, action 426 immediately occurs to initiate a local device function associated with the local command representation.

また、アクション４２４でローカルコマンド表現の検出に応答して、オーディオデバイス１０２は、スピーチコマンドサービス１０８に、要求４０８をキャンセルする及び／またはオーディオデバイス１０２からスピーチコマンドサービス１０８によって受信したオーディオに認識したスピーチに応答して実行されてもよい、いかなるサービス識別機能の実施をキャンセルすることを要求するアクション４２８を実行する。これは、通知または要求のキャンセルを送信することにより、このようなスピーチコマンドサービス１０８との通信を含んでもよい。 Also, in response to detecting the local command representation at action 424, the audio device 102 cancels the request 408 to the speech command service 108 and / or recognizes speech recognized by the speech command service 108 from the audio device 102. An action 428 is executed that requests canceling the implementation of any service identification function that may be performed in response to. This may include communication with such a speech command service 108 by sending a notification or request cancellation.

いくつかの場合において、キャンセルは、スピーチコマンドによるサービス識別機能の保留する実現のスピーチコマンドサービス１０８からの通信または通知に対する返事を含んでもよい。このような通知を受信することに応答して、オーディオデバイス１０２は、返信してもよく、及び保留している実現のキャンセルを要求してもよい。代わりに、オーディオデバイス１０２は、そうでなければローカルコマンド表現の検出に応答して実行されたかもしれないどんな機能の実現をキャンセルしてもよく、及び保留中の機能の実現を進めるためにスピーチコマンドサービス１０８に命令してもよい。 In some cases, the cancellation may include a reply to a communication or notification from the speech command service 108 of the implementation that suspends the service identification function by the speech command. In response to receiving such a notification, audio device 102 may reply and request cancellation of the pending implementation. Instead, the audio device 102 may cancel the implementation of any function that may otherwise have been performed in response to detection of the local command representation, and speech to proceed with the implementation of the pending function. The command service 108 may be commanded.

もしローカルコマンド表現が、アクション４２４で検出されなかったら、オーディオデバイス１０２は、スピーチコマンドサービス１０８から受信したコマンド４２０の実行のアクション４２２を実行する。アクション４２２は、スピーチコマンドサービスからコマンド４２０を受信すると、非同期として発生するかもしれない。 If no local command representation is detected at action 424, audio device 102 performs an action 422 of executing command 420 received from speech command service 108. Action 422 may occur asynchronously upon receiving command 420 from the speech command service.

上述の実施形態は、コンピュータ、プロセッサ、デジタル信号プロセッサ、アナログプロセッサなどのようなプログラムで実現されてもよい。他の実施形態において、しかしながら、コンポーネント、機能、または要素の１またはそれ以上は、アナログ回路及び／またはデジタル論理回路を含む特殊なまたは専用の回路を用いて実現されてもよい。ここで使用される用語「コンポーネント」は、コンポーネントに起因する機能を実現するために使用されるいかなるハードウェア、ソフトウェア、ロジック、または前述の組み合わせを含むことを意図している。 The above-described embodiments may be realized by a program such as a computer, a processor, a digital signal processor, an analog processor, and the like. In other embodiments, however, one or more of the components, functions, or elements may be implemented using special or dedicated circuitry including analog circuitry and / or digital logic circuitry. The term “component” as used herein is intended to include any hardware, software, logic, or combination of the foregoing that is used to implement the functionality attributable to the component.

主題は、構造的特徴の特定の文言で記載されているが、添付の特許請求の範囲で定義される主題は必ずしも記載されている特定の特徴に限定されるものではないことは理解されるべきである。むしろ、特定の特徴は、特許請求の範囲を実施する形態として開示されている。 Although the subject matter is described in specific language for structural features, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features described. It is. Rather, the specific features are disclosed as forms of implementing the claims.

条項
１．コンピュータ実行可能命令を記憶した１つまたは複数の非一時的なコンピュータ読取り可能なメディアであって、前記コンピュータ実行可能命令は実行されると、１つまたは複数のプロセッサに、ユーザスピーチを含むオーディオを受信することと、前記ユーザスピーチでトリガー表現を検出することと、前記ユーザスピーチで前記トリガー表現の検出に応答して、リモートスピーチコマンドサービスに前記受信したオーディオをストリーミングし、前記受信したオーディオを分析して、前記ユーザスピーチでの前記トリガー表現に続くデバイス機能に関連付けられたローカルコマンド表現を検出することと、前記ユーザスピーチでの前記トリガー表現に続く前記ローカルコマンド表現の検出に応答して、前記デバイス機能を開始することと、前記リモートスピーチコマンドサービスから応答を受信することであり、前記応答は、前記ストリーミングされたオーディオでの前記リモートスピーチコマンドサービスによって認識されたスピーチに応答して実行されるコマンドを示す、ことと、前記ローカルコマンド表現が前記ユーザスピーチでの前記トリガー表現に続いて検出されない場合に、前記応答によって示された前記コマンドを実行することと、前記ローカルコマンド表現が前記ユーザスピーチでの前記トリガー表現に続いて検出される場合に、前記応答によって示された前記コマンドの実行を見合わせることと、
を含む動作を実行させる、１つまたは複数の非一時的なコンピュータ読取り可能なメディア。 Article 1. One or more non-transitory computer-readable media having computer-executable instructions stored thereon, wherein the computer-executable instructions, when executed, receive audio including user speech on one or more processors. Receiving, detecting a trigger expression in the user speech, streaming the received audio to a remote speech command service in response to detecting the trigger expression in the user speech, and analyzing the received audio In response to detecting a local command expression associated with a device function following the trigger expression in the user speech and detecting the local command expression following the trigger expression in the user speech; Starting the device function Receiving a response from the remote speech command service, wherein the response indicates a command to be executed in response to speech recognized by the remote speech command service on the streamed audio; and Executing a command indicated by the response if a local command expression is not detected following the trigger expression in the user speech; and the local command expression follows the trigger expression in the user speech. Suspending execution of the command indicated by the response if detected;
One or more non-transitory computer readable media that cause an operation comprising:

２．前記ストリーミングは通信識別子に関連付けられており、前記応答は前記通信識別子を示す、条項１に記載の１つまたは複数のコンピュータ読取り可能なメディア。 2. The one or more computer-readable media of clause 1, wherein the streaming is associated with a communication identifier and the response indicates the communication identifier.

３．前記デバイス機能はメディア制御機能を含む、条項１に記載の１つまたは複数のコンピュータ読取り可能なメディア。 3. The one or more computer-readable media of clause 1, wherein the device functionality includes media control functionality.

４．前記動作は、前記コマンド表現の検出に応答して、前記受信したオーディオの前記ストリーミングを中止することをさらに含む、条項１に記載の１つまたは複数のコンピュータ読取り可能なメディア。 4). The one or more computer-readable media of clause 1, wherein the operation further comprises stopping the streaming of the received audio in response to detecting the command representation.

５．ユーザスピーチを含むオーディオを受信することと、前記ユーザスピーチでのトリガー表現を検出することと、前記ユーザスピーチでの前記トリガー表現の検出に応答して、前記受信したオーディオでのスピーチを認識して前記認識したスピーチに対応する第１の機能を実行するために、スピーチコマンドサービスに前記受信したオーディオを送信し、前記受信したオーディオを分析し、前記受信したオーディオでの前記トリガー表現に続く第２の機能に関連付けられたローカルコマンド表現を検出することと、前記受信したオーディオでの前記トリガー表現に続く前記ローカルコマンド表現の検出に応答して、前記第２の機能を開始し、前記第１の機能の実行をキャンセルすること
を含む方法。 5. Receiving audio including user speech; detecting a trigger expression in the user speech; and recognizing the speech in the received audio in response to detection of the trigger expression in the user speech. To perform the first function corresponding to the recognized speech, send the received audio to a speech command service, analyze the received audio, and follow the trigger expression on the received audio. In response to detecting a local command expression associated with the function of the first command and detecting the local command expression following the trigger expression in the received audio, and initiating the second function, A method that includes canceling execution of a function.

６．前記第１の機能の実行をキャンセルすることは、前記第１の機能の実行をキャンセルするように前記スピーチコマンドサービスに要求することを含む、条項５に記載の方法。 6). 6. The method of clause 5, wherein canceling execution of the first function comprises requesting the speech command service to cancel execution of the first function.

７．前記スピーチコマンドサービスから前記第１の機能の保留中の実行を示す通信を受信することをさらに含み、前記第１の機能の実行をキャンセルすることは、前記第１の機能の前記保留中の実行をキャンセルするように、前記スピーチコマンドサービスに要求することを含む、条項５に記載の方法。 7). Receiving further communication indicating the pending execution of the first function from the speech command service, wherein canceling the execution of the first function is the pending execution of the first function; 6. The method of clause 5, comprising requesting the speech command service to cancel.

８．前記スピーチコマンドサービスから前記第１の機能に対応するコマンドを受信することをさらに含み、前記第１の機能の実行をキャンセルすることは、前記スピーチコマンドサービスから前記受信したコマンドの実行を見合わせることを含む、条項５に記載の方法。 8). The method further includes receiving a command corresponding to the first function from the speech command service, and canceling execution of the first function is to suspend execution of the received command from the speech command service. The method of clause 5, including.

９．前記第２の機能が開始されたことを前記スピーチコマンドサービスを通知することをさらに含む、条項５に記載の方法。 9. 6. The method of clause 5, further comprising notifying the speech command service that the second function has been initiated.

１０．前記第１の機能の実行をキャンセルすることは、前記第２の機能が開始されたことを前記スピーチコマンドサービスに通知することを含む、条項５に記載の方法。 10. 6. The method of clause 5, wherein canceling execution of the first function includes notifying the speech command service that the second function has been initiated.

１１．前記第２の機能は、メディア制御機能を含む、条項５に記載の方法。 11. 6. The method of clause 5, wherein the second function includes a media control function.

１２．前記オーディオでの前記トリガー表現を検出することに応答して、前記スピーチコマンドサービスと通信セッションを確立することをさらに含み、前記第１の機能の実行をキャンセルすることは前記通信セッションの停止を含む、条項５に記載の方法。 12 In response to detecting the trigger representation in the audio, further comprising establishing a communication session with the speech command service, and canceling execution of the first function includes stopping the communication session. The method according to clause 5.

１３．前記受信したオーディオに識別子を関連付けることと、
前記スピーチコマンドサービスから、前記識別子及び前記第１の機能に対応するコマンドを示す応答を受信することをさらに含み、前記第１の機能の実行をキャンセルすることは、前記コマンドの実行を見合わせることを含む、条項５に記載の方法。 13. Associating an identifier with the received audio;
Receiving a response indicating the identifier and a command corresponding to the first function from the speech command service, and canceling the execution of the first function is to suspend execution of the command. The method of clause 5, including.

１４．受信したオーディオでのユーザスピーチを認識し、前記ユーザスピーチでのトリガー表現を検出し、前記ユーザスピーチでのローカルコマンド表現を検出するように構成された１つまたは複数のスピーチ認識コンポーネントと、前記１またはそれ以上のスピーチ認識コンポーネントによって、前記ユーザスピーチでの前記トリガー表現を検出することに応答して、動作を行うよう構成された制御ロジックと
を備えたシステムであって、前記動作は、前記オーディオでのスピーチを認識し前記認識されたスピーチに対応する第１の機能を実行するために、スピーチコマンドサービスに前記オーディオを送信することと、前記１つまたは複数のスピーチ認識コンポーネントによる、前記ユーザスピーチでの前記ローカルコマンド表現の検出に応答して、（ａ）前記ローカルコマンド表現に対応する第２の機能を特定すること及び（ｂ）前記第１及び第２の機能の少なくとも１つの実行をキャンセルすることと、
を含む、システム。 14 One or more speech recognition components configured to recognize user speech in received audio, detect a trigger expression in the user speech, and detect a local command expression in the user speech; and Control logic configured to perform an action in response to detecting the trigger expression in the user speech by or more speech recognition components, the action comprising the audio Transmitting the audio to a speech command service for recognizing speech and performing a first function corresponding to the recognized speech, and the user speech by the one or more speech recognition components Responds to detection of local command expression in Te, and to cancel at least one execution of (a) the local command that identifies a second function corresponding to the representation and (b) said first and second functions,
Including the system.

１５．前記１つまたは複数のスピーチ認識コンポーネントは、１つまたは複数のキーワードスポッターを含む、条項１４に記載のシステム。 15. 15. The system of clause 14, wherein the one or more speech recognition components includes one or more keyword spotters.

１６．前記第１及び第２の機能の前記少なくとも１つの実行をキャンセルすることは、前記第１の機能の実行をキャンセルするように前記スピーチコマンドサービスに要求することを含む、条項１４に記載のシステム。 16. 15. The system of clause 14, wherein canceling the at least one execution of the first and second functions includes requesting the speech command service to cancel execution of the first function.

１７．前記第１及び第２の機能の前記少なくとも１つの実行をキャンセルすることは、前記スピーチコマンドサービスから受信したコマンドを無視することを含む、条項１４に記載のシステム。 17. 15. The system of clause 14, wherein canceling the at least one execution of the first and second functions includes ignoring a command received from the speech command service.

１８．前記第２の機能はメディア制御機能を含む、条項１４に記載のシステム。 18. 15. The system of clause 14, wherein the second function includes a media control function.

１９．前記動作は、前記ユーザスピーチでの前記ローカルコマンド表現の検出に応答して前記オーディオの送信することを中止することをさらに含む、条項１４に記載のシステム。 19. 15. The system of clause 14, wherein the operation further comprises stopping transmitting the audio in response to detecting the local command representation in the user speech.

２０．前記第１及び第２の機能の前記少なくとも１つの実行をキャンセルすることは、前記第２の機能が開始されたことを前記スピーチコマンドサービスを通知することを含む、条項１４に記載のシステム。 20. 15. The system of clause 14, wherein canceling the at least one execution of the first and second functions includes notifying the speech command service that the second function has been initiated.

Claims

A device having computer-executable instructions stored therein, wherein when the computer-executable instructions are executed, the one or more processors of the device include:
Receiving audio including user speech;
Detecting a trigger expression in the user speech;
In response to detecting the trigger expression in the user speech,
Streaming the received audio to a remote speech command service;
Analyzing the received audio to detect a local command expression associated with a device function following the trigger expression in the user speech;
In response to detecting the local command expression following the trigger expression in the user speech, initiating the device function;
Receiving a response from the remote speech command service, wherein the response indicates a command to be executed in response to speech recognized by the remote speech command service on the streamed audio;
Executing the command indicated by the response if the local command expression is not detected following the trigger expression in the user speech;
Suspending execution of the command indicated by the response if the local command expression is detected following the trigger expression in the user speech;
A device that performs operations including

The device of claim 1, wherein the streaming is associated with a communication identifier and the response indicates the communication identifier.

The device of claim 1, wherein the device function comprises a media control function.

The device of claim 1, wherein the operation further comprises stopping the streaming of the received audio in response to detecting the command representation.

Receiving audio including user speech;
Detecting a trigger expression in the user speech;
In response to detecting the trigger expression in the user speech,
Transmitting the received audio to a speech command service to recognize the speech in the received audio and perform a first function corresponding to the recognized speech;
Analyzing the received audio and detecting a local command expression associated with a second function following the trigger expression in the received audio;
In response to detecting the local command expression following the trigger expression in the received audio,
Starting the second function,
Canceling execution of the first function.

The method of claim 5, wherein canceling execution of the first function comprises requesting the speech command service to cancel execution of the first function.

Further comprising receiving a communication indicating pending execution of the first function from the speech command service;
The method of claim 5, wherein canceling execution of the first function comprises requesting the speech command service to cancel the pending execution of the first function.

The method further includes receiving a command corresponding to the first function from the speech command service, and canceling execution of the first function is to suspend execution of the received command from the speech command service. 6. The method of claim 5, comprising.

6. The method of claim 5, further comprising notifying the speech command service that the second function has been initiated.

Associating an identifier with the received audio;
Further comprising receiving from the speech command service a response indicating the identifier and a command corresponding to the first function;
The method of claim 5, wherein canceling execution of the first function includes suspending execution of the command.

One or more speech recognition components configured to recognize user speech in received audio, detect a trigger expression in the user speech, and detect a local command expression in the user speech;
Control logic configured to perform an action in response to detecting the trigger expression in the user speech by the one or more speech recognition components;
The operation is
Transmitting the audio to a speech command service to recognize speech in the audio and to perform a first function corresponding to the recognized speech;
Responsive to detection of the local command representation in the user speech by the one or more speech recognition components, (a) identifying a second function corresponding to the local command representation; and (b) the Canceling at least one execution of the first and second functions;
Including the system.

12. The system of claim 11, wherein canceling the at least one execution of the first and second functions includes requesting the speech command service to cancel execution of the first function. .

The system of claim 11, wherein canceling the at least one execution of the first and second functions comprises ignoring a command received from the speech command service.

The system of claim 11, wherein the operation further comprises stopping transmitting the audio in response to detecting the local command representation in the user speech.

12. The system of claim 11, wherein canceling the at least one execution of the first and second functions includes notifying the speech command service that the second function has been initiated.