WO2012014275A1

WO2012014275A1 - Audio transmitting/receiving device, audio transmitting/receiving system and server device

Info

Publication number: WO2012014275A1
Application number: PCT/JP2010/062558
Authority: WO
Inventors: 隆真亀谷
Original assignee: Pioneer Corp
Current assignee: Pioneer Corp
Priority date: 2010-07-26
Filing date: 2010-07-26
Publication date: 2012-02-02
Anticipated expiration: 2013-01-26

Abstract

This audio transmitting/receiving device, audio transmitting/receiving system and server device provides optimal conversation quality suited to the situation of a user engaged in a conversation. The audio transmitting/receiving device (10A) comprises: transmitting means (100, 200) for transmitting and receiving conversation data over a network (20) to and from an audio transmitting/receiving device (10B) provided at a base other than the home base, among multiple bases, said conversation data representing the content of a conversation and including at least audio data; a conversation feature amount acquiring means (300) for acquiring conversation feature amounts for a home base user (A) located at the home base and another base user (B) located at the other base, respectively; a processing delay amount determining means (600) for determining a total amount of processing delay amounts in the entire audio transmitting/receiving system on the basis of conversation feature amounts acquired; and a processing delay amount controlling means (600) for controlling the processing delay amounts so as to satisfy the total amount determined.

Description

Audio transmission / reception device, audio transmission / reception system, and server device

　本発明は、例えば遠隔地のユーザ相互間において会話を円滑に進行させることが可能な音声送受信装置、音声送受信システム及びサーバ装置の技術分野に関する。 The present invention relates to a technical field of a voice transmission / reception apparatus, a voice transmission / reception system, and a server apparatus that can smoothly advance conversations between users at remote locations, for example.

　この種の目的を有するものとして、例えば特許文献１には、リアルタイム音声再生装置が提案されている。この装置によれば、音声の欠落を会話として成立する最大の音声欠落率以内に収め、音声遅延を会話として成立する最大の遅延時間以内に収めることにより、全体の会話品質を確保することが可能であるとされている。 For example, Patent Document 1 proposes a real-time audio reproducing device having this kind of purpose. According to this device, it is possible to ensure the overall conversation quality by keeping the voice loss within the maximum voice drop rate established as a conversation and the voice delay within the maximum delay time established as a conversation. It is said that.

　また、特許文献２には、遅延ゆらぎを吸収するためのバッファを、有音区間でのバッファアンダーフロー発生回数等に基づいて無音区間に制御するための、バッファ監視部とバッファ制御部とを備えた構成が開示されている。 Further, Patent Document 2 includes a buffer monitoring unit and a buffer control unit for controlling a buffer for absorbing delay fluctuation to a silent period based on the number of occurrences of buffer underflow in a voiced period. The configuration is disclosed.

特開２００２－２２２３２４７号公報JP 2002-223247 A 特開平１１－２１５８１２号公報Japanese Patent Laid-Open No. 11-215812

　一般的には、遅延時間は、大きい程、音切れの発生や誤りの発生を抑制し得るから、会話の音声品質を向上させ得る。一方で、遅延時間が大きければ、一方のユーザが、自身の発話終了時点以降、相手の発話開始を認識するまでの時間が長くなるため、会話の円滑な進行が妨げられ易い。従って、特許文献１の装置のように、遅延時間を会話として成立する最大の遅延時間以内に収める旨の技術思想は、音声品質と円滑な進行との妥協点を見出し得る点において有益である。 In general, the longer the delay time, the better the voice quality of the conversation can be improved because the occurrence of sound interruptions and errors can be suppressed. On the other hand, if the delay time is large, it takes a long time for one user to recognize the start of the other party's utterance after the end of his / her utterance, and thus the smooth progress of the conversation is likely to be hindered. Therefore, the technical idea that the delay time falls within the maximum delay time established as a conversation as in the device of Patent Document 1 is useful in that a compromise between voice quality and smooth progress can be found.

　ところで、会話が成立する最大の遅延時間は、例えば、会話に参加するユーザ各々の性格又は身体的若しくは精神的負荷状態、或いはその時点のユーザの個別具体的な事情等に応じて、如何様にも変化し得る。例えば、最大の遅延時間は、性格的にせっかちなユーザであれば短く、のんびりしたユーザであれば長く、イライラしているユーザであれば短く、リラックスしたユーザであれば長く、体調の悪いユーザであれば体調に応じて長くも短くもなる。また、それとは別に、何らかの事情で急いでいれば当然ながら短くなる。 By the way, the maximum delay time for establishing a conversation depends on, for example, the personality or physical or mental load state of each user participating in the conversation, or the individual specific circumstances of the user at that time. Can also change. For example, the maximum delay time is short for a personally impatient user, long for a leisurely user, short for an irritated user, long for a relaxed user, If there is, it will be longer or shorter depending on the physical condition. Apart from that, if you are in a hurry for some reason, it will naturally become shorter.

　特許文献１に開示された装置を適用する場合、会話として成立する最大の遅延時間を事前に設定する必要があるが、このように如何様にも変化し得る性質を有する最大の遅延時間を、予め実験的に、経験的に或いは理論的に確定させておくことは、実践上困難を極める。従って、特許文献１で述べられるところの最大の遅延時間は、必ずしも真に会話を成立させるための最大の遅延時間とはならない。このため、特許文献１の装置を会話に適用したとしても、ユーザによっては、快適性の低下を招く可能性がある。或いは、逆に遅延時間が短過ぎて、不要な音声品質の低下を招く可能性がある。即ち、特許文献１の装置には、ユーザ側の事情に適応する術を有さぬことに起因して、遅延時間が最適化され難いという技術的問題点がある。 When applying the device disclosed in Patent Document 1, it is necessary to set in advance the maximum delay time that is established as a conversation, in this way the maximum delay time having the property that can be changed in any way, It is extremely difficult to determine in advance experimentally, empirically or theoretically. Therefore, the maximum delay time described in Patent Document 1 is not necessarily the maximum delay time for truly establishing a conversation. For this reason, even if the apparatus of Patent Literature 1 is applied to conversation, there is a possibility that the comfort may be lowered depending on the user. Or, conversely, the delay time is too short, and there is a possibility that unnecessary voice quality is lowered. That is, the apparatus of Patent Document 1 has a technical problem that it is difficult to optimize the delay time due to the fact that it does not have a technique to adapt to the situation on the user side.

　また、このような問題点は、特許文献２の装置のように、遅延揺らぎの吸収による音声パケットの廃棄率低減を目的としたバッファ制御をなし得たところで、何ら変わらず生じ得る。 Also, such a problem can occur without any change when buffer control is performed for the purpose of reducing the discard rate of voice packets by absorbing delay fluctuations as in the apparatus of Patent Document 2.

　本発明は、係る問題点に鑑みてなされたものであり、会話に参加するユーザの事情に即した最適な会話品質を提供可能な音声送受信装置、音声送受信システム及びサーバ装置を提供することを課題とする。 The present invention has been made in view of such problems, and it is an object of the present invention to provide an audio transmission / reception apparatus, an audio transmission / reception system, and a server apparatus that can provide optimal conversation quality in accordance with the circumstances of users participating in a conversation. And

　上述した課題を解決するため、請求の範囲第１項の音声送受信装置は、各々がネットワークに収容される、複数の拠点に設置された音声送受信装置を含み、該音声送受信装置を介して、前記複数の拠点に夫々存在するユーザ相互間で会話を成立させることが可能な音声送受信システムにおける、前記音声送受信装置であって、前記複数の拠点のうち自拠点を除く他拠点に設置される前記音声送受信装置との間で、前記ネットワークを介して、少なくとも音声データを含む、前記会話の内容を表す会話データの送受信を行う通信手段と、前記自拠点に存在する自拠点ユーザ及び前記他拠点に存在する他拠点ユーザの各々について、前記会話における発話のタイミングに関連する所定の会話特徴量を取得する会話特徴量取得手段と、前記取得された会話特徴量に基づいて、その大小が前記会話の時間遅延量の大小に夫々対応し且つ前記会話の音声品質の高低に夫々対応する前記会話データの処理遅延量の、前記音声送受信システム全体における総量を決定する処理遅延量決定手段と、前記決定された総量が満たされるように前記処理遅延量を制御する処理遅延量制御手段とを具備することを特徴とする。 In order to solve the above-described problem, the voice transmission / reception device according to claim 1 includes voice transmission / reception devices installed at a plurality of bases, each of which is accommodated in a network, through the voice transmission / reception device. The voice transmitting / receiving apparatus in a voice transmitting / receiving system capable of establishing a conversation between users respectively present at a plurality of bases, wherein the voice is installed at another base other than the base among the plurality of bases. Communication means for transmitting / receiving conversation data representing the content of the conversation, including at least voice data, to / from the transmission / reception device, and the local user existing at the local base and the local base Conversation feature quantity acquisition means for acquiring a predetermined conversation feature quantity related to the utterance timing in the conversation for each of the other site users, and the acquired The amount of processing delay of the conversation data corresponding to the magnitude of the time delay amount of the conversation and the amount of speech quality of the conversation respectively corresponding to the magnitude of the conversation time delay A processing delay amount determining unit that determines a total amount and a processing delay amount control unit that controls the processing delay amount so that the determined total amount is satisfied.

　上述した課題を解決するため、請求の範囲第１４項の音声送受信システムは、各々がネットワークに収容される、複数の拠点に設置された音声送受信装置を含み、該音声送受信装置を介して、前記複数の拠点に夫々存在するユーザ相互間で会話を成立させることが可能な音声送受信システムであって、前記音声送受信装置は、前記複数の拠点のうち自拠点を除く他拠点に設置される前記音声送受信装置との間で、前記ネットワークを介して、少なくとも音声データを含む、前記会話の内容を表す会話データの送受信を行う通信手段と、前記自拠点に存在する自拠点ユーザ及び前記他拠点に存在する他拠点ユーザの各々について、前記会話における発話のタイミングに関連する所定の会話特徴量を取得する会話特徴量取得手段と、前記取得された会話特徴量に基づいて、その大小が前記会話の時間遅延量の大小に夫々対応し且つ前記会話の音声品質の高低に夫々対応する前記会話データの処理遅延量の、前記音声送受信システム全体における総量を決定する処理遅延量決定手段と、前記決定された総量が満たされるように前記処理遅延量を制御する処理遅延量制御手段とを具備することを特徴とする。 In order to solve the above-described problem, the voice transmission / reception system according to claim 14 includes voice transmission / reception devices installed at a plurality of bases, each of which is accommodated in a network, A voice transmission / reception system capable of establishing a conversation between users respectively present at a plurality of bases, wherein the voice transmission / reception apparatus is installed at another base other than the base among the plurality of bases. Communication means for transmitting / receiving conversation data representing the content of the conversation, including at least voice data, to / from the transmission / reception device, and the local user existing at the local base and the local base Conversation feature quantity acquisition means for acquiring a predetermined conversation feature quantity related to the utterance timing in the conversation for each of the other site users, and the acquired The amount of processing delay of the conversation data corresponding to the magnitude of the time delay amount of the conversation and the amount of speech quality of the conversation respectively corresponding to the magnitude of the conversation time delay A processing delay amount determining unit that determines a total amount and a processing delay amount control unit that controls the processing delay amount so that the determined total amount is satisfied.

　上述した課題を解決するため、請求の範囲第１５項のサーバ装置は、各々がネットワークに収容される、サーバ装置と、複数の拠点に設置された、該複数の拠点のうち自拠点を除く他拠点に設置される音声送受信装置との間で、前記ネットワークを介して、少なくとも音声データを含む、前記会話の内容を表す会話データの送受信を行う通信手段を具備する音声送受信装置とを含み、該音声送受信装置を介して、前記複数の拠点に夫々存在するユーザ相互間で会話を成立させることが可能な音声送受信システムにおける、前記サーバ装置であって、前記複数の音声送受信装置から、前記ネットワークを介して、前記自拠点に存在する自拠点ユーザ及び前記他拠点に存在する他拠点ユーザの各々についての前記会話における発話のタイミングに関連する所定の会話特徴量を取得する会話特徴量取得手段と、前記取得された会話特徴量に基づいて、その大小が前記会話の時間遅延量の大小に夫々対応し且つ前記会話の音声品質の高低に夫々対応する前記会話データの処理遅延量の、前記音声送受信システム全体における総量を決定する処理遅延量決定手段と、前記複数の音声送受信装置に対し前記ネットワークを介して前記決定された総量を告知する告知手段とを具備することを特徴とする。 In order to solve the above-described problem, the server device according to claim 15 is a server device that is accommodated in a network, and is installed at a plurality of bases, other than its own base. A voice transmission / reception device comprising communication means for transmitting / receiving conversation data representing the content of the conversation, including at least voice data, with the voice transmission / reception device installed at a base, The server device in a voice transmission / reception system capable of establishing a conversation between users respectively present at the plurality of bases via a voice transmission / reception device, wherein the network is connected from the plurality of voice transmission / reception devices. The timing of utterances in the conversation for each of the own site user existing at the own site and the other site user existing at the other site A conversation feature amount acquisition means for acquiring a predetermined predetermined conversation feature amount, and based on the acquired conversation feature amount, the magnitude corresponds to the amount of time delay amount of the conversation, and the voice quality of the conversation A processing delay amount determining means for determining a total amount of the processing delay amount of the conversation data corresponding to high and low in the entire voice transmitting and receiving system; and the total amount determined via the network for the plurality of voice transmitting and receiving devices. And a notifying means for notifying.

本発明の第１実施例に係る遠隔会議システムの構成を概念的に表してなる概略構成図である。BRIEF DESCRIPTION OF THE DRAWINGS It is a schematic block diagram which represents notionally the structure of the remote conference system based on 1st Example of this invention. 図１の遠隔会議システムにおける音声送受信装置の構成を概念的に表してなるブロック図である。FIG. 2 is a block diagram conceptually showing a configuration of an audio transmission / reception device in the remote conference system of FIG. 1. 図２の音声送受信装置において実行されるバッファ容量制御のフローチャートである。It is a flowchart of the buffer capacity control performed in the audio | voice transmission / reception apparatus of FIG. 反応時間の概念を説明するタイミングチャートである。It is a timing chart explaining the concept of reaction time. 図２の音声送受信装置において実行される反応時間算出処理のフローチャートである。It is a flowchart of the reaction time calculation process performed in the audio | voice transmission / reception apparatus of FIG. 図２の音声送受信装置において実行される反応待機時間算出処理のフローチャートである。It is a flowchart of the reaction waiting time calculation process performed in the audio | voice transmission / reception apparatus of FIG. 第１計算モデルの概念を説明するタイミングチャートである。It is a timing chart explaining the concept of a 1st calculation model. 第２計算モデルの概念を説明するタイミングチャートである。It is a timing chart explaining the concept of the 2nd calculation model. 本発明の第２実施例に係る遠隔会議システムの構成を概念的に表してなる概略構成図である。It is a schematic block diagram which represents notionally the structure of the remote conference system based on 2nd Example of this invention.

＜音声送受信装置の実施形態＞ <Embodiment of Audio Transmitting / Receiving Device>

　本発明の音声送受信装置に係る実施形態は、各々がネットワークに収容される、複数の拠点に設置された音声送受信装置を含み、該音声送受信装置を介して、前記複数の拠点に夫々存在するユーザ相互間で会話を成立させることが可能な音声送受信システムにおける、前記音声送受信装置であって、前記複数の拠点のうち自拠点を除く他拠点に設置される前記音声送受信装置との間で、前記ネットワークを介して、少なくとも音声データを含む、前記会話の内容を表す会話データの送受信を行う通信手段と、前記自拠点に存在する自拠点ユーザ及び前記他拠点に存在する他拠点ユーザの各々について、前記会話における発話のタイミングに関連する所定の会話特徴量を取得する会話特徴量取得手段と、前記取得された会話特徴量に基づいて、その大小が前記会話の時間遅延量の大小に夫々対応し且つ前記会話の音声品質の高低に夫々対応する前記会話データの処理遅延量の、前記音声送受信システム全体における総量を決定する処理遅延量決定手段と、前記決定された総量が満たされるように前記処理遅延量を制御する処理遅延量制御手段とを具備する。 Embodiments according to the voice transmission / reception apparatus of the present invention include voice transmission / reception apparatuses installed at a plurality of bases, each accommodated in a network, and users existing at the plurality of bases via the voice transmission / reception apparatuses. In the voice transmission / reception system capable of establishing a conversation between each other, the voice transmission / reception apparatus, and the voice transmission / reception apparatus installed in another base other than the base among the plurality of bases, For each of the communication means for transmitting and receiving conversation data representing the content of the conversation, including at least voice data, via the network, and the local user existing in the local base and the other base user existing in the local base, Based on the acquired conversation feature quantity, a conversation feature quantity acquisition means for acquiring a predetermined conversation feature quantity related to the timing of utterance in the conversation Determining the processing delay amount for determining the total amount of the processing delay amount of the conversation data corresponding to the magnitude of the time delay amount of the conversation and corresponding to the level of the voice quality of the conversation, respectively, in the entire voice transmission / reception system And a processing delay amount control means for controlling the processing delay amount so that the determined total amount is satisfied.

　実施形態に係る音声送受信装置は、相異なる複数の拠点間で各拠点に存在するユーザ同士の会話を成立させ得るシステムとしての音声送受信システムを構築する、一の音声送受信装置である。実施形態に係る音声送受信装置は、ネットワークに常時、或いは何らかの条件（尚、この種の条件は、如何様にも限定されない）が満たされた場合に限定的に収容される構成となっている。 The voice transmission / reception apparatus according to the embodiment is one voice transmission / reception apparatus that constructs a voice transmission / reception system as a system capable of establishing a conversation between users existing at each base between a plurality of different bases. The voice transmission / reception apparatus according to the embodiment is configured to be accommodated in a network at all times or limitedly when some condition (this kind of condition is not limited in any way) is satisfied.

　実施形態に係る「ネットワーク」とは、例えばＷＡＮ（Wide Area Network）網、ＬＡＮ（Local Area Network）網、又はこれらＷＡＮ網又はＬＡＮ網を介して或いは電話回線、ＡＤＳＬ（Asymmetric Digital Subscriber Line）又は光ファイバーケーブル等を介して適宜に接続されるインターネット網等の各種データ通信網を包括する概念である。 The “network” according to the embodiment is, for example, a WAN (Wide Area Network) network, a LAN (Local Area Network) network, a WAN line or a LAN network, a telephone line, an ADSL (Asymmetric Digital Subscriber Line), or an optical fiber. It is a concept encompassing various data communication networks such as the Internet network appropriately connected via a cable or the like.

　実施形態に係る音声送受信装置は、通信手段を有しており、この通信手段の作用により、自拠点とは異なる他拠点（即ち、会話の相手側の拠点である）に設置される音声送受信装置との間で、ネットワークを介した会話データの送受信を行うことが可能である。尚、他拠点に設置される音声送受信装置との間の、当該会話データを始めとする各種情報、データ、又はデータファイル等の送受信形態は、例えば然るべきサーバ装置を介したものであってもよいし、例えばＰ２Ｐ（Peer To Peer）等サーバ装置を介さないものであってもよい。必然的に、音声送受信システムの構成要素もまた多義的であってよい。 The voice transmission / reception apparatus according to the embodiment has a communication unit, and the voice transmission / reception apparatus is installed at another base (that is, a base on the other side of the conversation) different from the base by the action of the communication unit. It is possible to send and receive conversation data via the network. In addition, the transmission / reception mode of various information including the conversation data, data, or a data file between the voice transmission / reception apparatuses installed in other bases may be, for example, via an appropriate server apparatus. For example, P2P (Peer To Peer) or the like may not be used. Naturally, the components of the voice transmission / reception system may also be ambiguous.

　尚、自拠点とは、実施形態に係る音声送受信装置が設置される拠点を意味するものであって、ある特定の拠点を限定的に意味するものではない。 Note that the self-base means a base where the voice transmitting / receiving apparatus according to the embodiment is installed, and does not mean a specific base.

　実施形態における「会話データ」とは、夫々相異なる拠点に存在するユーザ相互間の会話の成立に必要となる或いは意義を与え得るデータを包括する概念であり、特に、少なくとも音声データを含むものとして規定される。会話データには、例えば、この他に画像データや映像データ等が含まれてもよい。尚、「会話」とは、音声を伴った、ユーザ相互間の意思伝達行為を意味し、発生するシチュエーションは多岐にわたる。例えば、実施形態における「会話」とは、日常会話の他に、打合せや会議等における参加者各々の発話等が好適に含まれる。 The “conversation data” in the embodiment is a concept that includes data that is necessary or meaningful for establishing a conversation between users existing at different bases, and particularly includes at least audio data. It is prescribed. For example, the conversation data may include image data, video data, and the like. Note that “conversation” means an action of communication between users accompanied by voice, and the situations that occur are diverse. For example, the “conversation” in the embodiment preferably includes the speech of each participant in a meeting or a meeting in addition to the daily conversation.

　この会話データのうち、他拠点に設置された音声送受信装置へ送信されるデータは、例えば、マイク等の集音手段を介して集音された自拠点ユーザのアナログ音声データを、エンコーダ等を介して符号化してなるデータであってもよい。このデータは、他拠点側に設置された音声送受信装置において受信され（即ち、他拠点側では受信データとなる）、例えばデコーダ等を介して復号化され、最終的にスピーカ等の出力手段を介してアナログ音声として他拠点ユーザの受話に供され得る。発話側と受話側とが入れ替わっても同様である。実施形態に係る音声送受信装置では、例えばこのように会話データの送受信が繰り返されることにより、相異なる拠点に存在するユーザ相互間で会話を成立させることができる。 Among the conversation data, the data transmitted to the voice transmitting / receiving device installed at the other site is, for example, the analog voice data of the user at the local site collected via the sound collecting means such as a microphone via the encoder or the like. It is also possible to use encoded data. This data is received by a voice transmitting / receiving device installed at the other site side (that is, received data at the other site side), is decoded via a decoder or the like, and finally is output via an output means such as a speaker. Thus, it can be provided as an analog voice to the user of another base. The same applies even if the utterance side and the reception side are switched. In the audio transmitting / receiving apparatus according to the embodiment, for example, by repeating the transmission / reception of conversation data in this way, it is possible to establish a conversation between users existing at different bases.

　実施形態に係る音声送受信装置は、会話データの処理遅延量の、音声送受信システム全体における総量（以下、適宜「システム総処理遅延量」と称する）を決定する処理遅延量決定手段と、この決定されたシステム総処理遅延量が満たされるように当該処理遅延量を制御する処理遅延量制御手段とを具備する。 The voice transmission / reception apparatus according to the embodiment includes processing delay amount determining means for determining a total amount of conversation data processing delay in the entire voice transmission / reception system (hereinafter, referred to as “system total processing delay amount” as appropriate). And a processing delay amount control means for controlling the processing delay amount so that the total system processing delay amount is satisfied.

　ここで、「会話データの処理遅延量」とは、複数の拠点の各々において音声送受信装置が会話データを処理するにあたって生じる遅延の量であって、特に、その大小が、自拠点ユーザと他拠点ユーザとの会話の時間遅延の量の大小に夫々対応し、且つその大小が、自拠点ユーザと他拠点ユーザとの会話の音声品質の高低に夫々対応する、可制御性を有する遅延の量を意味する。システム総処理遅延量とは、発話側と受話側とにおける当該処理遅延量の総和である。従って、例えば音切れや符号化誤りの少ない、高い音声品質を得ようとする場合、システム総処理遅延量は大きい方が良いことになる。 Here, the “processing amount of conversation data processing” is the amount of delay that occurs when the voice transmission / reception apparatus processes conversation data at each of a plurality of bases. The amount of delay having controllability corresponding to the amount of time delay of the conversation with the user, and the size corresponding to the level of the voice quality of the conversation between the user at the local site and the user at the other site, respectively. means. The total system processing delay amount is the sum of the processing delay amounts on the utterance side and the reception side. Therefore, for example, in order to obtain high voice quality with few sound interruptions and coding errors, a larger total system processing delay amount is better.

　ところで、会話の時間遅延は、音声品質とは異なる次元で会話の品質に影響する。より具体的には、会話の時間遅延量が大きくなり過ぎると、ユーザ相互間で同一の時間軸を共有することが難しくなり、会話のリアルタイム性が低下する。このようなリアルタイム性の低下は、それ自体が快適性を低下させユーザに不快感を惹起する。また、このようなリアルタイム性の低下は、発話してから初めて相手が発話中であることを確認する等、円滑性の低下による二次的な会話品質の低下を招き、一種の悪循環を招来し易い。 By the way, the time delay of the conversation affects the quality of the conversation in a dimension different from the voice quality. More specifically, if the amount of time delay of conversation becomes too large, it becomes difficult to share the same time axis between users, and the real-time nature of conversation is reduced. Such a decrease in real-time property itself decreases comfort and causes discomfort to the user. In addition, such a decrease in real-time performance causes a secondary deterioration in conversation quality due to a decrease in smoothness, such as confirming that the other party is speaking for the first time after speaking, and a kind of vicious circle. easy.

　このように、音声送受信システムにおいては、会話の音声品質と快適性とがトレードオフの関係となる。従って、これら双方に影響するシステム総処理遅延量を最適化することが必要となる。 Thus, in the voice transmission / reception system, the voice quality and comfort of the conversation are in a trade-off relationship. Therefore, it is necessary to optimize the total system processing delay amount that affects both of them.

　尚、実施形態に係る会話データには、ユーザの表情を特定するための画像データや映像データが含まれ得るが、この種の画像データや映像データの送受信は、それ自体が高負荷処理であり、音声と画像又は映像を同期出力しようとすれば、会話の時間遅延量は増大する。一方で、これらを独立に制御すれば、表情と音声との同期が崩れるため、快適性の低下を抑制する旨の効果は限定的となる。即ち、上述の問題点は、画像又は映像を併用することにより解消される性質のものではない。 The conversation data according to the embodiment may include image data and video data for specifying a user's facial expression. However, this kind of transmission / reception of image data and video data is a high-load process itself. If the voice and image or video are to be output synchronously, the amount of time delay of conversation increases. On the other hand, if these are controlled independently, the synchronization between the facial expression and the voice is lost, so the effect of suppressing the decrease in comfort is limited. That is, the above-described problems are not of a nature that can be solved by using an image or video together.

　そこで、実施形態に係る音声送受信装置では、会話特徴量取得手段により、自拠点に存在する自拠点ユーザ及び他拠点に存在する他拠点ユーザの各々について所定の会話特徴量が取得される。処理遅延量決定手段は、この取得された会話特徴量に基づいてシステム総処理遅延量を決定する構成となっている。 Therefore, in the voice transmitting / receiving apparatus according to the embodiment, the conversation feature amount acquisition unit acquires a predetermined conversation feature amount for each of the own site user existing at the own site and the other site user existing at the other site. The processing delay amount determining means is configured to determine the total system processing delay amount based on the acquired conversation feature amount.

　会話特徴量とは、会話における発話のタイミングに関連する物理量、制御量或いは規格化された各種の指標値等を意味する。会話における発話のタイミングは、好適な一形態として、相手の発話終了後に発話を開始するタイミングと、自分の発話終了後に相手が無反応であると判断して再度発話を開始するタイミングとに大別される。これらは、いずれも会話の基本的な進行リズムを規定する重要な要素であり、最適なシステム総処理遅延量と有意な関係を有し得る。例えば、発話のタイミングが早い（遅い）ユーザに対する最適なシステム総処理遅延量は小さい（大きい）と言える。 The conversation feature quantity means a physical quantity, control quantity, or various standardized index values related to the timing of utterance in conversation. As a preferred form, the timing of utterance in conversation is roughly divided into timing for starting utterance after the end of the other party's utterance, and timing for starting utterance again after judging that the other party is unresponsive after the end of his / her utterance. Is done. These are all important elements that define the basic progress rhythm of conversation, and may have a significant relationship with the optimum system total processing delay amount. For example, it can be said that the optimum total system processing delay amount for a user whose speech timing is early (slow) is small (large).

　一方で、会話における発話のタイミングは、ユーザ各々の性格及び嗜好に応じて千差万別である。また同一ユーザであっても、その時点の身体的若しくは精神的負荷状態又は時間的余裕、或いはその他各種個別具体的な事情に応じて如何様にも変化し得る。会話特徴量取得手段によれば、このような多様に変化し得る発話タイミングを絶えず高精度に把握することができる。 On the other hand, the timing of utterances in conversation varies widely depending on the personality and preferences of each user. Even the same user may change in any way depending on the physical or mental load state or time margin at that time, or various other specific circumstances. According to the conversation feature quantity acquisition means, it is possible to constantly grasp the utterance timing that can be varied in various ways with high accuracy.

　このようにユーザの状態に即した会話特徴量に基づいてシステム総処理遅延量が決定されると、処理遅延量制御手段により、この決定されたシステム総処理遅延量が満たされるように会話データの処理遅延量が制御される。尚、処理遅延量制御手段に係る作用には、各拠点で負担すべき処理遅延量の決定プロセスも含まれてよい。 When the total system processing delay amount is determined based on the conversation feature amount according to the user's state in this way, the processing delay amount control means determines that the determined total system processing delay amount is satisfied by the processing delay amount control means. The amount of processing delay is controlled. The operation related to the processing delay amount control means may include a process delay amount determination process to be borne by each base.

　ここで、実施形態に係るシステム総処理遅延量は、発話側の音声送受信装置と受話側の音声送受信装置からなるシステム全体として確保すべき処理遅延量の総量である。理論的には、システム総処理遅延量が変化しない限り、会話のリズムやテンポ等が変化することはないから、快適性に係る会話品質は変化しない。この点に鑑みれば、当該システム総処理遅延量が変化しない限りにおいて、発話側及び受話側各々に対する処理遅延量の分配態様（即ち、各々における処理遅延の負担量である）には相応の自由度が生じる場合もある。このような場合については、システム総処理遅延量に対し、処理遅延量は多義的となり得る。 Here, the total system processing delay amount according to the embodiment is the total amount of processing delay amount to be secured for the entire system including the speech transmitting / receiving device on the utterance side and the speech transmitting / receiving device on the receiving side. Theoretically, as long as the total system processing delay amount does not change, the conversation rhythm, tempo, and the like do not change, so the conversation quality related to comfort does not change. In view of this point, as long as the total processing delay amount of the system does not change, the distribution degree of the processing delay amount to each of the uttering side and the receiving side (that is, the amount of burden of processing delay in each) is appropriate. May occur. In such a case, the processing delay amount can be ambiguous with respect to the total system processing delay amount.

　但し、システム総処理遅延量を決定値に維持するためには、他拠点に設置される音声送受信システムとの協調が不可欠となる。この種の協調は、決定されたシステム総処理遅延量と自拠点側の処理遅延量の制御値とによって必然的に定まる要求値を、ネットワークを介して他拠点側の装置に告知することによってなされてもよい。或いは、決定されたシステム総処理遅延量を他拠点側の装置と共有し、予め策定されたアルゴリズムに従って両者協議の上で分配比率を決定すること等によってなされてもよい。例えば、各拠点における処理遅延量は、システム総処理遅延量を等分したものであってもよい。 However, in order to maintain the total system processing delay amount at the determined value, it is indispensable to cooperate with voice transmission / reception systems installed at other sites. This kind of cooperation is made by notifying a request value that is inevitably determined by the determined total processing delay amount of the system and the control value of the processing delay amount of the local site to the other site side devices via the network. May be. Alternatively, the determined system total processing delay amount may be shared with the apparatus on the other site side, and the distribution ratio may be determined after consultation between both parties according to a pre-determined algorithm. For example, the processing delay amount at each base may be an equally divided total system processing delay amount.

　このように、実施形態に係る音声送受信装置によれば、その時々のユーザの状態に即して、常に会話の音声品質と快適性との調和を図ることができる。即ち、最適な会話品質が提供されるのである。 As described above, according to the voice transmitting and receiving apparatus according to the embodiment, it is possible to always achieve a harmony between the voice quality of the conversation and the comfort according to the state of the user at that time. That is, optimal conversation quality is provided.

　補足すると、予め実験的に、経験的に、理論的に又はシミュレーション等に基づいて処理遅延量或いはシステム総処理遅延量が決定されている場合、その時点におけるユーザの事情は、実は全く考慮されないに等しいから、事前に推定し得るユーザの基本的な性格等には一定の対応が可能となり得るものの、不確定要素や外乱要素に対する適応性は殆ど無いに等しい。従って、システム総処理遅延量は、ユーザ毎に或いは同一ユーザであってもその時々に応じて変化する真の最適値から乖離し易い。その結果、システム総処理遅延量をより多くして音声品質を向上させ得るにもかかわらず不要にシステム総処理遅延量が抑制される、或いは逆にシステム総処理遅延量が大き過ぎて会話のテンポやリズムがユーザ固有のテンポやリズムから外れ、ユーザの快適性が低下する等の事態が、決して低くない頻度で発生してしまうのである。 In addition, if the processing delay amount or the total system processing delay amount is determined in advance experimentally, empirically, theoretically or based on simulations, the circumstances of the user at that time are not actually considered at all. Therefore, although it is possible to cope with the basic personality of the user that can be estimated in advance, there is almost no adaptability to uncertainties and disturbance factors. Therefore, the total system processing delay amount tends to deviate from the true optimum value that changes depending on the user even for each user or even the same user. As a result, although the total system processing delay amount can be increased to improve the voice quality, the system total processing delay amount is unnecessarily suppressed, or conversely, the system total processing delay amount is too large and the conversation tempo. In other words, such a situation occurs that the user's comfort is reduced because the user's tempo or rhythm deviates from the user's own tempo or rhythm.

　また、会話に参加するユーザ同士で性格、嗜好及び各種事情が異なることも珍しくない。このような場合においても、実施形態に係る音声送受信装置によれば、自拠点ユーザ及び他拠点ユーザの各々について取得される会話特徴量に基づいてシステム総処理遅延量が設定されることにより、双方に不快感を生じさせない範囲で最大のシステム総処理遅延量を設定することができるため、実践上極めて有益である。 Also, it is not uncommon for users participating in a conversation to have different personalities, preferences and various circumstances. Even in such a case, according to the audio transmitting / receiving apparatus according to the embodiment, the system total processing delay amount is set based on the conversation feature amount acquired for each of the own site user and the other site user, so that both Since the maximum system total processing delay amount can be set in a range that does not cause discomfort, it is extremely useful in practice.

　尚、会話特徴量取得手段に係る「取得」とは、最終的に制御上の参照値として確定させることを意味しており、そのプロセスは何ら限定されない趣旨である。即ち、会話特徴量取得手段は、ネットワークを介する等して外部から会話特徴量を取得してもよいし、内部処理として算出、導出、推定、同定又は選択等の各種措置を講じることによって会話特徴量を取得してもよい。また、自拠点ユーザの会話特徴量取得プロセスと他拠点ユーザの会話特徴量取得プロセスとが相違していてもよい。 Note that “acquisition” related to the conversation feature value acquisition means means to finally determine the reference value for control, and the process is not limited in any way. That is, the conversation feature value acquisition means may acquire the conversation feature value from the outside through a network or the like, or the conversation feature value by taking various measures such as calculation, derivation, estimation, identification or selection as internal processing. An amount may be obtained. Also, the conversation feature value acquisition process of the user at the local site may be different from the conversation feature value acquisition process of the user at the other site.

　また、会話特徴量取得手段は、望ましくは、発話行為が発生する毎に、或いは一定又は不定の周期で、会話特徴量の取得を繰り返す。この場合、ユーザ側の最新の事情に対応し得るためより効果的である。また、過去に得られた会話特徴量に基づいた統計処理等が講じられる場合、会話特徴量の急変化が防止され、また会話特徴量の推定精度が向上するため効果的である。 Also, the conversation feature quantity acquisition means preferably repeats the acquisition of the conversation feature quantity every time an utterance action occurs or at a constant or indefinite period. In this case, it is more effective because it can cope with the latest situation on the user side. In addition, when statistical processing based on conversation feature values obtained in the past is taken, it is effective because sudden changes in the conversation feature values are prevented and the estimation accuracy of the conversation feature values is improved.

　尚、実施形態に係る音声送受信装置に備わる、会話特徴量取得手段、処理遅延量決定手段及び処理遅延量制御手段は、夫々が或いは全体として、例えばＣＰＵ（Central Processing Unit）又はＭＰＵ（Micro Processing Unit）等の各種演算処理装置、各種プロセッサ、コントローラ又は各種機能モジュール等の各種形態を採り得る。 Note that the conversation feature amount acquisition means, processing delay amount determination means, and processing delay amount control means provided in the voice transmitting / receiving apparatus according to the embodiment are each or as a whole, for example, a CPU (Central Processing Unit) or an MPU (Micro Processing Unit). ) And the like, various forms such as various processors, various processors, controllers, various functional modules, and the like.

　本発明の音声送受信装置に係る実施形態の一の態様では、前記会話特徴量取得手段は、前記会話特徴量として、前記自拠点において前記他拠点ユーザの発話出力が終了した時点から前記自拠点ユーザが発話を開始する時点までの時間である自拠点ユーザ反応時間と、前記他拠点において前記自拠点ユーザの発話出力が終了した時点から前記他拠点ユーザが発話を開始する時点までの時間である他拠点ユーザ反応時間とのうち少なくとも一方を取得する。 In one aspect of the embodiment of the voice transmitting / receiving apparatus of the present invention, the conversation feature quantity acquisition unit is configured to use the own site user from the time when the utterance output of the other site user is completed at the own site as the conversation feature quantity. Is the time until the time when the other site user starts utterance from the time when the utterance output of the own site user ends at the other site. At least one of the base user reaction time is acquired.

　この態様によれば、会話特徴量として、自拠点ユーザ反応時間又は他拠点ユーザ反応時間或いはその両方が取得される。 According to this aspect, the own site user reaction time or the other site user reaction time or both are acquired as the conversation feature value.

　ここで、「反応時間」とは、一方の発話出力が終了した時点から他方が発話を開始する時点までの時間であり、先述した「相手の発話終了後に発話を開始するタイミング」に相関する会話特徴量である。尚、「発話出力」とは、例えばスピーカ等の出力手段を介した出力を意味し、その開始及び終了時点は、相手の実際の発話行為の開始及び終了時点よりも時系列上後となる。即ち、反応時間とは、上述した処理遅延やその他各種遅延の影響を排除した、ユーザ各々の純粋な反応時間であり、個々のユーザにおける、会話の基本的なテンポ及びリズムを規定する時間である。従って、システム総処理遅延量の最適値を決定する上での参照値として好適である。 Here, “reaction time” is the time from the time when one utterance output ends to the time when the other starts utterance, and correlates with the above-mentioned “timing to start utterance after the other's utterance ends” It is a feature quantity. Note that “speech output” means output via an output means such as a speaker, for example, and the start and end times thereof are later in time series than the start and end times of the actual speech act of the other party. That is, the reaction time is the pure reaction time of each user, excluding the effects of the processing delay and various other delays described above, and is the time for defining the basic tempo and rhythm of conversation in each user. . Therefore, it is suitable as a reference value for determining the optimum value of the total system processing delay amount.

　このように、反応時間は、ユーザ固有の時間値であるから、自拠点ユーザ反応時間と他拠点ユーザ反応時間とは、当然ながら異なる方が自然であるが、これらが殆ど等しかろうが大きく異なろうが、これら自拠点ユーザ反応時間と他拠点ユーザ反応時間とのうち少なくとも一方に基づいて、より望ましくは両方に基づいてシステム総処理遅延量が決定されることにより、少なくとも必要量以上の快適性が担保された最適な会話品質を提供することができる。 Thus, since the reaction time is a time value unique to the user, it is natural that the own site user reaction time and the other site user reaction time are naturally different, but they are almost equal but greatly different. Deaf, the system total processing delay amount is determined based on at least one of these own-site user reaction time and other-site user reaction time, and more preferably both, so that at least the required amount of comfort is achieved. Can provide the best conversation quality.

　他拠点ユーザ反応時間と自拠点ユーザ反応時間とが取得される本発明の音声送受信装置に係る実施形態の一の態様では、前記処理遅延量決定手段は、前記取得された他拠点ユーザ反応時間の大小に応じて夫々大小に変化する自拠点第１許容処理遅延量及び前記取得された自拠点ユーザ反応時間の大小に応じて夫々大小に変化する他拠点第１許容処理遅延量のうち最小となる値以下の範囲で前記総量を決定する。 In one aspect of the embodiment of the voice transmission / reception apparatus according to the present invention in which the other site user reaction time and the own site user reaction time are acquired, the processing delay amount determination means is configured to determine the acquired other site user reaction time. The local site first allowable processing delay amount that varies depending on the size and the other site first allowable processing delay amount that varies depending on the size of the acquired local site user reaction time. The total amount is determined within the range of the value or less.

　この態様によれば、自拠点第１許容処理遅延量及び他拠点第１許容処理遅延量のうち最小となる値（二者間の会話であれば、即ち小さい方の値）以下の範囲でシステム総処理遅延量が決定される。従って、自拠点ユーザ及び他拠点ユーザのいずれにおいても不快感の生じない会話品質を提供することができる。 According to this aspect, the system is within a range equal to or less than the minimum value of the own base first allowable processing delay amount and the other base first allowable processing delay amount (if the conversation is between two parties, that is, the smaller value). A total processing delay amount is determined. Therefore, it is possible to provide conversation quality that does not cause discomfort for both the local user and the other user.

　尚、自拠点第１許容処理遅延量及び他拠点第１許容処理遅延量の算出プロセスは、音声送受信システムを構築する如何なる構成要素においてなされてもよい。即ち、最終的にこれらの最小値以下の範囲でシステム総処理遅延量が決定される限りにおいて、これらの全てが実施形態に係る音声送受信装置でなされる必要はない。例えば、実施形態に係る音声送受信装置においては、ネットワークを介して他拠点で検出された他拠点ユーザ反応時間を取得し、自拠点第１許容処理遅延量を算出するのみであってもよい。この場合、他拠点においては、自拠点から送信される自拠点ユーザ反応時間が取得され、他拠点第１許容処理遅延量が算出されることになる。このように各装置で処理が分散されると、一の音声送受信装置に負荷が集中する事態を防ぐことができる。 It should be noted that the calculation process of the local base first allowable processing delay amount and the other base first allowable processing delay amount may be performed in any component that constructs the voice transmission / reception system. That is, as long as the total system processing delay amount is finally determined within a range below these minimum values, all of these need not be performed by the voice transmitting / receiving apparatus according to the embodiment. For example, in the voice transmitting / receiving apparatus according to the embodiment, it is only necessary to acquire the other site user reaction time detected at another site via the network and calculate the own site first allowable processing delay amount. In this case, at the other site, the own site user reaction time transmitted from the own site is acquired, and the other site first allowable processing delay amount is calculated. When processing is distributed among the devices in this way, it is possible to prevent a situation in which the load is concentrated on one voice transmitting / receiving device.

　他拠点ユーザ反応時間と自拠点ユーザ反応時間とが取得される本発明の音声送受信装置に係る実施形態の他の態様では、前記通信手段は、前記他拠点に設置される音声送受信装置に対し、前記ネットワークを介して、前記取得された自拠点ユーザ反応時間に対応する自拠点ユーザ反応時間データを送信する。 In another aspect of the embodiment of the voice transmission / reception device of the present invention in which the other site user reaction time and the own site user reaction time are acquired, the communication means is for the voice transmission / reception device installed at the other site, Local site user reaction time data corresponding to the acquired local site user reaction time is transmitted via the network.

　この態様によれば、取得された自拠点ユーザ反応時間が、自拠点ユーザ反応時間データとして他拠点側へ送信される。従って、最終的にシステム総処理遅延量の決定に供される各種の参照値の算出プロセスのうち、自拠点ユーザ反応時間に基づいた一部を他拠点側の装置に委ねることが可能となり、処理負担を分散することが可能となる。 According to this aspect, the acquired own site user reaction time is transmitted to the other site side as own site user reaction time data. Therefore, it is possible to leave a part of the calculation process of various reference values finally used for determining the total processing delay amount to the other site side device based on the own site user reaction time. The burden can be distributed.

　他拠点ユーザ反応時間と自拠点ユーザ反応時間とが取得される本発明の音声送受信装置に係る実施形態の他の態様では、前記会話特徴量取得手段は、前記会話特徴量として、前記自拠点において前記自拠点ユーザが発話を終了した時点から再び前記自拠点ユーザが発話を開始する時点までの時間である自拠点ユーザ反応待機時間を更に取得する。 In another aspect of the embodiment of the voice transmitting / receiving apparatus according to the present invention in which the other-site user reaction time and the own-site user reaction time are acquired, the conversation feature value acquisition unit is configured as the conversation feature value at the own site. A self-base user reaction waiting time which is a time from the time when the self-base user ends the utterance to the time when the self-base user starts to speak again is further acquired.

　この態様によれば、会話特徴量として、自拠点ユーザ反応待機時間が取得される。 According to this aspect, the own site user reaction waiting time is acquired as the conversation feature amount.

　ここで、「反応待機時間」とは、一方の発話出力が終了した時点から一方が再び発話を開始する時点までの時間であり、先述した「自分の発話終了後に相手が無反応であると判断して再度発話を開始するタイミング」に相関する会話特徴量である。即ち、反応待機時間とは、ユーザ各々の性格、嗜好及び各種事情を反映した時間であり、個々のユーザにおける、会話の基本的なテンポ及びリズムを規定する時間である。従って、システム総処理遅延量の最適値を決定する上での参照値として好適である。 Here, “reaction waiting time” is the time from the time when one utterance output ends to the time when one starts uttering again. Then, the conversation feature quantity correlates with “timing to start speech again”. That is, the reaction waiting time is a time reflecting the personality, taste and various circumstances of each user, and is a time for defining the basic tempo and rhythm of conversation for each user. Therefore, it is suitable as a reference value for determining the optimum value of the total system processing delay amount.

　尚、この態様では、前記取得された自拠点ユーザ反応待機時間と前記取得された他拠点ユーザ反応時間との差の大小に応じて夫々大小に変化する自拠点第２許容処理遅延量及び前記他拠点において前記他拠点ユーザが発話を終了した時点から再び前記他拠点ユーザが発話を開始する時点までの時間として取得された他拠点ユーザ反応待機時間と前記取得された自拠点ユーザ反応時間との差の大小に応じて夫々大小に変化する他拠点第２許容処理遅延量のうち最小となる値以下の範囲で前記総量を決定してもよい。 In this aspect, the own-site second allowable processing delay amount that changes depending on the difference between the acquired own-site user response waiting time and the acquired other-site user response time, and the other The difference between the other site user reaction waiting time acquired as the time from the time when the other site user finishes speaking at the site to the time when the other site user starts speaking again and the acquired own site user reaction time The total amount may be determined within a range equal to or smaller than the minimum value of the second base allowable processing delay amount at the other bases that changes depending on the size of the other base.

　この態様によれば、自拠点第２許容処理遅延量及び他拠点第２許容処理遅延量のうち最小となる値（二者間の会話であれば、即ち小さい方の値）以下の範囲でシステム総処理遅延量が決定される。従って、自拠点ユーザ及び他拠点ユーザのいずれにおいても不快感の生じない会話品質を提供することができる。 According to this aspect, the system is within a range equal to or less than the minimum value of the second allowable processing delay amount of the own site and the second allowable processing delay amount of the other site (if the conversation is between two parties, that is, the smaller value). A total processing delay amount is determined. Therefore, it is possible to provide conversation quality that does not cause discomfort for both the local user and the other user.

　ここで特に、反応待機時間と反応時間との差の大小に応じて夫々大小に変化する第２許容処理遅延量は、一方のユーザが、一方のユーザの発話に対し他方のユーザが無反応であるとの誤判断の下に再度発話することによって生じる双方発話状態を回避する上で実践上極めて有益な参照値となり得る。 Here, in particular, the second allowable processing delay amount that changes depending on the difference between the reaction waiting time and the reaction time is such that one user does not react to the other user's utterance. It can be a reference value that is extremely useful in practice in avoiding the bilateral utterance state caused by speaking again under the misjudgment of being.

　尚、自拠点第２許容処理遅延量及び他拠点第２許容処理遅延量の算出プロセスは、音声送受信システムを構築する如何なる構成要素においてなされてもよい。即ち、最終的にこれらの最小値以下の範囲でシステム総処理遅延量が決定される限りにおいて、これらの全てが実施形態に係る音声送受信装置でなされる必要はない。例えば、実施形態に係る音声送受信装置においては、ネットワークを介して他拠点で検出された他拠点ユーザ反応時間を取得し、自拠点で取得された自拠点ユーザ反応待機時間を使用して自拠点第２許容処理遅延量を算出するのみであってもよい。この場合、他拠点においては、自拠点から送信される自拠点ユーザ反応時間が取得され、他拠点において取得された他拠点ユーザ反応待機時間を使用して他拠点第２許容処理遅延量が算出される。このように各装置で処理が分散されると、一の音声送受信装置に負荷が集中する事態を防ぐことができる。 It should be noted that the calculation process of the second allowable processing delay amount of the own site and the second allowable processing delay amount of the other site may be performed in any component that constructs the voice transmission / reception system. That is, as long as the total system processing delay amount is finally determined within a range below these minimum values, all of these need not be performed by the voice transmitting / receiving apparatus according to the embodiment. For example, in the voice transmitting and receiving apparatus according to the embodiment, the other site user reaction time detected at the other site via the network is acquired, and the own site user response waiting time acquired at the own site is used to 2 It may be possible only to calculate the allowable processing delay amount. In this case, at the other site, the own site user reaction time transmitted from the own site is acquired, and the other site second allowable processing delay amount is calculated using the other site user reaction waiting time acquired at the other site. The When processing is distributed among the devices in this way, it is possible to prevent a situation in which the load is concentrated on one voice transmitting / receiving device.

　更に、この場合、前記処理遅延量決定手段は、前記取得された他拠点ユーザ反応時間の大小に応じて夫々大小に変化する自拠点第１許容処理遅延量及び前記取得された自拠点ユーザ反応時間の大小に応じて夫々大小に変化する他拠点第１許容処理遅延量並びに前記自拠点第２許容処理遅延量及び前記他拠点第２許容処理遅延量のうち最小となる値以下の範囲で前記総量を決定してもよい。 Furthermore, in this case, the processing delay amount determination means is configured to determine whether the own site first allowable processing delay amount that changes depending on the size of the acquired other site user reaction time and the acquired own site user response time. The total amount within a range that is less than or equal to the minimum value of the first allowable processing delay amount at the other site and the second allowable processing delay amount at the own site and the second allowable processing delay amount at the other site, which varies depending on the size of the other site. May be determined.

　この態様によれば、夫々がユーザの状態を反映した参照値として算出された、自拠点第１許容処理遅延量、他拠点第１許容処理遅延量、自拠点第２許容処理遅延量及び他拠点第２許容処理遅延量のうち最小値以下の範囲でシステム総処理遅延量が決定される。従って、快適性の低下に伴う会話品質の低下をより確実に抑制することができる。 According to this aspect, the local base first allowable processing delay amount, the remote base first allowable processing delay amount, the local base second allowable processing delay amount, and the remote base, each calculated as a reference value reflecting the state of the user. The total system processing delay amount is determined within a range equal to or smaller than the minimum value among the second allowable processing delay amounts. Therefore, it is possible to more surely suppress a decrease in conversation quality accompanying a decrease in comfort.

　本発明の音声送受信装置に係る実施形態の他の態様では、前記音声送受信装置は、前記会話データを前記送受信に相前後して一時的に蓄積するバッファを具備し、前記処理遅延量制御手段は、前記決定された総量に基づいて、前記バッファに係るバッファ容量を制御する。 In another aspect of the embodiment of the voice transmitting / receiving apparatus of the present invention, the voice transmitting / receiving apparatus includes a buffer for temporarily storing the conversation data before and after the transmission / reception, and the processing delay amount control means includes: The buffer capacity of the buffer is controlled based on the determined total amount.

　データパケット間に生じるジッタを吸収し、音切れやデータ欠落等による音声品質の低下を防止しようとする場合、バッファを設けるのが好適である。一方、バッファ容量の大小は、会話の時間遅延量の大小と一対一の関係にあり、無条件に大きくすることは全体的な会話品質上許容されない。即ち、バッファ容量は、実施形態に係る処理遅延量を制御するにあたっての処理遅延量制御手段の実制御対象として妥当である。 When it is intended to absorb jitter generated between data packets and prevent deterioration of voice quality due to sound interruption or data loss, it is preferable to provide a buffer. On the other hand, the size of the buffer capacity has a one-to-one relationship with the size of the amount of time delay of conversation, and unconditionally increasing it is not allowed in terms of overall conversation quality. That is, the buffer capacity is appropriate as an actual control target of the processing delay amount control means when controlling the processing delay amount according to the embodiment.

　尚、制御対象としてバッファ容量を捉える場合、決定されたシステム総処理遅延量に基づいて、音声送受信システムに備わるバッファ容量を如何に制御するかについては自由度がある。例えば、各音声送受信装置において、バッファが、受信データを受信後に一時的に蓄積する受信バッファと、送信データを送信前に一時的に蓄積する送信バッファとを含んで構築される場合、処理遅延量制御手段は、自拠点側の受信バッファ及び送信バッファと、他拠点側の受信バッファ及び送信バッファのバッファ容量の総和に対応する処理遅延量がシステム総処理遅延量となるように、比較的自由にこれら相互間のバッファ容量の分配比率を決定することができる。 When the buffer capacity is captured as a control target, there is a degree of freedom as to how to control the buffer capacity provided in the voice transmission / reception system based on the determined total system processing delay amount. For example, in each audio transmission / reception device, when the buffer is constructed including a reception buffer that temporarily stores received data after reception and a transmission buffer that temporarily stores transmission data before transmission, the processing delay amount The control means is relatively free so that the processing delay amount corresponding to the sum of the buffer capacities of the reception buffer and transmission buffer on the local site side and the reception buffer and transmission buffer on the other site side becomes the total system processing delay amount. The distribution ratio of the buffer capacity between them can be determined.

　本発明の音声送受信装置に係る実施形態の他の態様では、前記処理遅延量制御手段は、前記決定された総量に基づいて、前記会話データを符号化するにあたっての符号化レートを制御する。 In another aspect of the embodiment of the voice transmitting / receiving apparatus of the present invention, the processing delay amount control means controls a coding rate for coding the conversation data based on the determined total amount.

　遠隔地のユーザ同士で会話を成立させる場合、好適にはエンコード等の符号化プロセスが必要となる。 In order to establish a conversation between users at remote locations, an encoding process such as encoding is preferably required.

　この際、自拠点での入力音声をエンコードするにあたってのエンコードビットレート等を意味する符号化レートの高低は、夫々音声品質の高低に対応し、且つ時間遅延量の大小に対応する。 At this time, the level of the encoding rate, which means the encoding bit rate for encoding the input voice at the local site, corresponds to the level of the voice quality and the amount of time delay.

　従って、この種の符号化レートは、実施形態に係る処理遅延量を制御するにあたっての処理遅延量制御手段の実制御対象として妥当である。 Therefore, this type of encoding rate is appropriate as an actual control target of the processing delay amount control means in controlling the processing delay amount according to the embodiment.

　本発明の音声送受信装置に係る実施形態の他の態様では、前記ネットワークの伝送状態を取得する伝送状態取得手段を更に具備し、前記処理遅延量決定手段は、前記取得された会話特徴量と前記取得された伝送状態とに基づいて前記総量を決定する。 In another aspect of the embodiment of the voice transmission / reception apparatus of the present invention, the voice transmission / reception apparatus further includes a transmission state acquisition unit that acquires a transmission state of the network, wherein the processing delay amount determination unit includes the acquired conversation feature amount and the The total amount is determined based on the acquired transmission state.

　この態様によれば、伝送状態取得手段により取得されるネットワークの伝送状態が、システム総処理遅延量の決定に反映されるため、システム総処理遅延量をより正確に決定することができる。 According to this aspect, since the transmission state of the network acquired by the transmission state acquisition unit is reflected in the determination of the total system processing delay amount, the total system processing delay amount can be determined more accurately.

　尚、この態様では、前記伝送状態取得手段は、前記ネットワークの伝送状態として、前記ネットワークの伝送遅延量を取得してもよい。 In this aspect, the transmission state acquisition means may acquire the transmission delay amount of the network as the transmission state of the network.

　ネットワークの伝送遅延量は、即ち、処理遅延量と等しい次元で会話の時間遅延を規定する。従って、システム総処理遅延量の決定に反映させるべき伝送状態として好適である。尚、このようなネットワークの伝送遅延量とは、好適には、ＲＴＴ（Round Trip Time）を指す。 The network transmission delay amount defines the conversation time delay in the same dimension as the processing delay amount. Therefore, it is suitable as a transmission state to be reflected in the determination of the total system processing delay amount. Note that the transmission delay amount of such a network preferably indicates RTT (Round Trip Time).

　本発明の音声送受信装置に係る実施形態の他の態様では、前記取得された会話特徴量を統計処理する統計処理手段を更に具備し、前記処理遅延量決定手段は、前記統計処理された会話特徴量に基づいて前記総量を決定する。 In another aspect of the embodiment of the voice transmitting / receiving apparatus of the present invention, the speech processing apparatus further includes statistical processing means for statistically processing the acquired conversation feature amount, and the processing delay amount determining means is the statistically processed conversation feature. The total amount is determined based on the amount.

　この態様によれば、統計処理手段により、取得された会話特徴量に統計処理が施される。即ち、過去一定又は不定の期間にわたって取得された会話特徴量が、システム総処理遅延量の決定に反映させるべき会話特徴量に反映される。このため、会話特徴量の信頼度を向上させることができ、会話品質を安定的に維持することが可能となる。 According to this aspect, the statistical processing is performed on the acquired conversation feature by the statistical processing means. That is, the conversation feature value acquired over a past fixed or indefinite period is reflected in the conversation feature value to be reflected in the determination of the total system processing delay amount. For this reason, the reliability of the conversation feature amount can be improved, and the conversation quality can be stably maintained.

　尚、統計処理の実践的態様は、特に限定されないが、好適な一形態として、過去一定期間にわたって取得された会話特徴量を加算平均する処理であってもよい。この際、明らかな異常値はサンプルから除外する等の措置が講じられてもよい。 In addition, the practical aspect of the statistical processing is not particularly limited, but may be a process of adding and averaging conversation feature amounts acquired over a certain period in the past as a suitable form. At this time, a measure such as excluding the apparent abnormal value from the sample may be taken.

　本発明の音声送受信装置に係る実施形態の他の態様では、前記取得された会話特徴量を記憶する記憶手段を更に具備する。 In another aspect of the embodiment of the voice transmitting / receiving apparatus of the present invention, the voice transmitting / receiving apparatus further includes storage means for storing the acquired conversation feature quantity.

　この態様によれば、取得された会話特徴量が、例えばＨＤＤ（Hard Disk Drive）、フラッシュメモリ、ＦＤＤ（Floppy（登録商標） Disc Drive）、ＤＶＤ或いはＢＤＤ（Blu-ray Disc Drive）等の各種態様を有し得る記憶装置に揮発的に又は不揮発的に記憶されるため、システム総処理遅延量の決定を円滑に行うことができる。また、次回、同様に会話を行う際のシステム総処理遅延量の初期値を、この記憶された会話特徴量に基づいて決定することもできるため、システム総処理遅延量が最適なシステム総処理遅延量に収束するまでの時間を短縮化することも可能となる。 According to this aspect, the acquired conversation feature amount is, for example, various aspects such as HDD (Hard Disk Disk Drive), flash memory, FDD (Floppy (registered trademark) Disk Disk Drive), DVD or BDD (Blu-ray Disk Disk Drive). Therefore, the total system processing delay amount can be determined smoothly. In addition, since the initial value of the total system processing delay amount when the conversation is similarly performed next time can be determined based on the stored conversation feature amount, the total system processing delay amount with the optimum system total processing delay amount can be determined. It is also possible to shorten the time until the amount converges.

　本発明の音声送受信システムに係る実施形態は、各々がネットワークに収容される、複数の拠点に設置された音声送受信装置を含み、該音声送受信装置を介して、前記複数の拠点に夫々存在するユーザ相互間で会話を成立させることが可能な音声送受信システムであって、前記音声送受信装置は、前記複数の拠点のうち自拠点を除く他拠点に設置される前記音声送受信装置との間で、前記ネットワークを介して、少なくとも音声データを含む、前記会話の内容を表す会話データの送受信を行う通信手段と、前記自拠点に存在する自拠点ユーザ及び前記他拠点に存在する他拠点ユーザの各々について、前記会話における発話のタイミングに関連する所定の会話特徴量を取得する会話特徴量取得手段と、前記取得された会話特徴量に基づいて、その大小が前記会話の時間遅延量の大小に夫々対応し且つ前記会話の音声品質の高低に夫々対応する前記会話データの処理遅延量の、前記音声送受信システム全体における総量を決定する処理遅延量決定手段と、前記決定された総量が満たされるように前記処理遅延量を制御する処理遅延量制御手段とを具備する。 Embodiments according to the voice transmission / reception system of the present invention include voice transmission / reception devices installed at a plurality of bases, each accommodated in a network, and users existing at the plurality of bases via the voice transmission / reception devices, respectively. A voice transmission / reception system capable of establishing a conversation between each other, wherein the voice transmission / reception device is connected to the voice transmission / reception device installed at another base other than the base among the plurality of bases. For each of the communication means for transmitting and receiving conversation data representing the content of the conversation, including at least voice data, via the network, and the local user existing in the local base and the other base user existing in the local base, Based on the conversation feature amount acquisition means for acquiring a predetermined conversation feature amount related to the timing of utterance in the conversation, and the acquired conversation feature amount, A processing delay amount determination for determining a total amount of the processing delay amount of the conversation data corresponding to the size of the time delay amount of the conversation and the speech quality of the conversation, respectively, in the entire voice transmission / reception system. And a processing delay amount control means for controlling the processing delay amount so that the determined total amount is satisfied.

　音声送受信システムに係る実施形態は、上述した実施形態に係る音声送受信装置を備えるため、最適な会話品質を得ることが可能である。 Since the embodiment according to the voice transmission / reception system includes the voice transmission / reception device according to the above-described embodiment, it is possible to obtain the optimum conversation quality.

　本発明のサーバ装置に係る実施形態は、各々がネットワークに収容される、サーバ装置と、複数の拠点に設置された、該複数の拠点のうち自拠点を除く他拠点に設置される音声送受信装置との間で、前記ネットワークを介して、少なくとも音声データを含む、前記会話の内容を表す会話データの送受信を行う通信手段を具備する音声送受信装置とを含み、該音声送受信装置を介して、前記複数の拠点に夫々存在するユーザ相互間で会話を成立させることが可能な音声送受信システムにおける、前記サーバ装置であって、前記複数の音声送受信装置から、前記ネットワークを介して、前記自拠点に存在する自拠点ユーザ及び前記他拠点に存在する他拠点ユーザの各々についての前記会話における発話のタイミングに関連する所定の会話特徴量を取得する会話特徴量取得手段と、前記取得された会話特徴量に基づいて、その大小が前記会話の時間遅延量の大小に夫々対応し且つ前記会話の音声品質の高低に夫々対応する前記会話データの処理遅延量の、前記音声送受信システム全体における総量を決定する処理遅延量決定手段と、前記複数の音声送受信装置に対し前記ネットワークを介して前記決定された総量を告知する告知手段とを具備する。 Embodiments according to the server device of the present invention include a server device, each of which is accommodated in a network, and a voice transmitting / receiving device installed at a plurality of bases and installed at another base other than the base among the plurality of bases A voice transmission / reception device including communication means for transmitting / receiving conversation data representing the content of the conversation including at least voice data via the network, and via the voice transmission / reception device, The server device in a voice transmission / reception system capable of establishing a conversation between users respectively present at a plurality of bases, and present at the local base from the plurality of voice transmission / reception devices via the network Predetermined conversation feature quantity related to the utterance timing in the conversation for each of the own base user and the other base user existing at the other base Based on the acquired conversation feature quantity and the conversation feature quantity acquisition means to be acquired, the size of the conversation data corresponds to the magnitude of the time delay amount of the conversation and corresponds to the level of the voice quality of the conversation. The processing delay amount determining means for determining the total amount of the processing delay amount in the entire voice transmitting / receiving system, and the notifying means for notifying the plurality of voice transmitting / receiving apparatuses of the determined total amount via the network. .

　サーバ装置に係る実施形態は、上述した音声送受信装置の実施形態に係る会話特徴量取得手段及び処理遅延量決定手段を備えるため、最適な会話品質を得ることが可能である。また、このように、会話特徴量の取得プロセス及びシステム総処理遅延量の決定プロセスをサーバ装置が担うことにより、音声送受信システムを構築する音声送受信装置の負担を著しく軽減することができ実践上非常に有益である。例えば、この際、各音声送受信装置では、サーバ装置に備わる告知手段により告知されたシステム総処理遅延量の最適値が満たされるように、例えば処理遅延量制御手段に類する手段が処理遅延量を制御すればよい。或いは、サーバ装置側で然るべきコマンドを実行して、音声送受信装置を上位制御してもよい。また、サーバ装置側で、システム総処理遅延量に基づいた各音声送受信装置の処理遅延量の分配態様をも規定し得る場合には、音声送受信装置側の負担を一層軽減することも可能である。 Since the embodiment according to the server apparatus includes the conversation feature amount acquisition unit and the processing delay amount determination unit according to the embodiment of the voice transmission / reception apparatus described above, it is possible to obtain the optimum conversation quality. In addition, since the server device is responsible for the conversation feature value acquisition process and the system total processing delay amount determination process in this way, it is possible to remarkably reduce the burden on the voice transmission / reception device for constructing the voice transmission / reception system. It is beneficial to. For example, at this time, in each voice transmitting / receiving device, for example, a unit similar to the processing delay amount control unit controls the processing delay amount so that the optimum value of the total system processing delay amount notified by the notification unit provided in the server device is satisfied. do it. Alternatively, an appropriate command may be executed on the server device side to control the audio transmission / reception device. Further, when the server device side can also define the distribution mode of the processing delay amount of each voice transmission / reception device based on the total system processing delay amount, it is possible to further reduce the burden on the voice transmission / reception device side. .

　以上説明したように、本発明の音声送受信装置に係る実施形態によれば、通信手段、会話特徴量取得手段、処理遅延量決定手段及び処理遅延量制御手段を備えるので、最適な会話品質を得ることができる。 As described above, according to the embodiment of the voice transmission / reception apparatus of the present invention, since the communication means, the conversation feature quantity acquisition means, the processing delay amount determination means, and the processing delay amount control means are provided, the optimum conversation quality is obtained. be able to.

　以上説明したように、本発明の音声送受信システムに係る実施形態によれば、本発明の音声送受信装置に係る実施形態を備えるので、最適な会話品質を得ることができる。 As described above, according to the embodiment of the voice transmission / reception system of the present invention, since the embodiment of the voice transmission / reception apparatus of the present invention is provided, the optimum conversation quality can be obtained.

　以上説明したように、本発明の音声送受信装置に係る実施形態によれば、会話特徴量取得手段、処理遅延量決定手段及び告知手段を備えるので、最適な会話品質を得ることができる。 As described above, according to the embodiment of the voice transmitting / receiving apparatus of the present invention, since the conversation feature quantity acquisition means, the processing delay amount determination means, and the notification means are provided, the optimum conversation quality can be obtained.

　本発明のこのような作用及び他の利得は次に説明する実施例から明らかにされる。 These effects and other advantages of the present invention will become apparent from the embodiments described below.

　以下、適宜図面を参照して、本発明の好適な各種実施例について説明する。
＜第１実施例＞
　＜実施例の構成＞
　始めに、図１を参照し、本発明の第１実施例に係る遠隔会議システム１の構成について説明する。ここに、図１は、遠隔会議システム１の構成を概念的に表してなる概略構成図である。 Hereinafter, various preferred embodiments of the present invention will be described with reference to the drawings as appropriate.
<First embodiment>
<Configuration of Example>
First, the configuration of the remote conference system 1 according to the first embodiment of the present invention will be described with reference to FIG. FIG. 1 is a schematic configuration diagram conceptually showing the configuration of the remote conference system 1.

　図１において、遠隔会議システム１は、相互いに離れた拠点Ｘ（即ち、本発明に係る「自拠点」の一例）及び拠点Ｙ（即ち、本発明に係る「他拠点」の一例）を繋ぐ広域ネットワーク２０（ＩＰ（Internet Protocol）網）に収容された、本発明に係る「音声送受信システム」の一例たる音声会議システムである。 In FIG. 1, a remote conference system 1 is a wide area that connects a base X (ie, an example of “own base” according to the present invention) and a base Y (ie, an example of “other base” according to the invention) that are separated from each other. It is an audio conference system as an example of an “audio transmission / reception system” according to the present invention, accommodated in a network 20 (IP (Internet Protocol) network).

　遠隔会議システム１は、拠点Ｘに設置され、拠点Ｘに存在するユーザＡ（即ち、本発明に係る「自拠点ユーザ」の一例である）に使用される音声送受信装置１０Ａ（即ち、本発明に係る「音声送受信装置」の一例）と、拠点Ｙに設置され、拠点Ｙに存在するユーザＢ（即ち、本発明に係る「他拠点ユーザ」の一例である）に使用される音声送受信装置１０Ｂ（即ち、本発明に係る「音声送受信装置」の他の一例）から構成される。ユーザＡとユーザＢとは、遠隔会議システム１により音声情報のやり取りを介した円滑な音声会議を行うことができる。 The remote conference system 1 is installed at the site X and used for the user A (that is an example of the “own site user” according to the present invention) at the site X. An example of such “voice transmitting / receiving apparatus” and a voice transmitting / receiving apparatus 10B (used as an example of “another site user” according to the present invention) installed at the base Y and present at the base Y. That is, it is constituted by “another example of“ voice transmitting / receiving apparatus ”according to the present invention. The user A and the user B can perform a smooth audio conference by exchanging audio information by the remote conference system 1.

　次に、図２を参照し、遠隔会議システム１を構成する一の音声送受信装置の構成について、その動作を交えて説明する。ここに、図２は、音声送受信装置１０Ａの構成を概念的に表してなるブロック図である。尚、同図において、図１と重複する箇所には、同一の符合を付してその説明を適宜省略することとする。 Next, with reference to FIG. 2, the configuration of one voice transmitting / receiving apparatus constituting the remote conference system 1 will be described along with its operation. FIG. 2 is a block diagram conceptually showing the configuration of the audio transmitting / receiving apparatus 10A. In the figure, the same parts as those in FIG. 1 are denoted by the same reference numerals, and the description thereof is omitted as appropriate.

　尚、遠隔会議システム１においては、音声送受信装置１０Ａのハードウェア構成は、音声送受信装置１０Ｂと同等であるとする。 In the remote conference system 1, the hardware configuration of the voice transmitting / receiving apparatus 10A is the same as that of the voice transmitting / receiving apparatus 10B.

　図２において、音声送受信装置１０Ａは、音声入力ユニット１００、音声出力ユニット２００、会話特徴量検出部３００、会話特徴量統計処理部４００、記憶装置５００、処理遅延量決定部６００、ＲＴＴ測定部７００、処理遅延情報通信部８００及びバッファ制御部９００を備える。 In FIG. 2, the voice transmission / reception device 10A includes a voice input unit 100, a voice output unit 200, a conversation feature amount detection unit 300, a conversation feature amount statistical processing unit 400, a storage device 500, a processing delay amount determination unit 600, and an RTT measurement unit 700. , A processing delay information communication unit 800 and a buffer control unit 900 are provided.

　音声入力ユニット１００は、音声入力部１１０、エンコーダ１２０、送信バッファ１３０及び音声データ送信部１４０を備え、ユーザ１０Ａの発話音声を音声データとしてネットワーク２０を介して音声送受信装置１０Ｂへ送信可能なユニットである。 The voice input unit 100 includes a voice input unit 110, an encoder 120, a transmission buffer 130, and a voice data transmission unit 140. The voice input unit 100 is a unit capable of transmitting the voice of the user 10A as voice data to the voice transmission / reception device 10B via the network 20. is there.

　音声入力部１１０は、入力端子（符合省略）が図示せぬマイクに接続された入力インターフェイスであり、当該マイクを介して入力されたユーザＡの発話音声をアナログ音声信号として取り込み可能に構成されている。 The voice input unit 110 is an input interface in which an input terminal (not shown) is connected to a microphone (not shown), and is configured to be able to capture the speech voice of the user A input via the microphone as an analog voice signal. Yes.

　エンコーダ１２０は、音声入力部１１０を介して入力されるアナログ音声信号を、所定の符号化レート（エンコードビットレート）で符号化し、デジタル音声データに変換するデジタル変換装置である。エンコーダ１２０に係るアナログ音声データのエンコード態様としては、公知の各種規格に準じたものを採用可能である。例えば、この種の規格とは、ＭＰＥＧ（Moving Picture Expert Group）等の規格であってもよい。 The encoder 120 is a digital conversion device that encodes an analog audio signal input via the audio input unit 110 at a predetermined encoding rate (encoding bit rate) and converts the encoded audio signal into digital audio data. As an encoding mode of the analog audio data related to the encoder 120, those according to various known standards can be adopted. For example, this type of standard may be a standard such as MPEG (Moving Picture Expert Group).

　送信バッファ１３０は、エンコーダ１２０を介して得られたデジタル音声データを所定の送信バッファ容量Ｄａ＿ｓに相当するデータ量だけ一時的に蓄積する揮発性記憶装置である。 The transmission buffer 130 is a volatile storage device that temporarily accumulates digital audio data obtained through the encoder 120 by a data amount corresponding to a predetermined transmission buffer capacity Da_s.

　音声データ送信部１４０は、出力端子（符合省略）がネットワーク２０に接続された送信インターフェイスであり、送信バッファ１３０から順次出力されてくるデジタル音声データを順次ネットワーク２０を介して音声入出力装置１０Ｂに送信可能に構成される。即ち、音声データ送信部１４０は、本発明に係る「通信部」の一例であり、送信されるデジタル音声データは、本発明に係る「会話データ」の一例である。 The audio data transmission unit 140 is a transmission interface whose output terminal (not shown) is connected to the network 20, and digital audio data sequentially output from the transmission buffer 130 is sequentially transmitted to the audio input / output device 10 </ b> B via the network 20. Configured to send. That is, the audio data transmission unit 140 is an example of the “communication unit” according to the present invention, and the transmitted digital audio data is an example of the “conversation data” according to the present invention.

　音声出力ユニット２００は、音声データ受信部２１０、受信バッファ２２０、デコーダ２３０及び音声出力部２４０を備え、ユーザＢの発話音声をスピーカ等の出力装置を介して出力可能なユニットである。 The audio output unit 200 includes an audio data receiving unit 210, a reception buffer 220, a decoder 230, and an audio output unit 240, and is a unit that can output the voice of user B via an output device such as a speaker.

　音声データ受信部２１０は、出力端子（符合省略）がネットワーク２０に接続された受信インターフェイスであり、ネットワーク２０を介して送信されてくるユーザＢの発話音声に対応するデジタル音声データを順次取り込み可能に構成される。即ち、音声データ受信部２１０は、本発明に係る「通信部」の他の一例であり、受信されるデジタル音声データは、本発明に係る「会話データ」の他の一例である。 The audio data receiving unit 210 is a reception interface whose output terminal (not shown) is connected to the network 20, and can sequentially capture digital audio data corresponding to the user B's uttered audio transmitted via the network 20. Composed. That is, the voice data receiving unit 210 is another example of the “communication unit” according to the present invention, and the received digital voice data is another example of the “conversation data” according to the present invention.

　受信バッファ２２０は、音声データ受信部２１０を介して得られたデジタル音声データを所定の受信バッファ容量Ｄａ＿ｒに相当するデータ量だけ一時的に蓄積する揮発性記憶装置である。 The reception buffer 220 is a volatile storage device that temporarily accumulates digital audio data obtained via the audio data reception unit 210 by a data amount corresponding to a predetermined reception buffer capacity Da_r.

　デコーダ２３０は、受信バッファ２２０から順次出力されてくるデジタル音声データを復号化しアナログ音声データに変換するアナログ変換装置である。デコーダ２３０の機能は、エンコーダ１２０のエンコード機能と対をなすものであり、両者は当然ながら同一の規格に準じたデータ変換を行うように構成される。 The decoder 230 is an analog conversion device that decodes digital audio data sequentially output from the reception buffer 220 and converts it into analog audio data. The function of the decoder 230 is paired with the encoding function of the encoder 120, and both are naturally configured to perform data conversion according to the same standard.

　音声出力部２４０は、出力端子（符合省略）が図示せぬスピーカに接続された出力インターフェイスであり、当該スピーカを介してユーザＢの発話音声を出力可能に構成されている。 The audio output unit 240 is an output interface in which an output terminal (not shown) is connected to a speaker (not shown), and is configured to be able to output user B's speech through the speaker.

　会話特徴量検出部３００は、ユーザＡの会話特徴量として後述するユーザＡ反応時間Ｒ_Ａ及びユーザＡ反応待機時間Ｔ_Ａを検出可能に構成された、本発明に係る「会話特徴量取得手段」の一例である。 Conversation feature amount detecting unit 300 was detectably configure user A reaction time R _A and user A reaction waiting time T _A to be described later as a conversation characteristic quantity of the user A, according to the present invention "conversation feature amount acquisition means" It is an example.

　会話特徴量統計処理部４００は、会話特徴量検出部３００により適宜検出される会話特徴量を統計処理可能に構成された、本発明に係る「統計処理手段」の一例である。会話特徴量統計処理部４００は、検出された会話特徴量を過去一定サンプルについて保持可能な構成となっており、保持するサンプルのサンプル値を加算平均処理して出力する構成となっている。 The conversation feature quantity statistical processing unit 400 is an example of the “statistic processing means” according to the present invention configured to be able to statistically process the conversation feature quantity appropriately detected by the conversation feature quantity detection unit 300. The conversation feature quantity statistical processing unit 400 is configured to be able to hold the detected conversation feature quantity with respect to a certain past sample, and is configured to perform an averaging process on the sample values of the held samples and output them.

　記憶装置５００は、例えばＨＤＤやフラッシュメモリ等の不揮発性記憶装置であり、本発明に係る「記憶手段」の一例である。記憶装置５００の記憶領域には、会話特徴量統計処理部４００を介して出力されたユーザＡ反応時間Ｒ_Ａの平均値を表す反応時間データＤＡＴ＿Ｒ_Ａ及びユーザＡ反応待機時間Ｔ_Ａの平均値を表す反応待機時間データＤＡＴ＿Ｔ_Ａが格納される構成となっている。 The storage device 500 is a nonvolatile storage device such as an HDD or a flash memory, and is an example of the “storage unit” according to the present invention. The storage area of the storage device 500, the average value of the response time data DAT_R _A and the user A reaction waiting time T _A which represents the average value of the conversation feature amount statistical processing unit 400 via the output the user A reaction time R _A the reaction latency data DAT_T _a has a configuration in which is stored representing.

　処理遅延量決定部６００は、許容処理遅延推定部６１０及び交渉部６２０を備え、送信バッファ１３０のバッファ容量Ｄａ＿ｓ及び受信バッファ２２０のバッファ容量Ｄａ＿ｒを決定するように構成された、本発明に係る「処理遅延量決定手段」の一例たる制御装置である。処理遅延量決定部６００は、ＲＯＭ（Read Only Memory）に格納された制御プログラムに従って、後述するバッファ容量制御を実行可能に構成されている。 The processing delay amount determination unit 600 includes an allowable processing delay estimation unit 610 and a negotiation unit 620, and is configured to determine the buffer capacity Da_s of the transmission buffer 130 and the buffer capacity Da_r of the reception buffer 220 according to the present invention. It is a control apparatus as an example of “processing delay amount determination means”. The processing delay amount determination unit 600 is configured to be able to execute buffer capacity control, which will be described later, according to a control program stored in a ROM (Read Only Memory).

　許容処理遅延推定部６１０は、後述する、遠隔会議システム１に許容される最大の処理遅延量たる最大許容処理遅延量ｄｍａｘ（即ち、本発明に係る「会話データの処理遅延量の音声送受信システム全体における総量」の一例である）を決定するための各種処理を実行するプロセッサである。許容処理遅延推定部６１０は、予め理論的に構築された計算モデルとしての第１計算モデル及び第２計算モデルを備え、これら計算モデルに基づいて当該処理を実行する構成となっている。 The allowable processing delay estimation unit 610 is a maximum allowable processing delay amount dmax that is a maximum processing delay amount allowed for the remote conference system 1, which will be described later. This is a processor that executes various types of processing for determining “total amount”. The allowable processing delay estimation unit 610 includes a first calculation model and a second calculation model that are theoretically constructed in advance, and is configured to execute the processing based on these calculation models.

　交渉部６２０は、拠点Ｙの音声送受信装置１０Ｂとの間で、本発明に係る「処理遅延量」の一例としてのバッファ容量の分配交渉を行うプロセッサである。交渉部６２０はまた、この音声送受信装置１０Ｂとの交渉を経て、送信バッファ容量Ｄａ＿ｓ及び受信バッファ容量Ｄａ＿ｒの制御目標値を決定する。 The negotiation unit 620 is a processor that negotiates the distribution of the buffer capacity as an example of the “processing delay amount” according to the present invention with the voice transmitting / receiving apparatus 10B at the site Y. The negotiation unit 620 also determines control target values for the transmission buffer capacity Da_s and the reception buffer capacity Da_r through negotiation with the voice transmitting / receiving apparatus 10B.

　ＲＴＴ測定部７００は、ネットワーク２０の伝送遅延量であるＲＴＴを測定可能に構成されたプロセッサである。ＲＴＴ測定部７００は、ＲＴＣＰ（Real-time Transport Control Protocol）のＳＲ（Sender Report）やＲＲ（Receiver Report）を利用してＲＴＴを測定する。 The RTT measurement unit 700 is a processor configured to be able to measure the RTT that is the transmission delay amount of the network 20. The RTT measuring unit 700 measures RTT using SR (Sender Report) or RR (Receiver Report) of RTCP (Real-time Transport Control Protocol).

　尚、ここで規定されるＲＴＴは、音声送受信装置１０Ａにおいて音声データ送信部１４０を介してデジタル音声データが送信された時点から音声送受信装置１０Ｂにおいて音声データ受信部２１０を介して当該デジタル音声データが受信される時点までの時間と、音声送受信装置１０Ｂにおいて音声データ送信部１４０を介してデジタル音声データが送信された時点から音声送受信装置１０Ａにおいて音声データ受信部２１０を介して当該デジタル音声データが受信される時点までの時間との和である。ユーザＡとユーザＢとが同一拠点で対面状態で会話を行っている場合等、通常の会話においては、このＲＴＴに相当する時間遅延は存在しない。即ち、遠隔会議システム１において、このＲＴＴは、遠隔会議システム１を利用したユーザ相互間の会話の時間遅延量を規定する指標値となる。 Note that the RTT defined here is the time when the digital audio data is transmitted from the audio transmitting / receiving apparatus 10A via the audio data receiving unit 210 from the time when the digital audio data is transmitted via the audio data transmitting unit 140. The time until the reception is received, and the digital audio data is received via the audio data receiving unit 210 in the audio transmitting / receiving device 10A from the time when the digital audio data is transmitted via the audio data transmitting unit 140 in the audio transmitting / receiving device 10B. It is the sum of the time until the point. There is no time delay corresponding to this RTT in normal conversation, such as when user A and user B are conversing in a face-to-face state at the same site. That is, in the remote conference system 1, the RTT is an index value that defines the amount of time delay of conversation between users using the remote conference system 1.

　処理遅延情報通信部８００は、出力端子がネットワーク２０に接続されており、音声送受信装置１０Ｂとの間で、上述したバッファ量の目標値策定に係る各種のデータの送受信を行う通信インターフェイスである。処理遅延情報通信部８００は、音声送受信装置１０Ｂに対し、ユーザＡ反応時間Ｒ_Ａ及び各計算モデルに基づいて設定された処理遅延量ｄを送信し、音声送受信装置１０ＢからユーザＢ反応時間Ｒ_Ｂ及び各計算モデルに基づいて設定された処理遅延量ｄを取得する。 The processing delay information communication unit 800 is a communication interface whose output terminal is connected to the network 20 and performs transmission / reception of various data related to the above-described buffer amount target value formulation with the audio transmission / reception device 10B. The processing delay information communication unit 800 transmits the user A reaction time R _A and the processing delay amount d set based on each calculation model to the voice transmission / reception device 10B, and the user B reaction time R _B from the voice transmission / reception device 10B. And the processing delay amount d set based on each calculation model is acquired.

　バッファ制御部９００は、送信バッファ１３０及び受信バッファ２２０について、各バッファ容量を可変に制御可能に構成された、本発明に係る「処理遅延量制御手段」の一例たるプロセッサである。
＜実施例の動作＞
　次に、図３を参照し、本実施例の動作として、処理遅延量決定部６００により実行されるバッファ容量制御の詳細について説明する。ここに、図３は、バッファ容量制御のフローチャートである。尚、音声送受信装置１０Ａでは、このバッファ容量制御とは別に、音声入力ユニット１００及び音声出力ユニット２００による先に述べた会話データの送受信が適宜実行されている。 The buffer control unit 900 is a processor as an example of the “processing delay amount control unit” according to the present invention configured to be able to variably control the buffer capacities of the transmission buffer 130 and the reception buffer 220.
<Operation of Example>
Next, details of the buffer capacity control executed by the processing delay amount determination unit 600 will be described as operations of the present embodiment with reference to FIG. FIG. 3 is a flowchart of buffer capacity control. In the voice transmitting / receiving apparatus 10A, the above-described conversation data transmission / reception by the voice input unit 100 and the voice output unit 200 is appropriately executed separately from the buffer capacity control.

　図３において、バッファ容量制御では、先ずユーザＡ反応時間Ｒ_Ａの送信処理及びユーザＢ反応時間Ｒ_Ｂの受信処理が実行される（ステップＳ１０１）。ユーザＡ反応時間Ｒ_Ａ及びユーザＢ反応時間Ｒ_Ｂは、夫々、本発明に係る「会話特徴量」の一例であり、また前者は「自拠点ユーザ反応時間」、後者は「他拠点ユーザ反応時間」の夫々一例である。 In FIG. 3, in the buffer capacity control, first, a transmission process for the user A reaction time R _A and a reception process for the user B reaction time R _B are executed (step S101). User A reaction time R _A and user B reaction times R _B are, respectively, an example of the "conversation feature quantity" according to the present invention, also the former "own base user reaction time", the latter "different hub user reaction time Is an example.

　ステップＳ１０１においては、記憶装置５００から反応時間データＤＡＴ＿Ｒ_Ａが読み出され、処理遅延情報通信部８００によりネットワーク２０を介して音声送受信装置１０Ｂへ送信される。一方で、処理遅延情報通信部８００によりネットワーク２０を介して音声送受信装置１０ＢからユーザＢ反応時間Ｒ_Ｂに対応する反応時間データＤＡＴ＿Ｒ_Ｂが取得され、許容処理遅延推定部６１０へ送出される。 In step S101, the response time data DAT_R _A from the storage device 500 is read out and transmitted to the voice transmitting and receiving device 10B via the network 20 by the processing delay information communication unit 800. On the other hand, the reaction time data DAT_R _B corresponding to the user B reaction time _{R B} is acquired from the voice transceiver 10B via the network 20 by the processing delay information communication unit 800, it is sent to the allowable processing delay estimator 610.

　ここで、図４を参照し、反応時間について説明する。ここに、図４は、反応時間の概念を説明するタイミングチャートである。 Here, the reaction time will be described with reference to FIG. FIG. 4 is a timing chart for explaining the concept of reaction time.

　図４において、ユーザＡとユーザＢとが、互いの音声伝達に有意な時間遅延が生じない理想的な環境で会話しているとする。尚、このような理想的な環境とは、例えば、同一拠点において、対面状態で会話がなされる環境等を指す。 In FIG. 4, it is assumed that the user A and the user B are talking in an ideal environment where no significant time delay occurs in the mutual voice transmission. Note that such an ideal environment refers to, for example, an environment where conversations are made in a face-to-face state at the same site.

　ある任意の時刻Ｔ０において、ユーザＡの発話が終了したとする（ユーザＡのハッチング部分参照）。一方、ユーザＢは、時刻Ｔ０においてユーザＡの発話終了を知覚する。そして、この発話終了時点（Ｔ０）から、ユーザＢに固有の遅延時間を経た時刻Ｔ１において発話を開始する（ユーザＢのハッチング部分参照）。この一方の発話終了時点から他方の発話開始時点までの遅延時間が、反応時間である。反応時間は、ユーザの性格、嗜好、精神的負荷状態、肉体的負荷状態、及びその他その都度個別具体的に変化し得る各種の事情に応じて多様に変化する。 Suppose that the utterance of the user A is finished at an arbitrary time T0 (see the hatched portion of the user A). On the other hand, user B perceives the end of user A's utterance at time T0. Then, from this utterance end time (T0), the utterance is started at time T1 after a delay time unique to the user B (see the hatched portion of the user B). The delay time from the end time of one utterance to the start time of the other utterance is the reaction time. The reaction time varies in various ways according to the user's personality, taste, mental load state, physical load state, and other various circumstances that can be specifically changed each time.

　ユーザＡ反応時間Ｒ_Ａとは、拠点ＸにおけるユーザＡの反応時間であり、ユーザＢ反応時間Ｒ_Ｂとは、拠点ＹにおけるユーザＢの反応時間である。尚、遠隔会議システム１のように遠隔地同士の会話である場合、一方の発話終了時点とは、スピーカ等の音声出力手段を介した音声出力（即ち、本発明に係る「発話出力」の一例である）の終了時点を意味する。一方の発話行為が終了したところで発話内容が他方に認識されなければ、会話が成立しないからである。 The user A reaction time R _A is the reaction time of the user A at the site X, and the user B reaction time R _B is the reaction time of the user B at the site Y. In the case of a conversation between remote locations as in the remote conference system 1, one utterance end point is an example of a voice output via a voice output means such as a speaker (ie, "speech output" according to the present invention). )). This is because the conversation is not established if the utterance content is not recognized by the other at the end of one utterance act.

　反応時間は、夫々の音声送受信装置において、会話特徴量検出部３００により実行される反応時間算出処理により算出される。ここで、図５を参照し、反応時間算出処理について説明する。ここに、図５は、反応時間算出処理のフローチャートである。尚、図５の反応時間算出処理は、拠点Ｘにおいて実行される処理であるとする。 The reaction time is calculated by a reaction time calculation process executed by the conversation feature amount detection unit 300 in each voice transmitting / receiving device. Here, the reaction time calculation process will be described with reference to FIG. FIG. 5 is a flowchart of the reaction time calculation process. The reaction time calculation process in FIG. 5 is a process executed at the site X.

　図５において、先ずユーザＢの音声出力が有るか否かが判別される（ステップＳ２０１）。ユーザＢの音声出力が無い場合（ステップＳ２０１：ＮＯ）、処理はステップＳ２０１に戻され一連の処理が繰り返される。ユーザＢの音声出力が有る場合（ステップＳ２０１：ＹＥＳ）、その時点の時刻が最終音声出力時刻Ｔｏｐとして更新される（ステップＳ２０２）。 In FIG. 5, it is first determined whether or not there is a voice output from the user B (step S201). If there is no voice output from user B (step S201: NO), the process returns to step S201, and a series of processes is repeated. When there is a voice output of user B (step S201: YES), the time at that time is updated as the final voice output time Top (step S202).

　最終音声出力時刻Ｔｏｐが更新されると、ユーザＢの音声出力が無いか否かが判別される（ステップＳ２０３）。ユーザＢの音声出力が継続している場合（ステップＳ２０３：ＮＯ）、処理はステップＳ２０２に戻され、最終音声出力時刻Ｔｏｐの更新が継続される。一方、ユーザＢの音声出力が無い場合（ステップＳ２０３：ＹＥＳ）、ユーザＡの音声入力が有るか否かが判別される（ステップＳ２０４）。 When the final audio output time Top is updated, it is determined whether or not there is no audio output from the user B (step S203). If user B's voice output continues (step S203: NO), the process returns to step S202, and the update of the final voice output time Top is continued. On the other hand, if there is no voice output from user B (step S203: YES), it is determined whether there is a voice input from user A (step S204).

　尚、ユーザＢが必ずしも連続的に発話しているとは限らないため、ユーザＢの音声出力が無い旨の判別は、ユーザＢが一連の発話中であるにもかかわらず発話が終了したと誤判別されぬように、無音区間の長さが予め設定された基準値を超えたか否かに基づいて正確になされる構成となっている。 Since user B does not always speak continuously, the determination that there is no voice output from user B is a misjudgment that the speech has ended despite user B being in a series of speeches. It is configured so as to be accurately made based on whether or not the length of the silent section exceeds a preset reference value.

　ユーザＡの音声入力が無い場合（ステップＳ２０４：ＮＯ）、即ち、ユーザＢの発話出力終了後、ユーザＡがそれに対する発話内容を考えていると推定される間は、ステップＳ２０３が繰り返し実行される。ユーザＡの音声入力が開始される（発話が開始される）と（ステップＳ２０４：ＹＥＳ）、その時点の時刻ＴとステップＳ２０２において更新された最終音声出力時刻Ｔｏｐとの差に相当する時間値が、ユーザＡ反応時間Ｒ_Ａとして決定される（ステップＳ２０５）。ユーザＡ反応時間Ｒ_Ａが算出されると、処理はステップＳ２０１へ戻され、一連の処理が繰り返される。反応時間算出処理は以上のように実行される。 When there is no voice input by the user A (step S204: NO), that is, after the utterance output of the user B is completed, while it is estimated that the user A is thinking about the utterance content for the utterance, the step S203 is repeatedly executed. . When user A's voice input is started (utterance is started) (step S204: YES), a time value corresponding to the difference between the time T at that time and the final voice output time Top updated in step S202 is obtained. The user A reaction time _RA is determined (step S205). When user A reaction time _RA is calculated, the process returns to step S201, and a series of processes is repeated. The reaction time calculation process is executed as described above.

　尚、ユーザＢ反応時間Ｒ_Ｂも、同様に音声送受信装置１０Ｂにおいて算出されている。 Also user B reaction time R _B, are calculated in the voice transmitting and receiving device 10B as well.

　反応時間算出処理により適宜算出されたユーザＡ反応時間Ｒ_Ａは、算出される毎に会話特徴量統計処理部４００に送出され、統計処理に供される。統計処理は先述したように過去所定サンプル分についての加算平均処理である。尚、統計処理の態様は、加算平均処理に限定されない。 User A reaction time R _A, which is suitably calculated by the reaction time calculation process is sent to the conversation feature quantity statistics unit 400 each time it is calculated, is subjected to statistical processing. As described above, the statistical process is an averaging process for the past predetermined samples. Note that the mode of the statistical processing is not limited to the averaging process.

　会話特徴量統計処理部４００により統計処理を施されたユーザＡ反応時間Ｒ_Ａは、記憶部５００に反応時間データＤＡＴ＿Ｒ_Ａとして格納される。反応時間データＤＡＴ＿Ｒ_Ａは、ユーザＡ反応時間Ｒ_Ａが算出され統計処理が実行される毎に適宜更新される。 User A reaction time _{R A} which has been subjected to statistical processing by conversation feature amount statistical processing unit 400 is stored as the reaction time data DAT_R _A in the storage unit 500. The reaction time data DAT_R _A, the user A reaction time R _A is appropriately updated each time statistical processing is calculated is performed.

　図３に戻り、反応時間データの送受信が終了すると、ユーザＡ反応待機時間Ｔ_Ａが取得される（ステップＳ１０２）。 Returning to FIG. 3, transmission and reception of the response time data is completed, the user A reaction waiting time T _A is obtained (step S102).

　反応待機時間とは、一方のユーザの発話が終了した時点から、当該一方のユーザが再度発話を開始する時点までの時間値である。反応待機時間は、ユーザの性格、嗜好、精神的負荷状態、肉体的負荷状態、及びその他その都度個別具体的に変化し得る各種の事情に応じて多様に変化する。 The reaction waiting time is a time value from the time when one user's utterance ends to the time when the one user starts speaking again. The reaction waiting time varies in various ways according to the user's personality, taste, mental load state, physical load state, and other various circumstances that can be specifically changed each time.

　反応待機時間は、夫々の音声送受信装置において、会話特徴量検出部３００により実行される反応待機時間算出処理により算出される。ここで、図６を参照し、反応待機時間算出処理について説明する。ここに、図６は、反応待機時間算出処理のフローチャートである。尚、図６の反応待機時間算出処理は、拠点Ｘにおいて実行される処理であるとする。 The reaction waiting time is calculated by a reaction waiting time calculation process executed by the conversation feature amount detection unit 300 in each voice transmitting / receiving device. Here, the reaction waiting time calculation process will be described with reference to FIG. FIG. 6 is a flowchart of the reaction waiting time calculation process. Note that the reaction waiting time calculation process in FIG.

　図６において、先ずユーザＡの音声入力が有るか否かが判別される（ステップＳ３０１）。ユーザＡの音声入力が無い場合（ステップＳ３０１：ＮＯ）、処理はステップＳ２０１に戻され一連の処理が繰り返される。ユーザＡの音声入力が有る場合（ステップＳ２０１：ＹＥＳ）、その時点の時刻が最終音声入力時刻Ｔｉｐとして更新される（ステップＳ３０２）。 In FIG. 6, it is first determined whether or not there is a voice input from the user A (step S301). If there is no voice input by user A (step S301: NO), the process returns to step S201, and a series of processes is repeated. When there is a voice input of the user A (step S201: YES), the time at that time is updated as the final voice input time Tip (step S302).

　最終音声入力時刻Ｔｉｐが更新されると、ユーザＢの音声出力が無いか否かが判別される（ステップＳ３０３）。ユーザＢの音声出力が有る場合（ステップＳ３０３：ＮＯ）、ユーザＡ反応待機時間ＴＡよりも短い時間でユーザＢの応答が返ってきたものとして、処理はステップＳ３０１に戻される。 When the final voice input time Tip is updated, it is determined whether or not there is no voice output from the user B (step S303). If there is a voice output from user B (step S303: NO), the process returns to step S301 on the assumption that the response from user B is returned in a time shorter than the user A reaction waiting time TA.

　ユーザＢの音声出力が無い場合（ステップＳ３０３：ＹＥＳ）、ユーザＡの音声入力が有るか否かが判別される（ステップＳ３０４）。ユーザＡの音声入力が無い場合（ステップＳ３０４：ＮＯ）、処理はステップＳ３０３に戻される。即ち、ユーザＡの音声入力が再開されるまで、処理は待機状態となる。 If there is no voice output from user B (step S303: YES), it is determined whether there is a voice input from user A (step S304). If there is no voice input by user A (step S304: NO), the process returns to step S303. That is, the process is in a standby state until the voice input of the user A is resumed.

　ユーザＡの音声入力が有る場合（ステップＳ３０４：ＹＥＳ）、現時点の時刻ＴとステップＳ３０２で更新された最終音声入力時刻Ｔｉｐとの差に相当する時間値が、基準値Ｔ０よりも大きいか否かが判別される。ここで、ユーザＡは必ずしも連続的に発話している訳ではないため、一連の発話中であっても一時的に発話が途切れることがある。基準値Ｔ０は、ユーザＡの音声入力が、このような一連の発話動作に相当するか否かを判別するための判断基準値である。 If there is a voice input by the user A (step S304: YES), whether or not the time value corresponding to the difference between the current time T and the last voice input time Tip updated in step S302 is greater than the reference value T0. Is determined. Here, since the user A does not necessarily utter continuously, the utterance may be temporarily interrupted even during a series of utterances. The reference value T0 is a determination reference value for determining whether or not the voice input of the user A corresponds to such a series of speech operations.

　即ち、Ｔ－Ｔｉｐに相当する時間値が基準値Ｔ０以下である場合（ステップＳ３０５：ＮＯ）、処理はステップＳ３０１に戻される。一方、Ｔ－Ｔｉｐに相当する時間値が基準値Ｔ０よりも大きい場合（ステップＳ３０５：ＹＥＳ）、Ｔ－Ｔｉｐに相当する時間値がユーザＡ反応待機時間Ｔ_Ａとして決定される（ステップＳ３０６）。反応待機時間算出処理は以上のように実行される。 That is, when the time value corresponding to T-Tip is equal to or less than the reference value T0 (step S305: NO), the process returns to step S301. On the other hand, if the time value corresponding to T-Tip is larger than the reference value T0 (Step S305: YES), the time value corresponding to T-Tip is determined as a user A reaction waiting time _{T A} (step S306). The reaction waiting time calculation process is executed as described above.

　尚、ユーザＢ反応待機時間Ｔ_Ｂも、同様に音声送受信装置１０Ｂにおいて算出されている。 Also user B reaction waiting time T _B, it is calculated in the voice transmitting and receiving device 10B as well.

　反応時間算出処理により適宜算出されたユーザＡ反応待機時間Ｔ_Ａは、算出される毎に会話特徴量統計処理部４００に送出され、統計処理に供される。統計処理は先述したように過去所定サンプル分についての加算平均処理である。尚、統計処理の態様は、加算平均処理に限定されない。 User A reaction waiting time T _A, which is suitably calculated by the reaction time calculation process is sent to the conversation feature quantity statistics unit 400 each time it is calculated, is subjected to statistical processing. As described above, the statistical process is an averaging process for the past predetermined samples. Note that the mode of the statistical processing is not limited to the averaging process.

　会話特徴量統計処理部４００により統計処理を施されたユーザＡ反応待機時間Ｔ_Ａは、記憶部５００に反応時間データＤＡＴ＿Ｔ_Ａとして格納される。反応時間データＤＡＴ＿Ｔ_Ａは、ユーザＡ反応待機時間Ｔ_Ａが算出され統計処理が実行される毎に適宜更新される。 User A reaction waiting time _{T A} that has been subjected to statistical processing by conversation feature amount statistical processing unit 400 is stored as the reaction time data DAT_T _A in the storage unit 500. The reaction time data DAT_T _A, the user A reaction waiting time T _A is properly updated each time statistical processing is calculated is performed.

　再び図３に戻り、ユーザＡ反応待機時間Ｔ_Ａが取得されると、第１計算モデル及び第２計算モデルに基づいて処理遅延量ｄが設定される（ステップＳ１０３）。処理遅延量ｄは、第１計算モデル及び第２計算モデルの各々について設定される。尚、処理遅延量ｄは、遠隔会議システム１の処理遅延量の総量を意味する。 Returning to Figure 3 again, when the user A reaction waiting time T _A is obtained, processing delay amount d is set based on the first calculation model and the second calculation model (step S103). The processing delay amount d is set for each of the first calculation model and the second calculation model. The processing delay amount d means the total processing delay amount of the remote conference system 1.

　ここで、図７を参照し、第１計算モデルについて説明する。ここに、図７は、第１計算モデルの概念を説明するタイミングチャートである。尚、同図において、図４と重複する箇所には、同一の符合を付してその説明を適宜省略することとする。 Here, the first calculation model will be described with reference to FIG. FIG. 7 is a timing chart for explaining the concept of the first calculation model. In the figure, the same reference numerals are given to the same portions as those in FIG. 4, and the description thereof will be omitted as appropriate.

　図７において、遠隔会議システム１における処理遅延量ｄを下記（１）式により定義する。 In FIG. 7, the processing delay amount d in the remote conference system 1 is defined by the following equation (1).

　ｄ＝ｄａ＋ｄｂ＋ｄｐｒｏｃ・・・（１）
　上記（１）式において、ｄａは音声送受信装置１０Ａのバッファ容量と一対一に対応する遅延量であり、受信バッファ容量Ｄａ＿ｒに対応する受信バッファ遅延量ｄａ＿ｒ及び送信バッファ容量Ｄａ＿ｓに対応する送信バッファ遅延量ｄａ＿ｓとの間に「ｄａ＝ｄａ＿ｒ＋ｄａ＿ｓ」なる関係を有する。 d = da + db + dproc (1)
In the above equation (1), da is a delay amount that has a one-to-one correspondence with the buffer capacity of the audio transmission / reception device 10A, and the transmission buffer delay corresponding to the reception buffer delay da_r and the transmission buffer capacity Da_s corresponding to the reception buffer capacity Da_r. There is a relationship of “da = da_r + da_s” with the quantity da_s.

　また、上記（１）式において、ｄｂは音声送受信装置１０Ｂのバッファ容量と一対一に対応する遅延量であり、受信バッファ容量Ｄｂ＿ｒに対応する受信バッファ遅延量ｄｂ＿ｒ及び送信バッファ容量Ｄｂ＿ｓに対応する送信バッファ遅延量ｄｂ＿ｓとの間に「ｄｂ＝ｄｂ＿ｒ＋ｄｂ＿ｓ」なる関係を有する。 Also, in the above equation (1), db is a delay amount that corresponds to the buffer capacity of the audio transmission / reception device 10B on a one-to-one basis, and transmission corresponding to the reception buffer delay amount db_r and transmission buffer capacity Db_s corresponding to the reception buffer capacity Db_r. There is a relationship “db = db_r + db_s” with the buffer delay amount db_s.

　更に、上記（１）式において、ｄｐｒｏｃは、音声送受信装置１０Ａ及び１０Ｂの各々におけるエンコーダ及びデコーダの処理遅延である。 Furthermore, in the above equation (1), dproc is the processing delay of the encoder and decoder in each of the audio transmitting / receiving apparatuses 10A and 10B.

　より具体的には、ｄｐｒｏｃは、音声送受信装置１０Ａにおけるエンコーダの処理遅延ｄｅｎｃａ及びデコーダの処理遅延ｄｄｅｃａ並びに音声送受信装置１０Ｂにおけるエンコーダの処理遅延ｄｅｎｃｂ及びデコーダの処理遅延ｄｄｅｃｂの総和である。尚、ｄｐｒｏｃは、エンコーダ及びデコーダにおける符号化レートが一定であれば一定値となる。 More specifically, dproc is the sum of the encoder processing delay denca and the decoder processing delay ddeca in the voice transmitting / receiving apparatus 10A, and the encoder processing delay decb and decoder processing delay ddecb in the voice transmitting and receiving apparatus 10B. Note that dproc has a constant value if the encoding rate in the encoder and decoder is constant.

　図７において、時刻Ｔ０にユーザＡの発話が終了したとする。ここで、遠隔地同士である拠点Ｘと拠点Ｙとにおいて遠隔会議システム１を利用して会話を行う構成では、このユーザＡの発話終了が認識される時点が、図４のような通常会話時と異なる。 In FIG. 7, it is assumed that the utterance of the user A ends at time T0. Here, in the configuration in which the remote conference system 1 uses the remote conference system 1 to perform a conversation between the remote locations X and Y, the point in time when the user A's utterance end is recognized is during a normal conversation as shown in FIG. And different.

　即ち、拠点ＹにおいてユーザＡの発話終了が認識されるのは、時刻Ｔ０から音声送受信装置１０Ａの送信バッファ遅延量ｄａ＿ｓ、音声送受信装置１０Ｂの受信バッファ遅延量ｄｂ＿ｒ、音声送受信装置１０Ａのエンコーダ処理遅延ｄｅｎｃａ、音声送受信装置１０Ｂのデコーダ処理遅延ｄｄｅｃｂ及び音声送受信装置１０Ａから音声送受信装置１０Ｂへ向かうネットワーク２０の片道遅延ＯＷＤａを経た後の時刻Ｔ０’である。即ち、下記（２）式が成立する。 That is, at the site Y, the end of the speech of the user A is recognized from time T0, the transmission buffer delay amount da_s of the voice transmission / reception device 10A, the reception buffer delay amount db_r of the voice transmission / reception device 10B, and the encoder processing delay of the voice transmission / reception device 10A. Denca, time T0 ′ after passing through the decoder processing delay ddecb of the voice transmitting / receiving apparatus 10B and the one-way delay OWDa of the network 20 from the voice transmitting / receiving apparatus 10A to the voice transmitting / receiving apparatus 10B. That is, the following formula (2) is established.

　Ｔ０’＝Ｔ０＋ｄａ＿ｓ＋ｄｂ＿ｒ＋ｄｅｎｃａ＋ｄｄｅｃｂ＋ＯＷＤａ・・・（２）
　尚、ネットワーク２０の片道遅延ＯＷＤａに関しては、下記（３）式が成立する。 T0 ′ = T0 + da_s + db_r + denca + ddecb + OWDa (2)
Note that the following equation (3) holds for the one-way delay OWDa of the network 20.

　ＲＴＴ＝ＯＷＤａ＋ＯＷＤｂ・・・（３）
　尚、ＯＷＤｂは、音声送受信装置１０Ｂから音声送受信装置１０Ａへ向かうネットワーク２０の片道遅延である。 RTT = OWDa + OWDb (3)
OWDb is a one-way delay of the network 20 from the voice transmitting / receiving apparatus 10B to the voice transmitting / receiving apparatus 10A.

　次に、時刻Ｔ０’からユーザＢ反応時間Ｒ_Ｂを経た時刻Ｔ１’においてユーザＢが発話を開始したとする。このユーザＢの発話開始が認識される時刻もまた、拠点Ｘと拠点Ｙとでは異なり、拠点ＸにおいてユーザＢの発話開始が認識されるのは、時刻Ｔ１’から音声送受信装置１０Ｂの送信バッファ遅延量ｄｂ＿ｓ、音声送受信装置１０Ａの受信バッファ遅延量ｄａ＿ｒ、音声送受信装置１０Ｂのエンコーダ処理遅延ｄｅｎｃｂ、音声送受信装置１０Ａのデコーダ処理遅延ｄｄｅｃａ及び上記片道遅延ＯＷＤｂを経た後の時刻Ｔ２となる。即ち、下記（４）式が成立する。 Then, the user B starts an utterance in 'User B reaction time time through the R _B T1 from' time T0. The time when the user B's utterance start is recognized is also different between the base X and the base Y, and the start of the user B's utterance at the base X is recognized from the transmission buffer delay of the voice transmitting / receiving apparatus 10B from the time T1 ′. The amount db_s, the reception buffer delay amount da_r of the voice transmission / reception device 10A, the encoder processing delay decb of the voice transmission / reception device 10B, the decoder processing delay ddeca of the voice transmission / reception device 10A, and the time T2 after the one-way delay OWDb. That is, the following equation (4) is established.

　Ｔ２＝Ｔ０’＋Ｒ_Ｂ＋ｄｂ＿ｓ＋ｄａ＿ｒ＋ｄｅｎｃｂ＋ｄｄｅｃａ＋ＯＷＤｂ・・・（４）
　即ち、図４に例示された通常会話時と同様の感覚で、ユーザＡの発話終了後にユーザＢ反応時間Ｒ_Ｂを隔ててユーザＢが発話を行ったとしても、遠隔会議システム１とネットワーク２０の影響により、ユーザＡにユーザＢの発話開始が認識されるのは、下記（５）式に規定される遅延時間ＴＬが経過した後となる。 T2 = T0 ′ + R _B + db_s + da_r + dencb + ddeca + OWDb (4)
That is, in the same sense as in the normal conversation is illustrated in Figure 4, even if the user B has performed speech at a user B reaction time R _B after the end user utterance A, the teleconference system 1 and the network 20 Due to the influence, the user A recognizes the start of the utterance of the user B after the delay time TL defined by the following equation (5) has elapsed.

　ＴＬ＝Ｔ２－Ｔ０＝Ｒ_Ｂ＋ＲＴＴ＋ｄ・・・（５）
　ここで、第１計算モデルでは、下記（６）式の指標値Ｚを使用する。 TL = T2-T0 = R _B + RTT + d (5)
Here, in the first calculation model, an index value Z of the following equation (6) is used.

　Ｚ＝ＴＬ／Ｒ_Ｂ・・・（６）
　指標値Ｚは、１より大きい値を採り、通常の会話環境（図４に例示される如き環境）に近付く程１に漸近する性質を持つ。また指標値Ｚが大きくなる程、ユーザＡは、会話が円滑に進まないと感じるようになり、快適性の低下を覚え得る。 Z = TL / R _B (6)
The index value Z takes a value larger than 1 and has a property of gradually approaching 1 as it approaches a normal conversation environment (an environment illustrated in FIG. 4). Further, as the index value Z increases, the user A feels that the conversation does not proceed smoothly, and may experience a decrease in comfort.

　一方、この指標値Ｚには予め実験的に最大値Ｆ（例えば、Ｆ＝１．２）が設定される。最大値Ｆは、それよりも大きい領域においてユーザＡが感じる会話品質の低下が無視出来なくなると判断され得る適合値である。逆に言えば、指標値Ｚが最大値Ｆ以下であれば、ユーザＡはユーザＢとの会話に実践上無視し得ない円滑性の欠如を感じることがない。 On the other hand, a maximum value F (for example, F = 1.2) is experimentally set in advance for the index value Z. The maximum value F is a fitness value that can be determined that a decrease in conversation quality felt by the user A in a larger area cannot be ignored. In other words, if the index value Z is equal to or less than the maximum value F, the user A does not feel lack of smoothness that cannot be ignored in practice in the conversation with the user B.

　具体的には、条件式として下記（７）式が設定される。 Specifically, the following expression (7) is set as the conditional expression.

　（Ｒ_Ｂ＋ＲＴＴ＋ｄ）／Ｒ_Ｂ≦Ｆ・・・（７）
　更に（７）式を変形すると、下記（８）式が得られる。 (R _B + RTT + d) / R _B ≦ F (7)
Further, when the formula (7) is modified, the following formula (8) is obtained.

　ｄ≦（Ｆ－１）Ｒ_Ｂ－ＲＴＴ・・・（８）
　上記（８）式は、処理遅延量ｄの採り得る範囲を規定する式である。尚、上記（８）式における処理遅延量ｄは、本発明に係る「自拠点第１許容処理遅延量」の一例である。 d ≦ (F-1) R _B -RTT (8)
The above expression (8) is an expression that defines a range in which the processing delay amount d can be taken. The processing delay amount d in the above equation (8) is an example of the “own base first allowable processing delay amount” according to the present invention.

　上記（８）式に示される処理遅延量ｄは、ユーザＡに対し設定される値であり、ユーザＢに対しては、音声送受信装置１０Ｂにおいて、同様のプロセスを経て、下記（９）式により処理遅延量ｄが設定される。下記（９）式における処理遅延量ｄは、本発明に係る「他拠点第１許容処理遅延量」の一例である。 The processing delay amount d shown in the above equation (8) is a value set for the user A, and for the user B, the same process is performed in the voice transmitting / receiving apparatus 10B by the following equation (9). A processing delay amount d is set. The processing delay amount d in the following equation (9) is an example of the “other site first allowable processing delay amount” according to the present invention.

　ｄ≦（Ｆ－１）Ｒ_Ａ－ＲＴＴ・・・（９）
　次に、図８を参照し、第２計算モデルについて説明する。ここに、図８は、第２計算モデルの概念を説明するタイミングチャートである。尚、同図において、図７と重複する箇所には、同一の符合を付してその説明を適宜省略することとする。 d ≦ (F−1) R _A −RTT (9)
Next, the second calculation model will be described with reference to FIG. FIG. 8 is a timing chart illustrating the concept of the second calculation model. In the figure, the same parts as those in FIG. 7 are denoted by the same reference numerals, and the description thereof is omitted as appropriate.

　図８は、図７に対し、ユーザＡ反応待機時間Ｔ_Ａの概念を表したものである。 8 to FIG. 7 illustrates a concept of user A reaction waiting time T _A.

　ユーザＡ反応待機時間Ｔ_Ａは、ユーザＡの発話が終了した時刻Ｔ０から、ユーザＡが再び発話を開始する時刻Ｔ１’までの時間値である。ここで、計算モデル１の説明で述べたように、ユーザＡの発話終了後、ユーザＢの発話開始がユーザＡによって認識されるのは、時刻Ｔ０から上記遅延時間ＴＬが経過した後の時刻Ｔ２である。 User A reaction waiting time T _A is from the time T0 to the utterance of the user A is completed, the time value up to the time T1 'that the user A initiates a speech again. Here, as described in the explanation of calculation model 1, after user A's utterance ends, user A's utterance start is recognized by user A at time T2 after the delay time TL has elapsed from time T0. It is.

　ところが、この時刻Ｔ２においては、既にユーザＢの反応が無いものとしてユーザＡが発話を開始しており、ユーザＡとユーザＢとが共に発話している状況が生じ得る。従って、拠点Ｘでの会話の円滑性を担保しようとすれば、このユーザＡ反応待機時間Ｔ_Ａを考慮する必要が生じる。第２計算モデルは、このユーザＡ反応待機時間Ｔ_Ａを考慮した計算モデルである。 However, at this time T2, the user A has already started uttering on the assumption that there is no response from the user B, and there may be a situation in which the user A and the user B are both speaking. Therefore, if an attempt collateral smoothness of conversation in offices X, it is necessary to consider the user A reaction waiting time T _A. Second calculation model is a calculation model that takes into account the user A reaction waiting time T _A.

　上述した二者同時発話の状況を回避するためには、上記遅延時間ＴＬがユーザＡ反応待機時間Ｔ_Ａ以下であればよいから、下記（１０）式が成立する。 To avoid the situation of two parties simultaneous speech described above, the delay time TL is because as long than user A reaction waiting time T _A, the following (10) is established.

　Ｒ_Ｂ＋ＲＴＴ＋ｄ≦Ｔ_Ａ・・・（１０）
　上記（１０）式を変形すると、下記（１１）式が得られる。 R _B + RTT + d ≦ T _A (10)
When the formula (10) is modified, the following formula (11) is obtained.

　ｄ≦（Ｔ_Ａ－Ｒ_Ｂ）-ＲＴＴ・・・（１１）
　上記（１１）式は、処理遅延量ｄの採り得る範囲を規定する式である。尚、上記（１１）式における処理遅延量ｄは、本発明に係る「自拠点第２許容処理遅延量」の一例である。 _{_{d ≦ (T A -R B)}} -RTT ··· (11)
The above expression (11) is an expression that defines a range in which the processing delay amount d can be taken. The processing delay amount d in the equation (11) is an example of the “own base second allowable processing delay amount” according to the present invention.

　尚、上記（１１）式に示される処理遅延量ｄは、ユーザＡに対し設定される値であり、ユーザＢに対しては、音声送受信装置１０Ｂにおいて、同様のプロセスを経て、下記（１２）式により処理遅延量ｄが算出される。下記（１２）式に示される処理遅延量ｄは、本発明に係る「他拠点第２許容処理遅延量」の一例である。 The processing delay amount d shown in the above equation (11) is a value set for the user A, and for the user B, the following process (12) is performed through the same process in the voice transmitting / receiving apparatus 10B. The processing delay amount d is calculated by the equation. The processing delay amount d shown in the following expression (12) is an example of the “other site second allowable processing delay amount” according to the present invention.

　ｄ≦（Ｔ_Ｂ－Ｒ_Ａ）-ＲＴＴ・・・（１２）
　第１計算モデル及び第２計算モデルにより処理遅延量ｄが設定されると、最大許容処理遅延量ｄｍａｘが決定される（ステップＳ１０４）。最大許容処理遅延量ｄｍａｘは、ユーザＡ及びユーザＢの双方にとって、会話の円滑性が担保され得る処理遅延量である。従って、第１計算モデル及び第２計算モデルに基づいて設定された処理遅延量を比較する必要がある。より具体的には、最大許容処理遅延量ｄｍａｘは、上記（８）式、（９）式、（１１）式及び（１２）式のいずれも満たす処理遅延量である必要がある。 d ≦ (T _B −R _A ) −RTT (12)
When the processing delay amount d is set by the first calculation model and the second calculation model, the maximum allowable processing delay amount dmax is determined (step S104). The maximum allowable processing delay amount dmax is a processing delay amount that can ensure the smoothness of conversation for both the user A and the user B. Therefore, it is necessary to compare the processing delay amounts set based on the first calculation model and the second calculation model. More specifically, the maximum allowable processing delay amount dmax needs to be a processing delay amount that satisfies all of the above formulas (8), (9), (11), and (12).

　そこで、ステップＳ１０４においては、処理遅延情報通信部８００を介して、音声送受信装置１０ＢからユーザＢについて第１及び第２計算モデルに基づいて設定された処理遅延量ｄが取得される。一方で、処理遅延情報通信部８００を介して、音声送受信装置１０Ｂに対しユーザＡについて第１及び第２計算モデルに基づいて設定された処理遅延量ｄが送信される。即ち、音声送受信装置１０Ａと音声送受信装置１０Ｂとの間で、最大許容処理遅延量ｄｍａｘの決定に係る条件式が共有される。 Therefore, in step S104, the processing delay amount d set for the user B based on the first and second calculation models is acquired from the voice transmitting / receiving apparatus 10B via the processing delay information communication unit 800. On the other hand, the processing delay amount d set for the user A based on the first and second calculation models is transmitted to the voice transmitting / receiving apparatus 10B via the processing delay information communication unit 800. That is, the conditional expression relating to the determination of the maximum allowable processing delay amount dmax is shared between the voice transmitting / receiving apparatus 10A and the voice transmitting / receiving apparatus 10B.

　続いて、上記（８）式、（９）式、（１１）式及び（１２）式のいずれも満たす処理遅延量ｄが決定される。尚、具体的には、これら四式のうち、右辺項が最も小さい条件式によって規定される処理遅延量ｄが、これら四式を満たす処理遅延量ｄとなる。 Subsequently, the processing delay amount d satisfying any of the above formulas (8), (9), (11) and (12) is determined. Specifically, among these four formulas, the processing delay amount d defined by the conditional expression with the smallest right-hand term is the processing delay amount d satisfying these four formulas.

　一方、音切れや符合誤り等による音声品質の低下を防止する観点からは、処理遅延量は大きい方が良い。従って、最終的に、最大許容処理遅延量ｄｍａｘは、これら四式を満たす範囲の最大値として決定される。 On the other hand, from the viewpoint of preventing the voice quality from being degraded due to sound interruptions or code errors, it is better that the processing delay amount is large. Therefore, finally, the maximum allowable processing delay amount dmax is determined as the maximum value in a range that satisfies these four formulas.

　最大許容処理遅延量ｄｍａｘが決定されると、分配処理が実行される（ステップＳ１０５）。分配処理とは、決定された最大許容処理遅延量ｄｍａｘを音声送受信装置１０Ａ及び音声送受信装置１０Ｂの各バッファ容量に分配する処理である。この分配処理は、交渉部６２０により実行される。 When the maximum allowable processing delay amount dmax is determined, distribution processing is executed (step S105). The distribution process is a process of distributing the determined maximum allowable processing delay amount dmax to the buffer capacities of the voice transmitting / receiving apparatus 10A and the voice transmitting / receiving apparatus 10B. This distribution process is executed by the negotiation unit 620.

　具体的には、交渉部６２０は、処理遅延情報通信部８００を介して音声送受信装置１０Ｂに負荷状況を確認する。音声送受信装置１０Ｂに処理負荷上の問題がなく、音声送受信装置１０Ａにも処理負荷上の問題がなければ、交渉部６２０は、最大許容処理遅延量ｄｍａｘの５０％を音声送受信装置１０Ａで負担する旨を決定し、またその旨を音声送受信装置１０Ｂに伝達する。その結果、処理負荷上の問題がなければ、通常、最大許容処理遅延量ｄｍａｘの５０％が音声送受信装置１０Ａ側で負担される。 Specifically, the negotiation unit 620 confirms the load status with the voice transmitting / receiving apparatus 10B via the processing delay information communication unit 800. If there is no problem in processing load in the voice transmitting / receiving apparatus 10B and there is no problem in processing load in the voice transmitting / receiving apparatus 10A, the negotiation unit 620 bears 50% of the maximum allowable processing delay amount dmax in the voice transmitting / receiving apparatus 10A. This is determined, and this is transmitted to the voice transmitting / receiving apparatus 10B. As a result, if there is no problem in processing load, 50% of the maximum allowable processing delay amount dmax is usually borne on the voice transmitting / receiving apparatus 10A side.

　更に、交渉部６２０は、音声送受信装置１０Ａで負担すべき遅延量を送信バッファ１３０及び受信バッファ２２０で分配する。通常、ここでも、送信バッファ１３０と受信バッファ２２０との負担率は等しく設定される。即ち、送信バッファ１３０に係る処理遅延量は、最大許容処理遅延量ｄｍａｘの２５％、受信バッファ２２０に係る処理遅延量も、最大許容処理遅延量ｄｍａｘの２５％に設定される。設定された分配比率は、バッファ制御部９００に伝達される。 Further, the negotiation unit 620 distributes the delay amount to be borne by the voice transmitting / receiving apparatus 10A between the transmission buffer 130 and the reception buffer 220. Normally, the burden rates of the transmission buffer 130 and the reception buffer 220 are also set equal here. That is, the processing delay amount related to the transmission buffer 130 is set to 25% of the maximum allowable processing delay amount dmax, and the processing delay amount related to the reception buffer 220 is also set to 25% of the maximum allowable processing delay amount dmax. The set distribution ratio is transmitted to the buffer control unit 900.

　分配比率を伝達されたバッファ制御部９００は、設定された分配比率に応じた処理遅延量が得られるように、送信バッファ１３０及び受信バッファ２２０の容量を制御する（ステップＳ１０６）。バッファ容量の制御がなされると、処理はステップＳ１０１に戻され、一連の処理が繰り返される。バッファ容量制御は以上のように実行される。 The buffer control unit 900 to which the distribution ratio is transmitted controls the capacities of the transmission buffer 130 and the reception buffer 220 so that a processing delay amount corresponding to the set distribution ratio is obtained (step S106). When the buffer capacity is controlled, the process returns to step S101, and a series of processes is repeated. The buffer capacity control is executed as described above.

　このように、本実施例に係るバッファ容量制御によれば、第１計算モデル及び第２計算モデルに基づいて設定された処理遅延量ｄから最大許容処理遅延量ｄｍａｘが決定され、受信バッファ容量Ｄａ＿ｒ及び送信バッファ容量Ｄａ＿ｓの制御を介して、遠隔会議システム１の処理遅延量ｄが、この最大許容処理遅延量ｄｍａｘに制御される。ここで、第１計算モデル及び第２計算モデルは、夫々ユーザの会話特徴量を反映し得るように構築されており、決定される最大許容処理遅延量ｄｍａｘは、会話の当事者たるユーザＡ及びユーザＢの実情に即した、双方にとって円滑性が最低限担保されたものとなる。 As described above, according to the buffer capacity control according to the present embodiment, the maximum allowable processing delay amount dmax is determined from the processing delay amount d set based on the first calculation model and the second calculation model, and the reception buffer capacity Da_r. And the processing delay amount d of the remote conference system 1 is controlled to the maximum allowable processing delay amount dmax through the control of the transmission buffer capacity Da_s. Here, the first calculation model and the second calculation model are constructed so as to reflect the conversation feature amount of the user, respectively, and the maximum allowable processing delay amount dmax determined is the user A and the user who are parties to the conversation. In accordance with the actual situation of B, smoothness for both sides is guaranteed at the minimum.

　従って、会話の円滑性が阻害されることによる快適性の低下が防止される。一方で、最大許容処理遅延量ｄｍａｘは、この種の快適性の低下を防止し得る範囲で最大の値であり、音切れや符号化誤り等による音声品質の低下も防止される。即ち、本実施例によれば、会話に参加するユーザの事情に即した最適な会話品質が提供されるのである。 Therefore, a decrease in comfort due to hindering smoothness of conversation is prevented. On the other hand, the maximum allowable processing delay amount dmax is the maximum value within a range in which this kind of comfort reduction can be prevented, and voice quality deterioration due to sound interruptions, coding errors, and the like is also prevented. In other words, according to the present embodiment, the optimum conversation quality in accordance with the circumstances of the user participating in the conversation is provided.

　尚、本実施例では、会話に参加するユーザは、ユーザＡとユーザＢとの二者となっているが、三者或いはより多くの参加者を有する会議や会話等においても、上記と同様の概念を適用してユーザの事情に即した最適な会話品質を提供することが可能であることは言うまでもない。 In this embodiment, the user who participates in the conversation is the user A and the user B, but the same as the above also in a meeting or conversation having three parties or more participants. Needless to say, it is possible to apply the concept to provide the optimum conversation quality according to the user's circumstances.

　尚、本実施例では、音声会議が前提とされたが、各拠点に撮像手段を配置して、音声データと共に画像データ或いは映像データの送受信を行うことも可能である。この場合、例えばＴＶ会議等と称される会議形態を採ることも可能である。 In the present embodiment, the audio conference is assumed. However, it is also possible to transmit and receive image data or video data together with audio data by arranging imaging means at each site. In this case, for example, it is possible to adopt a conference form called a TV conference or the like.

　尚、本実施例に係る音声送受信装置は、記憶部５００を備えるため、複数のユーザの過去の会話特徴量を保持しておくことも可能である。無論、過去の会話特徴量が次回においてそのまま運用できる保証はないが、例えば初期値として用いることによって、会話特徴量取得部３００及び処理遅延量決定部６００の動作初期における、会話品質のばらつき等を抑制することができるため、好適である。 In addition, since the audio | voice transmission / reception apparatus concerning a present Example is provided with the memory | storage part 500, it is also possible to hold | maintain the past conversation feature-value of a some user. Of course, there is no guarantee that the past conversation feature quantity can be used as it is in the next time. However, by using it as an initial value, for example, conversation quality variation in the initial operation of the conversation feature quantity acquisition unit 300 and the processing delay amount determination unit 600 can be reduced. Since it can suppress, it is suitable.

　尚、本実施例では、バッファ容量の制御を介して処理遅延量の制御が実現されたが、処理遅延量と相関する制御量は、音声データの符号化レート（例えば、エンコーダ１２０のエンコードビットレート）であってもよい。符号化レートが高くなれば、相対的に高音質となる分、データ伝送量も増えるため、処理遅延量は大きくなる。従って、上述したバッファ容量に替えて或いは加えて符号化レートを制御することにより、上記と同様の効果を得ることが可能である。
＜第２実施例＞
　次に、図９を参照し、本発明の第２実施例について説明する。ここに、図９は、本発明の第２実施例に係る遠隔会議システム２の構成を概念的に表してなる概略構成図である。尚、同図において、図１と重複する箇所には同一の符合を付してその説明を適宜省略することとする。 In this embodiment, the control of the processing delay amount is realized through the control of the buffer capacity. However, the control amount correlated with the processing delay amount is the encoding rate of the audio data (for example, the encoding bit rate of the encoder 120). ). As the encoding rate increases, the amount of data transmission increases as the sound quality becomes relatively high, and the amount of processing delay increases. Therefore, by controlling the encoding rate instead of or in addition to the buffer capacity described above, it is possible to obtain the same effect as described above.
<Second embodiment>
Next, a second embodiment of the present invention will be described with reference to FIG. FIG. 9 is a schematic configuration diagram conceptually showing the configuration of the remote conference system 2 according to the second example of the present invention. In the figure, the same reference numerals are given to the same portions as those in FIG. 1, and the description thereof will be omitted as appropriate.

　図９において、遠隔会議システム２は、ネットワーク２０に収容されたサーバ装置３０を備える点において、第１実施例に係る遠隔会議システム１と相違している。 9, the remote conference system 2 is different from the remote conference system 1 according to the first embodiment in that the remote conference system 2 includes a server device 30 accommodated in the network 20.

　サーバ装置３０は、音声送受信装置１０Ａ及び１０Ｂを仲介するコンピュータシステムであり、本発明に係る「サーバ装置」の一例である。 The server device 30 is a computer system that mediates the voice transmission / reception devices 10A and 10B, and is an example of the “server device” according to the present invention.

　サーバ装置３０は、第１実施例における処理遅延量決定部６００を備えており、第１及び第２計算モデルに基づいた処理遅延量ｄ、最大許容処理遅延量ｄｍａｘ及び各音声送受信装置のバッファ容量に係る各算出処理は、全てこのサーバ装置３０で実行される構成となっている。 The server device 30 includes the processing delay amount determination unit 600 in the first embodiment, and the processing delay amount d based on the first and second calculation models, the maximum allowable processing delay amount dmax, and the buffer capacity of each voice transmission / reception device All the calculation processes related to are configured to be executed by the server device 30.

　一方、サーバ装置３０は、決定されたバッファ容量を各音声送受信装置に対し告知する告知部を備えており、告知部からの告知を受けた各音声送受信装置では、バッファ制御部９００が、この告知されたバッファ容量に従って送受信の各バッファを制御する。 On the other hand, the server device 30 includes a notification unit that notifies each voice transmission / reception device of the determined buffer capacity. In each voice transmission / reception device that has received the notification from the notification unit, the buffer control unit 900 performs this notification. Each transmission / reception buffer is controlled according to the buffer capacity.

　このように、第２実施例においても、第１実施例と同様、快適性を損なわない会話品質が提供される。特に、計算モデルに基づいた各種演算処理やバッファ容量の分配比率決定処理等、比較的高負荷な処理は、音声送受信装置に代わってこのサーバ装置３０が負担する構成を採るため、音声送受信装置の負担が軽減され、より円滑な会話の運用制御が可能となる。 Thus, also in the second embodiment, as in the first embodiment, conversation quality that does not impair comfort is provided. In particular, relatively high-load processing such as various calculation processing based on the calculation model and buffer capacity distribution ratio determination processing is configured to be borne by the server device 30 instead of the voice transmission / reception device. This reduces the burden and enables smoother conversational operation control.

　本発明は、上述した実施例に限られるものではなく、請求の範囲及び明細書全体から読み取れる発明の要旨或いは思想に反しない範囲で適宜変更可能であり、そのような変更を伴う音声送受信装置、音声送受信システム及びサーバ装置もまた本発明の技術的範囲に含まれるものである。 The present invention is not limited to the above-described embodiments, and can be changed as appropriate without departing from the spirit or concept of the invention that can be read from the claims and the entire specification. An audio transmission / reception system and a server device are also included in the technical scope of the present invention.

　本発明は、遠隔地のユーザ同士で会話を成立させる装置或いはシステムに適用可能である。 The present invention can be applied to an apparatus or system that establishes a conversation between users at remote locations.

　１…遠隔会議システム、１０Ａ、１０Ｂ…音声送受信装置、２０…ネットワーク、１００…音声入力ユニット、１１０…音声入力部、１２０…エンコーダ、１３０…送信バッファ、１４０…音声データ送信部、２００…音声出力ユニット、２１０…音声データ受信部、２２０…受信バッファ、２３０…でコーダ、２４０…音声出力部、３００…会話特徴量検出部、４００…会話特徴量統計処理部、５００…記憶部、６００…処理遅延量決定部、６１０…許容処理遅延推定部、６２０…交渉部、７００…ＲＴＴ測定部、８００…処理遅延情報通信部、９００…バッファ制御部。 DESCRIPTION OF SYMBOLS 1 ... Remote conference system, 10A, 10B ... Voice transmitter / receiver, 20 ... Network, 100 ... Voice input unit, 110 ... Voice input part, 120 ... Encoder, 130 ... Transmission buffer, 140 ... Voice data transmission part, 200 ... Voice output Unit: 210 ... Audio data receiving unit, 220 ... Reception buffer, 230 ... coder, 240 ... Audio output unit, 300 ... Conversation feature amount detection unit, 400 ... Conversation feature amount statistical processing unit, 500 ... Storage unit, 600 ... Process Delay amount determination unit, 610 ... allowable processing delay estimation unit, 620 ... negotiation unit, 700 ... RTT measurement unit, 800 ... processing delay information communication unit, 900 ... buffer control unit.

Claims

Voice that includes voice transmitting / receiving devices installed at a plurality of bases, each accommodated in a network, and that can establish a conversation between users at the plurality of bases via the voice transmitting / receiving devices. In the transmission / reception system, the voice transmission / reception device,
Communication for transmitting / receiving conversation data representing the content of the conversation, including at least voice data, to / from the voice transmission / reception apparatus installed at another base other than the base among the plurality of bases. Means,
Conversation feature amount acquisition means for acquiring a predetermined conversation feature amount related to the utterance timing in the conversation for each of the own site user existing in the own site and the other site user existing in the other site;
Based on the acquired conversation feature value, the voice transmission / reception of the processing delay amount of the conversation data, the size of which corresponds to the size of the time delay amount of the conversation and the sound quality of the conversation, respectively. A processing delay amount determining means for determining a total amount in the entire system;
And a processing delay amount control means for controlling the processing delay amount so that the determined total amount is satisfied.

The conversation feature quantity acquisition means, as the conversation feature quantity, is a self-site user reaction time that is a time from the time when the utterance output of the other-site user is finished at the own place to the time when the own-site user starts utterance. And at least one of the other site user reaction time, which is the time from when the utterance output of the local site user ends at the other site to the time when the other site user starts uttering, The voice transmitting / receiving apparatus according to claim 1.

The processing delay amount determining means is responsive to the own first site allowable processing delay amount that changes depending on the size of the acquired other site user reaction time and the size of the acquired own site user response time. The voice transmission / reception apparatus according to claim 2, wherein the total amount is determined within a range equal to or less than a minimum value among the first allowable processing delay amounts at other bases, each of which changes in magnitude.

The communication means transmits own site user reaction time data corresponding to the acquired own site user reaction time to the voice transmitting / receiving device installed at the other site via the network. The voice transmitting / receiving apparatus according to claim 2.

The conversation feature quantity acquisition means is the own site user response waiting time that is a time from the time when the own site user ends the utterance at the own site to the time when the own site user starts to speak again as the conversation feature value. The voice transmitting / receiving apparatus according to claim 2, further acquiring time.

The own site second allowable processing delay amount that changes depending on the difference between the acquired own site user reaction waiting time and the acquired other site user reaction time, and the other site user at the other site. Depending on the magnitude of the difference between the other-site user reaction waiting time acquired as the time from when the other-site user starts speaking again to the time when the other-site user starts speaking again and the acquired own-site user reaction time, respectively. 6. The voice transmitting / receiving apparatus according to claim 5, wherein the total amount is determined within a range that is not more than a minimum value among the second allowable processing delay amounts at other sites that change in size.

The processing delay amount determining means is responsive to the own first site allowable processing delay amount that changes depending on the size of the acquired other site user reaction time and the size of the acquired own site user response time. Determining the total amount within a range that is less than or equal to a minimum value of the first allowable processing delay amount of the other bases, the second allowable processing delay amount of the own base, and the second allowable processing delay amount of the other bases, each of which changes in magnitude. The voice transmitting / receiving apparatus according to claim 6,

The voice transmission / reception device includes:
A buffer for temporarily storing the conversation data before and after the transmission and reception;
The voice transmission / reception apparatus according to claim 1, wherein the processing delay amount control means controls a buffer capacity of the buffer based on the determined total amount.

The voice transmission / reception apparatus according to claim 1, wherein the processing delay amount control means controls a coding rate for coding the conversation data based on the determined total amount.

Further comprising transmission status acquisition means for acquiring the transmission status of the network;
The voice transmission / reception apparatus according to claim 1, wherein the processing delay amount determination unit determines the total amount based on the acquired conversation feature amount and the acquired transmission state.

The voice transmission / reception apparatus according to claim 10, wherein the transmission state acquisition unit acquires a transmission delay amount of the network as a transmission state of the network.

Statistical processing means for statistically processing the acquired conversation feature value,
The voice transmission / reception apparatus according to claim 1, wherein the processing delay amount determination means determines the total amount based on the statistically processed conversation feature amount.

The voice transmitting / receiving apparatus according to claim 1, further comprising storage means for storing the acquired conversation feature amount.

Voice that includes voice transmitting / receiving devices installed at a plurality of bases, each accommodated in a network, and that can establish a conversation between users at the plurality of bases via the voice transmitting / receiving devices. A transmission / reception system,
The voice transmission / reception device includes:
Communication for transmitting / receiving conversation data representing the content of the conversation, including at least voice data, to / from the voice transmission / reception apparatus installed at another base other than the base among the plurality of bases. Means,
Conversation feature amount acquisition means for acquiring a predetermined conversation feature amount related to the utterance timing in the conversation for each of the own site user existing in the own site and the other site user existing in the other site;
Based on the acquired conversation feature value, the voice transmission / reception of the processing delay amount of the conversation data, the size of which corresponds to the size of the time delay amount of the conversation and the sound quality of the conversation, respectively. A processing delay amount determining means for determining a total amount in the entire system;
And a processing delay amount control means for controlling the processing delay amount so that the determined total amount is satisfied.

Between each of the server devices accommodated in the network and the voice transmitting / receiving devices installed at a plurality of bases other than the own base among the plurality of bases, via the network, A voice transmission / reception apparatus including communication means for transmitting / receiving conversation data representing the content of the conversation, including at least voice data, and between users existing at each of the plurality of bases via the voice transmission / reception apparatus. In the voice transmission / reception system capable of establishing a conversation, the server device,
Predetermined conversation characteristics related to the utterance timing in the conversation for each of the local user existing at the local base and the local base user existing at the other base from the plurality of voice transmitting / receiving apparatuses via the network. A conversation feature quantity acquisition means for acquiring the quantity;
Based on the acquired conversation feature value, the voice transmission / reception of the processing delay amount of the conversation data, the size of which corresponds to the size of the time delay amount of the conversation and the sound quality of the conversation, respectively. A processing delay amount determining means for determining a total amount in the entire system;
A server device comprising: notification means for notifying the plurality of voice transmitting / receiving devices of the determined total amount via the network.