WO2012014275A1 - Audio transmitting/receiving device, audio transmitting/receiving system and server device - Google Patents
Audio transmitting/receiving device, audio transmitting/receiving system and server device Download PDFInfo
- Publication number
- WO2012014275A1 WO2012014275A1 PCT/JP2010/062558 JP2010062558W WO2012014275A1 WO 2012014275 A1 WO2012014275 A1 WO 2012014275A1 JP 2010062558 W JP2010062558 W JP 2010062558W WO 2012014275 A1 WO2012014275 A1 WO 2012014275A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- conversation
- processing delay
- voice
- delay amount
- time
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M3/00—Automatic or semi-automatic exchanges
- H04M3/22—Arrangements for supervision, monitoring or testing
- H04M3/2227—Quality of service monitoring
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L12/00—Data switching networks
- H04L12/02—Details
- H04L12/16—Arrangements for providing special services to substations
- H04L12/18—Arrangements for providing special services to substations for broadcast or conference, e.g. multicast
- H04L12/1813—Arrangements for providing special services to substations for broadcast or conference, e.g. multicast for computer conferences, e.g. chat rooms
- H04L12/1827—Network arrangements for conference optimisation or adaptation
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L65/00—Network arrangements, protocols or services for supporting real-time applications in data packet communication
- H04L65/40—Support for services or applications
- H04L65/403—Arrangements for multi-party communication, e.g. for conferences
Definitions
- the present invention relates to a technical field of a voice transmission / reception apparatus, a voice transmission / reception system, and a server apparatus that can smoothly advance conversations between users at remote locations, for example.
- Patent Document 1 proposes a real-time audio reproducing device having this kind of purpose. According to this device, it is possible to ensure the overall conversation quality by keeping the voice loss within the maximum voice drop rate established as a conversation and the voice delay within the maximum delay time established as a conversation. It is said that.
- Patent Document 2 includes a buffer monitoring unit and a buffer control unit for controlling a buffer for absorbing delay fluctuation to a silent period based on the number of occurrences of buffer underflow in a voiced period. The configuration is disclosed.
- JP 2002-223247 A Japanese Patent Laid-Open No. 11-215812
- the longer the delay time the better the voice quality of the conversation can be improved because the occurrence of sound interruptions and errors can be suppressed.
- the delay time is large, it takes a long time for one user to recognize the start of the other party's utterance after the end of his / her utterance, and thus the smooth progress of the conversation is likely to be hindered. Therefore, the technical idea that the delay time falls within the maximum delay time established as a conversation as in the device of Patent Document 1 is useful in that a compromise between voice quality and smooth progress can be found.
- the maximum delay time for establishing a conversation depends on, for example, the personality or physical or mental load state of each user participating in the conversation, or the individual specific circumstances of the user at that time. Can also change. For example, the maximum delay time is short for a personally impatient user, long for a leisurely user, short for an irritated user, long for a relaxed user, If there is, it will be longer or shorter depending on the physical condition. Apart from that, if you are in a hurry for some reason, it will naturally become shorter.
- the maximum delay time described in Patent Document 1 is not necessarily the maximum delay time for truly establishing a conversation. For this reason, even if the apparatus of Patent Literature 1 is applied to conversation, there is a possibility that the comfort may be lowered depending on the user. Or, conversely, the delay time is too short, and there is a possibility that unnecessary voice quality is lowered. That is, the apparatus of Patent Document 1 has a technical problem that it is difficult to optimize the delay time due to the fact that it does not have a technique to adapt to the situation on the user side.
- the present invention has been made in view of such problems, and it is an object of the present invention to provide an audio transmission / reception apparatus, an audio transmission / reception system, and a server apparatus that can provide optimal conversation quality in accordance with the circumstances of users participating in a conversation.
- the voice transmission / reception device includes voice transmission / reception devices installed at a plurality of bases, each of which is accommodated in a network, through the voice transmission / reception device.
- the voice transmitting / receiving apparatus in a voice transmitting / receiving system capable of establishing a conversation between users respectively present at a plurality of bases, wherein the voice is installed at another base other than the base among the plurality of bases.
- Communication means for transmitting / receiving conversation data representing the content of the conversation, including at least voice data, to / from the transmission / reception device, and the local user existing at the local base and the local base Conversation feature quantity acquisition means for acquiring a predetermined conversation feature quantity related to the utterance timing in the conversation for each of the other site users, and the acquired
- the amount of processing delay of the conversation data corresponding to the magnitude of the time delay amount of the conversation and the amount of speech quality of the conversation respectively corresponding to the magnitude of the conversation time delay
- a processing delay amount determining unit that determines a total amount and a processing delay amount control unit that controls the processing delay amount so that the determined total amount is satisfied.
- the voice transmission / reception system includes voice transmission / reception devices installed at a plurality of bases, each of which is accommodated in a network, A voice transmission / reception system capable of establishing a conversation between users respectively present at a plurality of bases, wherein the voice transmission / reception apparatus is installed at another base other than the base among the plurality of bases.
- Communication means for transmitting / receiving conversation data representing the content of the conversation, including at least voice data, to / from the transmission / reception device, and the local user existing at the local base and the local base Conversation feature quantity acquisition means for acquiring a predetermined conversation feature quantity related to the utterance timing in the conversation for each of the other site users, and the acquired
- the amount of processing delay of the conversation data corresponding to the magnitude of the time delay amount of the conversation and the amount of speech quality of the conversation respectively corresponding to the magnitude of the conversation time delay
- a processing delay amount determining unit that determines a total amount and a processing delay amount control unit that controls the processing delay amount so that the determined total amount is satisfied.
- the server device is a server device that is accommodated in a network, and is installed at a plurality of bases, other than its own base.
- a voice transmission / reception device comprising communication means for transmitting / receiving conversation data representing the content of the conversation, including at least voice data, with the voice transmission / reception device installed at a base,
- the server device in a voice transmission / reception system capable of establishing a conversation between users respectively present at the plurality of bases via a voice transmission / reception device, wherein the network is connected from the plurality of voice transmission / reception devices.
- a conversation feature amount acquisition means for acquiring a predetermined predetermined conversation feature amount, and based on the acquired conversation feature amount, the magnitude corresponds to the amount of time delay amount of the conversation, and the voice quality of the conversation
- a processing delay amount determining means for determining a total amount of the processing delay amount of the conversation data corresponding to high and low in the entire voice transmitting and receiving system; and the total amount determined via the network for the plurality of voice transmitting and receiving devices.
- a notifying means for notifying.
- FIG. 2 is a block diagram conceptually showing a configuration of an audio transmission / reception device in the remote conference system of FIG. 1. It is a flowchart of the buffer capacity control performed in the audio
- Embodiments according to the voice transmission / reception apparatus of the present invention include voice transmission / reception apparatuses installed at a plurality of bases, each accommodated in a network, and users existing at the plurality of bases via the voice transmission / reception apparatuses.
- the voice transmission / reception system capable of establishing a conversation between each other, the voice transmission / reception apparatus, and the voice transmission / reception apparatus installed in another base other than the base among the plurality of bases,
- the communication means for transmitting and receiving conversation data representing the content of the conversation, including at least voice data, via the network, and the local user existing in the local base and the other base user existing in the local base Based on the acquired conversation feature quantity, a conversation feature quantity acquisition means for acquiring a predetermined conversation feature quantity related to the timing of utterance in the conversation Determining the processing delay amount for determining the total amount of the processing delay amount of the conversation data corresponding to the magnitude of the time delay amount of the conversation and corresponding to the level of the voice quality of the conversation, respectively, in the entire voice transmission
- the voice transmission / reception apparatus is one voice transmission / reception apparatus that constructs a voice transmission / reception system as a system capable of establishing a conversation between users existing at each base between a plurality of different bases.
- the voice transmission / reception apparatus according to the embodiment is configured to be accommodated in a network at all times or limitedly when some condition (this kind of condition is not limited in any way) is satisfied.
- the “network” is, for example, a WAN (Wide Area Network) network, a LAN (Local Area Network) network, a WAN line or a LAN network, a telephone line, an ADSL (Asymmetric Digital Subscriber Line), or an optical fiber. It is a concept encompassing various data communication networks such as the Internet network appropriately connected via a cable or the like.
- the voice transmission / reception apparatus has a communication unit, and the voice transmission / reception apparatus is installed at another base (that is, a base on the other side of the conversation) different from the base by the action of the communication unit. It is possible to send and receive conversation data via the network.
- the transmission / reception mode of various information including the conversation data, data, or a data file between the voice transmission / reception apparatuses installed in other bases may be, for example, via an appropriate server apparatus.
- P2P Peer To Peer
- the components of the voice transmission / reception system may also be ambiguous.
- the self-base means a base where the voice transmitting / receiving apparatus according to the embodiment is installed, and does not mean a specific base.
- the “conversation data” in the embodiment is a concept that includes data that is necessary or meaningful for establishing a conversation between users existing at different bases, and particularly includes at least audio data. It is prescribed.
- the conversation data may include image data, video data, and the like.
- “conversation” means an action of communication between users accompanied by voice, and the situations that occur are diverse.
- the “conversation” in the embodiment preferably includes the speech of each participant in a meeting or a meeting in addition to the daily conversation.
- the data transmitted to the voice transmitting / receiving device installed at the other site is, for example, the analog voice data of the user at the local site collected via the sound collecting means such as a microphone via the encoder or the like. It is also possible to use encoded data.
- This data is received by a voice transmitting / receiving device installed at the other site side (that is, received data at the other site side), is decoded via a decoder or the like, and finally is output via an output means such as a speaker.
- it can be provided as an analog voice to the user of another base. The same applies even if the utterance side and the reception side are switched.
- the audio transmitting / receiving apparatus for example, by repeating the transmission / reception of conversation data in this way, it is possible to establish a conversation between users existing at different bases.
- the voice transmission / reception apparatus includes processing delay amount determining means for determining a total amount of conversation data processing delay in the entire voice transmission / reception system (hereinafter, referred to as “system total processing delay amount” as appropriate). And a processing delay amount control means for controlling the processing delay amount so that the total system processing delay amount is satisfied.
- the “processing amount of conversation data processing” is the amount of delay that occurs when the voice transmission / reception apparatus processes conversation data at each of a plurality of bases.
- the amount of delay having controllability corresponding to the amount of time delay of the conversation with the user, and the size corresponding to the level of the voice quality of the conversation between the user at the local site and the user at the other site, respectively. means.
- the total system processing delay amount is the sum of the processing delay amounts on the utterance side and the reception side. Therefore, for example, in order to obtain high voice quality with few sound interruptions and coding errors, a larger total system processing delay amount is better.
- the time delay of the conversation affects the quality of the conversation in a dimension different from the voice quality. More specifically, if the amount of time delay of conversation becomes too large, it becomes difficult to share the same time axis between users, and the real-time nature of conversation is reduced. Such a decrease in real-time property itself decreases comfort and causes discomfort to the user. In addition, such a decrease in real-time performance causes a secondary deterioration in conversation quality due to a decrease in smoothness, such as confirming that the other party is speaking for the first time after speaking, and a kind of vicious circle. easy.
- the voice quality and comfort of the conversation are in a trade-off relationship. Therefore, it is necessary to optimize the total system processing delay amount that affects both of them.
- the conversation data according to the embodiment may include image data and video data for specifying a user's facial expression.
- image data and video data for specifying a user's facial expression.
- this kind of transmission / reception of image data and video data is a high-load process itself. If the voice and image or video are to be output synchronously, the amount of time delay of conversation increases. On the other hand, if these are controlled independently, the synchronization between the facial expression and the voice is lost, so the effect of suppressing the decrease in comfort is limited. That is, the above-described problems are not of a nature that can be solved by using an image or video together.
- the conversation feature amount acquisition unit acquires a predetermined conversation feature amount for each of the own site user existing at the own site and the other site user existing at the other site.
- the processing delay amount determining means is configured to determine the total system processing delay amount based on the acquired conversation feature amount.
- the conversation feature quantity means a physical quantity, control quantity, or various standardized index values related to the timing of utterance in conversation.
- the timing of utterance in conversation is roughly divided into timing for starting utterance after the end of the other party's utterance, and timing for starting utterance again after judging that the other party is unresponsive after the end of his / her utterance. Is done.
- These are all important elements that define the basic progress rhythm of conversation, and may have a significant relationship with the optimum system total processing delay amount. For example, it can be said that the optimum total system processing delay amount for a user whose speech timing is early (slow) is small (large).
- the timing of utterances in conversation varies widely depending on the personality and preferences of each user. Even the same user may change in any way depending on the physical or mental load state or time margin at that time, or various other specific circumstances.
- the conversation feature quantity acquisition means it is possible to constantly grasp the utterance timing that can be varied in various ways with high accuracy.
- the processing delay amount control means determines that the determined total system processing delay amount is satisfied by the processing delay amount control means.
- the amount of processing delay is controlled.
- the operation related to the processing delay amount control means may include a process delay amount determination process to be borne by each base.
- the total system processing delay amount is the total amount of processing delay amount to be secured for the entire system including the speech transmitting / receiving device on the utterance side and the speech transmitting / receiving device on the receiving side.
- the conversation rhythm, tempo, and the like do not change, so the conversation quality related to comfort does not change.
- the distribution degree of the processing delay amount to each of the uttering side and the receiving side is appropriate. May occur. In such a case, the processing delay amount can be ambiguous with respect to the total system processing delay amount.
- the determined system total processing delay amount may be shared with the apparatus on the other site side, and the distribution ratio may be determined after consultation between both parties according to a pre-determined algorithm.
- the processing delay amount at each base may be an equally divided total system processing delay amount.
- the voice transmitting and receiving apparatus it is possible to always achieve a harmony between the voice quality of the conversation and the comfort according to the state of the user at that time. That is, optimal conversation quality is provided.
- the processing delay amount or the total system processing delay amount is determined in advance experimentally, empirically, theoretically or based on simulations, the circumstances of the user at that time are not actually considered at all. Therefore, although it is possible to cope with the basic personality of the user that can be estimated in advance, there is almost no adaptability to uncertainties and disturbance factors. Therefore, the total system processing delay amount tends to deviate from the true optimum value that changes depending on the user even for each user or even the same user. As a result, although the total system processing delay amount can be increased to improve the voice quality, the system total processing delay amount is unnecessarily suppressed, or conversely, the system total processing delay amount is too large and the conversation tempo. In other words, such a situation occurs that the user's comfort is reduced because the user's tempo or rhythm deviates from the user's own tempo or rhythm.
- the system total processing delay amount is set based on the conversation feature amount acquired for each of the own site user and the other site user, so that both Since the maximum system total processing delay amount can be set in a range that does not cause discomfort, it is extremely useful in practice.
- “acquisition” related to the conversation feature value acquisition means means to finally determine the reference value for control, and the process is not limited in any way. That is, the conversation feature value acquisition means may acquire the conversation feature value from the outside through a network or the like, or the conversation feature value by taking various measures such as calculation, derivation, estimation, identification or selection as internal processing. An amount may be obtained. Also, the conversation feature value acquisition process of the user at the local site may be different from the conversation feature value acquisition process of the user at the other site.
- the conversation feature quantity acquisition means preferably repeats the acquisition of the conversation feature quantity every time an utterance action occurs or at a constant or indefinite period. In this case, it is more effective because it can cope with the latest situation on the user side. In addition, when statistical processing based on conversation feature values obtained in the past is taken, it is effective because sudden changes in the conversation feature values are prevented and the estimation accuracy of the conversation feature values is improved.
- the conversation feature amount acquisition means, processing delay amount determination means, and processing delay amount control means provided in the voice transmitting / receiving apparatus are each or as a whole, for example, a CPU (Central Processing Unit) or an MPU (Micro Processing Unit). ) And the like, various forms such as various processors, various processors, controllers, various functional modules, and the like.
- a CPU Central Processing Unit
- MPU Micro Processing Unit
- the conversation feature quantity acquisition unit is configured to use the own site user from the time when the utterance output of the other site user is completed at the own site as the conversation feature quantity. Is the time until the time when the other site user starts utterance from the time when the utterance output of the own site user ends at the other site. At least one of the base user reaction time is acquired.
- the own site user reaction time or the other site user reaction time or both are acquired as the conversation feature value.
- reaction time is the time from the time when one utterance output ends to the time when the other starts utterance, and correlates with the above-mentioned “timing to start utterance after the other's utterance ends” It is a feature quantity.
- speech output means output via an output means such as a speaker, for example, and the start and end times thereof are later in time series than the start and end times of the actual speech act of the other party. That is, the reaction time is the pure reaction time of each user, excluding the effects of the processing delay and various other delays described above, and is the time for defining the basic tempo and rhythm of conversation in each user. . Therefore, it is suitable as a reference value for determining the optimum value of the total system processing delay amount.
- the reaction time is a time value unique to the user, it is natural that the own site user reaction time and the other site user reaction time are naturally different, but they are almost equal but greatly different.
- the system total processing delay amount is determined based on at least one of these own-site user reaction time and other-site user reaction time, and more preferably both, so that at least the required amount of comfort is achieved. Can provide the best conversation quality.
- the processing delay amount determination means is configured to determine the acquired other site user reaction time.
- the total amount is determined within the range of the value or less.
- the system is within a range equal to or less than the minimum value of the own base first allowable processing delay amount and the other base first allowable processing delay amount (if the conversation is between two parties, that is, the smaller value). A total processing delay amount is determined. Therefore, it is possible to provide conversation quality that does not cause discomfort for both the local user and the other user.
- the calculation process of the local base first allowable processing delay amount and the other base first allowable processing delay amount may be performed in any component that constructs the voice transmission / reception system. That is, as long as the total system processing delay amount is finally determined within a range below these minimum values, all of these need not be performed by the voice transmitting / receiving apparatus according to the embodiment.
- the voice transmitting / receiving apparatus it is only necessary to acquire the other site user reaction time detected at another site via the network and calculate the own site first allowable processing delay amount. In this case, at the other site, the own site user reaction time transmitted from the own site is acquired, and the other site first allowable processing delay amount is calculated.
- processing is distributed among the devices in this way, it is possible to prevent a situation in which the load is concentrated on one voice transmitting / receiving device.
- the communication means is for the voice transmission / reception device installed at the other site, Local site user reaction time data corresponding to the acquired local site user reaction time is transmitted via the network.
- the acquired own site user reaction time is transmitted to the other site side as own site user reaction time data. Therefore, it is possible to leave a part of the calculation process of various reference values finally used for determining the total processing delay amount to the other site side device based on the own site user reaction time.
- the burden can be distributed.
- the conversation feature value acquisition unit is configured as the conversation feature value at the own site.
- a self-base user reaction waiting time which is a time from the time when the self-base user ends the utterance to the time when the self-base user starts to speak again is further acquired.
- the own site user reaction waiting time is acquired as the conversation feature amount.
- reaction waiting time is the time from the time when one utterance output ends to the time when one starts uttering again. Then, the conversation feature quantity correlates with “timing to start speech again”. That is, the reaction waiting time is a time reflecting the personality, taste and various circumstances of each user, and is a time for defining the basic tempo and rhythm of conversation for each user. Therefore, it is suitable as a reference value for determining the optimum value of the total system processing delay amount.
- the own-site second allowable processing delay amount that changes depending on the difference between the acquired own-site user response waiting time and the acquired other-site user response time, and the other
- the total amount may be determined within a range equal to or smaller than the minimum value of the second base allowable processing delay amount at the other bases that changes depending on the size of the other base.
- the system is within a range equal to or less than the minimum value of the second allowable processing delay amount of the own site and the second allowable processing delay amount of the other site (if the conversation is between two parties, that is, the smaller value).
- a total processing delay amount is determined. Therefore, it is possible to provide conversation quality that does not cause discomfort for both the local user and the other user.
- the second allowable processing delay amount that changes depending on the difference between the reaction waiting time and the reaction time is such that one user does not react to the other user's utterance. It can be a reference value that is extremely useful in practice in avoiding the bilateral utterance state caused by speaking again under the misjudgment of being.
- the calculation process of the second allowable processing delay amount of the own site and the second allowable processing delay amount of the other site may be performed in any component that constructs the voice transmission / reception system. That is, as long as the total system processing delay amount is finally determined within a range below these minimum values, all of these need not be performed by the voice transmitting / receiving apparatus according to the embodiment.
- the voice transmitting and receiving apparatus the other site user reaction time detected at the other site via the network is acquired, and the own site user response waiting time acquired at the own site is used to 2 It may be possible only to calculate the allowable processing delay amount.
- the own site user reaction time transmitted from the own site is acquired, and the other site second allowable processing delay amount is calculated using the other site user reaction waiting time acquired at the other site.
- the processing delay amount determination means is configured to determine whether the own site first allowable processing delay amount that changes depending on the size of the acquired other site user reaction time and the acquired own site user response time. The total amount within a range that is less than or equal to the minimum value of the first allowable processing delay amount at the other site and the second allowable processing delay amount at the own site and the second allowable processing delay amount at the other site, which varies depending on the size of the other site. May be determined.
- the local base first allowable processing delay amount, the remote base first allowable processing delay amount, the local base second allowable processing delay amount, and the remote base each calculated as a reference value reflecting the state of the user.
- the total system processing delay amount is determined within a range equal to or smaller than the minimum value among the second allowable processing delay amounts. Therefore, it is possible to more surely suppress a decrease in conversation quality accompanying a decrease in comfort.
- the voice transmitting / receiving apparatus includes a buffer for temporarily storing the conversation data before and after the transmission / reception, and the processing delay amount control means includes: The buffer capacity of the buffer is controlled based on the determined total amount.
- the size of the buffer capacity has a one-to-one relationship with the size of the amount of time delay of conversation, and unconditionally increasing it is not allowed in terms of overall conversation quality. That is, the buffer capacity is appropriate as an actual control target of the processing delay amount control means when controlling the processing delay amount according to the embodiment.
- the buffer capacity When the buffer capacity is captured as a control target, there is a degree of freedom as to how to control the buffer capacity provided in the voice transmission / reception system based on the determined total system processing delay amount.
- the processing delay amount For example, in each audio transmission / reception device, when the buffer is constructed including a reception buffer that temporarily stores received data after reception and a transmission buffer that temporarily stores transmission data before transmission, the processing delay amount
- the control means is relatively free so that the processing delay amount corresponding to the sum of the buffer capacities of the reception buffer and transmission buffer on the local site side and the reception buffer and transmission buffer on the other site side becomes the total system processing delay amount.
- the distribution ratio of the buffer capacity between them can be determined.
- the processing delay amount control means controls a coding rate for coding the conversation data based on the determined total amount.
- an encoding process such as encoding is preferably required.
- the level of the encoding rate which means the encoding bit rate for encoding the input voice at the local site, corresponds to the level of the voice quality and the amount of time delay.
- this type of encoding rate is appropriate as an actual control target of the processing delay amount control means in controlling the processing delay amount according to the embodiment.
- the voice transmission / reception apparatus further includes a transmission state acquisition unit that acquires a transmission state of the network, wherein the processing delay amount determination unit includes the acquired conversation feature amount and the The total amount is determined based on the acquired transmission state.
- the total system processing delay amount can be determined more accurately.
- the transmission state acquisition means may acquire the transmission delay amount of the network as the transmission state of the network.
- the network transmission delay amount defines the conversation time delay in the same dimension as the processing delay amount. Therefore, it is suitable as a transmission state to be reflected in the determination of the total system processing delay amount. Note that the transmission delay amount of such a network preferably indicates RTT (Round Trip Time).
- the speech processing apparatus further includes statistical processing means for statistically processing the acquired conversation feature amount, and the processing delay amount determining means is the statistically processed conversation feature. The total amount is determined based on the amount.
- the statistical processing is performed on the acquired conversation feature by the statistical processing means. That is, the conversation feature value acquired over a past fixed or indefinite period is reflected in the conversation feature value to be reflected in the determination of the total system processing delay amount. For this reason, the reliability of the conversation feature amount can be improved, and the conversation quality can be stably maintained.
- the practical aspect of the statistical processing is not particularly limited, but may be a process of adding and averaging conversation feature amounts acquired over a certain period in the past as a suitable form. At this time, a measure such as excluding the apparent abnormal value from the sample may be taken.
- the voice transmitting / receiving apparatus further includes storage means for storing the acquired conversation feature quantity.
- the acquired conversation feature amount is, for example, various aspects such as HDD (Hard Disk Disk Drive), flash memory, FDD (Floppy (registered trademark) Disk Disk Drive), DVD or BDD (Blu-ray Disk Disk Drive). Therefore, the total system processing delay amount can be determined smoothly. In addition, since the initial value of the total system processing delay amount when the conversation is similarly performed next time can be determined based on the stored conversation feature amount, the total system processing delay amount with the optimum system total processing delay amount can be determined. It is also possible to shorten the time until the amount converges.
- Embodiments according to the voice transmission / reception system of the present invention include voice transmission / reception devices installed at a plurality of bases, each accommodated in a network, and users existing at the plurality of bases via the voice transmission / reception devices, respectively.
- a voice transmission / reception system capable of establishing a conversation between each other, wherein the voice transmission / reception device is connected to the voice transmission / reception device installed at another base other than the base among the plurality of bases.
- a processing delay amount determination for determining a total amount of the processing delay amount of the conversation data corresponding to the size of the time delay amount of the conversation and the speech quality of the conversation, respectively, in the entire voice transmission / reception system.
- a processing delay amount control means for controlling the processing delay amount so that the determined total amount is satisfied.
- the embodiment according to the voice transmission / reception system includes the voice transmission / reception device according to the above-described embodiment, it is possible to obtain the optimum conversation quality.
- Embodiments according to the server device of the present invention include a server device, each of which is accommodated in a network, and a voice transmitting / receiving device installed at a plurality of bases and installed at another base other than the base among the plurality of bases
- a voice transmission / reception device including communication means for transmitting / receiving conversation data representing the content of the conversation including at least voice data via the network, and via the voice transmission / reception device
- the server device in a voice transmission / reception system capable of establishing a conversation between users respectively present at a plurality of bases, and present at the local base from the plurality of voice transmission / reception devices via the network
- the size of the conversation data corresponds to the magnitude of the time delay amount of the conversation and corresponds to the level of the voice quality of the conversation.
- the processing delay amount determining means for determining the total amount of the
- the embodiment according to the server apparatus includes the conversation feature amount acquisition unit and the processing delay amount determination unit according to the embodiment of the voice transmission / reception apparatus described above, it is possible to obtain the optimum conversation quality.
- the server device is responsible for the conversation feature value acquisition process and the system total processing delay amount determination process in this way, it is possible to remarkably reduce the burden on the voice transmission / reception device for constructing the voice transmission / reception system. It is beneficial to. For example, at this time, in each voice transmitting / receiving device, for example, a unit similar to the processing delay amount control unit controls the processing delay amount so that the optimum value of the total system processing delay amount notified by the notification unit provided in the server device is satisfied. do it.
- an appropriate command may be executed on the server device side to control the audio transmission / reception device.
- the server device side can also define the distribution mode of the processing delay amount of each voice transmission / reception device based on the total system processing delay amount, it is possible to further reduce the burden on the voice transmission / reception device side. .
- the communication means, the conversation feature quantity acquisition means, the processing delay amount determination means, and the processing delay amount control means are provided, the optimum conversation quality is obtained. be able to.
- the embodiment of the voice transmission / reception system of the present invention since the embodiment of the voice transmission / reception apparatus of the present invention is provided, the optimum conversation quality can be obtained.
- the conversation feature quantity acquisition means the processing delay amount determination means, and the notification means are provided, the optimum conversation quality can be obtained.
- FIG. 1 is a schematic configuration diagram conceptually showing the configuration of the remote conference system 1.
- a remote conference system 1 is a wide area that connects a base X (ie, an example of “own base” according to the present invention) and a base Y (ie, an example of “other base” according to the invention) that are separated from each other. It is an audio conference system as an example of an “audio transmission / reception system” according to the present invention, accommodated in a network 20 (IP (Internet Protocol) network).
- IP Internet Protocol
- the remote conference system 1 is installed at the site X and used for the user A (that is an example of the “own site user” according to the present invention) at the site X.
- the user A and the user B can perform a smooth audio conference by exchanging audio information by the remote conference system 1.
- FIG. 2 is a block diagram conceptually showing the configuration of the audio transmitting / receiving apparatus 10A.
- the same parts as those in FIG. 1 are denoted by the same reference numerals, and the description thereof is omitted as appropriate.
- the hardware configuration of the voice transmitting / receiving apparatus 10A is the same as that of the voice transmitting / receiving apparatus 10B.
- the voice transmission / reception device 10A includes a voice input unit 100, a voice output unit 200, a conversation feature amount detection unit 300, a conversation feature amount statistical processing unit 400, a storage device 500, a processing delay amount determination unit 600, and an RTT measurement unit 700.
- a processing delay information communication unit 800 and a buffer control unit 900 are provided.
- the voice input unit 100 includes a voice input unit 110, an encoder 120, a transmission buffer 130, and a voice data transmission unit 140.
- the voice input unit 100 is a unit capable of transmitting the voice of the user 10A as voice data to the voice transmission / reception device 10B via the network 20. is there.
- the voice input unit 110 is an input interface in which an input terminal (not shown) is connected to a microphone (not shown), and is configured to be able to capture the speech voice of the user A input via the microphone as an analog voice signal. Yes.
- the encoder 120 is a digital conversion device that encodes an analog audio signal input via the audio input unit 110 at a predetermined encoding rate (encoding bit rate) and converts the encoded audio signal into digital audio data.
- encoding rate encoding bit rate
- this type of standard may be a standard such as MPEG (Moving Picture Expert Group).
- the transmission buffer 130 is a volatile storage device that temporarily accumulates digital audio data obtained through the encoder 120 by a data amount corresponding to a predetermined transmission buffer capacity Da_s.
- the audio data transmission unit 140 is a transmission interface whose output terminal (not shown) is connected to the network 20, and digital audio data sequentially output from the transmission buffer 130 is sequentially transmitted to the audio input / output device 10 ⁇ / b> B via the network 20. Configured to send. That is, the audio data transmission unit 140 is an example of the “communication unit” according to the present invention, and the transmitted digital audio data is an example of the “conversation data” according to the present invention.
- the audio output unit 200 includes an audio data receiving unit 210, a reception buffer 220, a decoder 230, and an audio output unit 240, and is a unit that can output the voice of user B via an output device such as a speaker.
- the audio data receiving unit 210 is a reception interface whose output terminal (not shown) is connected to the network 20, and can sequentially capture digital audio data corresponding to the user B's uttered audio transmitted via the network 20. Composed. That is, the voice data receiving unit 210 is another example of the “communication unit” according to the present invention, and the received digital voice data is another example of the “conversation data” according to the present invention.
- the reception buffer 220 is a volatile storage device that temporarily accumulates digital audio data obtained via the audio data reception unit 210 by a data amount corresponding to a predetermined reception buffer capacity Da_r.
- the decoder 230 is an analog conversion device that decodes digital audio data sequentially output from the reception buffer 220 and converts it into analog audio data.
- the function of the decoder 230 is paired with the encoding function of the encoder 120, and both are naturally configured to perform data conversion according to the same standard.
- the audio output unit 240 is an output interface in which an output terminal (not shown) is connected to a speaker (not shown), and is configured to be able to output user B's speech through the speaker.
- Conversation feature amount detecting unit 300 was detectably configure user A reaction time R A and user A reaction waiting time T A to be described later as a conversation characteristic quantity of the user A, according to the present invention "conversation feature amount acquisition means" It is an example.
- the conversation feature quantity statistical processing unit 400 is an example of the “statistic processing means” according to the present invention configured to be able to statistically process the conversation feature quantity appropriately detected by the conversation feature quantity detection unit 300.
- the conversation feature quantity statistical processing unit 400 is configured to be able to hold the detected conversation feature quantity with respect to a certain past sample, and is configured to perform an averaging process on the sample values of the held samples and output them.
- the storage device 500 is a nonvolatile storage device such as an HDD or a flash memory, and is an example of the “storage unit” according to the present invention.
- the storage area of the storage device 500, the average value of the response time data DAT_R A and the user A reaction waiting time T A which represents the average value of the conversation feature amount statistical processing unit 400 via the output the user A reaction time R A the reaction latency data DAT_T a has a configuration in which is stored representing.
- the processing delay amount determination unit 600 includes an allowable processing delay estimation unit 610 and a negotiation unit 620, and is configured to determine the buffer capacity Da_s of the transmission buffer 130 and the buffer capacity Da_r of the reception buffer 220 according to the present invention. It is a control apparatus as an example of “processing delay amount determination means”.
- the processing delay amount determination unit 600 is configured to be able to execute buffer capacity control, which will be described later, according to a control program stored in a ROM (Read Only Memory).
- the allowable processing delay estimation unit 610 is a maximum allowable processing delay amount dmax that is a maximum processing delay amount allowed for the remote conference system 1, which will be described later. This is a processor that executes various types of processing for determining “total amount”.
- the allowable processing delay estimation unit 610 includes a first calculation model and a second calculation model that are theoretically constructed in advance, and is configured to execute the processing based on these calculation models.
- the negotiation unit 620 is a processor that negotiates the distribution of the buffer capacity as an example of the “processing delay amount” according to the present invention with the voice transmitting / receiving apparatus 10B at the site Y.
- the negotiation unit 620 also determines control target values for the transmission buffer capacity Da_s and the reception buffer capacity Da_r through negotiation with the voice transmitting / receiving apparatus 10B.
- the RTT measurement unit 700 is a processor configured to be able to measure the RTT that is the transmission delay amount of the network 20.
- the RTT measuring unit 700 measures RTT using SR (Sender Report) or RR (Receiver Report) of RTCP (Real-time Transport Control Protocol).
- the RTT defined here is the time when the digital audio data is transmitted from the audio transmitting / receiving apparatus 10A via the audio data receiving unit 210 from the time when the digital audio data is transmitted via the audio data transmitting unit 140.
- the processing delay information communication unit 800 is a communication interface whose output terminal is connected to the network 20 and performs transmission / reception of various data related to the above-described buffer amount target value formulation with the audio transmission / reception device 10B.
- the processing delay information communication unit 800 transmits the user A reaction time R A and the processing delay amount d set based on each calculation model to the voice transmission / reception device 10B, and the user B reaction time R B from the voice transmission / reception device 10B. And the processing delay amount d set based on each calculation model is acquired.
- the buffer control unit 900 is a processor as an example of the “processing delay amount control unit” according to the present invention configured to be able to variably control the buffer capacities of the transmission buffer 130 and the reception buffer 220.
- FIG. 3 is a flowchart of buffer capacity control.
- the voice transmitting / receiving apparatus 10A the above-described conversation data transmission / reception by the voice input unit 100 and the voice output unit 200 is appropriately executed separately from the buffer capacity control.
- a transmission process for the user A reaction time R A and a reception process for the user B reaction time R B are executed (step S101).
- User A reaction time R A and user B reaction times R B are, respectively, an example of the "conversation feature quantity" according to the present invention, also the former "own base user reaction time", the latter "different hub user reaction time Is an example.
- step S101 the response time data DAT_R A from the storage device 500 is read out and transmitted to the voice transmitting and receiving device 10B via the network 20 by the processing delay information communication unit 800.
- the reaction time data DAT_R B corresponding to the user B reaction time R B is acquired from the voice transceiver 10B via the network 20 by the processing delay information communication unit 800, it is sent to the allowable processing delay estimator 610.
- FIG. 4 is a timing chart for explaining the concept of reaction time.
- FIG. 4 it is assumed that the user A and the user B are talking in an ideal environment where no significant time delay occurs in the mutual voice transmission.
- an ideal environment refers to, for example, an environment where conversations are made in a face-to-face state at the same site.
- the utterance of the user A is finished at an arbitrary time T0 (see the hatched portion of the user A).
- user B perceives the end of user A's utterance at time T0.
- the utterance is started at time T1 after a delay time unique to the user B (see the hatched portion of the user B).
- the delay time from the end time of one utterance to the start time of the other utterance is the reaction time.
- the reaction time varies in various ways according to the user's personality, taste, mental load state, physical load state, and other various circumstances that can be specifically changed each time.
- the user A reaction time R A is the reaction time of the user A at the site X
- the user B reaction time R B is the reaction time of the user B at the site Y.
- a voice output means such as a speaker (ie, "speech output" according to the present invention). )). This is because the conversation is not established if the utterance content is not recognized by the other at the end of one utterance act.
- the reaction time is calculated by a reaction time calculation process executed by the conversation feature amount detection unit 300 in each voice transmitting / receiving device.
- the reaction time calculation process will be described with reference to FIG.
- FIG. 5 is a flowchart of the reaction time calculation process.
- the reaction time calculation process in FIG. 5 is a process executed at the site X.
- step S201 it is first determined whether or not there is a voice output from the user B (step S201). If there is no voice output from user B (step S201: NO), the process returns to step S201, and a series of processes is repeated. When there is a voice output of user B (step S201: YES), the time at that time is updated as the final voice output time Top (step S202).
- step S203 When the final audio output time Top is updated, it is determined whether or not there is no audio output from the user B (step S203). If user B's voice output continues (step S203: NO), the process returns to step S202, and the update of the final voice output time Top is continued. On the other hand, if there is no voice output from user B (step S203: YES), it is determined whether there is a voice input from user A (step S204).
- the determination that there is no voice output from user B is a misjudgment that the speech has ended despite user B being in a series of speeches. It is configured so as to be accurately made based on whether or not the length of the silent section exceeds a preset reference value.
- step S204 When there is no voice input by the user A (step S204: NO), that is, after the utterance output of the user B is completed, while it is estimated that the user A is thinking about the utterance content for the utterance, the step S203 is repeatedly executed. .
- step S204: YES When user A's voice input is started (utterance is started) (step S204: YES), a time value corresponding to the difference between the time T at that time and the final voice output time Top updated in step S202 is obtained.
- the user A reaction time RA is determined (step S205). When user A reaction time RA is calculated, the process returns to step S201, and a series of processes is repeated.
- the reaction time calculation process is executed as described above.
- user B reaction time R B are calculated in the voice transmitting and receiving device 10B as well.
- reaction time R A which is suitably calculated by the reaction time calculation process is sent to the conversation feature quantity statistics unit 400 each time it is calculated, is subjected to statistical processing.
- the statistical process is an averaging process for the past predetermined samples. Note that the mode of the statistical processing is not limited to the averaging process.
- reaction time R A which has been subjected to statistical processing by conversation feature amount statistical processing unit 400 is stored as the reaction time data DAT_R A in the storage unit 500.
- the reaction time data DAT_R A, the user A reaction time R A is appropriately updated each time statistical processing is calculated is performed.
- the reaction waiting time is a time value from the time when one user's utterance ends to the time when the one user starts speaking again.
- the reaction waiting time varies in various ways according to the user's personality, taste, mental load state, physical load state, and other various circumstances that can be specifically changed each time.
- the reaction waiting time is calculated by a reaction waiting time calculation process executed by the conversation feature amount detection unit 300 in each voice transmitting / receiving device.
- the reaction waiting time calculation process will be described with reference to FIG.
- FIG. 6 is a flowchart of the reaction waiting time calculation process. Note that the reaction waiting time calculation process in FIG.
- step S301 it is first determined whether or not there is a voice input from the user A (step S301). If there is no voice input by user A (step S301: NO), the process returns to step S201, and a series of processes is repeated. When there is a voice input of the user A (step S201: YES), the time at that time is updated as the final voice input time Tip (step S302).
- step S303 When the final voice input time Tip is updated, it is determined whether or not there is no voice output from the user B (step S303). If there is a voice output from user B (step S303: NO), the process returns to step S301 on the assumption that the response from user B is returned in a time shorter than the user A reaction waiting time TA.
- step S303 If there is no voice output from user B (step S303: YES), it is determined whether there is a voice input from user A (step S304). If there is no voice input by user A (step S304: NO), the process returns to step S303. That is, the process is in a standby state until the voice input of the user A is resumed.
- step S304 If there is a voice input by the user A (step S304: YES), whether or not the time value corresponding to the difference between the current time T and the last voice input time Tip updated in step S302 is greater than the reference value T0. Is determined.
- the reference value T0 is a determination reference value for determining whether or not the voice input of the user A corresponds to such a series of speech operations.
- step S305: NO when the time value corresponding to T-Tip is equal to or less than the reference value T0 (step S305: NO), the process returns to step S301.
- the time value corresponding to T-Tip is larger than the reference value T0 (Step S305: YES)
- the time value corresponding to T-Tip is determined as a user A reaction waiting time T A (step S306).
- the reaction waiting time calculation process is executed as described above.
- user B reaction waiting time T B it is calculated in the voice transmitting and receiving device 10B as well.
- reaction waiting time T A which is suitably calculated by the reaction time calculation process is sent to the conversation feature quantity statistics unit 400 each time it is calculated, is subjected to statistical processing.
- the statistical process is an averaging process for the past predetermined samples. Note that the mode of the statistical processing is not limited to the averaging process.
- reaction time data DAT_T A is stored as the reaction time data DAT_T A in the storage unit 500.
- the reaction time data DAT_T A, the user A reaction waiting time T A is properly updated each time statistical processing is calculated is performed.
- processing delay amount d is set based on the first calculation model and the second calculation model (step S103).
- the processing delay amount d is set for each of the first calculation model and the second calculation model.
- the processing delay amount d means the total processing delay amount of the remote conference system 1.
- FIG. 7 is a timing chart for explaining the concept of the first calculation model.
- the same reference numerals are given to the same portions as those in FIG. 4, and the description thereof will be omitted as appropriate.
- the processing delay amount d in the remote conference system 1 is defined by the following equation (1).
- da da + db + dproc (1)
- da is a delay amount that has a one-to-one correspondence with the buffer capacity of the audio transmission / reception device 10A, and the transmission buffer delay corresponding to the reception buffer delay da_r and the transmission buffer capacity Da_s corresponding to the reception buffer capacity Da_r.
- da da_r + da_s
- db is a delay amount that corresponds to the buffer capacity of the audio transmission / reception device 10B on a one-to-one basis, and transmission corresponding to the reception buffer delay amount db_r and transmission buffer capacity Db_s corresponding to the reception buffer capacity Db_r.
- db db_r + db_s
- dproc is the processing delay of the encoder and decoder in each of the audio transmitting / receiving apparatuses 10A and 10B.
- dproc is the sum of the encoder processing delay denca and the decoder processing delay ddeca in the voice transmitting / receiving apparatus 10A, and the encoder processing delay decb and decoder processing delay ddecb in the voice transmitting and receiving apparatus 10B. Note that dproc has a constant value if the encoding rate in the encoder and decoder is constant.
- FIG. 7 it is assumed that the utterance of the user A ends at time T0.
- the point in time when the user A's utterance end is recognized is during a normal conversation as shown in FIG. And different.
- the end of the speech of the user A is recognized from time T0, the transmission buffer delay amount da_s of the voice transmission / reception device 10A, the reception buffer delay amount db_r of the voice transmission / reception device 10B, and the encoder processing delay of the voice transmission / reception device 10A.
- T0 ′ T0 + da_s + db_r + denca + ddecb + OWDa (2) Note that the following equation (3) holds for the one-way delay OWDa of the network 20.
- OWDb is a one-way delay of the network 20 from the voice transmitting / receiving apparatus 10B to the voice transmitting / receiving apparatus 10A.
- the user B starts an utterance in 'User B reaction time time through the R B T1 from' time T0.
- the time when the user B's utterance start is recognized is also different between the base X and the base Y, and the start of the user B's utterance at the base X is recognized from the transmission buffer delay of the voice transmitting / receiving apparatus 10B from the time T1 ′.
- T2 T0 ′ + R B + db_s + da_r + dencb + ddeca + OWDb (4) That is, in the same sense as in the normal conversation is illustrated in Figure 4, even if the user B has performed speech at a user B reaction time R B after the end user utterance A, the teleconference system 1 and the network 20 Due to the influence, the user A recognizes the start of the utterance of the user B after the delay time TL defined by the following equation (5) has elapsed.
- the index value Z takes a value larger than 1 and has a property of gradually approaching 1 as it approaches a normal conversation environment (an environment illustrated in FIG. 4). Further, as the index value Z increases, the user A feels that the conversation does not proceed smoothly, and may experience a decrease in comfort.
- the maximum value F is a fitness value that can be determined that a decrease in conversation quality felt by the user A in a larger area cannot be ignored.
- the index value Z is equal to or less than the maximum value F, the user A does not feel lack of smoothness that cannot be ignored in practice in the conversation with the user B.
- the above expression (8) is an expression that defines a range in which the processing delay amount d can be taken.
- the processing delay amount d in the above equation (8) is an example of the “own base first allowable processing delay amount” according to the present invention.
- the processing delay amount d shown in the above equation (8) is a value set for the user A, and for the user B, the same process is performed in the voice transmitting / receiving apparatus 10B by the following equation (9).
- a processing delay amount d is set.
- the processing delay amount d in the following equation (9) is an example of the “other site first allowable processing delay amount” according to the present invention.
- FIG. 8 is a timing chart illustrating the concept of the second calculation model.
- the same parts as those in FIG. 7 are denoted by the same reference numerals, and the description thereof is omitted as appropriate.
- FIG. 7 illustrates a concept of user A reaction waiting time T A.
- User A reaction waiting time T A is from the time T0 to the utterance of the user A is completed, the time value up to the time T1 'that the user A initiates a speech again.
- user A's utterance start is recognized by user A at time T2 after the delay time TL has elapsed from time T0. It is.
- Second calculation model is a calculation model that takes into account the user A reaction waiting time T A.
- the delay time TL is because as long than user A reaction waiting time T A, the following (10) is established.
- the above expression (11) is an expression that defines a range in which the processing delay amount d can be taken.
- the processing delay amount d in the equation (11) is an example of the “own base second allowable processing delay amount” according to the present invention.
- the processing delay amount d shown in the above equation (11) is a value set for the user A, and for the user B, the following process (12) is performed through the same process in the voice transmitting / receiving apparatus 10B.
- the processing delay amount d is calculated by the equation.
- the processing delay amount d shown in the following expression (12) is an example of the “other site second allowable processing delay amount” according to the present invention.
- the maximum allowable processing delay amount dmax is determined (step S104).
- the maximum allowable processing delay amount dmax is a processing delay amount that can ensure the smoothness of conversation for both the user A and the user B. Therefore, it is necessary to compare the processing delay amounts set based on the first calculation model and the second calculation model. More specifically, the maximum allowable processing delay amount dmax needs to be a processing delay amount that satisfies all of the above formulas (8), (9), (11), and (12).
- step S104 the processing delay amount d set for the user B based on the first and second calculation models is acquired from the voice transmitting / receiving apparatus 10B via the processing delay information communication unit 800.
- the processing delay amount d set for the user A based on the first and second calculation models is transmitted to the voice transmitting / receiving apparatus 10B via the processing delay information communication unit 800. That is, the conditional expression relating to the determination of the maximum allowable processing delay amount dmax is shared between the voice transmitting / receiving apparatus 10A and the voice transmitting / receiving apparatus 10B.
- the processing delay amount d satisfying any of the above formulas (8), (9), (11) and (12) is determined. Specifically, among these four formulas, the processing delay amount d defined by the conditional expression with the smallest right-hand term is the processing delay amount d satisfying these four formulas.
- the maximum allowable processing delay amount dmax is determined as the maximum value in a range that satisfies these four formulas.
- distribution processing is executed (step S105).
- the distribution process is a process of distributing the determined maximum allowable processing delay amount dmax to the buffer capacities of the voice transmitting / receiving apparatus 10A and the voice transmitting / receiving apparatus 10B. This distribution process is executed by the negotiation unit 620.
- the negotiation unit 620 confirms the load status with the voice transmitting / receiving apparatus 10B via the processing delay information communication unit 800. If there is no problem in processing load in the voice transmitting / receiving apparatus 10B and there is no problem in processing load in the voice transmitting / receiving apparatus 10A, the negotiation unit 620 bears 50% of the maximum allowable processing delay amount dmax in the voice transmitting / receiving apparatus 10A. This is determined, and this is transmitted to the voice transmitting / receiving apparatus 10B. As a result, if there is no problem in processing load, 50% of the maximum allowable processing delay amount dmax is usually borne on the voice transmitting / receiving apparatus 10A side.
- the negotiation unit 620 distributes the delay amount to be borne by the voice transmitting / receiving apparatus 10A between the transmission buffer 130 and the reception buffer 220.
- the burden rates of the transmission buffer 130 and the reception buffer 220 are also set equal here. That is, the processing delay amount related to the transmission buffer 130 is set to 25% of the maximum allowable processing delay amount dmax, and the processing delay amount related to the reception buffer 220 is also set to 25% of the maximum allowable processing delay amount dmax.
- the set distribution ratio is transmitted to the buffer control unit 900.
- the buffer control unit 900 to which the distribution ratio is transmitted controls the capacities of the transmission buffer 130 and the reception buffer 220 so that a processing delay amount corresponding to the set distribution ratio is obtained (step S106).
- the process returns to step S101, and a series of processes is repeated.
- the buffer capacity control is executed as described above.
- the maximum allowable processing delay amount dmax is determined from the processing delay amount d set based on the first calculation model and the second calculation model, and the reception buffer capacity Da_r. And the processing delay amount d of the remote conference system 1 is controlled to the maximum allowable processing delay amount dmax through the control of the transmission buffer capacity Da_s.
- the first calculation model and the second calculation model are constructed so as to reflect the conversation feature amount of the user, respectively, and the maximum allowable processing delay amount dmax determined is the user A and the user who are parties to the conversation. In accordance with the actual situation of B, smoothness for both sides is guaranteed at the minimum.
- the maximum allowable processing delay amount dmax is the maximum value within a range in which this kind of comfort reduction can be prevented, and voice quality deterioration due to sound interruptions, coding errors, and the like is also prevented.
- the optimum conversation quality in accordance with the circumstances of the user participating in the conversation is provided.
- the user who participates in the conversation is the user A and the user B, but the same as the above also in a meeting or conversation having three parties or more participants. Needless to say, it is possible to apply the concept to provide the optimum conversation quality according to the user's circumstances.
- the audio conference is assumed. However, it is also possible to transmit and receive image data or video data together with audio data by arranging imaging means at each site. In this case, for example, it is possible to adopt a conference form called a TV conference or the like.
- voice transmission / reception apparatus concerning a present Example is provided with the memory
- the past conversation feature quantity can be used as it is in the next time.
- conversation quality variation in the initial operation of the conversation feature quantity acquisition unit 300 and the processing delay amount determination unit 600 can be reduced. Since it can suppress, it is suitable.
- the control of the processing delay amount is realized through the control of the buffer capacity.
- the control amount correlated with the processing delay amount is the encoding rate of the audio data (for example, the encoding bit rate of the encoder 120). ).
- the encoding rate increases, the amount of data transmission increases as the sound quality becomes relatively high, and the amount of processing delay increases. Therefore, by controlling the encoding rate instead of or in addition to the buffer capacity described above, it is possible to obtain the same effect as described above.
- FIG. 9 is a schematic configuration diagram conceptually showing the configuration of the remote conference system 2 according to the second example of the present invention. In the figure, the same reference numerals are given to the same portions as those in FIG. 1, and the description thereof will be omitted as appropriate.
- the remote conference system 2 is different from the remote conference system 1 according to the first embodiment in that the remote conference system 2 includes a server device 30 accommodated in the network 20.
- the server device 30 is a computer system that mediates the voice transmission / reception devices 10A and 10B, and is an example of the “server device” according to the present invention.
- the server device 30 includes the processing delay amount determination unit 600 in the first embodiment, and the processing delay amount d based on the first and second calculation models, the maximum allowable processing delay amount dmax, and the buffer capacity of each voice transmission / reception device All the calculation processes related to are configured to be executed by the server device 30.
- the server device 30 includes a notification unit that notifies each voice transmission / reception device of the determined buffer capacity.
- the buffer control unit 900 performs this notification.
- Each transmission / reception buffer is controlled according to the buffer capacity.
- conversation quality that does not impair comfort is provided.
- relatively high-load processing such as various calculation processing based on the calculation model and buffer capacity distribution ratio determination processing is configured to be borne by the server device 30 instead of the voice transmission / reception device. This reduces the burden and enables smoother conversational operation control.
- the present invention can be applied to an apparatus or system that establishes a conversation between users at remote locations.
Landscapes
- Engineering & Computer Science (AREA)
- Signal Processing (AREA)
- Multimedia (AREA)
- Computer Networks & Wireless Communication (AREA)
- Quality & Reliability (AREA)
- General Engineering & Computer Science (AREA)
- Telephonic Communication Services (AREA)
Abstract
Description
本発明は、例えば遠隔地のユーザ相互間において会話を円滑に進行させることが可能な音声送受信装置、音声送受信システム及びサーバ装置の技術分野に関する。 The present invention relates to a technical field of a voice transmission / reception apparatus, a voice transmission / reception system, and a server apparatus that can smoothly advance conversations between users at remote locations, for example.
この種の目的を有するものとして、例えば特許文献1には、リアルタイム音声再生装置が提案されている。この装置によれば、音声の欠落を会話として成立する最大の音声欠落率以内に収め、音声遅延を会話として成立する最大の遅延時間以内に収めることにより、全体の会話品質を確保することが可能であるとされている。
For example,
また、特許文献2には、遅延ゆらぎを吸収するためのバッファを、有音区間でのバッファアンダーフロー発生回数等に基づいて無音区間に制御するための、バッファ監視部とバッファ制御部とを備えた構成が開示されている。
Further,
一般的には、遅延時間は、大きい程、音切れの発生や誤りの発生を抑制し得るから、会話の音声品質を向上させ得る。一方で、遅延時間が大きければ、一方のユーザが、自身の発話終了時点以降、相手の発話開始を認識するまでの時間が長くなるため、会話の円滑な進行が妨げられ易い。従って、特許文献1の装置のように、遅延時間を会話として成立する最大の遅延時間以内に収める旨の技術思想は、音声品質と円滑な進行との妥協点を見出し得る点において有益である。
In general, the longer the delay time, the better the voice quality of the conversation can be improved because the occurrence of sound interruptions and errors can be suppressed. On the other hand, if the delay time is large, it takes a long time for one user to recognize the start of the other party's utterance after the end of his / her utterance, and thus the smooth progress of the conversation is likely to be hindered. Therefore, the technical idea that the delay time falls within the maximum delay time established as a conversation as in the device of
ところで、会話が成立する最大の遅延時間は、例えば、会話に参加するユーザ各々の性格又は身体的若しくは精神的負荷状態、或いはその時点のユーザの個別具体的な事情等に応じて、如何様にも変化し得る。例えば、最大の遅延時間は、性格的にせっかちなユーザであれば短く、のんびりしたユーザであれば長く、イライラしているユーザであれば短く、リラックスしたユーザであれば長く、体調の悪いユーザであれば体調に応じて長くも短くもなる。また、それとは別に、何らかの事情で急いでいれば当然ながら短くなる。 By the way, the maximum delay time for establishing a conversation depends on, for example, the personality or physical or mental load state of each user participating in the conversation, or the individual specific circumstances of the user at that time. Can also change. For example, the maximum delay time is short for a personally impatient user, long for a leisurely user, short for an irritated user, long for a relaxed user, If there is, it will be longer or shorter depending on the physical condition. Apart from that, if you are in a hurry for some reason, it will naturally become shorter.
特許文献1に開示された装置を適用する場合、会話として成立する最大の遅延時間を事前に設定する必要があるが、このように如何様にも変化し得る性質を有する最大の遅延時間を、予め実験的に、経験的に或いは理論的に確定させておくことは、実践上困難を極める。従って、特許文献1で述べられるところの最大の遅延時間は、必ずしも真に会話を成立させるための最大の遅延時間とはならない。このため、特許文献1の装置を会話に適用したとしても、ユーザによっては、快適性の低下を招く可能性がある。或いは、逆に遅延時間が短過ぎて、不要な音声品質の低下を招く可能性がある。即ち、特許文献1の装置には、ユーザ側の事情に適応する術を有さぬことに起因して、遅延時間が最適化され難いという技術的問題点がある。
When applying the device disclosed in
また、このような問題点は、特許文献2の装置のように、遅延揺らぎの吸収による音声パケットの廃棄率低減を目的としたバッファ制御をなし得たところで、何ら変わらず生じ得る。
Also, such a problem can occur without any change when buffer control is performed for the purpose of reducing the discard rate of voice packets by absorbing delay fluctuations as in the apparatus of
本発明は、係る問題点に鑑みてなされたものであり、会話に参加するユーザの事情に即した最適な会話品質を提供可能な音声送受信装置、音声送受信システム及びサーバ装置を提供することを課題とする。 The present invention has been made in view of such problems, and it is an object of the present invention to provide an audio transmission / reception apparatus, an audio transmission / reception system, and a server apparatus that can provide optimal conversation quality in accordance with the circumstances of users participating in a conversation. And
上述した課題を解決するため、請求の範囲第1項の音声送受信装置は、各々がネットワークに収容される、複数の拠点に設置された音声送受信装置を含み、該音声送受信装置を介して、前記複数の拠点に夫々存在するユーザ相互間で会話を成立させることが可能な音声送受信システムにおける、前記音声送受信装置であって、前記複数の拠点のうち自拠点を除く他拠点に設置される前記音声送受信装置との間で、前記ネットワークを介して、少なくとも音声データを含む、前記会話の内容を表す会話データの送受信を行う通信手段と、前記自拠点に存在する自拠点ユーザ及び前記他拠点に存在する他拠点ユーザの各々について、前記会話における発話のタイミングに関連する所定の会話特徴量を取得する会話特徴量取得手段と、前記取得された会話特徴量に基づいて、その大小が前記会話の時間遅延量の大小に夫々対応し且つ前記会話の音声品質の高低に夫々対応する前記会話データの処理遅延量の、前記音声送受信システム全体における総量を決定する処理遅延量決定手段と、前記決定された総量が満たされるように前記処理遅延量を制御する処理遅延量制御手段とを具備することを特徴とする。
In order to solve the above-described problem, the voice transmission / reception device according to
上述した課題を解決するため、請求の範囲第14項の音声送受信システムは、各々がネットワークに収容される、複数の拠点に設置された音声送受信装置を含み、該音声送受信装置を介して、前記複数の拠点に夫々存在するユーザ相互間で会話を成立させることが可能な音声送受信システムであって、前記音声送受信装置は、前記複数の拠点のうち自拠点を除く他拠点に設置される前記音声送受信装置との間で、前記ネットワークを介して、少なくとも音声データを含む、前記会話の内容を表す会話データの送受信を行う通信手段と、前記自拠点に存在する自拠点ユーザ及び前記他拠点に存在する他拠点ユーザの各々について、前記会話における発話のタイミングに関連する所定の会話特徴量を取得する会話特徴量取得手段と、前記取得された会話特徴量に基づいて、その大小が前記会話の時間遅延量の大小に夫々対応し且つ前記会話の音声品質の高低に夫々対応する前記会話データの処理遅延量の、前記音声送受信システム全体における総量を決定する処理遅延量決定手段と、前記決定された総量が満たされるように前記処理遅延量を制御する処理遅延量制御手段とを具備することを特徴とする。 In order to solve the above-described problem, the voice transmission / reception system according to claim 14 includes voice transmission / reception devices installed at a plurality of bases, each of which is accommodated in a network, A voice transmission / reception system capable of establishing a conversation between users respectively present at a plurality of bases, wherein the voice transmission / reception apparatus is installed at another base other than the base among the plurality of bases. Communication means for transmitting / receiving conversation data representing the content of the conversation, including at least voice data, to / from the transmission / reception device, and the local user existing at the local base and the local base Conversation feature quantity acquisition means for acquiring a predetermined conversation feature quantity related to the utterance timing in the conversation for each of the other site users, and the acquired The amount of processing delay of the conversation data corresponding to the magnitude of the time delay amount of the conversation and the amount of speech quality of the conversation respectively corresponding to the magnitude of the conversation time delay A processing delay amount determining unit that determines a total amount and a processing delay amount control unit that controls the processing delay amount so that the determined total amount is satisfied.
上述した課題を解決するため、請求の範囲第15項のサーバ装置は、各々がネットワークに収容される、サーバ装置と、複数の拠点に設置された、該複数の拠点のうち自拠点を除く他拠点に設置される音声送受信装置との間で、前記ネットワークを介して、少なくとも音声データを含む、前記会話の内容を表す会話データの送受信を行う通信手段を具備する音声送受信装置とを含み、該音声送受信装置を介して、前記複数の拠点に夫々存在するユーザ相互間で会話を成立させることが可能な音声送受信システムにおける、前記サーバ装置であって、前記複数の音声送受信装置から、前記ネットワークを介して、前記自拠点に存在する自拠点ユーザ及び前記他拠点に存在する他拠点ユーザの各々についての前記会話における発話のタイミングに関連する所定の会話特徴量を取得する会話特徴量取得手段と、前記取得された会話特徴量に基づいて、その大小が前記会話の時間遅延量の大小に夫々対応し且つ前記会話の音声品質の高低に夫々対応する前記会話データの処理遅延量の、前記音声送受信システム全体における総量を決定する処理遅延量決定手段と、前記複数の音声送受信装置に対し前記ネットワークを介して前記決定された総量を告知する告知手段とを具備することを特徴とする。 In order to solve the above-described problem, the server device according to claim 15 is a server device that is accommodated in a network, and is installed at a plurality of bases, other than its own base. A voice transmission / reception device comprising communication means for transmitting / receiving conversation data representing the content of the conversation, including at least voice data, with the voice transmission / reception device installed at a base, The server device in a voice transmission / reception system capable of establishing a conversation between users respectively present at the plurality of bases via a voice transmission / reception device, wherein the network is connected from the plurality of voice transmission / reception devices. The timing of utterances in the conversation for each of the own site user existing at the own site and the other site user existing at the other site A conversation feature amount acquisition means for acquiring a predetermined predetermined conversation feature amount, and based on the acquired conversation feature amount, the magnitude corresponds to the amount of time delay amount of the conversation, and the voice quality of the conversation A processing delay amount determining means for determining a total amount of the processing delay amount of the conversation data corresponding to high and low in the entire voice transmitting and receiving system; and the total amount determined via the network for the plurality of voice transmitting and receiving devices. And a notifying means for notifying.
<音声送受信装置の実施形態> <Embodiment of Audio Transmitting / Receiving Device>
本発明の音声送受信装置に係る実施形態は、各々がネットワークに収容される、複数の拠点に設置された音声送受信装置を含み、該音声送受信装置を介して、前記複数の拠点に夫々存在するユーザ相互間で会話を成立させることが可能な音声送受信システムにおける、前記音声送受信装置であって、前記複数の拠点のうち自拠点を除く他拠点に設置される前記音声送受信装置との間で、前記ネットワークを介して、少なくとも音声データを含む、前記会話の内容を表す会話データの送受信を行う通信手段と、前記自拠点に存在する自拠点ユーザ及び前記他拠点に存在する他拠点ユーザの各々について、前記会話における発話のタイミングに関連する所定の会話特徴量を取得する会話特徴量取得手段と、前記取得された会話特徴量に基づいて、その大小が前記会話の時間遅延量の大小に夫々対応し且つ前記会話の音声品質の高低に夫々対応する前記会話データの処理遅延量の、前記音声送受信システム全体における総量を決定する処理遅延量決定手段と、前記決定された総量が満たされるように前記処理遅延量を制御する処理遅延量制御手段とを具備する。 Embodiments according to the voice transmission / reception apparatus of the present invention include voice transmission / reception apparatuses installed at a plurality of bases, each accommodated in a network, and users existing at the plurality of bases via the voice transmission / reception apparatuses. In the voice transmission / reception system capable of establishing a conversation between each other, the voice transmission / reception apparatus, and the voice transmission / reception apparatus installed in another base other than the base among the plurality of bases, For each of the communication means for transmitting and receiving conversation data representing the content of the conversation, including at least voice data, via the network, and the local user existing in the local base and the other base user existing in the local base, Based on the acquired conversation feature quantity, a conversation feature quantity acquisition means for acquiring a predetermined conversation feature quantity related to the timing of utterance in the conversation Determining the processing delay amount for determining the total amount of the processing delay amount of the conversation data corresponding to the magnitude of the time delay amount of the conversation and corresponding to the level of the voice quality of the conversation, respectively, in the entire voice transmission / reception system And a processing delay amount control means for controlling the processing delay amount so that the determined total amount is satisfied.
実施形態に係る音声送受信装置は、相異なる複数の拠点間で各拠点に存在するユーザ同士の会話を成立させ得るシステムとしての音声送受信システムを構築する、一の音声送受信装置である。実施形態に係る音声送受信装置は、ネットワークに常時、或いは何らかの条件(尚、この種の条件は、如何様にも限定されない)が満たされた場合に限定的に収容される構成となっている。 The voice transmission / reception apparatus according to the embodiment is one voice transmission / reception apparatus that constructs a voice transmission / reception system as a system capable of establishing a conversation between users existing at each base between a plurality of different bases. The voice transmission / reception apparatus according to the embodiment is configured to be accommodated in a network at all times or limitedly when some condition (this kind of condition is not limited in any way) is satisfied.
実施形態に係る「ネットワーク」とは、例えばWAN(Wide Area Network)網、LAN(Local Area Network)網、又はこれらWAN網又はLAN網を介して或いは電話回線、ADSL(Asymmetric Digital Subscriber Line)又は光ファイバーケーブル等を介して適宜に接続されるインターネット網等の各種データ通信網を包括する概念である。 The “network” according to the embodiment is, for example, a WAN (Wide Area Network) network, a LAN (Local Area Network) network, a WAN line or a LAN network, a telephone line, an ADSL (Asymmetric Digital Subscriber Line), or an optical fiber. It is a concept encompassing various data communication networks such as the Internet network appropriately connected via a cable or the like.
実施形態に係る音声送受信装置は、通信手段を有しており、この通信手段の作用により、自拠点とは異なる他拠点(即ち、会話の相手側の拠点である)に設置される音声送受信装置との間で、ネットワークを介した会話データの送受信を行うことが可能である。尚、他拠点に設置される音声送受信装置との間の、当該会話データを始めとする各種情報、データ、又はデータファイル等の送受信形態は、例えば然るべきサーバ装置を介したものであってもよいし、例えばP2P(Peer To Peer)等サーバ装置を介さないものであってもよい。必然的に、音声送受信システムの構成要素もまた多義的であってよい。 The voice transmission / reception apparatus according to the embodiment has a communication unit, and the voice transmission / reception apparatus is installed at another base (that is, a base on the other side of the conversation) different from the base by the action of the communication unit. It is possible to send and receive conversation data via the network. In addition, the transmission / reception mode of various information including the conversation data, data, or a data file between the voice transmission / reception apparatuses installed in other bases may be, for example, via an appropriate server apparatus. For example, P2P (Peer To Peer) or the like may not be used. Naturally, the components of the voice transmission / reception system may also be ambiguous.
尚、自拠点とは、実施形態に係る音声送受信装置が設置される拠点を意味するものであって、ある特定の拠点を限定的に意味するものではない。 Note that the self-base means a base where the voice transmitting / receiving apparatus according to the embodiment is installed, and does not mean a specific base.
実施形態における「会話データ」とは、夫々相異なる拠点に存在するユーザ相互間の会話の成立に必要となる或いは意義を与え得るデータを包括する概念であり、特に、少なくとも音声データを含むものとして規定される。会話データには、例えば、この他に画像データや映像データ等が含まれてもよい。尚、「会話」とは、音声を伴った、ユーザ相互間の意思伝達行為を意味し、発生するシチュエーションは多岐にわたる。例えば、実施形態における「会話」とは、日常会話の他に、打合せや会議等における参加者各々の発話等が好適に含まれる。 The “conversation data” in the embodiment is a concept that includes data that is necessary or meaningful for establishing a conversation between users existing at different bases, and particularly includes at least audio data. It is prescribed. For example, the conversation data may include image data, video data, and the like. Note that “conversation” means an action of communication between users accompanied by voice, and the situations that occur are diverse. For example, the “conversation” in the embodiment preferably includes the speech of each participant in a meeting or a meeting in addition to the daily conversation.
この会話データのうち、他拠点に設置された音声送受信装置へ送信されるデータは、例えば、マイク等の集音手段を介して集音された自拠点ユーザのアナログ音声データを、エンコーダ等を介して符号化してなるデータであってもよい。このデータは、他拠点側に設置された音声送受信装置において受信され(即ち、他拠点側では受信データとなる)、例えばデコーダ等を介して復号化され、最終的にスピーカ等の出力手段を介してアナログ音声として他拠点ユーザの受話に供され得る。発話側と受話側とが入れ替わっても同様である。実施形態に係る音声送受信装置では、例えばこのように会話データの送受信が繰り返されることにより、相異なる拠点に存在するユーザ相互間で会話を成立させることができる。 Among the conversation data, the data transmitted to the voice transmitting / receiving device installed at the other site is, for example, the analog voice data of the user at the local site collected via the sound collecting means such as a microphone via the encoder or the like. It is also possible to use encoded data. This data is received by a voice transmitting / receiving device installed at the other site side (that is, received data at the other site side), is decoded via a decoder or the like, and finally is output via an output means such as a speaker. Thus, it can be provided as an analog voice to the user of another base. The same applies even if the utterance side and the reception side are switched. In the audio transmitting / receiving apparatus according to the embodiment, for example, by repeating the transmission / reception of conversation data in this way, it is possible to establish a conversation between users existing at different bases.
実施形態に係る音声送受信装置は、会話データの処理遅延量の、音声送受信システム全体における総量(以下、適宜「システム総処理遅延量」と称する)を決定する処理遅延量決定手段と、この決定されたシステム総処理遅延量が満たされるように当該処理遅延量を制御する処理遅延量制御手段とを具備する。 The voice transmission / reception apparatus according to the embodiment includes processing delay amount determining means for determining a total amount of conversation data processing delay in the entire voice transmission / reception system (hereinafter, referred to as “system total processing delay amount” as appropriate). And a processing delay amount control means for controlling the processing delay amount so that the total system processing delay amount is satisfied.
ここで、「会話データの処理遅延量」とは、複数の拠点の各々において音声送受信装置が会話データを処理するにあたって生じる遅延の量であって、特に、その大小が、自拠点ユーザと他拠点ユーザとの会話の時間遅延の量の大小に夫々対応し、且つその大小が、自拠点ユーザと他拠点ユーザとの会話の音声品質の高低に夫々対応する、可制御性を有する遅延の量を意味する。システム総処理遅延量とは、発話側と受話側とにおける当該処理遅延量の総和である。従って、例えば音切れや符号化誤りの少ない、高い音声品質を得ようとする場合、システム総処理遅延量は大きい方が良いことになる。 Here, the “processing amount of conversation data processing” is the amount of delay that occurs when the voice transmission / reception apparatus processes conversation data at each of a plurality of bases. The amount of delay having controllability corresponding to the amount of time delay of the conversation with the user, and the size corresponding to the level of the voice quality of the conversation between the user at the local site and the user at the other site, respectively. means. The total system processing delay amount is the sum of the processing delay amounts on the utterance side and the reception side. Therefore, for example, in order to obtain high voice quality with few sound interruptions and coding errors, a larger total system processing delay amount is better.
ところで、会話の時間遅延は、音声品質とは異なる次元で会話の品質に影響する。より具体的には、会話の時間遅延量が大きくなり過ぎると、ユーザ相互間で同一の時間軸を共有することが難しくなり、会話のリアルタイム性が低下する。このようなリアルタイム性の低下は、それ自体が快適性を低下させユーザに不快感を惹起する。また、このようなリアルタイム性の低下は、発話してから初めて相手が発話中であることを確認する等、円滑性の低下による二次的な会話品質の低下を招き、一種の悪循環を招来し易い。 By the way, the time delay of the conversation affects the quality of the conversation in a dimension different from the voice quality. More specifically, if the amount of time delay of conversation becomes too large, it becomes difficult to share the same time axis between users, and the real-time nature of conversation is reduced. Such a decrease in real-time property itself decreases comfort and causes discomfort to the user. In addition, such a decrease in real-time performance causes a secondary deterioration in conversation quality due to a decrease in smoothness, such as confirming that the other party is speaking for the first time after speaking, and a kind of vicious circle. easy.
このように、音声送受信システムにおいては、会話の音声品質と快適性とがトレードオフの関係となる。従って、これら双方に影響するシステム総処理遅延量を最適化することが必要となる。 Thus, in the voice transmission / reception system, the voice quality and comfort of the conversation are in a trade-off relationship. Therefore, it is necessary to optimize the total system processing delay amount that affects both of them.
尚、実施形態に係る会話データには、ユーザの表情を特定するための画像データや映像データが含まれ得るが、この種の画像データや映像データの送受信は、それ自体が高負荷処理であり、音声と画像又は映像を同期出力しようとすれば、会話の時間遅延量は増大する。一方で、これらを独立に制御すれば、表情と音声との同期が崩れるため、快適性の低下を抑制する旨の効果は限定的となる。即ち、上述の問題点は、画像又は映像を併用することにより解消される性質のものではない。 The conversation data according to the embodiment may include image data and video data for specifying a user's facial expression. However, this kind of transmission / reception of image data and video data is a high-load process itself. If the voice and image or video are to be output synchronously, the amount of time delay of conversation increases. On the other hand, if these are controlled independently, the synchronization between the facial expression and the voice is lost, so the effect of suppressing the decrease in comfort is limited. That is, the above-described problems are not of a nature that can be solved by using an image or video together.
そこで、実施形態に係る音声送受信装置では、会話特徴量取得手段により、自拠点に存在する自拠点ユーザ及び他拠点に存在する他拠点ユーザの各々について所定の会話特徴量が取得される。処理遅延量決定手段は、この取得された会話特徴量に基づいてシステム総処理遅延量を決定する構成となっている。 Therefore, in the voice transmitting / receiving apparatus according to the embodiment, the conversation feature amount acquisition unit acquires a predetermined conversation feature amount for each of the own site user existing at the own site and the other site user existing at the other site. The processing delay amount determining means is configured to determine the total system processing delay amount based on the acquired conversation feature amount.
会話特徴量とは、会話における発話のタイミングに関連する物理量、制御量或いは規格化された各種の指標値等を意味する。会話における発話のタイミングは、好適な一形態として、相手の発話終了後に発話を開始するタイミングと、自分の発話終了後に相手が無反応であると判断して再度発話を開始するタイミングとに大別される。これらは、いずれも会話の基本的な進行リズムを規定する重要な要素であり、最適なシステム総処理遅延量と有意な関係を有し得る。例えば、発話のタイミングが早い(遅い)ユーザに対する最適なシステム総処理遅延量は小さい(大きい)と言える。 The conversation feature quantity means a physical quantity, control quantity, or various standardized index values related to the timing of utterance in conversation. As a preferred form, the timing of utterance in conversation is roughly divided into timing for starting utterance after the end of the other party's utterance, and timing for starting utterance again after judging that the other party is unresponsive after the end of his / her utterance. Is done. These are all important elements that define the basic progress rhythm of conversation, and may have a significant relationship with the optimum system total processing delay amount. For example, it can be said that the optimum total system processing delay amount for a user whose speech timing is early (slow) is small (large).
一方で、会話における発話のタイミングは、ユーザ各々の性格及び嗜好に応じて千差万別である。また同一ユーザであっても、その時点の身体的若しくは精神的負荷状態又は時間的余裕、或いはその他各種個別具体的な事情に応じて如何様にも変化し得る。会話特徴量取得手段によれば、このような多様に変化し得る発話タイミングを絶えず高精度に把握することができる。 On the other hand, the timing of utterances in conversation varies widely depending on the personality and preferences of each user. Even the same user may change in any way depending on the physical or mental load state or time margin at that time, or various other specific circumstances. According to the conversation feature quantity acquisition means, it is possible to constantly grasp the utterance timing that can be varied in various ways with high accuracy.
このようにユーザの状態に即した会話特徴量に基づいてシステム総処理遅延量が決定されると、処理遅延量制御手段により、この決定されたシステム総処理遅延量が満たされるように会話データの処理遅延量が制御される。尚、処理遅延量制御手段に係る作用には、各拠点で負担すべき処理遅延量の決定プロセスも含まれてよい。 When the total system processing delay amount is determined based on the conversation feature amount according to the user's state in this way, the processing delay amount control means determines that the determined total system processing delay amount is satisfied by the processing delay amount control means. The amount of processing delay is controlled. The operation related to the processing delay amount control means may include a process delay amount determination process to be borne by each base.
ここで、実施形態に係るシステム総処理遅延量は、発話側の音声送受信装置と受話側の音声送受信装置からなるシステム全体として確保すべき処理遅延量の総量である。理論的には、システム総処理遅延量が変化しない限り、会話のリズムやテンポ等が変化することはないから、快適性に係る会話品質は変化しない。この点に鑑みれば、当該システム総処理遅延量が変化しない限りにおいて、発話側及び受話側各々に対する処理遅延量の分配態様(即ち、各々における処理遅延の負担量である)には相応の自由度が生じる場合もある。このような場合については、システム総処理遅延量に対し、処理遅延量は多義的となり得る。 Here, the total system processing delay amount according to the embodiment is the total amount of processing delay amount to be secured for the entire system including the speech transmitting / receiving device on the utterance side and the speech transmitting / receiving device on the receiving side. Theoretically, as long as the total system processing delay amount does not change, the conversation rhythm, tempo, and the like do not change, so the conversation quality related to comfort does not change. In view of this point, as long as the total processing delay amount of the system does not change, the distribution degree of the processing delay amount to each of the uttering side and the receiving side (that is, the amount of burden of processing delay in each) is appropriate. May occur. In such a case, the processing delay amount can be ambiguous with respect to the total system processing delay amount.
但し、システム総処理遅延量を決定値に維持するためには、他拠点に設置される音声送受信システムとの協調が不可欠となる。この種の協調は、決定されたシステム総処理遅延量と自拠点側の処理遅延量の制御値とによって必然的に定まる要求値を、ネットワークを介して他拠点側の装置に告知することによってなされてもよい。或いは、決定されたシステム総処理遅延量を他拠点側の装置と共有し、予め策定されたアルゴリズムに従って両者協議の上で分配比率を決定すること等によってなされてもよい。例えば、各拠点における処理遅延量は、システム総処理遅延量を等分したものであってもよい。 However, in order to maintain the total system processing delay amount at the determined value, it is indispensable to cooperate with voice transmission / reception systems installed at other sites. This kind of cooperation is made by notifying a request value that is inevitably determined by the determined total processing delay amount of the system and the control value of the processing delay amount of the local site to the other site side devices via the network. May be. Alternatively, the determined system total processing delay amount may be shared with the apparatus on the other site side, and the distribution ratio may be determined after consultation between both parties according to a pre-determined algorithm. For example, the processing delay amount at each base may be an equally divided total system processing delay amount.
このように、実施形態に係る音声送受信装置によれば、その時々のユーザの状態に即して、常に会話の音声品質と快適性との調和を図ることができる。即ち、最適な会話品質が提供されるのである。 As described above, according to the voice transmitting and receiving apparatus according to the embodiment, it is possible to always achieve a harmony between the voice quality of the conversation and the comfort according to the state of the user at that time. That is, optimal conversation quality is provided.
補足すると、予め実験的に、経験的に、理論的に又はシミュレーション等に基づいて処理遅延量或いはシステム総処理遅延量が決定されている場合、その時点におけるユーザの事情は、実は全く考慮されないに等しいから、事前に推定し得るユーザの基本的な性格等には一定の対応が可能となり得るものの、不確定要素や外乱要素に対する適応性は殆ど無いに等しい。従って、システム総処理遅延量は、ユーザ毎に或いは同一ユーザであってもその時々に応じて変化する真の最適値から乖離し易い。その結果、システム総処理遅延量をより多くして音声品質を向上させ得るにもかかわらず不要にシステム総処理遅延量が抑制される、或いは逆にシステム総処理遅延量が大き過ぎて会話のテンポやリズムがユーザ固有のテンポやリズムから外れ、ユーザの快適性が低下する等の事態が、決して低くない頻度で発生してしまうのである。 In addition, if the processing delay amount or the total system processing delay amount is determined in advance experimentally, empirically, theoretically or based on simulations, the circumstances of the user at that time are not actually considered at all. Therefore, although it is possible to cope with the basic personality of the user that can be estimated in advance, there is almost no adaptability to uncertainties and disturbance factors. Therefore, the total system processing delay amount tends to deviate from the true optimum value that changes depending on the user even for each user or even the same user. As a result, although the total system processing delay amount can be increased to improve the voice quality, the system total processing delay amount is unnecessarily suppressed, or conversely, the system total processing delay amount is too large and the conversation tempo. In other words, such a situation occurs that the user's comfort is reduced because the user's tempo or rhythm deviates from the user's own tempo or rhythm.
また、会話に参加するユーザ同士で性格、嗜好及び各種事情が異なることも珍しくない。このような場合においても、実施形態に係る音声送受信装置によれば、自拠点ユーザ及び他拠点ユーザの各々について取得される会話特徴量に基づいてシステム総処理遅延量が設定されることにより、双方に不快感を生じさせない範囲で最大のシステム総処理遅延量を設定することができるため、実践上極めて有益である。 Also, it is not uncommon for users participating in a conversation to have different personalities, preferences and various circumstances. Even in such a case, according to the audio transmitting / receiving apparatus according to the embodiment, the system total processing delay amount is set based on the conversation feature amount acquired for each of the own site user and the other site user, so that both Since the maximum system total processing delay amount can be set in a range that does not cause discomfort, it is extremely useful in practice.
尚、会話特徴量取得手段に係る「取得」とは、最終的に制御上の参照値として確定させることを意味しており、そのプロセスは何ら限定されない趣旨である。即ち、会話特徴量取得手段は、ネットワークを介する等して外部から会話特徴量を取得してもよいし、内部処理として算出、導出、推定、同定又は選択等の各種措置を講じることによって会話特徴量を取得してもよい。また、自拠点ユーザの会話特徴量取得プロセスと他拠点ユーザの会話特徴量取得プロセスとが相違していてもよい。 Note that “acquisition” related to the conversation feature value acquisition means means to finally determine the reference value for control, and the process is not limited in any way. That is, the conversation feature value acquisition means may acquire the conversation feature value from the outside through a network or the like, or the conversation feature value by taking various measures such as calculation, derivation, estimation, identification or selection as internal processing. An amount may be obtained. Also, the conversation feature value acquisition process of the user at the local site may be different from the conversation feature value acquisition process of the user at the other site.
また、会話特徴量取得手段は、望ましくは、発話行為が発生する毎に、或いは一定又は不定の周期で、会話特徴量の取得を繰り返す。この場合、ユーザ側の最新の事情に対応し得るためより効果的である。また、過去に得られた会話特徴量に基づいた統計処理等が講じられる場合、会話特徴量の急変化が防止され、また会話特徴量の推定精度が向上するため効果的である。 Also, the conversation feature quantity acquisition means preferably repeats the acquisition of the conversation feature quantity every time an utterance action occurs or at a constant or indefinite period. In this case, it is more effective because it can cope with the latest situation on the user side. In addition, when statistical processing based on conversation feature values obtained in the past is taken, it is effective because sudden changes in the conversation feature values are prevented and the estimation accuracy of the conversation feature values is improved.
尚、実施形態に係る音声送受信装置に備わる、会話特徴量取得手段、処理遅延量決定手段及び処理遅延量制御手段は、夫々が或いは全体として、例えばCPU(Central Processing Unit)又はMPU(Micro Processing Unit)等の各種演算処理装置、各種プロセッサ、コントローラ又は各種機能モジュール等の各種形態を採り得る。 Note that the conversation feature amount acquisition means, processing delay amount determination means, and processing delay amount control means provided in the voice transmitting / receiving apparatus according to the embodiment are each or as a whole, for example, a CPU (Central Processing Unit) or an MPU (Micro Processing Unit). ) And the like, various forms such as various processors, various processors, controllers, various functional modules, and the like.
本発明の音声送受信装置に係る実施形態の一の態様では、前記会話特徴量取得手段は、前記会話特徴量として、前記自拠点において前記他拠点ユーザの発話出力が終了した時点から前記自拠点ユーザが発話を開始する時点までの時間である自拠点ユーザ反応時間と、前記他拠点において前記自拠点ユーザの発話出力が終了した時点から前記他拠点ユーザが発話を開始する時点までの時間である他拠点ユーザ反応時間とのうち少なくとも一方を取得する。 In one aspect of the embodiment of the voice transmitting / receiving apparatus of the present invention, the conversation feature quantity acquisition unit is configured to use the own site user from the time when the utterance output of the other site user is completed at the own site as the conversation feature quantity. Is the time until the time when the other site user starts utterance from the time when the utterance output of the own site user ends at the other site. At least one of the base user reaction time is acquired.
この態様によれば、会話特徴量として、自拠点ユーザ反応時間又は他拠点ユーザ反応時間或いはその両方が取得される。 According to this aspect, the own site user reaction time or the other site user reaction time or both are acquired as the conversation feature value.
ここで、「反応時間」とは、一方の発話出力が終了した時点から他方が発話を開始する時点までの時間であり、先述した「相手の発話終了後に発話を開始するタイミング」に相関する会話特徴量である。尚、「発話出力」とは、例えばスピーカ等の出力手段を介した出力を意味し、その開始及び終了時点は、相手の実際の発話行為の開始及び終了時点よりも時系列上後となる。即ち、反応時間とは、上述した処理遅延やその他各種遅延の影響を排除した、ユーザ各々の純粋な反応時間であり、個々のユーザにおける、会話の基本的なテンポ及びリズムを規定する時間である。従って、システム総処理遅延量の最適値を決定する上での参照値として好適である。 Here, “reaction time” is the time from the time when one utterance output ends to the time when the other starts utterance, and correlates with the above-mentioned “timing to start utterance after the other's utterance ends” It is a feature quantity. Note that “speech output” means output via an output means such as a speaker, for example, and the start and end times thereof are later in time series than the start and end times of the actual speech act of the other party. That is, the reaction time is the pure reaction time of each user, excluding the effects of the processing delay and various other delays described above, and is the time for defining the basic tempo and rhythm of conversation in each user. . Therefore, it is suitable as a reference value for determining the optimum value of the total system processing delay amount.
このように、反応時間は、ユーザ固有の時間値であるから、自拠点ユーザ反応時間と他拠点ユーザ反応時間とは、当然ながら異なる方が自然であるが、これらが殆ど等しかろうが大きく異なろうが、これら自拠点ユーザ反応時間と他拠点ユーザ反応時間とのうち少なくとも一方に基づいて、より望ましくは両方に基づいてシステム総処理遅延量が決定されることにより、少なくとも必要量以上の快適性が担保された最適な会話品質を提供することができる。 Thus, since the reaction time is a time value unique to the user, it is natural that the own site user reaction time and the other site user reaction time are naturally different, but they are almost equal but greatly different. Deaf, the system total processing delay amount is determined based on at least one of these own-site user reaction time and other-site user reaction time, and more preferably both, so that at least the required amount of comfort is achieved. Can provide the best conversation quality.
他拠点ユーザ反応時間と自拠点ユーザ反応時間とが取得される本発明の音声送受信装置に係る実施形態の一の態様では、前記処理遅延量決定手段は、前記取得された他拠点ユーザ反応時間の大小に応じて夫々大小に変化する自拠点第1許容処理遅延量及び前記取得された自拠点ユーザ反応時間の大小に応じて夫々大小に変化する他拠点第1許容処理遅延量のうち最小となる値以下の範囲で前記総量を決定する。 In one aspect of the embodiment of the voice transmission / reception apparatus according to the present invention in which the other site user reaction time and the own site user reaction time are acquired, the processing delay amount determination means is configured to determine the acquired other site user reaction time. The local site first allowable processing delay amount that varies depending on the size and the other site first allowable processing delay amount that varies depending on the size of the acquired local site user reaction time. The total amount is determined within the range of the value or less.
この態様によれば、自拠点第1許容処理遅延量及び他拠点第1許容処理遅延量のうち最小となる値(二者間の会話であれば、即ち小さい方の値)以下の範囲でシステム総処理遅延量が決定される。従って、自拠点ユーザ及び他拠点ユーザのいずれにおいても不快感の生じない会話品質を提供することができる。 According to this aspect, the system is within a range equal to or less than the minimum value of the own base first allowable processing delay amount and the other base first allowable processing delay amount (if the conversation is between two parties, that is, the smaller value). A total processing delay amount is determined. Therefore, it is possible to provide conversation quality that does not cause discomfort for both the local user and the other user.
尚、自拠点第1許容処理遅延量及び他拠点第1許容処理遅延量の算出プロセスは、音声送受信システムを構築する如何なる構成要素においてなされてもよい。即ち、最終的にこれらの最小値以下の範囲でシステム総処理遅延量が決定される限りにおいて、これらの全てが実施形態に係る音声送受信装置でなされる必要はない。例えば、実施形態に係る音声送受信装置においては、ネットワークを介して他拠点で検出された他拠点ユーザ反応時間を取得し、自拠点第1許容処理遅延量を算出するのみであってもよい。この場合、他拠点においては、自拠点から送信される自拠点ユーザ反応時間が取得され、他拠点第1許容処理遅延量が算出されることになる。このように各装置で処理が分散されると、一の音声送受信装置に負荷が集中する事態を防ぐことができる。 It should be noted that the calculation process of the local base first allowable processing delay amount and the other base first allowable processing delay amount may be performed in any component that constructs the voice transmission / reception system. That is, as long as the total system processing delay amount is finally determined within a range below these minimum values, all of these need not be performed by the voice transmitting / receiving apparatus according to the embodiment. For example, in the voice transmitting / receiving apparatus according to the embodiment, it is only necessary to acquire the other site user reaction time detected at another site via the network and calculate the own site first allowable processing delay amount. In this case, at the other site, the own site user reaction time transmitted from the own site is acquired, and the other site first allowable processing delay amount is calculated. When processing is distributed among the devices in this way, it is possible to prevent a situation in which the load is concentrated on one voice transmitting / receiving device.
他拠点ユーザ反応時間と自拠点ユーザ反応時間とが取得される本発明の音声送受信装置に係る実施形態の他の態様では、前記通信手段は、前記他拠点に設置される音声送受信装置に対し、前記ネットワークを介して、前記取得された自拠点ユーザ反応時間に対応する自拠点ユーザ反応時間データを送信する。 In another aspect of the embodiment of the voice transmission / reception device of the present invention in which the other site user reaction time and the own site user reaction time are acquired, the communication means is for the voice transmission / reception device installed at the other site, Local site user reaction time data corresponding to the acquired local site user reaction time is transmitted via the network.
この態様によれば、取得された自拠点ユーザ反応時間が、自拠点ユーザ反応時間データとして他拠点側へ送信される。従って、最終的にシステム総処理遅延量の決定に供される各種の参照値の算出プロセスのうち、自拠点ユーザ反応時間に基づいた一部を他拠点側の装置に委ねることが可能となり、処理負担を分散することが可能となる。 According to this aspect, the acquired own site user reaction time is transmitted to the other site side as own site user reaction time data. Therefore, it is possible to leave a part of the calculation process of various reference values finally used for determining the total processing delay amount to the other site side device based on the own site user reaction time. The burden can be distributed.
他拠点ユーザ反応時間と自拠点ユーザ反応時間とが取得される本発明の音声送受信装置に係る実施形態の他の態様では、前記会話特徴量取得手段は、前記会話特徴量として、前記自拠点において前記自拠点ユーザが発話を終了した時点から再び前記自拠点ユーザが発話を開始する時点までの時間である自拠点ユーザ反応待機時間を更に取得する。 In another aspect of the embodiment of the voice transmitting / receiving apparatus according to the present invention in which the other-site user reaction time and the own-site user reaction time are acquired, the conversation feature value acquisition unit is configured as the conversation feature value at the own site. A self-base user reaction waiting time which is a time from the time when the self-base user ends the utterance to the time when the self-base user starts to speak again is further acquired.
この態様によれば、会話特徴量として、自拠点ユーザ反応待機時間が取得される。 According to this aspect, the own site user reaction waiting time is acquired as the conversation feature amount.
ここで、「反応待機時間」とは、一方の発話出力が終了した時点から一方が再び発話を開始する時点までの時間であり、先述した「自分の発話終了後に相手が無反応であると判断して再度発話を開始するタイミング」に相関する会話特徴量である。即ち、反応待機時間とは、ユーザ各々の性格、嗜好及び各種事情を反映した時間であり、個々のユーザにおける、会話の基本的なテンポ及びリズムを規定する時間である。従って、システム総処理遅延量の最適値を決定する上での参照値として好適である。 Here, “reaction waiting time” is the time from the time when one utterance output ends to the time when one starts uttering again. Then, the conversation feature quantity correlates with “timing to start speech again”. That is, the reaction waiting time is a time reflecting the personality, taste and various circumstances of each user, and is a time for defining the basic tempo and rhythm of conversation for each user. Therefore, it is suitable as a reference value for determining the optimum value of the total system processing delay amount.
尚、この態様では、前記取得された自拠点ユーザ反応待機時間と前記取得された他拠点ユーザ反応時間との差の大小に応じて夫々大小に変化する自拠点第2許容処理遅延量及び前記他拠点において前記他拠点ユーザが発話を終了した時点から再び前記他拠点ユーザが発話を開始する時点までの時間として取得された他拠点ユーザ反応待機時間と前記取得された自拠点ユーザ反応時間との差の大小に応じて夫々大小に変化する他拠点第2許容処理遅延量のうち最小となる値以下の範囲で前記総量を決定してもよい。 In this aspect, the own-site second allowable processing delay amount that changes depending on the difference between the acquired own-site user response waiting time and the acquired other-site user response time, and the other The difference between the other site user reaction waiting time acquired as the time from the time when the other site user finishes speaking at the site to the time when the other site user starts speaking again and the acquired own site user reaction time The total amount may be determined within a range equal to or smaller than the minimum value of the second base allowable processing delay amount at the other bases that changes depending on the size of the other base.
この態様によれば、自拠点第2許容処理遅延量及び他拠点第2許容処理遅延量のうち最小となる値(二者間の会話であれば、即ち小さい方の値)以下の範囲でシステム総処理遅延量が決定される。従って、自拠点ユーザ及び他拠点ユーザのいずれにおいても不快感の生じない会話品質を提供することができる。 According to this aspect, the system is within a range equal to or less than the minimum value of the second allowable processing delay amount of the own site and the second allowable processing delay amount of the other site (if the conversation is between two parties, that is, the smaller value). A total processing delay amount is determined. Therefore, it is possible to provide conversation quality that does not cause discomfort for both the local user and the other user.
ここで特に、反応待機時間と反応時間との差の大小に応じて夫々大小に変化する第2許容処理遅延量は、一方のユーザが、一方のユーザの発話に対し他方のユーザが無反応であるとの誤判断の下に再度発話することによって生じる双方発話状態を回避する上で実践上極めて有益な参照値となり得る。 Here, in particular, the second allowable processing delay amount that changes depending on the difference between the reaction waiting time and the reaction time is such that one user does not react to the other user's utterance. It can be a reference value that is extremely useful in practice in avoiding the bilateral utterance state caused by speaking again under the misjudgment of being.
尚、自拠点第2許容処理遅延量及び他拠点第2許容処理遅延量の算出プロセスは、音声送受信システムを構築する如何なる構成要素においてなされてもよい。即ち、最終的にこれらの最小値以下の範囲でシステム総処理遅延量が決定される限りにおいて、これらの全てが実施形態に係る音声送受信装置でなされる必要はない。例えば、実施形態に係る音声送受信装置においては、ネットワークを介して他拠点で検出された他拠点ユーザ反応時間を取得し、自拠点で取得された自拠点ユーザ反応待機時間を使用して自拠点第2許容処理遅延量を算出するのみであってもよい。この場合、他拠点においては、自拠点から送信される自拠点ユーザ反応時間が取得され、他拠点において取得された他拠点ユーザ反応待機時間を使用して他拠点第2許容処理遅延量が算出される。このように各装置で処理が分散されると、一の音声送受信装置に負荷が集中する事態を防ぐことができる。 It should be noted that the calculation process of the second allowable processing delay amount of the own site and the second allowable processing delay amount of the other site may be performed in any component that constructs the voice transmission / reception system. That is, as long as the total system processing delay amount is finally determined within a range below these minimum values, all of these need not be performed by the voice transmitting / receiving apparatus according to the embodiment. For example, in the voice transmitting and receiving apparatus according to the embodiment, the other site user reaction time detected at the other site via the network is acquired, and the own site user response waiting time acquired at the own site is used to 2 It may be possible only to calculate the allowable processing delay amount. In this case, at the other site, the own site user reaction time transmitted from the own site is acquired, and the other site second allowable processing delay amount is calculated using the other site user reaction waiting time acquired at the other site. The When processing is distributed among the devices in this way, it is possible to prevent a situation in which the load is concentrated on one voice transmitting / receiving device.
更に、この場合、前記処理遅延量決定手段は、前記取得された他拠点ユーザ反応時間の大小に応じて夫々大小に変化する自拠点第1許容処理遅延量及び前記取得された自拠点ユーザ反応時間の大小に応じて夫々大小に変化する他拠点第1許容処理遅延量並びに前記自拠点第2許容処理遅延量及び前記他拠点第2許容処理遅延量のうち最小となる値以下の範囲で前記総量を決定してもよい。 Furthermore, in this case, the processing delay amount determination means is configured to determine whether the own site first allowable processing delay amount that changes depending on the size of the acquired other site user reaction time and the acquired own site user response time. The total amount within a range that is less than or equal to the minimum value of the first allowable processing delay amount at the other site and the second allowable processing delay amount at the own site and the second allowable processing delay amount at the other site, which varies depending on the size of the other site. May be determined.
この態様によれば、夫々がユーザの状態を反映した参照値として算出された、自拠点第1許容処理遅延量、他拠点第1許容処理遅延量、自拠点第2許容処理遅延量及び他拠点第2許容処理遅延量のうち最小値以下の範囲でシステム総処理遅延量が決定される。従って、快適性の低下に伴う会話品質の低下をより確実に抑制することができる。 According to this aspect, the local base first allowable processing delay amount, the remote base first allowable processing delay amount, the local base second allowable processing delay amount, and the remote base, each calculated as a reference value reflecting the state of the user. The total system processing delay amount is determined within a range equal to or smaller than the minimum value among the second allowable processing delay amounts. Therefore, it is possible to more surely suppress a decrease in conversation quality accompanying a decrease in comfort.
本発明の音声送受信装置に係る実施形態の他の態様では、前記音声送受信装置は、前記会話データを前記送受信に相前後して一時的に蓄積するバッファを具備し、前記処理遅延量制御手段は、前記決定された総量に基づいて、前記バッファに係るバッファ容量を制御する。 In another aspect of the embodiment of the voice transmitting / receiving apparatus of the present invention, the voice transmitting / receiving apparatus includes a buffer for temporarily storing the conversation data before and after the transmission / reception, and the processing delay amount control means includes: The buffer capacity of the buffer is controlled based on the determined total amount.
データパケット間に生じるジッタを吸収し、音切れやデータ欠落等による音声品質の低下を防止しようとする場合、バッファを設けるのが好適である。一方、バッファ容量の大小は、会話の時間遅延量の大小と一対一の関係にあり、無条件に大きくすることは全体的な会話品質上許容されない。即ち、バッファ容量は、実施形態に係る処理遅延量を制御するにあたっての処理遅延量制御手段の実制御対象として妥当である。 When it is intended to absorb jitter generated between data packets and prevent deterioration of voice quality due to sound interruption or data loss, it is preferable to provide a buffer. On the other hand, the size of the buffer capacity has a one-to-one relationship with the size of the amount of time delay of conversation, and unconditionally increasing it is not allowed in terms of overall conversation quality. That is, the buffer capacity is appropriate as an actual control target of the processing delay amount control means when controlling the processing delay amount according to the embodiment.
尚、制御対象としてバッファ容量を捉える場合、決定されたシステム総処理遅延量に基づいて、音声送受信システムに備わるバッファ容量を如何に制御するかについては自由度がある。例えば、各音声送受信装置において、バッファが、受信データを受信後に一時的に蓄積する受信バッファと、送信データを送信前に一時的に蓄積する送信バッファとを含んで構築される場合、処理遅延量制御手段は、自拠点側の受信バッファ及び送信バッファと、他拠点側の受信バッファ及び送信バッファのバッファ容量の総和に対応する処理遅延量がシステム総処理遅延量となるように、比較的自由にこれら相互間のバッファ容量の分配比率を決定することができる。 When the buffer capacity is captured as a control target, there is a degree of freedom as to how to control the buffer capacity provided in the voice transmission / reception system based on the determined total system processing delay amount. For example, in each audio transmission / reception device, when the buffer is constructed including a reception buffer that temporarily stores received data after reception and a transmission buffer that temporarily stores transmission data before transmission, the processing delay amount The control means is relatively free so that the processing delay amount corresponding to the sum of the buffer capacities of the reception buffer and transmission buffer on the local site side and the reception buffer and transmission buffer on the other site side becomes the total system processing delay amount. The distribution ratio of the buffer capacity between them can be determined.
本発明の音声送受信装置に係る実施形態の他の態様では、前記処理遅延量制御手段は、前記決定された総量に基づいて、前記会話データを符号化するにあたっての符号化レートを制御する。 In another aspect of the embodiment of the voice transmitting / receiving apparatus of the present invention, the processing delay amount control means controls a coding rate for coding the conversation data based on the determined total amount.
遠隔地のユーザ同士で会話を成立させる場合、好適にはエンコード等の符号化プロセスが必要となる。 In order to establish a conversation between users at remote locations, an encoding process such as encoding is preferably required.
この際、自拠点での入力音声をエンコードするにあたってのエンコードビットレート等を意味する符号化レートの高低は、夫々音声品質の高低に対応し、且つ時間遅延量の大小に対応する。 At this time, the level of the encoding rate, which means the encoding bit rate for encoding the input voice at the local site, corresponds to the level of the voice quality and the amount of time delay.
従って、この種の符号化レートは、実施形態に係る処理遅延量を制御するにあたっての処理遅延量制御手段の実制御対象として妥当である。 Therefore, this type of encoding rate is appropriate as an actual control target of the processing delay amount control means in controlling the processing delay amount according to the embodiment.
本発明の音声送受信装置に係る実施形態の他の態様では、前記ネットワークの伝送状態を取得する伝送状態取得手段を更に具備し、前記処理遅延量決定手段は、前記取得された会話特徴量と前記取得された伝送状態とに基づいて前記総量を決定する。 In another aspect of the embodiment of the voice transmission / reception apparatus of the present invention, the voice transmission / reception apparatus further includes a transmission state acquisition unit that acquires a transmission state of the network, wherein the processing delay amount determination unit includes the acquired conversation feature amount and the The total amount is determined based on the acquired transmission state.
この態様によれば、伝送状態取得手段により取得されるネットワークの伝送状態が、システム総処理遅延量の決定に反映されるため、システム総処理遅延量をより正確に決定することができる。 According to this aspect, since the transmission state of the network acquired by the transmission state acquisition unit is reflected in the determination of the total system processing delay amount, the total system processing delay amount can be determined more accurately.
尚、この態様では、前記伝送状態取得手段は、前記ネットワークの伝送状態として、前記ネットワークの伝送遅延量を取得してもよい。 In this aspect, the transmission state acquisition means may acquire the transmission delay amount of the network as the transmission state of the network.
ネットワークの伝送遅延量は、即ち、処理遅延量と等しい次元で会話の時間遅延を規定する。従って、システム総処理遅延量の決定に反映させるべき伝送状態として好適である。尚、このようなネットワークの伝送遅延量とは、好適には、RTT(Round Trip Time)を指す。 The network transmission delay amount defines the conversation time delay in the same dimension as the processing delay amount. Therefore, it is suitable as a transmission state to be reflected in the determination of the total system processing delay amount. Note that the transmission delay amount of such a network preferably indicates RTT (Round Trip Time).
本発明の音声送受信装置に係る実施形態の他の態様では、前記取得された会話特徴量を統計処理する統計処理手段を更に具備し、前記処理遅延量決定手段は、前記統計処理された会話特徴量に基づいて前記総量を決定する。 In another aspect of the embodiment of the voice transmitting / receiving apparatus of the present invention, the speech processing apparatus further includes statistical processing means for statistically processing the acquired conversation feature amount, and the processing delay amount determining means is the statistically processed conversation feature. The total amount is determined based on the amount.
この態様によれば、統計処理手段により、取得された会話特徴量に統計処理が施される。即ち、過去一定又は不定の期間にわたって取得された会話特徴量が、システム総処理遅延量の決定に反映させるべき会話特徴量に反映される。このため、会話特徴量の信頼度を向上させることができ、会話品質を安定的に維持することが可能となる。 According to this aspect, the statistical processing is performed on the acquired conversation feature by the statistical processing means. That is, the conversation feature value acquired over a past fixed or indefinite period is reflected in the conversation feature value to be reflected in the determination of the total system processing delay amount. For this reason, the reliability of the conversation feature amount can be improved, and the conversation quality can be stably maintained.
尚、統計処理の実践的態様は、特に限定されないが、好適な一形態として、過去一定期間にわたって取得された会話特徴量を加算平均する処理であってもよい。この際、明らかな異常値はサンプルから除外する等の措置が講じられてもよい。 In addition, the practical aspect of the statistical processing is not particularly limited, but may be a process of adding and averaging conversation feature amounts acquired over a certain period in the past as a suitable form. At this time, a measure such as excluding the apparent abnormal value from the sample may be taken.
本発明の音声送受信装置に係る実施形態の他の態様では、前記取得された会話特徴量を記憶する記憶手段を更に具備する。 In another aspect of the embodiment of the voice transmitting / receiving apparatus of the present invention, the voice transmitting / receiving apparatus further includes storage means for storing the acquired conversation feature quantity.
この態様によれば、取得された会話特徴量が、例えばHDD(Hard Disk Drive)、フラッシュメモリ、FDD(Floppy(登録商標) Disc Drive)、DVD或いはBDD(Blu-ray Disc Drive)等の各種態様を有し得る記憶装置に揮発的に又は不揮発的に記憶されるため、システム総処理遅延量の決定を円滑に行うことができる。また、次回、同様に会話を行う際のシステム総処理遅延量の初期値を、この記憶された会話特徴量に基づいて決定することもできるため、システム総処理遅延量が最適なシステム総処理遅延量に収束するまでの時間を短縮化することも可能となる。 According to this aspect, the acquired conversation feature amount is, for example, various aspects such as HDD (Hard Disk Disk Drive), flash memory, FDD (Floppy (registered trademark) Disk Disk Drive), DVD or BDD (Blu-ray Disk Disk Drive). Therefore, the total system processing delay amount can be determined smoothly. In addition, since the initial value of the total system processing delay amount when the conversation is similarly performed next time can be determined based on the stored conversation feature amount, the total system processing delay amount with the optimum system total processing delay amount can be determined. It is also possible to shorten the time until the amount converges.
本発明の音声送受信システムに係る実施形態は、各々がネットワークに収容される、複数の拠点に設置された音声送受信装置を含み、該音声送受信装置を介して、前記複数の拠点に夫々存在するユーザ相互間で会話を成立させることが可能な音声送受信システムであって、前記音声送受信装置は、前記複数の拠点のうち自拠点を除く他拠点に設置される前記音声送受信装置との間で、前記ネットワークを介して、少なくとも音声データを含む、前記会話の内容を表す会話データの送受信を行う通信手段と、前記自拠点に存在する自拠点ユーザ及び前記他拠点に存在する他拠点ユーザの各々について、前記会話における発話のタイミングに関連する所定の会話特徴量を取得する会話特徴量取得手段と、前記取得された会話特徴量に基づいて、その大小が前記会話の時間遅延量の大小に夫々対応し且つ前記会話の音声品質の高低に夫々対応する前記会話データの処理遅延量の、前記音声送受信システム全体における総量を決定する処理遅延量決定手段と、前記決定された総量が満たされるように前記処理遅延量を制御する処理遅延量制御手段とを具備する。 Embodiments according to the voice transmission / reception system of the present invention include voice transmission / reception devices installed at a plurality of bases, each accommodated in a network, and users existing at the plurality of bases via the voice transmission / reception devices, respectively. A voice transmission / reception system capable of establishing a conversation between each other, wherein the voice transmission / reception device is connected to the voice transmission / reception device installed at another base other than the base among the plurality of bases. For each of the communication means for transmitting and receiving conversation data representing the content of the conversation, including at least voice data, via the network, and the local user existing in the local base and the other base user existing in the local base, Based on the conversation feature amount acquisition means for acquiring a predetermined conversation feature amount related to the timing of utterance in the conversation, and the acquired conversation feature amount, A processing delay amount determination for determining a total amount of the processing delay amount of the conversation data corresponding to the size of the time delay amount of the conversation and the speech quality of the conversation, respectively, in the entire voice transmission / reception system. And a processing delay amount control means for controlling the processing delay amount so that the determined total amount is satisfied.
音声送受信システムに係る実施形態は、上述した実施形態に係る音声送受信装置を備えるため、最適な会話品質を得ることが可能である。 Since the embodiment according to the voice transmission / reception system includes the voice transmission / reception device according to the above-described embodiment, it is possible to obtain the optimum conversation quality.
本発明のサーバ装置に係る実施形態は、各々がネットワークに収容される、サーバ装置と、複数の拠点に設置された、該複数の拠点のうち自拠点を除く他拠点に設置される音声送受信装置との間で、前記ネットワークを介して、少なくとも音声データを含む、前記会話の内容を表す会話データの送受信を行う通信手段を具備する音声送受信装置とを含み、該音声送受信装置を介して、前記複数の拠点に夫々存在するユーザ相互間で会話を成立させることが可能な音声送受信システムにおける、前記サーバ装置であって、前記複数の音声送受信装置から、前記ネットワークを介して、前記自拠点に存在する自拠点ユーザ及び前記他拠点に存在する他拠点ユーザの各々についての前記会話における発話のタイミングに関連する所定の会話特徴量を取得する会話特徴量取得手段と、前記取得された会話特徴量に基づいて、その大小が前記会話の時間遅延量の大小に夫々対応し且つ前記会話の音声品質の高低に夫々対応する前記会話データの処理遅延量の、前記音声送受信システム全体における総量を決定する処理遅延量決定手段と、前記複数の音声送受信装置に対し前記ネットワークを介して前記決定された総量を告知する告知手段とを具備する。 Embodiments according to the server device of the present invention include a server device, each of which is accommodated in a network, and a voice transmitting / receiving device installed at a plurality of bases and installed at another base other than the base among the plurality of bases A voice transmission / reception device including communication means for transmitting / receiving conversation data representing the content of the conversation including at least voice data via the network, and via the voice transmission / reception device, The server device in a voice transmission / reception system capable of establishing a conversation between users respectively present at a plurality of bases, and present at the local base from the plurality of voice transmission / reception devices via the network Predetermined conversation feature quantity related to the utterance timing in the conversation for each of the own base user and the other base user existing at the other base Based on the acquired conversation feature quantity and the conversation feature quantity acquisition means to be acquired, the size of the conversation data corresponds to the magnitude of the time delay amount of the conversation and corresponds to the level of the voice quality of the conversation. The processing delay amount determining means for determining the total amount of the processing delay amount in the entire voice transmitting / receiving system, and the notifying means for notifying the plurality of voice transmitting / receiving apparatuses of the determined total amount via the network. .
サーバ装置に係る実施形態は、上述した音声送受信装置の実施形態に係る会話特徴量取得手段及び処理遅延量決定手段を備えるため、最適な会話品質を得ることが可能である。また、このように、会話特徴量の取得プロセス及びシステム総処理遅延量の決定プロセスをサーバ装置が担うことにより、音声送受信システムを構築する音声送受信装置の負担を著しく軽減することができ実践上非常に有益である。例えば、この際、各音声送受信装置では、サーバ装置に備わる告知手段により告知されたシステム総処理遅延量の最適値が満たされるように、例えば処理遅延量制御手段に類する手段が処理遅延量を制御すればよい。或いは、サーバ装置側で然るべきコマンドを実行して、音声送受信装置を上位制御してもよい。また、サーバ装置側で、システム総処理遅延量に基づいた各音声送受信装置の処理遅延量の分配態様をも規定し得る場合には、音声送受信装置側の負担を一層軽減することも可能である。 Since the embodiment according to the server apparatus includes the conversation feature amount acquisition unit and the processing delay amount determination unit according to the embodiment of the voice transmission / reception apparatus described above, it is possible to obtain the optimum conversation quality. In addition, since the server device is responsible for the conversation feature value acquisition process and the system total processing delay amount determination process in this way, it is possible to remarkably reduce the burden on the voice transmission / reception device for constructing the voice transmission / reception system. It is beneficial to. For example, at this time, in each voice transmitting / receiving device, for example, a unit similar to the processing delay amount control unit controls the processing delay amount so that the optimum value of the total system processing delay amount notified by the notification unit provided in the server device is satisfied. do it. Alternatively, an appropriate command may be executed on the server device side to control the audio transmission / reception device. Further, when the server device side can also define the distribution mode of the processing delay amount of each voice transmission / reception device based on the total system processing delay amount, it is possible to further reduce the burden on the voice transmission / reception device side. .
以上説明したように、本発明の音声送受信装置に係る実施形態によれば、通信手段、会話特徴量取得手段、処理遅延量決定手段及び処理遅延量制御手段を備えるので、最適な会話品質を得ることができる。 As described above, according to the embodiment of the voice transmission / reception apparatus of the present invention, since the communication means, the conversation feature quantity acquisition means, the processing delay amount determination means, and the processing delay amount control means are provided, the optimum conversation quality is obtained. be able to.
以上説明したように、本発明の音声送受信システムに係る実施形態によれば、本発明の音声送受信装置に係る実施形態を備えるので、最適な会話品質を得ることができる。 As described above, according to the embodiment of the voice transmission / reception system of the present invention, since the embodiment of the voice transmission / reception apparatus of the present invention is provided, the optimum conversation quality can be obtained.
以上説明したように、本発明の音声送受信装置に係る実施形態によれば、会話特徴量取得手段、処理遅延量決定手段及び告知手段を備えるので、最適な会話品質を得ることができる。 As described above, according to the embodiment of the voice transmitting / receiving apparatus of the present invention, since the conversation feature quantity acquisition means, the processing delay amount determination means, and the notification means are provided, the optimum conversation quality can be obtained.
本発明のこのような作用及び他の利得は次に説明する実施例から明らかにされる。 These effects and other advantages of the present invention will become apparent from the embodiments described below.
以下、適宜図面を参照して、本発明の好適な各種実施例について説明する。
<第1実施例>
<実施例の構成>
始めに、図1を参照し、本発明の第1実施例に係る遠隔会議システム1の構成について説明する。ここに、図1は、遠隔会議システム1の構成を概念的に表してなる概略構成図である。
Hereinafter, various preferred embodiments of the present invention will be described with reference to the drawings as appropriate.
<First embodiment>
<Configuration of Example>
First, the configuration of the
図1において、遠隔会議システム1は、相互いに離れた拠点X(即ち、本発明に係る「自拠点」の一例)及び拠点Y(即ち、本発明に係る「他拠点」の一例)を繋ぐ広域ネットワーク20(IP(Internet Protocol)網)に収容された、本発明に係る「音声送受信システム」の一例たる音声会議システムである。
In FIG. 1, a
遠隔会議システム1は、拠点Xに設置され、拠点Xに存在するユーザA(即ち、本発明に係る「自拠点ユーザ」の一例である)に使用される音声送受信装置10A(即ち、本発明に係る「音声送受信装置」の一例)と、拠点Yに設置され、拠点Yに存在するユーザB(即ち、本発明に係る「他拠点ユーザ」の一例である)に使用される音声送受信装置10B(即ち、本発明に係る「音声送受信装置」の他の一例)から構成される。ユーザAとユーザBとは、遠隔会議システム1により音声情報のやり取りを介した円滑な音声会議を行うことができる。
The
次に、図2を参照し、遠隔会議システム1を構成する一の音声送受信装置の構成について、その動作を交えて説明する。ここに、図2は、音声送受信装置10Aの構成を概念的に表してなるブロック図である。尚、同図において、図1と重複する箇所には、同一の符合を付してその説明を適宜省略することとする。
Next, with reference to FIG. 2, the configuration of one voice transmitting / receiving apparatus constituting the
尚、遠隔会議システム1においては、音声送受信装置10Aのハードウェア構成は、音声送受信装置10Bと同等であるとする。
In the
図2において、音声送受信装置10Aは、音声入力ユニット100、音声出力ユニット200、会話特徴量検出部300、会話特徴量統計処理部400、記憶装置500、処理遅延量決定部600、RTT測定部700、処理遅延情報通信部800及びバッファ制御部900を備える。
In FIG. 2, the voice transmission /
音声入力ユニット100は、音声入力部110、エンコーダ120、送信バッファ130及び音声データ送信部140を備え、ユーザ10Aの発話音声を音声データとしてネットワーク20を介して音声送受信装置10Bへ送信可能なユニットである。
The
音声入力部110は、入力端子(符合省略)が図示せぬマイクに接続された入力インターフェイスであり、当該マイクを介して入力されたユーザAの発話音声をアナログ音声信号として取り込み可能に構成されている。
The
エンコーダ120は、音声入力部110を介して入力されるアナログ音声信号を、所定の符号化レート(エンコードビットレート)で符号化し、デジタル音声データに変換するデジタル変換装置である。エンコーダ120に係るアナログ音声データのエンコード態様としては、公知の各種規格に準じたものを採用可能である。例えば、この種の規格とは、MPEG(Moving Picture Expert Group)等の規格であってもよい。
The
送信バッファ130は、エンコーダ120を介して得られたデジタル音声データを所定の送信バッファ容量Da_sに相当するデータ量だけ一時的に蓄積する揮発性記憶装置である。
The
音声データ送信部140は、出力端子(符合省略)がネットワーク20に接続された送信インターフェイスであり、送信バッファ130から順次出力されてくるデジタル音声データを順次ネットワーク20を介して音声入出力装置10Bに送信可能に構成される。即ち、音声データ送信部140は、本発明に係る「通信部」の一例であり、送信されるデジタル音声データは、本発明に係る「会話データ」の一例である。
The audio
音声出力ユニット200は、音声データ受信部210、受信バッファ220、デコーダ230及び音声出力部240を備え、ユーザBの発話音声をスピーカ等の出力装置を介して出力可能なユニットである。
The
音声データ受信部210は、出力端子(符合省略)がネットワーク20に接続された受信インターフェイスであり、ネットワーク20を介して送信されてくるユーザBの発話音声に対応するデジタル音声データを順次取り込み可能に構成される。即ち、音声データ受信部210は、本発明に係る「通信部」の他の一例であり、受信されるデジタル音声データは、本発明に係る「会話データ」の他の一例である。
The audio
受信バッファ220は、音声データ受信部210を介して得られたデジタル音声データを所定の受信バッファ容量Da_rに相当するデータ量だけ一時的に蓄積する揮発性記憶装置である。
The
デコーダ230は、受信バッファ220から順次出力されてくるデジタル音声データを復号化しアナログ音声データに変換するアナログ変換装置である。デコーダ230の機能は、エンコーダ120のエンコード機能と対をなすものであり、両者は当然ながら同一の規格に準じたデータ変換を行うように構成される。
The
音声出力部240は、出力端子(符合省略)が図示せぬスピーカに接続された出力インターフェイスであり、当該スピーカを介してユーザBの発話音声を出力可能に構成されている。
The
会話特徴量検出部300は、ユーザAの会話特徴量として後述するユーザA反応時間RA及びユーザA反応待機時間TAを検出可能に構成された、本発明に係る「会話特徴量取得手段」の一例である。 Conversation feature amount detecting unit 300 was detectably configure user A reaction time R A and user A reaction waiting time T A to be described later as a conversation characteristic quantity of the user A, according to the present invention "conversation feature amount acquisition means" It is an example.
会話特徴量統計処理部400は、会話特徴量検出部300により適宜検出される会話特徴量を統計処理可能に構成された、本発明に係る「統計処理手段」の一例である。会話特徴量統計処理部400は、検出された会話特徴量を過去一定サンプルについて保持可能な構成となっており、保持するサンプルのサンプル値を加算平均処理して出力する構成となっている。
The conversation feature quantity
記憶装置500は、例えばHDDやフラッシュメモリ等の不揮発性記憶装置であり、本発明に係る「記憶手段」の一例である。記憶装置500の記憶領域には、会話特徴量統計処理部400を介して出力されたユーザA反応時間RAの平均値を表す反応時間データDAT_RA及びユーザA反応待機時間TAの平均値を表す反応待機時間データDAT_TAが格納される構成となっている。
The
処理遅延量決定部600は、許容処理遅延推定部610及び交渉部620を備え、送信バッファ130のバッファ容量Da_s及び受信バッファ220のバッファ容量Da_rを決定するように構成された、本発明に係る「処理遅延量決定手段」の一例たる制御装置である。処理遅延量決定部600は、ROM(Read Only Memory)に格納された制御プログラムに従って、後述するバッファ容量制御を実行可能に構成されている。
The processing delay
許容処理遅延推定部610は、後述する、遠隔会議システム1に許容される最大の処理遅延量たる最大許容処理遅延量dmax(即ち、本発明に係る「会話データの処理遅延量の音声送受信システム全体における総量」の一例である)を決定するための各種処理を実行するプロセッサである。許容処理遅延推定部610は、予め理論的に構築された計算モデルとしての第1計算モデル及び第2計算モデルを備え、これら計算モデルに基づいて当該処理を実行する構成となっている。
The allowable processing
交渉部620は、拠点Yの音声送受信装置10Bとの間で、本発明に係る「処理遅延量」の一例としてのバッファ容量の分配交渉を行うプロセッサである。交渉部620はまた、この音声送受信装置10Bとの交渉を経て、送信バッファ容量Da_s及び受信バッファ容量Da_rの制御目標値を決定する。
The
RTT測定部700は、ネットワーク20の伝送遅延量であるRTTを測定可能に構成されたプロセッサである。RTT測定部700は、RTCP(Real-time Transport Control Protocol)のSR(Sender Report)やRR(Receiver Report)を利用してRTTを測定する。
The
尚、ここで規定されるRTTは、音声送受信装置10Aにおいて音声データ送信部140を介してデジタル音声データが送信された時点から音声送受信装置10Bにおいて音声データ受信部210を介して当該デジタル音声データが受信される時点までの時間と、音声送受信装置10Bにおいて音声データ送信部140を介してデジタル音声データが送信された時点から音声送受信装置10Aにおいて音声データ受信部210を介して当該デジタル音声データが受信される時点までの時間との和である。ユーザAとユーザBとが同一拠点で対面状態で会話を行っている場合等、通常の会話においては、このRTTに相当する時間遅延は存在しない。即ち、遠隔会議システム1において、このRTTは、遠隔会議システム1を利用したユーザ相互間の会話の時間遅延量を規定する指標値となる。
Note that the RTT defined here is the time when the digital audio data is transmitted from the audio transmitting / receiving
処理遅延情報通信部800は、出力端子がネットワーク20に接続されており、音声送受信装置10Bとの間で、上述したバッファ量の目標値策定に係る各種のデータの送受信を行う通信インターフェイスである。処理遅延情報通信部800は、音声送受信装置10Bに対し、ユーザA反応時間RA及び各計算モデルに基づいて設定された処理遅延量dを送信し、音声送受信装置10BからユーザB反応時間RB及び各計算モデルに基づいて設定された処理遅延量dを取得する。
The processing delay
バッファ制御部900は、送信バッファ130及び受信バッファ220について、各バッファ容量を可変に制御可能に構成された、本発明に係る「処理遅延量制御手段」の一例たるプロセッサである。
<実施例の動作>
次に、図3を参照し、本実施例の動作として、処理遅延量決定部600により実行されるバッファ容量制御の詳細について説明する。ここに、図3は、バッファ容量制御のフローチャートである。尚、音声送受信装置10Aでは、このバッファ容量制御とは別に、音声入力ユニット100及び音声出力ユニット200による先に述べた会話データの送受信が適宜実行されている。
The buffer control unit 900 is a processor as an example of the “processing delay amount control unit” according to the present invention configured to be able to variably control the buffer capacities of the
<Operation of Example>
Next, details of the buffer capacity control executed by the processing delay
図3において、バッファ容量制御では、先ずユーザA反応時間RAの送信処理及びユーザB反応時間RBの受信処理が実行される(ステップS101)。ユーザA反応時間RA及びユーザB反応時間RBは、夫々、本発明に係る「会話特徴量」の一例であり、また前者は「自拠点ユーザ反応時間」、後者は「他拠点ユーザ反応時間」の夫々一例である。 In FIG. 3, in the buffer capacity control, first, a transmission process for the user A reaction time R A and a reception process for the user B reaction time R B are executed (step S101). User A reaction time R A and user B reaction times R B are, respectively, an example of the "conversation feature quantity" according to the present invention, also the former "own base user reaction time", the latter "different hub user reaction time Is an example.
ステップS101においては、記憶装置500から反応時間データDAT_RAが読み出され、処理遅延情報通信部800によりネットワーク20を介して音声送受信装置10Bへ送信される。一方で、処理遅延情報通信部800によりネットワーク20を介して音声送受信装置10BからユーザB反応時間RBに対応する反応時間データDAT_RBが取得され、許容処理遅延推定部610へ送出される。
In step S101, the response time data DAT_R A from the
ここで、図4を参照し、反応時間について説明する。ここに、図4は、反応時間の概念を説明するタイミングチャートである。 Here, the reaction time will be described with reference to FIG. FIG. 4 is a timing chart for explaining the concept of reaction time.
図4において、ユーザAとユーザBとが、互いの音声伝達に有意な時間遅延が生じない理想的な環境で会話しているとする。尚、このような理想的な環境とは、例えば、同一拠点において、対面状態で会話がなされる環境等を指す。 In FIG. 4, it is assumed that the user A and the user B are talking in an ideal environment where no significant time delay occurs in the mutual voice transmission. Note that such an ideal environment refers to, for example, an environment where conversations are made in a face-to-face state at the same site.
ある任意の時刻T0において、ユーザAの発話が終了したとする(ユーザAのハッチング部分参照)。一方、ユーザBは、時刻T0においてユーザAの発話終了を知覚する。そして、この発話終了時点(T0)から、ユーザBに固有の遅延時間を経た時刻T1において発話を開始する(ユーザBのハッチング部分参照)。この一方の発話終了時点から他方の発話開始時点までの遅延時間が、反応時間である。反応時間は、ユーザの性格、嗜好、精神的負荷状態、肉体的負荷状態、及びその他その都度個別具体的に変化し得る各種の事情に応じて多様に変化する。 Suppose that the utterance of the user A is finished at an arbitrary time T0 (see the hatched portion of the user A). On the other hand, user B perceives the end of user A's utterance at time T0. Then, from this utterance end time (T0), the utterance is started at time T1 after a delay time unique to the user B (see the hatched portion of the user B). The delay time from the end time of one utterance to the start time of the other utterance is the reaction time. The reaction time varies in various ways according to the user's personality, taste, mental load state, physical load state, and other various circumstances that can be specifically changed each time.
ユーザA反応時間RAとは、拠点XにおけるユーザAの反応時間であり、ユーザB反応時間RBとは、拠点YにおけるユーザBの反応時間である。尚、遠隔会議システム1のように遠隔地同士の会話である場合、一方の発話終了時点とは、スピーカ等の音声出力手段を介した音声出力(即ち、本発明に係る「発話出力」の一例である)の終了時点を意味する。一方の発話行為が終了したところで発話内容が他方に認識されなければ、会話が成立しないからである。
The user A reaction time R A is the reaction time of the user A at the site X, and the user B reaction time R B is the reaction time of the user B at the site Y. In the case of a conversation between remote locations as in the
反応時間は、夫々の音声送受信装置において、会話特徴量検出部300により実行される反応時間算出処理により算出される。ここで、図5を参照し、反応時間算出処理について説明する。ここに、図5は、反応時間算出処理のフローチャートである。尚、図5の反応時間算出処理は、拠点Xにおいて実行される処理であるとする。 The reaction time is calculated by a reaction time calculation process executed by the conversation feature amount detection unit 300 in each voice transmitting / receiving device. Here, the reaction time calculation process will be described with reference to FIG. FIG. 5 is a flowchart of the reaction time calculation process. The reaction time calculation process in FIG. 5 is a process executed at the site X.
図5において、先ずユーザBの音声出力が有るか否かが判別される(ステップS201)。ユーザBの音声出力が無い場合(ステップS201:NO)、処理はステップS201に戻され一連の処理が繰り返される。ユーザBの音声出力が有る場合(ステップS201:YES)、その時点の時刻が最終音声出力時刻Topとして更新される(ステップS202)。 In FIG. 5, it is first determined whether or not there is a voice output from the user B (step S201). If there is no voice output from user B (step S201: NO), the process returns to step S201, and a series of processes is repeated. When there is a voice output of user B (step S201: YES), the time at that time is updated as the final voice output time Top (step S202).
最終音声出力時刻Topが更新されると、ユーザBの音声出力が無いか否かが判別される(ステップS203)。ユーザBの音声出力が継続している場合(ステップS203:NO)、処理はステップS202に戻され、最終音声出力時刻Topの更新が継続される。一方、ユーザBの音声出力が無い場合(ステップS203:YES)、ユーザAの音声入力が有るか否かが判別される(ステップS204)。 When the final audio output time Top is updated, it is determined whether or not there is no audio output from the user B (step S203). If user B's voice output continues (step S203: NO), the process returns to step S202, and the update of the final voice output time Top is continued. On the other hand, if there is no voice output from user B (step S203: YES), it is determined whether there is a voice input from user A (step S204).
尚、ユーザBが必ずしも連続的に発話しているとは限らないため、ユーザBの音声出力が無い旨の判別は、ユーザBが一連の発話中であるにもかかわらず発話が終了したと誤判別されぬように、無音区間の長さが予め設定された基準値を超えたか否かに基づいて正確になされる構成となっている。 Since user B does not always speak continuously, the determination that there is no voice output from user B is a misjudgment that the speech has ended despite user B being in a series of speeches. It is configured so as to be accurately made based on whether or not the length of the silent section exceeds a preset reference value.
ユーザAの音声入力が無い場合(ステップS204:NO)、即ち、ユーザBの発話出力終了後、ユーザAがそれに対する発話内容を考えていると推定される間は、ステップS203が繰り返し実行される。ユーザAの音声入力が開始される(発話が開始される)と(ステップS204:YES)、その時点の時刻TとステップS202において更新された最終音声出力時刻Topとの差に相当する時間値が、ユーザA反応時間RAとして決定される(ステップS205)。ユーザA反応時間RAが算出されると、処理はステップS201へ戻され、一連の処理が繰り返される。反応時間算出処理は以上のように実行される。 When there is no voice input by the user A (step S204: NO), that is, after the utterance output of the user B is completed, while it is estimated that the user A is thinking about the utterance content for the utterance, the step S203 is repeatedly executed. . When user A's voice input is started (utterance is started) (step S204: YES), a time value corresponding to the difference between the time T at that time and the final voice output time Top updated in step S202 is obtained. The user A reaction time RA is determined (step S205). When user A reaction time RA is calculated, the process returns to step S201, and a series of processes is repeated. The reaction time calculation process is executed as described above.
尚、ユーザB反応時間RBも、同様に音声送受信装置10Bにおいて算出されている。
Also user B reaction time R B, are calculated in the voice transmitting and receiving
反応時間算出処理により適宜算出されたユーザA反応時間RAは、算出される毎に会話特徴量統計処理部400に送出され、統計処理に供される。統計処理は先述したように過去所定サンプル分についての加算平均処理である。尚、統計処理の態様は、加算平均処理に限定されない。
User A reaction time R A, which is suitably calculated by the reaction time calculation process is sent to the conversation feature
会話特徴量統計処理部400により統計処理を施されたユーザA反応時間RAは、記憶部500に反応時間データDAT_RAとして格納される。反応時間データDAT_RAは、ユーザA反応時間RAが算出され統計処理が実行される毎に適宜更新される。
User A reaction time R A which has been subjected to statistical processing by conversation feature amount
図3に戻り、反応時間データの送受信が終了すると、ユーザA反応待機時間TAが取得される(ステップS102)。 Returning to FIG. 3, transmission and reception of the response time data is completed, the user A reaction waiting time T A is obtained (step S102).
反応待機時間とは、一方のユーザの発話が終了した時点から、当該一方のユーザが再度発話を開始する時点までの時間値である。反応待機時間は、ユーザの性格、嗜好、精神的負荷状態、肉体的負荷状態、及びその他その都度個別具体的に変化し得る各種の事情に応じて多様に変化する。 The reaction waiting time is a time value from the time when one user's utterance ends to the time when the one user starts speaking again. The reaction waiting time varies in various ways according to the user's personality, taste, mental load state, physical load state, and other various circumstances that can be specifically changed each time.
反応待機時間は、夫々の音声送受信装置において、会話特徴量検出部300により実行される反応待機時間算出処理により算出される。ここで、図6を参照し、反応待機時間算出処理について説明する。ここに、図6は、反応待機時間算出処理のフローチャートである。尚、図6の反応待機時間算出処理は、拠点Xにおいて実行される処理であるとする。 The reaction waiting time is calculated by a reaction waiting time calculation process executed by the conversation feature amount detection unit 300 in each voice transmitting / receiving device. Here, the reaction waiting time calculation process will be described with reference to FIG. FIG. 6 is a flowchart of the reaction waiting time calculation process. Note that the reaction waiting time calculation process in FIG.
図6において、先ずユーザAの音声入力が有るか否かが判別される(ステップS301)。ユーザAの音声入力が無い場合(ステップS301:NO)、処理はステップS201に戻され一連の処理が繰り返される。ユーザAの音声入力が有る場合(ステップS201:YES)、その時点の時刻が最終音声入力時刻Tipとして更新される(ステップS302)。 In FIG. 6, it is first determined whether or not there is a voice input from the user A (step S301). If there is no voice input by user A (step S301: NO), the process returns to step S201, and a series of processes is repeated. When there is a voice input of the user A (step S201: YES), the time at that time is updated as the final voice input time Tip (step S302).
最終音声入力時刻Tipが更新されると、ユーザBの音声出力が無いか否かが判別される(ステップS303)。ユーザBの音声出力が有る場合(ステップS303:NO)、ユーザA反応待機時間TAよりも短い時間でユーザBの応答が返ってきたものとして、処理はステップS301に戻される。 When the final voice input time Tip is updated, it is determined whether or not there is no voice output from the user B (step S303). If there is a voice output from user B (step S303: NO), the process returns to step S301 on the assumption that the response from user B is returned in a time shorter than the user A reaction waiting time TA.
ユーザBの音声出力が無い場合(ステップS303:YES)、ユーザAの音声入力が有るか否かが判別される(ステップS304)。ユーザAの音声入力が無い場合(ステップS304:NO)、処理はステップS303に戻される。即ち、ユーザAの音声入力が再開されるまで、処理は待機状態となる。 If there is no voice output from user B (step S303: YES), it is determined whether there is a voice input from user A (step S304). If there is no voice input by user A (step S304: NO), the process returns to step S303. That is, the process is in a standby state until the voice input of the user A is resumed.
ユーザAの音声入力が有る場合(ステップS304:YES)、現時点の時刻TとステップS302で更新された最終音声入力時刻Tipとの差に相当する時間値が、基準値T0よりも大きいか否かが判別される。ここで、ユーザAは必ずしも連続的に発話している訳ではないため、一連の発話中であっても一時的に発話が途切れることがある。基準値T0は、ユーザAの音声入力が、このような一連の発話動作に相当するか否かを判別するための判断基準値である。 If there is a voice input by the user A (step S304: YES), whether or not the time value corresponding to the difference between the current time T and the last voice input time Tip updated in step S302 is greater than the reference value T0. Is determined. Here, since the user A does not necessarily utter continuously, the utterance may be temporarily interrupted even during a series of utterances. The reference value T0 is a determination reference value for determining whether or not the voice input of the user A corresponds to such a series of speech operations.
即ち、T-Tipに相当する時間値が基準値T0以下である場合(ステップS305:NO)、処理はステップS301に戻される。一方、T-Tipに相当する時間値が基準値T0よりも大きい場合(ステップS305:YES)、T-Tipに相当する時間値がユーザA反応待機時間TAとして決定される(ステップS306)。反応待機時間算出処理は以上のように実行される。 That is, when the time value corresponding to T-Tip is equal to or less than the reference value T0 (step S305: NO), the process returns to step S301. On the other hand, if the time value corresponding to T-Tip is larger than the reference value T0 (Step S305: YES), the time value corresponding to T-Tip is determined as a user A reaction waiting time T A (step S306). The reaction waiting time calculation process is executed as described above.
尚、ユーザB反応待機時間TBも、同様に音声送受信装置10Bにおいて算出されている。
Also user B reaction waiting time T B, it is calculated in the voice transmitting and receiving
反応時間算出処理により適宜算出されたユーザA反応待機時間TAは、算出される毎に会話特徴量統計処理部400に送出され、統計処理に供される。統計処理は先述したように過去所定サンプル分についての加算平均処理である。尚、統計処理の態様は、加算平均処理に限定されない。
User A reaction waiting time T A, which is suitably calculated by the reaction time calculation process is sent to the conversation feature
会話特徴量統計処理部400により統計処理を施されたユーザA反応待機時間TAは、記憶部500に反応時間データDAT_TAとして格納される。反応時間データDAT_TAは、ユーザA反応待機時間TAが算出され統計処理が実行される毎に適宜更新される。
User A reaction waiting time T A that has been subjected to statistical processing by conversation feature amount
再び図3に戻り、ユーザA反応待機時間TAが取得されると、第1計算モデル及び第2計算モデルに基づいて処理遅延量dが設定される(ステップS103)。処理遅延量dは、第1計算モデル及び第2計算モデルの各々について設定される。尚、処理遅延量dは、遠隔会議システム1の処理遅延量の総量を意味する。
Returning to Figure 3 again, when the user A reaction waiting time T A is obtained, processing delay amount d is set based on the first calculation model and the second calculation model (step S103). The processing delay amount d is set for each of the first calculation model and the second calculation model. The processing delay amount d means the total processing delay amount of the
ここで、図7を参照し、第1計算モデルについて説明する。ここに、図7は、第1計算モデルの概念を説明するタイミングチャートである。尚、同図において、図4と重複する箇所には、同一の符合を付してその説明を適宜省略することとする。 Here, the first calculation model will be described with reference to FIG. FIG. 7 is a timing chart for explaining the concept of the first calculation model. In the figure, the same reference numerals are given to the same portions as those in FIG. 4, and the description thereof will be omitted as appropriate.
図7において、遠隔会議システム1における処理遅延量dを下記(1)式により定義する。
In FIG. 7, the processing delay amount d in the
d=da+db+dproc・・・(1)
上記(1)式において、daは音声送受信装置10Aのバッファ容量と一対一に対応する遅延量であり、受信バッファ容量Da_rに対応する受信バッファ遅延量da_r及び送信バッファ容量Da_sに対応する送信バッファ遅延量da_sとの間に「da=da_r+da_s」なる関係を有する。
d = da + db + dproc (1)
In the above equation (1), da is a delay amount that has a one-to-one correspondence with the buffer capacity of the audio transmission /
また、上記(1)式において、dbは音声送受信装置10Bのバッファ容量と一対一に対応する遅延量であり、受信バッファ容量Db_rに対応する受信バッファ遅延量db_r及び送信バッファ容量Db_sに対応する送信バッファ遅延量db_sとの間に「db=db_r+db_s」なる関係を有する。
Also, in the above equation (1), db is a delay amount that corresponds to the buffer capacity of the audio transmission /
更に、上記(1)式において、dprocは、音声送受信装置10A及び10Bの各々におけるエンコーダ及びデコーダの処理遅延である。
Furthermore, in the above equation (1), dproc is the processing delay of the encoder and decoder in each of the audio transmitting / receiving
より具体的には、dprocは、音声送受信装置10Aにおけるエンコーダの処理遅延denca及びデコーダの処理遅延ddeca並びに音声送受信装置10Bにおけるエンコーダの処理遅延dencb及びデコーダの処理遅延ddecbの総和である。尚、dprocは、エンコーダ及びデコーダにおける符号化レートが一定であれば一定値となる。
More specifically, dproc is the sum of the encoder processing delay denca and the decoder processing delay ddeca in the voice transmitting / receiving
図7において、時刻T0にユーザAの発話が終了したとする。ここで、遠隔地同士である拠点Xと拠点Yとにおいて遠隔会議システム1を利用して会話を行う構成では、このユーザAの発話終了が認識される時点が、図4のような通常会話時と異なる。
In FIG. 7, it is assumed that the utterance of the user A ends at time T0. Here, in the configuration in which the
即ち、拠点YにおいてユーザAの発話終了が認識されるのは、時刻T0から音声送受信装置10Aの送信バッファ遅延量da_s、音声送受信装置10Bの受信バッファ遅延量db_r、音声送受信装置10Aのエンコーダ処理遅延denca、音声送受信装置10Bのデコーダ処理遅延ddecb及び音声送受信装置10Aから音声送受信装置10Bへ向かうネットワーク20の片道遅延OWDaを経た後の時刻T0’である。即ち、下記(2)式が成立する。
That is, at the site Y, the end of the speech of the user A is recognized from time T0, the transmission buffer delay amount da_s of the voice transmission /
T0’=T0+da_s+db_r+denca+ddecb+OWDa・・・(2)
尚、ネットワーク20の片道遅延OWDaに関しては、下記(3)式が成立する。
T0 ′ = T0 + da_s + db_r + denca + ddecb + OWDa (2)
Note that the following equation (3) holds for the one-way delay OWDa of the
RTT=OWDa+OWDb・・・(3)
尚、OWDbは、音声送受信装置10Bから音声送受信装置10Aへ向かうネットワーク20の片道遅延である。
RTT = OWDa + OWDb (3)
OWDb is a one-way delay of the
次に、時刻T0’からユーザB反応時間RBを経た時刻T1’においてユーザBが発話を開始したとする。このユーザBの発話開始が認識される時刻もまた、拠点Xと拠点Yとでは異なり、拠点XにおいてユーザBの発話開始が認識されるのは、時刻T1’から音声送受信装置10Bの送信バッファ遅延量db_s、音声送受信装置10Aの受信バッファ遅延量da_r、音声送受信装置10Bのエンコーダ処理遅延dencb、音声送受信装置10Aのデコーダ処理遅延ddeca及び上記片道遅延OWDbを経た後の時刻T2となる。即ち、下記(4)式が成立する。
Then, the user B starts an utterance in 'User B reaction time time through the R B T1 from' time T0. The time when the user B's utterance start is recognized is also different between the base X and the base Y, and the start of the user B's utterance at the base X is recognized from the transmission buffer delay of the voice transmitting / receiving
T2=T0’+RB+db_s+da_r+dencb+ddeca+OWDb・・・(4)
即ち、図4に例示された通常会話時と同様の感覚で、ユーザAの発話終了後にユーザB反応時間RBを隔ててユーザBが発話を行ったとしても、遠隔会議システム1とネットワーク20の影響により、ユーザAにユーザBの発話開始が認識されるのは、下記(5)式に規定される遅延時間TLが経過した後となる。
T2 = T0 ′ + R B + db_s + da_r + dencb + ddeca + OWDb (4)
That is, in the same sense as in the normal conversation is illustrated in Figure 4, even if the user B has performed speech at a user B reaction time R B after the end user utterance A, the
TL=T2-T0=RB+RTT+d・・・(5)
ここで、第1計算モデルでは、下記(6)式の指標値Zを使用する。
TL = T2-T0 = R B + RTT + d (5)
Here, in the first calculation model, an index value Z of the following equation (6) is used.
Z=TL/RB・・・(6)
指標値Zは、1より大きい値を採り、通常の会話環境(図4に例示される如き環境)に近付く程1に漸近する性質を持つ。また指標値Zが大きくなる程、ユーザAは、会話が円滑に進まないと感じるようになり、快適性の低下を覚え得る。
Z = TL / R B (6)
The index value Z takes a value larger than 1 and has a property of gradually approaching 1 as it approaches a normal conversation environment (an environment illustrated in FIG. 4). Further, as the index value Z increases, the user A feels that the conversation does not proceed smoothly, and may experience a decrease in comfort.
一方、この指標値Zには予め実験的に最大値F(例えば、F=1.2)が設定される。最大値Fは、それよりも大きい領域においてユーザAが感じる会話品質の低下が無視出来なくなると判断され得る適合値である。逆に言えば、指標値Zが最大値F以下であれば、ユーザAはユーザBとの会話に実践上無視し得ない円滑性の欠如を感じることがない。 On the other hand, a maximum value F (for example, F = 1.2) is experimentally set in advance for the index value Z. The maximum value F is a fitness value that can be determined that a decrease in conversation quality felt by the user A in a larger area cannot be ignored. In other words, if the index value Z is equal to or less than the maximum value F, the user A does not feel lack of smoothness that cannot be ignored in practice in the conversation with the user B.
具体的には、条件式として下記(7)式が設定される。 Specifically, the following expression (7) is set as the conditional expression.
(RB+RTT+d)/RB≦F・・・(7)
更に(7)式を変形すると、下記(8)式が得られる。
(R B + RTT + d) / R B ≦ F (7)
Further, when the formula (7) is modified, the following formula (8) is obtained.
d≦(F-1)RB-RTT・・・(8)
上記(8)式は、処理遅延量dの採り得る範囲を規定する式である。尚、上記(8)式における処理遅延量dは、本発明に係る「自拠点第1許容処理遅延量」の一例である。
d ≦ (F-1) R B -RTT (8)
The above expression (8) is an expression that defines a range in which the processing delay amount d can be taken. The processing delay amount d in the above equation (8) is an example of the “own base first allowable processing delay amount” according to the present invention.
上記(8)式に示される処理遅延量dは、ユーザAに対し設定される値であり、ユーザBに対しては、音声送受信装置10Bにおいて、同様のプロセスを経て、下記(9)式により処理遅延量dが設定される。下記(9)式における処理遅延量dは、本発明に係る「他拠点第1許容処理遅延量」の一例である。
The processing delay amount d shown in the above equation (8) is a value set for the user A, and for the user B, the same process is performed in the voice transmitting / receiving
d≦(F-1)RA-RTT・・・(9)
次に、図8を参照し、第2計算モデルについて説明する。ここに、図8は、第2計算モデルの概念を説明するタイミングチャートである。尚、同図において、図7と重複する箇所には、同一の符合を付してその説明を適宜省略することとする。
d ≦ (F−1) R A −RTT (9)
Next, the second calculation model will be described with reference to FIG. FIG. 8 is a timing chart illustrating the concept of the second calculation model. In the figure, the same parts as those in FIG. 7 are denoted by the same reference numerals, and the description thereof is omitted as appropriate.
図8は、図7に対し、ユーザA反応待機時間TAの概念を表したものである。 8 to FIG. 7 illustrates a concept of user A reaction waiting time T A.
ユーザA反応待機時間TAは、ユーザAの発話が終了した時刻T0から、ユーザAが再び発話を開始する時刻T1’までの時間値である。ここで、計算モデル1の説明で述べたように、ユーザAの発話終了後、ユーザBの発話開始がユーザAによって認識されるのは、時刻T0から上記遅延時間TLが経過した後の時刻T2である。
User A reaction waiting time T A is from the time T0 to the utterance of the user A is completed, the time value up to the time T1 'that the user A initiates a speech again. Here, as described in the explanation of
ところが、この時刻T2においては、既にユーザBの反応が無いものとしてユーザAが発話を開始しており、ユーザAとユーザBとが共に発話している状況が生じ得る。従って、拠点Xでの会話の円滑性を担保しようとすれば、このユーザA反応待機時間TAを考慮する必要が生じる。第2計算モデルは、このユーザA反応待機時間TAを考慮した計算モデルである。 However, at this time T2, the user A has already started uttering on the assumption that there is no response from the user B, and there may be a situation in which the user A and the user B are both speaking. Therefore, if an attempt collateral smoothness of conversation in offices X, it is necessary to consider the user A reaction waiting time T A. Second calculation model is a calculation model that takes into account the user A reaction waiting time T A.
上述した二者同時発話の状況を回避するためには、上記遅延時間TLがユーザA反応待機時間TA以下であればよいから、下記(10)式が成立する。 To avoid the situation of two parties simultaneous speech described above, the delay time TL is because as long than user A reaction waiting time T A, the following (10) is established.
RB+RTT+d≦TA・・・(10)
上記(10)式を変形すると、下記(11)式が得られる。
R B + RTT + d ≦ T A (10)
When the formula (10) is modified, the following formula (11) is obtained.
d≦(TA-RB)-RTT・・・(11)
上記(11)式は、処理遅延量dの採り得る範囲を規定する式である。尚、上記(11)式における処理遅延量dは、本発明に係る「自拠点第2許容処理遅延量」の一例である。
d ≦ (T A -R B) -RTT ··· (11)
The above expression (11) is an expression that defines a range in which the processing delay amount d can be taken. The processing delay amount d in the equation (11) is an example of the “own base second allowable processing delay amount” according to the present invention.
尚、上記(11)式に示される処理遅延量dは、ユーザAに対し設定される値であり、ユーザBに対しては、音声送受信装置10Bにおいて、同様のプロセスを経て、下記(12)式により処理遅延量dが算出される。下記(12)式に示される処理遅延量dは、本発明に係る「他拠点第2許容処理遅延量」の一例である。
The processing delay amount d shown in the above equation (11) is a value set for the user A, and for the user B, the following process (12) is performed through the same process in the voice transmitting / receiving
d≦(TB-RA)-RTT・・・(12)
第1計算モデル及び第2計算モデルにより処理遅延量dが設定されると、最大許容処理遅延量dmaxが決定される(ステップS104)。最大許容処理遅延量dmaxは、ユーザA及びユーザBの双方にとって、会話の円滑性が担保され得る処理遅延量である。従って、第1計算モデル及び第2計算モデルに基づいて設定された処理遅延量を比較する必要がある。より具体的には、最大許容処理遅延量dmaxは、上記(8)式、(9)式、(11)式及び(12)式のいずれも満たす処理遅延量である必要がある。
d ≦ (T B −R A ) −RTT (12)
When the processing delay amount d is set by the first calculation model and the second calculation model, the maximum allowable processing delay amount dmax is determined (step S104). The maximum allowable processing delay amount dmax is a processing delay amount that can ensure the smoothness of conversation for both the user A and the user B. Therefore, it is necessary to compare the processing delay amounts set based on the first calculation model and the second calculation model. More specifically, the maximum allowable processing delay amount dmax needs to be a processing delay amount that satisfies all of the above formulas (8), (9), (11), and (12).
そこで、ステップS104においては、処理遅延情報通信部800を介して、音声送受信装置10BからユーザBについて第1及び第2計算モデルに基づいて設定された処理遅延量dが取得される。一方で、処理遅延情報通信部800を介して、音声送受信装置10Bに対しユーザAについて第1及び第2計算モデルに基づいて設定された処理遅延量dが送信される。即ち、音声送受信装置10Aと音声送受信装置10Bとの間で、最大許容処理遅延量dmaxの決定に係る条件式が共有される。
Therefore, in step S104, the processing delay amount d set for the user B based on the first and second calculation models is acquired from the voice transmitting / receiving
続いて、上記(8)式、(9)式、(11)式及び(12)式のいずれも満たす処理遅延量dが決定される。尚、具体的には、これら四式のうち、右辺項が最も小さい条件式によって規定される処理遅延量dが、これら四式を満たす処理遅延量dとなる。 Subsequently, the processing delay amount d satisfying any of the above formulas (8), (9), (11) and (12) is determined. Specifically, among these four formulas, the processing delay amount d defined by the conditional expression with the smallest right-hand term is the processing delay amount d satisfying these four formulas.
一方、音切れや符合誤り等による音声品質の低下を防止する観点からは、処理遅延量は大きい方が良い。従って、最終的に、最大許容処理遅延量dmaxは、これら四式を満たす範囲の最大値として決定される。 On the other hand, from the viewpoint of preventing the voice quality from being degraded due to sound interruptions or code errors, it is better that the processing delay amount is large. Therefore, finally, the maximum allowable processing delay amount dmax is determined as the maximum value in a range that satisfies these four formulas.
最大許容処理遅延量dmaxが決定されると、分配処理が実行される(ステップS105)。分配処理とは、決定された最大許容処理遅延量dmaxを音声送受信装置10A及び音声送受信装置10Bの各バッファ容量に分配する処理である。この分配処理は、交渉部620により実行される。
When the maximum allowable processing delay amount dmax is determined, distribution processing is executed (step S105). The distribution process is a process of distributing the determined maximum allowable processing delay amount dmax to the buffer capacities of the voice transmitting / receiving
具体的には、交渉部620は、処理遅延情報通信部800を介して音声送受信装置10Bに負荷状況を確認する。音声送受信装置10Bに処理負荷上の問題がなく、音声送受信装置10Aにも処理負荷上の問題がなければ、交渉部620は、最大許容処理遅延量dmaxの50%を音声送受信装置10Aで負担する旨を決定し、またその旨を音声送受信装置10Bに伝達する。その結果、処理負荷上の問題がなければ、通常、最大許容処理遅延量dmaxの50%が音声送受信装置10A側で負担される。
Specifically, the
更に、交渉部620は、音声送受信装置10Aで負担すべき遅延量を送信バッファ130及び受信バッファ220で分配する。通常、ここでも、送信バッファ130と受信バッファ220との負担率は等しく設定される。即ち、送信バッファ130に係る処理遅延量は、最大許容処理遅延量dmaxの25%、受信バッファ220に係る処理遅延量も、最大許容処理遅延量dmaxの25%に設定される。設定された分配比率は、バッファ制御部900に伝達される。
Further, the
分配比率を伝達されたバッファ制御部900は、設定された分配比率に応じた処理遅延量が得られるように、送信バッファ130及び受信バッファ220の容量を制御する(ステップS106)。バッファ容量の制御がなされると、処理はステップS101に戻され、一連の処理が繰り返される。バッファ容量制御は以上のように実行される。
The buffer control unit 900 to which the distribution ratio is transmitted controls the capacities of the
このように、本実施例に係るバッファ容量制御によれば、第1計算モデル及び第2計算モデルに基づいて設定された処理遅延量dから最大許容処理遅延量dmaxが決定され、受信バッファ容量Da_r及び送信バッファ容量Da_sの制御を介して、遠隔会議システム1の処理遅延量dが、この最大許容処理遅延量dmaxに制御される。ここで、第1計算モデル及び第2計算モデルは、夫々ユーザの会話特徴量を反映し得るように構築されており、決定される最大許容処理遅延量dmaxは、会話の当事者たるユーザA及びユーザBの実情に即した、双方にとって円滑性が最低限担保されたものとなる。
As described above, according to the buffer capacity control according to the present embodiment, the maximum allowable processing delay amount dmax is determined from the processing delay amount d set based on the first calculation model and the second calculation model, and the reception buffer capacity Da_r. And the processing delay amount d of the
従って、会話の円滑性が阻害されることによる快適性の低下が防止される。一方で、最大許容処理遅延量dmaxは、この種の快適性の低下を防止し得る範囲で最大の値であり、音切れや符号化誤り等による音声品質の低下も防止される。即ち、本実施例によれば、会話に参加するユーザの事情に即した最適な会話品質が提供されるのである。 Therefore, a decrease in comfort due to hindering smoothness of conversation is prevented. On the other hand, the maximum allowable processing delay amount dmax is the maximum value within a range in which this kind of comfort reduction can be prevented, and voice quality deterioration due to sound interruptions, coding errors, and the like is also prevented. In other words, according to the present embodiment, the optimum conversation quality in accordance with the circumstances of the user participating in the conversation is provided.
尚、本実施例では、会話に参加するユーザは、ユーザAとユーザBとの二者となっているが、三者或いはより多くの参加者を有する会議や会話等においても、上記と同様の概念を適用してユーザの事情に即した最適な会話品質を提供することが可能であることは言うまでもない。 In this embodiment, the user who participates in the conversation is the user A and the user B, but the same as the above also in a meeting or conversation having three parties or more participants. Needless to say, it is possible to apply the concept to provide the optimum conversation quality according to the user's circumstances.
尚、本実施例では、音声会議が前提とされたが、各拠点に撮像手段を配置して、音声データと共に画像データ或いは映像データの送受信を行うことも可能である。この場合、例えばTV会議等と称される会議形態を採ることも可能である。 In the present embodiment, the audio conference is assumed. However, it is also possible to transmit and receive image data or video data together with audio data by arranging imaging means at each site. In this case, for example, it is possible to adopt a conference form called a TV conference or the like.
尚、本実施例に係る音声送受信装置は、記憶部500を備えるため、複数のユーザの過去の会話特徴量を保持しておくことも可能である。無論、過去の会話特徴量が次回においてそのまま運用できる保証はないが、例えば初期値として用いることによって、会話特徴量取得部300及び処理遅延量決定部600の動作初期における、会話品質のばらつき等を抑制することができるため、好適である。
In addition, since the audio | voice transmission / reception apparatus concerning a present Example is provided with the memory |
尚、本実施例では、バッファ容量の制御を介して処理遅延量の制御が実現されたが、処理遅延量と相関する制御量は、音声データの符号化レート(例えば、エンコーダ120のエンコードビットレート)であってもよい。符号化レートが高くなれば、相対的に高音質となる分、データ伝送量も増えるため、処理遅延量は大きくなる。従って、上述したバッファ容量に替えて或いは加えて符号化レートを制御することにより、上記と同様の効果を得ることが可能である。
<第2実施例>
次に、図9を参照し、本発明の第2実施例について説明する。ここに、図9は、本発明の第2実施例に係る遠隔会議システム2の構成を概念的に表してなる概略構成図である。尚、同図において、図1と重複する箇所には同一の符合を付してその説明を適宜省略することとする。
In this embodiment, the control of the processing delay amount is realized through the control of the buffer capacity. However, the control amount correlated with the processing delay amount is the encoding rate of the audio data (for example, the encoding bit rate of the encoder 120). ). As the encoding rate increases, the amount of data transmission increases as the sound quality becomes relatively high, and the amount of processing delay increases. Therefore, by controlling the encoding rate instead of or in addition to the buffer capacity described above, it is possible to obtain the same effect as described above.
<Second embodiment>
Next, a second embodiment of the present invention will be described with reference to FIG. FIG. 9 is a schematic configuration diagram conceptually showing the configuration of the
図9において、遠隔会議システム2は、ネットワーク20に収容されたサーバ装置30を備える点において、第1実施例に係る遠隔会議システム1と相違している。
9, the
サーバ装置30は、音声送受信装置10A及び10Bを仲介するコンピュータシステムであり、本発明に係る「サーバ装置」の一例である。
The
サーバ装置30は、第1実施例における処理遅延量決定部600を備えており、第1及び第2計算モデルに基づいた処理遅延量d、最大許容処理遅延量dmax及び各音声送受信装置のバッファ容量に係る各算出処理は、全てこのサーバ装置30で実行される構成となっている。
The
一方、サーバ装置30は、決定されたバッファ容量を各音声送受信装置に対し告知する告知部を備えており、告知部からの告知を受けた各音声送受信装置では、バッファ制御部900が、この告知されたバッファ容量に従って送受信の各バッファを制御する。
On the other hand, the
このように、第2実施例においても、第1実施例と同様、快適性を損なわない会話品質が提供される。特に、計算モデルに基づいた各種演算処理やバッファ容量の分配比率決定処理等、比較的高負荷な処理は、音声送受信装置に代わってこのサーバ装置30が負担する構成を採るため、音声送受信装置の負担が軽減され、より円滑な会話の運用制御が可能となる。
Thus, also in the second embodiment, as in the first embodiment, conversation quality that does not impair comfort is provided. In particular, relatively high-load processing such as various calculation processing based on the calculation model and buffer capacity distribution ratio determination processing is configured to be borne by the
本発明は、上述した実施例に限られるものではなく、請求の範囲及び明細書全体から読み取れる発明の要旨或いは思想に反しない範囲で適宜変更可能であり、そのような変更を伴う音声送受信装置、音声送受信システム及びサーバ装置もまた本発明の技術的範囲に含まれるものである。 The present invention is not limited to the above-described embodiments, and can be changed as appropriate without departing from the spirit or concept of the invention that can be read from the claims and the entire specification. An audio transmission / reception system and a server device are also included in the technical scope of the present invention.
本発明は、遠隔地のユーザ同士で会話を成立させる装置或いはシステムに適用可能である。 The present invention can be applied to an apparatus or system that establishes a conversation between users at remote locations.
1…遠隔会議システム、10A、10B…音声送受信装置、20…ネットワーク、100…音声入力ユニット、110…音声入力部、120…エンコーダ、130…送信バッファ、140…音声データ送信部、200…音声出力ユニット、210…音声データ受信部、220…受信バッファ、230…でコーダ、240…音声出力部、300…会話特徴量検出部、400…会話特徴量統計処理部、500…記憶部、600…処理遅延量決定部、610…許容処理遅延推定部、620…交渉部、700…RTT測定部、800…処理遅延情報通信部、900…バッファ制御部。
DESCRIPTION OF
Claims (15)
前記複数の拠点のうち自拠点を除く他拠点に設置される前記音声送受信装置との間で、前記ネットワークを介して、少なくとも音声データを含む、前記会話の内容を表す会話データの送受信を行う通信手段と、
前記自拠点に存在する自拠点ユーザ及び前記他拠点に存在する他拠点ユーザの各々について、前記会話における発話のタイミングに関連する所定の会話特徴量を取得する会話特徴量取得手段と、
前記取得された会話特徴量に基づいて、その大小が前記会話の時間遅延量の大小に夫々対応し且つ前記会話の音声品質の高低に夫々対応する前記会話データの処理遅延量の、前記音声送受信システム全体における総量を決定する処理遅延量決定手段と、
前記決定された総量が満たされるように前記処理遅延量を制御する処理遅延量制御手段と
を具備することを特徴とする音声送受信装置。 Voice that includes voice transmitting / receiving devices installed at a plurality of bases, each accommodated in a network, and that can establish a conversation between users at the plurality of bases via the voice transmitting / receiving devices. In the transmission / reception system, the voice transmission / reception device,
Communication for transmitting / receiving conversation data representing the content of the conversation, including at least voice data, to / from the voice transmission / reception apparatus installed at another base other than the base among the plurality of bases. Means,
Conversation feature amount acquisition means for acquiring a predetermined conversation feature amount related to the utterance timing in the conversation for each of the own site user existing in the own site and the other site user existing in the other site;
Based on the acquired conversation feature value, the voice transmission / reception of the processing delay amount of the conversation data, the size of which corresponds to the size of the time delay amount of the conversation and the sound quality of the conversation, respectively. A processing delay amount determining means for determining a total amount in the entire system;
And a processing delay amount control means for controlling the processing delay amount so that the determined total amount is satisfied.
ことを特徴とする請求の範囲第1項に記載の音声送受信装置。 The conversation feature quantity acquisition means, as the conversation feature quantity, is a self-site user reaction time that is a time from the time when the utterance output of the other-site user is finished at the own place to the time when the own-site user starts utterance. And at least one of the other site user reaction time, which is the time from when the utterance output of the local site user ends at the other site to the time when the other site user starts uttering, The voice transmitting / receiving apparatus according to claim 1.
ことを特徴とする請求の範囲第2項に記載の音声送受信装置。 The processing delay amount determining means is responsive to the own first site allowable processing delay amount that changes depending on the size of the acquired other site user reaction time and the size of the acquired own site user response time. The voice transmission / reception apparatus according to claim 2, wherein the total amount is determined within a range equal to or less than a minimum value among the first allowable processing delay amounts at other bases, each of which changes in magnitude.
ことを特徴とする請求の範囲第2項に記載の音声送受信装置。 The communication means transmits own site user reaction time data corresponding to the acquired own site user reaction time to the voice transmitting / receiving device installed at the other site via the network. The voice transmitting / receiving apparatus according to claim 2.
ことを特徴とする請求の範囲第2項に記載の音声送受信装置。 The conversation feature quantity acquisition means is the own site user response waiting time that is a time from the time when the own site user ends the utterance at the own site to the time when the own site user starts to speak again as the conversation feature value. The voice transmitting / receiving apparatus according to claim 2, further acquiring time.
ことを特徴とする請求の範囲第5項に記載の音声送受信装置。 The own site second allowable processing delay amount that changes depending on the difference between the acquired own site user reaction waiting time and the acquired other site user reaction time, and the other site user at the other site. Depending on the magnitude of the difference between the other-site user reaction waiting time acquired as the time from when the other-site user starts speaking again to the time when the other-site user starts speaking again and the acquired own-site user reaction time, respectively. 6. The voice transmitting / receiving apparatus according to claim 5, wherein the total amount is determined within a range that is not more than a minimum value among the second allowable processing delay amounts at other sites that change in size.
ことを特徴とする請求の範囲第6項に記載の音声送受信装置。 The processing delay amount determining means is responsive to the own first site allowable processing delay amount that changes depending on the size of the acquired other site user reaction time and the size of the acquired own site user response time. Determining the total amount within a range that is less than or equal to a minimum value of the first allowable processing delay amount of the other bases, the second allowable processing delay amount of the own base, and the second allowable processing delay amount of the other bases, each of which changes in magnitude. The voice transmitting / receiving apparatus according to claim 6,
前記会話データを前記送受信に相前後して一時的に蓄積するバッファを具備し、
前記処理遅延量制御手段は、前記決定された総量に基づいて、前記バッファに係るバッファ容量を制御する
ことを特徴とする請求の範囲第1項に記載の音声送受信装置。 The voice transmission / reception device includes:
A buffer for temporarily storing the conversation data before and after the transmission and reception;
The voice transmission / reception apparatus according to claim 1, wherein the processing delay amount control means controls a buffer capacity of the buffer based on the determined total amount.
ことを特徴とする請求の範囲第1項に記載の音声送受信装置。 The voice transmission / reception apparatus according to claim 1, wherein the processing delay amount control means controls a coding rate for coding the conversation data based on the determined total amount.
前記処理遅延量決定手段は、前記取得された会話特徴量と前記取得された伝送状態とに基づいて前記総量を決定する
ことを特徴とする請求の範囲第1項に記載の音声送受信装置。 Further comprising transmission status acquisition means for acquiring the transmission status of the network;
The voice transmission / reception apparatus according to claim 1, wherein the processing delay amount determination unit determines the total amount based on the acquired conversation feature amount and the acquired transmission state.
ことを特徴とする請求の範囲第10項に記載の音声送受信装置。 The voice transmission / reception apparatus according to claim 10, wherein the transmission state acquisition unit acquires a transmission delay amount of the network as a transmission state of the network.
前記処理遅延量決定手段は、前記統計処理された会話特徴量に基づいて前記総量を決定する
ことを特徴とする請求の範囲第1項に記載の音声送受信装置。 Statistical processing means for statistically processing the acquired conversation feature value,
The voice transmission / reception apparatus according to claim 1, wherein the processing delay amount determination means determines the total amount based on the statistically processed conversation feature amount.
ことを特徴とする請求の範囲第1項に記載の音声送受信装置。 The voice transmitting / receiving apparatus according to claim 1, further comprising storage means for storing the acquired conversation feature amount.
前記音声送受信装置は、
前記複数の拠点のうち自拠点を除く他拠点に設置される前記音声送受信装置との間で、前記ネットワークを介して、少なくとも音声データを含む、前記会話の内容を表す会話データの送受信を行う通信手段と、
前記自拠点に存在する自拠点ユーザ及び前記他拠点に存在する他拠点ユーザの各々について、前記会話における発話のタイミングに関連する所定の会話特徴量を取得する会話特徴量取得手段と、
前記取得された会話特徴量に基づいて、その大小が前記会話の時間遅延量の大小に夫々対応し且つ前記会話の音声品質の高低に夫々対応する前記会話データの処理遅延量の、前記音声送受信システム全体における総量を決定する処理遅延量決定手段と、
前記決定された総量が満たされるように前記処理遅延量を制御する処理遅延量制御手段と
を具備することを特徴とする音声送受信システム。 Voice that includes voice transmitting / receiving devices installed at a plurality of bases, each accommodated in a network, and that can establish a conversation between users at the plurality of bases via the voice transmitting / receiving devices. A transmission / reception system,
The voice transmission / reception device includes:
Communication for transmitting / receiving conversation data representing the content of the conversation, including at least voice data, to / from the voice transmission / reception apparatus installed at another base other than the base among the plurality of bases. Means,
Conversation feature amount acquisition means for acquiring a predetermined conversation feature amount related to the utterance timing in the conversation for each of the own site user existing in the own site and the other site user existing in the other site;
Based on the acquired conversation feature value, the voice transmission / reception of the processing delay amount of the conversation data, the size of which corresponds to the size of the time delay amount of the conversation and the sound quality of the conversation, respectively. A processing delay amount determining means for determining a total amount in the entire system;
And a processing delay amount control means for controlling the processing delay amount so that the determined total amount is satisfied.
前記複数の音声送受信装置から、前記ネットワークを介して、前記自拠点に存在する自拠点ユーザ及び前記他拠点に存在する他拠点ユーザの各々についての前記会話における発話のタイミングに関連する所定の会話特徴量を取得する会話特徴量取得手段と、
前記取得された会話特徴量に基づいて、その大小が前記会話の時間遅延量の大小に夫々対応し且つ前記会話の音声品質の高低に夫々対応する前記会話データの処理遅延量の、前記音声送受信システム全体における総量を決定する処理遅延量決定手段と、
前記複数の音声送受信装置に対し前記ネットワークを介して前記決定された総量を告知する告知手段と
を具備することを特徴とするサーバ装置。 Between each of the server devices accommodated in the network and the voice transmitting / receiving devices installed at a plurality of bases other than the own base among the plurality of bases, via the network, A voice transmission / reception apparatus including communication means for transmitting / receiving conversation data representing the content of the conversation, including at least voice data, and between users existing at each of the plurality of bases via the voice transmission / reception apparatus. In the voice transmission / reception system capable of establishing a conversation, the server device,
Predetermined conversation characteristics related to the utterance timing in the conversation for each of the local user existing at the local base and the local base user existing at the other base from the plurality of voice transmitting / receiving apparatuses via the network. A conversation feature quantity acquisition means for acquiring the quantity;
Based on the acquired conversation feature value, the voice transmission / reception of the processing delay amount of the conversation data, the size of which corresponds to the size of the time delay amount of the conversation and the sound quality of the conversation, respectively. A processing delay amount determining means for determining a total amount in the entire system;
A server device comprising: notification means for notifying the plurality of voice transmitting / receiving devices of the determined total amount via the network.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/JP2010/062558 WO2012014275A1 (en) | 2010-07-26 | 2010-07-26 | Audio transmitting/receiving device, audio transmitting/receiving system and server device |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/JP2010/062558 WO2012014275A1 (en) | 2010-07-26 | 2010-07-26 | Audio transmitting/receiving device, audio transmitting/receiving system and server device |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2012014275A1 true WO2012014275A1 (en) | 2012-02-02 |
Family
ID=45529523
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/JP2010/062558 Ceased WO2012014275A1 (en) | 2010-07-26 | 2010-07-26 | Audio transmitting/receiving device, audio transmitting/receiving system and server device |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2012014275A1 (en) |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JPH10200580A (en) * | 1997-01-16 | 1998-07-31 | Matsushita Electric Ind Co Ltd | Audio packet playback method |
| JP2002164921A (en) * | 2000-11-27 | 2002-06-07 | Oki Electric Ind Co Ltd | Quality controller for voice packet communication |
| JP2005303531A (en) * | 2004-04-08 | 2005-10-27 | Mitsubishi Electric Corp | Audio data receiving apparatus and audio data transmitting apparatus |
-
2010
- 2010-07-26 WO PCT/JP2010/062558 patent/WO2012014275A1/en not_active Ceased
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JPH10200580A (en) * | 1997-01-16 | 1998-07-31 | Matsushita Electric Ind Co Ltd | Audio packet playback method |
| JP2002164921A (en) * | 2000-11-27 | 2002-06-07 | Oki Electric Ind Co Ltd | Quality controller for voice packet communication |
| JP2005303531A (en) * | 2004-04-08 | 2005-10-27 | Mitsubishi Electric Corp | Audio data receiving apparatus and audio data transmitting apparatus |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US7680099B2 (en) | Jitter buffer adjustment | |
| US10965603B2 (en) | Bandwidth management | |
| US10027818B2 (en) | Seamless codec switching | |
| US8489758B2 (en) | Method of transmitting data in a communication system | |
| KR101182518B1 (en) | Video streaming system and method | |
| US9667801B2 (en) | Codec selection based on offer | |
| US9729287B2 (en) | Codec with variable packet size | |
| US10506004B2 (en) | Advanced comfort noise techniques | |
| US10469630B2 (en) | Embedded RTCP packets | |
| US20170201443A1 (en) | Playout delay adjustment method and electronic apparatus thereof | |
| JP2005269632A (en) | Communication terminal device, telephone data receiving method, communication system, and gateway | |
| CN113242436B (en) | Live broadcast data processing method and device and electronic equipment | |
| CN100359892C (en) | Dynamic Latency Management for IP Telephony | |
| US9509618B2 (en) | Method of transmitting data in a communication system | |
| US9253116B2 (en) | Multi-media data rate allocation method and voice over IP data rate allocation method | |
| JP2009076952A (en) | TV conference apparatus and TV conference method | |
| WO2012014275A1 (en) | Audio transmitting/receiving device, audio transmitting/receiving system and server device | |
| KR102109607B1 (en) | System for reducing delay of transmission and reception in communication network, and apparatus thereof | |
| JP6954289B2 (en) | Bit rate indicator, bit rate indicator method, and bit rate indicator program | |
| JP6724517B2 (en) | Bit rate instruction device, bit rate instruction method, and bit rate instruction program | |
| JP4861964B2 (en) | Communication terminal device and computer program | |
| JP2005192129A (en) | Data transmitting apparatus and data receiving apparatus | |
| JP2006303702A (en) | Voice coding selection control method, voice packet transmission apparatus, voice packet reception apparatus, voice packet transmission program, voice packet reception program, recording medium | |
| JP2007312265A (en) | Voice packet communication system and speech reproducer |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 10855286 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 10855286 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: JP |