[go: up one dir, main page]

WO2012014275A1 - Dispositif d'émission/réception audio, système d'émission/réception audio et dispositif serveur - Google Patents

Dispositif d'émission/réception audio, système d'émission/réception audio et dispositif serveur Download PDF

Info

Publication number
WO2012014275A1
WO2012014275A1 PCT/JP2010/062558 JP2010062558W WO2012014275A1 WO 2012014275 A1 WO2012014275 A1 WO 2012014275A1 JP 2010062558 W JP2010062558 W JP 2010062558W WO 2012014275 A1 WO2012014275 A1 WO 2012014275A1
Authority
WO
WIPO (PCT)
Prior art keywords
conversation
processing delay
voice
delay amount
time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/JP2010/062558
Other languages
English (en)
Japanese (ja)
Inventor
隆真 亀谷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Pioneer Corp
Original Assignee
Pioneer Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Pioneer Corp filed Critical Pioneer Corp
Priority to PCT/JP2010/062558 priority Critical patent/WO2012014275A1/fr
Publication of WO2012014275A1 publication Critical patent/WO2012014275A1/fr
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/22Arrangements for supervision, monitoring or testing
    • H04M3/2227Quality of service monitoring
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L12/00Data switching networks
    • H04L12/02Details
    • H04L12/16Arrangements for providing special services to substations
    • H04L12/18Arrangements for providing special services to substations for broadcast or conference, e.g. multicast
    • H04L12/1813Arrangements for providing special services to substations for broadcast or conference, e.g. multicast for computer conferences, e.g. chat rooms
    • H04L12/1827Network arrangements for conference optimisation or adaptation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/40Support for services or applications
    • H04L65/403Arrangements for multi-party communication, e.g. for conferences

Definitions

  • the present invention relates to a technical field of a voice transmission / reception apparatus, a voice transmission / reception system, and a server apparatus that can smoothly advance conversations between users at remote locations, for example.
  • Patent Document 1 proposes a real-time audio reproducing device having this kind of purpose. According to this device, it is possible to ensure the overall conversation quality by keeping the voice loss within the maximum voice drop rate established as a conversation and the voice delay within the maximum delay time established as a conversation. It is said that.
  • Patent Document 2 includes a buffer monitoring unit and a buffer control unit for controlling a buffer for absorbing delay fluctuation to a silent period based on the number of occurrences of buffer underflow in a voiced period. The configuration is disclosed.
  • JP 2002-223247 A Japanese Patent Laid-Open No. 11-215812
  • the longer the delay time the better the voice quality of the conversation can be improved because the occurrence of sound interruptions and errors can be suppressed.
  • the delay time is large, it takes a long time for one user to recognize the start of the other party's utterance after the end of his / her utterance, and thus the smooth progress of the conversation is likely to be hindered. Therefore, the technical idea that the delay time falls within the maximum delay time established as a conversation as in the device of Patent Document 1 is useful in that a compromise between voice quality and smooth progress can be found.
  • the maximum delay time for establishing a conversation depends on, for example, the personality or physical or mental load state of each user participating in the conversation, or the individual specific circumstances of the user at that time. Can also change. For example, the maximum delay time is short for a personally impatient user, long for a leisurely user, short for an irritated user, long for a relaxed user, If there is, it will be longer or shorter depending on the physical condition. Apart from that, if you are in a hurry for some reason, it will naturally become shorter.
  • the maximum delay time described in Patent Document 1 is not necessarily the maximum delay time for truly establishing a conversation. For this reason, even if the apparatus of Patent Literature 1 is applied to conversation, there is a possibility that the comfort may be lowered depending on the user. Or, conversely, the delay time is too short, and there is a possibility that unnecessary voice quality is lowered. That is, the apparatus of Patent Document 1 has a technical problem that it is difficult to optimize the delay time due to the fact that it does not have a technique to adapt to the situation on the user side.
  • the present invention has been made in view of such problems, and it is an object of the present invention to provide an audio transmission / reception apparatus, an audio transmission / reception system, and a server apparatus that can provide optimal conversation quality in accordance with the circumstances of users participating in a conversation.
  • the voice transmission / reception device includes voice transmission / reception devices installed at a plurality of bases, each of which is accommodated in a network, through the voice transmission / reception device.
  • the voice transmitting / receiving apparatus in a voice transmitting / receiving system capable of establishing a conversation between users respectively present at a plurality of bases, wherein the voice is installed at another base other than the base among the plurality of bases.
  • Communication means for transmitting / receiving conversation data representing the content of the conversation, including at least voice data, to / from the transmission / reception device, and the local user existing at the local base and the local base Conversation feature quantity acquisition means for acquiring a predetermined conversation feature quantity related to the utterance timing in the conversation for each of the other site users, and the acquired
  • the amount of processing delay of the conversation data corresponding to the magnitude of the time delay amount of the conversation and the amount of speech quality of the conversation respectively corresponding to the magnitude of the conversation time delay
  • a processing delay amount determining unit that determines a total amount and a processing delay amount control unit that controls the processing delay amount so that the determined total amount is satisfied.
  • the voice transmission / reception system includes voice transmission / reception devices installed at a plurality of bases, each of which is accommodated in a network, A voice transmission / reception system capable of establishing a conversation between users respectively present at a plurality of bases, wherein the voice transmission / reception apparatus is installed at another base other than the base among the plurality of bases.
  • Communication means for transmitting / receiving conversation data representing the content of the conversation, including at least voice data, to / from the transmission / reception device, and the local user existing at the local base and the local base Conversation feature quantity acquisition means for acquiring a predetermined conversation feature quantity related to the utterance timing in the conversation for each of the other site users, and the acquired
  • the amount of processing delay of the conversation data corresponding to the magnitude of the time delay amount of the conversation and the amount of speech quality of the conversation respectively corresponding to the magnitude of the conversation time delay
  • a processing delay amount determining unit that determines a total amount and a processing delay amount control unit that controls the processing delay amount so that the determined total amount is satisfied.
  • the server device is a server device that is accommodated in a network, and is installed at a plurality of bases, other than its own base.
  • a voice transmission / reception device comprising communication means for transmitting / receiving conversation data representing the content of the conversation, including at least voice data, with the voice transmission / reception device installed at a base,
  • the server device in a voice transmission / reception system capable of establishing a conversation between users respectively present at the plurality of bases via a voice transmission / reception device, wherein the network is connected from the plurality of voice transmission / reception devices.
  • a conversation feature amount acquisition means for acquiring a predetermined predetermined conversation feature amount, and based on the acquired conversation feature amount, the magnitude corresponds to the amount of time delay amount of the conversation, and the voice quality of the conversation
  • a processing delay amount determining means for determining a total amount of the processing delay amount of the conversation data corresponding to high and low in the entire voice transmitting and receiving system; and the total amount determined via the network for the plurality of voice transmitting and receiving devices.
  • a notifying means for notifying.
  • FIG. 2 is a block diagram conceptually showing a configuration of an audio transmission / reception device in the remote conference system of FIG. 1. It is a flowchart of the buffer capacity control performed in the audio
  • Embodiments according to the voice transmission / reception apparatus of the present invention include voice transmission / reception apparatuses installed at a plurality of bases, each accommodated in a network, and users existing at the plurality of bases via the voice transmission / reception apparatuses.
  • the voice transmission / reception system capable of establishing a conversation between each other, the voice transmission / reception apparatus, and the voice transmission / reception apparatus installed in another base other than the base among the plurality of bases,
  • the communication means for transmitting and receiving conversation data representing the content of the conversation, including at least voice data, via the network, and the local user existing in the local base and the other base user existing in the local base Based on the acquired conversation feature quantity, a conversation feature quantity acquisition means for acquiring a predetermined conversation feature quantity related to the timing of utterance in the conversation Determining the processing delay amount for determining the total amount of the processing delay amount of the conversation data corresponding to the magnitude of the time delay amount of the conversation and corresponding to the level of the voice quality of the conversation, respectively, in the entire voice transmission
  • the voice transmission / reception apparatus is one voice transmission / reception apparatus that constructs a voice transmission / reception system as a system capable of establishing a conversation between users existing at each base between a plurality of different bases.
  • the voice transmission / reception apparatus according to the embodiment is configured to be accommodated in a network at all times or limitedly when some condition (this kind of condition is not limited in any way) is satisfied.
  • the “network” is, for example, a WAN (Wide Area Network) network, a LAN (Local Area Network) network, a WAN line or a LAN network, a telephone line, an ADSL (Asymmetric Digital Subscriber Line), or an optical fiber. It is a concept encompassing various data communication networks such as the Internet network appropriately connected via a cable or the like.
  • the voice transmission / reception apparatus has a communication unit, and the voice transmission / reception apparatus is installed at another base (that is, a base on the other side of the conversation) different from the base by the action of the communication unit. It is possible to send and receive conversation data via the network.
  • the transmission / reception mode of various information including the conversation data, data, or a data file between the voice transmission / reception apparatuses installed in other bases may be, for example, via an appropriate server apparatus.
  • P2P Peer To Peer
  • the components of the voice transmission / reception system may also be ambiguous.
  • the self-base means a base where the voice transmitting / receiving apparatus according to the embodiment is installed, and does not mean a specific base.
  • the “conversation data” in the embodiment is a concept that includes data that is necessary or meaningful for establishing a conversation between users existing at different bases, and particularly includes at least audio data. It is prescribed.
  • the conversation data may include image data, video data, and the like.
  • “conversation” means an action of communication between users accompanied by voice, and the situations that occur are diverse.
  • the “conversation” in the embodiment preferably includes the speech of each participant in a meeting or a meeting in addition to the daily conversation.
  • the data transmitted to the voice transmitting / receiving device installed at the other site is, for example, the analog voice data of the user at the local site collected via the sound collecting means such as a microphone via the encoder or the like. It is also possible to use encoded data.
  • This data is received by a voice transmitting / receiving device installed at the other site side (that is, received data at the other site side), is decoded via a decoder or the like, and finally is output via an output means such as a speaker.
  • it can be provided as an analog voice to the user of another base. The same applies even if the utterance side and the reception side are switched.
  • the audio transmitting / receiving apparatus for example, by repeating the transmission / reception of conversation data in this way, it is possible to establish a conversation between users existing at different bases.
  • the voice transmission / reception apparatus includes processing delay amount determining means for determining a total amount of conversation data processing delay in the entire voice transmission / reception system (hereinafter, referred to as “system total processing delay amount” as appropriate). And a processing delay amount control means for controlling the processing delay amount so that the total system processing delay amount is satisfied.
  • the “processing amount of conversation data processing” is the amount of delay that occurs when the voice transmission / reception apparatus processes conversation data at each of a plurality of bases.
  • the amount of delay having controllability corresponding to the amount of time delay of the conversation with the user, and the size corresponding to the level of the voice quality of the conversation between the user at the local site and the user at the other site, respectively. means.
  • the total system processing delay amount is the sum of the processing delay amounts on the utterance side and the reception side. Therefore, for example, in order to obtain high voice quality with few sound interruptions and coding errors, a larger total system processing delay amount is better.
  • the time delay of the conversation affects the quality of the conversation in a dimension different from the voice quality. More specifically, if the amount of time delay of conversation becomes too large, it becomes difficult to share the same time axis between users, and the real-time nature of conversation is reduced. Such a decrease in real-time property itself decreases comfort and causes discomfort to the user. In addition, such a decrease in real-time performance causes a secondary deterioration in conversation quality due to a decrease in smoothness, such as confirming that the other party is speaking for the first time after speaking, and a kind of vicious circle. easy.
  • the voice quality and comfort of the conversation are in a trade-off relationship. Therefore, it is necessary to optimize the total system processing delay amount that affects both of them.
  • the conversation data according to the embodiment may include image data and video data for specifying a user's facial expression.
  • image data and video data for specifying a user's facial expression.
  • this kind of transmission / reception of image data and video data is a high-load process itself. If the voice and image or video are to be output synchronously, the amount of time delay of conversation increases. On the other hand, if these are controlled independently, the synchronization between the facial expression and the voice is lost, so the effect of suppressing the decrease in comfort is limited. That is, the above-described problems are not of a nature that can be solved by using an image or video together.
  • the conversation feature amount acquisition unit acquires a predetermined conversation feature amount for each of the own site user existing at the own site and the other site user existing at the other site.
  • the processing delay amount determining means is configured to determine the total system processing delay amount based on the acquired conversation feature amount.
  • the conversation feature quantity means a physical quantity, control quantity, or various standardized index values related to the timing of utterance in conversation.
  • the timing of utterance in conversation is roughly divided into timing for starting utterance after the end of the other party's utterance, and timing for starting utterance again after judging that the other party is unresponsive after the end of his / her utterance. Is done.
  • These are all important elements that define the basic progress rhythm of conversation, and may have a significant relationship with the optimum system total processing delay amount. For example, it can be said that the optimum total system processing delay amount for a user whose speech timing is early (slow) is small (large).
  • the timing of utterances in conversation varies widely depending on the personality and preferences of each user. Even the same user may change in any way depending on the physical or mental load state or time margin at that time, or various other specific circumstances.
  • the conversation feature quantity acquisition means it is possible to constantly grasp the utterance timing that can be varied in various ways with high accuracy.
  • the processing delay amount control means determines that the determined total system processing delay amount is satisfied by the processing delay amount control means.
  • the amount of processing delay is controlled.
  • the operation related to the processing delay amount control means may include a process delay amount determination process to be borne by each base.
  • the total system processing delay amount is the total amount of processing delay amount to be secured for the entire system including the speech transmitting / receiving device on the utterance side and the speech transmitting / receiving device on the receiving side.
  • the conversation rhythm, tempo, and the like do not change, so the conversation quality related to comfort does not change.
  • the distribution degree of the processing delay amount to each of the uttering side and the receiving side is appropriate. May occur. In such a case, the processing delay amount can be ambiguous with respect to the total system processing delay amount.
  • the determined system total processing delay amount may be shared with the apparatus on the other site side, and the distribution ratio may be determined after consultation between both parties according to a pre-determined algorithm.
  • the processing delay amount at each base may be an equally divided total system processing delay amount.
  • the voice transmitting and receiving apparatus it is possible to always achieve a harmony between the voice quality of the conversation and the comfort according to the state of the user at that time. That is, optimal conversation quality is provided.
  • the processing delay amount or the total system processing delay amount is determined in advance experimentally, empirically, theoretically or based on simulations, the circumstances of the user at that time are not actually considered at all. Therefore, although it is possible to cope with the basic personality of the user that can be estimated in advance, there is almost no adaptability to uncertainties and disturbance factors. Therefore, the total system processing delay amount tends to deviate from the true optimum value that changes depending on the user even for each user or even the same user. As a result, although the total system processing delay amount can be increased to improve the voice quality, the system total processing delay amount is unnecessarily suppressed, or conversely, the system total processing delay amount is too large and the conversation tempo. In other words, such a situation occurs that the user's comfort is reduced because the user's tempo or rhythm deviates from the user's own tempo or rhythm.
  • the system total processing delay amount is set based on the conversation feature amount acquired for each of the own site user and the other site user, so that both Since the maximum system total processing delay amount can be set in a range that does not cause discomfort, it is extremely useful in practice.
  • “acquisition” related to the conversation feature value acquisition means means to finally determine the reference value for control, and the process is not limited in any way. That is, the conversation feature value acquisition means may acquire the conversation feature value from the outside through a network or the like, or the conversation feature value by taking various measures such as calculation, derivation, estimation, identification or selection as internal processing. An amount may be obtained. Also, the conversation feature value acquisition process of the user at the local site may be different from the conversation feature value acquisition process of the user at the other site.
  • the conversation feature quantity acquisition means preferably repeats the acquisition of the conversation feature quantity every time an utterance action occurs or at a constant or indefinite period. In this case, it is more effective because it can cope with the latest situation on the user side. In addition, when statistical processing based on conversation feature values obtained in the past is taken, it is effective because sudden changes in the conversation feature values are prevented and the estimation accuracy of the conversation feature values is improved.
  • the conversation feature amount acquisition means, processing delay amount determination means, and processing delay amount control means provided in the voice transmitting / receiving apparatus are each or as a whole, for example, a CPU (Central Processing Unit) or an MPU (Micro Processing Unit). ) And the like, various forms such as various processors, various processors, controllers, various functional modules, and the like.
  • a CPU Central Processing Unit
  • MPU Micro Processing Unit
  • the conversation feature quantity acquisition unit is configured to use the own site user from the time when the utterance output of the other site user is completed at the own site as the conversation feature quantity. Is the time until the time when the other site user starts utterance from the time when the utterance output of the own site user ends at the other site. At least one of the base user reaction time is acquired.
  • the own site user reaction time or the other site user reaction time or both are acquired as the conversation feature value.
  • reaction time is the time from the time when one utterance output ends to the time when the other starts utterance, and correlates with the above-mentioned “timing to start utterance after the other's utterance ends” It is a feature quantity.
  • speech output means output via an output means such as a speaker, for example, and the start and end times thereof are later in time series than the start and end times of the actual speech act of the other party. That is, the reaction time is the pure reaction time of each user, excluding the effects of the processing delay and various other delays described above, and is the time for defining the basic tempo and rhythm of conversation in each user. . Therefore, it is suitable as a reference value for determining the optimum value of the total system processing delay amount.
  • the reaction time is a time value unique to the user, it is natural that the own site user reaction time and the other site user reaction time are naturally different, but they are almost equal but greatly different.
  • the system total processing delay amount is determined based on at least one of these own-site user reaction time and other-site user reaction time, and more preferably both, so that at least the required amount of comfort is achieved. Can provide the best conversation quality.
  • the processing delay amount determination means is configured to determine the acquired other site user reaction time.
  • the total amount is determined within the range of the value or less.
  • the system is within a range equal to or less than the minimum value of the own base first allowable processing delay amount and the other base first allowable processing delay amount (if the conversation is between two parties, that is, the smaller value). A total processing delay amount is determined. Therefore, it is possible to provide conversation quality that does not cause discomfort for both the local user and the other user.
  • the calculation process of the local base first allowable processing delay amount and the other base first allowable processing delay amount may be performed in any component that constructs the voice transmission / reception system. That is, as long as the total system processing delay amount is finally determined within a range below these minimum values, all of these need not be performed by the voice transmitting / receiving apparatus according to the embodiment.
  • the voice transmitting / receiving apparatus it is only necessary to acquire the other site user reaction time detected at another site via the network and calculate the own site first allowable processing delay amount. In this case, at the other site, the own site user reaction time transmitted from the own site is acquired, and the other site first allowable processing delay amount is calculated.
  • processing is distributed among the devices in this way, it is possible to prevent a situation in which the load is concentrated on one voice transmitting / receiving device.
  • the communication means is for the voice transmission / reception device installed at the other site, Local site user reaction time data corresponding to the acquired local site user reaction time is transmitted via the network.
  • the acquired own site user reaction time is transmitted to the other site side as own site user reaction time data. Therefore, it is possible to leave a part of the calculation process of various reference values finally used for determining the total processing delay amount to the other site side device based on the own site user reaction time.
  • the burden can be distributed.
  • the conversation feature value acquisition unit is configured as the conversation feature value at the own site.
  • a self-base user reaction waiting time which is a time from the time when the self-base user ends the utterance to the time when the self-base user starts to speak again is further acquired.
  • the own site user reaction waiting time is acquired as the conversation feature amount.
  • reaction waiting time is the time from the time when one utterance output ends to the time when one starts uttering again. Then, the conversation feature quantity correlates with “timing to start speech again”. That is, the reaction waiting time is a time reflecting the personality, taste and various circumstances of each user, and is a time for defining the basic tempo and rhythm of conversation for each user. Therefore, it is suitable as a reference value for determining the optimum value of the total system processing delay amount.
  • the own-site second allowable processing delay amount that changes depending on the difference between the acquired own-site user response waiting time and the acquired other-site user response time, and the other
  • the total amount may be determined within a range equal to or smaller than the minimum value of the second base allowable processing delay amount at the other bases that changes depending on the size of the other base.
  • the system is within a range equal to or less than the minimum value of the second allowable processing delay amount of the own site and the second allowable processing delay amount of the other site (if the conversation is between two parties, that is, the smaller value).
  • a total processing delay amount is determined. Therefore, it is possible to provide conversation quality that does not cause discomfort for both the local user and the other user.
  • the second allowable processing delay amount that changes depending on the difference between the reaction waiting time and the reaction time is such that one user does not react to the other user's utterance. It can be a reference value that is extremely useful in practice in avoiding the bilateral utterance state caused by speaking again under the misjudgment of being.
  • the calculation process of the second allowable processing delay amount of the own site and the second allowable processing delay amount of the other site may be performed in any component that constructs the voice transmission / reception system. That is, as long as the total system processing delay amount is finally determined within a range below these minimum values, all of these need not be performed by the voice transmitting / receiving apparatus according to the embodiment.
  • the voice transmitting and receiving apparatus the other site user reaction time detected at the other site via the network is acquired, and the own site user response waiting time acquired at the own site is used to 2 It may be possible only to calculate the allowable processing delay amount.
  • the own site user reaction time transmitted from the own site is acquired, and the other site second allowable processing delay amount is calculated using the other site user reaction waiting time acquired at the other site.
  • the processing delay amount determination means is configured to determine whether the own site first allowable processing delay amount that changes depending on the size of the acquired other site user reaction time and the acquired own site user response time. The total amount within a range that is less than or equal to the minimum value of the first allowable processing delay amount at the other site and the second allowable processing delay amount at the own site and the second allowable processing delay amount at the other site, which varies depending on the size of the other site. May be determined.
  • the local base first allowable processing delay amount, the remote base first allowable processing delay amount, the local base second allowable processing delay amount, and the remote base each calculated as a reference value reflecting the state of the user.
  • the total system processing delay amount is determined within a range equal to or smaller than the minimum value among the second allowable processing delay amounts. Therefore, it is possible to more surely suppress a decrease in conversation quality accompanying a decrease in comfort.
  • the voice transmitting / receiving apparatus includes a buffer for temporarily storing the conversation data before and after the transmission / reception, and the processing delay amount control means includes: The buffer capacity of the buffer is controlled based on the determined total amount.
  • the size of the buffer capacity has a one-to-one relationship with the size of the amount of time delay of conversation, and unconditionally increasing it is not allowed in terms of overall conversation quality. That is, the buffer capacity is appropriate as an actual control target of the processing delay amount control means when controlling the processing delay amount according to the embodiment.
  • the buffer capacity When the buffer capacity is captured as a control target, there is a degree of freedom as to how to control the buffer capacity provided in the voice transmission / reception system based on the determined total system processing delay amount.
  • the processing delay amount For example, in each audio transmission / reception device, when the buffer is constructed including a reception buffer that temporarily stores received data after reception and a transmission buffer that temporarily stores transmission data before transmission, the processing delay amount
  • the control means is relatively free so that the processing delay amount corresponding to the sum of the buffer capacities of the reception buffer and transmission buffer on the local site side and the reception buffer and transmission buffer on the other site side becomes the total system processing delay amount.
  • the distribution ratio of the buffer capacity between them can be determined.
  • the processing delay amount control means controls a coding rate for coding the conversation data based on the determined total amount.
  • an encoding process such as encoding is preferably required.
  • the level of the encoding rate which means the encoding bit rate for encoding the input voice at the local site, corresponds to the level of the voice quality and the amount of time delay.
  • this type of encoding rate is appropriate as an actual control target of the processing delay amount control means in controlling the processing delay amount according to the embodiment.
  • the voice transmission / reception apparatus further includes a transmission state acquisition unit that acquires a transmission state of the network, wherein the processing delay amount determination unit includes the acquired conversation feature amount and the The total amount is determined based on the acquired transmission state.
  • the total system processing delay amount can be determined more accurately.
  • the transmission state acquisition means may acquire the transmission delay amount of the network as the transmission state of the network.
  • the network transmission delay amount defines the conversation time delay in the same dimension as the processing delay amount. Therefore, it is suitable as a transmission state to be reflected in the determination of the total system processing delay amount. Note that the transmission delay amount of such a network preferably indicates RTT (Round Trip Time).
  • the speech processing apparatus further includes statistical processing means for statistically processing the acquired conversation feature amount, and the processing delay amount determining means is the statistically processed conversation feature. The total amount is determined based on the amount.
  • the statistical processing is performed on the acquired conversation feature by the statistical processing means. That is, the conversation feature value acquired over a past fixed or indefinite period is reflected in the conversation feature value to be reflected in the determination of the total system processing delay amount. For this reason, the reliability of the conversation feature amount can be improved, and the conversation quality can be stably maintained.
  • the practical aspect of the statistical processing is not particularly limited, but may be a process of adding and averaging conversation feature amounts acquired over a certain period in the past as a suitable form. At this time, a measure such as excluding the apparent abnormal value from the sample may be taken.
  • the voice transmitting / receiving apparatus further includes storage means for storing the acquired conversation feature quantity.
  • the acquired conversation feature amount is, for example, various aspects such as HDD (Hard Disk Disk Drive), flash memory, FDD (Floppy (registered trademark) Disk Disk Drive), DVD or BDD (Blu-ray Disk Disk Drive). Therefore, the total system processing delay amount can be determined smoothly. In addition, since the initial value of the total system processing delay amount when the conversation is similarly performed next time can be determined based on the stored conversation feature amount, the total system processing delay amount with the optimum system total processing delay amount can be determined. It is also possible to shorten the time until the amount converges.
  • Embodiments according to the voice transmission / reception system of the present invention include voice transmission / reception devices installed at a plurality of bases, each accommodated in a network, and users existing at the plurality of bases via the voice transmission / reception devices, respectively.
  • a voice transmission / reception system capable of establishing a conversation between each other, wherein the voice transmission / reception device is connected to the voice transmission / reception device installed at another base other than the base among the plurality of bases.
  • a processing delay amount determination for determining a total amount of the processing delay amount of the conversation data corresponding to the size of the time delay amount of the conversation and the speech quality of the conversation, respectively, in the entire voice transmission / reception system.
  • a processing delay amount control means for controlling the processing delay amount so that the determined total amount is satisfied.
  • the embodiment according to the voice transmission / reception system includes the voice transmission / reception device according to the above-described embodiment, it is possible to obtain the optimum conversation quality.
  • Embodiments according to the server device of the present invention include a server device, each of which is accommodated in a network, and a voice transmitting / receiving device installed at a plurality of bases and installed at another base other than the base among the plurality of bases
  • a voice transmission / reception device including communication means for transmitting / receiving conversation data representing the content of the conversation including at least voice data via the network, and via the voice transmission / reception device
  • the server device in a voice transmission / reception system capable of establishing a conversation between users respectively present at a plurality of bases, and present at the local base from the plurality of voice transmission / reception devices via the network
  • the size of the conversation data corresponds to the magnitude of the time delay amount of the conversation and corresponds to the level of the voice quality of the conversation.
  • the processing delay amount determining means for determining the total amount of the
  • the embodiment according to the server apparatus includes the conversation feature amount acquisition unit and the processing delay amount determination unit according to the embodiment of the voice transmission / reception apparatus described above, it is possible to obtain the optimum conversation quality.
  • the server device is responsible for the conversation feature value acquisition process and the system total processing delay amount determination process in this way, it is possible to remarkably reduce the burden on the voice transmission / reception device for constructing the voice transmission / reception system. It is beneficial to. For example, at this time, in each voice transmitting / receiving device, for example, a unit similar to the processing delay amount control unit controls the processing delay amount so that the optimum value of the total system processing delay amount notified by the notification unit provided in the server device is satisfied. do it.
  • an appropriate command may be executed on the server device side to control the audio transmission / reception device.
  • the server device side can also define the distribution mode of the processing delay amount of each voice transmission / reception device based on the total system processing delay amount, it is possible to further reduce the burden on the voice transmission / reception device side. .
  • the communication means, the conversation feature quantity acquisition means, the processing delay amount determination means, and the processing delay amount control means are provided, the optimum conversation quality is obtained. be able to.
  • the embodiment of the voice transmission / reception system of the present invention since the embodiment of the voice transmission / reception apparatus of the present invention is provided, the optimum conversation quality can be obtained.
  • the conversation feature quantity acquisition means the processing delay amount determination means, and the notification means are provided, the optimum conversation quality can be obtained.
  • FIG. 1 is a schematic configuration diagram conceptually showing the configuration of the remote conference system 1.
  • a remote conference system 1 is a wide area that connects a base X (ie, an example of “own base” according to the present invention) and a base Y (ie, an example of “other base” according to the invention) that are separated from each other. It is an audio conference system as an example of an “audio transmission / reception system” according to the present invention, accommodated in a network 20 (IP (Internet Protocol) network).
  • IP Internet Protocol
  • the remote conference system 1 is installed at the site X and used for the user A (that is an example of the “own site user” according to the present invention) at the site X.
  • the user A and the user B can perform a smooth audio conference by exchanging audio information by the remote conference system 1.
  • FIG. 2 is a block diagram conceptually showing the configuration of the audio transmitting / receiving apparatus 10A.
  • the same parts as those in FIG. 1 are denoted by the same reference numerals, and the description thereof is omitted as appropriate.
  • the hardware configuration of the voice transmitting / receiving apparatus 10A is the same as that of the voice transmitting / receiving apparatus 10B.
  • the voice transmission / reception device 10A includes a voice input unit 100, a voice output unit 200, a conversation feature amount detection unit 300, a conversation feature amount statistical processing unit 400, a storage device 500, a processing delay amount determination unit 600, and an RTT measurement unit 700.
  • a processing delay information communication unit 800 and a buffer control unit 900 are provided.
  • the voice input unit 100 includes a voice input unit 110, an encoder 120, a transmission buffer 130, and a voice data transmission unit 140.
  • the voice input unit 100 is a unit capable of transmitting the voice of the user 10A as voice data to the voice transmission / reception device 10B via the network 20. is there.
  • the voice input unit 110 is an input interface in which an input terminal (not shown) is connected to a microphone (not shown), and is configured to be able to capture the speech voice of the user A input via the microphone as an analog voice signal. Yes.
  • the encoder 120 is a digital conversion device that encodes an analog audio signal input via the audio input unit 110 at a predetermined encoding rate (encoding bit rate) and converts the encoded audio signal into digital audio data.
  • encoding rate encoding bit rate
  • this type of standard may be a standard such as MPEG (Moving Picture Expert Group).
  • the transmission buffer 130 is a volatile storage device that temporarily accumulates digital audio data obtained through the encoder 120 by a data amount corresponding to a predetermined transmission buffer capacity Da_s.
  • the audio data transmission unit 140 is a transmission interface whose output terminal (not shown) is connected to the network 20, and digital audio data sequentially output from the transmission buffer 130 is sequentially transmitted to the audio input / output device 10 ⁇ / b> B via the network 20. Configured to send. That is, the audio data transmission unit 140 is an example of the “communication unit” according to the present invention, and the transmitted digital audio data is an example of the “conversation data” according to the present invention.
  • the audio output unit 200 includes an audio data receiving unit 210, a reception buffer 220, a decoder 230, and an audio output unit 240, and is a unit that can output the voice of user B via an output device such as a speaker.
  • the audio data receiving unit 210 is a reception interface whose output terminal (not shown) is connected to the network 20, and can sequentially capture digital audio data corresponding to the user B's uttered audio transmitted via the network 20. Composed. That is, the voice data receiving unit 210 is another example of the “communication unit” according to the present invention, and the received digital voice data is another example of the “conversation data” according to the present invention.
  • the reception buffer 220 is a volatile storage device that temporarily accumulates digital audio data obtained via the audio data reception unit 210 by a data amount corresponding to a predetermined reception buffer capacity Da_r.
  • the decoder 230 is an analog conversion device that decodes digital audio data sequentially output from the reception buffer 220 and converts it into analog audio data.
  • the function of the decoder 230 is paired with the encoding function of the encoder 120, and both are naturally configured to perform data conversion according to the same standard.
  • the audio output unit 240 is an output interface in which an output terminal (not shown) is connected to a speaker (not shown), and is configured to be able to output user B's speech through the speaker.
  • Conversation feature amount detecting unit 300 was detectably configure user A reaction time R A and user A reaction waiting time T A to be described later as a conversation characteristic quantity of the user A, according to the present invention "conversation feature amount acquisition means" It is an example.
  • the conversation feature quantity statistical processing unit 400 is an example of the “statistic processing means” according to the present invention configured to be able to statistically process the conversation feature quantity appropriately detected by the conversation feature quantity detection unit 300.
  • the conversation feature quantity statistical processing unit 400 is configured to be able to hold the detected conversation feature quantity with respect to a certain past sample, and is configured to perform an averaging process on the sample values of the held samples and output them.
  • the storage device 500 is a nonvolatile storage device such as an HDD or a flash memory, and is an example of the “storage unit” according to the present invention.
  • the storage area of the storage device 500, the average value of the response time data DAT_R A and the user A reaction waiting time T A which represents the average value of the conversation feature amount statistical processing unit 400 via the output the user A reaction time R A the reaction latency data DAT_T a has a configuration in which is stored representing.
  • the processing delay amount determination unit 600 includes an allowable processing delay estimation unit 610 and a negotiation unit 620, and is configured to determine the buffer capacity Da_s of the transmission buffer 130 and the buffer capacity Da_r of the reception buffer 220 according to the present invention. It is a control apparatus as an example of “processing delay amount determination means”.
  • the processing delay amount determination unit 600 is configured to be able to execute buffer capacity control, which will be described later, according to a control program stored in a ROM (Read Only Memory).
  • the allowable processing delay estimation unit 610 is a maximum allowable processing delay amount dmax that is a maximum processing delay amount allowed for the remote conference system 1, which will be described later. This is a processor that executes various types of processing for determining “total amount”.
  • the allowable processing delay estimation unit 610 includes a first calculation model and a second calculation model that are theoretically constructed in advance, and is configured to execute the processing based on these calculation models.
  • the negotiation unit 620 is a processor that negotiates the distribution of the buffer capacity as an example of the “processing delay amount” according to the present invention with the voice transmitting / receiving apparatus 10B at the site Y.
  • the negotiation unit 620 also determines control target values for the transmission buffer capacity Da_s and the reception buffer capacity Da_r through negotiation with the voice transmitting / receiving apparatus 10B.
  • the RTT measurement unit 700 is a processor configured to be able to measure the RTT that is the transmission delay amount of the network 20.
  • the RTT measuring unit 700 measures RTT using SR (Sender Report) or RR (Receiver Report) of RTCP (Real-time Transport Control Protocol).
  • the RTT defined here is the time when the digital audio data is transmitted from the audio transmitting / receiving apparatus 10A via the audio data receiving unit 210 from the time when the digital audio data is transmitted via the audio data transmitting unit 140.
  • the processing delay information communication unit 800 is a communication interface whose output terminal is connected to the network 20 and performs transmission / reception of various data related to the above-described buffer amount target value formulation with the audio transmission / reception device 10B.
  • the processing delay information communication unit 800 transmits the user A reaction time R A and the processing delay amount d set based on each calculation model to the voice transmission / reception device 10B, and the user B reaction time R B from the voice transmission / reception device 10B. And the processing delay amount d set based on each calculation model is acquired.
  • the buffer control unit 900 is a processor as an example of the “processing delay amount control unit” according to the present invention configured to be able to variably control the buffer capacities of the transmission buffer 130 and the reception buffer 220.
  • FIG. 3 is a flowchart of buffer capacity control.
  • the voice transmitting / receiving apparatus 10A the above-described conversation data transmission / reception by the voice input unit 100 and the voice output unit 200 is appropriately executed separately from the buffer capacity control.
  • a transmission process for the user A reaction time R A and a reception process for the user B reaction time R B are executed (step S101).
  • User A reaction time R A and user B reaction times R B are, respectively, an example of the "conversation feature quantity" according to the present invention, also the former "own base user reaction time", the latter "different hub user reaction time Is an example.
  • step S101 the response time data DAT_R A from the storage device 500 is read out and transmitted to the voice transmitting and receiving device 10B via the network 20 by the processing delay information communication unit 800.
  • the reaction time data DAT_R B corresponding to the user B reaction time R B is acquired from the voice transceiver 10B via the network 20 by the processing delay information communication unit 800, it is sent to the allowable processing delay estimator 610.
  • FIG. 4 is a timing chart for explaining the concept of reaction time.
  • FIG. 4 it is assumed that the user A and the user B are talking in an ideal environment where no significant time delay occurs in the mutual voice transmission.
  • an ideal environment refers to, for example, an environment where conversations are made in a face-to-face state at the same site.
  • the utterance of the user A is finished at an arbitrary time T0 (see the hatched portion of the user A).
  • user B perceives the end of user A's utterance at time T0.
  • the utterance is started at time T1 after a delay time unique to the user B (see the hatched portion of the user B).
  • the delay time from the end time of one utterance to the start time of the other utterance is the reaction time.
  • the reaction time varies in various ways according to the user's personality, taste, mental load state, physical load state, and other various circumstances that can be specifically changed each time.
  • the user A reaction time R A is the reaction time of the user A at the site X
  • the user B reaction time R B is the reaction time of the user B at the site Y.
  • a voice output means such as a speaker (ie, "speech output" according to the present invention). )). This is because the conversation is not established if the utterance content is not recognized by the other at the end of one utterance act.
  • the reaction time is calculated by a reaction time calculation process executed by the conversation feature amount detection unit 300 in each voice transmitting / receiving device.
  • the reaction time calculation process will be described with reference to FIG.
  • FIG. 5 is a flowchart of the reaction time calculation process.
  • the reaction time calculation process in FIG. 5 is a process executed at the site X.
  • step S201 it is first determined whether or not there is a voice output from the user B (step S201). If there is no voice output from user B (step S201: NO), the process returns to step S201, and a series of processes is repeated. When there is a voice output of user B (step S201: YES), the time at that time is updated as the final voice output time Top (step S202).
  • step S203 When the final audio output time Top is updated, it is determined whether or not there is no audio output from the user B (step S203). If user B's voice output continues (step S203: NO), the process returns to step S202, and the update of the final voice output time Top is continued. On the other hand, if there is no voice output from user B (step S203: YES), it is determined whether there is a voice input from user A (step S204).
  • the determination that there is no voice output from user B is a misjudgment that the speech has ended despite user B being in a series of speeches. It is configured so as to be accurately made based on whether or not the length of the silent section exceeds a preset reference value.
  • step S204 When there is no voice input by the user A (step S204: NO), that is, after the utterance output of the user B is completed, while it is estimated that the user A is thinking about the utterance content for the utterance, the step S203 is repeatedly executed. .
  • step S204: YES When user A's voice input is started (utterance is started) (step S204: YES), a time value corresponding to the difference between the time T at that time and the final voice output time Top updated in step S202 is obtained.
  • the user A reaction time RA is determined (step S205). When user A reaction time RA is calculated, the process returns to step S201, and a series of processes is repeated.
  • the reaction time calculation process is executed as described above.
  • user B reaction time R B are calculated in the voice transmitting and receiving device 10B as well.
  • reaction time R A which is suitably calculated by the reaction time calculation process is sent to the conversation feature quantity statistics unit 400 each time it is calculated, is subjected to statistical processing.
  • the statistical process is an averaging process for the past predetermined samples. Note that the mode of the statistical processing is not limited to the averaging process.
  • reaction time R A which has been subjected to statistical processing by conversation feature amount statistical processing unit 400 is stored as the reaction time data DAT_R A in the storage unit 500.
  • the reaction time data DAT_R A, the user A reaction time R A is appropriately updated each time statistical processing is calculated is performed.
  • the reaction waiting time is a time value from the time when one user's utterance ends to the time when the one user starts speaking again.
  • the reaction waiting time varies in various ways according to the user's personality, taste, mental load state, physical load state, and other various circumstances that can be specifically changed each time.
  • the reaction waiting time is calculated by a reaction waiting time calculation process executed by the conversation feature amount detection unit 300 in each voice transmitting / receiving device.
  • the reaction waiting time calculation process will be described with reference to FIG.
  • FIG. 6 is a flowchart of the reaction waiting time calculation process. Note that the reaction waiting time calculation process in FIG.
  • step S301 it is first determined whether or not there is a voice input from the user A (step S301). If there is no voice input by user A (step S301: NO), the process returns to step S201, and a series of processes is repeated. When there is a voice input of the user A (step S201: YES), the time at that time is updated as the final voice input time Tip (step S302).
  • step S303 When the final voice input time Tip is updated, it is determined whether or not there is no voice output from the user B (step S303). If there is a voice output from user B (step S303: NO), the process returns to step S301 on the assumption that the response from user B is returned in a time shorter than the user A reaction waiting time TA.
  • step S303 If there is no voice output from user B (step S303: YES), it is determined whether there is a voice input from user A (step S304). If there is no voice input by user A (step S304: NO), the process returns to step S303. That is, the process is in a standby state until the voice input of the user A is resumed.
  • step S304 If there is a voice input by the user A (step S304: YES), whether or not the time value corresponding to the difference between the current time T and the last voice input time Tip updated in step S302 is greater than the reference value T0. Is determined.
  • the reference value T0 is a determination reference value for determining whether or not the voice input of the user A corresponds to such a series of speech operations.
  • step S305: NO when the time value corresponding to T-Tip is equal to or less than the reference value T0 (step S305: NO), the process returns to step S301.
  • the time value corresponding to T-Tip is larger than the reference value T0 (Step S305: YES)
  • the time value corresponding to T-Tip is determined as a user A reaction waiting time T A (step S306).
  • the reaction waiting time calculation process is executed as described above.
  • user B reaction waiting time T B it is calculated in the voice transmitting and receiving device 10B as well.
  • reaction waiting time T A which is suitably calculated by the reaction time calculation process is sent to the conversation feature quantity statistics unit 400 each time it is calculated, is subjected to statistical processing.
  • the statistical process is an averaging process for the past predetermined samples. Note that the mode of the statistical processing is not limited to the averaging process.
  • reaction time data DAT_T A is stored as the reaction time data DAT_T A in the storage unit 500.
  • the reaction time data DAT_T A, the user A reaction waiting time T A is properly updated each time statistical processing is calculated is performed.
  • processing delay amount d is set based on the first calculation model and the second calculation model (step S103).
  • the processing delay amount d is set for each of the first calculation model and the second calculation model.
  • the processing delay amount d means the total processing delay amount of the remote conference system 1.
  • FIG. 7 is a timing chart for explaining the concept of the first calculation model.
  • the same reference numerals are given to the same portions as those in FIG. 4, and the description thereof will be omitted as appropriate.
  • the processing delay amount d in the remote conference system 1 is defined by the following equation (1).
  • da da + db + dproc (1)
  • da is a delay amount that has a one-to-one correspondence with the buffer capacity of the audio transmission / reception device 10A, and the transmission buffer delay corresponding to the reception buffer delay da_r and the transmission buffer capacity Da_s corresponding to the reception buffer capacity Da_r.
  • da da_r + da_s
  • db is a delay amount that corresponds to the buffer capacity of the audio transmission / reception device 10B on a one-to-one basis, and transmission corresponding to the reception buffer delay amount db_r and transmission buffer capacity Db_s corresponding to the reception buffer capacity Db_r.
  • db db_r + db_s
  • dproc is the processing delay of the encoder and decoder in each of the audio transmitting / receiving apparatuses 10A and 10B.
  • dproc is the sum of the encoder processing delay denca and the decoder processing delay ddeca in the voice transmitting / receiving apparatus 10A, and the encoder processing delay decb and decoder processing delay ddecb in the voice transmitting and receiving apparatus 10B. Note that dproc has a constant value if the encoding rate in the encoder and decoder is constant.
  • FIG. 7 it is assumed that the utterance of the user A ends at time T0.
  • the point in time when the user A's utterance end is recognized is during a normal conversation as shown in FIG. And different.
  • the end of the speech of the user A is recognized from time T0, the transmission buffer delay amount da_s of the voice transmission / reception device 10A, the reception buffer delay amount db_r of the voice transmission / reception device 10B, and the encoder processing delay of the voice transmission / reception device 10A.
  • T0 ′ T0 + da_s + db_r + denca + ddecb + OWDa (2) Note that the following equation (3) holds for the one-way delay OWDa of the network 20.
  • OWDb is a one-way delay of the network 20 from the voice transmitting / receiving apparatus 10B to the voice transmitting / receiving apparatus 10A.
  • the user B starts an utterance in 'User B reaction time time through the R B T1 from' time T0.
  • the time when the user B's utterance start is recognized is also different between the base X and the base Y, and the start of the user B's utterance at the base X is recognized from the transmission buffer delay of the voice transmitting / receiving apparatus 10B from the time T1 ′.
  • T2 T0 ′ + R B + db_s + da_r + dencb + ddeca + OWDb (4) That is, in the same sense as in the normal conversation is illustrated in Figure 4, even if the user B has performed speech at a user B reaction time R B after the end user utterance A, the teleconference system 1 and the network 20 Due to the influence, the user A recognizes the start of the utterance of the user B after the delay time TL defined by the following equation (5) has elapsed.
  • the index value Z takes a value larger than 1 and has a property of gradually approaching 1 as it approaches a normal conversation environment (an environment illustrated in FIG. 4). Further, as the index value Z increases, the user A feels that the conversation does not proceed smoothly, and may experience a decrease in comfort.
  • the maximum value F is a fitness value that can be determined that a decrease in conversation quality felt by the user A in a larger area cannot be ignored.
  • the index value Z is equal to or less than the maximum value F, the user A does not feel lack of smoothness that cannot be ignored in practice in the conversation with the user B.
  • the above expression (8) is an expression that defines a range in which the processing delay amount d can be taken.
  • the processing delay amount d in the above equation (8) is an example of the “own base first allowable processing delay amount” according to the present invention.
  • the processing delay amount d shown in the above equation (8) is a value set for the user A, and for the user B, the same process is performed in the voice transmitting / receiving apparatus 10B by the following equation (9).
  • a processing delay amount d is set.
  • the processing delay amount d in the following equation (9) is an example of the “other site first allowable processing delay amount” according to the present invention.
  • FIG. 8 is a timing chart illustrating the concept of the second calculation model.
  • the same parts as those in FIG. 7 are denoted by the same reference numerals, and the description thereof is omitted as appropriate.
  • FIG. 7 illustrates a concept of user A reaction waiting time T A.
  • User A reaction waiting time T A is from the time T0 to the utterance of the user A is completed, the time value up to the time T1 'that the user A initiates a speech again.
  • user A's utterance start is recognized by user A at time T2 after the delay time TL has elapsed from time T0. It is.
  • Second calculation model is a calculation model that takes into account the user A reaction waiting time T A.
  • the delay time TL is because as long than user A reaction waiting time T A, the following (10) is established.
  • the above expression (11) is an expression that defines a range in which the processing delay amount d can be taken.
  • the processing delay amount d in the equation (11) is an example of the “own base second allowable processing delay amount” according to the present invention.
  • the processing delay amount d shown in the above equation (11) is a value set for the user A, and for the user B, the following process (12) is performed through the same process in the voice transmitting / receiving apparatus 10B.
  • the processing delay amount d is calculated by the equation.
  • the processing delay amount d shown in the following expression (12) is an example of the “other site second allowable processing delay amount” according to the present invention.
  • the maximum allowable processing delay amount dmax is determined (step S104).
  • the maximum allowable processing delay amount dmax is a processing delay amount that can ensure the smoothness of conversation for both the user A and the user B. Therefore, it is necessary to compare the processing delay amounts set based on the first calculation model and the second calculation model. More specifically, the maximum allowable processing delay amount dmax needs to be a processing delay amount that satisfies all of the above formulas (8), (9), (11), and (12).
  • step S104 the processing delay amount d set for the user B based on the first and second calculation models is acquired from the voice transmitting / receiving apparatus 10B via the processing delay information communication unit 800.
  • the processing delay amount d set for the user A based on the first and second calculation models is transmitted to the voice transmitting / receiving apparatus 10B via the processing delay information communication unit 800. That is, the conditional expression relating to the determination of the maximum allowable processing delay amount dmax is shared between the voice transmitting / receiving apparatus 10A and the voice transmitting / receiving apparatus 10B.
  • the processing delay amount d satisfying any of the above formulas (8), (9), (11) and (12) is determined. Specifically, among these four formulas, the processing delay amount d defined by the conditional expression with the smallest right-hand term is the processing delay amount d satisfying these four formulas.
  • the maximum allowable processing delay amount dmax is determined as the maximum value in a range that satisfies these four formulas.
  • distribution processing is executed (step S105).
  • the distribution process is a process of distributing the determined maximum allowable processing delay amount dmax to the buffer capacities of the voice transmitting / receiving apparatus 10A and the voice transmitting / receiving apparatus 10B. This distribution process is executed by the negotiation unit 620.
  • the negotiation unit 620 confirms the load status with the voice transmitting / receiving apparatus 10B via the processing delay information communication unit 800. If there is no problem in processing load in the voice transmitting / receiving apparatus 10B and there is no problem in processing load in the voice transmitting / receiving apparatus 10A, the negotiation unit 620 bears 50% of the maximum allowable processing delay amount dmax in the voice transmitting / receiving apparatus 10A. This is determined, and this is transmitted to the voice transmitting / receiving apparatus 10B. As a result, if there is no problem in processing load, 50% of the maximum allowable processing delay amount dmax is usually borne on the voice transmitting / receiving apparatus 10A side.
  • the negotiation unit 620 distributes the delay amount to be borne by the voice transmitting / receiving apparatus 10A between the transmission buffer 130 and the reception buffer 220.
  • the burden rates of the transmission buffer 130 and the reception buffer 220 are also set equal here. That is, the processing delay amount related to the transmission buffer 130 is set to 25% of the maximum allowable processing delay amount dmax, and the processing delay amount related to the reception buffer 220 is also set to 25% of the maximum allowable processing delay amount dmax.
  • the set distribution ratio is transmitted to the buffer control unit 900.
  • the buffer control unit 900 to which the distribution ratio is transmitted controls the capacities of the transmission buffer 130 and the reception buffer 220 so that a processing delay amount corresponding to the set distribution ratio is obtained (step S106).
  • the process returns to step S101, and a series of processes is repeated.
  • the buffer capacity control is executed as described above.
  • the maximum allowable processing delay amount dmax is determined from the processing delay amount d set based on the first calculation model and the second calculation model, and the reception buffer capacity Da_r. And the processing delay amount d of the remote conference system 1 is controlled to the maximum allowable processing delay amount dmax through the control of the transmission buffer capacity Da_s.
  • the first calculation model and the second calculation model are constructed so as to reflect the conversation feature amount of the user, respectively, and the maximum allowable processing delay amount dmax determined is the user A and the user who are parties to the conversation. In accordance with the actual situation of B, smoothness for both sides is guaranteed at the minimum.
  • the maximum allowable processing delay amount dmax is the maximum value within a range in which this kind of comfort reduction can be prevented, and voice quality deterioration due to sound interruptions, coding errors, and the like is also prevented.
  • the optimum conversation quality in accordance with the circumstances of the user participating in the conversation is provided.
  • the user who participates in the conversation is the user A and the user B, but the same as the above also in a meeting or conversation having three parties or more participants. Needless to say, it is possible to apply the concept to provide the optimum conversation quality according to the user's circumstances.
  • the audio conference is assumed. However, it is also possible to transmit and receive image data or video data together with audio data by arranging imaging means at each site. In this case, for example, it is possible to adopt a conference form called a TV conference or the like.
  • voice transmission / reception apparatus concerning a present Example is provided with the memory
  • the past conversation feature quantity can be used as it is in the next time.
  • conversation quality variation in the initial operation of the conversation feature quantity acquisition unit 300 and the processing delay amount determination unit 600 can be reduced. Since it can suppress, it is suitable.
  • the control of the processing delay amount is realized through the control of the buffer capacity.
  • the control amount correlated with the processing delay amount is the encoding rate of the audio data (for example, the encoding bit rate of the encoder 120). ).
  • the encoding rate increases, the amount of data transmission increases as the sound quality becomes relatively high, and the amount of processing delay increases. Therefore, by controlling the encoding rate instead of or in addition to the buffer capacity described above, it is possible to obtain the same effect as described above.
  • FIG. 9 is a schematic configuration diagram conceptually showing the configuration of the remote conference system 2 according to the second example of the present invention. In the figure, the same reference numerals are given to the same portions as those in FIG. 1, and the description thereof will be omitted as appropriate.
  • the remote conference system 2 is different from the remote conference system 1 according to the first embodiment in that the remote conference system 2 includes a server device 30 accommodated in the network 20.
  • the server device 30 is a computer system that mediates the voice transmission / reception devices 10A and 10B, and is an example of the “server device” according to the present invention.
  • the server device 30 includes the processing delay amount determination unit 600 in the first embodiment, and the processing delay amount d based on the first and second calculation models, the maximum allowable processing delay amount dmax, and the buffer capacity of each voice transmission / reception device All the calculation processes related to are configured to be executed by the server device 30.
  • the server device 30 includes a notification unit that notifies each voice transmission / reception device of the determined buffer capacity.
  • the buffer control unit 900 performs this notification.
  • Each transmission / reception buffer is controlled according to the buffer capacity.
  • conversation quality that does not impair comfort is provided.
  • relatively high-load processing such as various calculation processing based on the calculation model and buffer capacity distribution ratio determination processing is configured to be borne by the server device 30 instead of the voice transmission / reception device. This reduces the burden and enables smoother conversational operation control.
  • the present invention can be applied to an apparatus or system that establishes a conversation between users at remote locations.

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Quality & Reliability (AREA)
  • General Engineering & Computer Science (AREA)
  • Telephonic Communication Services (AREA)

Abstract

L'invention porte sur un dispositif d'émission/réception audio, un système d'émission/réception audio et un dispositif serveur qui offrent une qualité de conversation optimale adaptée à la situation d'un utilisateur engagé dans une conversation. Le dispositif d'émission/réception audio (10A) comprend : un moyen de transmission (100, 200) pour envoyer et recevoir des données de conversation sur un réseau (20) à et depuis un dispositif d'émission/réception audio (10B) installé au niveau d'une base autre que la base nominale, parmi de multiples bases, lesdites données de conversation représentant le contenu d'une conversation et contenant au moins des données audio ; un moyen d'acquisition de quantités caractéristiques de conversation (300) pour acquérir des quantités caractéristiques de conversation pour un utilisateur de base nominale (A) se trouvant au niveau de la base nominale et un utilisateur d'autre base (B) se trouvant au niveau de l'autre base, respectivement ; un moyen de détermination de quantités de retard de traitement (600) pour déterminer une quantité totale de quantités de retard de traitement dans le système d'émission/réception audio entier sur la base de quantités caractéristiques de conversation acquises ; et un moyen de réglage de quantités de retard de traitement (600) pour régler les quantités de retard de traitement de manière à satisfaire la quantité totale déterminée.
PCT/JP2010/062558 2010-07-26 2010-07-26 Dispositif d'émission/réception audio, système d'émission/réception audio et dispositif serveur Ceased WO2012014275A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2010/062558 WO2012014275A1 (fr) 2010-07-26 2010-07-26 Dispositif d'émission/réception audio, système d'émission/réception audio et dispositif serveur

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2010/062558 WO2012014275A1 (fr) 2010-07-26 2010-07-26 Dispositif d'émission/réception audio, système d'émission/réception audio et dispositif serveur

Publications (1)

Publication Number Publication Date
WO2012014275A1 true WO2012014275A1 (fr) 2012-02-02

Family

ID=45529523

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2010/062558 Ceased WO2012014275A1 (fr) 2010-07-26 2010-07-26 Dispositif d'émission/réception audio, système d'émission/réception audio et dispositif serveur

Country Status (1)

Country Link
WO (1) WO2012014275A1 (fr)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH10200580A (ja) * 1997-01-16 1998-07-31 Matsushita Electric Ind Co Ltd 音声パケット再生方法
JP2002164921A (ja) * 2000-11-27 2002-06-07 Oki Electric Ind Co Ltd 音声パケット通信の品質制御装置
JP2005303531A (ja) * 2004-04-08 2005-10-27 Mitsubishi Electric Corp 音声データ受信装置および音声データ送信装置

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH10200580A (ja) * 1997-01-16 1998-07-31 Matsushita Electric Ind Co Ltd 音声パケット再生方法
JP2002164921A (ja) * 2000-11-27 2002-06-07 Oki Electric Ind Co Ltd 音声パケット通信の品質制御装置
JP2005303531A (ja) * 2004-04-08 2005-10-27 Mitsubishi Electric Corp 音声データ受信装置および音声データ送信装置

Similar Documents

Publication Publication Date Title
US7680099B2 (en) Jitter buffer adjustment
US10965603B2 (en) Bandwidth management
US10027818B2 (en) Seamless codec switching
US8489758B2 (en) Method of transmitting data in a communication system
KR101182518B1 (ko) 영상 전송 시스템 및 방법
US9667801B2 (en) Codec selection based on offer
US9729287B2 (en) Codec with variable packet size
US10506004B2 (en) Advanced comfort noise techniques
US10469630B2 (en) Embedded RTCP packets
US20170201443A1 (en) Playout delay adjustment method and electronic apparatus thereof
JP2005269632A (ja) 通信端末装置、電話データ受信方法、通信システムおよびゲートウェイ
CN113242436B (zh) 直播数据的处理方法、装置及电子设备
CN100359892C (zh) 用于ip电话的动态等待时间管理
US9509618B2 (en) Method of transmitting data in a communication system
US9253116B2 (en) Multi-media data rate allocation method and voice over IP data rate allocation method
JP2009076952A (ja) Tv会議装置およびtv会議方法
WO2012014275A1 (fr) Dispositif d'émission/réception audio, système d'émission/réception audio et dispositif serveur
KR102109607B1 (ko) 통신 네트워크에서 송수신 지연을 감소시키기 위한 시스템 및 장치
JP6954289B2 (ja) ビットレート指示装置、ビットレート指示方法、及び、ビットレート指示プログラム
JP6724517B2 (ja) ビットレート指示装置、ビットレート指示方法、及び、ビットレート指示プログラム
JP4861964B2 (ja) 通信端末装置及びコンピュータプログラム
JP2005192129A (ja) データ送信装置およびデータ受信装置
JP2006303702A (ja) 音声符号化選択制御方法、音声パケット送信装置、音声パケット受信装置、音声パケット送信プログラム、音声パケット受信プログラム、記録媒体
JP2007312265A (ja) 音声パケット通信システム、音声再生装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 10855286

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 10855286

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP