HK1130378B

HK1130378B - Jitter buffer adjustment

Info

Publication number: HK1130378B
Application number: HK09110108.9A
Authority: HK
Inventors: A‧拉卡尼米
Original assignee: Nokia Technologies Oy
Priority date: 2006-08-22
Filing date: 2007-08-14
Publication date: 2014-03-07

Description

Jitter buffer adjustment

Technical Field

The invention relates to jitter buffer adjustment.

Background

To transmit speech, speech frames may be encoded at a transmitter, transmitted over a network, and decoded again at a receiver for presentation to a user.

The normal transmission of speech frames may be switched off during periods when the transmitter has no active speech to be transmitted. This is called the Discontinuous Transmission (DTX) mechanism. Discontinuous transmission saves transmission resources when there is no useful information to be transmitted. In a normal conversation, for example, only one of the people involved is usually speaking at a certain moment, which means that the signal in one direction on average contains active speech only during about 50% of the time period. The transmitter may generate a set of comfort noise parameters during this period that depict the background noise present at the transmitter. These comfort noise parameters may be transmitted to the receiver. The transmission of comfort noise parameters is typically done at a low bit rate and/or low transmission interval compared to speech frames. The receiver may then use the received comfort noise parameters to synthesize an artificial noise-like signal having characteristics that approximate the characteristics of the background noise present at the transmitter.

In adaptive multi-rate (AMR) speech codecs and adaptive multi-rate wideband (AMR-WB) speech codecs, new speech frames are generated, for example, at 20ms intervals during active speech periods. Once the end of the active speech period is detected, the discontinuous transmission mechanism keeps the decoder active for more than seven frames to form a delayed release period (handoff period). This period is used at the receiving end to prepare a background noise estimate that can be used as a basis for generating comfort noise during non-speech periods. After the delayed release period, the transmission is switched to a comfort noise state during which updated comfort noise parameters are transmitted at intervals of 160ms over silence description (SID) frames. At the start of a new session, the transmitter is set to an active state. This means that even if the audio signal does not contain speech, at least the first seven frames of the new session are encoded and transmitted as speech.

An audio signal comprising speech frames and, in case of DTX, comfort noise parameters may be transmitted from a transmitter to a receiver via, for example, a packet switched network (e.g., the internet).

The nature of packet-switched communications typically introduces variations in the time of transmission of packets, commonly referred to as jitter, which appears to the receiver as packets arriving at unstable intervals. In addition to packet loss situations, network jitter is a major obstacle, particularly for conversational voice services provided over packet switched networks.

More specifically, the audio playback components of an audio receiver operating in real time require a constant input to maintain good sound quality. Even very short interruptions should be excluded. Thus, if some packets containing audio frames arrive after the audio frames need to be decoded and further processed, then these packets and the contained audio frames will be considered lost due to arriving too late. The audio decoder will perform error concealment (error concealment) to compensate for the audio signal carried in the lost frame. Nevertheless, it is clear that a large amount of error concealment reduces the sound quality.

Typically, a jitter buffer is thus used to hide the unstable packet arrival times, thereby providing a continuous input to the decoder and subsequent audio playback components. The jitter buffer stores the incoming audio frames therein during a predetermined amount of time. This time may be specified, for example, when the first packet of the packet stream is received. However, the jitter buffer introduces an extra delay component, since the received packets are stored before further processing. This increases the end-to-end delay. The jitter buffer can be characterized by, for example, the average buffering delay and the final proportion of delayed frames in all frames received.

Jitter buffers using fixed playback timing are inevitably a compromise between the lowest end-to-end delay and the least amount of delayed frames, and finding the optimal balance is not an easy task. While in certain specific environments and applications the amount of expected jitter may be estimated to stay within certain limits, typically jitter may vary from 0 to hundreds of milliseconds (even in the same session). Using a fixed playback timing where the initial buffering delay is set to a value large enough to cover jitter according to the expected worst case may keep the number of delay frames in hand, but at the same time there is a risk that too long an end-to-end delay is introduced to make a natural conversation impossible. Therefore, in most audio transmission applications operating over packet switched networks, applying fixed buffering is not the optimal choice.

Adaptive jitter buffer management may be used to dynamically control the balance between a sufficiently short delay and a sufficiently small number of delayed frames. In this method, an incoming packet stream is continuously monitored and the buffering delay is adjusted according to observed changes in the delay condition of the incoming packet stream. In the case where the transmission delay seems to be increasing or the jitter becomes more severe, the buffering delay is increased to satisfy the network environment. In the opposite case, the buffering delay can be reduced, thereby minimizing the total end-to-end delay.

Disclosure of Invention

The present invention proceeds from the consideration that control of the end-to-end delay is one of the challenges in adaptive jitter buffer management. In the usual case, the receiver does not have any information about the end-to-end delay. Thus, adaptive jitter buffer management typically performs adjustments only by attempting to keep the number of delayed frames below a desired threshold. While this approach may be used to maintain voice quality at an acceptable level in most transmission environments, in some cases, the adjustments may increase the end-to-end delay above an acceptable level, thereby preventing natural conversation.

A method is proposed, which method comprises: a desired amount of adjustment of the jitter buffer is determined at the first device by using the estimated delay as a parameter, the delay comprising at least an end-to-end delay in at least one direction in the call. For the call, a call voice signal is transmitted in packets between the first device and the second device via the packet-switched network. The method also includes adjusting the jitter buffer based on the determined amount of adjustment.

Furthermore, an apparatus is proposed, which comprises a control component configured to determine a desired amount of adjustment of the jitter buffer at the first device by using an estimated delay as a parameter, said delay comprising at least an end-to-end delay in at least one direction in the call. For the call, voice signals are sent in packets between the first device and the second device via the packet-switched network. The apparatus also includes an adjustment component configured to adjust the jitter buffer based on the determined amount of adjustment.

The control component and the regulating component may be implemented in software and/or hardware. The apparatus may be, for example, an audio receiver, an audio transceiver, or the like. The device may further be implemented, for example, in the form of a chip or in the form of a more widely used device, etc.

Furthermore, an electronic device is proposed, which comprises the above-mentioned apparatus, an audio input component (e.g. a microphone) and an audio output component (e.g. a loudspeaker).

Furthermore, a system is proposed, which comprises the electronic device as well as further electronic devices. The other electronic device is configured to exchange voice signals for a conversation with the first electronic device via a packet-switched network.

Finally, a computer program product is proposed, in which the program code is stored in a computer-readable medium. Which program code, when being executed by a processor, carries out the proposed method.

The computer program product may be, for example, a stand-alone storage device, or a memory integrated into an electronic device, etc.

The invention should be understood to include computer program code separate from the computer program product and the computer readable medium.

By taking into account the end-to-end delay in at least one direction when adjusting the jitter buffer, the performance of the adaptive jitter buffer may be improved. An optimal trade-off between these two aspects can be found if, in addition to the number of frames arriving after the planned decoding time, for example, the end-to-end delay in at least one direction is also taken into account. Frames arriving after the scheduled decoding time are typically discarded by the buffer because the decoder has replaced them with error concealment because of their late arrival. From the decoder perspective, these frames may be considered discarded frames. Thus, the number of such frames is referred to as a late loss rate (late loss rate).

The estimated delay considered may be, for example, an estimated one-way end-to-end delay or an estimated two-way end-to-end delay. The one-way end-to-end delay may be, for example, the delay between the time the user of one device starts talking and the time the user of another device starts listening to speech. The bi-directional end-to-end delay will be referred to as response time in the following.

In a telephony scenario, user interactivity may be considered a more important aspect from the user's perspective than one-way end-to-end latency. Interactivity is measured as the response time, which is the time it takes for a user to stop talking and wait to hear a response, and therefore includes the user's reaction time in addition to the bi-directional transmission and processing delays. For one embodiment, it is proposed to use the estimated response time as a specific estimated delay for selecting the optimal adjustment of the adaptive jitter buffer. The estimated response time may be, for example, the time between the end of a segment of speech uttered by a user of the first device and the beginning of presentation by the first device of a segment of speech uttered by a user of the second device.

In one embodiment of the invention, determining the amount of adjustment comprises determining the amount of adjustment such that the number of frames that arrive after the planned decoding time is kept below a first threshold as long as the estimated delay is below the first threshold. In addition, the amount of adjustment is determined such that when the estimated delay exceeds a first threshold, for example between the first threshold and a second, higher threshold, then the number of frames that arrive after the planned decoding time is kept below a second threshold.

The first threshold, the second threshold, the first threshold, and the second threshold may be predetermined values. Alternatively, however, one or more of the values may be variable. The second threshold may be calculated, for example, as a function of the estimated delay. By using a longer estimated delay, a higher second threshold may be used. The idea is that as the latency becomes higher (resulting in reduced interactivity), a higher late-to-loss rate may be allowed to avoid increasing the latency further by increasing the buffer time to keep the late-to-loss rate low.

Any available mechanism may be used to estimate the delay. The estimation may be based on available information or dedicated measurements.

For example, an external time reference based approach may be used, such as "RTP: a Transport Protocol for Real-time applications "describes a Network Time Protocol (NTP) based method.

The response time can also be roughly estimated by taking into account the basic structure of the call if the estimated response time is to be used as the estimated delay. A call is typically divided into a number of call turns (switching turns) during which one party is speaking and the other party is listening. This structure of the call can be used to estimate the response time.

The response time may be estimated as a time period between a time when a user of the first device is detected to switch from speaking to listening at the first device and a time when a user of the second device is detected to switch from listening to speaking at the first device.

The electronic device will typically know its own transmit and receive states, and this knowledge can be used as a response time basis to estimate these behavioral changes.

When it is detected that the user of the second device switches from listening to speaking, the estimated time instant may for example be the time instant when the first device receives the first segment of the speech signal containing active speech via the packet switched network after having received at least one segment of the speech signal not containing active speech. The decoder of the first device may provide this with, for example, an indication of the current type of content of the received speech signal, an indication of the presence of a particular type of content, or an indication of a change in content. The type of content indicates a current reception state of the first device and a current transmission state of the second device. The reception of comfort noise frames indicates, for example, that the user of the second device is listening, while the reception of speech frames indicates that the user of the second device is speaking.

The estimated time when the user of the first device is detected to switch from speaking to listening may be a time when the first device starts generating comfort noise parameters. The encoder of the first device may provide a corresponding indication.

Alternatively, if the electronic device uses Voice Activity Detection (VAD), the time instant estimated when the user of the first device will be detected to switch from speaking to listening may be when the VAD component of the first device sets the flag to a value indicating that the current segment of the speech signal to be sent via the packet switched network does not contain voice. The VAD component of the first device may provide a corresponding indication. If a DTX delayed release period is used, the flag set by the VAD component may provide faster and more accurate information about the end of the speech segment than if an indication of comfort noise was generated.

For example, in the case of Voice Over IP (VOIP), the VOIP client may know its own transmission state according to the current state of voice activity detection and the state of discontinuous transmission operation.

It should be noted that the proposed alternative method of roughly estimating the response time can be used for other purposes than for controlling the adaptive jitter buffer. Furthermore, it is another useful quality of service metric.

The invention can be used for any application using an adaptive jitter buffer for speech signals. VOIP using AMR or AMR-WB codecs is an example

It is to be understood that the presented exemplary embodiments may also be implemented in any suitable combination.

Other objects and features of the present invention will become apparent from the following detailed descriptions considered in conjunction with the accompanying drawings. It is to be understood, however, that the drawings are designed solely for purposes of illustration and not as a definition of the limits of the invention, for which reference should be made to the appended claims. It should be further understood that the drawings are not drawn to scale and that they are intended to conceptually illustrate the structures and procedures described herein.

Drawings

FIG. 1 is a schematic block diagram of a system according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating a call structure;

FIG. 3 is a flow chart illustrating operations in the system of FIG. 1 for estimating a current response time in a call;

FIG. 4 is a flow chart illustrating operations in the system of FIG. 1 for adjusting jitter buffer based on current response time; and

FIG. 5 is a schematic block diagram of an electronic device according to another embodiment of the invention.

Detailed Description

Fig. 1 is a schematic block diagram of an exemplary system that supports adaptive jitter buffer adjustment based on an estimated response time in accordance with an embodiment of the present invention.

The system comprises a first electronic device 110, a second electronic device 150 and a packet switched communication network 160 connecting both devices 110, 150. The packet switched communication network 160 may be or include, for example, the internet.

The electronic device 110 comprises an audio receiver 111, a playback assembly 118 linked to an output of said audio receiver 111, an audio transmitter 122, a microphone 121 linked to an input of said audio transmitter 122, and a response time (T) linked to both said audio receiver 111 and said audio transmitter 122_resp) An estimation component 130. The T is_respThe estimation component 130 is also connected to a timer 131. Inside the electronic device 110, the interface of the device 110 to said packet switched communication network 160 (not shown) is linked to an input of said audio receiver 111 and to an output of said audio transmitter 122.

Audio receiver 111, audio transmitter 122, T_respThe estimation component 130 and the timer 131 may be implemented as a single chip 140 or chip set, for example.

Inside the audio receiver 111, the input of the audio receiver 111 is connected on the one hand to a jitter buffer 112 and on the other hand to a network analyzer 113. The jitter buffer 112 is connected to the output of said audio receiver 111 via a decoder 114 and an adjusting component 115 and thus to a playback component 118. A control signal output of the network analyzer 113 is connected to a first control input of a control component 116 and a control signal output of the jitter buffer 112 is connected to a second control input of the control component 116. The control signal output of the control component 116 is also connected to the control input of the regulating component 115.

The playback component 118 may include, for example, a speaker.

Within the audio receiver 122, an input of the audio transmitter 122 of the electronic device 110 is connected to an encoder 124 via an analog-to-digital converter (ADC) 123. The encoder 124 may include, for example, a speech encoder 125, a voice activity detection component (VAD)126, and a comfort noise parameter generator 127.

The T is_respThe estimation component 130 is arranged to receiveInputs from decoder 114 and encoder 124. The T is_respAn output of the estimation component 130 is connected to the control component 116.

Electronic device 110 may be considered to represent an exemplary embodiment of an electronic device in accordance with the present invention, and chip 140 may be considered to represent an exemplary embodiment of a device in accordance with the present invention.

It is to be understood that various other components of the electronic device 110 within and outside of the audio receiver 111 and the audio transmitter 122 are not depicted, and that any of the illustrated links are equivalent to links via other components not shown. The electronic device 110 also includes an interface, such as described above, to the network 160. In addition, it may include a separate discontinuous transmission control component, channel encoder, and packetizer (packetizer) for the transmit link. Also, it may include a depacketizer (depacketizer) for the receive chain, a channel decoder, and a digital-to-analog converter, etc. Further, the audio receiver 111 and the audio transmitter 122 may also be implemented as an integrated transceiver. Further, said T_respThe estimation component 130 and the timer 131 may also be integrated in the audio receiver 111, the audio transmitter 122 or the audio transceiver.

Although not mandatory, the electronic device 150 may be implemented in the same manner as the electronic device 110. The electronic device 150 should be configured to receive and transmit audio packets in a discontinuous transmission using a codec compatible with the codec employed by the electronic device 110. To illustrate these transceiving functions, the electronic device 150 is shown to include an audio Transceiver (TRX) 151.

The encoding and decoding of the audio signal in the electronic device 110, 150 may be based on, for example, an AMR codec or an AMR-WB codec.

Electronic device 110 and electronic device 150 may be used by users of VOIP conversations conducted via packet-switched communication network 160.

During an ongoing VOIP session, the microphone 121 registers audio signals, in particular the voice uttered by user a, in the environment of the electronic device 110. The microphone 121 forwards the registered analog audio signal to the audio transmitter 122. In the audio transmitter 122, the analog audio signal is converted into a digital signal by the ADC 123 and supplied to the encoder 124. In the encoder 124, the VAD component 126 detects whether the current audio signal contains active voice (active voice). The VAD flag is set to "1" if active speech is detected, and to "0" if no active speech is detected. If the VAD flag is set to "1," then speech encoder 125 encodes the current audio frame into an active speech frame. Otherwise, the comfort noise parameter generator 127 generates SID frames. The SID frame includes a 35-bit comfort noise parameter that describes the background noise at the transmitting end in the absence of active speech. The active speech frames and SID frames are then channel encoded, packetized, and transmitted to the electronic device 150 via the packet-switched communication network 160. Active speech frames are sent at 20ms intervals, while SID frames are sent at 160ms intervals.

In the electronic device 150, the audio transceiver 151 processes the received packets so that the corresponding reconstructed audio signal can be presented to the user B. Moreover, the audio transceiver 151 processes audio signals registered in the environment of the electronic device 150, in particular the voice uttered by the user B, in a similar manner to the way the audio transmitter 122 processes audio signals registered in the environment of the electronic device 111. The final packet is sent to the electronic device 110 via the packet-switched communication network 160. The electronic device 110 receives the packets, depacketizes them, and channel decodes the contained audio frames.

The jitter buffer 112 is then used to store the received audio frames while they are waiting to be decoded and played back. The jitter buffer 112 may have the following capabilities: arranges the received frames into the correct decoding order and provides the arranged frames (or information about the lost frames) to the decoder 114 in sequence upon request. In addition, the jitter buffer 112 provides information about its status to the control component 116. Said networkThe analyzer 113 calculates a parameter set describing the current reception characteristics based on the frame reception statistics and the timing of the received frames and provides the parameter set to the control component 116. The control component 116 determines the need to change the buffering delay based on the received information and provides a corresponding timing scaling command to the adjusting component 115. In general, the optimal average buffering delay refers to the average buffering delay that minimizes the buffering time without any frames reaching the decoder 114 after their scheduled decoding time. However, according to the invention, the control unit also takes into account the reception from T_respInformation for the component 130 is estimated, as will be further described below.

When the playback component 118 requests new data, the decoder 114 retrieves audio frames from the buffer 112. The decoder decodes the acquired audio frames and forwards the decoded frames to the adaptation component 115. When an encoded speech frame is received, the speech frame is decoded to obtain a decoded speech frame. When a SID frame is received, comfort noise is generated based on the contained comfort noise parameters and distributed to a series of digital noise frames forming the decoded frame. Said adjusting component 115 performs the scaling, i.e. lengthening or shortening of the received decoded frames, commanded by said control component 116. The decoded frames, and possibly the time scaled frames, are provided to a playback component for presentation to user a.

Fig. 2 is a diagram showing the structure of a call between a user a and a user B, which is based on the assumption that: while user a of device 110 is speaking, user B of device 150 is listening, and vice versa.

When user A speaks (201), user B is at some delay T_AtoBThen hear (202), T_AtoBIs the transmission time from user a to user B.

When user B notices that user A has terminated the conversation, user B will react at reaction time T_reactAnd then a response is made.

When user B speaks (203), user A is at some delayT_BtoAThen hear (204), T_BtoAIs the transmission time from user B to user a.

The time period that the user a experiences from the time when the user a stops talking to the time when the user a starts hearing the voice of the user B is called the response time T from the user a to the user B and back to the user a_resp. The sound time T_respCan be expressed as:

T_resp＝T_AtoB+T_react+T_BtoA。

it should be noted that this is merely a simplified model of the overall response time. For example, the model does not clearly show the buffering delay, the algorithm and the delay processing in the speech processing components employed, but these are assumed to be included at the transmission time T_AtoBAnd T_BtoAIn (1). Although the buffering delay in user a's device is a significant part of the response time, this delay part can be easily obtained in user a's device. Besides, the relevant aspect is the bidirectional nature of the response time. It should also be noted that the response time does not have to be symmetrical. Response time a-B-a may be different than response time B-a-B due to different routing and/or link behavior. Furthermore, the reaction times are likely to be different for user a and user B as well.

From the user's point of view, the respective response time T_respThe interactivity of the represented call is an important aspect. That is, the respective response times T_respShould not become too large.

T of the electronic device 110_respThe estimation component 130 is used to estimate the current response time T_resp。

FIG. 3 is a diagram showing a method for determining the response time T_respT of_respA flow diagram of the operation of the estimation component 130.

Encoder 124 is configured to transition to T when the content of the received audio signal changes from active speech to background noise_respThe estimation component 130 provides an indication.

When comfort noise parameter generator 127 begins generating comfort noise parameters after active speech, encoder 124 may send a corresponding interrupt, which indicates that user a has stopped talking.

However, in some codecs, such as the AMR and AMR-WB codecs, the Discontinuous Transmission (DTX) mechanism uses a DTX delayed release period. That is, when a speech burst has been encoded by the speech encoder 127, it switches encoding from speech mode to comfort noise mode only after seven frames followed by no active speech. In this case, the change from "talk" to "listen" can be detected early by monitoring the state of the VAD flag, which is indicative of voice activity in the current frame.

The decoder 114 is configured to transition to T when the decoder 114 receives a first frame with active speech after only receiving frames with comfort noise parameters_respThe estimation component 130 provides an indication. This change indicates that user B has switched from "listening" to "speaking".

In order to determine the response time T_respSaid T is_respThe estimation component 130 monitors whether an interrupt is received from the encoder 124, which indicates the start of the creation of comfort noise parameters (step 301). Alternatively, the T is_respThe estimation component 130 monitors whether the VAD flag provided by the VAD component 126 changes from "1" to "0", which indicates the end of a speech burst (step 302). This alternative is indicated in fig. 3 with a dashed line. Both alternatives are adapted to be directed towards T_respThe evaluation component 130 notifies the user a that the switch from "speaking" to "listening" has been made.

The T if the creation of comfort noise or the end of a speech spurt is detected_respThe evaluation component 130 activates the timer 131 (step 303).

When the counter 131 counts the elapsed time from 0, T_respEstimation component130 monitors whether it receives an indication from the decoder 114 that user B has switched from "listening" to "speaking" (step 304).

When this handover is detected, T_respThe evaluation component 130 stops the timer 131 (step 305) and reads the counted time (step 306).

The counted time is taken as the response time T_respAnd provided to the control component 116.

The blocks in FIG. 3 may be equivalently regarded as T_respSubcomponents of the estimation component 130. That is, blocks 301 or 302 and block 304 may be considered detection components, while blocks 303, 305 and 306 may be considered timer access components configured to perform the specified functions.

The proposed mechanism provides useful results in case users a and B talk alternately (not simultaneously). Thus, some circumstances may need to be taken care of to avoid cluttering the estimate, such as for this case: one user gives a response before the other user ends his/her turn of the call (turn). In this regard, the decoder 114 may also be configured to indicate when it begins receiving frames of new speech bursts. Then, if the last received information from the decoder 114 does not indicate that user B has started speaking, T_respThe estimation component 130 may only consider the indication in step 301 or 302 that user a starts listening.

Although the proposed operation only provides for a response time T_respBut the response time T_respStill considered as useful information for adaptive jitter buffer management. It should be noted, however, that the response time T may also be estimated or measured in some other way_respFor example based on the method described in the document RFC 3550 cited above.

FIG. 4 is a flow chart illustrating operation of the control component 116, the control component 116 for basing the response time T_respTo adjust the jitterAnd (4) dynamic buffering.

In the control component 116, for the response time T_respA first lower predetermined threshold THR1 and a second upper predetermined threshold THR2 are set. Additionally, a first lower predetermined threshold LLR1 and a second higher predetermined threshold LLR2 are set for a Late Loss Rate (LLR) of the received frame. As described above, the late loss rate is the number of frames that arrive after their scheduled decoding time. That is, the late loss rate may correspond to the number of frames as follows: playback component 118 requests the frames from decoder 114, but because of their late arrival, decoder 114 cannot retrieve the frames from buffer 112, and thus the frames are considered discarded by decoder 114 and are typically replaced by error concealment.

End-to-end delays below 200ms are not considered to degrade call quality, and end-to-end delays above 400ms are considered to result in unacceptable call quality due to reduced interactivity, according to ITU-T coding specification g.114. Due to this specification, the threshold THR1 may be set to, for example, 400ms, while the threshold THR2 may be set to, for example, 800 ms. Further, the threshold for the late loss rate may be set to, for example, 0% LLR1 and 1.5% LLR 2.

However, the second higher threshold LLR2 may also be used by the control component 116 as the received estimated response time T_respIs calculated as a function of (c). That is, a higher threshold LLR2 may be used for a higher estimated response time T_respAnd thus accept a higher loss rate to achieve better interactivity.

When the control component 116 receives the estimated response time T_respThe control component 116 first determines the response time T_respWhether or not it is lower than threshold THR1 (step 401)

If the response time T_respBelow the threshold THR1, the control component 116 selects a scaling value suitable for maintaining the late loss rate below the predetermined threshold LLR1 (step 402). Note that since the response time includes the buffer time, the scaling operation will change the value of the response time.Taking this correlation into account, the response time estimate T_respMay be initialized when the generated conversation begins to be received and updated in each targeting operation.

When estimating the response time T_respAbove the threshold THR1 and below the threshold THR2 (step 403), the control component 116 selects a scaling value suitable for maintaining the late loss rate below a predetermined threshold LLR2 (step 405).

Alternatively, when the response time satisfies THR1 < T_resp< THR2, the control component 116 may first use the threshold LLR2 for late loss rate as the estimated response time T_respIs calculated as a function of, i.e., LLR2 ═ f (T)_resp). This selection is indicated in fig. 4 by a dashed line (step 404). The control component 116 then selects a scaling value that is suitable for maintaining the late loss rate below a predetermined threshold LLR2 (step 405).

Disallowing the estimation of the response time T_respIncreases beyond a threshold THR 2.

The scaling value selected in step 402 or step 405 is provided to conditioning component 115 by a scaling command. The adjustment component 115 can then continue scaling the received frame based on the received scaling value (step 406).

The blocks in fig. 4 may be equivalently viewed as sub-components of the control component 116. That is, blocks 402 and 404 may be considered comparators, while blocks 401, 403, and 405 may be considered processing components configured to perform specified functions.

It is to be understood that the proposed operation is merely a general example of jitter buffer management, which uses response times to control the adjustment process. There may be many variations of this method.

The components 111, 122, 130, and 131 of the electronic device 110 shown in fig. 1 may be implemented as hardware, such as a chip or circuitry on a chipset. The entire collection may be implemented as, for example, an Integrated Circuit (IC). Alternatively, these functions may also be partially or wholly implemented in the form of computer program code.

Fig. 5 is a block diagram showing details of another exemplary embodiment of an electronic device according to the present invention, wherein the functions are implemented by computer program code.

The electronic device 510 includes a processor 520, and an audio input component 530, an audio output component 540, an interface 550, and a memory 560 connected to the processor 520. The audio input component 530 may include, for example, a microphone. The audio output component 540 may include, for example, a speaker. The interface 550 may be, for example, an interface to a packet switched network.

The processor 520 is configured to execute available computer program code.

The memory 560 stores various computer program codes. The stored code comprises computer program code designed for encoding audio data, for decoding audio data using an adaptive jitter buffer, and for determining a response time T to be used as an input variable when adjusting the jitter buffer_resp。

When a VOIP session has been established, the processor 520 may retrieve the code from the memory 560, and the processor 520 executes the code to implement encoding and decoding operations, including, for example, those described with reference to fig. 3 and 4.

It will be appreciated that the same processor 520 may also execute computer program code for implementing other functions of the electronic device 110.

Although the response time T has been estimated for use_respThe exemplary embodiments of fig. 1 to 5 have been described as an alternative to adjusting the parameters of the jitter buffer, but it will be appreciated that a similar approach may also be used to use a unidirectional end-to-end delay D_{end_to_end}As an alternative to the parameter. In FIG. 1, a response time estimation component 130May be an end-to-end delay estimation component. The one-way delay may be measured or estimated, for example, by using the NTP-based method described above. Can be achieved by using only the estimated end-to-end delay D_{end_to_end}Instead of estimating the response time T_respThe process in fig. 4 is used as shown for the response time, which is also indicated in fig. 4 as an option within parentheses. The selected thresholds THR1 and THR2 also need to be set accordingly. Furthermore, in the embodiment of fig. 5, the use of a unidirectional end-to-end delay D has also been indicated in brackets_{end_to_end}Instead of the response time T_respSelection of (2).

The functionality represented by the control component 116 in fig. 1, or by the computer program code of fig. 5, can be equivalently viewed as means for determining a desired amount of adjustment of the jitter buffer at the first device by using the estimated delay as a parameter, the delay comprising at least the end-to-end delay in at least one direction in the call; for the call, a call voice signal is transmitted in packets between the first device and the second device via the packet-switched network. The functionality represented by the adjustment component 115 of fig. 1, or by the computer program code of fig. 5, may be equivalently viewed as a means for adjusting the jitter buffer based on the determined amount of adjustment. From T in FIG. 1_respThe estimation component 130, or the functionality represented by the computer program code of fig. 5, can equally be seen as a means for estimating the delay.

While there have been shown, described, and pointed out fundamental novel features of the invention as applied to preferred embodiments thereof, it will be understood that various omissions and substitutions and changes in the form and details of the devices and methods described may be made by those skilled in the art without departing from the spirit of the invention. For example, it is expressly intended that all combinations of those elements and/or method steps which perform substantially the same function in substantially the same way to achieve the same results are within the scope of the invention. Moreover, it should be recognized that structures and/or elements and/or method steps shown and/or described in connection with any disclosed form or embodiment of the invention may be incorporated in any other disclosed or described or suggested form or embodiment as a general matter of design choice. Accordingly, the invention is to be limited only as indicated by the scope of the appended claims. Furthermore, the functionally defined terms in the claims are intended to include not only the structures described herein that perform the recited function, but also equivalent structures.

Claims

1. A method for jitter buffer adjustment, the method comprising:

determining, at a first device, a desired amount of adjustment of a jitter buffer by using an estimated delay as a parameter, the delay comprising at least an end-to-end delay in at least one direction in a call for which call voice signals are sent in packets between the first device and a second device via a packet switched network; and

adjusting the jitter buffer based on the determined amount of adjustment;

wherein determining the adjustment amount comprises:

determining the amount of adjustment such that the number of frames arriving at the first device after a planned decoding time is kept below a first threshold as long as the estimated delay is below the first threshold; and

determining the amount of adjustment such that the number of frames arriving at the first device after a planned decoding time is kept below a second threshold when the estimated delay exceeds the first threshold.

2. The method of claim 1, wherein the estimated delay is an estimated response time in a call, the estimated response time being a time between an end of a segment of speech uttered by a user of the first device and a beginning of presentation by the first device of a segment of speech uttered by a user of the second device.

3. The method of claim 1, further comprising determining the second threshold as a function of the estimated delay.

4. The method of claim 1, further comprising estimating the delay.

5. The method of claim 2, further comprising estimating the response time, wherein the response time is estimated by considering a basic structure of a call.

6. The method of claim 2, further comprising estimating the response time, wherein the response time is estimated as a time period between:

detecting, at the first device, a time when a user of the first device switches from speaking to listening; and

a time when a user of the second device switches from listening to speaking is detected at the first device.

7. The method according to claim 6, wherein the time when it is detected that the user of the second device switches from listening to speaking is the time when the first device receives a first segment of a speech signal containing active speech via the packet switched network after having received at least one segment of the speech signal not containing active speech.

8. The method of claim 6, wherein the time when the user of the first device is detected to switch from speaking to listening is a time when the first device starts generating comfort noise parameters.

9. The method according to claim 6, wherein the time when it is detected that the user of the first device switches from speaking to listening is a time when a voice activity detection component of the first device sets a flag to a value that indicates that a current segment of a speech signal to be sent via the packet switched network does not contain voice.

10. An apparatus for jitter buffer adjustment, the apparatus comprising:

a control component configured to determine, at a first device, a desired amount of adjustment of a jitter buffer by using an estimated delay as a parameter, the delay comprising at least an end-to-end delay in at least one direction in a call for which call voice signals are sent in packets between the first device and a second device via a packet-switched network; and

a jitter buffer adjustment component configured to adjust the jitter buffer based on the determined adjustment amount;

wherein the control component is configured to determine the adjustment amount by:

11. The apparatus of claim 10, wherein the estimated delay is an estimated response time in a call, the estimated response time being a time between an end of a segment of speech uttered by a user of the first device and a beginning of presentation by the first device of a segment of speech uttered by a user of the second device.

12. The apparatus of claim 10, wherein the control component is further configured to determine the second threshold as a function of the estimated delay.

13. The apparatus of claim 10, the apparatus further comprising an estimation component configured to estimate the delay.

14. The apparatus of claim 11, the apparatus further comprising a response time estimation component configured to estimate the response time by considering a basic structure of a call.

15. The apparatus of claim 11, the apparatus further comprising a response time estimation component configured to estimate the response time as a time period between:

16. The apparatus of claim 15, wherein the response time estimating component is configured to estimate a time when a user of the second device is detected to switch from listening to speaking as a time when the first device receives a first segment of a voice signal containing active speech via the packet switched network after at least one segment of the voice signal not containing active speech has been received.

17. The apparatus of claim 16, the apparatus further comprising a decoder configured to indicate to the response time estimation component: which receives a first segment of a speech signal containing active speech after having received at least one segment of the speech signal not containing active speech.

18. The apparatus of claim 15, wherein the response time estimation component is configured to estimate a time when a user of the first device is detected to switch from speaking to listening to a time when the first device begins generating comfort noise parameters

19. The apparatus of claim 18, the apparatus further comprising an encoder configured to indicate to the response time estimation component: the moment it starts generating comfort noise parameters.

20. The apparatus of claim 15, wherein the response time estimating component is configured to estimate a time when the user of the first device is detected to switch from speaking to listening to a time when a voice activity detecting component of the first device sets a flag to a value that indicates that a current segment of a speech signal to be transmitted via the packet switched network does not contain voice.

21. The device of claim 20, the device further comprising the voice activity detection component.

22. An electronic device for jitter buffer adjustment, the electronic device comprising:

an audio receiver; and

an audio transmitter;

wherein the audio receiver comprises the apparatus of claim 10.

23. A system for jitter buffer adjustment, the system comprising:

the first electronic device of claim 22; and

a second electronic device configured to exchange voice signals for a call with the first electronic device via a packet-switched network.