WO2008147272A1

WO2008147272A1 - A conference bridge and a method for managing packets arriving therein

Info

Publication number: WO2008147272A1
Application number: PCT/SE2007/050395
Authority: WO
Inventors: Anders Eriksson; Tommy Falk
Original assignee: Telefonaktiebolaget Lm Ericsson (Publ)
Priority date: 2007-06-01
Filing date: 2007-06-01
Publication date: 2008-12-04

Abstract

A conference bridge for managing arriving packets of multiple packetized media streams of a non-synchronous packet network includes queue memories (22) arranged to queue arrived packets for each stream. A control unit (24) monitors the queued packets to detect arrival of temporally related packets of the streams. A mixer (16) mixes selected temporally related packets once it has been detected that they have arrived.

Description

A CONFERENCE BRIDGE AND A METHOD FOR MANAGING PACKETS ARRIVING THEREIN

TECHNICAL FIELD

The present invention relates to managing packets of multiple packetized media streams arriving in a conference bridge of a non- synchronous packet network

BACKGROUND

Conferencing capability allows for group communication and collaboration among geographically dispersed participants (also called users below). Historically, conferencing has been achieved in the Public Switched Telephone Network (PSTN) by means of a centralized conference bridge. In such a circuit switched network, the mixing of real-time media streams from several users can usually be performed without causing any substantial additional delay. In e.g. a voice teleconference, the individual audio samples from the participants are synchronized and arrive at regular time intervals. This means that the samples can be scheduled to be processed at regular time intervals and no additional delay is added except the time needed for the processing. The processing for a voice teleconference usually consist of determining which talkers that are active and summing the speech contribution from the active talkers.

Currently trends point towards the migration of voice communication services from the circuit- switched PSTN to non- synchronous packet-based Internet Protocol (IP) networks. This shift is motivated by a desire to provide data and voice services on a single, packet-based network infrastructure. In a non- synchronous packet network, the audio samples (or coded parameters representing the audio samples) from the participants in e.g. a voice telecon- ference do usually not arrive at regular time intervals due to the jitter in the transport network.

In order to synchronize the speech contributions from the participants and thus making it possible to mix samples corresponding to the same time from all participants, jitter buffers are typically implemented in the conference bridge on the incoming speech to cater for the varying delay of the packets.

SUMMARY

A drawback of the prior art jitter buffer approach is that the jitter buffers introduce an undesirable extra delay in the conference bridge.

An object of the present invention is to reduce the delay in a conference bridge.

This object is achieved in accordance with the attached claims.

Briefly, the present invention queues arrived packets for each stream. The queued packets are monitored to detect arrival of temporally related packets of the streams. Selected temporally related packets are mixed once it has been detected that they have arrived.

An advantage of the present invention is that it reduces the overall delay in the network. Instead of introducing a fixed delay in the conference bridge, the jitter in the incoming packets is forwarded to be handled by the receiving terminal. BRIEF DESCRIPTION OF THE DRAWINGS

The invention, together with further objects and advantages thereof, may best be understood by making reference to the following description taken together with the accompanying drawings, in which:

Fig. 1 is a simple block diagram of a network with a conference bridge;

Fig. 2 is a more detailed block diagram of a typical prior art non- synchronous packet network with a conference bridge;

Fig. 3 is a time diagram illustrating jitter buffering in a typical prior art conference bridge;

Fig. 4 is a time diagram illustrating time dependence of the jitter from a first user to the conference bridge;

Fig. 5 is a time diagram illustrating time dependence of the jitter from the conference bridge to a second user;

Fig. 6 is a time diagram illustrating time dependence of the combined jitter from the first user to the second user;

Fig. 7 is a time diagram illustrating an embodiment of the method in accordance with the present invention;

Fig. 8 is a block diagram of an embodiment of a conference bridge in accordance with the present invention;

Fig. 9 is a time diagram illustrating another embodiment of the method in accordance with the present invention;

Fig. 10 is a time diagram illustrating still another embodiment of the method in accordance with the present invention;

Fig. 11 is a block diagram of another embodiment of a conference bridge in accordance with the present invention;

Fig. 12 is a flow chart illustrating the principles of the method in accordance with the present invention;

Fig. 13 is a flow chart illustrating an embodiment of the method in accordance with the present invention; and

Fig. 14 is a flow chart illustrating a further embodiment of the method in accordance with the present invention. DETAILED DESCRIPTION

In the following description elements having the same or similar functions will be provided with the same reference designations in the drawings.

Fig. 1 is a simple block diagram of a network with a conference bridge 10. Users A-E each have bidirectional connections to conference bridge 10. In one direction each user sends audio or voice packets to conference bridge 10. In the opposite direction each user receives combined or mixed packets from the other users or participants in the conference. For example, user A sends packets A to conference bridge 10 and receives a mix BCDE of packets from the other users. The purpose of the conference bridge is to manage received packets and perform the mixing that is relevant for each user.

Fig. 2 is a more detailed block diagram of a typical prior art non- synchronous packet network with a conference bridge 10. In order to simplify the description, Fig. 2 only illustrates how packets from users A-D are mixed and forwarded to user E. The other users are managed in a similar way.

Packets from users A-D reach respective jitter buffers 12 in conference bridge 10, where they are delayed. When the packets are released from jitter buffers 10, they are forwarded to respective decoders 14. Decoders 14 decode the packets into samples that are forwarded to a selecting and mixing unit 16. After mixing, the resulting samples are encoded into packets in an encoder 18 and forwarded to user E. A clock unit 20 releases packets from jitter buffers 12 at regular time instants separated by a time interval T, which corresponds to the length of a speech frame, typically 20-40 ms. The added delay in jitter buffers is typically 1-3 time intervals T.

Fig. 3 is a time diagram illustrating jitter buffering in the conference bridge of Fig. 2. This example assumes a jitter buffer delay of one time interval T At time instant k all temporally related packets from users A-D have arrived simultaneously to conference bridge 10. The feature that the packets are temporally re- lated means that they are based on samples generated (at the users) at approximately the same absolute or global time, i.e. they represent approximately simultaneous events. Due to the delay in jitter buffers 12, mixing is not performed until time instant k+ 1 , as illustrated by the arrow in the lower left corner of Fig. 3. At time instant k+1 all temporally related packets have also arrived, but this time they were not synchronous. However, the buffering until time instant k+2 makes them synchronous. Similar comments apply to the temporally related packets arriving between instants k+2 and k+3. It is noted that so far the extra delay provided by the jitter buffers was not actually needed, since all packets arrived in time for mixing at the next periodic time instant. This, however, is changed at time instant k+4, since the packet from user A would have arrived to late for mixing without the extra delay. Thus, for such situations the extra delay provided for late packets by jitter buffers 12 is actually useful.

The jitter buffer approach can be described mathematically as follows: The transmission time from, for example, user A to user E without a jitter buffer may be expressed as:

T^^α|(/c) = T_A→hndge + §_A→hndge{k) + T_bndge→E + §_hndge→E{k) (1) where

T_A→bndge is the (constant) time delay from user A to the bridge,

T_bndge→E is the (constant) time delay from the bridge to user E, δ _Α→_bndgeik) ^ls the jitter in transmission time from user A to the bridge,

§_bndge→E(k) is the jitter in transmission time from the bridge to user E.

The jitter can be assumed to obey some statistical distribution.

With the inclusion of a jitter buffer 12 in the conference bridge, the transmission time for packets from user A to user E will be:

¹ A→E V^/ — ¹ A→bndge ^{τ λ} jitterbuffer ^{τ 1} bndge→E ^{τ u} bridge → E \^ I \^> Since the jitter buffer in the conference bridge is designed to compensate for the possible jitter §_A→hndge{k) one can assume that:

* βtterbuffer ^>

(^)

most of the time. Thus, the transmission time for packets from A to E will, in average, be larger with the inclusion of a jitter buffer. Similar reasoning for the other paths leads to jitters δ_B→bndge(k), δ_c→bndge(k) and δ_D→bndge(k) . Thus, the jitter buffers have to satisfy

TjXterbuffer > ^maX (°^" A→bπdge W > ^δ B→bπdge W > ^C→bndge (^k) ^ D→bnd_ge {^k)) (⁴)

most of the time, i.e. the media stream with the largest jitter will determine the jitter buffer delay for all streams. It is also known to have adaptive jitter buffer delays.

From equation (1) it is noted that there is a jitter δ_bndge→E(^k) in transmission time from the bridge to user E. This jitter is compensated by a playback jitter buffer at user E. The playback jitter buffer synchronizes the packets of the single mixed stream arriving at user E before decoding to generate a continuous stream of decoded samples at user E.

The basic concept of the invention is to exclude the jitter buffering in the conference bridge and asynchronously perform the mixing of the signals once all the packets that should be mixed have arrived. As discussed above, the purpose of the jitter buffers in a conference bridge is to enable the mixing of the correct samples from the included participants by compensating for jitter in the transmission time. However, unlike the situation for the playback in the terminal, the conference bridge does not have to produce synchronous output (for e.g. playback in a loudspeaker), but can produce asyn- chronous output. Instead all the necessary synchronization for playback may be performed at the jitter buffer in the receiving terminal. The present invention is based on two observations, namely:

1. The jitter buffer at user E is of the same magnitude as the jitter buffers in the conference bridge.

2. The jitters δ_A→bπdge(k) and δ_bndge→E(k) are typically uncorrelated. This means that it is unlikely that the two jitters are large at the same time k. This is illustrated in Fig. 4-6. Thus, in average the combined jitter ^_A→bndge(k) + δ_bndge→E(k) will not result in any significantly increased delay.

These observations imply that by omitting the jitter buffers in the conference bridge, the jitters resulting at user E will in average be of the same magnitude as before, but that larger jitters will occur more frequently. However, since the jitter buffer at user E is already designed to cope with such jitter, no changes are necessary at user E.

Fig. 7 is a time diagram illustrating an embodiment of the method in accordance with the present invention. The packets received at the conference bridge are exactly the same as in Fig. 3. However, since there are no jitter buffers in the conference bridge, the temporally related packets are mixed as soon as they have all been received. As can be seen in Fig. 7 this results in an eliminated jitter buffer delay and mixed stream with asynchronously transmitted packets. For example, the packets that arrive early in the time interval between k+ 1 and k+2 can immediately be decoded and mixed as soon as all packets have been received, instead of waiting until time instant k+3 as in the prior art of Fig. 3.

Fig. 8 is a block diagram of an embodiment of a conference bridge in accordance with the present invention. Instead of forwarding packets that arrive in the con- ference bridge to jitter buffers, as in the prior art, they are forwarded to queue memories or FIFOs 22. Queue memories 22 are controlled by a control unit 24. Control unit 24 monitors queue memories 22 over monitor lines 26 to determine whether temporally related packets from all streams have arrived in the queue memories. As soon as all temporally related packets representing a given time interval have arrived, control unit 24 releases these packets to decoders 14 for decoding and subsequent mixing in unit 16.

In the above description it has been assumed that packets from all users A-D should be mixed. However, if one or several users are silent, packets from these users may be discarded instead of mixed with packets from active talkers. The detection may be performed by one or more voice activity detectors (VADs), typically included in unit 16. This situation can be handled in different ways, as illustrated by Fig. 9 and 10.

In the embodiment illustrated in Fig. 9 all temporally related packets are collected and thereafter analyzed with regard to speech activity. Only packets including speech activity are then mixed. In Fig. 9 packets without speech activity have been illustrated by empty boxes. Thus, in this embodiment mixing is not performed until all temporally related packets have arrived. Sometimes this means that packets that are later determined to include no speech activity have to be awaited before previously arrived packets with speech activity can be mixed. This situation has been illustrated, for example, between time instants k+1 and k+2, where a non-speech packet from user D arrives after the speech packets from users A-C. Fig. 9 also illustrates that different users may be silent at different times. Thus, user D is silent between time instants k and k+2, while user A is silent between time instants k+2 and k+5. The embodiment illustrated in Fig. 9 can be implemented by a conference bridge in accordance with Fig. 8, provided with voice activity detection for each stream in unit 16.

In the embodiment illustrated in Fig. 10, mixing is normally performed as soon as all active packets have arrived. This is accomplished by storing and maintaining a list of active streams, typically in unit 16. For example, the three active packets from users A-C can be mixed as soon as they have all arrived between time instants k+1 and k+2, since user D is not in the list of active streams, and thus the later arriving packet from user D can be ignored in the mixing. The reason is that the previous packet from user D did not include any speech. Thus, by storing and maintaining a list of previously active streams or users, this embodiment needs to wait only for packets from users in the list of active streams before mixing can be started. The fact that a stream is in the list does, however, not necessarily mean that the next arriving packet from this stream will be mixed, since the next packet may be inactive. This is illustrated between instants k+2 and k+3, where the arriving packet from stream A is inactive, thereby enabling updating of the list. The active packet from stream D between instants k+2 and k+3 is not in the list when it arrives. However, this packet may actually be included in mixing if the list is updated with the status of all packets that have been received when the packets are released from the queue memories, in this case when the inactive packet from user A has arrived. If the active packet from user D had arrived after the inactive packet from user A, it would thus not have been included in the mixing. Although late packets from streams that are not in the list are ignored for mixing purposes, they are still examined when they arrive to determine whether their inactive /active status has changed to update the list. Similarly, although arriving inactive packets from streams in the list will not be mixed, they will be used to update the list.

Comparing the embodiments of Fig. 9 and 10, it is appreciated that mixing can often be started earlier in the embodiment of Fig. 10. The trade-off is that the active /inactive status of streams may occasionally be delayed one time interval T (due to late arriving active packets from streams not yet in the list), which may lead to exclusion of an actually active packet from mixing.

Fig. 11 is a block diagram of an embodiment of a conference bridge in accordance with the present invention that is suitable for implementing the method illustrated in Fig. 10. This embodiment differs from the embodiment of Fig. 8 in that selecting and mixing unit 16 has been modified into a selecting and mixing unit 30, which includes a unit 28 for maintaining a list of active streams. Unit 28 forwards a current list of active streams to control unit 24, which uses this list to release arrived temporally related packets to decoders 14 as soon as the last packet from a stream in the list has arrived.

In the embodiments illustrated in Fig. 8 and 1 1 , voice activity detection is assumed to be performed on decoded signals (samples). However, it is also possible to perform voice activity detection directly on the coded speech parameters (before decoders 14). This can for example be performed using the techniques described in [1] combined with the relevant parts of a standard VAD (e.g. 3GPP 26.094), or as exemplified by [2].

The master clock for the mixing may either be a reference clock in the mixer itself or may be derived from the included participants (e.g. the median time).

In the description above it has been assumed that packets arrive with reasonable delays to be included in the mixing. However, if this is not the case concealment strategies may be applied. For example, if an expected packet from use A has not arrived within a predetermined time out period, for example 3-5 time intervals T, a concealment packet (for example the last received packet from user A) may be included in the mixing instead. In this case the late packet is typically discarded if it eventually arrives. The time out period for late arriving packets should be set according to the statistics of the jitter and the desired lost frame rate. A concealment unit is typically provided in selecting and mixing unit 16, 30.

In order to handle possible drift between the clocks of the sample circuits of the respective user terminals, a mechanism for estimating and correcting for the clock drift may be included. The clock drift preferably is handled at the mixing point, since otherwise an increasing time difference between users with clock drift would be introduced in the mixed signal. Methods for deter- mining clock drift can be found in e.g. [3, 4], which are hereby incorporated by reference.

The functionality of the conference bridge of the present invention is typically implemented by a micro processor or micro/ signal processor combination and corresponding software.

Fig. 12 is a flow chart illustrating the principles of the method in accordance with the present invention. Step Sl queues packets that have arrived in the conference bridge for each of the streams. Step S2 monitors the queued packets to detect arrival of temporally related packets of the streams. Step S3 mixes selected temporally related packets once it has been detected that they have arrived. The same steps are performed during the next time interval T.

Fig. 13 is a flow chart illustrating an embodiment of the method in accordance with the present invention. This embodiment is suitable for the approach described in Fig. 9. Step Sl queues packets that have arrived in the conference bridge for each of the streams. Step S2 monitors the queued packets to detect arrival of temporally related packets of the streams. Step S4 tests whether all temporally related packets of the streams have arrived. If so, active temporally related packets are selected in step S5 and mixed in step S6. Otherwise the procedure returns to step S4. The same steps are performed during the next time interval T.

Fig. 14 is a flow chart illustrating a further embodiment of the method in accordance with the present invention. This embodiment is suitable for the approach described in Fig. 10. Step Sl queues packets that have arrived in the conference bridge for each of the streams. Step S7 selects streams eligible for mixing from a list of currently active streams. Step S8 monitors the queued packets to detect arrival of temporally related packets of the streams in the list. Step S9 tests whether all temporally related packets of the streams in the list have arrived. If so, the list of active streams is updated in step SlO and then the received temporally related packets from streams in the list are selected and mixed in step SI l. Otherwise the procedure returns to step S8. The same steps are performed during the next time interval T.

It will be understood by those skilled in the art that various modifications and changes may be made to the present invention without departure from the scope thereof, which is defined by the appended claims.

REFERENCES

[1] US 2002/0184010 Al

[2] US 2003/0135370 Al

[3] Tόnu Trump, "Maximum Likelihood Trend Estimation in Exponential Noise", IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 49, NO. 9, SEPTEMBER 2001 , pp 2087-2095,

[4] Tόnu Trump, "Compensation for clock skew in voice over packet networks by speech interpolation", Proceedings of ISCAS 2004, pp 608-61 1.

Claims

1. A method of managing packets of multiple packetized media streams arriving in a conference bridge of a non- synchronous packet network, including the steps of queuing (Sl) arrived packets for each stream; monitoring (S2) the queued packets to detect arrival of temporally related packets of the streams; mixing (S3) selected temporally related packets once it has been detected that they have arrived.

2. The method of claim 1, including the step of selecting (S5) packets to be included in the mixing (S6) when all temporally related packets of the streams have arrived (S4).

3. The method of claim 1, including the steps of selecting (SI l) packets to be included in the mixing from a list of currently active (SlO) streams; starting mixing (SI l) as soon as the temporally related packets of the streams in the list have arrived (S8, S9).

4. The method of any of the preceding claims, including the step of replacing an expected packet to be included in the mix by an error concealment packet if the expected packet has not arrived within a predetermined time period after a previous mix.

5. The method of claim 2 or 3, including the step of determining the active/inactive status of packets by voice activity detection.

6. A conference bridge for managing arriving packets of multiple packetized media streams of a non- synchronous packet network, including queue memories (22) arranged to queue arrived packets for each stream; a control unit (24) arranged to monitor the queued packets to detect arrival of temporally related packets of the streams; a mixer (16, 30) arranged to mix selected temporally related packets once it has been detected that they have arrived.

7. The conference bridge of claim 6, including a packet selector (16) arranged to select packets to be included in the mix when all temporally related packets of the streams have arrived.

8. The conference bridge of claim 6, including a packet selector (28) arranged to determine packets to be included in the mix from a list of currently active streams; a mixer (30) arranged to start mixing as soon as the temporally related packets of the currently active streams in the list have arrived.

9. The conference bridge of any of the preceding claims 6-8, including an error concealer (16, 30) arranged to replace an expected packet to be included in the mix by an error concealment packet if the expected packet has not arrived within a predetermined time period after a previous mix.

10. The conference bridge of claim 7 or 8, including at least one voice activity detector for determining the active /inactive status of packets.