WO2001019062A1

WO2001019062A1 - Suppression of residual acoustic echo

Info

Publication number: WO2001019062A1
Application number: PCT/IL2000/000527
Authority: WO
Inventors: Adoram Erell
Original assignee: DSPC Technologies Ltd
Current assignee: DSPC Technologies Ltd
Priority date: 1999-09-07
Filing date: 2000-09-04
Publication date: 2001-03-15
Anticipated expiration: 2002-03-07
Also published as: GB2369001A; GB0203162D0; DE10084970T1; AU6862200A

Abstract

A communication device having residual echo suppression includes an echo canceller (22) and a residual echo system (50, 52). The echo canceller cancels (30) most of an echo of a far end signal found in a microphone signal and provides an echo canceled signal with a residual echo therein. The residual echo system suppresses (60) the residual echo when the echo canceller cancels a significant amount of echo. The residual echo system can either transmit (42) comfort noise or can transmit (38) the echo canceled signal at a low rate when the echo canceller cancels a significant amount of echo. If the former, the comfort noise parameters are updated (40) from background noise but not when the echo canceller cancels a significant amount of echo.

Description

SUPPRESSION OF RESIDUAL ACOUSTIC ECHO

FIELD OF THE INVENTION

The present invention relates to echo cancellation generally and to the cancellation of residual echoes, in particular.

BACKGROUND OF THE INVENTION

When a telephone operates in hands-free mode (i.e. with the speaker open), the user speaks into the microphone and hears the voice of his conversant through the loudspeaker. Unfortunately, in duplex transmission (both sides can

talk at the same time), the microphone also "hears" the output of the loudspeaker.

This acoustic coupling between the near end signal (the signal picked up by the microphone) and the far end signal (as reproduced by the loudspeaker) creates an unpleasant echo. Echo cancellers are used to eliminate the echo due to the acoustic coupling.

Fig. 1 , to which reference is now made, illustrates the situation for a digital telephone. Fig. 1 shows a telephone having a loudspeaker 10 and a microphone

12. A frame of the far end signal, which eventually is provided to loudspeaker 10, is first decoded, by a decoder 14 after which its volume is adjusted by digital volume control 16. The resultant signal is converted by a codec 18 to an analog

signal whose volume is further adjusted by analog volume control 20. As mentioned above, microphone 12 picks up both the near end signal, as spoken by the user, and the output of loudspeaker 10. This is noted by the arrow labeled "acoustic echo". The output of microphone 12 is digitally converted by codec 18, formed into a frame, and provided to an echo canceller 22 which also receives the frame of the far end signal as produced by digital volume control 16. Echo canceller 22 uses the received signal in order to remove the portion of the microphone signal which is an echo of the far end signal. The result is passed to an encoder 26 for transmission.

The telephone of Fig. 1 operates in discontinuous transmission mode

(DTX) as shown in Fig. 2, to which reference is now made. The echo cancellation (step 30) is followed by voice activity detection (step 32). If speech was found (as checked in step 34), the frame is encoded (step 36) and then transmitted (step 38). If there was no speech, parameters of "comfort noise" are transmitted so

that the receiver can generate a comfort noise frame. The parameters are first updated (step 40) from the current, non-speech frame, and then the parameters are transmitted (step 42). The comfort noise parameters are calculated from the background noise and are continually updated from each background noise frame. However, it takes a few frames until the telephone has sufficient knowledge of the background noise. Thus, for the first few frames of operation the transmitter transmits as if the

signal had speech content, even the signal was one of non-speech. Echo cancellation involves subtracting a synthesized estimate of the

echo, where the synthesized echo is estimated by adaptive filtering techniques. Since the estimated echo is never an exact replica of the true echo, the subtraction is usually followed by some kind of further suppression of the residual

echo. Methods originally designed for analog communication, such as those required by the G.167 standard of the Telecommunication Industry Association (TIA), use center clipping to suppress the low-level residual echo. A review of the pros and cons of center clipping can be found in PCT Publication WO 99/03093. Generally, the disadvantage of center clipping is that it is a relatively "brutal" processing of the signal, creating discontinuities and large variations in the background noise level and overall sound quality.

To overcome some of these shortcomings, WO 99/03093 suggests a smooth addition of synthetic comfort noise to the center-clipped signal. Another method of totally replacing the residual echo by synthetic noise in presented in US Patent 5,687,229. A disadvantage of the noise-generation method is that it requires a relatively complex signal processing module to continuously monitor the background noise and to generate a faithful comfort noise to replace or mask the residual echo. For a digital communication network, comfort noise is commonly used for the purpose of discontinuous transmission (DTX) as explained hereinabove with respect to Fig. 2. In DTX, when there is no speech in the up link channel (to the mobile station), the base station generates comfort noise based on noise information sent from the mobile station. Thus, PCT Publication WO 96/42142 describes a method where the already existing comfort-noise mechanism is extended to an echo-suppressor device in the base station. In this arrangement, the comfort noise generation extends, in addition to non-speech periods, also to periods of residual echo. The method presented in WO 96/42142 is similar to those of US Patent 5,222,251 and US Patent 5,835,851 , both of which were formulated for the older half-duplex systems. The method in WO 96/42142 involves echo-suppression in the base station rather than in the mobile station. In US 5,222,251, on the other hand, the full echo, rather than the residual echo, is replaced by one or more "noise codewords". These "noise codewords" can be thought of as comfort noise. In US 5,835,851 the "noise codewords" are variably mixed with the echo rather than replacing it, to avoid cutting of the near-end speech.

Neither WO 96/42142, US 5,222,251 or US 5,835,851 describe a full solution for the elimination of residual noise in the mobile station. For example, the criterion used in US 5,222,251 and US 5,835,851 to decide when to replace the residual echo by comfort noise is problematic for a residual-echo suppressor in the handset. In US 5,222,251 and US 5,835,851 , which aim at half-duplex systems, the criterion for replacing or mixing-in generated-noise is the existence of a signal in the down-link channel, which means that an echo exists in the up-link channel. This criterion cannot be used for full-duplex communication because echo can be accompanied by near-end speech in a double-talk situation.

The criterion presented in WO 96/42142 for the base station can, in principle, be extended also to the mobile station in a full-duplex conversation. However, this criterion has several drawbacks, which will be explained herein. In WO 96/42142, the decision to process the residual echo after echo cancellation- is based on the mutual fulfillment of two conditions:

1. Speech activity is detected in the down-link channel.

2. There is no speech activity in the up-link channel. The criterion creates two problems:

Condition (2) is difficult to detect, because it is difficult to discriminate between echo-only and double-talk (i.e., echo plus near-end speech). Thus a situation can occur in which the near-end speech is erroneously replaced by comfort noise, or vice-versa: the echo is not replaced by comfort-noise and is therefore not suppressed.

Even if double-talk situation is properly detected, during the double-talk period, the residual echo is not suppressed. During the low-energy segments of the near-end speech, the residual echo is not masked by the near-end speech and can be disturbing.

Another issue is that the WO 96/42142 can be implemented only for a digital network with DTX. However, the new generation of CDMA networks do not use DTX, but rather a variable rate method. In the variable rate method there is no comfort-noise generator in the base station, so residual-echo cannot be as easily replaced by comfort noise.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention will be understood and appreciated more fully from the following detailed description taken in conjunction with the appended drawings in which: Fig. 1 is a schematic illustration of a prior art hands-free telephone;

Fig. 2 is a flow chart illustration of the operation of the telephone of Fig. 1 in hands-free mode;

Fig. 3 is a schematic illustration of a hands-free telephone capable of suppressing the residual echo, constructed and operative in accordance with a preferred embodiment of the present invention; and

Fig. 4 is a flow chart illustration of the operation of the telephone of Fig. 3 in hands-free mode.

DETAILED DESCRIPTION OF THE PRESENT INVENTION

Reference is now made to Figs. 3 and 4, which illustrate a digital telephone capable of suppressing the residual echo, constructed and operative in accordance with a preferred embodiment of the present invention, and its method of operation, respectively. The telephone of Fig. 3 has elements similar to those of the prior art. These similar elements, labeled with similar reference numerals, will not be further described.

The present invention adds a residual echo detector 50 to the prior art telephone and replaces the encoder of the prior art with a residual echo suppressing encoder 52 that can operate with detector 50.

In accordance with a preferred embodiment of the present invention, residual echo detector 50 takes input signal IN and output signal OUT of echo canceller 22 and determines when a significant amount of echo has been cancelled, as discussed hereinbelow. When this occurs, detector 50 indicates to encoder 52 to suppress the echo in the frame.

Residual echo detector 50 calculates the following criterion:

EiN / Eouτ > ENR_THRES [1] where E|_N and E₀uτ are the estimated energies in the input IN and output OUT frames of echo canceller 22 and ENR_THRES is a threshold level, such as 10 dB. As explained hereinbelow, because echo canceller 22 can attenuate the echo but not the near-end signal, criterion [1] is fulfilled only when the microphone signal is predominately echo while, at the same time, a large part of this echo is cancelled in the output. It is noted that all echo cancellers have adaptive filters which attempt to represent the echo transfer function. Unfortunately, they are initialized with some default values and they take some time, after an echo first appears, to converge to their optimal value. During this time they are not effective. Therefore, in accordance with a preferred embodiment of the present invention, detector 50 also enables suppression whenever the adaptive filter (not shown) of echo canceller 50 has not yet converged. For example, suppression can be enabled for the initial 80 samples (10 msec with a sampling rate of 8 kHz) of the adaptation. It may be safely assumed that during this very short initial interval, the near-end signal does not contain near-end speech, because double-talk (two sides talking at the same time) is not likely to occur at the very beginning of a conversation.

In a digital network with discontinuous transmission (DTX), each frame for which the above criterion is fulfilled is declared by detector 50 to be of type "noise". In addition, the mechanism for updating the comfort noise variables in the encoder is inhibited. Thus, the encoder sends a comfort-noise indication with comfort-noise parameters of previous noise frames that did not contain echo. This is shown in Fig. 4, which illustrates the operation of the telephone in the present invention. Similar steps to those of the prior art have similar reference numerals.

In step 30, echo canceller 22 performs the echo cancellation. Afterward (step 60), residual echo detector 50 utilizes the above criterion to decide whether or not to suppress the echo. If not, the telephone continues as before, performing voice activity detection (step 32) and transmitting an encoded frame (step 38) if there was voice and comfort noise parameters (step 42) if there was no voice. If there is residual echo to be suppressed, encoder 52 first determines (step 62) if echo canceller 22 is before convergence. If it is, then encoder 52 generates (step 64) default comfort noise parameters and transmits them (step 42). If echo canceller 22 is after convergence, then encoder 52 utilizes the comfort noise parameters most recently calculated and transmits them (step 42).

It will be appreciated that, by utilizing the most recently calculated comfort noise parameters, the present invention does not update the comfort noise with data from a signal with too much residual echo. This ensures that the comfort noise calculation is based on background noise and not on echoed signals.

It will be appreciated that criterion [1] of the present invention may replace the residual echo by comfort noise also during double-talk, at least during the very low energy, near-end speech frames for which the input signal is dominated by the echo. This means that during double-talk, the very low energy frames of the near-end speech are also suppressed. However, this is barely felt by the far-end listener.

It will further be appreciated that the part of the residual echo that is most annoying is the high-energy, voiced part. In accordance with an alternative preferred embodiment of the present invention operative for digital networks having a variable-rate mode, the high-energy portion can be reduced by transmitting each frame for which the above criterion is fulfilled (i.e. which has residual echo therein) at a low rate, rather than sending comfort noise parameters. For example, with EVRC and QCELP13 vocoders, the low rate is 1/8. With these vocoders, speech encoded at rate 1/8 sounds weaker and more "whisper like". The effect is particularly strong for voiced speech, which is the high-energy part of the speech. Forcing a low rate of 1/8 has a more dramatic effect on the voiced speech than on the unvoiced, and thus, forcing rate 1/8 whenever criterion [1] is fulfilled will result mainly in attenuating the voiced parts of

the residual echo.

It will be also appreciated that, in this second embodiment of the present invention, the low rate will be forced both on the residual echo and on the low-energy near-end speech, in the case of double-talk. It will be appreciated that the effect on the residual echo is strong, whereas it is weak for the near-end speech. For the residual echo, the low rate is forced for all of its segments,

including the voiced parts. These voiced parts are the most audible and annoying ones and they are the ones that are most effectively suppressed. For the near-end speech, on the other hand, the low rate is forced only on the low energy frames, which are mostly unvoiced, and the effect of forcing the low rate on these frames is small.

Explanation of the Suppression Criterion

Suppression occurs when the ratio of input and output energies of echo

canceller 22 be above a threshold level ENR_THRES, or:

EiN / Eouτ > ENR_THRES [1]

To understand the criterion, consider the following: quantitatively, the microphone signal (which is the input to echo canceller 22) can be described as the sum of echo and the near end signal:

IN = NEAR + ECHO [2] Echo cancellers in hands-free environments achieve a significant amount of echo attenuation after a very short initial convergence period. Typically, the echo energy is attenuated by at least a factor of 10 after a few milliseconds, where fluctuations in the echo path do not increase the residual echo above this minimum attenuation EC_ATTm_in of about 10 dB. Denoting the momentary attenuation by EC_ATT, we write:

OUT = NEAR + ECHO / EC_ATT [3]

If the threshold ENR_THRES of criterion [1] is equal to the minimum

attenuation, EC_ATTm_in , then, approximately, E|_N/Eouτ exceeds this threshold when both conditions [4] and [5] hereinbelow are satisfied:

EE/A > ENEAR [4]

EC_ATT > EC_ATTm,_n [5] where EE/A is the energy of the attenuated signal ECHO/EC_ATT and

ENEAR is the energy in the near-end signal. Condition [4] means that the input signal is predominately echo and condition [5] means that, indeed, the echo was attenuated by at least EC_ATT_m,_n. Thus, condition [1] indicates a condition where the residual echo might be audible. Condition [1] does not necessarily indicate that the residual echo is audible, because the actual attenuation may be such that: ECHO / EC_ATT « NEAR [6]

However, experience with EC algorithms for acoustic echo of speech signals shows that a residual echo is almost always audible. Therefore, criterion [1] almost always indicates that a residual echo needs to be suppressed. Moreover, because of the particular suppression method suggested in the present invention, there is almost no price paid for suppressing non-audible residual-echo in the case that condition [6] is fulfilled.

For a variable-rate vocoder, the suppression consists of merely "forcing" a low-rate. Most likely, if the residual echo were indeed very low, a low-rate would be have been used anyway, so "suppression" in this case has no effect.

In the case of forcing a comfort noise frame, the same thing happens: If the residual echo is small, the VAD would have indicated a "noise" frame, and comfort noise parameters would have been transmitted anyway. However, without "suppression", the current frame would have been used for updating the comfort noise parameters, whereas with "suppression" it will not be used. However, this has a negligible effect on the overall sound quality.

As previously explained, the above is also true if the input signal includes low energy, near-end speech frames. The frames that will be suppressed will be those for which the echo is high and the near-end speech is low. Since, for acoustic echo situations, the average echo level is the same or lower than the average near-end speech level, the highest-energy near-end speech frames that may suppressed will be those with energy that is EC_ATTmi_n times less than the average near-end energy (i.e., in the worst case, frames that are typically 10 dB below the average level will be suppressed). Because this is only a rare worst case (after all, most echo frames are below the near-end average level), and because the far-end speaker hears his own speech at the same time, the above distortion of the near-end speech is barely audible to the far-end. It will be appreciated that criterion [1] is not fulfilled when criterions [4] and [5] are both not satisfied. Criterion [5] is not satisfied only when echo canceller 22 has not yet converged. When criterion [4] is not satisfied, the noise additionally fully masks the residual echo, leaving the residual echo inaudible. In the latter case, there is no need for suppression. In the former case, there is need for suppression since the echo canceller is not yet fully operative.

The methods and apparatus disclosed herein have been described without reference to specific hardware or software. Rather, the methods and apparatus have been described in a manner sufficient to enable persons of ordinary skill in the art to readily adapt commercially available hardware and software as may be needed to reduce any of the embodiments of the present invention to practice without undue experimentation and using conventional techniques.

It will be appreciated by persons skilled in the art that the present invention is not limited by what has been particularly shown and described herein above. Particularly, the method is not limited to a mobile phone, hands-free operation or acoustic echo. Rather the scope of the invention is defined by the claims that follow:

Claims

1. A communication device comprising: an echo canceller adapted to cancel most of an echo of a far end signal found in a microphone signal, said echo canceller providing an echo canceled signal with a residual echo therein; and a residual echo system adapted to suppress said residual echo when said echo canceller cancels a significant amount of echo.

2. A communication device according to claim 1 and wherein said residual echo system includes a residual echo detector adapted to detect when the energy of said microphone signal is significantly higher than the energy of said echo canceled signal.

3. A communication device according to claim 1 and wherein said residual echo system includes a residual echo suppressing encoder adapted to transmit comfort noise when said echo canceller cancels a significant amount of echo.

4. A communication device according to claim 3 and wherein said encoder is adapted to update said comfort noise from background noise but not when said echo canceller cancels a significant amount of echo.

5. A communication device according to claim 3 wherein a significant amount of echo means 10dB of echo.

6. A communication device according to claim 1 and wherein said residual echo system includes a residual echo suppressing encoder adapted to transmit said echo canceled signal at a low rate when said echo canceller cancels a significant amount of echo.

7. A communication device according to claim 1 wherein significantly higher means 10dB.

8. A residual echo system comprising: a residual echo detector adapted to detect when the energy of a microphone signal is significantly higher than the energy of an echo canceled signal and adapted to produce a detection signal in response; and a residual echo suppressing encoder adapted to transmit comfort noise upon receipt of said detection signal.

9. A system according to claim 8 and wherein said encoder is adapted to update said comfort noise from background noise but not when said encoder receives said detection signal.

10. A residual echo system comprising: a residual echo detector adapted to detect when the energy of a microphone signal is significantly higher than the energy of an echo canceled signal and adapted to produce a detection signal in response; and a residual echo suppressing encoder adapted to transmit said echo canceled signal at a low rate upon receipt of said detection

signal.

11. An integrated circuit (IC) having a residual echo system, the system comprising: a residual echo detector adapted to detect when the energy of a microphone signal is significantly higher than the energy of an echo canceled signal and adapted to produce a detection signal in response; and a residual echo suppressing encoder adapted to transmit

comfort noise upon receipt of said detection signal.

12. An IC according to claim 1 1 and wherein said encoder is adapted to update said comfort noise from background noise but not when said encoder receives said detection signal.

13. An IC having a residual echo system, the system comprising: a residual echo detector adapted to detect when the energy of a microphone signal is significantly higher than the energy of an echo canceled signal and adapted to produce a detection signal in response; and a residual echo suppressing encoder adapted to transmit said echo canceled signal at a low rate upon receipt of said detection

signal.

14. A method comprising: canceling most of an echo of a far end signal found in a microphone signal and providing an echo canceled signal with a residual echo therein; and suppressing said residual echo when a significant amount of echo is canceled in said canceling.

15. A method according to claim 14 and wherein said suppressing includes detecting when the energy of said microphone signal is significantly higher than the energy of said echo canceled signal.

16. A method according to claim 14 and wherein said suppressing includes transmitting comfort noise when a significant amount of echo is cancelled in said canceling.

17. A method according to claim 16 and wherein said transmitting includes updating said comfort noise from background noise but not when a significant amount of echo is cancelled in said canceling.

18. A method according to claim 14 and wherein said suppressing includes transmitting said echo canceled signal at a low rate when a significant amount of echo is cancelled in said canceling.