HK1097080B - Complex signal activity detection for improved speech/noise classification of an audio signal - Google Patents
Complex signal activity detection for improved speech/noise classification of an audio signal Download PDFInfo
- Publication number
- HK1097080B HK1097080B HK07101656.6A HK07101656A HK1097080B HK 1097080 B HK1097080 B HK 1097080B HK 07101656 A HK07101656 A HK 07101656A HK 1097080 B HK1097080 B HK 1097080B
- Authority
- HK
- Hong Kong
- Prior art keywords
- audio signal
- correlation
- value
- signal
- correlation value
- Prior art date
Links
Description
This application is a divisional application filed on 12/11/1999 and entitled "composite signal activation detection for improved speech/noise classification of audio signals" with application number 99813625.5. This application claims priority from pending provisional application No. US60/109556, filed 11/23/1998, in accordance with 35USC119(e) (1).
Technical Field
The present invention relates to audio signal compression, and more particularly to speech/noise classification when compressing audio signals.
Background
Speech encoders and decoders are typically located in radio transmitters and radio receivers, respectively, and they may operate simultaneously to permit speech (voice) communication between a given transmitter and receiver along a radio communication line. The combination of a speech encoder and a speech decoder is often referred to as a speech codec. A mobile radiotelephone (e.g., a cellular radiotelephone) is an example of a conventional communication device that typically includes a radio transmitter having a speech encoder and a radio receiver having a speech decoder.
In a conventional block-based speech coder, the incoming speech signal is divided into blocks and such blocks are called frames. The frame length for a typical 4kHz telephony bandwidth range is typically 20ms or 160 samples, the frame may be further divided into sub-frames, typically 5ms or 40 samples in length.
In compressing an incoming audio signal, a speech encoder typically uses advanced lossy compression techniques. The decoder then attempts to reproduce the input audio signal from the incoming compressed signal information. If some characteristics of the incoming audio signal are known, the bit rate in the channel can be kept as low as possible. If the audio signal contains information related to the listener, this information is retained. However, if the audio signal contains only non-relevant information (such as background noise), bandwidth may be saved by transmitting only a limited amount of information about the signal. For many signals containing only non-relevant information, very low bit rates can often achieve high performance compression. In an extreme case, the input signal may be synthesized in the decoder through the above-described channel without any information update until it is redetermined that the input audio signal includes the relevant information.
For more complex non-speech signals like music or the synthesis of speech and music, it is required to reproduce them accurately by means of a decoder with a higher bit rate.
Current mobile systems take advantage of the fact that the bit rate of the transmission is adjusted downward during the duration of the background noise.
In a conventional Discontinuous Transmission (DTX) scheme, the transmitter stops transmitting encoded speech frames when the speaker pauses. At regular or irregular intervals (e.g., every 100ms to 500ms), the transmitter transmits speech parameters suitable for generating conventional comfort noise in the decoder. At the receiver, the decoder synthesizes the artificial noise using the comfort noise parameters received in the SID frames and by a conventional Comfort Noise Injection (CNI) algorithm.
When soft noise is generated in the decoder in a conventional DTX system, it is generally perceived that this noise variation is very small and very different from the background noise generated in active mode (non-DTX). The reason for this perception is that DTX SID frames are not transmitted to the receiver as often as normal speech frames. In conventional linear predictive analysis-by-synthesis (LPAS) codecs with DTX mode, it is often necessary to estimate (e.g., average) the spectrum and energy of the background noise over several frames, and then quantize the estimated parameters within SID frames and channel them to the decoder.
The benefit of transmitting SID frames with a lower update rate instead of regular speech frames is twofold. For example, battery life in a mobile radio transceiver is extended due to lower power consumption, and interference caused by the transmitter is reduced to increase system capacity.
If a composite signal, such as music, is compressed using a relatively simple compression mode and the corresponding bit rate is relatively low, the signal reproduced in the decoder is significantly different from the result obtained using a better (higher quality) compression technique. When such misclassifications occur, not only does the decoder output a poorly reproduced signal, but disadvantageously the misclassification itself results in a transition from a higher quality compression scheme to a lower quality compression scheme. To correct the misclassification, a switch back to a higher quality scheme is required. If such switching between compression schemes occurs frequently, the listener is typically audible and very harsh.
From the foregoing, it is desirable to reduce subjectively-related signal misclassification when low bit rates (high compression rates) are maintained under appropriate circumstances, such as compression of background noise during speaker pauses. The use of soft noise parameters as described above in DTX systems is an example of a strong compression technique, as is conventional low rate Linear Predictive Coding (LPC) using random excitation methods.
Conventional classification techniques for determining whether an input audio signal contains relevant information are primarily based on a relatively simple steady-state analysis of the input audio signal. If the input signal is determined to be stable, it is assumed to be a noise-like signal. However, such conventional steady state analysis alone results in a composite signal that is fairly stable but actually contains perceptually relevant information that is misclassified as noise. Disadvantageously, such misclassifications can cause problems as described above.
Disclosure of Invention
There is therefore a need for a classification technique that reliably detects the presence of perceptually relevant information within a composite signal of the type described above.
The present invention provides composite signal activity detection that reliably detects composite non-speech signals that include relevant information that is perceptually important to a listener. Examples of composite non-speech signals that can be reliably detected include music, music on-hold, combinations of speech and music, music in the background, and other tonal or overtone sounds.
Drawings
FIG. 1 schematically illustrates relevant parts of an exemplary speech encoding device according to the present invention;
FIG. 2 illustrates an exemplary embodiment of the composite signal activation detector of FIG. 1;
FIG. 3 illustrates an exemplary embodiment of the voice activity detector of FIG. 1;
FIG. 4 illustrates an exemplary embodiment of the release delay logic block of FIG. 1;
FIG. 5 illustrates an exemplary operational flow of the parameter generator of FIG. 2;
FIG. 6 illustrates an exemplary operational flow of the counter controller of FIG. 2;
FIG. 7 illustrates an exemplary operational flow of a portion of FIG. 2;
FIG. 8 illustrates an exemplary operational flow of another portion of FIG. 2;
FIG. 9 illustrates an exemplary operational flow of a portion of FIG. 3;
FIG. 10 illustrates an exemplary operational flow of the counter controller of FIG. 3;
FIG. 11 illustrates an exemplary operational flow of another portion of FIG. 3;
FIG. 12 illustrates an exemplary operational procedure that may be performed by the embodiments of FIGS. 1-11;
fig. 13 illustrates another embodiment of the composite signal activation detector of fig. 2.
Detailed Description
Fig. 1 schematically illustrates relevant parts of an exemplary speech encoding device according to the present invention, such as might be installed in a radio receiver for the transmission of audio information over a radio communication channel.
In FIG. 1, an input audio signal is input to a composite signal activity detector (CAD) that is responsive to the audio input signal and performs a correlation analysis to determine whether the input signal includes relevant listener perceived information, and then outputs a set of signal-related parameters to the VAD; speech/noise is displayed as output. The CAD generates a set of composite signal identifications for output to the release delay logic block in response to the voice/noise indication and the input audio signal, the release delay logic block also receiving as its input the voice/noise indication generated by the VAD.
The output of the release delay logic block may suitably be used to control the operation of DTX (in a DTX system) or the bit rate (in a variable rate VR encoder), for example, if the output of the release delay logic block indicates that the input audio signal does not contain the relevant information, soft noise (in a DTX system) or reduced bit rate (in a VR encoder) may be generated.
The input signal (which may be preprocessed) is analyzed in the CAD by extracting information for each frame of the signal that is relevant to a particular frequency band. This may be done by performing an initial filtering of the signal using a suitable filter, which may be a band pass filter or a high pass filter. The filter evaluates the frequency band containing the maximum energy of interest in the analysis. In order to reduce strong low frequency content such as car noise, the low frequency region should be generally filtered. The filtered signal is sent to an open loop Long Term Prediction (LTP) correlation analysis. The result provided by the LTP analysis is a correlation value vector or a normalized gain value; one value per associated shift. For example, in conventional LTP analysis, the shift range may be [20, 147] another method of obtaining the required correlation detection [ low complexity ] is to use the unfiltered signal in the correlation calculation and to modify the correlation values by an algorithmic similar "filtering" process as described in detail below.
For each analysis frame, the standard correlation value (gain value) having the largest amplitude is selected and buffered. No shift (LTP lag corresponding to the selected correlation value) is used. The values are further analyzed to derive a signal correlation parameter vector, which is fed to the VAD for use in the background noise estimation process. The buffered correlation values are processed and used to draw a final conclusion: whether the signals are correlated (i.e., of perceptual importance) and whether the VAD conclusion is reliable. VAD _ fail _ long and VAD _ fail _ short are used to indicate when perceptually relevant information is present, when the VAD may be going to perform a strict misclassification, i.e., a noise classification.
The VAD scheme attempts to determine whether the signal is a speech signal (which may be degraded by ambient noise) or a noise signal. The correlation parameters from the VAD are used to determine how much the VAD background noise and activity signal estimates are updated.
If the VAD is deemed reliable, the release delay logic may adjust the final decision of the signal that utilizes previous information about the correlation of the signal and the previous VAD conclusions. The output of the release delay logic is the final decision, so that it can be determined whether the signals are correlated or uncorrelated. In the case of no correlation, a low bit rate may be used for encoding. In DTX systems, the correlation/non-correlation information is used to determine whether the current frame should be encoded in the normal manner (correlated) or with comfort noise parameters (non-correlated).
In an exemplary embodiment, a CAD is provided to achieve high efficiency and low complexity in a speech coder that uses a linear predictive synthesis analysis (LPAS) architecture. The signal input to the speech coder is passed through conventional means (high-pass filtered)Scaled, etc.) to adjust. The conditioned signal s (n) is then filtered by a conventional adaptive noise weighting re-filter used by the LPAS encoder. The weighted speech signals sw (n) are then passed to an open-loop LTP analysis, which pairs in the range Lmin,Lmax]The correlation value for each shift in (b) is calculated and stored, wherein the end values of the range may be, for example: l ismin=18,Lmax147. For each lag value (shift) L within the above range, the range of correlation values Rxx (k, 1) for lag value 1 can be calculated by the following equation:
equation 1
If the value of k is set to 0, the function just changes with a lag value of 1:
equation 2
Can also be defined as:
Exx (L) ═ Rxx (L, L) equation 3
This process is typically pre-searched as an adaptive codebook search in the LPAS encoder, and therefore does not add additional computational cost.
The optimal gain coefficient g _ opt of the single tap predictor is obtained by minimizing the distortion D in the following equation:
equation 4
The optimal gain factor g _ opt (the actual standard correlation value) is the value of g that minimizes D in equation 4, as given by the following equation:
where L is the lag value that minimizes the distortion D (equation 4) and Exx (L) is the energy the composite signal detector calculates the optimal gain factor (g _ opt) for the high pass filtered version of the weighted signal sw. In one embodiment, the high pass filtered weighting signal is not used prior to correlation calculation, but rather a simplified formula is used to minimize the value of D using the filtered signal sw _ f (n).
The high-pass filtered signal sw _ f (n) is determined using the following equation:
sw _ f (n) ═ h0 · sw (n) + h1 · sw (n-1) equation 7
In this case, g _ max (g _ opt of the filtered signal) can be obtained by the following equation:
equation 8
This allows the parameter g _ max to be calculated according to equation 8 using the Rxx and Exx values that have previously been derived from the unfiltered signal sw, without calculating a new Rxx value for the filtered signal sw _ f.
If the filter coefficients [ h0, h1] are chosen to be [1, -1] and the denominator standard lag value Lden is set to 0, then the g _ max calculation can be simplified to:
equation 9
The above equation can be further simplified by setting the denominator Lden in equation 8 to (Lmin +1) (non-optimal value L _ opt, i.e., the optimal hysteresis value in equation 4), and limiting the maximum value L to Lmax-1 and the minimum value Lmin in the maximum search to (Lmin + 1). In this case, no additional correlation calculations are required, except for Rxx (1) which has been derived from the open-loop LTP analysis.
For each frame, the gain value g-max with the largest magnitude is stored according to g-f (i) -b 0 · g _ max (i) -a1 · g _ f (i-1) and by deriving the filtered g _ max value for each frame, a smooth model g _ f (i) can be derived. g _ max (i) and g _ f (i-1). I.e. b0 ═ fb(t, g _ max (i), g _ f (i-1)) and a1 ═ fa(t,g_max(i),g_f(i-1))。
By analyzing the state and the time-varying curve of g _ f (i), the VAD adaptation can be provided with auxiliary equipment, while the release delay logic block is provided with an operation display.
Fig. 2 shows an exemplary embodiment of the composite signal activation detector CAD of fig. 1, which has been described above, the preprocessing section 21 preprocesses the input signal, resulting in the aforementioned weighting signal sw (n). The signal sw (n) is output to a conventional correlation analyzer 23, which correlation analyzer 23 may be, for example, an open loop Long Term Prediction (LTP) correlation analyzer the output 22 of the correlation analyzer 23 is typically used as an input to an adaptive codebook search 24. As described above, the Rxx and Exx values used in the conventional correlation analyzer 23 according to the present invention are used to calculate g _ f (i).
The Rxx and Exx values are input at 25 to a maximum normalized gain calculator 20, which calculator 20 may calculate a g _ max value as described above. The maximum amplitude g _ max value for each frame is selected by the calculator 20 and stored in the buffer 26. the buffered value is output to the smoothing filter 27 as described above, the output of the smoothing filter 27 is g _ f (i).
The signal g _ f (i) is input to a parameter generator 28. the parameter generator 28 is responsive to the input signal g _ f (i) and generates a pair of complex high and complex low outputs which are supplied as signal dependent parameters to the VAD (see fig. 1). The reference generator 28 also generates a complex-timer output which is input to the counter controller 29 controlling the counter 201, the output of the counter 201 complex suspended count is fed to the VAD as a signal dependency parameter and also to the comparator 203, the output of the comparator 203 VAD fail long is fed to the complex signal identification of the release delay logic (see fig. 1), the signal g f (i) is also fed to a further comparator 205, the output 208 of the comparator 205 is coupled to the input of the and gate 207.
The composite signal activity detector of fig. 2 also receives a voice/noise manifest from the VAD (see fig. 1), i.e., signal sp _ VAD _ prim (e.g., noise if the manifest equals 0 and voice if the manifest equals 1), inputs the signal into the buffer 202, the output of the buffer 202 is coupled to the comparator 204, the output 206 of the comparator 204 is coupled to the other input of the and gate 207, the output of the and gate 207 is a composite signal flag _ VAD _ fail _ short (VAD _ fail _ short), which is input to the release delay logic block of fig. 1.
Fig. 13 shows another example of the device of fig. 2, where the high pass filtering model (filtered version) from sw (n), i.e. the output sw _ f (n) of the high pass filter 131, calculates the g _ opt value of equation 5 by the correlation analyzer 23, then the maximum amplitude g _ opt value of each frame is buffered in the buffer 26 of fig. 2 instead of g _ max, the correlation analyzer 23 also accepts the signal sw _ (n) and produces the regular output 22 as shown in fig. 2.
FIG. 3 illustrates relevant portions of an exemplary embodiment of the VAD of FIG. 1, as shown in FIG. 2 described above, the VAD accepts signal correlation parameters from CAD: complex _ high, complex _ low, complex _ suspended _ count, complex _ hang _ count, complex _ high, complex _ low are input into respective buffers 30 and 31, the outputs of which are input into comparators 32 and 33, respectively, the outputs of comparators 32 and 33 are coupled and provided as respective inputs to or gate 34, which or gate 34 outputs a complex _ alarm signal to counter controller 35, counter controller 35 is responsive to the complex _ alarm signal so that counter 36 can be controlled.
The audio input signal is coupled to an input of the noise estimator 38 and also to an input of the speech/noise determiner 39. as shown generally, the speech/noise determiner 39 also accepts background noise estimate information 303 from the noise estimator 38. the speech/noise determiner is generally responsive to the input audio signal and the noise estimate information 303 and generates a speech/noise display sp _ vad _ prim which is output to the release delay logic and CAD of fig. 1.
The signal complex _ hang _ count is input to a comparator 37, the output of which comparator 37 is coupled to a DOWN input of a noise estimator 38. In other embodiments, activating the DOWN (DOWN) input may allow the noise estimator to update its estimate upward to indicate more noise, but requires that the speed (intensity) of the update be significantly reduced.
The noise estimator 38 also has a DELAY (DELAY) input coupled to an output signal generated by the counter 36 called the stat count in conventional VADs, which receives a DELAY after the noise estimator receives a display signal indicating that the input signal is, for example, non-stationary, or tonal. If the noise level suddenly increases, the entire VAD algorithm will not lock onto the active display.
According to the present invention, the static count drive DELAY (DELAY) input sets the lower limit of the aforementioned DELAY time of the noise estimator (i.e., requires a longer DELAY time than conventionally required) when the signals appear to be fairly correlated allowing the noise estimate to grow "quickly". The stat count signal may delay the increase of the noise estimate for a relatively long period of time (e.g., 5 seconds) if the CAD detects a very high correlation over a relatively long period of time (e.g., 2 seconds).
The speech/noise determiner 39 has an output 301 coupled to an input of the counter controller 35 and also coupled to an input of the noise estimator 38, the latter coupling being commonly used when the speech/noise determiner determines that a given frame of the audio input signal is, for example, a tonal signal or an unstable signal, the output 301 indicates that the signal is to be output to the counter controller 35, which in turn sets the output stat count of the counter 36 to a desired value, the counter 36 may be decremented by the controller 35 if the output 301 indicates a stable signal.
Fig. 4 illustrates an exemplary embodiment of the release delay logic block of fig. 1. in fig. 4, the composite signal identification short _ fail _ VAD (VAD _ fail _ short) and VAD _ fail _ long (VAD _ fail _ long) are input to an or gate 41, the output of which or gate 41 serves as one input to another or gate 43. the voice/noise display sp _ VAD _ prim from the VAD is input to a conventional VAD release delay logic block 45. The output of the VAD release delay logic block is provided as a second input to the or gate 43 if the composite signal identifies that one of VAD _ fail _ short (VAD _ fail _ short) or VAD _ fail _ long (VAD _ fail _ long) is active, the output of the or gate 41 may cause the or gate 43 to indicate that the input signal is relevant.
The voice/noise decision of the VAD release delay logic 45, signal sp _ VAD, will constitute a correlate/non-correlate indication if none of the composite signal flags are active, if sp _ VAD is active, it is indicated as voice, then the output of the OR gate 43 indicates that the signal is correlated, otherwise if sp _ VAD is inactive, it is indicated as noise, then the output of the OR gate 43 indicates that the signal is non-correlated, e.g., the correlate/non-correlate indication from the OR gate 43 may be output to either the DTX control portion of a DTX system or the bit rate control portion of a VR system.
FIG. 5 illustrates an exemplary operational flow of the parameter generator 28 of FIG. 2 that generates the complex _ high, complex _ low, and complex time stamps (complex _ timer) signals, the index i in FIG. 5 (and FIGS. 6-11) representing the current frame of the audio input signal, as shown in FIG. 5, if the signal g _ f (i) is not greater than its corresponding threshold, TH being the high complex signal in steps 51 and 52hTH for the complex _ low signal in steps 54 and 551TH for the complex _ timer signal in steps 57 and 58tThe value of each of the aforementioned signals is set to zero. If the signal g _ f (i) is greater than the threshold TH in step 51hThen in step 53 the signal high complex is set to 1; if the signal g _ f (i) is greater than the threshold TH in step 541Then the signal complex low is set to 1 in step 56. If the signal g _ f (i) is greater than the threshold TH in step 57tThe value of the signal complex time stamp (complex _ timer) is incremented by 1 in step 59, in the figureExemplary thresholds in 5 include: THh=0.6,TH1=0.5,THt0.7. It can be seen from fig. 5 that the complex _ time stamp (complex _ timer) represents the number of consecutive frames within which g _ f (i) is greater than the threshold THt。
FIG. 6 illustrates an exemplary operational flow of the counter controller 29 and the counter 201 of FIG. 2 if the complex _ time stamp (complex _ timer) is greater than the threshold TH in step 61ctThe counter controller 29 sets the value of the output signal complex _ hang _ count of the counter 201 to h in step 62 if the complex _ time stamp (complex _ timer) is not greater than the threshold TH in step 61ctBut greater than 0 in step 63, the counter controller 29 decrements the value of the output signal complex _ hang _ count of the counter 201 by 1 in step 64 exemplary values in fig. 6 include: THct100 (corresponding to 2 seconds in one embodiment) and 250 (corresponding to 5 seconds in one embodiment).
Fig. 7 shows an exemplary operation flow of the comparator 203 in fig. 2. If the composite _ hang _ count is greater than TH in step 71hcVAD _ fail _ long (VAD _ fail _ long) is set to 1 in step 72, otherwise VAD _ fail _ long (VAD _ fail _ long) is set to 0 in step 73. In one embodiment THhc=0。
Fig. 8 shows an exemplary operational flow of the buffer 202, the comparators 204 and 205, and the and gate 207 of fig. 2. As shown in FIG. 8, if the P values of the most recent sp _ vad _ prim immediately preceding the current (i-TH point) sp _ vad _ prim value in step 81 are all equal to 0 and if the signal g _ f (i) is greater than the threshold TH in step 82fsVAD _ fail _ short (VAD _ fail _ short) is set to 1 in step 83, otherwise VAD _ fail _ short (VAD _ fail _ short) is set to 0 in step 84. the exemplary values in fig. 8 include: THfs=0.55,p=10.
Fig. 9 illustrates an exemplary operational flow of the buffers 30 and 31, the comparators 32 and 33, and the or gate 34 of fig. 3, if the complex _ high value at the last mth point before the current (ith) complex _ high value is all equal to 0 in step 91, or if the complex _ low value at the last nth point before the current (ith) complex _ low value is all equal to 0 in step 92, the complex _ alarm is set to 1 in step 93. Otherwise, complex _ warning is set to 0 in step 94 exemplary values in FIG. 9 include: m is 8 and n is 15.
Fig. 10 shows an exemplary operational flow of the counter controller 35 and the counter 36 of fig. 3 if the audio signal is stable as indicated in step 100 (see 301 of fig. 3), the static count is decremented in step 104. Then if the complex _ warning is 1 in step 101 and the static count is less than the value MIN in step 102, the value of the static count is set to MIN in step 103. if the audio signal is not stable in step 100, the value of the static count is set to a in step 105. In one embodiment, exemplary values of MIN and a are 5 and 20, respectively, which may result in lower delay values of 100ms and 400ms, respectively, for noise estimator 38 (fig. 3).
FIG. 11 illustrates an exemplary operational flow of the comparator 37 and the noise estimator 38 of FIG. 3 if the composite _ hang _ count is greater than the threshold TH in step 111hcComparator 37 activates the down input of noise estimator 38 in step 112 so that noise estimator 38 only allows its noise estimate to be updated down (or does not change the noise estimate). if the composite _ hang _ count is not greater than threshold TH in step 111hc1The downward input to the noise estimator 38 is inactive so that the noise estimator 38 allows its noise estimate to be updated downward or upward in step 113hc1=0.
As previously described, the VAD _ fail _ short flag may trigger the "correlation" display at the output of the release delay logic block when it is determined that g _ f (i) is greater than a predetermined value after a predetermined number of consecutive frames classified as noise by the VAD.
And after g _ f (i) is greater than a predetermined number of consecutive frames, the VAD _ fail _ Long flag may fire a "correlation" display at the output of the release delay logic and hold the display for a longer hold time.
In one embodiment, the signal dependency parameter complex _ hang _ count (complex _ hang _ count) may enable the downward input of the noise estimator 38 under the same conditions as the complex signal identification VAD _ fail _ long. If g _ f (i) is greater than a first predetermined threshold for a first number of consecutive frames or greater than a second predetermined threshold for a second number of consecutive frames, then the signal correlation parameters complex high and complex low may operate such that the delay input to the noise estimator 38 may be raised (if necessary) to a lower limit even if a number of consecutive frames have been determined (by the speech/noise determiner 39) to be stable.
FIG. 12 shows an exemplary operational flow that may be performed by the speech encoder embodiments of FIGS. 1-11, the normalized gain with the largest amplitude for the current frame is calculated in step 121. The gain is analyzed in step 122 to generate a correlation parameter and a composite signal signature. In step 123, the correlation parameter is used for the evaluation of the background noise in the VAD. If it is determined in step 125 that the audio signal does not contain perceptually relevant information, then in step 126 the bit rate is reduced, for example in a VR system, or the comfort noise parameter is encoded, for example in a DTX system.
From the foregoing, it will be apparent to those skilled in the art that the embodiments of FIGS. 1-13 can be conveniently implemented by appropriate modification of software, hardware, or both on conventional speech encoding equipment.
Although exemplary embodiments of the present invention have been described in detail hereinabove, it is not intended to limit the scope of the invention, which can be embodied in various forms.
Claims (23)
1. A method of retaining perceptually relevant non-speech information in an audio signal during encoding of the audio signal, comprising:
making a first determination as to whether the audio signal is deemed to include speech or noise information;
making a second determination as to whether the audio signal includes non-speech information that is perceptually relevant to a listener; and
selectively ignoring the first determination indicative of noisy information in response to the second determination indicative of perceptually relevant non-speech information,
wherein the second determination comprises comparing a correlation value derived from an open-loop long-term prediction correlation analysis with a predetermined value.
2. The method of claim 1, wherein the selectively ignoring step comprises: ignoring the first determination in response to the correlation value being greater than a predetermined value.
3. The method of claim 1, wherein the selectively ignoring step comprises: ignoring the first determination in response to a predetermined number of correlation values being greater than a predetermined value for a given period of time, one correlation value for each respective frame into which the audio signal is divided.
4. The method of claim 3, wherein the selectively ignoring step comprises: ignoring the first determination in response to a predetermined number of consecutive correlation values being greater than a predetermined value, one correlation value for each respective frame into which the audio signal is divided.
5. The method of claim 1, wherein the correlation value is the highest standard correlation value of a high-pass filtered model of the audio signal.
6. The method of claim 1, wherein the correlation value is a maximum amplitude standard correlation value.
7. The method of claim 1, wherein the correlation value is a smoothed correlation value obtained by filtering a maximum magnitude correlation value.
8. The method of claim 1, further comprising the steps of:
generating a set of signal correlation parameters; and
using the set of signal correlation parameters in the first determination as to whether the audio signal is deemed to include speech or noise information.
9. The method of claim 8, wherein each respective correlation parameter in the set of signal correlation parameters is generated by comparing a correlation value to a threshold applicable to the respective correlation parameter.
10. A method of retaining perceptually relevant information in an audio signal, comprising:
making a first determination as to whether the audio signal is deemed to include speech or noise information;
detecting a highest standard correlation value of a high-pass filtered model of the audio signal by using an open-loop long-term predictive correlation analysis;
determining a value representing the highest standard correlation value;
comparing the determined value representing the highest standard correlation value with at least one threshold value, thereby obtaining an indication of whether the audio signal contains perceptual correlation information; and
in response to an indication of whether the audio signal contains perceptually relevant information, adjusting a first determination as to whether the audio signal is considered to include speech or noise information.
11. The method of claim 10, wherein the detecting step comprises applying the correlation analysis to the audio signal without generating a high pass filtered model of the audio signal.
12. The method of claim 10, wherein the detecting step comprises high-pass filtering the audio signal and then performing the correlation analysis on the high-pass filtered audio signal.
13. The method of claim 10, wherein the detecting step comprises determining a maximum amplitude standard correlation value.
14. A method as claimed in claim 13, wherein the determined value representing the highest standard correlation value is obtained by filtering the largest magnitude standard correlation value.
15. The method of claim 10, further comprising the steps of:
generating a set of signal correlation parameters; and
using the set of signal correlation parameters in the first determination as to whether the audio signal is deemed to include speech or noise information.
16. The method of claim 15, wherein each respective correlation parameter of the set of signal correlation parameters is generated by comparing the determined value representing the highest standard correlation value with a respective threshold value applicable to the respective correlation parameter.
17. An apparatus for retaining perceptually relevant non-speech information contained within an audio signal in an audio signal encoder, comprising:
a voice activity detector for receiving an audio signal and making a first determination of whether the audio signal is deemed to include speech or noise information;
a signal activity detector for receiving the audio signal and making a second determination of whether the audio signal includes perceptually relevant non-speech information to a listener;
a logic block coupled to said voice activity detector and said signal activity detector, said logic block having an output indicative of whether an audio signal includes perceptually relevant information, said logic block being operable to selectably provide information indicative of said first determination indicative of noise information at said output and further being operable to selectively ignore said first determination indicative of noise information in response to said second determination indicative of perceptually relevant non-speech information,
wherein the voice activity detector is operable to compare a correlation value derived from an open loop long term prediction analysis with a predetermined value.
18. The apparatus of claim 17, wherein the logic block is operable to ignore the first determination in response to the correlation value being greater than a predetermined value.
19. The apparatus of claim 17, wherein the logic block is operable to disregard the first determination in response to a predetermined number of the correlation values being greater than a predetermined value within a given time period, one correlation value for each respective frame into which the audio signal is divided.
20. The apparatus of claim 19, wherein the logic block is operable to disregard the first determination in response to a predetermined number of consecutive correlation values being greater than a predetermined value, one correlation value for each respective frame into which the audio signal is divided.
21. Apparatus in accordance with claim 17, in which the voice activity detector is operative to derive the correlation value by detecting the highest standard correlation value of a high pass filtered model of the audio signal.
22. An apparatus as claimed in claim 21, wherein the highest standard correlation value represents a maximum amplitude standard correlation value within a frame.
23. The apparatus of claim 22, wherein the correlation value is a smoothed correlation value obtained by filtering the maximum magnitude standard correlation value.
Applications Claiming Priority (4)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US10955698P | 1998-11-23 | 1998-11-23 | |
| US60/109556 | 1998-11-23 | ||
| US09/434787 | 1999-11-05 | ||
| US09/434,787 US6424938B1 (en) | 1998-11-23 | 1999-11-05 | Complex signal activity detection for improved speech/noise classification of an audio signal |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| HK1097080A1 HK1097080A1 (en) | 2007-06-15 |
| HK1097080B true HK1097080B (en) | 2010-12-31 |
Family
ID=
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CA2348913C (en) | Complex signal activity detection for improved speech/noise classification of an audio signal | |
| EP1145222B1 (en) | Speech coding with comfort noise variability feature for increased fidelity | |
| KR101452014B1 (en) | Improved voice activity detector | |
| US9646621B2 (en) | Voice detector and a method for suppressing sub-bands in a voice detector | |
| US9401160B2 (en) | Methods and voice activity detectors for speech encoders | |
| CN100508028C (en) | Method and apparatus for adding a release delay frame to a plurality of frames encoded by a vocoder | |
| CN107195313B (en) | Method and apparatus for voice activity detection | |
| JPH09152894A (en) | Sound and silence discriminator | |
| EP1112568B1 (en) | Speech coding | |
| RU2237296C2 (en) | Method for encoding speech with function for altering comfort noise for increasing reproduction precision | |
| HK1097080B (en) | Complex signal activity detection for improved speech/noise classification of an audio signal | |
| HK1114939A (en) | Method and apparatus for robust speech classification |