HK1104369B - A method and encoder for encoding a frame in a communication system - Google Patents
A method and encoder for encoding a frame in a communication system Download PDFInfo
- Publication number
- HK1104369B HK1104369B HK07109017.3A HK07109017A HK1104369B HK 1104369 B HK1104369 B HK 1104369B HK 07109017 A HK07109017 A HK 07109017A HK 1104369 B HK1104369 B HK 1104369B
- Authority
- HK
- Hong Kong
- Prior art keywords
- mode
- value
- excitation
- frame
- normcorr
- Prior art date
Links
Description
Technical Field
The present invention relates to a method of encoding a signal in an encoder of a communication system.
Background
Today, cellular communication systems have become widespread. Typically, cellular communication systems operate in accordance with a given standard or specification. For example, the standards or specifications may define communication protocols and/or parameters for the connection. Various standards or specifications include, but are not limited to, GSM (global system for mobile communications), GSM/EDGE (enhanced data rates for GSM evolution), AMPS (american mobile phone system), WCDMA (wideband code division multiple access), third generation (3G) UMTS (universal mobile telecommunications system), IMT 2000 (international mobile telecommunications 2000), and so on.
In cellular communication systems and general signal processing applications, signals are always compressed to reduce the amount of data required to represent the signals. For example, audio signals are typically captured as analog signals, digitized in an analog-to-digital (a/D) converter, and then encoded. In a cellular communication system, the encoded signals may be transmitted over a radio air interface between user equipment, such as mobile terminals and base stations. Alternatively, in a more general signal processing system, the encoded audio signal may be stored in a storage medium for later use and reproduction.
In a cellular communication system, the encoding process compresses the signal, which is then transmitted over the air interface with a minimum amount of data, while maintaining an acceptable level of signal quality. This is very important because of the limited capacity of radio channels on the radio air interface in cellular communication systems.
An ideal encoding method compresses the audio signal with as few bits as possible, thereby optimizing the channel capacity, while at the same time producing a decoded signal that is as realistic as possible to the original audio signal. In practice, a trade-off is usually made between the bit rate of the compression method and the quality of the decoded speech.
The compression or encoding may be lossy or lossless. In lossy compression, part of the information is lost during compression, and thus it is impossible to completely reconstruct the original signal from the compressed signal; in lossless compression, there is usually no loss of information, so the original signal can be completely reconstructed from the compressed signal.
The audio signal may be considered speech, music (or non-speech), or both. The different characteristics of speech and music make it difficult to design coding methods that handle both speech and music with good performance. In general, the optimal coding method for speech signals is not optimal for music or non-speech signals. Therefore, to solve this problem, different encoding methods for speech and music have been developed. However, before a suitable encoding method can be selected, the audio signal has to be classified as speech or music.
Classifying audio signals as speech signals or music/non-speech signals is a difficult task. The accuracy required for the classification depends on the application in which the signal is used. In some applications, such as in speech recognition or in archiving for storage or retrieval, the accuracy is critical.
However, it is possible that the encoding method for a portion of an audio signal mainly including speech is very effective for a portion of an audio signal mainly including music. In fact, it is possible that the coding method for music with strong tonal components is also very suitable for speech. Thus, a method for audio signal classification based purely on whether the signal consists of speech or music and must be able to select the best compression method for the audio signal.
Adaptive multi-rate (AMR) codec is a coding method opened by the third generation partnership project (3GPP) for GSM/EDGE and WCDMA communication networks. It is also envisaged that AMR will be used for future packet switched networks. AMR is based on Algebraic Code Excited Linear Prediction (ACELP) excitation coding. AMR and AMR-WB (adaptive multi-rate wideband) codecs comprise 8 and 9 effective bit rates, respectively, and also include voice inactivity detection (VAD) and Discontinuous Transmission (DTX) functions. The sampling rate in the AMR codec is 8 kHz. The sampling rate in the AMR-WB codec is 16 kHz.
AMR and AMR-WB encoders are described in 3GPP TS 26.090 and 3GPP TS26.190 specifications. For more details on AMR-WB and VAD see 3GPP TS 26.194 specification.
In another coding method, the AMR-WB (AMR-WB +) codec is extended, the coding is based on two different excitation methods: ACELP-like pulse excitation and transform code excitation (TCX). The ACELP excitation is similar to that used in the original AMR-WB codec, the TCX excitation is an AMR-WB + specific modification.
ACELP excitation coding is performed using a model of how the signal is generated at the source, and the parameters of the model are extracted from the signal. More specifically, ACELP excitation coding is based on a physiological acoustic system, in which the larynx and mouth are modeled as a linear filter, and the signal is generated by the periodic vibration of the filter excited by air. An encoder analyzes the signal on a frame-by-frame basis and generates a set of parameters representing the modeled signal for each frame and outputs it by the encoder. The parameter set may include excitation parameters and coefficients of the filter, among other parameters. The output of this type of encoder is often referred to as a parametric representation of the input signal. The parameter sets are used to appropriately configure a decoder to reproduce the input signal.
In the AMR-WB + codec, Linear Predictive Coding (LPC) is calculated in each frame of a signal to model the spectral envelope of the signal as a linear filter. The result of LPC (usually called LPC excitation) is then encoded using either ACELP excitation or TCX excitation.
Typically, ACELP excitation uses long-term predictors and fixed codebook parameters, while TCX uses Fast Fourier Transforms (FFTs). Furthermore, in the AMR-WB + codec, the TCX excitation can work with one of 3 different frame lengths (20, 40 and 80 ms).
TCX excitation is widely used in non-speech audio coding. The superiority of TCX excitation based on non-speech signal coding results from the use of acoustic masking and frequency domain coding. Although TCX techniques produce high quality music signals, they are somewhat weak for periodic speech signals. In contrast, a codec based on a physiological acoustic system (e.g., ACELP) provides a good quality speech signal, while the quality of the provided music signal is poor.
Thus, in general, ACELP excitation is mostly used for coding of speech signals and TCX excitation is mostly used for coding of music and non-speech signals. This is not always the case, however, and in some cases the speech signal contains parts like music signals, or the music signal contains parts like speech, or the audio signal contains both speech and music, and in this case it may not be optimal to select a coding method based on only one of the ACELP excitation or the TCX excitation.
In AMR-WB +, selection of the excitation can be done in a number of ways.
The first, and simplest, method is to analyze the properties of the signal before encoding it, thereby classifying it into speech or music/non-speech signals, and to select the best excitation method for the signal type from ACELP excitation and TCX. This is the so-called "preselection" method. However, this method is not suitable for signals that contain varying characteristics of both speech and music, resulting in encoded signals that are not optimal for either speech or music.
Another more complex approach is to encode the audio signal using both ACELP excitation and TCX excitation and then select the excitation method based on the better quality of the synthesized audio signal. Signal quality may be measured using a signal-to-noise parameter. This "analytic synthesis" type of method is also referred to as a "brute force" method because it calculates all the different stimuli and selects the best one. Although this method provides good results, it is not suitable for practical applications because of its computational complexity for performing multiple computations.
It is an object of embodiments of the present invention to provide an improved method for selecting an excitation method for encoding a signal, at least partly reducing some of the above mentioned problems.
Disclosure of Invention
According to a first aspect of the present invention, there is provided a method of encoding a frame in an encoder of a communication system, the method comprising the steps of: calculating a first set of parameters associated with the frame, wherein the first set of parameters includes filter bank parameters; in a first phase, selecting one of a plurality of coding methods based on a predetermined condition associated with the first set of parameters; calculating a second set of parameters associated with the frame; in a second stage, selecting one of a plurality of encoding methods based on the selection result of the first stage and the second parameter set; and encoding the frame using the encoding method selected from the second stage.
Preferably, the plurality of encoding methods includes a first excitation method and a second excitation method.
The first set of parameters may be based on energy levels of one or more frequency bands associated with the frame. And, for different predetermined conditions of the first set of parameters, no coding method may be selected in the first stage.
The second set of parameters may include at least one of spectral parameters, LTP parameters, and related parameters associated with the frame.
Preferably, the first excitation method is algebraic code excited linear prediction excitation and the second excitation method is transform code excitation.
When the frame is encoded using the second excitation method, the method for encoding may further include selecting a length of the frame encoded using the second excitation encoding method based on the selection of the first and second stages.
The choice of the length of the coding frame may depend on the signal-to-noise ratio of the frame.
Preferably, the encoder is a slave AMR-WB + encoder.
The frame may be an audio frame. Preferably, the audio frames comprise speech or non-speech. The non-speech may comprise music.
According to another aspect of the present invention, there is provided an encoder for encoding a frame in a communication system, the encoder comprising: a first calculation module for calculating a first set of parameters associated with the frame, wherein the first set of parameters includes filter bank parameters; a first stage selection module for selecting one of a plurality of encoding methods on the basis of the first set of parameters; a second calculation module for calculating a second set of parameters associated with the frame; a second stage selection module for selecting one of the plurality of encoding methods on the basis of the selection result of the first stage and the second parameter set; and an encoding module for encoding the frame using the encoding method selected by the second stage.
According to a further aspect of the present invention, there is provided a method of encoding a frame in an encoder of a communication system, the method comprising the steps of: calculating a first set of parameters associated with the frame, wherein the first set of parameters includes filter bank parameters; in a first phase, selecting one of a first or a second excitation method based on the first set of parameters; the frame is encoded using the selected excitation method.
Drawings
For a better understanding of the present invention, reference will now be made to the following drawings, in which:
FIG. 1 illustrates a communication network diagram to which embodiments of the present invention may be applied;
FIG. 2 shows a block diagram of an embodiment of the invention; and
fig. 3 is a block diagram of a VAD filter bank in an embodiment of the present invention.
Detailed Description
The invention is described herein with reference to specific examples. The invention is not limited to these examples.
Fig. 1 illustrates a communication system 100 that supports signal processing using an AMR-WB + codec according to one embodiment of the present invention.
System 100 includes an analog-to-digital (a/D) converter 104, an encoder 106, a transmitter 108, a receiver 110, a decoder 112, and a digital-to-analog (D/a) converter 114. The a/D converter 104, encoder 106 and transmitter 108 may form part of a mobile terminal. The receiver 110, decoder 112 and D/a converter 114 may form part of a base station.
The system 100 also includes one or more audio sources, such as loudspeakers, which are not shown in fig. 1. The generated audio signal 102 comprises speech and/or non-speech signals. The a/D converter 104 receives the analog signal 102 and converts the analog signal 102 into a digital signal 105. It should be understood that if the audio source produces a digital signal rather than an analog signal, the a/D converter 104 may be skipped.
The digital signal 105 is input to an encoder 106 where it is encoded to encode and compress the digital signal 105 on a frame-by-frame basis using a selected encoding method to produce encoded frames 107. The encoder 106 may operate with an AMR-WB + codec or other suitable codec and will be described in detail below.
The encoded frames may be stored in a suitable storage medium, such as a digital sound recorder, for later processing. Alternatively, as shown in fig. 1, the encoded frame is input to the transmitter 108, and the transmitter 108 transmits the encoded frame 109.
The encoded frame 109 is received by a receiver 110, and the receiver 110 processes the encoded frame 109 and inputs the encoded frame 111 to a decoder 112. The decoder 112 decodes and decompresses the encoded frame 111. The decoder 112 also comprises decision means to decide the particular coding method used in the encoder for each received coded frame 111. The decoder 112 selects a decoding method for decoding the encoded frame 111 based on the decision.
The decoded frame is output by the decoder 112 in the form of a decoded signal 113, and the decoded signal 113 is input to a D/a converter 114 for converting the decoded signal 113 of the digital signal into an analog signal 116. The analog signal 116 may then be processed, for example, converted to audio by a speaker.
Fig. 2 shows a block diagram of the encoder 106 of fig. 1 in a preferred embodiment. The encoder 106 operates in accordance with the AMR-WB + codec and selects one of either the ACELP excitation or the TCX excitation for encoding the signal. The selection is made based on determining the best coding model for the input signal by analyzing parameters generated at the coding module.
The encoder 106 includes a Voice Activity Detection (VAD) module 202, a Linear Predictive Coding (LPC) analysis module 206, a Long Term Prediction (LTP) analysis module 208, and an excitation generation module 212. The excitation generation module 212 encodes the signal using one of ACELP excitation or TCX excitation.
The encoder 106 also includes an excitation selection module 216 that is coupled to the first stage selection module 204, the second stage selection module 210, and the third stage selection module 214. The excitation selection module 216 determines the excitation method, i.e., ACELP excitation or TCX excitation, used by the excitation generation module 212 for signal encoding.
The first stage selection module 204 is connected between the VAD module 202 and the LPC analysis module 206. The second stage selection module 210 is coupled between the LTP analysis module 208 and the stimulus generation module 212. The third stage selection module 214 couples the excitation generation module 212 and the output of the encoder 106.
The encoder 106 receives the input signal 105 at the VAD module, which determines whether the input signal 105 includes an active audio or mute period. The signal is sent to the LPC analysis module 206 and the LPC analysis module 206 processes on a frame-by-frame basis.
The VAD module also calculates filter bank parameters that may be used for excitation selection. During silence, the excitation selection state is not updated at all times.
The excitation selection module 216 determines a first excitation method in the first stage selection module 204. The first excitation method is one of ACELP excitation or TCX excitation and is used to encode the signal in the excitation generation module 212. If the excitation method cannot be determined in the first stage selection module 204, the excitation method is considered to be undefined.
The first excitation method is determined by the excitation selection module 216 based on the parameters received from the VAD module 202. Specifically, the input signal 105 is divided into a plurality of frequency bands by the VAD module, the signal in each frequency band having an associated energy level. The first stage selection module 204 receives the plurality of frequency bands and associated energy levels and passes to the excitation selection module 216 where the plurality of frequency bands and associated energy levels are analyzed using a first excitation selection method to classify the signal substantially as a speech-like or music-like signal.
The first excitation selection method may comprise analysing the relationship between the high and low frequency bands of the signal and the energy level variations in these frequency bands. The excitation selection module 216 may also use different analysis windows and decision thresholds in the analysis. Other parameters associated with the signal may also be used in the analysis.
Fig. 3 shows an example of a filter bank 300 utilized by the VAD module 202 to generate different frequency bands. The energy level associated with each frequency band is generated by statistical analysis. The filter bank structure 300 comprises 3 rd order filter units 306, 312, 314, 316, 318 and 320. The filter bank 300 further comprises 5 th order filter units 302, 304, 308, 310 and 313. Wherein the "order" of the filter unit refers to the maximum delay in terms of the number of samples used to generate each output sample. For example, y (n) ═ a × x (n) + b × x (n-1) + c × x (n-2) + d × x (n-3) illustrates an example of a 3-order filter.
The signal 301 is input into a filter bank and processed by a series of 3 and/or 5 order filter units, resulting in filtered signal bands: 3224.8-6.4kHz, 3244.0-4.8kHz, 3263.2-4.0kHz, 3282.4-3.2kHz, 3302.0-2.4kHz, 3321.6-2.0kHz, 3341.2-1.6kHz, 3360.8-1.2kHz, 3380.6-0.8kHz, 3400.4-0.6kHz, 3420.2-0.4kHz, 3440.0-0.2 kHz.
The filtered signal band 3224.8-6.4kHz results from the signal passing through the 5 th order filter unit 302 and the 5 th order filter unit 304 in sequence; the filtered signal band 3244.0-4.8kHz results from the signal passing through a 5 th order filter unit 302, a 5 th order filter unit 304 and a 3 rd order filter unit 306 in that order; the filtered signal band 3263.2-4.0kHz results from the signal passing through a 5 th order filter unit 302, a 5 th order filter unit 304 and a 3 rd order filter unit 306 in that order; the filtered signal band 3282.4-3.2kHz results from the signal passing through the 5 th order filter unit 302, the 5 th order filter unit 308 and the 5 th order filter unit 310 in sequence; the filtered signal band 3302.0-2.4kHz results from the signal passing through a 5 th order filter unit 302, a 5 th order filter unit 308, a 5 th order filter unit 310 and a 3 rd order filter unit 312 in that order; the filtered signal band 3321.6-2.0kHz results from the signal passing through a 5 th order filter unit 302, a 5 th order filter unit 308, a 5 th order filter unit 310 and a 3 rd order filter unit 312 in sequence; the filtered signal band 3341.2-1.6kHz results from the signal passing through the 5 th order filter unit 302, the 5 th order filter unit 308, the 5 th order filter unit 313 and the 3 rd order filter unit 314 in this order; the filtered signal band 3360.8-1.2kHz is generated by the signal passing through the 5 th order filter unit 302, the 5 th order filter unit 308, the 5 th order filter unit 313 and the 3 rd order filter unit 314 in this order; the filtered signal band 3380.6-0.8kHz results from the signal passing through a 5 th order filter unit 302, a 5 th order filter unit 308, a 5 th order filter unit 313, a 3 rd order filter unit 316 and a 3 rd order filter unit 318 in that order; the filtered signal band 3400.4-0.6kHz results from the signal passing through a 5 th order filter unit 302, a 5 th order filter unit 308, a 5 th order filter unit 313, a 3 rd order filter unit 316 and a 3 rd order filter unit 318 in that order; the filtered signal band 3420.2-0.4kHz results from the signal passing through a 5 th order filter unit 302, a 5 th order filter unit 308, a 5 th order filter unit 313, a 3 rd order filter unit 316 and a 3 rd order filter unit 320 in that order; the filtered signal band 3440.0-0.2kHz results from a signal passing through, in order, a 5 th order filter unit 302, a 5 th order filter unit 308, a 5 th order filter unit 313, a 3 rd order filter unit 316, and a 3 rd order filter unit 320.
The parametric analysis performed by the excitation selection module 216 and in particular the classification result of the signal is used to select a first excitation method from ACELP or TCX for encoding the signal in the excitation generation module 212. However, if the analyzed signal does not result in a clear speech-like or music-like signal classification, e.g. when the signal has both speech and music characteristics, the excitation method is considered unselected or selected as undetermined, and the selection decision is made again by the next method selection stage. For example, an explicit selection may be made in the second stage selection module 210 after the LPC and LTP analysis.
A first excitation selection method for selecting an excitation method will be exemplified below.
In the deterministic excitation method, the AMR-WB codec utilizes an AMR-WB VAD filter bank, wherein the signal energy E (n) in each of the 12 sub-bands in the frequency band 0-6400Hz is determined for every 20ms input signal frame. The energy level of each sub-band may be normalized by dividing the bandwidth of each sub-band (in Hz) by the energy level e (n) from each sub-band, which yields the normalized energy level en (n) for each band.
In the first stage excitation selection module 204, the standard deviation of the energy level may be calculated for each of the 12 sub-bands using two windows, short window stdshort (n) and long window stdlong (n). In the case of AMR-WB +, the length of the short window is 4 frames long, while the length of the long window is 16 frames long. With this algorithm, 12 energy levels from the current frame, along with 3 or 15 frames from the previous (resulting in 4 and 16 frame windows), are used to derive two standard deviation values. One characteristic of this algorithm is that it is only executed when the VAD module 202 determines that the input signal 105 contains active audio. This feature enables the algorithm to react more accurately after an extended period of speech/music pause delay when there is a possibility of distortion of the statistical parameters.
Then, for each frame, the average standard deviation is calculated for all 12 subbands of the long and short windows, and the average standard deviation values for stdalong and stdashort are also calculated.
For each frame of the audio signal, a relationship between the low frequency band and the high frequency band may be calculated. In the AMR-WB + codec, LevL is calculated by summing the energy levels of the low frequency sub-bands 2 to 8 and normalized by dividing the sum by the total bandwidth length (unit: Hz) of the low frequency sub-bands 2 to 8. For the high frequency sub-bands 9 to 12 the sum of their energy levels is calculated and normalized, resulting in LevH. In this example, the lowest sub-band 1 is not used, since the lowest sub-band 1 typically contains a divergent amount of energy that may distort the operation and make very little contribution from the other sub-bands. By such measurement, the relationship LPH between the low band and the high band can be obtained by the following equation:
LPH=LevL/LevH
in addition, an average LPHa value for the activity of each frame is calculated using the current and previous 3 LPH values. The high and low frequency relation LPHaF for the current frame can also be calculated based on a weighted sum of the average LPHa values of the current and the previous 7 activities, in which the more recent values are given more weight.
The average energy level AVL for the filter block of the current frame may be calculated by subtracting the estimated energy level of the background noise from the output of each filter block, then multiplying each of the difference energy levels by the highest frequency of the corresponding filter block, and accumulating the final results. In this way, the high frequency sub-bands are balanced, which contain relatively low energy compared to the low frequencies containing the higher energy sub-bands.
The total energy TotE0 for the current frame is calculated by combining the energy levels of all the filter blocks and subtracting the background noise estimate for each filter bank.
After the above calculations are completed, a selection between the ACELP excitation and TCX excitation methods may be made using a method in which it is assumed that when a given flag is set, other flags will be cleared to prevent a setting conflict.
First, the average standard deviation value stdalong for the long window is compared with a first threshold TH1 (e.g., 0.4). Setting a TCX MODE flag bit to indicate selection of a TCX excitation algorithm for encoding if the standard deviation value stdalong is less than the first threshold TH 1; otherwise, the calculated high and low frequency relationship measure LPHaF is compared to a second threshold TH2 (e.g., 280).
If the calculated high-low frequency relationship measurement LPHaF is greater than the second threshold TH2, then the TCXMODE flag is set. Otherwise, the inverse of the difference obtained by subtracting the first threshold TH1 from the standard deviation stdalong is calculated, and the inverse of the subtracted difference is added to a first constant C1 (e.g., 5). Comparing the sum with the calculated high and low frequency relationship measures LPHaF as follows:
C1+(1/(stdalong-TH1))>LPHaF (1)
if the comparison of comparison (1) is true, then the TCX MODE flag bit is set to indicate that the TCX excitation algorithm is selected for encoding. Otherwise, the standard deviation value stdalong is multiplied by the first multiplicand M1 (e.g., -90), and then the multiplied result is added to the second constant C2 (e.g., 120). Comparing the sum with the calculated high and low frequency relationship measures the magnitude of the LPHaF as follows:
(M1*stdalong)+C2<LPHaF (2)
if the sum is less than the calculated high and low frequency relationship measure LPHaF, i.e., the result of comparison (2) is true, the ACELP MODE flag bit is set to indicate that the ACELP excitation algorithm is selected for encoding. Otherwise, the UNCERTAIN MODE is set, indicating that the excitation method for the current frame selection has not yet been determined.
Further checks may then be made before confirming the excitation method selected for the current frame.
The further check first determines whether the ACELP MODE flag or the UNCERTAIN MODE flag is set. If either of the two flags is set, and if the calculated average energy level AVL of the filter bank for the current frame is greater than the third threshold TH3 (e.g., 2000), the TCX MODE flag is set, and the ACELPMODE flag and the UNCERTAIN MODE flag are cleared.
Then, if UNCERTAIN MODE is still set, a similar calculation is performed for the short window average standard deviation value stdashort as described above for the long window standard deviation value stalong, but with a slight difference between the constants and thresholds used in the comparison.
If the average standard deviation value stdashort for the short window is less than a fourth threshold TH4 (e.g., 0.2), the TCX MODE flag bit is set to indicate that the TCX excitation algorithm is selected for encoding. Otherwise, the inverse of the difference of the standard deviation stdashort subtracted by the fourth threshold TH4 is calculated, and the inverse of the subtracted difference is added to a third constant C3 (e.g., 2.5). Comparing the sum with the calculated high and low frequency relationship measures the magnitude of the LPHaF as follows:
C3+(1/(stdashort-TH4))>LPHaF (3)
if the result of comparison (3) is true, the TCX MODE flag bit is set to indicate that the TCX excitation algorithm is selected for encoding. If the result of the comparison is not true, the standard deviation value stdalong is multiplied by a second multiplicand M2 (e.g., -90), and then the multiplied result is added to a fourth constant C4 (e.g., 140). Comparing the sum with the calculated high and low frequency relationship measures the magnitude of the LPHaF as follows:
M2*stdashort+C4<LPHaF (4)
if the sum is less than the calculated high and low frequency relationship measure LPHaF, i.e., the result of comparing equation (4) is true, then the ACELP MODE flag bit is set to indicate that the ACELP excitation algorithm is selected for encoding. Otherwise, the unknown MODE flag is set, indicating that the excitation method for the current frame has not yet been determined.
In the next stage, the energy levels of the current and previous frames may be checked. If the energy of the total energy TotE0 at the current frame and the total energy TotE-1 at the previous frame is greater than a fifth threshold TH5 (e.g., 25), the ACELP MODE flag is set and the TCX MODE flag and the UNCERTAIN MODE flag are cleared.
Finally, if either the TCX MODE flag bit or the UNCERTAIN MODE flag bit is set, and if the calculated average energy level AVL of the filter bank 300 for the current frame is greater than the third threshold TH3, while the total energy TotE0 of the current frame is less than the sixth threshold TH6 (e.g., 60), the ACELP MODE flag bit is set.
When the first excitation selection method described above is performed, the first excitation method selected in the first excitation block 204 is TCX if the TCX MODE flag bit is set, and the first excitation method selected in the first excitation block 204 is ACELP if the ACELP MODE flag bit is set. However, if the UNCERTAIN MODE flag bit is set, the first excitation selection method has not yet determined the first excitation method. In this case, the TCX or ACELP excitation is selected in another excitation selection module (e.g., the second stage excitation selection module 210), where further analysis may be performed to select which of the TCX or ACELP excitation will be used.
The first incentive selection method described above may be illustrated by the following pseudo code:
if(stdalong<TH1)
SET TCX_MODE
else if(LPHaF>TH2)
SETTCX_MODE
else if((C1+(1/(stdalong-TH1)))>LPHaF)
SET TCX_MODE
else if((M1*stdalong+C2)<LPHaF)
SET ACELP_MODE
else
SET UNCERTAIN_MODE
if(ACELP_MODE or UNCERTAIN_MODE)and(AVL>TH3)
SET TCX_MODE
if(UNCERTAIN_MODE)
if(stdashort<TH4)
SET TCX_MODE
else if((C3+(1/(stdashort-TH4)))>LPHaF)
SET TCX_MODE
else if((M2*stdashort+C4)<LPHaF)
SET ACELP_MODE
else
SET UNCERTAIN_MODE
if(UNCERTAIN_MODE)
if((TotE0/TotE-1)>TH5)
SET ACELP_MODE
if(TCX_MODE||UNCERTAIN_MODE))
if(AVL>TH3and TotE0<TH6)
SET ACELP_MODE
after the first phase selection module 204 has completed the above method and the first excitation method for encoding the signal is selected, the signal is sent from the VAD module 202 to the LPC analysis module 206, which LPC analysis module 206 processes the signal on a frame-by-frame basis.
Specifically, the LPC analysis module 206 determines the LPC filter corresponding to a frame by minimizing the residual of the frame. Once the LPC filter is determined, the determined LPC filter may be represented by a set of coefficients for the determined LPC filter. The frames processed by the LPC analysis module 206 are sent to the input of the LTP analysis module 208 along with any parameters determined by the module (e.g., LPC filter coefficients).
The LTP analysis module 208 processes the received frames and parameters. In particular, the LTP analysis module 208 calculates LTP parameters that are closely related to the pitch frequency of the frame and are commonly referred to as "pitch lag" parameters or "pitch lag" parameters, which describe the periodicity of the speech signal in terms of speech samples. The LTP analysis module 208 also calculates the LTP gain, which is closely related to the pitch period of the speech signal.
The frames processed by the LTP analysis module 208 are sent to the excitation generation module 212 along with the calculated parameters, where the excitation generation module 212 encodes the frames using one of ACELP or TCX excitation methods. Selection of one of the ACELP or TCX excitation methods is accomplished by the excitation selection module 216 in conjunction with the second stage selection module 210.
The second stage selection module 210 receives the frames processed by the LTP analysis module 208 and the parameters calculated by the LPC analysis module 206 and the LTP analysis module 208. The excitation selection module 216 analyzes the parameters to determine the best excitation method for the current frame based on the LPC and LTP parameters and the normalized correlation from the ACELP excitation and the TCX excitation. In particular, the excitation selection module 216 analyzes the parameters from the LPC analysis module 206 and in particular from the LTP module, and the correlation parameters for selecting the best excitation method from the ACELP excitation and the TCX excitation. The second phase selection module verifies the first incentive method selected by the first phase selection module or if the first incentive method determined by the first phase selection module is not determined, the incentive selection module 210 selects the best incentive method at this phase. Therefore, the selection of the excitation method for frame coding will be delayed until after the LTP analysis has been performed.
In the second stage selection module, normalized correlation may be used, which may be calculated as follows:
where N denotes the frame length, T0 denotes the open loop delay of a frame having the frame length N, XiIndicating the ith sample, X, of the encoded framei-T0 denotes the slave sample XiThe encoded frame samples after T0 samples have been removed.
There are some exceptions to the second phase excitation selection where the ACELP or TCX selection of the first phase may be changed or reselected.
In a steady signal, the difference between the maximum minimum delay of the current frame and the previous frame is below a predetermined threshold TH2, and the delay variation between previous and subsequent frames may not be too large. The LTP gain range of AMR-WB + codecs is typically between 0 and 1.2, and the normalized correlation range is typically between 0 and 1.0. For example, a threshold indicating a high LTP gain may exceed 0.8. High correlation (or similarity) of LTP gain to normalized correlation can be observed by examining their difference. If the difference is less than a third threshold, e.g., 0.1 in the current and/or previous frame, then the LTP gain is considered to have a high correlation with the normalized correlation.
If the signal is transient characterized, in an embodiment of the invention, the encoding may be performed using a first excitation method, such as ACELP. The temporal sequence can be detected using the spectral distance SD of adjacent frames. For example, if the spectral distance SD of frame n is calculated from the Immittance Spectral Pair (ISP) coefficients of the current and previous framesnBeyond a first predetermined threshold, the signal is classified as transient. The ISP coefficients are derived from LPC filter coefficients that have been converted to ISPs.
The noise-like sequence may be encoded using a second excitation method, such as TCX. The noise-like sequence may be detected by examining the LTP coefficients and the average frequency of the frame in the frequency domain. If the LTP parameter is very unstable and/or the average frequency exceeds a predetermined threshold, the frame is determined to contain a noise-like signal.
An example of an algorithm that can be used for the second excitation selection method is described below.
If the VAD flag is set indicating an active audio signal and the first excitation method has been determined to be undetermined (e.g. defined as TCX _ OR _ ACELP) in the first stage selection module, the second excitation method is selected as follows:
If(SDn>0.2)
Mode=ACELP_MODE;
else
if(LagDifbuf<2)
if(Lagn==HIGH LIMIT or Lagn==LOW LIMIT){
if(Gainn-NormCorrn<0.1and NormCorrn>0.9)
Mode=ACELP_MODE
else
Mode=TCX_MODE
elseif(Gainn-NormCorrn<0.1and NormCorrn>0.88)
Mode=ACELP_MODE
else if(Gainn-NormCorrn>0.2)
Mode=TCX_MODE
else
NoMtcx=NoMtcx+1
if(MaxEnergybuf<60)
if(SDn>0.15)
Mode=ACELP_MODE;
else
NoMtcx=NoMtcx+1.
spectral distance SD of frame nnCalculated from ISP parameters, as follows:
wherein, ISPnFor ISP coefficient vector of frame n, ISPn(i) To ISPnThe ith component of (1).
LagDifbufIs a buffer containing the open loop delay value of the first 10 frames (20 ms).
LagnContaining two open loop delay values for the current frame n.
GainnIncluding two LTP gain values for the current frame n.
NormCorrnContaining two normalized correlation values for the current frame n.
MaxEnergybufA maximum buffer containing energy values; the energy buffer contains the last 6 energy values of the current and previous frames (20 ms).
IphnIndicating the tilt of the spectrum.
NoMtcx is a flag bit to indicate to avoid TCX encoding at long frame lengths (80ms) if TCX excitation is selected.
If the VAD flag is set indicating an active audio signal and the first excitation method has been determined to be ACELP in the first stage selection module, the first excitation method determination is verified according to the following algorithm, in which the excitation method can be converted into TCX:
if(LagDifbuf<2)
if(NormCorrn<0.80 and SDn<0.1)
Mode=TCX_MODE;
if(Iphn>200 and SDn<0.1)
Mode=TCX_MODE
if the VAD flag is set in the current frame and the VAD flag is set to zero in at least one frame in the previous superframe (one superframe is 80ms long, which contains 4 frames 20ms long), while the mode has been selected to be TCX mode, the TCX excitation resulting in 80ms frames-the use of TCX80, is disabled (NoMtcx is set).
if(vadFlagold==0and vadFlag==1and Mode==TCX_MODE))
NoMtcx=NoMtcx+1
If the VAD flag is set and the first excitation selection method has been determined to be undetermined (TCX _ OR _ ACELP) OR TCX, then the first excitation selection method is performed according to the following algorithm:
if(Gainn-NormCorrn<0.006 and NormCorrn>0.92 and Lagn>21)
DFTSum=0;
for(i=1;i<40;i++){
DFTSum=DFTSum+mag[i];
if(DFTSum>95 and mag[0]<5){
Mode=TCX_MODE;
else
Mode=ACELP_MODE;
NoMtcx=NoMtcx+1
vadFlagoldrepresents the VAD flag of the previous frame and vadFlag represents the VAD flag of the current frame.
NoMtcx is a flag bit to indicate to avoid TCX encoding at long frame lengths (80ms) if TCX excitation is selected.
Mag represents a discrete cosine transform (DFT) spectral envelope created from the LP filter coefficients Ap of the current frame.
DFTSum represents the sum of the first 40 components of the vector mag except for the 1 st component (mag (0)).
The frames following the second stage selection module 210 are then sent to the excitation generation module 212, and the excitation generation module 212 encodes the frames received from the LTP analysis module 208, as well as the parameters received from the previous modules, using one of the excitation methods selected at either the first stage selection module 204 or the second stage selection module 210. The encoding is controlled by an excitation selection module 216.
The frames output by the excitation generation module 212 are encoded frames represented by the parameters determined by the LPC analysis module 206, the LTP analysis module 208, and the excitation generation module 212. Finally, the encoded frame is output by the third stage selection module 214.
If the frame is encoded using ACELP excitation, the encoded frame passes directly through the third stage selection module 214 and is output as an encoded frame 107. However, if the frame is encoded using TCX excitation, the length of the encoded frame must be determined by the number of previously selected ACELP frames in the superframe, where the superframe length is 80ms, containing 4 x20 ms frames. In other words, the length of a TCX encoded frame depends on the number of ACELP frames in the preceding frame.
The maximum frame length of a TCX coded frame is 80ms and may consist of a single 80ms TCX coded frame (TCX80), or 2 40ms (2 × 40ms) TCX coded frames (TCX40), or 420 ms (4 × 20ms) TCX coded frames (TCX 20). The decision on how to encode the 80ms TCX frame is determined by the excitation selection module 216 using the third stage selection module 214 and also depends on the number of selected ACELP frames in the super-frame.
For example, the third stage selection module 214 may measure the signal-to-noise ratio of the encoded frames from the excitation generation module 212 and therefore select either 2 x40 ms encoded frames or a single 80ms encoded frame.
The third excitation selection phase is only performed if the number of ACELP methods selected in the first and second excitation selection phases is less than 3(ACELP < 3) in a superframe of 80 ms. Table 1 below shows possible combinations of methods before and after the third excitation selection phase. In the third excitation selection phase, the frame length of the TCX method is selected, for example, according to the SNR.
Table 1TCX Process combinations
This embodiment describes the selection of ACELP excitation for periodic signals (possibly containing speech signals) and transient signals with high long-term correlation. On the other hand, TCX excitation is selected for stationary signals, noise-like signals, single-frequency signals (tone-like signals), and the like, and is more suitable for encoding and processing frequency domain analysis of these signals.
The selection of the excitation method is delayed in the embodiment, but it is applied to the current frame, and thus provides a signal encoding method with less complexity than existing methods. Meanwhile, the method consumes less storage than the existing method. This improvement is particularly important for mobile devices that have only limited memory and processing power.
Furthermore, the use of parameters from the VAD module, LPC and LTP analysis modules makes the classification of the signal more accurate and therefore the selection of the best excitation method for signal coding more accurate.
It should be noted that although the codec used in the embodiments of the present invention is AMR-WB +, it will be apparent to those skilled in the art that the techniques described in the present invention can be easily applied to other codecs having a plurality of excitation methods as alternative and additive embodiments of the present invention.
Furthermore, although the above embodiments use one or both of ACELP and TCX, as alternative and additional embodiments of the invention, it will be apparent to those skilled in the art that other methods of excitation may be employed.
The above-described encoder may be applied to other terminals, such as computers or other signal processing devices, in addition to mobile terminals.
It is also worth noting here that the embodiments of the present invention are not limited to the above, but that various changes and modifications can be made without departing from the scope of the solution disclosed in the claims of the present invention.
Claims (18)
1. A method of encoding a frame in an encoder of a communication system, the method comprising the steps of:
calculating a first set of parameters associated with the frame, wherein the first set of parameters contains parameters relating to frequency bands and their associated energy levels;
in a first stage, selecting one of an algebraic code-excited linear prediction excitation, a transform code excitation or an undetermined mode based on predetermined conditions associated with the first set of parameters;
calculating a second set of parameters associated with the frame, the second set of parameters including at least one of spectral parameters, long-term prediction parameters, and correlation parameters associated with the frame;
in a second stage, selecting one of an algebraic code-excited linear prediction excitation and a transform code excitation based on the result of the first stage selection and the second set of parameters; and
encoding the frame using one of an algebraic code-excited linear prediction excitation and a transform-coded excitation selected from the second stage.
2. The method according to claim 1, wherein, if an algebraic code excited linear prediction excitation has been selected, the selection in the second stage comprises a re-selection of an algebraic code excited linear prediction excitation or a selection of a transformed code excitation instead according to a first algorithm,
wherein the first algorithm comprises detecting an active audio signal and, if so, performing the following:
if LagDifbufLess than 2, and if NormCorrnLess than 0.8 and SDnIf the value is less than 0.1, setting the MODE value as TCX _ MODE;
if IphnGreater than 200 and SDnLess than 0.1, the value of MODE is set to TCX _ MODE,
wherein LagDifbufA buffer containing the open loop delay value of the first 10 frames;
NormCorrntwo normalized correlation values comprising the current frame n;
SDnis the spectral distance of frame n; and
Iphnindicating a tilt in the frequency spectrum.
3. The method of claim 1, wherein if either a transformed code excitation or an undetermined mode has been selected, the selection in the second phase comprises re-selecting a transformed code excitation or selecting an algebraic code excited linear prediction excitation instead according to a second algorithm,
wherein the second algorithm comprises: detecting an active audio signal and, if any, performing the following:
if Gainn-NormCorrnLess than 0.006 and NormCorrnGreater than 0.92 and lagsnIf the value is more than 21, the value of DFTSum is set to 0;
starting from the initial value of the variable i being 1, executing a loop DFTSum ═ DFTSum + mag [ i ], wherein the value of i increases by 1 after each execution, the loop being executed until the value of i is not less than 40; and
if DFTSum is greater than 95 and mag [0] is less than 5, then set the value of MODE to TCX _ MODE, otherwise set the value of MODE to ACELP _ MODE, and add 1 to the NoMtcx value,
wherein, GainnTwo LTP gain values including the current frame n;
NormCorrntwo normalized correlation values comprising the current frame n;
Lagntwo open-loop delay values comprising the current frame n;
NoMtcx is a flag bit for indicating to avoid transform code excitation encoding with a long frame length when transform code excitation is selected;
mag is the discrete cosine transform DFT spectral envelope created from the LP filter coefficients Ap of the current frame; and
DFTSum is the sum of the first 40 components of the vector mag except the 1 st component mag [0 ].
4. The method as claimed in claim 1, wherein if the undetermined mode has been selected, the selecting in the second phase includes selecting one of an algebraic code-excited linear prediction excitation and a transformed code excitation according to a third algorithm,
wherein the third algorithm comprises: detecting an active audio signal and, if any, performing the following:
if SDnIf the value is more than 0.2, setting the MODE value as ACELP _ MODE;
otherwise, determining LagDifbufWhether less than 2, if so, then
In the bagnEqual to HIGH LIMIT or ladnEqual to LOW LIn the case of IMIT:
determining whether Gain is presentn-NormCorrnLess than 0.1 and NormCorrnIs greater than 0.9 of the total weight of the rubber,
if so, set the value of MODE to ACELP MODE,
otherwise the value of MODE is set to TCX _ MODE,
in addition, in Gainn-NormCorrnLess than 0.1 and NormCorrnIf the value is greater than 0.88, setting the value of MODE to ACELP _ MODE;
in addition, in Gainn-NormCorrnIf the value is greater than 0.2, setting the value of MODE to TCX _ MODE;
in other cases, the NoMtcx value is increased by 1; and
determining MaxEnergybufWhether less than 60, if so, at SDnThe value of MODE is set to ACELP MODE in case of more than 0.15, and in other cases, the NoMtcx value is increased by 1,
wherein SDnIs the spectral distance of frame n;
LagDifbufa buffer containing the open loop delay value of the first 10 frames;
Lagntwo open-loop delay values comprising the current frame n;
Gainntwo LTP gain values including the current frame n;
NormCorrntwo normalized correlation values comprising the current frame n;
NoMtcx is a flag bit for indicating to avoid transform code excitation encoding with a long frame length when transform code excitation is selected; and
MaxEnergybufis the maximum value of the energy value in the buffer containing energy values that contains the last 6 energy values of the current and previous frames.
5. The method of claim 1, wherein when the frame is encoded using transform coded excitation, the method further comprises:
selecting a length of a frame to be encoded with transform coded excitation based on the selection at the first stage and the second stage.
6. A method as claimed in claim 5, wherein the selection of the coding frame length depends on the signal-to-noise ratio of the frame.
7. The method of claim 1, wherein the encoder is an adaptive multi-rate-wideband-plus encoder.
8. The method of claim 1, wherein the frame is an audio frame comprising speech and non-speech, wherein the non-speech comprises music.
9. The method of any preceding claim, wherein the first parameter set is a filter bank parameter.
10. An encoder for encoding a frame in a communication system, the encoder comprising:
a first calculation module to calculate a first set of parameters associated with the frame, wherein the first set of parameters contains parameters related to frequency bands and their associated energy levels;
a first stage selection module that selects one of an algebraic code excited linear prediction excitation, a transform code excitation, or an undetermined mode based on predetermined conditions associated with the first set of parameters;
a second calculation module to calculate a second set of parameters associated with the frame, the second set of parameters including at least one of spectral parameters, long-term prediction parameters, and correlation parameters associated with the frame;
a second stage selection module that selects one of an algebraic code-excited linear prediction excitation and a transform code excitation based on the selection of the first stage and the second set of parameters; and
an encoding module that encodes the frame using one of an algebraic code-excited linear prediction excitation and a transform code excitation selected from the second stage.
11. An encoder as defined in claim 10, wherein the second stage selection module comprises: if an algebraic code excited linear prediction excitation has been selected in the first stage selection module, then an algebraic code excited linear prediction excitation is selected again or alternatively a means of transforming code excitation is selected according to a first algorithm,
wherein the first algorithm comprises detecting an active audio signal and performing the following operations, if any:
if LagDifbufLess than 2, and if NormCorrnLess than 0.8 and SDnIf the value is less than 0.1, setting the MODE value as TCX _ MODE;
if Ichn is greater than 200 and SDnLess than 0.1, the value of MODE is set to TCX _ MODE,
wherein LagDifbufA buffer containing the open loop delay value of the first 10 frames;
NormCorrntwo normalized correlation values comprising the current frame n;
SDnis the spectral distance of frame n; and
Iphnindicating a tilt in the frequency spectrum.
12. An encoder as defined in claim 10, wherein the second stage selection module comprises: means for selecting either the transformed code excitation or the undetermined mode if, at said first stage, the module has selected either the transformed code excitation or, alternatively, the algebraic code excited linear prediction excitation according to a second algorithm,
wherein the second algorithm comprises: detecting an active audio signal and, if any, performing the following:
if Gainn-NormCorrnLess than 0.006 and NormCorrnGreater than 0.92 and lagsnIf the value is more than 21, the value of DFTSum is set to 0;
starting from the initial value of the variable i being 1, executing a loop DFTSum ═ DFTSum + mag [ i ], wherein the value of i increases by 1 after each execution, the loop being executed until the value of i is not less than 40; and
if DFTSum is greater than 95 and mag [0] is less than 5, then set the value of MODE to TCX _ MODE, otherwise set the value of MODE to ACELP _ MODE, and add 1 to the NoMtcx value,
wherein, GainnTwo LTP gain values including the current frame n;
NormCorrntwo normalized correlation values comprising the current frame n;
Lagntwo open-loop delay values comprising the current frame n;
NoMtcx is a flag bit for indicating to avoid transform code excitation encoding with a long frame length when transform code excitation is selected;
mag is the discrete cosine transform (DFT) spectral envelope created from the LP filter coefficients Ap of the current frame; and
DFTSum is the sum of the first 40 components of the vector mag except the 1 st component mag [0 ].
13. An encoder as defined in claim 10, wherein the second stage selection module comprises: means for selecting one of an algebraic code-excited linear prediction excitation and a transformed code excitation according to a third algorithm if an undetermined mode has been selected in said first stage selection module,
wherein the third algorithm comprises: detecting an active audio signal and, if any, performing the following:
if SDnIf the value is more than 0.2, setting the MODE value as ACELP _ MODE;
otherwise, determining LagDifbufWhether less than 2, if so, then
In the bagnEqual to HIGH LIMIT or ladnEqual to LOW LIMIT:
determining whether Gain is presentn-NormCorrnLess than 0.1 and NormCorrnIs greater than 0.9 of the total weight of the rubber,
if so, set the value of MODE to ACELP MODE,
otherwise the value of MODE is set to TCX _ MODE,
in addition, in Gainn-NormCorrnLess than 0.1 and NormCorrnIf the value is greater than 0.88, setting the value of MODE to ACELP _ MODE;
in addition, in Gainn-NormCorrnIf the value is greater than 0.2, setting the value of MODE to TCX _ MODE;
in other cases, the NoMtcx value is increased by 1; and
determining MaxEnergybufWhether less than 60, if so, at SDnThe value of MODE is set to ACELP MODE in case of more than 0.15, and in other cases, the NoMtcx value is increased by 1,
wherein SDnIs the spectral distance of frame n;
LagDifbufa buffer containing the open loop delay value of the first 10 frames;
Lagntwo open-loop delay values comprising the current frame n;
Gainntwo LTP gain values including the current frame n;
NormCorrntwo normalized correlation values comprising the current frame n;
NoMtcx is a flag bit for indicating to avoid transform code excitation encoding with a long frame length when transform code excitation is selected; and
MaxEnergybufis the maximum value of the energy value in the buffer containing energy values that contains the last 6 energy values of the current and previous frames.
14. An encoder as defined in claim 10, further comprising:
a third stage selection module configured to select a length of a frame to be encoded with transform coded excitation based on selections at the first stage selection module and the second stage selection module.
15. An encoder as defined in claim 14, wherein the third stage selection module is configured to select a length of a frame to be encoded based on a signal-to-noise ratio of the frame.
16. The encoder of claim 10, wherein the encoder is an adaptive multi-rate-wideband-plus-encoder.
17. An encoder as defined in claim 10, wherein the frame is an audio frame comprising speech and non-speech, wherein the non-speech comprises music.
18. The encoder of any of claims 10 to 17, wherein the first parameter set is a filter bank parameter.
Applications Claiming Priority (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| GB0408856.3 | 2004-04-21 | ||
| GBGB0408856.3A GB0408856D0 (en) | 2004-04-21 | 2004-04-21 | Signal encoding |
| PCT/IB2005/001033 WO2005104095A1 (en) | 2004-04-21 | 2005-04-19 | Signal encoding |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| HK1104369A1 HK1104369A1 (en) | 2008-01-11 |
| HK1104369B true HK1104369B (en) | 2012-07-20 |
Family
ID=
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN1969319B (en) | Signal encoding | |
| US7747430B2 (en) | Coding model selection | |
| EP2040253B1 (en) | Predictive dequantization of voiced speech | |
| KR100962681B1 (en) | Classification of Audio Signals | |
| US8862463B2 (en) | Adaptive time/frequency-based audio encoding and decoding apparatuses and methods | |
| EP1204968B1 (en) | Method and apparatus for subsampling phase spectrum information | |
| HK1104369B (en) | A method and encoder for encoding a frame in a communication system | |
| HK1099960B (en) | Coding model selection |