[go: up one dir, main page]

HK1185709B - Coding generic audio signals at low bitrates and low delay - Google Patents

Coding generic audio signals at low bitrates and low delay Download PDF

Info

Publication number
HK1185709B
HK1185709B HK13112954.4A HK13112954A HK1185709B HK 1185709 B HK1185709 B HK 1185709B HK 13112954 A HK13112954 A HK 13112954A HK 1185709 B HK1185709 B HK 1185709B
Authority
HK
Hong Kong
Prior art keywords
domain
frequency
time
sound signal
input sound
Prior art date
Application number
HK13112954.4A
Other languages
Chinese (zh)
Other versions
HK1185709A1 (en
Inventor
Tommy Vaillancourt
Milan Jelinek
Original Assignee
Voiceage Evs Llc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Voiceage Evs Llc filed Critical Voiceage Evs Llc
Priority claimed from PCT/CA2011/001182 external-priority patent/WO2012055016A1/en
Publication of HK1185709A1 publication Critical patent/HK1185709A1/en
Publication of HK1185709B publication Critical patent/HK1185709B/en

Links

Description

Encoding a normal audio signal at a low bit rate and with a short delay
Technical Field
The present disclosure relates to mixed time-domain/frequency-domain coding devices and methods of coding an input sound signal, and corresponding encoders and decoders using these mixed time-domain/frequency-domain coding devices and methods.
Background
The prior art conversational codec can represent a clean speech signal with a bit rate of about 8kbps at a very good quality and is nearly transparent at a bit rate of 16 kbps. However, at bit rates below 16kbps, the short-processing delay conversational codecs that most commonly encode the input speech signal in the time domain are not suitable for normal audio signals like music and reverberated speech. To overcome this drawback, switched codecs are introduced, which basically use the time-domain approach for coding speech-dominated input signals and the frequency-domain approach for coding ordinary audio signals. However, such switching solutions typically require long processing delays required for both speech-music classification and transformation to the frequency domain.
To overcome the above disadvantages, more uniform time and frequency domain models are proposed.
Disclosure of Invention
The present disclosure relates to a mixed time-domain/frequency-domain coding device for coding an input sound signal, comprising: a calculator for calculating a time-domain excitation contribution in response to an input sound signal; a calculator of a cut-off frequency of the time-domain excitation contribution in response to the input sound signal; a filter responsive to the cut-off frequency to adjust a frequency range of the time-domain excitation contribution; a calculator for calculating a frequency domain excitation contribution in response to an input sound signal; and an adder for adding the filtered time-domain excitation contribution and the frequency-domain excitation contribution to form a mixed time-domain/frequency-domain excitation constituting a coded version of the input sound signal.
The present disclosure also relates to an encoder using time and frequency domain models, comprising: a classifier for classifying an input sound signal into speech or non-speech; a time-domain only encoder; the above-mentioned mixed time domain/frequency domain coding device; and a selector for selecting only one of the time-domain encoder and the mixed time-domain/frequency-domain encoding device for encoding the input sound signal depending on the classification of the input sound signal.
In this disclosure, a mixed time-domain/frequency-domain coding device is described for coding an input sound signal, comprising: a calculator of a time-domain excitation contribution in response to the input sound signal, wherein the calculator of the time-domain excitation contribution processes the input sound signal in successive frames of the input sound signal, and comprises a calculator of a number of sub-frames to be used in a current frame of the input sound signal, wherein the calculator of the time-domain excitation contribution uses the number of sub-frames determined by the sub-frame number calculator for the current frame in the current frame; a calculator for calculating a frequency domain excitation contribution in response to an input sound signal; and an adder for adding the time-domain excitation contribution and the frequency-domain excitation contribution to form a mixed time-domain/frequency-domain excitation constituting a coded version of the input sound signal.
The present disclosure further relates to a decoder for decoding a sound signal encoded using a mixed time-domain/frequency-domain coding device as described above, comprising: a converter for converting the mixed time-domain/frequency-domain excitation in the time domain; and a synthesis filter responsive to the mixed time-domain/frequency-domain excitation converted in the time-domain to synthesize the sound signal.
The present disclosure also relates to a mixed time-domain/frequency-domain coding method of coding an input sound signal, comprising: calculating a time-domain excitation contribution in response to the input sound signal; calculating a cut-off frequency of the time-domain excitation contribution in response to the input sound signal; adjusting a frequency range of the time-domain excitation contribution in response to the cut-off frequency; calculating a frequency domain excitation contribution in response to an input sound signal; and adding the adjusted time-domain excitation contribution and the frequency-domain excitation contribution to form a mixed time-domain/frequency-domain excitation constituting a coded version of the input sound signal.
In this disclosure, there is further described a method of encoding using time-domain and frequency-domain modes, comprising: classifying an input sound signal into speech or non-speech; providing a time-domain only coding method; providing the above-mentioned mixed time domain/frequency domain coding method; and selecting only one of the time-domain coding method and the mixed time-domain/frequency-domain coding method for coding the input sound signal depending on the classification of the input sound signal.
The present disclosure still further relates to a mixed time-domain/frequency-domain coding method of coding an input sound signal, comprising: calculating a time-domain excitation contribution in response to the input sound signal, wherein calculating the time-domain excitation contribution comprises processing the input sound signal in successive frames of the input sound signal and calculating a number of sub-frames to be used in a current frame of the input sound signal, wherein calculating the time-domain excitation contribution further comprises using the number of sub-frames determined for the current frame in the current frame; calculating a frequency domain excitation contribution in response to an input sound signal; and adding the time-domain excitation contribution and the frequency-domain excitation contribution to form a mixed time-domain/frequency-domain excitation constituting a coded version of the input sound signal.
In this disclosure, there is still further described a method of decoding a sound signal encoded using a mixed time-domain/frequency-domain coding method as described above, comprising: converting the mixed time-domain/frequency-domain excitation in the time domain; and synthesizing the sound signal by a synthesis filter in response to the mixed time-domain/frequency-domain excitation converted in the time domain.
The above and other features of the invention will be more apparent upon reading of the following non-limiting description of an exemplary embodiment of the proposed time and frequency domain model, given by way of example only with reference to the accompanying drawings.
Drawings
In the drawings:
FIG. 1 is a schematic block diagram illustrating an overview of an enhanced CELP (code excited Linear prediction) encoder, e.g., an ACELP (algebraic code excited Linear prediction) encoder;
FIG. 2 is a schematic block diagram of a more detailed structure of the enhanced CELP encoder of FIG. 1;
FIG. 3 is a schematic block diagram of an overview of a calculator of cut-off frequency;
FIG. 4 is a schematic block diagram of a more detailed structure of the calculator of the cut-off frequency of FIG. 3;
FIG. 5 is a schematic block diagram of an overview of a frequency quantizer; and
fig. 6 is a schematic block diagram of a more detailed structure of the frequency quantizer of fig. 5.
Detailed Description
The proposed more uniform time and frequency domain model can improve the synthesis quality for ordinary audio signals like e.g. music and/or reverberated speech without increasing processing delay and bit rate. Such models operate, for example, in the Linear Prediction (LP) residual (residual) domain, where the available bits are dynamically allocated between an adaptive codebook, one or more fixed codebooks (e.g., algebraic codebooks, gaussian codebooks, etc.), and frequency domain coding modes, depending on the characteristics of the input signal.
To achieve a short processing delay low bit rate conversational codec that improves the synthesis quality of normal audio signals like music and/or reverberated speech, the frequency domain coding mode may be integrated as closely as possible with the CELP (code excited linear prediction) time domain coding mode. For this purpose, the frequency-domain coding mode uses, for example, a frequency transform performed in the LP residual domain. This allows switching from one frame (e.g., a 20ms frame) to another frame with little or no artifacts. Furthermore, the integration of the two (2) coding modes is close enough so that if the current coding mode is determined to be not efficient enough, the bit budget can be dynamically reallocated to the other coding mode.
One feature of the proposed more uniform time and frequency domain model is variable time support that varies from quarter frame to full frame on a frame-by-frame basis and will be referred to as the time domain component of the sub-frame. As an illustrative example, one frame represents 20ms of the input signal. This corresponds to 320 samples if the codec's internal sampling frequency is 16kHz, or 256 samples if the codec's internal sampling frequency is 12.8 kHz. Then, depending on the codec's internal sampling frequency, a quarter frame (subframe) represents 64 or 80 samples. In the following exemplary embodiment, the internal sampling frequency of the codec is 12.8kHz, giving a frame length of 256 samples. The variable time support allows the dominant temporal events to be captured at a minimum bit rate to create a fundamental time-domain excitation contribution. At very low bit rates, the time support is typically an entire frame. In that case, the time-domain contribution to the excitation signal consists of the adaptive codebook only, and the corresponding pitch information with the corresponding gain is sent once per frame. When more bit rates are available, more temporal events can be captured by shortening the time support (and increasing the bit rate allocated to the time-domain coding mode). Finally, when the time support is sufficiently short (down to a quarter frame) and the available bit rate is sufficiently high, the time-domain contribution may include an adaptive codebook contribution, a fixed codebook contribution, or both, with corresponding gains. Parameters describing the codebook index and gain are then transmitted for each subframe.
At low bit rates, conversational codecs cannot properly encode higher frequencies. This can severely degrade the quality of the synthesis when the input signal comprises music and/or reverberant speech. To address this problem, features are added that calculate the efficiency of the time-domain excitation contribution. In some cases, the time-domain excitation contribution is of no value, regardless of the input bit rate and time-frame support. In those cases, all bits are reallocated for the frequency domain encoding of the next step. But most of the time-domain excitation contribution is only worth up to a certain frequency (the cut-off frequency). In these cases, the time-domain excitation contribution is filtered above the cutoff frequency. The filtering operation allows to preserve valuable information coded with the time-domain excitation contribution and to remove non-valuable information above the cut-off frequency. In an exemplary embodiment, the filtering is performed in the frequency domain by setting the frequency bins (bins) above a certain frequency to zero.
The variable time support in combination with the variable cut-off frequency makes the bit allocation within the integrated time and frequency domain model very dynamic. The bit rate after quantization by the LP filter may be fully allocated to the time domain or fully allocated to the frequency domain, or in between. The bit rate allocation between the time and frequency domains is made as a function of the number of subframes used for the time domain contribution, the available bit budget, and the calculated cut-off frequency.
To create an overall excitation that more efficiently matches the input residual, a frequency domain coding mode is applied. One feature in this disclosure is to frequency-domain encode a vector that contains the difference between the frequency representation of the input LP residual (frequency transform) and the frequency representation of the filtered time-domain excitation contribution up to a cutoff frequency (frequency transform), and contains the frequency representation of the input LP residual itself above that cutoff frequency (frequency transform). A smooth spectral transition region is inserted between the two segments just above the cut-off frequency. In other words, the high frequency part of the frequency representation of the time domain excitation contribution is first zeroed out. A transition region between the unaltered part of the spectrum and the zeroed-out part of the spectrum is inserted just above the cut-off frequency to ensure a smooth transition between the two parts of the spectrum. This modified spectrum of the time-domain excitation contribution is then subtracted from the frequency representation of the input LP residual. In the case of some transition regions, the resulting spectrum thus corresponds to the difference of the two spectra below the cut-off frequency and the frequency representation above the cut-off frequency corresponding to the LP residual. As mentioned above, the cut-off frequency may be different from one frame to another.
Regardless of the frequency quantization method (frequency coding mode) chosen, there is always the possibility of pre-echoes, especially when long windows are used. In this technique, the window used is a square window, so that the extra window length compared to the encoded signal is zero (0), i.e. overlap-add is not used. Although this corresponds to an optimal window to reduce any potential pre-echoes, some pre-echoes may still be audible over some time attacks (temporal attack). There are many techniques to address such pre-echo problems, but the present disclosure proposes a simple feature to eliminate such pre-echo problems. This feature is based on memoryless time-domain coding patterns derived from the "transition patterns" of ITU-T Recommendation G.718 (ref: ITU-T Recommendation G.718"Frame error robust narrow-band and window masked variable bit-rate coding of speed and audio from8-32kbit/s", June2008, section6.8.1.4and section 6.8.4.2). The idea behind this feature is to exploit the fact that the proposed more uniform time and frequency domain model is integrated with the LP residual domain, so that switching is almost always done without artifacts. When a signal is considered to be normal audio (music and/or reverberant speech) and when a temporal attack is detected in a frame, then this frame is encoded using only this special memoryless temporal coding mode. This mode will be aware of the time attack and thus avoid pre-echoes that may be introduced by frequency domain encoding that frame.
Illustrative embodiments
In the proposed more unified time and frequency domain model, the above adaptive codebook, one or more fixed codebooks (e.g., algebraic codebooks, gaussian codebooks, etc.) (i.e., so-called time domain codebooks), and frequency domain quantization (frequency domain coding mode) can be considered as a codebook library, and bits can be allocated among all or a subset of the available codebooks. This means that, for example, if the input sound signal is clean speech, all bits are allocated to the time-domain coding mode, basically contracting the coding into the conventional CELP scheme. On the other hand, for some music pieces, all the bits allocated for encoding the input LP residual are sometimes best spent in the frequency domain, e.g., in the transform domain.
As indicated by the previous description, the temporal support of the time-domain and frequency-domain coding modes need not be the same. While the bits spent on different time-domain quantization methods (adaptive and algebraic codebook search) are typically allocated on a subframe basis (quarter-frame, or 5ms time support), the bits allocated to the frequency-domain coding mode are allocated on a frame basis (typically 20ms time support) to improve frequency resolution.
The bit budget allocated to the time-domain CELP coding mode may also be dynamically controlled depending on the input sound signal. In some cases, the bit budget allocated to a time-domain CELP coding mode may be zero, effectively meaning that the entire bit budget is contributed to the frequency-domain coding mode. There are two (2) major benefits to the choice of working in the LP residual domain for both the time-domain and frequency-domain approaches. First, this is compatible with CELP coding modes, which have proven effective in speech signal coding. Therefore, no artifacts are introduced due to switching between the two types of encoding modes. Second, the low dynamics of the LP residual relative to the original input sound signal and its relative flatness make the square window easier to use for frequency transformation, thus allowing the use of non-overlapping windows.
Similar to that in ITU-T recommendation g.718, in the non-limiting example where the codec's internal sampling frequency is 12.8kHz (meaning 256 samples per frame), the length of a subframe used in the time-domain CELP coding mode can vary from a typical 1/4 frame length (5 ms) to a half frame (10 ms) or full frame length (20 ms). The sub-frame length decision is based on the available bit rate and on an analysis of the input sound signal, in particular the spectral dynamics of this input sound signal. The subframe length determination may be made in a closed loop manner. To reduce complexity, the subframe length decision may also be made in an open loop manner. The subframe length may vary from frame to frame.
Once the length of the sub-frame is selected in a particular frame, a standard closed-loop pitch analysis is performed and the first contribution to the excitation signal is selected from the adaptive codebook. Then, depending on the available bit budget and the characteristics of the input sound signal (e.g. in case of an input speech signal), a second contribution from one or several fixed codebooks may be added before the transform domain coding. The resulting excitation is referred to as the time-domain excitation contribution. On the other hand, at very low bitrates and in the case of normal audio, it is often better to skip the fixed codebook stage and use all remaining bits for the transform domain coding mode. The transform domain coding mode may be, for example, a frequency domain coding mode. As described above, the subframe length may be one quarter frame, half frame or one frame long. The fixed codebook contribution is only used when the subframe length is equal to the quarter-frame length. In case the sub-frame length is decided to be half or full frame long, then only the adaptive codebook contribution is used to represent the time-domain excitation and all remaining bits are allocated to the frequency-domain coding mode.
Once the computation of the time-domain excitation contribution is complete, its efficiency needs to be evaluated and quantified. If the gain of the time domain coding is low, it is more efficient to remove the time domain excitation contribution together and instead use all bits for the frequency domain coding mode. On the other hand, in the case of clean input speech, for example, all bits are allocated to the time-domain coding mode without the frequency-domain coding mode. However, often only the coding in the time domain up to a certain frequency is effective. This frequency is referred to as the cut-off frequency of the time-domain excitation contribution. The determination of the cut-off frequency thus ensures that the entire time-domain coding contributes to a better final synthesis than the inverse frequency-domain coding.
The cut-off frequency is estimated in the frequency domain. To calculate the cut-off frequency, the spectrum of both the LP residual and the time-domain coding contribution is first decomposed into a predetermined number of frequency bands. The number of frequency bands and the number of frequency bins covered by each frequency band may vary from one implementation to another. For each frequency band, a normalized correlation is calculated between the frequency representation of the time-domain excitation contribution and the frequency representation of the LP residual, and the correlation is smoothed between adjacent frequency bands. The lower limit of the correlation per frequency band is 0.5 and normalized between 0 and 1. The average correlation is then calculated as the average of the correlations for all bands. For the first estimate of the cut-off frequency, the average correlation is then scaled between 0 and half the sampling rate (half the sampling rate corresponds to a normalized correlation value of 1). A first estimate of the cut-off frequency is then found as the upper bound of the band closest to that value. In the example of implementation, sixteen (16) bands at 12.8kHz are defined for correlation calculations.
When using psychoacoustic properties of human ears, the reliability of the estimated value of the cut-off frequency is improved by comparing the estimated position of the 8 th harmonic frequency of the tone with the cut-off frequency estimated by the correlation calculation. If this position is higher than the cut-off frequency estimated by the correlation calculation, the cut-off frequency is modified to correspond to the position of the 8 th harmonic frequency of the tone. The final value of the cut-off frequency is then quantized and transmitted. In the implemented example, 3 or 4 bits are used for such quantization, giving 8 or 16 possible cut-off frequencies depending on the bit rate.
Once the cut-off frequency is known, frequency quantization of the frequency domain excitation contribution is performed. First, the difference between the frequency representation of the input LP residual (frequency transform) and the frequency representation of the time-domain excitation contribution (frequency transform) is determined. Then, a new vector is created, which consists of this difference up to the cut-off frequency, and a smooth transition to the frequency representation of the input LP residual for the remaining spectrum. Frequency quantization is then applied to the entire new vector. In the example of implementation, the quantization consists in coding the sign and the position of the main (most energetic) spectral pulse. The number of pulses to be quantized per frequency band is related to the bit rate available for the frequency domain coding mode. If there are insufficient bits available to cover all bands, the remaining bands have to be filled with noise.
The quantization of the frequencies of a frequency band using the quantization method described in the previous paragraph does not guarantee that all frequency bins in this frequency band are quantized. This is particularly true at low bit rates where the number of pulses quantized per band is correspondingly small. To prevent the accidental occurrence of audible artefacts caused by these unquantized intervals, some kind of noise is added to fill the gaps. Since the quantization pulses should dominate the spectrum at low bit rates, rather than the noise inserted, the noise spectral amplitude corresponds to only a fraction of the amplitude of the pulses. The amplitude of the added noise in the spectrum is higher when the available bit budget is low (allowing more noise) and lower when the available bit budget is high.
In the frequency-domain coding mode, a gain is calculated for each frequency band in order to match the energy of the unquantized signal with the energy of the quantized signal. The gain is vector quantized and applied to the quantized signal per frequency band. When the encoder changes its bit allocation from the time-domain only coding mode to the mixed time-domain/frequency-domain coding mode, the excitation spectral energy does not match each band excitation spectral energy of the mixed time-domain/frequency-domain coding mode only. This energy mismatch can be chosen to be some switching artifact, especially at low bit rates. To reduce any audible degradation of this bit reallocation option, long-term gains may be calculated for each band and applied to correct the energy of each band for several frames after switching from the time-domain only coding mode to the mixed time-domain/frequency-domain coding mode.
After the frequency-domain coding mode is completed, the total excitation is found by adding the frequency-domain excitation contribution to the frequency representation of the time-domain excitation contribution (frequency transform), and then transforming the sum of the excitation contributions back to the time domain to form the total excitation. Finally, the total excitation calculation synthesis signal is filtered by a LP synthesis filter. In one embodiment, while CELP coding memories are updated on a subframe basis using only the time-domain excitation contribution, the total excitation is used to update those memories on the frame boundary. In another possible implementation, CELP coding memory is updated on a sub-frame basis and on frame boundaries using only the time-domain excitation contribution. This results in the frequency domain quantized signal constituting an embedded structure of the quantization upper layer independent of the core CELP layer. In this particular case, a fixed codebook is always used in order to update the adaptive codebook content. However, the frequency domain coding mode may be applied to the entire frame. This embedding approach is suitable for bit rates around 12kbps and higher.
1) Sound type classification
Fig. 1 is a schematic block diagram illustrating an overview of an enhanced CELP encoder 100, e.g., an ACELP encoder. Of course, other types of enhanced CELP encoders may be implemented using the same concepts. Fig. 2 is a schematic block diagram of a more detailed structure of the enhanced CELP encoder 100.
CELP encoder 100 includes a preprocessor 102 (fig. 1) that analyzes an input sound signal 101 (fig. 1 and 2). Referring to fig. 2, the processor 102 contains an LP analyzer 201, a spectrum analyzer 202, an open-loop pitch analyzer 203, and a signal classifier 204 of the input sound signal 101. Analyzers 201 and 202 perform spectral analysis, typically performed in CELP coding, as described, for example, in ITU-T recommendation g.718 sections 6.4 and 6.1.4, and therefore, are not described further in this disclosure.
The preprocessor 102 performs a first level of analysis to classify the input sound signal 101 between speech and non-speech (normal audio (music or reverberant speech)) in a manner similar to that described in references [ t.variallan court et al, "Inter-tone noise reduction in a low bit rate celp decoder," proc.lee ICASSP, Taipei, Taiwan, apr.2009, pp.4113-16], which are incorporated herein by reference in their entirety, or using any other reliable speech/non-speech discrimination method.
After this first level of analysis, the pre-processor 102 performs a second level of analysis of the input signal parameters to allow time-domain CELP coding (non-frequency-domain coding) to be used on some sound signals that have strong non-speech characteristics but can still be better coded using time-domain approaches. This second level of analysis allows the CELP encoder 100 to switch to a memoryless time-domain coding mode when significant changes in energy occur, commonly referred to as Transition mode in the references Eksler, V., and Jelinek, M. (2008), "Transition mode coding for source controlled cell codes", IEEE proceedings of International Conference on Acoustics, speed and Signal Processing, March-April, pp.4001-40043, which are incorporated herein by reference in their entirety.
During this second stage of analysis, the signal classifier 204 calculates and uses the smoothed version c of the open-loop pitch correlation from the open-loop pitch analyzer 203stDeviation σ ofcCurrent frameTotal energy EtotAnd the difference E between the total energy of the current frame and the total energy of the previous framediff. First, the variance of the smoothed open-loop pitch correlation is calculated as follows:
wherein:
Cis a smooth open-loop pitch correlation defined as follows:
C0lis the open-loop pitch correlation calculated by analyzer 203 using, for example, methods known to those of ordinary skill in the art of CELP coding, as described in ITU-T recommendation g.718, section 6.6;
is smoothing the open-loop pitch correlation CAverage over the last 10 frames; and
σcis the deviation of the smoothed open-loop pitch correlation.
During the first-stage analysis, when the signal classifier 204 classifies the frame as non-speech, a check is made by the signal classifier 204 to determine whether it is really safe to use the mixed time-domain/frequency-domain coding mode in the second-stage analysis as follows. However, sometimes it is better to encode the current frame with a temporal coding mode only, which uses one of the temporal approaches estimated by the pre-processing function of the temporal coding mode. In particular, it may be better to at least reduce any possible pre-echoes that may be brought about by the mixed time-domain/frequency-domain coding mode using the memoryless time-domain coding mode.
As a first verification of whether mixed time-domain/frequency-domain coding should be used, the signal classifier 204 calculates the difference E between the total energy of the current frame and the total energy of the previous framediff. When the total energy E of the current frametotDifference E from the total energy of the previous framediffAbove 6dB, this corresponds to a so-called "time attack" in the input sound signal. In such a situation, the speech/non-speech decision and the selected coding mode are overridden and the memoryless temporal coding mode is forced to be used. More specifically, enhanced CELP encoder 100 contains time/time-frequency only code selector 103 (fig. 1) which itself contains speech/normal audio selector 205 (fig. 5), time attack detector 208 (fig. 2), and memoryless time-domain coding mode selector 206. In other words, in response to the determination of the non-speech signal (normal audio) by the selector 205 and the detection of a time attack in the input sound signal by the detector 208, the selector 206 forces the closed-loop CELP encoder 207 (fig. 2) to use a memoryless time-domain coding mode. Closed-loop CELP encoder 207 forms part of time-domain-only encoder 104 of fig. 1.
As a second verification, when the current frame total energy EtotTotal energy of previous frameDifference value E betweendiffLess than or equal to 6dB, but;
smoothing the open-loop pitch correlation CstAbove 0.96;
smoothing the open-loop pitch correlation CstHigher than 0.85 and the total energy E of the current frametotDifference E from the total energy of the previous framediffLess than 0.3 dB;
smoothing the deviation σ of the open-loop pitch correlationcLess than 0.1 and the total energy E of the current frametotDifference E from the total energy of the previous framediffLess than 0.6 dB; or
Total energy E of the current frametotLower than 20 dB; and
this is at least the second consecutive frame (cnt ≧ 2) in the case where the decision of the first-stage analysis is intended to be changed, then the speech/generic audio selector 205 determines to use the closed-loop generic CELP encoder 207 (FIG. 2) and to use the encoding of the current frame in the time-domain-only mode.
Otherwise, the time/time-frequency coding selector 103 selects the mixed time/frequency-domain coding mode by the mixed time/frequency-domain coding device disclosed in the present description.
For example, when the non-speech sound signal is music, this can be summarized with the following pseudo code:
if(generic audio)
if(Ediff)6dB)
coding mode=Time domain memory less
Cnt=1
else if(Cst>0.96|(Cst>0.85&Ediff<0.3dB)|(σc<0.1&Ediff<0.6dB)|Etot<20dB)
Cnt++
if(cnt>=2)
coding mode=Time domain
else
coding mode=mix time/frequency domain
cnt=0
wherein EtotIs the current frame energy expressed as:
(where x (i) represents samples of the input sound signal in a frame) and EdiffIs the total energy E of the current frametotThe difference with the total energy of the previous frame.
2) Sub-frame length determination
In typical CELP, input sound signal samples are processed in 10-30ms frames, and the frames are divided into several sub-frames for adaptive codebook and fixed codebook analysis. For example, a 20ms frame (256 samples when the internal sampling frequency is 12.8 kHz) may be used, divided into four sub-frames of 5 ms. Variable subframe length is a feature for fully integrating the time and frequency domains into one coding mode. The sub-frame length may vary from a typical 1/4 frame length to a half or full frame length. Of course, another number of subframes (subframe length) may also be used.
The decision on the length of the sub-frames (number of sub-frames) or the time support is determined by the calculator 210 of the number of sub-frames on the basis of the available bit rate and on the basis of the input signal analysis in the pre-processor 102, in particular the high spectral dynamics of the input sound signal 101 from the analyzer 209 and the open-loop pitch analysis including the smoothed open-loop pitch correlation from the analyzer 203. The analyzer 209 determines the high spectral dynamics of the input signal 101 in response to information from the spectrum analyzer 202. The spectral dynamics are calculated from the features described in ITU-T recommendation g.718 No. 6.7.2.2 as input spectra without their noise floor giving a representation of the spectral dynamics. When the average spectral dynamics of the input sound signal 101 in the frequency band between 4.4kHz and 6.4kHz as determined by the analyzer 209 is below 9.6dB and the last frame is considered to have high spectral dynamics, then the input signal 10 is no longer considered to have high spectral dynamics content at higher frequencies. In that case, more bits can be allocated to frequencies below, for example, 4kHz by adding more subframes into the time-domain coding pattern or by forcing the use of more pulses in the lower frequency part of the frequency-domain contribution.
On the other hand, if the average spectral dynamics of the higher frequency content of the input sound signal 101 is greater than, e.g., 4.5dB relative to the average spectral dynamics of the last frame not considered to have high spectral dynamics, as determined by the analyzer 209, then the input sound signal 101 is considered to have high spectral dynamics content above, e.g., 4 kHz. In that case, depending on the available bit rate, some additional bits are used for encoding the high frequencies of the input sound signal 101 in order to allow one or more frequency pulse encodings.
The subframe length as determined by calculator 210 (fig. 2) also depends on the available bit budget. At very low bit rates, e.g., bit rates below 9kbps, only one subframe may be used for time-domain coding, otherwise the number of available bits is not sufficient for frequency-domain coding. For medium bit rates, for example, between 9kbps and 16kbpsFor the case where the high frequencies contain high dynamic spectral content, one sub-frame is used, if not two sub-frames are used. For medium to high bit rates, e.g. bit rates of about 16kbps and higher, if the smooth open-loop pitch correlation C as defined in the paragraph of the sound type classification sectionAbove 0.8, the case of four (4) subframes also becomes available.
While the case of one or two subframes limits the time-domain coding to only the adaptive codebook contribution (there is a coded pitch lag and pitch gain), i.e. in that case no fixed codebook is used, four (4) subframes allow for both adaptive and fixed codebook contribution if the available bit budget is sufficient. The case of four (4) subframes is allowed starting at approximately 16kbps upwards. Due to bit budget constraints, the time-domain excitation consists only of adaptive codebook contributions at lower bit rates. For higher bit rates, for example, starting from 24kbps, a simple fixed codebook contribution can be added. For all cases, the temporal coding efficiency is evaluated afterwards to decide up to which frequency such temporal coding is valuable.
3) Closed loop tone analysis
When a mixed time-domain/frequency-domain coding mode is used, a fixed algebraic codebook search is followed after closed-loop pitch analysis, if necessary. To this end, CELP encoder 100 (fig. 1) contains calculator 105 (fig. 1 and 2) of the time-domain excitation contribution. This calculator further comprises an analyzer 211 (fig. 2) for performing closed-loop tonal analysis in response to the sub-frame length (or number of sub-frames in a frame) in the open-loop tonal analyzer 203 and in the calculator 210. Closed loop tonal analysis is well known to those of ordinary skill in the art, and examples of implementations are described in, for example, references incorporated herein by reference in their entirety [ ITU-T g.718 recommendation; section 6.8.4.1.4.1 ]. Closed loop pitch analysis results in pitch parameters, also referred to as adaptive codebook parameters, which are mainly composed of a pitch lag (adaptive codebook index T) and a pitch gain (or adaptive codebook gain b) to be calculated. The adaptive codebook contribution is usually a past excitation of the delay T or an interpolated version of it. The adaptive codebook index T is encoded and sent to the far decoder. The pitch gain b is also quantized and sent to the far decoder.
When the closed-loop pitch analysis is complete, CELP encoder 100 contains a fixed codebook 212 that is searched to find the best fixed codebook parameters, which typically contain fixed codebook indices and fixed codebook gains. The fixed codebook indices and gains form fixed codebook contributions. The fixed codebook indices are encoded and transmitted to the far decoder. The fixed codebook gain is also quantized and sent to the far decoder. Fixed algebraic codebooks and searches thereof are considered well known to those of ordinary skill in the art of CELP coding and therefore will not be described further in this disclosure.
The adaptive codebook indices and gains and the fixed codebook indices and gains form the time-domain CELP excitation contribution.
4) Frequency conversion of a signal of interest
During frequency domain coding of a mixed time domain/frequency domain coding mode, two signals need to be represented in the transform domain, e.g. in the frequency domain. In one embodiment, the time-to-frequency transform may be implemented using a 256 point type II (or type IV) DCT (discrete cosine transform) that gives a resolution of 25Hz for an internal sampling frequency of 12.8kHz, but any other transform may be used. In case another transform is used, the frequency resolution (as defined above), the number of frequency bands and the number of frequency bins per frequency band (as further defined below) may need to be modified accordingly. In this regard, CELP encoder 100 includes an input LP residual r resulting from LP analysis of the input sound signal by response analyzer 201es(n) a calculator 107 (fig. 1) that calculates the frequency domain excitation contribution. As illustrated in FIG. 2, calculator 107 may calculate an input LP residual resDCT213 of (n), e.g., type II DCT. CELP encoder 100 also includes a calculator 106 (fig. 1) that calculates a frequency transform of the time-domain excitation contribution. As illustrated in FIG. 2, calculator 106 may calculate a DCT214 of the time-domain excitation contribution, for exampleSuch as a type II DCT. Frequency transformation f of input LP residualresAnd time domain CELP excitation contribution fexcThe following expression can be used for calculation:
and:
wherein r ises(n) is the input LP residual, etd(N) is the time-domain excitation contribution, and N is the frame length. In one possible implementation, the frame length is 256 samples for an internal sampling frequency of 12.8 kHz. The time-domain excitation contribution is given by the following relation:
etd(n)=bv(n)+gc(n)
where v (n) is the adaptive codebook contribution, b is the adaptive codebook gain, c (n) is the fixed codebook contribution, and g is the fixed codebook gain. It should be noted that the time-domain excitation contribution, as described above, may consist of only the adaptive codebook contribution.
5) Cut-off frequency of time-domain contribution
For normal audio samples, the time-domain excitation contribution (combination of adaptive and/or fixed algebraic codebooks) does not always contribute much to the coding improvement compared to frequency-domain coding. Often times, it does improve the lower part of the spectrum, but the coding improvement in the upper part of the spectrum is small. CELP encoder 100 contains a finder and filter 108 (fig. 1) of the cut-off frequency, which is the frequency at which the time-domain excitation contribution provides little to no value in the coding improvement. The finder and filter 108 comprises a calculator 215 and a filter 216 of the cut-off frequency of fig. 2. First of all, the values defined in the preceding section 4, respectively designated as f, are used by the calculator 215resAnd fexcFrom the calculator 107A computer 303 (fig. 3 and 4) of the normalized cross-correlation for each frequency band between the incoming LP residuals and the frequency transformed time-domain excitation contributions from the calculator 106 estimates the cut-off frequency of the time-domain excitation contributions. The last frequency L included in each of, for example, sixteen (16) frequency bandsfDefined in Hz as follows:
for this illustrative example, the number of frequency bins B per band for a 20ms frame with a 12.8kHz sampling frequencybCumulative frequency interval C of each frequency bandBbAnd normalized cross-correlation C for each bandC(i) The definition is as follows:
wherein:
and:
wherein B isbIs the number of frequency bins per frequency band, CBbIs the cumulative frequency interval for each frequency band,is the normalized cross-correlation of each frequency band,is the excitation energy of one frequency band and similarly,is the residual energy of each band.
The cut-off frequency calculator 215 includes a cross-correlation across frequency bands smoother 304 (fig. 3 and 4) for performing some operations to smooth the cross-correlation vector between different frequency bands. More specifically, the smoother 304 of cross-correlation across frequency bands computes a new cross-correlation vector using the relationship
Wherein:
α=0.95;=(1-α);Nb=13;β=/2。
the cut-off frequency calculator 215 further comprises a new cross-correlation vectorPreceding NbFrequency band (N)b= 13 representing 5575 Hz) on the average (fig. 3 and 4).
The cut-off frequency calculator 215 further comprises a cut-off frequency module 306 (fig. 3), the cut-off frequency module 306 comprising a cross-correlation limiter 406 (fig. 4), a cross-correlation normalizer 407 and a cross-correlation lowest frequency band finder 408. More specifically, limiter 406 limits the average of the cross-correlation vectors to a minimum of 0.5 and normalizer 408 normalizes the limited average of the cross-correlation vectors to between 0 and 1. The finder 408 finds the band L by findingfAnd a cross-correlation vector multiplied by the width F/2 of the frequency spectrum of the input sound signalNormalized mean valueFrequency band L having the smallest difference therebetweenfObtaining a first estimate of the cut-off frequency:
and
wherein
Fs=12800Hz and
is a first estimate of the cut-off frequency.
At normalized mean valueAt low bit rates which are never really high, or in order to artificially increaseTo give slightly more weight to the time domain contribution, a fixed scaling factor may be utilized, e.g. amplification at bit rates below 8kbpsAnd will always be in the exemplary implementationMultiplied by 2.
The accuracy of the cut-off frequency can be increased by adding the following components to the calculation. To this end, the calculator 215 of the cut-off frequency comprises an extrapolator 410 (fig. 4) of the 8 th harmonic calculated from the minimum or optimal pitch lag value of the time-domain excitation contribution of all sub-frames using the following relation:
wherein Fs=12800Hz,NsubIs the number of subframes, and t (i) is the adaptive codebook index or pitch lag for subframe i.
The cut-off frequency calculator 215 also contains the 8 th harmonicThe finder 409 of the frequency band in which it is located (fig. 4). More specifically, for all i<NbThe finder 409 searches for the highest frequency that still satisfies the following inequality:
the index of that band is calledIt indicates the frequency band in which the 8 th harmonic is likely to be located.
The cut-off frequency calculator 215 finally contains the final cut-off frequency ftcThe selector 411 (fig. 4). More specifically, the selector 411 holds the first estimated value f of the cutoff frequency from the finder 408 using the following relationshiptc1Last frequency of frequency band in which the 8 th harmonic is locatedHigher frequencies in between:
ftc=max(Lf(igth),ftc1)
as illustrated in figures 3 and 4 of the drawings,
the calculator of cut-off frequencies 215 further comprises a decider 307 (fig. 3) of the number of frequency intervals to be zeroed, itself comprising an analyzer 415 (fig. 4) of the parameters, and a selector 416 (fig. 4) of the frequency intervals to be zeroed; and
the filter 216 (fig. 2) operating in the frequency domain contains a zeroer 308 (fig. 3) of the frequency intervals determined to be zeroed. The zeroer can zero all frequency intervals (zeroer 417 in fig. 4), or just at the cut-off frequency f, supplemented by a smooth transition zonetcSome higher frequency bins above. The transition region being at the cut-off frequency ftcAbove but below the return-to-zero interval, it makes ftcThe spectral transition between the invariant spectrum below and the zeroing interval at higher frequencies is smoothed.
For the illustrative example, when the cutoff frequency f from the selector 411 istcBelow or equal to 755Hz, the cost of cutting off the excitation contribution is considered too high by analyzer 415. Selector 416 selects a frequency representation of the time-domain excitation contribution to be zeroed outAll frequency intervals, the zeroer 417 forces all frequency intervals to zero and also forces the cut-off frequency ftcAnd (4) returning to zero. All bits allocated to the time-domain excitation contribution are then re-allocated to the frequency-domain coding mode. Otherwise, the analyzer 415 forces the selector 416 to select the cut-off frequency ftcThe above high frequency interval to be zeroed by the zeroer 418.
Finally, the cut-off frequency calculator 215 includes a cut-off frequency ftcQuantized to the quantized form f of this cut-off frequencytcQThe quantizer 309. If three (3) bits are associated with the cutoff frequency parameter, a possible set of output values may be defined (in Hz):
ftcQ-{0,1175,1575,1975,2375,2775,3175,3575,}
many mechanisms can be used to stabilize the final cut-off frequency ftcTo prevent quantization form ftcQSwitching between 0 and 1175 in the improper signal segment. To achieve this, the analyzer 415 in this exemplary embodiment is responsive to the long-term average pitch gain G from the closed-loop pitch analyzer 211 (FIG. 2)lt412. Open-loop correlation C from open-loop tone analyzer 203ol413 and smooth open-loop correlation Cst. To prevent switching to full frequency coding, the analyzer 415 does not allow frequency-only coding, i.e., cannot convert f to full frequency coding, when the following condition is satisfiedtcQSet to 0:
ftc>2375Hz, or
ftc>1175Hz,Col>0.7 and GhNot less than 0.6, or
ftc≥1175Hz,Cst>0.8 and GltNot less than 0.4 or
ftcQ(t-1)!=0,Col>0.5,Cst>0.5 and Clt≥0.6,
Wherein C isolIs the open-loop pitch correlation 413, and CstCorresponding to open-loop tonesA smoothed version of the correlation 414, defined as Cst=0.9·Col+0.1·Cst. Further, Glt(item 412 in FIG. 4) corresponds to the long-term average of the tonal gains obtained by the closed-loop tonal analyzer 211 within the time-domain excitation contribution. The long-term average 412 of the tonal gain is defined asAndis the average pitch gain over the current frame. To further reduce the frequency of switching between frequency only coding and mixed time/frequency domain coding, a release delay may be added.
6) Frequency domain coding
Creating a difference vector
Once the cut-off frequency of the time-domain excitation contribution is defined, frequency-domain encoding is performed. CELP encoder 100 includes a frequency transform f from zero to the cut-off frequency of the time-domain excitation contribution with the input LP residual from DCT213 (FIG. 2)res502 (fig. 5 and 6) (or other frequency representation) with a frequency transform f of the time-domain excitation contribution from the DCT214 (fig. 2)exc501 (fig. 5 and 6) (or other frequency representations) form a difference vector fdA subtractor or calculator 109 (fig. 1, 2, 5 and 6). At it and frequency conversion fresBefore subtracting the respective spectral portions of ftransThe next transition region of =2kHz (in this exemplary implementation, 80 frequency bins) applies a reduction factor 603 (fig. 6) to the frequency transform fexc501. The result of the subtraction constitutes a representation of the secondary cut-off frequency ftcTo ftc+ftransOf the frequency range of (3)dThe second part of (1). Frequency transformation f of input LP residualres502 for vector fdThe remaining third portion of (a). Vector f obtained by applying a reduction factor 603dThe reduction can be realized by using any type of fading function, and can be reduced to only a few frequency intervals, but when the reduction is completedDetermining that the available bit budget is sufficient to prevent the cut-off frequency ftcThe energy oscillation artifact during the change can be omitted. For example, for 1 frequency interval f in DCT with 256 points on 12.8kHzbinFor a 25Hz resolution corresponding to 25Hz, the difference vector can be established as follows:
fd(k)=fres(k)-fexc(k)
wherein k is more than or equal to 0 and less than or equal to ftc/fbin
Wherein f iste/fbin<k≤(ftc+ftrans)/fbin
Otherwise, fd(k)=fres(k),
Wherein f isres,fexcAnd ftcAs already defined in the preceding sections 4and 5.
Searching for frequency pulses
CELP encoder 100 includes a difference vector fdThe frequency quantizer 110 (fig. 1 and 2). Difference vector fdSeveral methods may be used for quantification. In all cases, the frequency pulses must be searched and quantized. In a possibly simple approach, the frequency domain coding comprises skipping the spectral search difference vector fdThe most energetic pulse. The method of searching for pulses can be as simple as breaking the spectrum into bands and having a certain number of pulses per band. The number of pulses per band depends on the available bit budget and on the position of the band within the spectrum. Typically, more pulses are allocated to low frequencies.
Quantized difference vector
Depending on the available bit rate, the quantization of the frequency pulses may be performed using different techniques. In one embodiment, a simple search and quantization scheme may be used to encode the position and sign of the pulse at bit rates below 12 kbps. This scheme is described below.
For example, for frequencies below 3175Hz, this simple search and quantization scheme uses a means based on the order Pulse Coding (FPC) described in, for example, the references [ misttal, u., Ashley, j.p., and Cruz-Zeno, E.M. (2007), "Low Complexity factory Pulse Coding of Coding using the Combinatorial Functions", ieee proceedings on acoustics, Speech and signaling Processing, vol.1, April, pp.289-292], which are incorporated herein by reference in their entirety.
More specifically, the selector 504 (fig. 5 and 6) determines that all spectra are not quantized using FPC. As illustrated in fig. 5, FPC encoding and pulse position and symbol encoding are performed in an encoder 506. As illustrated in fig. 6, the encoder 506 includes a searcher 609 of frequency pulses. The search is performed throughout all frequency bands with frequencies below 3175 Hz. The FPC encoder 610 then processes the frequency pulses. The encoder 506 also comprises a finder 611 for finding the most energetic pulse for frequencies equal to or greater than 3175Hz, and a quantizer 612 for finding the position and sign of the most energetic pulse. If more than one (1) pulse is allowed in the band, the amplitude of the previously found pulse is divided by 2 and the search is performed again over the entire band. Whenever a pulse is found, its position and sign are stored for the quantization and bit-stuffing phase. The following pseudo code illustrates this simple search and quantization scheme:
for k=0:NBD
for i=0:NP
Pmax=0
for j=CBb(k):CBb(k)+Bb(k)
if fd(j)2>Pmax
pmax=fd(j)2
pp(i)=j
ps(i)=sign(fd(j))
end
end
end
end
wherein N isBDIs the number of frequency bands (in the illustrative example, NBD=16),NpIs the number of pulses to be coded in frequency band k, BbIs the number of frequency bins per frequency band, CBbIs the cumulative band interval, P, of each band as previously defined in section 5pRepresenting a vector containing the positions of the found pulses, PsA vector representing the symbol containing the found pulse, and PmaxRepresenting the energy of the pulse found.
At bit rates above 12kbps, the selector 504 determines that all spectra are to be quantized using FPC. As illustrated in fig. 5, FPC encoding is performed in the encoder 505. As illustrated in fig. 6, the encoder 505 includes a searcher 607 of frequency pulses. The search is performed throughout the entire frequency band. The FPC processor 610 then PFC encodes the found frequency pulses.
Then, by having the pulse sign psThe number of pulses nb _ pulses of (d) is added to each position p foundpTo obtain a quantized difference vector fdQ. For each frequency band, the quantized difference vector f can be written using pseudo code as followsdQ
for j=O,…,j<nb_Pulses
fdQ(pp(j))+=ps(j)
Noise filling
All bands are quantized with more or less precision; the quantization method described in the previous section does not guarantee that all frequency bins within the frequency band are quantized. This is particularly the case at low bit rates where the number of pulses quantized per frequency band is correspondingly small. To prevent the accidental occurrence of audible artifacts caused by these unquantized intervals, noise filler 507 (fig. 5) adds some noise to fill in these gaps. This noise is added at bits below, for example, 12kbpsThe rate is over the entire spectrum, but for higher bit rates, the cut-off frequency f of the excitation contribution may be only in the time domaintcThe above application. For simplicity, the noise strength is a function of the available bit rate only. At high bit rates, the noise level is low, but at low bit rates the noise level is higher.
The noise filler 504 comprises adding noise to the quantized difference vector f after the strength or energy level of such added noise has been determined in the estimator 6 and before the gain for each frequency band is determined in the computer 615dQAdder 613 (fig. 6). In an exemplary embodiment, the noise level is directly related to the coding bit rate. For example, at 6.60kbps, noise level N'LIs 0.4 times the amplitude of the spectral pulse encoded in the particular frequency band and is stepped down to a value of 0.2 of the amplitude of the spectral pulse encoded in the frequency band at 24 kbps. Adding noise only to a certain number of successive frequency intervals having very low energy, e.g. when the number N of successive very low energy intervalszIs included in the spectrum portion when half the number of intervals in the frequency band. For a specific frequency bandiNoise is injected as follows:
for j=CBb(i),…,j<CBb(i)+Bb(i)
for k=j,…,k<j+Nz
j+=Nz
wherein
Wherein for frequency bands i, CBbIs the cumulative number of intervals per frequency band, BbIs the number of segments, N ', in a particular frequency band i'LIs the noise level, and randIs a random number generator limited between-1 and 1.
7) Per band gain quantization
The frequency quantizer 110 comprises a per-band gain calculator/quantizer 508 (fig. 5) comprising a per-band gain calculator 615 (fig. 6) and a quantizer 616 (fig. 6) that calculates per-band gains. Once the quantized difference vector f, including noise filling if needed, is founddQThe per-band gain is calculated for each band by calculator 615. Gain G per band of a specific band is given as followsb(i) Defined as the unquantized difference vector f in the logarithmic domaindSignalEnergy and quantized difference vector fdQEnergy ratio of (a):
whereinAnd
wherein C isBbAnd BbAs defined in section 5 above.
In the embodiment of fig. 5 and 6, each band gain quantizer 616 vector quantizes each band frequency gain. Prior to vector quantization, at low bit rates, the last gain (corresponding to the last band) is quantized separately and all remaining fifteen (15) gains are divided by the last quantized gain. Then, the normalized fifteen (15) residual gains are vector quantized. At higher bit rates, the average of the gain per band is first quantized and then removed from all of the gain per band, e.g., sixteen (16) bands, before vector quantization. The vector quantization used may be a standard minimization in the logarithmic domain of the distance between the vector containing the gain per frequency band and the entries of the specific codebook.
In the frequency-domain coding mode, a gain is calculated for each frequency band in calculator 615 to make the unquantized vector fdEnergy and quantization vector f ofdQEnergy matching. The gain is vector quantized in quantizer 611 and applied to quantized vector f per frequency band by multiplier 509 (fig. 5 and 6)dQ
Alternatively, the FPC coding scheme at a rate below 12kbps can also be used for the entire spectrum by selecting only some of the frequency bands to be quantized. Quantizing the unquantized vector f before performing the selection of the frequency banddEnergy E of frequency band ofd. The energy is calculated as follows:
Ed(i)=log10(Sd(i))
wherein C isBbAnd BbAs defined in section 5 above.
For performing band energy Ed' the average energy over the first 12 bands of the sixteen used bands is first quantized and subtracted from all sixteen (16) bands. All bands are then vector quantized for each set of 3 or 4 bands. The vector quantization used may be a standard minimization in the logarithmic domain of the distance between the vector containing the gain per frequency band and the entries of the specific codebook. If not enough bits are available, only the first 12 bands may be quantized and the last 4 bands extrapolated using the average of the first 3 bands or by any other method.
Once the energies of the frequency bands of the unquantized difference vector are quantized, the energies may be sorted in descending order in a manner that is repeatable at the decoder side. During sequencing, all energy bands below 2kHz are always preserved, and then only the most active band is passed to the FPC for encoding pulse amplitude and sign. For this approach, the FPC scheme encodes smaller vectors but covers a wider frequency range. In other words, it takes fewer bits to cover important energy events across the entire spectrum.
After the pulse quantization process, a similar noise filling as described before is required. Then, a gain adjustment factor G is calculated for each frequency bandaTo quantize the difference vector fdQEnergy E ofdQAnd the unquantized difference vector fdThe quantized energy Ed' of. This per-band gain adjustment factor is then applied to the quantized difference vector fdQThe following steps:
wherein
And Ed' is the unquantized difference vector f as defined beforedQuantizes each band energy.
After the frequency domain encoding stage is completed, the frequency quantized difference vector f is passed through an adder 111 (fig. 1, 2, 5 and 6)dQWith the filtered frequency-transformed time-domain excitation contribution fexcFThe total time domain/frequency domain excitation is summed. When the enhanced CELP encoder 100 changes its bit allocation from mixing the time-domain/frequency-domain coding mode only when the time-domain coding mode is changed, the excitation spectral energy only at each band of the time-domain coding mode does not match the excitation spectral energy at each band of the mixed time-domain/frequency-domain coding mode. This energy mismatch can be chosen to be a more audible switching artifact at low bit rates. To reduce any audible degradation of such bit reallocation options, long-term gains may be calculated for each band and applied to the sum excitation after the reallocation to correct the energy of each band for several frames. The frequency quantized difference vector f is then quantized in a converter 112 (fig. 1, 5 and 6) containing, for example, an IDCT (inverse DCT) 220dQWith the frequency transformed and filtered time-domain excitation contribution fecxFThe sum of (a) is transformed back into the time domain.
Finally, the total excitation signal from the IDCT220 is filtered by the LP synthesis filter 113 (fig. 1 and 2) to calculate a synthesized signal.
Frequency quantized difference vector fdQWith the frequency transformed and filtered time-domain excitation contribution fecxFThe sum of (a) forms a mixed time/frequency domain excitation that is sent to a far decoder (not shown). The far decoder also includes a converter 112 that transforms the mixed time-domain/frequency-domain excitation back to the time domain using, for example, an IDCT (inverse DCT) 220. Finally, the total excitation signal from the IDCT220 is filtered by the LP synthesis filter 113 (fig. 1 and 2), i.e., a mixed time-domain/frequency-domain excitation computation synthesis signal.
In one embodiment, while CELP coding memories are updated on a subframe basis using only the time-domain excitation contribution, the total excitation is used to update those memories on the frame boundary. In another possible implementation, CELP coding memory is updated on a sub-frame basis and on frame boundaries using only the time-domain excitation contribution. This results in the frequency domain quantized signal constituting an embedded structure of the quantization upper layer independent of the core CELP layer. This may be advantageous in certain applications. In this particular case, a fixed codebook is always used to maintain good perceptual quality, and the number of subframes is always four (4) for the same reason. However, the frequency domain analysis may be applied to the entire frame. This embedding approach is suitable for bit rates around 12kbps and higher.
The foregoing disclosure is directed to non-limiting, exemplary embodiments that may be modified at will within the scope of the appended claims.

Claims (60)

1. A mixed time-domain/frequency-domain coding apparatus that codes an input sound signal, comprising:
a calculator for calculating a time-domain excitation contribution in response to an input sound signal;
a calculator of a cut-off frequency of the time-domain excitation contribution in response to the input sound signal;
a filter responsive to the cut-off frequency to adjust a frequency range of the time-domain excitation contribution;
a calculator for calculating a frequency domain excitation contribution in response to an input sound signal; and
an adder for adding the filtered time-domain excitation contribution and the frequency-domain excitation contribution to form a mixed time-domain/frequency-domain excitation constituting a coded version of the input sound signal.
2. A mixed time-domain/frequency-domain coding device according to claim 1, wherein the time-domain excitation contribution comprises: only the adaptive codebook contribution, or the adaptive codebook contribution and the fixed codebook contribution.
3. A mixed time-domain/frequency-domain coding device according to claim 1 or 2, wherein the calculator of time-domain excitation contribution uses code-excited linear prediction coding of the input sound signal.
4. A mixed time-domain/frequency-domain coding device according to any one of claims 1 to 3, comprising a calculator of a number of subframes to be used in a current frame, wherein the calculator of time-domain excitation contribution uses, in the current frame, the number of subframes determined by the subframe number calculator for said current frame.
5. A mixed time-domain/frequency-domain coding device according to claim 4, wherein the calculator of the number of sub-frames in the current frame is responsive to at least one of a high spectral dynamic and an available bit budget of the input sound signal.
6. A mixed time-domain/frequency-domain coding device according to any one of claims 1 to 5, comprising a calculator of a frequency transform of the time-domain excitation contribution.
7. A mixed time-domain/frequency-domain coding device according to any one of claims 1 to 6, wherein the calculator of frequency-domain excitation contribution performs a frequency transform on an LP residual obtained from an LP analysis of the input sound signal to generate a frequency representation of the LP residual.
8. A mixed time-domain/frequency-domain coding device according to claim 7, wherein the calculator of cut-off frequency comprises a computer that calculates, for each of a plurality of frequency bands, a cross-correlation between the frequency representation of the LP residual and a frequency representation of the time-domain excitation contribution, and the coding device comprises a finder of an estimate of the cut-off frequency in response to the cross-correlation.
9. A mixed time-domain/frequency-domain coding device according to claim 7 or 8, further comprising a smoother for smoothing the cross-correlation across the frequency bands to generate a cross-correlation vector, a calculator for calculating an average of the cross-correlation vector over the frequency bands, and a normalizer for normalizing the average of the cross-correlation vector, wherein the finder for finding an estimate of the cut-off frequency determines a first estimate of the cut-off frequency by finding a last frequency of one of the frequency bands which minimizes a difference between said last frequency and the normalized average of the cross-correlation vector multiplied by the spectral width value.
10. A mixed time-domain/frequency-domain coding device according to claim 9, wherein the calculator of cut-off frequency comprises a finder of one of the frequency bands in which the harmonic calculated from the time-domain excitation contribution is located, and a selector of the cut-off frequency as the higher frequency of the first estimate of the cut-off frequency and the last frequency of the frequency band in which the harmonic is located.
11. A mixed time-domain/frequency-domain coding device according to any one of claims 1 to 10, wherein the filter comprises a zeroer of frequency bins which forces the frequency bins of the plurality of frequency bands above the cutoff frequency to zero.
12. A mixed time-domain/frequency-domain coding device according to any one of claims 1 to 11, wherein the filter comprises a zeroer of frequency bins which forces all frequency bins of the plurality of frequency bands to zero when the cut-off frequency is below a given value.
13. A mixed time-domain/frequency-domain coding device according to any one of claims 1 to 12, wherein the calculator of frequency-domain excitation contribution comprises a calculator of a difference between a frequency representation of an LP residual of the input sound signal and a filtered frequency representation of the time-domain excitation contribution.
14. A mixed time-domain/frequency-domain coding device according to claim 7, wherein the calculator of frequency-domain excitation contribution comprises a calculator of a difference between the frequency representation of the LP residual and a frequency representation of the time-domain excitation contribution up to a cutoff frequency to form the first portion of the difference vector.
15. A mixed time-domain/frequency-domain coding device according to claim 14, comprising a downscaling factor applied to a frequency representation of the time-domain excitation contribution in a determined frequency range after the cutoff frequency to form the second part of the difference vector.
16. A mixed time-domain/frequency-domain coding device according to claim 15, wherein the difference vector is formed by a frequency representation of the LP residual for a third remaining portion over the determined frequency range.
17. A mixed time-domain/frequency-domain coding device according to any one of claims 14 to 16, comprising a quantizer of the difference vector.
18. A mixed time-domain/frequency-domain coding device according to claim 17, wherein the adder adds the quantized difference vector and a frequency-transformed version of the filtered time-domain excitation contribution in the frequency domain to form the mixed time-domain/frequency-domain excitation.
19. A mixed time-domain/frequency-domain coding device according to any one of claims 1 to 18, wherein the adder adds the time-domain excitation contribution and the frequency-domain excitation contribution in the frequency domain.
20. A mixed time-domain/frequency-domain coding device according to any one of claims 1 to 19, comprising means for dynamically allocating a bit budget between the time-domain excitation contribution and the frequency-domain excitation contribution.
21. An encoder using time-domain and frequency-domain models, comprising:
a classifier for classifying an input sound signal into speech or non-speech;
a time-domain only encoder;
a mixed time-domain/frequency-domain coding device according to any one of claims 1 to 20; and
a selector for selecting, for encoding the input sound signal, only one of the time-domain encoder and the mixed time-domain/frequency-domain encoding device, depending on the classification of the input sound signal.
22. An encoder as defined in claim 21, wherein the time-domain only encoder is a code-excited linear prediction encoder.
23. Encoder as claimed in claim 21 or 22, comprising a selector of a memoryless temporal coding mode which forces the memoryless temporal coding mode to be used for encoding the input sound signal in the time-domain only encoder when the classifier classifies the input sound signal as non-speech and detects a temporal attack in the input sound signal.
24. An encoder according to any one of claims 21 to 23, wherein the mixed time-domain/frequency-domain coding device uses variable-length subframes in the calculation of the time-domain contribution.
25. A mixed time-domain/frequency-domain coding apparatus that codes an input sound signal, comprising:
a calculator of a time-domain excitation contribution in response to an input sound signal, wherein the calculator of the time-domain excitation contribution processes the input sound signal in successive frames of said input sound signal, and comprises a calculator of a number of sub-frames to be used in a current frame of the input sound signal, wherein the calculator of the time-domain excitation contribution uses in the current frame the number of sub-frames determined for said current frame by the sub-frame number calculator;
a calculator for calculating a frequency domain excitation contribution in response to an input sound signal; and
an adder for adding the time-domain excitation contribution and the frequency-domain excitation contribution to form a mixed time-domain/frequency-domain excitation constituting a coded version of the input sound signal.
26. A mixed time-domain/frequency-domain coding device according to claim 25, wherein the calculator of the number of sub-frames in the current frame is responsive to at least one of a high spectral dynamic and an available bit budget of the input sound signal.
27. A decoder for decoding a sound signal encoded using the mixed time-domain/frequency-domain coding device of any one of claims 1 to 20, comprising:
a converter for converting the mixed time-domain/frequency-domain excitation in the time domain; and
a synthesis filter for synthesizing a sound signal in response to the mixed time-domain/frequency-domain excitation converted in the time-domain.
28. A decoder according to claim 27, wherein the converter uses an inverse discrete cosine transform.
29. A decoder as claimed in claim 27 or 28, wherein the synthesis filter is an LP synthesis filter.
30. A decoder for decoding a sound signal encoded using the mixed time-domain/frequency-domain coding device of claim 25 or 26, comprising:
a converter for converting the mixed time-domain/frequency-domain excitation in the time domain; and
a synthesis filter for synthesizing a sound signal in response to the mixed time-domain/frequency-domain excitation converted in the time-domain.
31. A mixed time-domain/frequency-domain coding method of encoding an input sound signal, comprising:
calculating a time-domain excitation contribution in response to the input sound signal;
calculating a cut-off frequency of the time-domain excitation contribution in response to the input sound signal;
adjusting a frequency range of the time-domain excitation contribution in response to the cut-off frequency;
calculating a frequency domain excitation contribution in response to an input sound signal; and
the adjusted time-domain excitation contribution and the frequency-domain excitation contribution are added to form a mixed time-domain/frequency-domain excitation constituting a coded version of the input sound signal.
32. A mixed time-domain/frequency-domain coding method according to claim 31, wherein the time-domain excitation contribution comprises: only the adaptive codebook contribution, or the adaptive codebook contribution and the fixed codebook contribution.
33. A mixed time-domain/frequency-domain coding method according to claim 31 or 32, wherein calculating the time-domain excitation contribution comprises using code-excited linear prediction coding of the input sound signal.
34. A mixed time-domain/frequency-domain coding method according to any one of claims 31 to 32, comprising calculating a number of subframes to be used in a current frame, wherein calculating the time-domain excitation contribution comprises using the number of subframes determined for said current frame in the current frame.
35. A mixed time-domain/frequency-domain coding method according to claim 34, wherein calculating the number of sub-frames in the current frame is responsive to at least one of an available bit budget and a high spectral dynamics of the input sound signal.
36. A mixed time-domain/frequency-domain coding method according to any one of claims 31 to 35, comprising calculating a frequency transform of the time-domain excitation contribution.
37. A mixed time-domain/frequency-domain coding method according to any one of claims 31 to 36, wherein calculating the frequency-domain excitation contribution comprises frequency-transforming an LP residual obtained from an LP analysis of the input sound signal to generate a frequency representation of the LP residual.
38. A mixed time-domain/frequency-domain coding method according to claim 37, wherein calculating the cut-off frequency comprises calculating a cross-correlation between a frequency representation of the LP residual and a frequency representation of the time-domain excitation contribution for each of a plurality of frequency bands, and the coding method comprises finding an estimate of the cut-off frequency in response to the cross-correlation.
39. A mixed time-domain/frequency-domain coding method according to claim 38, comprising smoothing the cross-correlation across the frequency bands to generate a cross-correlation vector, calculating an average of the cross-correlation vector over the frequency bands, and normalizing the average of the cross-correlation vector, wherein finding the estimate of the cutoff frequency comprises determining a first estimate of the cutoff frequency by finding a last frequency of one of the frequency bands that minimizes a difference between the last frequency and the normalized average of the cross-correlation vector multiplied by a spectral width value.
40. A mixed time-domain/frequency-domain coding method according to claim 39, wherein calculating the cut-off frequency comprises finding one of the frequency bands in which a harmonic calculated from the time-domain excitation contribution is located, and selecting the cut-off frequency as a higher frequency of the first estimate of the cut-off frequency and a last frequency of the frequency band in which the harmonic is located.
41. A mixed time-domain/frequency-domain coding method according to any one of claims 31 to 40, wherein adjusting the frequency range of the time-domain excitation contribution comprises zeroing frequency bins to force frequency bins of the plurality of frequency bands above the cutoff frequency to zero.
42. A mixed time-domain/frequency-domain coding method according to any one of claims 31 to 41, wherein adjusting the frequency range of the time-domain excitation contribution comprises zeroing frequency bins to force all frequency bins of the plurality of frequency bands to zero when the cut-off frequency is below a given value.
43. A mixed time-domain/frequency-domain coding method according to any one of claims 31 to 42, wherein calculating the frequency-domain excitation contribution comprises calculating a difference between a frequency representation of an LP residual of the input sound signal and a filtered frequency representation of the time-domain excitation contribution.
44. A mixed time-domain/frequency-domain coding method according to any one of claims 31 to 43, wherein calculating the frequency-domain excitation contribution comprises calculating a difference between a frequency representation of the LP residual and a frequency representation of the time-domain excitation contribution up to a cutoff frequency to form the first portion of the difference vector.
45. A mixed time-domain/frequency-domain coding method according to claim 44, comprising applying a downscaling factor to the frequency representation of the time-domain excitation contribution in a determined frequency range after the cutoff frequency to form the second part of the difference vector.
46. A mixed time-domain/frequency-domain coding method according to claim 45, comprising forming a difference vector using the frequency representation of the LP residual for a third remaining portion over the predetermined frequency range.
47. A mixed time-domain/frequency-domain coding method according to any one of claims 44 to 46, comprising quantizing the difference vector.
48. A mixed time-domain/frequency-domain coding method according to claim 47, wherein adding the adjusted time-domain excitation contribution and the frequency-domain excitation contribution to form the mixed time-domain/frequency-domain excitation comprises: the quantized difference vector and the frequency transformed version of the adjusted time-domain excitation contribution are added in the frequency domain.
49. A mixed time-domain/frequency-domain coding method according to any one of claims 31 to 48, wherein adding the adjusted time-domain excitation contribution and the frequency-domain excitation contribution to form the mixed time-domain/frequency-domain excitation comprises adding the time-domain excitation contribution and the frequency-domain excitation contribution in the frequency domain.
50. A mixed time-domain/frequency-domain coding method according to any one of claims 31 to 49, comprising dynamically allocating a bit budget between the time-domain excitation contribution and the frequency-domain excitation contribution.
51. A method of encoding using time-domain and frequency-domain models, comprising:
classifying an input sound signal into speech or non-speech;
providing a time-domain only coding method;
providing a mixed time-domain/frequency-domain coding method according to any one of claims 31 to 50; and
depending on the classification of the input sound signal, only one of the time-domain coding method and the mixed time-domain/frequency-domain coding method is selected for coding the input sound signal.
52. A coding method according to claim 51, wherein the time-domain only coding method is a code excited linear prediction coding method.
53. A method of encoding as defined in claim 51 or 52, comprising selecting a memoryless temporal coding mode which forces the memoryless temporal coding mode to be used for encoding the input sound signal using a time-domain only coding method when the input sound signal is classified as non-speech and a temporal attack in the input sound signal is detected.
54. A method of encoding as defined in any one of claims 51 to 53, wherein the mixed time-domain/frequency-domain coding method comprises using variable-length sub-frames in the calculation of the time-domain contribution.
55. A mixed time-domain/frequency-domain coding method of encoding an input sound signal, comprising:
calculating a time-domain excitation contribution in response to an input sound signal, wherein calculating the time-domain excitation contribution comprises processing the input sound signal in successive frames of the input sound signal and calculating a number of sub-frames to be used in a current frame of the input sound signal, wherein calculating the time-domain excitation contribution further comprises using the number of sub-frames calculated for the current frame in the current frame;
calculating a frequency domain excitation contribution in response to an input sound signal; and
the time-domain excitation contribution and the frequency-domain excitation contribution are added to form a mixed time-domain/frequency-domain excitation constituting a coded version of the input sound signal.
56. A mixed time-domain/frequency-domain coding method according to claim 55, wherein calculating the number of sub-frames in the current frame is responsive to at least one of an available bit budget and a high spectral dynamics of the input sound signal.
57. A method of decoding a sound signal encoded using the mixed time-domain/frequency-domain coding method of any one of claims 31 to 50, comprising:
converting the mixed time-domain/frequency-domain excitation in the time domain; and
the sound signal is synthesized by a synthesis filter in response to the mixed time-domain/frequency-domain excitation converted in the time domain.
58. The method of decoding according to claim 57, wherein converting the mixed time-domain/frequency-domain excitation in the time domain comprises using an inverse discrete cosine transform.
59. A method of decoding according to claim 57 or 58, wherein the synthesis filter is an LP synthesis filter.
60. A method of decoding a sound signal encoded using the mixed time-domain/frequency-domain coding method of claim 55 or 56, comprising:
converting the mixed time-domain/frequency-domain excitation in the time domain; and
the sound signal is synthesized by a synthesis filter in response to the mixed time-domain/frequency-domain excitation converted in the time domain.
HK13112954.4A 2010-10-25 2011-10-24 Coding generic audio signals at low bitrates and low delay HK1185709B (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US40637910P 2010-10-25 2010-10-25
US61/406,379 2010-10-25
PCT/CA2011/001182 WO2012055016A1 (en) 2010-10-25 2011-10-24 Coding generic audio signals at low bitrates and low delay

Publications (2)

Publication Number Publication Date
HK1185709A1 HK1185709A1 (en) 2014-02-21
HK1185709B true HK1185709B (en) 2015-12-24

Family

ID=

Similar Documents

Publication Publication Date Title
US9015038B2 (en) Coding generic audio signals at low bitrates and low delay
KR101078625B1 (en) Systems, methods, and apparatus for gain factor limiting
AU2012246799B2 (en) Method of quantizing linear predictive coding coefficients, sound encoding method, method of de-quantizing linear predictive coding coefficients, sound decoding method, and recording medium
EP1338003B1 (en) Gains quantization for a celp speech coder
EP1869670B1 (en) Method and apparatus for vector quantizing of a spectral envelope representation
US6574593B1 (en) Codebook tables for encoding and decoding
EP2313887B1 (en) Variable bit rate lpc filter quantizing and inverse quantizing device and method
US10706865B2 (en) Apparatus and method for selecting one of a first encoding algorithm and a second encoding algorithm using harmonics reduction
US20120065965A1 (en) Apparatus and method for encoding and decoding signal for high frequency bandwidth extension
CN103620675A (en) Device for quantizing linear predictive coding coefficients, audio coding device, device for dequantizing linear predictive coding coefficients, audio decoding device and electronic device thereof
CN105247614A (en) Audio Encoders and Decoders
JP2017528751A (en) Signal encoding method and apparatus, and signal decoding method and apparatus
EP4275204B1 (en) Method and device for unified time-domain / frequency domain coding of a sound signal
HK40107881A (en) Coding generic audio signals at low bitrates and low delay
HK1185709B (en) Coding generic audio signals at low bitrates and low delay
HK40103944A (en) Method and device for unified time-domain / frequency domain coding of a sound signal