HK1220541B

HK1220541B - Adaptive bandwidth extension and apparatus for the same

Info

Publication number: HK1220541B
Application number: HK16108371.4A
Authority: HK
Inventors: 高扬
Original assignee: 华为技术有限公司
Priority date: 2013-09-10
Filing date: 2014-09-09
Publication date: 2018-05-04

Description

Adaptive bandwidth extension method and device

The invention claims a prior application priority of us patent application No. 14/478,839 entitled "Adaptive Bandwidth Extension method and Apparatus therefor" (Adaptive Bandwidth Extension and application for the Same) "filed 9/5 2014, which is a continuous application of us provisional patent application No. 61/875,690 entitled" Adaptive Selection of shift band for Bandwidth Extension based on Spectral Energy Level of Bandwidth Extension "(filed 9/10 2013), the contents of which are incorporated in this document by way of introduction, as reproduced in full text.

Technical Field

The present invention relates generally to the field of speech processing, and more particularly to adaptive bandwidth extension methods and apparatus.

Background

In modern audio/speech digital signal communication systems, digital signals are compressed at an encoder, and the compressed information (bit stream) can be packetized and transmitted frame by frame over a communication channel to a decoder. The system of encoder and decoder together is called a codec. Voice/audio compression may be used to reduce the number of bits representing the voice/audio signal, thereby reducing the bit rate required for transmission. Speech/audio compression techniques can be broadly classified into time-domain coding and frequency-domain coding. Time-domain coding is typically used to encode low bit rate speech signals or audio signals. Frequency domain coding is commonly used to encode high bit rate audio or speech signals. The bandwidth extension (BWE) may be part of a time domain coding or a frequency domain coding for generating the highband signal at a very low bit rate or at a zero bit rate.

However, speech coders are lossy coders, i.e., the decoded signal is different from the original signal. Thus, one of the goals of speech coding is to minimize distortion (or perceptual loss) at a given bit rate, or to minimize the bit rate to achieve a given distortion.

Speech coding differs from other forms of audio coding in that speech is a much simpler signal than most other audio signals, and there is much more statistical information about the characteristics of speech. Thus, some auditory information related to audio coding may not be necessary in the context of speech coding. In speech coding, the most important criterion is to maintain the intelligibility and "pleasure" of speech with a limited amount of data transmitted.

The intelligibility of speech, in addition to the actual textual content, also includes speaker identity, mood, intonation, and timbre, all of which are important for optimal intelligibility. The pleasure of impaired speech is a more abstract concept that differs from a property of intelligibility in that degraded speech is likely to be completely intelligible, but subjectively annoying to listeners.

Redundancy of speech waveforms is associated with different types of speech signals, such as voiced and unvoiced speech signals. Voiced sounds, such as 'a', 'b', are basically generated due to the vibration of vocal cords and are oscillatory. Thus, they can be simulated well by the superposition of periodic signals, such as sinusoids, in a short time. In other words, a voiced speech signal is substantially periodic. However, this periodicity may vary over the duration of a speech segment, and the shape of the periodic wave typically varies gradually from segment to segment. Low bit rate speech coding can greatly benefit from studying this periodicity. Voiced speech periods are also referred to as pitch (pitch), and pitch prediction is often referred to as long-term prediction (LTP). In contrast, unvoiced sounds, e.g.,'s', 'sh', are more noise-like. This is because unvoiced speech signals are more like a random noise and have less predictability.

Traditionally, all parametric speech coding methods exploit the redundancy inherent in speech signals to reduce the amount of information transmitted and to estimate the parameters of speech samples of a signal within short intervals. This redundancy is mainly due to the fact that the speech waveform repeats at a quasi-periodic rate and the spectral envelope of the speech signal changes slowly.

The redundancy of the speech waveform can be considered with reference to several different types of speech signals, e.g. voiced and unvoiced. Although voiced speech signals are substantially periodic, such periodicity may vary over the duration of a speech segment, and the shape of the periodic wave typically varies gradually from segment to segment. Low bit rate speech coding can greatly benefit from studying this periodicity. Voiced speech periods are also referred to as pitch, and pitch prediction is often referred to as long-term prediction (LTP). With unvoiced speech, the signal is more like a random noise and has less predictability.

In either case, parametric coding may be used to reduce redundancy of speech segments by separating excitation components of the speech signal from spectral envelope components. The slowly varying spectral envelope may be represented by Linear Predictive Coding (LPC), also known as Short Term Prediction (STP). Low bit rate speech coding can also greatly benefit from studying such short-term prediction. The advantage of coding comes from the slow variation of the parameters. However, it is rare that these parameters are significantly different from the values that are maintained over a few milliseconds. Accordingly, at sample rates of 8kHz, 12.8kHz, or 16kHz, the speech coding algorithm employs a nominal frame duration in the range of ten to thirty milliseconds. A frame duration of 20 milliseconds is the most common choice.

Audio coding based on filter bank techniques is widely used, for example in frequency domain coding. In signal processing, a filter bank is a set of band pass filters that separate an input signal into multiple components, each band pass filter carrying a single sub-band of the original signal. The decompression process performed by the filter bank is called analysis and the output of the filter bank analysis is called subband signal, where the subband signal has as many subbands as the number of filters in the filter bank. The reconstruction process is called filter combining. In digital signal processing, the term "filter bank" also applies generally to a receiver bank. The difference is that the receiver also down-converts the sub-band to a low center frequency that can be re-sampled at a lower rate. The same result can sometimes be obtained by down-sampling the band-pass sub-bands. The output of the filter bank analysis may be in the form of complex coefficients. Each complex coefficient contains real and imaginary elements representing cosine and sine terms, respectively, of each subband in the filter bank.

Code excited linear prediction ("CELP") techniques have been employed in recent well known standards such as g.723.1, g.729, g.718, Enhanced Full Rate (EFR), Selectable Mode Vocoder (SMV), adaptive multi-rate (AMR), variable rate multi-mode wideband (VMR-WB), or adaptive multi-rate wideband (AMR-WB). CELP is generally understood as a combination of techniques for coded excitation, long-term prediction and short-term prediction. CELP mainly encodes speech signals using human voice characteristics or a human voice utterance model. CELP speech coding is a very popular algorithm principle in the field of speech compression, although the CELP details in different codecs may vary greatly. Due to its popularity, the CELP algorithm has been applied to various standards such as ITU-T, MPEG, 3GPP, and 3GPP 2. Variations of CELP include algebraic CELP, generalized CELP, low-delay CELP, and vector and excitation linear prediction, among others. CELP is a generic term for a class of algorithms, not for a particular codec.

The CELP algorithm is based on four main perspectives. First, a source filter model for speech generation by Linear Prediction (LP) is used. The source filter for speech generation models speech as a combination of a sound source, e.g. vocal cords, and a linear acoustic filter, i.e. the vocal tract (and radiation characteristic). In an implementation of a source filter model for speech generation, the sound source or excitation signal is typically modeled as a periodic pulse sequence for voiced speech, or white noise for unvoiced speech. Second, adaptive and fixed codebooks are used as inputs (excitations) to the LP model. Third, the search is performed in a closed loop of the "perceptual weighting domain". Fourth, Vector Quantization (VQ) is used.

Disclosure of Invention

Embodiments of the present invention describe a method of decoding an encoded audio bitstream and generating a band extension at a decoder. The method includes decoding the audio bitstream to produce a decoded low-band audio signal and generating a low-band excitation spectrum corresponding to a low-band. Selecting a subband region from within the low band using a parameter indicative of energy information of a spectral envelope of the decoded low band audio signal. Generating a high-band excitation spectrum of the high-band by replicating the sub-band excitation spectrum from the selected sub-band region to a high sub-band region corresponding to the high-band. Generating an extended highband audio signal by employing a highband spectral envelope using the generated highband excitation spectrum. Adding the extended high-band audio signal to the decoded low-band audio signal to generate an audio output signal having an extended frequency bandwidth.

According to an alternative embodiment of the present invention, a decoder for decoding an encoded audio bitstream and generating a frequency bandwidth comprises a low-band decoding unit for decoding the audio bitstream to produce a decoded low-band audio signal and generating a low-band excitation spectrum corresponding to a low-band. The decoder also includes a bandwidth extension unit coupled to the low band decoding unit. The bandwidth extension unit includes a sub-band selection unit and a duplication unit. The subband selection unit is configured to select a subband region from within the low band using a parameter indicative of energy information of a spectral envelope of the decoded low band audio signal. The copying unit is configured to generate a high-band excitation spectrum of a high-band by copying the sub-band excitation spectrum from the selected sub-band region to a high sub-band region corresponding to the high-band.

According to an alternative embodiment of the present invention, a decoder for speech processing includes a processor and a computer readable storage medium storing a program for execution by the processor. The program includes instructions to: the audio bitstream is decoded to produce a decoded low-band audio signal and a low-band excitation spectrum corresponding to a low-band is generated. The program includes instructions to: selecting a subband region from within the low band using a parameter indicative of energy information of a spectral envelope of the decoded low band audio signal; and generating a high-band excitation spectrum of the high-band by copying the sub-band excitation spectrum from the selected sub-band region to a high sub-band region corresponding to the high-band. The program further includes instructions to: generating an extended high-band audio signal by employing a high-band spectral envelope using the generated high-band excitation spectrum, and adding the extended high-band audio signal to the decoded low-band audio signal to generate an audio output signal having an extended frequency bandwidth.

An alternative embodiment of the present invention describes a method of decoding an encoded audio bitstream and generating a band extension at a decoder. The method includes decoding the audio bitstream to produce a decoded low-band audio signal and generating a low-band spectrum corresponding to a low-band, and selecting a subband region from within the low-band using a parameter indicative of energy information of a spectral envelope of the decoded low-band audio signal. The method further includes generating a high-band spectrum by copying a sub-band spectrum from the selected sub-band region to a high sub-band region, and using the generated high-band spectrum to generate an extended high-band audio signal by employing a high-band spectral envelope energy. The method also includes adding the extended high-band audio signal to the decoded low-band audio signal to generate an audio output signal having an extended frequency bandwidth.

Drawings

For a more complete understanding of the present invention, and the advantages thereof, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:

FIG. 1 illustrates operations performed during encoding of original speech using a conventional CELP encoder;

FIG. 2 illustrates operations performed during decoding of original speech using a conventional CELP decoder in an embodiment of the present invention described below;

FIG. 3 illustrates operations performed during encoding of original speech in a conventional CELP encoder;

FIG. 4 shows a basic CELP decoder implementing an embodiment of the present invention corresponding to the encoder in FIG. 5, as will be described below;

fig. 5A and 5B illustrate an example of encoding/decoding with bandwidth extension (BWE), where fig. 5A illustrates operations at an encoder with BWE side information and fig. 5B illustrates operations at a decoder with BWE;

fig. 6A and 6B illustrate another example of encoding/decoding with BWE without transmitting side information, where fig. 6A illustrates operations at an encoder and fig. 6B illustrates operations at a decoder;

FIG. 7 shows an example of an idealized excitation spectrum for voiced speech or harmonic music when using a CELP type codec;

FIG. 8 illustrates an example of a conventional bandwidth extension of a decoded excitation spectrum for voiced speech or harmonic music when using a CELP type codec;

FIG. 9 illustrates an example of bandwidth extension of a decoded excitation spectrum applied to voiced speech or harmonic music when an embodiment of the present invention uses a CELP-type codec;

FIG. 10 illustrates operations at a decoder for implementing sub-band shifting or copying for BWE in an embodiment of the present invention;

FIG. 11 illustrates an alternative embodiment of a decoder for implementing sub-band shifting or copying for BWE;

FIG. 12 illustrates operations performed by a decoder according to embodiments of the present invention;

FIGS. 13A and 13B illustrate a decoder for implementing bandwidth extension according to an embodiment of the present invention;

FIG. 14 illustrates a communication system according to an embodiment of the present invention; and

FIG. 15 illustrates a block diagram of a processing system that may be used to implement the apparatus and methods disclosed herein.

Detailed Description

In modern audio/speech digital signal communication systems, digital signals are compressed at an encoder, and the compressed information or bit stream may be packetized and transmitted frame by frame over a communication channel to a decoder. The decoder receives and decodes the compressed information to obtain an audio/speech digital signal.

The present invention relates generally to speech/audio signal encoding and speech/audio signal bandwidth extension. In particular, embodiments of the present invention may be used to improve the standard of ITU-T AMR-WB speech coders in the field of bandwidth extension.

Some frequencies are more important than others. These important frequencies are encoded with high resolution. The subtle differences between these frequencies are important and therefore coding schemes that can maintain these differences are needed. On the other hand, less important frequencies need not be accurate. A coarser coding scheme may be used even though some of the finer details would be lost in the coding. Typical coarser coding schemes are based on the concept of bandwidth extension (BWE). This technical concept is also referred to as High Band Extension (HBE), subband replication (SBR) or Spectral Band Replication (SBR). Although the names may be different, they all have the same meaning, i.e. some sub-bands (usually high-band) are encoded/decoded with a very low bit rate (even zero bit rate) or a bit rate significantly lower than normal encoding/decoding methods.

In the SBR technique, the spectral fine structure in the high frequency band can be copied from the low frequency band, and some random noise can be added. Then, a spectral envelope in the high frequency band is formed by using the side information transmitted from the encoder to the decoder. Band shifting or copying from low band to high band is typically the first step of BWE technology.

Embodiments of the present invention will describe techniques for improving BWE based on energy level adaptive selection of shifted bands for spectral envelopes.

Fig. 1 illustrates operations performed during encoding of original speech using a conventional CELP encoder.

Fig. 1 shows a conventional initial CELP coder, where the weighted error 109 between the synthesized speech 102 and the original speech 101 is minimized, typically by using a synthesis analysis method, which means that the coding (analysis) is performed by perceptually optimizing the decoded (synthesized) signal in a closed loop.

The rationale behind all speech coders is the fact that the speech signal is a highly correlated waveform. By way of illustration, the speech may be represented using an Autoregressive (AR) model as shown in equation (11) below.

In equation (11), each sample is represented as a linear combination of the first L samples plus white noise. Weighting coefficient a₁、a₂……a_LReferred to as Linear Prediction Coefficients (LPC). For each frame, a weighting factor a is selected₁、a₂……a_LSo that the spectrum { X generated using the model described above₁、X₂……X_NThe spectrum that best matches the input speech frame.

Alternatively, the speech signal may also be represented by a combination of a harmonic model and a noise model. The harmonic part of the model is actually a fourier series representation of the periodic component of the signal. In general, for voiced signals, the harmonic plus noise model of speech consists of a mixture of harmonics and noise. The proportion of harmonics and noise in voiced speech depends on a number of factors, including speaker characteristics (e.g., to what extent the speaker's voice is normal or like breathing sounds); speech segment characteristics (e.g., to what extent the speech segment is periodic) and frequency. Higher frequencies of voiced speech have a higher proportion of noise-type components.

Linear prediction models and harmonic noise models are two main methods for simulating and encoding speech signals. Linear prediction models are particularly good at simulating the spectral envelope of speech, while harmonic noise models are good at simulating the fine structure of speech. The two methods can be combined to take advantage of their relative strengths.

As indicated previously, the input signal to the handset microphone is filtered and sampled prior to CELP encoding, for example at 8000 samples per second. Each sample is then quantized, for example, with 13 bits per sample. The sampled speech is segmented into 20ms segments or frames (e.g., in the case of 160 samples).

The speech signal is analyzed and its LP model, excitation signal and pitch are extracted. The LP model represents the spectral envelope of speech. It is converted to a set of Line Spectral Frequency (LSF) coefficients, which is an alternative representation of the linear prediction parameters, since LSF coefficients have good quantization properties. The LSF coefficients may be scalar quantized or, more efficiently, they may be vector quantized using a previously trained LSF vector codebook.

The code excitation comprises a codebook of code vectors having all independently selected components such that each code vector may have an approximately 'white' spectrum. For each sub-frame of the input speech, each codevector is filtered by a short-term linear prediction filter 103 and a long-term prediction filter 105, and the output is compared to the speech samples. At each sub-frame, the codevector that outputs the best match (minimized error) to the input speech is selected to represent that sub-frame.

Coded excitation 108 typically comprises a pulse-type signal or a noise-type signal, which are mathematically constructed or stored in a codebook. The codebook may be used for an encoder and a receiver decoder. The coded excitation 108, which may be a random or fixed codebook, may be a vector quantization dictionary (implicitly or explicitly) hard coded to the codec. Such a fixed codebook may be algebraic code excited linear prediction or may be explicitly stored.

The codevectors in the codebook are multiplied by appropriate gain adjustments to make the energy equal to the energy of the input speech. Accordingly, the output of the coded excitation 108 is multiplied by a gain G before entering the linear filter_c107。

The short-term linear prediction filter 103 shapes the 'white' spectrum of the codevector to resemble the spectrum of the input speech. Likewise, in the time domain, the short-term linear prediction filter 103 incorporates short-term correlation coefficients (correlation with previous samples) into the white sequence. The filter that shapes the excitation has an all-pole model (short term linear prediction filter 103) of the form 1/a (z), where a (z) is called the prediction filter and can be obtained by linear prediction (e.g., levinson-durbin algorithm). In one or more embodiments, an all-pole filter may be used because it is a good representation of the human vocal tract and is easy to compute.

The short-term linear prediction filter 103 is obtained by analyzing the raw signal 101 and is represented by a set of coefficients:

as previously described, regions of voiced speech exhibit long periods. This period, called the pitch, is introduced into the synthesized spectrum by the pitch filter 1/(b (z)). The output of the long-term prediction filter 105 depends on the pitch and the pitch gain. In one or more embodiments, the pitch may be estimated from the original signal, the residual signal, or the weighted original signal. In one embodiment, the long-term prediction function (b (z)) may be expressed as follows using equation (13).

B(z)＝1-G_p·z^-Pitch(13)

The weighting filter 110 is related to the short-term prediction filter described above. One of the typical weighting filters can be represented as described in equation (14).

Wherein beta is more than alpha, beta is more than 0 and less than 1, and alpha is more than 0 and less than or equal to 1.

In another embodiment, the weighting filter w (z) may be derived from the LPC filter by using the bandwidth extension as shown in one embodiment in equation (15) below.

In equation (15), γ 1> γ 2, which are factors by which the poles move toward the origin.

Accordingly, for each frame of speech, the LPC and pitch are calculated and the filter is updated. For each sub-frame of speech, the codevector representing sub-frame that produces the 'best' filtered output is selected. The corresponding quantized values of the gains must be transmitted to the decoder for proper decoding. The LPC and pitch values must also be quantized and sent every frame in order to reconstruct the filter at the decoder. Accordingly, the coded excitation index, the quantized gain index, the quantized long-term prediction parameter index, and the quantized short-term prediction parameter index are transmitted to the decoder.

Fig. 2 illustrates operations performed during decoding of original speech using a CELP decoder in an embodiment embodying the present invention, as will be described below.

The speech signal is reconstructed at the decoder by passing the received codevectors through corresponding filters. Thus, each block except for post-processing has the same definition as described for the encoder of fig. 1.

The encoded CELP bitstream is received and unwrapped 80 at the receiving device. For each received subframe, the corresponding parameters are found by corresponding decoders, e.g., gain decoder 81, long-term prediction decoder 82, and short-term prediction decoder 83, using the received coded excitation index, quantized gain index, quantized long-term prediction parameter index, and quantized short-term prediction parameter index. For example, the position and amplitude signals of the excitation pulse and the algebraic code vector of the code excitation 402 may be determined from the received coded excitation index.

Referring to fig. 2, the decoder is a combination of several blocks, which includes coded excitation 201, long-term prediction 203, short-term prediction 205. The initial decoder also includes a post-processing block 207 after synthesizing speech 206. The post-treatment may also include a short post-treatment and a long post-treatment.

Fig. 3 shows a conventional CELP encoder.

Fig. 3 shows a basic CELP encoder using an additional adaptive codebook for improving long-term linear prediction. The code excitation 308 may be a random or fixed codebook as previously described by adding the contributions of the adaptive codebook 307 and the code excitation 308 to produce the excitation. The entries in the adaptive codebook include time-delayed versions of the excitation. This makes it possible to efficiently encode periodic signals, such as voiced sounds.

Referring to fig. 3, the adaptive codebook 307 includes a past synthesized excitation 304 or a past excitation pitch loop repeated within a pitch period. When the pitch delay is large or long, it can be coded as an integer value. When the pitch delay is small or short, it is usually coded as a more accurate fractional value. The periodicity information of the pitch is used to generate an adaptive component of the excitation. This excitation component is then multiplied by a gain G_p305 (also called pitch gain) adjustment.

Long-term prediction is very important for voiced speech coding because voiced speech has strong periodicity. Adjacent pitch periods of voiced speech are similar to each other, which means that mathematically, the pitch gain G in the underlying excitation expression_pVery high or close to 1. The resulting excitation can be expressed in equation (16) as a combination of individual excitations.

e(n)＝G_p·e_p(n)+G_c·e_c(n) (16)

Wherein e is_p(n) is a subframe of a sample sequence indexed n from an adaptive codebook 307 that includes past excitation 304 through a feedback loop (fig. 3). e.g. of the type_p(n) can be adaptively low-pass filtered into low frequency regions that are typically more periodic and harmonic than high frequency regions. e.g. of the type_c(n) from the coded excitation codebook 308 (also referred to as the fixed codebook), which is the current excitation contribution. In addition, e can be enhanced, for example, by using high pass filtering enhancement, pitch enhancement, dispersion enhancement, formant enhancement, and others_c(n)。

For voiced speech, e in the adaptive codebook 307_pThe contribution of (n) may be dominant and the pitch gain G_p305 has a value of about 1. The excitation for each sub-frame is typically updated. A typical frame size is 20 milliseconds and a typical subframe size is 5 milliseconds。

As illustrated in FIG. 1, the fixed code excitation 308 is multiplied by a gain G before entering the linear filter_c306. The two multiplied excitation components in the fixed codebook excitation 108 and the adaptive codebook 307 are added together before filtering by the short-term linear prediction filter 303. Quantize the two gains (G)_pAnd G_c) And transmitted to the decoder. Accordingly, the coded excitation index, the adaptive codebook index, the quantized gain index, and the quantized short-term prediction parameter index are transmitted to the receiving audio device.

A CELP bitstream encoded using the apparatus shown in fig. 3 is received at a receiving device. Fig. 4 shows a corresponding decoder of the receiving device.

Fig. 4 shows a basic CELP decoder corresponding to the encoder in fig. 3. Fig. 4 includes a post-processing block 408 that receives synthesized speech 407 from the primary decoder. The decoder is similar to fig. 3 except for the adaptive codebook 307.

For each subframe received, the received coded excitation index, quantized coded excitation gain index, quantized pitch index, quantized adaptive codebook gain index and quantized short-term prediction parameter index are used to find corresponding parameters by corresponding decoders, e.g., gain decoder 81, pitch decoder 84, adaptive codebook gain decoder 85 and short-term prediction decoder 83.

In various embodiments, the CELP decoder is a combination of several blocks and includes a coded excitation 402, an adaptive codebook 401, a short-term prediction 406, and a post-processor 408. Except for post-processing, each block has the same definition as described for the encoder of fig. 3. The post-treatment may also include a short post-treatment and a long post-treatment.

As mentioned before, CELP is mainly used to encode speech signals by benefiting from specific human voice characteristics or human vocal models. To more efficiently encode speech signals, the speech signals may be classified into different classes, and each class may be encoded in a different manner. Voiced/unvoiced classification or unvoiced decision may be an important and basic classification of all classifications of all different classes. For each class, the spectral envelope is often represented using LPC or STP filters. But the excitation of the LPC filter may be different. Unvoiced signals may be encoded with a noise-type excitation. Voiced signals, on the other hand, may be encoded with an impulse-type excitation.

The code excitation block (refer to reference numeral 308 of fig. 3 and 402 in fig. 4) shows the position of a Fixed Codebook (FCB) for general CELP coding. The codevector selected from the FCB is shown generally as G_c306 of the gain.

Fig. 5A and 5B illustrate an example of encoding/decoding using bandwidth extension (BWE). Fig. 5A shows the operation at the encoder with BWE side information, while fig. 5B shows the operation at the decoder with BWE.

The low band signal 501 is encoded by using the low band parameters 502. The low band parameters 502 are quantized and the generated quantization indices may be transmitted through a bitstream channel 503. The high band signal extracted from the audio/speech signal 504 is encoded by using the high band side parameters 505 and using a small number of bits. The quantized high-band side parameters (side information indices) are transmitted through the bitstream channel 506.

Referring to fig. 5B, at the decoder, the low-band bitstream 507 is used to produce a decoded low-band signal 508. The high-band-edge bitstream 510 is used to decode the high-band-edge parameters 511. The high band signal 512 is generated from the low band signal 508 with the help of the high band side parameters 511. A final audio/speech signal 509 is produced by combining the low band signal 508 and the high band signal 512.

Fig. 6A and 6B illustrate another example of encoding/decoding using BWE without transmitting side information. Fig. 6A shows the operation at the encoder, and fig. 6B shows the operation at the decoder.

Referring to fig. 6A, a low band signal 601 is encoded by using low band parameters 602. The low band parameters 602 are quantized to generate quantization indices, which may be transmitted over a bitstream channel 603.

Referring to fig. 6B, at the decoder, the low band bitstream 604 is used to produce a decoded low band signal 605. The high band signal 607 is generated from the low band signal 605 without transmitting side information. The final audio/speech signal 606 is generated by combining the low band signal 605 and the high band signal 607.

Fig. 7 shows an example of an idealized excitation spectrum of voiced speech or harmonic music when using a CELP type codec.

After removing the LPC spectral envelope, the idealized excitation spectrum 702 is almost flat. The idealized low-band excitation spectrum 701 may be used as a reference for low-band excitation coding. The idealized highband excitation spectrum 703 is not available at the decoder. In theory, the energy level of the idealized or unquantized high-band excitation spectrum may be almost the same as that of the low-band excitation spectrum.

In practice, the synthesized or decoded excitation spectrum does not look as good as the idealized excitation spectrum shown in fig. 7.

Fig. 8 shows an example of a decoded excitation spectrum of voiced speech or harmonic music when using a CELP type codec.

After removing the LPC spectral envelope 804, the decoded excitation spectrum 802 is almost flat. The decoded low-band excitation spectrum 801 is available at the decoder. The quality of the decoded low-band excitation spectrum 801 becomes worse or more distorted especially in regions where the envelope energy is low. This is due to a number of reasons. For example, two main reasons are: closed-loop CELP coding emphasizes more high energy regions than low energy regions, and the waveform matching of low frequency signals is easier than that of high frequency signals because high frequency signals change faster. For low bit rate CELP coding, such as AMR-WB, the high bands are typically not coded, but are generated in the decoder using BWE techniques. In this case, the high-band excitation spectrum 803 can be simply copied from the low-band excitation spectrum 801, and the high-band spectral energy envelope can be predicted or estimated from the low-band spectral energy envelope. Conventionally, the generated high-band excitation spectrum 803 after 6400Hz is replicated from the sub-bands before 6400 Hz. This may be a good approach if the spectral quality is equivalent from 0Hz to 6400 Hz. However, for low bit rate CELP codecs, the spectral quality may vary greatly from 0Hz to 6400 Hz. The quality of the sub-band copied from the end region of the low band before 6400Hz may be poor, which will then introduce additional noise into the highband region of 6400Hz to 8000 Hz.

The bandwidth of the extended high band is typically much smaller than that of the encoded low band. Thus, in various embodiments, the best sub-band in the low band is selected and copied into the high band region.

High-quality subbands may exist anywhere within the entire low-frequency band. The most likely location of the high quality sub-band is the region corresponding to the region of high spectral energy, i.e. the spectral formant region.

Fig. 9 shows an example of a decoded excitation spectrum of voiced speech or harmonic music when using a CELP type codec.

After removing the LPC spectral envelope 904, the decoded excitation spectrum 902 is almost flat. The decoded low-band excitation spectrum 901 is available at the decoder, but not at the high-band 903. The quality of the decoded low-band excitation spectrum 901 becomes worse or more distorted especially in regions where the energy of the spectral envelope 904 is lower.

In the illustrated case of fig. 9, in one embodiment, the high quality sub-bands are located around the first speech formant region (e.g., about 2000Hz in this example embodiment). In various embodiments, the high quality quantum bands may be located anywhere between 0 and 6400 Hz.

After the position of the best sub-band is determined, it is copied from the low band to the high band as further shown in fig. 9. Thereby generating a high-band excitation spectrum 903 by copying from the selected sub-band. The perceptual quality of the highband 903 in fig. 9 sounds much better than the highband 803 in fig. 8 because of the improved excitation spectrum.

In one or more embodiments, if a low-band spectral envelope is available at the decoder in the frequency domain, the best subband may be determined by searching for the highest subband energy from among all subband candidates.

Alternatively, in one or more embodiments, if a frequency domain spectral envelope is not available, the high energy location may also be determined from any parameter that reflects the spectral energy envelope or spectral formant peak. The best subband position for BWE corresponds to the highest spectral peak position.

The search range for the best subband starting point may depend on the codec bit rate. For example, for a very low bit rate codec, the search range may be from 0 to 6400-1600-4800 Hz (2000Hz to 4800Hz), assuming that the bandwidth of the high band is 1600 Hz. In another example, for a medium bit rate codec, the search range may be from 2000Hz to 6400-1600 ═ 4800Hz (2000Hz to 4800Hz), assuming that the bandwidth of the high band is 1600 Hz.

Since the spectral envelope varies slowly from one frame to the next, the optimal subband starting point for the highest spectral formant energy typically varies slowly. To avoid that the optimal subband starting point fluctuates or changes frequently from one frame to another, some smoothing may be applied in the same voiced region in the time domain, unless the spectral peak energy changes dramatically from one frame to the next or a new voiced region is created.

FIG. 10 illustrates operations at a decoder in accordance with embodiments of the present invention for implementing sub-band displacement or replication BWE.

The time domain low band signal 1002 is decoded by using the received bit stream 1001. The low-band time-domain excitation 1003 is typically available at the decoder. Sometimes, low band frequency domain excitation is also available. If not, the low-band time-domain excitation 1003 may be transformed to the frequency domain to obtain a low-band frequency-domain excitation.

The spectral envelope of a voiced speech or music signal is usually represented by LPC parameters. Sometimes, a direct frequency domain spectral envelope is available at the decoder. In any case, the energy distribution information 1004 may be extracted from the LPC parameters or from any parameters such as the direct frequency domain spectral envelope or the DFT domain or FFT domain. By using the low-band energy distribution information 1004, the best sub-band is selected from the low-band by searching for relatively higher energy peaks. The selected sub-band is then copied from the low band to the high band region. The predicted or estimated high-band spectral envelope is then applied to the high-band region, or the time-domain high-band excitation 1005 is passed through a predicted or estimated high-band filter representing the high-band frequency-domain envelope. The output of the high band filter is a high band signal 1006. A final speech/audio output signal 1007 is obtained by combining the low band signal 1002 and the high band signal 1006.

FIG. 11 illustrates an alternative embodiment of a decoder for implementing sub-band displacement or replication BWE.

Unlike fig. 10, fig. 11 assumes that a frequency domain low band spectrum is available. The best sub-band in the low frequency band is selected by simply searching for relatively high energy peaks in the frequency domain. Subsequently, the selected sub-band is copied from the low band to the high band. After applying the estimated high-band spectral envelope, a high-band spectrum 1103 is formed. The final frequency domain speech/audio spectrum is obtained by combining the low band spectrum 1102 and the high band spectrum 1103. The final time domain speech/audio signal output is generated by converting the frequency domain/speech/audio spectrum to the time domain.

When filter bank analysis and synthesis are available at a decoder containing the desired spectral range, the SBR algorithm may implement band shifting by copying the low band coefficients corresponding to the output of the selected low band from the filter bank analysis to the high band region.

Fig. 12 illustrates operations performed at a decoder according to an embodiment of the present invention.

Referring to fig. 12, a method of decoding an encoded audio bitstream at a decoder includes receiving an encoded audio bitstream. In one or more embodiments, the received audio bitstream has been CELP encoded. In particular, only the low frequency band is encoded by CELP. CELP produces relatively higher spectral quality in the higher spectral energy regions than in the lower spectral energy regions. Accordingly, embodiments of the present invention include decoding an audio bitstream to generate a decoded low-band audio signal and a low-band excitation spectrum corresponding to a low-band (block 1210). A subband region is selected from within the low band using energy information of a spectral envelope of the decoded low band audio signal (block 1220). A high-band excitation spectrum for the high-band is generated by copying the subband excitation spectrum from the selected subband region to a high subband region corresponding to the high-band (block 1230). An audio output signal is generated using the high-band excitation spectrum (block 1240). In particular, the generated high-band excitation spectrum is used to generate an extended high-band audio signal by applying a high-band spectral envelope. The extended high-band audio signal is added to the decoded low-band audio signal to generate an audio output signal having an extended frequency bandwidth.

As previously described using fig. 10 and 11, embodiments of the present invention may be applied in different ways depending on whether a frequency domain spectral envelope is available. For example, if a frequency-domain spectral envelope is available, the subband having the highest subband energy may be selected. On the other hand, if the frequency-domain spectral envelope is not available, the energy distribution of the spectral envelope may be determined from Linear Predictive Coding (LPC) parameters, Discrete Fourier Transform (DFT) domain, or Fast Fourier Transform (FFT) domain parameters. Similarly, if spectral formant-peak information is available (or calculable), it may be used in some embodiments. If only the low-band time-domain excitation is available, the low-band frequency-domain excitation may be calculated by transforming the low-band time-domain excitation into the frequency domain.

In various embodiments, the spectral envelope may be calculated using any known method known to one of ordinary skill in the art. For example, in the frequency domain, the spectral envelope may be a simple set of energies, representing the energies of a set of subbands. Similarly, in another example, the spectral envelope may be represented in the time domain by LPC parameters. The LPC parameters may have many forms in various embodiments, such as reflection coefficients, LPC coefficients, LSP coefficients, LSF coefficients.

Fig. 13A and 13B illustrate a decoder implementing bandwidth extension according to an embodiment of the present invention.

Referring to fig. 13A, a decoder for decoding an encoded audio bitstream includes a low-band decoding unit 1310 for decoding an audio bit rate to generate a low-band excitation spectrum for a low frequency band.

The decoder also includes a bandwidth extension unit 1320 coupled to the low band decoding unit 1310 and including a sub-band selection unit 1330 and a copy unit 1340. The subband selection unit 1330 is configured to select subband regions from within the low frequency band using energy information of the spectral envelope of the decoded audio bitstream. The copying unit 1340 serves to generate a high-band excitation spectrum of a high-band by copying the sub-band excitation spectrum from the selected sub-band region to a high sub-band region corresponding to the high-band.

The high band signal generator 1350 is coupled to the copying unit 1340. The high band signal generator 1350 is configured to generate a high band time domain signal using the predicted high band spectral envelope. The output generator is coupled to the high band signal generator 1350 and the low band decoding unit 1310. The output generator 1360 is operative to generate an audio output signal by combining the low-band time-domain signal and the high-band time-domain signal obtained by decoding the audio bitstream.

Fig. 13B shows an alternative embodiment of a decoder implementing bandwidth extension.

Similar to fig. 13A, the decoder of fig. 13B also includes a low band decoding unit 1310 and a bandwidth extension unit 1320, the bandwidth extension unit 1320 being coupled to the low band decoding unit 1310 and including a sub-band selection unit 1330 and a duplication unit 1340.

Referring to fig. 13B, the decoder further includes a high band spectrum generator coupled to the copy unit 1340. The high-band signal generator 1355 is configured to use the high-band spectral envelope energy to generate a high-band spectrum of a high-band through the high-band excitation spectrum.

The output spectrum generator 1365 is coupled to the high-band spectrum generator 1355 and the low-band decoding unit 1310. The output spectrum generator is used to generate a frequency domain audio spectrum by combining a low-band spectrum obtained by decoding the audio bitstream from the low-band decoding unit 1310 and a high-band spectrum from the high-band spectrum generator 1355.

The inverse transform signal generator 1370 is used to generate a time domain audio signal by inverse transforming the frequency domain audio spectrum into the time domain.

The various components illustrated in fig. 13A and 13B may be implemented in hardware in one or more embodiments. In some embodiments, they are implemented in software and are intended to function in a signal processor.

Accordingly, embodiments of the present invention may be used to improve bandwidth extension at a decoder that decodes a CELP encoded audio bitstream.

Fig. 14 illustrates a communication system 10 according to an embodiment of the present invention.

Communication system 10 has audio access devices 7 and 8 coupled to network 36 via communication links 38 and 40. In one embodiment, audio access devices 7 and 8 are Voice Over IP (VOIP) devices and network 36 is a Wide Area Network (WAN), a public switched telephone network (PSTB), and/or the internet. In another embodiment, the communication links 38 and 40 are wired and/or wireless broadband connections. In another alternative embodiment, audio access devices 7 and 8 are cellular or mobile phones, links 38 and 40 are wireless mobile phone channels, and network 36 represents a mobile phone network.

Audio access device 7 converts sound, such as music or human voice, to analog audio input signal 28 using microphone 12. The microphone interface 16 converts the analog audio input signal 28 into a digital audio signal 33 for input into the encoder 22 of the codec 20. According to an embodiment of the invention, the encoder 22 generates an encoded audio signal TX for transmission to the network 26 via the network interface 26. The decoder 24 within the codec 20 receives the encoded audio signal RX from the network 36 via the network interface 26 and converts the encoded audio signal RX into a digital audio signal 34. The speaker interface 18 converts the digital audio signals 34 into audio signals 30 suitable for driving the speaker 14.

In the embodiment of the present invention, when the audio access device 7 is a VOIP device, some or all of the components in the audio access device 7 are implemented in the handset. However, in some embodiments, the microphone 12 and speaker 14 are separate units, and the microphone interface 16, speaker interface 18, codec 20, and network interface 26 are implemented within a personal computer. The codec 20 may be implemented in software running on a computer or a dedicated processor or by dedicated hardware, for example on an Application Specific Integrated Circuit (ASIC). The microphone interface 16 is implemented by an analog-to-digital (a/D) converter, as well as other interface circuitry located within the handset and/or computer. Similarly, the speaker interface 18 is implemented by digital to analog converters and other interface circuits located within the handset and/or computer. In other embodiments, the audio access device 7 may be implemented and partitioned in other ways known in the art.

In embodiments of the present invention, when the audio access device 7 is a cellular or mobile phone, the elements within the audio access device 7 are implemented within the cellular phone. The codec 20 is implemented by software running on a processor within the handset or by dedicated hardware. In other embodiments of the present invention, the audio access device may be implemented in other devices such as end-to-end wired and wireless digital communication systems, e.g., walkie-talkies and wireless handsets. In applications such as client audio equipment, the audio access device may comprise a codec in a digital microphone system or music playback device having only, for example, an encoder 22 or a decoder 24. In other embodiments of the present invention, the codec 20 may be used in a cellular base station that accesses the PSTN without the microphone 12 and speaker 14.

The speech processing for improving the unvoiced/voiced classification described in the various embodiments of the present invention may be implemented in, for example, the encoder 22 or the decoder 24. Speech processing for improving unvoiced/voiced classification may be implemented in hardware or software in various embodiments. For example, the encoder 22 or decoder 24 may be part of a Digital Signal Processing (DSP) chip.

Fig. 15 illustrates a block diagram of a processing system that may be used to implement the apparatus and methods disclosed herein. A particular device may utilize all of the illustrated components or only a subset of the components, and the degree of integration between devices may vary. Further, a device may include multiple instances of a component, such as multiple processing units, processors, memories, transmitters, receivers, etc. The processing system may include a processing unit equipped with one or more input/output devices, such as a speaker, a microphone, a mouse, a touch screen, keys, a keyboard, a printer, a display, and so forth. The processing unit may include a Central Processing Unit (CPU), memory, mass storage device, video adapter, and an I/O interface connected to the bus.

The bus may be one or more of any type of several bus architectures including a memory bus or memory controller, a peripheral bus, a video bus, and the like. The CPU may comprise any type of electronic data processor. The memory may include any type of system memory such as Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), synchronous DRAM (sdram), Read Only Memory (ROM), combinations thereof, and the like. In an embodiment, the memory may include ROM for use at boot-up and DRAM for program and data memory for use in executing programs.

The mass memory device may include any type of memory device for storing data, programs, and other information and for enabling the data, programs, and other information to be accessed via the bus. The mass storage device may include one or more of the following: solid state drives, hard disk drives, magnetic disk drives, optical disk drives, and the like.

The display card and the I/O interface provide interfaces to couple external input and output devices to the processing unit. As illustrated, examples of input and output devices include a display coupled to a display card and a mouse/keyboard/printer coupled to an I/O interface. Other devices may be coupled to the processing unit and additional or fewer interface cards may be utilized. For example, the interface may be provided to the printer using a serial interface such as a Universal Serial Bus (USB) (not shown).

The processing unit also contains one or more network interfaces that may include wired links, such as ethernet cables or the like, and/or wireless links to access nodes or different networks. The network interface allows the processing unit to communicate with remote units via a network. For example, the network interface may provide wireless communication via one or more transmitters/transmit antennas and one or more receivers/receive antennas. In one embodiment, the processing unit is coupled to a local or wide area network for data processing and communication with remote devices, such as other processing units, the internet, remote storage facilities, or the like.

While the invention has been described with reference to illustrative embodiments, this description is not intended to be construed in a limiting sense. Various modifications and combinations of the illustrative embodiments, as well as other embodiments of the invention, will be apparent to persons skilled in the art upon reference to the description. For example, the various embodiments described above may be combined with each other.

Although the present invention and its advantages have been described in detail, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims. For example, many of the features and functions discussed above can be implemented by software, hardware, firmware, or a combination thereof. Moreover, the scope of the present application is not intended to be limited to the particular embodiments of the process, machine, manufacture, composition of matter, means, methods and steps described in the specification. As one of ordinary skill in the art will readily appreciate from the disclosure, processes, machines, manufacture, compositions of matter, means, methods, or steps, presently existing or later to be developed that perform substantially the same function or achieve substantially the same result as the corresponding embodiments described herein may be utilized according to the present disclosure. Accordingly, the appended claims are intended to include within their scope such processes, machines, manufacture, compositions of matter, means, methods, or steps.

Claims

1. A method of decoding an encoded audio bitstream and generating a band extension at a decoder, the method comprising:

decoding the audio bitstream to produce a decoded low-band audio signal and generating a low-band excitation spectrum corresponding to a low-band;

selecting a subband region from within the low band using a parameter indicative of energy information of a spectral envelope of the decoded low band audio signal; wherein the selected subband region is the subband region corresponding to the highest spectral envelope energy;

generating a high-band excitation spectrum of a high-band by copying the sub-band excitation spectrum from within the selected sub-band region to a high sub-band region corresponding to the high-band;

generating an extended highband audio signal using the generated highband excitation spectrum and highband spectral envelope; and

adding the extended high-band audio signal to the decoded low-band audio signal to generate an audio output signal having an extended frequency bandwidth.

2. The method of claim 1, wherein the parameter indicative of energy information of the spectral envelope comprises a parameter reflecting a highest energy or a spectral formant peak to peak of the spectral envelope.

3. The method of claim 1 or 2, wherein the decoding method employs a bandwidth extension technique to generate the high frequency band.

4. The method of claim 1 or 2, wherein employing the high-band spectral envelope comprises employing a high-band filter representing a prediction of the high-band spectral envelope.

5. The method of claim 1 or 2, further comprising:

generating the audio output signal by inverse transforming the frequency domain audio spectrum into the time domain.

6. The method according to claim 1 or 2, wherein copying the subband excitation spectrum from within the selected subband region to the high subband region corresponding to the high frequency band comprises copying output low band coefficients from filter bank analysis to high subband region.

7. The method of claim 1 or 2, wherein the audio bitstream comprises voiced speech or harmonic music.

8. A decoder for decoding an encoded audio bitstream and generating a spectral bandwidth, the decoder comprising:

a low band decoding unit for decoding the audio bitstream to produce a decoded low band audio signal and to generate a low band excitation spectrum corresponding to a low band; and

a bandwidth extension unit coupled to the low-band decoding unit and including a subband selection unit for selecting a subband region from within the low-band using a parameter indicating energy information of a spectral envelope of the decoded low-band audio signal, and a copy unit for generating a high-band excitation spectrum of the high-band by copying a subband excitation spectrum from within the selected subband region to a high-subband region corresponding to a high-band, wherein the selected subband region is the subband region corresponding to a highest spectral envelope energy.

9. The decoder defined in claim 8 wherein the parameter indicative of energy information of the spectral envelope comprises a parameter reflecting spectral envelope energy or spectral formant peaks.

10. The decoder according to claim 8 or 9, further comprising:

a high band signal generator coupled to the replication unit, the high band signal generator to generate a high band time domain signal using the predicted high band spectral envelope; and

an output generator coupled to the high band signal generator and the low band decoding unit, wherein the output generator is configured to generate an audio output signal by combining a low band time domain signal obtained by decoding the audio bitstream with the high band time domain signal.

11. The decoder of claim 10, wherein the highband signal generator is configured to use a predicted highband filter representing the predicted highband spectral envelope.

12. The decoder according to claim 8 or 9, further comprising:

a high-band spectrum generator coupled to the replication unit, the high-band spectrum generator to generate a high-band spectrum of the high-band using the estimated high-band spectral envelope and the high-band excitation spectrum; and

an output spectral generator coupled to the high-band spectral generator and the low-band decoding unit, wherein the output spectral generator is configured to generate a frequency-domain audio spectrum by combining a low-band spectrum obtained by decoding the audio bitstream with the high-band spectrum.

13. The decoder of claim 12, further comprising:

an inverse transform signal generator for generating a time domain audio signal by inverse transforming the frequency domain audio spectrum into a time domain.

14. A speech processing decoder, comprising:

a processor; and

a computer readable storage medium storing a program for execution by the processor, the processor configured to execute the program stored in the computer readable storage medium for:

decoding the audio bit rate to produce a decoded lowband audio signal and generating a lowband excitation spectrum corresponding to a lowband;

selecting a subband region from within the lowband using a parameter indicative of energy information of a spectral envelope of the decoded lowband audio signal, wherein the selected subband region is a subband region corresponding to a highest spectral envelope energy;

generating a high-band excitation spectrum of a high-band by copying the high-band excitation spectrum from the selected sub-band region to a high sub-band region corresponding to the high-band;

generating an extended highband audio signal using said generated highband excitation spectrum and highband spectral envelope, an

15. The decoder defined in claim 14 wherein the parameter indicative of energy information of the spectral envelope comprises a parameter reflecting spectral envelope energy or spectral formant peaks.