[go: up one dir, main page]

HK1060430B - Method and apparatus for encoding and decoding of unvoiced speech - Google Patents

Method and apparatus for encoding and decoding of unvoiced speech Download PDF

Info

Publication number
HK1060430B
HK1060430B HK04103354.0A HK04103354A HK1060430B HK 1060430 B HK1060430 B HK 1060430B HK 04103354 A HK04103354 A HK 04103354A HK 1060430 B HK1060430 B HK 1060430B
Authority
HK
Hong Kong
Prior art keywords
sub
frame
random noise
filter
speech
Prior art date
Application number
HK04103354.0A
Other languages
Chinese (zh)
Other versions
HK1060430A1 (en
Inventor
黄鹏俊
Original Assignee
高通股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US09/690,915 external-priority patent/US6947888B1/en
Application filed by 高通股份有限公司 filed Critical 高通股份有限公司
Publication of HK1060430A1 publication Critical patent/HK1060430A1/en
Publication of HK1060430B publication Critical patent/HK1060430B/en

Links

Description

Method and apparatus for encoding and decoding non-voice speech
Technical Field
The disclosed embodiments relate to the field of speech processing. More particularly, the disclosed embodiments relate to a novel and improved method and apparatus for low bit rate encoding of unvoiced speech segments.
Background
The transmission of voice over digital technology has found widespread use, particularly in long-range and digital radiotelephone applications. It has, in turn, attracted interest in determining the minimum amount of information that can be transmitted over a channel while maintaining the perceived quality of the reconstructed speech. If speech is transmitted by simple sampling and digitization, a data rate on the order of 64 kilobits per second (kbps) is required to achieve the speech quality of conventional analog phones. However, by using speech analysis followed by appropriate encoding, transmission and re-synthesis at the receiver, a significant reduction in data rate can be achieved.
Devices that apply techniques for compressing speech by extracting parameters related to a model of human speech generation are called speech coders. The speech coder divides the input speech signal into several time chunks, or analysis frames. A speech encoder typically comprises an encoder and a decoder, or a codec. The encoder analyzes the incoming speech frame to extract certain relevant parameters and then quantizes these parameters into a binary representation, i.e. into a set of bits or a binary data packet. The data packets are transmitted to the receiver and decoder through a communication channel. The decoder processes the packet, dequantizes it to produce parameters, and then uses the dequantized parameters to re-synthesize the speech frame.
The function of a speech encoder is to compress a digitized speech signal into a low bit rate signal by removing all the natural information inherent in speech that is superfluous. Compression of the digitization is achieved by representing the input speech frame with a set of parameters and applying quantization to represent the parameters with a set of bits. If the input speech frame has oneNumber of bits NiAnd the data packet generated by the speech encoder has a number of bits NoCompression factor C obtained by a speech encoderr=Ni/No. The challenge is to retain both the high speech quality of the decoded speech and the targeted compression factor. The performance of a speech coder depends on (1) how well the speech mode, or combination of the above analysis and synthesis processes, is performing, and (2) N per frameoHow well the parameter quantization process performs at the target bit rate of the bits. The goal of the speech mode is to capture the nature of the speech signal or the target speech quality with a small set of parameters for each frame.
A speech encoder may be implemented as a time-domain encoder that attempts to encode a small speech segment (typically a 5 millisecond (ms) subframe) at a time by applying a high time resolution process to capture the speech waveform in the time domain. For each subframe, a high accuracy representation from a codebook space is found by various algorithmic systems known in the art. Alternatively, the speech encoder may be implemented as a frequency-domain encoder that attempts to capture the short-term speech spectrum of an input speech frame with a set of parameters (analysis) and to recreate the speech waveform from the spectral parameters using a corresponding synthesis process. The parameter quantizer preserves these parameters by delineating them with stored coding vectors according to known quantization techniques described in the article "vector quantization and signal compression" (1992) of a. gersho & r.m.gray.
One well-known time-domain speech coder is the code-excited linear prediction (CELP) coder described in the work "digital processing of speech signals" 396-. In a CELP coder, short-term correlation, or redundancy, of a speech signal is removed by a Linear Prediction (LP) analysis, from which a short-term formant filter coefficient is found. Applying short-term filtering to the incoming speech frame produces an LP residual signal that is further patterned and quantized with long-term prediction filter parameters and a subsequent random codebook. In this way, CELP coding decomposes the task of coding the speech waveform in the time domain into separate tasks of coding the LP short-time filter coefficients and coding the LP residual. Time-domain coding can be performed at a fixed rate (i.e., the same number of bits N0 for each frame) or at a varying rate (different bit rates for different types of frame content). Variable rate encoders attempt to encode the codec parameters only to the number of bits required to achieve a level suitable for achieving a target quality. An exemplary variable rate CELP encoder is described in U.S. patent No.5414796, which is assigned to the assignee of the embodiments of the present disclosure and is incorporated herein by reference in its entirety.
Time-domain encoders, such as CELP encoders, typically rely on a high number of bits N per frameoTo maintain the accuracy of the time domain speech waveform. Such encoders typically deliver excellent speech quality provided by a relatively large (e.g., 8kbps or more) number of bits per frame N0. However, at low bit rates (4kbps or less), the time-domain encoder cannot maintain high quality and robust performance due to the limited number of bits available. At low bit rates, the limited codebook space reduces the waveform matching capability that conventional time-domain encoders successfully implement in high-rate commercial applications.
Generally, the CELP scheme applies a Short Term Prediction (STP) filter and a Long Term Prediction (LTP) filter. An analysis by synthesis (AbS) method is used at the encoder to find the LTP delay and gain and the optimal random codebook gain and index. Current state-of-the-art coders, such as the Enhanced Variable Rate Coder (EVRC), can achieve good quality of synthesized speech at data rates of about 8 kbits per second.
It is also known that unvoiced speech is not able to show its periodicity. The bandwidth consumed by encoding LTP filtering in conventional CELP schemes is not utilized as efficiently for unvoiced speech as voiced speech, where the periodicity of the speech is strong and LTP filtering makes sense. Thus, there is a need for a more efficient (i.e., lower bit rate) coding scheme for unvoiced speech.
For coding at low bit rates, various spectral or frequency-domain coding methods of speech have been developed, in which the speech signal is analyzed as the evolution of a temporal variation of a spectrum. See, for example, "sinusoidal coding" of r.j.mcaulay & t.f.quateri in "speech coding and synthesis" chapter iv (m.b.kleijn & k.k.paliwal, 1995 edition). In a spectral encoder, the target will model or predict the short-term speech spectrum of the input frame for each speech with a set of spectral parameters, rather than to accurately model the time-varying speech waveform. The spectral parameters are then encoded and an output frame of speech is created using the decoded parameters. The resulting synthesized speech does not match the original input speech waveform, but provides similar perceptual quality. Examples of frequency domain encoders well known in the art include multi-band excitation encoders (MBEs), sinusoidal transform encoders (STCs), and harmonic encoders (HCs). Such a frequency-domain encoder provides a high quality parametric model with a compact set of parameters that can be accurately quantized with the low bit numbers obtained at low bit rates.
However, low bit-rate coding imposes a significant constraint of limited coding resolution or limited codebook space, which limits the effectiveness of individual coding schemes, making it impossible for the encoder to render various types of speech segments with the same degree of accuracy under various background conditions. For example, conventional low bit-rate frequency-domain encoders do not convey phase information for speech frames. Instead, the phase information is reconstructed by using a random artificially generated initial phase value and linear interpolation techniques. See, for example, H.Yang et al, "quadratic phase interpolation for speech synthesis in MBE model" in "29 electronic communication" 856-57 (5 months 1993). Because the phase information is artificially generated, the output speech produced by the frequency-domain encoder will not be aligned with the original input speech (i.e., the dominant cadence will not be synchronous), even though the amplitude of the sinusoid is fully preserved by the quantization-non-quantization process. It has thus proven difficult to employ any closed-loop performance test, such as signal-to-noise ratio (SNR) or perceptual SNR, in a frequency-domain encoder.
One effective technique for efficiently encoding speech at low bit rates is multi-mode encoding. Multi-mode coding techniques have been combined with an open-loop mode decision process for low-rate speech coding. One such multi-mode coding technique is described in chapter seven (m.b. kleijn & k.k.paliwal, 1995 edition) of "multi-mode and variable rate coding of speech" by Amitava Das et al. Conventional multi-mode encoders apply different modes, or encoding-decoding algorithms, to different types of input speech frames. Each mode, or encoding-decoding process, is customized in the most efficient manner to a certain type of rendering a speech segment, such as voiced speech, unvoiced speech, or background noise (unvoiced). An external open-loop mode decision mechanism examines the incoming speech frame and makes a decision as to which mode to apply to the frame. The open-loop mode decision is typically made by extracting a number of parameters from the input frame, evaluating the parameters with respect to certain temporal and spectral characteristics, and basing the mode decision on the evaluation. Thus, the mode decision is made without prior knowledge of the precise conditions of the output speech, i.e., how close the output speech will be to the input speech based on sound quality or other performance metrics. An example of open loop mode decision for codec of a speech is described in U.S. patent No.5414796, which is assigned to the assignee of the embodiments of the present disclosure and is incorporated herein by reference in its entirety.
The multi-mode encoding may be fixed rate, using the same number of bits N0 for each frame, or variable rate, using different bit rates for different modes. The goal in variable rate coding is to use only the number of bits required to encode the codec parameters to a level suitable to achieve the target quality. As a result, Variable Bit Rate (VBR) techniques can be applied to achieve the same target sound quality at a relatively low average rate as a fixed rate, higher rate encoder. An example variable rate speech encoder is described in U.S. patent No.5414796, which is assigned to the assignee of the embodiments of the present disclosure and is incorporated herein by reference in its entirety.
Currently, there is a surge in research to develop the benefits and strong commercial needs of high quality speech coders operating at medium to low bit rates (i.e., in the range of 2.4 to 4kbps and below). The application areas include wireless telephony, satellite communications, internet telephony, various multimedia and voice streaming applications, voice mail, and other voice storage systems. The driving force is the need for high capacity and the requirement for robust performance in case of packet loss. The effort to standardize various current speech coding is another direct driver, driving the research and development of low-rate speech coding algorithms. The low-rate speech encoder creates more channels or users per allowable application bandwidth, and coupled with an additional suitable channel coding layer, the low-rate speech encoder can fit the overall bit budget of the encoder specification and deliver robust performance under channel error conditions.
Thus, multi-mode VBR speech coding is an efficient mechanism for coding speech at low bit rates. Conventional multimodal schemes require an efficient coding scheme structure or mode for various speech segments (unvoiced, voiced, transitional) and background noise or silence modes. The overall performance of a speech encoder depends on how well each mode works, and the average rate of the encoder depends on the bit rates of the different modes for unvoiced, voiced and other speech segments. In order to achieve the target quality at low average rates, efficient, high performance modes must be designed, some of which must operate at low bit rates. Typically, voiced and unvoiced speech segments are captured at a high bit rate, and background noise and silence segments are depicted in a mode operating at a relatively low rate. Thus, there is a need for high performance low bit rate coding that accurately captures a high percentage of unvoiced segments of speech while using only a minimum number of bits per frame.
Disclosure of Invention
Embodiments of the present disclosure are directed to a high performance low bit rate coding technique that accurately captures unvoiced segments of speech while using only a minimum number of bits per frame. Thus, in one embodiment of the present invention, a method of decoding a non-speech segment includes recovering a set of quantized gains using received indices of a plurality of subframes; generating a random noise signal including a random number for each of a plurality of subframes; selecting, for each of a plurality of subframes, a predetermined percentage of the highest amplitude random numbers of the random noise signal; scaling the selected highest amplitude random number by the recovered gain for each subframe to produce a scaled random noise signal; band-pass filtering and shaping the scaled random noise signal; and selecting a second filter based on the received filter selection indication, and further shaping the scaled random noise signal with the selected filter.
Drawings
The features, objects, and advantages of embodiments of the present disclosure will become more apparent from the detailed description set forth below when taken in conjunction with the drawings. In the drawings, like reference numerals correspond to like parts throughout. In the drawings:
FIG. 1 is a block diagram of a speech coder terminated at each end of a communication channel;
FIG. 2A is a block diagram of an encoder that can be used in a high performance low bit rate speech encoder;
FIG. 2B is a block diagram of a decoder that can be used in a high performance low bit rate speech encoder;
FIG. 3 depicts a high performance low bit rate unvoiced speech encoder that can be used in the encoder of FIG. 2A;
FIG. 4 depicts a high performance low bit rate unvoiced speech decoder that can be used in the decoder of FIG. 2B;
FIG. 5 is a flow chart depicting the encoding steps of a high performance low bit rate encoding technique for unvoiced speech;
FIG. 6 is a flow chart depicting the decoding steps of a high performance low bit rate encoding technique for unvoiced speech;
FIG. 7A is a graph of the frequency response of a low pass filter applied in an energy analysis;
FIG. 7B is a graph of the frequency response of a high pass filter applied in an energy analysis;
FIG. 8A is a graph of the frequency response of band pass filtering applied in perceptual filtering;
FIG. 8B is a graph of the frequency response of the initial shaping filter applied in the perceptual filtering;
FIG. 8C is a graph of the frequency response of one shaping filter that may be applied in the final perceptual filtering;
FIG. 8D is a graph of the frequency response of another shaping filter that may be applied in the final perceptual filtering;
Detailed Description
Embodiments of the present disclosure provide a method and apparatus for high performance low bit rate encoding of unvoiced speech. The non-voice signal is digitized and converted into sampled frames. Each frame of the unvoiced signal is filtered by a short-term prediction filter to generate a short-term signal block. Each frame is decomposed into a plurality of subframes. A gain is then calculated for each subframe. These gains are quantized and transmitted sequentially. A random noise block is then generated and filtered by the method described in detail below. The filtered random noise is scaled by the quantized sub-frame gain to form a quantized signal representing the short-term signal. A random noise frame is generated at the decoder and filtered in the same way as the random noise at the encoder. The filtered random noise at the decoder is then scaled by the received subframe gain and filtered through a short-term prediction to form a synthesized speech frame representing the initial samples.
The disclosed embodiments propose a novel coding technique for various non-speech voices. At a rate of 2 kbits per second, the quality of the synthesized unvoiced speech is perceptually equivalent to that produced by the conventional CELP scheme, which requires a much higher data rate. According to embodiments of the present disclosure, a high percentage (approximately twenty percent) of unvoiced speech segments can be encoded
In fig. 1, a first encoder 10 receives digitized speech samples s (n) and encodes the samples s (n) for delivery to a first decoder 14 over a media 12 or communication channel 12. Decoder 14 decodes the encoded samples and outputs a speech signal sSYNTH(n) the synthesis is carried out. For transmission in the opposite direction, the second encoder 16 encodes the digitized speech samples s (n) transmitted over the communication channel 18. The second decoder 20 receives and decodes the encoded speech samples to produce an integrated output speech signal sSYNTH(n)。
The speech samples s (n) represent a speech signal that has been digitized and quantized according to various methods known in the art, including, for example, Pulse Code Modulation (PCM) companded mu-law or a-law. As is known in the art, the speech samples s (n) are organized into input data frames, where each frame includes a predetermined number of digitized speech samples s (n). In one example embodiment, a sampling rate of 8kHz is applied, including 160 samples per 20ms frame. In the embodiments described below, the rate of data transmission may vary from 8kbps (full rate) to 4kbps (half rate) to 2kbps (quarter rate) to 1kbps (eighth rate) on a frame-to-frame basis. Or other data rates may be used. As used herein, the term "full rate" or "high rate" generally refers to a data rate of greater than or equal to 8kbps, and the term "half rate" or "low rate" generally refers to a data rate of less than or equal to 4 kbps. Varying the data transmission rate is advantageous because a lower bit rate can be selectively applied to frames containing relatively less speech information. Other sampling rates, frame sizes, and data transmission rates may also be applied, as will be appreciated by those skilled in the art.
The first encoder 10 and the second decoder 20 together form a first speech encoder or speech codec. Similarly, the second encoder 16 and the first decoder 14 together form a second speech encoder. Those skilled in the art will appreciate that the speech coder may be implemented using a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), discrete gate logic, firmware, or any conventional programmable software module and microprocessor. The software modules may reside in RAM memory, flash memory, registers, and any other form of writable storage medium known in the art. Alternatively, any conventional processor, controller, or state machine can be used in place of the microprocessor. An exemplary ASIC specifically designed for speech coding is described in U.S. patent No. 5727123, which is assigned to the assignee of the embodiments of the present disclosure and is incorporated herein by reference in its entirety. This example is also described in U.S. patent No. 5784532 entitled Application Specific Integrated Circuit (ASIC) for fast voice compression in a mobile telephone system, which is assigned to the assignee of the embodiments of the present disclosure and is incorporated herein by reference in its entirety.
Fig. 2A is a block diagram of an encoder (10, 16) depicted in fig. 1 to which embodiments of the present disclosure may be applied. A speech signal s (n) is filtered by a short-term prediction filter 200. The speech itself s (n) and/or the linear prediction residual signal r (n) at the output of the short-term prediction filter 200 provide input to a speech classifier 202.
The output of the speech classifier 202 provides an input to the switch 203, enabling the switch 203 to select the corresponding mode encoder (204, 206) based on the classified mode of the speech. Those skilled in the art will appreciate that the speech classifier 202 is not limited to voiced and unvoiced speech classifications, but may also classify transitions, background noise (silence), or other types of speech.
The speech coder 204 encodes the speech by any conventional method such as CELP or Prototype Waveform Interpolation (PWI).
The unvoiced speech encoder 205 encodes unvoiced speech of a low bit rate according to an embodiment described below. The non-speech vocoder 206 is described in detail with reference to FIG. 3 according to one embodiment.
After being encoded by encoder 204 or encoder 206, multiplexer 208 forms a packet bitstream including packets, speech modes and other encoded parameters for transmission.
Fig. 2B is a block diagram of a decoder (14, 20) depicted in fig. 1 to which embodiments of the present disclosure may be applied.
The demultiplexer 210 receives a packet bitstream, demultiplexes the data from the bitstream, and recovers the data packets, voice patterns, and other encoded parameters.
The output of the demultiplexer 210 provides an input to a switch 211 that enables the switch 211 to select a corresponding mode decoder (212, 214) based on the classified mode of speech. Those skilled in the art will appreciate that switch 211 is not limited to voiced and unvoiced speech patterns, and may also recognize transitions, background noise (silence), or other types of speech.
The speech decoder 212 decodes the speech by performing the reverse operation of the speech encoder 204.
In one embodiment, unvoiced speech decoder 214 decodes unvoiced speech transmitted at a low bit rate, as described in detail below with reference to fig. 4.
After decoding by the decoder 212 or the decoder 214, the integrated linear prediction residual signal is filtered by a short-term prediction filter 216. The synthesized speech at the output of the short-term prediction filter 216 is passed to a post-filter processor 218 to produce the final output speech.
FIG. 3 is an exhaustive block diagram of the high-performance low-bitrate non-speech encoder 206 depicted in FIG. 2. Fig. 3 depicts in detail the apparatus and sequence of operation of one embodiment of a non-speech encoder.
The digitized speech samples s (n) are input to a Linear Predictive Coding (LPC) analyzer 302 and an LPC filter 304. The LPC analyzer 302 generates Linear Prediction (LP) coefficients of the digitized speech samples. The LPC filter 304 produces a speech residual signal r (n) which is input to a gain calculation component 306 and a non-scaled band energy analyzer 314.
The gain calculation component 306 decomposes each frame of digitized speech samples into sub-frames, calculates a set of codebook gains, hereinafter referred to as gains or indices, for each sub-frame, decomposes the gains into subgroups, and normalizes the gain for each subgroup. The speech residual signal r (N), N0, …, N-1, is segmented into K sub-frames, where N is the number of residual samples in a frame. In one embodiment, K-10 and N-160. Gain g (i), i ═ 0, …, K-1, calculated for each subframe as follows:
and
the gain quantizer 308 quantizes the K gains, and gain codebook indices for the gains are successively transmitted. The quantization may be performed with a conventional linear or vector quantization scheme or with any other variant. One embodied scheme is multi-level vector quantization.
The residual signal output r (n) from the LPC filter 304 passes through a low pass filter and a high pass filter in the non-scaled band energy analyzer 314. For the residual signal r (n), El,Elp1And E, andhp1the energy value of (c). E1Is the energy in the residual signal r (n), Elp1Is the low band energy, E, in the residual signal r (n)hp1Is the high band energy in the residual signal r (n). In one embodiment, the frequency responses of the low pass filter and the high pass filter of the non-scaled band energy analyzer 314 are shown in fig. 7A and 7B, respectively. Energy value E1,Elp1And E, andhp1is calculated as follows:
and
energy value E1,Elp1And EhplWhich is later used to select the shaping filter in the final shaping filter 316 to process the random noise signal so that it most closely resembles the original noise signal.
For each K sub-frames output by LPC analyzer 302, random number generator 310 generates random numbers with unit variances evenly distributed between-1 and + 1. The random number selector 312 selects with respect to most of the low-amplitude random numbers in each subframe. For each subframe, a portion of the highest amplitude random number is reserved. In one embodiment, a portion of the random numbers that are reserved is 25%.
The random number output from the random number selector 312 for each sub-frame is then multiplied by the multiplier 307 with the respective quantization gain for the sub-frame output from the gain quantizer 308. The scaled random signal output of multiplier 307Processed by perceptual filtering.
To improve the perceptual quality and preserve the natural characteristics of quantized unvoiced speech in a scaled random signalThe two-step perceptual filtering process is performed.
In the first step of the perceptual filtering process, the scaled random signal is passed through two fixed filters in the perceptual filter 318. The first fixed filter of the perceptual filter 318 is a band pass filter 320, which is derived fromEliminating low and high side frequencies to generate a signalIn one embodiment, the frequency response of the band pass filter 320 is depicted by FIG. 8A. The second fixed filter of perceptual filter 318 is a perceptual shaping filter 322. The signal calculated by element 320Is passed through a perceptual shaping filter 322 to produce a signalIn one embodiment, the frequency response of the perceptual shaping filter 322 is depicted by fig. 8B.
The signal calculated by element 320And the signal calculated by element 322The calculation method of (2) is as follows:
and
signalAndis respectively calculated as E2And E3。E2And E3The calculation method of (2) is as follows:
and
in a second step of the perceptual filtering process, the signal output from the perceptual shaping filter 322Is scaled to be E1And E2The basis has the same energy as the original residual signal r (n) output from the LPC filter 304.
Calculated by element 322 in scaled band energy analyzer 324Scaled and filtered random signalSubject to the same band energy analysis as previously performed by the non-scaled band energy analyzer 314 on the initial residual signal r (n).
The signal calculated by element 322The calculation method of (2) is as follows:
low band energy of Elp2It is shown that,high pass band energy ofhp2And (4) showing. Will be provided withAnd the high and low band energies of r (n) to determine the next shaping filter to use in the final shaping filter 316. With r (n) andbased on a comparison of (a) with (b), or without additional filtering, or selecting one of two fixed shaping filters to sum at r (n)The closest match is generated between them. The final filter shaping (or no additional filtering) is determined by comparing the band energy of the original signal with the band energy of the random signal.
The ratio Rl of the low band energy of the original signal to the low band energy of the scaled pre-filtered random signal is calculated as follows:
Rl=10*log10(Elp1/Elp2)。
ratio R of the high band energy of the original signal and the high band energy of the scaled pre-filtered random signalhThe calculation method of (2) is as follows:
Rh=10*log10(Ehp1/Ehp2)。
if the ratio R islLess than-3, the high pass final shaping filter (filter 2) is used for further processingTo generate
If the ratio R ishLess than-3, the low pass final shaping filter (filter 3) is used for further processingTo generate
Otherwise, it is toWithout any further treatment, and is therefore
The output from the final shaping filter 316 is a quantized random residual signalSignalIs scaled to have a sumThe same energy.
Fig. 8C shows the frequency response of the high-pass final shaping filter (filter 2). Fig. 8D shows the frequency response of the low-pass final shaping filter (filter 3).
A filter selection indication is generated to indicate which filter (filter 2, filter 3 or no filter) was selected for the final filtering. The filter selection indication is transmitted successively so that the decoder can replicate the final filtering. In one embodiment, the filter selection indication consists of two bits.
FIG. 4 is an exhaustive block diagram of the high performance low bit rate unvoiced speech decoder 214 depicted in FIG. 2. FIG. 4 details the apparatus and sequence of operation of one embodiment of a non-speech decoder. The non-voice speech decoder receives the non-voice packets and synthesizes non-voice speech from the packets by performing the reverse operation of the non-voice speech encoder 206 depicted in fig. 2.
The non-voice packets are input to gain dequantizer 406. The gain dequantizer 406 performs the inverse operation of the gain quantizer 308 in the non-speech encoder depicted in fig. 3. The output of the gain dequantizer 406 is the K quantized unvoiced gains.
The random number generator 402 and the random number selector 404 perform the exact same operation of the random number generator 310 and the random number selector 310 in the non-speech encoder of fig. 3.
The random number output from the random number selector 404 for each sub-frame is then dequantized by the multiplier 405 using the slave gain06 of the sub-frames output. The scaled random signal output of multiplier 405 is thenProcessed by perceptual filtering.
A two-step perceptual filtering process is performed which is identical to the perceptual filtering process of the non-speech encoder of fig. 3. Perceptual filter 408 performs exactly the same operation as perceptual filter 318 in the unvoiced coder in fig. 3. Random signalIs passed through two fixed filters in the perceptual filter 408. Bandpass filter 407 and initial shaping filter 409 are identical to bandpass filter 320 and initial shaping filter 322 used in perceptual filter 318 in the unvoiced coder of fig. 3. The outputs after the band pass filter 407 and the initial shaping filter 409 are respectively represented asAndsignalAndthe calculations are performed as in the non-speech encoder of fig. 3.
SignalFiltered in a final shaping filter 410. The final shaping filter 410 is the same as the final shaping filter 316 in the non-speech encoder of fig. 3. The final shaping filter 410 performs either high pass final shaping filtering, low pass final shaping filtering, or no final filtering as determined by the filter selection indication generated at the non-speech encoder of fig. 3 and received in the data bit packet at the decoder 214. The quantized residual signal r (n) output from the final shaping filter 410 is scaled to have a sumThe same energy.
Quantized random signalFiltered by LPC synthesis filter 412 to produce a synthesized speech signal
A subsequent post-filter 414 may be applied to the integrated speech signalTo produce the final output speech.
FIG. 5 is a flow chart depicting the encoding steps of a high performance low bit rate encoding technique for unvoiced speech.
In step 502, a frame of unvoiced digitized speech samples is provided to a unvoiced speech encoder (not shown). A new frame is provided every 20 milliseconds. In an embodiment where unvoiced speech is sampled at a rate of 8 kbits per second, a frame contains 160 samples. The flow of control proceeds to step 504.
In step 504, the frame of data is filtered by the LPC filter to produce a frame of residual signal. The flow of control proceeds to step 506.
Step 506 and 516 describe the method steps of gain calculation and quantization of the residual signal frame.
In step 506, the residual signal frame is decomposed into sub-frames. In one embodiment, each frame is decomposed into ten subframes of sixteen samples each. The flow of control proceeds to step 508.
At step 508, a gain is calculated for each subframe. In one embodiment, ten subframe gains are calculated. The flow of control proceeds to step 510.
At step 510, the subframe gains are decomposed into subgroups. In one embodiment, the 10 subframe gains are decomposed into two subgroups of five subframe gains each. The flow of control proceeds to step 512.
At step 512, the gain of each subgroup is normalized to produce a normalization factor for each subgroup. In one embodiment, two normalization factors are generated for two subgroups of five gains each. The flow of control proceeds to step 514.
In step 514, the normalization factor generated in step 512 is converted to a logarithmic domain or exponential form and then quantized. In one embodiment, a quantized normalization factor, which will be referred to as index 1 hereinafter, is generated. Control passes to step 516.
In step 516, the normalized gain for each subgroup generated in step 512 is quantized. In one embodiment, the two subgroups are quantized to produce two quantized gain values, which will be referred to as index 2 and index 3 in the following. Control proceeds to step 518.
Steps 518-520 describe the method steps for generating a randomly quantized non-speech signal.
At step 518, a random noise signal is generated for each sub-frame. A predetermined percentage of the highest amplitude random numbers generated are selected for each subframe. The unselected numbers are set to zero. In one embodiment, the percentage of random numbers selected is 25%. The flow of control proceeds to step 520.
In step 520, the selected random numbers are scaled by the quantization gain of each sub-frame generated in step 516. The flow of control proceeds to step 522.
Step 522-528 describes the method steps for perceptually filtering the random signal. The step 522-528 perceptual filtering improves perceptual quality and preserves the natural properties of the randomly quantized non-speech signal.
At step 522, the random quantized unvoiced speech signal is band-pass filtered to eliminate high-end and low-end components. The flow of control proceeds to step 524.
At step 524, a fixed preliminary shaping filter is applied to the random quantized unvoiced speech signal. The flow of control proceeds to step 526.
At step 526, the random signal and the initial residual signal are analyzed for low band energy and high band energy. Control passes to step 528.
At step 528, the energy analysis of the initial residual signal is compared to the energy analysis of the random signal to determine if further filtering of the random signal is necessary. Based on this analysis, either no filtering is performed or one of two predetermined final filters is selected to further filter the random signal. The two predetermined final filters are a high-pass final shaping filter and a low-pass final shaping filter. A filter selection indication is generated to indicate to the decoder which last filter (or no filter) was applied. In one embodiment, the filter selection indication information is 2 bits. Control passes to step 530.
In step 530, an index for the quantization normalization factor generated in step 514, an index for the quantization subgroup gain generated in step 516, and filter selection indication information generated in step 528 are transmitted. In one embodiment, an index of 1, an index of 2, an index of 3, and a 2-bit final filter selection indication are transmitted. Including the bits needed to transmit the quantized LPC parameter indices, the bit rate of one embodiment is 2k bits per second. (quantification of LPC parameters is not within the scope of embodiments of the present disclosure.)
FIG. 6 is a flow chart depicting the decoding steps of a high performance low bit rate encoding technique for unvoiced speech.
In step 602, a normalization factor index, a quantization subgroup gain index, and a final filter selection indicator are received for a non-speech frame. In one embodiment, an exponent 1, exponent 2, exponent 3, and a 2-bit final filter selection indication are received. The flow of control proceeds to step 604.
At step 604, the normalization factor index is used to recover the normalization factor from the lookup table. The normalization factor is converted from a logarithmic domain or exponential form to a linear form. The flow of control proceeds to step 606.
At step 606, the gain index is used to recover the gain from the lookup table. The recovered gain is scaled by the recovered normalization factor to recover the quantized gain for each subset of the original frame. The flow of control proceeds to step 608.
In step 608, a random noise signal is generated for each sub-frame, exactly as in encoding. A predetermined percentage of the highest amplitude random numbers generated are selected for each subframe. The unselected numbers are set to zero. In one embodiment, the percentage of random numbers selected is 25%. The flow of control proceeds to step 610.
In step 610, the selected random number is scaled by the quantization gain of each sub-frame recovered in step 606.
Step 612-.
At step 612, the randomly quantized unvoiced speech signal is band-pass filtered to eliminate high-end and low-end components. The bandpass filter is identical to the bandpass filter used in encoding. The flow of control proceeds to step 614.
In step 614, a fixed preliminary shaping filter is applied to the random quantized unvoiced speech signal. The fixed preliminary shaping filter is identical to the fixed preliminary shaping filter used in encoding. The flow of control proceeds to step 616.
In step 616, based on the filter selection indication information, either no filtering is performed or one of two predetermined final filters is selected to further filter the random signal in the final shaping filtering. The two predetermined filters of the final shaping filter are a high-pass final shaping filter (filter 2) and a low-pass final shaping filter (filter 3), which are identical to the high-pass final shaping filter and the low-pass final shaping filter of the encoder. The quantized random signal output from the final shaping filter is scaled to have the same energy as the signal output of the band pass filter. The quantized random signal is filtered by an LPC synthesis filter to produce a synthesized speech signal. A subsequent post-filter may be applied to the synthesized speech signal to produce the final decoded output speech.
FIG. 7A is a graph of normalized frequency versus amplitude frequency response of a low pass filter in a band energy analyzer (314, 324) used to analyze a residual signal r (n) output from an LPC filter (304) of an encoder, and a scaled and filtered random signal output from a preliminary shaping filter (322) of the encoderMedium low band energy.
FIG. 7B is a graph of normalized frequency versus amplitude frequency response of a high pass filter in a band energy analyzer (314, 324) used to analyze a residual signal r (n) output from an LPC filter (304) of an encoder, and a scaled and filtered random signal output from a preliminary shaping filter (322) of the encoderMedium high band energy.
FIG. 8A is a graph of normalized frequency versus amplitude frequency response of a low-bandpass final shaping filter in a bandpass filter (320, 407) used to shape a scaled random signal output from multipliers (307, 405) of an encoder and decoder
FIG. 8B is a graph of normalized frequency versus amplitude frequency response of a high bandpass shaping filter of the preliminary shaping filters (322, 409)Shaping filters are used to shape the scaled random signals output from the band pass filters (320, 407) of the encoder and decoder
FIG. 8C is a graph of normalized frequency versus amplitude frequency response of a high-bandpass final shaping filter in the final shaping filters (316, 410) used to shape the scaled and filtered random signal output from the preliminary shaping filters (322, 409) of the encoder and decoder
FIG. 8D is a graph of normalized frequency versus amplitude frequency response of a low-bandpass final shaping filter in the final shaping filters (316, 410) used to shape the scaled and filtered random signal output from the preliminary shaping filters (322, 409) of the encoder and decoder
The previous description of the preferred embodiments is provided to enable any person skilled in the art to make or use the embodiments of the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without the use of the inventive faculty. Thus, the embodiments of the present disclosure are not intended to be limited to the embodiments shown herein but are to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (65)

1. A method for encoding a unvoiced segment of speech, the method comprising:
dividing a linear prediction residual signal frame into a plurality of sub-frames;
establishing a set of subframe gains by computing a codebook gain for each of a plurality of subframes;
decomposing the set of sub-frame gains into sub-sets of sub-frame gains;
normalizing the sub-frame gain sub-set to produce a plurality of normalization factors, wherein each factor of the plurality of normalization factors is associated with one of the normalized sub-sets of sub-frame gains;
converting each of the plurality of normalization factors into an exponential form and quantizing the converted plurality of normalization factors;
quantizing the normalized sub-frame gain sub-groups to generate a plurality of quantized codebook gains, wherein each of the quantized codebook gains refers to a codebook gain index for each of the plurality of sub-groups;
generating a random noise signal including a random number for each of a plurality of subframes;
selecting a predetermined percentage of the highest amplitude random numbers of the random noise signals associated with each sub-frame;
scaling the selected highest amplitude random number by the quantized codebook gain for each subframe to produce a scaled random noise signal;
band-pass filtering and shaping the scaled random noise signal to produce a band-pass filtered and shaped random noise signal;
analyzing the energy of the linear prediction residual signal frame and the energy of the band-pass filtered and shaped random noise signal to produce an energy analysis;
selecting a second filter based on the energy analysis and further shaping the band-pass filtered and shaped random noise signal with the selected filter; and
a second filter selection indication is generated to identify the selected filter.
2. The method of claim 1, wherein the step of dividing one frame of the linear prediction residual signal into a plurality of sub-frames comprises dividing one frame of the linear prediction residual signal into 10 sub-frames.
3. The method of claim 2, wherein the step of decomposing the set of subframe gains into sub-sets of subframe gains comprises dividing a set of ten subframe gains into two groups of five subframe gains each.
4. The method of claim 1, wherein the frames of linear prediction residual signals comprise 160 samples per frame sampled at 20 milliseconds per eight kilohertz per second.
5. The method of claim 1, wherein the predetermined percentage of highest amplitude random numbers is twenty-five percent.
6. The method of claim 3, wherein a normalization factor is generated for each of the two subsets.
7. The method of claim 1, wherein quantizing the subframe gains is performed using multi-level vector quantization.
8. A method for encoding a unvoiced segment of speech, the method comprising:
dividing a frame of linear prediction residual signal into sub-frames, each sub-frame having a codebook gain associated therewith;
quantizing the codebook gain to produce a codebook gain index;
scaling a predetermined percentage of the highest amplitude random noise associated with each subframe by a codebook gain index associated with the subframe;
performing primary first filtering on the scaled random noise;
comparing the energy of the first filtered random noise with the energy of the linear prediction residual signal;
second filtering the first filtered random noise a second time based on the comparison;
a second filter selection indication is generated to identify the second filtering that was performed.
9. The method of claim 8, wherein the step of dividing one frame of the linear prediction residual signal into sub-frames comprises dividing one frame of the linear prediction residual signal into 10 sub-frames.
10. The method of claim 8, wherein the frames of linear prediction residual signals comprise 160 samples per frame sampled at 20 milliseconds per eight kilohertz per second.
11. The method of claim 8, wherein the predetermined percentage is twenty-five percent.
12. The method of claim 8 wherein quantizing codebook gains to produce codebook gain indices is performed using multi-stage vector quantization.
13. A speech encoder for encoding a unvoiced segment of speech, the encoder comprising:
means for dividing a frame of linear prediction residual signals into a plurality of sub-frames;
means for establishing a set of subframe gains by calculating a codebook gain for each of a plurality of subframes;
means for decomposing the set of sub-frame gains into sub-sets of sub-frame gains;
means for normalizing the sub-frame gain sub-groups to produce a plurality of normalization factors, wherein each factor of the plurality of normalization factors is associated with one of the normalized sub-groups of sub-frame gains;
means for converting each of the plurality of normalization factors into an exponential form and quantizing the converted plurality of normalization factors;
means for quantizing the normalized sub-frame gain sub-groups to produce a plurality of quantized codebook gains, wherein each of the quantized codebook gains is a codebook gain index for each of the plurality of sub-groups;
means for generating a random noise signal comprising a random number for each of a plurality of sub-frames;
means for selecting a predetermined percentage of the highest amplitude random numbers of the random noise signal associated with each sub-frame;
means for scaling the selected highest amplitude random number by the quantized codebook gain for each subframe to produce a scaled random noise signal;
means for band-pass filtering and shaping the scaled random noise signal to produce a band-pass filtered and shaped random noise signal;
means for analyzing the energy of the frame of linear prediction residual signals and the energy of the band-pass filtered and shaped random noise signal to produce an energy analysis;
means for selecting a second filter based on the energy analysis and further shaping the band-pass filtered and shaped random noise signal with the selected filter; and
means for generating a second filter selection indication to identify the selected filter.
14. The speech coder of claim 13, wherein the means for partitioning a frame of linear prediction residual signal into a plurality of sub-frames comprises means for partitioning a frame of linear prediction residual signal into 10 sub-frames.
15. The speech coder of claim 14, wherein the means for dividing the set of sub-frame gains into sub-groups comprises means for dividing a set of ten sub-frame gains into two groups of five sub-frame gains each.
16. The speech coder of claim 13, wherein the means for selecting a predetermined percentage of the highest amplitude random numbers comprises means for selecting twenty-five percent of the highest amplitude random numbers.
17. The speech coder of claim 15, wherein the means for normalizing the subsets comprises means for generating a normalization factor for each of the two subsets.
18. The speech coder of claim 13, wherein the means for quantizing the sub-frame gains comprises means for performing multi-level vector quantization.
19. A speech encoder for encoding unvoiced segments of speech, the encoder comprising:
means for dividing a frame of linear prediction residual signals into sub-frames, each sub-frame having a codebook gain associated therewith;
means for quantizing the codebook gains to produce codebook gain indices;
means for scaling a predetermined percentage of the highest amplitude random noise associated with each subframe by a codebook gain index associated with the subframe;
means for performing a first filtering of the scaled random noise;
means for comparing the energy of the first filtered random noise with the energy of the linear prediction residual signal;
means for second filtering the first filtered random noise a second time based on the comparison;
a second filter selection indication is generated to identify the means for performing the second filtering.
20. The speech coder of claim 19, wherein the means for dividing a frame of linear prediction residual signal into sub-frames comprises means for dividing a frame of linear prediction residual signal into 10 sub-frames.
21. The speech coder of claim 19, wherein the means for scaling a predetermined percentage of the highest amplitude random noise comprises a means for scaling twenty-five percent of the highest amplitude random noise.
22. The speech coder of claim 19, wherein the means for quantizing the codebook gains to produce codebook gain indices comprises means for performing multi-stage vector quantization.
23. A speech encoder for encoding unvoiced segments of speech, the encoder comprising:
a gain calculation component configured to divide a frame of a linear prediction residual signal into a plurality of sub-frames, establish a set of sub-frame gains by calculating a codebook gain for each of the plurality of sub-frames, divide the set of sub-frame gains into sub-frame gain sub-groups, normalize the sub-frame gain sub-groups to produce a plurality of normalization factors, wherein each of the plurality of normalization factors is associated with a sub-group of the normalized sub-frame gain sub-groups, and convert each of the plurality of normalization factors into an exponential form;
a gain quantizer configured to quantize the converted plurality of normalization factors to produce quantized normalization factor indices, and quantize the normalized sub-frame gain subsets to produce a plurality of quantization codebook gains, wherein each of the quantization codebook gains refers to a codebook gain index for each of the plurality of subsets;
a random number generator configured to generate a random noise signal including a random number for each of a plurality of sub-frames;
a random number selector configured to select a predetermined percentage of the highest amplitude random number of the random noise signal for each of the plurality of sub-frames;
a multiplier configured to scale the selected highest amplitude random number with the quantized codebook gain for each subframe to produce a scaled random noise signal;
a band pass filter for removing low and high end frequencies from the scaled random noise signal to produce a band pass filtered random noise signal;
a first shaping filter for perceptually filtering the band-pass filtered random noise signal to produce a perceptually filtered random noise signal;
a non-scaled band energy analyzer configured to analyze the energy of the linearly predicted residual signal;
a scaled band energy analyzer configured to analyze the energy of the perceptually filtered random noise signal and to produce an associated energy analysis of the linear prediction residual signal energy compared to the energy of the perceptually filtered random noise signal;
a second shaping filter configured to select a second filter based on the correlation energy analysis, further shape the perceptually filtered random noise signal with the selected filter, and generate a second filter selection indication to identify the selected filter.
24. The speech coder of claim 23, wherein the bandpass filter and the first shaping filter are fixed filters.
25. The speech coder of claim 23, wherein the second shaping filter is configured with two fixed shaping filters.
26. The speech coder of claim 23, wherein the second shaping filter configured to generate a second filter selection indication to identify the selected filter is further configured to generate a two-bit filter selection indication.
27. The speech coder of claim 23, wherein the gain computation component configured to divide one frame of linear prediction residual signal into a plurality of subframes is further configured to divide the frame of linear prediction residual signal into ten subframes.
28. The speech coder of claim 23, wherein the gain computation component is further configured to divide a set of ten subframe gains into two groups of five subframe gains each.
29. The speech coder of claim 23, wherein the random number selector configured to select a predetermined percentage of the highest amplitude random numbers is further configured to select twenty-five percent of the highest amplitude random numbers.
30. The speech coder of claim 23, wherein the gain computation component is further configured to generate two normalization factors for two subsets of five subframe codebook gains each.
31. The speech coder of claim 23, wherein the gain quantizer is further configured to perform multi-level vector quantization.
32. A speech coder for coding a non-speech segment, the coder comprising:
a gain calculation component configured to divide a frame of linear prediction residual signal into a plurality of sub-frames, each sub-frame having a codebook gain associated therewith;
a gain quantizer configured to quantize the codebook gain to produce a codebook gain index;
a random number selector and multiplier configured to scale a predetermined percentage of the highest amplitude random noise associated with each subframe by a codebook gain index associated with the subframe;
a first perceptual filter configured to first filter the scaled random noise;
a band energy analyzer configured to compare an energy of the first filtered random noise with an energy of the linear prediction residual signal;
a second shaping filter configured to second filter the first filtered random noise based on the comparison and to generate a second filter selection indication to identify the second filtering performed.
33. The speech coder of claim 32, wherein the gain computation component configured to divide a linear prediction residual signal frame into subframes is further configured to divide a linear prediction residual signal frame into ten subframes.
34. The speech coder of claim 32, wherein the random noise selector and multiplier configured to scale a predetermined percentage of the highest amplitude random noise are further configured to scale twenty-five percent of the highest amplitude random noise.
35. The speech coder of claim 32, wherein the gain quantizer configured to quantize codebook gains to produce codebook gain indices is further configured to perform multi-level vector quantization.
36. The speech coder of claim 32, wherein the first perceptual filter configured to first filter the scaled random noise is further configured to filter the scaled random noise with a fixed band-pass filter and a fixed shaping filter.
37. The speech coder of claim 32, wherein the second shaping filter configured to second filter the first filtered random noise is further configured to have two fixed filters.
38. The speech coder of claim 32, wherein the second shaping filter configured to generate a second filter selection indication is further configured to generate a two-bit filter selection indication.
39. A method for decoding a non-voice speech segment, the method comprising:
recovering a set of quantization gains using the received normalization factor indices and quantization sub-set gain indices for the plurality of subframes;
generating a random noise signal including a random number for each of a plurality of subframes;
selecting a predetermined percentage of the highest amplitude random number of the random noise signals associated with each sub-frame;
scaling the selected highest amplitude random number with the recovered quantization gain for each subframe to produce a scaled random noise signal;
band-pass filtering and shaping the scaled random noise signal to produce a band-pass filtered and shaped random noise signal; and
a second filter is selected based on a received filter selection indication and the band-pass filtered and shaped random noise signal is further shaped with the selected filter.
40. The method of claim 39 further comprising further filtering the random noise further shaped by the second filter with a linear predictive coding synthesis filter.
41. The method of claim 39, wherein the plurality of subframes comprises a division of ten subframes per frame of encoded non-speech.
42. The method of claim 39, wherein the plurality of subframes comprises subframe gains divided into subgroups.
43. The method of claim 41, wherein sub-grouping comprises dividing a group of ten subframe gains into two groups of five subframe gains.
44. The method of claim 41, wherein the encoded non-speech frame comprises 160 samples per frame sampled at 20 milliseconds per second at eight kilohertz.
45. The method of claim 39, wherein the predetermined percentage is twenty-five percent.
46. The method of claim 43, wherein two normalization factors are recovered for each set for two subgroups of five subframe gains.
47. The method of claim 39 wherein recovering a set of quantization gains is performed using multi-level vector quantization.
48. A method for decoding a non-voice speech segment, the method comprising:
recovering the quantization gains divided into the sub-frame gains from the received normalization factor index and quantization sub-group gain index associated with each sub-frame;
scaling a predetermined percentage of the highest amplitude random noise associated with each sub-frame by a normalization factor index and a quantization sub-set gain index associated with each sub-frame;
first filtering the scaled random noise;
second filtering determined by a received filter selection indication is performed on the first filtered random noise.
49. The method of claim 48, comprising further filtering the second filtered random noise with a linear predictive coding synthesis filter.
50. The method of claim 48, wherein the sub-frame gain comprises a division of ten sub-frame gains per frame for the encoded non-speech.
51. The method of claim 50, wherein the encoded non-speech frames comprise 160 samples per frame sampled at 20 milliseconds per second at eight kilohertz.
52. The method of claim 48, wherein the predetermined percentage is twenty-five percent.
53. The method of claim 48, wherein the recovered quantization gain is quantized by multi-stage vector quantization.
54. A decoder for decoding a non-speech segment, the decoder comprising:
means for recovering a set of quantization gains using the normalization factor indices and the quantization sub-set gain indices of the received plurality of subframes;
means for generating a random noise signal comprising a random number for each of a plurality of sub-frames;
means for selecting a highest amplitude random number of a predetermined percentage of the random noise signal associated with each sub-frame;
means for scaling the selected highest amplitude random number with the recovered quantization gain for each subframe to produce a scaled random noise signal;
means for band-pass filtering and shaping the scaled random noise signal to produce a band-pass filtered and shaped random noise signal; and
means for selecting a second filter based on a received filter selection indication and further shaping the band-pass filtered and shaped random noise signal with the selected filter.
55. The decoder of claim 54, wherein the decoder includes means for further filtering the random noise further shaped by the second filter with a linear predictive coding synthesis filter.
56. The decoder of claim 54, wherein the means for selecting a predetermined percentage of the highest amplitude random numbers of the random noise signal associated with each subframe further comprises means for selecting twenty-five percent of the highest amplitude random numbers.
57. A decoder for decoding a non-speech segment, the decoder comprising:
a gain dequantizer configured to recover a set of quantization gains using the normalization factor index and the quantization sub-set gain index for the received plurality of subframes;
a random number generator configured to generate a random noise signal including a random number for each of a plurality of sub-frames;
a random number selector configured to select a predetermined percentage of the highest amplitude random numbers of the random noise signal associated with each sub-frame;
a random number multiplier configured to scale the selected highest amplitude random number by the recovered quantization gain for each frame to produce a scaled random noise signal;
a band pass filter and a first shaping filter for filtering and shaping the scaled random noise signal to produce a band pass filtered and shaped random noise signal; and
a second shaping filter configured to select a second filter based on a received filter selection indication and to further shape the band-pass filtered and shaped random noise signal with the selected filter.
58. The decoder of claim 57 wherein the decoder further comprises a linear predictive coding synthesis filter configured to further filter random noise further shaped by the second filter.
59. The decoder of claim 57, wherein the random number selector configured to select a predetermined percentage of the highest amplitude random numbers of the random noise signal is further configured to select twenty-five percent of the highest amplitude random numbers.
60. A decoder for decoding a non-speech segment, the decoder comprising:
means for recovering the quantization gains divided into the sub-frame gains from the received normalization factor index and quantization sub-group gain index associated with each sub-frame;
means for scaling a predetermined percentage of the highest amplitude random noise associated with each sub-frame by a normalization factor index and a quantization sub-set gain index associated with each sub-frame;
means for first filtering the scaled random noise;
means for performing a second filtering of the first filtered random noise determined by a received filter selection indication.
61. The decoder of claim 60, wherein the decoder includes means for further filtering the second filtered random noise with a linear predictive coding synthesis filter.
62. The decoder of claim 60 wherein the means for scaling a predetermined percentage of the highest amplitude random noise associated with each sub-frame further comprises means for scaling 25% of the highest amplitude random noise associated with each sub-frame.
63. A decoder for decoding a non-speech segment, the decoder comprising:
a gain dequantizer configured to recover quantized gains decomposed into sub-frame gains from the received normalization factor index and the quantized sub-group gain index associated with each sub-frame;
a random number selector and multiplier configured to scale a predetermined percentage of the highest amplitude random noise associated with each sub-frame by a normalization factor index and a quantization sub-group gain index associated with the sub-frame;
a first shaping filter configured to perform a first perceptual filtering of the scaled random noise;
a second shaping filter configured to perform a second filtering of the first filtered random noise determined by a received filter selection indication.
64. The decoder defined in claim 63 wherein the decoder comprises a linear predictive coding synthesis filter that is configured to further filter the second filtered random noise.
65. The decoder of claim 63, wherein the random number selector and multiplier configured to scale a predetermined percentage of the highest amplitude random noise associated with each subframe is further configured to scale 25% of the highest amplitude random noise associated with each subframe.
HK04103354.0A 2000-10-17 2001-10-06 Method and apparatus for encoding and decoding of unvoiced speech HK1060430B (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US09/690,915 2000-10-17
US09/690,915 US6947888B1 (en) 2000-10-17 2000-10-17 Method and apparatus for high performance low bit-rate coding of unvoiced speech
PCT/US2001/042575 WO2002033695A2 (en) 2000-10-17 2001-10-06 Method and apparatus for coding of unvoiced speech

Publications (2)

Publication Number Publication Date
HK1060430A1 HK1060430A1 (en) 2004-08-06
HK1060430B true HK1060430B (en) 2007-08-03

Family

ID=

Similar Documents

Publication Publication Date Title
CN1302459C (en) A low-bit-rate coding method and apparatus for unvoiced speed
CN100350453C (en) Robust speech classification method and device
CN1223989C (en) Frame Erasure Compensation Method in Variable Rate Speech Coder and Device Using the Method
CN1266674C (en) Closed-loop multimode mixed-domain linear prediction (MDLP) speech coder
US8346544B2 (en) Selection of encoding modes and/or encoding rates for speech compression with closed loop re-decision
JP4489960B2 (en) Low bit rate coding of unvoiced segments of speech.
US8090573B2 (en) Selection of encoding modes and/or encoding rates for speech compression with open loop re-decision
CN1432176A (en) Method and apparatus for predictive quantization of voiced speech
CN1279510C (en) Method and apparatus for subsampling phase spectrum information
CN1188832C (en) Multipulse interpolative coding of transition speech frames
CN1262991C (en) Method and apparatus for tracking the phase of a quasi-periodic signal
HK1060430B (en) Method and apparatus for encoding and decoding of unvoiced speech
HK1064196B (en) Method and apparatus for subsampling phase spectrum information
HK1067444B (en) Method and apparatus for robust speech classification
HK1055833B (en) Closed-loop multimode mixed-domain linear prediction speech coder and method of processing frames
HK1091584A (en) Low bit-rate coding of unvoiced segments of speech