HK1055833B

HK1055833B - Closed-loop multimode mixed-domain linear prediction speech coder and method of processing frames

Info

Publication number: HK1055833B
Application number: HK03108074.9A
Authority: HK
Inventors: A‧达斯
Original assignee: 高通股份有限公司
Filing date: 2000-02-29
Publication date: 2007-01-12

Description

Closed-loop multimode mixed-domain linear prediction speech coder-decoder and method for processing frame

Technical Field

The present invention relates generally to the field of speech processing, and more particularly to a method and apparatus for closed-loop, multi-mode, and mixed-domain speech codec.

Background

Voice transmission using digital techniques has become widespread, particularly in long distance and digital wireless telephony applications. This situation in turn has led to an interest in determining the minimum amount of information that can be transmitted over the information while maintaining the perceived quality of the reproduced speech. If the language is simply sampled and digitized, a data rate on the order of 64 kilobits per second (kbps) is required to achieve the voice quality of a conventional analog telephone. However, by the use of speech analysis, and appropriate codec, transmission and re-synthesis at the receiving side, significantly reduced data rates can be achieved.

Devices that use techniques to compress speech by extracting parameters related to a model of human pronunciation are referred to as speech codecs. The speech codec may decompose an input speech signal into time blocks or analysis frames. A speech codec generally comprises an encoder and a decoder. The encoder analyzes the incoming speech frame to extract certain relevant parameters and then quantizes these parameters into a binary representation, i.e., into groups of bits or binary data packets. The data packets are transmitted over a communication channel to a receiver and decoder. The decoder processes the data packets, dequantizes them to produce parameters, and uses the dequantized parameters to re-synthesize a speech frame.

The function of a speech codec is to compress a digitized speech signal into some low bit rate signal by removing the natural redundancies inherent in all speech. This digital compression is obtained by representing the input speech frame with a set of parameters and by using quantization to represent these parameters with a set of bits. If the input speech frame has a number of bits N_iAnd the data packet generated by the speech codec has a number of bits N_oThen the compression factor of the speech codec is C_r＝N_i/N_o. The problem is to maintain a high voice quality of the decoded speech while achieving a target compression factor. The performance of a speech codec depends on: (1) how well the speech model or a combination of the above analysis and synthesis processes operates, and (2) N at each frame_oHow well the parameter quantization process works at the target bit rate of bits. The goal of the speech model is therefore to capture the essence of the speech signal, or target voice quality, with a small set of parameters per frame.

The speech codec may be implemented as a time-domain codec that captures the time-domain speech waveform each time by encoding small segments of speech (typically 5 millisecond (ms) subframes) using a high time resolution process. For each subframe, a certain high accuracy representation can be found from a codebook space using various search algorithms known in the art. Alternatively, the speech codec may be implemented as a frequency-domain codec, which captures the short-term speech spectrum of the input speech frame with a set of parameters (analysis), and uses a corresponding synthesis process to reconstruct the speech waveform from these spectral parameters. According to known quantization techniques described in vector quantization and signal compression (1992) of a.gersho and r.m.gray, the parameter quantizer stores these parameters by representing them in a representation of a stored code vector.

A well-known time-domain speech codec is described in l.b. rabiner and r.w. schafer, digital processing of speech signals (1978) pages 396 and 453 "A code excited linear prediction "(CELP) codec (all enclosed for reference). In a CELP codec, short-term correlations or redundancies in the speech signal are removed by a Linear Prediction (LP) analysis that finds the coefficients of a short-term formant filter. Applying the short-term prediction filter to the input speech frame produces an LP residual signal that is further modeled and quantized with long-term prediction filter parameters and a subsequent random codebook. Thus, CELP codec decomposes the task of coding the time-domain speech waveform into a different LP short-term filter coefficient coding task and an LP residual coding task. The time-domain codec may use a fixed rate (i.e., the same number of bits N per frame)_o) It may also be performed at a variable rate (using different bit rates for different types of frame content). Variable rate codecs attempt to encode the codec parameters using only the number of bits necessary to achieve the appropriate level of target quality. An exemplary variable rate CELP codec is described in U.S. patent No. 5,414,796, which is assigned to the assignee of the present invention and is incorporated herein by reference in its entirety.

Time domain codecs like CELP codecs typically require a large number of bits N per frame_oTo maintain the accuracy of the time domain speech waveform. If the number of bits per frame N_oRelatively large (e.g., 8kbps or more), such codecs can generally provide excellent voice quality. However, at low bit rates (4kbps or less), it is difficult for the time-domain codec to maintain high quality and enhanced performance due to the limited number of available bits. At low bit rates, the limited codebook space limits the waveform matching capability of conventional time-domain codecs that are so successfully used in high-rate commercial applications.

There is now a wave of research interest and a strong commercial need to develop a high quality speech codec that works at medium to low bit rates (i.e., in the range of 2.4 to 4kbps or less). Applications areas include wireless telephony, satellite communications, internet telephony, a variety of multimedia and voice streaming applications, voice mail and other voice storage systems. The driving forces are the need for high capacity and the need for enhanced performance in packet loss environments. Many recent speech codec standardization efforts are also another direct driving force to advance the research and development of low-rate speech codec algorithms. A low-rate speech codec may produce more channels or users per allowed application bandwidth, while a low-rate speech codec incorporating additional appropriate channel codec layers may accommodate the overall bit budget of the codec specification and provide enhanced performance in channel error conditions.

In order to codec at low bit rates, many methods of spectral or frequency-domain, speech codec have been developed in which the speech signal is analyzed as some time-varying spectral evolution bundle (see sine wave codec of r.j.mcaulay and t.f. quateri in chapter 4, speech codec and synthesis (edited by w.b. kleijn and k.k.paliwal, 1995)). In a spectral codec, the goal is to model or predict the short-term speech spectrum of each speech input frame with a set of spectral parameters, rather than accurately modeling the time-varying speech waveform. These spectral parameters are then encoded and a speech output frame is generated using the decoded parameters. The resulting synthesized speech does not match the original input speech waveform, but provides similar perceptual quality. Exemplary frequency domain codecs well known in the art include multi-band excitation codecs (MBE), Sinusoidal Transform Codecs (STC), and Harmonic Codecs (HC). These frequency domain codecs can provide high quality parametric modeling with a compact set of parameters that can be accurately quantized at low bit rates with a small number of bits.

However, low bit rate codecs introduce severe constraints of limited codec resolution or limited codebook space, which limits the effectiveness of individual codec mechanisms, rendering the codec unable to represent different types of speech segments with the same accuracy under different background conditions. For example, conventional low bit rate frequency domain codecs do not transmit phase information for speech frames. Instead, the phase information is reproduced using some random, artificially generated initial phase value and linear interpolation technique. See MBE model for quadratic phase interpolation for voiced speech synthesis in 29 electronic letters (1993, 5 months) 856-857 H.Yang et al. Since the phase information is artificially generated, the output speech produced by the frequency-domain codec does not coincide exactly with the original input speech (i.e., the primary pulses are not synchronized), even though the quantization-dequantization process perfectly maintains the amplitude of the sinusoid. Therefore, it is difficult to employ any closed-loop performance measure, such as signal-to-noise ratio (SNR) or perceptual SNR, in frequency-domain codecs.

Multi-mode codec techniques have been used to perform low-rate speech codec in conjunction with an open-loop mode decision process. Such a multimode codec technique is described in the multimode and variable rate speech codec of Amitava Das et al, chapter 7, speech codec and synthesis (edited by w.b. kleijn and k.k.paliwal, 1995). Conventional multi-mode codecs apply different modes or coding-decoding algorithms for different types of input speech frames. Each mode or encoding-decoding process is run to represent a particular type of speech segment, such as voiced speech, unvoiced speech, or background noise (non-speech), in the most efficient manner. An external open-loop mode decision mechanism examines the incoming speech frame and makes a decision as to which mode to apply to the frame. Open-loop mode decision is typically performed by extracting certain parameters from the input frame, estimating these parameters with certain temporal and spectral characteristics, and making a mode decision based on the estimation. Thus, the mode decision is made without prior knowledge of the exact conditions of the output speech (i.e., how close the output speech and the input speech will be in terms of speech quality or other performance measures).

Based on the above, it is highly desirable to provide a low bit-rate frequency domain codec that can more accurately estimate phase information. It would be advantageous if a multi-mode, mixed-domain codec could be provided that time-domain encodes certain speech frames and frequency-domain encodes other speech frames based on the content of the speech frames. It would be further desirable to provide a hybrid-domain codec that can time-domain encode certain speech frames and frequency-domain encode other speech frames according to some closed-loop codec mode decision mechanism. Therefore, there is a need for a closed-loop, multi-mode, mixed-domain speech codec that ensures time synchronization between the output speech produced by the codec and the original speech input to the codec.

Disclosure of Invention

The present invention relates to a closed-loop, multi-mode, mixed-domain speech codec that ensures time synchronization between output speech produced by the codec and original speech input to the codec. Accordingly, in one aspect of the present invention, a multi-mode, mixed-domain speech processor advantageously includes a codec having at least one time-domain codec mode and at least one frequency-domain codec mode, and a closed-loop mode selection device coupled to the codec and configured to select a codec mode for the codec based on the content of frames processed by the speech processor.

According to one aspect of the present invention, there is provided a multimodal mixed domain speech processor comprising: a codec having at least one time-domain coding mode and at least one frequency-domain coding mode; and closed-loop mode selection means coupled to the codec and configured to implement at least one time-domain codec mode if the output of the frequency-domain codec mode is distorted outside acceptable limits; wherein said closed-loop mode selection means comprises a comparison circuit coupled to the codec for comparing unencoded frames with frames encoded in at least one frequency domain codec mode and for generating a performance measure based on the comparison, wherein the codec applies the at least one time domain codec mode only if the performance measure is below a predetermined threshold, and otherwise the codec applies the at least one frequency domain codec mode.

According to another aspect of the present invention, there is provided a multimodal mixed domain speech processor comprising: a codec having at least one time-domain coding mode and at least one frequency-domain coding mode, wherein at least one frequency-domain coding mode is implemented by a plurality of coding modesA set of sinusoids comprising parameters of frequency, phase and amplitude represents the short-term spectrum of each frame, the phase being modelled by a polynomial expression and an initial phase value, where the polynomial expression is θ (k, n) ═ B₁(k)*n²+B₂(k)*n+B₃(k) θ (k, N) is the phase of the sinusoids, k is equal to 1, 2 …, L is the total number of sinusoids, N is equal to 1, 2 …, N is the number of samples per frame, bi (k) is the estimation coefficient, the initial phase value is either (1) the final estimated phase value of the previous frame if the previous frame was coded in one of the at least one frequency domain coding mode, or (2) a phase value obtained from the short term spectrum of the previous frame if the previous frame was coded in one of the at least one time domain coding mode; and closed-loop mode selection means coupled to the codec and configured to select a codec mode for the codec based on the short-term spectrum of each frame encoded in the at least one frequency-domain codec mode, wherein the closed-loop mode selection means selects the at least one time-domain codec mode if the short-term spectrum is distorted outside acceptable limits; wherein said closed-loop mode selection means comprises a comparison circuit coupled to the codec for comparing unencoded frames with frames encoded in at least one frequency domain codec mode and for generating a performance measure based on the comparison, wherein the codec applies the at least one time domain codec mode only if the performance measure is below a predetermined threshold, and otherwise the codec applies the at least one frequency domain codec mode.

In another aspect of the present invention, a method of processing a frame advantageously comprises the steps of: applying an open-loop codec mode selection process to each successive input frame to select either a time-domain mode or a frequency-domain codec mode based on speech content of the input frame; if the voice content of the input frame represents the sound voice in a stable state, performing frequency domain coding and decoding on the input frame; performing time-domain coding on the input frame if the speech content of the input frame represents anything else that is not steady-state voiced speech; comparing the frequency domain coded and decoded frame with the input frame to obtain a certain performance measurement value; if the performance measure falls below some predefined threshold, the input frame is time-domain coded.

In another aspect of the invention, a multi-mode, mixed-domain speech processor. Advantageously comprises: means for applying an open-loop codec mode selection process to the input frame to select a time-domain or frequency-domain codec mode based on speech content of the input frame; means for performing frequency domain coding and decoding on the input frame if the speech content of the input frame represents a steady state voiced speech; means for time-domain coding the input frame if the speech content of the input frame represents anything else that is not a steady-state voiced speech; means for comparing the frequency domain coded and decoded frames with the input frames to obtain a performance measurement; and means for time-domain coding the input frame if the performance measure falls below a predefined threshold.

Drawings

FIG. 1 is a block diagram of a communication channel terminated at each end by a speech codec;

FIG. 2 is a block diagram of an encoder that may be used in a multimode mixed-domain linear prediction (MDLP) speech codec;

FIG. 3 is a block diagram of a decoder that may be used in a multi-mode MDLP speech codec;

FIG. 4 is a flow chart illustrating the steps of MDLP encoding that may be performed by the MDLP encoder used in the encoder of FIG. 2;

FIG. 5 is a flow chart illustrating a speech codec decision process;

FIG. 6 is a block diagram of a closed-loop multimode MDLP speech codec;

FIG. 7 is a block diagram of a spectral codec that may be used in the FIG. 6 codec or the FIG. 2 encoder;

FIG. 8 is an amplitude-frequency graph showing the amplitude of a sine wave in a harmonic codec;

FIG. 9 is a flowchart illustrating a mode decision process in a multi-mode MDLP codec;

FIG. 10A is a graph of speech signal amplitude versus time; and FIG. 10B is a Linear Prediction (LP) residual amplitude-time plot;

FIG. 11A is a graph of rate/mode-frame index under a closed-loop coding decision; FIG. 11B is a graph of perceived Signal-to-noise ratio (PSNR) versus frame index under a closed-loop decision; and, fig. 11C is a graph of both rate/mode and PSNR versus frame index in the absence of a closed loop coding decision.

Detailed Description

In fig. 1, a first encoder 10 receives digitized speech samples s (n) and encodes the samples s (n) for transmission over a transmission medium 12 or communication channel 12 to a first decoder 14. Decoder 14 decodes the encoded speech samples and synthesizes an output speech signal s_SYNTH(n) of (a). For transmission in the reverse direction, the digitized speech samples s (n) are encoded by a second encoder 16 and transmitted over a communication channel 18. A second decoder 20 receives and decodes the speech samples and generates a synthesized output speech signal s_SYNTH(n)。

The speech samples s (n) represent a speech signal that has been digitized and quantized according to any of a variety of methods known in the art, such as Pulse Code Modulation (PCM), companded μ law or a rate. As is known in the art, these speech samples s (n) are organized into frames of input data, where each frame contains a predetermined number of digitized speech samples s (n). In a certain exemplary embodiment, a sampling frequency of 8kHz is used, i.e., each 20ms frame contains 160 samples. In the embodiments described below, the data transmission rate may be advantageously varied on a frame-to-frame basis, from 8kbps (full rate) to 4kbps (half rate) to 2kbps (quarter rate) to 1kbps (eighth rate). Alternatively, other data rates may be selected. As used herein, the term "full rate" or "high rate" generally refers to a data rate of greater than or equal to 8kbps, while the term "half rate" or "low rate" generally refers to a data rate of less than or equal to 4 kbps. Varying the data transmission rate is beneficial because a lower bit rate can be selectively used for frames containing relatively less speech information. Those skilled in the art will appreciate that other sampling frequencies, frame sizes, and data transmission rates may be used.

The first encoder 10 and the second decoder 20 together form a first speech codec, or speech codec. Similarly, the second encoder 16 and the first decoder 14 together form a second speech codec. As is well known to those skilled in the art, a speech codec may be implemented by a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), discrete gate logic, firmware, or any conventional programmable software module and microprocessor. The software modules may reside in RAM memory, flash memory, registers, or any other form of writable storage medium known in the art. Alternatively, any conventional processor, controller or state machine may be substituted for the microprocessor. An exemplary ASIC designed specifically for speech codec is described in U.S. patent No. 5,727,123 (assigned to the assignee of the present invention and incorporated herein by reference in its entirety) and U.S. patent application serial No. 08/197,417 (assigned to the assignee of the present invention and incorporated herein by reference in its entirety) entitled vocoder Application Specific Integrated Circuit (ASIC), filed on 16.1994.

According to one embodiment depicted in FIG. 2, a multi-mode mixed-domain linear prediction (MDLP) encoder 100, which may be used in a speech codec, includes a mode decision module 102, a pitch estimation module 104, a Linear Prediction (LP) analysis module 106, a LP analysis filter 108, a LP quantization module 110, and an MDLP residual encoder 112. The input speech frames s (n) are provided to a mode decision module 102, a pitch estimation module 104, a linear pre-predictionA measure (LP) analysis module 106 and an LP analysis filter 108. The mode decision module 102 generates a mode index I based on the period of each input speech frame s (n) and other extracted parameters such as energy, spectral tilt, zero crossing rate, etc_MAnd a mode M. A number of methods for classifying speech frames according to periodicity are described in U.S. patent application serial No. 08/815,354 entitled "method and apparatus for performing reduced rate variable rate speech codec," filed on 11/3/1997 (assigned to the assignee of the present invention and incorporated by reference in its entirety). Such methods are also incorporated into the telecommunications industry association industry interim standards TIA/EIA IS-127 and TIA/EIA IS-733.

The interval estimation module 104 generates an interval index I according to each input speech frame s (n)_PAnd a time lag value P₀. The LP analysis module 106 performs linear prediction analysis on each input speech frame s (n) to generate an LP parameter a. The LP parameters a are provided to the LP quantization module 110. The LP quantization module 110 also receives a mode M to perform the quantization process in a mode-dependent manner. The LP quantization module 110 generates an LP index I_LPAnd a quantized LP parameter *. The LP analysis filter 108 receives the input speech frames s (n) in addition to the quantized LP parameters *. The LP analysis filter 108 generates a LP residual signal R [ n ] representing the error between the input speech frame s (n) and the speech reproduced according to the quantized linear prediction parameters *]. The LP residual signal R [ n ]]Together, mode M and quantized LP parameters * are provided to the MDLP residual encoder 112. Based on these values, the MDLP residual encoder 112 generates a residual index I according to the steps described below with reference to the flow chart of FIG. 4_RAnd a quantized residual signal

In fig. 3, a decoder 200 that may be used in a speech codec includes an LP parameter decoding module 202, a residual decoding module 204, a mode decoding module 206, and an LP synthesis filter 208. Mode decoding module 206 receives and processes mode index I generated from mode M_MAnd decoding is performed. LP parameter decoding moduleBlock 202 receives mode M and LP index I_LP. The LP parameter decoding module 202 decodes the received values to generate a quantized LP parameter *. The residual decoding module 204 receives the residual index I_RPitch index I_PAnd a mode index I_M. The residue decoding module 204 decodes the received values to generate a quantized residue signalThe quantized residual signalTogether with the quantized LP parameters *, are provided to an LP synthesis filter 208, from which a decoded output speech signal [ n ] is synthesized]。

The operation and implementation of the various blocks in the encoder 100 of fig. 2 and the decoder 200 of fig. 3, with the exception of the MDLP residual encoder 112, are known in the art and are described in the aforementioned U.S. patent nos. 5,414,796 and l.b. rabiner and r.w. schafer, digital processing of speech signals, page 396 and 453 of 1978.

According to one embodiment, an MDLP encoder (not shown) performs the steps in the flow chart of FIG. 4. The MDLP encoder may be the MDLP residual encoder 112 of fig. 2. The MDLP encoder checks that mode M is Full Rate (FR) in step 300. Or Quarter Rate (QR) or Eighth Rate (ER). If the mode M is FR, QR or ER, the MDLP encoder proceeds to step 302. In step 302, the MDLP encoder applies the corresponding rate (FR, QR, or ER-depending on the value of M) to the remaining index I_R. The time-domain codec, which is a high-precision, high-rate codec for the FR mode and advantageously a CELP codec, is applied to an LP residual frame or, on the other hand, to a speech frame. The frame is then transmitted (after further signal processing including digital-to-analog conversion and modulation). In one embodiment, the frame is an LP residual frame representing the prediction error. In another embodiment, the frame is a speech frame representing speech samples.

On the other hand, if the mode M is not FR, QR, or ER in step 300 (i.e., if the mode M is Half Rate (HR)), the MDLP encoder proceeds to step 304. In step 304, the spectral codec, which is advantageously a harmonic codec, is applied to the LP residual, or on the other hand to the speech signal, at half rate. The MDLP encoder then proceeds to step 306. In step 306, a distortion measure D is obtained by decoding the encoded speech and comparing it to the original input frame. The MDLP encoder then proceeds to step 308. In step 308, the distortion measure D is compared with a predetermined threshold T. If the distortion measure D is greater than a predetermined threshold T, the corresponding quantization parameter of the frame encoded in the half-rate spectrum is modulated and transmitted. On the other hand, if the distortion measure D is not greater than the threshold T, the MDLP encoder proceeds to step 310. In step 310, the decoded frame is re-encoded in the time domain at the full rate. Any conventional high-rate, high-precision codec algorithm, such as CELP codec, may be advantageously used. The FR mode quantization parameter associated with the frame is then modulated and transmitted.

As shown in the flow chart of FIG. 5, a closed-loop multi-mode MDLP speech codec according to one embodiment follows a set of steps in processing speech samples for transmission. In step 400, a speech codec receives digital samples of a speech signal in successive frames. After accepting a particular frame, the speech codec performs step 402. In step 402, the speech codec detects the energy of the frame. Energy is a measure of the speech activity of a frame. Speech detection is performed by summing the squares of the amplitudes of the digitized speech samples and comparing the energy of the result to some threshold. In one embodiment, the threshold is based on varying levels of background noise and is adaptive. An exemplary variable threshold voice activity detector is described in the aforementioned U.S. patent No. 5,414,796. Some unvoiced speech may be very low-energy samples that may be incorrectly coded as background noise. To prevent this from happening, the spectral tilt of the low energy samples can be used to distinguish between unvoiced speech and background noise, as described in the aforementioned U.S. Pat. No. 5,414,796.

After detecting the energy of the frame, the speech codec proceeds to step 404. In step 404, the speech codec determines whether the energy of the detected frame is sufficient to classify the frame as containing speech information. If the energy of the detected frame is below the predetermined threshold level, the speech codec proceeds to step 406. In step 406, the speech codec encodes the frame as background noise (i.e., non-speech or silence). In one embodiment, the background noise frames are time-domain encoded at 1/8 rate or 1 kbps. If the energy of the detected frame meets or exceeds the predetermined threshold level in step 404, the frame is classified as speech and the speech codec proceeds to step 408.

In step 408, the speech codec determines whether the frame is periodic. For example, various known periodicity determination methods include using zero crossings and using a normalized autocorrelation function (NACF). In particular, the use of zero crossings and NACF for detecting periodicity is described in U.S. patent application Ser. No. 08/815354 entitled "method and apparatus for performing reduced Rate variable Rate Voice codec," filed on 11/3/1997 (assigned to the assignee of the present invention and incorporated by reference in its entirety). In addition, the above-described methods for distinguishing voiced and unvoiced speech are also incorporated in the telecommunications industry association industry interim standards TIA/EIA IS-127 and TIA/EIA IS-733. If in step 408 it is determined that the frame is not periodic, the speech codec proceeds to step 410. In step 410, the speech codec encodes the frame as unvoiced speech. In one embodiment, the unvoiced speech frames are time-domain encoded at 1/4 rate or 2 kbps. If it is determined in step 408 that the frame is periodic, the speech codec proceeds to step 412.

In step 412, the speech codec determines whether the frame is sufficiently periodic using known periodicity detection methods of the prior art as described previously, for example, in the above-mentioned U.S. patent application serial No. 08/815,354. If it is determined that the frame is not sufficiently periodic, the speech codec proceeds to step 414. In step 414, the frame is time-domain coded as transitional speech (i.e., transitioning from unvoiced to voiced speech). In one embodiment, the transitional speech frame is time-domain encoded at full rate or 8 kbps.

If, in step 412, the speech codec determines that the frame is sufficiently periodic, the speech codec proceeds to step 416. In step 416, the speech codec encodes the frame as voiced speech. In one embodiment, the voiced speech frames are spectrally encoded at half rate or 4 kbps. The voiced speech frame is advantageously spectrally encoded with a harmonic codec as described below with reference to fig. 7. Alternatively, other spectral codecs known in the art may be used, such as a sinusoidal transform codec or a multi-band excitation codec. The speech codec then proceeds to step 418. In step 418, the speech codec decodes the encoded voiced speech frames. The speech codec then goes to step 420. In step 420, the decoded voiced speech frame is compared to the corresponding input speech samples for the frame to obtain a measure of the synthesized speech distortion and to determine whether the half-rate voiced speech spectral codec model operates within acceptable limits. The speech codec then proceeds to step 422.

In step 422, the speech codec determines whether the error between the decoded voiced speech frame and the corresponding input speech sample for the frame is below a predetermined threshold. According to one embodiment, this determination is accomplished using the manner described below with reference to FIG. 6. If the coding distortion is below the predetermined threshold, the speech codec proceeds to step 424. In step 424, the speech codec treats the frame as voiced speech and sends it using the parameters of step 416. If the coding error meets or exceeds the predetermined threshold in step 422, the speech codec proceeds to step 414 to time-domain encode the frame of digitized speech samples received in step 400 as transition speech at full rate.

It is noted that step 400-.

As shown in FIG. 6, in one embodiment, a closed-loop multi-mode MDLP speech codec includes an analog-to-digital converter (A/D)500 coupled to a frame buffer 502, the frame buffer 502 in turn coupled to a control processor 504. An energy calculator 506, a voiced speech detector 508, a background noise encoder 510, a high-rate time-domain encoder 512, and a low-rate spectral encoder 514 are also coupled to the control processor 504. A spectral decoder 516 is coupled to the spectral encoder 514 and an error calculator 518 is coupled to the spectral decoder 516 and the control processor 504. A threshold comparator 520 is coupled to the error calculator 518 and the control processor 504, and a buffer 522 is coupled to the spectral encoder 514, the spectral decoder 516, and the threshold comparator 520.

In the embodiment shown in fig. 6, the components of the speech codecs are advantageously implemented in the speech codec as firmware or other software driven modules, and the speech codec itself preferably resides in a DSP or an ASIC. Those skilled in the art will appreciate that these speech codec components may be equivalently implemented in many other known ways. Control processor 504 may advantageously be a microprocessor, but may alternatively be implemented using a controller, state machine, or discrete logic circuitry.

In the multimode codec shown in fig. 6, a speech signal is provided to the a/D500. A/D500 converts the analog signal into frames of digitized speech samples S (n). These digitized speech samples are provided to a frame buffer 502. The control processor 504 extracts these digitized speech samples from the frame buffer 502 and provides them to the energy calculator 506. The energy calculator 506 calculates the energy E of each speech sample according to the following formula:

wherein the length of the frame is 20ms and the sampling frequency is 8 kHz. The calculated energy E is sent back to the control processor 504.

Control processor 504 compares the calculated speech energy to a speech activity threshold. If the calculated energy is below the voice activity threshold, the control processor 504 directs the digitized voice samples from the frame buffer 502 to the background noise encoder 510. The background noise encoder 510 encodes these frames using the minimum number of bits necessary to maintain an estimate of the background noise.

If the calculated energy is greater than or equal to the voice activity threshold, the control processor 504 directs the digitized voice samples from the frame buffer 502 to the voiced-speech detector 508. The voiced-speech detector 508 determines whether the periodicity of these speech frames allows for efficient coding using some low bit-rate spectral coding. Methods of determining the level of periodicity in a speech frame are known in the art, and include, for example, the use of normalized autocorrelation functions (NACFs) and zero crossings. These and other methods are described in the aforementioned U.S. patent application serial No. 08/815,354.

The voiced-speech detector 508 provides a signal to the control processor 504 indicating whether the speech frame contains speech with sufficient periodicity for efficient encoding by the spectral encoder 514. If voiced-speech detector 508 determines that the speech frame lacks sufficient periodicity, control processor 504 directs the digitized speech samples to high-rate encoder 512, and high-rate encoder 512 time-domain encodes the speech at a predetermined maximum data rate. In one embodiment, the predetermined maximum data rate is 8kbps and the high-rate encoder 512 is a CELP codec.

If voiced-speech detector 508 initially determines that the speech signal has sufficient periodicity for spectral encoder 514 to effectively encode, control processor 504 directs the digitized speech samples from frame buffer 502 to spectral encoder 514. An exemplary spectral encoder will be described in detail below with reference to fig. 7.

Spectral encoder 514 extracts the estimated pitch frequency F₀Amplitude a of harmonics of the pitch frequency₁And voice information V_c. The spectral encoder 415 provides these parameters to the buffer 522 and the spectral decoder 516. Advantageously, the spectral decoder 516 may be similar to the decoder in a conventional CELP encoder. The spectral decoder 516 generates synthesized speech samples according to a spectral decoding format (described below with reference to FIG. 7)And provides the synthesized speech samples to an error calculator 518. Control processor 504 sends the speech samples s (n) to error calculator 518.

Error calculator 518 calculates each speech sample S (n) and each corresponding synthesized speech sample according to the following equationMean Square Error (MSE) between:

the calculated MSE is provided to a threshold comparator 520, which threshold comparator 520 determines whether the distortion level is within acceptable limits, i.e. whether the distortion level is below a predetermined threshold.

If the calculated MSE is within acceptable limits, the threshold comparator 520 provides a signal to the buffer 502 and outputs spectrally encoded data from the speech codec. On the other hand, if the MSE is not within acceptable limits, the threshold comparator 520 provides a signal to the control processor 504, which in turn controls the processor 504 to direct the digitized samples from the buffer 522 to the high-rate time-domain encoder 512. The time domain encoder 512 encodes the frames at a predetermined maximum rate and discards the contents of the buffer 522.

In the embodiment shown in fig. 6, the type of spectral codec used is a harmonic codec as will be described below with reference to fig. 7, but any type of spectral codec may be selected, such as a sinusoidal transform codec or a multi-band excitation codec. The use of multi-band excitation codecs is described, for example, in U.S. Pat. No. 5,195,166, and the use of sinusoidal transform codecs is described, for example, in U.S. Pat. No. 4,865,068.

The multimode codec of fig. 6 advantageously uses CELP codec at full rate or 8kbps with a high-rate time-domain encoder 512 for transition frames and voiced frames with a phase distortion threshold equal to or below the periodicity parameter. On the other hand, for such frames, any other known high-rate time-domain codec form may be used. In this manner, the transition frames (and voiced frames that do not have sufficient periodicity) are coded and decoded with high accuracy so that their input and output waveforms match well and phase information is well preserved. In one embodiment, the multimode codec switches from half-rate spectral codec to full-rate CELP codec for a frame after processing a predetermined number of consecutive voiced frames for which the threshold exceeds its periodicity measure, without determination by the threshold comparator 520.

Note that energy computer 506, voiced speech detector 508, and control processor 504 together form an open-loop encoding decision. Instead, the spectral encoder 514, the spectral decoder 516, the error calculator 518, the threshold comparator 520, the buffer 522, and the control processor 504 together form a closed-loop encoding decision.

In the embodiment described with reference to fig. 7, the substantially periodic voiced frames are encoded at a low bit rate using spectral coding (advantageously harmonic coding). Spectral codecs are generally defined by algorithms that attempt to preserve the temporal evolution of the spectral characteristics of speech in some perceptual sense by modeling and encoding each frame of speech in the frequency domain. The basic parts of such an algorithm are: (1) spectral analysis or parameter estimation; (2) quantizing the parameters; and (3) synthesizing an output speech waveform using the decoded parameters. The goal is therefore to preserve the important features of the short-term speech spectrum with a set of spectral parameters, encode these parameters, and then synthesize the output speech with the decoded spectral parameters. Typically, the output speech is synthesized as a weighted sum of sinusoids. The amplitude, frequency and phase of the sine wave are the spectral parameters estimated during the analysis.

Although "analysis by synthesis" is a well-known technique in CELP codec, it is not used in spectral codec. The main reason that is not applied to the spectral codec by the analysis-by-synthesis is the loss of initial phase information. Although the speech model works well from a perceptual standpoint, the Mean Square Energy (MSE) of the synthesized speech may still be high. Thus, another advantage of accurately generating the initial phase is the resulting ability to directly compare the reproduced speech to the speech samples to allow a determination of whether such speech models can accurately encode the speech frame.

In spectral codec, output speech frames are synthesized such as:

S[n]＝S_v[n]+S_uv[n]，n＝1，2，...，N

where N is the number of samples per frame, S_vAnd S_uvRespectively voiced and unvoiced parts. A "sine wave summation" synthesis process creates the voiced part as follows:

wherein L is the total number of sinusoids, f_kFor frequencies of interest in the short-term spectrum, A (k, n) is the amplitude of the sine wave and θ (k, n) is the phase of the sine wave. These amplitude, frequency and phase parameters are estimated from the short-term spectrum of the input frame by a spectral analysis process. The unvoiced portion may be built up together with the voiced portion in some single "sine wave sum" synthesis, or may be calculated separately by some special "unvoiced synthesis" process and added back to S_vIn (1).

In the embodiment of fig. 7, a special type of spectral codec, known as a harmonic codec, is used to spectrally encode a sufficiently periodic voiced frame at a low bit rate. The harmonic codec analyzes a small segment of a frame, characterizing the frame as the sum of some sinusoids. Each of the sine waves of the sum has a frequency that is the spacing F of the frame₀Integer multiples of. In an alternative embodiment, the special type of spectral codec is no longer a harmonic codec, and the sine wave frequency of each frame is extracted from a set of real numbers between 0 and 2 pi. In the embodiment shown in fig. 7, the amplitude and phase of each sine wave in the sum is advantageously selected so that the sum can be optimally matched to the signal over one period, as shown in the graph of fig. 8. Harmonic codecs generally use some sort of outer classification to label each input speech frame as voiced or unvoiced. For a sound frame, the frequency of its sine wave is limitedIs the estimated spacing (F)₀) Of harmonic wave, i.e. f_k＝kF₀. For unvoiced frames, a sine wave is determined using the peaks of its short-term spectrum. The amplitude and frequency were interpolated to model their evolution over the frame, as:

A(k，n)＝C₁(k)^*n+C₂(k)

θ(k，n)＝B₁(k)*n²+B₂(k)*n+B₃(k)

wherein the frequency position f is determined from a specific frequency position outside the short-term Fourier transform (STFT) of the windowed input speech frame_k(＝kf₀) To estimate the coefficient [ C ] from the instantaneous values of amplitude, frequency and phase_i(k)，B_i(k)]. The parameters to be transmitted for each sine wave are amplitude and frequency. The phase is not transmitted but is simulated according to any of several known techniques, such as quadratic phase simulation.

As shown in fig. 7, a harmonic codec includes a pitch extractor 600 coupled to window logic 602 and Discrete Fourier Transform (DFT) and harmonic analysis logic 604. The pitch extractor 600, which receives as input the speech samples s (n), is also connected to DFT and harmonic analysis logic 604. The DFT and harmonic analysis logic 604 is coupled to a residual encoder 606. The span extractor 600, DFT and harmonic analysis logic 604 and residual encoder 606 are each coupled to a parametric quantizer 608. The parametric quantizer 608 is in turn coupled to a channel encoder 610, and the channel encoder 610 is coupled to a transmitter 612. The transmitter 612 is coupled to a receiver 614 via some standard Radio Frequency (RF) interface, such as some Code Division Multiple Access (CDMA) air interface. The receiver 614 is coupled to a channel decoder 616, and the channel decoder 616 is coupled to a dequantizer 618. The dequantizer 618 is coupled to a "sine wave accumulation" speech synthesizer 620. Also connected to the "sine wave accumulation" vocoder 620 is a phase estimator 622 that receives the previous frame of information as an input. Configuration of"sine wave summation" speech synthesizer 620 to produce synthesized speech output S_SYNTH(n)。

The span extractor 600, window logic 602, DFT and harmonic analysis logic 604, residue encoder 606, parameter quantizer 608, channel encoder 610, channel decoder 616, dequantizer 618, "sine wave sum" vocoder 620, and phase estimator 622 may be implemented in a number of different ways known to those skilled in the art, including, for example, firmware or software modules. The transmitter 612 and receiver 614 may be implemented using any equivalent standard RF components.

In the harmonic codec shown in fig. 7, the pitch extractor 600 receives the input samples s (n) and extracts the pitch frequency information F₀. The windowing logic 602 then multiplies the samples by an appropriate windowing function to allow analysis of small segments of the speech frame. Using the spacing information provided by the spacing extractor 608, the DFT and harmonic analysis logic 604 computes a sampled DFT to generate complex spectral points from which harmonic amplitudes A can be extracted₁As shown in the graph of fig. 8, in fig. 8, L represents the total number of harmonics. The DFT is provided to a residual encoder 606, and the residual encoder 606 extracts the sound information V_c。

It should be noted that, as shown in FIG. 8, the parameter V_cRepresents a point on the frequency axis above which the spectrum is characteristic of an unvoiced sound signal and is no longer a harmonic. In contrast, at V_cThe spectrum below the point is characteristic of harmonic and voiced speech signals.

Handle A₁、F₀And V_cThe components are provided to a parameter quantizer 608 that quantizes the information. These quantized information is provided in the form of packets to the channel encoder 610, which the channel encoder 610 quantizes at a low bit rate such as, for example, half rate or 4 kbps. The packets are then provided to a transmitter 612, which transmitter 612 modulates and transmits the resulting signal over the air to a receiver 614. Receiver 614 receives and demodulates the signal and passes the encoded packets toA channel decoder 616. The channel decoder 616 decodes the packets and provides the decoded packets to a dequantizer 618. The dequantizer 618 dequantizes the information. This information is then provided to a "sine wave accumulation" vocoder 620.

"sine wave accumulation" vocoder 620 is configured to perform as described above with respect to S [ n ]]The formula (a) synthesizes a plurality of sinusoids that model the short-term speech spectrum. Frequency f of these sine waves_kIs the fundamental frequency F₀Multiple or harmonic of, and fundamental frequency F₀Is the frequency of the pitch period of the quasi-periodic (i.e., transitional) voiced speech segments.

The "sine wave accumulate" vocoder 620 also receives phase information from the phase estimator 622. The phase estimator 622 receives information of the previous frame, i.e., a of the immediately previous frame₁、F₀And V_cAnd (4) parameters. Phase estimator 622 also receives N samples of the previous frame reconstruction, where N is the length of the frame (i.e., N is the number of samples per frame). The phase estimator 622 determines the initial phase of the frame based on the information of the previous frame. The determined initial phase is provided to a "sine wave accumulation" vocoder 620. Based on the information of the current frame, and the initial phase calculation performed by phase estimator 622 based on the information of the past frame, a "sine wave accumulation" speech synthesizer 620 generates synthesized speech frames, as described above.

As described above, the harmonic codec linearly changes from frame to frame by using information of the previous frame and the prediction phase. While synthesizing or rendering the speech frames. In the above-mentioned synthetic model, which is generally referred to as a "quadratic phase model", the coefficient B₃(k) Indicating the initial phase of the current voiced frame being synthesized. In determining the phase, conventional harmonic codecs either set the initial phase to zero, or generate an initial phase randomly or in some pseudo-random generation method. To more accurately predict phase, phase estimator 622 uses one of two possible methods to determine the initial speech frame based on whether the immediately preceding frame was determined to be a voiced speech frame (i.e., a sufficiently periodic frame) or a transitional speech framePhase. If the previous frame is a voiced speech frame, the final estimated phase value of one frame is used as the initial phase value of the current frame. On the other hand, if the previous frame is classified as one transition frame, the initial phase value of the current frame is obtained from the spectrum of the previous frame obtained by performing DFT on the decoded output of the previous frame. In this manner, the phase estimator 622 utilizes the accurate phase information already present (since the previous frame was processed as a transition frame at full rate).

In a certain embodiment, a closed-loop multi-mode MDLP speech codec follows the speech processing steps described in the flow diagram of fig. 9. The speech codec encodes the LP residual for each input speech frame by selecting the most appropriate coding mode. Some modes encode the LP residual or speech residual in the time domain, while other modes represent the LP residual or speech residual in the frequency domain. The set of these patterns is: full rate, time domain (T mode) for transition frames; half-rate, frequency domain (V-mode) for voiced frames; quarter rate, time domain (U-mode) for unvoiced frames; at eighth rate, time domain (N-mode) for noisy frames.

Those skilled in the art will appreciate that either the speech signal or the corresponding LP residual may be encoded following the steps shown in fig. 9. The waveform characteristics of noise, unvoiced, transitional, and voiced speech can all be viewed as a function of time in the graph of fig. 10A. The noise, silence, transition, and voiced LP residuals may be viewed as a function of time in the graph of fig. 10B.

In step 700, an open-loop mode decision is made as to which of the four modes (T, V, U or N) is applied to the input speech residual S (N). If T-mode is applied, the speech residual is processed in T-mode (i.e., full-rate, in the time domain) in step 702. If U-mode is applied, the speech residual is processed in U-mode (i.e. quarter rate, in time domain) in step 704. If N-mode is applied, the speech residual is processed in N-mode (i.e., eighth rate, in the time domain) in step 706. If V-mode is applied, the speech residual is processed in V-mode (i.e., half rate, in the frequency domain) in step 708.

In step 710, the speech encoded in step 708 is decoded and compared to the input speech residue s (n), and a performance measure D is calculated. In step 712, the performance measure D is compared to a predetermined threshold T. If the performance measure D is greater than or equal to the threshold T, then transmission of the spectrally encoded speech residual in step 708 is permitted in step 714. On the other hand, if the performance measure D is less than the threshold T, the input speech residual s (n) is processed in T mode in step 716. In an alternative embodiment, no performance measure is calculated, nor is a threshold defined. But after a predetermined number of speech residual frames are processed in the V mode, the next frame is processed in the T mode.

Advantageously, the decision step shown in fig. 9 allows to use the high bit rate T-mode only when necessary, while using the low bit rate V-mode for the periodicity of the voiced speech segments, while switching to full rate when the V-mode cannot be properly executed, to prevent any degradation of quality. Accordingly, a very high speech quality approaching full rate speech quality can be produced at some average rate significantly lower than full rate. Furthermore, the target voice quality may also be controlled by a selected performance measure and a selected threshold.

The "update" of the T mode also improves the performance of subsequent applications of the V mode by keeping the analog phase tracking close to that of the input speech. When the performance of the V-mode is not adequate, the closed loop performance check of steps 710 and 712 switches to the T-mode, thereby improving the performance of subsequent V-mode processing by "refreshing" the initial phase values to bring the analog phase tracking closer to the phase tracking of the original input speech again. By way of example as shown in the graphs of fig. 11A-C, the fifth frame from the beginning in the V mode performs inappropriately as can be seen by the PSNR distortion measure used. Thus, without closed loop decision and update, the analog phase tracking would deviate significantly from the phase tracking of the original input speech, resulting in a dramatic decrease in PSNR as shown in fig. 11C. Moreover, the performance of subsequent frames processed in V-mode is degraded. However, in the closed-loop determination, the fifth frame is switched to the T-mode process as shown in fig. 11A. As can be seen from the improvement in PSNR shown in fig. 11B, the performance of the fifth frame is significantly improved by the update. In addition, the performance of subsequent frames processed in the V mode is also improved.

The decision step shown in fig. 9 improves the representation quality of the V-mode by providing a very accurate initial phase estimate, ensuring that the resulting speech residual signal of the V-mode synthesis is aligned exactly in time with the original input speech residual s (n). The initial phase of the first V-mode processed speech residual segment is obtained from the immediately preceding decoded frame in the following manner. For each harmonic, if the previous frame was processed in V mode, its initial phase is set equal to the final estimated phase of the previous frame. For each harmonic, if the previous frame was processed in T-mode, its initial phase is set equal to the actual harmonic phase of the previous frame. The actual harmonic phase of the previous frame is obtained by performing past decoding of the remaining DFT using the entire previous frame. On the other hand, the actual harmonic phase of the previous frame may also be obtained by processing a plurality of pitch periods of the previous frame and DFT the past decoding residues in a pitch synchronous manner.

Thus, a novel closed-loop multimode mixed-domain linear prediction (MDLP) speech codec has been described. Those of skill in the art will understand and appreciate that the various illustrative logical blocks and algorithm steps described in connection with the embodiments disclosed herein may be implemented or performed with a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), discrete gate or transistor logic, discrete hardware components such as registers and FIFO, a processor executing a set of firmware instructions, or any conventional programmable software module and processor. Advantageously, the processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. The software modules may reside in RAM memory, flash memory, registers, or any other form of writable storage medium known in the art. Those of skill would further appreciate that the data, instructions, commands, information, signals, bits, symbols, and chips that may be referenced throughout the above description may be advantageously represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof.

Preferred embodiments of the present invention are thus disclosed and described. However, it will be apparent to those skilled in the art that many changes can be made to the embodiments disclosed herein without departing from the spirit and scope of the invention. Accordingly, the invention is not to be restricted except in light of the following claims.

Claims

1. A multi-modal mixed-domain speech processor, comprising:

a codec having at least one time-domain coding mode and at least one frequency-domain coding mode; and

closed-loop mode selection means coupled to the codec and configured to implement at least one time-domain codec mode if the output of the frequency-domain codec mode is distorted outside acceptable limits;

wherein said closed-loop mode selection means comprises a comparison circuit coupled to the codec for comparing unencoded frames with frames encoded in at least one frequency domain codec mode and for generating a performance measure based on the comparison, wherein the codec applies the at least one time domain codec mode only if the performance measure is below a predetermined threshold, and otherwise the codec applies the at least one frequency domain codec mode.

2. The speech processor of claim 1, wherein a codec encodes speech frames.

3. The speech processor of claim 1, wherein a codec encodes the linear prediction residual of the speech frame.

4. The speech processor of claim 1, wherein the at least one time-domain coding mode comprises a coding mode that codes frames at a first coding rate, and wherein the at least one frequency-domain coding mode comprises a coding mode that codes frames at a second coding rate, the second coding rate being less than the first coding rate.

5. The speech processor of claim 1, wherein the at least one frequency domain coding mode comprises a harmonic coding mode.

6. The speech processor of claim 1, wherein the codec applies at least one time-domain codec mode to an immediately following frame after successively processing frames that were coded in at least one frequency-domain codec mode for a predetermined number.

7. A multi-modal mixed-domain speech processor, comprising:

a codec having at least one time-domain coding mode and at least one frequency-domain coding mode, wherein the at least one frequency-domain coding mode is grouped by a plurality of coding modesA sine wave comprising parameters of frequency, phase and amplitude represents the short-term spectrum of each frame, and the phase is modelled by a polynomial expression and an initial phase value, where the polynomial expression is θ (k, n) ═ B₁(k)*n²+B₂(k)*n+B₃(k) θ (k, N) is the phase of the sine wave, k is equal to 1, 2 …, L, L is the total number of sine waves, N is equal to 1, 2 …, N, N is the number of samples per frame, B_i(k) Is an estimated coefficient, the initial phase value is either (1) a final estimated phase value of a previous frame if the previous frame was encoded in one of the at least one frequency domain encoding mode, or (2) a phase value obtained from a short term spectrum of a previous frame if the previous frame was encoded in one of the at least one time domain encoding mode; and

closed-loop mode selection means coupled to the codec and configured to select a codec mode for the codec based on the short-term spectrum of each frame encoded in at least one frequency-domain codec mode, wherein the closed-loop mode selection means selects at least one time-domain codec mode if the short-term spectrum is distorted outside acceptable limits;

8. The speech processor of claim 7, wherein the sine wave frequency of each frame is an integer multiple of the pitch frequency of the frame.

9. The speech processor of claim 7, wherein the frequency of the sinusoid for each frame is extracted from a set of real numbers between 0 and 2 pi.

10. A method of processing a frame, comprising the steps of:

applying an open-loop codec mode selection process to each successive input frame to select a time-domain codec mode or a frequency-domain codec mode based on speech content of the input frame;

if the voice content of the input frame is represented as a sound voice in a stable state, performing frequency domain coding and decoding on the input frame;

performing time-domain coding on an input frame if the speech content of the input frame is represented as anything other than steady-state voiced speech;

comparing the frame encoded and decoded in the frequency domain with the input frame to obtain a performance measure; and

if the performance measure is below a predetermined threshold, the input frame is time-domain coded.

11. The method of claim 10, wherein the frames are linear prediction residue frames.

12. The method of claim 10, wherein the frames are speech frames.

13. The method of claim 10, wherein the time-domain coding step comprises coding the frames at a first coding rate, and wherein the frequency-domain coding step comprises coding the frames at a second coding rate, the second coding rate being less than the first coding rate.

14. The method of claim 10, wherein the frequency domain coding step comprises harmonic coding.

15. The method of claim 10 wherein the step of frequency-domain coding comprises representing the short-term spectrum of each frame with a plurality of sinusoids each having a set of parameters including frequency, phase and amplitude, wherein phase is represented by a polynomial expression and a phaseSimulating an initial phase value, wherein the polynomial expression is theta (k, n) or B₁(k)*n²+B₂(k)*n+B₃(k) θ (k, N) is the phase of the sine wave, k is equal to 1, 2 …, L, L is the total number of sine waves, N is equal to 1, 2 …, N, N is the number of samples per frame, B_i(k) Is an estimated coefficient, the initial phase value is either (1) a final estimated phase value of the previous frame if the previous frame was coded in frequency domain coding mode, or (2) a phase value obtained from the short term spectrum of the previous frame if the previous frame was coded in time domain coding mode.

16. The method of claim 15 wherein the frequency of the sinusoids in each frame is an integer multiple of the inter-frame frequency.

17. The method of claim 15, wherein the frequency of the sinusoid for each frame is extracted from a set of real numbers between 0 and 2 pi.

18. A multi-modal mixed-domain speech processor, comprising:

means for applying an open-loop codec mode selection process to the input frame to select a time-domain codec mode or a frequency-domain codec mode based on the speech content of the input frame;

means for performing frequency domain coding and decoding on the input frame if the speech content of the input frame is represented as steady-state voiced speech;

means for time-domain coding the input frame if the speech content of the input frame is represented as anything other than steady state voiced speech;

means for comparing the frames encoded and decoded in the frequency domain with the input frames to obtain a performance measure; and

means for time-domain coding the input frame if the performance measure is below a predetermined threshold.

19. The speech processor of claim 18, wherein the input frame is a linear prediction residue frame.

20. The speech processor of claim 18, wherein the input frame is a speech frame.

21. The speech processor of claim 18, wherein the means for time-domain coding comprises means for coding frames at a first coding rate, and the means for frequency-domain coding comprises means for coding frames at a second coding rate, the second coding rate being less than the first coding rate.

22. The speech processor of claim 18, wherein the means for frequency domain coding comprises a harmonic codec.

23. The speech processor of claim 18, wherein the means for frequency domain coding comprises means for representing the short-term spectrum of each frame with a plurality of sinusoids each having a set of parameters including frequency, phase, and amplitude, wherein the phase is modeled by a polynomial expression and an initial phase value, and wherein the polynomial expression is θ (k, n) ═ B₁(k)*n²+B₂(k)*n+B₃(k) θ (k, N) is the phase of the sine wave, k is equal to 1, 2 …, L, L is the total number of sine waves, N is equal to 1, 2 …, N, N is the number of samples per frame, B_i(k) Is an estimated coefficient, the initial phase value is either (1) a final estimated phase value of the previous frame if the previous frame was coded in frequency domain coding mode, or (2) a phase value obtained from the short term spectrum of the previous frame if the previous frame was coded in time domain coding mode.

24. The speech processor of claim 23, wherein the sinusoid frequency of each frame is an integer multiple of the inter-frame frequency.

25. The speech processor of claim 23, wherein the frequency of the sinusoid in each frame is extracted from a set of real numbers between 0 and 2 pi.