HK1133730B

HK1133730B - Signal modification method for efficient coding of speech signals

Info

Publication number: HK1133730B
Application number: HK10100712.5A
Authority: HK
Inventors: 米科‧塔米; 米兰‧杰利内克; 克劳德‧拉夫拉姆; 维萨‧劳皮拉
Original assignee: Nokia Technologies Oy
Priority date: 2001-12-14
Filing date: 2010-01-22
Publication date: 2014-04-17

Description

Signal modification method for efficient coding of speech signals

This application is a divisional application of an invention patent application having an application number of 02827607.8, an application date of 2002, 12/13/h, and an invention name of "signal modification method for efficiently encoding speech signal".

Technical Field

The present invention relates generally to encoding and decoding of sound signals in a communication system. More particularly, the present invention relates to code-excited linear prediction (CELP) coding that is applicable, particularly but not exclusively, to code-excited linear prediction (CELP) coding.

Background

In various application areas such as video conferencing, multimedia and wireless communication, there is an increasing demand for efficient digital narrowband and wideband speech coding techniques with a good compromise between subjective quality and bit rate. Until recently, telephony bandwidths limited to the range of 200-3400Hz have been used primarily in speech coding applications. However, broadband voice applications increase intelligibility and naturalness of communications compared to traditional telephone bandwidth. It has been found that a bandwidth in the range of 50-7000Hz is sufficient to provide a good quality impression with face-to-face communication. For a typical audio signal this bandwidth gives an acceptable subjective quality, but still lower than the quality of an FM radio or CD operating in the 20-16000Hz and 20-20000Hz ranges, respectively.

The speech encoder converts the speech signal into a digital bit stream, which is transmitted over a communication channel or stored in a storage medium. The speech signal is digitized, i.e. sampled and quantized, typically with 16 bits per sample. The speech coder plays the role of representing these digital samples with a smaller number of bits while maintaining a good subjective speech quality. A speech decoder or synthesizer operates on the transmitted or stored bit stream and converts it back to a sound signal.

Code Excited Linear Prediction (CELP) coding is one of the best techniques for achieving a good compromise between subjective quality and bitrate. This coding technique is the basis of several speech coding standards in both wireless and wireline applications. In CELP coding, a sampled speech signal is processed in blocks of N samples, typically consecutive to frames, where N is typically corresponding to 10-30 ms. Is determined. A Linear Prediction (LP) filter is calculated and sent per frame. The computation of the LP filter typically requires the prediction of 5-10ms speech segments of the sub-subsequent frame. A frame of N samples is divided into smaller blocks called sub-frames. Typically the number of subframes is three or four, thus yielding 4-10ms subframes. In each subframe, an excitation signal is typically obtained from two components: a past stimulus and an innovative fixed-codebook (fixed-codebook) stimulus. The component formed from past excitation is often referred to as an adaptive codebook or pitch excitation (pitch excitation). The characteristic parameters of the excitation signal are encoded and transmitted to a decoder, where the reconstructed excitation signal is used as input to an LP filter.

In conventional CELP coding, long-term prediction (long-term prediction) for mapping a past excitation to a present excitation is generally performed on a subframe basis. Long-term prediction is characterized by delay parameters and pitch gains, which are typically calculated, encoded, and transmitted to the decoder for each subframe. At low bit rates, these parameters consume a significant portion of the available bit budget. Signal modification techniques [1-7]

[1] Kleijn, P.Kroon, and D.NaHumi, "The RCELP speech-coding algorithm," European Transactions on Telecommunications, Vol.4, No.5, pp.573-582, 1994(W.B.Kleijn, P.Kroon, and D.NaHumi, "RCELP speech coding algorithm", European Telecommunications Commission, Vol.4, No.5, pp.573 582, 1994)

[2] Kleijn, R.P.Ramachandran, and P.Kroon, "Interpolation of the pitch-predictor parameters in analysis-by-synthesis Speech coders," IEEEtransactions on Speech and Audio Processing, Vol.2, No.1, pp.42-54, 1994(W.B.Kleijn, R.P.Ramachandran, and P.Kroon, "Interpolation of pitch prediction parameters in analysis-by-synthesis Speech coders", IEEE Speech and Audio Processing reports, Vol.2, No.1, pp.42-54, Vol.42, pp.54, 1994)

[3] Y.gao, a.benyassine, j.thyssen, h.su, and e.shmoot, "EX-CELP: aspeech coding partner, "IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Salt Lake City, Utah, U.S. A., pp.689-692, 7-11May 2001(Y.Gao, A.Benyassine, J.Thyssen, H.Su and E.Sholomot," EX-CELP: Speech coding mode ", IEEE International Conference on Acoustic, Speech and Chi Processing (ICASSP), Salt Lake City, Utah, USA, page 689-

[4] US Patent 5,704,003, "RCELP coder," Lucent Technologies inc. (w.b. kleijn and d.nahumi), Filling Date: 19 September 1995 (U.S. Pat. No.5,704,003, "RCELP encoder", lucent technologies Inc., (W.B. Kleijn and D.NaHumi), filing date: 9/19/1995)

[5] European Patent Application 0602826A2, "Time shifting for analysis-by-synthesis coding," AT & T Corp., (B.Kleijn), Filling Date: 1December 1993 (European patent application 0602826A2, "analysis-by-synthesis coded time-Shift", USA telex corporation, (B.Kleijn), filing date: 1/12/1993)

[6] Patent application WO 00/11653, "Speech encoder with continuous searching combined with long term prediction," constructive Systems inc., (y.gao), Filing Date: 24August 1999 (patent application WO 00/11653, "Speech coder with continuous deformation combined with Long-term prediction", Conexant systems, Inc. (Y. Gao), filing date: 24/8/1999)

[7] Patent Application WO00/11654, "spech encoder adaptive applying pitch preprocessing with continuous warping," circumstant Systems inc., (h.su and y.gao), Filing Date: 24August 1999 (patent application WO00/11654, "adaptive application of Speech coders with continuous warped Pitch preprocessing", Conexant systems, Inc. (H.Su and Y.Gao), filing date: 24/8/1999)

The long-term prediction performance at low bit rates is improved by adapting the signal to be encoded. This is done by adapting the evolution of the pitch period (pitch cycle) in the speech signal to fit the long-term prediction delay, so that only one delay parameter can be sent per frame. The signal modification is based on the following premises: it is possible to present a difference between the modified speech signal and the inaudible original speech signal. CELP encoders using signal modification are often referred to as generalized analysis-by-synthesis or relaxed rcelp (delayed rcelp) encoders.

The signal modification technique adjusts the tones of the signal to a predetermined delay profile (delay contour). Long-term prediction then maps the past excitation signal to the present subframe by using this delay profile and scaling with gain parameters. The delay contour is directly obtained by interpolation between two open-loop pitch estimates (open-loop pitches), the first one obtained in the previous frame and the second one in the current frame. The interpolation gives a delay value for each instant of the frame. After the delay profile is available, the pitch in the subframe currently to be coded is adjusted to accommodate this artificial profile by warping, i.e. changing the time scale (time scale) of the signal.

In discontinuous deformation [1, 4 and 5]

[5] European Patent Application 0602826A2, "Time shifting for analysis-by-synthesis coding," AT & T Corp., (B.Kleijn), Filling Date: 1December 1993 (European patent application 0602826A2, "time-shifted analysis of codes for Integrated services", telegraphy of America, (B.Kleijn), filing date: 12/1/1993) signal segments are time-shifted without changing the segment length. Discontinuous deformation requires a process for handling the resulting overlapping or missing signal portions. Continuous deformation [2, 3, 6, 7]

[3] Y.gao, a.benyassine, j.thyssen, h.su, and e.shmoot, "EX-CELP: aspeech coding partner, "IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Salt Lake City, Utah, U.S. A., pp.689-692, 7-11May 2001(Y.Gao, A.Benyassine, J.Thyssen, H.Su and E.Sholomot," EX-CELP: Speech coding mode ", IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP), Salt Lake City, Utah, USA, page 689 (692), 7-11 2001)

One signal segment is contracted or expanded. This is done by using a time-continuous approximation for the signal segments and resampling them to the desired length with unequal sampling intervals determined based on the delay profile. To reduce artifacts in these operations, the allowable changes in the time scale are kept small. Also, the LP residual (residual) signal or the weighted speech signal is typically used for warping to reduce the resulting distortion. The use of these signals instead of speech signals also facilitates the detection of pitch pulses and low power regions therebetween, thus facilitating the determination of signal segments for warping. The actual modified speech signal is generated by inverse filtering.

After signal modification for the current subframe, encoding may be performed in any conventional manner, except that a predetermined delay profile is used to generate the adaptive codebook excitation. In practice, the same signal modification techniques can be used in either narrowband or wideband CELP coding.

The signal modification technique can also be applied to other types of speech coding methods, such as waveform interpolation coding and sinusoidal coding, according to [8 ].

[8] US Patent 6,223,151, "Method and apparatus for preprocessing speech signals printer to coding by transform-based speech coders," Telefon aktieebelaget LM Ericsson, (w.b. kleijn and t.erikssson), Filling Date: 10Feb.1999 (U.S. Pat. No. 8, 6,223,151, "method and apparatus for preprocessing a speech signal before encoding by a transform-based speech coder," Telefon Aktie Bolaget LM Ericsson, (W.B. Kleijn and T.Eriksson), filing date: 10/2/1999)

Disclosure of Invention

The invention relates to a method for determining a long-term prediction delay parameter featuring long-term prediction in a technique using signal modification for digitally encoding a sound signal, comprising: dividing the sound signal into a series of successive frames; locating a feature of the sound signal in a previous frame; locating corresponding features of the sound signal in the current frame; and determining a long-term prediction delay parameter for the current frame such that long-term prediction maps signal features of a previous frame to corresponding signal features of the current frame.

The subject invention relates to a device for determining a long-term prediction delay parameter characterizing a long-term prediction in a technique using signal modification for digitally encoding a sound signal, comprising: a divider for dividing the sound signal into a series of successive frames; a detector for detecting a feature of the sound signal in a unit previous frame; a detector for locating a corresponding feature of the sound signal in the current frame; a calculator for determining a long-term prediction delay parameter for the current frame, the calculation of the long-term prediction delay parameter being performed such that long-term prediction maps a signal feature of a previous frame to a corresponding signal feature of the current frame.

According to the present invention, there is provided a signal modification method for implementation in a technique for digitally encoding a sound signal, comprising: dividing the sound signal into a series of successive frames; dividing each frame of the sound signal into a plurality of signal segments; -warping at least a portion of the signal segments of the frame, the warping comprising warped signal segments confined within the frame.

According to the present invention, there is provided a signal modification device for implementation in a technique for digitally encoding a sound signal, comprising: a first divider for dividing the sound signal into a series of successive frames; a second divider for dividing each frame of the sound signal into a plurality of signal segments; and a signal segment deforming member provided with at least a part of the signal segments of said frame, this deforming member comprising a limiter for limiting the deformed signal segments within said frame.

The invention also relates to a method for searching for pitch pulses in a sound signal, comprising: dividing the sound signal into a series of successive frames; dividing each frame into a plurality of subframes; generating a residual signal by filtering the sound signal through a linear prediction analysis filter; locating a last pitch pulse of the sound signal of a previous frame from the residual signal; extracting a pitch pulse prototype (pulse type) of a given length around a last pitch pulse position of the sound signal of a previous frame using the residual signal; and using the pitch pulse prototype to locate a pitch pulse in the current frame.

The invention also relates to a device for searching for pitch pulses in a sound signal, comprising: a divider for dividing the sound signal into a series of successive frames; a divider for dividing each frame into a plurality of subframes; a linear predictive analysis filter for filtering the sound signal to produce a residual signal; a detector for locating a last pitch pulse of the sound signal of a previous frame in response to the residual signal; an extractor for extracting pitch pulse prototypes of a given length around a last pitch pulse position of the sound signal of a previous frame in response to the residual signal; and a detector for locating a pitch pulse in the current frame using the pitch pulse prototype.

According to the present invention, there is also provided a method for searching for a pitch pulse in an acoustic signal, comprising: dividing the sound signal into a series of successive frames; dividing each frame into a plurality of subframes; generating a weighted sound signal by processing the sound signal via a weighting filter, wherein the weighted sound signal is indicative of the periodicity of the signal; locating a last pitch pulse of the sound signal of a previous frame based on the weighted sound signal; extracting a pitch pulse prototype of a given length around a last pitch pulse position of the sound signal of a previous frame using the weighted sound signal; pitch pulse prototypes are used to locate pitch pulses in the current frame.

Also, according to the present invention, there is provided a device for searching for a pitch pulse in an acoustic signal, comprising: a divider for dividing the sound signal into a series of successive frames; a divider for dividing each frame into a plurality of subframes; a weighting filter for processing the sound signal to produce a weighted sound signal, the weighted sound signal being indicative of the periodicity of the signal; a detector for locating a last pitch pulse of the sound signal of a previous frame in response to the weighted sound signal; an extractor for extracting pitch pulse prototypes of a given length around a last pitch pulse position of the sound signal of a previous frame in response to the weighted sound signal; and a detector for locating a pitch pulse in the current frame using the pitch pulse prototype.

The invention also relates to a method for searching for pitch pulses in a sound signal, comprising: dividing the sound signal into a series of successive frames; dividing each frame into a plurality of subframes; generating a synthesized weighted sound signal by filtering a synthesized speech signal generated during a last sub-frame of a previous frame of the sound signal through a weighting filter; locating a last pitch pulse of the sound signal of a previous frame based on the synthesized weighted sound signal; extracting a pitch pulse prototype of a given length around a last pitch pulse position of the sound signal of a previous frame using the synthesized weighted sound signal; and using the pitch pulse prototype to locate a pitch pulse in the current frame.

The invention also relates to a device for searching for pitch pulses in a sound signal, comprising: a divider for dividing the sound signal into a series of successive frames; a divider for dividing each frame into a plurality of subframes; a weighting filter for filtering a synthesized speech signal generated during a last sub-frame of a previous frame of the sound signal to generate a synthesized weighted sound signal; a detector for locating a last pitch pulse of the sound signal of a previous frame in response to the synthesized weighted sound signal; an extractor for extracting a pitch pulse prototype of a given length around a last pitch pulse position of the sound signal of a previous frame in response to the synthesized weighted sound signal; and a detector for locating a pitch pulse in the current frame using the pitch pulse prototype.

According to the present invention, there is also provided a method for forming an adaptive codebook excitation during decoding of a sound signal, the sound signal being divided into successive frames and encoded in advance by using a technique for signal modification of a digitally encoded sound signal, the method comprising:

receiving, for each frame, a long-term prediction delay parameter characterizing long-term prediction in the digital sound signal encoding technique;

restoring a delay profile using the long-term prediction delay parameter received during the current frame and the long-term prediction delay parameter received during the previous frame, wherein the delay profile with long-term prediction maps the signal feature of the previous frame to a corresponding signal feature of the current frame;

an adaptive codebook excitation in an adaptive codebook is formed in response to the delay profile.

Also, according to the present invention, there is provided a device for forming an adaptive codebook excitation during decoding of a sound signal which is divided into successive frames and encoded in advance by using a technique for signal modification of a digitally encoded sound signal, the device comprising:

a receiver that receives a long-term prediction delay parameter for each frame, wherein the long-term prediction delay parameter is characterized by long-term prediction in the digital sound signal encoding technique;

a calculator for calculating a delay profile in response to a long-term prediction delay parameter received during a current frame and a long-term prediction delay parameter received during a previous frame, wherein the delay profile with long-term prediction maps signal features of the previous frame to corresponding signal features of the current frame; and

an adaptive codebook is used to form an adaptive codebook excitation in response to the delay profile.

The above and other objects, advantages and features of the present invention will become more apparent upon reading of the following non-limiting description of illustrated embodiments thereof, given by way of example only with reference to the accompanying drawings.

Drawings

FIG. 1 is a diagrammatic example of an original and modified residual signal for one frame;

FIG. 2 is a functional block diagram of an illustrative embodiment of a signal modification method in accordance with the present invention;

FIG. 3 is a schematic block diagram of a diagrammatic example of a speech communication system showing the use of a speech encoder and decoder;

FIG. 4 is a schematic block diagram of an illustrative embodiment of a speech encoder utilizing a signal modification method;

FIG. 5 is a functional block diagram of an illustrative embodiment of a pitch pulse search;

FIG. 6 is a graphical example of localized pitch pulse positions and corresponding pitch period segments for a frame;

fig. 7 is a graphical example of determining the delay parameter when the number of pitch pulses is 3 (c-3);

FIG. 8 is a graphical example of delayed interpolation (thick line) on a speech frame compared to linear interpolation (thin line);

fig. 9 is a diagrammatic example of a delay profile over 10 frames selected in accordance with the delay interpolation (thick line) and linear interpolation (thin line) of fig. 8 when the corrected pitch value is 52 samples;

FIG. 10 is a functional block diagram of a signal modification method for adjusting a speech frame to a selected delay profile in accordance with an illustrative embodiment of the present invention;

FIG. 11 is a graph of updating a target signal using a determined optimal displacement δAnd replacing the signal segment w with an interpolated value illustrated as gray points_s(k) Illustrative examples of (a);

FIG. 12 is a functional block diagram of rate determination logic in accordance with an illustrative embodiment of the present invention;

FIG. 13 is a schematic block diagram of an illustrative embodiment of a speech decoder using delay profiles formed in accordance with an illustrative embodiment of the present invention.

Detailed Description

Although the illustrated embodiment of the present invention will be described with reference to speech signals and the 3GPP AMR wideband speech codec AMR-WB standard (ITU-t g.722.2), it should be appreciated that the concepts of the present invention can be applied to other types of voice signals as well as to other speech and audio encoders.

Fig. 1 illustrates an example of a modified residual signal 12 within one frame. As shown in fig. 1, the time shift in the modified residual signal 12 is limited so that at time t_n-1And t_nThis modified residual signal is time-synchronized with the original, unmodified residual signal at the frame boundaries that occur. Where n refers to the index of the current frame.

More specifically, the time shift is explicitly controlled using a delay profile used to interpolate delay parameters over the current frame. The delay parameters and contours are determined according to temporal alignment constraints at the frame boundaries. When linear interpolation is used to enforce temporal alignment, the resulting delay parameters tend to oscillate over several frames. This often causes annoying artifacts to the modified signal whose pitch follows the artificial vibration delay profile. Using a properly chosen non-linear interpolation technique for the delay parameters will greatly reduce these oscillations.

A functional block diagram of an illustrative embodiment of a signal modification method in accordance with the present invention is provided in fig. 2.

The method begins with a "pitch period search" block 101 that locates the individual pitch pulses and pitch periods. The search of block 101 uses an open-loop pitch estimate that is interpolated over the frame. Based on the located pitch pulse, the frame is divided into pitch period segments, each of which includes a pitch pulse and is confined to a frame boundary t_n-1And t_nAnd (4) the following steps.

The function of the "delay curve selection" block 103 is to determine the delay parameter of the long-term predictor and to form a delay profile for interpolating this delay parameter over the frame. According to at the frame boundary t_n-1And t_nTo determine the delay parameters and profiles. When the signal is enabled for the current frame, the delay parameters determined in block 103 are encoded and sent to the decoder.

The actual signal modification operation is performed at a "pitch sync signal modulation" block 105. Block 105 first forms a target signal based on the delay profile determined at block 103 for subsequently matching the individual pitch period segments into this target signal. The pitch period segments are then individually shifted to maximize their correlation with this target signal. To keep the complexity at a low level, no continuous time warping is applied when searching for the best displacement and displacing the segment.

The illustrated embodiments of the signal modification methods disclosed in this specification are typically enabled on purely voiced speech frames. For example, transitional frames such as speech onsets are not modified because of the high risk of causing artifacts. In purely voiced frames, the pitch period typically changes slowly, so a small displacement is sufficient to adapt the signal to a long-term prediction model. Because only small, discreet signal adjustments are made, the potential for causing artifacts is minimized.

The signal modification method constitutes an efficient classifier for pure voiced segments and thus a rate determination mechanism to be used in source control coding of speech signals. Each of blocks 101, 103 and 105 of fig. 2 provides several indicators as to the suitability of signal periodicity and signal modification in the current frame. These indicators are analyzed in logic blocks 102, 104 and 106 to determine the appropriate coding mode and bit rate for the current frame, and more specifically, these logic blocks 102, 104 and 106 monitor the success of the operations performed in blocks 101, 103 and 105.

If block 102 detects that the operation performed in block 101 is successful, the signal modification method continues in block 103. When this block 102 detects a failure in the operations performed in block 101, the signal modification process is terminated and the original speech frame is left intact for encoding (see block 108 (no signal modification) corresponding to the normal mode).

If block 104 detects that the operation performed in block 103 is successful, the signal modification method continues at block 105. Conversely, when this block 104 detects a failure in the operations performed in block 103, the signal modification process terminates and the original speech frame is left intact for encoding (see block 108 (no signal modification) corresponding to the normal mode).

If block 106 detects that the operation performed in block 105 was successful, then a low bit rate mode with signal modification is used (see block 107). Conversely, when this block 106 detects a failure in the operations performed in block 105, the signal modification process terminates and the original speech frame is left intact for encoding (see block 108 (no signal modification) corresponding to the normal mode). The operation of blocks 101-108 is described in detail below in this specification.

Fig. 3 is a schematic block diagram of a diagrammatic example of a speech communication system for describing the use of a speech encoder and decoder. The voice communication system of fig. 3 supports the transmission and reproduction of voice signals over the communication channel 205. The communication channel 205 typically comprises at least a portion of a radio frequency link, although it may comprise, for example, a wired, optical, or fiber optic line link. The radio frequency link often supports multiple, simultaneous voice communications requiring shared bandwidth resources, such as may be found in cellular telephones. Although not shown, the communication channel 205 may be replaced with a memory device for recording and storing the encoded voice signal for later playback.

On the transmitter side, a microphone 201 produces an analog speech signal 210, which is provided to an analog-to-digital (A/D) converter 202. The function of the a/D converter 202 is to convert the analog voice signal 210 into a digital voice signal 211. The speech encoder 203 encodes the digital speech signal 211 to produce a set of encoding parameters 212, which are encoded into binary form and provided to the channel encoder 204. The channel encoder 204 adds redundancy to the binary representation of the encoded parameters before sending them to the bitstream 213 over the communication channel 205.

At the receiver side, the channel decoder 206 is provided with a redundant binary representation of the above-mentioned coding parameters from the received bit stream 214 in order to detect and correct channel errors occurring in the transmission. The speech decoder 207 converts the channel error corrected bitstream 215 from the channel decoder 206 back into a set of coding parameters for creating a synthesized digital speech signal 216. The synthesized voice signal 216 reconstructed by the voice decoder 207 is converted into an analog voice signal 217 by a digital-to-analog (D/a) converter 208, and is played back through the speaker unit 209.

Fig. 4 is a schematic block diagram illustrating operations performed by an illustrative embodiment of speech encoder 203 (fig. 3) incorporating signal modification functionality. This description provides a novel implementation of the signal modification functionality of block 603 in fig. 4. Other operations performed by the speech encoder 203 are well known to those of ordinary skill in the art, and have been described in, for example, publication [10],

[10]3GPP TS 26, 190, "AMR Wideband Speech Codec: TranscodingFunctions, "3 GPP Technical Specification (3GPP TS 26, 190," AMR wideband Speech codec: decoding function ", 3GPP Technical Specification)

This publication is incorporated herein by reference. When not otherwise stated, the implementation of the speech encoding and decoding operations in the illustrated embodiments and examples of the present invention will conform to the AMR wideband speech codec (AMR-WB) standard.

The speech encoder 203 shown in fig. 4 encodes the digitized speech signal using one or more coding modes. When multiple encoding modes are used and the signal modification function is disabled in one of these modes, this particular mode will work according to well established standards known to those of ordinary skill in the art.

Although not shown in fig. 4, the voice signals are sampled at a rate of 16kHz, and each voice signal is digitized. The digital speech signal is then divided into successive frames of a given length, and each of these frames is divided into a given number of successive sub-frames. The digital speech signal is further pre-processed as described in the AMR-WB standard. This preprocessing involves high-pass filtering, using a filter p (z) 1-0.68z^-1ToEmphasis (pre-emphasis) filtering, and down-sampling (down-sampling) at a sampling rate of 16kHz to 12.8 kHz. The subsequent operation of fig. 4 assumes that the input speech signal s (t) has been pre-processed and down-sampled to a sampling rate of 12.8 kHz.

The speech encoder 203 comprises a LP (Linear prediction) analysis and quantization module 601 for calculating and quantizing the parameter a of the LP filter 1/A (z) in response to an input, pre-processed digital speech signal s (t)617₀、a₁、a₂、...、a_nAWherein n is_AIs the order of the filter, A (z) ═ a₀+a₁z^-1+a₂z^-2+...+a_nz^-nA. The binary representations 616 of these quantized LP filter parameters are provided to a multiplexer 614 and then multiplexed into a bitstream 615. The unquantized and quantized LP filter parameters may be interpolated to obtain corresponding LP filter parameters for each subframe.

The speech encoder 203 also includes a pitch estimator 602 for computing an open-loop pitch estimate 619 for the current frame in response to the LP filter parameters 618 from the LP analysis and quantization module 601. These open loop pitch estimates 619 are interpolated over the frame for use in the signal modification module 603.

The operations performed in the LP analysis and quantization module 601 and the pitch estimator 602 may be implemented in accordance with the AMR-WB standard described above.

The signal modification module 603 of fig. 4 performs a signal modification operation to adjust the speech signal to the determined delay profile d (t) before the closed-loop pitch search for the adaptive codebook excitation signal. In the illustrated embodiment, the delay profile d (t) defines the long-term prediction delay for each sample of the frame. Constructively, the delay profile is at frame t e (t)_n-1，t_n) Above all with delay parameter 620 d_n＝d(t_n) And its previous value d_n-1＝d(t_n-1) Features-they are equal to the value of the delay profile at the frame boundary. Delay parameter 620 is determined as part of the signal modification operation and is encoded and then provided to multiplexer 614 where it isIs multiplexed into the bit stream 615.

The delay profile d (t) defining the long-term prediction delay parameters for each sample of the frame is provided to the adaptive codebook 607. Adaptive codebook 607 uses delay profile d (t) as u in response to delay profile d (t)_b(t) u (t-d (t)) forming an adaptive codebook excitation u (t) for the current frame from the excitation u (t)_b(t) of (d). The delay profile thus maps past samples of the excitation signal u (t-d (t)) to the excitation u at the adaptive codebook_b(t) current sample in (t).

The signal modification process also produces a modified residual signalFor composing a fixed codebook excitation u_cModified target signal 621 of the closed loop search of (t). Modified residual signalIs obtained in the signal modification module 603 by deforming the pitch period segments of the LP residual signal and is provided to the module 604 for calculating the modified target signal. The modified residual signal is filtered with the LP synthesis of filter 1/a (z) to obtain a modified speech signal in block 604. The modified target signal 621 of the fixed codebook excitation search is formed in block 604 as per the operation of the AMR-WB standard, but with the original speech signal replaced by a modified version thereof.

After the adaptive codebook excitation u has been obtained for the current frame_b(t) and modified target signal 621, may be further encoded using conventional means.

The function of the closed-loop fixed-codebook excitation search is to determine the fixed-codebook excitation signal u for the current subframe_c(t) of (d). To schematically illustrate the operation of the closed-loop codebook search, the fixed codebook excitation signal u is gain scaled by an amplifier 609_c(t) of (d). In the same way, the adaptive codebook excitation u is gain scaled by the amplifier 609_b(t) of (d). Gain scalingAdaptive and fixed codebook excitation u_b(t) and u_c(t) are summed by adder 611 to form the total excitation signal u (t). This total excitation signal u (t) is processed by the LP synthesis filter 1/a (z)612 to produce a synthesized speech signal 625, which is subtracted from the modified target signal 621 by adder 605 to produce an error signal 626. Error weighting and minimization module 606 is responsive to error signal 626 to calculate gain parameters for each sub-frame amplifier 609 and 610 in a conventional manner. The error weighting and minimization module 606 also calculates an input 627 to the fixed codebook 608 in accordance with conventional methods and in response to an error signal 626. Quantized gain parameters 622 and 623 and a characterized fixed codebook excitation signal u_cParameter 624 of (t) is provided to multiplexer 614 and multiplexed into bitstream 615. The above-described process is performed in the same manner when signal modification is enabled or disabled.

It should be noted that the adaptive excitation codebook 607 operates in accordance with conventional methods when the signal modification function is disabled. In this case, a separate delay parameter is searched for each subframe in the adaptive codebook 607 to improve (refine) the open loop pitch estimation 619. These delay parameters are encoded, provided to multiplexer 614, and multiplexed into bit stream 615. Also, a target signal 621 of the fixed codebook search is formed according to a conventional method.

The speech decoder shown operates in accordance with conventional methods, except when signal modification is enabled, fig. 13. The signal modification disabling and enabling operations are only performed in forming the adaptive codebook excitation signal u_bThe manner of (t) is substantially different. In both modes of operation, the decoder decodes the received binary representations of the parameters according to them. Typically, the received parameters include excitation, gain, delay, and LP parameters. The decoded excitation parameters are used in a module 701 to form a fixed codebook excitation signal u per subframe_c(t) of (d). This signal is supplied to adder 703 through amplifier 702. Similarly, the adaptive codebook excitation signal u for the current subframe_b(t) is provided to adder 703 through amplifier 704. In adder 703, the gain is scaledAdaptive and fixed codebook excitation signal u_b(t) and u_c(t) are summed to form the total excitation signal u (t) for the current sub-frame. This excitation signal u (t) is processed by an LP synthesis filter 1/A (z)708, which uses the LP parameters of the current sub-frame interpolated in a module 707 to generate a synthesized speech signal。

When enabling signal modification, the speech decoder uses the received delay parameter d as in the encoder_nAnd the value d received immediately before it_n-1To recover the delay profile d (t). This delay profile d (t) defines the long-term prediction delay parameter for each instant of the current frame. Using the delay profile d (t), an adaptive codebook excitation u is formed from the past excitation of the current subframe as in the encoder_b(t)＝u(t-d(t))。

The remaining description discloses the detailed operation of the signal modification procedure 603 and its use as part of the mode determination mechanism.

Pitch pulse and pitch period segmented search

The signal modification method operates on pitch and frame synchronously, displacing each detected pitch period segment independently but limiting the displacement at the frame boundary. This requires a means for locating the pitch pulse and corresponding pitch period segment of the current frame. In the illustrated embodiment of the signal modification method, the pitch period segments are determined from the detected pitch pulses searched for according to FIG. 5.

The residual signal r (t), the weighted speech signal w (t) and/or the weighted synthesized speech signal may beA pitch pulse search is performed. The residual signal r (t), which has been interpolated for the sub-frames, is obtained by filtering the speech signal s (t) using the LP filter a (z). In the illustrated embodiment, the LP filterThe order of A (z) is 16. Pass weighting filter

Processing the speech signal s (t) to obtain a weighted speech signal w (t), wherein the coefficient γ₁0.92 and γ₂0.68. The weighted speech signal w (t) is often used in the open-loop pitch estimation (block 602) because the weighting filter defined by equation (1) attenuates the dominant unit structure in the speech signal s (t) and also preserves the periodicity on the sinusoidal signal segments. This facilitates the pitch pulse search as possible signal periodicity becomes apparent in the weighted signal. It should be noted that a weighted speech signal w (t) is also required for the prediction in order to search for the last pitch pulse in the current frame. This can be done by using the weighting filter of equation (1) formed in the last subframe of the current frame over the look-ahead portion.

The pitch pulse search routine of FIG. 5 begins at block 301 by locating the last pitch pulse of the previous frame based on the parameter signal r (t). A pitch pulse is usually clearly emphasized to have a value of about p (t)_n-1) Length of (2) of the pitch period of the low-pass filtered residual signalLarge absolute value. To facilitate locating the last pitch pulse of the previous frame, for low-pass filtering a normalized Hamming window H with a length of 5 samples is used₅(z)＝(0.08z^-2+0.54z^-1+1+0.54z+0.08z²)/2.24. The pitch pulse position is represented by T₀To indicate. The illustrated embodiment of the signal modification method according to the invention does not require an exact position for the pitch pulse, but rather an approximate position estimate of the high energy segment in the pitch period.

At T₀After locating the last pitch pulse in the previous frame, a pitch pulse prototype of length 2l +1 is extracted in block 302 of FIG. 5 around this approximate location estimate, e.g.:

for k 0, 1, 2l,(2)

this pitch pulse prototype is then used to locate the pitch pulse in the current frame.

Synthesized weighted speech signals may be used(or weighted speech signal w (t)) for the pulse prototype without the residual signal r (t). This facilitates the pitch pulse search because the periodic structure of the signal is preferably preserved in the weighted speech signal. Synthesized weighted speech signalIs obtained by the equation (1)To filter the synthesized speech signal of the last sub-frame of the previous frame. If the pitch pulse prototype extends beyond the end of the previously synthesized frame, this uses the weighted speech signal w (t) of the current frame for this excess. If the previously synthesized speech frame already comprises well-developed pitch periods, the pitch pulse prototype has a high correlation with the pitch pulses of the weighted speech signal w (t). The use of synthesized speech in extracting the prototype thus provides additional information for monitoring the execution of the encoding and selecting the appropriate encoding mode in the current frame, as will be described in more detail in the following description.

The choice of l-10 samples provides a good compromise between the complexity and performance of the pitch pulse search. The value of/may also be determined in proportion to the open loop pitch estimate.

At a position T giving the last pulse in the previous frame₀In the case of (2), it can be predicted that the first pitch pulse of the current frame approximately occurs at time T₀+p(T₀). Here, p (t) represents the interpolated open-loop pitch estimate at time (position) t. This prediction is performed in block 303.

In block 305, the predicted pitch pulse position T₀+p(T₀) Is improved as

T₁＝T₀+p(T₀)+argmaxC(j)(3)

Wherein the weighted speech signal w (t) in the prediction part neighborhood is associated with a pulse prototype:

thus, the improvement is limited to [ -j [ ]_max，j_max]Which maximizes the weighted correlation c (j) between the pulse prototype and one of the residual signal, the weighted speech signal or the weighted synthesized speech signal mentioned above. According to one illustrative example, limit j_maxAnd as min {20,<p(0)/4>is proportional to the open-loop pitch estimate of, where the operator is<·>Indicating a rounding off of less than the nearest integer. Weighting function in equation (4)

γ(j)＝1-|j|/p(T₀+p(T₀))(5)

The pulse position predicted using open-loop pitch estimation is preferred because γ (j) gets its maximum value of 1 at j ═ 0. Denominator p (T) in equation (5)₀+p(T₀) Is an open loop pitch estimate of the predicted pitch pulse position.

After the first pitch pulse position T has been found using equation (3)₁Thereafter, the next pitch pulse may be predicted as being at time T₂＝T₁+p(T₁) And improved as above. This pitch pulse search, including prediction 303 and refinement 305, is repeated until the prediction or refinement procedure results in a pitch pulse position outside the current frame. These conditions are checked in logic block 304 for predicting the position of the next pitch pulse (block 303), and in logic block 306 for refining this position of the pitch pulse (block 305). It should be noted that logic block 304 terminates the search only if one predicted pulse position is so far in the subsequent frame that the improvement step cannot bring it back to the current frame. This procedure obtains the c-space pulse positions in the current frame, denoted by T₁、T₂、...、T_cAnd (4) showing.

According to an illustrative example, except by T_cThe pitch pulses are located in integer resolution (integer resolution) outside the last pitch pulse of the frame represented. Because the exact distance between the last pulses of two consecutive frames is needed to determine the delay parameter to be sent, the fractional resolution of 1/4 samples in equation (4) for j is used to locate the last pulse. Fractional resolution is obtained by upsampling w (t) in the neighborhood of the last predicted pitch pulse before evaluating the correlation of equation (4). According to one illustrative example, a length 33 sinc interpolation plus a hamming window is used for upsampling. The fractional resolution of the last pitch pulse position helps maintain good performance for long-term prediction, despite the time synchronization constraint being set to the end of the frame. This is achieved at the cost of the additional bit rate required for transmitting the delay parameter with high accuracy.

After the pitch period segments in the current frame are completed, the optimal displacement for each segment is determined. This operation is performed using the weighted speech signal w (t) described in the following description. To reduce distortion caused by distortion, the LP residual signal r (t) is usedThe displacement of the individual pitch period segments is achieved. Since the displacement distorts the signal, especially near the segment boundaries, the boundaries have to be placed in the low power part of the residual signal r (t). In one illustrative example, the segment boundaries are disposed approximately in the middle of two consecutive pitch pulses, but are confined within the current frame. Segment boundaries are always selected within the current frame so that each segment includes only one pitch pulse. Segments with multiple pitch pulses or "null" segments without any pitch pulses hinder subsequent correlation-based matching with the target signal and should be prevented in pitch period segments. l_sThe s-th extracted segment of a sample is denoted w_s(k)，k＝0，1，...，l_s-1. The starting instant of this segment is t_sIt is selected such that w_s(0)＝w(t_s). The number of segments in the current frame is denoted c.

Albeit two consecutive pitch pulses T within the current frame_sAnd T_s+1Segment boundaries are selected between, but the following procedure is used. First, the central time instant between two pulses is calculated as Λ ═<(T_s+T_s+1)/2>. The candidate position of the segment boundary is located in the region [ Lambda-epsilon ∈_max，Λ+∈_max]In which is e_maxCorresponding to 5 samples. The energy of each candidate boundary position is calculated as

Q(ε′)＝r²(Λ+ε′-1)+r²(Λ+ε′)，ε′∈[-ε_max，ε_max](6)

The location giving the least energy is selected because this selection typically results in the least distortion in the modified speech signal. The instant at which equation (6) is minimized is denoted as ε. The starting instant of the new segment is chosen to be t_sΛ + epsilon. This also defines the length of the previous segment, since the previous segment ends at time Λ + ε -1.

Fig. 6 shows a diagrammatic example of a pitch period segment. Of particular note, the first and last segments w are extracted separately₁(k) And w₄(k) So that no empty segments are generated and no frame boundaries are exceeded.

Determination of delay parameters

In general, the main advantage of signal modification is that only one delay parameter per frame needs to be encoded and transmitted to the decoder (not shown). However, special attention needs to be paid to the determination of this single parameter. The delay parameter not only defines the evolution of the pitch period length over the frame together with its previous value, but also affects the temporal asynchrony in the generated modification signal.

In the method described in [14, -7]

No time synchronization is required at the frame boundaries and therefore the open loop pitch estimate can be used to directly determine the delay parameter to be transmitted. This option usually results in time asynchrony at the frame boundaries and translates into a cumulative time shift in the next frame, since the continuity of the signal must be maintained. Although human hearing is insensitive to variations in the time scale of the synthesized speech signal, increasing the time asynchrony complicates the implementation of the encoder. In fact, a long signal buffer is required to accommodate signals whose time scale may have been extended, and control logic needs to be implemented for limiting the accumulated time shift during encoding. Also, the time asynchrony of a few samples typical in RCELP coding may cause a mismatch between the LP parameters and the modified residual signal. This mismatch may lead to a perceived artifact on the modified speech signal synthesized by LP filtering the modified residual signal.

In contrast, the illustrated embodiment of the signal modification method according to the present invention maintains time synchronization at frame boundaries. Thus, a tightly constrained displacement occurs at the end of the frame, and each new frame starts at a good time to match the original speech frame.

To ensure time synchronization at the end of the frame, the delay profile d (t) maps the last pitch pulse at the end of the previous synthesized speech frame to the pitch pulse of the current frame using long-term prediction. The delay profile defines the delay profile for the slave time t_n-1+1 to t_nThe interpolated long-term prediction delay parameter over the current nth frame of each sample. Only byDelay parameter d at end of frame only_n＝d(t_n) Is sent to the decoder, indicating that d (t) must have a form fully specified by the value being sent. The long-term predicted delay parameters must be selected so that the resulting delay profile satisfies the pulse mapping. In mathematical form, this mapping can be expressed as follows: let κ_cIs a temporary time variable, and T₀And T_cThe last pitch pulse position in the previous and current frames, respectively. Now, the delay parameter d_nNeed to be selected so that after execution of the pseudo code provided in Table 1, the facility κ_cWith very close minimization of error | κ_c-T₀T of |₀The value of (c). Pseudo code slave value kappa_c-T_cStart and by updating κ_i：＝κ_i-1-d(κ_i-1) To loop back c times. If κ_cIs equal to T₀Then long-term prediction can be used with maximum efficiency without temporal asynchrony at the end of the frame.

TABLE 1 Loop for searching for optimal delay parameters

％initializationκ₀：＝T_c；％loopfori＝l to cκ_i：＝κ_i-1-d(κ_i-1)；end；

An example of the operation of the delay selection loop in the case where c is 3 is illustrated in fig. 7. The cyclic slave value κ₀＝T_cBegins and loops back to κ for the first time₁＝κ₀-d(κ₀). The cycle continues twice again, resulting in κ₂＝κ₁-d(κ₁) And kappa₃＝κ₂-d(κ₂). Last value of k₃Then with an error e_n＝|κ₃-T₀Form of |, and T₀And (6) comparing. The resulting error is a function of the delay profile adjusted in the delay selection algorithm, as described below.

Signal modification methods such as those described in the following documents [1, 4, 6, 7]

[7] Patent Application WO00/11654, "spech encoder adaptive applying pitch preprocessing with continuous warping," circumstant Systems inc., (h.su and y.gao), Filing Date: 24August 1999 (patent application WO00/11654, "from

Speech encoder adaptively applying pitch pre-processing with continuous warping, compatant systems, inc, (h.su and y.gao), filing date: 1999 8 month 24 days)

At d_n-1And d_nLinearly interpolating the delay parameter over the frames in between. But when at the end of the frameLinear interpolation tends to produce an oscillating delay profile when time synchronization is required. Thus, the pitch pulses in the modified speech signal are periodically contracted and expanded, easily creating annoying artifacts. The evolution and amplitude of the oscillations are associated with the last pitch position. The further the last pitch pulse is from the end of the frame in relation to the pitch period, the more likely the oscillation is amplified. Since time synchronization at the end of a frame is a necessary requirement for the illustrated embodiment of the signal modification method according to the invention, the use of linear interpolation familiar with prior methods cannot be used without deteriorating the speech quality. In contrast, the illustrated embodiment of the signal modification method according to the present invention discloses a piecewise linear delay profile

Wherein

α(t)＝(t-t_n-1)/σ_n (8)

By using this delay profile the oscillation can be greatly reduced. Here, t_nAnd t_n-1Are the end times of the current and previous frames, respectively, and d_nAnd d_n-1Is the corresponding delay parameter value. Note t_n-1+σ_nIs the time after which the delay profile remains constant.

In one illustrative example, the parameter σ_nAs d_n-1Is changed as a function of

And the frame length N is 256 samples. To avoid oscillations, it is beneficial to reduce σ as the length of the pitch period increases_nThe value of (c). On the other hand, to avoid at the start t of the frame_n-1＜t＜t_n-1+σ_nRapid change in the delay profile d (t) of (1), parameter σ_nIt must always be at least half the frame length. The rapid change in d (t) tends to degrade the quality of the modified speech signal.

Note that according to the coding mode of the previous frame, d_n-1The delay value may be at the end of the frame (signal modification enabled) or the delay value of the last subframe (signal modification disabled). Since the past value d of the delay parameter is known at the decoder_n-1Thus delaying the profile d_nUnambiguously defined and the decoder can use equation (7) to form the delay profile.

The only parameter that can be changed when searching for the optimal delay profile is d_nI.e. limited to [34, 231 ]]At the end of the frame. There is no simple obvious way to solve the optimum d under normal conditions_n. Instead, several values must be tested to find the best solution. However, the search is straightforward. d_nCan be first predicted as

In the illustrated embodiment, the search is performed in three stages, with resolution enhancement and the search paradigm to be examined focused on [34, 231 ] in each stage]In (1). The minimum error e in the procedure of Table 1 in these three phases is given_n＝|κ_c-T₀The delay parameter of is denoted d_n ⁽¹⁾，d_n ⁽²⁾Andin the first stage, the value d predicted using equation (10)_n ⁽⁰⁾The neighborhood is searched at a resolution of four samples, and the resolution of four samples isIs in the range [ d_n ⁽⁰⁾-11，d_n ⁽⁰⁾+12]Otherwise in the range [ d ]_n ⁽⁰⁾-15，d_n ⁽⁰⁾+16]In (1). The second stage limits the range to d_n ⁽⁰⁾-3，d_n ⁽¹⁾+3]And integer resolution is used. Finally, the final third stage examines the range [ d ] at a resolution of 1/4 samples_n ⁽²⁾-3/4，d_n ⁽²⁾+3/4]WhereinIn that range [ d_n ⁽²⁾-1/2，d_n ⁽²⁾+1/2]The resolution of 1/2 samples is used above. This third stage derives the optimum delay parameter d to be sent to the decoder_n. This procedure is a compromise between search accuracy and complexity. Of course, one of ordinary skill in the art can readily implement the search for delay parameters under time synchronization constraints using alternative means without departing from the spirit and nature of the present invention.

Can use the method for d_nResolution of 1/4 samples < 921/2 and for d_nResolution of 1/2 samples > 921/2, encoding the delay parameter d using 9 bits per frame_n∈[34，231]。

FIG. 8 illustrates that when d_n-1＝50、d_n＝53、σ_nDelay interpolation when frame length N is 256 and 172. The interpolation method used in the illustrated embodiment of the signal modification method is shown in bold lines, while linear interpolation corresponding to existing methods is shown in thin lines. Both interpolated contours perform in a substantially similar manner in the delay selection loop of Table 1, but the disclosed piecewise linear interpolation results in a smaller absolute range | d_n-1-d_nL. This feature reduces possible oscillations in the delay profile d (t) and annoying artifacts in the modified speech signal whose pitch will follow this delay profile.

To further clarify the performance of the piecewise linear interpolation method, fig. 9 shows an example of the resulting delay profile d (t) over 10 frames in bold lines. The corresponding delay profiles d (t) obtained using conventional linear interpolation are indicated with thin lines. The example is constructed using an artificial speech signal with a constant delay parameter of 52 samples as input to a speech modification procedure. Delay parameter d₀54 samples are intended to be used as an initial value for the first frame to account for the effects of pitch estimation errors typical in speech coding. The procedure of Table 1 is then used to search for the delay parameter d for linear interpolation and the piecewise linear interpolation method disclosed herein_n. According to an illustrative embodiment of the signal modification method according to the inventionAll required parameters are selected. The resulting delay profile d (t) shows that piecewise linear interpolation results in a rapidly converging delay profile d (t) that cannot be reached to the correct value over a 10 frame period by conventional linear interpolation. These prolonged oscillations in the delay profile d (t) often cause annoying artifacts to the modified speech signal, so that the overall perceptual quality is reduced.

Modification of signals

After the delay parameter d has been determined_nAnd the pitch period is segmented, the signal modification process itself can be initiated. In the illustrated embodiment of the signal modification method, the speech signal is modified by shifting the individual pitch period segments one by one, adjusting them to a delay profile d (t). Segment displacement is determined by associating segments in the weighted speech domain with the target signal. Weighted speech signal using synthesis of a previous frame and a preceding, shifted segment in a current frameTo compose the target signal. The actual displacement is performed on the residual signal r (t).

Signal modification needs to be done carefully to maximize the performance of long-term prediction and at the same time maintain the perceptual quality of the modified speech signal. The time synchronization required at the frame boundary must also be taken into account during the modification.

A block diagram of an illustrative embodiment of the signal modification method is shown in fig. 10. By extracting l from the weighted speech signal w (t) in block 401_sNew segment w of samples_s(k) To begin the modification. This segmentation is by a segment length l_sAnd a start time t_sIs defined, given w_s(k)＝w(t_s+k)，k＝0，1，...，l_s-1. The segmentation procedure is performed as described above.

If no new segments can be selected or extracted (block 402), the signal modification operation is complete (block 403). Otherwise, the signal modification operation continues with block 404.

To find the current segment w_s(k) At block 405, a target signal is establishedFor the first segment w in the current frame₁(k) This target signal is obtained by the following recursion:

therein, theWhen t is less than or equal to t_n-1The weighted synthesized speech signal available in the previous frame. Parameter delta₁For a length l₁The maximum displacement allowed for the first segment of (a). Equation (11) can be translated into a long-term predicted simulation using a delay profile on the signal portion where the current shifted segment is likely to be located. The calculation of the target signal for subsequent segmentation follows the same principle and will be provided after this section.

A search procedure for finding the best displacement of the current segment may be initiated after the target signal is formed. This procedure is based on the fact that at the time t_sStarting segment w_s(k) And a target signalThe correlation c calculated at block 404 between_s(δ′)：

Wherein delta_sDetermining w for a current segment_s(k) The maximum displacement that is allowed for is,indicating rounding to positive infinity. Normalized correlation can be used instead of equation (12), although this has increased complexity. In the illustrated embodiment, for δ_sThe following values were used:

as described later in this section, δ_sThe values of (c) are more limited for the first and last segments in the frame.

Integer resolution is used to evaluate the correlation (12), but higher accuracy improves the performance of long-term prediction. To keep the complexity low, the signal w in equation (12) is directly up-sampled_s(k) OrIs not reasonable. Instead, by using upsampled correlation c_s(δ') determining the optimal displacement to achieve fractional resolution in a computationally efficient manner.

First search at integer resolution such that c_s(δ') maximized displacement δ. Now, with fractional resolution, the maximum must be located in the open interval (δ -1, δ +1) and assigned to [ - δ +_s，δ_s]In (1). In block 406, the correlation c is interpolated at this interval using sinc interpolation of a hamming window of length 65 samples_s(δ') upsampling to a resolution of 1/8 samples. A displacement delta factor corresponding to the maximum of the up-sampled correlationThis is the best displacement at fractional resolution. After finding this optimal displacement, the weighted speech segments w are recalculated with the solved fractional resolution at block 407_s(k) In that respect I.e. the exact new start time of the segment is updated to t_s：＝t_s-δ+δ_lWhereinAlso, the weighted speech segments w corresponding to the fractional resolution are computed from the residual signal r (t) again using sinc interpolation as described above (block 407) at this point_s(k) Is segmented by the residue r_s(k) In that respect Since the fractional part of the best displacement is incorporated into the residual and weighted speech segment, rounded-up displacements can be usedTo effect all subsequent calculations.

FIG. 11 illustrates a segmentation w according to block 407 of FIG. 10_s(k) Is calculated again. In this illustrated example, the best displacement is searched for at a resolution of 1/8 samples by maximizing the correlation given the value δ -13/8. Thus, the integer part δ_lBecomes-13/8-1 and the fractional part becomes 3/8. As a result, the start time of the segment is updated to t_s＝t_s+3/8. In FIG. 11, w_s(k) Is indicated by the gray dots.

If the logic block 106 disclosed later allows to continue the signal modification, the last task is to segment r the current residual signal by_s(k) Copying to a modified residual signalTo update the modified residual signal(block 411):

k＝0，1，...，l_s-1(14)

since the displacements in successive segments are related to each other, the segments are positioned toEither overlapping or having a gap therebetween. A directly weighted average may be used for overlapping segments. The gap is filled by copying adjacent samples from adjacent segments. Since the number of overlapping or missing samples is typically small and segment boundaries occur in low energy regions of the residual signal, no perceptual artifacts are typically caused. It should be noted that it is not used in [2]]、[6]、[7]The continuous signal deformation as described in (1),

But is modified intermittently by shifting the pitch period segments to reduce complexity.

The processing of subsequent pitch period segments follows the procedure disclosed above except for the target signal in block 405Is formed differently than for the first segment.Is first replaced with modified weighted speech samples

k＝0，1，...，l_s-1(15)

This process is illustrated in fig. 11. The samples of the segment following the update are then also updated,

k＝t_s+δ_l+l_s，...，t_s+δ_l+l_s+l_s+1+δ_s+1-2(16)

target signalBy taking into account the delay contour d (t), ensures a higher correlation between successive pitch period segments in the modified speech signal and thus a more accurate long-term prediction. While processing the last segment of the frame, the target signalNeed not be updated. The displacement of the first and last segments in a frame is a special case that needs to be performed with special care. Before shifting the first segment, it should be guaranteed to be close to the frame boundary t_n-1There are no high power regions in the residual signal r (f) because shifting such segments may cause artifacts. The high power region is searched by squaring the residual signal r (t) as follows:

E₀(k)＝r²(k)，k∈[t_n-1-ζ₀，t_n-1+ζ₀] (17)

wherein ζ₀＝<p(t_n-1)/2>. If E is₀(k) Is detected as being close to the range t_n-1-2，t_n-1+2]Is then limited to 1/4 samples. If the proposed displacement | δ | for the first segment is less than this limit, the signal modification process is enabled in the current frame, but the first segment remains unchanged.

The last segment in the frame is processed in a similar manner. As described in the above description, the delay profile d (t) is chosen such that in principle no displacement is required for the last segment. However, since the target signal is repeatedly updated during signal modification by taking into account the correlation between the successive segments in equations (16) and (17), it is likely that the last segment must be slightly shifted. In the illustrated embodiment, this displacement is always limited to less than 3/2 samples. If there is a high power region at the end of the frame, no displacement is allowed. This condition is verified by using the following squared residual signal:

E₁(k)＝r²(k)，k∈[t_n-ζ₁+1，t_n+1](18)

wherein ζ₁＝p(t_n). If for t or more_nK of-4 gives E₁(k) Maximum of (d), no displacement is allowed for the last segment. Similar to the first segment, when the proposed displacement | δ | < 1/4, the current frame may still be acceptable for modification, but the last segment remains unchanged.

It should be noted that in contrast to the known signal modification methods, the displacement is not translated into the next frame and each new frame starts well synchronized with the original input signal. As another fundamental difference specific to RCELP coding, the illustrated embodiment of the signal modification method deals with processing a complete speech frame before encoding the sub-frame. Admittedly, subframe modification enables the use of previously coded subframes, which may improve performance, to compose the target signal for each subframe. This approach cannot be used in the context of the illustrated embodiment of the signal modification method, because the allowed time asynchrony at the end of the frame is severely limited. Nevertheless, the updating of the target signal using equations (15) and (16) gives practically the same processing as the sub-frame processing, since the modification is only enabled for smoothly evolving speech frames.

Mode determination logic incorporated into signal modification processes

The illustrated embodiment of the signal modification method according to the present invention incorporates an efficient classification and pattern determination mechanism as shown in fig. 2. Each operation performed in blocks 101, 103 and 105 yields several indicators for quantifying the achievable performance of long-term prediction in the current frame. If any of these indicators is outside its allowed limits, the signal modification process is terminated by one of the logic blocks 102, 104 or 106. In this case, the original signal is kept unchanged.

The pitch pulse search routine 101 generates several indicators of the periodicity of the current frame. Thus, the logic block 102 that analyzes these indicators is the most important component of the classification logic. Logic block 102 compares the difference between the detected pitch pulse position and the interpolated open loop pitch estimate using the following condition, and terminates the signal modification process if this condition is not met:

|T_k-T_k-1-p(T_k)|＜0.2p(T_k)，k＝1，2，...，c (19)

the choice of the delay profile d (t) in block 103 also gives additional information about the evolution of the pitch period and the periodicity of the current speech frame. This information is checked in a logic block 104. As long as condition | d is satisfied_n-d_n-1If < 0.2, the signal modification process continues from this block 104. This condition means that only small delay changes are allowed for classifying the current frame as a purely voiced frame. The logic block 104 also checks the selected delay parameter value d_nDifference of (c) | κ_c-T₀To evaluate the success of the delay selection loop of table 1. If the difference is greater than one sample, the signal modification process is terminated.

In order to ensure good quality of the modified speech signal, it is advantageous to limit the displacement made for successive pitch period segments in block 105. Then in logical block 106 this is achieved by applying the following criteria to all segments of the frame:

where, delta^(s)And delta^(s-1)For the s-th and (s-1) th, respectively) Displacement of pitch period segments. If the threshold is exceeded, the signal modification process is interrupted and the original signal is retained.

When the frame undergoing signal modification is coded at a low bit rate, the shape of the pitch period segments must remain similar across the frame. This allows reliable signal modeling by long-term prediction and thus coding at low bit rates without degrading subjective quality. May simply pass through the following w in block 407 of fig. 10_s(k) Normalized correlation between the current segment and the optimally displaced target signal after updating to quantify the similarity of successive segments:

said normalized correlation g_sAlso known as pitch gain.

If signal modification is beneficial in the current frame, the displacement of the pitch period segment in block 105 that maximizes the correlation of the pitch period segment with the target signal enhances the periodicity and achieves a high pitch prediction gain. The success of the procedure is checked in logic block 106 using the following criteria:

g_s≥0.84

if this condition is not met for all segments, the signal modification process terminates (block 4O9) and the original signal remains unchanged. When this condition is satisfied (block 106), signal modification continues at block 411. At the re-count from block 407Segment w of the algorithm_s(k) And the target signal from block 405Block 408 in between calculates the pitch gain g_s. In general, a slightly lower gain threshold may be allowed for female sounds and with equal coding performance. The gain threshold may be changed in different operating modes of the encoder to adjust the percentage of usage of the signal modification mode and thus the resulting average bit rate.

Mode determination logic for source-controlled variable bit rate speech codecs

This section discloses the use of a signal modification procedure as part of a generic rate determination mechanism in a source-controlled variable bit rate speech codec. This functionality is incorporated in the illustrated embodiment of the signal modification method, as it provides several indicators of the signal periodicity and the expected coding performance of the long-term prediction in the current frame. These indicators include the evolution of the pitch period, the fitness of the selected delay profile to describe this evolution, the pitch prediction gain that can be obtained by signal modification. If the logic blocks 102, 104 and 106 shown in fig. 2 enable signal modification, long-term prediction can efficiently model the modified speech frame, facilitating its encoding at ground bit rate without degrading subjective quality. In this case, the adaptive codebook excitation has a major contribution in describing the excitation signal, and therefore the allocated bit rate for the fixed codebook excitation can be reduced. When a logic block 102, 104, or 106 disables signal modification, a frame may include non-stationary speech segments, such as voiced onset or rapidly evolving voiced speech signals. These frames typically require a high bit rate for maintaining good subjective quality.

Fig. 12 depicts a signal modification process 603 as part of rate determination logic for controlling four encoding modes. In this illustrated embodiment, the mode set includes specialized modes for non-active speech frames (block 508), silent speech frames (block 507), steady voiced frames (block 506), and other types of frames (block 505). It should be noted that all of these modes, except for the mode for the stationary voiced frame 506, are implemented in accordance with techniques well known to those of ordinary skill in the art.

The rate determination logic is based on signal classification in three steps in logic blocks 501, 502 and 504, where the operation of blocks 501 and 502 is well known to those of ordinary skill in the art.

First, a Voice Activity Detector (VAD)501 distinguishes between active and inactive speech frames. If an invalid speech frame is detected, the speech signal is checked according to mode 508.

If a valid speech frame is detected at block 501, a second classifier 502 dedicated to making sound determinations processes the frame. If classifier 502 places the current frame in the unvoiced speech signal, the classification chain ends and the speech signal is processed according to mode 507. Otherwise, the speech frames are passed to the signal modification module 603.

The signal modification module then determines whether to enable or disable signal modification of the current frame in logic block 504. This determination is actually made as part of the signal modification process in the logic blocks 102, 104 and 106 described above with reference to fig. 2. When signal modification is enabled, the frame is treated as a stable voiced or purely voiced speech segment.

When the rate determination mechanism selects mode 506, the signal modification mode is enabled and speech frames are encoded as per the teachings of the previous section. Table 2 discloses the bit allocations used in the illustrated embodiment for pattern 506. Since frames to be encoded in this mode are characteristically very periodic, a substantially lower bit rate is sufficient for maintaining a good subjective quality compared to, for example, transition frames. The signal modification also allows delay information to be efficiently encoded using only 9 bits per 20 milliseconds, thus saving a significant portion of the bit budget for other parameters. The good performance of long-term prediction allows using only 13-bit per 5 ms subframes for fixed codebook excitation without sacrificing subjective speech quality. The fixed codebook comprises one track with two pulses with 64 possible positions.

Table 2 bit allocation in voiced 6.2kps mode for a 20 ms frame comprising four sub-frames

Parameter(s)	Bits/frame
		LP parametric pitch delayed pitch filter gain algebraic codebook mode bits	3494＝1+1+1+124＝6+6+6+652＝13+13+13+131
Total of	124 bits 6.2kbps

TABLE 3 bit Allocation in the 12.65kbps mode according to the AMR-WB standard

Parameter(s)	Bits/frame
		LP parametric pitch delay	4630＝9+6+9+6

Pitch filter gain algebraic codebook mode bits	4＝1+1+1+124＝7+7+7+7144＝36+36+36+361
		Total of	253 bits 12.65kbps

Other coding modes 505, 507 and 508 are implemented according to known techniques as follows, signal modification being disabled in all of these modes. Table 3 shows the bit allocation of the mode 505 adopted according to the AMR-WB standard.

Specifications [11] and [12] associated with the AMR-WB standard are included herein as references with respect to comfort noise and VAD functions in 501 and 508, respectively.

[11]3GPP TS 26.192, "AMR Wideband Speech Codec: comfort NoiseAspects, "3 GPP Technical Specification (3GPP TS 26.192," AMR wideband Speech codec: Comfort noise aspect ", 3GPP Technical Specification)

[12]3GPP TS 26.193, "AMR Wideband speed Codec: voice Activity Detector (VAD), "3 GPP Technical Specification (3GPP TS 26.192," AMR wideband speech codec: Voice Activity Detector (VAD) ", 3GPP Technical specifications)

In summary, the present specification has described frame synchronization signal modification methods for pure voiced speech frames, classification mechanisms for detecting the frames to be modified, and the use of these methods in a source-controlled CELP speech codec to enable high quality coding at low bit rates.

The signal modification method incorporates a classification mechanism for determining the frames to be modified. It differs from existing signal modification and pre-processing in operation and in the nature of the modified signal. The classification function embedded in the signal modification process is used as a rate determination mechanism in a source-controlled CELP speech codec.

The signal modification is done pitch and frame synchronously, i.e. adapting a pitch period segment at a certain time in the current frame so that the subsequent speech frame starts with a good temporal alignment with the original signal. The pitch period segments are bounded by frame boundaries. This feature prevents time-shifted translation at frame boundaries, simplifying the encoder implementation to code and reduce the risk of artifacts in the modified speech signal. Because the time shifts are not accumulated over successive frames, the disclosed signal modification method does not require long buffers for accommodating the expanded signal, nor does it require complex logic for controlling the accumulated time shifts. In source-controlled speech coding, it simplifies multi-mode operation between signal modification enable and mechanization modes, since each new frame starts out in temporal alignment with the original signal.

Of course, many other modifications and variations are possible. Such other modifications and variations will now become apparent to those skilled in the art from the foregoing detailed description of the invention and the associated drawings. It should also be apparent that such other variations may be effected without departing from the spirit and scope of the invention.

Claims

1. A method of forming a delay profile characterized by long-term prediction in a technique using signal modification for digitally encoding a speech signal, the method comprising:

dividing the speech signal into a series of successive frames;

locating a pitch pulse of the speech signal in a previous frame; and

locating a pitch pulse of the speech signal in the current frame;

the method is characterized in that:

and mapping the pitch pulse of the previous frame to the pitch pulse characteristic of the current frame by using the long-term prediction delay parameter of the previous frame and the long-term prediction delay parameter of the current frame to form a segmented linear delay profile.

2. The method of claim 1, comprising:

the delay profile is formed based on the distance of consecutive pitch pulses between the last pitch pulse of the previous frame and the last pitch pulse of the current frame.

3. The method of any preceding claim, wherein locating pitch pulses comprises deriving a linear prediction residual signal from a speech signal.

4. The method of claim 1, wherein locating pitch pulses comprises deriving a weighted speech signal from a speech signal.

5. The method of claim 1 or 2, wherein locating the pitch pulse comprises deriving a synthesized weighted speech signal from the speech signal.

6. The method of claim 1, wherein forming the delay profile comprises selecting a long-term prediction delay parameter by iterating over a function of a temporal variable and a position of a pitch pulse of the speech signal in a previous frame and a current frame.

7. The method of claim 6, comprising searching for long-term prediction delay parameter values in a plurality of stages, and starting with a long-term prediction delay parameter value predicted for the end of the current frame, each successive stage having an increased resolution and a more focused search range.

8. The method of claim 7, comprising predicting the long-term prediction delay parameter value as a difference equal to a quotient of twice a difference between a long-term delay prediction delay parameter value at an end of a previous frame and a pitch pulse position of the speech signal in the previous frame and a current frame divided by a number of iterations of a function.

9. The method of claim 1, comprising modifying the speech signal by shifting pitch period segments one by one to adjust them to the delay profile.

10. The method of claim 9, comprising determining segment displacement by correlating segments in the weighted speech domain with the target signal.

11. The method according to one of claims 4 and 10, comprising composing the target signal using the synthesized weighted speech signal of the previous frame and any previous displaced segments in the current frame.

12. An apparatus (603) for forming a delay profile characterized by long-term prediction in a technique using signal modification for digitally encoding a speech signal, the apparatus comprising:

a divider for dividing the speech signal into a series of successive frames;

a detector for locating a pitch pulse of the speech signal in a previous frame;

a detector for locating a pitch pulse of the speech signal in the current frame;

the method is characterized in that:

a selector arranged to map the pitch pulse of the previous frame to the pitch pulse of the current frame using the long-term prediction delay parameter of the previous frame and the long-term prediction delay parameter of the current frame to form a segmented linear delay profile.

13. The apparatus of claim 12, wherein the selector is

A calculator for calculating a long-term prediction delay parameter based on a distance of consecutive pitch pulses between a last pitch pulse of a previous frame and a last pitch pulse of a current frame.

14. An apparatus according to claim 12 or 13, wherein the selector is

A selector for selecting the long-term prediction delay parameter by iterating a function of the temporal variable and the position of the pitch pulse of the speech signal in the previous frame and the current frame.

15. The apparatus of claim 14, comprising:

a searcher for searching for long-term prediction delay parameter values in a plurality of stages and for searching for long-term prediction delay parameter values predicted for the end of the current frame as a start, each successive stage having an increased resolution and a more focused search range.

16. The apparatus of claim 15, comprising:

a predictor for predicting the long-term prediction delay parameter value as a difference equal to a quotient of twice a difference between the long-term delay prediction delay parameter value at the end of the previous frame and a pitch pulse position of the speech signal in the previous frame and the current frame divided by the number of iterations of the function.

17. The apparatus of claim 12, comprising:

a modifier for modifying the speech signal by shifting the pitch period segments one by one to adjust them to the delay profile.

18. The apparatus of claim 17, comprising:

a determiner for determining a segment displacement by correlating the segments in the weighted speech domain with the target signal.

19. The apparatus of claim 18, comprising:

a composer for composing the target signal using the weighted speech signal synthesized in the previous frame and any previous displaced segments in the current frame.