[go: up one dir, main page]

HK1129487B - The method of time distortion of decoding audio signal after packet lost - Google Patents

The method of time distortion of decoding audio signal after packet lost Download PDF

Info

Publication number
HK1129487B
HK1129487B HK09107382.2A HK09107382A HK1129487B HK 1129487 B HK1129487 B HK 1129487B HK 09107382 A HK09107382 A HK 09107382A HK 1129487 B HK1129487 B HK 1129487B
Authority
HK
Hong Kong
Prior art keywords
signal
band
frame
speech signal
lag
Prior art date
Application number
HK09107382.2A
Other languages
Chinese (zh)
Other versions
HK1129487A1 (en
Inventor
罗伯塔.W..措普夫
朱因韦.陈
杰斯.赛森
Original Assignee
美国博通公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 美国博通公司 filed Critical 美国博通公司
Priority claimed from PCT/US2007/076019 external-priority patent/WO2008022207A2/en
Publication of HK1129487A1 publication Critical patent/HK1129487A1/en
Publication of HK1129487B publication Critical patent/HK1129487B/en

Links

Description

Method for decoding time distortion of audio signal after packet loss
Technical Field
The present invention relates to a system and method for concealing the effects of quality degradation caused by packet loss in a speech or audio encoder.
Background
In the course of digital transmission of voice or audio signals over packet networks, the encoded voice/audio signals are typically divided into frames and then encapsulated into packets, where each packet may contain one or more frames of encoded voice/audio data. The packets are then transmitted over a packet network. Sometimes some packets are lost and some useful packets arrive too late and are considered lost. Such packet loss causes a significant degradation in audio quality unless special techniques are used to conceal the effects of packet loss.
There currently exists a Packet Loss Concealment (PLC) method based on an extrapolated audio signal for an independent block encoder or a full-band predictive encoder. Such PLC methods include the techniques disclosed in the following U.S. patent applications: us patent application No. 11/234,291, inventor et al entitled "packet loss concealment technique for independent block speech codecs" and us patent application No. 10/183,608, inventor et al entitled "extrapolation speech waveform based frame erasure concealment method and system for predictive speech coding". However, the techniques described in these applications cannot be used directly in sub-band predictive coders, such as ITU-T recommendation G.722 wideband speech coders, because of the sub-band specific structure issues that these techniques do not address. Furthermore, for each sub-band, the g.722 encoder uses an Adaptive Differential Pulse Code Modulation (ADPCM) predictive encoder that uses a gradient-based approach to quantizer step size and sample-by-sample backward adaptation of the predictor coefficients, which poses a particular challenge that the existing PLC techniques fail to solve. Therefore, there is a need for a suitable PLC method specifically designed for sub-band predictive coders (e.g., g.722).
Disclosure of Invention
The invention is used for hiding the quality reduction effect caused by packet loss in the sub-band prediction encoder. The present invention specifically addresses some sub-band specific structural problems when applying audio waveform extrapolation techniques to sub-band predictive coders, and the present invention also addresses the specific PLC difficulties of general backward adaptive ADPCM coders and special g.722 sub-band ADPCM coders.
In particular, the invention herein describes a method in a decoder for decoding a series of frames representing an encoded audio signal for switching between a lost frame in the series of frames and one or more received frames following the lost frame. According to the method, synthesizing an output audio signal associated with the lost frame; generating an extrapolated signal based on the synthesized output audio signal; calculating a time lag between the extrapolated signal and a decoded audio signal associated with the received frame, wherein the time lag represents a phase difference between the extrapolated signal and the decoded audio signal; time-warping the decoded audio signal based on the time lag, wherein time-warping the decoded audio signal comprises stretching or shrinking the decoded audio signal in the time domain.
The invention also describes a system. The system includes a decoder, an audio signal synthesizer, and time warp logic. The decoder is for decoding a received frame of a series of frames representing an encoded audio signal. The audio signal synthesizer is configured to synthesize an output audio signal associated with a lost frame of the series of frames. The time-warping logic is to generate an extrapolated signal based on the synthesized output audio signal, calculate a time lag between the extrapolated signal and a decoded audio signal associated with one or more received frames in the series of frames that immediately follow a lost frame, and time-warp the decoded audio signal based on the time lag. The time lag represents a phase difference between the extrapolated signal and the decoded audio signal, and time-warping the decoded audio signal comprises stretching or shrinking the decoded audio signal in the time domain.
A computer program product is also described. The computer program product includes a computer readable medium having computer program logic recorded thereon for causing a processor to convert between a lost frame in a series of frames representing an encoded audio signal and one or more received frames immediately following the lost frame. The computer program logic includes a first module, a second module, a third module, and a fourth module. A first module is for causing the processor to synthesize an output audio signal associated with a lost frame of the series of frames. A second module is for causing the processor to generate an extrapolated signal based on the synthesized output audio signal. A third module is configured to cause the processor to calculate a time lag between the extrapolated signal and a decoded audio signal associated with the received frame, wherein the time lag represents a phase difference between the extrapolated signal and the decoded audio signal. A fourth module is to cause the processor to time-warp the decoded audio signal based on the time lag, wherein time-warping the decoded audio signal comprises stretching or shrinking the decoded audio signal in the time domain.
Further features and advantages of the invention, as well as the structure and operation of various embodiments of the invention, are described in further detail below with reference to the accompanying drawings. It is noted that the invention is not limited to the specific embodiments described herein. The embodiments presented herein are for illustrative purposes only. Further embodiments will be apparent to those skilled in the art based on the teachings contained herein.
Drawings
The accompanying drawings, which are incorporated herein and form a part of the specification, illustrate one or more embodiments of the present invention and, together with the description, further serve to explain the uses, advantages, and principles of the invention and to enable a person skilled in the pertinent art to make and use the invention.
FIG. 1 is a schematic diagram of an encoder structure of a conventional ITU-T G.722 sub-band predictive encoder;
FIG. 2 is a schematic diagram of a decoder architecture of a conventional ITU-T G.722 sub-band predictive encoder;
FIG. 3 is a block diagram of a decoder/PLC system according to an embodiment of the present invention;
FIG. 4 is a flow chart of a method of processing frames in a decoder/PLC system to produce an output speech signal according to an embodiment of the present invention;
FIG. 5 is a timing diagram of different types of frames that may be processed by a decoder/PLC system according to an embodiment of the present invention;
FIG. 6 is a timeline diagram of the amplitudes of an original speech signal and an extrapolated speech signal;
FIG. 7 is a flowchart of a method of calculating a time lag (time lag) between a decoded speech signal and an extrapolated speech signal according to an embodiment of the present invention;
FIG. 8 is a flowchart of a method for calculating two phases of time lag between decoding a speech signal and extrapolating the speech signal in accordance with an embodiment of the present invention;
FIG. 9 is a schematic diagram of the manner in which an extrapolated speech signal may be translated relative to a decoded speech signal during execution of a time lag calculation according to an embodiment of the present invention;
FIG. 10A is a schematic time line diagram showing the relative effect of decoding a speech signal and a re-encoding operation ahead of an extrapolated speech signal, according to an embodiment of the present invention;
FIG. 10B is a timeline diagram showing the relative effect of decoding a speech signal and a re-encoding operation that lags behind the extrapolated speech signal, according to an embodiment of the present invention;
FIG. 10C is a timeline diagram showing the relative effect of the extrapolated speech signal and the decoded speech signal synchronized at frame boundaries and a re-encoding operation, in accordance with an embodiment of the present invention;
FIG. 11 is a flowchart of a method for performing re-phasing (re-phasing) of internal states of a sub-band ADPCM decoder after packet loss, according to an embodiment of the present invention;
FIG. 12A is a diagram of applying time-warping (time-warping) to a decoded speech signal that leads an extrapolated speech signal, according to an embodiment of the present invention;
FIGS. 12B and 12C are diagrams each illustrating the application of time warping to a decoded speech signal that lags behind the extrapolated speech signal according to an embodiment of the present invention;
FIG. 13 is a flow chart of a method of performing time warping to shrink a signal along a time axis according to an embodiment of the present invention;
FIG. 14 is a flow chart of a method of performing time warping to stretch a signal along a time axis according to an embodiment of the present invention;
fig. 15 is a block diagram of logic for processing received frames after a predetermined number of received frames after a packet loss occurs within a decoder/PLC system in accordance with an embodiment of the present invention;
FIG. 16 is a block diagram of the logic for performing waveform extrapolation in a decoder/PLC system to generate an output speech signal associated with a lost frame in accordance with an embodiment of the present invention;
FIG. 17 is a block diagram of logic for updating sub-band ADPCM decoder states in a decoder/PLC system in accordance with an embodiment of the present invention;
FIG. 18 is a block diagram of logic for performing rephasing and time warping in a decoder/PLC system according to an embodiment of the present invention;
FIG. 19 is a block diagram of logic for performing constrained and controlled decoding of good frames received after a packet loss in a decoder/PLC system in accordance with an embodiment of the present invention;
fig. 20 is a block diagram of a simplified low-band ADPCM encoder for updating internal states of a low-band ADPCM decoder during packet loss according to an embodiment of the invention;
FIG. 21 is a block diagram of a simplified high-band ADPCM encoder for updating internal states of a high-band ADPCM decoder during packet loss, according to an embodiment of the present invention;
FIGS. 22A, 22B and 22C are each a time line schematic diagram of applying a time warp to a decoded speech signal according to an embodiment of the present invention;
FIG. 23 is a block diagram of another decoder/PLC system according to an embodiment of the present invention;
FIG. 24 is a block diagram of a computer system implementing an embodiment of the invention.
The features and advantages of the present invention will become more apparent from the detailed description set forth below when taken in conjunction with the drawings. The first occurrence of an element in a drawing is indicated by the leftmost digit(s) in the corresponding reference number.
Detailed Description
A. Introduction to the design reside in
Exemplary embodiments of the present invention will be described in detail below with reference to the accompanying drawings. Other embodiments are possible, and modifications to the exemplary embodiments may be made within the spirit and scope of the invention. The following detailed description, therefore, is not to be taken in a limiting sense. Rather, the scope of the invention is defined by the claims.
Those skilled in the art will readily appreciate that the present invention, as described below, may be implemented in hardware, software, hardware, and/or the entities illustrated in the figures. Any actual software code with the present invention's implementation in specific control hardware is not limiting of the present invention. Thus, the following description of the operation and behavior of the present invention is provided based on the understanding that various modifications and changes may be made to the embodiments in the present application based on the procedures described in detail in the present application.
It should be understood that although the detailed description of the invention presented herein is directed to the processing of speech signals, the invention may also be used in relation to the processing of other types of audio signals. Accordingly, the terms "speech" and "speech signal" are used herein for convenience of description only and are not limiting. Those skilled in the art will appreciate that this term may be replaced with the more commonly used terms "audio" and "audio signal". Furthermore, although speech and audio signals are described herein as being partitioned into frames, one skilled in the art will appreciate that such signals may also be partitioned into other discrete signal segments, including but not limited to sub-frames. Thus, operations described herein as being performed on frames also include similar operations performed on other segments of a speech or audio signal, such as sub-frames.
In addition, although the following description discusses frame loss (referred to as packet loss) of an audio signal transmitted over a packet network, the present invention is not limited to Packet Loss Concealment (PLC). For example, in wireless networks, audio signal frames may also be lost or deleted due to channel impairments. This situation is called "frame erasure". When this occurs, to avoid a substantial degradation in the quality of the output speech, the decoder in the wireless system needs to perform "frame erasure concealment" (FEC) to attempt to conceal the degradation in quality due to the dropped frames. For PLC or FEC algorithms, packet loss and frame erasure suffer from the same problem: some transmitted frames cannot be used for decoding anymore, so the PLC or FEC algorithm needs to generate a waveform to fill the waveform gap (gap) corresponding to the lost frame, thereby concealing the degradation of the frame due to the lost frame. Since the terms FEC and PLC generally refer to the same kind of technology, they may be used interchangeably. Thus, for convenience, the term "packet loss concealment" or PLC is used herein to refer to both.
B. Review of sub-band predictive coding
To facilitate a better understanding of the embodiments of the invention described in the following sections, a review of the basic principles of sub-band predictive coding is made. In general, a sub-band predictive encoder may separate an input speech signal into N sub-bands, where N ≧ 2. Without loss of generality, the dual-band predictive coding system of the ITU-T G.722 encoder is described herein as an example. Those skilled in the art can readily generalize this description to other N-band sub-band predictive encoders.
Fig. 1 is a simplified encoder structure 100 of a g.722 sub-band predictive encoder. The encoder architecture 100 includes a Quadrature Mirror Filter (QMF) analysis filterbank 110, a low-band Adaptive Differential Pulse Code Modulation (ADPCM) encoder 120, a high-band ADPCM encoder 130, and a bitstream multiplexer 140. The QMF analysis filter bank 110 separates the input speech signal into a low-band speech signal and a high-band speech signal. The low-band ADPCM encoder 120 encodes the low-band speech signal into a low-band bitstream. High-band ADPCM encoder 130 encodes the high-band speech signal into a high-band bitstream. The bitstream multiplexer 140 multiplexes the low-band bitstream and the high-band bitstream into a single output bitstream. In the packet transmission application discussed herein, the output bitstream is packaged into packets and then transmitted to the sub-band predictive decoder 200, as shown in FIG. 2.
As shown in fig. 2, the decoder 200 includes a bitstream demultiplexer 210, a low-band ADPCM decoder 220, a high-band ADPCM decoder 230, and a QMF synthesis filter bank 240. The bitstream demultiplexer 210 separates the input bitstream into a low-band bitstream and a high-band bitstream. The low-band ADPCM decoder 220 decodes the low-band bitstream into a decoded low-band speech signal. The high-band ADPCM decoder 230 decodes the high-band bitstream into a decoded high-band speech signal. QMF synthesis filter bank 240 then combines the decoded low-band speech signal and the decoded high-band speech signal into a full-band output speech signal.
Further details regarding the structure and operation of encoder 100 and decoder 200 may be found in ITU-T recommendation g.722, which is incorporated herein by reference in its entirety.
C. Packet loss concealment technique for sub-band predictive coder based on full-band speech waveform extrapolation (extrapolation)
A high quality PLC system and method in accordance with one embodiment of the present invention will now be described. A general introduction to the system and method is provided in this section, and further details regarding specific implementations of the system and method are described in section D below. The example systems and methods are for an ITU-T recommendation G.722 speech coder. However, those skilled in the art will appreciate that many of the concepts described herein with respect to this particular embodiment may be used to perform PLC in other types of sub-band predictive speech encoders as well as other types of speech and audio encoders.
As described in more detail herein, this embodiment performs PLC in the 16kHz output domain of the G.722 speech decoder. The method uses periodic waveform extrapolation to fill in a waveform associated with a frame loss of a speech signal, wherein the extrapolated waveform is mixed with filtered noise according to signal characteristics before the frame loss occurs. To update the state of the sub-band ADPCM decoder, the extrapolated 16kHz signal is passed through a QMF analysis filterbank to generate sub-band signals, which are then processed by a simplified sub-band ADPCM encoder. In order to provide a smooth transition from the extrapolated waveform associated with a lost frame to the normal decoded waveform associated with a good frame received after a packet loss, additional processing is performed after each packet loss. The state of the sub-band ADPCM decoder is phase aligned with the first good frame received after packet loss, and the normal decoded waveform associated with the first good frame is time warped to align the normal decoded waveform with the interpolated waveform between its superposition with the interpolated waveform, thereby achieving a smooth transition. For long-term packet loss situations, the system and method will gradually attenuate the output signal.
Fig. 3 is a high-level block diagram of a g.722 speech decoder 300 implementing such PLC functions. Although the decoder/PLC system 300 described herein includes a g.722 decoder, those skilled in the art will appreciate that many of the concepts described herein can be applied generally to any N-band sub-band predictive coding system. Similarly, the predictive coder for each sub-band need not be the ADPCM coder shown in fig. 3, but can be any general predictive coder, and can be either forward adaptive or backward adaptive.
As shown in fig. 3, the decoder/PLC system 300 includes a bitstream demultiplexer 310, a low-band ADPCM decoder 320, a high-band ADPCM decoder 330, a switch 336, a QMF synthesis filter bank 340, a full-band speech signal synthesizer 350, a sub-band ADPCM decoder state update module 360, and a decoding constraint and control module 370.
The terms "lost frame" or "bad frame" as used herein refer to a frame of a speech signal that is not received at the decoder/PLC 300 or is deemed unsuitable for normal decoding operations. The "received frame" or "good frame" is a frame of a speech signal that is normally received at the decoder/PLC system 300. The "current frame" is the frame currently being processed by the decoder/PLC 300 to generate the output speech signal, and the "previous frame" is the frame previously processed by the decoder/PLC system 300 to generate the output speech signal. The terms "current frame" and "previous frame" may both be used to refer to the received frame as well as the frame that is performing PLC operations and the lost frame.
The manner in which the decoder/PLC system 300 operates will be described with reference to the flow diagram 400 of fig. 4. As shown in fig. 4, the method of flowchart 400 begins at step 402 where the decoder/PLC system 300 determines the frame type of the current frame. The decoder/PLC system 300 distinguishes six different types of frames, represented by types 1 through 6, respectively. Fig. 5 provides a timeline 500 for different frame types. A frame of type 1 is any received frame after the eighth received frame after the packet loss. The type 2 frames are the first and second lost frames associated with packet loss. The frame of type 3 is any one of the third to sixth lost frames associated with packet loss. A frame of type 4 is any one lost frame after the sixth lost frame associated with packet loss. A frame of type 5 is any received frame that is received immediately following a packet loss. Finally, the frame of type 6 is any one of the second to eighth received frames received after the packet loss. Those skilled in the art will readily appreciate that other schemes of classifying frame types may also be used in accordance with alternative embodiments of the present invention. For example, in systems with different frame sizes, the number of frames in each frame type is different from that described above. Also, the number of frames in each frame type may be different for different codecs (i.e., non-g.722 codecs).
The manner in which the decoder/PLC system 300 processes the current frame to generate the output speech signal is determined by the frame type of the current frame. This is represented in fig. 4 by a series of decision steps 404, 406, 408 and 410. Specifically, if it is determined at step 402 that the current frame is a type 1 frame, then a first sequence of processing steps is performed to generate an output speech signal, as shown at decision step 404. If it is determined at step 402 that the current frame is a type 2, type 3 or type 4 frame, then a second sequence of processing steps is performed to produce an output speech signal, as shown at decision step 406. If it is determined at step 402 that the current frame is a type 5 frame, a third sequence of processing steps is performed to generate an output speech signal, as shown at decision step 408. Finally, if it is determined at step 402 that the current frame is a type 6 frame, then a fourth sequence of processing steps is performed to generate an output speech signal, as shown at decision step 410. The processing steps associated with each of the different frame types will be described below.
After the processing steps for each sequence are performed, it is determined whether there are additional frames to process in decision step 430. If there are additional frames to process, then processing returns to step 402. However, if there are no additional frames to process, then the process ends at step 432 as shown.
1. Processing type 1 frames
As shown at step 412 of flowchart 400, if the current frame is a type 1 frame, then the decoder/PLC system 300 performs normal g.722 decoding of the current frame. Thus, the modules 310, 320, 330 and 340 of the decoder/PLC system 300 properly perform the same functions as the modules 210, 220, 230 and 240, respectively, of the conventional g.722 decoder 200. Specifically, the bitstream demultiplexer 310 separates the input bitstream into a low-band bitstream and a high-band bitstream. The low-band ADPCM decoder 320 decodes the low-band bitstream into a decoded low-band speech signal. High-band ADPCM decoder 330 decodes the high-band bitstream into a decoded high-band speech signal. QMF synthesis filter bank 340 then recombines the decoded low-band speech signal and the decoded high-band speech signal into a full-band speech signal. In processing the type 1 frames, the switch 336 is connected to the upper position labeled "type 1" to take the output signal of the QMF synthesis filter bank 340 as the final output speech signal of the decoder/PLC system 300 for the type 1 frames.
After step 412 is completed, the decoder/PLC system 300 updates the various state memories and performs some processing that facilitates PLC operations for subsequent frame losses, as shown at step 414. The state memories include PLC related low-band ADPCM decoder state memories, PLC related high-band ADPCM decoder state memories, and full-band PLC related state memories. As part of this step, full-band speech signal synthesizer 350 stores the output signal of QMF synthesis filter bank 340 in an internal signal buffer in preparation for possible speech waveform extrapolation during subsequent lost frame processing. The sub-band ADPCM decoder state update module 360 and the decoding constraint and control module 370 are inactive during processing of type 1 frames. More details regarding type 1 frame processing will be provided below with reference to a specific implementation of the decoder/PLC system 300 described in section D.
2. Processing type 2, type 3 and type 4 frames
In processing frames of type 2, type 3 and type 4, the input bitstream associated with the lost frame is not available. Thus, modules 310, 320, 330, and 340 are unable to perform their usual functions and are inactive. Instead, the switch 336 is connected to the lower position labeled "type-6" and the full-band speech signal synthesizer 350 becomes active, synthesizing the output speech signal of the decoder/PLC system 300. The full-band speech signal synthesizer 350 synthesizes the output speech signal of the decoder/PLC system 300 by inserting the previously stored output speech signal associated with the last few received frames before the packet loss. This is represented by step 416 of flowchart 400.
After the full-band speech signal synthesizer 350 has completed the waveform synthesis task, the sub-band ADPCM decoder state update module 360 appropriately updates the internal states of the low-band ADPCM decoder 320 and the high-band ADPCM decoder 330 in preparation for a possible good frame in the next frame, as shown in step 418. The manner in which steps 416 and 418 are performed will now be described in more detail.
a. Waveform extrapolation
There are many prior art techniques for performing the waveform extrapolation function of step 416. The technique used in the implementation of the decoder/PLC system 300 described below in section D is a modified version of the technique described in U.S. patent application No. 11/234,291, filed by applicant, filed on 26/9/2005, entitled "packet loss concealment technique for block independent speech codecs". A high-level description of the technology will be provided herein, and more detail will be presented in section D.
To implement the waveform extrapolation function, the full-band speech signal synthesizer 350 analyzes the stored output speech signal from the QMF synthesis filterbank 340 during the processing of the received frames to extract pitch period (pitch period), short-term predictors, and long-term predictors. These parameters are then stored for subsequent use.
Full-band speech signal synthesizer 350 extracts the pitch period by performing a two-stage search. In a first stage, a low resolution pitch period (or coarse pitch) is determined by performing a search on a sampled version (refined version) of the input speech signal or a filtered version thereof. In the second stage, the coarse pitch is refined to normal resolution by searching the neighborhood of the coarse pitch using the non-sampled signal. This two-stage search method requires significantly less computational complexity than a single-stage full search in the non-sampled interval. Before sampling a speech signal or a filtered version thereof, typically the non-sampled signal needs to pass through an anti-aliasing (anti-aliasing) low pass filter. To reduce complexity, a common prior art technique is to use a low-order Infinite Impulse Response (IIR) filter, such as an elliptic filter. However, the poles of good low-order IIR filters are typically very close to the unit circle, and thus double-precision arithmetic operations are required when performing filtering operations corresponding to all-pole portions of the filter in a 16-bit fixed-point algorithm.
In contrast to the prior art, the full-band speech signal synthesizer 350 uses a Finite Impulse Response (FIR) filter as an anti-aliasing low-pass filter. By using a FIR filter in this manner, only a single precision 16-bit fixed point arithmetic operation is required, and the FIR filter can operate at a lower sampling rate of the sampled signal. Therefore, this approach can significantly reduce the computational complexity of the anti-aliasing low pass filter. For example, in the implementation of decoder/PLC system 300 described in section D, the non-sampled signal has a sampling rate of 16kHz, but the sampled signal for pitch extraction has only a sampling rate of 2 kHz. Based on the prior art, a 4 th order elliptic filter can be used. The all-pole part of the elliptic filter requires a double precision fixed point algorithm and needs to operate at a 16kHz sampling rate. As such, while the all-zero part can operate at a 2kHz sampling rate, the entire 4 th order elliptic filter and downsampling operation requires a computational complexity of 0.66WMOPS (weighted million operations per second). In contrast, even if a relatively high order FIR filter of 60 orders is used instead of the 4 th order elliptic filter, since the 60 th order FIR filter operates at a very low sampling rate of 2kHz, the entire 60 th order FIR filter and downsampling operation only requires the complexity of 0.18WMOPS, which is 73% less than the 4 th order elliptic filter.
At the beginning of the first lost frame of a packet loss, the full-band speech signal synthesizer 350 uses the cascaded long-term synthesis filter and short-term synthesis filter to generate a signal, referred to as a "calling signal," when the input to the cascaded synthesis filter is set to zero. The full-band speech signal synthesizer 350 then analyzes certain signal parameters (e.g., pitch prediction gain and normalized autocorrelation) to determine the degree of "voicing" in the stored output speech signal. If the previous output speech signal is voiced very high, the speech signal is extrapolated in a periodic manner to produce a substitute waveform for the current bad frame. The periodic waveform extrapolation is performed using a refined version of the pitch period extracted on the most recently received frame. If the previous output speech signal was unvoiced or noise-like, then the scaled (scaled) random noise is passed through a short-term synthesis filter to produce a replacement signal for the current bad frame. If the degree of articulation is between the two extremes, the two components are mixed together proportionally to the degree of articulation. The extrapolated signal is then superimposed with the call signal to ensure that there is no waveform interruption at the beginning of the first bad frame of a packet loss. In addition, the waveform extrapolation is extended beyond the end of the current bad frame for a time at least equal to the overlap-add period, so that additional samples of the extrapolated signal at the beginning of the next frame can be used as the call signal to be overlapped at the beginning of the next frame.
In a bad frame that is not the first bad frame of a lost packet (i.e., in a frame of type 3 or type 4), the operation of the full-band speech signal synthesizer 350 is essentially the same as that described in the previous paragraph, except that the full-band speech signal synthesizer 350 does not need to compute the call signal, and additional samples of the extrapolated signal computed in the previous frame beyond the end of the previous frame may be used as the call signal for the superposition operation, thereby ensuring that there is no waveform discontinuity at the beginning of the frame.
For the case of long-term packet loss, the full-band speech signal synthesizer 350 gradually attenuates the output speech signal of the decoder/PLC system 300. For example, in the implementation of the decoder/PLC system described in section D, the output speech signal generated during packet loss decays or "fades" to zero in a linear fashion, starting at 20ms and ending at 60 ms. This function is performed because the uncertainty about the shape and form of the "actual" waveform increases over time. In fact, many PLC schemes begin to produce a buzzy (buzzy) output when the extrapolated segments are far beyond the range of about 60 ms.
In an alternative embodiment of the invention, for PLCs in background noise, (typically) an embodiment of the invention tracks the level of background noise (ambient noise) and attenuates to that level, rather than zero, for long frame erasures. This eliminates the packet loss interruption effect in the background noise due to the silencing of the output by the PLC system.
Further alternative embodiments of the present invention address the aforementioned problems of PLC in background noise by performing the implementation of Comfort Noise Generation (CNG) functions. When this embodiment of the present invention begins to attenuate the output speech signal of the decoder/PLC system 300 for long-term packet losses, it also begins to blend in with the comfort noise generated by the CNG. By mixing in and replacing with comfort noise, when the output speech signal of the decoder/PLC system 300 is attenuated and eventually silenced, the above-described disruptive effects will be eliminated and provide a reliable reproduction of the environment surrounding the signal. This approach has proven to be and is generally accepted in other applications. For example, in a sub-band echo canceller (SBAEC), or in a general echo canceller (AEC), when a residual echo is detected, the signal is attenuated and replaced with comfort noise. This is commonly referred to as nonlinear processing (NLP). This embodiment of the invention is premised on the PLC presenting a very similar scheme. Similar to AEC, this approach will provide a more enhanced experience for the use of PLC, which is far less annoying than the interrupting effect.
b. Updating internal states of low-band and high-band ADPCM decoders
After the full-band speech signal synthesizer 350 has completed the waveform synthesis task performed in step 416, the sub-band ADPCM decoder state update module 360 then updates the internal states of the low-band ADPCM decoder 320 and the high-band ADPCM decoder 330 appropriately in step 418, in preparation for the possible good frame in the next frame. There are many ways to perform the internal state update of the low-band ADPCM decoder 320 and the high-band ADPCM decoder 330. Since the g.722 encoder of fig. 1 and the g.722 decoder of fig. 2 have the same type of internal states, a straightforward way to update the internal states of the decoders 320 and 330 is to feed back the output signal of the full-band speech signal synthesizer 350 through the standard g.722 encoder of fig. 1, starting with the internal state left by the last sample of the previous frame. Then, after encoding the current bad frame of the interpolated speech signal, the internal states left by the last sample of the current bad frame are used to update the internal states of the low-band ADPCM decoder 320 and the high-band ADPCM decoder 330.
However, the aforementioned approach has the complexity of two sub-band encoders. To save complexity, the decoder/PLC system 300 described in section D approximates the above approachAnd (5) realizing. For high-band ADPCM encoders, it is recognized that the high-band adaptive quantization step size Δ is not required when processing the first received frame after packet lossH(n) of (a). Instead, the quantization step is reset to a running average before packet loss (as described elsewhere in this application). Thus, a non-quantized differential signal (or prediction error signal) e is usedH(n) performing adaptive prediction update in a high-band ADPCM encoder, and on eHThe quantization operation of (n) is completely avoided.
For low-band ADPCM encoders, the scheme is slightly different. Due to the maintenance of the low band adaptive quantization step size ΔLImportance of pitch modulation of (n), implementation of decoder/PLC system 300 described below in section D effectively updates this parameter in the lost frame. Standard G.722 low-band ADPCM coder uses a differential signal (or prediction error signal) eL6-bit quantization of (n). However, according to the g.722 standard, only a subset of 8 amplitude quantization indices are used to update the low-band adaptive quantization step size ΔL(n) of (a). By using non-quantized differential signals eL(n) use of the quantized difference signal for adaptive prediction update in a low-band ADPCM encoder while maintaining a low-band adaptive quantization step size ΔLWith the same update of (n), the embodiment described in section D can use less complex quantization of the differential signal.
It should be readily appreciated by those skilled in the art that reference to high-band adaptive quantization step size Δ is referred to in the present applicationH(n) the high-band adaptive quantization step size may be defined by a high-band logarithmic scale factorAnd (6) replacing. As such, the present application relates to a low band adaptive quantization step size ΔL(n) the low band adaptive quantization step size may be defined by a low band logarithmic scale factorAnd (6) replacing.
Another difference between the low-band and high-band ADPCM encoders used in the embodiment of section D is the adaptive resetting of the encoder based on signal properties and packet loss duration, compared to the standard g.722 sub-band ADPCM encoder. The function will now be described.
As described above, for long periods of packet loss, the full-band speech signal synthesizer 350 silences the output speech waveform after a predetermined time. In an implementation of the decoder/PLC system 300 described in section D below, the output signal from the full-band speech signal synthesizer 350 is fed through to a g.722QMF analysis filter bank to obtain sub-band signals used to update the internal states of the low-band ADPCM decoder 320 and the high-band ADPCM decoder 330 during a lost frame. Thus, once the output signal from the full-band speech signal synthesizer 350 is reduced to zero, the sub-band signal used to update the internal state of the sub-band ADPCM decoder also becomes zero. The constant zeros can distinguish the adaptive predictor in each decoder from the adaptive predictor in the encoder, since the constant zeros will unnaturally cause the predictor part to adapt constantly in the same direction. This is evident in conventional high-band ADPCM decoders, which often produce high frequency chirp (chirp) when processing a good frame after a long packet loss. For conventional low-band ADPCM decoders, this problem occasionally causes an unnatural energy increase, since the predictor has too high a filter gain.
Based on the foregoing discussion, the implementation of decoder/PLC system 300 described below in section D resets the ADPCM sub-band decoder once the PLC output waveform has been subtracted to zero. The method almost completely eliminates high frequency chirp after long frame erasure. The uncertainty of the synthesized waveform generated by the full-band speech signal synthesizer 350 increases with the time to packet loss, and this discussion shows that at some point it is not obvious to update the sub-band ADPCM decoders 320 and 330 using this method.
However, even with the subband APCM decoders 320 and 330 reset when the output of the full-band speech signal synthesizer 350 is fully muted, there are still some problems that arise in the form of unusual chirps (from the high-band ADPCM decoder 330) and unusual and unnatural energy growth (from the low-band ADPCM decoder 320). These problems have been solved in the implementation described in section D by generating an adaptive reset depth for each sub-band ADPCM decoder. Resetting may still occur when the waveform is muted, but one or more of the sub-band ADPCM decoders 320 and 330 may also be reset in advance.
As will be described in section D, the decision to reset early is based on monitoring certain performance of the signal controlling adaptation of the pole portion of the adaptive predictor of the sub-band ADPCM decoders 320 and 330 during bad frames (i.e., during updating of the sub-band ADPCM decoders 320 and 330 based on the output signal from the full-band speech signal synthesizer). For the low-band ADPCM decoder 320, the partially reconstructed signal PLt(n) drive the adaptation of the all-pole filtering part, while the partially reconstructed signal PH(n) drives the adaptation of the all-pole filtering portion of high-band ADPCM decoder 330. Essentially, each parameter is monitored to be largely constant during a 10ms lost frame period, or predominantly positive or negative during the current lost frame period. It should be noted that in the implementation described in section D, the adaptive reset is limited to 30ms after the packet loss.
3. Processing type 5 and type 6 frames
In processing type 5 and type 6 frames, the input bitstream associated with the current frame is again available and thus the modules 310, 320, 330 and 340 are again active. However, the decoding operations performed by the low-band ADPCM decoder 320 and the high-band ADPCM decoder 330 are constrained and controlled by the decoding constraint and control module 370 to reduce artifacts (artifacts) and distortion at the transition from the lost frame to the received frame, thereby improving the performance of the post-packet-loss decoder/PLC system 300. This is shown in step 420 of flowchart 400 for a type 5 frame and in step 426 for a type 6 frame.
For type 5 frames, additional modifications will be performed on the output speech signal to ensure a smooth transition between the synthesized signal produced by full-band speech signal synthesizer 350 and the output signal produced by QMF synthesis filter bank 340. Thus, the output signal of the QMF synthesis filter bank 340 is not directly used as the output speech signal of the decoder/PLC system 300. Instead, full-band speech synthesizer 350 modifies the output of QMF synthesis filter bank 340 and uses the modified version as the output speech signal of decoder/PLC system 300. Thus, while processing type 5 or type 6 frames, switch 336 remains connected to the lower position labeled "type 2-6" to receive the output speech signal from full-band speech signal synthesizer 350.
In this regard, if there is misalignment between the synthesized signal produced by full-band speech signal synthesizer 350 and the output signal produced by QMF synthesis filter bank 340, the operations performed by full-band speech signal synthesizer 350 include time-warping and rephasing. The performance of these operations is shown at step 422 of flow 400 and will be described in more detail below.
Likewise, for type 5 frames, the output speech signal generated by the full-band speech signal synthesizer 350 is superimposed with the call signal from the previously processed lost frame. This is done to ensure a smooth transition from the synthesized waveform associated with the previous frame to the output waveform associated with the current type 5 frame. The performance of this step is shown at step 424 of flowchart 400.
After the output speech signal has been generated for the type 5 or type 6 frame, the decoder/PLC system 300 updates the various state memories and performs some processing to facilitate PLC operations performed in a similar manner as step 414 for the subsequent lost frame, as shown at step 428.
a. Constraint and control of sub-band ADPCM decoding
As introduced above, the decoding constraint and control module 370 constrains and controls the decoding operations performed by the low-band ADPCM decoder 320 and the high-band ADPCM decoder 330 during processing of type 5 and type 6 frames to improve the performance of the post-packet-loss decoder/PLC system 300. Various constraints and controls employed by the decoding constraint and control module 370 will now be described. Further details regarding these constraints and controls are further described below in section D with reference to a specific implementation of the decoder/PLC system 300.
i. Adaptive quantization step size setting for high band ADPCM decoders
For type 5 frames, the decoding constraint and control module 370 will use the adaptive quantization step size Δ for the high-band ADPCM decoder 330H(n) is set to a running average of the values associated with good frames received before packet loss. The performance of the decoder/PLC system 300 in background noise is improved by reducing the energy degradation caused by packet loss visible in segments of background noise.
ii. Adaptive quantization step size setting for low band ADPCM decoders
For type 5 frames, the decoding constraint and control module 370 implements an adaptive strategy to set the adaptive quantization step size Δ for the low-band ADPCM decoder 320L(n) of (a). In an alternative embodiment, the method may also be used for high-band ADPCM decoder 330. As introduced in the previous section, for the high-band ADPCM decoder 330, the quantization step size Δ is adaptively quantizedH(n) setting to a running average of the values over the first good frame before packet loss is beneficial to the performance of the decoder/PLC system 300 in background noise. However, the same method applied to the low-band ADPCM decoder 320 occasionally produces a large unnatural energy growth on voiced speech. This is because the pitch period pair Δ is used in voiced speechL(n) is modulated and thus Δ is modulatedL(n) setting the running average before dropping a frame causes a delta on the first good frame after dropping a frameL(n) very large abnormal increases.
Thus, Δ is modulated by the pitch periodL(n) in the case of the preferred use of Δ from ADPCM decoder state update module 360L(n) rather than delta before packet lossL(n) a running average. Recall that the sub-band ADPCM decoder state update module 360 updates the low-band signal by passing the output signal of the full-band speech signal synthesizer 350 through a g.722QMF analysis filter bank to obtain the low-band signalAn ADPCM decoder 320. If the full-band speech signal synthesizer 350 is performing a task, possibly for voiced speech, then the signal used to update the low-band ADPCM decoder 320 is likely to match very well the signal used at the encoder, so the parameter ΔL(n) is also likely to be very close to the encoder step size. For voiced speech, the method preferably compares ΔL(n) is set to Δ before packet lossL(n) a running average.
In view of the foregoing, the decoding constraint and control module 370 employs an adaptive strategy to set the delta for the first good frame after packet lossL(n) of (a). If the speech signal before packet loss is quite stable, e.g. stable background noise, then Δ will beL(n) is set to Δ before packet lossL(n) a running average. However, if the speech signal before packet loss exhibits ΔL(n), e.g. considered for voiced speech, will be ΔL(n) is set to a value obtained by low-band ADPCM decoder update based on the output of the full-band speech signal synthesizer 350. For the intermediate case, ΔL(n) is set based on the before packet loss ΔL(n) varying the linear weighting performed between these two values.
iii adaptive low-pass filtering of adaptive quantization step size for high-band ADPCM decoders
In order to reduce the risk of generating too strong local fluctuations of the high frequency content (due to temporary loss of synchronization between the g.722 encoder and the g.722 decoder) during handling of the first few good frames (frames of type 5 and type 6) after a packet loss, the decoding constraint and control module 370 may effectively control the adaptive quantization step size Δ of the high-band ADPCM decoderH(n) of (a). This creates a dither effect, which is just the effect of real chirp. Thus, the step size Δ is quantized for high bands during the first few good framesH(n) applying an adaptive low-pass filter. The adaptive time period reduces the smooth transitions in a quadratic fashion. For signal segments where the speech signal is very stable before packet loss, the duration is longer (decoder/PLC described in section D below80ms in the implementation of system 300). For the case of speech signals that are not very stable before packet loss, the duration is slightly shorter (40 ms in the implementation of the decoder/PLC system 300 described below in section D), while for unstable segments, no low-pass filtering is employed.
iv, adaptive safety margin (adaptive margin) on all-pole filter part in the first few good frames
Due to the inevitable offset between the g.722 decoder and the encoder during and after packet loss, the decoding constraint and control module 370 enforces certain constraints on the adaptive predictor of the low-band ADPCM decoder 720 during the first few good frames (type 5 and type 6 frames) after packet loss. According to the g.722 standard, the default encoder and decoder perform 1/16 minimum "safety" margins on the pole portion of the sub-band predictor. However, it has been found that the all-pole part of the two-pole six-zero prediction filter of a low-band ADPCM decoder often causes an abnormal energy increase after packet loss. This is typically sensed in the form of a waveform bounce (pop). It is clear that packet loss causes a lower safety margin, which corresponds to the all-pole filtering part with higher gain that produces very high energy waveforms.
The decoding constraint and control block 370 greatly reduces this abnormal energy increase after packet loss by adaptively enforcing more stringent constraints on the all-pole filtering portion of the adaptive predictor of the low-band ADPCM decoder 320. An increased minimum safety margin is obtained over the first few good frames after a packet loss. The increased minimum safety margin is gradually decreased to the standard g.722 minimum safety margin. Furthermore, a running average of the safety margin before packet loss is monitored and the minimum safety margin added during the first few good frames after packet loss is controlled so as not to exceed the running average.
DC removal on internal signals of v, high band ADPCM decoder
During the first few good frames (type 5 and type 6 frames) after a packet loss, it is observed that the g.722 decoder often producesVery annoying noticeable high-frequency chirp distortion. The distortion comes from the high-band ADPCM decoder losing synchronization with the high-band ADPCM encoder due to packet loss and thus producing biased prediction. Loss of synchronization leading to chirp distortion manifests itself as a pole-pair predictor P in the input signalH(n) adaptive control and adaptation of a reconstructed high band signal r having a constant sign over a long timeH(n) control. This causes the pole part of the predictor to drift, since the adaptation is sign-based and thus remains updated in the same direction.
To avoid this problem, the first few good frames after the packet loss are respectively high-pass filtered versions PH,HP(n) and rH,HP(n) instead of the signal PH(n) and rH(n), the decoding constraint and control module 370 adds a DC removal operation to these signals. This is used to completely eliminate chirp. DC removal is implemented as PH(n) and rH(n) subtraction of the respective sliding averages. These running averages are continuously updated for both good and bad frames. In the implementation of the decoder/PLC system 300 described in section D below, this replacement occurs for the first 40ms after a packet loss.
b. Rephasing and time warping
As introduced above, in step 422 of flowchart 400, if there is a misalignment between the synthesized speech signal produced by full-band speech signal synthesizer 350 during the packet loss and the speech signal produced by QMF synthesis filter bank 340 during the frame initially received after the packet loss, full-band speech signal synthesizer 350 performs a technique referred to as "rephasing" and "time warping".
As described above, in processing a lost frame, if a decoded speech signal associated with a received frame before the lost frame is lost is almost periodic, such as a speech vowel signal, the full-band speech signal synthesizer 350 extrapolates a speech waveform based on the pitch period. As also described above, this waveform extrapolation continues beyond the end of the lost frame to obtain more samples for superposition with the associated speech signal of the next frame, thereby ensuring a smooth transition and avoiding any interruption. However, the actual pitch period of the decoded speech signal does not generally follow the pitch contour used during waveform extrapolation in the lost frame. Therefore, the extrapolated speech signal is generally not perfectly aligned with the decoded speech signal associated with the first good frame.
This is illustrated in fig. 6, which is a timeline 600 showing the amplitude of the decoded speech signal 602 during the first received frame before and after packet loss (for convenience, the decoded speech signal is also shown when the frame is lost, but it should be appreciated that the decoder/PLC system 300 is not able to decode this portion of the original signal) and the amplitude of the extrapolated speech signal 604 generated during the lost frame and during the first received frame after packet loss. As shown in fig. 6, the two signals are out of phase in the first received frame.
This out-phasing phenomenon causes two problems in the decoder/PLC system 300. First, as can be seen in FIG. 6, in the first received frame after a packet loss, the decoded speech signal 602 and the extrapolated speech signal 604 in the overlap region are out of phase and will partially cancel, causing audible artifacts. Second, the state memories associated with the sub-band ADPCM decoders 320 and 330 exhibit some degree of pitch modulation and are therefore sensitive to the phase of the speech signal. This problem is particularly pronounced if the speech signal is close to the pitch period, i.e. a part of the speech signal close to the pitch pulse where the signal level rises and falls sharply. Because the sub-band ADPCM decoders 320 and 330 are sensitive to the phase of the speech signal, and because the extrapolated speech signal 604 is used to update the state memories of these decoders when packets are lost (as described above), the phase difference between the extrapolated speech signal 604 and the decoded speech signal 602 will produce significant artifacts in the received frames after the packet loss, due to the internal state mismatch of the sub-band ADPCM encoder and decoder.
As will be described in more detail below, time-warping is used to solve the first problem of destructive interference in the overlap-and-add region. Specifically, time warping is used to stretch and shrink the time axis of the decoded speech signal associated with the first received frame after a packet loss to align it with the extrapolated speech signal used to conceal the previous lost frame. Although time warping is described herein with reference to a sub-band predictive encoder with memory, the general techniques are also applicable to other encoders, including but not limited to encoders with and without memory, predictive and non-predictive encoders, and sub-band and full-band encoders.
Also described in more detail herein, rephasing is used to address the second problem of internal state mismatching of the sub-band ADPCM encoder and decoder due to misalignment of the lost frame and the first frame after packet loss. Rephasing is the process of setting the internal states of the sub-band ADPCM decoders 320 and 330 to the states at the point in time when the extrapolated speech waveform is in phase with the last input signal sample before the first received frame after a packet loss. Although rephasing is described in the context of a backward adaptive system, it can also be used to perform PLC in a forward adaptive predictive encoder or any encoder with memory.
i. Skew calculation
Both the rephasing and time warping techniques require the calculation of the number of misaligned samples between the extrapolated speech signal and the decoded speech signal associated with the first received frame after a packet loss. This misalignment is referred to as "lag," which, as labeled in FIG. 6, can be thought of as the number of samples of the decoded speech signal that lag the extrapolated speech signal. In the case of fig. 6, the hysteresis is negative.
One general method of performing the skew calculation is shown in the flow chart 700 of FIG. 7, but other methods may be used. One particular way of performing this method is described in section D below.
As shown in fig. 7, the method of flowchart 700 begins at step 702 where the speech waveform generated by full-band speech signal synthesizer 350 during the previous lost frame is extrapolated into the first received frame after the packet loss.
At step 704, a time lag is calculated. At a conceptual level, the time lag is calculated by maximizing the correlation between the extrapolated speech signal and the decoded speech signal associated with the first received frame after a packet loss. As shown in fig. 9, the extrapolated speech signal (denoted 904) drifts in a range of-MAXOS to + MAXOS with respect to the decoded speech signal associated with the first received frame (denoted 902), where MAXOS denotes the maximum offset, and the drift amount that maximizes the correlation is used as the time lag. This can be done by searching for the peak of the normalized cross-correlation function r (k) between signals within a time lag range of ± MAXOS around zero:
where es is the extrapolated speech signal, x is the decoded speech signal associated with the first received frame after a packet loss, MAXOS is the maximum offset allowed, LSW is the lag search window length, and i-0 represents the first sample in the lag search window. Maximizing the time lag of this function will correspond to the relative time offset between the two waveforms.
In one embodiment, the number of samples over which the correlation is computed (also referred to as the lag search window) is determined in a pitch period-based adaptive manner. For example, in the embodiment described in section D below, the window size over the number of samples (16kHz sample rate) for the coarse lag search is as follows:
where ppfe is the pitch period. The equation uses a floor function. Floor function of real number xIs a function that returns the largest integer less than or equal to x.
If the time lag calculated in step 704 is zero, this indicates that the extrapolated speech signal and the decoded speech signal associated with the first received frame are in phase, while a positive value indicates that the decoded speech signal associated with the first received frame lags (is relatively delayed) the extrapolated speech signal, and a negative value indicates that the decoded speech signal associated with the first received frame leads the extrapolated speech signal. If the skew is equal to zero, then rephasing and time-warping need not be performed. In the example implementation presented in section D below, the time lag is also set to zero if the last received frame before the packet loss is silent (expressed in degrees of voicing calculated for that frame, as described above with respect to type 2, type 3, and type 4 processing), or if the first received frame after the packet loss is silent.
To minimize the complexity of the correlation computation, a lag search may be performed using a multi-stage process. This method is illustrated by the flow chart 800 of fig. 8, where a coarse skew search is first performed using a down-sampled representation of the signal at step 802, and then a refined skew search is performed using a higher sample rate representation of the signal at step 804. For example, a coarse skew search is performed after down-sampling the signal to 4kHz, and a fine skew search is performed with the signal at 8 kHz. To further reduce complexity, the downsampling may be performed by merely subsampling the signal, ignoring any aliasing effects.
One problem is what signal is used to correlate with the extrapolated speech signal in the first received frame. The "brute force" method is a method of completely decoding the first received frame to obtain a decoded speech signal and then calculating the correlation at 16 kHz. To decode the first received frame, the internal states of sub-band ADPCM decoders 320 and 330 obtained from re-encoding the extrapolated speech signal (as described above) up to the frame boundary may be used. However, this requires a rerun of G.722 decoding, since the rephasing algorithm described below will provide a set of better states for the sub-band ADPCM decoders 320 and 330. This method is very wasteful in terms of computational complexity because it performs two complete decoding operations. To address this problem, embodiments of the present invention implement a less complex approach.
According to a less complex method, the g.722 bitstream received in the first received frame is only partially decoded to obtain the low-band quantized differential signal dLt(n) of (a). In the normal G.722 decoding process, bits received from the bitstream demultiplexer 310 are converted into a differential signal d by the sub-band ADPCM decoders 320 and 330Lt(n) and dH(n) scaled by a backward adaptive scaling factor and passed through a backward adaptive pole-zero predictor to obtain subband speech signals, which are then synthesized by a QMF synthesis filter bank 340 to generate output speech signals. At each sample in the process, the coefficients (coefficient) of the adaptive predictors in the sub-band ADPCM decoder ds 320 and 330 will be updated. This update accounts for a significant portion of the decoder complexity. Since only a signal for the lag computation is needed, the two-pole, six-zero prediction filter coefficients remain unchanged (they are not updated on a sample-by-sample basis) in the lower complexity approach. In addition, since the lag is determined by the pitch, and the fundamental frequency of the pitch of human speech is less than 4kHz, only a low-band approximation signal r can be obtainedL(n) of (a). More details regarding this method will be provided in section D below.
In the embodiment described in section D below, the fixed filter coefficients of the two-pole, six-zero prediction filter are obtained from the extrapolated waveform during the re-decoding of the lost packet until the end of the last lost frame. In an alternative implementation, the fixed filter coefficients may be those used at the end of the last received frame before the packet loss. In another alternative implementation, one or the other of these coefficient sets may be selected in an adaptive manner depending on the characteristics of the speech signal or other criteria.
ii. Rephasing
During the rephasing process, the internal states of the sub-band ADPCM decoders 320 and 330 are adjusted to account for the time lag between the extrapolated speech waveform and the decoded speech waveform associated with the first received frame after a packet loss. As previously described, the internal states of the sub-band ADPCM decoders 320 and 330 are estimated by re-encoding the output speech signal synthesized by the full-band speech signal synthesizer 350 during the previous lost frame, prior to processing the first received frame. The internal states of these decoders exhibit some pitch modulation. Thus, if the pitch period used during waveform extrapolation associated with the previous lost frame is just in close proximity to the pitch track of the decoded speech signal, the re-encoding process can be stopped at the boundary between the last lost frame and the first received frame, while the state of the sub-band ADPCM decoders 320 and 330 is in phase with the original signal. However, as described above, the pitch used at the time of extrapolation generally does not match the pitch track of the decoded speech signal, and the extrapolated speech signal and the decoded speech signal are not aligned at the beginning of the first received frame after a packet loss.
To overcome this problem, rephasing uses a time lag to control where to stop the rephasing process. In the example of fig. 6, the time lag between the extrapolated speech signal 604 and the decoded speech signal 602 is negative. Let it be assumed that the time lag is represented by lag. Then, it can be seen that if the interpolated speech signal is re-encoded by-lag samples beyond the frame boundary, the re-encoding will stop at a phase within the interpolated speech signal 604 that coincides with the phase of the decoded speech signal 602 at the frame boundary. The resulting state memories of the sub-band ADPCM decoders 320 and 330 will be in phase with the received data in the first good frame, providing a better decoded signal. Thus, the number of samples of the re-decoded subband reconstructed signal is as follows:
N=FS-lag (3)
where FS is the frame size and all parameters are in units of subband sample rate (8 kHz).
Fig. 10A, 10B and 10C show three rephasing schemes, respectively. On the timeline 1000 of FIG. 10A, the decoded speech signal 1002 leads the extrapolated speech signal 1004, so the re-decoding exceeds the frame boundary by-lag samples. On the time line 1010 of FIG. 10B, the decoded speech signal 102 lags the extrapolated speech signal 1014, with re-decoding terminating lag samples before the frame boundary. On timeline 1020 of fig. 10C, extrapolated speech signal 1024 and decoded speech signal 1022 are in phase at the frame boundary (although the pitch track during the lost frame is different), and re-encoding stops at the frame boundary. Note that in fig. 10A, 10B and 10C, decoded speech signals during lost frames are shown for convenience, but it should be clear that the decoder 300 cannot decode this portion of the original signal.
If no rephasing of the internal states of the sub-band ADPCM decoders 320 and 330 is performed, a re-encoding to update these internal states can be performed throughout the processing of the lost frame. However, since the lag is not known until the first received frame after a packet loss, re-encoding cannot be performed throughout the lost frame. A simple way to solve this problem is to store the entire extrapolated waveform used to replace the previous lost frame and then perform re-encoding during the first received frame. However, this requires memory to store FS + MAXOS samples. The re-encoding complexity also falls entirely within the first received frame.
Fig. 11 is a flow chart 110 of a method of performing re-encoding in a manner that redistributes a large number of computations on previously lost frames. Since MAXOS < FS, this is reasonable and feasible from a computational load balancing perspective.
As shown in FIG. 11, the method of flow chart 1100 begins at step 1102 by performing re-encoding within the lost frame up to the frame boundary and then storing the internal state of the sub-band ADPCM decoders 320 and 330 at the frame boundary. Additionally, the intermediate internal state after re-encoding the FS-MAXOS samples is also stored, as shown in step 1104. At step 1106, the waveform extrapolated samples generated for re-encoding FS-MAXOS +1 through FS + MAXOS are saved in memory. In step 1108, in the first received frame after a packet loss, subband approximation decoding (for determining the lag described above) is performed using the internal state stored at the frame boundary as an initial state. Then, at decision step 1110, a determination is made as to whether lag is positive or negative. If the lag is positive, then the internal state on the FS-MAXOS sample is restored and the MAXOS-lag sample begins to be re-decoded, as shown in step 1112. Conversely, if lag is negative, then the internal state at the frame boundary is used and the additional | lag | samples are re-encoded. According to the method, at most MAXOS samples are re-encoded in the first received frame.
Those skilled in the art will appreciate that the amount of re-encoding in the first good frame can be reduced by storing more g.722 states along the way as the re-encoding process in the lost frame. In an extreme case, the g.722 state of each sample between FRAMESIZE-MAXOS and FRAMESIZE + MAXOS may be stored and does not need to be re-encoded in the first received frame.
In an alternative approach that requires more re-encoding in the first received frame than the method of flowchart 1100, the re-encoding is performed for FS-MAXOS samples during the lost frame. The internal states and the remaining 2 x MAXOS samples of the sub-band ADPCM decoders 320 and 330 are stored in memory for use in the first received frame. In the first received frame, the lag is calculated and re-encoding is performed starting from the stored g.722 state for an appropriate number of samples based on the lag. This approach requires storing 2 x MAXOS reconstructed samples, one copy of the g.722 state, and a re-encoding of at most 2 x MAXOS samples in the first received frame. The disadvantage of this alternative approach is that it is not possible to store the internal states of the sub-band ADPCM decoders 320 and 330 at the frame boundaries for the lower complexity decoding and lag computation described above.
Ideally, the lag should coincide with a phase shift at a frame boundary between the extrapolated speech signal and the decoded speech signal associated with the first received frame. According to one embodiment of the invention, the coarse lag estimate is computed over a relatively long lag search window whose center does not coincide with the frame boundary. For example, the lag search window may be 1.5 times the pitch period. The lag search range (i.e., the number of samples that offset the extrapolated speech signal relative to the original speech signal) is also relatively wide (i.e., + -28 samples). To improve the alignment, a lag refinement search is then performed. As part of the late refinement search, the search window is moved to start with the first sample of the first received frame. This can be done by biasing the extrapolated speech signal by a coarse lag estimate. The size of the lag search window in the lag refinement search may be relatively small and the lag search range may also be relatively small (i.e., ± 4 samples). The search method may be the same as that described above in section 3. b.i.
The present invention has proposed the concept of rephasing in the context of the above g.722 backward adaptive predictive codec. We can easily extend this concept to other backward adaptive predictive codecs such as g.726. However, the use of rephasing is not limited to backward adaptive predictive codecs. In contrast, most memory-based encoders exhibit phase correlation in the state memory and therefore benefit from rephasing.
iii time warping
As used herein, the term time warping refers to the process of stretching or shrinking a signal along the time axis. As discussed elsewhere herein, to maintain a continuous signal, embodiments of the present invention combine the extrapolated speech signal used to replace the lost frame with the decoded speech signal associated with the first received frame after the packet was lost to avoid interruptions. This is done by performing a superposition between the two signals. However, if the signals are out of phase with each other, waveform cancellation (cancellation) may occur and audible artifacts, such as the overlap region in fig. 6, may result. Performing the superposition in this region will cause significant waveform cancellation between the negative part of the decoded speech signal 602 and the extrapolated speech signal 604.
According to an embodiment of the present invention, the decoded speech signal associated with the first received frame after the packet loss is time-warped to phase align the decoded speech signal with the extrapolated speech signal at a point in time in the first received frame. The amount of time-warping is controlled by the value of the time lag. Thus, in one embodiment, if the time lag is positive, the decoded speech signal associated with the first received frame will be stretched, and the overlap region may be set at the beginning of the first received frame. However, if the time lag is negative, the decoded speech signal will be compressed. Therefore, the overlap-and-add region is set to enter the first received frame | lag | samples.
In the case of g.722, some samples of the beginning of the first received frame after the packet loss are not reliable because the internal state of the sub-band ADPCM decoders 320 and 330 at the beginning of the frame is incorrect. Thus, in embodiments of the present invention, the first MIN _ UNSTBL samples within the first received frame may not be included in the overlap-and-add region based on the time-warping applied to the decoded speech signal associated with the first received frame. For example, in the embodiment described in section D below, MIN _ UNSTBL is set to 16, or the first 1ms within a 10ms frame of 160 samples. In this region, the extrapolated speech signal can be used as an output speech signal of the decoder/PLC system 300. Such an embodiment effectively solves the re-convergence time of the speech signal in the first received frame.
Fig. 12A, 12B, and 12C illustrate several examples of this concept. In the example of FIG. 12A, time line 1200 shows that the decoded speech signal leads the extrapolated signal in the first received frame. Thus, the decoded speech signal undergoes a time-warp-contraction of-lag samples (the lag is negative). The result of applying the time warping is shown in time line 1210. These signals are in phase at or near the center of the overlap zone as shown in time line 1210. In this case, the center of the overlap-and-add region is located at MIN _ UNSTBL-lag + OLA/2, where OLA is the number of samples in the overlap-and-add region. In the example of FIG. 12B, time line 1220 shows that the decoded speech signal lags the extrapolated signal in the first received frame. Thus, the decoded speech signal is stretched by lag samples in a time warping to complete the alignment. The results using time warping are shown in graph 1230. In this case, MIN _ UNSTBL > lag, and there is still an unstable region in the first received frame. In the example of FIG. 12C, time line 1240 shows that the decoded signal again lags the extrapolated signal, so the decoded speech signal is time warp stretched to provide the results in time line 1250. However, as shown in timeline 1250, because MTN _ UNSTBL ≦ lag, the overlap-and-add region may begin with the first sample in the first received frame.
The "in-phase point" between the decoded speech signal and the extrapolated signal needs to be in the middle of the overlap region, which is located as close as possible to the beginning of the first received frame. This reduces the time that the synthesized speech signal associated with the previous lost frame must be extrapolated to the first received frame. In one embodiment of the invention, this is done by performing a two-phase skew estimation. In the first stage, the coarse lag estimate is calculated over a relatively long lag search window, the center of which may not coincide with the center of the overlap region. For example, the lag search window may be 1.5 times the pitch period. The lag search range (i.e., the number of samples that offset the extrapolated speech signal relative to the original speech signal) is also relatively wide (i.e., + -28 samples). To improve the alignment, a lag refinement search is then performed. As part of the lag refinement search, the lag search window is concentric with the predicted overlap-add setting derived from the coarse lag estimate. This can be done by biasing the extrapolated speech signal by a coarse lag estimate. The size of the lag search window in the lag refinement search may be small (e.g., the size of the overlap region), and the lag search range may also be small (i.e., ± 4 samples). The search method may be the same as that described above in section 3. b.i.
There are many techniques to perform time warping, one of which includes piece-wise (piece-wise) single sample translation and superposition. A method of puncturing using this technique is depicted in flow chart 1300 of fig. 13. According to the method, the samples are periodically down-sampled, as shown in step 1302. Starting from this point of sample reduction, the original signal and the signal shifted to the left (due to the reduction) are superimposed, as shown in step 1304. A method of stretching using this technique is depicted in the flow chart 1400 of fig. 14. According to the method, the sampling is repeated periodically, as shown in step 1402. Starting from the point where the sampling repeats, the original signal and the signal shifted to the right (due to the sampling repeats) are superimposed, as shown at step 1404. The length of the overlap-and-add window for these operations depends on the periodicity of the sample increase/decrease. To avoid too much signal smoothing, a maximum overlap-and-add period (i.e. 8 samples) may be defined. The period over which the sample increase/decrease occurs depends on various factors such as the frame size, the number of sample increases/decreases, and whether an increase or decrease is being performed.
The amount of time-warping may be limited. For example, in the g.722 system described in section D below, the amount of time warping may be limited to ± 1.75ms for a 10ms frame (or 28 samples out of a 160-sample 10ms frame). Distortion above this range may eliminate the destructive interference described above, but typically introduces some other audible distortion. Thus, in such an embodiment, time-warping would not be performed if the time lag were outside this range.
The present invention sets the system described below in section D to ensure zero sample delay after the first received frame after a packet loss. For this reason, the system does not perform time-warping on the decoded speech signal beyond the first received frame. This in turn limits the amount of time-warping described in the previous paragraph where no audible distortion occurs. However, those skilled in the art will appreciate that in systems that accommodate some sample delay (after the first received frame following a packet loss), time-warping may be applied to the decoded speech signal beyond the first good frame, thereby allowing for large time lags to be accommodated without audible distortion. Of course, in such a system, if a frame following the first received frame is lost, then time warping may only be applied to the decoded speech signal associated with the first good frame. Such alternative embodiments are also within the scope and spirit of the present invention.
In an alternative embodiment of the invention, the time-warping may be performed for both the decoded speech signal and the extrapolated speech signal. This approach may provide better performance for a number of reasons.
For example, if the time lag is-20, then the decoded speech signal is punctured by 20 samples according to the method described above. This means that 20 samples of the extrapolated speech signal need to be generated for use in the first received frame. This number can also be reduced by shrink-extrapolating the speech signal. For example, the extrapolated speech signal may be shrunk by 4 samples, leaving 16 samples for decoding the speech signal. This reduces the number of samples that have to be used for the extrapolated signal within the first received frame and also reduces the amount of warping that has to be performed on the decoded speech signal. As noted above, in the embodiment of section D, the time warping needs to be limited to 28 samples. The reduction in the amount of time warping required to align the signals means that less distortion is introduced in the time warping process and also increases the number of situations that can be improved.
By time-warping the decoded speech signal and the extrapolated speech signal, a better waveform match in the overlap-and-add region should also be obtained. The explanation is as follows: if the lag is-20 samples in the previous example, this means that the decoded speech signal leads the extrapolated signal by 20 samples. The most likely reason for this is that the pitch period used for extrapolation is larger than the actual pitch. By also contracting the extrapolated speech signal, the effective pitch of the signal in the overlap region becomes smaller, closer to the actual pitch period. Also, by puncturing the original signal, the effective pitch period of the signal is larger than for the case of puncturing only. Thus, two waveforms in the overlap region will have more matching pitch periods and thus the waveforms will be more matched.
If the lag is positive, the decoded speech signal is stretched. In this case, it is not clear whether the improvement can be obtained although stretching the extrapolated signal increases the number of extrapolated samples used in the first received frame. However, this approach can provide improved performance if there is long-term packet loss and the two waveforms are significantly out of phase. For example, if the lag is 30 samples, no warping is performed in the foregoing method because of the limit of more than 28 samples. A distortion of 30 samples is likely to cause distortion of itself. However, if the 30 samples are distributed between two signals, such as stretch extrapolating 10 samples of the speech signal and stretch decoding 20 samples of the speech signal, they can be aligned without using excessive time warping.
D. Details of example implementations in G.722 decoders
This section provides specific details relating to a particular implementation of the invention in an ITU-T recommendation g.722 speech decoder. This example implementation is performed on an inherent 10 millisecond (ms) frame size, but may be performed on any packet or frame size that is a multiple of 10 ms. The longer input frames are processed as super frames (super frames), for which the PLC logic calls the appropriate number of times at its inherent 10ms frame size. It does not cause additional delay compared to conventional g.722 decoding using the same frame size. The present invention provides these implementation details and the following description by way of example only and is not intended to limit the present invention.
The embodiments described in this section meet the same complexity requirements as the PLC algorithm described in g.722 appendix IV, but provide significantly better speech quality than the PLC algorithm described in that appendix. Due to its high quality, the embodiments described in this section are suitable for general applications of g.722 where frame erasures or packet losses occur. Such applications include, for example, voice over internet protocol (VoIP), wireless fidelity (WiFi), and next generation digital enhanced wireless communications (DECT). The embodiments described in this section are readily applicable except for applications that leave virtually no complexity headroom (headroom) after the execution of a PLC-less basic g.722 decoder.
1. Abbreviations and conventions
Some abbreviations used in this section are listed in table 1.
Abbreviations Description of the invention
ADPCM Adaptive differential PCM
ANSI National institute of standards
dB Decibel
DECT Digital enhanced wireless communication
DC Direct current
FIR Finite impulse response
Hz Hertz's scale
LPC Linear predictive coding
OLA Superposition
PCM Pulse code modulation
PLC Packet loss concealment
PWE Periodic waveform extrapolation
STL2005 Software tool library 2005
QMF Mirror image secondary filter
VoIP Voice over internet protocol technology
WB Wide band
WiFi Wireless fidelity
Table 1: abbreviations
Some conventions are also used in the description of the invention and some of them will be explained. The PLC algorithm operates with a 10ms intrinsic frame size, so the description of the algorithm is only for a 10ms frame. For larger packets (multiples of 10ms), the received packets are decoded in 10ms segments. The discrete time of a signal at a sample rate of 16kHz is typically denoted by "j" or "i". The discrete-time index of the signal at the 8kHz sampling level is generally denoted by "n". Low band signals (0-4kHz) are identified with the subscript "L" and high band signals (4-8kHz) are identified with the subscript "H". This description will reuse the ITU-T g.722 specification if possible.
The most commonly used symbols and their descriptions are listed in table 2 below.
Table 2: common symbols and their description
2. General description of PLC algorithms
As described above with reference to fig. 5, there are six types of frames processed by the decoder/PLC system 300: type 1, type 2, type 3, type 4, type 5, and type 6. The frame of type 1 is any received frame other than the eighth received frame after the packet loss. A frame of type 2 is one of the first and second lost frames associated with packet loss. The frame of type 3 is any one of the third to sixth lost frames associated with packet loss. A frame of type 4 is any lost frame other than the sixth frame associated with a packet loss. A frame of type 5 is one received frame immediately following a packet loss. Finally, the frame of type 6 is any one of the second to eighth received frames immediately following the packet loss. The PLC algorithm described in this section operates on a fixed frame size of 10ms duration.
The inventionDecoding type 1 frames according to standard g.722 operations adds some state memory maintenance and processing to facilitate PLC and related processing. FIG. 15 is a block diagram 1500 of the logic to perform these operations according to an embodiment of the present invention. Specifically, as shown in fig. 15, in processing a frame of type 1, the index (index) I of the low band ADPCM encoder is received from a bit demultiplexer (not shown in fig. 15)L(n) and decoded by the low-band ADPCM decoder 1510 to produce a sub-band speech signal. Similarly, the index number I of the high band ADPCM coder is received from the bit demultiplexerH(n) and decoded by high-band ADPCM decoder 1520 to produce a sub-band speech signal. The QMF synthesis filter bank 1530 synthesizes the low-band speech signal and the high-band speech signal to generate a decoded output signal xout(j) In that respect These operations are consistent with standard g.722 decoding.
In addition to these standard g.722 decode operations, logic 1540 is configured to update PLC related low-band ADPCM state memory, logic 1550 is configured to update PLC related high-band ADPCM state memory, and logic 1560 is configured to update WB PCM PLC related state memory when processing type 1 frames. These state memory updates are used to speed PLC processing associated with other frame types.
For type 2, type 3, and type 4 frames, Wideband (WB) PCM PLC is performed in the 16kHz output speech domain. A block diagram 1600 for executing the logic of the WB PCM PLC is provided in fig. 16. Previous output speech x of G.722 decoderout(j) Buffered and transferred to WB pcmpc logic. The WB PCM PLC algorithm is based on Periodic Waveform Extrapolation (PWE), and pitch estimation is an important component of WB PCM PLC logic. Initially, the coarse pitch is estimated based on the down-sampled (to 2kHz) signal in the weighted speech domain. The estimate is then refined at full resolution using the original 16kHz samples. Output x of WB PCM PLC logicPLC(i) Is a linear combination of the periodically extrapolated waveform and the noise shaped by the PLC. For sustained frame erasure, output waveform xPLC(i) Is gradually attenuated. The fade starts 20ms after the frame loss and is completed 60ms after the frame loss。
As shown in block diagram 1700 of FIG. 17, for frames of type 2, type 3, and type 4, the output x of the WB PCMPLPC logicPLC(i) Passed through the g.722qmf analysis filter bank 1702 to obtain corresponding sub-band signals, which are then passed to the modified low-band ADPCM encoder 1704 and the modified high-band ADPCM encoder 1706, respectively, to update the state and memory of the decoder. Only a partially simplified sub-band ADPCM encoder is used for the update.
The processing performed by the logic shown in fig. 16 and 17 occurs during a lost frame. The improved low-band ADPCM encoder 1704 and the improved high-band ADPCM encoder are both simplified to reduce complexity. They will be described in detail elsewhere in this application. One feature that arises in the encoders 1704 and 1706 (which is not present in conventional g.722 sub-band ADPCM encoders) is encoder adaptive reset based on signal properties and packet loss duration.
The most complex processing associated with PLC algorithms is for type 5 frames, which are the first received frame to follow a packet loss. The transition of the extrapolated waveform to the standard decoded waveform occurs during this frame. Techniques used in processing type 5 frames include rephasing and time warping, which are described in more detail herein. FIG. 18 provides a block diagram 1800 of the logic for performing these techniques. In addition, the QMF synthesis filter bank in the decoder is updated in a manner described in more detail herein when processing type 5 frames. Another function associated with processing type 5 frames includes the setting of low-band and high-band logarithmic scale factors at the beginning of the first received frame after a packet loss.
Both type 5 and type 6 frames are decoded using a modified and constrained sub-band ADPCM decoder. FIG. 19 depicts a block diagram 1900 of the logic for processing type 5 and type 6 frames. As shown in fig. 19, logic 1970 imposes constraints and control on subband ADPCM decoders 1910 and 1920 in processing type 5 and/or type 6 frames. The constraints and controls of the sub-band ADPCM decoder are applied during the first 80ms after a packet loss. Some of which do not extend beyond 40ms, while others are adaptive in duration or degree. The constraints and control mechanisms are described in more detail in this application. As shown in fig. 19, logic modules 1940, 1950, and 1960 are used to update the state memory after processing a type 5 or type 6 frame.
The PLC algorithm described in this section is bit-exact with g.722 under error-free channel conditions. Furthermore, the algorithm is the same as g.722 except for the eighth frame after packet loss under error conditions, and should be able to get convergence to g.722 without error if there is no bit error.
The PLC algorithm described in this section supports any frame size that is a multiple of 10 ms. For packet sizes greater than 10ms, the PLC algorithm need only be invoked multiple times at 10ms intervals for each packet. Therefore, in the subsequent part of this section, the PLC algorithm will be described in terms of a fixed frame size of 10 ms.
3. Waveform extrapolation for G.722 output
For lost frames corresponding to packet loss (type 2, type 3, and type 4 frames), the WB PCM PLC logic described in fig. 16 extrapolates the g.722 output waveform associated with the previous frame to generate a replacement waveform for the current frame. The extrapolated wideband signal waveform x is then processed while processing frames of type 2, type 3, and type 4PLC(i) Used as the output waveform of the g.722plc logic. To facilitate the description of the various blocks in FIG. 16, when WB PCM PLC logic calculates signal x for a lost framePLC(i) After that, the signal xPLC(i) Is written to store xout(j) In which xout(j) Is the final output of the entire g.722 decoder/PLC system. Each of the processing modules of FIG. 16 will now be described in more detail.
a. Eight order LPC analysis
Module 1604 is used for calculating signal x associated with current frameout(j) And after storing it in a buffer, perform an 8 th order LPC analysis near the end of the frame processing cycle. This 8 th order LPC analysis is an autocorrelation LPC analysis with the application to x associated with the current frameout(j) The 10ms of the signal is not symmetrical about the analysis window. This asymmetric window is defined as follows:
let x beout(0)、xout(1)、...、xout(159) Indicating that the g.722 decoder/PLC system associated with the current frame outputs wideband signal samples. The windowing operation is performed as follows:
xw(j)=xout(j)w(j),j=0,1,2,...,159.(5)
next, the autocorrelation coefficients are calculated as follows:
spectral smoothing and white noise correction operations are then applied to the autocorrelation coefficients as follows:
wherein f iss16000 is the sampling rate of the input signal, and σ is 40.
Next, the autocorrelation coefficients are applied using the Levenson-Durbin (Levinson-Durbin) recursionConversion to LPC predictor coefficientsi is 0, 1. If the levenson-durbin recursion exits prematurely before the recursion is complete (e.g., because the prediction error energy e (i) is less than zero), then the short-term predictor coefficients associated with the most recent frame are used in the current frame. To address this exception, it is necessary toInitial values of the array.The initial value of the array is set toAnd1, 2, 8. The Levenson-Dubin recursion algorithm specifies the following:
1. if it is notUsing the most recent frameAn array of data sets is created,and exit the Levenson-Dubin recursion
5.E(1)=(1-k1 2)E(0)
6. If E (1) ≦ 0, using the most recent frameArray and exit Levenson-Dubin recursion
7, for i 2, 3, 4, 8, the following operations are performed:
d.E(i)=(1-ki 2)E(i-1)
e. if E (i) ≦ 0, using the most recent frameArray and exit Levenson-Dubin recursion
Using previously processed frames if recursion is exited prematurelyAnd (4) array. If recursion is successfully completed (normal case)Below), the LPC predictor coefficients are as follows:
and
by applying a wideband extension operation to the coefficients obtained above, the set of LPC predictor coefficients finally obtained is as follows:
b. computation of short-term prediction residual signal
Block 1602 of fig. 16 (labeled "a (z)") represents a short term linear prediction error filter with filter coefficients calculated as a aboveiI is 0, 1, 8. Module 1602 is configured to operate after performing an 8 th order LPC analysis. The module 1602 computes the short-term prediction residual signal d (j) as follows:
conventionally, the time index n of the current frame continues from the time index of the previously processed frame. In other words, if the time index range 0, 1, 2.,. 159 represents the current frame, then the time index range-160, -159.,. 1 represents the previously processed frame. Thus, in the above equation, if the index (j-i) is negative, then the index points to the signal sample near the end of the previously processed frame.
c. Calculation of the scaling factor
The module 1606 in fig. 16 is used to calculate the average amplitude of the short-term prediction residual signal associated with the current frame. This operation is performed after the short-term prediction residual signal d (j) is calculated by the module 1602 in the above-described manner. The average amplitude avm is calculated as follows:
if the next frame to be processed is a lost frame (in other words, a frame corresponding to a packet loss), the average amplitude can be used as a scaling factor to adjust a white Gaussian noise sequence (if the current frame is silent).
d. Computation of weighted speech signals
Block 1608 (labeled "1/a (z/y)") of fig. 16 represents a weighted short term synthesis filter. The module 1608 is configured to operate after calculating the short-term prediction residual signal d (j) of the current frame in the manner described above (reference to the module 1602). Coefficient a 'of this weighted short-term synthesis filter'i(i ═ 0, 1.., 8.) the following is calculated (where γ is1=0.75):
a′i=γ1 iai,i=1,2,...,8.(13)
The short-term prediction residual signal d (j) is passed through the weighted synthesis filter. The corresponding output weighted speech signal xw (j) is calculated as follows:
e. eight to one sampling (eight-to-one resolution)
Block 1616 of FIG. 16 passes the weighted speech signal output by block 1608 through a 60-order minimum-phase Finite Impulse Response (FIR) filter, and then performs 8: 1 sampling to downsample the resulting 16kHz low-pass filtered weighted speech signal to a 2kHz downsampled weighted speech signal xwd (n). This sampling operation is performed after the weighted speech signal is calculated. To reduce complexity, the FIR low-pass filtering operation is performed only when a new sample of xwd (n) is needed. Thus, the down-sampled weighted speech signal xwd (n) is calculated as follows:
wherein b isi(i ═ 0, 1, 2.., 59) is the filter coefficient of a FIR low pass filter of order 60, as shown in table 3.
Hysteresis, i B in Q15 formati Hysteresis, i B in Q15 formati Hysteresis, i B in Q15 formati
0 1209 20 -618 40 313
1 728 21 -941 41 143
2 1120 22 -1168 42 -6
3 1460 23 -1289 43 -126
4 1845 24 -1298 44 -211
5 2202 25 -1199 45 -259
6 2533 26 -995 46 -273
7 2809 27 -701 47 -254
8 3030 28 -348 48 -210
9 3169 29 20 49 -152
10 3207 30 165 50 -89
11 3124 31 365 51 -30
12 2927 32 607 52 21
13 2631 33 782 53 58
14 2257 34 885 54 81
15 1814 35 916 55 89
16 1317 36 881 56 84
17 789 37 790 57 66
18 267 38 654 58 41
19 -211 39 490 59 17
Table 3: coefficients of 60 th order FIR filter
f. Coarse pitch period extraction
To reduce computational complexity, the WB PCM PLC logic performs pitch extraction in two stages: the coarse pitch period is first determined using the temporal resolution of the 2kHz sampled signal and then pitch period refinement is performed using the temporal resolution of the 16kHz non-sampled signal. Such pitch extraction is performed only after the down-sampled weighted speech signal xwd (n) has been calculated. This subsection describes the first stage coarse pitch cycle extraction algorithm performed by module 1620 of FIG. 16. The algorithm is based on maximizing the standard cross-correlation using some additional decision logic.
A 15ms pitch analysis window may be used in the coarse pitch period extraction. The tail of the pitch analysis window is aligned with the tail of the current frame. At a sampling rate of 2kHz, 15ms corresponds to 30 samples. Without loss of generality, assume that the index range n-0 to n-29 corresponds to the pitch analysis window for xwd (n). The coarse pitch period extraction algorithm starts by calculating the following values:
and
the above calculations are performed for all integers ranging from k-MINPPD-1 to k-MAXPPD +1, where minnpd-5 and MAXPPD-33 are the minimum and maximum pitch periods in the sampling range, respectively. The coarse pitch period extraction algorithm then searches through the range of k minnpd, minnpd +1, minnpd +2, MAXPPD to find all local peaks of the array { c2(k)/e (k) }, which satisfy c (k) > 0. (if a value is smaller than two values adjacent to it, this value is defined as a local peak). Suppose NpIndicating the number of positive local peaks. Suppose kp(j)(j=1,2,...,Np) Is an index, where c2(k)p(j))/E(kp(j) Is a local peak and c (k)p(j) 0) and assume kp(1)<kp(2)<...<kp(Np). For convenience, c2(k)/E (k) will be referred to as "normalized correlation square".
If N is presentp0, i.e. if the function c2(k)/e (k) does not have a positive local peak, then the algorithm will search for the largest negative local peak with the largest magnitude | c2(k)/e (k) |. If this maximum negative local peak is found, the corresponding index k is used as the output coarse pitch period cpp and the processing of block 1620 is terminated. If the normalized correlation squared function c2(k)/e (k) has neither positive nor negative local peaks, then the output coarse pitch period is set to cpp MIPPD,and terminates processing by block 1620. If N is presentp1, then the output coarse pitch period is set to cpp kp(1) And terminates processing in block 1620.
If there are at least two local peaks (N)p≧ 2), this module uses algorithms A, B, C and D (described below) to determine the output coarse pitch period cpp in this order. The variables calculated in the earlier of these four algorithms will be passed to the next algorithm for continued use.
The following algorithm A is used to identify the normalized correlation squared c2(k)p)/E(kp) Is detected, and the maximum quadratic interpolated peak around the local peak of (c) is detected. For c (k)p) Perform quadratic interpolation for E (k)p) Linear interpolation is performed. This interpolation is performed using the time resolution of the 16kHz non-sampled speech signal. In the following algorithm, D denotes the sampling factor used when sampling xw (n) to xwd (n). Thus, D is 8 here.
Algorithm A-at c2(k)p)/E(kp) Find the maximum quadratic interpolation peak around:
A. c2max is-1, Emax is 1, and jmax is 0.
B. 1, 2, N for jpThe following 12 steps are performed:
1. set a to 0.5[ c (k) ]p(j)+1)+c(kp(j)-1)]-c(kp(j))
2. Set b to 0.5[ c (k) ]p(j)+1)+c(kp(j)-1)]
3. Setting ji to 0
4. Setting ei ═ E (k)p(j))
5. C2m is set to c2(k)p(j))
6. Setting Em ═ E (k)p(j))
7. If c2(k)p(j)+1)E(kp(j)-1)>c2(kp(j)-1)E(kp(j) +1), the rest of step 7 is performed:
a、Δ=[E(kp(j)+1)-ei]/D
b. for k 1, 2,.., D/2, the following part of step 7 is performed:
i.ci=a(k/D)2+b(k/D)+c(kp(j))
ii.ei←ei+Δ
if (ci)2Em > (c2m) ei, the following three rows:
a.ji=k
b.c2m=(ci)2
c.Em=ei
8. if c2(k)p(j)+1)E(kp(j)-1)≤c2(kp(j)-1)E(kp(j) +1), the rest of step 8 is performed:
a、Δ=[E(kp(j)-1)-ei]/D
b. for k ═ 1, -2, ·, -D/2, the following part of step 8 is performed:
i.ci=a(k/D)2+b(k/D)+c(kp(j))
ii.ei←ei+Δ
if (ci)2Em > (c2m) ei, the following three rows:
a.ji=k
b.c2m=(ci)2
c.Em=ei
9. setting lag (j) kp(j)+ji/D
10. Setting c2i (j) to c2m
11. Setting Ei (j) Em
12. If c2m × Emax > c2max × Em, the following three rows are performed:
a.jmax=j
b.c2max=c2m
c.Emax=Em
symbol ← denotes updating of the parameter on the left side with the value on the right side.
To avoid selecting coarse pitch periods that are approximately an integer multiple of the actual coarse pitch period, the sum c2(k) is comparedp)/E(kp) The respective lags corresponding to the local peaks are searched to determine if the lags are sufficiently close to the output coarse pitch period of the previously processed frame (denoted cpplast, which is initialized to 12 for each first frame). If there is a time lag within 25% of cpplast, then it is considered close enough. For all time lags within 25% of cpplast, the corresponding normalized correlation squared, c2(k)p)/E(kp) Are compared and the interpolation lag corresponding to the square of the maximum normalized correlation is selected for subsequent processing. The following algorithm B performs the above-described tasks. The interpolation arrays c2i (j) and ei (j) calculated in the above algorithm a are used in this algorithm.
Algorithm B-find the maximum interpolated c2(k) among all lags near the output coarse pitch period of the last framep)/E(kp) Time lag of (2):
A. setting index im to-1
B. Setting c2m ═ 1
C. Setting Em to 1
D. N for j ═ 1, 2pThe following operations are performed:
1. if | kp(j) -cpplast | ≦ 0.25 × cpplast, performed as follows:
a. if c2i (j) xEm > c2m xEi (j), the following three rows are performed:
i.im=j
ii.c2m=c2i(j)
iii.Em=Ei(j)
note that if there is no time lag k within 25% of cpplastp(j) Then the value of index im will remain at-1 after execution of algorithm B. If there are one or more time lags that are within 25% of cpplast, then the index im corresponds to the maximum normalized correlation squared among these time lags.
Next, algorithm C determines whether another lag is selected as the output coarse pitch period in the first half of the pitch range. The algorithm searches all interpolation lags lag (j) that are smaller than 16 and checks if there are local peaks in them that are close to the square of their large enough normalized correlation for each integer multiple up to 32, including itself. If there are one or more lags that satisfy this condition, the smallest of these satisfying lags is selected as the output coarse pitch period.
The final values of the variables calculated in the above algorithms a and B are also transferred to the following algorithm C. As described below, the parameter MPDTH is 0.06, and the threshold sets MPTH (k) are given as MPTH (2) 0.7, MPTH (3) 0.55, MPTH (4) 0.48, MPTH (5) 0.37, MPTH (k) 0.30, and k > 5.
Algorithm C-checks if another lag in the first half-cycle within the range of the coarse pitch period should be selected as the output coarse pitch period:
A. for j ═ 1, 2, 3pWhen lag (j) < 16, the following operations are performed in this order:
1. if j ≠ im, set threshold 0.73; otherwise, set threshold to 0.4.
2. If c2i (j) × Emax ≦ threshold × c2max × Ei (j), cancel this j, and skip step (3) corresponding to this j, increment j by 1 and return to step (1).
3. If c2i (j) × Emax > threshold × c2max × Ei (j), the following operations are performed:
a. for k 2, 3, 4., when k × lag (j) < 32, the following is performed:
i、s=k×lag(j)
ii、a=(1-MPDTH)s
iii、b=(1+MPDTH)s
iv, passing through m ═ j +1, j +2, j +3, in that orderpSee if there is a lag (m) between a and b. If no skew is between a and b, skip this j, stop step 3, increment j by 1 and return to step 1. If there is at least one m satisfying a < lag (m) < b and c2i (m) xEmax > MPTH (k) xc 2max xEi (m), then a sufficiently large peak in normalized correlation squared is considered to be found in the vicinity of the kth integer multiple of lag (j); in this case step 3.a.iv is stopped, k is incremented by 1 and step 3.a.i is returned. b. If step 3.a is done without premature stopping, i.e. if there is a sufficiently large interpolated peak of normalized correlation squared in ± 100 xMPDTH% of each integer multiple of lag (j) less than 32, the algorithm is stopped, algorithm D is skipped and cpp lags (j) is taken as the final output coarse pitch period.
If the above algorithm C is performed without finding an output coarse pitch period cpp that satisfies the condition, algorithm D will check the maximum local peak of the normalized correlation squared around the coarse pitch period of the last frame (obtained in algorithm B above) and finally determine the output coarse pitch period cpp. Similarly, the variables calculated in algorithm a and algorithm B above pass their final values to algorithm D below. As described below, the parameters SMDTH ═ 0.095 and LPTHI ═ 0.78.
Algorithm D-Final determination of the output coarse pitch period:
A. if im is-1, i.e. if there is not a large enough local peak of normalized correlation squares around the coarse pitch period of the last frame, then cpp last calculated at algorithm a is taken as the final output coarse pitch period and the algorithm is exited.
B. If im is jmax, i.e. if the largest local peak of the normalized correlation squares around the coarse pitch period of the last frame is also the global maximum of all interpolated peaks of the normalized correlation squares in this frame, then the cpp last calculated in algorithm a is taken as the final output coarse pitch period and the algorithm is exited.
C. If im < jmax, the following is done:
1. if C2m × Emax > 0.43 × C2max × Em, the following part of step C is performed:
a. if lag (im) > MAXPPD/2, the output cpp ═ lag (im) is set and the algorithm is exited.
b. Otherwise, for k 2, 3, 4, 5, the following is performed:
i、s=lag(jmax)/k
ii、a=(1-SMDTH)s
iii、b=(1+SMDTH)s
iv, if lag (im) > a and lag (im) > b, set the output cpp ═ lag (im), and exit the algorithm.
D. If im > jmax, the following is done:
1. if c2m × Emax > LPTH1 × c2max × Em, the output cpp ═ lag (im) is set, and the algorithm is exited.
E. If the algorithm is executed to this point, none of the above steps selects the final output coarse pitch period. In this case, only the cpp calculated at the end of algorithm a is accepted as the final output coarse pitch period.
g. Pitch period refinement
Module 1622 in fig. 16 is used to perform the second stage processing of the pitch period extraction algorithm by searching the vicinity of the coarse pitch period at full 16kHz time resolution using the g.722 decoded output speech signal. The module first converts the coarse pitch period cpp to the non-sampled signal domain by multiplying the coarse pitch period cpp by a sampling factor D, where D-8. The pitch refinement analysis window size WSZ is selected as cpp × D samples and the smaller of 160 samples (corresponding to 10 ms): WSZ min (cpp × D, 160).
Next, the lower boundary of the search range is calculated as lb ═ max (MINPP, cpp × D-4), where MINPP ═ 40 samples are the minimum pitch period. The upper bound of the search range is calculated as ub max (MAXPP, cpp × D +4), where MAXPP 265 samples is the maximum pitch period.
Block 1622 maintains a 16kHz g.722 decoded speech signal x of a total of XQOFF ═ MAXPP +1+ FRSZ samplesout(j) Wherein FRSZ-160 is the frame size. The last FRSZ samples of this buffer contain the g.722 decoded speech signal of the current frame. The first MAXPP +1 samples are the G.722 decoder/PLC system output signal in the last frame processed prior to the current frame. The last sample of the analysis window is aligned with the last sample of the current frame. Let the index range from j 0 to j WSZ-1 correspond to the analysis window (the window is x)out(j) The last WSZ samples in the buffer) and a negative index is set to represent the samples before the analysis window. In the search range [ lb, ub]The following correlation function and energy term (energy term) in the non-sampled signal domain are computed for the time lag k as follows:
then will maximize the ratioTime lag k ∈ [ lb, ub ]]The final refined pitch period, or ppfe, is selected as the frame deletion. Namely, it is
Next, block 1622 also calculates two more pitch-dependent scaling factors. The first pitch tap, called ptfe or frame erasure, is a scale factor for periodic waveform extrapolation and is calculated as x in the analysis windowout(j) Average amplitude of signal and x before ppfe samplesout(j) The ratio of the average amplitudes of the signal portions, having the same sign as the correlation function between the two signal portions, is as follows:
in thatIn the degenerate case of (3), ptfe is set to 0. When this calculation of ptfe is complete, the bound range of values for ptfe is [ -1, 1 [ ]]。
The second pitch-dependent scaling factor, referred to as ppt or pitch prediction tap, is used to calculate the long-term filtered call signal (described below), which is calculated as ppt 0.75 × ptfe.
h. Calculate the Mixing Ratio (Mixing Ratio)
Block 1618 in fig. 16 calculates a figure of merit (figure of merit) to determine the mixing ratio between the periodic extrapolated waveform and the filtered noise waveform during the lost frame. This calculation is only performed during the first lost frame each time a packet loss occurs. The figure of merit is a weighted sum of three signal characteristics: the logarithmic gain, the first normalized autocorrelation and the pitch prediction gain are each calculated as follows.
Using the same for x as in the previous subsection descriptionout(j) By the indexing convention of (1), x in the pitch refinement analysis windowout(j) The energy of the signal is
And the base 2 logarithmic gain lg is calculated as follows
If it is notThe pitch prediction residual energy is calculated as follows
And the pitch prediction gain pg is calculated as follows
If it is notPg is set to 0. If sige is equal to 0, pg is also set equal to 0.
First normalized autocorrelation ρ1Is calculated as follows
After obtaining these three signal characteristics, the figure of merit is calculated as follows
merit=lg+pg+12ρ1.(28)
The above calculated merit determines two scaling factors Gp and Gr, which effectively determine the mixing ratio between the periodic extrapolated waveform and the filtered noise waveform. There are two thresholds for the merge: a figure of merit high threshold MHI and a figure of merit low threshold MLO. These thresholds are set to MHI 28 and MLO 20. The scale factor Gr for the random (filtered noise) component is calculated as follows
And the scale factor Gp of the periodic component is calculated as follows
Gp=1-Gr (30)
i. Periodic waveform extrapolation
Block 1624 of FIG. 16 is configured to periodically interpolate a previous output speech waveform during a lost frame (if unit > MLO). The manner in which the module 1624 performs this function will now be described.
For the first lost frame of each packet loss, the average pitch period delta per frame is calculated. The pitch period history buffer pph (m) (1, 2.., 5) stores the pitch period ppfe of the previous 5 frames. The average pitch period increment is obtained according to the following procedure. Starting with the immediate most recent frame, the pitch period increment from its previous frame to that frame is calculated (negative values indicate a pitch period decrement). If the pitch increment is zero, the algorithm will check the pitch increment of the previous frame. This process continues until the first frame with a non-zero pitch period increment is detected, or until a fourth previous frame is detected. If all the previous five frames have the same pitch period, then the average pitch period delta is set to zero. Otherwise, if the first non-zero pitch increment was found in the mth previous frame, and if the pitch increment has a magnitude less than 5% of the pitch period in the frame, then the average pitch increment ppinc is calculated as the pitch increment in the frame divided by m, and the resulting value is limited to the range of [ -1, 2 ].
In the second consecutive lost frame of packet loss, the average pitch increment is added to the pitch period ppfe and the resulting number is rounded to the nearest integer and then limited to the range of MIPP, xpp.
If the current frame is the first lost frame for a packet loss, the so-called "call signal" used in the superposition is calculated to ensure a smooth waveform transition at the start of the frame. The length of the superposition of the call signal and the periodic extrapolated waveform is 20 samples of the first lost frame. Let the index range of j 0, 1, 2.., 19 correspond to the first 20 samples of the current first lost frame, i.e., the overlap-and-add period, and let the negative index correspond to the previous frame. The long-term call signal is obtained as a scaled version of the short-term prediction residual signal (which is one pitch period earlier than the overlap-add period):
after these 20 samples of ltring (j) are calculated, they are further adjusted by the scaling factor ppt calculated by module 622:
ltring(j)←ppt·ltring(j),j=0,1,2,...,19.(32)
usage is initialized to x in the last frameout(j) A final 8-sample filter memory ring (j) of the signal (j-8, -7.,. 1), the final call signal obtained is as follows:
let the index range of j 0, 1, 2.., 159 correspond to the current first lost frame, and the index range of j 160, 161, 162.., 209 corresponds to the first 50 samples of the next frame. Further, let wi (j) and wo (j) (0, 1.., 19) be triangle fade-in and fade-out windows, respectively, so wi (j) + wo (j) (1). The periodic waveform extrapolation is then performed in two steps:
step 1:
xout(j)=wi(j)·ptfe·xout(n-ppfe)+wo(j)·ring(j),j=0,1,2,...,19.(34)
step 2:
xout(j)=ptfe·xout(j-ppfe),j=20,21,22,...,209.(35)
j. standardized noise generator
If merit < MHI, block 1610 in fig. 16 would generate a white gaussian random noise sequence with a single average amplitude. In order to reduce the computational complexity, white gaussian random noise is pre-computed and stored in a table. To avoid using too long tables and to avoid repeating the same noise pattern due to too short tables, the invention will use a special indexing scheme. In this scheme, the white gaussian noise table wn (j) has 127 entries, and the adjusted version of the output of the noise generator module is
wgn(j)=avm×wn(mod(cfecount×j,127)),j=0,1,2,...,209,(36)
Wherein cfcount is a frame counter, and for the kth consecutive lost frame in the current packet loss, cfcount is k,is a modulo operation.
k. Filtering of noise sequences
Block 1614 in fig. 16 represents a short term synthesis filter. If merit < MHI, block 1614 filters the adjusted white Gaussian noise to give it the x within the most recent frameout(j) Spectral packets of a signal having the same spectral envelopeLinking the collaterals. The filtered noise fn (j) obtained is as follows
1. Mixture of periodic and random components
If merit > MHI, then only the periodic extrapolated waveform x calculated by module 1624out(j) Used as the output of WB PCM PLC logic. If merit < MLO, only the filtered noise signal fn (j) generated by block 1614 is used as the output of the WB PCM PLC logic. If MLO ≦ merit ≦ MHI, then the two components are mixed to
xout(j)←Gp·xout(j)+Gr·fn(j),j=0,1,2,...,209.(38)
Extrapolated xout(j) The first 40 additional samples of the signal (j) 160, 161, 162, 199 will become the call signal ring (j) of the next frame, j 0, 1, 2, 199. If the next frame is also a lost frame, only the first 20 samples of the call signal are used for superposition. If the next frame is a received frame, the call signal is receivedThere are 40 samples to be used for the superposition.
m, conditional ramp down (conditional ramp down)
X generated by mixing periodic and random components if packet loss lasts 20ms or lessout(j) The signal will be used as the WB PCM PLC output signal. If the packet loss lasts more than 60ms, then the WB PCM PLC output signal is completely muted. X generated by mixing periodic and random components if packet loss lasts more than 20ms but less than 60msout(j) The signal will produce a linear ramp down (decaying to zero in a linear manner). This conditional ramp down is performed during a lost frame when cfcount > 2, as specified in the specific algorithm below. The array gawd () in the form of Q15 is given here as { -52, -69, -104, -207 }. Likewise, the index range of j 0, 1, 2out(j) The current frame of (2).
Conditional ramp down algorithm:
A. if cfecount is less than or equal to 6, the following 9 rows are performed:
1、delta=gawd(cfecount-3)
2、gaw=1
3. for j 0, 1, 2,.., 159, the following two rows are performed:
a.xout(j)=gaw·xout(j)
b.gaw=gaw+delta
4. if cfecount < 6, the following three rows are performed:
a. for j 160, 161, 162, 209, the following two rows are performed:
i.xout(j)=gaw·xout(j)
ii.gaw=gaw+delta
B. otherwise (if cfcount > 6), set xout(j)=0,j=0,1,2,...,209。
n, superposition in the first received frame
For type 5 frames, output x from the G.722 decoderout(j) Superimposed with the call signal ring (j) from the last lost frame (calculated by module 1624 in the manner described above):
xout(j)=wi(j)·xout(j)+wo(j)·ring(j) j=0...LOLA-1,(39)
wherein
4. Recoding of PLC output
To update the memory and parameters of the g.722 ADPCM decoder during a lost frame (type 2, type 3, and type 4 frames), the PLC output essentially goes through the g.722 encoder. FIG. 17 is a block diagram 1700 of the logic for performing this re-encoding process. PLC output x, as shown in FIG. 17out(j) Analysis of the filterbank 1702 by a QMF to produce a low band subband signal xL(n) and high band subband signals xH(n)。Low band sub-band signal xL(n) encoded by a low band ADPCM encoder 1704 and a high band sideband signal xH(n) are encoded by high-band ADPCM encoder 1706. To reduce complexity, this disclosure simplifies ADPCM sub-band encoders 1704 and 1706 as compared to conventional ADPCM sub-band encoders. The foregoing operations will now be described in more detail.
a. Transmitting PLC outputs through a QMF analysis filterbank
The memories of the QMF analysis filter bank 1702 are initialized to provide subband signals that are continuous with the decoded subband signal. The first 22 samples of the WB PCM PLC output constitute the filter memory, and the subband signals are calculated according to the following equation:
wherein xPLC(0) Corresponding to the first sample, x, of the 16kHz WB PCM PLC output of the current frameL(n ═ 0) and xH(n-0) corresponds to the first samples of the 8kHz low-band and high-band sub-band signals, respectively, of the current frame. The filtering being in addition to the offset by an additional 22 samples, of the G.722 encoderThe transmit QMF is the same and the WB PCM PLC output (as opposed to the input) is transmitted to the filter bank. Furthermore, to generate a complete frame of the sub-band signal (80 samples 10ms), the WB PCM PLC needs to expand 22 samples out of the current frame and generate 182 samples 11.375 ms. Subband signal xL(n) (n ═ 0, 1,. multidot., 79) and xH(n) (n ═ 0, 1.., 79) were generated according to equations 41 and 42, respectively.
b. Re-encoding of low band signals
Low band signal xL(n) is encoded using a simplified low-band ADPCM encoder. A block diagram of a simplified low-band ADPCM encoder 2000 is shown in fig. 20. In fig. 20 it can be seen that the inverse quantizer of the standard low-band ADPCM encoder is removed and the quantized prediction error is replaced by a non-quantized prediction error. Furthermore, since the update of the adaptive quantizer is based only on the index I by the 6-bit low-band encoderLThe subset of 8 elements in the 64-element set denoted by (n) is performed, so the prediction error is quantized to only 8-element set. This provides the same update of the adaptive quantizer, also simplifying the quantization. Table 4 shows the results based on eLThe 8 levels of absolute values of (n) simplify the decision level, output code and multiplier of the quantizer.
mL Low threshold value High threshold IL Multiplier, WL
1 0.00000 0.14103 3c -0.02930
2 0.14103 0.45482 38 -0.01465
3 0.45482 0.82335 34 0.02832
4 0.82335 1.26989 30 0.08398
5 1.26989 1.83683 2c 0.16309
6 1.83683 2.61482 28 0.26270
7 2.61482 3.86796 24 0.58496
8 3.86796 20 1.48535
Table 4: decision level, output code and multiplier for 8-level reduced quantizer
The entities of fig. 20 are computed from their equivalent of the g.722 low-band ADPCM sub-band encoder:
sL(n)=sLp(n)+sLz(n),(45)
eL(n)=xL(n)-sL(n), and (46)
pLt(n)=sLz(n)+eL(n).(47)
The adaptive quantizer is updated exactly as specified by the g.722 encoder. The adaptation of the zero and pole sections occurs as in the g.722 encoder, as described in clauses 3.6.3 and 3.6.4 of the g.722 specification.
Low band ADPCM decoder 1910 automatically resets after 60ms of frame loss, but it can adaptively reset 30ms earlier during frame loss. Partial reconstruction of the signal p during the re-encoding of the low-band signalLt(n) and controls the adaptive reset of the low-band ADPCM decoder 1910.pLt(n) the signal is monitored throughout the loss, so it is set to zero at the first lost frame:
for lost frames, p is monitored against a constant signal on a per frame basisLt(n), and thus the attribute (cnst [ 2 ]) at the beginning of each lost frame]) Is set to zero. It is updated to
The sub-band decoder is reset at the end of the lost frames 3 through 5 if the following conditions are met:
or cnst [ p ]Lt(n)]>40,(50)
Wherein N islostIs the number of lost frames, i.e. 3, 4 or 5.
c. Re-encoding of high band signals
High band signal xH(n) is encoded using a simplified high-band ADPCM encoder. A block diagram of a simplified high-band ADPCM encoder 2100 is shown in fig. 21. It can be seen in fig. 21 that the adaptive quantizer of the standard highband ADPCM encoder has been eliminated because the algorithm overwrites the log scale factor on the first received frame with a moving average before packet loss, and thus does not require a highband re-encoded log scale factor. The quantized prediction error of the high-band ADPCM encoder 2100 is replaced by an unquantized prediction error.
The entities of fig. 21 are computed from their equivalent of the g.722 high-band ADPCM sub-band encoder:
sH(n)=sHp(n)+sHz(n),(53)
eH(n)=xH(n)-sH(n), and (54)
pH(n)=sHz(n)+eH(n).(55)
The adaptation of the zero and pole sections occurs as in the g.722 encoder, as described in clauses 3.6.3 and 3.6.4 of the g.722 specification.
Similar to the low-band re-encoding, the high-band ADPCM decoder 1920 automatically resets after 60ms of the frame loss, but it can adaptively reset 30ms earlier during the frame loss. Partial reconstruction of a signal p during re-encoding of a highband signalH(n) and controls the adaptive reset of high-band ADPCM decoder 1910. p is a radical ofH(n) the signal is monitored throughout the loss, so it is set to zero at the first lost frame:
for lost frames, p is monitored on a per frame basis against a constant signalH(n), and thus the attribute (cnst 2) at the beginning of each lost frame]) Is set to zero. It is updated to
The sub-band decoder is reset at the end of the lost frames 3 through 5 if the following conditions are met:
or cnst [ p ]H(n)]>40.(58)
5. Monitoring signal characteristics and use of PLC therefor
The following describes the functionality of the constraint and control logic 1970 of fig. 19 for reducing artifacts and distortion on the transition from a lost frame to a received frame, thereby improving the performance of the post-packet-loss decoder/PLC system 300.
a. Low band logarithmic scale factor
Updating low band log scale factors during reception of framesAnd use these characteristics to adaptively set the state of the adaptive quantizer for the scale factor on the first received frame after a frame loss. A measure of the stationarity of the low-band logarithmic scale factor is thus obtained for determining the correct resetting of the state.
i. Stationarity of low band logarithmic scale factor
Calculating and updating low band logarithmic scale factors during reception of framesBased on having constant leakage (leakage)First order moving average ofThe following steps:
first order moving average trackingIs calculated as follows
Second order moving average with adaptive leakageCalculated according to equation 61 as follows:
the stationarity of the low band logarithmic scale factor is measured as the degree of variation according to the following equation:
no update during the lost frame, in other words:
(63)
ii. Resetting of log scale factors for low band adaptive quantizers
At the first received frame after a frame loss, the low band logarithmic scale factor is adaptively reset (overwritten) according to the stationarity before the frame loss:
b. high band logarithmic scale factor
Updating high band log scale factors during reception of framesAnd using these characteristics to set the state of the adaptive quantization scale factor on the received frame after the frame loss. Furthermore, the characteristic adaptively controls convergence of the high-band logarithmic scale factor after frame loss.
i. Moving average and stationarity of high band logarithmic scale factors
Calculated according to the following formulaTracking:
based on the tracking, a moving average with adaptive leakage is calculated as follows
This moving average is used to reset the high band logarithmic scale factor on the first received frame, as will be described in later subsections.
The degree of stationarity of the high band logarithmic scale factor is calculated from the mean as follows:
the measure of stationarity is used after control frame lossAs will be described in later subsections.
No update during the lost frame, in other words:
ii. Resetting of logarithmic scale factors for high band adaptive quantizers
Resetting the high-band logarithmic scale factor on the first received frame to the moving average of the received frames before packet loss:
iii convergence of logarithmic scale factors of high band adaptive quantizer
Convergence of high-band logarithmic scale factor after frame loss is measured by stationarity before frame lossTo be controlled. For stationary cases, after packet loss, it is rightAn adaptive low-pass filter is applied. The low pass filter is applied over 0ms, 40ms or 80ms, during which the degree of low pass filtering gradually decreases. Duration of samplingIs determined as follows:
the low-pass filtering is given by:
where the coefficients are given as follows:
thus, the low-pass filtering reduces the time n sample by sample. Low pass filtered logarithmic scale factor inThe sampling period simply replaces the conventional logarithmic scaling factor.
c. Low band pole part
An entity (entity) called the stability margin (of the pole part) is updated for the sub-band ADPCM decoder during reception of a frame for constraining the pole part after a frame loss.
i. Stability margin of low band pole portion
The stability margin of the low band pole portion is defined as
βL(n)=1-|aL,1(n)|-aL,2(n),(73)
Wherein a isL,1(n) and aL,2(n) are two pole coefficients. The moving average of the stability margin during the received frame is updated according to:
βL,MA(n)=15/16·βL,MA(n-1)+1/16·βL(n)(74)
during the lost frame, the moving average is not updated:
βL,MA(n)=βL,MA(n-1).(75)
ii. Confinement of low band pole portions
In conventional G.722 low band (and high band) ADPCM encoding and decoding, β is maintainedL,min1/16 minimum stability margin. An increased minimum stability margin is maintained for the low-band ADPCM decoder for the first 40ms after a frame loss, which is a function of the time since the frame loss and the moving average of the stability margin.
Minimum stability margin for the first three 10ms frames
βL,min=min{3/16,βL,MA(n-1)}(76)
The minimum stability margin is set on the frame boundary and is performed over the entire frame. On a frame boundary into the fourth 10ms frame, a minimum stability margin is performed
While beta is performed for other framesL,min1/16 for a conventional minimum stability margin.
d. High band partially reconstructed signal and high band reconstructed signal
Saving the high-band part-supporting reconstructed signal p during all frames, including lost and received framesH(n) and high band reconstructed signal rHA high-pass filtered version of (n):
pH,HP(n)=0.97[pH(n)-pH(n-1)+pH,HP(n-1)]and (78)
This corresponds to a 3dB cut-off of about 40Hz, mainly to remove DC.
Within the first 40ms after the frame loss, the conventional partially reconstructed signal and the conventional reconstructed signal are replaced by their respective high pass filtered versions, which are respectively for high band pole partially adaptive and high band reconstructed output.
6. Skew calculation
The rephasing and time warping techniques discussed herein require a missing frame concealment waveform xPLC(j) And the number of samples in the first received frame where the signal is not aligned.
a. Low complexity estimation of low subband reconstructed signals
The signal used to calculate the time lag in the first received frame is determined by using the extreme zero filter coefficient (a)Lpwe,i(159),bLpwe,i(159) ) and slave STATE159Obtaining other necessary state information for low subband truncated difference signal dLt(n) obtained by filtering:
this function is performed by block 1820 of FIG. 18.
b. Rephasing and time-warping determination
If the last received frame is silent, as indicated by the figure of merit, the time lag TLIs set as:
if merit is less than or equal to MLO, TL=0.(81)
Additionally, if the first received frame is unvoiced, as represented by the normalized first autocorrelation coefficients:
the skew is set to zero:
if r (1) < 0.125, TL=0.(83)
Otherwise, the computation of the time lag is explained in the following section. The calculation of the time lag is performed by block 1850 of FIG. 18.
c. Calculation of time lag
The calculation of the time lag comprises the following steps: (1) generating an extrapolated signal; (2) searching for coarse time lag; and (3) refining the time lag search. These will be described in the subsections below.
i. Generation of an extrapolated signal
Time lag representation xPLC(j) And rLe(n) deviation between (n). To calculate this deviation, x isPLC(j) Extends to the first received frame and maximizes the normalized cross-correlation function. This subsection describes how to extrapolate xPLC(j) And specifies the length of the desired signal. Let x bePLC(j) Is copied to xout(j) In the cache. Since this is oneA frame of type 5 (first received frame), so it is assumed that the corresponding is:
xout(j-160)=xPLC(j),j=0,1,...,159(84)
the range of search relevance (correlation) is as follows:
wherein ΔTLMAXAt 28, ppfe is at the generation of xPLC(j) The periodic waveform used extrapolates the pitch period. The window size (at 16kHz sampling rate) for the lag search is as follows:
it is useful to specify a lag search window LSW at 8kHz sampling rate, as follows:
as given above, from x is requiredPLC(j) The total length of the resulting extrapolated signal is:
L=2·(LSW+ΔTL).(88)
the starting position of the extrapolated signal relative to the first sample in the received frame is:
D=12-ΔTL.(89)
the extrapolated signal es (j) is reconstructed according to the following method:
if D < 0
es(j)=xout(D+j)j=0,1,...,-D-1
If (L + D ≦ ppfe)
es(j)=xout(-ppfe+D+j)j=-D,-D+1,...,L-1
Otherwise
es(j)=xout(-ppfe+D+j)j=-D,-D+1,...,ppfe-D-1
es(j)=es(j-ppfe)j=ppfe-D,ppfe-D+1,...,L-1
Otherwise
If (ovs is more than or equal to L)
es(j)=xout(-ovs+j)j=0,1,...,L-1
Otherwise
If (ovs > 0)
es(j)=xout(-ovs+j)j=0,1,...,ovs-1
If (L-ovs is less than or equal to ppfe)
es(j)=xout(-ovs-ppfe+j)j=ovs,ovs+1,...,L-1
Otherwise
es(j)=xout(-ovs-ppfe+j)j=ovs,ovs+1,...,ovs+ppfe-1
es(j)=es(j-ppfe)j=ovs+ppfe,ovs+ppfe+1,...,L-1.
ii. Coarse time lag search
Time lag T of rough estimationLSUBFirst standardizing the cross-correlation function R by searching for sub-samplesSUB(k) The peak value of (c) is calculated to obtain:
to avoid searching beyond boundaries in refinement, T is adjustedLSUBThe following were used:
if (T)LSUB>ΔTLMAX-4) TLSUB=ΔTLMAX-4(91)
If (T)LSUB<-ΔTLMAX+4) TLSUB=-ΔTLMAX+4.(92)
iii, refine time lag search
The search is then refined by searching for the peak of R (k) to give the time lag TLR (k) is as follows:
finally, the following conditions were checked:
if it is not
Or(TL>ΔTLMAX-2)‖(TL<-ΔTLMAX+2)(96)
Then TL=0.
7. Rephasing
Rephasing is the setting of the internal state to the lost frame concealment waveform xPLC(j) Processing of the state when the last input signal sample before the first received frame is in phase. Rephasing can be divided into the following steps: (1) storing the intermediate g.722 state during re-encoding of the lost frame; (2) re-encoding according to skew adjustment; and (3) updating the QMF synthesis filter memory. More details of these steps will be described in the subsections below. Rephasing is performed by module 1810 of fig. 18.
a. Storing intermediate G.722 states during re-encoding
Re-encoding the reconstructed signal x during a lost frame, as described elsewhere in this applicationPLC(j) To update the g.722 decoder state memory. Suppose STATEjIs to xPLC(j) The j-th sample of (a) re-encodes the g.722 state and the PLC state. Then the g.722 state other than at the frame boundary (Will be maintained normally, i.e. STATE159) In addition, STATE is stored159-ΔTLMAX. To facilitate rephasing, subband signals are also stored:
xL(n),xH(n)n=69-ΔTLMAX/2...79+ΔTLMAX/2
b. re-encoding based on skew adjustment
Depending on the sign of the lag, the process of adjusting the re-encoding is as follows:
if ΔTL>0
1. Restoring G.722 STATE and PLC STATE to STATE159-ΔTLMAX
2. Recoding x in the manner described aboveL(n)、xH(n)(n=80-ΔTLMAX/2...79-ΔTL/2) if ΔTL<0
1. Restoring G.722 STATE and PLC STATE to STATE159
2. Recoding x in the manner described aboveL(n)、xH(N)(n=80...79+|ΔTL/2|)
Note that to facilitate xL(n) and xH(n) is re-encoded until n is 79+ | ΔTL/2. required for xPLC(j) Up to ΔTLMAX+182 samples.
c. Updating QMF synthesis filter memory
On the first received frame, the QMF synthesis filter memory needs to be computed since the QMF synthesis filter bank is inactive during the lost frame because PLC occurs in the 16kHz output speech domain. Temporally, the memory typically corresponds to the last sample of the last lost frame. However, rephasing needs to be taken into account. According to g.722, the QMF synthesis filter memory is given as follows:
xd(i)=rL(n-i)-rH(n-i), i ═ 1, 2,.., 11, and (97)
xs(i)=rL(n-i)+rH(n-i),i=1,2,...,11(98)
The first two output samples of the first received frame are calculated as follows:
and (99)
Filter memory (i.e. x)d(i) And xs(i) (i 1, 2.., 11)) is based on re-encoding xL(n) and xH(n)(n=69-ΔTL/2,69-ΔTL/2+1,...,79-ΔTL2,) (i.e., the last 11 samples up to the rephasing point) of the simplified subband ADPCM encoder:
xd(i)=xL(80-ΔTL/2-i)-xH(80-ΔTL2-i), i ═ 1, 2, 11, and (101)
xs(i)=xL(80-ΔTL/2-i)+xH(80-ΔTL/2-i),i=1,2,...,11,(102)
Wherein xL(n) and xH(n) has been stored in the state memory during the lost frame.
8. Time warping
Time warping is the process of stretching or shrinking a signal along the time axis. How to align x is described belowout(j) Time warping to improve the extrapolation of the signal x with a periodic waveformPLC(j) Is aligned. Only when T isLThe algorithm is executed when not equal to 0. Time-warping is performed by block 1860 of fig. 18.
a. Skew refinement
Refining the skew T for time warping by taking the maximum of the cross-correlation in the overlap-and-add windowL. Based on TLThe estimated start position of the overlay window in the first received frame is as follows:
SPOLA=max(0,MIN_UNSTBL-TL),(103)
where MIN _ unspbl is 16.
With respect to SPOLAThe start positions of the extrapolated signals of (1) are as follows:
Dref=SPOLA-TL-RSR,(104)
where RSR 4 is the refined search range.
The required length of the extrapolated signal is as follows:
Lref=OLALG+RSR.(105)
extrapolation of the signal estw(j) Is obtained using the same procedure as described in section d.6. c.i., except that LSW ═ OLALG, L ═ LrefAnd D ═ Dref
Refinement hysteresis TrefIs obtained by searching for the following peaks:
the final time lag for time warping is then obtained:
TLwarp=TL+Tref.(107)
b. computing time warped xout(j) Signal
Signal xout(j) Through TLwarpTime warping of individual samples to form a signal es which is subsequently extrapolated from the waveformold(j) Superimposed signal xwarp(j) In that respect The timelines 2200, 2220, and 2240 of FIGS. 22A, 22B, and 22C, respectively, show the dependency on TLwarpThree cases of values. In FIG. 22A, TLwarp<0,xout(j) Undergo shrinkage or compression. x is the number ofout(j) Is not used to create x during the warping processwarp(j) And xstart MIN _ UNSTBL. In FIG. 22B, 0. ltoreq.TLwarpMIN _ UNSTBL ≦ for xout(j) Carry out TLwarpStretching of the individual samples. Likewise, xout(j) Is unused and xstart is MIN _ unspbl. In FIG. 22C, TLwarpMore than or equal to MIN _ UNSTBL, once again for xout(j) Carry out TLwarpStretching of the individual samples. However, because extra T is created during the warping processLwarpOne sample, so x is not required in this caseout(j) Head T ofLwarpSampling; thus, xstart ═ TLwarp
In each case, the number of samples per stack/drop is as follows:
the twisting is achieved by segmented (piece-wise) single sample translation and triangular superposition, from xout[xstart]And starting. To perform puncturing, the samples are periodically down-sampled. From sample reductionThe original signal and the signal shifted to the left (due to the reduction) are superimposed starting from the point of (c). To perform stretching, the sampling is periodically repeated. Starting from the point of sampling repetition, the original signal and the signal shifted to the right (due to sampling repetition) are superimposed. Length L of the superimposed windowolawarp(Note: this is different from the OLA region shown in FIGS. 22A, 22B, and 22C) depends on the periodicity of the sample increase/decrease as follows:
if it is not
Otherwise
Lolawarp=min(8,Lolawarp).
Warped input signal xwarpThe lengths of (a) are as follows:
Lxwarp=min(160,160-MIN_UNSTBL+TLwarp).(110)
c. computing a waveform extrapolated signal
As shown in fig. 22A, 22B and 22C, the signal x is warped in the first received framewarpAnd extrapolated signal esola(j) And (6) performing superposition. The treatment according to the following two steps may be at xout(j) In signal buffer directlyGenerating an extrapolated signal esola(j):
Step 1
esola(j)=xout(j)=ptfe·xout(j-ppfe)j=0,1,...,160-Lxwarp+39(111)
Step 2
xout(j)=xout(j)·wi(j)+ring(j)·wo(j)j=0,1,...,39,(112)
Wherein wi(j) And wo(j) Is a length-40 triangular ramp-up and ramp-down superimposed window, and ring (j) is a call signal calculated in the manner described elsewhere in this application.
d. Superposition of time-warped signal and waveform extrapolated signal
Extrapolated and warped signal x calculated in the previous paragraphwarp(j) The superposition is carried out as follows:
xout(160-Lxwarp+j)=xout(160-Lxwarp+j)·wo(j)+xwarp(j)·wi(j),j=0,1,...,39.
(113)
then xwarp(j) The remainder of (a) is simply copied into the signal buffer:
xout(160-Lxwarp+j)=xwarp(j),j=40,41,...,Lxwarp-1.(114)
E. packet loss concealment for sub-band predictive coders based on sub-band speech waveform extrapolation
As shown in the decoder/PLC system 2300 in fig. 23, is an alternative embodiment of the present invention. Most of the techniques discussed above for the decoder/PLC system 300 can also be used in this second exemplary embodiment. The primary difference between decoder PLC system 2300 and decoder/PLC system 300 is that speech signal waveform extrapolation is performed in the sub-band speech signal domain, rather than in the full-band speech signal domain.
As shown in fig. 23, the decoder/PLC system 2300 includes a bitstream demultiplexer 2310, a low-band ADPCM decoder 2320, a low-band speech signal synthesizer 2322, a switch 2336, and a QMF synthesis filter bank 2340. The bitstream demultiplexer 2310 is essentially the same as the bitstream demultiplexer 210 of fig. 2 and the QMF synthesis filter bank 2340 is essentially the same as the QMF synthesis filter bank 240 of fig. 2.
As with the decoder/PLC system 300 of fig. 3, the decoder/PLC system 2300 processes frames in a manner dependent on frame type and uses the same frame type as described above in fig. 5.
In processing a frame of type 1, the decoder/PLC system 2300 performs standard g.722 decoding. In this mode of operation, the modules 2310, 2320, 2330 and 2340 of the decoder/PLC system 2300 perform the same functions as the corresponding modules 210, 220, 230 and 240, respectively, of the conventional g.722 decoder 200. Specifically, the bitstream demultiplexer 2310 separates the input bitstream into a low-band bitstream and a high-band bitstream. The low-band ADPCM decoder 2320 decodes the low-band bitstream into a decoded low-band speech signal. The switch 2326 is connected to the upper position labeled "type 1" to connect the decoded low-band speech signal to the QMF synthesis filter bank 2340. The high-band ADPCM decoder 2330 decodes the high-band bitstream into a decoded high-band speech signal. Switch 2336 is also connected to the upper position labeled "type 1" to connect the decoded high-band speech signal to QMF synthesis filter bank 2340. The QMF synthesis filter bank 2340 then recombines the decoded low-band speech signal and the decoded high-band speech signal into a full-band output speech signal.
Thus, in processing a frame of type 1, the decoder/PLC system corresponds to the decoder 200 of fig. 2, except that the decoded low-band speech signal is stored in the low-band speech signal synthesizer 2322 for possible use in a subsequent lost frame, and the decoded high-band speech signal is stored in the high-band speech signal synthesizer 2332 also for possible use in a subsequent lost frame. It is contemplated that other status updates and processes to perform PLC operations may also be performed.
In processing frames of type 2, type 3 and type 4 (lost frames), the decoded speech signal for each sub-band is extrapolated from the stored sub-band speech signal associated with the previous frame to fill in the waveform gaps associated with the current lost frame. This waveform extrapolation is performed by the low-band speech signal synthesizer 2322 and the high-band speech signal synthesizer 2332. There are many prior art techniques for performing the waveform extrapolation function of blocks 2322 and 2332. For example, the techniques described in U.S. patent application No. 11/234,291, filed for 26/9/2005 entitled "packet loss concealment techniques for independent block speech codecs," or modified versions of those techniques, such as those involving the decoder/PLC system 300 of fig. 3 described above, may be used.
Switches 2326 and 2336 are located in the lower position labeled "type 2-6" when processing frames of type 2, type 3, or type 4. Thus, they connect the synthesized low-band audio signal and the synthesized high-band audio signal to QMF synthesis filter bank 2340, which in turn recombines them into a synthesized output speech signal for the current lost frame.
Similar to the decoder/PLC system 300, the first few received frames following a bad frame (type 5 and type 6 frames) require special processing to minimize the degradation of speech quality due to mismatched g.722 states and to ensure a smooth transition from the extrapolated speech signal waveform in the last lost frame to the decoded speech signal waveform in the first few good frames following the last bad frame. Thus, while processing the frames, switches 2326 and 2336 remain in the lower position labeled "type 2-6" so that the decoded low-band speech signal from low-band ADPCM decoder 2320 may be modified by low-band speech signal synthesizer 2322 before being provided to QMF synthesis filter bank 2340, and the decoded high-band speech signal from high-band ADPCM decoder 2330 may be modified by high-band speech signal synthesizer 2332 before being provided to QMF synthesis filter bank 2340.
Those skilled in the art will appreciate that most of the techniques described in subsections C and D above for the first few frames after a packet loss can also be readily applied in such exemplary embodiments for special handling of the first few frames after a packet loss. For example, decoding constraint and control logic (not shown in fig. 23) is also included in decoder/PLC system 2300 for constraining and controlling the decoding operations performed by low-band ADPCM decoder 2320 and high-band ADPCM decoder 2330 in processing frames of type 5 and type 6 in a similar manner as described above with reference to decoder/PLC system 300. Likewise, each sub-band speech signal synthesizer 2322 and 2332 is used to perform rephasing and time warping techniques such as those described above with reference to decoder/PLC system 300. Because a complete description of these techniques is provided in the preceding section, there is no need to repeat the description of these techniques as used in the context of decoder/PLC system 2300.
The main advantage of the decoder/PLC system 2300 over the decoder/PLC system 300 is that it has a lower complexity. This is because extrapolating the speech signal in the subband-domain eliminates the need to separate the full-band extrapolated speech signal into subband speech signals using a QMF synthesis filter bank, as is done in the first exemplary embodiment. However, extrapolating the speech signal in the full band domain also has its advantages, as will be explained below.
There are some potential problems when the system 2300 in fig. 23 extrapolates high-band speech signals. First, if it does not perform periodic waveform extrapolation on the highband speech signal, the output speech signal will not retain the periodic characteristics of the highband speech signal that may occur in some highly periodic voiced signals. In other words, if it performs periodic waveform extrapolation on a high-band speech signal, there is still another problem even if it uses the same pitch period for extrapolating a low-band speech signal to reduce the calculation and to ensure that two sub-band speech signals are using the same pitch period for extrapolation. When the high-band speech signal is periodically extrapolated, the extrapolated high-band speech signal will be periodic and have harmonic structures in the frequency spectrum. In other words, the frequency of the spectral peaks in the spectrum of the highband speech signal will be related to integer multiples. However, once the synthesis filter bank 2340 recombines the high-band speech signal with the low-band speech signal, the spectrum of the high-band speech signal will be "translated" or shifted to higher frequencies, while mirroring may occur, depending on the QMF synthesis filter bank used. Thus, after such mirroring and frequency shifting, there is no guarantee that the spectral peaks in the highband portion of the full-band output speech signal will still have frequencies that are integer multiples of the fundamental frequency in the lowband speech signal. This can potentially lead to a degradation of the output audio quality of highly periodic voiced signals. In contrast, the system 300 of fig. 3 does not suffer from this problem. Because the system 300 performs audio signal extrapolation in the full-band domain, it is guaranteed that the frequency of harmonic peaks in the high-band is an integer multiple of the fundamental frequency.
In summary, an advantage of the decoder/PLC system 300 is that for voiced signals, extrapolating full-band speech signals will preserve the harmonic structure of the spectral peaks over the entire speech band. In other words, the decoder/PLC system 2300 has the advantage of low complexity, but it may not preserve such harmonic structure in higher subbands.
F. Hardware and software implementation
To ensure completeness, the present invention provides the following description of a general-purpose computer system. The present invention can be realized in hardware or a combination of hardware and software. Thus, the present invention may be implemented in the context of a computer system or other processing system. Fig. 24 shows an example of such a computer system 2400. In the present invention, all of the decoding and PLC operations described above in sections C, D and E can be performed on one or more different computer systems 2400 to implement the various methods of the present invention.
Computer system 2400 includes one or more processors, such as processor 2404. Processor 2404 may be a special purpose or general purpose digital signal processor. The processor 2404 is connected to a communication fabric 2402 (e.g., a bus or network). Various software implementations are described in terms of this exemplary computer system. After reading this description, those skilled in the art will readily appreciate how to implement the present invention using other computer systems and/or computer structures.
Computer system 2400 also includes a main memory 2406, preferably Random Access Memory (RAM), and may also include a secondary memory 2420. The secondary memory 2420 may include, for example, a hard disk drive 2422 and/or a removable storage drive 2424, representative of floppy disk drives, magnetic tape drives, optical disk drives, and so forth. The removable storage drive 2424 reads from and/or writes to a removable storage unit 2428 in a well known manner. Removable storage unit 2428 is a floppy disk, magnetic tape, optical disk, etc. which is read by and written to by removable storage drive 2424. It should be appreciated that the removable storage unit 2428 includes a computer usable storage medium having stored therein computer software and/or data.
In alternative implementations, the second memory 2420 can include other similar means for loading computer programs or other instructions into the computer system 2400. Such means also include, for example, a removable storage unit 2430 and an interface 2426. Examples of such modules include a program cartridge (cartridge) and cartridge interface (such as that used in a video game setting), a removable memory chip (such as an EPROM, or PROM) and associated socket, and other removable storage units 2430 and interfaces 2426 which enable data to be transferred from the removable storage unit 2430 to computer system 2400.
Computer system 2400 also includes a communication interface 2440. Communication interface 2440 enables software and data to be transferred between computer system 2400 and external devices. Examples of communications interface 2440 include a modem, a network interface (such as an ethernet card), a communications port, a PCMCIA slot and card, and the like. Software and data transferred via communications interface 2440 can be in the form of electrical, electromagnetic, optical, or other signals capable of being received by communications interface 2440. These signals may be provided to communications interface 2440 via a communications path 2442. Communications path 2442 carries signals and may be implemented using wire or cable, fiber optics, a phone line, a cellular phone connection, an RF connection, and other communications channels.
The terms "computer program medium" and "computer usable medium" as used herein are used generally to refer to media such as the removable storage unit 2428, a hard disk installed on the hard disk drive 2422, and signals received by the communication interface 2440. These computer program products are means for providing software to computer system 2400.
Computer programs (also called computer control logic) are stored in main memory 2406 and/or secondary memory 2420. Computer programs may also be received via communications interface 2440. Such computer programs, when executed, enable the computer system 2400 to implement the present invention as discussed herein. In particular, the computer programs, when executed, enable the processor 2400 to implement the processes of the present invention, such as any of the methods discussed herein. Accordingly, such computer programs represent controllers of the computer system 2400. In implementing the invention using software, the software may be stored in a computer program product and loaded into computer system 2400 using mobile storage drive 2424, interface 2426, or communications interface 2440.
In another embodiment, the features of the present invention are implemented primarily in hardware, for example, using hardware devices such as Application Specific Integrated Circuits (ASICs) and gate matrices. Implementations using hardware state machines to perform the functions described herein will also be apparent to those skilled in the art.
G. Conclusion
While various embodiments of the present invention have been described above, it should be understood that they have been presented by way of example only, and not limitation. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention. Accordingly, the scope of the invention should not be unduly limited to any of the embodiments described above but should be defined only in accordance with the claims and their equivalents.

Claims (6)

1. A method in a decoder for decoding a series of frames representing an encoded audio signal for switching between a lost frame in the series of frames and one or more received frames following the lost frame, the method comprising:
synthesizing an output audio signal associated with the lost frame;
generating an extrapolated signal based on the synthesized output audio signal;
calculating a time lag between the extrapolated signal and a decoded audio signal associated with the received frame, wherein the time lag represents a phase difference between the extrapolated signal and the decoded audio signal; and
time-warping the decoded audio signal based on the time lag, wherein time-warping the decoded audio signal comprises stretching or shrinking the decoded audio signal in the time domain;
calculating a time lag between the extrapolated signal and the decoded audio signal comprises:
partially decoding the received frame to produce an approximation of the decoded audio signal; and
calculating a time lag between the extrapolated signal and an approximation of the decoded audio signal;
partially decoding the received frame comprises:
decoding a low-band bitstream associated with the received frame in a low-band adaptive differential pulse code modulation decoder to produce a low-band reconstructed signal; and
using the low band reconstructed signal as an approximation of the decoded audio signal;
decoding a low-band bit stream associated with the received frame in a low-band adaptive differential pulse code modulation decoder includes fixing coefficients of an adaptive filter having two poles and six zeros during decoding of the low-band bit stream.
2. The method of claim 1, wherein calculating the time lag between the extrapolated signal and the decoded audio signal comprises maximizing a correlation between the extrapolated signal and the decoded audio signal.
3. The method of claim 1, wherein calculating the time lag between the extrapolated signal and the decoded audio signal comprises:
searching for a first peak of a normalized cross-correlation function between the extrapolated signal and the decoded audio signal using a first lag search range and a first lag search window to determine a coarse time lag, wherein the first lag search range refers to a range that shifts a starting point of the extrapolated signal in the search process, the first lag search window specifying a number of samples for which the normalized cross-correlation function is calculated; and
a second peak of the normalized cross-correlation function between the extrapolated signal and the decoded audio signal is searched using a second lag search range and a second lag search window to determine a refined time lag, wherein the second lag search range is smaller than the first lag search range.
4. A method as claimed in claim 3, wherein searching for a first peak of the normalized cross-correlation function between the extrapolated signal and the decoded audio signal comprises searching for a peak of the normalized cross-correlation function between the extrapolated signal and the down-sampled samples of the decoded audio signal.
5. The method of claim 3, wherein the second lag search window is smaller than the first lag search window.
6. The method of claim 3, wherein searching for a second peak of the normalized cross-correlation function between the extrapolated signal and the decoded audio signal using a second lag search range and a second lag search window comprises aligning the second lag search window with a center of an overlap region of the received frame.
HK09107382.2A 2006-08-15 2007-08-15 The method of time distortion of decoding audio signal after packet lost HK1129487B (en)

Applications Claiming Priority (9)

Application Number Priority Date Filing Date Title
US83762706P 2006-08-15 2006-08-15
US60/837,627 2006-08-15
US84804906P 2006-09-29 2006-09-29
US84805106P 2006-09-29 2006-09-29
US60/848,049 2006-09-29
US60/848,051 2006-09-29
US85346106P 2006-10-23 2006-10-23
US60/853,461 2006-10-23
PCT/US2007/076019 WO2008022207A2 (en) 2006-08-15 2007-08-15 Time-warping of decoded audio signal after packet loss

Publications (2)

Publication Number Publication Date
HK1129487A1 HK1129487A1 (en) 2009-11-27
HK1129487B true HK1129487B (en) 2012-11-09

Family

ID=

Similar Documents

Publication Publication Date Title
CN101361112B (en) Re-phasing of decoder states after packet loss
KR101041892B1 (en) Update Method of Decoder State after Packet Loss Concealment
HK1129487B (en) The method of time distortion of decoding audio signal after packet lost
HK1129488B (en) A method and system for updating decoder state
HK1129153B (en) Constraint and controlled decoding after packet missing
HK1129764B (en) Packet loss concealment based on the whole-band audio waveform extrapolation for predictive coding of sub-band based
HK1129154B (en) The hiding of the updating of decoder state after packet missing