[go: up one dir, main page]

MXPA00012579A - Method and apparatus for performing packet loss or frame erasure concealment - Google Patents

Method and apparatus for performing packet loss or frame erasure concealment

Info

Publication number
MXPA00012579A
MXPA00012579A MXPA/A/2000/012579A MXPA00012579A MXPA00012579A MX PA00012579 A MXPA00012579 A MX PA00012579A MX PA00012579 A MXPA00012579 A MX PA00012579A MX PA00012579 A MXPA00012579 A MX PA00012579A
Authority
MX
Mexico
Prior art keywords
frame
separator
advance
signal
fec
Prior art date
Application number
MXPA/A/2000/012579A
Other languages
Spanish (es)
Inventor
A Kapilow David
Original Assignee
At & T Corp
A Kapilow David
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by At & T Corp, A Kapilow David filed Critical At & T Corp
Publication of MXPA00012579A publication Critical patent/MXPA00012579A/en

Links

Abstract

The invention concerns a method and apparatus for performing packet loss or Frame Erasure Concealment (FEC) for a speech coder that does not have a built-in or standard FEC process. A receiver with a decoder receives encoded frames of compressed speech information transmitted from an encoder. A lost frame detector at the receiver determines if an encoded frame has been lost or corrupted in transmission, or erased. If the encoded frame is not erased, the encoded frame is decoded by a decoder and a temporary memory is updated with the decoder's output. A predetermined delay period is applied and the audio frame is then output. If the lost frame detector determines that the encoded frame is erased, a FEC module applies a frame concealment process to the signal. The FEC processing produces natural sounding synthetic speech for the erased frames.

Description

METHOD AND APPARATUS TO PERFORM THE HIDDENING OF LOSS OF PACKAGES OR ERASING PICTURES FIELD OF THE INVENTION This invention refers to techniques for realizing Package Loss Hiding or Table Erase (FEC).
DESCRIPTION OF THE RELATED TECHNIQUE The Frame Erasure Hiding Algorithms (FEC) conceal the transmission losses in a voice communication system where an input speech signal is encoded and packaged in a transmitter, sent over a network (of any type), and received in a receiver that decodes the packet and reproduces the voice output signal. Many of the standard voice encoders based on CELP, such as G.723.1, G.728 and G.729 have integrated or proposed FEC algorithms in their standards. The objective of FEC is to generate a synthetic voice signal to cover the missing data in a received bitstream. Ideally, the signal Ref: 125404 synthesized will have the same timbre and spectral characteristics as the missing signal, and will not create unnatural artifacts. Since voice signals are often locally stationary, it is possible to use signals beyond the story to generate a reasonable approximation to the missing segment. If the cancellations or deletions are not too long, and the erasure does not land in a region where the signal is changing rapidly, deletions or cancellations may be inaudible after concealment. Previous systems did employ advance or separation waveform replication techniques to hide the deletions of frames, such as, for example, DJ Goodman et al., Waveform Substitution Techniques for Recovering Missing Speech Segments in Packet Voice Communications, Vol. 34, No. 6 IEEE Trans. On Acoustics, Speech, and Signal Processing 1440-48 (December 1996) and O. J. Wasem et. al., The Effect of Waveform Substitution on the Quality of PCM Packet Communications, Vol. 36, No. 3 IEEE Transactions on Acoustics, Speech, and Signal Processing 342-48 (March 1988).
Some of the decoders of voice coding systems based on standards (and not based on standards) maintain state information. That is, these include the memory that stores values of certain signals used in decoding operations in the future. For those coding systems that include their own integrated FEC technique, the status information is adjusted appropriately when the FEC techniques are applied to compensate for the lost packets.
BRIEF DESCRIPTION OF THE INVENTION The inventor of the present invention has realized that if the decoder of a voice coding system maintains the status information, but attempts to use a FEC technique not integrated with the decoder (for example, a FEC process that follows the decoder ), the status information will not be appropriately adjusted to explain FEC and will cause problems for the decoded speech signal. The present invention relates to the technique for performing the Packet Loss Hiding or Frame Erase (FEC) for a voice coding system process, for voice coding systems having decoders that maintain state information or condition. The invention relates to the synthesis of a speech signal corresponding to the unavailable packets and then encoding the synthesized speech signal. The signals that reflect the encoding process are then provided to the decoder for use in the maintenance of the decoder status information.
BRIEF DESCRIPTION OF THE DRAWINGS The invention is described in detail with reference to the following figures, wherein similar numbers refer to similar elements, and wherein: Figure 1 is an exemplary audio transmission system; Figure 2 is an exemplary audio transmission system with a G.711 encoder and an FEC module; Figure 3 illustrates an output audio signal using a FEC technique; Figure 4 illustrates an operation of addition by overlap (OLA) at the end of a deletion or cancellation; Figure 5 is a flow diagram of an exemplary process for performing FEC using a G.711 encoder; Figure 6 is a graph illustrating e_. process of updating the historical separator; Figure 7 is a flow chart of an exemplary process for hiding the first frame of the signal; Figure 8 illustrates the advance or separation estimate from self-correlation; Figure 9 illustrates the advance estimates' or fine versus gross separation; Figure 10 illustrates signals in the advance and last quarter separators; Figure 11 illustrates the generation of synthetic signal using a single period advance separator; Figure 12 is a flowchart of an exemplary process for hiding the second and last frames deleted from the signal; Figure 13 illustrates the synthesized signals continued in the second erased block; Figure 14 illustrates the generation of synthetic signal using a two-period advance separator; Figure 15 illustrates an OLA at the start of the second erased frame; Figure 16 is a flow diagram of an exemplary method for processing the first frame after deletion; Figure 17 illustrates the generation of synthetic signal using a three-period advance separator; and Figure 18 is a block diagram illustrating the use of FEC techniques with other voice coders.
DETAILED DESCRIPTION OF THE PREFERRED MODALITIES Recently there has been much interest in using G.711 over packet networks without guaranteed quality of service, to support the Old Common Telephony Service (POTS). When the deletions or cancellations of tables (or losses of 3_rtft '& packages) occur on these networks, concealment techniques are necessary or the quality of the call is seriously degraded. A high-quality, low complexity, frame erase concealment technique (FEC) has been developed and described in detail later. An exemplary block diagram of an audio system with FEC is shown in Figure 1. In Figure 1, an encoder 110 receives an input audio frame and outputs an encoded bit stream. The bit stream is received by the detector 115 of lost frames which determines if some frames have been lost. If the lost frame detector 115 determines that frames have been lost, the lost frame detector 115 signals the FEC module 130 to apply an FEC algorithm or process to reconstruct the missing frames. In this way, the FEC process hides transmission losses in an audio system, where the input signal is encoded and packaged in a transmitter, sent over a network, and received in a detector 115 of lost frames, which determines that a picture has been lost. It is assumed in Figure 1 that the detector 115 of lost frames has a way to determine if an expected frame does not arrive, or arrives too late to be used. In IP networks this is usually implemented by adding a sequence number or a clock to the data in the transmitted frame. The detector 115 of lost frames compares the sequence numbers of the frames that arrive, with the sequence numbers that could be expected if the frames were not lost. If the lost frame detector 115 detects that a frame has arrived when it is expected, it is decoded by the decoder 120 and the audio output frame is given to the output system. If a frame is lost, the FEC 130 module applies a process to hide the missing audio frame, by generating a value of the synthetic frame instead of the audio. Many of the CELP-based voice coders of the ITU-T standard, such as G.723.1, G.728 and G.729, model voice reproduction in their decoders. In this way, the decoders have sufficient status information to integrate the FEC process directly into the decoder. These voice coders have FEC algorithms or processes specified as part of their standards. G.711 in comparison, it is a coding scheme sample by sample that does not model voice reproduction. There is no status or condition information in the encoder to help in the FEC. As a result, the FEC process with G.711 is independent of the encoder. An exemplary block diagram of the system as used with the G.711 encoder is shown in Figure 2. As in Figure 1, the encoder 210 of G.711 encodes and transmits the bitstream data to the detector 215 of lost paintings. Again, the detector 215 of lost frames compares the sequence numbers of the arriving frames, with the sequence numbers that could be expected if frames were not lost. If a frame arrives when it is expected, it is sent for decoding by the decoder 220 and then sent out to a historical separator 240, which stores the signal. If a frame is lost, the detector 215 of lost frames informs the module 230 of FEC which applies a process to hide the missing audio frame, by generating a value of the synthetic frame instead of the audio. However, to hide the missing frames, the FEC module 230 applies a G.711 FEC process that uses the past history of the decoded output signal, provided by the historical separator 240, to estimate which signal should be in the missing frame . In addition, to ensure a smooth transition between the erased and non-erased frames, a delay module 250 also delays the output of the system for a predetermined period of time, for example, 3.75 milliseconds. This delay allows the synthetic erase signal to be slowly mixed with the actual output signal at the beginning of a deletion or cancellation. The arrows between the FEC module 230 and each historical separator 240 and the blocks of the delay module 250, mean that the saved or saved history is used by the FEC process to generate the synthetic signal. In addition, the output of the FEC module 230 is used to update the historical separator 240 during a deletion or bypass. It should be noted that, since the FEC process only depends on the decoded output of G.711, the process will work just as well when no voice coder is present. A graphic example of how the input signal is processed by the FEC process in the FEC 230 module, is shown in Figure 3. The upper waveform in the figure shows the input to the system when a deletion or cancellation of 20 milliseconds occurs in a region of voice coming from a male speaker. In the waveform below it, the FEC process has hidden the missing segments by generating synthetic speech in empty space. For comparison purposes, the original input signal without a deletion or cancellation is also shown. In an ideal system, the hidden voice sounds just like the original. As can be seen from the figure, the synthetic waveform closely resembles the original in the missing segments. Later, it is discussed in detail how the "Hidden" waveform is generated from the "Input" waveform. The FEC process used by the FEC module 230 hides the missing picture by generating synthetic voice having characteristics similar to the voice stored in the historical separator 240. The basic idea is as follows. If the signal is with voice, it is assumed that the signal is almost periodic and locally stationary. The advance or separation and the repetition of the last advance period in the historical separator 240 is estimated a few times. However, if the erasure is prolonged or the advance is short (the frequency is high), the repetition of the same period of advance or separation too many times leads to the output that is too harmonic in comparison with the natural voice. To avoid these harmonic artifacts that are audible as beeps and tolls, the number of advance periods used from the historical separator 240 is increased as the erase length progresses. The short erasures use only the last or the few last advance or separation periods from the historical separator 240 to generate the synthetic signal. Long erasures also use periods of advance from far back in the historical separator 240. With long erasures, the advance periods from the historical separator 240 are not reproduced again in the same order as they occurred in the original voice. However, the tests found that the synthetic voice signal generated in long erasures still produces a natural sound. The longer the erasure or override, the more likely the synthetic signal will diverge from the actual signal. To avoid artifacts caused by keeping certain types of sounds too long, the synthetic signal is attenuated as the erasure becomes longer. For deletions lasting 10 milliseconds or less, no attenuation is necessary. For deletions greater than 10 milliseconds, the synthetic signal is attenuated at the speed of 20% for an additional 10 milliseconds. Beyond 60 milliseconds, the synthetic signal is set to zero (silence). This is because the synthetic signal is so dissimilar to the original signal that on average it does more harm than good to continue trying to hide the missing voice after 60 milliseconds. Whenever a transition is made between signals from different sources, it is important that the transition does not introduce discontinuities, audible as clicks, or unnatural artifacts in the output signal. These transitions occur in several places: 1. At the beginning of the deletion in the boundary between the start of the synthetic signal and the tail of the last good frame. 2. At the end of the deletion in the boundary between the synthetic signal and the start of the signal in the first good frame after deletion. 3. As long as the number of advance or separation periods used from historical separator 240 is changed to increase it. variation of the signal. 4. At the boundaries between the repeated portions of the historical separator 240.
To ensure smooth transitions, Overlap Additions (OLA) are performed at all signal boundaries. The OLAs are a way to gently combine two signals that overlap at one edge. In the region where the signals overlap, the signals are weighted by windows and then aggregated (mixed) together. The windows are designed so that the sum of the weights in any particular sample is equal to 1.
That is, no gain or attenuation is applied to the full sum of the signals. In addition, the windows are designed so that the signal on the left starts at weight 1 and gradually fades to 0, while the signal on the right side starts at the weight 0 gradually increases to the weight 1. In this way , the region to the left of the overlap window, only the left signal is present, while in the region to the right of the overlap window, only the right signal will be present. In the overlap region, the signal gradually makes a transition from the signal on the left to that on the right. In the FEC process, triangular windows are used to keep the calculation complexity of variable-length windows low, but other windows, such as the Hanning windows, can be used instead. Figure 4 shows the synthetic voice at the end of a 20 millisecond deletion that is submitted to OLA with the real voice that begins after the deletion is complete. In this example, the OLA weighting window is a triangular window of 5.75 milliseconds. The upper signal is the synthetic signal generated during deletion, and the overlapping signal below it is the real voice after deletion. The OLA weighting windows are shown below the signals. Here, due to a change of advance in the real signal during erasure, the peaks of the synthetic and real signals do not mesh, and the discontinuity introduced if you try to combine the signals without an OLA, is shown on the graph marked "Combined but". The "Combined without OLA" graphic was created by copying the synthetic signal to the beginning of the OLA window, and the actual signal for the duration. The result of the OLA operations shows how the discontinuities in the limits are smoothed. The previous discussion refers to how an illustrative process works with a stationary voiced voice, but if the voice is changing rapidly or is not voiced, the voice or conversation can not have a periodic structure. However, these signals are processed in the same way, which is described later. First, the smallest advance period allowed in the illustrative mode in the advance estimate is 5 milliseconds, corresponding to the frequency of 200 Hz. While it is known that some female and high frequency children have fundamental frequencies per above 200 Hz, they are limited to 200 Hz, so that the windows remain relatively large. In this way, within a deleted frame of 10 milliseconds the selected progress period is repeated a maximum of two times. With high frequency interlocutors, this does not really degrade the output, since the forward estimator returns a multiple of the actual advance period. And by not repeating any voice too frequently, the process does not create a synthetic periodic voice outside the non-periodic voice. Second, because the number of advance periods used to generate the synthetic voice is increased as the erasure becomes longer, sufficient variation is added to the signal so that periodicity is not introduced for long erasures. It should be noted that the process of Adding Waveform Similarity Overlap (WSOLA) to scale up the voice time, also uses OLA windows of large fixed size so that the same process can be used to increase the scale of time of the periodic and non-periodic voice signals. While a review of the illustrative FEC process was given earlier, the individual steps will be discussed in detail later. For the purposes of this discussion, it will be assumed that a table contains 10 milliseconds of speech and the sampling rate is 8 kHz, for example. In this way, deletions can occur in increments of 80 samples (8000 * 0.010 = 80). It should be noted that the FEC process is easily adaptable to other frame sizes and sampling rates. To change the sampling rate, only the time periods given in milliseconds are multiplied by 0.001, and then by the sampling rate to obtain the appropriate separator sizes. For example, the historical separator 240 contains at least 48.75 milliseconds of speech. At 8 kHz this could mean that the separator is (48.75 * 0.001 * 8000) = 390 samples in length. At 16 kHz sampling, this could be double, or 780 samples. Several of the separator sizes are based on the lowest frequency that the process expects to observe. For example, the illustrative process assumes that the lowest frequency that will be observed at 8 kHz sampling is 66 2/3 Hz. This leads to a maximum advance period of 15 milliseconds (1 / (66 2/3). ) = 0.015). The length of the historical separator 240 is 3.25 times the period of the lowest frequency. So the historical separator 240 is thus 15 * 3.25 = 48.75 milliseconds. If the input filters allow a frequency as low as 50 Hz (20 millisecond period) at a 16 kHz sample, the historical separator 240 may have to be extended to 20 * 3.25 = 65 milliseconds. The size of the picture can also be changed; 10 milliseconds are chosen as the default value since this is the frame size used by the various standard voice coders, such as G.729, and is also used in several wireless systems. The change in the size of the table is direct. If the desired frame size is a multiple of 10 milliseconds, the process remains unchanged. Simply leave the size of the deletion process box at 10 milliseconds, and it is required multiple times per frame. If the desired size of the packet frame is a divisor of 10 milliseconds, such as 5 milliseconds, the FEC process remains basically unchanged. However, the speed at which the number of periods in the advance separator increases will have to be modified based on the number of frames in 10 milliseconds. Frame sizes that are not multiples or dividers of 10 milliseconds, such as 12 milliseconds, can also be accommodated. The FEC process is reasonably forgiving the change in the speed of the increase in the number of advance or separation periods used of the advance separator or compensator. Increasing the number of periods once every 12 milliseconds instead of once every 10 milliseconds will not make much difference. Figure 5 is a block diagram of the FEC process performed by the illustrative embodiment of Figure 2. The sub-steps necessary to implement some of the larger operations are further detailed in Figures 7, 12 and 16, and discussed below. In the following discussion various variables are used to maintain the values and separators or compensators. These variables are summarized below: TABLE 1. VARIABLES AND THEIR CONTENTS As shown in the flow diagram in Figure 5, the process begins and in step 505, the next frame is received by the detector 215 of lost frames. In step 510, the lost frame detector 215 determines whether the frame is erased. If the frame is not erased, in step 512 the frame is decoded by the decoder 220. Subsequently, in step 515, the decoded frame is saved in the historical separator 240 for use by the FEC 230 module. update of the historical separator, the length of this separator 240 is 3.25 times the length of the longest advance or separation period, expected. At a sampling of 8 kHz, the longest advance period is 15 milliseconds, or 120 samples, so that the length of the historical separator 240 is 48.75 milliseconds, or 390 samples. Therefore, after each frame is decoded by the decoder 220, the historical separator 240 is updated so that it contains the most recent speech history. The update of the historical separator 240 is shown in Figure 6. As shown in this Figure, the historical separator 240 contains the most recent speech samples on the right and the oldest speech samples on the left. When the newest frame of the decoded voice is received, it is moved in the compensator 240 from the right, with the samples corresponding to the oldest voice, displaced out of the separator to the left (see 6b).
In addition, in step 520 the delay module 250 delays the voice output by 1/4 of the longer advance or separation period. In the 8 kHz sample, this is 120 * 1/4 = 30 samples, or 3.75 milliseconds. This delay allows the FEC 230 module to perform a 1/4 wavelength wave at the beginning of an erase, to ensure a smooth transition between the actual signal before deletion and the synthetic signal created by the FEC 230 module. it must be delayed because after the decoding of a frame, it is not known if the next frame is deleted. In step 525, the audio is sent out and, in step 530, the process determines whether any additional frames exist. If there are no more tables, the process ends. If there are more tables, the process goes back to step 505 to obtain the following table. However, if in step 510 the lost frame detector 215 determines that the received frame is deleted, the process goes to step 535 where the FEC 230 module hides the first erased frame, the process of which is described in detail later in Figure 7. After the first frame is hidden, in step 540, the lost frame detector 215 obtains the following frame. In step 545, the lost frame detector 215 determines whether the next frame is erased. If the next frame is not deleted, in step 555, the FEC module 230 processes the first frame after deletion, the process of which is described in detail later in FIG. 16. After the first frame is processed, the process returns to step 530, where the detector 215 of lost frames determines if any additional frames exist. If in step 545, the lost frame detector 215 determines that the next or subsequent frames are deleted, the FEC module 230 hides the second and subsequent frames according to a process that is described in detail later in Figure 12 Figure 7 describes in detail the steps that are taken to hide the first 10 milliseconds of a deletion. The steps are examined in detail later. As can be seen in Figure 7, in step 705, the first operation at the beginning of ur. erasing is estimating the advance or separation. To do this, a normalized auto-correlation is performed on the signal of the historical separator 240, with a window of 20 milliseconds (160 samples) to derivation delays of 40 to 120 samples. To 8 kHz sampling, these delays correspond to separation or advance periods of 5 to 15 milliseconds, or fundamental frequencies of 200 to 66 2/3 Hz. The derivation at the peak of the self-correlation is the estimated P of the advance. Assuming that H contains this history, and is indexed from -1 (the right sample before deletion) to -390 (the sample, 390 samples before the deletion begins), the self-correlation for derivation j can be expressed mathematically as : Autocor (j) = The peak of the self-correlation, or the estimated progress, can be expressed as: P =. { max} (Autocor (j)) \ 40 = j = 120.}. As mentioned above, the lowest allowed advance or separation period, 5 milliseconds, or 40 samples, is large enough so that a single advance period is repeated a maximum of two times in a deleted 10 millisecond frame. This avoids artifacts in the non-voiced voice, and also avoids unnatural harmonic artifacts in interlocutors with high separation or advancement. A graphic example of the calculation of the normalized auto-correlation for erasure in Figure 3 is shown in Figure 8. The waveform marked "History" is the content of the historical separator 240 just before erasure. The dashed horizontal line shows the reference part of the signal, the historical separator 240 H [-1]: H [-160], which is 20 milliseconds of the voice just before deletion. The horizontal solid lines are the windows of 20 milliseconds delayed in the derivations of 40 samples (the upper line, the period of 5 milliseconds, frequency of 200 Hz) to 120 samples (the lower line, period of 15 milliseconds, frequency of 66.66 Hz ). The output of the correlation is also plotted graphically aligned with the locations of the windows. The dashed vertical line in the correlation is the peak of the curve and represents the estimated advance or separation. This line is a period back from the beginning of the deletion. In this case, P is equal to 56 samples, corresponding to a period of advance or separation of 7 milliseconds, and a fundamental frequency of 142.9 Hz. To reduce the complexity of self-correlation, two special procedures are used. While these abbreviated methods do not significantly change the output, they have a great impact on the complete complexity of the run time of the process. Most of the complexity in the FEC process lies in self-correlation. First, instead of computing the correlation in each derivation, a rough estimate of the peak on a decimated signal is first determined, and then a fine search is performed in the vicinity of the gross peak. For the rough estimate, the previous Autocar function is modified to the new function that works on a decimated signal 2: 1 and only examines each third derivation: Coarse Autocor (/) = Pburdo = 2 { max3 (Au t ocorburdo (2j)) \ 20 = j = 60} Then using the rough estimate, the original search process is repeated, but only in the interval Pb? Rd -1 = j = Pburdo + 1-Care must be taken to ensure j stays in the original interval between 40 and 120 samples. Note that if the sampling rate is increased, the loss or decimation factor must also be increased, so that the total complexity of the process remains approximately constant. Tests have been performed with loss or decimation factors of 8: 1 on the sampled speech at 44.1 K? Z and good results were obtained. Figure 9 compares the graph of Au t ocorb urdo with that of Au t ocor. As can be seen in the figure, Au t ocorburdo is a good approximation to the Au t ocor and the complexity decreases by almost a factor of 4 to sample taking of 8 K? Z - a factor of 2 because only every third derivation is examined and a factor of 2 because it is only examined, at a given derivation, every third sample. The second procedure is carried out to reduce the complexity of the energy calculation in Au t ocor and Au t ocorburdo. Instead of computing the complete sum in each step, a running sum of the energy is maintained. That is to say: 160 2? Energy (j) = ZJ H [-k-j] K = l Then 160 Energy (j + l) =? H2 [-k-j-l] = Energy (j) + H2 [-j-161] -H2 '[-j-l] So only 2 multiples and 2 additions are necessary to update the energy term in each step of the FEC process, after the first term of the energy is calculated. Now that you have the estimate of advance or separation, P, the waveform starts to be generated during the deletion. Returning to the flow chart in Figure 7, in step 710, the 3.25 most recent wavelengths (3.25 * P samples) are copied from the historical separator 240, H, to the advance separator, B. The contents of the advance separator , with the exception of the most recent 1/4 wavelength, remains constant for the duration of the deletion. The historical separator 240, on the other hand, continues to be updated during erasure, with the synthetic voice. In step 715, the most recent 1/4 wavelength (0.25 * P samples) from the historical separator 240 is stored in the separator of the last quarter, L. This 1/4 wavelength is necessary for several of OLA operations. For convenience, the same negative indexing scheme will be used to access the separators B and L as was done for the historical separator 240. B [-l] is the last sample before the deletion arrives, B [-2] It is the sample before, etc. The synthetic voice will be placed on the synthetic separator S, which is indexed from 0 upwards. So S [0] is the first sample synthesized, S [l] is the second, etc. The contents of the advance separator, B, and the separator of the last quarter, L, for the erasure in Figure 3 are shown in Figure 10.
In the previous section, we calculate the period, P, which is 56 samples. The advance separator is thus 3.25 * 56 = 182 samples in length. The separator of the last room is 0.25 * 56 = 14 samples in length. In the figure, the vertical lines have been placed every P samples behind the beginning of the deletion. During the first 10 milliseconds of an erase, only the last advancing or separating period of the advance separator is used, so that in step 720, U = l. If the voice signal was truly periodic and the present estimate of advance or separation was not an estimate, but the exact true value, the waveform could be copied directly from the advance separator, B, to the synthetic separator, S, and the synthetic signal could be smooth and continuous. That is, S [0] = B [-P], S [l] = B [-P + l], etc. If the separation is shorter than the 10 millisecond frame, that is P < 80, the simple advance or separation period is repeated more than once in the deleted table. In the present example P = 56, so that copying moves successively to S [56]. The sequence of sample copying near sample 56 could be: S [54] = B [-2], S [55] = B [-1], S [56] = B [-56], S [ 57] = B [-55], etc. In practice, the estimate of the advancement c separation is not exact and the signal can not be truly periodic. To avoid discontinuities (a) in the boundary between the real and synthetic signal, and (b) in the limit where the period is repeated, OLAs are required. For both limits a smooth transition is desired from the end of the real voice, B [-l] to the voice a period back, B [-P]. Therefore, in step 725, this can be achieved by the addition by overlap (OLA) of the 1/4 wavelength before B [-P] with at least 1/4 wavelength of the historical separator 240 , or the contents of L. Graphically, this is equivalent to taking the last 1 1/4 wavelengths in the forward separator, moving it to the right one wavelength, and making an OLA in the overlap region of 1 / 4 wavelength. In step 730, the result of the OLA is copied to the last 1/4 wavelength in the historical separator 240. To generate additional periods of the synthetic waveform, the advance separator is shifted additional wavelengths and they make additional OLAs.
Figure 11 shows the OLA operation for the first 2 iterations. In this figure the vertical line that crosses all the waveforms is the beginning of the erasure. Short vertical lines are advance or separation markers and P samples are placed from the erase limit. It should be noted that the region of overlap between the waveforms "Advance Separator" and "Shifted to the Right by P" correspond exactly to the same samples as those in the overlap region between "Shifted to the right by P", and "Shifted to the right by 2P." Therefore, 1/4 wavelength OLA only needs to be computed once. In step 735, by computing the first OLA and placing the results on the last 1/4 wavelength of the wave separator, the process for a true signal can be used; periodic that generates the synthetic waveform. Starting at sample B (-P), the samples are simply copied from the advance separator to the synthetic separator, running again the advance separator flag at the beginning of the advance period if the end or end of the advance separator is reached. Using this technique, a synthetic waveform of any duration can be generated. The period of separation or advance to the left of the beginning of the erasure in the form of. wave "Combined with OLAs" of Figure 11, corresponds to the updated contents of the advance separator. The waveform "Combined with OLAs" shows that the simple period advance separator generates a periodic signal with period P, without discontinuities. This synthetic voice, generated from a single wavelength in the historical separator 240, is used to hide the first 10 milliseconds of a deletion. The effect of the OLA can be observed by comparing the 1/4 wavelength just before the deletion begins in the "Advance Separator" and the waveforms "Combined with OLAs". In step 730, this 1/4 wavelength in the "Combined with OLAs" waveform also replaces the last 1/4 wavelength in the historical separator 240. The OLA operation with triangular windows can also be expressed mathematically. The variable P4 is first defined, which is 1/4 of the advance or separation period in the samples. In this way, P4 = P >; > 2. In the present example, P was 56, so P4 is 14. The OLA operation can be expressed in the interval 1 < _ i < _ P4 as: The result of the OLA replaces the wavelengths of the last 1/4 in the historical separator 240 and in the advance separator. By replacing the historical separator 240, the 1/4 wavelength OLA transition will be output when the historical separator 240 is updated, since the historical separator 240 also delays the output by 3.75 milliseconds. The output waveform during the first 10 milliseconds of the erasure can be observed in the region between the first two dashed lines in the "Hidden" waveform of Figure 3. In step 740, at the end of the generation of the Synthetic speech for the frame, the current shift is stored in the advance separator as the variable O. This shift allows the synthetic waveform to be continued in the following frame for an OLA with the real or synthetic signal of the following frame. Or it also allows the appropriate synthetic signal phase to be maintained and the erasure extends beyond 10 milliseconds. In the present example with tables of 80 samples and P = 56, at the beginning of the deletion the displacement is -56. After 56 samples, it runs again to -56. After 80-56 = 24 additional samples, the displacement is -56 + 24 = 32, so that O is -32 at the end of the first frame. In step 745, after the synthesis separator has been filled from S [0] to S [79], S is used to update the historical separator 240. In step 750, the historical separator 240 also adds the delay of 3.75 milliseconds The handling of the historical separator 240 is the same during the deleted and not deleted frames. At this point, the concealment operation of the first frame in step 535 of Figure 5 ends and the process proceeds to step 540 in Figure 5. Details of how the FEC 230 operates to hide the last frames beyond 10 milliseconds, as shown in step 550 of Figure 5, is shown in detail in Figure 12. The technique used to generate the synthetic signal during the second and the last deleted frames, is very similar to the first deleted frame, although Some additional work has to be done to add some variation to the signal. In step 1205, the clearing code determines whether the second or third frame is being erased. During the second and third erased frames, the number of advance or separation periods used from the advance separator is increased. This introduces more variation in the signal and keeps the synthesized output far from sounding too harmonic. As with all other transitions, an OLA is necessary to smooth the limit when the number of advance or separation periods is increased. Beyond the third frame, (30 milliseconds of erasing) the advance separator is kept constant at a length of 3 wavelengths. These 3 wavelengths generate all the synthetic voice for the duration of the erasure. In this way, the branch to the left of Figure 12 is only taken in the second and third erased frames. Next, in step 1210, the number of wavelengths used in the advance separator is increased. That is, U = U + 1 is established.
At the beginning of the second or third erased frames, in step 1215 the synthetic signal from the previous frame is continued by 1/4 of additional wavelength at the beginning of the current frame. For example, at the beginning of the second frame the signal synthesized in the present example appears as shown in Figure 13. This 1/4 wavelength will be added by overlap with the new synthetic signal using the oldest wavelengths coming from of the advance separator. At the beginning of the second erased frame, the number of wavelengths is increased to 2, U = 2. As the advance separator of a wavelength, an OLA must be made at the limit where the 2-wavelength advance separator can repeat itself. This time the 1/4 wavelength that ends U wavelengths from the tail of the advance separator, B, is added in overlap with the contents of the last quarter separator, L, in step 1220. This OLA operator can to be expressed over the interval 1 = i = P4 as: B [-i] = ^ L [-i) + ¡^ 7Í \ B [-i-PU] The only difference from the previous version of this equation is that the constant P used to index B on the right side has been transformed in PU. The creation of the Two-Wavelength Advance Repair is shown graphically in Figure 14. As in Figure 11, the region of the "Combined with OLAs" waveform to the left of the beginning of the deletion is the updated content of the separator. of advance of two periods. Short vertical lines mark the separation or advance period. The close examination of the consecutive peaks in the waveform "Combined with OLAs" shows that the peaks alternate from the peaks one and two wavelengths back before the start of erasing. At the beginning of the synthetic output in the second frame, the signal coming from the new advance separator must be merged with the 1/4 wavelength generated in Figure 13. It is desired that the synthetic signal coming from the new advance separator should come from the oldest portion of the separator in use. But care must be taken that the new part comes from a similar portion of the waveform, or when mixed, will create audible artifacts. In other words, you want to maintain the correct phase or the waveforms can interfere destructively when mixed. This is achieved in step 1225 (Figure 12) by subtracting the P periods from the saved offset at the end of the previous frame, OR, until it points to the oldest wavelength in the used portion of the advance separator. For example, in the first deleted table, the valid index for the advance separator, B, was -1 to -P. So the saved OR of the first deleted frame must be in this interval. In the second erased frame, the valid range is -1 to -2P. So P is subtracted from O until O is in the range -2P < or < -P. Or to be more general, P is subtracted from O until it is in the range of -UP < O < - / (U-l) P. In the present example, P = 56 and 0 = -32 at the end of the first erased frame. Subtract 56 from -32 to produce -88. In this way, the first sample of synthesis in the second table comes from B [-88], the next from B [-87], etc. The mixing of OLA of the synthetic signals from the advance separators of one or two periods at the beginning of the second erased frame is shown in Figure 15. It should be noted that by subtracting P from O, the phase of the waveform and the signal peaks in the "Advance Separator 1P" and "Advance Separator 2P" waveforms are aligned. The waveform "Combined with OLA" also shows a smooth transition between the different advance separators at the beginning of the second erased frame. One more operation is required before the second frame can be sent out in the waveform "Combined with Wave" of Figure 15. In step 1230 (Figure 12), the new offset is used to copy the 1 / 4 wavelength from the advance separator in a temporary separator. In step 1235, the 1/4 wavelength is added to the offset. Then, in step 1240, the temporary separator is subjected to OLA with the start of the output separator, and the result is placed in the first 1/4 wavelength of the output separator. In step 1245, the offset is then used to generate the rest of the signal in the output separator. The advance separator is copied to the output separator for the duration of the "? I ,, millisecond frame. In step 1250, the current offset is stored in the advance separator as the variable O. During the second and last deleted frames, the synthetic signal is attenuated in step 1255, with a linear ramp. The synthetic signal is gradually attenuated until beyond 60 milliseconds it is set to 0, or silence. As the erasure becomes longer, the vo ?, hidden is more likely to diverge from the true signal. Keeping certain sound times for too long, even if sound, natural sounds in isolation for a short period of time, can lead to unnatural audible artifacts at the exit of the concealment process. To avoid these artifacts in the synthetic signal, slow attenuation or fading is used. A similar operation is performed in the concealment processes found in all standard voice coders, such as G.723.1, G.728 and G.729. The FEC process attenuates the signal at 20% per 10 millisecond frame, starting at the second frame. If S, the synthesis separator, it contains the synthetic signal before the attenuation and F is the "" •• *. number of consecutive deleted frames (F = 1 for the first deleted frame, 2 for the second erased frame) then the attenuation can be expressed as: S '[i] = [l -.2 (F - 2) - -) S [i] 80 In interval 0 < i < 79 and 2 < F < 6 For example, in the samples at the beginning of the second deleted frame F = 2, so that F-2 = 0 and 0.2 / 80 = 0.0025, so that S '[0] = 1. S [0], S '[1] = 0.9975S [1], S' [2] = 0. 95S [2], and S '[79] = 0.8025S [79]. Beyond the deleted sixth frame, the output is simply set to 0. After the synthetic signal is attenuated in step 1255, it is given to the historical separator 240 in step 1260, and the output is delayed, in step 1265 , for 3.75 milliseconds. The displacement flag O is also updated to its location in the advance separator at the end of the second frame, so that the synthetic signal can be continued in the following table. The process then returns to step 540 to obtain the following table.
If the deletion lasts beyond two frames, the processing in the third frame is exactly like the second frame, except that the number of periods in the advance separator is increased from 2 to 3, instead of 1 to 2. While the present exemplary deletion ends in two frames, the three-period advance separator that could be used in the third frame and beyond is shown in Figure 17. Beyond the third frame, the number of periods in the advance separator it remains fixed in three, so that only the path or path is taken on the right side of Figure 12. In this case, the displacement flag O is simply used to copy the advance separator to the synthetic output and are not necessary addition operations by overlap. The operation of the FEC module 230 in the first good frame after erasing is detailed in Figure 16. At the end of a deletion, a smooth transition between the synthetic voice generated during erasure and the real voice is needed. If the erasure was only one frame long, in step 1610, the synthetic voice for 1/4 wavelength is continued and an addition of overlap with the real voice is made. If the FEC module 230 determines that the deletion was longer than 10 milliseconds in step 1620, the mismatches between the synthetic and real signals are more likely, so that in step 1630, the generation of synthetic speech is continued and the window OLA is incremented by an additional 4 milliseconds per deleted frame, up to a maximum of 10 milliseconds. If the advance separation estimate was slightly displaced, or the real voice separation or advancement changed during erasure, the probability of a phase mismatch between the synthetic and actual signals increases with the erasure length. Longer OLA windows force the synthetic signal to be attenuated and the actual voice signal to increase more slowly. If the deletion was longer than 10 milliseconds, it is also necessary to attenuate the synthetic voice, in step 1640, before an OLA can be made, so that it adjusts to the level of the signal in the previous frame. In step 1650, an OLA is performed on the contents of the output separator (synthetic voice) with the start of the new input frame. The start of the entry separator is replaced with the result of the OLA. The OLA at the end of the deletion for the previous example can be observed in Figure 4. The complete output of the concealment process for the previous example can be observed in the "Hidden" waveform of Figure 3. In step 1660, the historical separator is updated with the contents of the input separator. In step 1670, the voice output is delayed by 3.75 milliseconds and the process returns to step 530 in Figure 5 to obtain the following table. With a small adjustment, the FEC process can be applied to other voice coders that maintain status information between samples or frames, and they do not provide concealment, such as G.726. The FEC process is used exactly as described in the previous section, to generate the synthetic waveform during erasure. However, care must be taken to ensure that the internal state variables of the encoder track the synthetic speech generated by the FEC process. Otherwise, after the deletion is complete, artifacts and discontinuities will appear in the output, since the decoder restarts using its erroneous state. While the OLA window at the end of a deletion helps, it must be done more. Better results can be obtained as shown in Figure 18, by converting the decoder 1820 into an encoder 1860 for the erase duration, using the synthesized output of the FEC module 1830 at the encoder input 1860. In this way the state of the decoder variables 1820 will track the hidden voice. It should be noted that contrary to a typical encoder, the encoder 1860 is only run to maintain the status information and its output is not used. In this way, shortcuts can be taken to significantly reduce their running time complexity. As stated above, there are many advantages and aspects provided by the invention. In particular, as a frame erasure progresses, the number of advance or separation periods used in the signal history to generate the synthetic signal is increased as a function of time. This significantly reduces harmonic artifacts over extended erasures.
Even when the advance periods are not reproduced in their original order, the output still sounds natural. With G.726 and other encoders that maintain the status information between the samples or frames, the decoder can be run as a decoder over the output of the synthesized output in the concealment process. In this way, the internal state variables of the decoder will track the output, avoiding - or at least decreasing - the discontinuities caused by the erroneous state information in the decoder after the erasing is terminated. Since the output from the encoder is never used (its sole purpose is to maintain the status information), a low complexity, unset, version of the encoder can be used. The minimum advance or separation period allowed in the exemplary modalities (40 samples, or 200 Hz) is longer than what is expected to be the fundamental frequency for some female and child partners. In this way, for high-frequency interlocutors, more than one advance period is used to generate the synthetic voice, even at the beginning of the deletion. With fundamental high frequency partners, the waveforms are repeated frequently. Multiple advance periods in the synthetic signal make harmonic artifacts less likely. This technique also helps maintain the natural sound of the signal during non-voiced voice segments, as well as fast transition regions, such as a stop or stop. The OLA window at the end of the first good frame after erasing, grows with the length of the erasure. With longer erasures, phase adjustments are more likely to occur when the next good frame arrives. Stretching the OLA window as a function of the erasure length reduces the deformed signals caused by the phase shifts over the extended erasure, but still allows the signal to recover quickly if the erasure is short. The FEC process of the invention also uses the variable length OLA windows which are a small fraction of the estimated advance or separation that are 1/4 wavelength and are not aligned with the forward peaks.
The FEC process of the invention does not distinguish between the voiced sound and will not sound. Instead it works perfectly in the reproduction of the non-voiced voice due to two attributes of the process: A) The minimum window size is reasonably large even when the non-voiced regions of speech have reasonable variation, and B) The length of the The advance separator is incremented as the process progresses, again ensuring that harmonic artifacts are not introduced. It should be noted that the use of large windows to avoid sound voice handling and not sound differently, is also present in the well-known time scale WSOL technique. While the addition of the delay to allow the OLA at the beginning of an erase, can be considered as an undesirable aspect of the process of the invention, it is necessary to ensure a smooth transition between the real and synthetic signals at the beginning of the deletion. While this invention has been described in conjunction with the specific embodiments described above, it is evident that many alternatives, modifications and variations will be apparent to those skilled in the art. Accordingly, it is preferred that the preferred embodiments of the invention as described above be illustrative and not limiting. Various changes can be made without departing from the spirit and scope of the invention as defined in the following claims.
It is noted that in relation to this date, the best method known to the applicant to carry out the aforementioned invention, is that which is clear from the present description of the invention.

Claims (1)

CLAIMS Having described the invention as above, property is claimed as contained in the following rei indications:
1. A method for maintaining status information in a speech decoder which experiences a non-availability of packets containing encoded speech information from a speech encoder, the decoder status information reflects the encoded, received speech information, The method is characterized in that it comprises the steps of: decoding the received packets to form a speech signal; the determination that one or more packets are not available for the decoder; the synthesis of a voice signal corresponding to the unavailable packets; the encoding of the synthesized speech signal; and the provision of signals that reflect the synthesized, coded voice to the decoder for use in the maintenance of the decoder status information.
MXPA/A/2000/012579A 1999-04-19 2000-12-15 Method and apparatus for performing packet loss or frame erasure concealment MXPA00012579A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US60/130,016 1999-04-19

Publications (1)

Publication Number Publication Date
MXPA00012579A true MXPA00012579A (en) 2001-09-07

Family

ID=

Similar Documents

Publication Publication Date Title
EP1086451B1 (en) Method for performing frame erasure concealment
US7117156B1 (en) Method and apparatus for performing packet loss or frame erasure concealment
US7047190B1 (en) Method and apparatus for performing packet loss or frame erasure concealment
US7233897B2 (en) Method and apparatus for performing packet loss or frame erasure concealment
US7908140B2 (en) Method and apparatus for performing packet loss or frame erasure concealment
US6973425B1 (en) Method and apparatus for performing packet loss or Frame Erasure Concealment
US6961697B1 (en) Method and apparatus for performing packet loss or frame erasure concealment
MXPA00012579A (en) Method and apparatus for performing packet loss or frame erasure concealment
MXPA00012578A (en) Method and apparatus for performing packet loss or frame erasure concealment
MXPA00012576A (en) Method and apparatus for performing packet loss or frame erasure concealment
MXPA00012580A (en) Method and apparatus for performing packet loss or frame erasure concealment
MXPA00012581A (en) Method and apparatus for performing packet loss or frame erasure concealment