HK1123621B - Sub-band voice codec with multi-stage codebooks and redundant coding - Google Patents
Sub-band voice codec with multi-stage codebooks and redundant coding Download PDFInfo
- Publication number
- HK1123621B HK1123621B HK08113068.2A HK08113068A HK1123621B HK 1123621 B HK1123621 B HK 1123621B HK 08113068 A HK08113068 A HK 08113068A HK 1123621 B HK1123621 B HK 1123621B
- Authority
- HK
- Hong Kong
- Prior art keywords
- frame
- information
- codebook
- decoding
- redundant
- Prior art date
Links
Description
Technical Field
The described tools and techniques relate to audio codecs, and more particularly to subband coding, codebooks, and/or redundant coding.
Background
With the advent of digital wireless telephone networks, streaming audio over the internet, as well as internet telephony, digital transmission, and voice transmission, has become commonplace. Engineers use a variety of techniques to efficiently process speech while ensuring quality. Understanding these techniques helps to understand how audio information is represented and processed in a computer.
I. Representation of audio information in a computer
The computer processes the audio information as a series of numbers representing the audio. A single number can represent an audio sample, which is the magnitude at a particular time. Many factors can affect the quality of the audio, including sample depth and sampling rate.
The sample depth (or accuracy) shows the range of numbers used to represent the sample. The more possible values per sample, the higher the output quality will generally be, since more subtle amplitude variations can be represented. An 8-bit sample has 256 possible values, while a 16-bit sample has 65,536 possible values.
The sampling rate (typically measured as samples per second) also affects quality. The higher the sampling rate, the higher the quality because higher frequency sounds can be represented. Some common sampling rates are 8,000, 11,025, 22,050, 32,000, 44,100, 48,000, and 96,000 samples per second (Hz). Table 1 shows a plurality of audio formats with different quality levels, and their corresponding raw bit rate costs.
Table 1: bit rate of audio of different quality
As shown in table 1, the cost of high quality audio corresponds to a high bit rate. High quality audio information consumes a large amount of computer storage and transmission capacity. Many computers and computer networks lack the resources to process raw digital audio. Compression (also known as encoding or decoding) reduces the cost of storing and transmitting audio information by converting the information into a lower bit rate form. Compression may be lossless (where quality is not compromised) or lossy (where quality is compromised, but the bit rate reduction resulting from subsequent lossless compression is more significant). Decompression (also known as decoding) extracts a reconstructed version of the original information from the compressed form. A codec is a type of encoder/decoder system.
Speech encoder and decoder
One goal of audio compression is to digitally represent an audio signal to provide the best signal quality for a given number of bits. In other words, the goal is to represent the audio signal with the fewest bits at a given quality level. Other goals such as resilience to transmission errors and limits on overall delay caused by encoding/transmission/decoding may also be applied in some scenarios. .
Different types of audio signals have different characteristics. Music is characterized by a wide range of frequencies and amplitudes, and typically contains two or more channels. Speech, on the other hand, is characterized by a small range of frequencies and amplitudes, and is typically represented within one channel. Specific codecs and processing techniques are applicable to music and general audio; other codecs and their processing techniques are suitable for speech.
One type of conventional speech codec uses linear prediction to achieve compression. The speech coding comprises a plurality of stages. The encoder finds and quantizes coefficients for a linear prediction filter that predicts each sample value as a linear combination of previous sample values. The residual signal (denoted as "excitation" signal) represents the portion of the original signal that is not accurately predicted by the filter. At some levels, speech codecs use different compression techniques for voiced segments (characterized by vocal cord vibrations of voices), unvoiced segments, and unvoiced segments, because different types of speech have different characteristics. Voiced segments typically exhibit a highly repetitive voicing pattern even in the residual domain. For voiced segments, the encoder performs further compression by comparing the current residual signal to a previous residual period and encoding the current residual signal according to delay or lag information relative to the previous period. The encoder uses a specially designed codebook to handle other differences between the original signal and the predicted, encoded representation.
Many speech codecs use some method of temporal redundancy in the signal. As mentioned above, one common approach is to predict the current excitation signal using long-term prediction of pitch (pitch) parameters, based on a delay or lag relative to the previous excitation period. The use of temporary redundancy can greatly improve compression efficiency in terms of quality and bit rate, but introduces memory dependence into the codec, i.e. the decoder relies on one previously decoded part of the signal to correctly decode another part of the signal. Many effective speech codecs have significant memory dependence.
Although the speech codecs described above have good overall performance for many applications, they still have several drawbacks. More specifically, several drawbacks are encountered when speech codecs are used for use with dynamic network resources. In this scenario, the encoded speech may be lost due to temporary bandwidth starvation or other problems.
A.Narrow band and wide bandCoding and decoding device
Many standard speech codecs are designed for narrowband signals with 8kHz sampling rate. While an 8kHz sampling rate is sufficient in many cases, higher sampling rates, such as to represent higher frequencies, may be used in other cases.
Speech signals with a sampling rate of at least 16kHz are commonly referred to as wideband speech. While these wideband codecs are well suited to represent high frequency speech modes, they typically require higher bit rates than narrowband codecs. Such high bit rates are not feasible in some network types or under some network conditions.
B.Inefficient memory dependence in dynamic network conditions
When encoded speech is lost, such as by being lost, delayed, corrupted, or otherwise rendered unusable in transmission, the performance of the speech codec may suffer due to memory dependence on the lost information. The loss of information about the excitation signal prevents subsequent reconstruction that relies on those lost signals. If the previous cycle is lost, the lag information becomes useless because it points to information that the decoder does not have. Another example of memory dependence is filter coefficient interpolation (to smooth the transition between different synthesis filters, especially for voiced signals). If a filter coefficient of a certain frame is lost, the filter coefficients for the following frames may have incorrect values.
Decoders use various techniques to conceal errors due to packet loss and other information losses, but these concealment techniques rarely are able to fully conceal these errors. For example, the decoder repeats the previous parameters or estimated parameters based on the correctly decoded information. However, the lag information can be very sensitive and prior art techniques are not very effective for concealment.
In most cases, the decoder will eventually recover from errors due to lost information. As packets are received and decoded, the parameters are gradually adjusted to their correct values. But the quality is likely to deteriorate until the decoder can recover the correct internal state. In many of the most efficient speech codecs, the playback quality can degrade over an extended period of time (e.g., up to one second), causing high distortion and often rendering the speech unintelligible. The recovery time is faster when significant changes occur, such as silence frames, as this provides a natural reset point for many parameters. Some codecs are less prone to packet loss because they remove inter-frame dependencies. However, such a codec requires a significantly higher bit rate to achieve the same voice quality as a conventional CELP codec with inter-frame dependency
Given the importance of compression and decompression for representing speech signals within computer systems, it is not surprising that compression and decompression of speech has led to the act of studying and standardizing it. Whatever the advantages of the prior art and tools, none of them have the advantages of the techniques and tools described herein.
Disclosure of Invention
In general, the detailed description relates to various techniques and tools for audio codecs, and more particularly to tools and techniques related to subband coding, audio codec codebooks, and/or redundant coding. The described embodiments implement one or more of the described techniques and tools, including but not limited to the following:
in one aspect, a bitstream of an audio signal includes primary coding information for a current frame that references a segment of a previous frame for decoding the current frame, and redundant coding information for decoding the current frame. The redundant coded information includes signal history information associated with referenced segments of a previous frame.
In another aspect, a bitstream of an audio signal includes primary coded information for a current coding unit and referring to a segment of a previous coding unit for decoding the current coding unit, and redundant coded information for decoding the current coding unit. The redundant coding information includes one or more parameters regarding one or more additional codebook stages for being used to decode the current coding unit only when the previous coding unit is not available.
In another aspect, a bitstream includes a plurality of encoded audio units, and each encoded unit includes a field. The field indicates whether the coding unit includes primary coding information representing a segment of an audio signal and whether the coding unit includes redundant coding information for decoding the primary coding information.
In another aspect, an audio signal is decompressed into a plurality of frequency subbands. Each sub-band is encoded according to a code-excited linear prediction model. The bitstream may include a plurality of coding units each representing a segment of the audio signal, wherein the plurality of coding units includes a first coding unit representing a first plurality of frequency subbands and a second coding unit representing a second plurality of frequency subbands that differ from the first plurality of subbands due to a degradation characteristic of subband information associated with the first coding unit or the second coding unit. The first sub-band may be encoded according to a first encoding mode and the second sub-band may be encoded according to a second, different encoding mode. The first and second coding modes may use different numbers of codebook stages. Each subband may be encoded separately. In addition, the real-time speech encoder may process the bitstream, including decompressing the audio signal into a plurality of frequency sub-bands and encoding the plurality of frequency sub-bands. Processing the bit stream may include decoding the plurality of frequency subbands and synthesizing the plurality of frequency subbands.
In another aspect, a bitstream for an audio signal includes parameters related to a first set of codebook stages for representing a first segment of the audio signal, the first set of codebook stages including a first set of a plurality of fixed codebook stages. The first set of multiple fixed codebook stages may include multiple randomly fixed codebook stages. The fixed codebook stages may include a pulse codebook stage and a random codebook stage. The first set of codebook stages may further include an adaptive codebook stage. The bitstream may further comprise parameters relating to a second set of codebook stages for representing a second segment of the audio signal, the second set having a different number of codebook stages than the first set. The number of codebook stages in the first set of codebook stages may be selected based on one or more factors including one or more characteristics of the first segment of the audio signal. The number of codebook stages in the first set of codebook stages may be selected based on one or more factors including network transmission conditions between the encoder and the decoder. The bit stream may include a separate codebook index and a separate gain for each of the plurality of fixed codebook stages. Signal matching is facilitated by the separate gains, and codebook searching can be simplified by the separate codebook indices.
In another aspect, a bitstream includes, for each of a plurality of parameterizable units described using an adaptive codebook, a field indicating whether adaptive codebook parameters are for the unit. The unit may be a sub-frame of a plurality of audio signal frames. An audio processing tool, such as a real-time speech encoder, may process the bitstream, including determining whether to use adaptive codebook parameters at each unit. Determining whether to use the adaptive codebook parameter may include determining whether an adaptive codebook gain is above a threshold. Also, determining whether to use the adaptive codebook parameters may include estimating one or more characteristics of the frame. Further, determining whether to use adaptive codebook parameters may include estimating one or more network transmission characteristics between the encoder and the decoder. This field may be a one-bit flag for each voiced unit. This field may be a one-bit flag for each sub-frame of a voiced frame of an audio signal, and other types of frames may not need to include this field.
Various techniques and tools may be used in combination or independently.
Other features and advantages will become apparent from the following detailed description of various embodiments, which proceeds with reference to the accompanying drawings.
Drawings
FIG. 1 is a block diagram of a suitable computing environment in which one or more of the described embodiments may be implemented.
FIG. 2 is a block diagram of a network environment in connection with which one or more described embodiments may be implemented.
FIG. 3 is a diagram depicting a set of frequency responses for a subband structure that may be used for subband coding.
FIG. 4 is a block diagram of a real-time speech band encoder in conjunction with which one or more of the described embodiments may be implemented.
FIG. 5 is a flow diagram depicting codebook parameter determination in an implementation.
FIG. 6 is a block diagram of a real-time speech band decoder in conjunction with which one or more of the described embodiments may be implemented.
FIG. 7 is a graphical representation of an excitation signal history including re-encoded portions of a current frame and a previous frame.
FIG. 8 is a flow diagram depicting codebook parameter determination for an additional random codebook stage in an implementation.
Fig. 9 is a block diagram of a real-time speech band decoder using an additional random codebook stage.
Fig. 10 is a block diagram of a bit rate format for frames that include information regarding different redundant coding techniques that may be used with some embodiments.
Fig. 11 is a block diagram of a bit rate format for packets including frames having redundant coding information that may be used with some embodiments.
Detailed Description
The described embodiments relate to techniques and tools for processing audio information in encoding and decoding. The use of these techniques can improve the quality of speech obtained from speech codecs such as real-time speech codecs. Such improvements may be the result of the use of various techniques and tools, separately or in combination.
These techniques and tools may include encoding and/or decoding of subbands using linear prediction techniques, such as CELP.
The techniques may also include multiple stages with fixed codebooks, including pulse and/or random fixed codebooks. The number of codebook stages can be varied to provide the best quality for a given bit rate. In addition, the adaptive codebook may be turned on or off depending on factors such as the desired bit rate and characteristics of the current frame or subframe.
In addition, a frame may include redundant coding information about part or all of the current frame depending on the previous frame. This information can be used by the decoder to decode the current frame in the event that a previous frame is lost, without requiring that the entire previous frame be sent multiple times. This information may be encoded at the same bit rate as the current or previous frame, or at a lower bit rate. Furthermore, the information may include random codebook information that approximates the desired portion of the excitation signal, rather than an entire re-encoding of the desired portion of the excitation signal.
Although the operations of the various techniques are described in a particular order for purposes of illustration, it should be understood that this method of description encompasses alternative rearrangements in the order of operations, unless a particular order is required. For example, the operations described subsequently may be rearranged or performed concurrently in some cases. Moreover, for the sake of simplicity, the flow diagrams do not illustrate the various ways in which particular techniques may be used in conjunction with other techniques.
I.Computing environment
FIG. 1 illustrates a generalized example of a suitable computing environment (100) in which one or more of the described embodiments may be implemented. The computing environment (100) is not intended to suggest any limitation as to the scope of use or functionality of the invention, as the invention may be implemented in diverse general-purpose or special-purpose computing environments.
Referring to fig. 1, a computing environment (100) includes at least one processing unit (110) and memory (120). In fig. 1, the most basic configuration (130) is included within the range of the dotted line. The processing unit (110) executes computer-executable instructions and may be a real or virtual processor. In a multi-processing system, multiple processing units execute computer-executable instructions to increase processing power. The memory (120) may be volatile memory (e.g., registers, cache, RAM), non-volatile memory (e.g., ROM, EEPROM, flash memory, etc.), or a combination of the two. The memory (120) stores software (180) for performing subband coding, multi-level codebooks, and/or redundant coding techniques for a speech encoder or decoder.
The computing environment (100) may have additional features. In FIG. 1, a computing environment (100) includes storage (140), one or more input devices (150), one or more output devices (160), and one or more communication connections (170). An interconnection mechanism (not shown), such as a bus, controller, or network, interconnects the components of the computing environment (100). Typically, operating system software (not shown) provides an operating environment for other software executing in the computing environment (100) and coordinates activities of the components of the computing environment (100).
The storage (140) may be removable or non-removable, and may include magnetic disks, magnetic tapes or cassettes, CD-ROMs, CD-RWs, DVDs, or any other medium which can be used to store information and which can be accessed within the computing environment (100). The storage (140) stores instructions of the software (180).
The input device (150) may be a touch input device such as a keyboard, mouse, pen, or trackball, a voice input device, a scanning device, a network adapter, or another device that provides input to the computing environment (100). For audio, the input device (150) may be a sound card, a microphone or other device that accepts audio input in analog or digital format, or a CD/DVD reader that provides audio samples to the computing environment (100). The output device (160) may be a display, a printer, a speaker, a CD/DVD writer, a network adapter, or another device that provides output from the computing environment (100).
The communication connection (170) may enable communication with another computing entity via a communication medium. The communication medium conveys information such as computer-executable instructions, compressed voice information, or other modulated data signals. A modulated data signal refers to a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired or wireless techniques implemented with an electrical, optical, RF, infrared, acoustic, or other carrier.
The invention may be described in the general context of computer-readable media. Computer readable media are any available media that can be accessed in a computing environment. By way of example, and not limitation, in connection with the computing environment (100), computer-readable media include memory (120), storage (140), communication media, and any combination thereof.
The invention may be described in the general context of computer-executable instructions, such as those included in program modules, being executed in a computing environment on a target real or virtual processor. Program modules typically include routines, libraries, objects, classes, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Program modules may be combined or divided as desired between programming modes in different embodiments. Computer-executable instructions for program modules may be executed within a local or distributed computing environment.
For the purposes of this description, a detailed description uses terms like "determine," "generate," "adjust," and "apply" to describe computer operations in a computing environment. These terms are high-level abstractions for operations performed by a computer, and should not be confused with acts performed by a human being. The actual computer operations corresponding to these terms vary depending on the implementation.
Generalized network environment and real-time speech codec
FIG. 2 is a block diagram of a generalized network environment (200) that may be implemented in connection with one or more of the described embodiments. The network (250) distinguishes between various encoder-side components and various decoder-side components.
The main functions of the components on the encoder side and the decoder side are speech encoding and decoding, respectively. On the encoder side, an input buffer (210) accepts and stores speech input (202). The speech encoder (230) takes the speech input (202) from the input buffer (210) and encodes it.
More specifically, a frame splitter (212) divides samples of the speech input (202) into frames. In one implementation, the frames are unified to be 20ms long — 160 samples at 8kHz input and 320 samples at 16kHz input. In other implementations, the frames have different durations and are non-uniform or overlapping, and/or the sampling rates of the inputs (202) are different. The frames may be organized in superframes/frames, frames/subframes, or other configurations to encode and decode the various levels.
The frame classifier (214) classifies frames according to one or more criteria, such as signal energy, zero crossing rate, long term prediction gain, gain differential, and/or other criteria for a sub-frame or an entire frame. Based on the criteria, the frame classifier (214) classifies different frames into classes such as unvoiced, voiced, and transitional (e.g., from unvoiced to voiced). In addition, frames may be classified according to the type of redundant coding (used for the frame, if any). The frame classification may affect the parameters that will be used to compute the encoded frame. In addition, frame classification may affect the parsing and loss recovery capabilities of the parameters encoded with it to provide more solution and loss recovery capabilities for more important frame classifications and parameters. For example, silence frames are typically encoded at a very slow rate, can be simply recovered by concealment if lost, and do not require loss protection. Voiced frames are usually encoded at a slightly higher rate, can be reasonably simply recovered by concealment if lost, and do not require significant loss protection. Unvoiced and transition frames are typically encoded with more bits depending on the complexity of the frame and the presentation of the transition. Unvoiced and transition frames are difficult to recover if lost, and thus more significant loss protection is required. Optionally, the frame classifier (214) utilizes other and/or additional frame classifications.
The input speech signal may be divided into subband signals before applying a coding model, such as a CELP coding model, to the subband information about the frame. May be implemented using a series of one or more analysis filter columns (e.g., QMF analysis filters) (216). For example, if a 3-band structure is used, a method of dropping a low frequency band by passing a signal through a low pass filter is used. Likewise, the use of tapping off the high frequency band by passing the signal through a high pass filter is used. The intermediate frequency band is dropped using a band-pass filter comprising a low-pass filter and a high-pass filter in that order by passing the signal through. Alternatively, other filter arrangement types may be used for subband decomposition and/or filtering timing (e.g., prior to frame allocation). If a band is decoded for a portion of the signal, the portion bypasses the analysis filter bank (216). CELP coding generally has higher coding efficiency than ADPCM and MLT in speech signals.
The number of frequency bands, n, may be determined by the sampling rate. For example, in one implementation, a single band structure usage is used for an 8kHz sampling rate. For 16kHz and 22.05kHz sampling rates, a 3-band structure as shown in fig. 3 may be used. In the 3-band structure of fig. 3, the low frequency band (310) extends to half (from 0 to 0.5F) of the full bandwidth F. The other half of the bandwidth is equally divided between the intermediate frequency band (320) and the high frequency band (330). Near the crossover point of the frequency bands, the frequency response to a band gradually decreases from the pass stage to the stop stage, which is characterized by attenuation of both sides of the signal as the crossover point approaches. Other divisions of the frequency bandwidth may also be used. For example, for a 32kHz sampling rate, an equally divided 4-band structure may be used.
The low frequency band is usually the most important band of the speech signal, since the signal energy is generally attenuated towards higher frequency ranges. Therefore, the low frequency band is typically encoded using more bits than the other bands. The subband structure is more flexible than a single band coding structure and allows for better control of the bit distribution/quantization noise across the frequency bands. Therefore, it is believed that the perceived voice quality can be effectively improved by using the subband structure.
In fig. 2, each sub-band is encoded separately as shown by the encoding components (232, 234). Components although band encoding components (232, 234) are shown separately, all band encoding may be done by one encoder, or they may be encoded by separate encoders. Such band encoding will be described in more detail below with reference to fig. 4. Alternatively, the codec may operate as a stand-alone codec.
The results of the encoded speech are provided to software for one or more network layers (240) through a multiplexer ("MUX") (236). The network layer (240) processes the encoded speech for transmission over the network (250). For example, the network layer software packages the encoded voice information into packets that conform to the RTP protocol, which are relayed over a network using UDP, IP, and various physical layer protocols. Other and/or additional software layers or network protocols may alternatively be used. The network (250) is a packet-switched wide area network, such as the internet. Alternatively, the network (250) may be a local area network or other type of network.
At the decoder side, software for one or more network layers (260) receives and processes the transmitted data. The network, transport, and higher layer protocols and software within the decoder-side network layer (260) generally correspond to those components in the encoder-side network layer (240). The network layer provides the encoded voice information to the voice decoder (270) through a demultiplexer ("DEMUX") (276). The decoder (270) decodes each sub-band separately as described in the decoding modules (272, 274). All of the sub-bands may be decoded by a single decoder or may be decoded by separate band decoders.
The decoded subbands are then synthesized in a series of one or more synthesis filter columns (e.g., QMF synthesis filters) (280) that output decoded speech (292). Alternatively, other types of filter arrangements may be used for subband synthesis. If only a single band is present, the decoded band may bypass the filter bank (280).
The decoded speech output (292) may also be passed through one or more post-filters (284) to improve the quality of the resulting filtered speech output (294). Also, each frequency band may pass through one or more post-filters, respectively, before entering the filter bank (280).
A generalized real-time speech band decoder is described below with reference to fig. 6, but other speech decoders may be used instead. In addition, some or all of the tools and techniques described may be used in conjunction with other types of audio encoders and decoders, such as music encoders and decoders, or general-purpose audio encoders and decoders.
In addition to these primary encoding and decoding functions, the components may also share information (represented by dashed lines in fig. 2) to control the rate, quality, and/or loss of resilience of the encoded speech. The rate controller (220) takes into account a number of factors, such as the complexity of the current input in the input buffer (210), the buffer fullness of the output buffer in the encoder (230) or other device, the desired output rate, the current network bandwidth, network congestion/noise conditions, and/or the decoder loss rate. The decoder (270) feeds back decoder loss rate information to the rate controller (220). The network layers (240, 260) collect or estimate information about the current network bandwidth and congestion/noise conditions, which is then fed back to the rate controller (220). Optionally, the rate controller (220) takes into account other and/or additional factors.
The rate controller (220) directs the speech encoder (230) to change the rate, quality, and/or loss of resilience of the encoded speech. The encoder (230) may change the rate and quality by adjusting a quantization factor on the parameter or changing a solution of an entropy code representing the parameter. In addition, the encoder may also change the loss recovery capability by adjusting the rate or type of redundant encoding. Thus, the encoder (230) may vary the bit allocation between the primary encoding function and the loss recovery capability function depending on network conditions.
The rate controller (220) may determine the coding mode for each subband of each frame based on several factors. These factors may include the signal characteristics of each sub-band, the bitstream buffering history, and the target bit rate. For example, as noted above, simpler frames, such as unvoiced and unvoiced frames, generally require fewer bits, while more complex frames, such as transition frames, require more bits. In addition, some bands, such as the high frequency band, require fewer bits. Further, if the average bit rate in the bitstream history buffer is less than the target average bit rate, a higher bit rate may be used for the current frame. If the average bit rate is less than the target average bit rate, a lower bit rate may be selected for the current frame to reduce the average bit rate. In addition, one or more frequency bands may be omitted from one or more frames. For example, the intermediate and high frequency frames are omitted from the unvoiced frames, or they are ignored from all frames for a period of time, thereby reducing the bit rate during that period of time.
FIG. 4 is a block diagram of a generalized speech band encoder (400) implemented in connection with one or more described embodiments. The band encoder (400) generally corresponds to any of the band encoders (232, 234) of fig. 2.
The band encoder (400) accepts band inputs from filter columns (or other filters) in the case where a signal (e.g., a current frame) is split into multiple bands. If the current frame is not divided into multiple bands, the band input (402) includes samples representing the entire bandwidth. The band encoder produces an encoded band output (492).
If the signal is divided into a plurality of frequency bands, a downsampling component (420) may perform downsampling on each frequency band. As an example, if the sampling rate is set to 16kHz and the duration of each frame is 20ms, each frame comprises 320 samples. If no downsampling is performed and the frame is divided into the 3-band structure shown in fig. 3, the frame is encoded and decoded 3 times the number of samples (i.e., 320 samples per band, or 960 samples in total). However, each band may be downsampled. For example, the low frequency band (310) may be downsampled from 320 samples to 160 samples, and each of the intermediate band (320) and the high frequency band (330) may be downsampled from 320 samples to 80 samples, where the bands (310, 320, 330) are thinned out to one half, one quarter, and one quarter of the frequency range, respectively. In subsequent stages, the higher the frequency band, the fewer bits are typically used, as the signal energy is typically attenuated toward the higher frequency range, thus providing a total of 320 samples for the frame to encode and decode.
It is believed that even with the use of downsampling of each band, the sub-band codec can still produce a higher speech quality output than a single band codec because the sub-band codec is more flexible. For example, it may be more flexible to control quantization noise on a per-band basis, if not by the same means for the entire spectrum. Each of the plurality of frequency bands can be encoded with different properties (e.g., different numbers and/or types of codebook stages as will be discussed below). These attributes may be determined by rate control based on several factors as described above, including the signal characteristics of each sub-band, the bitstream buffering history, and the target bit rate. As described above, "simple" frames, such as unvoiced and unvoiced frames, generally require fewer bits, while "complex" frames, such as transition frames, require more bits. If the average bit rate in the bitstream history buffer is less than the target average bit rate, a higher bit rate may be used for the current frame. Otherwise a lower bit rate is selected to reduce the average bit rate. In a sub-band codec, each band may be characterized in this manner and coded accordingly, rather than characterizing the entire spectrum in the same manner. In addition, rate control can reduce the bit rate by ignoring one or more higher frequency bands for one or more frames.
The LP analysis component (430) calculates linear prediction coefficients (432). In one implementation, the LP filter uses 10 coefficients for an 8kHz input and 16 coefficients for a 16kHz input, and the LP component analysis component (430) computes a set of linear prediction coefficients per frame for each frequency band. Optionally, the LP analysis component (430) calculates two sets of coefficients per frame for each frequency band, each set for one of two windows centered on a different position, or the LP analysis component (430) calculates a different number of coefficients per frequency band and/or per frame.
The LPC processing component (435) receives and processes the linear prediction coefficients (432). Typically, the LPC processing component (435) converts LPC values to a different representation for the purpose of more efficiently quantizing and encoding the component. For example, the LPC processing component (435) converts LPC values to a line spectral pair [ "LSP" ] representation, and the LSP values are quantized (e.g., by vector quantization) and encoded. The LSP values may be intra-coded or predicted from other LSP values. Various representations, quantization techniques, and encoding techniques are possible for the LPC values. The LPC values provided in some formats are used for grouping and transmission (along with any quantization parameters and other information needed for reconstruction) as part of the encoded band output (492). For subsequent use in the encoder (400), the LPC processing component (435) reconstructs the LPC values. The LPC processing component (435) may perform interpolation for LPC values (such as an equivalent LSP representation or another representation) to smooth the transitions between different sets of LPC coefficients or between LPC coefficients used for different sub-frames of a frame.
A synthesis (or "short-term prediction") filter (440) receives the reconstructed LPC values (438) and incorporates them into the filter. The synthesis filter (440) accepts an excitation signal and generates an approximation of the original signal. For a given frame, the synthesis filter (440) will buffer a number of reconstructed samples from the previous frame before the prediction began (e.g., 1 filter 10 per 10 taps).
Perceptual weighting components (450, 455) apply perceptual weighting to the raw data and the modeled output of the synthesis filter (440) to selectively de-emphasize the formant structure of the speech signal, thereby making the auditory system less sensitive to quantization errors. The perceptual weighting components (450, 455) use psychoacoustic phenomena such as masking. In an implementation, the perceptual weighting component (450, 455) applies weights based on the raw LPC values (422) derived from the LP analysis component (430). Optionally, the perceptual weighting components (450, 455) apply other and/or additional weights.
After the perceptual weighting components (450, 455), the encoder (400) computes the difference between the perceptually weighted original signal and the perceptually weighted synthesis filter output to produce a difference signal (434). Alternatively, the encoder (400) utilizes a different technique for calculating the speech parameters.
In minimizing the difference between the perceptually weighted original and synthesized signals (based on weighted mean square error or other criteria), the excitation parameterization component (460) searches for and finds the best combination of adaptive codebook indices, fixed codebook indices, and gain codebook indices. Many parameters may be calculated for each subframe, but more commonly the parameters are calculated for each superframe, frame, or subframe. As described above, the parameters for different frequency bands of a frame or subframe may be different. Table 2 shows the available parameter types for different frame classifications in one implementation.
Table 2: parameters for different frame classifications
In fig. 4, the excitation parameterization component (460) divides the frame into sub-frames and calculates codebook indices and gains appropriately for each sub-frame. For example, the number and type of codebook stages to be used, as well as the solution for the codebook indices, may be initially determined by an encoding mode, which may be specified by the rate control component described above. A particular mode may also specify encoding and decoding parameters other than the number and type of codebook stages, e.g., the resolution of codebook indices. The parameters of each codebook stage are determined by optimizing the parameters to minimize the error between the target signal and the contribution (contribution) of the codebook stage to the composite signal. (the term "optimize" as used herein means finding a suitable solution under application constraints such as distortion reduction, parameter search time, parameter search complexity, parameter bit rate, etc., with respect to performing a full search over the parameter space. For example, the optimization may be achieved using a modified mean square error technique. The target signal for each stage is the difference between the residual signal and the sum of the contributions to the composite signal from each preceding codebook stage (if any). Alternatively, other optimization techniques may be used.
FIG. 5 illustrates a technique to determine codebook parameters according to an implementation. The excitation parameterization component (460) potentially performs the technique in conjunction with other components, such as a rate controller. Optionally, other components in the encoder perform the technique.
Referring to fig. 5, for each subframe within a voiced or transition frame, the excitation parameterization component (460) determines (510) whether an Adaptive Codebook (ACB) can be used for the current subframe. (e.g., rate control may specify that no adaptive codebook is used for a particular frame.) if an adaptive codebook is not used, then an adaptive codebook conversion will indicate that no adaptive codebook is used (535). This may be achieved, for example, by setting a one-bit flag at the frame level indicating that no adaptive codebook is used for the frame, or by setting a one-bit flag for each subframe indicating that no adaptive codebook is used for the subframe.
For example, the rate control component may exclude the adaptive codebook for frames, thereby removing the most obvious memory dependence between frames. A typical excitation signal is characterized by a periodic pattern, especially for voiced frames. The adaptive codebook includes an index that represents a lag indicating a location of a segment of excitation within a history buffer. The segment prior excitation is adjusted to the contribution of the adaptive codebook to the excitation signal. At the decoder, the adaptive codebook information is usually important to reconstruct the excitation signal. If a previous frame is lost and the adaptive codebook index points back to a segment of the previous frame, the adaptive codebook index is generally not useful because it points to historical information that does not exist. Even if concealment techniques are performed to recover the lost information, future reconstruction is based on this incompletely recovered signal. This will result in errors in subsequent frames since the lag information is usually sensitive.
Thus, loss of packets that are relied upon by the subsequent adaptive codebook results in an extended degradation that will not fade out until many packets have been decoded or when a frame without the adaptive codebook is encountered. This problem can be alleviated by regularly inserting so-called "intra frames" between packet flows, which have no memory dependence between frames. Thus, the error will only propagate until the next intra frame. Thus, there is a tradeoff between better voice quality and better packet loss performance, since the coding efficiency of adaptive codebooks is typically higher than that of fixed codebooks. The rate control component can determine when it is advantageous to block the adaptive codebook for a particular frame. This adaptive codebook conversion is used to prevent the use of an adaptive codebook for a particular frame, thereby eliminating the usually most significant dependence on previous frames (LPC interpolation and synthesis filter memory also depend to some extent on previous frames). Thus, the adaptive codebook conversion may be used by the rate control component to dynamically create quasi-intra-frames based on factors such as packet loss rate (i.e., when the packet loss rate is high, more intra frames may be inserted to allow faster memory resets).
Still referring to FIG. 5, if an adaptive codebook is used, the component (460) determines adaptive codebook parameters. Those parameters include an index, or pitch value, of the desired segment indicative of the excitation signal history, and a gain to be applied to the desired segment. In fig. 4 and 5, the block (460) performs a closed loop pitch search (520). The search begins with the pitch determined by optional open-loop pitch search component (425) in fig. 4. An open-loop pitch search section (425) analyzes the weighted signal generated by the weighting section (450) to estimate the pitch thereof. Starting with the estimated pitch, the closed-loop pitch search (520) optimizes the pitch value to reduce the error between the target signal and the weighted synthesized signal generated from the indicated segment of the excitation signal history. The adaptive codebook gain value (525) is also optimized. The adaptive codebook gain value indicates a multiplier to be applied to a pitch prediction value (the value being from an indicated segment of the excitation signal history) to adjust the ratio of the values. This gain multiplied by the pitch predictor is the contribution of the adaptive codebook to the excitation signal for the current frame or subframe. Gain optimization (525) produces gain values and index values that minimize the error between the target signal and the weighted composite signal contributed by the adaptive codebook.
After the pitch and gain values are determined, it is determined (530) whether the adaptive codebook contribution is significant enough to make it worth the number of bits used by the respective adaptive codebook parameters. If the adaptive codebook gain is less than the threshold, then the adaptive codebook is turned off to hold bits for the fixed codebook as discussed below. In one embodiment, a threshold of 0.3 is used, although other alternative values may be used as the threshold. As an example, if the current coding mode uses an adaptive codebook plus a pulse codebook with 5 pulses, a 7-pulse codebook may be used when the adaptive codebook is turned off, and the total number of bits will still be the same or less. As described above, a one-bit flag that may be used for each subframe may be used to indicate an adaptive codebook switch for that subframe. Thus, if the adaptive codebook is not used, the transition is set to indicate that the adaptive codebook is not used in the subframe (535). Likewise, if an adaptive codebook is used, the transition is set to indicate that the adaptive codebook is used in the subframe and the adaptive codebook parameters are signaled in the bitstream (540). Although fig. 5 shows signaling after the determination, the signal may not be batch processed until the technique completes a frame or superframe.
The excitation parameterization component (460) also determines (550) whether a pulse codebook (pulse CB) is used. In one embodiment, the use or non-use of a pulse codebook is indicated as part of the overall coding mode for the current frame, or may be indicated or determined in other ways as well. A pulse codebook is a type of fixed codebook that specifies one or more pulses that are to contribute to the excitation signal to make up the excitation signal. The pulse codebook parameters include index and sign (gain may be positive or negative) pairs. Each pair indicates a pulse contained within the excitation signal, with the index indicating the pulse position and the coincidence indicating the pulse polarity. The number of pulses contained in the pulse codebook and used to contribute to the excitation signal may vary depending on the coding mode. In addition, the number of pulses also depends on whether an adaptive codebook is used.
If a pulse codebook is used, the pulse codebook parameters (555) are optimized to minimize the error between the contribution of the indicator pulse and the target signal. If no adaptive codebook is used, the target signal is the weighted original signal. If an adaptive codebook is used, the target signal is the difference between the weighted original signal and the contribution of the adaptive codebook to the weighted synthesized signal. At some point (not shown), the pulse codebook parameters are then signaled in the bitstream.
The excitation parameterization component (460) can also determine (565) whether any random fixed codebook is used. The number of random codebook stages (if any) is indicated as part of the overall coding mode for the current frame, although it may be indicated or determined in other ways. A random codebook is a type of fixed codebook that uses a predefined signal model for its encoded values. The codebook parameters may include a starting point and a possible or positive or negative sign for the signal model indication segment. The length or range of the indication segment is typically fixed and therefore is not typically signaled, but may alternatively be signaled. The gain is multiplied by the value in the indicated segment to generate the contribution of the random codebook to the excitation signal.
If at least one random codebook (random CB) stage is used, codebook stage parameters (570) applicable to that codebook stage are optimized to minimize the error between the contribution of the random codebook stage and the target signal. The target signal is the difference between the weighted original signal and the sum of the contributions of the adaptive codebook (if any), the pulse codebook (if any), and the previously determined random codebook stage (if any) to the weighted composite signal. At some point (not shown), the random codebook parameters are then signaled in the bitstream.
The component (460) then determines (580) whether any more random codebook stages are to be used. If so, the parameters of the next random codebook stage are then optimized (570) and signaled as described above. This continues until all parameters for the random codebook are determined. All random codebook stages may use the same signal model, although they may indicate different segments and have different gain values than the model. Alternatively, different signal models may be used for different random codebook stages.
Each excitation gain may be quantized independently, or two or more gains may be quantized simultaneously, as determined by the rate controller and/or other components.
Although a particular order for optimizing the various codebook parameters has been set forth herein, other orders and optimization techniques may be used. Thus, while FIG. 5 illustrates sequential computation of different codebook parameters, two or more different codebook parameters may alternatively be jointly optimized (e.g., jointly varying the parameters and estimating the results according to some non-linear optimization technique). In addition, other configurations of codebooks or excitation signal parameters may be used.
The excitation signal in this implementation is the sum of the contributions of the adaptive codebook, the pulse codebook, and one or more random codebook stages. Optionally, the component (460) may calculate other and/or additional parameters for the excitation signal.
Referring to fig. 4, codebook parameters for the excitation signal are signaled or otherwise provided to the local decoder (465) (circled in fig. 4 by dashed lines) and the band output (492). Thus, for each frequency band, the encoder output (492) includes the output from the LPC processing component (435) described above, as well as the output from the excitation parameterization component (460).
The bit rate of the output (492) depends in part on the parameters used by the codebook, and the encoder (400) may control the bit rate and/or quality by switching between different sets of codebook indices, using embedded coding, or using other techniques. Different combinations of codebook types and levels may produce different coding models for different frames, bands, and/or subframes. For example, an unvoiced frame may use only one random codebook stage. Adaptive codebooks as well as pulse codebooks may be used for low rate voiced frames. The high-rate frame may then be encoded using an adaptive codebook, a pulse codebook, and one or more random codebook stages. The combination of all these coding modes for all subbands in a frame is collectively referred to as a mode set. There are several predefined mode sets for each sampling rate with different modes corresponding to different coding bit rates. The rate control module can determine or influence the mode set for each frame.
The range of possible bit rates can be very large for the described implementation and can result in significant improvements in the resulting quality. In a standard encoder, the number of bits used for the pulse codebook may also be varied, but too many bits may only produce excessively dense pulses. Similarly, when only a single codebook is used, adding more bits enables a larger signal model to be used. But this would significantly increase the complexity of the search for the model optimized segment. In contrast, additional types of codebooks as well as additional random codebook stages may be added without significantly increasing the complexity of the respective codebook search (as compared to searching a single combined codebook). In addition, multiple random codebook stages and multiple classes of fixed codebooks allow for multiple gain factors to provide more flexible waveform matching.
Still referring to fig. 4, the output of the excitation parameterization component (460) is received by a codebook reconstruction component (470, 472, 474, 476) and a gain application component (480, 482, 484, 486) corresponding to each codebook used by the parameterization component (460). The codebook stage (470, 472, 474, 476) and the corresponding gain application component (480, 482, 484, 486) reconstruct the contribution of the codebook. These contributions are summed to produce an excitation signal (490), which is received by a synthesis filter (440), where it is used in conjunction with subsequent linear prediction to thereby produce "predicted" samples. The delayed portion of the excitation signal is also used by an adaptive codebook reconstruction component (470) as an excitation history signal to reconstruct subsequent adaptive codebook parameters (e.g., pitch contribution) and to calculate subsequent adaptive codebook parameters (e.g., pitch index and pitch gain value) by a parameterization component (460).
Referring back to fig. 2, the band output for each band is received by the MUX (236), along with other parameters. These other parameters include frame classification information (222) from the frame classifier (214) and information of the frame encoding mode. The MUX (236) constructs application layer packets for delivery to other software, or the MUX (236) puts data into the payload of the packets following a protocol such as RTP. The MUX buffers the parameters to allow selective repetition of the parameters for forward error correction in subsequent packets. In one implementation, the MUX (236) packs the primary encoded speech information for a frame, along with forward error correction information for all or a portion of one or more previous frames, into a single packet.
The MUX (236) provides feedback such as the current buffer fullness for rate control purposes. More generally, various components of the encoder (230), including the frame classifier (214) and the MUX (236), may provide information to a rate controller (220) such as that shown in FIG. 2.
The bitstream DEMUX (276) of FIG. 2 receives the encoded speech information as input and parses it to identify and process parameters. These parameters may include frame classification, some representation of LPC values, and codebook parameters. The frame classification may indicate which other parameters are present for a given frame. More specifically, the DEMUX (276) uses the protocol used by the encoder (230) and extracts parameters from packets into which the encoder (230) is packed. For receiving packets via a dynamic packet-switching network, the DEMUX (276) includes a jitter buffer to smooth out short-term fluctuations in packet rate over a given period of time. In some cases, the decoder (270) adjusts buffer delay and manages when packets are read from the buffer to merge delay, quality control, concealment of missing frames, etc. together for decoding. In other cases, the application layer component manages a jitter buffer that is filled at a variable rate and depleted by the decoder (270) at a constant or relatively constant rate.
The DEMUX (276) may receive multiple versions of parameters for a given segment, including a primary encoded version and one or more secondary error correction versions. When error correction fails, the decoder (270) then uses a concealment technique such as parameter repetition or estimation based on the correctly received information.
FIG. 6 is a block diagram of a real-time speech band decoder in conjunction with which one or more of the described embodiments may be implemented. The band decoder (600) generally corresponds to any of the band decoding components (272, 274) of fig. 2.
The band decoder (600) receives as input encoded speech information for a frequency band (which may be a full frequency band, or one of a plurality of sub-bands) and generates a reconstructed output (602) after decoding. The decoder element (600) has corresponding elements within the encoder (400), but the decoder (600) as a whole is simpler because it has no elements for perceptual weighting, excitation processing cycles, and rate control.
The LPC processing component (635) receives information representing LPC values (as well as any quantization parameters and other information needed for reconstruction) in a format provided by the band encoder (400). The LPC processing component (635) reconstructs the LPC values (638) using the inverse of the transform, quantization, encoding, etc. previously applied to the LPC values. The LPC processing component (635) may also perform interpolation (in the LPC representation or another representation such as an LSP) for the LPC values to smooth the transition between different sets of LPC coefficients.
The codebook stages (670, 672, 674, 676) and the gain application components (680, 682, 684, 686) decode the parameters for any respective codebook stage of the excitation signal and calculate the contribution of each codebook stage used. More specifically, the configuration and operation of the codebook stages (670, 672, 674, 676) and gain elements (680, 682, 684, 686) correspond to the configuration and operation of the codebook stages (470, 472, 474, 476) and gain elements (480, 482, 484, 486) in the encoder (400). The codebook-level contributions used are summed and the resulting excitation signal (690) is fed to the synthesis filter (640). The delay value of the excitation signal (690) is also used by the adaptive codebook (670) as an excitation history in calculating the contribution of the adaptive codebook for subsequent portions of the excitation signal.
A synthesis filter (640) receives the reconstructed LPC values (638) and incorporates them into the filter. The synthesis filter (640) stores previously reconstructed samples for processing. The excitation signal (690) is passed through a synthesis filter to form an approximation of the original speech signal. Referring back to fig. 2, as described above, if there are multiple subbands, the subband outputs for each subband are synthesized in the filter bank (280) to form the speech output (292).
The relationships shown in FIGS. 2-6 indicate general information flow; other relationships are not shown for simplicity. Depending on the implementation and the type of compression desired, various components may be added, omitted, split into multiple components, combined with other components, and/or replaced by similar components. For example, in the environment (200) shown in fig. 2, the rate controller (220) may be combined with the speech encoder (230). Possible added components include a multimedia encoder (or playback) application that manages the speech encoder (or decoder) and other encoders (or decoders) and collects network and decoder condition information and performs adaptive error correction functions. In alternative embodiments, different combinations and configurations of components process voice information using the techniques described herein.
Redundant coding techniques
One possible application of voice codecs is for voice over IP networks (voice over IP networks) or other packet switched networks. These networks have some advantages over existing circuit switching infrastructure. However, in IP network telephony, packets are often delayed or faded due to network congestion.
Many standard speech codecs have high intra-frame dependencies. Thus, for these codecs, the loss of one frame can cause severe speech quality degradation over many subsequent frames.
Each frame can be decoded independently in other codecs. Such frames can cope with packet loss. However, in terms of quality and bit rate, the coding efficiency is significantly reduced by not allowing intra-frame dependencies. Therefore, these codecs typically require higher bit rates to achieve similar speech quality as conventional CELP coders.
In some embodiments, the redundant coding techniques discussed below help achieve good packet loss recovery performance without significantly increasing the bit rate. The techniques may be used together in a codec or separately.
In the encoder implementation described above with reference to fig. 2 and 4, the adaptive codebook information is typically the primary source of dependency on other frames. As described above, the adaptive codebook index indicates the location of a segment of the excitation signal in the history buffer. This segment of the previous excitation signal is adjusted (according to the gain value) to the adaptive codebook contribution of the current frame (or sub-frame) excitation signal. If a previous packet containing information used to reconstruct a previous coded excitation signal is lost, the current frame (or subframe) lag information is not available because it points to non-existing historical information. Since the lag information is sensitive, this typically results in an exaggerated degradation of the resulting speech output that needs to wait until many packets have been decoded before fading.
The following techniques are designed to remove, at least to some extent, the dependence of the current excitation signal on reconstructed information from unavailable previous frames that are delayed or lost.
An encoder, such as the encoder (230) described above with reference to fig. 2, can switch between the following encoding techniques on a frame-by-frame or other basis. A decoder such as the decoder (270) described above with reference to fig. 2 can then switch the corresponding analysis/decoding technique on a frame-by-frame or other basis. Alternatively, another encoder, decoder, or audio processing tool may perform one or more of the following techniques.
A. Primary self-adaptive codebook history re-encoding/decoding
In primary adaptive codebook history re-encoding/decoding, the excitation history buffer is not used to decode the excitation signal of the current frame, even though the excitation history buffer is available at the decoder (packet reception of the previous frame, decoding of the previous frame, etc.). Instead, at the encoder, the pitch information is analyzed for the current frame to determine how much excitation history is needed. The necessary portion of the excitation history is re-encoded and transmitted along with coding information (e.g., filter parameters, codebook indices, and gains) about the current frame. The adaptive codebook contribution of the current frame refers to the re-encoded excitation signal that is transmitted along with the current frame. This ensures that a redundant excitation history is available to the decoder for each frame. Such redundant coding is not necessary in the case where the current frame, such as an unvoiced frame, does not use an adaptive codebook.
The re-encoding of the referenced part of the excitation history may be done in connection with the encoding of the current frame and may be done in the same way as the encoding of the excitation signal related to the current frame as described above.
In some implementations, the encoding of the excitation signal is done on a sub-frame basis, and the segment of the re-encoded excitation signal extends from the beginning of the current frame including the current sub-frame back beyond the sub-frame boundary that is the farthest adaptive codebook dependency on the current frame. The re-encoded excitation signal may thus be used to reference pitch information relating to a plurality of sub-frames within the frame. Alternatively, the encoding of the excitation signal may be implemented on other basis, such as on a frame-by-frame basis.
An example depicting the stimulus history (710) is shown in FIG. 7. The frame boundary (720) and subframe boundary (730) are depicted by larger and smaller dashed lines, respectively. The sub-frames of the current frame (740) are encoded using an adaptive codebook. Line (750) depicts the farthest point of dependence for any adaptive lag index for the subframe of the current frame. Thus, the re-encoding history (760) extends from the beginning of the current frame past the next subframe boundary of the farthest point (750). The most dependent far point may be estimated using the results of the open loop pitch search (425) described above. Because the search is not accurate, it is possible, however, that the adaptive codebook depends on some portion of the excitation signal beyond the estimated farthest point unless a subsequent pitch search is defined. Thus, the re-encoding history may include additional samples beyond the estimated farthest dependence point, providing additional space for finding matching pitch information. In one implementation, at least ten additional samples beyond the estimated farthest dependency point are included in the re-encoding history. Of course, more than ten samples may be included, increasing the probability that the re-encoding history extends sufficiently to include pitch periods that match each pitch period within the current sub-frame.
Alternatively, only the segments of the previous excitation signal that are actually referenced within the sub-frame of the current frame are re-encoded. For example, a segment of the previous excitation signal having the appropriate duration is re-encoded for use in decoding a single current segment within that duration.
The primary adaptive codebook history re-encoding/decoding eliminates the dependency on the excitation history of the previous frame. At the same time, it allows the use of adaptive codebooks and does not require re-encoding of the entire previous frame (or even the entire excitation history of the previous frame). However, re-encoding the adaptive codebook memory requires a very high bit rate compared to the techniques described below, especially when the re-encoding history is used for primary encoding/decoding at the same quality level as encoding/decoding with intra-frame dependency.
As a byproduct of the primary adaptive codebook history re-encoding/decoding, the re-encoded excitation signal may be used to recover at least part of the excitation signal for a previously lost frame. For example, the re-encoded excitation signal is reconstructed during decoding of each subframe of the current frame and input into the LPC synthesis filter reconstructed using actual or estimated filter coefficients.
The resulting reconstructed output signal can be used as part of the previous frame output. This technique also helps estimate the original state of the synthesis filter memory for the current frame. Using the re-encoding history and estimated synthesis filter memory, the output of the current frame can be generated in the same manner as in conventional encoding.
B. Secondary adaptive codebook history re-encoding/decoding
In the secondary adaptive codebook history re-encoding/decoding technique, the primary adaptive codebook encoding of the current frame is unchanged. Similarly, the main decoding of the current frame is unchanged; it uses the previous frame excitation history if the previous frame was received.
In use, if the previous excitation history is not reconstructed, the excitation history buffer is sequentially re-encoded in the same manner as the primary adaptive codebook history re-encoding/decoding technique described previously. However, compared to the main encoding/decoding, only a few bits are used for re-encoding, since the speech quality is not affected by the re-encoded signal without packet loss. The number of bits used to re-encode the excitation history can be reduced by varying various parameters, such as using fewer fixed codebook stages, or using fewer pulses in the pulse codebook.
When a previous frame is lost, the re-encoded excitation history is used to generate an adaptive codebook excitation signal for the current frame in the decoder. The re-encoded excitation history may also be used to recover at least part of the excitation signal associated with a previously lost frame, as in the primary adaptive codebook history re-encoding/decoding technique.
Likewise, the resulting reconstructed output signal may be used as part of the previous frame output. This technique also helps to estimate the original state of the synthesis filter memory with respect to the current frame. Using the re-encoded excitation history and the estimated synthesis filter memory, the output of the current frame can be generated in the same manner as with conventional encoding.
C. Extra codebook stage
In the extra codebook level technique, the main excitation signal encoding is the same as the conventional encoding described with reference to fig. 2-5, just as in the secondary adaptive codebook history re-encoding/decoding technique. However, parameters for additional codebook stages may also be determined.
In this encoding technique, as shown in FIG. 8, it is assumed (810) that the prior excitation history buffer at the beginning of the current frame is all zero, so there is no contribution from the prior excitation history buffer. In addition to the primary coding information for the current frame, one or more additional codebook stages may also be used for each subframe or other segment that uses an adaptive codebook. For example, the additional codebook stages use random fixed codebooks, such as those described with reference to FIG. 4.
In this technique, a current frame is typically encoded to produce main encoding information (which may include main codebook parameters for a main codebook stage) for use by a decoder if a previous frame is available. At the encoder side, redundancy parameters for one or more additional codebook stages are determined within the closed loop, assuming there is no excitation information from the previous frame. In the first order, this determination may be made without using any primary codebook parameters. Optionally, in a second implementation, at least part of the primary codebook parameters for the current frame are determined to be used. Those primary codebook parameters may be used along with additional codebook stage parameters to decode the current frame in the event that a previous frame is lost as described below. In general, this second implementation may use fewer bits required for the additional codebook stages to achieve similar quality as the first implementation.
According to fig. 8, the gain of the extra codebook stage and the gain of the last existing pulse or random codebook are jointly optimized in the encoder closed loop search, minimizing the coding error. Most of the parameters formed in conventional encoding are saved and used in optimization. In the optimization, it is determined (820) whether any random or pulse codebook stages are used in the normal encoding. If so, the correction gain of the last existing random or pulse codebook stage (such as random codebook stage n in FIG. 4) is optimized (830) to minimize the error between the contribution of that codebook stage and the target signal. The target signal for this optimization is the difference between the residual signal and the sum of the contributions of any of the aforementioned random codebook stages (i.e., all of the aforementioned codebook stages, but with the adaptive codebook contribution from segments of the previous frame set to zero).
The index and gain parameters of the additional random codebook stage are similarly optimized (840) to minimize the error between the codebook contribution and the target signal. The target signal for this extra random codebook stage is the difference between the residual signal and the sum of the contributions of the adaptive codebook, the pulse codebook (if any) and any conventional random codebook (along with the last existing conventional random or pulse codebook with the modified gain). The correction gain of the last existing conventional random or pulse codebook and the gain of the extra random codebook stage can be optimized separately or together.
When in the conventional decoding mode, the decoder does not use the additional random codebook stage, and decodes the signal according to the description above (e.g., as shown in fig. 6).
FIG. 9A illustrates a sub-band decoder that may use an additional codebook stage if the adaptive codebook index points to a segment of a previous frame that has been lost. This framework is generally the same as the decoding framework described and illustrated in FIG. 6, and many of the components and signals in the sub-band decoder (900) of FIG. 9 function the same as the corresponding components and signals in FIG. 6. For example, receiving the encoded sub-band information (992), the LPC processing component (935) uses this information to reconstruct the linear prediction coefficients (938) and provide these coefficients to a synthesis filter (940). However, when a previous frame is missing, the reset component (996) signals the zero history component (994) to set the excitation history for the missing frame to zero and provide the history to the adaptive codebook (970). A gain (980) is applied to the contribution of the adaptive codebook. The adaptive codebook (970) then has a zero contribution when its index points to the history buffer for the missing frame, but may have some non-zero contribution when the previous index points to a segment inside the current frame. The fixed codebook stages (972, 974, 976) apply the conventional indices that they receive with the subband information (992). Similarly, the fixed codebook gain components (982, 984) in addition to the most recent conventional codebook component (986) also apply their conventional indices to generate respective contributions to the excitation signal (990).
If an additional random codebook stage (998) is available and the previous frame is missing, the reset component (996) signals the conversion (998) to pass the contribution of the last conventional codebook stage (976) with residual gain (987) to be summed with the other codebook contributions rather than better than passing the contribution of the last conventional codebook stage (976) with conventional gain (986) for summing. The correction gain is optimized in case the excitation history on the previous frame is set to zero. In addition, an additional codebook stage (978) applies its index to indicate a segment of the random codebook model signal in the corresponding codebook, and a random codebook gain component (988) applies a gain for the additional random codebook stage to that segment. The conversion (998) passes the additional codebook stage contributions to be summed with the preceding codebook stages (970, 972, 974, 976) to produce the excitation signal (990). Thus, redundant information for the extra random codebook stage (e.g., extra stage index and gain) and the correction gain of the last main random codebook stage (used in place of the conventional gain on the last main random codebook stage) are used to quickly reset the current frame to a known state. Alternatively, this regular gain may be used for the last primary random codebook stage and/or some other parameter may be used to signal the additional stage random codebook.
The extra codebook stage technique requires so few bits that the bit rate penalty for its use is generally insignificant. On the other hand, it can significantly reduce quality degradation caused by frame loss when there is intra-frame dependency.
FIG. 9B shows a sub-band decoder similar to FIG. 9A but without a conventional random codebook stage. In this implementation, then, the correction gain (987) is optimized for the pulse codebook (972) when the residual history for the previously lost frame is set to zero. Thus, when a frame is missing, the contributions of the respective adaptive codebook (970) (set to zero along with the residual history on the previous missing frame), the pulse codebook (972) (along with the correction gain), and the additional random codebook stage (978) are summed to produce the excitation signal (990).
An additional codebook stage optimized with the residual history for the missing frame set to zero may be used in conjunction with codebook implementations and combinations and/or other representations of the residual signal.
D. Trade-offs between redundant coding techniques
Each of the three redundant coding techniques described above has advantages and disadvantages over the others. Table 3 shows a generalized conclusion that is considered as a compromise between these three redundant coding techniques. The bit rate loss refers to the total amount of bits required to utilize the technique. For example, a higher bit rate loss during standard decoding generally corresponds to a lower quality, assuming the same bit rate as used in conventional encoding/decoding, since more bits are used for redundant encoding and then fewer bits are used for conventionally encoded information. Reducing the efficiency of memory dependence refers to the efficiency of techniques for improving the quality of the resulting speech output when one or more previous frames are lost. Validity for recovering a previous frame refers to the ability to recover one or more previous frames using redundant coding information when the previous frame is lost. The conclusions in the tables are general and need not be applied in a particular implementation.
Table 3: trade-offs between redundant coding techniques
The encoder may select either redundancy coding scheme for any over the air (on the fly) frame during encoding. Redundant coding may be completely useless for some frame classifications (e.g., for voiced frames, not for unvoiced or unvoiced frames), and if it is used, may be used for each frame on a periodic basis such as every ten frames, or on some other basis. This can be controlled by a component such as a rate control component, taking into account various factors such as the trade-offs described above, available channel bandwidth, and decoder feedback regarding packet loss status.
E. Redundant coded bit stream format
The redundant coded information may be transmitted in a bitstream in a variety of different formats. The following is an implementation of one format for transmitting the redundant coded information described above and signaling its representation to the decoder. In this implementation, each frame within the bitstream begins with a two-bit field called a frame type. The frame type is used to identify redundant coding modes with respect to the following bits, and may also be used for other purposes in encoding and decoding. Table 4 gives the redundant coding mode indicating the frame type field.
Table 4: description of frame type bits
Fig. 10 shows four different combinations of these codes in the bitstream frame format, where these codes signal the presence of a regular frame and/or respective redundant coding types. For a conventional frame (1010) that includes primary coding information about the frame without any redundant coding bits, the byte boundary (1015) at the beginning of the frame is followed by the frame type code 00. The frame type code is followed by the main coding information for the conventional frame.
For a frame (1020) with redundant coding information of a primary adaptive codebook history, following the byte boundary (1025) at the beginning of the frame is a frame type code 10 that signals the presence of primary adaptive codebook history information about the frame. The frame type code is followed by coding units associated with the frame with primary coding information and adaptive codebook history information.
When the secondary historical redundant coding information is included within a frame (1030), the byte boundary (1035) at the beginning of the frame is followed by a coding unit that includes a frame type code 00 (a code for a regular frame), and the code 00 is followed by the primary coding information for the regular frame. However, following the byte boundary (1045) at the end of the main coding information, another coding unit includes a frame type 11, which code 11 is used to indicate that optional secondary history information (1040) will follow (instead of the main coding information for the frame). Because the secondary history information (1040) is only used when the previous frame was lost, the packetizer or other component may be given the option of omitting this information. This may be done for different reasons, such as when the overall bit rate needs to be reduced, when the packet loss rate is low, or when a previous frame is contained within a packet with the current frame. Alternatively, the demultiplexer or other component may be given the option of skipping the secondary history information when the regular frame (1030) is successfully received.
Similarly, when additional codebook level redundant coding information is included in a frame (1050), the byte boundary (1055) at the beginning of the coding unit is followed by frame type code 00 (the code for the regular frame), and code 00 is followed by the primary coding information for the regular frame. However, following the byte boundary (1065) at the end of the main coding information, another coding unit includes a frame type 01, which code 01 indicates that optional additional codebook stage information (1060) will follow. As with the secondary history information (1040), the additional codebook level information (1060) is only used when a previous frame is lost. Thus, still as with the secondary history information, a packetizer or other component may be given the option of omitting the additional codebook level information, or a demultiplexer or other component may be given the option of skipping the additional codebook level information.
An application (e.g., an application executing a transport layer packet) may decide to combine multiple frames to form a larger packet to reduce the extra bits needed for the packet header. Inside the packet, the application can determine the frame boundary by scanning the bitstream.
Fig. 11 shows a possible bit stream of a plurality of packets (1100) with four frames (1110, 1120, 1130, 1140). It may be assumed that all frames within the single packet will be received if any of them are received (i.e., no partial data corruption), and that the adaptive codebook lag, or pitch, is typically less than the frame length. In this example, any optional redundant coding information is generally not used for frame 2(1120), frame 3(1130), and frame 4(1140), since the previous frame would normally exist if the current frame existed. Thus, optional redundant coding information for all frames except the first frame within the packet (1110) may be removed. This results in a compressed packet (1150), where frame 1(1160) includes optional additional codebook level information, but all optional redundant coding information has been removed from the residual frame (1170, 1180, 1190).
If the encoder uses a primary historical redundancy encoding technique, the application will drop any of these bits when packing the frames together into a single packet because the primary historical redundancy encoding information is used whether or not the previous frame was lost. However, the application, if it knows that this frame will be within a multi-frame packet and will not be the first frame in this packet, will force the encoder to encode this frame as it would be in encoding convention.
Although fig. 10 and 11 and their associated description show byte-aligned boundaries between frames and information types, these boundaries may alternatively not be byte-aligned. Further, fig. 10 and 11 and their related modes illustrate exemplary combinations of frame type codes and frame types. Alternatively, the encoder and decoder use other and/or additional frame types or combinations of frame types.
Having described and illustrated the principles of the invention with reference to described embodiments, it will be recognized that the described embodiments can be modified in arrangement and detail without departing from such principles. It should be understood that the programs, processes, or methods described herein are not associated with or limited to any particular type of computing environment unless otherwise indicated. Various types of general purpose or special purpose computing environments may be used or executed with operations according to the teachings described herein. Elements of the described embodiments shown in software may also be implemented in hardware and vice versa.
Claims (15)
1. An audio decoding method, comprising:
at an audio processing tool, processing a bitstream related to an audio signal, wherein the bitstream comprises:
primary coding information for a current frame, the primary coding information referencing a segment of a previous frame to be used in decoding the current frame; and
redundant encoded information for decoding the current frame, the redundant encoded information including excitation signal history information associated with a referenced segment of the previous frame, wherein the excitation signal history information includes excitation signal history information for the referenced segment but does not include excitation signal history information for one or more non-referenced segments of the previous frame; and
and outputting the result.
2. The method of claim 1, wherein the audio processing tool is a real-time speech decoder and the result is decoded speech.
3. The method of claim 1, wherein the audio processing tool is a speech decoder, and wherein the processing comprises using the redundant coding information in decoding the current frame regardless of whether the previous frame is available to the decoder.
4. The method of claim 1, wherein the audio processing tool is a speech decoder, the processing comprising using the redundant coding information in decoding the current frame only if the previous frame is not available to the decoder.
5. The method of claim 1, wherein the signal history information is encoded at a quality level that is set at least partially dependent on a probability of using the redundant coding information when decoding the current frame.
6. A method of audio decoding, comprising:
at an audio processing tool, processing a bitstream comprising a plurality of frames, wherein each frame of the plurality of frames comprises a field indicating:
whether the frame includes primary coding information representing a segment of an audio signal, wherein the primary coding information references a segment of a preceding frame to be used in decoding the frame; and
whether the frame includes redundant coding information for use in decoding primary coding information, wherein the redundant coding information includes excitation signal history information associated with a referenced segment of the preceding frame or parameters for an additional codebook stage for use in decoding the frame only if the preceding frame is unavailable;
wherein the excitation signal history information includes excitation signal history information for the referenced segment but not excitation signal history information for one or more non-referenced segments of the previous frame.
7. The method of claim 6, wherein the field for each frame indicates whether the frame includes:
both primary and redundant coding information;
primary coded information, but no redundant coded information; or
Redundant coded information, but no primary coded information.
8. The method of claim 6, wherein the processing comprises packetizing at least a portion of the plurality of frames, wherein each packetized frame is included in a packet with the corresponding primary encoding information, the each packetized frame containing redundant encoding information used to decode the corresponding primary encoding information but not including the corresponding primary encoding information.
9. The method of claim 6, wherein the processing comprises determining whether redundant coding information within a current frame of the plurality of frames is optional.
10. The method of claim 9, wherein the processing further comprises determining whether to pack redundant coded information within the current frame if the redundant coded information within the current frame is optional.
11. The method of claim 6, wherein if a current frame of the plurality of frames includes redundant coding information, the field for the current frame indicates a classification of the redundant coding information for the current frame.
12. An audio decoding system comprising:
at an audio processing tool, means for processing a bitstream related to an audio signal, wherein the bitstream comprises:
primary coding information for a current frame, the primary coding information referencing a segment of a previous frame to be used in decoding the current frame; and
redundant encoded information for decoding the current frame, the redundant encoded information including excitation signal history information associated with a referenced segment of the previous frame, wherein the excitation signal history information includes excitation signal history information for the referenced segment but does not include excitation signal history information for one or more non-referenced segments of the previous frame; and
means for outputting the result.
13. The audio decoding system of claim 12, wherein the audio decoding system is a real-time speech decoder and the result is decoded speech.
14. The audio decoding system of claim 12, wherein the audio decoding system is a speech decoder, and wherein the means for processing comprises means for using the redundant coding information in decoding the current frame regardless of whether the previous frame is available to the decoder.
15. The audio decoding system of claim 12, wherein the excitation signal history information is encoded at a quality level that is set at least partially dependent on a probability of using the redundant coding information when decoding the current frame.
Applications Claiming Priority (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US11/142,605 | 2005-05-31 | ||
| US11/142,605 US7177804B2 (en) | 2005-05-31 | 2005-05-31 | Sub-band voice codec with multi-stage codebooks and redundant coding |
| PCT/US2006/012686 WO2006130229A1 (en) | 2005-05-31 | 2006-04-05 | Sub-band voice codec with multi-stage codebooks and redundant coding |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| HK1123621A1 HK1123621A1 (en) | 2009-06-19 |
| HK1123621B true HK1123621B (en) | 2013-06-14 |
Family
ID=
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CA2611829C (en) | Sub-band voice codec with multi-stage codebooks and redundant coding | |
| CA2609539C (en) | Audio codec post-filter | |
| JP5072835B2 (en) | Robust decoder | |
| HK1123621B (en) | Sub-band voice codec with multi-stage codebooks and redundant coding |