[go: up one dir, main page]

HK1176455B - Complex-transform channel coding with extended-band frequency coding - Google Patents

Complex-transform channel coding with extended-band frequency coding Download PDF

Info

Publication number
HK1176455B
HK1176455B HK13103638.7A HK13103638A HK1176455B HK 1176455 B HK1176455 B HK 1176455B HK 13103638 A HK13103638 A HK 13103638A HK 1176455 B HK1176455 B HK 1176455B
Authority
HK
Hong Kong
Prior art keywords
channel
audio
frequency
extension
transform
Prior art date
Application number
HK13103638.7A
Other languages
Chinese (zh)
Other versions
HK1176455A1 (en
Inventor
S.梅若特拉
W-G.陈
Original Assignee
微软技术许可有限责任公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US11/336,606 external-priority patent/US7831434B2/en
Application filed by 微软技术许可有限责任公司 filed Critical 微软技术许可有限责任公司
Publication of HK1176455A1 publication Critical patent/HK1176455A1/en
Publication of HK1176455B publication Critical patent/HK1176455B/en

Links

Description

Complex transform channel coding using spread band frequency coding
The patent application of the invention is a divisional application of an invention patent application with the international application number of PCT/US2007/000021, the international application date of 2007 of 1, 3 and the application number of 200780002567.0 in the stage of entering China, namely 'complex transform channel coding using extended band frequency coding'.
Background
Engineers use various techniques to efficiently process digital audio while maintaining the quality of the digital audio. To understand these techniques, it is helpful to understand how audio information is represented and processed in a computer.
I.Representation of audio information in a computer
The computer processes the audio information into a series of numbers representing the audio information. For example, a single number may represent an audio sample that is a magnitude at a particular time. Several factors affect the quality of the audio information, including sample depth, sampling rate, and channel mode.
The sample depth (or precision) indicates the range of numbers used to represent one sample. The more possible values for a sample, the higher the quality because the number can capture the more subtle changes in amplitude. For example, an 8-bit sample has 256 possible values, while a 16-bit sample has 65,536 possible values. The sampling rate (usually measured as samples per second) also affects quality. The higher the sampling rate, the higher the quality since more sound frequencies can be represented. Some common sampling rates are 8,000, 11,025, 22,050, 32,000, 44,100, 48,000, and 96,000 samples/second.
Mono and stereo are two common channel modes for audio. In the mono mode, audio information exists in one channel. In stereo mode, audio information is present in two channels, usually labeled left and right. Other modes with more channels, such as 5.1 channels, 7.1 channels, or 9.1 channels of surround sound ("1 indicating subwoofer or low frequency sound effect channels) are also possible. Table 1 shows several audio formats with different quality levels, and the corresponding raw bit rate costs.
Table 1: bit rate for different quality audio information
Surround sound audio typically has an even higher original bit rate.
As shown in table 1, the cost of high quality audio information is high bit rate. High quality audio information consumes a large amount of computer storage and transmission capacity. However, companies and consumers are increasingly relying on computers to create, distribute, and playback high quality audio content.
II.Processing audio information in a computer
Many computers and computer networks lack the resources to process raw digital audio. Compression (also known as encoding or decoding) reduces the cost of storing and transmitting audio information by converting the information into a lower bit rate form. Decompression (also known as decoding) extracts a reconstructed version of the original information from the compressed form. The encoder and decoder systems include microsoft Windows media audio ("WMA") encoders and decoders and some versions of WMA Pro encoders and decoders.
The compression may be lossless (where quality is not compromised) or lossy (where quality is compromised, but the bit rate reduction resulting from subsequent lossless compression is more significant). For example, the original audio information is approximated using lossy compression, and then the approximation is losslessly compressed. Lossless compression techniques include run-length coding, run-level coding, variable length coding, and arithmetic coding. Corresponding decompression techniques (also referred to as entropy decoding techniques) include run-length decoding, run-level decoding, variable-length decoding, and arithmetic decoding.
One purpose of audio compression is to digitally represent an audio signal to provide the maximum quality of the perceived signal with the least number of bits possible. With this goal in mind, various contemporary audio coding systems utilize various lossy compression techniques. These lossy compression techniques typically involve perceptual modeling/weighting and quantization after frequency transformation. The corresponding decompression involves inverse quantization, inverse weighting and inverse frequency transformation.
Frequency transform techniques convert data into a form that makes it easier to separate perceptually insignificant information from perceptually significant information. Less important information may then be compressed more lossy, while more important information is retained to provide the best perceived quality for a given bit rate. Frequency transforms typically receive audio samples and convert them from the time domain into data in the frequency domain, sometimes referred to as frequency coefficients or spectral coefficients.
Perceptual modeling involves processing audio data according to a model of the human auditory system to improve the perceptual quality of a reconstructed audio signal for a given bit rate. For example, an auditory model typically considers the range of human hearing and critical bands. Using the results of perceptual modeling, the encoder shapes distortion (e.g., quantization noise) in the audio data with the goal of minimizing the audibility of the distortion for a given bitrate.
Quantization maps a range of input values to a single value, introducing irreversible information loss, but also allowing the encoder to adjust the quality and bit rate of the output. Sometimes, the encoder performs quantization in conjunction with a rate controller that adjusts the quantization to adjust the bit rate and/or quality. There are various types of quantization, including adaptive and non-adaptive, scalar and vector, uniform and non-uniform. Perceptual weighting can be considered a form of non-uniform quantization. Inverse quantization and inverse weighting reconstruct the weighted, quantized frequency coefficient data into an approximation of the original frequency coefficient data. An inverse frequency transform then converts the reconstructed frequency coefficient data into reconstructed time domain audio samples.
Joint coding of audio channels involves coding information from more than one channel together to reduce the bit rate. For example, mid/side coding (also known as M/S coding or sum-difference coding) involves performing a matrix operation on the left and right stereo channels at the encoder and sending the resulting "mid" and "side" channels (normalized sum-and-difference channels) to the decoder. The decoder reconstructs the actual physical channels from the "mid" and "side" channels. M/S coding is lossless, allowing perfect reconstruction without the use of other lossy techniques (e.g., quantization) for the coding process.
Intensity stereo coding is one example of a lossy joint coding technique that can be used at low bitrates. Intensity stereo coding involves adding the left and right channels at the encoder and then scaling the information from the sum channel at the decoder during reconstruction of the left and right channels. Typically, intensity stereo coding is performed at higher frequencies, where artifacts introduced by this lossy technique are less noticeable.
Given the importance of compression and decompression to media processing, it is not surprising that compression and decompression are well developed areas. Regardless of the advantages of the prior art techniques and systems, however, they do not have the various advantages of the techniques and systems described herein.
Disclosure of Invention
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
In general, the detailed description relates to strategies for encoding and decoding multi-channel audio. For example, an audio decoder uses one or more techniques to improve the quality and/or bitrate of multi-channel audio data. This improves the overall listening experience and makes the computer system a more compelling platform for creating, distributing and playing back high quality multi-channel audio. The encoding and decoding strategies described herein include various techniques and tools that may be used in combination or independently.
For example, an audio encoder receives multi-channel audio data that includes a set of multiple source channels. The encoder performs channel extension encoding on the multi-channel audio data. Channel extension coding comprises coding the combined channel for the group and determining a plurality of parameters for representing the individual source channels of the group as a modified form of the coded combined channel. The encoder also performs frequency extension encoding on the multi-channel audio data. Frequency extension encoding may include, for example, dividing a frequency band in multi-channel audio data into a baseband group and an extension band group, and encoding audio coefficients in the extension band group based on the audio coefficients in the baseband group.
As another example, an audio decoder receives encoded multi-channel audio data that includes channel extension encoded data and frequency extension encoded data. The decoder reconstructs a plurality of audio channels using the channel extension encoded data and the frequency extension encoded data. The channel extension encoding data includes a combined channel for a plurality of audio channels and a plurality of parameters for representing individual ones of the plurality of audio channels as modified versions of the combined channel.
As another example, an audio decoder receives multi-channel audio data and performs an inverse multi-channel transform, an inverse basic time-frequency transform, a frequency extension process, and a channel extension process on the received multi-channel audio data. The decoder may perform decoding corresponding to the encoding performed in the encoder, and/or additional steps such as a forward complex transform of the received data, and may perform these steps in various orders.
For several aspects described herein with respect to an audio encoder, an audio decoder performs corresponding processing and decoding.
The foregoing and other objects, features and advantages will become more apparent from the following detailed description, which proceeds with reference to the accompanying figures.
Drawings
FIG. 1 is a block diagram of a general operating environment in conjunction with which the described embodiments may be implemented.
Fig. 2, 3, 4, and 5 are block diagrams of generalized encoders and/or decoders in conjunction with which the described embodiments may be implemented.
Fig. 6 is a diagram showing an example tile configuration.
FIG. 7 is a flow diagram illustrating a general technique for multi-channel pre-processing.
FIG. 8 is a flow diagram illustrating a general technique for multi-channel post-processing.
Fig. 9 is a flow diagram illustrating a technique for deriving complex scale factors for combined channels in channel extension coding.
Fig. 10 is a flow diagram illustrating a technique for using complex scale factors in channel extension decoding.
Fig. 11 is a diagram illustrating scaling of combined channel coefficients in channel reconstruction.
FIG. 12 is a graph showing a graphical comparison of actual power ratio and power ratio interpolated from power ratio at a localization point.
Fig. 13-33 are equations and correlation matrix arrangements showing details of the channel expansion process in some implementations.
Fig. 34 is a block diagram of aspects of an encoder that performs frequency extension coding.
FIG. 35 is a flow diagram illustrating an example technique for encoding extended band sub-bands.
Fig. 36 is a block diagram of aspects of a decoder that performs frequency extension decoding.
Fig. 37 is a block diagram of aspects of an encoder that performs channel extension coding and frequency extension coding.
Fig. 38, 39, and 40 are block diagrams of aspects of a decoder that performs channel expansion decoding and frequency expansion decoding.
Fig. 41 is a diagram showing a representation of displacement vectors for two audio blocks.
Fig. 42 is a diagram showing the arrangement of audio blocks with anchor points for interpolation of scale parameters.
Detailed Description
Various techniques and tools for representing, encoding, and decoding audio information are described. These techniques and tools facilitate the creation, distribution, and playback of high quality audio content even at very low bit rates.
The various techniques and tools described herein may be used independently. Certain techniques and tools may also be used in combination (e.g., at various stages of a combined encoding and/or decoding process).
Various techniques will be described below with reference to flowcharts of processing acts. The various processing acts illustrated in the flowcharts may be combined into fewer acts or divided into more acts. For simplicity, the relationships between acts illustrated in a particular flowchart and acts described elsewhere are generally not illustrated. In many cases, the actions in the flow diagrams may be rearranged.
Much of the detailed description focuses on representing, encoding, and decoding audio information. Many of the techniques and tools described herein for representing, encoding, and decoding audio information may also be applied to video information, still image information, or other media information sent in a single or multiple channels.
I. Computing environment
FIG. 1 illustrates a general example of a suitable computing environment 100 in which the described embodiments may be implemented. The computing environment 100 is not intended to suggest any limitation as to scope of use or functionality, as the described embodiments may be implemented in diverse general-purpose or special-purpose computing environments.
Referring to FIG. 1, computing environment 100 includes at least one processing unit 110 and memory 120. In fig. 1, this most basic configuration 130 is included within a dashed line. The processing unit 110 executes computer-executable instructions and may be a real or virtual processor. In a multi-processing system, multiple processing units execute computer-executable instructions to increase processing power. Memory 120 may be volatile memory (e.g., registers, cache, RAM), non-volatile memory (e.g., ROM, EEPROM, flash memory), or some combination of the two. Memory 120 stores software 180 that implements one or more audio processing techniques and/or systems in accordance with one or more described embodiments.
The computing environment may have additional features. For example, computing environment 100 includes storage 140, one or more input devices 150, one or more output devices 160, and one or more communication connections 170. An interconnection mechanism (not shown) such as a bus, controller, or network interconnects the components of the computing environment 100. Typically, operating system software (not shown) provides an operating environment for software executing in the computing environment 100 and coordinates activities of the components of the computing environment 100.
Storage 140 may be removable or non-removable, and includes magnetic disks, magnetic tapes or cassettes, CDs, DVDs, or any other medium which can be used to store information and which can be accessed within computing environment 100. Storage 140 stores instructions for software 180.
The input device 150 may be a touch input device such as a keyboard, mouse, pen, touch screen, or trackball, a voice input device, a scanning device, or another device that provides input to the computing environment 100. For audio or video, the input device 150 may be a microphone, sound card, video card, TV tuner card, or similar device that accepts audio or video input in analog or digital form, or a CD or DVD that reads audio or video samples into the computing environment. Output device 160 may be a display, printer, speaker, CD/DVD recorder, network adapter, or another device that provides output from computing environment 100.
Communication connection(s) 170 allow communication over a communication medium to one or more other computing entities. The communication medium conveys information such as computer-executable instructions, audio or video information, or other data in a data signal. A modulated data signal is a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired or wireless techniques implemented with an electrical, optical, RF, infrared, acoustic, or other carrier.
Embodiments may be described in the general context of computer-readable media. Computer readable media are any available media that can be accessed within a computing environment. By way of example, and not limitation, for computing environment 100, computer-readable media include memory 120, storage 140, communication media, and combinations of any of the above.
Embodiments may be described in the general context of computer-executable instructions, such as those included in program modules, being executed in a computing environment on a real or virtual target processor. Generally, program modules include routines, programs, libraries, objects, classes, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The functionality of the program modules may be combined or split between program modules as desired in various embodiments. Computer-executable instructions for program modules may be executed in local or distributed computing environments.
For the purposes of this description, the detailed description uses terms such as "determine," "receive," and "execute" to describe computer operations in a computing environment. These terms are high-level abstractions for operations performed by a computer, and should not be confused with acts performed by a human being. The actual computer operations corresponding to these terms vary depending on the implementation.
IIExample encoder and decoder
FIG. 2 illustrates a first audio encoder 200 in which one or more of the described embodiments may be implemented. The encoder 200 is a transform-based perceptual audio encoder 200. Fig. 3 shows a corresponding audio decoder 300.
Fig. 4 shows a second audio encoder 400 in which one or more of the described embodiments may be implemented. The encoder 400 is also a transform-based perceptual audio encoder, but the encoder 400 includes additional modules for processing multi-channel audio. Fig. 5 shows a corresponding audio decoder 500.
Although the systems shown in fig. 2 to 5 are generic, they each have characteristics that can be found in real systems. In any case, the relationships shown between the modules within the encoder and decoder indicate the flow of information in the encoder and decoder; other relationships are not shown for simplicity. Depending on the desired implementation and type of compression, modules of the encoder or decoder may be added, omitted, split into multiple modules, combined with other modules, and/or replaced with similar modules. In alternative embodiments, encoders/decoders with different modules and/or other configurations process audio data or some other type of data in accordance with one or more of the described embodiments.
A. First audio encoder
The encoder 200 receives a time series of input audio samples 205 at a certain sampling depth and rate. The input audio samples 205 are for multi-channel audio (e.g., stereo) or mono audio. The encoder 200 compresses the audio samples 205 and multiplexes information generated by the various modules of the encoder 200 to output a bitstream 295 in a format such as WMA format, a container format such as advanced stream format ("ASF"), or other compressed or container format.
The frequency transformer 210 receives the audio samples 205 and converts them into data in the frequency (or spectral) domain. For example, the frequency transformer 210 splits the audio samples (205) of a frame into sub-frame blocks, which may be of variable size to allow for variable time resolution. The blocks may overlap to reduce perceptible discontinuities between blocks that would otherwise be introduced by later quantization. Frequency transformer 210 applies a time-varying modulated lapped transform ("MLT"), a modulated DCT ("MDCT"), some other variant of an MLT or DCT, or some other type of modulated or unmodulated, overlapped or non-overlapped frequency transform to a block, or uses sub-band or wavelet coding. The frequency transformer 210 outputs blocks of spectral coefficient data to a multiplexer ("MUX") 280 and outputs side information such as block size.
For multi-channel audio data, the multi-channel transformer 220 may convert a plurality of original, independently encoded channels into jointly encoded channels. Alternatively, the multi-channel transformer 220 may pass the left and right channels as independently encoded channels. The multi-channel transformer 220 generates side information indicating the channel mode used to the MUX 280. The encoder 200 may apply multi-channel re-matrixing to blocks of audio data after the multi-channel transform.
The perception modeler 230 models characteristics of the human auditory system to improve the perceptual quality of a reconstructed audio signal for a given bit rate. The perception modeler 230 uses any of a variety of auditory models and passes excitation pattern information or other information to the weighter 240. For example, an auditory model typically considers the range of human hearing and critical bands (e.g., Bark bands). In addition to range and critical bands, interactions between audio signals can significantly affect perception. In addition, the auditory model may take into account various other factors related to the physical or neural aspects of human perception of sound.
The perception modeler 230 outputs information that the weighter 240 uses to shape noise in the audio data to reduce the audibility of the noise. For example, using any of a variety of techniques, weighter 240 generates weighting factors for a quantization matrix (sometimes referred to as a mask) based on the received information. The weighting factors for the quantization matrix include a weight for each of a plurality of quantization bands in the matrix, where a quantization band is a frequency range of frequency coefficients. The weighting factor thus indicates the proportion of the noise/quantization error distribution over the quantization band, thereby controlling the spectral/temporal distribution of the noise/quantization error, and its aim is to minimize the audibility of the noise by putting more noise in less audible frequency bands (and vice versa).
The weighter 240 then applies weighting factors to the data received from the multichannel transformer 220.
The quantizer 250 quantizes the output of the weighter 240, producing quantized coefficient data to the entropy encoder 260 and side information including quantization step sizes to the MUX 280. In fig. 2, quantizer 250 is an adaptive, uniform scalar quantizer. The quantizer 250 applies the same quantization step size to each spectral data, but the quantization step size itself may vary between iterations of the quantization loop to affect the bit rate of the entropy encoder 260 output. Other kinds of quantization are non-uniform, vector quantization and/or non-adaptive quantization.
The entropy encoder 260 losslessly compresses the quantized coefficient data received from the quantizer 250, for example, performing run-level coding and vector variable length coding. The entropy encoder 260 may calculate the number of bits spent encoding the audio information and pass this information to the rate/quality controller 270.
The controller 270 works in conjunction with the quantizer 250 to adjust the bit rate and/or quality of the output of the encoder 200. The controller 270 outputs the quantization step size to the quantizer 250 with the goal of satisfying the bit rate and quality constraints.
In addition, the encoder 200 may apply noise substitution and/or band truncation to the block of audio data.
The MUX280 multiplexes the side information received from the other modules of the audio encoder 200 and the entropy-encoded data received from the entropy encoder 260. The MUX280 may include a virtual buffer that stores the bitstream 295 to be output by the encoder 200.
B. First audio decoder
The decoder 300 receives a bitstream 305 of compressed audio information comprising entropy encoded data and side information, from which the decoder 300 reconstructs audio samples 395.
A demultiplexer ("DEMUX") 310 parses information in the bitstream 305 and sends the information to the various modules of the decoder 300. The DEMUX 310 includes one or more buffers to compensate for short-term variations in bit rate due to fluctuations in audio complexity, network jitter, and/or other factors.
The entropy decoder 320 losslessly decompresses the entropy codes received from the DEMUX 310, thereby generating quantized spectral coefficient data. The entropy decoder 320 typically applies the inverse of the entropy encoding technique used in the encoder.
The inverse quantizer 330 receives a quantization step size from the DEMUX 310 and quantized spectral coefficient data from the entropy decoder 320. The inverse quantizer 330 applies a quantization step size to the quantized frequency coefficient data to partially reconstruct the frequency coefficient data, or otherwise performs inverse quantization.
The noise generator 340 receives information from the DEMUX 310 indicating which bands in the block of data have been noise substituted and any parameters for that form of noise. The noise generator 340 generates a pattern for the indicated frequency band and passes this information to the inverse weighter 350.
The inverse weighter (350) receives weighting factors from the DEMUX (310), any noise-substituted patterns from the noise generator (340), and partially reconstructed frequency coefficient data from the inverse quantizer (330). The inverse weighter 350 decompresses the weighting factors, if necessary. The inverse weighter 350 applies a weighting factor to the partially reconstructed frequency coefficient data for the frequency band that is not noise substituted. The inverse weighter 350 then sums the noise patterns received from the noise generator 340 for the noise-substituted frequency bands. The inverse weighter 350 applies a weighting factor to the partially reconstructed frequency coefficient data for the frequency band that is not noise substituted. The inverse weighter 350 then sums the noise patterns received from the noise generator 340 for the noise-substituted frequency bands.
The inverse multi-channel transformer 360 receives the reconstructed spectral coefficient data from the inverse weighter 350 and channel mode information from the DEMUX 310. If the multi-channel audio is an independently encoded channel, the multi-channel inverse transformer 360 passes the channel. If the multi-channel data is a jointly encoded channel, the multi-channel inverse transformer 360 converts the data into independently encoded channels.
The inverse frequency transformer 370 receives the spectral coefficient data output by the multi-channel transformer 360 and side information such as a block size from the DEMUX 310. The inverse frequency transformer 370 applies the inverse of the frequency transform used in the encoder and outputs a block of reconstructed audio samples 395. A
C. Second audio encoder
Referring to fig. 4, an encoder 400 receives a time series of input audio samples 405 at a certain sampling depth and rate. The input audio samples 405 are for multi-channel audio (e.g., stereo, surround) or mono audio. The encoder 400 compresses the audio samples 405 and multiplexes information generated by the various modules of the encoder 400 to output a bitstream 295 in a format such as WMA Pro format, a container format such as ASF, or other compressed or container format.
The encoder 400 selects between a plurality of encoding modes for the audio samples 405. In fig. 4, the encoder 400 switches between a mixed/pure lossless coding mode and a lossy coding mode. The lossless coding mode includes a hybrid/pure lossless encoder 472 and is typically used for high quality (and high bit rate) compression. The lossy coding mode includes components such as a weighter 442 and a quantizer 460 and is typically used for adjustable quality (and controlled bit rate) compression. The selection decision depends on user input or other criteria.
For lossy encoding of multi-channel audio data, the multi-channel pre-processor 410 may optionally re-matrixate the time-domain audio samples 405. For example, the multi-channel pre-processor 410 selectively re-matrixes the audio samples 405 to either drop one or more encoded channels or increase inter-channel correlation in the encoder 400, but still allow reconstruction (in some form) in the decoder 500. The multi-channel pre-processor 410 may send side information, such as instructions for multi-channel post-processing, to the MUX 490.
The windowing module 420 divides the frame of audio input samples 405 into sub-frame blocks (windows). The window may have a time-varying size and a window shaping function. When the encoder 400 uses lossy encoding, the variable size window allows for variable temporal resolution. The windowing module 420 outputs the partitioned data blocks to the MUX 490 and outputs side information such as block size.
In fig. 4, tile configurer 422 divides frames of multi-channel audio on a per-channel basis. The tile configurator 422 independently divides each channel in the frame as quality/bit rate allows. This allows, for example, the tile configurer 422 to isolate transients that occur in a particular channel with smaller windows, while using larger windows for frequency resolution or compression efficiency in other channels. This can improve compression efficiency by isolating transients on a per-channel basis, but in many cases requires additional information specifying the partitioning in individual channels. The same size window at the same point in time enables further redundancy reduction by multi-channel transforms. Thus, tile configurator 422 groups same size windows in the same position in time into tiles.
Fig. 6 shows an example tile configuration 600 for a frame of 5.1 channel audio. Tile configuration 600 includes seven tiles, numbered 0 through 6. Tile 0 includes samples from channels 0, 2, 3, and 4 and covers the first quarter of the frame. Tile 1 includes samples from channel 1 and covers the first half of the frame. Tile 2 comprises samples from channel 5 and covers the entire frame. Tile 3 is the same as tile 0, but covers the second half of the frame. Tiles 4 and 6 include samples in channels 0, 2 and 3 and cover the third and fourth quarters of the frame, respectively. Finally, tile 5 includes samples from channels 1 and 4 and covers the second half of the frame. As shown, a particular tile may include windows in non-contiguous channels.
The frequency transformer 430 receives the audio samples and converts them into data in the frequency domain, applying the transformation as described above for the frequency transformer 210 of fig. 2. The frequency transformer 430 outputs blocks of spectral coefficient data to the weighter 442 and outputs side information such as block sizes to the MUX 490. The frequency transformer 430 outputs both the frequency coefficients and the side information to the perception modeler 440.
The perception modeler 440 models characteristics of the human auditory system to process audio data according to an auditory model generally as described above with reference to the perception modeler 230 of fig. 2.
The weighter 442 generates weighting factors for the quantization matrices based on information received from the perception modeler 440, generally as described above with reference to the weighter 240 of fig. 2. The weighter 442 applies a weighting factor to the data received from the frequency transformer 430. The weighter 442 outputs side information such as quantization matrices and channel weighting factors to the MUX 490. The quantization matrix may be compressed.
For multi-channel audio data, the multi-channel transformer 450 may apply a multi-channel transform to exploit inter-channel correlation. For example, the multi-channel transformer 450 selectively and flexibly applies the multi-channel transform to some but not all channels and/or quantization bands in a tile. The multi-channel transformer 450 selectively uses a predefined matrix or a custom matrix and applies efficient compression to the custom matrix. The multi-channel transformer 450 generates side information to the MUX 490 indicating, for example, the multi-channel transform used and the multi-channel transformed tile portion.
The quantizer 460 quantizes the output of the multi-channel transformer 450, producing quantized coefficient data to the entropy encoder 470 and side information including quantization steps to the MUX 490. In fig. 4, quantizer 460 is an adaptive, uniform, scalar quantizer that calculates a quantization factor for each tile, but quantizer 460 may also perform some other quantization.
The entropy encoder 470 losslessly compresses the quantized coefficient data received from the quantizer 460, generally as described above with reference to the entropy encoder 260 of fig. 2.
The controller 480 works in conjunction with the quantizer 460 to adjust the bit rate and/or quality of the output of the encoder 400. The controller 480 outputs the quantization factor to the quantizer 460 with the goal of satisfying quality and/or bitrate constraints.
The hybrid/pure lossless encoder 472 and associated entropy encoder 474 compress the audio data for the hybrid/pure lossless encoding mode. The encoder 400 uses a mixed/pure lossless coding mode for the entire sequence, or switches between coding modes on a frame-by-frame, block-by-block, tile-by-tile, or other basis.
The MUX 490 multiplexes the side information received from the other modules of the audio encoder 400 as well as the entropy encoded data received from the entropy encoders 470, 474. The MUX 490 includes one or more buffers for rate control or other purposes.
D. Second audio decoder
Referring to fig. 5, the second audio decoder 500 receives a bitstream 505 of compressed audio information. The bitstream 505 comprises entropy encoded data and side information from which the decoder 500 reconstructs audio samples 595.
The DEMUX 510 parses information in the bitstream 505 and sends the information to other modules of the decoder 500. The DEMUX 510 includes one or more buffers to compensate for short-term variations in bit rate due to audio complexity fluctuations, network jitter, and/or other factors.
The entropy decoder 520 losslessly decompresses the entropy codes received from the DEMUX 510, typically applying the inverse of the entropy encoding technique used in the encoder 400. When decoding data compressed in the lossy coding mode, the entropy decoder 520 generates quantized spectral coefficient data.
The hybrid/pure lossless decoder 522 and the associated entropy decoder 520 losslessly decompress the losslessly encoded audio data for the hybrid/pure lossless encoding mode.
The tile configuration decoder 530 receives information indicating a mode of a tile of a frame from the DEMUX 590 and decodes it as necessary. The tile mode information may be entropy encoded or otherwise parameterized. The tile configuration decoder 530 then passes the tile mode information to various other modules of the decoder 500.
The multi-channel inverse transformer 540 receives quantized spectral coefficient data from the entropy decoder 520 and tile mode information from the tile configuration decoder 530 and side information from the DEMUX 510 indicating, for example, the multi-channel transform used and the transformed tile portion. Using this information, the multi-channel inverse transformer 540 decompresses the transform matrix if necessary, and selectively and flexibly applies one or more multi-channel inverse transforms to the audio data.
The inverse quantizer/weighter 550 receives information such as tile and channel quantization factors and quantization matrices from the DEMUX 510 and receives quantized spectral coefficient data from the multichannel inverse transformer 540. The inverse quantizer/weighter 550 decompresses the received weighting factor information as necessary. Quantizer/weighter 550 then performs inverse quantization and weighting.
The inverse frequency transformer 560 receives the spectral coefficient data output by the inverse quantizer/weighter 550, as well as side information from the DEMUX 510 and tile mode information from the tile configuration decoder 530. The frequency inverse transformer 570 applies the inverse of the frequency transform used in the encoder and outputs the blocks to the overlapper/accumulator 570.
In addition to receiving the tile mode information from the tile configuration decoder 530, the overlapper/accumulator 570 also receives decoded information from the frequency inverse transformer 560 and/or the hybrid/pure lossless decoder 522. The overlapper/accumulator 570 overlays and accumulates audio data as necessary and interleaves frames or other audio data sequences encoded with other modes.
Multi-channel post-processor 580 may optionally re-matrixing the time-domain audio samples output by the overlapper/accumulator 570. For post-processing under bitstream control, the post-processing transformation matrix varies over time and is signaled or included in the bitstream 505.
III.Overview of multichannel processing
This section is an overview of some multi-channel processing techniques used in some encoders and decoders, including multi-channel pre-processing techniques, flexible multi-channel transform techniques, and multi-channel post-processing techniques.
A. Multi-channel preprocessing
Some encoders perform multi-channel pre-processing on input audio samples in the time domain.
In a conventional encoder, when there are N source audio channels as input, the number of output channels produced by the encoder is also N. The number of encoded channels may correspond one-to-one to the source channels, or the encoded channels may be multi-channel transform coded channels. However, when the encoding complexity of the source makes compression difficult or when the encoding buffer is full, the encoder may alter or discard (i.e., not encode) one or more of the original input audio channels or the multi-channel transform encoded channels. This may reduce the encoding complexity and improve the overall quality of the perceived audio. For quality-driven pre-processing, the encoder may perform multi-channel pre-processing as a reaction to the measured audio quality in order to smoothly control the overall audio quality and/or channel separation.
For example, an encoder may alter a multi-channel audio image to make one or more channels less important, such that these channels are discarded at the encoder and reconstructed at the decoder as "phantom" or unencoded channels. This helps to avoid the need for full channel deletion or severe quantization, which can have a significant impact on quality.
The encoder may indicate to the decoder what action to take when the number of encoded channels is less than the number of channels used for output. The multi-channel post-processing transform can then be used in a decoder to create phantom channels. For example, the encoder (via the bitstream) may instruct the decoder to create the phantom center channel by averaging the decoded left and right channels. Later, the multi-channel transform may exploit redundancy between the averaged inverse left and right channels (without post-processing), or the encoder may instruct the decoder to perform some multi-channel post-processing on the inverse left and right channels. Alternatively, the encoder may signal the decoder to perform multi-channel post-processing for another purpose.
Fig. 7 illustrates a general technique 700 for multi-channel pre-processing. An encoder performs (710) multi-channel pre-processing on time-domain multi-channel audio data, resulting in transformed audio data in the time domain. For example, the pre-processing involves a generic transformation matrix with continuous values of real elements. The generic transform matrix may be selected to artificially increase the inter-channel correlation. This reduces the complexity to the rest of the encoder, but at the cost of lost channel separation.
The output is then fed to the rest of the encoder, which encodes (720) the data using the techniques described with reference to fig. 4 or other compression techniques, in addition to any other processing that the encoder may perform, to produce encoded multi-channel audio data.
The syntax used by the encoder and decoder may allow the description of a generic or predefined post-processing multi-channel transform matrix, which may be changed or turned on/off on a frame-to-frame basis. The encoder can use this flexibility to limit stereo/surround image impairments, thereby trading off between channel separation and better overall quality in certain environments by artificially increasing inter-channel correlation. Alternatively, the decoder and encoder may use another syntax for multi-channel pre-and post-processing, e.g., a syntax that allows transform matrix changes on a basis other than frame-to-frame basis.
B. Flexible multi-channel transform
Some encoders may perform flexible multi-channel transforms that efficiently exploit inter-channel correlation. A corresponding decoder may perform a corresponding inverse multi-channel transform.
For example, an encoder may position a multi-channel transform after perceptual weighting (and a decoder may position an inverse multi-channel transform before inverse weighting) so that a cross-channel leaked signal may be controlled, measured, and have the same spectrum as the original signal. The encoder may apply weighting factors (e.g., weighting factors and per-channel quantization step modifiers) to the multi-channel audio in the frequency domain prior to the multi-channel transform. The encoder may perform one or more multi-channel transforms on the weighted audio data and quantize the multi-channel transformed audio data.
The decoder may collect samples from multiple channels into a vector at a particular frequency index and perform an inverse multi-channel transform to generate an output. The decoder may then inverse quantize and inverse weight the multi-channel audio, thereby coloring the output of the inverse multi-channel transform with a mask. Thus, the leakage occurring across channels (due to quantization) can be spectrally shaped such that the audibility of the leaked signal can be measured and controlled, and the leakage of the other channels in a given reconstructed channel is spectrally shaped as the original uncorrupted signal of the given channel.
The encoder may group the channels for multi-channel transforms to limit which channels are transformed together. For example, the encoder may determine which channels within a tile are correlated and group the correlated channels. The encoder may take into account pairwise correlations between signals of channels and correlations between frequency bands, or other and/or additional factors, when grouping channels for multi-channel transforms. For example, the encoder may calculate a pair-wise correlation between signals in the channels and then group the channels accordingly. Channels that are not pairwise correlated with any channel in a group may still be compatible with the group. For channels that are not compatible with a set, the encoder may check for band level compatibility and adjust one or more sets of channels accordingly. An encoder may identify channels that are compatible with a set in some frequency bands and incompatible in other frequency bands. Switching off the transform at incompatible frequency bands may improve the correlation between frequency bands where multi-channel transform coding is actually performed and increase coding efficiency. The channels in the channel set need not be contiguous. The signal tile may include a plurality of channel groups, and each channel group may have a different associated multi-channel transform. After deciding which channels are compatible, the encoder may put channel group information into the bitstream. The decoder can then retrieve and process information from the bitstream.
The encoder may selectively turn the multi-channel transform on or off at the band level to control which bands are to be transformed together. In this way, the encoder can selectively exclude bands that are incompatible in a multi-channel transform. When the multi-channel transform is turned off for a particular frequency band, the encoder may use an identity transform for that frequency band, thereby passing the data at that frequency band unaltered. The number of frequency bands is related to the sampling frequency and the patch size of the audio data. In general, the higher the sampling frequency or the larger the patch size, the larger the number of bands. The encoder may selectively turn on or off the multi-channel transform at the band level for each channel of a channel group of a tile. The decoder may retrieve band on/off information for a multi-channel transform of a channel group of a tile from a bitstream according to a specific bitstream syntax.
An encoder may use a layered multi-channel transform to limit computational complexity, especially in a decoder. With a hierarchical transform, the encoder can split the overall transform into multiple stages, thereby reducing the computational complexity of the various stages and, in some cases, the amount of information required to specify a multi-channel transform. Using this cascaded structure, the encoder can emulate a larger overall transform with a smaller transform until a certain accuracy is reached. The decoder may then perform a corresponding hierarchical inverse transform. The encoder may combine the band/switch information of multiple multi-channel transforms. The decoder may retrieve information of a hierarchical structure for a multi-channel transform of a channel group from a bitstream according to a specific bitstream syntax.
The encoder may use a predefined multi-channel transform matrix to reduce the bit rate for a given transform matrix. The encoder may select from a variety of available predefined matrix types and signal the selected matrix in the bitstream. Certain types of matrices may not need to be additionally signaled in the bitstream. Others require additional specifications. The decoder may retrieve information indicating the matrix type and, if necessary, additional information specifying the matrix.
The encoder may calculate and apply a quantization matrix for the channels of the tile, a per-channel quantization step modifier, and a total quantized tile factor. This allows the encoder to shape the noise, balance the inter-channel noise, and control the total distortion according to the auditory model. A corresponding decoder may decode and apply the total quantization tile factor, the per-channel quantization step modifier, and the quantization matrix for the channels of the tile, and may combine the inverse quantization and inverse weighting steps.
C. Multi-channel post-processing
Some decoders perform multi-channel post-processing on reconstructed audio samples in the time domain.
For example, the number of decoded channels may be less than the number of channels used for output (e.g., because the decoder did not decode one or more input channels). If so, the multi-channel post-processing transform can be used to create one or more "phantom" channels based on the actual data in the decoded channels. If the number of decoded channels is equal to the number of output channels, the post-processing transform may be used for any spatial rotation of the rendering, output channel remapping between speaker positions, or other spatial or special effects. If the number of encoded channels is greater than the number of output channels (e.g., surround sound audio is played on a stereo device), a post-processing transform may be used to "fold down" the channels. The transform matrices for these situations and applications may be provided or signaled by the encoder.
Fig. 8 illustrates a general technique 800 for multi-channel post-processing. The decoder decodes (810) the encoded multi-channel audio data, resulting in reconstructed time-domain multi-channel audio data.
The decoder then performs (820) multi-channel post-processing on the time-domain multi-channel audio data. When the encoder generates multiple encoded channels and the decoder outputs a large number of channels, the post-processing involves a general transform to generate a larger number of output channels from a smaller number of encoded channels. For example, the decoder takes samples that are (temporally) at the same point, takes one sample from each reconstructed encoded channel, and then fills any channels that are missing (i.e., channels that are discarded by the encoder) with zeros. The decoder multiplies these samples with a generic post-processing transform matrix.
The general post-processing transformation matrix may be a matrix having predetermined elements, or it may be a general matrix having elements specified by an encoder. The encoder signals the decoder to use a predetermined matrix (e.g., with one or more flag bits), or sends the elements of a generic matrix to the decoder, or the decoder may be configured to always use the same generic post-processing transform matrix. For additional flexibility, the multi-channels or processing may be turned on/off on a frame-by-frame or other basis (in which case the decoder may use the identity matrix to keep the channels unchanged).
For more information on Multi-Channel pre-processing, post-processing and flexible Multi-Channel transforms, see U.S. patent application publication No. 2004-0049379 entitled "Multi-Channel Audio Encoding and Decoding".
IV.Channel extension processing for multi-channel audio
In a typical encoding scheme for encoding a multi-channel source, a time-frequency transform using a transform such as a modulated lapped transform ("MLT") or a discrete cosine transform ("DCT") is performed at the encoder, while a corresponding inverse transform is performed at the decoder. The MLT or DCT coefficients for certain channels are grouped together into a channel group and a linear transform is applied to the channels to obtain the channels to be encoded. If the left and right channels of a stereo source are correlated, they can be encoded using a sum-difference transform (also known as M/S or mid/side coding). This removes the correlation between the two channels so that fewer bits are required to encode them. However, at low bit rates, the difference channel may not be encoded (resulting in loss of the stereo image) or quality may suffer from the fact that both channels are heavily quantized.
The described techniques and tools provide an ideal alternative to existing joint coding schemes (e.g., mid/side coding, intensity stereo coding, etc.). Instead of encoding sum and difference channels for a group of channels (e.g., left/right pair, left front/right front pair, left rear/right rear pair, or other group), the described techniques and tools encode one or more combined channels (which may be the sum of the channels, the primary principal component after applying a decorrelation transform, or some other combined channel) and additional parameters describing the channel cross-correlations and the power of the corresponding physical channels and allow reconstruction of the physical channels that maintain the channel cross-correlations and the power of the corresponding physical channels. In other words, the second order statistics of the physical channels are maintained. This process may be referred to as a channel extension process.
For example, using a complex transform allows for channel reconstruction that maintains the channel cross-correlation and power of the corresponding channels. For narrow-band signal approximation, maintaining the second order statistic is sufficient to provide for maintaining the reconstruction of the power and phase of the individual channels without sending explicit correlation coefficient information or phase information.
The described techniques and tools represent uncoded channels as a modified form of coded channels. The channel to be encoded may be the actual physical channel or a transformed version of the physical channel (e.g., using a linear transform applied to each sample). For example, the described techniques and tools allow for reconstruction of multiple physical channels using one encoded channel and multiple parameters. In one implementation, these parameters include the power (also referred to as intensity or energy) ratio between two physical channels and the coded channels on a per-band basis. For example, to encode a signal having left (L) and right (R) stereo channels, the power ratio is L/M and R/M, where M is the power of the encoded channel (the "sum" or "mono" channel), L is the power of the left channel, and R is the power of the right channel. Although channel extension coding is available for all frequency ranges, this is not essential. For example, for lower frequencies, the encoder may encode the channels of a channel transform simultaneously (e.g., using sum and difference), while for higher frequencies, the encoder may encode the sum channel and multiple parameters.
The described embodiments can significantly reduce the bitrate required to encode a multi-channel source. The parameters for modifying the channel occupy a small fraction of the total bit rate, leaving more bit rate for encoding the combined channel. For example, for a two-channel source, if the coding parameters are to occupy 10% of the available bit rate, 90% of the bits may be used to code the combined channel. In many cases, there is a significant savings over encoding two channels even after cross-channel dependencies are taken into account.
The channels may be reconstructed at a reconstructed channel/coded channel ratio other than the 2: 1 ratio described above. For example, the decoder may reconstruct the left and right channels and the center channel from a single encoded channel. Other arrangements are also possible. Furthermore, the parameters may be defined in different ways. For example, the parameters may be defined on a basis other than a per-band basis.
A. Complex transformation and scale/shape parameters
In the described embodiment, the encoder forms the combined channel and provides parameters to the decoder for decoding the reconstruction of the channels used to form the combined channel. The decoder uses a forward complex transform to derive complex coefficients (each having a real component and an imaginary component) for the combined channel. Then, in order to reconstruct the physical channel from the combined channel, the decoding apparatus scales the complex coefficients using the parameters provided by the encoding apparatus. For example, the decoder derives scale factors from the parameters provided by the encoder and uses them to scale the complex coefficients. The combined channel is typically a sum channel (sometimes referred to as mono), but may be another combination of physical channels. In cases where the physical channels are out of phase and summing the channels will result in the channels canceling each other, the combined channel may be a difference channel (e.g., the difference between the left and right channels).
For example, the encoder sends a sum channel for the left and right physical channels and a plurality of parameters, which may include one or more complex parameters, to the decoder. (the complex parameter is somehow derived from one or more complex numbers, however the complex parameter sent by the encoder (e.g., comprising the ratio of imaginary and real numbers) may not itself be complex). The encoder may also send only real parameters from which the decoder can derive complex scale factors for scaling the spectral coefficients. (an encoder typically does not use a complex transform to encode the combined channel itself. instead, the encoder may use any of a number of encoding techniques to encode the combined channel.)
Fig. 9 illustrates a simplified channel extension encoding technique 900 performed by an encoder. The encoder forms one or more combined channels (e.g., sum channels) at 910. Then, at 920, the encoder derives one or more parameters to be sent to the decoder along with the combined channel. Fig. 10 illustrates a simplified anti-channel expansion decoding technique 1000 performed by a decoder. At 1010, a decoder receives one or more parameters for one or more combined channels. Then, at 1020, the decoder scales the combined channel coefficients using the parameters. For example, the decoder derives complex scale factors from the parameters and scales the coefficients using the scale factors.
After time-frequency transformation at the encoder, the frequency spectrum of each channel is typically divided into subbands. In described embodiments, the encoder may determine different parameters for different frequency subbands, and the decoder may scale the coefficients in the bands of the combined channel for the respective bands in the reconstructed channel using one or more parameters provided by the encoder. In a coding arrangement in which the left and right channels are to be reconstructed from one coded channel, each coefficient in a subband for each of the left and right channels is represented by a scaled version of the subband in the coded channel.
For example, fig. 11 shows the scaling of coefficients in the frequency bands 1110 of the combined channels 1120 during channel reconstruction. The decoder uses the one or more parameters provided by the encoder to derive the scaled coefficients in the corresponding subbands of the left channel 1230 and the right channel 1240 that the decoder reconstructs.
In one implementation, each subband in each of the left and right channels has a scale parameter and a shape parameter. The shape parameter may be determined by the encoder and sent to the decoder, or the shape parameter may be assumed by taking the spectral coefficients in the same location as the encoded location. The encoder represents all frequencies in one channel using a scaled version of the spectrum from one or more encoded channels. A complex transform (having a real component and an imaginary component) is used so that the cross-channel second order statistics of the channels can be maintained for each subband. Since the coded channels are linear transforms of the actual channels, there is no need to send parameters for all channels. For example, if P channels are coded using N channels (where N < P), the parameters need not be sent for all P channels. More information about the scale and shape parameters is provided in section V below.
The parameters may change over time as the power ratio between the physical channel and the combined channel changes. Thus, the parameters for the frequency bands in a frame may be determined on a frame-by-frame basis or on some other basis. In the described embodiments, the parameters for the current frequency band in the current frame are differentially encoded based on parameters from other frequency bands and/or other frames.
The decoder performs a forward complex transform to derive complex spectral coefficients for the combined channel. It then scales the spectral coefficients using parameters sent in the bit stream, such as the power ratio and the imaginary-to-real ratio or normalized correlation matrix for cross-correlation. The complex scaled output is sent to a post-processing filter. The outputs of the filters are scaled and added to reconstruct the physical channels.
Channel extension coding need not be performed for all frequency bands or for all time blocks. For example, channel extension coding may be adaptively turned on or off on a per-band, per-block, or some other basis. In this way, the encoder may choose to perform this processing when efficient or beneficial. The remaining bands or blocks may be processed by conventional channel decorrelation, without decorrelation, or using other methods.
The complex scale factors achievable in the described embodiments are limited to values within certain boundaries. For example, the described embodiments encode the parameters in the log domain, and the values are bounded by the amount of possible cross-correlation between the channels.
The channels that can be reconstructed from the combined channels using the complex transform are not limited to left and right channel pairs, nor are the combined channels limited to combinations of left and right channels. For example, the combined channel may represent two, three, or more physical channels. The channels reconstructed from the combined channel may be groups such as left back/right back, left back/left, right back/right, left/center, right/center, and left/center/right. Other groups are also possible. The reconstructed channels may all be reconstructed using a complex transform, or some channels may be reconstructed using a complex transform while others may not.
B. Parameter interpolation
The encoder may use anchor points that determine explicit parameters and interpolate the parameters between the anchor points. The amount of time between anchor points and the number of anchor points may be fixed or variable depending on the content and/or encoder side decisions. When a certain location at time t is selected, the encoder may use that location for all bands in the spectrum. Alternatively, the encoder may select anchor points at different time instants for different frequency bands.
FIG. 12 is a graphical comparison of actual power ratios and power ratios interpolated from the power ratios at a fixed point. In the example shown in fig. 12, the interpolation smoothes the changes in the power ratios (e.g., between anchor points 1200 and 1202, 1202 and 1204, 1204 and 1206, and 1206 and 1208), which helps to avoid artifacts due to frequently changing power ratios. The encoder may turn interpolation on or off, or not interpolate parameters at all. For example, the encoder may choose to interpolate the parameters when the power ratio change is gradual over time, or turn off the interpolation when the parameters do not change too much between frames (e.g., between anchor points 1208 and 1210 in fig. 12), or turn off the interpolation when the parameters change too rapidly that the interpolation will provide an inaccurate representation of the parameters.
C. Detailed explanation
If the encoder encodes a subset N of the P channels in Y, this may be expressed as Z ═ BX, where vector Z is an N × L matrix and B is an N × P matrix formed by taking N rows of matrix Y corresponding to the N channels to be encodedW ═ cq (Z), where Q denotes the quantization of the vector Z. Substituting Z gives the equation W — cq (bx). Assuming that the quantization noise is negligible, W ═ CBX. C may be chosen appropriately to maintain the cross-channel second order statistic between vectors X and W. In the form of an equation, it can be expressed as WW ═ CBXX*B*C*=XX*In which XX*Is a symmetric PxP matrix.
Since XX is a symmetric P × P matrix, there is a degree of freedom of P (P +1)/2 in this matrix. If N > (P +1)/2, it is possible to obtain a matrix C of P × N, so that the equation is satisfied. If N < (P +1)/2, more information is needed to solve this equation. If this is the case, a complex transform may be used to arrive at other solutions that satisfy some portion of the constraint.
For example, if X is a complex vector and C is a complex matrix, then an attempt may be made to find C such that Re (CBXX)*B*C*)=Re(XX*). From this equation, for an appropriate complex matrix C, the symmetric matrix XX*Is equal to the symmetric matrix product CBXX*B*C*The real part of (a).
Example 1:for the case where M is 2 and N is 1, then BXX*B*Simply a real scalar (L × 1) matrix, referred to as α. solve the equation shown in FIG. 13. if B0=B1β (being some constant) the constraint in fig. 14 holds, when solving, for | C0|、|C1I and I C0||C1|cos(φ01) The values shown in fig. 15 were obtained. Encoder sends | C0I and I C1L. The constraints shown in FIG. 16 can then be used to solve. It should be clear from fig. 15 that these quantities are essentially the power ratios L/M and R/M. The sign in the constraint shown in FIG. 16 may be used to control the sign of the phase so that it matches XX*The imaginary part of (c). This allows solving for phi01But does not allow the actual value to be solved. To solve for the exact values, another assumption is made that the angle of the mono for each coefficient is maintained, as expressed in fig. 17. For maintenanceThis angle, | C0|sinφ0+|C1|sinφ10 is sufficient, which gives the results shown in fig. 18 for phi0And phi1The result of (1).
Using the constraints shown in fig. 16, the real and imaginary parts of the two scalar factors can be solved. For example, the real parts of two scalar factors can be solved by solving | C separately as shown in FIG. 190|cosφ0And | C1|cosφ1To find. The imaginary parts of the two scalar factors can be solved for | C separately as shown in FIG. 200|sinφ0And | C1|sinφ1To find.
Thus, when the encoder sends the absolute value of the complex scale factor, the decoder is able to reconstruct two separate channels that maintain the cross-channel second order property of the original physical channel, and the two reconstructed channels maintain the correct phase of the encoded channels.
Example 2: in example 1, although the imaginary part of the cross-channel second order statistics is solved (as shown in fig. 20), only the real part is maintained at the decoder, which is reconstructed from only a single mono source. However, the imaginary part of the cross-channel second order statistics can also be maintained if (in addition to complex scaling) the output from the previous stage as described in example 1 is post-processed to achieve an additional spectral effect. The output is filtered through a linear filter, scaled, and added back to the output from the previous stage.
Suppose that except for the current signal from the previous analysis (W for two channels respectively)0And W1) In addition, the decoder has an effect signal-the processed form of the two channels available (W respectively)0FAnd W1F) As shown in fig. 21. The overall transformation can be represented as in FIG. 23, assuming W0F=C0Z0FAnd W1F=C1Z0F. It has been shown that by following the reconstruction process shown in fig. 22, the decoder can maintain the second order statistics of the original signal. The decoder takes a linear combination of the original and filtered versions of W to create a signal that maintains the second order statistics of XS。
In example 1, the complex constant C is determined by sending two parameters (e.g., left/single (L/M) and right/single (R/M) power ratios), a complex constant C0And C1Can be selected to match the real part of the cross-channel second order statistic. If the encoder sends another parameter, the entire cross-channel second order statistic of the multi-channel source may be maintained.
For example, the encoder may send complex parameters representing the imaginary-to-real ratio of the cross-correlation between the two channels to maintain the entire cross-channel second order statistics of the two-channel source. Let us assume that the correlation matrix is defined by R as in FIG. 24XXNote that this factorization must exist for any symmetric matrix, for any achievable power correlation matrix, the eigenvalues must also be real*UΛU*The power in U- Λ, i.e. the diagonal matrix, Z is α, therefore, if a transformation such as the following is chosen
And assume W0FAnd W1FHaving a molecular weight of and W0And W1The same power and not both, the reconstruction process in fig. 23 or 22 yields the output for the final outputA required correlation matrix. In practice, the encoder sends a power ratio | C0I and I C1I, and ratio of deficiency to excessThe decoder may reconstruct a normalized version of the cross-correlation matrix (as shown in fig. 25). The decoder then calculates θ and finds the eigenvalues and eigenvectors to arrive at the desired transform.
Due to | C0I and I C1The relation between | they cannot have independent values. Thus, the encoder quantizes them jointly or conditionally. This applies to examples 1 and 2.
Other parameterizations are also possible, such as by sending the normalized form of the energy matrix directly from the encoder to the decoder, and thus can be normalized by the geometric mean of the power, as shown in fig. 26. Now, the encoder can send only the first row of the matrix, which is sufficient because the product of the diagonal is 1. However, the decoder now scales the feature values as shown in fig. 27.
Another parameterization can directly represent U and Λ. It can be shown that U can be factored into a series of Givens rotations. Each Givens rotation may be represented by an angle. The encoder sends Givens rotation angles and characteristic values.
Also, both parameterisations can be combined with any additional arbitrary pre-rotation V and still yield the same correlation matrix, since VV*I stands for identity matrix. That is, the relationship shown in fig. 28 works for any arbitrary rotation V. For example, the decoder selects a pre-rotation such that the amount of filtered signal entering each channel is the same, as shown in fig. 29. The decoder may select ω such that the relationship in fig. 30 holds.
Once the matrix shown in FIG. 31 is known, the decoder can reconstruct as before to obtain the channel W0And W1. Then, the decoder passes the signal to W0And W1Applying a linear filter to obtain W0FAnd W1F(effect signal). Example (b)For example, the decoder uses an all-pass filter, and the output at any tap of the filter can be taken to obtain the effect signal. (for more information on the use of all-pass filters, see "Colorless 'Artificial Reverberation" by m.r.schroeder and b.f.logan, 12th an.meeting of the Audio Eng' g Soc, p.18 (1960)) the strength of the signals added as post-processing is given in the matrix shown in fig. 31.
The all-pass filter may be represented as a cascade of other all-pass filters. Depending on the amount of reverberation required to accurately model the source, the output of any all-pass filter can be taken. The parameters may also be sent on any frequency band, subframe, or source basis. For example, the output of the first, second or third stage in the cascade of all-pass filters may be taken.
By taking the output of the filter, scaling it, and adding it back to the original reconstruction, the decoder is able to maintain the cross-channel second order statistics. Although this analysis makes certain assumptions about the power and relative structure of the effect signal, these assumptions are not always satisfied in practice. These assumptions can be refined using further processing and better approximation. For example, if the filtered signal has more energy than is needed, the filtered signal may be scaled as shown in fig. 32 so that it has the correct power. This ensures that power is maintained correctly if it is too large. The calculations used to determine whether the power exceeds the threshold are shown in fig. 33.
It is sometimes possible that the signals in the two physical channels that are combined will be out of phase, so if sum coding is used, the matrix will be singular. In these cases, the maximum determinant of the matrix may be limited. This parameter (threshold) limiting the maximum scaling of the matrix may also be sent in the bitstream on a frequency band, sub-frame or source basis.
As in example 1, the analysis in this example assumes B0=B1β, however, the same algebraic primitive may be used for any transformSimilar results are obtained.
V.Channel extension coding using other coding transforms
The channel extension coding techniques and tools described in section IV above may be used in conjunction with other techniques and tools. For example, the encoder may use a base transcoding, a frequency extension transcoding (e.g., an extension band perceptual similarity transcoding), and a channel extension transcoding. (frequency extension coding is described in section v.a. below.) in an encoder, these transforms may be performed in a base coding module, a frequency extension coding module separate from the base coding module, and a channel extension coding module separate from the base coding module and the frequency extension coding module. Alternatively, different transformations may be performed in various combinations within the same module.
A. Overview of frequency extension coding
This section is an overview of frequency extension coding techniques and tools used in certain encoders and decoders to encode higher spectral data from baseband data in the spectrum (sometimes referred to as extended band perceptual similarity frequency coding, or generalized perceptual similarity coding).
Encoding spectral coefficients for transmission to a decoder in an output bit stream may consume a relatively large portion of the available bit rate. Thus, at low bit rates, the encoder may choose to encode a reduced number of coefficients by encoding the baseband within the bandwidth of the spectral coefficients and representing the coefficients outside the baseband as scaled and shaped versions of the baseband coefficients.
Fig. 34 shows a generic module 3400 that may be used in an encoder. The illustrated module 3400 receives a set of spectral coefficients 3415. Thus, at low bit rates, the encoder may choose to encode a reduced number of coefficients: the baseband within the bandwidth of spectral coefficients 3415 is typically at the low end of the spectrum. The spectral coefficients outside the baseband are referred to as "extended band" spectral coefficients. The division of the baseband and the extension band is performed in the baseband/extension band division section 3420. Sub-band partitioning (e.g., sub-bands used to expand bands) may also be performed in this section.
To avoid distortion in the reconstructed audio (e.g., muffled or low-pass sound), the extended-band spectral coefficients are represented as shaped noise, shaped forms of other frequency components, or a combination of both. The extended band spectral coefficients may be partitioned into multiple sub-bands (e.g., having 64 or 128 coefficients), which may be disjoint or overlapping. This extension band coding provides a perceptual effect similar to the original even though the actual spectrum may be slightly different.
The baseband/extension band dividing section 3420 outputs baseband spectral coefficients 3425, extension band spectral coefficients, and side information (which may be compressed) describing, for example, the baseband width and the individual size and number of extension band sub-bands.
In the example shown in fig. 34, the encoder encodes the coefficients and side information in an encoding module 3430 (3435). The encoder may comprise separate entropy encoders for the baseband and extended band spectral coefficients and/or use different entropy encoding techniques to encode the different classes of coefficients. The corresponding decoder typically uses complementary decoding techniques. (to illustrate another possible implementation, FIG. 36 shows separate decoding modules for the baseband and extended band coefficients.)
An extension band encoder may encode a sub-band using two parameters. One parameter, called the scaling parameter, is used to represent the total energy within the band. Another parameter, called the shape parameter, is used to represent the shape of the spectrum within the frequency band.
Fig. 35 illustrates an example technique 3500 for encoding each sub-band of an extension band in an extension band encoder. The extended band encoder calculates the scale parameter at 3510 and the shape parameter at 3520. Each sub-band encoded by the extension band encoder may be represented as a product of a scale parameter and a shape parameter.
For example, the scaling parameter may be the root mean square value of the coefficients within the current subband. This is found by taking the square root of the mean square of all coefficients. The mean square value is found by taking the sum of the square values of all coefficients in the subband and dividing by the number of coefficients.
The shape parameter may be a displacement vector specifying a normalized form of a portion of the spectrum that has been encoded (e.g., a portion of the baseband spectral coefficients encoded with a baseband encoder), a normalized random noise vector, or a vector for the spectral shape from a fixed codebook. A displacement vector specifying another part of the spectrum is useful in audio because there are usually harmonic components in the tonal signal that repeat throughout the spectrum. The use of noise or some other fixed codebook may facilitate low bit-rate coding of components that cannot be well represented in the baseband encoded portion of the spectrum.
Some encoders allow for modification of the vector to better represent the spectral data. Some possible modifications include linear or non-linear transformation of the vector, or representing the vector as a combination of two or more other original or modified vectors. In the case of vector combining, the modification may involve taking one or more parts of one vector and combining it with one or more parts of the other vector. When vector modification is used, bits are sent to inform the decoder how to form a new vector. The modification consumes fewer bits to represent the spectral data than the actual waveform encoding, despite the additional bits.
The extension band encoder need not encode a separate scale factor for each sub-band of the extension band. Instead, the extended band encoder may represent the scaling parameters for the sub-bands as a function of frequency, such as by encoding a set of coefficients that produce a polynomial function of the scaling parameters of the extended sub-bands as a function of their frequency. Furthermore, the extension band encoder may encode further values characterizing the shape of the extension sub-bands. For example, the extension band encoder may encode a value specifying a displacement or stretch of the portion of the baseband indicated by the motion vector. In this case, the shape parameter is encoded as a set of values (e.g., specifying a position, displacement, and/or stretch) to better represent the shape of the extended subband relative to a vector from the encoded baseband, a fixed codebook, or a random noise vector.
The scale and shape factors that encode each sub-band of the extension band may be vectors. For example, the extended subbands may be represented as the vector product scale (f) shape (f) of a filter with frequency response scale (f) in the time domain and an excitation with frequency response shape (f). The encoding may be in the form of a Linear Predictive Coding (LPC) filter and excitation. The LPC filter is a low order representation of the scale and shape of the extended subband, while the excitation represents the pitch and/or noise characteristics of the extended baseband. The excitation may result from an analysis of a baseband encoded portion of the spectrum and an identification of a portion of the baseband encoded spectrum, fixed codebook spectrum, or random noise that matches the encoded excitation. This represents the extended sub-band as part of the baseband encoded spectrum, but the matching is done in the time domain.
Referring again to fig. 35, at 3530, the extension band encoder searches the baseband spectral coefficients for a similar frequency band in the baseband spectral coefficients that has a similar shape as the current sub-band of the extension band (e.g., using a normalized form least mean square comparison with each portion of the baseband). At 3532, the extension band encoder checks whether the similar band in baseband spectral coefficients is sufficiently close in shape to the current extension band (e.g., the least mean square value is below a preselected threshold). If so, the extended band encoder determines a vector pointing to this similar band of baseband spectral coefficients at 3534. The vector may be the starting coefficient position in baseband. Other methods, such as checking for pitch versus non-pitch, may also be used to see if similar bands of baseband spectral coefficients are sufficiently close in shape to the current extended band.
If a sufficiently similar portion of the baseband is not found, the extended band encoder then looks up a fixed codebook of spectral shapes (3540) to represent the current sub-band. If found (3542), the extended band encoder uses its index in the codebook as a shape parameter at 3544. Otherwise, at 3550, the extended band encoder represents the shape of the current sub-band as a normalized random noise vector.
Alternatively, the extension band encoder may decide how the spectral coefficients may be represented with some other decision process.
The extension band encoder may compress the scale and shape parameters (e.g., using predictive coding, quantization, and/or entropy coding). For example, the scaling parameter may be based on extended subbands of the preamble for predictive coding. For multi-channel audio, the scale parameters for a subband may be predicted from a previous subband in the channel. The scale parameter may also be predicted across channels, from more than one other subband, from the baseband spectrum, or from previous audio input blocks and other variations, and so on. The prediction choice can be made by looking at which previous band (e.g., within the same extension band, channel, or tile (input block)) provided the higher correlation. The extension band encoder may quantize the scale parameter using uniform or non-uniform quantization, and the resulting quantized value may be entropy encoded. The extended band encoder may also use predictive coding (e.g., from leading sub-band prediction), quantization, and entropy coding on the shape parameters.
This provides the opportunity to adjust the subband size to improve coding efficiency if the subband size is variable for a given implementation. In general, sub-bands with similar characteristics can be combined with little impact on quality. Sub-bands with highly variable data can be better represented when splitting sub-bands. However, smaller subbands require more subbands (and typically more bits) than larger subbands to represent the same spectral data. To balance these benefits, the encoder may make sub-band decisions based on the quality metric and bit rate information.
The decoder demultiplexes the bitstream with baseband/extension band partitioning and decodes the frequency bands using corresponding decoding techniques (e.g., in a baseband decoder and an extension band decoder). The decoder may also perform additional functions.
Fig. 36 shows aspects of an audio decoder 3600 for decoding a bitstream generated by an encoder that uses frequency extension coding and separate coding modules for baseband data and extension band data. In fig. 36, baseband data and extension band data in the encoded bit stream 3605 are decoded in a baseband decoder 3640 and an extension band decoder 3650, respectively. The baseband decoder 3640 decodes the baseband spectral coefficients using conventional decoding of a baseband codec. The extension band decoder FF 50 decodes extension band data, including portions of baseband spectral coefficients pointed to by motion vectors that duplicate shape parameters, and scaling by the scale factor of the scale parameter. The baseband and extended-band spectral coefficients are combined into a single spectrum, which is transformed by inverse transform 3680 to reconstruct the audio signal.
Section IV describes techniques for representing all frequencies in an uncoded channel using a scaled version of the spectrum from one or more coded channels. Frequency extension coding differs in that the extension band coefficients are represented using scaled versions of the baseband coefficients. However, these techniques may be used together, such as by performing frequency extension coding on the combined channel and other ways described below.
B. Examples of channel extension coding using other coding transforms
FIG. 37 is a diagram showing aspects of one example of an example encoder 3700 that processes multi-channel source audio 3705 using a time-to-frequency (T/F) base transform 3710, a T/F frequency extension transform 3720, and a T/F channel extension transform 3730. (other encoders may use different combinations or other transformations than those shown.)
The T/F transform may be different for each of the three transforms.
For the base transform, after the multi-channel transform 3712, the encoding 3715 includes encoding of spectral coefficients. If channel extension coding is also used, it is not necessary to encode at least some frequency ranges of the channels used for at least some multi-channel transform coding. If frequency extension coding is also used, it is not necessary to code at least some of the frequency range. For frequency extension transforms, the encoding 3715 includes encoding of the scale and shape parameters for the frequency bands in the sub-frame. If channel extension coding is also used, it may not be necessary to send these parameters for certain frequency ranges for certain channels. For channel extension transforms, the encoding 3715 includes encoding of parameters (e.g., power ratios and complex parameters) to accurately maintain the channel cross-correlation of the frequency bands in the sub-frames. For simplicity, the encoding is shown as being formed in a single encoding module 3715. However, different encoding tasks may be performed in different encoding modules.
Fig. 38, 39, and 40 are diagrams illustrating aspects of decoders 3800, 3900, and 4000 that decode a bitstream, such as bitstream 3795, produced by an example encoder 3700. In decoders 3800, 3900, and 4000, certain modules present in certain decoders are not shown for simplicity (e.g., entropy decoding, inverse quantization/weighting, additional post-processing.
In decoder 3800, the basic spectral coefficients are processed with basic inverse multichannel transform 3810, basic inverse T/F transform 3820, forward T/F frequency extension transform 3830, frequency extension process 3840, frequency extension inverse T/F transform 3850, forward T/F channel extension transform 3860, channel extension process 3870, and inverse channel extension T/F transform 3880 to produce reconstructed audio 3895.
However, for practical purposes, the decoder may be undesirably complicated. Also, the channel expansion transform is a complex transform, while the other two are not. Thus, other decoders may be adapted in the following way: the T/F transform used for frequency extension coding may be limited to (1) a basic T/F transform, or (2) a real part of a channel extension T/F transform.
This allows for configurations such as those shown in fig. 39 and 40.
In fig. 39, a decoder 3900 processes the base spectral coefficients with frequency extension processing 3910, inverse multi-channel transform 3920, inverse base T/F transform 3930, forward channel extension transform 3940, channel extension processing 3950, and inverse channel extension T/F transform 3960 to produce reconstructed audio 3995.
In fig. 40, a decoder 4000 processes basic spectral coefficients with an inverse multi-channel transform 4010, an inverse basic T/F transform 4020, a real part of a forward channel extension transform 4030, a frequency extension process 4040, a differential of an imaginary part of a forward channel extension transform 4050, a channel extension process 4060, and a channel extension T/F transform 4070 to produce reconstructed audio 4095.
Any of these configurations may be used, and the decoder may dynamically change which configuration is used. In one implementation, the transform for the base and frequency extension coding is MLT (is the real part of MCLT (modulated complex lapped transform)), and the transform for the channel extension transform is MCLT. However, the two transforms have different subframe sizes.
Each MCLT coefficient in a sub-frame has a basis function that spans the sub-frame. Since each subframe overlaps only two adjacent subframes, only the MLT coefficients from the current, previous, and next subframes are needed to find the exact MCLT coefficient for a given subframe.
The transforms may use transform blocks of the same size, or the transform blocks may be of different sizes for different kinds of transforms. Different sized transform blocks in the base and frequency extension transforms may be desirable, such as when the frequency extension transforms can improve quality by working on blocks of smaller time windows. However, changing the transform size at the base encoding, frequency extension encoding and channel encoding introduces significant complexity in the encoder and decoder. Thus, it may be desirable to share transform sizes among at least some transform types.
As one example, if the base and frequency extension coded transforms share the same transform block size, the channel extension coded transform may have a transform block size that is independent of the base/frequency extension coded transform block size. In this example, the decoder may include frequency reconstruction followed by an inverse base coding transform. The decoder then performs a forward complex transform to derive spectral coefficients for scaling the encoded combined channel. The complex channel coding transform uses its own transform block size, independent of the other two transforms. The decoder reconstructs the physical channel in the frequency domain from the encoded combined channel (e.g., sum channel) using the derived spectral coefficients and performs an inverse complex transform to obtain time-domain samples from the reconstructed physical channel.
As another example, if the base and frequency extension transforms have different transform block sizes, the channel coding transform may have the same transform block size as the frequency extension transform block size. In this example, the decoder may include an inverse base coding transform followed by frequency reconstruction. The decoder performs the inverse channel transform using the same transform block size as used for frequency reconstruction. The decoder then performs a forward transform of the complex components to derive the spectral coefficients.
In the forward transform, the decoder may calculate the imaginary part of the MCLT coefficient of the channel expansion transform coefficient from the real part. For example, the decoder may calculate the imaginary part in the current block by looking at the real parts of some frequency bands (e.g., three frequency bands or more) from the previous block, some frequency bands (e.g., two frequency bands) from the current block, and some frequency bands (e.g., three frequency bands or more) from the next block.
The mapping of the real to imaginary parts involves taking the dot product of the modulated inverse DCT basis and the forward modulated Discrete Sine Transform (DST) basis vector. Computing the imaginary part for a given subframe involves finding all DST coefficients within the subframe. This is non-zero only for the DCT basis vectors from the previous, current and next sub-frames. Furthermore, only DCT basis vectors of frequencies that are approximately similar to the DST coefficients sought to be found have significant energy. If the subframe sizes of the previous, current and next subframes are all the same, the energy is significantly reduced for frequencies other than the frequency for which the DST coefficients are sought. Thus, a low complexity solution may be found in order to find the DST coefficients for a given subframe given the DCT coefficients.
Specifically, Xs may be calculated as a × Xc (-1) + B × Xc (0) + C × Xc (1), where Xc (-1), Xc (0), and Xc (1) represent DCT coefficients from previous, current, and next blocks, and Xs represents the DST coefficients of the current block:
1) precomputing A, B and C matrices for different window shapes/sizes
2) The threshold A, B and the C matrix are calculated such that values much smaller than the peak are reduced to 0, reducing it to a sparse matrix
3) The matrix multiplication is calculated using only non-zero matrix elements.
In applications where a complex filter bank is required, this is a fast way of deriving the imaginary part from the real part or the real part from the imaginary part without directly calculating the imaginary part.
The decoder reconstructs the physical channel in the frequency domain from the encoded combined channel (e.g., sum channel) using the derived scale factors and performs an inverse complex transform to obtain time-domain samples from the reconstructed physical channel.
This approach results in a significant reduction in complexity compared to brute force approaches involving inverse DCT and forward DST.
C. Reduction of computational complexity in frequency/channel coding
Frequency/channel coding may be accomplished with a base transcoding, a frequency transcoding, and a channel transcoding. Switching the transform from one to another on a block or frame basis may improve the perceptual quality, but it is computationally expensive. In some situations (e.g., low processing power devices), this high complexity may not be acceptable. One solution to reduce complexity is to force the encoder to always select the basic coding transform for both frequency and channel coding. However, this approach imposes limitations on quality, even for playback devices without power constraints. Another solution is to let the encoder perform without transform constraints and let the decoder map the frequency/channel coding parameters to the basic coded transform domain, if low complexity is required. The second solution enables good quality for high power devices and good quality for low power devices with reasonable complexity if the mapping is done in the right way. The mapping of parameters from other domains to the base transform domain may be performed without extra information from the bitstream or with additional information put into the bitstream by the encoder to improve mapping performance.
D. Improved energy tracking of frequency coding at transitions of different window sizes
As indicated in section v.b., the frequency encoder may use a base transcoding, a frequency transcoding (e.g., an extended band perceptual similarity transcoding), and a channel extension transcoding. However, the starting point of frequency coding may require additional attention when the frequency coding is switched between two different transforms. This is because the signal in one of the various transforms, such as the base transform, is usually band-pass and the clear pass-band is defined by the last encoded coefficient. However, this clear boundary may become blurred when mapped to different transforms. In one implementation, the frequency encoder ensures that no signal energy is lost by carefully defining the starting point. In particular, the amount of the solvent to be used,
1) for each frequency band, the frequency encoder calculates the energy-E1 of the signal previously compressed (by base encoding, etc.).
2) For each frequency band, the frequency encoder calculates the energy of the original signal-E2.
3) If (E2-E1) > T, where T is a predefined threshold, the frequency encoder marks this band as the starting point.
4) Where the frequency encoder begins operation, an
5) The frequency encoder sends a start point to the decoder.
In this way, when switching between different transforms, the frequency encoder detects the energy difference and sends the starting point accordingly.
VI.Shape and scale parameters for frequency extension coding
A. Displacement vector for encoder using modulation DCT coding
As mentioned in section V above, extended band perceptual similarity frequency coding involves determining shape parameters and scale parameters for frequency bands within a time window. The shape parameter specifies a part of the baseband (typically the lower frequency band) that will be used as a basis for encoding the coefficients in the extension band (typically the higher frequency band than the baseband). For example, coefficients in a specified portion of the baseband may be scaled and then applied to the extension band.
The signal of the channel at time t can be modulated using the displacement vector d, as shown in fig. 41. FIG. 41 shows the respective times for the instants t0And t1A representation of the displacement vector of the two audio blocks 4100 and 4110. Although the example shown in fig. 41 relates to the concept of frequency spreading coding, the principles can be applied to other modulation schemes that do not involve frequency spreading coding.
In the example shown in FIG. 41, the audio blocks 4100 and 4110 include N sub-bands in the range 0 to N-1, where the sub-bands in each block are divided into a lower frequency baseband and a higher frequency extension band. For audio box 4100, displacement vector d0Shown as subband m0And n0To the other. Similarly, for audio box 4110, displacement vector d1Shown as subband m1And n1To the other.
Since the displacement vector is intended to accurately describe the shape of the extended band coefficients, it may be assumed that it would be desirable to allow maximum flexibility in the displacement vector. Limiting the value of the displacement vector in some cases, however, results in improved perceptual quality. For example, the encoder may select subbands m and n such that they are each always even or odd, so that the number of subbands covered by displacement vector d is always even. In an encoder using a modulated Discrete Cosine Transform (DCT), a better reconstruction is obtained when the number of subbands covered by the displacement vector d is even.
When performing extended band perceptual similarity frequency coding using a modulated DCT, a cosine wave from a baseband is modulated to generate a modulated cosine wave for an extended band. If the number of subbands covered by the displacement vector d is even, the modulation leads to an accurate reconstruction. However, if the number of subbands covered by the displacement vector d is odd, the modulation causes distortion in the reconstructed audio. Thus, by limiting the displacement vector to cover only an even number of subbands (and sacrificing some flexibility in d), better overall sound quality can be achieved by avoiding distortion in the modulated signal. Thus, in the example shown in fig. 41, the displacement vectors in the audio blocks 4100 and 4110 each cover an even number of subbands.
B. Setpoint for a ratio parameter
When the frequency coding has a smaller window than the basic encoder, the bit rate tends to increase. This is because it is important to keep the frequency resolution at a fairly high level to avoid undesirable artifacts despite the smaller window.
Fig. 42 shows a simplified arrangement of audio blocks of different sizes. The time window 4210 has a longer duration than the time windows 4212-4222, but each time window has the same number of frequency bands.
The tick marks in fig. 42 indicate anchor points for each frequency band. As shown in fig. 42, the number of anchor points may vary between frequency bands, and the time distance between anchor points may also vary. (not all windows, bands, or anchor points are shown in FIG. 42 for simplicity.) at these anchor points, the scaling parameters are determined. The scale parameters for the same frequency band in other time windows may then be interpolated from the parameters at the localization point.
Alternatively, the anchor point may be determined in other ways.
Having described and illustrated the principles of the invention with reference to described embodiments, it will be recognized that the described embodiments can be modified in arrangement and detail without departing from such principles. It should be understood that, unless otherwise indicated, the programs, processes, or methods described herein are not related or limited to any particular type of computing environment. Various types of general purpose or special purpose computing environments may be used or operations may be performed in accordance with the teachings described herein. Elements shown in software in the described embodiments may be implemented in hardware and vice versa.
In view of the many possible embodiments to which the principles of our invention may be applied, all such embodiments are claimed as the invention, which fall within the scope and spirit of the appended claims and their equivalents.

Claims (15)

1. A computer-implemented method in an audio decoder for decoding encoded multi-channel audio data, the method comprising:
receiving encoded audio data comprising channel extension encoded data and frequency extension encoded data;
wherein the channel extension encoding data comprises:
a combined audio channel representing a plurality of audio channels;
at least one power ratio representing the power of a single audio channel relative to either the combined audio channel or another single audio channel; and
cross-correlation information; and
reconstructing the plurality of audio channels using the channel extension encoded data and the frequency extension encoded data.
2. The computer-implemented method of claim 1, wherein frequency extension encoded data represents a baseband of a band as baseband spectral coefficients and represents an extension band of the band as a scaled and shaped form of the baseband coefficients, the frequency extension encoded data comprising:
a proportional parameter representing the total energy within said frequency band; and
a shape parameter representing a shape of the spectrum within the frequency band.
3. The computer-implemented method of claim 2, wherein reconstructing the plurality of audio channels using the channel extension encoded data and the frequency extension encoded data comprises:
deriving complex spectral coefficients of the combined audio channel within the channel extension encoded data using a complex forward channel extension transform;
scaling the derived complex spectral coefficients using the at least one power ratio and the cross-correlation information;
decoding the baseband spectral coefficients within the frequency spread encoded data; and
deriving the extension band using the decoded baseband spectral coefficients and the scale and shape parameters, including by copying portions of the baseband spectral coefficients pointed to by the motion vector of the shape parameter and scaling by a scale factor of the scale parameter.
4. A computer-implemented method in an audio decoder for decoding encoded multi-channel audio data, the method comprising:
receiving encoded audio data comprising channel extension encoded data and frequency extension encoded data, the channel extension encoded data comprising at least one combined audio channel representing a plurality of audio channels; and
reconstructing the plurality of audio channels using the channel extension encoded data and the frequency extension encoded data, wherein the reconstructing comprises:
performing an inverse multi-channel transform on the received encoded audio data when the channel extension encoded data includes more than one combined audio channel;
performing an inverse basic time-frequency transform on the encoded audio data;
performing a real part of a complex forward channel extension transform on the encoded audio data followed by a frequency extension process, wherein the frequency extension encoded data is used to reconstruct a total frequency spectrum of the at least one combined channel;
deriving an imaginary part of the complex forward channel extension transform after the frequency extension processing;
performing a channel expansion process on the total frequency spectrum of the at least one combined channel, wherein the plurality of audio channels represented by the at least one combined audio channel are reconstructed using the channel expansion encoded data; and
performing an inverse channel extension time-frequency transform on the reconstructed plurality of audio channels.
5. The computer-implemented method of claim 4, wherein the channel extension encoded data further comprises at least one power ratio representing a power of a single audio channel relative to either the combined audio channel or another single audio channel; and cross-correlation information.
6. The computer-implemented method of claim 1 or 5, wherein the channel extension encoding data includes a plurality of power ratios representing power of the plurality of audio channels relative to the combined audio channel.
7. The computer-implemented method of claim 1 or 5, wherein the cross-correlation information comprises complex parameters representing an imaginary-to-real ratio of cross-correlation between at least two of the plurality of audio channels.
8. The computer-implemented method of claim 4, wherein the encoded audio data comprises a plurality of combined audio channels, and wherein the reconstructing is performed for each of the combined audio channels.
9. The computer-implemented method of claim 4, wherein the frequency extension encoded data represents a baseband of the total spectrum of a band as baseband spectral coefficients and represents an extended band of the total spectrum of the band as a scaled and shaped version of the baseband coefficients, and wherein the frequency extension encoded data includes a scale parameter representing total energy within the band and a shape parameter representing a shape of the spectrum within the band.
10. The computer-implemented method of claim 2 or 9, wherein the scale parameter and the shape parameter are ignored for one or more frequency ranges in one or more of the plurality of audio channels.
11. The computer-implemented method of claim 1 or 4, wherein the combined channel is a sum channel.
12. The computer-implemented method of claim 1 or 4, wherein the combined channel is a difference channel.
13. The computer-implemented method of claim 3 or 4, wherein the forward channel extension transform is a modulated complex lapped transform.
14. The computer-implemented method of claim 1 or 4, wherein the reconstructing comprises using a frequency extension transform that is a non-complex transform.
15. A method performed by an audio decoder for decoding encoded multi-channel audio data, the method comprising:
receiving encoded audio data comprising channel extension encoded data and frequency extension encoded data;
wherein the channel extension encoding data comprises:
a combined audio channel representing a plurality of audio channels;
at least one power ratio representing the power of a single audio channel relative to either the combined audio channel or another single audio channel; and
cross-correlation information; and
reconstructing the plurality of audio channels using the channel extension encoded data and the frequency extension encoded data.
HK13103638.7A 2006-01-20 2013-03-22 Complex-transform channel coding with extended-band frequency coding HK1176455B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US11/336,606 US7831434B2 (en) 2006-01-20 2006-01-20 Complex-transform channel coding with extended-band frequency coding
US11/336,606 2006-01-20

Publications (2)

Publication Number Publication Date
HK1176455A1 HK1176455A1 (en) 2013-07-26
HK1176455B true HK1176455B (en) 2017-06-30

Family

ID=

Similar Documents

Publication Publication Date Title
AU2007208482B2 (en) Complex-transform channel coding with extended-band frequency coding
US7953604B2 (en) Shape and scale parameters for extended-band frequency coding
US8190425B2 (en) Complex cross-correlation parameters for multi-channel audio
US8046214B2 (en) Low complexity decoder for complex transform coding of multi-channel sound
US8249883B2 (en) Channel extension coding for multi-channel source
US9741354B2 (en) Bitstream syntax for multi-process audio decoding
US7860720B2 (en) Multi-channel audio encoding and decoding with different window configurations
US7801735B2 (en) Compressing and decompressing weight factors using temporal prediction for audio data
US8069052B2 (en) Quantization and inverse quantization for audio
HK1176455B (en) Complex-transform channel coding with extended-band frequency coding
MX2008009186A (en) Complex-transform channel coding with extended-band frequency coding