HK1066966B

HK1066966B - Method for audio channel translation

Info

Publication number: HK1066966B
Application number: HK04109904.2A
Authority: HK
Inventors: 马克．富兰克林．戴维斯
Original assignee: 多尔拜实验特许公司
Priority date: 2001-02-07
Filing date: 2002-02-07
Publication date: 2007-04-13

Description

Sound channel conversion method

Technical Field

The present invention relates to audio signal processing. In particular, the invention relates to the conversion of M input channels representing a soundfield into N output channels representing the same soundfield, wherein each channel is a single audio stream representing an audio frequency arriving from one direction, M and N are positive integers, and M is at least 2.

Background

Although humans have only two ears, we can hear real three-dimensional sounds, which rely on multiple localization cues, such as Head Related Transfer Functions (HRTFs) and head movements. A fully realistic sound reproduction requires that the full three-dimensional sound field, or at least the cues to be perceived, be preserved and reproduced. Unfortunately, sound recording techniques are not suitable for acquiring three-dimensional sound fields, nor two-dimensional planar sounds, nor even one-dimensional rectilinear sounds. Current sound recording techniques are only suitable for acquiring, storing and representing zero-dimensional discrete channels.

Efforts to improve fidelity since Edison's invention of sound recording have largely focused on overcoming the shortcomings of their original analog track-modulated cylinder/disk media. These include limited and uneven frequency response, noise, distortion, jitter, speed accuracy, wear, dirt and copy damage. Although there have been some sporadic efforts for local improvements, including electronic amplification, tape recording, noise reduction, and higher price players than some automobiles, the traditional problems of individual channel quality have not proven to be ultimately solved until the development of digital recording in general, and the introduction of Compact Discs (CDs), in particular. Since the development of digital recording, and CDs in particular, in addition to some efforts to further extend the quality of digital recording to 24 bits/96 kHz sampling, major efforts in sound reproduction research have focused on reducing the amount of data required to maintain the quality of each channel-mostly using perceptual coders, and on improving spatial fidelity. This latter problem is the subject of this document.

Efforts to improve spatial fidelity have been made along two lines: attempts to convey the perceptual cues of the entire sound field, and attempts to convey an approximation of the actual original sound field. Examples of systems that employ the former approach include binaural recordings and two-speaker based virtual surround sound systems. These systems suffer from a number of unfortunate drawbacks, particularly in terms of reliably localizing sound in certain directions, and requiring the use of headphones or listening in a single fixed listening position.

Whether in a living room or in a place of business such as a movie theater, for the sake of reproductionNow stereo sound is given to multiple listeners, the only feasible way is to try to approximate the actual original sound field. This is not surprising given the discrete channel characteristics of sound recordings: most of the current efforts include conservatively increasing the number of reproduced channels. Representative systems included the moving-mono three-loudspeaker motion picture film soundtrack early in the 50 s, conventional stereo, four-channel stereo in the 60 s, five-channel discrete magnetic soundtracks on 70 mm motion picture film, dolby surround sound in matrix in the 70 s, AC-35.1 channel surround sound in the 90 s, and recently surround-EX 6.1 channel surround sound. "Dolby" (Dolby), "Pro Logic" and "Surround EX" (Surround-EX) are trademarks of Dolby laboratories franchise. To varying degrees, these systems provide improved spatial reproduction over mono reproduction. However, the mixing of a large number of channels results in more time and expense being placed on the content producer, and the resulting experience is typically one of several discrete channels rather than a continuous sound field. Pro Logic decoding by Dolby is described in U.S. Pat. No. 4,799,260, which is incorporated herein by reference in its entirety. Detailed description of AC-3 in the document A/52 "digital Audio compression Standard (AC-3)" (available at the world Wide Web site of the Internet) published by the Advanced Television Systems Committee (ATSC) on 12.20.1995www.atsc.org/Standards/A52/a-52.docObtained). See also the survey error table (web site on the internet) of 7 month and 22 days 1999www.dolby.com/tech/ATSC err.pdfObtained).

Brief description of the invention

The basis for reconstructing an arbitrary distribution in a source-free wave medium is provided by a gaussian theorem that states that the wave field in a certain region is completely determined by the pressure distribution along the boundaries of the region. This means that, in principle, the reconstruction of a sound field in a concert hall in the region of a living room can be implemented in such a way that: in a concert hall, a living room is set, the wall is sound-insulated, and then made acoustically transparent by placing an infinite number of tiny microphones on the outside of the wall, each microphone signal being appropriately amplified and connected to a corresponding speaker in the wall of the living room. By inserting a suitable recording medium between the microphone and the loudspeaker, a satisfactory-possibly impractical-accurate three-dimensional sound reproduction system is realized. The remaining design effort is to make this system practical.

The first step towards practical implementation can be accomplished by noting that the signal of interest is band limited-the upper limit is about 20kHz, and applying the spatial sampling theorem, which is a variant of the more common time-domain sampling theorem. The latter is to say that no information is lost if a continuous band-limited time-domain waveform is discretely sampled at a rate at least twice the highest frequency of the source. The spatial sampling theorem takes the same considerations into account, which dictate that the spatial sampling interval must be at least twice as dense as the shortest wavelength density to avoid loss of information. Since the 20kHz wavelength is approximately 3/8 inches in air, this means that an accurate three-dimensional sound system can be realized with an array of microphones and speakers spaced no more than 3/16 inches apart. Extending to all surfaces of a typical 9 foot by 12 foot room produces about 2.5 million channels, which is a significant improvement for infinity but is still not practical at present. However, it establishes a basic method of using an array of discrete channels as spatial samples, according to which a sound field can be reproduced using appropriate interpolation.

Once the sound field is characterized, in principle this is possible: a decoder produces the best signal to feed either output speaker. The channels fed to such a decoder are referred to in various places in this document as "base", "transmitted" and "input" channels, and any output channel whose position does not correspond to the position of one of the base channels will be referred to as an "intermediate" channel. An output channel may also have a position that coincides with a basic input channel.

It is desirable to reduce the number of discrete channel spatial samples or base channels. This can be achieved based on the following facts: above 1500Hz (hertz) hearing no longer follows cycles but only the critical band envelope. This allows the channel spacing to correspond to 1500Hz, approximately 3 inches. This would reduce the total number of channels for a 9 foot by 12 foot room to about 6000 channels, effectively reducing about 2.49 million channels compared to the previous configuration.

In any case, the number of spatially sampled channels can be further reduced theoretically by means of psychoacoustic localization limits. For centered sound, the horizontal resolution limit is about 1 degree arc, and the corresponding vertical resolution limit is about 5 degrees. If this density is properly spread over a sphere, the result will still be hundreds to thousands of channels.

Disclosure of Invention

According to the invention, a process converts M input channels representing a sound field into N output channels representing the same sound field, wherein each channel is a single audio stream representing sound arriving from one direction, M and N are positive integers, and M is at least 2. One or more sets of output channels are generated, each set having one or more output channels. Each group is associated with two or more spatially adjacent input channels and each output channel of a group is produced by a process that includes determining a correlation metric for the two or more input channels and a level correlation for the two or more input channels.

In one aspect of the invention, a plurality of sets of output channels are associated with more than two input channels, and processing determines the correlation of those input channels associated with each set of output channels in a hierarchical order such that each set or sets are ordered by the number of input channels associated with the set of output channel(s). The maximum number of input channels corresponds to the highest order, and the processing processes sequentially process each group according to its hierarchical order. Further in accordance with an aspect of the invention, the processing considers the results of processing higher order groups.

The playback or decoding aspect of the present invention assumes that each of the M input channels representing sound arriving from one direction is produced by a passive-matrix nearest-neighbor amplitude-following encoding of each source direction (i.e., one source direction is assumed to map primarily the nearest neighbor base channel (s)), without the need for additional side-chain information (the utilization of side-chains or side-information is optional), so that it is compatible with existing mixing techniques, consoles, and formats. Although these source signals can be generated by directly using a passive coding matrix, most commonly used recording methods inherently generate these source signals (and therefore constitute an "effective coding matrix"). The playback or decoding aspects of the invention are also largely compatible with naturally recorded source signals, such as those recorded with 5 actual directional microphones, since allowing some possible time delay, sounds arriving from an intermediate direction tend to map primarily to the nearest microphone (in a horizontal array, unambiguously to the nearest pair of microphones).

A decoder or decoding process according to the invention can be implemented as a grid of connected processing modules or module functions (hereinafter referred to as "decoding modules"), each of which is used to generate one or more output channels (or control signals usable to generate one or more output channels) from two or more spatially nearest neighboring base channels associated with that decoding module. The output channels represent the relative proportions of the audio signals in the spatially nearest neighboring base channels associated to a particular decoding module. As explained in more detail below, the decoding modules are loosely coupled to each other in the sense that there is a module sharing node and there is a hierarchy of decoding modules. The modules are ranked according to the number of base channels associated with them (the module or modules with the highest number of associated base channels have the highest order). A hypervisor function manages these modules as follows: the common node signal is shared fairly and higher order decoder modules can affect the output of lower order modules.

Each decoder block may effectively comprise a matrix such that it directly produces output signals, or each decoder block may produce control signals which, together with control signals produced by other decoder blocks, are used to change the coefficients of a variable matrix or to change the scale factors input to or output from a fixed matrix to produce all of the output signals.

The decoder module mimics the work of the human ear in an effort to give a perceptually transparent rendition. Each decoder module may be implemented as either a wideband or multiband structure or function, in the latter case either with a continuous filter bank or with a block structure, e.g. using a transform-based processor such as one that does the same essential processing on each band.

Although the underlying invention generally relates to the spatial conversion of M input channels into N output channels, where M and N are positive integers and M is at least 2, it is another object of the invention that the number of loudspeakers receiving the N output channels can be reduced to a practical value, i.e. to form a perceived sound image at spatial positions where no loudspeakers are positioned, by conveniently relying on virtual images. The most common application of virtual mapping is the stereoscopic reproduction of the trajectory of a map between two loudspeakers by moving a mono signal between the channels. Virtual mapping is not considered a viable approach for group reproduction with a small number of channels, as it requires the listener to be equidistant or approximately equidistant from the two speakers. For example, the front left and front right speakers in a movie theater are too far apart for most listeners to obtain a useful illusion of a central sound image, and so are important as the center channel for many sources, a physical center speaker being used.

However, as the density of speakers is increased, for most listeners, at least for the range of smooth movement, a location will be reached where a virtual image can appear between any pair of speakers; when enough loudspeakers are present, the gap between the loudspeakers is no longer perceptible. Such an array has the potential to be virtually indistinguishable from the previously proposed two million arrays.

To test the effect of the invention we developed a horizontal array of 5 loudspeakers on each wall, considering the common corner loudspeakers, 16 in total, plus a ring of 6 loudspeakers placed at a vertical angle of about 45 degrees above the listener, plus a single loudspeaker directly above the listener, 23 in total, plus a subwoofer (LFE channel), 24 in total, all channels being fed by one PC (personal computer) for 24 channel sound reproduction. Although this system may be referred to as a 23.1 channel system in the present sense, for simplicity it will be referred to herein as a 24 channel system.

Fig. 1 is a top view schematically illustrating an idealized decoding architecture consistent with the test arrangement described above. The 5 horizontal wide range of base channels are shown as squares 1 ', 3', 5 ', 9 and 13' on the outer circle. A vertical channel is shown as a dashed box 23' in the center, which channel may be derived from 5 wide range base channels by means of correlated or generated reverberation, or provided separately. The 23 wide range output channels are shown by the filled circles marked by the respective numerals 1-23. The outer circle has 16 output channels on one horizontal plane and the inner circle has 6 output channels 45 degrees above the horizontal plane. The output channel 23 is directly above one or more listeners. The 5 two-input decoding modules are shown by the arrows 24-28 on the outer circle, which connect between each pair of horizontal base channels. 5 additional two-input vertical decoding modules, shown by arrows 29-33, connect the vertical channel to each of the horizontal channels. The raised center rear channel, output channel 21, is derived by a three-input decoding module, which is illustrated by the arrows between output channel 21 and base channels 9, 13 and 23. Each module is associated with a respective pair or three of the spatially nearest neighboring base channels. Although the decoding module shown in fig. 1 has 3, 4, or 5 output channels, a decoding module may have any reasonable number of output channels. One output channel may be located in the middle of one or more base channels or at the same location as one base channel. So in the example of fig. 1 there is also one output channel at each base channel position. Each input channel is shared by two or three decoding modules.

As will be discussed, the design goal of the present invention is that the sound processor should be able to operate on any number of speakers and their arrangement in principle, and the 24-channel array will be used as an illustrative example, but not the only example of the density and arrangement required to achieve a convincing continuous perceived sound field in accordance with the present invention.

The requirement to be able to apply a large and user selectable number of playback channels raises discrete channel number issues and/or other information that must be communicated to the playback processor so that it derives the 24 channels described above as at least one option. It is clear that one possible method is to simply transmit 24 discrete channels, but apart from the fact that the information producer has to mix such a plurality of individual channels can be cumbersome and transmitting such a large number of channels can also be cumbersome for the transmission medium, this is not preferred, since the 24-channel structure is only one of many possibilities and requires that more or fewer playback channels can be generated from one common transmit signal array.

One way to reproduce the output channels is to apply formal spatial interpolation, producing a fixed weighted sum of the transmitted channels for each output, provided that the density of these channels is large enough to allow this. However, this would require thousands to millions of transmitted channels, equivalent to a FIR filter with hundreds of taps to achieve time-domain interpolation of a single signal. The reduction of the transmitted channels to a practical number requires the application of psycho-acoustic principles and more aggressive dynamic interpolation by a sufficiently small number of channels, but still does not answer the following questions: multiple sound channels are required to produce a sound field sensation.

This question was answered by an experiment carried out by the inventors a few years ago and recently repeated by others. The basis of at least earlier experiments was the observation that conventional two-channel binaural recordings reproduce true left/right image distributions, but lead to unstable front/rear position determination, partly because of imperfections in the HRTFs used and the absence of head movement cues. To circumvent this drawback, a bi-binaural (4-channel) recording is implemented using two pairs of directional microphones spaced apart by a size corresponding to the human head. One pair of microphones faces forward and the other pair faces backward. The resulting recordings were played on 4 speakers spaced close to the head to mitigate the acoustic cross-coupling effect. This structure gives true left/right timing and amplitude localization cues from each pair of loudspeakers, and the corresponding discrete positions of the microphone and loudspeakers give clear front/back information. The result is a very convincing surround sound reproduction, but a proper representation of the lack of height information. Recent experiments by others have added a center front channel and two height channels, giving the same sense of realism, and possibly even improved by the addition of height information.

Therefore, from both psychoacoustic considerations and experimentally provided evidence, it appears that relevant perceptual information can be conveyed in approximately 4 to 5 "binaural-like" horizontal channels, plus one or more vertical channels. However, the signal cross-feed characteristics of binaural channel pairs make them unsuitable for directly playing back a set of speakers, since there is only very little isolation in the intermediate frequency range and at low frequencies. This is thus simpler and more straightforward than introducing cross-feeds at the encoder (as is done for one binaural pair) to only have to cancel it at the decoder: keeping the channels isolated from each other and mixing the output channel signals from the nearest transmitted channel. This is done not only by direct playback through the same number of loudspeakers without a decoder, but also by an optional downmix for a few channels with a passive matrix decoder if required, and it corresponds substantially to the existing standard arrangement of 5.1 channels, at least in the horizontal plane. It is also widely compatible with natural recordings, such as recordings that can be carried out with 5 actual directional microphones, since allowing for some possible time delay, sounds arriving from a mid-direction will tend to map primarily to the closest microphone (in a horizontal array, especially to the closest pair of microphones).

Therefore, from a sensory point of view, this should be possible: a channel conversion decoder accepts a standard 5.1 channel program and makes convincing sound reproduction through any number of horizontally arranged loudspeakers, including the 16 horizontal loudspeakers in the 24 channel array described above. By adding a vertical channel, as sometimes suggested for a digital cinema system, the entire 24-channel array can be fed with separately derived, perceptually effective signals that together produce a continuous sound field that is perceived at most listening positions. Of course, if finely structured source channels are available in the encoding field, additional information about them can be used to effectively change the encoding matrix scaling factors to pre-compensate for the decoder's limitations, or can simply be included as additional side-chain (auxiliary) information, possibly similar to the coupling coordinates used in AC-3(Dolby Digital) multi-channel encoding, but such additional information should not be perceptually necessary; and in fact, the requirement to contain such information is not required. The required operation of the channel conversion decoder is not limited to operation with 5.1 channel sources and may use fewer or more channels, but it is at least reasonable to believe that reliable performance is obtained from 5.1 channel sources.

One problem remaining with no endorsements is how to extract the intermediate output channels from the sparse array of transmitted channels. The solution proposed by one aspect of the present invention is to reuse the concept of virtual maps, but with slight variations. It has previously been noted that virtual imagery is not suitable for group playback with sparse speaker arrays because it requires the listener to be approximately equidistant from each speaker. It is adapted to give a listener who is sitting irregularly the impression of a central phantom channel for signals whose amplitudes have shifted between the nearest actual output channels. It is therefore proposed in one aspect of the invention that the channel conversion decoder comprises a series of modular interpolation signal processors, each effectively mimicking an optimally seated listener and each operating in a way mimicking the human auditory system to extract from the amplitude shifted signals those components that will form a virtual image and feed them to the actual loudspeakers; the loudspeakers are preferably arranged close enough together that the natural virtual image fills the remaining gaps between the loudspeakers.

Typically, each decoding module derives its input from the nearest transmitted base channel, which may be 3 or more base channels for a skylike (overhead) loudspeaker array, for example. One way to generate output channels having a relationship to more than two base channels may be to perform a series of pairwise operations, e.g., where the outputs of some pairwise decoding modules feed the inputs of other modules. However, this has two disadvantages. One drawback is that the cascaded decoding modules introduce multiple cascaded time constants, causing some output channels to react faster than others, causing sound position artifacts. A second disadvantage is that the pair-wise correlation can only interpolate intermediate or derived output channels along a straight line between a pair of channels; the use of three or more base channels exceeds this limit. Thus, an extension of the pair-wise correlation has generally been developed for correlating three or more output signals, a technique which is described below.

Horizontal positioning in the human ear is mainly based on two positioning cues: interaural amplitude difference and interaural time difference. The latter is only valid for signal pairs that are approximately aligned in time-the difference being around 600 microseconds. The net effect is that the intermediate image of the phantom will only appear at locations corresponding to a particular left/right amplitude difference, assuming that the signal components common in the two real channels are correlated or approximately correlated (note: the two signals may have a cross-correlation value between +1 and-1. the fully correlated signals (correlation value 1) have the same waveform and are aligned in time, but may have different amplitudes, corresponding to off-center image locations). When the correlation value of a signal pair is below 1, the perceived image will spread out until for two uncorrelated signals there will be no intermediate image, only separate and distinct left and right images. Negative correlations are usually processed by the ear to resemble uncorrelated signal pairs, although the two images can be spread over a wider range. Correlation is implemented on a critical band basis and above about 1500Hz, the critical band signal envelope is used instead of the signal itself to save human computational effort (MIPS).

Vertical positioning is a little more complicated, relying on dynamic modulation of HRTF top cues and horizontal cues with head motion, but the resulting effect is similar to amplitude of horizontal positioning versus movement, cross correlation, and corresponding perceived map position and fusion. However, the vertical spatial resolution accuracy is lower than the horizontal resolution and for proper interpolation performance, a less dense array of fundamental channels is not required.

The advantage of using a directional processor, which mimics the work of the human ear, is that any imperfections or limitations of the signal processing should be perceptually disguised by similar imperfections and limitations of the human ear, thereby allowing the following possibilities: the system is perceived as hardly distinguishable from the original, completely continuous playback.

While the present invention is designed to be effectively applicable to situations where no more or fewer output channels are available (including playback without decoding by as many loudspeakers as the input channels, and passive downmixing to fewer channels, including mono, stereo and Lt/Rt compatible surround sound), it is desirable to strive to use as many and somewhat random, yet practical, playback channels/loudspeakers, and to use a similar or fewer number of encoded channels, including existing 5.1 channel surround channels, and possibly the next generation 11 or 12 channel digital cinema channels, as source material.

The implementation of the present invention requires the implementation of four principles: error containment, advantage retention, constant power and synchronization smoothing.

The concept of error containment is that the post-decoding position of each source should be close to its true intended direction in a reasonable sense given the likelihood of decoding errors. This provides for a degree of conservation in the decoding strategy. More aggressive decoding exists, which are accompanied by a possibly greater spatial inconsistency in the event of an error, and it is generally recommended to accept less accurate decoding in exchange for guaranteed spatial containment. Even where higher accuracy decoding is applied with confidence, it may not be advisable to apply higher accuracy decoding if there is a possibility that dynamic signal conditions require the decoder to engage between aggressive and conservative ways to generate artificial sound images.

The advantage preserving is a more constrained variant of error containment that requires that a single well-defined advantage signal should be able to be moved by the decoder only to those output channels that are closest to it. This condition is necessary to keep the image fusion of the dominant signal and is beneficial for the perception of the matrix decoder's discreteness. When a signal is dominant, it is suppressed from the other output channels, either by subtracting it from the associated base signal or by directly complementing the matrix coefficients of the other output channels with the matrix coefficients used to generate the dominant signal ("anti-dominant coefficients/signals").

Constant power decoding requires not only that the total decoded output power be equal to the input power, but also that the input/output power of each channel and directional signal encoded in the transmitted base array be equal. This minimizes the artifacts produced by gain variations.

Synchronous smoothing means applying a smoothing time constant to the system in relation to the signal and requires: if any smoothing network in one decoding module is switched to fast time constant mode, all other smoothing networks in this module are also switched. This is to avoid that the newly dominant directional signal exhibits slow fading/moves away from the previous dominant direction.

Drawings

Fig. 1 is a schematic diagram showing a top view of an idealized decoder arrangement.

Detailed Description

Decoding module

Since encoding either source direction is assumed to map mainly onto the nearest neighboring channels, channel conversion decoding is based on a series of semi-automatic decoding modules that reproduce the output channels in the usual sense, in particular the intermediate output channels, each output channel usually being derived from a subset of all transmitted channels in a manner similar to the human ear.

In a manner similar to the human ear, the decoding module works based on a combination of amplitude ratio, which is used to determine the nominal current principal direction, and cross-correlation, which is used to determine the relative width of the image.

The processor generates sound signals of the output channels using control signals derived from the amplitude ratios and the cross-correlations. Since this is preferably done on the basis of a linear relationship to avoid distortion, the decoder forms a weighted sum of the base channels containing the signal of interest. (it may also be desirable to include non-adjacent base channels in the calculation of the weighted sum, as explained below). This limited but dynamic interpolation approach is more commonly referred to as matrixing. If in the source the desired signal is mapped (amplitude shifted) into the nearest neighbor of the M base channels, it is a problem of M: N matrix decoding. In other words, the output channels represent the relative proportions of the input channels.

Particularly in the case of a two-input decoding module, which is much like the problem associated with an active 2: N matrix decoder, such as the new model Dolby Pro Logic matrix decoder, which has pairs of decoding module inputs corresponding to Lt/Rt encoded signals.

Note that: the output of a 2: N matrix decoder is sometimes referred to as the base channel. However, the input channels of the channel conversion decoder are referred to herein as "basic".

However, there is at least one meaningful difference between the operation of the prior art autonomous 2: N decoder and the decoding module of the present invention. The former, in addition to indicating left/right positions with left/right amplitudes, which is also an assumption of channel-change decoders, also indicate front/rear positions with phases of the mutual channels, in particular the sum/difference ratios of the channels based on Lt/Rt coding.

This autonomous 2: N decoder architecture has two problems. One problem is that, for example, a perfectly correlated (front), but off-center signal will result in a sum/difference ratio less than infinity, incorrectly indicating a position that is not perfectly in front (similar to a perfectly anti-correlated off-center back signal). The result is a somewhat distorted decoding space. A second disadvantage is that the position mapping is many-to-one, introducing inherent decoding errors. For example, in a 4: 2: 4 matrix system, a pair of uncorrelated left-in and right-in signals without front-in or back-in will map the same clean, uncorrelated Lt/Rt pairs as the signal, but also a uncorrelated front-in/back-in pair without left-in/right-in, or the content of all 4 uncorrelated inputs. The decoder has no choice in the face of an uncorrelated Lt/Rt pair and "relaxes the matrix", i.e. allocates sound to all output channels using a passive matrix. It is not possible to decode into an array of signals that have only left-out/right-out or only front-out/back-out at the same time.

The underlying problem is that in an N: 2: N matrix system the phase of the mutual channels is used to encode the front/back position, which is different from the work of the human ear, which does not use the phase to discriminate the front/back position. The invention preferably works with at least three non-aligned base channels such that the front/back position is indicated by the set direction of the base channels, rather than giving different directions depending on their relative phase or polarity, so that a pair of uncorrelated or inverse-correlated channel converted base signals is unambiguously decoded into separate base-output channel signals, without intermediate signals and without "back" direction being indicated. (again, this avoids the unfortunate "center-gather" effect in autonomous 2: N decoders, where uncorrelated left-in and right-in signals are played out with reduced separation, since the decoder feeds the sum and difference of these two signals to the center and surrounding channels.) of course, in principle, it is possible to spatially extend an Lt/Rt signal by cascading a 2: N decoder-N-4 or 5-with an N: M channel conversion system, in this case, however, any limitations of the 2: N decoder, such as center-focusing, will be taken to the multiplied channel outputs, and these functions can also be combined into a channel-converting decoder designed to receive the 2-channel Lt/Rt signal, and in this case changes its characteristics to interpret the negative correlation signal as having a rear orientation, leaving the other processing unchanged. However, even in this case there is still decoding ambiguity caused by only two transmitted channels.

Therefore, each decoding module, and in particular the decoding module with two input channels, is similar to existing active 2: N decoders with front/back detection disabled or altered, arbitrary number of output channels. It is of course numerically impossible to generate a larger number of channels uniquely from a smaller number of channels with a matrix, since this is based on solving N linear equations with M unknowns, whereas M is larger than N. It is desirable that the decoding module may sometimes exhibit poor channel recovery in the presence of multiple autonomous source direction signals. However, the human auditory system is limited by the use of both ears, and will suffer the same, allowing the system to be perceived as being off-limits, even when working with all channels. The quality of the separated channels is still a consideration when the other channels are muted, in order to take care of a listener sitting near one speaker.

The human ear works certainly frequency dependent, but most of the image is correlated at all frequencies, and based on successful empirical experiments with a Pro logic decoder as a wideband system, it is expected that a wideband channel conversion system may perform satisfactorily in some applications. A multi-band channel conversion decoder should also be possible, with similar processing on a band-by-band basis and applying the same encoded signal in each case, the number and bandwidth of the individual frequency bands being left as a free parameter to the decoder implementer. Although multiband processing may require higher MIPS than broadband processing, the computational requirements may not be high if the input signal is divided into data blocks and the processing is implemented on a block basis.

Before describing the algorithm that can be used by the decoding module of the present invention, a consideration of the shared nodes is first given.

Shared node

If the base channel sets used by the decoding modules are all independent, the decoding modules themselves should be independent, autonomous entities. However, this is not usually the case. A given transferred channel will typically be shared with two or more adjacent base channels together with the separated output signal. If separate decoding modules are used to decode the array, each will be affected by the output signals of the adjacent channels, resulting in errors that may be severe. Functionally, the output signals of two adjacent decoding modules will "pull" towards-or move towards-the other, since the common base node contains both signals, causing the level to increase. If-as is often the case here-the signal is dynamic, the amount of interaction will be so great as to cause dynamic positioning errors associated with the signal to be unpleasant. This problem does not exist in proLogic and other active 2: N decoding because they only have a single, separate channel pair as the decoder input.

Therefore, it is necessary to compensate for the "shared node" effect. One possible approach is to subtract a regenerated signal from the common node before attempting to regenerate the output signals of neighboring decoding modules sharing the common node. This is not generally possible, and the following approach was therefore used: each decoding module predicts a common output signal energy present on the common input channel and a hypervisor informs each module of its neighboring module's output signal energy estimates.

Pairwise computation of common energies

For example, assume that the base channel pair a/B contains a common signal X and separate uncorrelated signals Y and Z:

A＝0.707X+Y

B＝0.707X+Z

wherein the scaling factorA power-to-nearest base channel preserving map is provided。

Since X and Y are uncorrelated, XY is 0, soThat is, because X and Y are uncorrelated, the total energy in the base channel A is the sum of the energies of the signals X and Y.

Similarly:

since X, Y and Z are uncorrelated, the average cross product of a and B is:

thus, in the case where an output signal is shared by two adjacent base channels, which may also contain independent, uncorrelated signals, the average cross-product of the signals is equal to the energy of the common signal component in each channel. If the common signal is not shared equally, i.e. it is biased towards one of the base channels, the average cross product will be the geometric average between the energies of the common components in a and B, whereby the individual channel common energy estimate can be found by normalizing with the square root of the channel amplitude ratio. The real-time average is calculated using a leaky integrator with an appropriate fall time constant to reflect the ongoing activity. Time constant smoothing can be elaborated with nonlinear rise and fall time options and, in multi-band systems, can be scaled by frequency.

Higher order common energy calculation

In order to find the common energy of a decoding module with three or more inputs, an average cross-product of all input signals must be formed. Simply doing a pair wise processing of the inputs will not be able to distinguish the separate output signals between each pair of inputs and the signal common to all inputs.

For example, consider three base channels, a, B and C, which are composed of uncorrelated signals W, Y, Z and a common signal X, respectively:

A＝X+W

B＝X+Y

C＝X+Z

if the average cross product is calculated, as in the second order calculation, all terms containing the combination of W, Y and Z will be eliminated, leaving X³Average of (d):

unfortunately, if X is a time signal whose mean is zero, then the mean of its cube is also zero. Unlike X²For any non-zero value of X, X²Are all positive numbers, X³With the same sign as X so that the positive and negative contributions will cancel out. Obviously, this is also true for any odd power of X, which corresponds to an odd number of block inputs, but even exponents with exponent greater than 2 can also lead to erroneous results; for example, 4 inputs with components (X, -X) and (X, X) will have the same product/average.

The above problem can be solved with a modified average product technique. Before averaging, the sign of each product is removed by taking the absolute value of the product. The sign of each term of the product is checked. If they are all the same, the absolute value of the product is sent to the averaging, and if any sign is different from the others, the negative of the absolute value of the product is averaged. Since the number of possible same-sign combinations is not equal to the number of possible different-sign combinations, a weighting factor is compensated by applying a negative absolute product, the weighting factor being constituted by the ratio of the number of same-sign combinations to the number of different-sign combinations. For example, a three-input module has two of the 8 possible cases of identical sign, and the remaining six possible cases are of different sign, so the scaling factor is 2/6-1/3. This compensation results in an increase of the integrated or added product if and only if there is a common signal component at all inputs of one decoding module.

However, in order for the averages of different order modules to be comparable, they all must have the same dimensions. A conventional second order correlation involves an average of two input multiplications, and thus is scaled to energy or power. The terms averaged in the higher order correlations must also be changed to have power dimensions. For a K-th order correlation, the absolute value of each product must become a power of 2/K in its exponent before averaging.

Of course, regardless of order, the energy of each input node of the module can be calculated as the average of the squares of the corresponding node signals, if desired, and need not first be raised to its k-th power and then reduced to a second order quantity.

The shared node: adjacent level

The common output channel signal energy magnitude can be estimated by applying the cross product of the mean square and the distortion of the base channel signal, the above example involving a single interpolation processor, but if one or more of the a/B (/ C) nodes is common to another module that has its own common signal component unrelated to any other signal, the mean cross-product calculated above should not be affected so that the calculation is inherently free of the pan rate inducing effect. (note: if the two output signals are not correlated, they will tend to pull the decoder closer, but will have a similar effect in the human ear, again keeping the system working faithful to human hearing.)

Once each decoding module has calculated the estimated common output channel signal on each of its base channels, the supervisor function can inform neighboring modules of each other common energy at which point the generation of the output channel signal proceeds as described below. The calculation of the common energy applied by a module on a node must take into account the multi-layer structure in which different orders of modules may overlap and subtract the common energy of a higher order module from the common energy estimated by any lower order module sharing the same node.

For example, assume that there are two adjacent base channels a and B representing two horizontal directions and a base channel C representing a vertical direction, and further assume that there is a signal energy X representing an internal direction (i.e., a direction within the limits of a, B, and C)²The intermediate or derived output channels of (a). The common energy input to the three input modules of (A, B, C) will be X²However, the common energy of the two input modules (A, B), (B, C) and (A, C) should also be X². If the common energies of the modules (A, B, C), (A, B) and (A, C) to which A is connected are simply added,will yield 3X²Instead of X². In order to correctly calculate the common node energy, the common energy of each higher order module is first subtracted from the common energy estimated by each overlapping lower order module, so that the common energy X of the higher order modules (A, B, C)²Subtracted from the common energy estimate of the two-input modules, resulting in 0 in each case, and a net common energy estimate at node A equal to X²+0+0＝X²。

Output channel signal generation

As described above, the process of reproducing the entire output channels from the transmitted channels in a linear method is basically a matrix method, i.e., a weighted sum of the base channels is formed to find the output channel signals. The optimal choice of matrix scaling factors is generally signal independent. Indeed, if the number of currently active output channels is equal to the number of transmitted channels (but representing different directions), so that the system is strictly constrained, it is mathematically possible to compute the inverse of the effective encoding matrix and recover the separated source signal prototype. Even if the number of active output channels is greater than the number of base channels, it may still be possible to compute a pseudo-inverse matrix.

Unfortunately, this approach has problems with computational requirements-especially based on multi-band processing and oriented towards high precision floating point implementations-and is one of the most important factors. Even if the intermediate signal is assumed to lie between the nearest neighboring base channels, the mathematical inverse or pseudo-inverse of the effective coding matrix generally has a contribution from all base channels for each output channel, due to the node sharing effect. If there are any imperfections in the decoding, which is practically unavoidable, a base channel signal may be reproduced by an output channel which is spatially distant from it, which is highly undesirable. Furthermore, the pseudo-inverse matrix calculation tends to produce a minimum RMS energy solution, which greatly expands the acoustic range, giving minimum separation; this is quite incompatible with the present invention.

Thus, to achieve a practical fault-tolerant decoder, in which there is an inherent spatial decoding error, the same modular structure as used for signal detection is used for signal generation.

The generation of the regenerated output signal by a decoding module is described in detail below. Note that the effective position of each output channel connected to the module is assumed to be determined by the ratio of the amplitudes required to locate the signal to its physical position, i.e. the ratio of the effective matrix coding coefficients corresponding to the ratio direction. To avoid the problem of division by zero, the ratio is typically calculated as the quotient of the matrix coefficients of one channel divided by the RMS sum of the totality of the matrix coefficients (usually 1) of this input channel. For example, the energy ratio used in a two-input module with L and R inputs should be the L energy divided by the sum of the L and R energies ("L-ratio"), which has a range of values from 0 to 1. If the two-input decoding module has 5 output channels, the effective coding matrix coefficient pairs are (1.0, 0), (0.89, 0.45), (0.71 ), (0.45, 0.89), and (0, 0.1), and the corresponding L-ratios are 1.0, 0.89, 0.71, 0.45, and 0, since each pair of scaling fixes has an RMS sum of 1.0.

Any node-co-located signal taken by an adjacent decoding module is subtracted from the signal energy of each input node (base channel) of the decoding module to obtain a normalized input signal power level for the remainder of the calculation.

The dominant direction indication is calculated as a vector sum of the cardinal directions weighted by the relative energies. For a two-input module, it reduces to the L-ratio of the normalized input signal power level.

The output channel in which the dominant direction is included is determined by comparing the L-ratio of the dominant direction in the previous step with the L-ratio of the output channel. For example, if the L-ratio of the five output decoding module input is 0.75, the second and third output channels include dominant directions, since 0.89 > 0.75 > 0.71.

The mobile scaling factor that maps the dominant signal to the nearest neighboring coverage channel is calculated from the ratio of the inverse-dominant signal levels of the channels. The inverse-dominant signal associated with a particular output channel is the result of scaling factors of the inverse-dominant matrix of the output channel when the corresponding decoding module input signal is used. The inverse-dominance matrix scaling factors for an output channel are those scaling factors whose RMS sum equals 1, which result in zero output when a single dominant signal is localized to the output channel. If the coding matrix scale factor for the output channel is (A, B), then the inverse-dominant scale factor for this channel is (B, -A).

Certifying that

If a single dominant signal is located on an output channel with an encoded scaling factor (a, B), the signal must have an amplitude (KA, KB), where K is the total amplitude of the signal, and thus the anti-dominant signal is (KA × B-KB × a) ═ 0 for this channel.

Thus, if a dominant signal consists of two input module input signals (X (t), Y (t)), which have input amplitudes (X, Y) normalized to RMS ═ 1, the dominant signal produced is dom (t) ═ xx (t) + yy (t). If the position of this signal is included between the output channels with matrix scaling factors (a, B) and (C, D), respectively, the dominant signal scaling factor for the channel scaling dom (t) with matrix scaling factor (a, B) is:

SF(A，B)＝sqrt((DX-CY)/((DX-CY)+(BX-AY)))，

for channels with matrix scaling factors of (C, D), the corresponding dominant signal scaling factors are:

SF(C，D)＝sqrt((BX-AY)/((DX-CY)+(BX-AY)))，

when the dominant direction is removed from one output channel to the other, the two scaling factors are removed in opposite directions between 0 and 1 with constant power sums.

The anti-dominant signal is calculated and located with the appropriate addition of scaling all the non-dominant channels. The anti-dominant signal is a matrix transformed signal without any dominant signal. If the input to the decoding block is (X (t), Y (t)), its normalized amplitude is (X, Y), the dominant signal is Xx (t) + Yy (t), and the anti-dominant signal is Yx (t) -Xy (t), regardless of the position of the non-dominant output channel.

In addition to the dominant/anti-dominant signal distribution, the second signal distribution is calculated using a "passive" matrix, which is based on the output channel matrix scaling factors that have been discussed, scaled to maintain power.

The cross-correlation of the decoding module input signal is calculated as the square root of the average cross product of the input signals divided by the product of the normalized input levels.

Returning now to the description of the generation process, the final output is calculated as a weighted cross-fade sum of the dominant and passive signal distributions, where the cross-fade factor is derived from the input signal cross-correlation of the decoding module. For a correlation value of 1, only the dominance/anti-dominance distribution is used. As the correlation value decreases, the output signal array is stretched by cross-fading of the passive profile to achieve a low positive correlation value, typically 0.2 to 0.4, depending on the number of output channels connected to the decoding module. As the correlation value decreases further, tending towards zero, the passive amplitude output profile gradually bows outward, reducing the output signal level to mimic the response of the human ear to these signals.

Vertical processing

Most of the processing described so far to generate output channel signals from adjacent base channels is independent of the direction of the output and base channels. However, human auditory localization tends to be less sensitive to mutual channel correlation in the vertical direction than in the horizontal direction due to the horizontal directionality of the human ear. This may be required to maintain the realism of human ear work: the correlation constraint is attenuated in the input channel interpolation processor with a vertical-orientation, for example by processing the correlation signal with a warping function before using it. However, it is possible that the same processing as for the horizontal channel will not bring any audible degradation, which will simplify the structure of the entire decoder.

Strictly speaking, the vertical signals comprise sounds from above and below, and the described decoder structure should work equally well for them, but in practice there is usually no natural sound coming from below, so its processing and sound channels can be cancelled without compromising the perceived spatial fidelity of the system.

This concept may have practical significance when applying channel conversion to existing 5.1 channel surround sound material, which of course has no vertical channels. However, it may contain vertical information, such as floating overhead, whose recordings span multiple or all horizontal channels. Therefore, it should be possible to extract a virtual vertical channel from these source materials by taking into account the correlation between non-adjacent channels or channel groups. If such correlations exist, they will generally indicate the presence of vertical information from above, rather than below, the listener. In some cases, virtual vertical information may also be derived by a reverberation generator, possibly critical to the listening environment model used. Once the virtual vertical channels are extracted or derived from the 5.1 channel source, the extension to a larger number of channels, such as the 24 channel structure described above, can be done as if a real vertical channel is provided.

Memory orientation

With respect to the operation of the decoding module control generation, which, as described above, operates similarly to a 2: N autonomous decoder such as a Pro Logic decoder, one aspect of the present invention is that the only "memory" in processing is in the smoothing network, which generates the basic control signals. At any one time, there is only one dominant direction and one input correlation value, and signal generation is directly based on these signals.

However, especially in complex acoustic environments (such as the prototype cocktail party), where the human ear exhibits a degree of positional memory, or inertia, a brief dominant sound from a given direction that is well-localized will cause other sounds from non-specific directions that are not well-localized to be perceived from the same source.

This effect can be modeled in the decoding module (and in fact also in Pro Logic decoding) by adding an explicit mechanism to keep the most recent dominant direction trajectory and to weight the output signal profile during the directionally ambiguous signal conditions so that it points to the most recent dominant direction. This can improve the dispersion and stability of reproduction perceived by complex signal arrays.

Modified correlated and selected channel mixing

As previously mentioned, the output profile determination of each decoding module is based on simultaneous cross-correlation of its input signals, which may underestimate the amount of output signal content in some cases. This would occur, for example, with a naturally recorded signal in which the non-central directions have slightly different arrival times and unequal amplitudes, which results in a reduced correlation value. This effect may be more severe if a large spacing of microphones is applied, with correspondingly greater inter-channel delay. To compensate for this effect, the correlation computation can be extended to cover a range of inter-channel delays, at the expense of slightly higher processing MIPS requirements. Since auditory nerve cells never have an effective time constant of about 1 millisecond, a more realistic correlation value can be obtained by first smoothing the detected sound with a smoother having a 1 millisecond time constant.

Furthermore, if an information producer has an existing 5.1 channel program with strongly uncorrelated channels, the uniformity of the distribution can be improved when processed by the channel-change decoder by slightly mixing the adjacent channels, thereby increasing the correlation, which will result in the channel-change decoding module providing a more uniform distribution between its intermediate output channels. This mixing can also be made selective, for example, by leaving the center front channel signal unmixed to preserve the compactness of the dialogue tracks.

Volume compression/expansion

When the encoding process involves mixing a larger number of channels into a smaller number of channels, the encoded signal may be clipped if some form of gain compensation is not provided. This problem is also present for conventional matrix coding, but there is a greater probability of channel conversion occurring because of the greater number of channels that are mixed into a given output channel. To avoid clipping in this case, an overall gain scaling factor is given by the encoder and transmitted to the decoder in the encoded bit stream. Typically this value is 0dB, but it can be set by the encoder to a non-zero attenuation value to avoid clipping, and the decoder provides an equivalent amount of compensation gain.

If the decoder is used to process an existing multi-channel, which does not have this scaling factor (e.g., an existing 5.1 channel track), it should choose a fixed scaling factor to be a hypothetical value (approximately 0dB), or apply an expansion function based on signal level and/or dynamic range, or apply metadata that may be utilized, such as a dialog specification value, to adjust the decoder gain.

The present invention and its various aspects may be implemented in analog circuitry, or more likely in a digital signal processor, a programmed general-purpose digital computer, and/or a special-purpose digital computer, as software functions. The interface between the analog and digital signal streams may be implemented in suitable hardware and/or as functions in software and/or firmware.

Claims

1. A method for converting M input channels representing a soundfield into N output channels representing the same soundfield, wherein each channel is a single audio stream representing sound arriving from one direction, M and N are positive integers, and M is a positive integer equal to or greater than 2, the method comprising:

a plurality of decoding module operations, wherein a plurality of module operations share a plurality of input channels of the M input channels, each module operation being either

Comprising a matrix generating one or more output channels constituting a subset of said N channels and controlling the matrix according to two or more of the spatially adjacent closest base channels operatively associated with the decoding module, or

Generating control signals from two or more of the spatially adjacent closest base channels associated with the decoding module operation, the control signals being used, together with control signals generated by other decoding module operations, to alter the coefficients of a variable matrix to produce all of the output channels, or

Control signals are generated from two or more of the spatially adjacent closest base channels associated with that decoding module operation, which control signals, together with control signals generated by other decoding module operations, are used to vary the scale factors of the inputs/outputs to/from a fixed matrix to generate all of the output channels.

2. The method of claim 1, wherein the module operations are hierarchically ordered in terms of their number of input channels, and further comprising a hypervisor operation in communication with the module operations to control sharing of input signals in accordance with their hierarchical ordering.