[go: up one dir, main page]

HK1214062B - Audio channel spatial trnslation - Google Patents

Audio channel spatial trnslation Download PDF

Info

Publication number
HK1214062B
HK1214062B HK16100846.8A HK16100846A HK1214062B HK 1214062 B HK1214062 B HK 1214062B HK 16100846 A HK16100846 A HK 16100846A HK 1214062 B HK1214062 B HK 1214062B
Authority
HK
Hong Kong
Prior art keywords
channels
input
output
channel
audio
Prior art date
Application number
HK16100846.8A
Other languages
Chinese (zh)
Other versions
HK1214062A1 (en
Inventor
M.F.戴维斯
Original Assignee
杜比实验室特许公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 杜比实验室特许公司 filed Critical 杜比实验室特许公司
Publication of HK1214062A1 publication Critical patent/HK1214062A1/en
Publication of HK1214062B publication Critical patent/HK1214062B/en

Links

Description

Audio channel spatial conversion
The application is a divisional application of an invention patent application with the application number of 200980151223.5, the application date of 2009, 12 months and 16 days and the invention name of 'audio channel space conversion'.
Cross Reference to Related Applications
This application claims priority to U.S. patent provisional application No.61/138823, filed on 18/12/2008, which is incorporated herein by reference in its entirety.
Technical Field
The present invention relates to audio signal processing. More particularly, the present invention relates to converting multiple audio input channels representing a soundfield into one or more audio output channels representing the same soundfield, wherein each channel is a single audio stream representing audio arriving from a certain direction.
Background
Although a person has only two ears, we rely on a number of local cues such as Head Related Transfer Functions (HRTFs) and head movements as three-dimensional entities to listen to sound. Thus, full fidelity sound reproduction requires the maintenance and reproduction of a full 3D sound field or at least its perceptual cues. Unfortunately, sound recording techniques do not face the capture of 3D sound fields, the capture of 2D planes of sound, or even the capture of 1D lines of sound. Current sound recording technology faces strictly zero-dimensional, discrete channel capture, preservation, and rendering of audio.
Most of the efforts to improve fidelity since the original invention of edison's sound recording have been focused on improving the shortcomings of his original analog modulation groove cylinder/disc medium. These include limited, non-uniform frequency response, noise, distortion, inflexion, flutter, speed accuracy, wear, dust and copy generation losses. Although there have been a number of sporadic attempts to improve isolation (including electronic amplification, tape recording, noise reduction, and higher cost than some automotive record players), the traditional problems of individual channel quality have not been ultimately solved until the development of an excellent overall digital recording, particularly the introduction of audio compact discs. From there, in addition to some efforts to further extend the quality of digital recording to 24 bits/96 kHz sampling, the main efforts of audio reproduction research have focused on reducing the amount of data required to maintain individual channel quality and increasing spatial fidelity, primarily through the use of perceptual encoders. The latter problem is the subject of this document.
Efforts to improve spatial fidelity have progressed along two fronts: attempts were made to transmit perceptual cues of the full sound field, as well as attempts to transmit approximations to the actual original sound field. Examples of systems using the former approach include binaural recordings and two-speaker based virtual surround sound systems. These systems present a number of unfortunate disadvantages, particularly in terms of reliably localizing sound in some directions and the need to use headphones or fixed individual listener positions.
In order to present spatial sound to multiple listeners, whether in the living room or in commercial venues such as movie theaters, the only viable alternatives have been to try to approximate the actual original sound field. Considering the discrete channel nature of sound recording, it is not surprising that most efforts to date involve an increase in the number of channels that can be considered conservative. Representative systems included the early 50 s of disc-shaped mono three-loudspeaker film soundtrack, conventional stereo, the 60 s of four-channel system, five-channel discrete magnetic soundtrack on 70mm film, the 70 s of Dolby Surround sound using matrix, the 90 s of AC-35.1 channel sound and the latest Surround-EX6.1 channel sound. "Dolby", "Pro Logic" and "Surround EX" are trademarks of Dolby Laboratories Licensing Corporation. More or less, these systems provide enhanced spatial rendering compared to mono rendering. However, mixing a large number of channels results in more time and cost penalty on the content generator, and the resulting perception is typically one of several diffuse, discrete channels rather than a continuous sound field. Aspects of Dolby Pro Logic decoding are described in U.S. patent 4799260, which is incorporated herein by reference in its entirety. Details of AC-3 are set forth in Digital Audio Compression Standard (AC-3, E-AC-3), Vision B, Advance delay Systems Committee,14June 2005.
When the sound field is characterized, in principle the decoder is able to derive the best signal feed for any output speaker. Here, the channels provided to such a decoder will be referred to variously as "primary", "transmitted" and "input" channels, and any output channel having a position that does not correspond to the position of one of the input channels will be referred to as an "intermediate" channel. The output channels may also have a position that coincides with the position of the input channels.
Disclosure of Invention
Coding or down-mixing
In accordance with an encoding or down-mixing aspect of the present invention, a method for converting M audio input channels to N audio output channels, each of the M audio input channels is associated with a spatial direction, each of the N audio output channels is associated with a spatial direction, wherein M and N are all positive integers, M is 3 or greater, and N is 3 or greater, the method comprising deriving the N audio output channels from the M audio input channels, wherein one or more of the M audio input channels are associated with a spatial direction that is different from a spatial direction associated with any of the N audio output channels, at least one of the one or more of the M audio input channels is mapped to a corresponding set of at least three of the N output channels. The set of at least three output channels may be associated with contiguous spatial directions. N may be 5 or greater, and the deriving may map the at least one of the one or more of the M audio input channels to a corresponding set of three, four, or five of the N output channels. A set of at least three, four, or five of the N output channels may be associated with consecutive spatial directions.
In a particular embodiment, M may be at least 6, N may be at least 5, and the M audio input channels may be respectively associated with five spatial directions corresponding to the five spatial directions associated with the N audio output channels, and at least one spatial direction not associated with the N audio output channels.
Each of the N audio output channels may be associated with a spatial direction in a common plane. At least one of the associated spatial directions of the M audio input channels may lie above or below a plane associated with the N audio output channels. At least some of the associated spatial directions of the M audio input channels may vary in distance relative to a reference spatial direction.
In particular embodiments, the spatial directions associated with the N audio output channels may include left, center, right, left surround, and right surround. The spatial directions associated with the M audio input channels may include left, center, right, left surround, right surround, left front elevated, center front elevated, right front elevated, left surround elevated, center surround elevated, and right surround elevated. The spatial directions associated with the M audio input channels may further comprise an elevated top.
Decoding or upmixing
In accordance with a decoding or upmixing aspect of the present invention, a method for converting N audio input channels to M audio output channels, each of the N audio input channels is associated with a spatial direction, each of the M audio output channels is associated with a spatial direction, wherein M and N are all positive integers, N is 3 or greater, and M is 1 or greater, the method comprising deriving the M audio output channels from the N audio input channels, wherein one or more of the M audio output channels are associated with a spatial direction that is different from a spatial direction associated with any of the N audio input channels, at least one of the one or more of the M audio output channels is derived from a corresponding set of at least three of the N input channels. At least one of the one or more of the M audio output channels may be derived from a corresponding set of at least three of the N input channels at least in part by approximating cross-correlations of the at least three of the N input channels. Approximating the cross-correlation may include calculating a common energy for each pair of the at least three of the N input channels. The common energy of any said pair may have a minimum value. The derived magnitudes of the M audio output channels may be based on a lowest estimated magnitude of the common energy of any pair of the at least three of the N input channels. The amplitude of the derived M audio output channels may be taken to be zero when the common energy of any pair of the at least three of the N input channels is zero.
A plurality of derived M audio output channels may be derived from respective sets of N input channels sharing a common pair of the N input channels, wherein calculating the common energy may include compensating for the common energy of the shared common pair of the N input channels.
The approximating may comprise processing the plurality of derived M audio channels in a hierarchical order such that each derived audio channel may be ranked according to a number of input channels from which the audio channel is derived, the largest number of input channels having the highest ranking, the approximating processing the plurality of derived M audio channels in turn according to the hierarchical order of the plurality of derived M audio channels.
Calculating the common energy may further comprise compensating for the common energy of the shared common pair of N input channels related to the derived audio channel having the higher hierarchical level.
A set of at least three of the N input channels may be associated with contiguous spatial directions.
N may be 5 or greater, and the deriving may map the at least one of the one or more of the M audio input channels to a corresponding set of three, four, or five of the N input channels. A set of at least three, four, or five of the N input channels may be associated with consecutive spatial directions.
In particular embodiments, M may be at least 6, N may be 5, and at least six output audio input channels may be respectively associated with five spatial directions corresponding to the five spatial directions associated with the N audio input channels, and at least one spatial direction not associated with the N audio input channels.
Each of the N audio input channels may be associated with a spatial direction in a common plane. At least one of the associated spatial directions of the M audio input channels may lie above or below a plane associated with the N audio output channels. At least some of the associated spatial directions of the M audio input channels may vary in distance relative to a reference spatial direction.
In particular embodiments, the spatial directions associated with the N audio output channels may include left, center, right, left surround, and right surround. The spatial directions associated with the M audio output channels may include left, center, right, left surround, right surround, left front elevated, center front elevated, right front elevated, left surround elevated, center surround elevated, and right surround elevated. The spatial directions associated with the N audio input channels may further comprise an elevated top.
According to a first aspect of other aspects of the invention, a method for converting M audio input signals to N audio output signals, each of the M audio input signals being associated with a direction, each of the N audio output signals being associated with a direction, where N is greater than M, M is 2 or more, and N is a positive integer equal to 3 or more, comprises providing a M: N variable matrix, applying M audio input signals to the variable matrix, deriving the N audio output signals from the variable matrix, and controlling the variable matrix in response to the input signals such that a sound field produced by the output signals when the input signals are highly correlated has a compact sound image in a nominal forward main direction of the input signals, which image spreads from compact to broad as the correlation decreases, and gradually splits into a plurality of compact sound images as the correlation continues to decrease to highly uncorrelated, each of the plurality of compact sound images is located in a direction associated with an input image.
According to this first of the other aspects of the invention, the variable matrix may be controlled in response to (1) the relative levels of the input signals and (2) a measure of the cross-correlation of the input signals. In this case, for a measure of cross-correlation of the input signals having values in a first range bounded by a maximum value and a reference value, the sound field may have a compact sound image when the measure of cross-correlation is the maximum value and may have a widely diffused image when the measure of cross-correlation is the reference value, and for a measure of cross-correlation of the input signals having values in a second range bounded by the reference value and a minimum value, the sound field may have a widely diffused image when the measure of cross-correlation is the reference value and may have a plurality of compact sound images each located in a direction associated with an input image when the measure of cross-correlation is the minimum value.
According to a further aspect of other aspects of the invention, a method for converting M audio input signals to N audio output signals, each of the M audio input signals being associated with a direction, each of the N audio output signals being associated with a direction, where N is greater than M, M being 3 or greater, comprises providing a plurality of M: N variable matrices, where M is a subset of M and N is a subset of N, applying to each of the plurality of variable matrices a corresponding subset of the M audio input signals, deriving from each of the plurality of variable matrices a corresponding subset of the N audio output signals, controlling each of the plurality of variable matrices in response to the subset of input signals applied to that variable matrix such that when the subset of input signals applied to that variable matrix is highly correlated a corresponding subset of output signals derived from that variable matrix results in Has a compact sound image in a nominal main direction of progression of said subset of said input signals, which image spreads from compact to wide as the correlation decreases, and gradually splits into a plurality of compact sound images as the correlation continues to decrease to high degree of irrelevancy, each of said plurality of compact sound images being in a direction associated with an input image applied to the variable matrix, and said N audio output signals being derived from a subset of N audio output channels.
According to this further aspect of the further aspects of the invention, the variable matrix may also be controlled in response to information compensating for the effect of one or more other variable matrices receiving the same input signal. Furthermore, deriving the N audio output signals from a subset of the N audio output channels may also include compensating for multiple variable matrices that produce the same output signal. According to such further aspects of other aspects of the invention, each of the plurality of variable matrices may be controlled in response to (a) the relative levels of the input signals applied to the variable matrix and (b) a measure of the cross-correlation of the input signals.
According to still further aspects of other aspects of the invention, a method for converting M audio input signals to N audio output signals, each of the M audio input signals being associated with a direction, each of the N audio output signals being associated with a direction, where N is greater than M and M is 3 or greater, includes providing an M: N variable matrix responsive to control matrix coefficients or scaling factors controlling matrix output, applying the M audio input signals to the variable matrix, providing a plurality of M: N variable matrix scaling factor generators, where M is a subset of M and N is a subset of N, applying a corresponding subset of the M audio input signals to each of the plurality of variable matrix scaling factor generators, deriving from each of the plurality of variable matrix scaling factor generators a variable matrix scaling factor for the corresponding subset of the N audio output signals In response to a subset of input signals applied to each of said plurality of variable matrix scale factor generators, controlling the variable matrix scale factor generators such that, when the scale factors generated by the variable matrix scale factor generators are applied to said M: N variable matrix, the sound field generated by the respective subset of generated output signals has, when the subset of input signals producing said applied scale factors is highly correlated, a compact sound image in a nominal principal direction of progression of said subset of input signals, which image spreads from compact to broad as the correlation decreases, and gradually splits into a plurality of compact sound images as the correlation continues to decrease to be highly uncorrelated, each of said plurality of compact sound images being in a direction associated with an input image producing the applied scale factor, and deriving the N audio output images from the variable matrix.
According to this yet further aspect of the other aspects of the invention, the variable matrix scale factor generator may also be controlled in response to information compensating for the effect of one or more other variable matrix scale factor generators receiving the same input signal. Furthermore, deriving the N audio output signals from the variable matrix may include compensating a plurality of variable matrix scale factor generators that produce scale factors for the same output signal. According to such further aspects of other aspects of the invention, each of the plurality of variable matrix scale factor generators may be controlled in response to (a) the relative levels of the input signals applied to that variable matrix scale factor generator and (b) a measure of the cross-correlation of the input signals.
As used herein, a "channel" is a single audio stream representing or associated with audio arriving from a direction (taking into account closer or further virtual or projected channels, such as azimuth, elevation and optionally distance).
According to the invention, M audio input channels representing a soundfield are converted into N audio output channels representing the same soundfield, wherein each channel is a single audio stream representing audio arriving from a direction, M and N are both positive integers, and M is at least 2 and N is at least 3, and N is greater than M. One or more sets of input channels are generated, each set having one or more output channels. Each set is typically associated with two or more spatially adjacent input channels, and each output channel of a set is generated by determining a measure of correlation of the two or more input channels and a measure of level interrelationship of the two or more input channels. The measure of cross-correlation is preferably a measure of zero time offset cross-correlation, which is the ratio of the common energy level to the geometric mean of the input signal energy levels. The common energy level is preferably a smoothed or averaged common energy level and the input signal energy level is a smoothed or averaged input signal energy level.
In one aspect of the invention, multiple sets of output channels may be associated with more than two input channels, and a process may determine the correlation of the input channels associated with each set of output channels according to a hierarchical order such that each set or sets is ranked according to the number of input channels associated with its output channel or channels, the largest number of input channels having the highest rank, and the process sequentially processes the multiple sets according to their hierarchical order. Further, according to an aspect of the invention, the processing takes into account results of processing higher order sets.
Certain playback or decoding aspects of the present invention assume that each of the M audio input channels representing audio arriving from an aspect is generated by passive matrix nearest neighbor amplitude panning coding of each source direction (i.e., the source direction is assumed to map primarily to the nearest input channel or channels), without the need for additional side chain information (the use of side chains or side information is optional), such that it is compatible with existing mixing techniques, consoles, and formats. While such source signals can be generated by explicitly employing passive coding matrices, most conventional recording techniques can inherently generate such source signals (thus, constituting "effective coding matrices"). Certain playback or decoding aspects of the present invention are also largely compatible with naturally recorded source signals such as may be obtained with 5 real directional microphones, since sound arriving from the medial direction tends to map primarily to the nearest microphone (in the horizontal array, specifically to the nearest microphone pair) taking into account some possible delays.
Decoders or decoding processes according to aspects of the present invention may be implemented as a grid of coupled processing modules or modular functions (hereinafter "modules" or "decoding modules"), each decoding module being used to generate one or more output channels (or alternatively, control signals that may be used to generate one or more output channels), typically from two or more of the nearest spatially adjacent input channels associated with the decoding module. The output channels typically represent the relative proportions of the audio signals in the nearest spatially adjacent input channels associated with a particular decoding module. As explained in more detail below, the decode modules are loosely coupled to each other in the sense that the modules share inputs, and there is a hierarchy of decode modules. The modules are hierarchically ordered according to the number of input channels associated with them (with the highest ranking for the module or modules having the highest number of associated input channels). A monitor or supervisory function management module such that common input signals are shared equally between or among modules and higher level decoder modules can affect lower level modules.
Each decoder module may actually comprise a matrix so that it can generate the output signals directly, or each decoder module may generate control signals which, together with control signals generated by other decoder modules, are used to change the coefficients of the variable matrix or the scaling factors of the inputs or outputs of the fixed matrix in order to generate all the output signals.
The decoder module simulates the work of the human ear in an attempt to provide an acoustically distinct reproduction. The signal conversion according to the invention of the decoder modules and module functions as an aspect thereof may be applied to a wideband signal or to the frequency bands of a multiband processor and may be performed once per sample or once per block of samples, depending on the implementation. Multi-band embodiments may employ a filter bank, such as a discrete critical band filter bank or a filter bank having a band structure compatible with an associated decoder, or a transform configuration, such as an FFT (fast fourier transform) or MDCT (modified discrete cosine transform) linear filter bank.
Another aspect of the invention is that the amount of loudspeakers receiving the N output channels can be reduced to a practical amount by judicious dependence on virtual imaging, which is the production of perceived sound images in space at locations other than the location of the loudspeakers. Although the most common use of virtual imaging is in stereo reproduction of images between two loudspeakers, virtual imaging, as contemplated as an aspect of the present invention, may include the presentation of phantom projection images that provide auditory cues beyond or within the walls of a room by sweeping a monophonic signal between the channels. Virtual imaging is not considered a viable technique for group rendering with a sparse number of channels, as it requires the listener to be equidistant or substantially equidistant from the two speakers. For example, in a cinema theatre, the front left and front right loudspeakers are spaced too far apart to enable the majority of viewers to obtain useful illusive imaging of the centre image, so considering the importance of the centre channel as the source of most conversations, a physical centre loudspeaker is used instead.
As the density of the loudspeakers increases, for most viewers, at least to the extent of pan smoothing, the point at which virtual imaging between any pair of loudspeakers is feasible will be reached; with sufficient loudspeakers, the gap between the loudspeakers is thus no longer perceptible.
Signal distribution
As described above, the measure of cross-correlation determines the ratio of dominant (common signal component) to non-dominant (non-common signal component) energy in the module, and the degree of dispersion of the non-dominant signal component among the output channels of the module. This can be better understood by considering the signal distribution to the output channels of the module under different conditions for the case of a two-input module. Unless otherwise indicated, the principles set forth herein extend directly to higher level modules.
A problem encountered with signal distribution is that there is often too little information, much less than the signal itself, to recover the original signal amplitude distribution. The basic information available is the cross product of the signal level at each module input and the averaged input signal, the common energy level. The zero time offset cross-correlation is the ratio of the common energy level relative to the geometric mean of the input signal energy levels.
The importance of cross-correlation is that it serves as a measure of the net amplitude of the signal common to all inputs. If there is a single signal (the "internal" and "intermediate" signals) swept anywhere between the inputs of the modules, all inputs will have the same waveform (albeit possibly of different magnitudes), and under these conditions the correlation will be 1.0. At the other extreme, if all input signals are independent, which means that there is no common signal component, the correlation will be 0. A value of the correlation intermediate between 0 and 1.0 may be considered to correspond to some intermediate balanced level of the individual, common signal component and the independent signal component at the input. Thus, any input signal condition may be separated into a common signal, a "dominant" signal and input signal components remaining after subtracting the contribution of the common signal, including "all remaining" signal components ("non-dominant" or residual signal energy). As noted above, the common or "dominant" signal amplitude is not necessarily louder than the residual or non-dominant signal level.
For example, considering the case of five channels in arc (L (left), MidL (middle left), C (center), MidR (middle right), R (right)) mapped to a single Lt/Rt (left and right population) pair, it is desirable to recover the original five channels in the single Lt/Rt pair. If all five channels have independent signals of equal amplitude then the magnitudes of Lt and Rt will be equal, with the median value of the common energy corresponding to the median value of the cross-correlation between 0 and 1 (since Lt and Rt are dependent signals). The same level can be achieved without utilizing signals from MidL and MidR by appropriately selecting the levels of L, C and R. Thus, a two-input, five-output module may feed only the output channel corresponding to the dominant direction (in this case, C), and the output channels (L, R) corresponding to the input signal residue after removing C energy from the Lt and Rt inputs, without supplying signals to the MidL and MidR output channels. This result is undesirable, and closing a channel unnecessarily is almost always a bad choice, since small perturbations in signal conditions will cause the "closed" channel to toggle between opening and closing, resulting in annoying chattering sounds ("chattering" is a rapid opening and closing of a channel), especially when the "closed" channel is listened to in isolation.
Thus, when there are multiple possible output signal distributions for a given set of module input signal values, the conservative approach considered from the perspective of individual channel quality is to spread the non-dominant signal components as uniformly as possible in the output channels of the module, consistent with the signal conditions. One aspect of the present invention spreads the available signal energy evenly, subject to signal conditions, according to a three-way split, rather than a "dominant" versus a "non-dominant" two-way split. Preferably, the three-way split contains a dominant (common) signal component, a filler (uniformly spread) signal component, and an input signal component residual. Unfortunately, there is only enough information to do the two-way splitting (the dominant signal component and all other signal components). A suitable method for implementing a three-way split is described herein, wherein for correlation values above a certain value, the two-way split employs dominant and diffuse non-dominant signal components; for correlation values below this value, the two-way split uses diffuse non-dominant signal components and residuals. The common signal is split between "dominant" and "uniform dispersion". The "uniformly diffused" component includes both the "dominant" and "residual" signal components. Thus, "diffuse" contains a mixture of common (correlated) and residual (uncorrelated) signal components.
Before processing, for a given input/output channel configuration of a given module, correlation values are calculated corresponding to all output channels receiving the same signal magnitude. This correlation value may be referred to as the "random _ xcor" value. The random _ xcor value may be calculated as 0.333 for a single, centrally derived intermediate output channel and two input channels. The random _ xcor value may be calculated as 0.483 for three equally-diffused intermediate channels and two input channels. While such time values have been found to provide satisfactory results, they are not critical. For example, values of about 0.3 and 0.5, respectively, are available. In other words, for a module with M inputs and N outputs, there is a degree of correlation for the particular M inputs that can be considered to represent equal energy in all N outputs. This can be derived by considering the M inputs as if they had been derived using a passive N to M matrix receiving N independent signals of equal energy, but of course the actual output can be derived by other means. This threshold correlation value is "random _ xcor" and may represent a split line between two operation regions.
Then, during processing, if the cross-correlation value of the module is greater than or equal to the random _ xcor value, it is scaled to the range of 1.0 to 0:
scaled _ xcor ═ (correlation value-random _ xcor)/(1-random _ xcor)
The "scaled _ xcor" value represents the amount of dominant signal above the uniform diffusion level. All of the remaining may be evenly distributed to the other output channels of the module.
However, there is an additional factor that should be taken into account, i.e. when the nominal main direction of progression of the input signal becomes more and more off-centre, the amount of diffused energy should be gradually reduced if an equal distribution for all output channels is maintained, or alternatively the amount of diffused energy should be maintained, but the energy distributed to the output channels should be reduced with respect to the "eccentricity" of the dominant energy-in other words, the energy taper along the output channels. In the latter case, additional processing complexity is required to maintain the output power equal to the input power. It should be noted that some references herein to "power" refer to "energy" from a strict standpoint. References to "power" are commonly used in the literature.
On the other hand, if the current correlation value is less than the random _ xcor value, the dominant energy is considered to be 0, the uniformly spread energy is gradually reduced, and the residual signal (all remaining) is allowed to accumulate at the input. When the correlation value is 0, there is no internal signal, only an independent input signal directly mapped to the output channel.
The operation of this aspect of the invention may be further explained as follows:
a) when the actual correlation value is greater than random _ xcor, there is enough energy to consider that there is a dominant signal that will be steered (panned) between two adjacent outputs (or, of course, provided to one output if its direction happens to coincide with that output); the energy allocated to the dominant signal is subtracted from the input to obtain a residue that is (preferably uniformly) distributed among all outputs.
b) When the actual correlation value happens to be random _ xcor, the input energy (which can be considered as all residuals) is evenly distributed among all outputs (this is the definition of random _ xcor).
c) When the actual correlation value is less than random _ xcor, there is no common energy sufficient for the dominant signal, so that the energy of the input is distributed among the outputs in a proportion dependent on the degree of importance. This is as if the relevant parts were treated as a residual to be evenly distributed in all outputs, and the irrelevant parts are as if they were to be sent to a plurality of dominant signals of the output corresponding to the direction of the input. In the extreme case of a correlation value of 0, each input is fed to only one output position (typically one of the outputs, but it may be a pan position between the two outputs).
Thus, there is a continuity between a full correlation, in which case a single signal is swept between two outputs according to the relative energy of the inputs, which are evenly distributed among all the outputs by random _ xcor, to a zero correlation, in which case M inputs are fed independently to M output positions.
Compensation of interaction
As described above, a channel conversion according to an aspect of the present invention may be considered to involve a grid of "modules". Since multiple modules may share a given input channel, there may be interactions between modules, and unless some compensation is taken, the interactions may degrade performance. Although it is generally not possible to separate the signals at the input according to the module to which the signal is going, estimating the amount of signal used by each connected module may improve the resulting correlation and direction estimates, resulting in improved overall performance.
As mentioned above, there are two module interactions: module interactions involving modules at a common or lower hierarchical level (i.e., modules with a similar number of inputs or fewer inputs), referred to as "neighbors," and module interactions involving modules at a higher hierarchical level (with more inputs) but sharing one or more common inputs than a given module, referred to as "high-level neighbors.
Consider a first neighbor complement at a common hierarchical levelAnd (6) compensating. To understand the problems caused by neighbor interactions, consider an isolated two-input module with identical L/R (left and right) input signals. This corresponds to a single dominant (common) signal midway between the inputs. Common energy is A2And the correlation value is 1.0. Assume a second input module having a common signal B, common energy B at its L/R input2And also has a correlation value of 1.0. If the two modules are connected at a common input, the signal at that input will be a + B. Assuming that the signals a and B are independent, then the average product of AB will be 0, so that the common energy of the first module will be a (a + B) ═ a2+AB=A2And the common energy of the second module will be B (a + B) ═ B2+AB=B2. Thus, the common energy is not affected by the neighboring modules as long as they process the independent signals. This is usually the correct assumption. If the signals are not independent, are identical or at least share substantially a common signal component, the system will react in a manner consistent with the response of the human ear-i.e., the common input will be larger, resulting in the resulting audio image being pulled toward the common input. In this case, since the common input has more signal amplitude (a + B) than any external input, the L/R input amplitude ratio of each module is shifted, which biases the direction estimation toward the common input. In this case, the correlation values of the two modules are now slightly less than 1.0, since the waveforms at the pair of inputs are different. The uncompensated common input signal spreads the non-common signal distribution of the modules, since the correlation value determines the degree of spreading of the non-dominant signal component and the ratio of the dominant (common signal component) to the non-dominant (non-common signal component) energy.
To compensate, a measure of the "common input level" attributable to each input of each module is estimated, and each module is then informed of the total amount of such common input level energy for all adjacent levels of the same hierarchy level at each module input. Two ways of calculating a measure of the common input level attributable to the inputs of the module are described herein: one way is based on the common energy of the inputs of the module (described in the next paragraph), and the other way is more accurate but requires more computational resources, which is based on the total energy of the internal outputs of the module (described below in connection with the arrangement of fig. 6A).
According to a first way of calculating a measure of the common input level attributable to each input of the module, the analysis of the input signals of the module does not allow to directly find the common input level at each input, only a fraction of the total common energy, which is the geometric mean of the common input energy levels. Since the common input energy level at each input cannot exceed the total energy level at that input, which is measured and known, the total common energy is multiplied by a factor to obtain an estimated common input level proportional to the observed input level, followed by quantization. Once the total of common input levels is computed for all modules in the grid (whether the metric of common input levels is based on the first or second computation), each module is informed of the total common input level of all neighboring modules at each input, an amount referred to as the "neighbor level" of the module at each of its inputs. The module then subtracts the neighbor levels from the input levels at its inputs to get compensated input levels, which are used to calculate the correlation and direction (the nominal principal direction of advance of the input signal).
For the above example, the adjacent levels are initially 0, so that the first module requires more than a at the common input since this input has more signals than either end input2And the second module requires more than B at the same input2Of the common input level. Since both requirements are more than the available energy at this level, the requirements are limited to a respectively2And B2. Since there are no other modules connected to this common input, each common input level corresponds to a neighbor level of a domain module. Thus, the compensated input power level seen by the first module is
(A2+B2)-B2=A2
And the compensated input power level seen by the second module is
(A2+B2)-A2=B2.
However, these are exactly the levels that would be seen if the module were isolated. Thus, the resulting correlation value will be 1.0 and, at a suitable magnitude, the dominant direction will be at the center, as desired. However, the recovered signal itself will not be completely isolated-the output of the first module will have some B signal components and vice versa, but this is a limitation of the matrix system and if the processing is performed on a multi-band basis, the mixed signal components will be at similar frequencies, presenting the difference between them somewhat pending. In more complex cases, the compensation will generally be less accurate, but in practice it is understood that the compensation mitigates most of the effects of adjacent module interactions through the experience of the system.
The extension to high-level neighbor level compensation is quite straightforward, given the principles and signals that have been established for neighbor level compensation. This applies to the case where two or more modules at different hierarchical levels share more than one common input channel. For example, there may be a three-input module that shares two inputs with a two-input module. The signal component common to all three inputs will also be common to both inputs of the two-input module and will be presented by the modules at different locations without compensation. More generally, there may be a signal component common to all three inputs, and a second component common to only the inputs of the two-input module, whose effects need to be separated as much as possible for a correct rendering of the output sound field. Thus, as embodied by the common input level described above, the three-input common signal should be subtracted from the inputs before the two-input calculation can be performed properly. In fact, the high-level common signal elements should be subtracted not only from the input levels of the low-level modules, but also from their measure of the observed common energy level, before the low-level calculations are made. This is different from the effect of the common input level of modules at the same hierarchical level, which does not affect the measure of the common energy level of adjacent modules. Therefore, high-level neighbor levels should be considered and employed separately from same-level neighbor levels. At the same time that the high level neighbor level has passed down to the hierarchically lower module, the remaining common level of the lower level module should also be passed up in the hierarchy because, as described above, the lower level module acts like a normal neighbor to the high level module. Some quantities are interdependent and difficult to find at the same time. To avoid performing complex simultaneous solution resource-intensive calculations, previously calculated values may be passed to the correlation module. Potential interdependencies of module common input levels at different hierarchical levels may be resolved using previous values as described above, or calculations may be performed in a repeating sequence (i.e., a loop) from the highest hierarchical level to the lowest hierarchical level. Alternatively, simultaneous equation solution is possible, although involving a non-trivial computational overhead.
Although the described interaction compensation techniques provide only approximately correct values for complex signal distributions, they are believed to improve upon grid arrangements that fail to account for module interactions.
Drawings
Fig. 1A is a top plan view schematically illustrating an idealized encoding and/or decoding arrangement in the manner of a test arrangement using a 16-channel horizontal array around the walls of a room, a 6-channel array disposed in a circle above the horizontal array, and a single ceiling (top) channel.
Fig. 1B is a top plan view schematically illustrating an idealized alternative encoding and/or decoding arrangement using a 16-channel horizontal array around the wall of a room, a 6-channel array arranged in a circle above the horizontal array, and a single ceiling (top) channel.
Fig. 2 is a functional block diagram providing an overview of a multi-band transform embodiment of multiple modules operating through a central monitor implementing the decoding example of fig. 1A.
Fig. 2' is a functional block diagram providing an overview of a multi-band transform embodiment of multiple modules operating through a central monitor implementing the decoding example of fig. 1B.
Fig. 3 is a functional block diagram useful in understanding the manner in which a monitor, such as monitor 201 of fig. 2 and 2', or fig. 2' may determine an endpoint scaling factor.
Fig. 4A-4C illustrate functional block diagrams of modules according to an aspect of the present invention.
Fig. 5 is a schematic diagram showing an imaginary arrangement of three input modules, three internal output channels and a dominant direction fed by input channels in a triangular relationship. This view can be used to understand the distribution of the dominant signal components.
Fig. 6A and 6B are functional block diagrams showing one suitable arrangement of excess endpoint energy scaling factor components for each of the endpoints of (1) the total estimated energy of the inputs in response to the total energy generation module at the inputs and (2) the measure of cross-correlation in response to the input signals, respectively.
FIG. 7 is a functional block diagram illustrating preferred functionality of the "sum and/or get larger" block 367 of FIG. 4C.
Fig. 8 is an idealized representation of the manner in which one aspect of the present invention produces the scale factor components in response to a measure of cross-correlation.
Fig. 9A and 9B through fig. 16A and 16B are a series of views showing idealized representations of the output scale factor of a module derived from various examples of input signal conditions.
Detailed Description
To test aspects of the invention, an arrangement was deployed having 5 speakers (one at each corner with three evenly spaced speakers between the corners) on each wall of a room with four walls, a horizontal array of 16 speakers in total to account for the common corner speakers, plus a ring of 6 speakers at a vertical angle of about 45 degrees above a centrally located listener, plus a single speaker directly above (23 speakers in total), plus subwoofer/LFE (low frequency effects) channels (24 speakers in total), all of which were fed by a personal computer set up for 24 channel playback. Although this system may be referred to by the present term as a 23.1 channel system, for simplicity it will be referred to herein as a 24 channel system.
Fig. 1A is a top plan view of an idealized decoding arrangement, schematically in the manner of the trial arrangement described above. The figure also presents an idealized coding arrangement where the 23.1 source channels are down-mixed to a 6.1 channel, the 6.1 channel being made up of the standard 5.1 channels (left, right, left surround, right surround and LFE), plus one additional channel (top channel) as in commonly used systems.
Returning to the description of FIG. 1A, as a decoding or upmix arrangement, five wide range horizontal input channels are shown as squares 1 ', 3 ', 5 ', 9 ', and 13 ' on the outer circle. The vertical or top channel, which may be derived from five wide-range inputs by correlation or resulting reverberation or fed separately as a sixth channel (as above and in fig. 2), is shown as a dashed square 23' in the center. Twenty-three wide range input channels are shown as circles 1-23 filled with numbers. The outer circle of the sixteen output channels is on the horizontal plane, and the inner circle of the six output channels is forty-five degrees above the horizontal plane. The output channel 23 is directly above one or more listeners. The five two-input decoding modules are delineated by brackets 24-28 around the outer circle connecting between the horizontal input channels of each pair. Five additional two-input vertical decoding modules are delineated by brackets 29-33 connecting the vertical channel with each of the horizontal inputs. The output channel 21, which is an elevated central rear channel, is derived from the three-input decoding module 34, as indicated by the arrows between the output channel 21 and the input channels 9, 13 and 23. Thus, the three-input module 34 is one LEVEL higher (LEVEL) in hierarchy than its two-input lower-LEVEL neighboring modules 27, 32, and 33. In this example, each module is associated with a respective pair or three of the nearest spatially adjacent input channels. Each module in this example has at least three neighbors of the same rank. For example, modules 25, 28, 29 are neighbors of module 24.
Although the decoding module shown in fig. 1A has three, four, or five output channels in different ways, the decoding module may have any reasonable number of output channels. The output channel may be located intermediate to two or more input channels or at the same location as the input channels. Thus, in the example of fig. 1A, each of the input channel LOCATIONs (LOCATIONs) is also an output channel. Two or three decoding modules share each input channel.
Although the arrangement of fig. 1A uses five modules (24-28) (each having two inputs) and five inputs (1 ', 3 ', 5 ', 9 ', and 13 ') to derive sixteen horizontal outputs (1-16) representing locations around four walls of a room, similar results can be obtained with a minimum of three inputs and three modules (each having two inputs, each sharing one input with the other).
By using multiple modules (such as the examples of fig. 1A, 1B, 2, and 2') each having multiple output channels in the manner of arcs or lines, decoding ambiguities encountered in prior art decoders where correlation less than zero is decoded to indicate a backward direction can be avoided.
An alternative to the encoding/decoding arrangement of fig. 1A is described below in conjunction with the description of fig. 1B.
Although the input and output channels may be characterized by their physical location or at least their orientation, it is useful to characterize them in a matrix, since it provides a well-defined signal relationship. Each matrix element (row i, column j) is a transfer function that relates an input channel i to an output channel j. The matrix elements are typically signed multiplication coefficients, but may also contain phase or delay terms (in principle, any filter), and may be a function of frequency (in terms of discrete frequencies, a different matrix at each frequency). This is straightforward in the case of dynamic scaling factors applied to the output of a fixed matrix, but it also adapts to a variable matrix by having a separate scaling factor for each matrix element or for matrix elements that are more complex than a simple scaling factor, where the matrix elements themselves are variables (e.g., variable delays).
There is some flexibility in mapping physical locations to matrix elements; in principle, embodiments of aspects of the invention may handle mapping of input channels to any number of output channels and vice versa, but the most general case is to assume that signals are mapped to the closest output channel by only a simple scaling factor whose sum of squares is 1.0 in order to save power. This mapping is often done by a sine/cosine sweep (panning) function.
For example, for two input channels and three internal output channels on a line between them plus two end point output channels that coincide with the input positions (i.e., M: N block where M is 2 and N is 5), it may be assumed that a circular arc (range where sine or cosine changes from 0 to 1 or from 1 to 0) that spans 90 degrees is represented, so that the channels are separated by 22.5 degrees at 90 degrees/4 intervals, giving a channel matrix coefficient of (cos (angle), sin (angle)):
Lout coeffs=cos(0),sin(0)=(1,0)
MidLout coeffs=cos(22.5),sin(22.5)=(.92,.38)
Cout coeffs=cos(45),sin(45)=(.71,.71)
MidRout coeffs=cos(67.5,sin(67.5)=(.38,.92)
Rout coeffs=cos(90),sin(90)=(0,1)
thus, for the case of a matrix with fixed coefficients and variable gain controlled by the scaling factor at each matrix output, the signal output at each of the five output channels is (where "SF" is the scaling factor for the particular output identified by the subscript):
Lout=Lt(SFL)
MidLout=((.92)Lt+(.38)Rt))(SFMidL)
Cout=((.45)Lt+(.45)Rt))(SFC)
MidRout=((.38)Lt+(.92)Lt))(SFMidR)
Rout=Rt(SFR)
in general, given an array of input channels, the closest inputs can be conceptually combined by straight lines, representing potential decoder modules. (they are "potential" in that a module is not needed if there are no output channels that need to be derived from the module). For a typical arrangement, any output channel on a line between two input channels can be derived from a two-input module (if the source and the transfer channel are in a common plane, then any one source is present in at most two input channels, in which case there is no benefit to using more than two inputs). An output channel at the same location of an input channel is an endpoint channel for possibly more than one module. An output channel that is not on-line or co-located with an input (e.g., inside or outside a triangle formed by three input channels) requires a module with more than two inputs.
Decoding modules with more than two inputs are useful when the common signal occupies more than two input channels. This occurs, for example, when the source and input channels are not in a plane: the source channel may map to more than two input channels. This occurs in the example of fig. 1A mapping 24 channels (16 horizontal ring channels, 6 elevated ring channels, 1 vertical channel, plus LFE) to 6.1 channels (containing a composite vertical or top channel). In this case, the central back channel in the elevated ring is not in the direct line between the two source channels, it is in the middle of the triangle formed by Ls (13), Rs (9) and the top (23) channels, and therefore, a three input module is required to extract it. One way to map the elevated channels to a horizontal array is to map each of them to more than two input channels. This allows mapping of the 24 lanes of the example of fig. 1A to a conventional 5.1 lane array. In this alternative, multiple three-input modules may extract the elevated channels, and the remaining signal components may be processed by the two-input module to provide a main horizontal loop of channels. This alternative is described further below in conjunction with fig. 1B and 2'.
In general, it is not necessary to check all possible combinations of signal commonality between input channels. For planar channel arrays (e.g., channels representing a horizontal alignment), it is generally sufficient to perform a pairwise similarity of spatially adjacent channels. For channels arranged in the cover or ball surface, signal commonality can be extended to three or more channels. The use and detection of signal commonality may also be used to convey conventional signal information. For example, a vertical or top signal component may be presented by mapping to all five full range channels of a horizontal five channel array. This alternative is described further below in conjunction with fig. 1B and 2'.
The decision as to which input channel combinations to analyze for commonality along with the default input/output mapping matrix need only be made once for each input/output channel converter or converter function arrangement when configuring the converter or converter function. The "initial mapping" (prior to processing) derives a passive "master" matrix that relates the input/output channel configuration to the spatial orientation of the channels. As an alternative, the processor or processing portion of the present invention may generate time-varying scaling factors, one for each input channel, that modify the input signal levels or matrix coefficients themselves that would otherwise be simple, passive matrices. The scaling factor is then derived from (a) the dominant, (b) the evenly spread (fill), and (c) the remaining (endpoint) signal components, as described below.
The master matrix is useful for configuring an arrangement of modules such as that shown in the examples of fig. 1A and 1B and described further below in connection with fig. 2 and 2'. By examining the master matrix it can be deduced, for example, how many decoder modules are needed, how they are connected, how many input and output channels each have, and the matrix coefficients relating to the inputs and outputs of each module. These coefficients may be derived from the primary matrix; only non-zero values are required unless the input channel is also the output channel (i.e., endpoint).
Each module preferably has a "local" matrix, which is the portion of the master matrix that is applicable to a particular module. In the case of a multi-module arrangement as in the example of fig. 1A and 2, the modules may use local matrices for generating scaling factors (or matrix coefficients) for controlling the primary matrix as described below in connection with fig. 2, 2 'and 4A-4C or for generating subsets of the output signals that are combined by a central processor, such as the monitor described in connection with fig. 2 and 2'. In the latter case, such a monitor compensates for multiple versions of the same output signal produced by modules having a common output signal in a manner similar to the way monitor 201 of fig. 2 and 2' determines a final scaling factor to replace the preliminary scaling factor produced by the module producing the preliminary scaling factor for the same output channel.
In the case of a plurality of modules generating scaling factors other than the output signals, such modules may continuously obtain matrix information about themselves from the main matrix via the monitor, rather than having a local matrix. However, if the module has its own local matrix, less computational overhead is required. In the case of a single, isolated module, the module has a local matrix, which is the only required matrix (in practice, the local matrix is the main matrix), and the local matrix is used to generate the output signals.
Embodiments of the present invention having a plurality of modules are described with reference to alternatives for the modules generating the scaling factors, unless otherwise indicated.
Any decoding module output channel that has only one non-zero coefficient (coefficient 1.0, since the sum of squares of the coefficients is 1.0) in the local matrix of the module is an endpoint channel. Output channels having more than one non-zero coefficient are internal output channels. Consider a simple example. If both output channels O1 and O2 are derived from the input channel I1 and the output channel I2 (but with different coefficient values), then a two-input module connected between I1 and I2 for generating the outputs O1 and O2 is particularly required. In a more complex case, if there are 5 inputs and 16 outputs, and one decoder module has inputs I1 and I2 and feeds outputs O1 and O2, then:
O1=A I1+B I2+0 I3+0 I4+0 I5
(Note that input channels I3, I4, or I5 do not contribute), and
O2=C I1+D I2+0 I3+0 I4+0 I5
(Note that input channels I3, I4, or I5 do not contribute)
The decoder may then have two inputs (I1 and I2), two outputs, and a scaling factor that relates them, as:
o1 ═ A I1+ B I2, and
O2=C I1+D I2.
in the case of a single, isolated module, the main matrix or the local matrix may have matrix elements for providing more than just the function of multiplication. For example, as described above, the matrix elements may include a filter function (e.g., phase or delay terms) and/or a filter that is a function of frequency. One example of applicable filtering is a matrix of pure delays that can present phantom projection images. In practice, such a primary or local matrix may be split into two functions, one employing the derived coefficients to derive the output channels, and the second applying a filter function, for example.
Fig. 2 is a functional block diagram providing an overview of a multi-band transform embodiment implementing the example of fig. 1A. Fig. 2' is a functional block diagram providing an overview of a multi-band transform embodiment implementing the example of fig. 1B. Which differs from fig. 2 in that several of the modules of fig. 2, i.e. modules 29-34, receive different sets of inputs (such modules are indicated by the numerals 29 ' -34; fig. 2' also has an additional module, module 35 '). Except for differences in some of the module inputs, fig. 2 and 2' are identical and corresponding elements use the same reference numerals. In both fig. 2 and 2', for example, a PCM audio input having a plurality of interleaved audio signal channels is applied to a monitor or supervisory function 201 (hereinafter "monitor 201") including a de-interleaver that recovers a separate stream for each of the six audio signal channels (1 ', 3 ', 5 ', 9 ', 13 ', and 23 ') carried by the interleaved input and applies each to a time-domain-to-frequency-domain transform or transform function (hereinafter "feed-forward transform"). Alternatively, the audio channels may be received in separate streams, in which case a deinterleaver is not required.
As mentioned above, the signal conversion according to the invention can be applied to a wideband signal, or to each frequency band in a multiband processor, which can employ a filter bank (e.g. a discrete critical band filter bank or a filter bank having a band structure compatible with the associated decoder), or a transform configuration (such as an FFT (fast fourier transform) or MDCT (modified discrete cosine transform) linear filter). Fig. 2, 2', 4A-4C, and other figures are described in the context of a multi-band transform configuration.
Not shown in fig. 1A, 1B, 2 'and other figures for simplicity are optional LFE input channels (potential seventh input channel in fig. 1A and 2, and potential sixth input channel in fig. 1B and 2') and output channels (potential 24 th output channel in fig. 1A and 2). The LFE channel can be processed in generally the same way as the other input and output channels, but with its own scaling factor fixed to "1" and its own matrix coefficients also fixed to "1". In the case where the source channel does not have an LFE but the output channel has an LFE (e.g., 2.5:1 upmix), the LFE channel may be derived using a low pass filter applied to the sum of the channels (e.g., a five-order Butterworth filter with a turning frequency of 120 Hz), or to avoid cancellation when the channels are summed, a phase-dependent sum of the channels may be employed. In the case where the input has an LFE channel but the output does not, the LFE channel may be added to one or more of the output channels.
Continuing with the description of fig. 2 and 2', modules 24-34 (24-28 and 29 "through 35 ' in fig. 2 ') receive the appropriate ones of the six inputs 1 ', 3 ', 5 ', 9 ', 13 ', and 23 ' in the manner shown in fig. 1A and 1B. Each module generates a preliminary scaling factor ("PSF") that is output for each of its associated audio output channels as shown in fig. 1A and 1B. Thus, for example, module 24 receives inputs 1 'and 3' and produces preliminary scale factor outputs PSF1, PSF2, and PSF 3. Alternatively, as described above, each module may generate a preliminary set of audio outputs for each of its associated audio output channels. Each module may also communicate with a monitor 201, as described further below. The information sent from the monitor 201 to the various modules may include neighbor level information as well as high-level neighbor level information, if any. The information sent from the modules to the monitor may include the total estimated energy attributable within the output of each of the inputs of the modules. The module may be considered to be part of the control signal generation portion of the overall system of fig. 2 and 2'.
A monitor such as monitor 201 of fig. 2 and 2' may perform a number of different functions. The monitor may, for example, determine whether more than one module is in use and, if not, the monitor need not perform any functions related to the neighbor tier. During initialization, the monitor may inform the or each module of the number of inputs and outputs it has, the matrix coefficients relating them and the sampling rate of the signal. As already mentioned, the blocks of interleaved PCM samples can be read and de-interleaved into separate channels. Which may apply unrestricted operation in the time domain, for example, in response to additional information indicating the magnitude of the source signal being restricted and the degree of restriction. If the system is operating in a multi-band mode, it may apply windowing and filter banks (e.g., FFT, MDCT, etc.) to each channel (so that multiple modules do not perform redundant transforms that significantly increase processing overhead) and pass streams of transformed values to each module for processing. Each module passes back to the monitor a two-dimensional array of scaling factors: one scaling factor for all transform segments in each sub-band of each output channel (one scaling factor per output channel when in a multi-band transform configuration, otherwise), or alternatively, a two-dimensional array of output signals: ensemble (ensemble) of complex transform segments for each sub-band of each output channel (when in a multi-band transform configuration, otherwise, one output signal per output channel). The monitor may smooth the scaling factors and apply them to the signal path matrixing (matrix 203, described below) to obtain (in a multi-band transform configuration) the output channel complex spectrum. Alternatively, when a module produces an output signal, the monitor can derive the output channel (in a multi-band transform configuration, the output channel complex spectrum), compensating for the local matrix that produced the same output signal. The inverse transform plus windowing and superposition (in the case of MDCT) may then be performed for each output channel, the output samples interleaved to form a composite multi-channel output stream (or alternatively, the interleaving may be omitted in order to provide multiple output streams), and sent to an output file, sound card, or other final destination.
While various functions may be performed by one monitor or by multiple monitors, as described herein, one of ordinary skill in the art will recognize that different ones or all of these functions may be performed in the modules themselves rather than by a monitor common to all or some of the modules. For example, if only a single, isolated module is present, no distinction is required between the module function and the monitor function. Although in the case of multiple modules, the common monitor may reduce the total processing power required by eliminating or reducing redundant processing tasks, the elimination of the common monitor or simplification thereof allows the modules to be easily added to each other, for example to upgrade to more input channels.
Returning to the description of fig. 2 and 2', the six inputs 1 ', 3 ', 5 ', 9 ', 13 ', and 23 ' are also applied to a variable matrix or variable matrixing function 203 (hereinafter "matrix 203"). The matrix 203 may be considered to be part of the signal path of the system of fig. 2 and 2'. Matrix 203 also receives as input from monitor 201A set of final scaling factors SF1 through SF23 for each of the 23 output channels of the fig. 1A and 1B example. The final scaling factor can be considered as the output of the control signal section of the system of fig. 2 and 2'. As described further below, monitor 201 preferably passes the preliminary scaling factors for each "internal" output channel as the final scaling factors for the matrix, but the monitor determines the final scaling factors for each endpoint output channel in response to information it receives from the module. The "internal" output channels are intermediate two or more "end point" output channels of each module. Alternatively, if the module produces an output signal instead of a scaling factor, matrix 203 is not required; the monitor itself generates an output signal.
In the example of fig. 1A and 1B, it is assumed that the endpoint output channels coincide with the input channel locations, but as further described otherwise, they do not have to coincide. Thus, output channels 2, 4, 6-8, 10-12, 14-16, 17, 18, 19, 20, 21, and 22 are internal output channels. The internal output channel 21 is intermediate to or supported by three input channels (input channels 9 ', 13 ' and 23 ') (blacket), while the other internal channels are intermediate to (or supported by) two input channels. Since there are multiple preliminary scaling factors for the endpoint output channels (i.e., output channels 1, 3, 5, 9, 13, and 23) that are shared between or among the modules, the monitor 20 determines the final endpoint scaling factor (SF1, SF3, etc.) from among the scaling factors SF1 through SF 23. The final internal output scaling factors (SF2, SF4, SF6, etc.) are the same as the preliminary scaling factors.
A drawback of the arrangement of fig. 1A and 2 is that the multiple input source channels are mapped to 6.1 channels (5.1 channels plus a top elevated channel), presenting a down-mix that is incompatible with existing 5.1 channel horizontal planar array systems, such as those used in Dolby Digital film soundtracks or on DVDs ("Dolby" and "Dolby Digital" are trademarks of Dolby Laboratories licensing corporation).
As described above, one way to map elevated channels to a horizontal planar array is to map each of them to more than two input channels. For example, 24 original source channels allowing the ions of fig. 1B to be mapped to a conventional 5.1 channel array (see table a below, where reference numerals 1 to 23 refer to directions in fig. 1B). In such a variation, multiple two or more input modules (not shown in fig. 1B) may extract channels that are "in-plane (outside or inside the listening area established by the standard 5.1 channel array) or out-of-plane (above the plane-" raised "or below the channel-" lowered "), and the remaining signal components may be processed by the two input modules to extract horizontal channels. The "distance-varying" channels may be fed to actual speakers placed inside the room to provide a variable distance presentation; and may be projected into the interior or exterior of the listening space as a virtual interior or exterior channel. The vertical or top signal component may be presented by, for example, mapping to all five channels of a horizontal five-channel array. Thus, a 5.1 channel downmix may be played by a conventional 5.1 channel decoder, while a decoder according to the example of fig. 1B and 2B may restore an approximation to the original 24 channels or some other desired output channel configuration.
Thus, according to an alternative to the examples of fig. 1B and 2' and as shown in table a, each standard horizontal source channel is mapped to one or two downmix channels of a 5.1 channel downmix, while the other source channels are each mapped to more than two channels of a 5.1 channel downmix. Thus, for the 23.1 channel source arrangement of the fig. 1A and 1B example, the individual channels can be mapped as follows:
TABLE A
In Table A, Lf is the left front, Cf is the center front, Rf is the right front, Ls is the left surround, Rs is the right surround, Lf-E is the elevated left front, Cf-E is the elevated center front, Rf-E is the elevated right front, Rs-E is the elevated right surround, Cs-E is the elevated center surround, and Top-E is the elevated Top. The weighting factors (matrix coefficients) may be equal in each group, or they may be selected individually. For example, each source channel mapped to three output channels may be mapped to the middle listed channel at twice the power of the two channels listed externally. The elevated Lf may be mapped to Lf and LS with a matrix coefficient of 0.5 (power 0.25) and to Cf with a coefficient of 0.7071 (power 0.5). The mapping to four or five output channels may be performed with equal matrix coefficients. Following common matrixing practice, the set of matrix coefficients for each source channel may be chosen to sum to the squares of 1.0.
Alternatively, a more elaborate down-mix arrangement including a dynamic power conserving down-mix based on source channel cross-correlation may be provided and is within the scope of the present invention.
It should be noted that in the example of fig. 1A, the down-mixing of 23.1 to 6.1 channels involves mapping all but one source channel to only two down-mix channels. In such an arrangement, only the Cs-elongated channel is mapped to three downmix channels (Ls + Rs + Top).
In order to extract channels that have been mapped to multiple downmix channels, it is necessary to identify the amount of common signal elements in two or more downmix channels. A common technique for this operation (even in applications other than upmixing) is cross-correlation. As mentioned above, the measure of cross-correlation is preferably a measure of zero time offset cross-correlation, which is the ratio of the geometric mean of the common power level and the input signal power level. The common power level is preferably a smoothed or averaged common power level and the input signal level is a smoothed or averaged input signal power level. In this context, the cross-correlation of the two signals S1 and S2 may be expressed as:
Xcor=|S1*S2|/Sqrt(|S1*S1|*|S2*S2|),
where the vertical lines indicate the averaged or smoothed values. The correlation of three or more signals is more complex, but techniques for calculating the cross-correlation of three signals are described herein under the heading "high-order calculation of common power". For downmix to 5.1 channels, it is shown in table a that the source channel may map to up to 5 downmix channels, such that cross-correlation values need to be derived from a similar number of channels, i.e. up to 5 order cross-correlation.
Rather than attempting to perform a potentially computationally intensive exact solution, the approximate cross-correlation technique according to one aspect of the present invention uses only a second order cross-correlation as described in the Xcor equation above.
The approximate cross-correlation technique involves calculating the common power (defined as the numerator of the Xcor equation above) for each pair of nodes involved. For 3 rd order correlations of signals S1, S2, and S3, this may be | S1 × S2|, | S2 × S3|, and | S1 × S3 |. For a 4 th order correlation, the common power terms would be | S1 × S2|, | S1 × S3|, | S1 × S4|, | S2 × S3|, | S2 × S4| and | S3 × S4 |. Similarly, the case of order 5 requires a total of ten such terms. Decoding the horizontal channel already requires many of these cross power calculations (in practice, 5 for up-mixing from 5.1), requiring a total of ten smoothed cross products for up to 15 th order correlation, 5 of which have already been calculated and the other 5 being required for 5 th order calculations. This total of 10 pairwise calculations is also used for all 4 th order correlations.
If any pair-wise cross power value is 0, this means that there is no common signal between the two nodes in question, and therefore no signal common to all N (N-3, 4 or 5) nodes, and therefore the output from the output channel in question is zero. Otherwise, if neither of the pair-wise cross power values is 0, the amount of common signals indicated by the cross power values of the two nodes node (i) and noid (j) may be calculated by assuming that the observed cross power is obtained from signals that are common for all nodes under consideration. If the source channel amplitude is A, then the amplitudes at nodes Node (i) and Noid (j) are represented by the corresponding downmix matrix coefficients MiAnd MjIs given as AMiAnd AMj. Thus, these nodesCommon power X ═ Si ═ Sj | ═ AM betweeni*AMjL. Thus, the estimate of the desired output amplitude from the cross power of a pair of nodes i and j is:
a (estimated) ═ Sqrt (X/M)i*Mj)
By considering the estimated value of a for all pairs of nodes associated with a given output channel, the actual value of a may not be greater than the minimum estimated value. If the node pair corresponding to the minimum estimate is not common to the other outputs, then the minimum estimate is taken to be the value of A.
If there are other output channels mapped to the two nodes in question, then there is not enough information (in this technique) to distinguish them, so it is assumed that there is equal signal distribution between the output channels in question and that all other output channels are mapped to the two nodes in question.
To address this problem, a matrix, which may be referred to as a "transfer matrix," a square matrix derived from the original encoding (downmix) matrix relating input nodes i to input nodes j may be computed during program initialization, where the value of the transfer matrix at i rows and j columns is equal to the sum of the cross-products of all encoding matrices with a common output channel. For example, assume that coded source channel 1 is mapped to downmix channels 1 and 2 with matrix values (.7071 ), and that source channels 17 are each mapped to downmix channels 1, 2 and 3 with matrix values.577 (note,. 577 × 577 ═ 3333, so that the sum of squares of the matrix values is 1.0, as desired). The transfer matrix is then (. 7071. 577. 577) ═ 5+.33 ═ 0.83 at elements 1, 2. Thus, each element of the transfer matrix is a measure of the total output power derived by the node pair. If A is found to be involved in the down-mixing of nodes 1 and 2 when deriving the output level of channel 172Then the amount of a that can be allocated to the output channel 17 is:
output power is A2*(.577*.577)/0.83=0.4A2
From the ratio of the estimated output amplitude to the amplitude at the input node, the final scaling factor for the output channel in question can be derived.
As explained elsewhere in this document, the derivation of the output levels may be performed in a hierarchical order, starting with the output channel derived from the largest number (5 in the example of fig. 1B) of channels, then the output channels derived from 4 channels, and so on.
After the output level of a given node is calculated, the power contribution of each code channel to the output is subtracted from the power level associated with the given node before proceeding with the next node output calculation.
One drawback of the cross-correlation approximation technique is that there may be more signals fed to the output channels than were originally present. However, assuming that the local array of output channels will have the correct total power, the consequences of an audible error feeding more signals into the output channels derived from three or more coded inputs are trivial, as the contributing channels are in close proximity to the output channels and the human ear will have difficulty distinguishing the additional signals to the output channels being derived. If the encoded 5.1 channel program is played without decoding, the channels that have been mapped to three or more of the 5.1 channels will be reproduced from the corresponding 5.1 channel speaker array and heard by the listener as a slightly widened source, which should not be annoying.
Blind upmix (blind upmix)
The decoding process just described may optionally be fed from any existing 5.1 channel source, even not specifically encoded as just described. Such decoding may be referred to as "blind upmixing". Such an arrangement is expected to produce interesting, perceptually pleasing results, and it makes reasonable use of the derived output channels. Unfortunately, it is common for commercial 5.1 channel movie soundtracks to have few common signal elements between pairs of channels, and fewer common signal elements in combinations of three or more channels. In such a case, the up-mixer just described produces very little output for any of the derived output channels, which is undesirable. In this case, a blind-up hybrid mode may be provided in which the input channel signals are modified or amplified such that, when at least one of the input channels from which the output channels are derived has a signal input, at least some signal output is provided in the derived output channel.
According to aspects of the invention, unamplified decode seeks
(a) The correlation between all input channels from which the output channel is derived, an
(b) A significant signal level at each of the input channels from which the output channel is derived.
If there is a low pair-wise correlation among the input channels involved, or a low signal level at any of the input channels involved, then the derived channel gets little or no signal. Each contributing input channel essentially has a veto over whether the derived channel gets a signal.
In order to perform blind up-mixing of channels that have not been coded in the manner described herein, the channels may be derived in such a way that there is some signal when the derived signal will be zero under certain signal conditions. This may be achieved, for example, by modifying both of the above conditions. With respect to the first condition, this may be done by setting a lower limit with respect to the correlation value. For example, the limit may be a minimum based on a "randomly equally distributed" correlation value described elsewhere herein. Then, in order to satisfy the condition (b), a weighted average of the signal powers of the input channels from which the output channels are derived can be simply found, wherein the weights may be matrix coefficients of the input channels. The use of such weighting techniques is not critical. Other ways of ensuring that a derived channel has some signals when any of the input channels from which it is derived has some signals may be employed.
Fig. 3 is a functional block diagram that may be used to understand the manner in which monitor 201, such as fig. 2 and 2', may determine an endpoint scaling factor. The monitor does not sum all outputs of the modules sharing inputs to obtain the endpoint scaling factor. Instead, it additively combines, e.g., in combiner 301, the total estimated internal energy of an input from each module sharing the input (such as input 9 '), which is shared by modules 26 and 27 of fig. 2 and 2'. This sum represents the total energy level at the input asserted by the internal outputs of all connected modules. This sum is then subtracted from the smoothed input energy level at the input (e.g., the output of the smoother 325 or 327 of fig. 4B, as described below) of any of the modules sharing the input (in this example, module 26 or module 27), such as in combiner 303. This is sufficient to select any of the smoothed inputs of the modules at a common input, even though the levels may differ slightly between modules as the module adjustments adjust their time constants independently of each other. The difference at the output of the combiner 303 is the desired output signal energy level at the input, which is not allowed to fall below zero. The final scaling factor for the output (in this example, SF9) is obtained by dividing the desired output signal level by the smoothed input level at the input in divider 305, and performing a square root calculation in block 307. It should be noted that the monitor derives a single final scaling factor for each such shared input, regardless of how many modules share the input. An arrangement for determining a total estimated energy attributable to the internal output of each of the inputs of the modules is described below in connection with FIG. 6A.
Since the level is an energy level (second order quantity) as opposed to a magnitude (first order quantity), after the division operation, a square root operation is applied in order to obtain the final scaling factor (scaling factor is related to first order quantity). The addition of internal levels and subtraction from the total input level are both done in the pure energy sense, since the internal outputs inside the different modules are assumed to be independent (uncorrelated). If this assumption is incorrect under exceptional circumstances, the calculation may yield more residual signals at the input than the input should have, which may cause minor spatial distortions in the reproduced sound field (e.g., minor pulling of other nearby internal images toward the input), but in the same case the human ear may react similarly. Internal output channel scaling factors (such as PSF6 through PSF8 of module 26) are passed by the monitor as final scaling factors (they are unmodified). For simplicity, fig. 3 shows the generation of only one of the endpoint final scale factors. Other endpoint final scaling factors may be derived in a similar manner.
Returning to the description of fig. 2 and 2', as described above, in variable matrix 203, the variability may be complex (all coefficients are variable) or simple (coefficients are changed in groups, such as applied to the inputs or outputs of a fixed matrix). Although either method may be employed to produce substantially the same result, a simpler method, namely a fixed matrix followed by a variable gain for each output (the gain of each output being controlled by a scaling factor), has been found to produce satisfactory results and is employed in the embodiments described herein. Although a variable matrix in which each matrix coefficient is variable is available, it has the following disadvantages: have more variables and require more computational power.
Monitor 201 also performs temporal smoothing of the optional final scaling factor before it is applied to variable matrix 203. In a variable matrix system, the output channels are never "closed" and the coefficients are arranged to emphasize some signals and cancel others. However, fixed matrix, variable gain systems open and close channels, as described in embodiments of the present invention, and are more susceptible to undesirable "chattering" artifacts. This may occur despite the presence of the two-stage smoothing described below (e.g., smoother 319/325, etc.). For example, when the scaling factor is close to zero, transitioning to 0 and from 0 may result in audible judder, since only a small change from "small" to "no" and vice versa is required.
The optional smoothing performed by monitor 201 preferably smoothes the output scaling factor with a variable time constant that depends on the magnitude of the absolute difference ("abs-diff") between the newly derived instantaneous scaling factor value and the running value of the smoothed scaling factor. For example, if abs-diff is greater than 0.4 (and, of course, < ═ 1.0), then little or no smoothing is applied; applying an additional small amount of smoothing for abs-diff between 0.2 and 0.4; and for values below 0.2 the time constant is a continuous inverse function of abs-diff. Although these values are not critical, they have been found to reduce auditory judder artifacts. Optionally, in a multi-band version of the module, the scale factor smoother time constant may also scale with frequency as well as time in the manner of the frequency smoothers 413, 415, and 417 of fig. 4A, as described below.
As described above, the variable matrix 203 is preferably a fixed coding matrix with variable scaling factors (gains) at the matrix outputs. Each matrix output channel may have (fixed) matrix coefficients as encoded downmix coefficients for that channel for which there is already an encoder with discrete input (instead of mixing the source channels directly to the downmix matrix, which avoids the need for a discrete encoder). The coefficients preferably sum to the squares of 1.0 for each output channel. Once it is known where the output channels are (as discussed above with respect to the "master" matrix), the matrix coefficients are fixed; while the scaling factor that controls the output gain of each channel is dynamic.
As explained below, after the initial energy and the common energy are calculated at the segmentation level, the input comprising the frequency domain transform segments applied to the modules 24-34 of fig. 2 (24-28 and 29 ' -35 ' of fig. 2 ') may be grouped by each module into frequency domain sub-bands. Thus, for each frequency subband there is one preliminary scaling factor (PSF in fig. 2 and 2 ') and a final scaling factor (SF in fig. 2 and 2'). The frequency domain output channels 1-23 produced by matrix 203 each comprise a set of transform segments (the sub-band-sized groups of transform segments are processed by the same scaling factor). The set of frequency domain transform segments is converted into a set of PCM output channels 1-23, respectively, by a frequency domain-to-time domain transform or transform function 205 (hereinafter "inverse transform"), which frequency domain-to-time domain transform or transform function 205 may be a function of the monitor 201, but is shown separated in time for clarity. The monitor 201 may interleave the resulting PCM channels 1-23 to provide a single interleaved PCM output stream, or reserve the PCM output channels as separate streams.
Fig. 4A-4C illustrate functional block diagrams of modules according to an aspect of the present invention. The module receives two or more input signal streams from a monitor, such as monitor 201 of fig. 2 and 2'. Each input comprises the ensemble of complex-valued frequency-domain transform segments. Each of the inputs 1 to m is applied to a function or device (e.g., function or device 401 for input 1, and function or device 403 for input m) that computes the energy of the segments, which is the sum of the squares of the real and imaginary values of the transformed segments (only the paths of the two inputs 1 and m are shown to simplify the figure). The inputs may also be applied to a function or device 405 that calculates the common energy across segments of the input channels of the module. In the case of an FFT embodiment, this may be computed by taking the cross product of the input samples (e.g., in the case of two inputs L and R, the real part of the complex product of the complex L segment value and the complex conjugate of the complex R segment value). Embodiments using real values only require multiplication of the real values of the inputs. For more than two inputs, a special cross-multiplication technique described below may be employed, i.e., if all signs are the same, the product is a positive sign, otherwise it is a negative sign and scaled by the ratio of the number of possible positive results (always two: either all positive or all negative) to the number of possible negative results.
Pairwise computation of common energies
For example, assume that the input channel pair A/B contains a common signal X along with individual, uncorrelated signals Y and Z:
A=0.707X+Y
B=0.707X+Z
wherein the scaling factorA power conservation map to the nearest input channel is provided.
Since X and Y are not correlated with each other,
thus:
that is, the total energy A in the input channel is the sum of the energies of signals X and Y, since X and Y are uncorrelated.
In a similar manner to that described above,
since X, Y and Z are uncorrelated, the averaged cross product of A and B is:
thus, in the case where the output signal is shared equally by two adjacent input channels, which may also contain independent, uncorrelated signals, the cross product of the averaged signals is equal to the energy of the common signal component in each channel. If the common signal is not shared equally, i.e. it is biased towards one input, the averaged cross product will be the geometric average between the energies of the common components in A and B, from which the individual channel common energy estimates can be derived by normalization with the square root of the ratio of the channel amplitudes. The actual time average is calculated at the subsequent smoothing stage as described below.
High order computation of common energy
The foregoing provides techniques for approximating the common energy of decoding modules having three or more inputs. Another technique for deriving common energy for decoding modules having three or more inputs is provided herein. This can be achieved by forming an averaged cross product of all input signals. Similarly performing pairwise processing of inputs makes it difficult to distinguish between separate output signals between pairs of inputs and signals that are common to all inputs.
For example, consider three input channels A, B and C consisting of uncorrelated signal W, Y, Z and common signal X:
A=X+W
B=X+Y
C=X+Z
if the averaged cross-product is computed, then all terms comprising the combination of W, Y and Z cancel out, as in the second order computation, leaving X behind3Average value of (d):
unfortunately, if X is a zero mean time function, then its cube has a mean of 0, as desired. Different from for X2(which is positive for any non-zero value of X) is averaged, X3Is the same sign as X, so that the positive and negative contributions will tend to cancel. It is clear that this is true for any odd power of X corresponding to an odd number of module inputs, but even exponents greater than 2 will also lead to erroneous results; for example, four inputs (X, X, -X, -X) with components will have the same product/mean as (X, X, X, X).
This problem can be solved by employing a variant of the average product technique. The sign of each product is discarded by taking the absolute value of the product before being averaged. The signs of the terms of the product are checked. If they are all the same, the absolute value of the product is applied to the averager. If any sign is different from the others, the negative values of the absolute values of the products are averaged. Since the number of possible same-sign combinations may differ from the number of possible different-sign combinations, a weighting factor consisting of a ratio of same-sign combinations to different-sign combinations is applied to the negative absolute-value product to compensate. For example, a three input module has two ways of making the sign the same in eight possibilities, while leaving six ways of making the sign different, resulting in a scaling factor 2/6 of 1/3. This compensation causes the integration or summation product to grow in a positive direction if and only if there is a signal component common to all inputs of the decoding module.
However, in order for the mean values of the modules of different orders to be comparable, they must have the same latitude. Conventional second order correlation involves a two input multiplication and thus an average of the amount of latitude with energy or power. Therefore, the terms to be averaged in the higher order correlations must also be modified to have the latitude of the power. For a correlation of order k, the absolute value of each product must therefore be raised to the power of 2/k before being averaged.
Of course, regardless of order, if desired, the individual input energies of the modules may be calculated as an average of the squares of the respective input signals without first being raised to the power k and then reduced to an order 2 quantity.
Returning to the description of fig. 4A, the transformed segment output for each block may be grouped into subbands by each function or device 407, 409, and 411. For example, the sub-bands may approximate the critical bands of the human ear. The remainder of the embodiments of the modules of fig. 4A-4C operate separately and independently on each frequency band. To simplify the drawing, only operation on one sub-band is shown.
The sub-bands from blocks 407, 409 and 411 are applied to frequency smoothers or frequency smoothing functions 413, 415 and 417, respectively (hereinafter "frequency smoothers"). The purpose of the frequency smoother is explained below. The frequency-smoothed subbands from the frequency smoother are applied to optional "fast" smoothers or smoothing functions 419, 421, and 423 (hereinafter "fast smoothers"), respectively, that provide temporal smoothing. Although preferred, the fast smoother may be omitted when the time constant of the fast smoother is close to the block length time of the feedforward transform (e.g., the feedforward transform in monitor 201 in fig. 2 and 2') that generated the input segment. The fast smoother is "fast" with respect to the "slow" varying time constant smoother or smoother functions 425, 427, and 429 (hereinafter "slow smoother") that receive the outputs of the fast smoother. Examples of fast and slow smoother time constants are given below.
Thus, regardless of whether the fast smoothing is provided by the inherent operation of the feed forward variation or by the fast smoother, a two-stage smoothing operation is preferred, with the second, slower stage being variable. However, a single stage of smoothing may provide acceptable results.
The time constants of the slow smoothers are preferably synchronized with each other within the module. This may be accomplished, for example, by applying the same control information to each slow smoother and by configuring each slow smoother to respond to the applied control information in the same manner. The derivation of the information used to control the slow smoother is given below.
Preferably, pairs of smoothers are connected in series in pairs 419/425, 421/427 and 423/429 as shown in fig. 4A and 4B, with the fast smoother feeding the slow smoother. The series arrangement has the advantage that the second stage is resistant to short fast signal spikes at the input of the pair. However, similar results can be obtained by configuring the pair of smoothers in parallel. For example, in a parallel arrangement, the resistance to short, fast signal spikes of the second stage in a series arrangement may be handled in the logic of the time constant controller.
Each stage of the two-stage smoother may be implemented by a single-pole low-pass filter ("leaky integrator"), such as an RC low-pass filter (in analog embodiments) or an equivalent first-order low-pass filter (in digital embodiments). For example, in a digital embodiment, the first order filters may each be implemented as "biquad filters," general second order filters, with some coefficients set to 0 so that the filter acts as a first order filter. Alternatively, the two smoothers may be combined into a single second order biquad filter, but if the second (variable) stage is separated from the first (fixed) stage, it is simpler to calculate the coefficient values of the second (variable) stage.
It should be noted that in the embodiments of fig. 4A, 4B and 4C, all signal levels are expressed as energy (squared) levels, except that amplitude is required by square root. Smoothing is applied to the energy level of the applied signal, smoothing RMS sensing instead of average sensing (average sensing smoother is fed with linear amplitude). Since the signal applied to the smoother is a squared level, the smoother reacts more quickly than the mean smoother to sudden increases in signal level, since the increase is amplified by a square function.
The two-stage smoother thus provides a time average of the sub-bands of energy of each input channel (of the first channel provided by slow smoother 425 and of the mth channel provided by slow smoother 427), and an average of the sub-bands of common energy of the input channels (provided by slow smoother 429).
The average energy of the output of the slow smoother (425, 427, 429) is applied to combiners 431, 433 and 435, respectively, where (1) the neighbor energy levels (if any) are subtracted from the smoothed energy levels of the respective input channels (e.g., from monitor 201 of fig. 2 and 2 '), and (2) the high level neighbor energy levels (if any) are subtracted from the average energy output of the respective slow smoother (e.g., from monitor 201 of fig. 2 and 2'). For example, each module receiving unit 3 '(fig. 1A, 2, and 2') has two adjacent modules, and receives neighbor energy level information that compensates for the influence of the two adjacent modules. However, none of these modules are "high-level" modules (i.e., all modules sharing input channel 3' are two-input modules). In contrast, module 28 (FIGS. 1A, 2, and 2') is an example of a module whose inputs are shared by higher level modules. Thus, for example, in block 28, the average energy output from the slow smoother for input 13' receives a high level of neighbor level compensation.
The resulting "neighbor compensated" energy levels of the sub-bands of the inputs of the module are applied to a function or device 437 which calculates the nominal heading of these energy levels. The direction indication may be calculated as a vector sum of the energy weighted inputs. For a two input module, this reduces to the L/R ratio of the smoothed and neighbor compensated input signal energy levels.
For example, assume a case where the position of a channel is given as a planar surrounding array of 2-ples representing x, y coordinates for two inputs. The listener in the center is assumed to be at (0, 0). In normalized spatial coordinates, the front left channel is located at (1, 1). The front right channel is located at (-1, 1). If the left input amplitude (Lt) is 4 and the right input amplitude (Rt) is 3, then these amplitudes are used as weighting factors, the nominal principal direction of advance being:
(4*(1,1)+3*(-1,1))/(4+3)=(0.143,1),
or slightly to the left of the center on the horizontal line connecting the left and right.
Alternatively, once the primary matrix is defined, the spatial directions may be expressed in matrix coordinates rather than physical coordinates. In this case, the input magnitude values normalized so that the sum of squares is 1 are the effective matrix coordinates of the direction. In the above example, the left and right levels are 4 and 3, which are normalized to 0.8 and 0.6. Thus, the "direction" is (0.8, 0.6). In other words, the nominal heading is a normalized version of the square root of the neighbor-compensated, smoothed input energy level with a sum of squares of 1. Block 337 produces the same number of outputs (in this example, 2) as the inputs of the module, indicating spatial directions.
The neighbor-compensated, smoothed energy levels of the subbands applied to the inputs of the modules of the direction determination function or device 337 are also applied to a function or device 339, which function or device 339 calculates a neighbor-compensated cross-correlation ("neighbor-compensated _ xcor"). Block 339 also receives as input, from slow variable smoother 329, the averaged common energy of the inputs of the modules for each sub-band, if any, that has been compensated by the high level neighbor energy levels in combiner 335. The neighbor-compensated cross-correlation is calculated in block 339 as the high-level compensated, smoothed common energy divided by the mth power of the product of the neighbor-compensated, smoothed energy levels of the respective input channels of the module, where M is the number of inputs, to derive the actual mathematical correlation value in the range of 1.0 to-1.0. Preferably, the value from 0 to-1.0 is taken as 0. The neighbor-compensated _ xcor provides an estimate of the cross-correlation that exists in the absence of other modules.
The neighbor-compensated xcor from block 339 is then applied to a weighting device or function 341, which weights the neighbor-compensated xcor with neighbor-compensated directional information to produce a direction-weighted, neighbor-weighted cross-correlation ("direction-weighted _ xcor"). The weight increases when the nominal heading deviates from the centered condition. In other words, unequal input amplitudes (and therefore energies) cause the direction-weighted _ xcor to increase proportionally. The direction-weighted _ xcor provides an estimate of the image compressibility. Thus, in the case of a two-input module with, for example, left L and right R inputs, the weight increases as the direction is off-center to the left or right (i.e., the weight is the same in any direction off-center at the same angle). For example, in the case of a two-input module, the neighbor-compensated _ xcor is weighted by the L/R or R/L ratio, so that the non-uniform signal distribution causes the direction-weighted _ xcor to approach 1.0. With respect to such a two-input module,
when R>=L.
direction-weighted_xcor=(1-((1-neighbor-compensated_xcor)*(L/R)),
and
when R<L,
direction-weighted_xcor=(1-((1-neighbor-compensated_xcor)*(R/L))
alternatively, the weighted cross-correlation (WgtXcor) may be obtained in other ways. For example
Assume a ═ L | - | R |)/(| L | + | R |) (normalized input power difference) (where "| … |," indicates averaging), and
let B ═ 2 × L × R |/(| L × L | + | R |) (normalized input cross power) (where "| … |," indicates averaging).
Then, can use
WgtXcor=A+B,
Alternatively, the sum of squares is used:
WgtXcor=Sqrt(A*A+B*B).
in either case, as L or R approaches 0, WgtXcor approaches 1 regardless of the value of | L × R |.
For modules with more than two inputs, calculating the direction-weighted _ xcor from the neighbor-compensated _ xcor requires replacing the above-mentioned ratio L/R or R/L, for example, with a "uniformity" metric that varies between 1.0 and 0. For example, to compute a uniformity metric for any number of inputs, the input signal level is normalized by the total input power, resulting in a normalized input level that sums to 1.0 in an energy (squared) sense. Each normalized input level is divided by the similarly normalized input level of the signal centered in the array. Thus, for example, for a three input module where one of the inputs has a 0 level, the uniformity metric is 0 and the direction-weighted _ xcor is equal to 1. (in this case, the signal is on the boundary of the three-input module, on the line between two of its inputs, and the two-input module (lower level) determines where on the line the nominal principal direction is, and how wide the output signal should extend along the line).
Returning to the description of FIG. 4B, the direction-weighted _ xcor is further weighted by applying it to a function or device 443, which applies "random _ xcor" to produce "effective _ xcor". effective _ xcor provides an estimate of the shape of the distribution of the input signal.
random _ xcor is the averaged cross-product of the input magnitude, which is divided by the square of the average input energy. The value of random _ xcor may be calculated by assuming that the output channels are initially the module input channels, and calculating the value of xcor resulting from those channels that are passively downmixed with independent but equal level signals. According to this method, random _ xcor is calculated to be 0.333 for the case of a three-output module with two inputs, and 0.483 for a five-output module with two inputs (three internal outputs). The random _ xcor value only needs to be calculated once for each module. While such a random _ xcor value has been found to provide satisfactory results, this value is not critical, and the system designer may decide to employ other values at will. As described below, the change in the value of random _ xcor affects the split line between the two conditions (regions) of operation of the signal distribution system. The exact location of the parting line is not critical.
The random _ xcor weighting performed by function or device 343 may be considered as a renormalization of the direction-weighted _ xcor value, resulting in an effective _ xcor:
effective_xcor=(direction-weighted_xcor-random_xcor)/
(1-random_xcor),if direction-weighted_xcor>=random_xcor,
effective_xcor=0 otherwise
the random _ xcor weighting accelerates the decrease of the direction-weighted _ xcor when the direction-weighted _ xcor decreases below 1.0, so that the effective _ xcor is 0 when the direction-weighted _ xcor is equal to the random _ xcor. Since the output of the module represents a direction along a circular arc or line, a value of effective _ xcor less than 0 is treated as equal to 0.
Information for controlling the slow smoothers 325, 327, and 329 is derived from the energy of the slow and fast smoothed input channels uncompensated by the neighbors and the common energy of the slow and fast smoothed input channels. In particular, the function or device 345 computes the fast non-neighbor compensated cross-correlation in response to the energy of the fast smoothed input channels and the common energy of the fast smoothed input channels. The function or device 347 computes fast neighbor-uncompensated directions (ratios or vectors, as discussed above in connection with the description of block 337) in response to the fast smoothed input channel energy. A function or device 349 computes slow non-neighbor compensated cross-correlation in response to the energy of the slow smoothed input channels and the common energy of the slow smoothed input channels. Function or device 351 computes a slow, neighbor-uncompensated direction (ratio or vector, as discussed above) in response to the input channel energy being slowly smoothed. The fast non-neighbor compensated cross-correlation, the fast non-neighbor compensated direction, the slow non-neighbor compensated cross-correlation, and the slow non-neighbor compensated direction are applied along with the direction-weighted _ xcor from block 341 to a device or function 353 (hereinafter "adjusting time constant"), which device or function 353 provides information for controlling the variable slow smoothers 325, 327, and 329 to adjust their time constants. Preferably, the same control information is provided to each variable slow buffer. Unlike other quantities that are fed into the time constant selection box that compares fast metrics to slow metrics, the direction-weighted _ xcor is preferably used without reference to any fast value, so that if the absolute value of the direction-weighted _ xcor is greater than a threshold, it may cause the adjustment time constant 353 to select the faster time constant. The operating rules for "adjust time constant" 353 are set forth below.
In general, in a dynamic audio system, it is desirable to stay at a static value using as slow a time constant as possible to minimize auditory disruption of the reproduced sound field until a "new event" occurs in the audio signal, in which case it is desirable for the control signal to change rapidly to the new static value and then hold that value until another "new event" occurs. Typically, the audio processing system equates the change in amplitude with a "new event". However, when cross-product or cross-correlation is involved, the new case and magnitude are not always equal: new events may cause cross-correlation to decrease. By sensing changes in parameters related to the operation of the module, i.e. the measure of cross-correlation and direction, the time constant of the module can be accelerated and the desired new control state can be quickly assumed.
The consequences of inappropriate dynamic behavior include image drift, jitter (rapid channel on and off), pumping (unnatural changes in level), and, in multi-band embodiments, continuous frequency conversion (jitter and pumping on a band-by-band basis). Some of these results are particularly critical to the quality of the isolation channel.
Embodiments such as those of fig. 1A and 2 and fig. 1B and 2' employ a trellis of decoding modules. This configuration leads to two types of dynamic problems: inter-module dynamics and intra-module dynamics. In addition, several ways of implementing audio processing (e.g., wideband, multiband using FFT or MDCT linear filter banks, or discrete filter banks, critical bands, or others) all require their own dynamic behavior optimization.
The basic decoding process in each module depends on a measure of the energy ratio of the input signals and a measure of the cross-correlation of the input signals (in particular the direction-weighted-xcor described above; the output of block 341 in fig. 4B), which together control the signal distribution among the outputs of the modules. The derivation of such basic quantities requires smoothing, which in the time domain requires the calculation of a time-weighted average of the instantaneous values of these quantities. The range of time constants required is very large: very short for fast transient changes in time conditions (e.g., 1 ms) to very long for low correlation values (e.g., 150 ms), where the transient changes may be much larger than the actual average.
In terms of simulation, a common method to achieve variable time constant behavior is to use "accelerating" diodes. When the instantaneous level exceeds the average level by a threshold amount, the diode conducts, resulting in a shorter effective time constant. A drawback of this technique is whether transient peaks in the steady-state input result in large changes in the smoothed level, which then decays very slowly, providing an unnatural emphasis of isolated peaks that otherwise have little audible consequence.
The correction calculations described in connection with the embodiments of fig. 4A-4C make the use of accelerating diodes (or their DSP equivalents) questionable. For example, all of the smoothers in a particular module preferably have time constants that are synchronized so that their smoothed levels are comparable. Therefore, a global (lumped) time constant switch structure is preferred. In addition, rapid changes in signal conditions are not necessarily associated with an increase in the common energy level. The use of accelerating diodes for this level may produce biased, inaccurate correlation estimates. Accordingly, embodiments of aspects of the present invention preferably use two-stage smoothing, rather than diode-equivalent acceleration. Estimates of the correlation and direction may be derived from at least the first and second stages of the smoother to set the time constant of the second stage.
For each pair of smoothers (e.g., 319/325), the time constant of the first, fixed fast stage may be set to a fixed value, e.g., 1 millisecond. The time constant of the second, variable slow stage may be selected, for example, among 10 milliseconds (fast), 30 milliseconds (medium), and 150 milliseconds (slow). While such time constants have been found to provide satisfactory results, their values are not critical, and other values may be employed at the discretion of the system designer. In addition, the value of the second stage time constant may be continuously variable rather than discrete. The selection of the time constant may be based not only on the signal conditions described above, but also on a hysteresis mechanism using a "fast flag" that is used to ensure that the system remains in fast mode once a truly fast transition is encountered, avoiding the use of a medium time constant until the signal conditions re-enable the slow time constant. This may help to ensure a quick adaptation to new signal conditions.
For the two input case, the choice of which of the three possible second stage time constants to use can be made by "adjusting the time constant" 353 according to the following rule:
if the absolute value of direction-weighted _ xcor is smaller than a first reference value (e.g., 0.5), and the absolute difference between fast-neighbor-compensated _ xcor and slow-neighbor-compensated _ xcor is smaller than the same first reference value, and the absolute difference between fast and slow direction ratios (both having a range of +1 to-1) is smaller than the same first reference value, then a slow second-stage time constant is used, and the fast flag is set to True, so that a medium time constant can be selected later.
Otherwise, if the fast flag is True, the absolute difference between the fast non-neighbor-compensated _ xcor and the slow non-neighbor-compensated _ xcor is greater than a first reference value and less than a second reference value (e.g., 0.75), the absolute difference between the fast and slow temporal L/R ratios is greater than the first reference value and less than the second reference value, and the absolute value of the direction-weighted _ xcor is greater than the first reference value and less than the second reference value, then a medium second level time constant is selected.
Otherwise, the fast second order time constant is used and the fast weight term is set to False, disabling subsequent use of the medium time constant until the slow time constant is again selected.
In other words, the slow time constant is selected when all three conditions are less than the first reference value, the medium time constant is selected when all conditions are between the first reference value and the second reference value and the previous condition is the slow time constant, and the fast time constant is selected when any one condition is greater than the second reference value.
While the rules and reference values just described have been found to provide satisfactory results, they are not critical and system designers may employ variations of the rules at will and other rules that take into account fast and slow cross-correlations and fast and slow directions. For example, it is simpler but equally effective to use diode accelerated processing, but with a tuning operation such that if any of the smoothers in the module are in fast mode, all other smoothers are also switched off to fast mode. It is also desirable to use separate smoothers for time constant determination and signal distribution, the smoothers used for time constant determination being maintained at a fixed time constant, while only the signal distribution time constant changes.
Since even signal levels smoothed in fast mode still require several milliseconds to adapt, delays can be embedded in the system to allow the control signals to adapt before they are applied to the signal path. In a wideband embodiment, this delay may be implemented as a discrete delay in the signal path (e.g., 5 milliseconds). In the multi-band (transform) version, the delay is a natural result of the block processing, and no explicit delay is needed if the analysis of the block is performed before the signal path matrixing of the block.
Multi-band embodiments of aspects of the present invention may use the same time constants and rules as the wideband version, except that the sampling rate of the smoother may be set to the signal sampling rate divided by the block size (e.g., block rate) so that the coefficients used in the smoother may be appropriately adjusted.
For frequencies below 400Hz, the time constant is preferably inversely scaled with frequency in a multiband embodiment. In a broadband version, this is not possible since there are no separate smoothers at the different frequencies, so as a partial compensation, a mitigating band-pass/pre-emphasis filter can be applied to the input signal of the control path to emphasize the mid and upper-mid frequencies. This filter may have, for example, a two-pole high-pass characteristic with a break frequency at 200Hz, plus a 2-pole low-pass characteristic with a break frequency at 8000Hz, plus a pre-emphasis network that applies a 6dB boost from 400Hz to 800Hz and another 6dB boost from 1600Hz to 3200 Hz. Although such a filter has been found to be suitable, the filter characteristics are not critical, and other parameters may be employed at the discretion of the system designer.
In addition to time domain smoothing, the multi-band version of aspects of the present invention preferably also employs frequency smoothing (frequency smoothers 413, 415, and 417) as described above in connection with fig. 4A. For each block, the neighbor-uncompensated energy levels may be averaged over a sliding frequency window, adjusted to approximate 1/3 octave (critical band) bandwidth, before being applied to subsequent time domain processing as described above. Since transform-based filter banks have an essentially linear frequency resolution, the width of this window (in terms of the number of transform coefficients) increases with increasing frequency, and typically only one transform coefficient wide at low frequencies (below about 400 Hz). Thus, the overall smoothing applied to the multi-band processing is more dependent on time-domain smoothing at low frequencies and more dependent on frequency-domain smoothing at higher frequencies, where fast time response may sometimes be more necessary.
Turning to the description of fig. 4C, a preliminary scaling factor (shown as a PSF in fig. 2 and 2') that ultimately affects the dominant/fill/endpoint signal distribution may be generated by a combination of devices or functions 455, 457, and 459 that respectively calculate the "dominant" scaling factor component, the "fill" scaling factor component, and the "excess endpoint energy" scaling factor component, and the respective normalizers and normalization functions 361, 362, and 365, and a device or function 367 that obtains the maximum of the dominant and fill scaling factor components and/or the additive combination of the fill and excess endpoint energy scaling factor components. If the module is one of a plurality of modules, the preliminary scaling factor may be sent to a monitor, such as monitor 201 of FIGS. 2 and 2'. The preliminary scaling factors may each have a range from 0 to 1.
Dominant scale factor component
In addition to the effective _ xcor, a device or function 355 (the "compute dominant scale factor component" 355) receives the neighbor-compensated directional information from block 337 and the information about the local matrix coefficients from the local matrix 369, so that the N nearest output channels (where N is equal to the number of inputs) can be determined, which can be applied to the weighted sum to get the nominal forward principal directional coordinate, and the "dominant" scale factor component to get the dominant coordinate. The output of block 355 is one scaling factor component (per subband) if the nominal advancing main direction happens to coincide with the output direction, otherwise, a plurality of scaling factor components (one per input per subband) that support the nominal advancing main direction and are applied in the power conservation sense at the appropriate scale to sweep or map the dominant signal to the correct virtual position (e.g., for N-2, the sum of the squares of the two assigned dominant channel scaling factor components should be effective _ xcor).
For a two input module, all output channels are in a straight line or circular arc, so that there is a natural ordering (from "left" to "right") and it appears more clear which channels are adjacent to each other. For the hypothetical case described above with two input channels and five output channels with the sin/cos coefficients shown, the nominal principal direction of progression may be assumed to be (0.8,0.6) between the middle left ML channel (.92,.38) and the center C channel (.71 ). This can be achieved by finding two consecutive channels where the L-coefficient is larger than the nominal advancing main direction L-coordinate and the channel to its right has an L-coefficient smaller than the dominant L-coordinate.
The dominant scale factor component is assigned to the two nearest channels in the sense of constant power. To this end, a system of two equations and two unknowns, the unknowns being the dominant component scale factor component (SFL) of the channel to the left of the dominant direction and the corresponding scale factor component (SFR) to the right of the nominal advancing dominant direction, is solved (these equations are solved for SFL and SFR).
first _ dominant _ coord ═ SFL left channel matrix value 1+ SFR right channel matrix value 1
second _ dominant _ coord ═ SFL left channel matrix value 2+ SFR right channel matrix value 2
It should be noted that the left and right channels refer to the channels supporting the nominal main direction of advance, not the L and R input channels of the module.
The solution is the anti-dominant level calculation for each channel normalized to sum the squares to 1.0 and is used as the dominant distribution scale factor components (SFL, SFR), one for each other channel. In other words, the anti-dominant value for the output channel with coefficient A, B for the signal with coordinate C, D is the absolute value of AD-BC. For the numerical example considered:
Antidom(ML channel)=abs(.92*.6-.38*.8)=.248
Antidom(C channel)=abs(.71*.6-.71*.8)=.142
(wherein "abs" indicates absolute value of calculation)
Normalizing the latter two numbers to a sum of squares of 1.0 yields values of.8678 and.4969, respectively. Therefore, switching these values to the opposite channel, the dominant scale factor component is (note that the value of the dominant scale factor is the square root of effective _ xcor before directional weighting):
ML dom sf=.4969*sqrt(effective_xcor)
C dom sf=.8678*sqrt(effective_xcor)
(the dominant signal is closer to Cout than MidLout).
The use of the anti-dominant component of one channel being normalized as the dominant scale factor component of the other channel can be better understood by considering what would happen if the nominal advancing principal direction happens to point to just one of the two selected channels. Assuming that the coefficients of one channel are [ a, B ] and the coefficients of the other channel are [ C, D ] and the coordinates of the nominal advancing principal direction are [ a, B ] (pointing to the first channel), then:
Antidom(firsr chan)=abs(AB-BA)
Antidom(second chan)=abs(CB-DA)
note that the first anti-dominance value is 0. When the two anti-dominance values are normalized so that the sum of squares is 1.0, the second anti-dominance value is 1. When switched, the first channel receives the dominant scaling factor component 1.0 (multiplied by the square root of effective _ xcor), while the second channel receives 0.0, as desired.
When this method is extended to modules with more than two inputs, the natural boundary no longer appears when the channels are in line or circular arcs. Again, block 337 of FIG. 4B calculates the nominal heading principal direction coordinate, e.g., by taking the input magnitudes after neighbor compensation and normalizing them to sum the squares of 1. For example, block 455 of fig. 4B then identifies the N nearest channels (when N equals the input number) that may be applied to the weighted sum to produce the dominant coordinate. (note that distance or proximity can be calculated as the sum of squares of the coordinate differences as if they were (x, y, z) spatial coordinates). Therefore, the N nearest channels are not always picked, since they have to be weighted and summed to produce the nominal advancing principal direction.
For example, assume a three-input module fed by channels Ls, Rs, and Top in a triangular relationship as shown in fig. 5. Assume that there are three internal output channels, each having module local matrix coefficients [.71,. 69,. 01], [.70,. 70,. 01], and [.69..71,. 01], that are together closer to the base of the triangle. The nominal main direction of advance is assumed to be slightly below the center of the triangle, with coordinates [.6,.6,.53 ]. (Note that the coordinates of the center of the triangle are [.5,.5,.707 ]). The three channels nearest to the nominal principal direction of progression are the three internal channels at the bottom, but they cannot use scaling factors between 0 and 1 to sum to get the dominant coordinates, so instead, two are selected from among the bottom and top endpoint channels to distribute the dominant signal, and three equations for three weighting factors are solved to complete the dominant calculation, and proceed to the fill and endpoint calculations.
In the example of fig. 1A and 2, there is only one three-input module and it is used to derive only one internal channel, which simplifies the calculation.
Fill factor component
In addition to effective _ xcor, device or function 356 ("compute fill factor component") also receives random _ xcor, direction-weight _ xcor from block 341, "EQUIAMPL" ("EQUIAMPL" is defined and explained below), and information about local matrix coefficients from the local matrix (in the case where the same fill factor component is not applied to all outputs, as described below in connection with fig. 14B). The output of block 457 is the scaling factor component (per subband) for each module output.
As described above, effective _ xcor is zero when direction-weight _ xcor is less than or equal to random _ xcor. When direction-weight _ xcor > -random _ xcor, the fill scaling factor component for all output channels is:
padding scaling factor component sqrt (1-effective _ xcor) eqiampl
Therefore, when direction-weight _ xcor is random _ xcor, effective _ xcor is zero, and thus (1-effective _ xcor) is 1.0, the padding amplitude scaling factor component is equal to EQUIAMPL (under this condition, it is ensured that the output power is the input power). This is the maximum value reached by the fill scaling factor component.
When weight _ xcor is less than random _ xcor, the dominant scale factor component is 0, and as direction-weight _ xcor approaches 0, the padding scale factor component is reduced to 0:
padding scaling factor component sqrt (direction-weight _ xcor/random _ xcor) }
EQUIAMPL
Thus, at the boundary of direction-weight _ xcor ═ random _ xcor, the padding scale factor component is again equal to EQUIAMPL, ensuring continuity with the result of the above equation for the case where direction-weight _ xcor is greater than random _ xcor.
Not only the value of random _ xcor, but also the value of "EQUIAMPL" is associated with each decoder module, the value of "EQUIAMPL" being the scaling factor value that all scaling factors should have if the signal is equally distributed so that power is conserved, i.e.:
equal square root of decoder module input channel number/decode
Number of output channels of device module
For example, for a two input module with three outputs:
EQUIAMPL=sqrt(2/3)=.8165
wherein "sqrt ()" means "square _ root _ of ()".
For a two input module with four outputs:
EQUIAMPL=sqrt(2/4)=.7071
for a two input module with five outputs:
EQUIAMPL=sqrt(2/5)=.6325
while such an EQUIAMPL value has been found to provide satisfactory results, this value is not critical and other values may be employed at the discretion of the system designer. The change in the value of EQUIAMPL affects the level of the output channel for the "fill" condition (intermediate correlation of the input signals) with respect to the level of the output channel for the "dominant" condition (maximum condition of the input signals) and for the "all endpoints" condition (minimum correlation of the input signals).
End point scale factor component
In addition to the neighbor-compensated xcor (from block 439, fig. 4B), a device or function 359 ("compute excess endpoint energy scaling factor component") receives the smoothed, non-neighbor-compensated energy (from blocks 325 and 325) for each of the first through mth inputs, and optionally information about the local matrix coefficients from the local matrix (in the case where either or both of the endpoint outputs do not coincide with the inputs, and the module applies the excess endpoint energy to the two outputs having directions closest to the input directions, as described further below). As explained below, the output of block 359 is the scale factor component of each endpoint output if the direction coincides with the input direction, and two scale factor components otherwise, one for each of the outputs closest to the terminal.
However, the excess endpoint energy scale factor component generated by block 359 is not the only "endpoint" scale factor component. There are three other sources of endpoint scale factor components (two in the case of a single, estimation module):
first, in the preliminary scale factor calculation for a particular module, the endpoints are possible candidates for the dominant signal scale factor component from block 355 (and normalizer 361).
Second, in the "fill" calculation of block 357 (and normalizer 363) of FIG. 4C, the endpoints along with all internal channels are considered possible fill candidates. Any non-zero padding scale factor component may be applied to all outputs, even the endpoints and the selected dominant output.
Third, if there is a grid of multiple modules, then a monitor (such as monitor 201 of the example of FIGS. 2 and 2 ') performs a final, fourth assignment of "endpoint" lanes as described above in connection with FIGS. 2, 2', and 3.
In order for block 459 to calculate the "excess endpoint energy" scaling factor component, the total energy at all internal outputs is reflected back to the inputs of the module, it is estimated how much of the energy of the internal outputs is contributed by the inputs (the "internal energy at input' n") based on neighbor-compensated _ xcor, and this energy is used to calculate the excess endpoint energy scaling factor component at the module outputs (i.e., endpoints) that coincide with the inputs.
To provide the information needed by a monitor, such as monitor 201 of fig. 2 and 2', to calculate the neighbor levels and the high-level neighbor levels, it is also necessary to reflect the internal energy back to the input. Fig. 6A and 6B illustrate one way to calculate the internal energy contribution at each input of the module and determine the excess endpoint scaling factor component for each endpoint output.
Fig. 6A and 6B show one suitable arrangement in a module, such as any of modules 24-34 of fig. 2 and any of modules 24-28 and 29 ' -35 ' of fig. 2', respectively, for (1) generating a total estimated internal energy for each of inputs 1 through m of the module in response to the total energy at each of inputs 1 through m, and (2) generating a redundant endpoint energy scaling factor component for each endpoint of the module in response to the neighbor-compensate _ xcor (see fig. 4B, output of block 439). The total estimated internal energy for each input of the module (fig. 6A) is needed by the monitor in the case of a multi-module arrangement, and in any case by the module itself to generate the excess endpoint energy scaling factor component.
Using the scale factor components derived at blocks 455 and 457 of fig. 4C along with other information, the arrangement of fig. 6A calculates the total estimated energy at each internal output (but without its endpoint output). Using the calculated internal output levels, each output level is multiplied by matrix coefficients relating the output to each input [ "m" inputs, "m" multipliers ], which provide the energy contribution of the output to the output. For each input, all energy contributions of all internal output channels are summed to obtain a total internal energy contribution for that input. The total internal energy contribution of each input is reported to the monitor and used by the module to calculate an excess endpoint energy scaling factor component for each endpoint output.
Referring in detail to fig. 6A, the smoothed total energy level (preferably without neighbor compensation) of each module input is applied to a set of multipliers, one for each internal output of the module. For simplicity of representation, fig. 6A shows two inputs "1" and "m" and two internal outputs "X" and "Z". The smoothed total energy level of each module input is multiplied by a matrix coefficient (of the local matrix of the module) that matches the particular input with one of the module's internal outputs (note that the matrix coefficients are the inverse of themselves, since the sum of the squares of the matrix coefficients equals 1). This operation is performed for each combination of input and internal output. Thus, as shown in FIG. 6A, the smoothed total energy level at input 1 (which may be obtained, for example, at the output of the slow smoother 425 of FIG. 4B) is applied to the multiplied byA multiplier 601, which multiplier 601 multiplies the energy level by matrix coefficients relating internal output X to input 1, provides a scaled output energy level component X at output X1. Similarly, multipliers 603, 605 and 607 provide a scaled energy level component Xm、Z1And Zm
The energy level component (e.g., X) of each internal output is amplitude/power-wise applied to each internal output in combiners 611 and 613 according to neighbor-compensated _ xcor1And Xm;Z1And Zm) And (6) summing. If the inputs to the combiner are in phase as indicated by the neighbor-compensated _ xcor being 1.0, their linear amplitudes add up. If they are not correlated as indicated by the neighbor-compensated _ xcor being 0, their energy levels are added. If the cross-correlation is between 0 and 1, the sum is partly the magnitude sum and partly the power sum. To sum the inputs of the combiners appropriately, the amplitude sum and the power sum are both calculated and weighted by, e.g., the neighbor-compensated _ xcor sum (1-e.g., the neighbor-compensated _ xcor), respectively. To obtain the weighted sum, prior to taking the weighted sum, either the square root of the power sum is taken to obtain the equivalent magnitude or the linear magnitude sum is squared to obtain its power level. For example, in the case of the latter method (weighted sum of powers), if the amplitude levels are 3 and 4 and as the neighbor-compensated _ xcor1.0, the amplitude sum is 3+ 4-7, or the power level 49, and the power energy sum is 9+ 16-25. Thus, the weighted sum is 0.7 × 49+ (1-0.7) × 25 ═ 41.8 (power energy level), or the square root, 6.47.
The summed result (X) is provided in multipliers 613 and 6151+Xm;Z1+Zm) The scaling factor components of each of the outputs X and Z are multiplied to produce the total energy levels at the respective internal outputs, which may be identified as X 'and Z'. The scaling factor component for each internal output is obtained from block 467 (fig. 4C). It should be noted that the "excess endpoint energy scaling factor component" from block 459 (fig. 4C) does not affect the internal output and is not included in the calculations performed by the arrangement of fig. 6A.
Each of the total energy levels X 'and Z' at the respective internal outputs is reflected back to the corresponding one of the module inputs by multiplication with the matrix coefficient (of the local matrix of the module) relating that particular output to the respective module input. This is done for each combination of internal output and input. Thus, as shown in FIG. 6A, the total energy level X' at the internal output X is applied to a multiplier 617, and the multiplier 617 multiplies the energy level by the matrix coefficient relating the internal output X to input 1 (which, as described above, is the same as the inverse of the matrix coefficient), providing a scaled energy level component X at input 11’。
It should be noted that a second order weight is required when a second order value, such as the total energy level X', is weighted with a first order value, such as a matrix coefficient. This is equivalent to taking the square root of the energy to obtain the magnitude, multiplying the magnitude by the matrix coefficients and squaring the result to retrieve the energy value.
Similarly, multipliers 619, 621 and 623 provide a scaled energy level component Xm’、Z1' and Zm'. Energy components (e.g., X) associated with the respective outputs are aligned in combiners 625 and 627 in the manner of magnitude/power as described above in connection with combiners 611 and 613 in accordance with neighbor-compensated _ xcor1' and Xm’;Z1' and Zm') are summed. The outputs of combiners 625 and 627 represent the total estimated internal energy for inputs 1 and m, respectively. In the case of a multi-module grid, this information is sent to a monitor, such as monitor 201 of fig. 2 and 2', so that the monitor can calculate the neighbor levels. The monitor requests all total internal energy contributions of each input from all modules connected to that input and then informs each module that, for each of its inputs, the sum of all other total internal energy contributions is from all other modules connected to that input. The result is the neighbor level of the input to the module. The generation of neighbor level information is described further below.
The total estimated internal energy contributed by each of inputs 1 and m is also needed by the module to calculate the excess endpoint energy scaling factor component for each endpoint output. Fig. 6B shows how such scale factor component information is calculated. For simplicity of representation, the calculation of the scale factor component information is shown for only one endpoint, it being understood that similar calculations are performed for each endpoint output. In this example, in combiner or combining function 629, the total estimated internal energy contributed by an input, input 1, of the same input is subtracted from the smoothed total input energy of the input, such as input 1 (e.g., the smoothed total energy level at output 1, obtained at the output of slow smoother 425 of fig. 4B, is applied to multiplier 601). The subtraction result is divided by the smoothed total energy level for the same input 1 in divider or division function 631. The square root of the division result is found in a square root finding device or a square root finding function 633. It should be noted that the operation of the divider or division function 631 (as well as the operation of the other dividers described herein) should include detection of the zero denominator. In this case, the quotient may be set to 0.
If only a single isolated module exists, the endpoint preliminary scale factor component is determined by having determined the dominant, fill, and excess endpoint energy scale factors.
Thus, all output channels, including the endpoints, are assigned scaling factors and can be further used to perform signal path matrixing. However, if there is a grid of multiple modules, each module has assigned an endpoint scaling factor to each input feeding that module, so inputs connected to more than one module have multiple scaling factor assignments, one from each connected module. In this case, a monitor (such as monitor 201 of the example of fig. 2 and 2 ') performs a final, fourth assignment of "endpoint" lanes, which, as described above in connection with fig. 2, 2' and 3, determines as endpoint scaling factors the final endpoint scaling factors that invalidate all scaling factor assignments made by the various modules.
In a practical arrangement, there is not necessarily an actual presence of an output channel direction corresponding to the end point position, although this is often the case. If there is no physical endpoint channel, but there is at least one physical channel outside the endpoint, then the endpoint energy is swept to the physical channel closest to the endpoint as if it were the dominant signal component. In a horizontal array, preferably using a constant energy distribution (the sum of the squares of the two scaling factors is 1.0), there are two channels that are closest to the end position. In other words, when the sound direction does not correspond to the position of the actual sound channel, even if the direction is an endpoint signal, it is preferable to pan it to the nearest available actual channel pair, because if the sound is moved slowly, it jumps abruptly from one output channel to the other. Thus, when there is no physical endpoint sound channel, it is inappropriate to pan the endpoint signal to the one sound channel closest to the endpoint location unless there is no physical channel outside the endpoint (in which case there is no choice other than to pan to the one sound channel closest to the endpoint location).
Another way to implement such a scan is for a monitor, such as monitor 201 of fig. 2 and 2', to generate a "final" scaling factor based on the assumption that each input also has a corresponding output channel (i.e., each corresponding input and output coincide, representing the same position). Then, an output matrix, such as the variable matrix 203 of fig. 2 or 2', may map an output channel to one or more suitable output channels in the absence of an actual output channel that directly corresponds to an input channel.
As described above, the output of each of the "compute scale factor components" devices or functions 455, 457, and 459 is applied to the respective normalizing device or function 461, 463, and 465. such normalizers are desirable because the scale factor components computed by blocks 455, 457, and 459 are based on neighbor-compensated levels, while the final signal path matrixing (in the main matrix in the case of a multi-module, or in the local matrix in the case of an isolated module) involves neighbor-uncompensated levels (the input signals applied to the matrix are not neighbor-compensated). Typically, the value of the scaling factor component is reduced by a normalizer.
One suitable method of implementing the normalizer is as follows. Each normalizer receives the neighbor compensated smoothed input energy (from combiners 331 and 333) for each input of the module, the non-neighbor compensated smoothed input energy (as from blocks 325 and 327) for each input of the module, the local matrix coefficient information from the local matrix, and the respective outputs of blocks 355, 357, and 359. Each normalizer calculates a desired output for each output channel and an actual output level for each output channel, assuming a scaling factor of 1. The calculated desired output for each output channel is then divided by the calculated actual output level for each output channel, and the quotient is squared to provide a potential preliminary scaling factor for application to the "sum and/or larger" 367. Consider the following example.
The smoothed, non-neighbor compensated input energy levels of the two-input module are assumed to be 6 and 8, and the corresponding neighbor compensated energy levels are assumed to be 3 and 4. It is also assumed that the center internal output channel has a matrix coefficient (· 71,.71), or is squared: (0.5). If the module selects an initial scaling factor (based on the neighbor compensated level) of 5.0 for this channel, or squared 0.25, then the desired output level for this channel (assuming pure energy summation and using neighbor corrected levels for simplicity) is:
.25*(3*.5+4*.5)=0.875.
since the actual input levels are 6 and 8, if the above scaling factor (squared) 0.25 is used for the matrixing of the last signal path, the output level is:
.25*(6*.5+8*.5)=1.75
rather than the desired output level of 0.875. The normalizer adjusts the scaling factor to obtain the desired output level when using a level that is not neighbor compensated.
Let SF be 1 and the actual output be 7 (6 x.5 +8 x.5).
(desired output level)/(actual output assuming SF-1) 0.875/7.5-0.125-squared final scaling factor.
The final scaling factor for this output channel is sqrt (0.125) 0.354 instead of the originally calculated value 0.5.
"Sum and/or Max" 367 preferably sums the corresponding fill and end point scale factor components for the respective output channels per sub-band, and selects the larger value of the leading and fill scale factor components for the respective output channels per sub-band. The function of the "sum and/or get larger" block 367 in its preferred form may be characterized as shown in fig. 7. That is, the dominant and padding scale factor components are applied to a device or function 701, which device or function 701 selects the larger value of the scale factor component of each output ("whichever is" 701) and applies them to an additive combiner or combining function 703, which additive combiner or combining function 703 sums the scale factor component from the larger value 701 with the excess endpoint energy scale factor of each output. Alternatively, when the "sum and/or maximum" 467: (1) summing in both zone 1 and zone 2, (2) get the larger of both zone 1 and zone 2, or (3) select the maximum in zone 1 and sum in zone 2, acceptable results can be obtained.
FIG. 8 is an idealized representation of the manner in which an aspect of the present invention generates scale factor components in response to a measure of cross-correlation. This figure is particularly useful for reference to the fig. 9A and 9B to fig. 16A to 16B examples. As described above, the generation of the scale factor component may be considered to have two regions or sections of operation (regions): a first region, region 1 and a second region, region B, region 1 being bounded by "all dominant" and "uniform fill", wherein the available scale factor components are the dominant and mixed threo scale factors, and region 2 being bounded by "uniform fill" and "all endpoints", wherein the available scale factor components are a combination of the fill and excess endpoint energy scale factor components. The "all dominant" boundary condition occurs when the direction _ xcor is 1. Region 1 (dominant + fill) extends from this boundary to the point where direction-weight _ xcor equals random _ xcor, the "uniform fill condition". The "all endpoints" boundary condition occurs when the direction-weighted _ xcor is 0. Region 2 (fill + endpoint) extends from the "evenly filled" boundary condition to the "all endpoints" boundary condition. The "evenly filled" boundary points may be considered to be in either region 1 or region 2. The exact boundary point is not critical, as described below.
As shown in fig. 8, as the value of the dominant scale factor component decreases, the value of the padding scale factor component increases, reaching a maximum when the dominant scale factor component reaches a value of zero, at which point the value of the excess endpoint energy scale factor component increases as the value of the padding scale factor component decreases. When applied to an appropriate matrix of input signals of the receiving module, the result is an output signal distribution that provides a compact sound image when the input signals are highly correlated, the image diffusing (widening) from compact to broad as the correlation decreases, and gradually splitting or curving outward from broad to multiple sound images, each at an endpoint, as the correlation continues to decrease to highly uncorrelated.
Although it is desirable to have a single spatially compact sound image (in the nominal forward principal direction of the input signal) for the case of perfect correlation and a plurality of spatially compact sound images (each at an end point) for the case of perfect correlation, the spatially diffuse sound images between these extremes can be implemented in a different manner than shown in fig. 8. This is not critical, for example, the padding scale factor component values reach a maximum value for random _ xcor-direction-weighted _ xcor, rather than the values of the three scale factor components changing linearly as shown. The present invention also contemplates modifications to the relationship of fig. 8 (and equations expressed below in the figure) and other relationships between suitable measures of cross-correlation and scale factor values that can produce a compact dominant to broad spread to make a compact endpoint signal distribution compact for measures of cross-correlation from highly correlated to highly uncorrelated. For example, rather than obtaining a compact dominant to broad spread to make the endpoint signal distribution compact by employing the dual region approach described above, such results are obtained by mathematical methods (e.g., using a pseudo-inverse based equation solution).
Output scaling factor examples
A series of idealized representations (fig. 9A and 9B to fig. 16A and 16B) show the output scaling factor of the module for various examples of input signal conditions. For simplicity, a single, isolated block is assumed, so that the scaling factor it produces for the variable matrix is the final scaling factor. The module and associated variable matrix have two input channels (such as left L and right R) that coincide with two endpoint output channels (which may also be designated as L and R). In this series of examples, there are three internal output channels (such as a left middle portion Lm, a center C, and a right middle portion Rm).
The meanings of "all dominant", "mixed dominant and filled", "uniform filled", "mixed filled and end points", and "all end points" are further explained in connection with the examples of fig. 9A and 9B to fig. 16A and 16B. In each pair of figures (e.g., fig. 9A and 9B), "a" illustrates the energy levels of two inputs (left L and right R) and "B" illustrates the scaling factors for five outputs (left L, left middle Lm, center C, right middle Rm, and left R). The figures are not drawn to scale.
In fig. 9A, the input energy levels shown as two vertical arrows are the same. In addition, both the direction-weighted _ xcor (and the effective _ xcor) are 1.0 (fully correlated). In this example, there is only one non-zero scaling factor, as shown in fig. 9B as a single vertical arrow at C, which is applied to the central internal channel C output, resulting in a spatially compact dominant signal. In this example, the output is centered (L/R ═ 1), and thus coincides with the central internal output channel C. If there are no coincident output channels, the dominant signal is applied to the nearest output channel in the appropriate proportion to sweep the dominant signal to the correct virtual position between them. If, for example, there is no center output channel C, the left mid Lm and right mid Rm output channels will have non-zero scaling factors such that the dominant signals are applied equally to the Lm and Rm outputs. In this fully correlated (all dominant signal) case, there is no fill signal component as well as an endpoint signal component. Thus, the preliminary scale factor produced by block 467 (FIG. 4C) is the same as the normalized dominant scale factor component produced by block 361.
In fig. 10A, the input energy levels are equal, but the direction-weighted _ xcor is less than 1.0 and greater than random _ xcor. Thus, the scale factor component is region 1 — the scale factor component that mixes the dominant and filling. The larger of the normalized dominant scale factor component (from block 361) and the normalized fill scale factor component (from block 363) is applied to the output channels (through block 367) such that, as shown in FIG. 10B, the dominant scale factor is located at the same center output channel C, but smaller, while the fill scale factor occurs at each of the other output channels (L, LM, RM, and R, including endpoints L and R)).
In fig. 11A, the input energy levels remain equal, but direction-weighted _ xcor ═ random _ xcor. Thus, as shown in FIG. 11B, the scale factor is that of the boundary condition between regions 1 and 2, the uniform fill condition, where there is no dominant and end point scale factor, only the fill scale factor having the same value at each output (and thus "uniform fill"), as indicated by the same arrows at each output. The fill zoom factor levels reach their highest values in this example. As discussed below, the padding scaling factor may be applied non-uniformly, such as in a cone-shaped manner, depending on the input signal conditions.
In fig. 12A, the input energy levels remain equal, but the direction-weighted _ xcor is less than random _ xcor and greater than 0 (region 2). Thus, as shown in FIG. 12B, there are fill and endpoint scale factors, and no dominant scale factor.
In fig. 13A, the input energy levels remain equal, but the direction-weighted _ xcor is 0. Thus, as shown in FIG. 13B, the scale factor is that of the full endpoint boundary region. There is no internal output scaling factor, only an endpoint scaling factor.
In the example of FIGS. 9A/B through 13A/B, the direction-weighted _ xcor (such as produced by block 441 of FIG. 4B) is the same as the neighbor-weighted _ xcor (such as produced by block 439 of FIG. 4B), since the energy levels of the two inputs are equal. However, in fig. 14A, the input energy levels are not equal (L is greater than R). Although the neighbor-compensated _ xcor is equal to random _ xcor in this example, the resulting scaling factor shown in fig. 14B is not a fill scaling factor that is applied uniformly to all channels as in the example of fig. 11A and 11B. Conversely, unequal input energy levels cause the proportion of direction-weighted _ xcor to increase (proportional to the extent to which the nominal heading is away from its central position) so that it becomes greater than neighbor-weighted _ xcor, thereby causing the scaling factor to be weighted more toward full dominance (as shown in fig. 8). This is a desirable result because strong L-or R-weighted signals should not have a wide width; they should have a compact width near the end of the L or R channel. The resulting output shown in fig. 14B is a non-zero dominant scale factor located closer to the L output than to the R output (in this case, the neighbor-compensated directional information exactly locates the dominant component at the left middle LM position), reducing the scale factor magnitude, and no end point scale factor (directional weighting pushes the operation to region 1 of fig. 8 (mixed dominant and filled)).
For 5 outputs corresponding to the scaling factor of FIG. 14B, the output can be expressed as:
Lout=Lt(SFL)
MidLout=((.92)Lt+(.38)Rt))(SFMidL)
Cout=((.45)Lt+(.45)Rt))(SFC)
MidRout=((.38)Lt+(.92)Lt))(SFMidR)
Rout=Rt(SFR).
thus, in the example of fig. 14B, even if the Scaling Factor (SF) for each of the four outputs other than MidLout is equal (filled), the corresponding signal outputs are not equal (resulting in more signals being output toward the Left side) because Lt is greater than Rt, and the dominant output at Mid Left is greater than the scaling factor indication. Since the nominal principal direction of travel coincides with the MidLeft output channel, the ratio of Lt to Rt is the same as the matrix coefficient for the MidLeft output channel, i.e. 0.92 to 0.38. It is assumed that they are the actual magnitudes of Lt and Rt. To calculate the output levels, these levels are multiplied by the respective matrix coefficients, added and scaled by the corresponding scaling factors:
output amplitude (output _ channel _ sub _ i)
=sf(i)*(Lt_Coeff(i)*Lt+Rt_Coeff(i)*Rt)
Although it is preferred to consider the mixing between the amplitude and energy additions (as in the calculations related to fig. 6A), in this example the cross-correlation is very high (large dominant scaling factor) and the ordinary summation can be performed:
Lout=0.1*(1*0.92+0*0.38)=0.092
MidLout=0.9*(0.92*0.92+0.38*0.38)=0.900
Cout=0.1*(0.71*0.92+0.71*0.38)=0.092
MidRout=0.1*(0.38*0.92+0.92*0.38)=0.070
Rout=0.1*(0*0.92+1*0.38)=0.038
thus, this example illustrates that since Lt is greater than Rt, the scaling factors for those outputs are equal, and the signal outputs at Lout, Cout, MidRout, and Rout are also unequal.
As shown in the examples of fig. 10B, 11B, 12B, and 14B, the fill scaling factors may be equally assigned to the output channels. Alternatively, the fill scale factor component is not uniform, but may vary in some way with position as a function of the dominant (correlated) and/or endpoint (uncorrelated) input signal components (or equivalently, as a function of the direction-weighted _ xcor values). For suitably high values of direction-weighted-xcor, the fill scaling factor component magnitudes may be convexly curved such that output channels near the scaled nominal principal direction of travel receive more signal levels than channels further away from that direction. For direction-weighted _ xcor ═ random _ xcor, the fill scale factor component magnitudes can be flattened to a uniform distribution, and for direction-weighted _ xcor < random _ xcor, the magnitudes can be concavely curved, facilitating passage near the end direction.
Examples of such warped fill scale factor magnitudes are set forth in fig. 15B and 16B. Fig. 15B outputs the result from the input (fig. 15A), the same as in fig. 10A described above. Fig. 16B outputs the result from the input (fig. 16A), which is the same as in the case of fig. 12B described above.
Communication between a module and a supervisor
With respect to neighbor levels and high-level neighbor levels
Each module in a plurality of module arrangements (such as the examples of fig. 1A and 2 and the examples of fig. 1B and 2 ') requires two mechanisms in order to support communication between it and a monitor (such as monitor 201 of fig. 2 and 2'):
(a) the information needed by the monitor is invoked and reported to calculate the neighbor levels and the high-level neighbor levels (if any). The information required by the monitor is the total estimated internal energy attributable to the various inputs of the module as produced, for example, by the arrangement of FIG. 6A.
(b) The neighbor levels (if any) from the monitor and the high level neighbor levels (if any) are received and applied. In the example of fig. 4B, the neighbor level is subtracted from the smoothed energy level of each output in each combiner 431 and 433, and the high-level neighbor level (if any) is subtracted from the smoothed energy level of each input and the common energy across the channels in each combiner 431, 433 and 435.
Once the monitor knows the generator of the total estimated internal energy contribution of each input of each module:
(1) it determines whether the total estimated internal energy contribution of each input (summed from all modules connected to that input) exceeds the total available signal level at that input. If the sum exceeds the total available signal level, the monitor scales back each of the reported internal energies reported by the modules connected to the input so that they sum to the total input level.
(2) It informs the modules of the neighbor levels at each input as the sum of all other internal energy contributions (if any) to that input.
The high-level (HO) neighbor levels are the neighbor levels of one or more high-level modules that share the input of the lower-level module. The above calculation of the neighbor level only involves modules at a particular input having the same hierarchy (all three-input modules (if any), then all two-input modules, etc.). The HO-neighbor level of a module is the sum of all neighbor levels of all high level modules at that input (i.e., the HO neighbor level at the input of a two-input module is the sum of all third, fourth, and high level modules (if any) of the node sharing the two-input module). Once the module knows what the HO neighbor levels are at its particular one of the inputs, it subtracts them and the same tier neighbor levels from the total input energy level of that output to get the neighbor compensated level at that input node. This is shown in fig. 4B, with the neighbor levels for input 1 and input m subtracted from the outputs of variable slow smoothers 425 and 427 in combiners 431 and 433, respectively, and the high level neighbor levels and common energy for input 1, input m subtracted from variable slow smoothers 425, 427, and 429 in combiners 431, 433, and 435, respectively.
One difference between the use of the neighbor levels and the HO neighbor levels for compensation is that the HO neighbor levels are also used to compensate the common energy across the input channels (e.g., by subtraction of the HO neighbor levels in combiner 435). The basic principle of this distinction is that the common level of a module is not affected by adjacent modules of the same hierarchy level, but it can be affected by a high-level module sharing all inputs of the module.
For example, assuming input channels Ls (left surround), Rs (right surround) and Top, an internal output channel (elevated ring back) with the middle of the triangle between them, plus an internal output channel on the line between Ls and Rs (main horizontal ring back), the former output channel requires a three-input module to recover a signal that is common to all three inputs. The latter output channel is then located on a line between the two inputs (Ls and Rs), requiring a two-input module. However, the total common signal level observed by the two-input module includes common elements of the three-input module that do not belong to the latter output channel, so that the square root of the pairwise product of HO neighbor levels is subtracted from the common energy of the two-input module to determine how much common energy is provided only by its internal channel (the latter mentioned channel). Thus, in fig. 4B, the smoothed common energy level (from block 429) is subtracted from the derived HO common level to obtain a neighbor compensated common energy level (from combiner 435), which is used by the module to calculate (in block 439) neighbor-compensated _ xcor.
The present invention and its various aspects may be implemented in analog circuitry, or more likely as software functions performed in a digital signal processor, a programmed general-purpose digital computer, and/or a special-purpose digital computer. The interface between the analog and digital signal streams may be performed in suitable hardware and/or as functions in software and/or firmware. Although the invention and its aspects may relate to analog or digital signals, in practical applications most or all of the processing functions may be performed in the digital domain on a digital signal stream in which the audio signal is represented by samples.
It should be understood that the implementation of other variations and modifications of the invention and its various aspects will be apparent to those of ordinary skill in the art, and that the invention is not limited by the specific embodiments described herein. It is therefore contemplated to cover by the present invention, any and all modifications, variations or equivalents that fall within the true spirit and scope of the basic underlying principles disclosed and claimed herein.

Claims (9)

1. A method for converting N audio input channels to M audio output channels, each of the N audio input channels being associated with a spatial direction, each of the M audio output channels being associated with a spatial direction, wherein M and N are all positive integers, N is 3 or greater, and M is 1 or greater, the method comprising:
deriving the M audio output channels from the N audio input channels, wherein one or more of the M audio output channels are associated with a spatial direction that is different from a spatial direction associated with any of the N audio input channels, at least one of the one or more of the M audio output channels being derived from a corresponding set of at least three of the N audio input channels, wherein at least one of the one or more of the M audio output channels is derived from the corresponding set of the at least three of the N audio input channels at least in part by approximating cross-correlation of the at least three of the N audio input channels, the cross-correlation having a value with a set lower limit; and
providing a blind upmix mode in which the audio input channel signals are augmented such that at least some signal outputs are provided in the derived audio output channels when at least one of the audio input channels from which the derived audio output channels are derived has a signal input, wherein the blind upmix of channels is performed by setting a lower limit on the values of the cross-correlation and obtaining a weighted average of the signal powers of the audio input channels from which the derived audio output channels are derived.
2. The method of claim 1, wherein approximating the cross-correlation comprises calculating a common energy for each pair of the at least three of the N audio input channels, and wherein the common energy for any of the pairs has a minimum value.
3. The method of claim 2, wherein the minimum value is based on a randomly equally distributed correlation value.
4. The method of claim 1, wherein the weight of each of the audio input channels from which the derived audio output channel is derived is a matrix coefficient of that audio input channel.
5. An apparatus for converting N audio input channels to M audio output channels, each of the N audio input channels associated with a spatial direction, each of the M audio output channels associated with a spatial direction, wherein M and N are all positive integers, N is 3 or greater, and M is 1 or greater, the apparatus comprising:
means for deriving the M audio output channels from the N audio input channels, wherein one or more of the M audio output channels are associated with a spatial direction that is different from a spatial direction associated with any of the N audio input channels, at least one of the one or more of the M audio output channels is derived from a corresponding set of the at least three of the N audio input channels, wherein at least one of the one or more of the M audio output channels is derived from a corresponding set of at least three of the N audio input channels at least in part by approximating cross-correlation of the at least three of the N audio input channels, the cross-correlation having a value with a set lower limit; and
means for providing a blind-up mixing mode in which audio input channel signals are augmented such that at least some signal output is provided in a derived audio output channel when at least one of the audio input channels from which the derived audio output channel is derived has a signal input, wherein the blind-up mixing of the channels is performed by setting a lower limit on the value of the cross-correlation and obtaining a weighted average of the signal powers of the audio input channels from which the derived audio output channel is derived.
6. The apparatus of claim 5, wherein approximating the cross-correlation comprises calculating a common energy for each pair of the at least three of the N audio input channels, and wherein the common energy for any of the pairs has a minimum value.
7. The apparatus of claim 6, wherein the minimum value is based on a randomly equally distributed correlation value.
8. The apparatus of claim 5, wherein the weight of each of the audio input channels from which the derived audio output channel is derived is a matrix coefficient of that audio input channel.
9. An apparatus for converting N audio input channels to M audio output channels, each of the N audio input channels associated with a spatial direction, each of the M audio output channels associated with a spatial direction, wherein M and N are all positive integers, N is 3 or greater, and M is 1 or greater, the apparatus comprising:
at least one processor; and
at least one tangible storage device having computer instructions stored thereon that, when executed, cause the at least one processor to be configured for:
deriving the M audio output channels from the N audio input channels, wherein one or more of the M audio output channels are associated with a spatial direction that is different from a spatial direction associated with any of the N audio input channels, at least one of the one or more of the M audio output channels being derived from a corresponding set of at least three of the N audio input channels, wherein at least one of the one or more of the M audio output channels is derived from the corresponding set of the at least three of the N audio input channels at least in part by approximating cross-correlation of the at least three of the N audio input channels, the cross-correlation having a value with a set lower limit; and
providing a blind upmix mode in which the audio input channel signals are augmented such that at least some signal outputs are provided in the derived audio output channels when at least one of the audio input channels from which the derived audio output channels are derived has a signal input, wherein the blind upmix of channels is performed by setting a lower limit on the values of the cross-correlation and obtaining a weighted average of the signal powers of the audio input channels from which the derived audio output channels are derived.
HK16100846.8A 2008-12-18 2012-05-16 Audio channel spatial trnslation HK1214062B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US13882308P 2008-12-18 2008-12-18
US61/138,823 2008-12-18

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
HK12104833.9A Addition HK1164603B (en) 2008-12-18 2009-12-16 Audio channel spatial trnslation

Related Child Applications (1)

Application Number Title Priority Date Filing Date
HK12104833.9A Division HK1164603B (en) 2008-12-18 2009-12-16 Audio channel spatial trnslation

Publications (2)

Publication Number Publication Date
HK1214062A1 HK1214062A1 (en) 2016-07-15
HK1214062B true HK1214062B (en) 2018-05-11

Family

ID=

Similar Documents

Publication Publication Date Title
US11805379B2 (en) Audio channel spatial translation
US7660424B2 (en) Audio channel spatial translation
EP1527655B1 (en) Audio channel spatial translation
WO2004019656A2 (en) Audio channel spatial translation
KR101341523B1 (en) How to Generate Multi-Channel Audio Signals from Stereo Signals
EP2997742B1 (en) An audio processing apparatus and method therefor
RU2752600C2 (en) Method and device for rendering an acoustic signal and a machine-readable recording media
JP2016501472A (en) Segment-by-segment adjustments to different playback speaker settings for spatial audio signals
HK1214062B (en) Audio channel spatial trnslation
HK1164603B (en) Audio channel spatial trnslation
HK1073963B (en) Audio channel spatial translation
HK1256578A1 (en) Bass management system and method for object-based audio