HK1158805B

HK1158805B - Audio spatial environment engine

Info

Publication number: HK1158805B
Application number: HK11113095.4A
Authority: HK
Inventors: 罗伯特‧W‧里姆斯; 杰弗里‧K‧托姆普森; 阿伦‧瓦尔纳
Original assignee: Dts（英属维尔京群岛）有限公司
Priority date: 2004-10-28
Filing date: 2011-12-02
Publication date: 2013-09-06

Description

Audio spatial environment engine

The application is a divisional application of Chinese patent application with application number 200580040670.5 and invented name of 'audio space environment engine', which is filed on 28.5.2007. The parent application has international application date of 10/28/2005 and international application number PCT/US 2005/038961.

RELATED APPLICATIONS

Priority of the present application for U.S. provisional application 60/622,922 entitled "2-to-N Rendering" filed on month 10 and 28 of 2004, U.S. patent application 10/975,841 entitled "Audio temporal Environment Engine" filed on month 10 and 28 of 2004, U.S. patent application 11/261,100 entitled "Audio temporal Environment Down-Mixer" filed herewith, and U.S. patent application 11/262,029 entitled "Audio temporal Environment Up-Mixer (attorney docket 13646.0012) filed herewith, each of which is commonly owned and hereby incorporated by reference for all purposes.

Technical Field

The present invention relates to the field of audio data processing, and more particularly to a system and method for converting between different formats of audio data.

Background

Systems and methods for processing audio data are known in the art. Most such systems and methods are used to process audio data for a known audio environment, such as a two-channel stereo environment, a four-channel stereo environment, a five-channel surround sound environment (also referred to as a 5.1-channel environment), or other suitable format or environment.

One problem caused by the increased number of formats or environments is: audio data processed for optimal audio quality in a first environment cannot generally be easily used in a second audio environment. One example of this problem is the transmission or storage of surround sound data over an infrastructure or network designed for stereo data. Because the infrastructure for stereo binaural transmission or storage may not support additional channels of audio data in surround sound format, it is difficult or impossible to transmit or utilize data in surround sound format with existing infrastructure.

Disclosure of Invention

In accordance with the present invention, a system and method for an audio spatial environment engine is provided that overcomes known problems by converting between spatial audio environments.

Specifically, a system and method for an audio spatial environment engine is provided that allows for conversion between N-channel data and M-channel data, and for conversion back N ' channel data from M-channel data, where N, M and N ' are integers and N is not necessarily equal to N '.

According to an exemplary embodiment of the present invention, an audio spatial environment engine is provided for converting from an N-channel audio system to an M-channel audio system and back to an N ' -channel audio system, where N, M and N ' are integers and N does not have to be equal to N '. The audio spatial environment engine includes a dynamic down mixer that receives the N channels of audio data and converts the N channels of audio data into M channels of audio data. The audio spatial environment engine also includes an up-mixer that receives the M channels of audio data and converts the M channels of audio data into N 'channels of audio data, where N is not necessarily equal to N'. One exemplary application of this system is the transmission or storage of surround sound data over an infrastructure or network designed for stereo data. The dynamic downmix unit converts the surround sound data into stereo sound data for transmission or storage, and the upmix unit restores the stereo sound data to the surround sound data for playback, processing, or some other suitable use.

The present invention provides a number of important technical advantages. An important technical advantage of the present invention is a system that provides improved and flexible switching between different spatial environments due to advanced dynamic down-mixing units and high resolution band up-mixing units. The dynamic down-mixing unit includes an intelligent analysis and correction loop for correcting spectral, temporal and spatial inaccuracies common to many down-mixing methods. The upmixing unit uses the extraction and analysis of important inter-channel spatial cues (cue) for the entire high resolution frequency band to derive the spatial arrangement of the different frequency elements. The down-mixing and up-mixing units provide improved sound quality and spatial discrimination when used alone or as a system.

Those skilled in the art will further appreciate the advantages and superior features of the invention together with other important aspects thereof upon reading the detailed description that follows in conjunction with the drawings.

Drawings

FIG. 1 is a diagram of a system for dynamic down-mixing using an analysis and correction loop in accordance with an exemplary embodiment of the present invention;

FIG. 2 is a diagram of a system for down-mixing data from N channels to M channels in accordance with an exemplary embodiment of the present invention;

FIG. 3 is a diagram of a system for down-mixing data from 5 channels to 2 channels according to an exemplary embodiment of the present invention;

FIG. 4 is a diagram of a sub-band vector calculation system according to an exemplary embodiment of the present invention;

FIG. 5 is a diagram of a sub-band correction system according to an exemplary embodiment of the present invention;

FIG. 6 is a diagram of a system for upmixing data from M channels to N channels according to an exemplary embodiment of the present invention;

FIG. 7 is a diagram of a system for upmixing data from 2 channels to 5 channels according to an exemplary embodiment of the present invention;

FIG. 8 is a diagram of a system for upmixing data from 2 channels to 7 channels according to an exemplary embodiment of the present invention;

FIG. 9 is a diagram of a method for extracting inter-channel spatial cues and generating spatial channel filtering for frequency domain application according to an exemplary embodiment of the invention;

FIG. 10A is a diagram of an exemplary left front channel filter map according to an exemplary embodiment of the present invention;

FIG. 10B is a diagram of an exemplary right front channel filter map;

FIG. 10C is a diagram of an exemplary center channel filter map;

FIG. 10D is a diagram of an exemplary left surround channel filter map; and

fig. 10E is a diagram of an exemplary right surround channel filter map.

Detailed Description

In the description that follows, like parts are marked throughout the specification and drawings with the same reference numerals. The drawings may not be to scale and certain elements may be shown in generalized or schematic form and identified by trade names for clarity and conciseness.

Fig. 1 is a diagram of a system 100 for dynamic downmixing from an N-channel audio format to an M-channel audio format using an analysis and correction loop, according to an exemplary embodiment of the invention. The system 100 uses 5.1 channel sound (i.e., N-5) and converts the 5.1 channel sound to stereo (i.e., M-2), but other suitable numbers of input and output channels can also or instead be used.

The dynamic downmix process of the system 100 is implemented using a reference downmix 102, a reference upmix 104, subband vector computing systems 106 and 108, and a subband correction system 110. The analysis and correction loop is implemented by a reference upmix 104, subband vector computing systems 106 and 108, and a subband correction system 110, where the reference upmix 104 simulates the upmix process, the subband vector computing systems 106 and 108 compute energy and position vectors for each band of the simulated upmix and the original signal, and the subband correction system 110 compares the energy and position vectors of the simulated upmix and the original signal and adjusts the inter-channel spatial cues of the downmix signal to correct for any inconsistencies.

The system 100 includes a static reference downmix 102 that converts received N-channel audio to M-channel audio. The static reference downmix 102 receives the 5.1 sound channel left l (T), right r (T), center c (T), left surround ls (T) and right surround rs (T) and converts the 5.1 channel signal into a stereo channel signal left watermark LW '(T) and right watermark RW' (T).

The stereo channel signals of the left watermark LW '(T) and the right watermark RW' (T) are then provided to a reference up-mix 104, which converts the stereo channels into 5.1 sound channels. The reference up-mix 104 outputs 5.1 sound channels left L ' (T), right ' r (T), center C ' (T), left surround LS ' (T), and right surround RS ' (T).

The upmixed 5.1 channel sound signal output from the reference upmix 104 is then provided to a subband vector computing system 106. The outputs from the subband vector computing system 106 are the upmix energy and image position data for multiple bands of the upmixed 5.1 channel signals L ' (T), R ' (T), C ' (T), LS ' (T), and RS ' (T). Likewise, the original 5.1 channel sound signal is provided to the subband vector calculating system 108. Output from subband vector computing system 108 are source energy and image position data for multiple frequency bands of the original 5.1 channel signals L (T), R (T), C (T), LS (T), and RS (T). The energy and position vectors calculated by the sub-band vector calculation systems 106 and 108 comprise a total energy measurement and a 2-dimensional vector for each frequency band, which indicates the perceived intensity and source position for a given frequency element of the listener under ideal listening conditions. For example, the audio signal can be converted from the time domain to the frequency domain using a suitable filter bank, such as a Finite Impulse Response (FIR) filter bank, a Quadrature Mirror Filter (QMF) bank, a Discrete Fourier Transform (DFT), a Time Domain Aliasing Cancellation (TDAC) filter bank, or other suitable filter bank. The filter bank output is further processed to determine a total energy per frequency band and a normalized image position vector per frequency band.

The energy and position vector values output from the subband vector computing systems 106 and 108 are provided to a subband correction system 110, which analyzes the source energy and position for the original 5.1 channel sound and the upmix energy and position for the 5.1 channel sound when generated from the left watermark LW '(T) and the right watermark RW' (T) stereo channel signals. The difference between the energy and position vectors of the source and upmix is then identified and corrected for each subband at the left watermark LW '(T) and the right watermark RW' (T) where LW (T) and RW (T) are generated, in order to provide a more accurate downmix of the stereo channel signal and a more accurate 5.1 reproduction when the stereo channel signal is subsequently upmixed. The corrected left watermark lw (t) and right watermark rw (t) signals are output for transmission, received by a stereo receiver, received by a receiver with upmixing functionality, or for other suitable uses.

In operation, the system 100 dynamically downmixes 5.1 channel sound to stereo sound through an intelligent analysis and correction loop that includes simulation, analysis, and correction of the entire downmix/upmix system. This is done by generating static down-mixed stereo signals LW '(T) and RW' (T); simulating subsequent up-mixed signals L ' (T), R ' (T), C ' (T), LS ' (T) and RS ' (T); and analyzing those signals with the original 5.1 channel signal to identify and correct any energy or position vector differences on a subband basis that can affect the quality of the left watermark LW '(T) and the right watermark RW' (T) stereo signals or the subsequently upmixed surround channel signals. The subband correction process that produces the left watermark lw (t) and the right watermark rw (t) stereo signals is performed such that when lw (t) and rw (t) are upmixed, the resulting 5.1 channel sound matches the original input 5.1 channel sound with improved accuracy. Likewise, additional processing can be performed to allow any suitable number of input channels to be converted into a suitable number of watermarked output channels, such as 7.1 channel sound to watermarked stereo, 7.1 channel sound to watermarked 5.1 channel sound, custom sound channels (such as for car audio systems or movie theaters) to stereo, or other suitable conversion.

Fig. 2 is a diagram of a static reference downmix 200 according to an exemplary embodiment of the present invention. The static reference down-mix 200 can be used as the reference down-mix 102 of fig. 1 or in other suitable manners.

The reference downmix 200 converts N-channel audio into M-channel audio, where N and M are integers and N is greater than M. The reference down-mix 200 receives an input signal X₁(T)、X₂(T) to XN (T). For each input channel i, the input signal X_i(T) are provided to Hilbert (Hilbert) transform units 202 to 206, which introduce a 90 ° phase-shifted signal. Other processes, such as a hilbert filter or an all-pass filter network that implements a 90 ° phase shift, can also or instead be used instead of the hilbert transform unit. For each input channel i, the Hilbert transform signal and the original input signal are then multiplied by first stage multipliers 208 to 218, respectively, with a predetermined proportionality constant C_i11And C_i12Multiplication, where the first subscript denotes the input channel number i, the second subscript denotes the first stage multiplier, and the third subscript denotes the multiplier number of each stage. The outputs of multipliers 208 to 218 are then summed by adders 220 to 224 to generate a fractional Hilbert signal X'_i(T). With respect to the corresponding input signal X_i(T), fractional Hilbert signal X 'output from multipliers 220 to 224'_i(T) has a variable number of phase shifts. The amount of phase shift depends on the proportionality constant C_i11And C_i12Wherein the 0 ° phase shift may correspond to C_i110 and C_i121 and a phase shift of ± 90 ° may correspond to C_i111 and C ═ 1_i120. Use of C_i11And C_i12Any intermediate amount of phase shift is possible.

Each signal X 'for each input channel i'_i(T) is then passed through second stage multipliers 226 through 242 with a predetermined proportionality constant C_i2jMultiplication, wherein the first index denotes the input channel number i and the second index denotes the second stage of multiplicationAnd the third subscript indicates the output channel number j. The outputs of multipliers 226 through 242 are then appropriately summed by adders 244 through 248 to generate a respective output signal Y for each output channel j_j(T). Determining a proportionality constant C for each input channel i and output channel j by the spatial position of each input channel i and output channel j_i2j. For example, the proportionality constant C for the left input channel i and the right output channel j_i2jCan be set to approximately zero to maintain spatial discrimination. Likewise, the proportionality constant C for the front input channel i and the front output channel j_i2jCan be set to approximately 1 to maintain the spatial arrangement.

In operation, when the receiver receives an output signal, the reference downmix 200 combines N channels into M channels in a manner that allows the spatial relationships in the input signal to be arbitrarily managed and extracted. Further, the combination of N channel sounds as shown generates M channel sounds that are of acceptable quality to a listener listening in an M channel audio environment. Thus, the reference downmix 200 can be used to convert N-channel sound into M-channel sound, which can be used by an M-channel receiver, an N-channel receiver with a suitable upmixer, or other suitable receiver.

Fig. 3 is a diagram of a static reference down-mix 300 according to an exemplary embodiment of the invention. As shown in fig. 3, the static reference downmix 300 is an implementation of the static reference downmix 200 of fig. 2, which converts 5.1-channel time domain data into stereo-channel time domain data. The static reference down-mix 300 can be used as the reference down-mix 102 of fig. 1 or in other suitable manners.

The reference downmix 300 comprises a hilbert transform 302 which receives a left channel signal l (t) of a source 5.1 channel sound and performs the hilbert transform on a time signal. The Hilbert transform introduces a 90 ° phase shift of the signal, which is then passed through a multiplier 310 with a predetermined proportionality constant C_L1Multiplication. Other processes, such as a hilbert filter or an all-pass filter network that implements a 90 ° phase shift, can also or instead be used instead of the hilbert transform unit. Original left channel signal L (T) by multiplicationLaw 312 and predetermined proportionality constant C_L2Multiplication. The outputs of multipliers 310 and 312 are summed by adder 320 to generate fractional hilbert signal L' (T). Similarly, the right channel signal R (T) from the source 5.1 channel sound is processed by the Hilbert transform 304 and passed through the multiplier 314 with a predetermined proportionality constant C_R1Multiplication. The original right channel signal R (T) is multiplied by a multiplier 316 and a predetermined proportionality constant C_R2Multiplication. The outputs of multipliers 314 and 316 are summed by adder 322 to generate fractional hilbert signal R' (T). The fractional hilbert signals L '(T) and R' (T) output from multipliers 320 and 322 have variable amounts of phase shift, respectively, relative to the corresponding input signals L (T) and R (T). The amount of phase shift depends on the proportionality constant C_L1、C_L2、C_R1And C_R2Wherein a 0 ° phase shift may correspond to C_L1＝0，C_L2＝1，C_R10 and C_R21 and a phase shift of ± 90 ° may correspond to C_L1＝±1，C_L2＝0，C_R11 and C ═ 1_R20. Use of C_L1、C_L2、C_R1And C_R2Any intermediate amount of phase shift is possible. The center channel input from the source 5.1 channel sound is provided to multiplier 318 as a fractional hilbert signal C' (T), meaning that the center channel input signal is not phase shifted. Multiplier 318 multiplies C' (T) by a predetermined proportionality constant C3, such as by 3 dB. The outputs of adders 320 and 322 and multiplier 318 are appropriately summed into a left watermark channel LW '(T) and a right watermark channel RW' (T).

The left surround channel ls (t) from the source 5.1 channel sound is provided to a hilbert transform 306, while the right surround channel rs (t) from the source 5.1 channel sound is provided to a hilbert transform 308. The outputs of the hilbert transforms 306 and 308 are fractional hilbert signals LS '(T) and RS' (T), meaning that there is a full 90 ° phase shift between the LS (T) and LS '(T) signal pairs and the RS (T) and RS' (T) signal pairs. LS' (T) is then multiplied by multipliers 324 and 326, respectively, with a predetermined proportionality constant C_LS1And C_LS2Multiplication. Similarly, RS' (T) is multiplied by multipliers 328 and 330, respectively, with a predetermined proportionality constant C_RS1And C_RS2Multiplication. The outputs of the multipliers 324 to 330 are suitably provided to the left watermark channel LW '(T) and the right watermark channel RW' (T).

The adder 332 receives the left channel signal output from the adder 320, the center channel signal output from the multiplier 318, the left surround channel signal output from the multiplier 324, and the right surround channel signal output from the multiplier 328, and adds these signals to form a left watermark channel LW' (T). Likewise, the adder 334 receives the center channel signal output from the multiplier 318, the right channel signal output from the adder 322, the left surround channel signal output from the multiplier 326, and the right surround channel signal output from the multiplier 330, and adds these signals to form the right watermark channel RW' (T).

In operation, when the receiver receives left and right watermark channel stereo signals, the reference downmix 300 combines the source 5.1 channels in a manner that allows the spatial relationship in the 5.1 input channels to be maintained and extracted. Furthermore, the combination of 5.1 channel sounds as shown generates stereo sound of acceptable quality for a listener using a stereo receiver without surround sound up-mixing. Thus, the reference down-mix 300 can be used to convert 5.1 channel sound into stereo sound, which can be used by a stereo receiver, a 5.1 channel receiver with an appropriate up-mixer, a 7.1 channel receiver with an appropriate up-mixer, or other appropriate receiver.

Fig. 4 is a diagram of a subband vector computing system 400 according to an exemplary embodiment of the present invention. The sub-band vector calculation system 400 provides energy and position vector data for a plurality of frequency bands and can be used as the sub-band vector calculation systems 106 and 108 of fig. 1. Although 5.1 channel sound is shown, other suitable channel configurations can be used.

The subband vector calculating system 400 includes time-frequency analyzing units 402 to 410. 5.1 time-domain channels L (T), R (T), C (T), LS (T), and RS (T) are provided to time-frequency analysis units 402 to 410, respectively, which convert the time-domain signals into frequency-domain signals. These time-frequenciesThe analysis unit can be a suitable filter bank, such as a Finite Impulse Response (FIR) filter bank, a Quadrature Mirror Filter (QMF) bank, a Discrete Fourier Transform (DFT), a Time Domain Aliasing Cancellation (TDAC) filter bank, or other suitable filter bank. For l (f), r (f), c (f), ls (f), and rs (f), the amplitude or energy value of each frequency band is output from the time-frequency analysis units 402 to 410. These amplitude/energy values comprise an amplitude/energy measure for each frequency band component of each respective channel. The amplitude/energy measurements are summed by an adder 412, which adder 412 outputs t (f), where t (f) is the total energy of the input signal for each frequency band. This value is then divided into each of the channel level/energy values by division units 414 to 422 to generate a corresponding normalized interchannel level difference (ICLD) signal M_L(F)、M_R(F)、M_C(F)、M_LS(F) And M_RS(F) Wherein these ICLD signals can be considered as normalized subband energy estimates for each channel.

The 5.1 channel sounds are mapped to a normalized position vector as shown with an exemplary localization on a 2-dimensional plane consisting of the horizontal and vertical axes. As shown, is used for (X)_LS，Y_LS) Is assigned to the origin, (X)_RS，Y_RS) Is assigned to (0, 1), (X)_L，Y_L) Is assigned to (0, 1-C), where C is a value between 1 and 0, representing the back distance of the left and right speakers from the rear of the room. Likewise, (X)_R，Y_R) Has a value of (1, 1-C). Finally, for (X)_C，Y_C) The value of (2) is (0.5, 1). These coordinates are exemplary and can be changed to reflect the actual standardized positioning or configuration of the speakers relative to each other, as the speaker coordinates differ based on the size of the room, the shape of the room, or other factors. For example, when using a 7.1 sound or other suitable channel configuration, additional coordinate values can be provided which reflect the positioning of the loudspeakers around the room. Likewise, such speaker positioning can be customized based on the actual distribution of speakers in a car, room, auditorium, theater, or as other suitable place.

The estimated image position vector p (f) can be calculated for each subband as set forth in the following vector equation:

P(F)＝M_L(F)＊(X_L，Y_L)+M_R(F)＊(X_R，Y_R)+M_C(F)＊(X_C，Y_C)+

i.M_LS(F)＊(X_LS，Y_LS)+M_RS(F)＊(X_RS，Y_RS)

thus, for each frequency band, an output of the total energy t (f) and a position vector p (f) are provided which are used to define the perceived intensity and position of the apparent frequency source for that frequency band. In this manner, the spatial image of the frequency components can be located, such as for the sub-band correction system 110 or for other suitable purposes.

Fig. 5 is a diagram of a subband correction system according to an exemplary embodiment of the present invention. The sub-band correction system can be used as the sub-band correction system 110 of FIG. 1 or for other suitable purposes. The sub-band correction system receives the left watermark LW '(T) and the right watermark RW' (T) stereo channel signals and energy and image corrects the watermark signals to compensate for signal inaccuracies that may arise as a result of reference downmixing or other suitable methods for each frequency band. Sub-band correction system receives and uses the total energy signal T of the source for each frequency band_SOURCE(F) And total energy signal T of the subsequent upmixed signal_UMIX(F) And a position vector P for the source_SOURCE(F) And the position vector P of the subsequent upmixed signal_UMIX(F) Such as those generated by the subband vector computing systems 106 and 108 of fig. 1. These total energy signals and position vectors are used to determine the appropriate corrections and compensations to be made.

The sub-band correction system includes a position correction system 500 and a spectral energy correction system 502. The position correction system 500 receives time domain signals for the left watermark stereo channel LW '(T) and the right watermark stereo channel RW' (T), which are converted from the time domain to the frequency domain by time-frequency analysis units 504 and 506, respectively. These time-frequency analysis units can be suitable filter banks, such as Finite Impulse Response (FIR) filter banks, Quadrature Mirror Filter (QMF) banks, Discrete Fourier Transforms (DFT), Time Domain Aliasing Cancellation (TDAC) filter banks, or other suitable filter banks.

The outputs of the time-frequency analysis units 504 and 506 are frequency domain subband signals LW '(F) and RW' (F). The associated spatial cues of the inter-channel level difference (ICLD) and the inter-channel coherence (ICC) are adjusted for each subband in the signals LW '(F) and RW' (F). For example, these cues can be adjusted by manipulating the amplitude or energy of LW '(F) and RW' (F) (shown as absolute values of LW '(F) and RW' (F)) and the phase angle of LW '(F) and RW' (F). Correction of ICLD is performed by multiplying the amplitude/energy value of LW' (F) by the multiplier 508 with the value generated by the following equation:

[X_MAX-P_X，SOURCE(F)]/[X_MAX-P_X，UMIX(F)]

wherein

X_MAXMaximum X coordinate boundary

P_X，SOURCE(F) Estimated sub-band X position coordinates relative to the source vector

P_X，UMIX(F) Subband X position coordinates relative to an estimate of a subsequent upmix vector

Likewise, the amplitude/energy for RW' (F) is multiplied by multiplier 510 with the value generated by the following equation:

[P_X，SOURCE(F)-X_MIN ]/[P_X，UMIX(F)-X_MIN]

wherein

X_MINMinimum X coordinate boundary

Correction of the ICC is made by adding the phase angle for LW' (F) by adder 512 to the value generated by the equation:

+/-п＊[P_Y，SOURCE(F)-P_Y，UMIX(F)]/[Y_MAX-Y_MIN]

wherein

P_Y，SOURCE(F) Estimated sub-band Y position coordinates relative to source vectors

P_Y，UMIX(F) Subband Y position coordinates relative to an estimate of a subsequent upmix vector

Y_MAXMaximum Y coordinate boundary

Y_MINMinimum Y coordinate boundary

Likewise, the phase angle for RW' (F) is added by adder 514 to the value generated by the equation:

-/+п＊[P_Y，SOURCE(F)-P_Y，UMIX(F)]/[Y_MAX-Y_MIN]

note that the angular components added to LW '(F) and RW' (F) have equal values but opposite polarities, with the resulting polarity being determined by the leading phase angle between LW '(F) and RW' (F).

The corrected LW '(F) amplitude/energy and LW' (F) phase angle are recombined by adder 516 to form a complex value LW (F) for each subband, which is then converted to a left watermark time domain signal LW (t) by frequency-time synthesis unit 520. Likewise, the corrected RW '(F) amplitude/energy and RW' (F) phase angle are recombined by adder 518 to form a complex value RW (F) for each subband, and then converted to the right watermark time-domain signal RW (t) by frequency-time synthesis unit 522. The frequency-time synthesis units 520 and 522 can be suitable synthesis filter banks that can convert the frequency domain signal back to a time domain signal.

As shown in this exemplary embodiment, the inter-channel spatial cues for each spectral component of the watermarked left and right channel signals can be corrected by using a position correction 500 that appropriately adjusts the ICLD and ICC spatial cues.

The spectral energy correction system 502 can be used to ensure that the overall spectral balance of the downmix signal coincides with the overall spectral balance of the original 5.1 signal, thus compensating for the spectral shift caused by comb filtering, for example. The left watermark time domain signal LW '(T) and the right watermark time domain signal RW' (T) are converted from the time domain to the frequency domain using time-frequency analysis units 524 and 526, respectively. These time-frequency analysis units can be suitable filter banks, such as Finite Impulse Response (FIR) filter banks, Quadrature Mirror Filter (QMF) banks, Discrete Fourier Transforms (DFT), Time Domain Aliasing Cancellation (TDAC) filter banks, or other suitable filter banks. Output from the time-frequency analysis units 524 and 526 are LW '(F) and RW' (F) frequency subband signals, which are passed through multipliers 528 and 530 and T_SOURCE(F)/T_UMIX(F) Multiplication of wherein

T_SOURCE(F)＝|L(F)|+|R(F)|+|C(F)|+|LS(F)|+

|RS(F)|

T_UMIX(F)＝|L_UMIX(F)|+|R_UMIX(F)|+|C_UMIX(F)|+

|LS_UMIX(F)|+|RS_UMIX(F)|

The outputs from multipliers 528 and 530 are then converted from the frequency domain back to the time domain by frequency-time synthesis units 532 and 534 to generate lw (t) and rw (t). The frequency-time synthesis unit may be a suitable synthesis filter bank capable of converting the frequency domain signal back to a time domain signal. In this way, position and energy corrections can be applied to the downmix stereo channel signals LW '(T) and RW' (T) in order to produce left and right watermark channel signals LW (T) and RW (T) that are faithful to the original 5.1 signal. Lw (t) and rw (t) can be played back or upmixed back in stereo to 5.1 channels or other suitable number of channels without significantly altering the spectral component positions or energies of any content elements present in the original 5.1 channel sound.

Fig. 6 is a diagram of a system 600 for upmixing data from M channels to N channels according to an exemplary embodiment of the invention. The system 600 converts stereo time domain data to N channel time domain data.

The system 600 includes time-frequency analysis units 602 and 604, a filter generation unit 606, a smoothing unit 608, and frequency-time synthesis units 634 to 638. System 600 provides improved spatial discrimination and stability during upmixing by a scalable frequency domain structure that allows high resolution band processing, and by a filter generation method that extracts and analyzes important inter-channel spatial cues per band to derive a spatial arrangement of frequency elements in an upmixed N-channel signal.

The system 600 receives a left channel stereo signal l (t) and a right channel stereo signal r (t) at time-frequency analysis units 602 and 604, which convert a time domain signal into a frequency domain signal. These time-frequency analysis units can be suitable filter banks, such as Finite Impulse Response (FIR) filter banks, Quadrature Mirror Filter (QMF) banks, Discrete Fourier Transforms (DFT), Time Domain Aliasing Cancellation (TDAC) filter banks, or other suitable filter banks. Output from the time-frequency analysis units 602 and 604 is a set of frequency domain values covering a sufficient frequency range of the human auditory system, such as a frequency range of 0 to 20kHz, where the analysis filter bank subband bandwidth can be processed to approximate the psychoacoustic critical band, the equivalent rectangular bandwidth, or some other perceptual characteristic. Likewise, other suitable numbers of frequency bands and ranges can be used.

The outputs from the time-frequency analysis units 602 and 604 are provided to a filter generation unit 606. In one exemplary embodiment, the filter generation unit 606 is capable of receiving an external selection regarding the number of channels that should be output for a given environment. For example, a 4.1 channel with two front and two rear speakers can be selected, a 5.1 sound system with two front, two rear, and one front center speaker can be selected, a 7.1 sound system with two front, two sides, two rear, and one front center speaker can be selected, or other suitable sound system can be selected. Basis of filter generation unit 606 in frequency bandInter-channel spatial cues such as inter-channel level differences (ICLDs) and inter-channel coherence (ICC) are extracted and analyzed on the basis. Those correlated spatial cues are then used as parameters to generate adaptive vocal tract filtering that controls the spatial arrangement of the frequency band elements in the upmix sound field. The vocal tract filtering is smoothed over time and frequency by the smoothing unit 608 to limit filtering variability that, if allowed to vary too rapidly, can cause objectionable ripple effects. In the exemplary embodiment shown in fig. 6, the left and right channel l (f) and r (f) frequency domain signals are provided to a filter generation unit 606, which produces N channel filtered signals H₁(F)、H₂(F) To H_N(F) They are provided to the smoothing unit 608.

The smoothing unit 608 averages the frequency domain components for each channel in the N channel filtering over the entire time and frequency dimensions. Smoothing across time and frequency helps control rapid fluctuations in the channel filtered signal, thereby reducing jitter artifacts (jitter) and instability that can be objectionable to a listener. In one exemplary embodiment, temporal smoothing can be achieved by applying a first-order low-pass filter to each frequency band from the current frame and the corresponding frequency band from the previous frame. This has the effect of reducing the variability of each band from frame to frame. In another exemplary embodiment, spectral smoothing can be performed on a group of entire frequency bins (bins) modeled to approximate the critical band spacing of the human auditory system. For example, if an analysis filter bank with uniformly spaced frequency bins is used, different numbers of frequency bins can be grouped and averaged for different partitions of the spectrum. For example, from 0 to 5kHz, 5 frequency bins can be averaged, from 5kHz to 10kHz, 7 frequency bins can be averaged, and from 10kHz to 20kHz, 9 frequency bins can be averaged, or other suitable number of frequency bins and bandwidth ranges can be selected. Output H from smoothing unit 608₁(F)、H₂(F) To H_N(F) The smoothed value of (1).

Source signal X for each of N output channels₁(F)、X₂(F) To X_N(F) Quilt growth promoting agentInto an adaptive combination of the M input channels. In the exemplary embodiment shown in FIG. 6, for a given output channel i, the channel source signal X output from summers 614, 620 and 626_i(F) Is generated as L (F) multiplied by an adapted scaling signal G_i(F) And R (F) multiplied by the adaptive scaling signal 1-G_i(F) And (4) summing. The adaptive scaling signal G used by multipliers 610, 612, 616, 618, 622, and 624_i(F) Is determined by the desired spatial position of the output channel i and the dynamic inter-channel coherence estimates of l (f) and r (f) for each frequency band. Likewise, the polarity of the signals provided to adders 614, 620, and 626 is determined by the desired spatial location of output channel i. For example, the adaptive scaling signal G at adders 614, 620, and 626_i(F) And polarity can be designed to provide L (F) + R (F) combinations for the front center channel, L (F) for the left channel, R (F) for the right channel, and L (F) -R (F) for the rear channel, as is common in conventional matrix up-mixing methods. Adaptive scaling signal G_i(F) A method can further be provided to dynamically adjust the interrelationship between the output channel pairs, whether they are cross or portrait channel pairs.

Channel source signal X₁(F)、X₂(F) To X_N(F) Filtered H with the smoothed channel by multipliers 628 to 632, respectively₁(F)、H₂(F) To H_N(F) Multiplication.

The outputs from the multipliers 628 to 632 are then converted from the frequency domain to the time domain by frequency-time synthesis units 634 to 638 to generate an output channel Y₁(T)、Y₂(T) to Y_N(T). In this manner, the left and right stereo signals are upmixed to the N channel signals, where inter-channel spatial cues, either naturally occurring or intentionally encoded into the left and right stereo signals as by the downmix watermarking process of fig. 1 or other suitable process, can be used to control the spatial arrangement of frequency elements within the N channel sound field produced by the system 600. Likewise, other suitable combinations of inputs and outputs can be used, such as stereo to 7.1 sounds, 5.1 to 7.1 sounds, or other suitable combinations.

Fig. 7 is a diagram of a system 700 for upmixing data from M channels to N channels according to an exemplary embodiment of the invention. The system 700 converts stereo time domain data into 5.1 channel time domain data.

The system 700 includes time-frequency analysis units 702 and 704, a filter generation unit 706, a smoothing unit 708, and frequency-time synthesis units 738 to 746. System 700 provides improved spatial discrimination and stability during upmixing by allowing the use of a scalable frequency domain structure for high resolution frequency band processing, and by a filter generation method that extracts and analyzes spatial cues between important channels for each frequency band to obtain a spatial arrangement of frequency elements in the upmixed 5.1 channel signal.

The system 700 receives a left channel stereo signal l (t) and a right channel stereo signal r (t) at time-frequency analysis units 702 and 704, which convert a time domain signal into a frequency domain signal. These time-frequency analysis units can be suitable filter banks, such as Finite Impulse Response (FIR) filter banks, Quadrature Mirror Filter (QMF) banks, Discrete Fourier Transforms (DFT), Time Domain Aliasing Cancellation (TDAC) filter banks, or other suitable filter banks. Output from the time-frequency analysis units 702 and 704 is a set of frequency domain values covering a sufficient frequency range of the human auditory system, such as a frequency range of 0 to 20kHz, where the analysis filter bank subband bandwidth can be processed to approximate the psycho-acoustic critical band, the equivalent rectangular bandwidth, or some other perceptual characteristic. Likewise, other suitable numbers of frequency bands and ranges can be used.

The outputs from the time-frequency analysis units 702 and 704 are provided to a filter generation unit 706. In an exemplary embodiment, the filter generation unit 706 can receive external selections as to the number of channels that should be output for a given environment, such as being able to select 4.1 channels with two front and two rear speakers, being able to select a 5.1 sound system with two front, two rear, and one front center speaker, being able to select a 3.1 sound system with two front and one front center speakers, or being able to select other suitable sound systems.The filter generation unit 706 extracts and analyzes inter-channel spatial cues such as inter-channel level differences (ICLDs) and inter-channel coherence (ICC) on a frequency band basis. Those correlated spatial cues are then used as parameters to generate adaptive vocal tract filtering that controls the spatial arrangement of the frequency band elements in the upmix sound field. The channel filtering is smoothed over time and frequency by the smoothing unit 708 to limit filtering variability that, if allowed to vary too rapidly, can cause objectionable ripple effects. In the exemplary embodiment shown in fig. 7, the left and right channel l (f) and r (f) frequency domain signals are provided to a filter generation unit 706, which produces a 5.1 channel filtered signal H_L(F)、H_R(F)、H_C(F)、H_LS(F) And H_RS(F) They are provided to the smoothing unit 708.

The smoothing unit 708 averages the frequency domain components for each channel in the 5.1-channel filtering over the entire time and frequency dimensions. Smoothing across time and frequency helps control rapid fluctuations in the channel filtered signal, thus reducing jitter artifacts and instabilities that can be annoying to listeners. In one exemplary embodiment, temporal smoothing can be achieved by applying a first-order low-pass filter to each frequency band from the current frame and the corresponding frequency band from the previous frame. This has the effect of reducing the variability of each band from frame to frame. In one exemplary embodiment, the entire set of frequency bins, which are modeled to approximate the critical band spacing of the human auditory system, can be spectrally smoothed. For example, if an analysis filter bank with uniformly spaced frequency bins is used, different numbers of frequency bins can be grouped and averaged for different partitions of the spectrum. In this exemplary embodiment, 5 frequency bins can be averaged from 0 to 5kHz, 7 frequency bins can be averaged from 5kHz to 10kHz, and 9 frequency bins can be averaged from 10kHz to 20kHz, or other suitable number of frequency bins and bandwidth ranges can be selected. Output H from smoothing unit 708_L(F)、H_R(F)、H_C(F)、H_LS(F) And H_RS(F) The smoothed value of (1).

For 5.1 output soundSource signal X for each of the tracks_L(F)、X_R(F)、X_C(F)、X_LS(F) And X_RS(F) Is generated as an adapted combination of stereo input channels. In the exemplary embodiment shown in FIG. 7, X_L(F) Is simply provided as L (F), meaning that there is G for all bands_L(F) 1. Likewise, X_R(F) Is simply provided as R (F), meaning that there is G for all bands_R(F) 0. As X output from adder 714_C(F) Is calculated as the signal L (F) multiplied by the adaptive scaling signal G_C(F) Multiplying R (F) by the adaptive scaling signal 1-G_C(F) The sum of (1). As X output from adder 720_LS(F) Is calculated as the signal L (F) multiplied by the adaptive scaling signal G_LS(F) Multiplying R (F) by the adaptive scaling signal 1-G_LS(F) The sum of (1). Similarly, X is output from adder 726_RS(F) Is calculated as the signal L (F) multiplied by the adaptive scaling signal G_RS(F) Multiplying R (F) by the adaptive scaling signal 1-G_RS(F) The sum of (1). Note that if there is G for all bands_C(F)＝0.5、G_LS(F) 0.5, and G_RS(F) 0.5, then the front center channel is derived from l (f) + r (f) combining and the surround channels are derived from scaled l (f) -r (f) combining, as is common in conventional matrix up-mixing methods. Adaptive scaling signal G_C(F)、G_LS(F) And G_RS(F) A method can further be provided to dynamically adjust the correlation between adjacent pairs of output channels, whether they are transverse or longitudinal channel pairs. Channel source signal X_L(F)、X_R(F)、X_C(F)、X_LS(F) And X_RS(F) Filtered H with smoothed channels by multipliers 728 through 736, respectively_L(F)、H_R(F)、H_C(F)、H_LS(F) And H_RS(F) Multiplication.

The outputs from the multipliers 728 to 736 are then converted from the frequency domain to the time domain by frequency-time synthesis units 738 to 746 to generate an output channel Y_L(T)、Y_R(T)、Y_C(F)、Y_LS(F) And Y_RS(T). In this way, the left and right stereo signals are up-mixed to a 5.1 channel signal, wherein,inter-channel spatial cues that are naturally occurring or intentionally encoded into the left and right stereo signals as by the downmix watermarking process of fig. 1 or other suitable process can be used to control the spatial arrangement of frequency elements within the 5.1 channel sound field produced by the system 700. Likewise, other suitable combinations of inputs and outputs can be used, such as stereo to 4.1 sound, 4.1 to 5.1 sound, or other suitable combinations.

Fig. 8 is a diagram of a system 800 for upmixing data from M channels to N channels according to an exemplary embodiment of the invention. The system 800 converts stereo time domain data to 7.1 channel time domain data.

The system 800 includes time-frequency analyzing units 802 and 804, a filter generating unit 806, a smoothing unit 808, and frequency-time synthesizing units 854 to 866. System 800 provides improved spatial discrimination and stability during upmixing by a scalable frequency domain structure that allows high resolution frequency band processing, and by a filter generation method that extracts and analyzes important inter-channel spatial cues for each frequency band to derive a spatial arrangement of frequency elements in the upmixed 7.1 channel signal.

The system 800 receives a left channel stereo signal l (t) and a right channel stereo signal r (t) at time-frequency analysis units 802 and 804, which convert time domain signals into frequency domain signals. These time-frequency analysis units can be suitable filter banks, such as Finite Impulse Response (FIR) filter banks, Quadrature Mirror Filter (QMF) banks, Discrete Fourier Transforms (DFT), Time Domain Aliasing Cancellation (TDAC) filter banks, or other suitable filter banks. Output from the time-frequency analysis units 802 and 804 is a set of frequency domain values that cover a sufficient frequency range of the human auditory system, such as a frequency range of 0 to 20kHz, where the analysis filter bank subband bandwidth can be processed to approximate the psychoacoustic critical band, equivalent rectangular bandwidth, or some other perceptual characteristic. Likewise, other suitable numbers of frequency bands and ranges can be used.

The outputs from the time-frequency analysis units 802 and 804 are provided to a filter generatorAnd (9) unit 806. In one exemplary embodiment, the filter generation unit 806 can receive an external selection as to the number of channels that should be output for a given environment. For example, a 4.1 channel with two front and two rear speakers can be selected, a 5.1 sound system with two front, two rear, and one front center speaker can be selected, a 7.1 sound system with two front, two sides, two rear, and one front center speaker can be selected, or other suitable sound system can be selected. The filter generation unit 806 extracts and analyzes inter-channel spatial cues such as inter-channel level differences (ICLDs) and inter-channel coherence (ICC) on a frequency band basis. Those correlated spatial cues are then used as parameters to generate adaptive vocal tract filtering that controls the spatial arrangement of the frequency band elements in the upmix sound field. The channel filtering is smoothed over time and frequency by the smoothing unit 808 to limit filtering variability that, if allowed to vary too rapidly, can cause objectionable ripple effects. In the exemplary embodiment shown in fig. 8, the left and right channel l (f) and r (f) frequency domain signals are provided to a filter generation unit 806, which produces a 7.1 channel filtered signal H_L(F)、H_R(F)、H_C(F)、H_LS(F)、H_RS(F)、H_LB(F) And H_RB(F) They are provided to the smoothing unit 808.

The smoothing unit 808 averages the frequency domain components for each channel of the 7.1-channel filtering over the entire time and frequency dimensions. Smoothing across time and frequency helps control rapid fluctuations in the channel filtered signal, thus reducing jitter artifacts and instabilities that can be annoying to listeners. In one exemplary embodiment, temporal smoothing can be achieved by applying a first-order low-pass filter to each frequency band from the current frame and the corresponding frequency band from the previous frame. This has the effect of reducing the variability of each band from frame to frame. In one exemplary embodiment, spectral smoothing can be performed across the entire set of frequency bins that are modeled to approximate the critical band spacing of the human auditory system. For example, if an analysis filterbank with uniformly spaced frequency bins is used, the differences can be grouped and averaged for different partitions of the spectrumA number of frequency bins. In this exemplary embodiment, 5 frequency bins can be averaged from 0 to 5kHz, 7 frequency bins can be averaged from 5kHz to 10kHz, and 9 frequency bins can be averaged from 10kHz to 20kHz, or other suitable number of frequency bins and bandwidth ranges can be selected. Output H from smoothing unit 808_L(F)、H_R(F)、H_C(F)、H_LS(F)、H_RS(F)、H_LB(F) And H_RB(F) The smoothed value of (1).

Source signal X for each of 7.1 output channels_L(F)、X_R(F)、X_C(F)、X_LS(F)、X_RS(F)、X_LB(F) And X_RB(F) Is generated as an adapted combination of stereo input channels. In the exemplary embodiment shown in FIG. 8, X_L(F) Is simply provided as L (F), meaning that there is G for all bands_L(F) 1. Likewise, X_R(F) Is simply provided as R (F), meaning that there is G for all bands_R(F) 0. As X output from adder 814_C(F) Is calculated as the signal L (F) multiplied by the adaptive scaling signal G_C(F) Multiplying R (F) by the adaptive scaling signal 1-G_C(F) The sum of (1). As X output from adder 820_LS(F) Is calculated as the signal L (F) multiplied by the adaptive scaling signal G_LS(F) Multiplying R (F) by the adaptive scaling signal 1-G_LS(F) The sum of (1). Similarly, X is output from adder 826_RS(F) Is calculated as the signal L (F) multiplied by the adaptive scaling signal G_RS(F) Multiplying R (F) by the adaptive scaling signal 1-G_RS(F) The sum of (1). Similarly, X is output from adder 832_LB(F) Is calculated as the signal L (F) multiplied by the adaptive scaling signal G_LB(F) Multiplying R (F) by the adaptive scaling signal 1-G_LB(F) The sum of (1). Similarly, X is output from the adder 838_RB(F) Is calculated as the signal L (F) multiplied by the adaptive scaling signal G_RB(F) Multiplying R (F) by the adaptive scaling signal 1-G_RB(F) The sum of (1). Note that if there is G for all bands_C(F)＝0.5、G_LS(F)＝0.5、G_RS(F)＝0.5、G_LB(F) 0.5 and G_RB(F) 0.5, then the front center channel originates fromL (f) + r (f) combining and the side and back channels are derived from proportional l (f) -r (f) combining, as is common in conventional matrix up-mixing methods. Adaptive scaling signal G_C(F)、G_LS(F)、G_RS(F)、G_LB(F) And G_RB(F) A method can further be provided to dynamically adjust the correlation between adjacent pairs of output channels, whether they are transverse or longitudinal channel pairs. Channel source signal X_L(F)、X_R(F)、X_C(F)、X_LS(F)、X_RS(F)、X_LB(F) And X_RB(F) Filtered H with smoothed channels by multipliers 840 to 852, respectively_L(F)、H_R(F)、H_C(F)、H_LS(F)、H_RS(F)、H_LB(F) And H_RB(F) Multiplication.

The outputs from the multipliers 840 to 852 are then converted from the frequency domain to the time domain by the frequency-time synthesis units 854 to 866 to generate the output channel Y_L(T)、Y_R(T)、Y_C(F)、Y_LS(F)、Y_RS(T)、Y_LB(T) and Y_RB(T). In this manner, the left and right stereo signals are up-mixed to a 7.1 channel signal, where inter-channel spatial cues, either naturally occurring or intentionally encoded into the left and right stereo signals as by the down-mixing watermarking process of fig. 1 or other suitable process, can be used to control the spatial arrangement of frequency elements within the 7.1 channel field produced by the system 800. Likewise, other suitable combinations of inputs and outputs can be used, such as stereo to 5.1 sounds, 5.1 to 7.1 sounds, or other suitable combinations.

Fig. 9 is a diagram of a system 900 for generating filtering for frequency domain application in accordance with an exemplary embodiment of the present invention. The filter generation process uses frequency domain analysis and processing of the M channel input signal. For each frequency band of the M-channel input signal, the associated inter-channel spatial cues are extracted and a spatial position vector is generated for each frequency band. For a listener in ideal listening conditions, this spatial position vector is interpreted as the localization of the perceptual source for that band. Each channel filter is then generated such that the resulting spatial position of the frequency element used in the upmix N-channel output signal is reproduced in conformity with the inter-channel cues. The estimates of the inter-channel level difference (ICLD) and the inter-channel coherence (ICC) are used as inter-channel cues to generate spatial position vectors.

In the exemplary embodiment shown in system 900, subband magnitude or energy components are used to estimate inter-channel level differences and subband phase angle components are used to estimate inter-channel coherence. The left and right frequency domain inputs L (F) and R (F) are converted to amplitude or energy components and phase angle components, where the amplitude/energy components are provided to an adder 902 which calculates the total energy signal T (F), which is then used to normalize the left M for each frequency band by dividers 904 and 906, respectively_L(F) And a right channel M_R(F) Amplitude/energy value of (a). Then, according to M_L(F) And M_R(F) Calculating a normalized lateral coordinate signal lat (f), wherein the normalized lateral coordinates for the frequency bands are calculated as:

LAT(F) ＝M_L(F)＊X_MIN+M_R(F)＊X_MAX

likewise, the normalized depth coordinate is calculated from the input phase angle component as:

DEP(F) ＝Y_MAX-0.5＊(Y_MAX-Y_MIN)＊sqrt(

[COS(/L(F))-COS(/R(F))]^2+[SIN(/L(F))-

SIN(/R(F))]^2)

the normalized depth coordinate being substantially based on the phase angle component/L (F) and/scaled and shifted distance measurements between r (f) are calculated. Current phase angle/L (F) and/r (F) are close to each other on the unit circle, the value of DEP (F) is close to 1, and when the phase angle is/L (F) and/r (F) is close to the opposite side of the unit circle, DEP (F) is close to 0. For each frequency band, the normalized lateral and depth coordinates form a 2-dimensional vector (LAT (F), DEP (F)) that is input into the 2-dimensional channel map, such as toAs shown in fig. 10A to 10E below, to generate a filtered value H for each channel i_i(F) In that respect These channel filters H for each channel i are output from filter generation units such as the filter generation unit 606 of fig. 6, the filter generation unit 706 of fig. 7, and the filter generation unit 806 of fig. 8_i(F)。

Fig. 10A is a diagram of a filtering diagram for a front left signal according to an exemplary embodiment of the present invention. In fig. 10A, the filter map 1000 accepts a normalized lateral coordinate ranging from 0 to 1 and a normalized depth coordinate ranging from 0 to 1, and outputs a normalized filter value ranging from 0 to 1. Grey shading is used to indicate the change in amplitude from a maximum of 1 to a minimum of 0, as shown by the scale on the right hand side of the filtered graph 1000. For this exemplary left front filtered graph 1000, normalized horizontal and depth coordinates near (0, 1) would output the highest filtered value near 1.0, while coordinates ranging from approximately (0.6, Y) to (1.0, Y), where Y is a number between 0 and 1, would output substantially filtered values of 0.

Fig. 10B is a diagram of an exemplary right front filter map 1002. The filtered map 1002 accepts the same normalized lateral coordinates and normalized depth coordinates as the filtered map 1000, but the output filtered values are biased toward the front right of the normalized layout.

Fig. 10C is a diagram of an exemplary center filter map 1004. In this exemplary embodiment, the maximum filtered value for the center filtered graph 1004 occurs at the center of the normalized layout, with a magnitude that decreases significantly as the coordinates move away from the front center of the layout toward the back of the layout.

Fig. 10D is a diagram of an exemplary left surround filter map 1006. In this exemplary embodiment, the maximum filtered value for the left surround filter map 1006 occurs near the rear left coordinate of the normalized layout and decreases in magnitude as the coordinates move to the front right of the layout.

Fig. 10E is a diagram of an exemplary right surround filter map 1008. In this exemplary embodiment, the maximum filtered value for the right surround filter map 1008 occurs near the rear right coordinate of the normalized layout and decreases in magnitude as the coordinates move to the front left of the layout.

Likewise, if other speaker layouts or configurations are used, the existing filter maps can be adjusted and new filter maps corresponding to new speaker locations can be generated to reflect the new listening environment changes. In one exemplary embodiment, the 7.1 system would include two additional filter maps with left and right wraps moving up in the depth coordinate dimension, and with left and right rear locations, with filter maps similar to filter maps 1006 and 1008, respectively. The rate at which the filter factor is reduced can be varied to accommodate different numbers of speakers.

Although exemplary embodiments of the systems and methods of the present invention have been described herein in detail, those skilled in the art will also recognize that various substitutions and modifications can be made to the systems and methods without departing from the scope and spirit of the appended claims.

Claims

1. A method for converting from an N-channel audio system to an M-channel audio system, where N and M are integers and N is greater than M, comprising:

converting the audio data of the N channels into audio data of the M channels;

converting the audio data of the M channels into audio data of N' channels; and

correcting the audio data of the M channels based on a difference between the audio data of the N channels and the audio data of the N' channels,

wherein converting the audio data of the N channels into the audio data of the M channels further comprises:

processing one or more of the N channels of audio data with a fractional hilbert function to apply a predetermined phase shift to the audio data of the associated channel; and

after processing with the fractional Hilbert function, combining one or more of the N channels of audio data to produce the M channels of audio data such that the combination of one or more of the N channels of audio data in each of the M channels of audio data has a predetermined phase relationship.

2. The method of claim 1, wherein converting the M channels of audio data into the N' channels of audio data comprises:

converting the audio data of the M channels from a time domain to a plurality of sub-bands of a frequency domain;

filtering the plurality of subbands of the M channels to generate a plurality of subbands of N' channels;

smoothing a plurality of subbands of the N' channels by averaging each subband with one or more adjacent bands;

multiplying each of a plurality of subbands of the N' channels with one or more corresponding subbands of the M channels; and

converting a plurality of subbands of the N 'channels from the frequency domain to the time domain to obtain audio data for the N' channels.

3. The method of claim 1, wherein correcting the M channels of audio data based on the difference between the N channels of audio data and the N' channels of audio data comprises:

determining an energy and position vector for each of a plurality of subbands of the N channels of audio data;

determining an energy and position vector for each of a plurality of subbands of the N' channels of audio data; and

correcting one or more subbands of the M channels of audio data if a difference between the energy and the position vector for the respective subbands of the N channels of audio data and the N' channels of audio data is greater than a predetermined threshold.

4. The method of claim 3, wherein correcting one or more subbands of the M channels of audio data comprises: adjusting energy and position vectors for the subbands of the M channels of audio data such that the adjusted subbands of the M channels of audio data are converted to adjusted N 'channels of audio data having one or more subband energy and position vectors that are closer to the energy and position vectors for subbands of the N channels of audio data than unadjusted energy and position vectors for each of a plurality of subbands of the N' channels of audio data.

5. An audio spatial environment engine for conversion from an N-channel audio system to an M-channel audio system, where N and M are integers and N is greater than M, comprising:

one or more Hilbert transform stages, each of which receives one of the N channels of audio data and applies a predetermined phase shift to the audio data of the associated channel;

one or more constant multiplier stages, each receiving one of the Hilbert transformed channel's audio data and each generating scaled Hilbert transformed channel's audio data;

one or more first summing stages, each receiving the one of the N channels of audio data and the scaled Hilbert transformed channel of audio data, and each generating fractional Hilbert channel of audio data; and

m second summing stages, each receiving one or more of the fractional hilbert channel audio data and one or more of the N channels of audio data and combining the one or more of the fractional hilbert channel audio data and each of the one or more of the N channels of audio data to generate one of M channels of audio data having a predetermined phase relationship between the one or more of each of the fractional hilbert channel audio data and the one or more of the N channels of audio data.

6. The audio spatial environment engine of claim 5 comprising a Hilbert transform stage for receiving audio data of a left channel, wherein the Hilbert transformed left channel audio data is multiplied by a constant and added to the left channel audio data to generate left channel audio data having a predetermined phase shift, the phase shifted left channel audio data is multiplied by a constant and provided to one or more of the M second summing stages.

7. The audio spatial environment engine of claim 5 including a Hilbert transform stage for receiving right channel audio data, wherein the Hilbert transformed right channel audio data is multiplied by a constant and subtracted from the right channel audio data to generate right channel audio data having a predetermined phase shift, the phase shifted right channel audio data being multiplied by a constant and provided to one or more of the M second summing stages.

8. The audio spatial environment engine of claim 5 including a Hilbert transform stage that receives audio data for a left surround channel and a Hilbert transform stage that receives audio data for a right surround channel, wherein the Hilbert transformed audio data for the left surround channel is multiplied by a constant and added to the Hilbert transformed audio data for the right surround channel to generate audio data for left and right surround channels, the audio data for the left and right surround channels being provided to one or more of the M second summing stages.

9. The audio spatial environment engine of claim 5 including a Hilbert transform stage that receives audio data for a right surround channel and a Hilbert transform stage that receives audio data for a left surround channel, wherein the Hilbert transformed audio data for the right surround channel is multiplied by a constant and added to the Hilbert transformed audio data for the left surround channel to generate audio data for a right left surround channel, the audio data for the right left surround channel being provided to one or more of the M second summing stages.

10. The audio spatial environment engine of claim 5, comprising:

a Hilbert transform stage that receives audio data of a left channel, wherein the Hilbert transformed audio data of the left channel is multiplied by a constant and added to the audio data of the left channel to generate audio data of the left channel with a predetermined phase shift, the audio data of the left channel is multiplied by the constant to generate scaled audio data of the left channel;

a Hilbert transform stage that receives audio data of a right channel, wherein the Hilbert transformed right channel audio data is multiplied by a constant and subtracted from the right channel audio data to generate right channel audio data having a predetermined phase shift, the right channel audio data being multiplied by the constant to generate scaled right channel audio data; and

a Hilbert transform stage receiving audio data of a left surround channel and a Hilbert transform stage receiving audio data of a right surround channel, wherein the audio data of the Hilbert transformed left surround channel is multiplied by a constant and added to the audio data of the Hilbert transformed right surround channel to generate audio data of left and right surround channels, and the audio data of the Hilbert transformed right surround channel is multiplied by a constant and added to the audio data of the Hilbert transformed left surround channel to generate audio data of a right-left surround channel.

11. The audio spatial environment engine of claim 10, comprising:

a first one of M second summing stages that receives the scaled left channel audio data, the right-left surround channel audio data, and a scaled center channel audio data and sums the scaled left channel audio data, the right-left surround channel audio data, and the scaled center channel audio data to form left watermark channel audio data; and

a second one of the M second summing stages that receives the scaled right channel audio data, the left and right surround channel audio data, and the scaled center channel audio data and adds and subtracts the left and right surround channel audio data from the sum to form right watermark channel audio data.

12. A method for converting from an N-channel audio system to an M-channel audio system, where N and M are integers and N is greater than M, comprising:

combining one or more of the N-channel audio data after being processed with the fractional Hilbert function to produce M-channel audio data, such that the combination of the one or more of the N-channel audio data in each of the M-channel audio data has a predetermined phase relationship.

13. The method of claim 12, wherein processing one or more of the N channels of audio data with a fractional hilbert function comprises:

performing a hilbert transform on audio data of a left channel;

multiplying the audio data of the Hilbert transformed left channel by a constant to obtain scaled, Hilbert transformed left channel audio data;

adding the scaled, Hilbert transformed left channel audio data to the left channel audio data to generate left channel audio data having a predetermined phase shift; and

multiplying the phase-shifted left channel audio data by a constant.

14. The method of claim 12, wherein processing one or more of the N channels of audio data with a fractional hilbert function comprises:

performing a hilbert transform on the audio data of the right channel;

multiplying the Hilbert transformed right channel audio data by a constant to obtain scaled, Hilbert transformed right channel audio data;

subtracting the scaled, Hilbert transformed right channel audio data from the right channel audio data to generate right channel audio data having a predetermined phase shift; and

multiplying the phase-shifted right channel audio data by a constant.

15. The method of claim 12, wherein processing one or more of the N channels of audio data with a fractional hilbert function comprises:

performing a hilbert transform on audio data of a left surround channel;

performing a hilbert transform on audio data of a right surround channel;

multiplying the audio data of the Hilbert transformed left surround channel by a constant to obtain scaled, Hilbert transformed left surround channel audio data; and

adding the scaled, Hilbert transformed left surround channel audio data to the Hilbert transformed right surround channel audio data to generate left and right channel audio data with a predetermined phase shift.

16. The method of claim 12, wherein processing one or more of the N channels of audio data with a fractional hilbert function comprises:

performing a hilbert transform on audio data of a left surround channel;

performing a hilbert transform on audio data of a right surround channel;

multiplying the audio data of the Hilbert transformed right surround channel by a constant to obtain scaled, Hilbert transformed right surround channel audio data; and

adding the scaled, Hilbert transformed right surround channel audio data to the Hilbert transformed left surround channel audio data to generate right-left channel audio data with a predetermined phase shift.

17. The method of claim 12, comprising:

performing a hilbert transform on audio data of a left channel;

adding the scaled, Hilbert transformed left channel audio data to the left channel audio data to generate left channel audio data having a predetermined phase shift;

multiplying the phase-shifted left channel audio data by a constant;

performing a hilbert transform on the audio data of the right channel;

subtracting the scaled, Hilbert transformed right channel audio data from the right channel audio data to generate right channel audio data having a predetermined phase shift;

multiplying the phase-shifted right channel audio data by a constant;

performing a hilbert transform on audio data of a left surround channel;

performing a hilbert transform on audio data of a right surround channel;

multiplying the audio data of the Hilbert transformed left surround channel by a constant to obtain scaled, Hilbert transformed left surround channel audio data;

adding the scaled, Hilbert transformed left surround channel audio data to the Hilbert transformed right surround channel audio data to generate left and right channel audio data with a predetermined phase shift;

18. The method of claim 17, comprising:

summing the audio data of the scaled left channel, the audio data of the right left channel, and the audio data of the scaled center channel to form audio data of a left watermark channel; and

summing the scaled channel audio data and the scaled center channel audio data and subtracting the left and right channel audio data from the sum to form a right watermark channel audio data.