[go: up one dir, main page]

HK1186036B - Estimation of synthetic audio prototypes - Google Patents

Estimation of synthetic audio prototypes Download PDF

Info

Publication number
HK1186036B
HK1186036B HK13113102.3A HK13113102A HK1186036B HK 1186036 B HK1186036 B HK 1186036B HK 13113102 A HK13113102 A HK 13113102A HK 1186036 B HK1186036 B HK 1186036B
Authority
HK
Hong Kong
Prior art keywords
signal
prototype
input
signals
estimate
Prior art date
Application number
HK13113102.3A
Other languages
Chinese (zh)
Other versions
HK1186036A (en
Inventor
P.B.赫尔茨
T.Z.巴克斯代尔
M.S.达布林
L.C.沃尔特斯
Original Assignee
伯斯有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 伯斯有限公司 filed Critical 伯斯有限公司
Publication of HK1186036A publication Critical patent/HK1186036A/en
Publication of HK1186036B publication Critical patent/HK1186036B/en

Links

Abstract

An approach to forming output signals both permits flexible and temporally and/or frequency local processing of input signals while limiting or mitigating artifacts in such output signals. Generally, the approach involves first synthesizing prototype signals for the output signals, or equivalently characterizing such prototypes, for example, according to their statistical characteristics, and then forming the output signals as estimates of the prototype signals, for example, as weighted combinations of the input signals.

Description

Estimation of synthetic audio prototypes
Cross Reference to Related Applications
The present application is a part Continuation In Part (CIP) application of the following applications, incorporated herein by reference:
U.S. application serial No. 12/909,569 filed on 21/10/2010.
This application relates to, but does not claim benefit from, the filing date of the following applications incorporated herein by reference:
U.S. patent No. 7,630,500 entitled "spatialdiscassamblyprocess" granted on 12, month 8, 2009; and
U.S. patent publication No. 2009/0262969 entitled "hearting assistance apparatus" was published at 22.10.2009.
U.S. patent publication No. 2008/0317260 entitled "sound discrimination method and apparatus" was published on 25.12.2008.
Technical Field
The invention relates to estimation of synthetic audio prototypes.
Background
In the field of audio signal processing, the term "upmixing" generally refers to the process of undoing "downmixing", which is the process of adding many source signals into fewer audio channels. The downmix may be a natural acoustic process or a combination of recording studios. As an example, upmixing may involve generating several spatially independent audio channels from a multi-channel source.
The simplest upmixer (upmixer) receives a pair of stereo audio signals and generates a single output representing information common to both channels, which is commonly referred to as the center channel. A slightly more complex upmixer may generate three channels representing the central channel and the "non-central" components of the left and right inputs. More complex upmixers attempt to separate one or more center channels, two "side-only" channels of panned content (panned content), and one or more "surround" channels of uncorrelated or out-of-phase content.
One method of upmixing is performed in the time domain by creating weighted (sometimes negative) combinations of stereo input channels. This approach can present a single source at a desired location, but it may not allow for isolation of multiple simultaneous sources. For example, a time domain upmixer operating on stereo content dominated by common (center) content would mix panned and less correlated content into the center output channel even though this weaker content could be attributed to the other channels.
Several stereo upmix algorithms are commercially available, including dolby prolologicil (and different versions), Logic7 by Lexicon and DTSNeo: 6, Bose's Videostage, AudioStage, Centerpoint, and Centerpoint II.
It is necessary to perform upmixing in a way that accurately renders spatially independent audio channels from a multi-channel source in a way that reduces sound artifacts (artifacts) and has low processing delay.
Disclosure of Invention
One or more embodiments address the technical problem of synthesizing output signals that each allow flexible time and/or frequency local processing while limiting or mitigating artifacts in such output signals. In general, this technical problem may be solved by first synthesizing a prototype signal for the output signal (or equivalent signal and/or data having characteristics of such a prototype, e.g. according to its statistical characteristics), and then forming the output signal as an estimate of the prototype signal (e.g. as a weighted combination of the input signals). In some examples, the prototype is a non-linear function of the input and the estimate is formed from a least squares error metric.
This technical problem may arise in various audio processing applications. For example, the process of upmixing from a set of input audio channels may be addressed by first forming a prototype for the upmix signal, and then estimating the output signal using a combination of the input signals to match the prototype very closely. Other applications include signal enhancement with multiple microphone inputs, for example to provide directivity and/or ambient noise mitigation in headsets, hand-held microphones, vehicle microphones, etc. with multiple microphone elements.
In one aspect, in general, a method for forming an output signal from a plurality of input signals includes determining a synthesized characterization of one or more prototype signals from a plurality of ones of the input signals. Forming one or more output signals includes forming each output signal as an estimate of a corresponding one of the one or more prototype signals that includes a combination of one or more of the input signals.
Aspects can include one or more of the following features.
Determining a synthetic characterization of the prototype signal includes determining the prototype signal, or includes determining statistical features of the prototype signal.
Determining a synthetic characterization of the prototype signal includes forming said data based on a time-local analysis of the input signal. In some examples, determining the synthetic characterization of the prototype signal further includes forming the data based on a frequency local analysis of the input signal. In some examples, forming the estimate of the prototype is based on a more comprehensive analysis of the input signal and the prototype signal than a local analysis when forming the prototype signal.
The synthesis of the prototype signal includes a non-linear function of the input signals and/or a gating of one or more of the input signals.
Forming the output signal as an estimate of the prototype includes forming a minimum error estimate of the prototype. In some examples, forming the minimum error estimate includes forming a least squares error estimate.
Forming the output signal as an estimate of a corresponding one of one or more prototype signals (which is a combination of one or more of the input signals), including calculating estimates of statistics related to the prototype signal and the one or more input signals, and determining weighting coefficients to be applied to each of said input signals.
The statistics include the cross-power statistics between the prototype signal and one or more input signals, the self-power statistics of one or more input signals, and the cross-power statistics between all input signals (if there is more than one input signal).
Calculating the estimate of the statistics includes averaging the locally calculated statistics over time and/or frequency.
The method also includes decomposing each input signal into a plurality of components.
Determining data characterizing the synthesis of the prototype signal includes forming data characterizing a component decomposition of each prototype signal into a plurality of prototype components.
Forming each output signal as an estimate of a corresponding one of the prototype signals includes forming a plurality of output component estimates as a transformation of corresponding components of the one or more input signals.
Forming the output signal includes combining the formed output component estimates to form the output signal.
Forming the component decomposition includes forming a frequency-based decomposition.
Forming the component decomposition includes forming a substantially orthogonal decomposition.
Forming the component decomposition includes applying at least one of a wavelet transform, a uniform bandwidth filter bank, a non-uniform bandwidth filter bank, an orthogonal mirror filter bank, and a statistical decomposition.
Forming the plurality of output component estimates as a combination of the corresponding components of the one or more input signals includes scaling the components of the input signals to form the components of the output signals.
The input signal comprises a plurality of input audio channels of an audio recording, and wherein the output signal comprises additional upmix channels. In some examples, the plurality of input audio channels includes at least a left audio channel and a right audio channel, and wherein the additional upmix channel includes at least one of a center channel and a surround channel.
A plurality of input signals from a microphone array are accepted. In some examples, one or more prototype signals are synthesized from differences between the input signals. In some examples, forming the prototype signal from differences between the input signals includes determining a gating value from the gain and/or phase difference, and applying the gating value to one or more of the input signals to determine the prototype signal.
In another aspect, in general, a method for forming one or more output signals from a plurality of input signals includes: the input signal is decomposed into input signal components representing different frequency components (e.g., typically frequency-dependent components) at each of a series of times. For example, a characterization of one or more prototype signals is determined from a plurality of input signals. The characterization of the one or more prototype signals includes a plurality of prototype components representing different frequency components at each of a series of times. One or more output signals are then formed by forming each output signal as an estimate of a corresponding one of the one or more prototype signals (including a combination of the one or more input signals).
In some examples, forming the output signal as an estimate of the prototype signal includes: for each prototype component of the plurality of prototype components, the estimate is formed as a combination of a plurality of input signal components, e.g. comprising at least some input signal components at different times or at different frequencies than the prototype component being estimated.
In some examples, forming the output signal as an estimate of the prototype signal includes: one or more constraints are applied in determining the combination of one or more input signals.
In another aspect, in general, a system for processing a plurality of input signals to form an output as an estimate of a synthetic prototype signal is configured to perform all of the steps of any of the methods specified above.
In another aspect, in general, software, which may be embodied on a machine-readable medium, includes instructions for processing a plurality of input signals to form an output as an estimate of a synthetic prototype signal, configured to perform all the steps of any of the methods specified above.
In another aspect, in general, a system for processing a plurality of input signals includes a prototype generator configured to accept a plurality of input signals and provide a characterization of the prototype signal. The estimator is configured to accept a characterization of the prototype signal and form the output signal as an estimate of the prototype signal as a combination of one or more of the input signals.
Aspects can include one or more of the following features.
The prototype signal comprises a non-linear function of the input signal.
The estimate of the prototype signal comprises a least squares error estimate of the prototype signal.
The system includes a component analysis module for forming a plurality of component decompositions for each input signal, and a reconstruction module for reconstructing an output signal from the component decompositions of the output signal.
The prototype generator and estimator are each configured to operate on the components on a component basis.
The prototype generator is configured to perform a time-local processing of the input signal for each component to determine a characterization of the component of the prototype signal.
The prototype generator is configured for accepting a plurality of input audio channels, and wherein the estimator is configured for providing an output signal comprising additional upmix channels.
The prototype generator is configured to accept a plurality of input audio channels from the microphone array, and wherein the prototype generator is configured to synthesize one or more prototype signals according to differences between the input signals.
The upmixing process may include converting the input signal into a component representation (e.g., by using a DFT filter bank). The component representation of each signal may be created periodically in time, thereby adding a time dimension to the component representation (e.g., a time-frequency representation).
Some embodiments may use heuristics to non-linearly estimate the desired output signal as the prototype signal. For example, the heuristics may determine how many given components from each input signal to include in the output signal.
When an appropriate filter bank is employed, the results that can be achieved by generating the coefficients non-linearly across time and frequency independently (i.e., a non-linear prototype) may be satisfactory.
An approximation technique (e.g., least squares approximation) may be used to project a nonlinear prototype onto the input signal space, thereby determining the upmix coefficients. Upmixing coefficients may be used to mix the input signals into the desired output signal.
Smoothing may be used to reduce artifacts and reduce resolution requirements, but may slow the response time of existing upmix systems. Existing time-frequency upmixers require a difficult compromise between artifacts and responsiveness. Creating linear estimates of the synthetic prototype makes these tradeoffs less difficult.
Various embodiments may have one or more of the following advantages.
The non-linear processing techniques used in the present application provide the possibility to perform a wide range of transformations that might otherwise not be possible to perform by using linear processing techniques alone. For example, upmixing, modification of room acoustics, and signal selection (e.g., for telephone headsets and hearing aids) may be achieved using non-linear processing techniques without introducing objectionable artifacts.
Linear estimation of a non-linear prototype of the target signal allows the system to respond quickly to changes in the input signal, yet introduces a minimum number of artifacts.
Other features and advantages of the invention will be apparent from the following description, and from the claims.
Drawings
Fig. 1 is a block diagram of a system configured for linear estimation of a synthetic prototype.
Fig. 2 is a block diagram of the decomposition of a signal into components and the estimation of a synthetic prototype for a representative component.
Fig. 3A shows a time-component representation for a prototype.
Fig. 3B is a detailed view of a single tile of the time-component representation.
FIG. 4A is a diagram illustrating an exemplary center channel synthesis prototype di(t) block diagram.
FIG. 4B is a synthetic prototype d showing two exemplary "side-only" prototypesi(t) block diagram.
FIG. 4C is a synthetic prototype d showing an exemplary surround channeli(t) block diagram.
FIG. 5 is a block diagram of an alternative configuration of a composition processing module.
Fig. 6 is a block diagram of a system configured for determining an upmix coefficient h.
Fig. 7 is a block diagram showing how 6 upmix channels can be determined by using two partial prototypes.
FIG. 8 is a block diagram of a system including a prototype generator utilizing multiple past inputs and outputs.
Fig. 9 is a two-microphone array receiving a source signal.
Fig. 10 is a two-microphone array receiving a source signal and a noise signal.
FIG. 11 is a graph of measured average signal-to-noise ratio gain and retained signal ratio for an MVDR design versus a time-frequency masking scheme.
Fig. 12 is a graph of average target and noise signal power.
Fig. 13 is a graph of signal-to-noise ratio gain and retained signal ratio.
Fig. 14 is a graph of signal-to-noise ratio gain and retained signal ratio.
Fig. 15 is a graph of signal-to-noise ratio gain and retained signal ratio.
Detailed Description
1Overview of the System
Referring to fig. 1, an example of a system using an estimate of a synthetic prototype is an upmix system 100, which includes an upmix module 104 that receives an input signal 112s1(t),...,sN(t) and outputting the upmixed signalAs an example, a time signal s is input1(t) and s2(t) represents a left input signal and a right input signal, andrepresenting the derived center channel. The upmix module 104 upmixes the signalFormed as an input signal s1(t),...,sNThe combination of (t)112 is formed, for example, as a (time-varying) linear combination of the input signals. Generally, the upmixed signal is generally passed through a non-linear technique by the estimator 110Formed as a linear estimate of the prototype signal d (t)109, which prototype signal d (t)109 is formed by the prototype generator 108 from the input signal. In some examples, the estimate is formed as a linear (e.g., frequency weighted) combination of the input signals that best approximates the prototype signal in a least mean square error sense. The linear estimationTypically based on a generative model 102 for a set of input signals 112, the generative model 102 being formed to blur the target signalAnd noise components 114 each associated with one of the input signals 112.
In the system 100 shown in fig. 1, the synthetic prototype generation module 108 forms the prototype d (t)109 as a nonlinear transformation of the set of input signals 112. It should be appreciated that linear techniques may also be used to form prototypes, which are illustratively formed from different subsets of the input signals and are not used to estimate the output signal from the prototypes. For certain types of prototype generation, if not presented directly to the listener through the linearity estimator 110, the prototype may include degradation and/or artifacts that would produce a low quality audio output. As introduced above, in some examples, the prototype d (t) is correlated with the desired upmix of the input signal. In other examples, prototypes are formed for other purposes, e.g., based on identification of a desired signal in the presence of interference.
In some embodiments, the process of forming the prototype signal is more localized in time and/or frequency than the estimation process, which may introduce a degree of smoothing that can compensate for objectionable characteristics in the prototype signal resulting from the localization process. On the other hand, the local nature of prototype generation provides a degree of flexibility and control that enables the formation of otherwise difficult to achieve processes (e.g., upmixing).
2Component decomposition
In some embodiments, the upmixing module 104 of the upmixing system 100 shown in fig. 1 is implemented by splitting each input signal 112 into components (e.g., frequency bands) and processing each component individually. For example, in the case of orthogonal components, the linear estimator 110 may be implemented by forming an estimate of each orthogonal component independently, and then synthesizing the output signal from the estimated components. It should be appreciated that although the following description focuses on components that are formed as frequency bands of the input signal, other decompositions of the components that are orthogonal or substantially independent may be used as well. These alternative decompositions may include wavelet transforms of the input signal, non-uniform (e.g., psycho-acoustic critical bands; octaves) filter banks, perceptual component decompositions, quadrature mirror filter banks, statistical (e.g., principal component) based decompositions, and the like.
Referring to fig. 2, one embodiment of the upmix module 104 is configured to process the decomposition of input signals (two input signals in this example) in a manner similar to that described in U.S. patent 7,630,500 entitled "spatial discassiblelyprocess," which is incorporated herein by reference. Each input signal 112 is transformed into a plurality of component representations having individual components 212. For example, the input signal s1(t) is decomposed into a set of components indexed by iIn some examples, and as described in the above-referenced patents, the component analyzer 220 is a Discrete Fourier Transform (DFT) analysis filter bank that converts the input signal into frequency components. In some examples, the frequency component is the output of zero-phase filters, each having an equal bandWide (e.g., 125 Hz).
From the components using a reconstruction module 230To reconstruct the output signalThe component analyzer 220 and reconstruction module 230 are such that if the components pass without modification, the initially analyzed signal is substantially (i.e., not necessarily completely) reproduced at the output of the reconstruction module 230.
In some implementations, the component analyzer 220 windows the input signal 112 into equally sized time blocks that may be indexed by n. The blocks may overlap (i.e., a portion of the data of one block may also be contained in another block) such that each window is shifted in time by a "hop size" τ. As an example, a window function (e.g., a square root hanning window) may be applied to each block for the purpose of improving the resulting component representation 222. After applying the window function to the blocks, component analyzer 220 may zero-fill each block of input signal 112 and then decompose each zero-filled block into their respective component representations. In some embodiments, components 212 form baseband signals, each modulated by one of the respective center frequencies of the filter bands (i.e., by a complex exponent). Furthermore, each component 212 may be downsampled and processed at a lower sampling rate sufficient for the bandwidth of the filter band. For example, the output of a DFT filter bank bandpass filter with a 125Hz bandwidth may be sampled at 250Hz without violating the Nyquist criterion.
In some examples, the input signal is sampled at 44.1KHz and shifted into a 23.2ms long frame or 1024 point samples or 512 point samples selected with a frame skip period of τ ═ 11.6 ms. Each frame is multiple windowed by a window function sin (pi · t)/τ, where t ═ 0 indicates the start of the frame. The windowed frame forms the input into a 1024-point FFT. Each frequency component is formed from one output of the FFT. (other windows may be selected that are longer or shorter than the input length of the FFT; if the input window is shorter than the FFT, the data may be zero-padded extended to accommodate the FFT; if the input window is longer than the FFT, the data may be time-aliased).
In fig. 2, windowing of the input signal and subsequent overlap-add of the output signal are not shown. Thus, the figure should be understood to explicitly illustrate the processing of a single analysis window. More precisely, for the nth analysis window a continuous input signal s is givenk(t), a windowed signal s is formedk,[n](t)=sk(t) w (t-n τ), where the window may be defined as w (t) sin (π t)/τ. In FIG. 2 these windowed signals do not have subscript [ n ]]But are shown. The components of the signal are then defined to decompose each signal intoResulting output signal for analysis periodAre then combined into
3Prototype synthesis
As described above, one approach for the synthesis of prototype signals is on a component-by-component basis, and particularly in a component-local basis, such that each component of each window period is processed separately to form one or more prototypes of that local component.
In FIG. 2, the component upmixer 206 processes a single pair of input componentsAndto form an output componentThe component upmixer 206 includes a component-based local prototype generator 208 that generates the local prototype from the input componentsAnddetermining (typically at a down-sampling rate) the prototype signal component di(t) of (d). In general, the prototype signal component is a non-linear combination of the input components. The component-based linearity estimator 210 then estimates the output components, as discussed further below
The local prototype generator 208 may utilize synthesis techniques that provide the possibility to perform a wide range of transformations that may not otherwise be possible by using linear processing techniques alone. For example, upmixing, modification of room acoustics, and signal selection (e.g., for phones and hearing aids) may all be accomplished using such synthesis processing techniques.
In some embodiments, the local prototype signal is derived based on knowledge or assumptions about the characteristics of the desired and undesired signals as observed in the input signal space. For example, the local prototype generator selects an input showing the characteristics of the desired signal, while suppressing an input not showing the characteristics of the desired signal. In this context, selection means passing with some predefined maximum gain (in unity gain for example), while in the limit case, suppression means passing with zero gain. The preferred selection function may have binary characteristics (pass region with unity gain, reject region with zero gain) or a mild transition between a pass signal with desired characteristics and a reject signal with undesired characteristics. The selection function may comprise a linearly modified input, an input of one or more non-linear gates, a multiplicative combination of (arbitrary order) inputs, and a linear combination of other non-linear functions of the inputs.
In some implementations, the synthesis prototype generator 208 generates virtually instantaneous (i.e., temporally local) "guesses" of the desired signal at the output, without having to consider whether such a sequence of guesses would directly synthesize an artifact-free signal.
In some embodiments, the method described in U.S. patent 7,630,500, incorporated herein by reference, to calculate components of an output signal is used in the present method to calculate components of a prototype signal, which are then subjected to further processing. Note that in these examples, the method may differ in characteristics (such as time and/or frequency ranges of components) from the method described in the referenced patent. For example, in the present approach, the window "jump rate" may be higher, which results in a more time-local synthesis of the prototype, and in some synthesis approaches, such higher jump rate may result in more artifacts if the method described in the cited patent is used directly.
Referring to FIG. 4A, an exemplary multiple-input prototype d for the center channel is shown in the complex plane for a single time valuei(t) a generator 408 (an example of the nonlinear prototype generator 208 shown in FIG. 2). The following formula (applied independently to each component) defines this characteristicAnd (3) fixing a local prototype:
where the component index i is omitted from the above formula for clarity. Note that this example is in the United statesA special case of the example shown in national patent 7,630,500 (at equation (16), where)。
Note that input signals 412: (And) Since its baseband representation is a complex signal. The above formula indicates the center local prototype di(t) is the average of equal length portions of the two complex input signals 412. In other words, the one of the two inputs 412 with the larger magnitude is scaled by real coefficients to match the smaller length, and then the two are averaged. The local prototype signal has a selection characteristic such that its output is largest in amplitude when the two inputs 412 are in phase and equal in magnitude, and it decreases as the level and phase difference between the signals increases. It is zero for "hard-panned" and inverted left and right signals. The phase of which is the average of the phases of the two input signals. The vector gating function can thus generate a signal having a different phase than any of the original signals, even if the components of the vector gating factor are real-valued.
Referring to fig. 5, another example of prototype generation module 508 (which is another example of prototype generator 208 shown in fig. 2) includes gating function 524 and sealer 526. The gating function 524 module accepts the input signals 512 and uses them to determine a gating factor giThe gating factor is kept constant during an analysis interval corresponding to one window of the input signal. Gating function block 524 may switch between 0 and 1 based on input signal 512. Alternatively, the gating function module 524 may implement a smooth slope where the gating is between 0 and 1 based on the input signals 512 and/or their history over many analysis windowsAnd (6) adjusting the rows. One of the input signals 512 (e.g.) And the gating factor g is applied to the sealer 526 to generate the local prototype d (t). This operation dynamically adjusts the amount of input signal 512 included in the output of the system. Since g is s1Is not s, so d (t) is not s1And thus the local prototype is s1Is dependent on s2And (4) non-linear correction. Since the gating factor is only real, the local prototypes d and s1Have the same phase; only its amplitude is corrected. Note that the gating factor is determined on a component-by-component basis, while the gating factor for each band is adjusted on an analysis window-by-analysis window basis.
One exemplary use of the gating function is for processing input from a telephone headset. The earpiece may include two microphones configured to be spaced apart from each other and substantially collinear with a primary acoustic direction of propagation of the speaker's voice. The microphone prototype generation module 508 provides an input signal 512. The gating function module 524 analyzes the input signal 512, for example, by observing a phase difference between the two microphones. Based on the observed differences, gating function 524 generates a gating factor g for each frequency component ii. For example, when the phases at the two microphones are equal, the gating factor giWhich may be 0, indicates that the recorded sound is not the speaker's voice but an extraneous sound from the environment. Alternatively, the gating factor may be 1 when the phase between the input signals 512 corresponds to the acoustic propagation delay between the microphones.
In general, various prototype synthesis methods can be formulated as a strobe of an input signal, where the strobe is based on coefficients ranging from 0 to 1, which can be expressed in the form of a vector matrix as:
wherein 0 is not more than g1,g2≤1。
In another example, the gating function is configured for use in a hearing assistance device in a manner similar to that described in U.S. patent publication 2009/0262969 entitled "HearingAsistanceAparatus," which is incorporated herein by reference. In this configuration, the gating function is configured to provide more enhancement to sound sources that are faced by the user than sound sources that are not faced by the user.
In another example, the gating function is configured for use in a voice recognition application, where the prototype is determined in a manner similar to the method of determining the output components in U.S. patent publication 2008/0317260 entitled "sound discrimination method and apparatus," which is incorporated herein by reference. For example, the output of the multiplier (42), which in the cited disclosure is the product of the input and the gain (40) (i.e., the gating term), is applied as a prototype in the present method.
4Output estimation
Referring again to FIG. 1, the estimator 110 is configured to determine the output of the best matching prototype d (t)In some implementations, the estimator 110 is a linear estimator that matches d (t) in a least squares sense. Referring again to fig. 2, for at least some forms of estimator 110, this estimation may be performed on a component basis, since typically the error in each component is uncorrelated due to the orthogonality of the components, and thus each component may be estimated separately. The component estimator 210 will estimate(t) is formed as a weighted combinationThe weight w is selected for each analysis window by the least squares weight estimator 216iBased on the input signal s1(t) and s2The self-power spectrum and cross-power spectrum of (t) form a minimum error estimate.
Minimizing | d (t) -hx (t) Cy can be achieved by taking the desired (complex) signal d (t) and the (complex) input signal x (t) and finding the real coefficient h2Are considered together to understand the calculations implemented in some examples of the estimation module. The coefficients that minimize this error can be expressed as
Wherein the index*Denotes the complex conjugate and E { } denotes the average or expectation over time. Note that numerically, if E (x)2(t)) is small, then the calculation of h may be unstable, so numerically, adding a small value to the denominator adjusts the estimate to:
estimating autocorrelation S over a time intervalXXAnd cross correlation SDX
When applied to the windowing analysis shown in FIG. 2, (using the notation n]To refer to the nth window) given windowed input signal x[n](t) (i.e. the nth window of the input signal x (t)), sk(t) one of (t) and the corresponding prototype d[n](t) local estimates of the auto-and cross-correlations within the window are formed
And
note that in the case where the components can be downsampled to a single sample per window, these expectations can each be as simple as a single complex multiplication.
In order to obtain robust estimates of the autocorrelation coefficients and cross-correlation coefficients, temporal averaging or filtering over multiple time windows may be used. For example, one form of filter is the decay time average calculated over an earlier window:
for example, when a is equal to 0.9 and the window jump time is 11.6ms, it corresponds to an average time constant of about 100 ms. Other causal or proactive, finite impulse response or wireless impulse response, fixed or adaptive filters may be used. An adjustment of the factor oa is then applied after the filtering.
For the case of estimating the weight h used to form a prototype based on a single component, referring to FIG. 6, one embodiment 700 of the least squares weight estimation module 216 is shown. The input component is identified as X in the graph (e.g., down-sampled to a single sampled component s for each window)i(t)), and the prototype component is identified as D in the figure. Fig. 6 shows a discrete-time filtering method that updates once per window period. Specifically, S is calculated along the top path by calculating the complex conjugate of X750, multiplying the complex conjugate of X by D752, and then low pass filtering the product along the time dimension 754DX. Then extracting SDXThe real part of (2). S is calculated along the bottom path by squaring 760 the magnitude of X and then low pass filtering 762 the result along the time dimensionXX. Then add 764 a small value to SXXTo prevent division by zero. Finally, by using Re { S }DXDivide by (S)XX+ oa) to calculate h.
Can be formed by considering a combination of two inputs x (t) and y (t), with real coefficients h and g found to minimize | d (t) -hx (t) -gy (t) & gt)2To further understand the calculations performed by the estimation module. Note that the use of real coefficients is not necessary, and in alternative embodiments where complex coefficients are used, forThe formula for the coefficient values is different (e.g., for complex coefficients, the Re () operation falls on all terms). In this case where real coefficients are used, the coefficients that minimize the error can be expressed as
As described above, each of the autocorrelation term and the cross-correlation term is filtered over a series of windows and adjusted prior to calculation.
The matrix equations shown above for two channels are easily modified for any number of input channels. For example, in m prototypesVector of and n input signalsIn the case of vectors of (a), an m × n matrix of weighting coefficients H can be calculated for forming the estimate using the following vector-matrix equation:
by calculating a real matrix H of
Wherein
Is a matrix of n × m and
is a matrix of n × n anda transpose matrix of the complex conjugate is indicated and the covariance term is calculated on a component by bit basis as described above and filtered and adjusted.
FIG. 3A is for all input channels sk(t) and time-components of one or more prototypes d (t)A graphical representation 300 of the representation 322. Each tile 332 in the representation 300 is associated with a window index n and a component index i. Fig. 3B is a detailed view of a single tile 332. In particular, FIG. 3B shows that the tiles 332 are constructed by first adding a time window 380 to each input signal 312. The time windowed portion of each input signal 312 is then processed by the component decomposition module 220. For each tile 332, estimates of the auto-and cross-correlations 384, 382 for the input channels 312 and 382 for each input and each output cross-correlation 382 are computed and then filtered 386 in time and adjusted to preserve numerical stability. Each weighting factor is then calculated according to a matrix formulation of the form shown above
Note that in the above description, smoothing of the correlation coefficient is performed in time. In some examples, smoothing is also across components (e.g., frequency bands). Furthermore, the characteristics of smoothing across components may not be the same, e.g., having a larger frequency range at higher frequencies than at lower frequencies.
5Other examples
In the following examples, the dependency on the time variable t is omitted for simplicity of notation. Note that for some choices of analysis period τ, only a single value is required to represent the components, and thus omitting the dependence on t may be considered to correspond to a single (complex) value representing the analysis component. Also, in general, the weighting values are typically complex numbers rather than real numbers as is the case in the particular example presented above.
5.1Multi-dimensional input
As a first example, to summarize the method presented above, a scalar prototype d may be estimated from n inputs x (i.e., n columns of vectors) by estimating a vector w of n weights (i.e., n columns of vectors) to satisfy:
by calculation of
Wherein (for n ═ 2)
w=[w1,w2]T
x=[x1,x2]T
And is
Thus d is the local time-frequency estimate of the desired signal (i.e., the desired prototype) and the goal is to find the vector w such that the input (i.e., w)Tx) is most suitable in the sense of least square error.
d least squares estimation of the resultHas a smoothing effect on d, which can be acoustically liked by the listener. Estimation of the expected prototype(where the e term is the remaining least squares estimation error) preserves the desirable characteristics of d, but may be more audibly pleasing than d alone. In addition to this, the present invention is,can be compared withA simple smooth version of d better preserves the desired behavior of d.
5.2Multiple input offset
In the previous example, a short-time implementation of the least-squares solution is optionally achieved by applying a low-pass filter (i.e., a short-time expectation operator and/or cross-frequency smoothing of the statistics) to the cross-and self-statistics of the closed-form solution of w. While the previous examples used a short-time implementation of a least-squares solution for smoothing a single prototype signal, it should be noted that the short-time implementation of a least-squares solution can be extended by adding constraints and applied to various other problems (e.g., dynamic filter coefficients). In particular, it can be seen as a short-time implementation of a time-varying closed least-squares solution. The time-varying closed least squares solution may be applied to various other situations.
In general, in the above method, it is assumed that the prototype estimate for the frequency component i at time frame n depends on the input signal at the same component and frame index, and possibly indirectly on other components and time frames through smoothing of the statistics used in the estimation. More generally, a prototype d at time frame nn(or more precisely, a prototype d for the frequency component i at the time frame nn,iBut for the sake of simplicity of notation, the dependency on i is omitted) depends on the input x over the range of k time frames n-k +1, …, nn,…,xn-k+1And each input xiMay be a vector comprising values of other frequency components than the prototype being estimated.
Referring to fig. 8, in a second example, a system 800 receives an input signal xnWhere n is, for example, the nth frame of the input signal. In this example, prototype generator 802 utilizes an input component xnMultiple forward input or forward prototype estimate yn-1…yn-kTo determine the prototype signal component d at time nn. An example of prototype generator 802 assumes dnIs a weighted linear combination of the forward input and the forward output of the input component plus some estimation errorDifference, so that the prototype estimatesIn the form of an IIR filter, as follows:
dn=b0xn+b1xn-1+…+bkxn-k…+a1yn-1+a2yn-2…+alyn-l+en
it can also be expressed as:
wherein
And is
z=[xn,xn-1,…,xn-k,yn-1,…yn-l]T
Prototype signal component dnIs passed to a component-based linear estimator 804 (e.g., a least squares estimator), which linear estimator 804 determines a vector w that minimizes the prototype signal component d in a least squares sensenAnd wTThe difference between z is as follows:
wherein
Rz=E{zzH}
Note that R is the (k + l +1) column vector of the input signal since z iszIs (k + l +1) × (k + l +1), so that for many input signals, R iszCan be expensive.
The output w of the component-based linear estimator 804 is passed to a linear combination module 806 (e.g., an IIR filter), which linear combination module 806 combines the estimates in the same manner as the prototype generator 802Is formed as xnA combination of the forward input value and the forward output value. However, linear combination module 806 replaces b with the value included in the w vector0,b1,…,bkAnd a1,a2,…,alValue (instant use)Alternative b0By usingAlternative b1Etc.). Output of linear combination module 806Is dnIs estimated.
5.3 quiltConstrained prototype estimation
In some examples, it is desirable to estimate multiple prototype signals from multiple input signals, such that the weight values for each prototype are constrained to be the same for each prototype, but applied to different input signals, for example. As one possible example, if each prototype is a different time frame (i.e., delay) of a particular signal component, it may be desirable that the filtering of the input component be time invariant at different lags. Another example is presented in section 5.7 below.
In general, the N × 1 vector making d the desired signal, d ═ d0,d1,…,dN-1]TAnd let w ═ w0,w1,…,wP-1]TA Px1 vector that becomes the coefficients of the N independent Px1 vectors used to linearly combine the input signals. The input signal difference using the w combination may be d for each desired prototype signal. Specifically, there is a separate Px1 input vector x for each desired signal or signal vectori(i-0, 1, … N-1), wherein
d0=wTx0+e0
d1=wTx1+e1
...
dN-1=wTxN-1+eN-1
The nxp input matrix Z may then be formed as:
then (note d)i=wTxi+e0=xi Tw+e0) The system of equations can be rewritten as
d=Zw+e
Where w is a vector of weighting coefficients:
w=[w0,w1,…,wP-1]T
a closed-form solution that simultaneously minimizes the difference between each prototype signal component d and Zw in the least squares sense is as follows:
w=E{ZHZ}-1E{ZHd}
5.4weighted least squares
In the above example, by actually minimizing eiEach input value is actually considered to have the same importance in the determination of the prototype estimate. However, in some examples, it may be useful to allow certain inputs to be more or less important than others. This may be achieved using a weighted least squares solution.
The weighted least squares solution defines G as being for each input xiWeight g ofiN × N diagonal matrix:
G=diag(g1,g2,…,gN)
including the matrix in the least squares solution described above makes errors due to higher weighted input constraints more costly than errors due to lower weighted input constraints. This biases the least squares solution against constraints with larger weights. In some examples, the constraint weights change over time and/or frequency and may be driven by other information within the system. In other examples, there may be a case where one constraint takes precedence over another constraint, and vice versa, within a given frequency band.
The least squares solution comprising the weight matrix W can be expressed as:
w=E{ZHGZ}-1E{ZHGd}
5.5example 1: multi-channel input with single locally desired prototype
In this example, the goal is to find two input channel signals x at time index n1,nAnd x2,nIs the desired signal d at time nnBest estimate ofThus, the number of the first and second electrodes,
d=dn
and is
The results correspond to the example presented in section 5.1.
5.6Example 2: single channel, adaptive FIR solution with single local desired prototype
This example differs from example 1 in that: instead of using two different channels as inputs, two different time periods of a single channel are used as inputs. The goal is to find the current (at time n) and previous (at time n-1) input signal xnAnd xn-1Is the desired signal d at the current time nnBest estimate ofThus, the number of the first and second electrodes,
d=dn
Z=[xn,xn-1],
and is
Thus, examples 1 and 2 show that it is possible to match the local wanted signal d by taking inputs across channels and/or timenSolve, however, dimension P becomes greater than 2 and the P × P matrix Z is alignedHZ inversion can be expensive. Note that additional desired signals (which correspond to additional input constraints, i.e., dimension N) may be used without increasing the size of the PxP matrix inversion.
5.7Example 3: multi-channel input with constrained prototype estimation
In some examples, least squares smoothing is applied to the microphone array. The raw signals from the microphones in the array are used to estimate the desired source signal component at a specific point in time and frequency. The goal is to determine a linear combination of the microphone signals that best approximates the instantaneous desired signal at that particular point in time and frequency. This application may be considered an extension of the application described in example 1 above.
As described more fully below, the least squares solution may not only provide the desired smooth behavior to the desired signal, but may also produce coefficients that provide cancellation when the coefficients being solved for are complex-valued.
Referring to fig. 9, a source 1002 at an ideal or known source location produces a source signal (e.g., an audio signal) that propagates through the air to each microphone 1004 of a microphone array 1006 (which in this example includes two microphones, M1 and M2). As the source signal propagates from the source 1002 to each microphone 1004, it is assumed to pass through a linear transfer function HdpWhere p is the p-th microphone 1004 in the microphone array 1006. In the following discussion, the transfer function of a particular signal component (e.g., frequency band) is referred to as hdp
If the geometry of the desired signal 1002 with respect to the location of the microphone array 1006 is known, the set of transfer functions between the ideal source location 1002 and two microphones in the microphone array 1006 can be expressed as a set
hd=[hd1,hd2]T
One example of this is in the case of an earhook microphone array where the position of the mouth relative to the microphone is known (at least approximately) and thus the transfer function may be predetermined or estimated in use.
Method for processing a batch of microphone signals (wherein the transfer function HdpIs known) may be to first estimate the source signal s and then apply this signal to a prototype estimation process as described above.
Another preferred method is to form a prototype estimate from the independent input signals in such a way that a weighted approximation of the input signals (but not necessarily) matches the known transfer function from the ideal source location. Thus, signals arriving from an ideal source location are typically transmitted unmodified.
One way to achieve this is to use the unit prototype d ═ dn,1]TExtended prototype dn. The unit prototype stems from the more well-known minimum variance distortionless response (MVD) in obtainingR) the distortion-free response constraint used in the solution, as follows:
to determine the weighting vector such that the weighted input signal approximately matches the known transfer function from the source, d is replaced by s in the above equation, as follows:
resulting in the following unit prototype:
in the context of a generic least squares solution, the prototype and input matrix may then be expressed as:
d=[dn,1]T
note that the above solution combines time-invariant constraints with a time-variant solution. Thus, based on the individual estimates dnAdditional constraints can be used to help prevent the transient solution for w from substantially corrupting any source signal originating from an ideal source position. Note however that this is not an absolute constraint (which strictly prohibits any distortion in the target source direction) as is the case for MVDR solutions.
As noted above, in some examples, it is desirable to have some prototypes in the vector d of prototypes to have more or less effect on the estimated signal than other prototypes. This can be achieved by including a weighting vector G in the solution of w. Thus, the weighted solution for the example shown in fig. 9 is as follows:
and only 2 x 2 matrix inversion is required.
Referring to fig. 10, the above example can be extended to include additional constraints such that the instantaneous coefficient w produces a null in a particular direction with respect to the microphone array 1106. For example, the direction may be expressed as P microphones in the noise (or other undesired) source and microphone array 1106 at an ideal or known noise location N1108Transfer function H between the microphones 1104np(where p is the p-th microphone). For the following discussion, the transfer function of a signal component (e.g., frequency band) is referred to as hnp. For the example of fig. 10, the desired prototype vector and input matrix (for the case of two microphone elements) may be expressed as follows:
d=[dn,1,0]T
and
the weighted solution for this example yields the following trend: the approximation tends to null (i.e., attenuate) in the direction of the noise source while preserving the source signal.
Although the two examples above each involve the use of two microphones, the number of microphones may be other numbers P greater than 2. In this general case, the output may be expressed as:
xn=hdsn
wherein
hd=[hd0,hd1,…,hdP-1]。
Furthermore, although the above examples describe prototypes applied to null techniques (nulling) and beamforming, it should be noted that any other arbitrary prototype may be used.
5.8Example 4 a: multiple desired prototypes with prototype input
In another example, a binary microphone array produces a raw input signal x1And x2. By observing the difference in the original input signal, each microphone (d) can be acquired1And d2) Is determined based on the instantaneous estimate of the desired signal component. These local estimates of the desired signal may be used to obtain a local estimate of the noise signal from each microphone signal, as follows:
n1=x1-d1
n2=x2-d2
in one of the above examples, the application of least squares smoothing to the microphone array is used to clean up the estimate of the desired signal. The goal of the above example is to determine a linear combination of microphone inputs that most closely resemble the desired signal estimate. In this example, another goal is to determine a linear combination of input signals that will best cancel the local estimate of the noise signal at a given time-frequency point, while still attempting to preserve the target signal. Using a general least squares solution, the problem can be expressed as:
here again, the top row in Z is the transfer function from the desired source to the array, and the desired array response in that direction is 1, while the desired response to the instantaneous noise estimate is some small signal a.
5.9Example 4 b: added back to the original desired prototype
In another example, example 4a is extended to include the original input constraints. Thus, the input matrix and the desired vector are expressed as:
the constraint weights may vary as a function of time and frequency (W ═ W (t, f)) taking into account that a solution for W is calculated for each frequency component. In some examples, it may be advantageous at certain times to give greater weight to certain constraints within a particular frequency range.
It is noted that as the number of included constraints increases, the overall concept of a weighted, constrained least squares smoothing structure can be viewed overall as an implementation strategy for combining multiple desired behavior with narrow temporal and frequency resolution. Furthermore, in some instances, it may not be possible to obtain all desired behavior simultaneously due to limited degrees of freedom or conflicting requirements. However, this concept allows dynamically emphasizing (smoothly switching or blending between constraints) the desired behavior while smoothing individual constraints in a desired manner.
5.10Example 4 c: fixed expected prototype with dynamic weights
In another example, undistorted response and noise cancellation are desired. The input matrix and the desired prototype vector are expressed as:
where a is 0 or some small signal/value. In this example, the emphasis of each constraint depends on the time and/or frequency variation values. For example, the weight matrix may be defined as:
wherein S ist,fMay function to emphasize the undistorted response constraint when the estimated target signal is present (or significant) and not to focus on the undistorted response constraint when the estimated target signal is not present (or significant). St,fOne example of (c) is an instantaneous estimate of the target signal energy | dn|2. When the energy of the target signal is high, | dn|2Putting into the weight matrix has the effect of emphasizing the distortion free response (DR) constraint. Thus, when the target signal is not present, the solution focuses more on satisfying the noise cancellation constraint. Vt,fIs an arbitrary weight function that can change with time or frequency over the noise cancellation constraints. It should be noted that the dynamic weighting of the constraints shown above is only one example, and in general any arbitrary function (e.g., inter-microphone coherence) may be used for the dynamic weighting.
5.11Example 5: fast minimum output mixer
In one example, two input signals U and S (which, as all previous examples, may be multi-channel time-domain or frequency-domain signals) are available. In this example, U and S both include the same desired signal, but include different noise signals (i.e., U ═ S + N)uAnd S + Ns). Since both the desired signal and the noise signal may be time-varying and non-stationary, a local time-frequency combination of U and S is found that includes the smallest possible noise contribution while preserving the desired signal component present in both (i.e., w)UU+wsS) may be useful.
In this example, the desired prototypes, inputs, and weights may be expressed as:
and the least squares solution can be expressed as:
the first constraint is directed to minimizing the combination of U and S (or strongly)Making the combination of both equal to 0). The second constraint attempts to enforce the weights (i.e., w)U+wS1) since the target signal is the same in U and S and is therefore preserved under this constraint. G is again a diagonal weight matrix that can place larger or smaller weights on any constraint. In some examples, the values in matrix G need to be carefully set due to competition between individual constraints.
5.12Example 5b
In another example, the weights described in example 5a are strictly performed as having a mixer relationship, where the output signal Y- αkU+(1-αk) S is generated by the system mixing factor αkCan be dynamically determined as follows:
however, as in the above example, a low pass filter is used to obtain a short time desired operation (i.e., E { }), as in least squares smoothing, to obtain αkFast local estimation of.
5.13The experimental results are as follows: microphone array processing under low SNR conditions
Under certain conditions, time-frequency masking or gating schemes have the potential to outperform more well-known LTI methods (such as MVDR solutions). However, under very low SNR conditions where the target signal is hardly the dominant source, temporal-frequency masking schemes tend to suppress too much of the desired signal and may not necessarily improve the signal-to-noise ratio like static spatial filters (i.e., MVDR). For a given noise environment, the most ideal LTI solution results in a continuous improvement of the signal-to-noise ratio independent of the ambient signal-to-interference ratio. Fig. 11 compares the measured average SNR gain and the retained signal ratio (PSR) of the MVDR design with the current time-frequency masking scheme using complex least squares smoothing. The negative PSR in the lower half of fig. 11 represents how much of the target signal is lost (in dB) on average due to array processing. This particular scenario includes the target speech signal mixed into the reverberant crosstalk of-6 dB overall root mean square SNR. The average target and noise signal power spectra used for this experiment are shown in fig. 12. Note that above 1.5kHz where the local SNR is approximately 0dB, the time-frequency masking scheme has minimal target signal loss, but still has several dB SNR gain compared to the static MVDR design. In the 400-600Hz range where the target average has significant energy but the SNR is poor (about-6 dB), the time-frequency masking scheme provides up to 8dB of SNR gain, but at the cost of more target signal loss. Below 150Hz, where the local SNR is very poor, MVDR solutions perform better than time-frequency maskers in removing noise.
As in example 4b, by applying additional constraints to the weighted least squares solution, it is possible to trade-off different operating characteristics, even in the frequency ranges where the respective frequency ranges are most relevant. Furthermore, the audio quality benefits of the original least squares smoothing method can be largely preserved while increasing this flexibility. In the following example, a constrained least squares approach is used to obtain a single solution that combines some of the advantages of the MVDR approach and the time-frequency masking approach. The desired vector and input matrix used are as follows:
where a is some small value or signal. The first constraint is towards the use at hdThe undistorted response of the solution in direction (d) applies tension. The second constraint drives the solution towards suppression and cancellation of the input. The final constraint is the original constraint, which drives a linear combination of inputs to achieve the desired signal estimate obtained via time-frequency masking. In this example, the weight function is applied such that the undistorted response and input cancellation constraints dominate at low frequencies, while the time-frequency masking expectation constraints dominate at higher frequencies. The SNR gain and PSR from this experiment are given below in fig. 13.
Note that the SNR gain benefit of the time-frequency masker is largely preserved while also improving the SNR gain below 200Hz to be equal to that of the MVDR solution. The PSR of the constrained least squares method is only slightly improved in this case, but at least not worse than using the time-frequency masker alone. When the orthoscopic response constraint is given even more emphasis at some frequencies, fig. 14 shows the results of using different sets of weight functions. The SNR gain is mostly as good or better than the MVDR solution, but improves the PSR on the previous example.
Fig. 15 illustrates the behavior of the unit response constraint configured for dominance via the weighting matrix when only the two constraints of the first pair (i.e., unit response and cancellation) are used. It behaves significantly closer to the static MVDR solution. Thus, including these additional weighting constraints in the least squares smoothing solution can provide multiple benefits. It continues to provide the desired smooth behavior of the original least squares method. Furthermore, for microphone array applications using time-frequency masking, it allows the array processor to trade off different desired behavior (via weight functions) to produce a more optimal solution. Furthermore, since the addition of multiple constraints does not increase the size of the matrix inversion in the least squares solution, additional processing requirements may not be considered.
6Component reconstruction
Because the component decomposition modules 220 (e.g., DFT filter banks) have linear phase, the single channel up-mix outputs have the same phase and can be recombined without phase interaction to produce various degrees of signal separation.
The component reconstruction is performed in the component reconstruction module 230. The component reconstruction module 230 performs the inverse operation of the component decomposition module 220 to construct a spatially separated time signal from the several components 222.
7Examples of the invention
In section 3, there is an input signal s corresponding to the left signal l (t) and the right signal r (t), respectively1(t) and s2(t), prototype d (t) is suitable for central channel c (t). In one example, a similar approach may be applied to be a "left only" signal lo(t) and "Right-Only" signals ro(t) determining a prototype signal. Referring to fig. 4B, an exemplary partial prototype for a "side only" channel is shown. Note that in other examples, the partial prototypes may originate from a single channel, while in other examples they may originate from two or more channels.
The following formula defines one form of this exemplary prototype:
and the combination of (a) and (b),
wherein, for the sake of clarityThe component index i is omitted from the above formula. A portion of each input signal 412 is combined to construct a center prototype. The local "side-only" prototype is the remainder of each input signal 412 after contributing to the center channel. For example, with respect to lo(t), if l (t) is less than r (t), then the prototype equals 0. When l (t) is greater than r (t), the prototype has a length that is the difference in the lengths of the input signals 412, and the same direction as the input l (t).
Referring to FIG. 4C, an exemplary partial prototype for a "surround" channel is shown. The "surround" prototype can be used for upmixing based on difference (inverse) information. The following formula defines a local prototype of the "surround" channel:
where the component index i is omitted from the above formula for clarity. The local prototype is symmetrical to the central channel local prototype. When the input signals 412 are equal in level and out of phase, they are maximum and they decrease as the level difference increases or the phase difference decreases.
For example, given prototype signals, examples of methods for estimating those prototype signals may differ in terms of the inputs that are combined to form the estimate, as described above. For example, as shown in FIG. 7, the prototype d (t) (referred to herein as c (t) as the center channel prototype) can produce two estimatesAndwherein each estimate is formed as a weighting of the individual inputs:
and
to represent a portion of the center prototype contained in the left and right input channels, respectively. Using the above definition of covariance estimation and cross-covariance estimation, these coefficients can be determined as follows:
and
for the definition of the surround channel s (t), the two estimates can be similarly formed as
And
wherein the negative sign relates to the phase asymmetry around the prototype and the coefficients are determined as
And
in this example, there are four upmix channels as defined above:
and
the two additional channels are computed as the remaining left signal and the remaining right signal after removing the center component and surround components of the single channel:
and
for a total of six output channels originating from the original two input channels.
In another example, the upmix output is generated by mixing the left and right inputs into each upmixer output. In this case, the two coefficients for each upmixer output are solved using least squares: left input coefficients and right input coefficients. The output is generated by scaling each input according to the corresponding coefficient and summing.
In this example, if the center channel and the surround channel are respectively approximated as:
and
the coefficients can be calculated as
Wherein
And
as described above, only the left signal and only the right signal are then calculated by removing components of the center signal and the surround signal from the input signal. Note that in other examples, only the left and only the right channels may be extracted directly rather than calculating them as residuals after subtracting the other extracted signals.
8Alternative solution
Several examples of local prototype synthesis, e.g., for a central channel, are provided above. However, various heuristics, physical gating schemes, and signal selection algorithms may be employed to construct the local prototype.
It should be understood that the prototype signal d (t) does not have to be explicitly calculated (e.g. as shown in fig. 1 and 2). In some examples, formulas are determined for calculating self-power and cross-power spectra or other characterizations of prototype signals, which are then used in the estimator 210 to determine weights w that do not actually form the signal d (t)209k217, while still producing a definition that will pass through the prototypeThe calculation of the results that have been obtained is identical or substantially identical. Similarly, other forms of estimator need not use weighted input signals to form the estimated signal. Some estimators do not have to utilize explicitly formed prototype signals but rather use signals or data that describe characteristics of the prototype of the target signal (e.g., use values representing statistical characteristics such as autocorrelation estimates or cross-correlation statistics, moments, etc. of the prototype) in such a way that the output of the estimator is an estimate from a particular matrix (e.g., a least squares error matrix) used by the estimator.
It should also be understood that in some examples, the estimation method may be understood as a subspace projection, where the subspace is defined by a set of input signals used as a basis for the output. In some examples, the prototype itself is a linear function of the input signal, but may be limited to a different subspace defined by a different subset of the input signal than the subset used in estimating the phase.
In some examples, the prototype signal is determined using a different representation than that used in the estimation. For example, the prototype may be determined using a different or no component decomposition, which is different from the component decomposition used in the estimation stage.
It should also be understood that a "local" prototype may not necessarily be strictly limited to being calculated from input signals in a single component (e.g., frequency band) and a single time period (e.g., a single window of input analysis). For example, there may be limited use of nearby components (e.g., components that are perceptually close in time and/or frequency) while still providing more locality of the prototype synthesis than that of the estimation process.
The smoothing introduced by windowing of the temporal data can be further extended to mask-based time-frequency smoothing or non-linear time-invariant (LTI) smoothing.
The coefficient estimation rule may be modified to enforce a constant power constraint. For example, multiple prototypes may be estimated simultaneously without computing the remaining "side-only" signals, while preserving the total power constraint such that the total left and right signals remain above the sum of the output channels.
Given a pair of stereo input signals (L and R), the input space can be rotated. Such rotation may result in cleaner left-only spatial decomposition and right-only spatial decomposition. For example, left plus right and left minus right may be used as input signals (input space rotated by 45 degrees). More generally, the input signal may be subjected to a transformation, such as a linear transformation, prior to prototype synthesis and/or output estimation.
9Applications of
The method described in this application can be applied in various applications where the input signal needs to be spatially separated in a low delay and low artifact manner.
The method may be applied to stereo systems such as home cinema surround sound systems or car surround sound systems. For example, a two-channel stereo signal from a compact disc player can be spatially separated into several channels in a car.
The method may also be used in telecommunication applications such as telephone headsets. For example, the method may be used to nullify unwanted ambient sound from a microphone input of a wireless headset.
10Implementation of
Examples of the above-described methods may be implemented in software, in hardware, or in a combination of hardware and software. The software may include a computer readable medium (e.g., a disk or solid state memory) holding instructions for causing a computer processor (e.g., a general purpose processor, digital signal processor, etc.) to perform the above-described steps. In some examples, the method is embodied in a sound processor apparatus adapted (e.g., configurable) for integration into one or more types of systems (e.g., home audio, headphones, etc.).
It is to be understood that the foregoing description is intended to illustrate and not to limit the scope of the invention, which is defined by the scope of the appended claims. Other embodiments are within the scope of the following claims.

Claims (46)

1. A method for forming one or more output signals from a plurality of input signals, comprising:
decomposing the input signal into input signal components representing different frequency components at each of a series of times;
determining, from the input signal, a characterization of one or more prototype signals, the characterization of the one or more prototype signals comprising a plurality of prototype signal components representing different frequency components at each of the series of times, wherein each of the plurality of prototype signal components is a non-linear combination of the input signal components; and
processing a prototype signal of the one or more prototype signals to form the output signal as an estimate of the prototype signal, the estimate being based on and varying according to the input signal used to determine a characterization of the prototype signal, the output signal corresponding to a combination of the input signals used to determine a characterization of the prototype signal;
wherein forming the output signal as an estimate of the prototype signal comprises calculating an estimate of statistics relating to the prototype signal and corresponding input signals, and determining weighting coefficients to be applied to each of the corresponding input signals.
2. The method of claim 1, further comprising: the steps of determining the characterization of the one or more prototype signals and forming the output signal are repeated for each time in a series of times.
3. The method of claim 2, wherein the combination of one or more input signals comprises one or more input signals at a time corresponding to each time in the series of times.
4. The method of claim 2, wherein the combination of one or more input signals comprises one or more input signals at a plurality of times prior to each time in the series of times that the output signal is formed.
5. The method of claim 2, wherein the combination of one or more input signals comprises a plurality of input signals representing different frequency components at times corresponding to each time in the series of times.
6. The method of claim 1, wherein the input signal comprises a plurality of input audio channels of an audio recording, and wherein the output signal comprises additional upmix channels.
7. The method of claim 6, wherein the plurality of input audio channels includes at least a left audio channel and a right audio channel, and wherein the additional upmix channels include at least one of a center channel and a surround channel.
8. The method of claim 1, further comprising accepting the plurality of input signals from a microphone array.
9. The method of claim 8, further comprising synthesizing the one or more prototype signals from differences between the input signals, and wherein forming the prototype signal from differences between the input signals comprises determining gating values from gain and/or phase differences and applying the gating values to one or more of the input signals to determine the prototype signal.
10. The method of claim 8, wherein forming one or more output signals comprises forming the estimate of the one or more prototype signals from at least one of a characterization of a response to a desired one of the signals from the microphone array and a characterization of an undesired one of the signals from the microphone array.
11. The method of claim 10, wherein the characterization of the response to the desired signal or the characterization of the response to the undesired signal comprises a transfer function characteristic for the signal.
12. The method of claim 1, wherein determining the characterization of the prototype signal comprises determining the prototype signal.
13. The method of claim 1, wherein determining the characterization of the prototype signal comprises determining statistical characteristics of the prototype signal.
14. The method of claim 1, wherein determining the characterization of a prototype signal comprises forming data based on a time-local analysis of the input signal.
15. The method of claim 14, wherein determining the characterization of a prototype signal further comprises forming the data based on a frequency local analysis of the input signal.
16. The method of claim 14, wherein the estimating of the prototype signal is based on a more comprehensive analysis of the input signal and the prototype signal than a local analysis when the prototype signal is formed.
17. The method of claim 1, wherein the synthesis of the prototype signal comprises a non-linear function of the input signal.
18. The method of claim 17, wherein forming the output signal as an estimate of the prototype signal comprises forming a least squares error estimate of the prototype signal.
19. The method of claim 1, wherein the synthesis of the prototype signal comprises gating of one or more of the input signals.
20. The method of claim 1, wherein forming the output signal as an estimate of the prototype signal comprises forming a minimum error estimate of the prototype signal.
21. The method of claim 20, wherein forming the minimum error estimate comprises forming a least squares error estimate.
22. The method of claim 1, wherein the statistics include a mutual power statistic between the prototype signal and the corresponding input signal, and a self power statistic of the corresponding input signal.
23. The method of claim 1, wherein computing the estimate of the statistics comprises averaging locally computed statistics over time and/or frequency.
24. The method of claim 1, further comprising decomposing each input signal into a plurality of components, and wherein
Determining data characterizing the synthesis of the prototype signal comprises forming data characterizing a component decomposition of each prototype signal into a plurality of prototype signal components;
forming each output signal as an estimate of a corresponding one of the prototype signals comprises forming a plurality of output component estimates as a transform of corresponding components of one or more input signals; and
forming the output signal includes combining the formed output component estimates to form the output signal.
25. The method of claim 24, wherein forming the component decomposition comprises forming a frequency-based decomposition.
26. The method of claim 24, wherein forming the component decomposition comprises forming a substantially orthogonal decomposition.
27. The method of claim 24, wherein forming the component decomposition comprises applying at least one of a wavelet transform, a uniform bandwidth filter bank, a non-uniform bandwidth filter bank, a quadrature mirror filter bank, and a statistical decomposition.
28. The method of claim 24, wherein forming a plurality of output component estimates as a transform of corresponding components of one or more input signals comprises scaling the components of the input signals to form the components of the output signals.
29. The method of claim 1, wherein forming the one or more output signals comprises forming a plurality of output signals each as a combination of a plurality of corresponding input signals according to a set of weights that are common to each of the plurality of output signals.
30. A system for processing a plurality of input signals, comprising:
an input processor configured to decompose the input signal into input signal components representing different frequency components at each of a series of times;
a prototype generator configured to accept the input signal and provide a characterization of a prototype signal from the input signal, the characterization of the prototype signal comprising a plurality of prototype signal components representing different frequency components at each time in the series of times, wherein each prototype signal component in the plurality of prototype signal components is a non-linear combination of the input signal components; and
an estimator configured to accept the characterization of the prototype signal and to form an output signal as an estimate of the prototype signal, the estimate being based on the input signal used to determine the characterization of the prototype signal and varying in accordance with the input signal used to determine the characterization of the prototype signal, the output signal corresponding to a combination of the input signals;
wherein forming the output signal as an estimate of the prototype signal comprises calculating an estimate of statistics relating to the prototype signal and corresponding input signals, and determining weighting coefficients to be applied to each of the corresponding input signals.
31. The system of claim 30, wherein the prototype signal comprises a non-linear function of the input signal.
32. The system of claim 31, wherein the estimate of the prototype signal comprises a least squares error estimate of the prototype signal.
33. The system of claim 30, further comprising a component analysis module for forming a plurality of component decompositions for each of the input signals, and a reconstruction module for reconstructing the output signal from the component decompositions for the output signal.
34. The system of claim 33, wherein the prototype generator and the estimator are each configured to operate on a component-by-component basis.
35. The system according to claim 33, wherein the prototype generator is configured to perform, for each component, a time-local processing of the input signal to determine a characterization of a component of the prototype signal.
36. The system of claim 30, wherein the prototype generator is configured to accept a plurality of input audio channels, and wherein the estimator is configured to provide an output signal comprising additional upmix channels.
37. The system of claim 30, wherein the prototype generator is configured to accept a plurality of input audio channels from a microphone array, and wherein the prototype generator is configured to synthesize one or more prototype signals from differences between the input signals.
38. The system of claim 30, wherein the output signal formed is a real combination of more than one of the input signals.
39. The system of claim 38, wherein the output signal formed is a real combination of a single input signal.
40. The system of claim 30, wherein the output signal formed is a complex combination of one or more of the input signals.
41. A method for forming one or more output signals from a plurality of input signals, comprising:
decomposing the input signal into input signal components representing different frequency components at each of a series of times;
determining, from the input signal, a characterization of one or more prototype signals, the characterization of the one or more prototype signals comprising a plurality of prototype signal components representing different frequency components at each of the series of times, wherein each of the plurality of prototype signal components is a non-linear combination of the input signal components; and
processing a prototype signal of the one or more prototype signals to form the output signal as an estimate of the prototype signal, the estimate being based on and varying according to the input signal used to determine a characterization of the prototype signal, the output signal corresponding to a combination of the input signals used to determine a characterization of the prototype signal;
wherein forming the output signal as an estimate of the prototype signal comprises determining a minimum error estimate of the prototype signal.
42. The method of claim 41, wherein forming the output signal as an estimate of a prototype signal comprises forming an estimate as a combination of a plurality of the input signal components, including at least some of the input signal components at different times or different frequencies than estimating the prototype signal components, for each of the plurality of prototype signal components.
43. The method of claim 41, wherein forming the output signal as an estimate of a prototype signal comprises applying one or more constraints in determining the combination of the one or more of the input signals.
44. An audio acquisition system comprising:
an input for receiving a plurality of input signals from a corresponding plurality of microphones;
an input processor configured to decompose the input signal into input signal components representing different frequency components at each of a series of times;
a prototype generator configured to accept the input signal and provide a characterization of a prototype signal comprising a plurality of prototype signal components representing different frequency components at each of the series of times, wherein each of the plurality of prototype signal components is a non-linear combination of the input signal components; and
an estimator configured to accept the characterization of the prototype signal and perform processing to form an output signal as an estimate of the prototype signal, the estimate of the prototype signal corresponding to a combination of the input signals used to determine the characterization of the prototype signal, the estimate being based on the input signals used to determine the characterization of the prototype signal and varying in accordance with the input signals used to determine the characterization of the prototype signal, wherein forming the output signal is performed in accordance with a pattern of responses of the microphones to signals from a desired location;
wherein forming the output signal as an estimate of the prototype signal comprises calculating an estimate of statistics relating to the prototype signal and corresponding input signals, and determining weighting coefficients to be applied to each of the corresponding input signals.
45. A system for processing a plurality of input signals, comprising:
an input processor configured to decompose the input signal into input signal components representing different frequency components at each of a series of times;
a prototype generator configured to accept the input signal and provide a characterization of a prototype signal from the input signal, the characterization of the prototype signal comprising a plurality of prototype signal components representing different frequency components at each time in the series of times, wherein each prototype signal component in the plurality of prototype signal components is a non-linear combination of the input signal components; and
an estimator configured to accept the characterization of the prototype signal and to form an output signal as an estimate of the prototype signal, the estimate being based on the input signal used to determine the characterization of the prototype signal and varying in accordance with the input signal used to determine the characterization of the prototype signal, the output signal corresponding to a combination of the input signals;
wherein forming the output signal as an estimate of the prototype signal comprises determining a minimum error estimate of the prototype signal.
46. An audio acquisition system comprising:
an input for receiving a plurality of input signals from a corresponding plurality of microphones;
an input processor configured to decompose the input signal into input signal components representing different frequency components at each of a series of times;
a prototype generator configured to accept the input signal and provide a characterization of a prototype signal comprising a plurality of prototype signal components representing different frequency components at each of the series of times, wherein each of the plurality of prototype signal components is a non-linear combination of the input signal components; and
an estimator configured to accept the characterization of the prototype signal and perform processing to form an output signal as an estimate of the prototype signal, the estimate of the prototype signal corresponding to a combination of the input signals used to determine the characterization of the prototype signal, the estimate being based on the input signals used to determine the characterization of the prototype signal and varying in accordance with the input signals used to determine the characterization of the prototype signal, wherein forming the output signal is performed in accordance with a pattern of responses of the microphones to signals from a desired location;
wherein forming the output signal as an estimate of the prototype signal comprises determining a minimum error estimate of the prototype signal.
HK13113102.3A 2010-10-21 2011-10-21 Estimation of synthetic audio prototypes HK1186036B (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US12/909,569 2010-10-21

Publications (2)

Publication Number Publication Date
HK1186036A HK1186036A (en) 2014-02-28
HK1186036B true HK1186036B (en) 2017-10-27

Family

ID=

Similar Documents

Publication Publication Date Title
US10891931B2 (en) Single-channel, binaural and multi-channel dereverberation
Baumgarte et al. Binaural cue coding-Part I: Psychoacoustic fundamentals and design principles
JP5081903B2 (en) System and method for processing audio signals
KR101984115B1 (en) Apparatus and method for multichannel direct-ambient decomposition for audio signal processing
EP2002692B1 (en) Rendering center channel audio
US8682006B1 (en) Noise suppression based on null coherence
JP6832968B2 (en) Crosstalk processing method
CN105284133B (en) Scaled and stereo enhanced apparatus and method based on being mixed under signal than carrying out center signal
US9078077B2 (en) Estimation of synthetic audio prototypes with frequency-based input signal decomposition
CN103181200B (en) The estimation of Composite tone prototype
Modhave et al. Design of matrix wiener filter for noise reduction and speech enhancement in hearing aids
WO2019229746A1 (en) Perceptually-transparent estimation of two-channel room transfer function for sound calibration
Thaleiser et al. Binaural-projection multichannel Wiener filter for cue-preserving binaural speech enhancement
Tsilfidis et al. Binaural dereverberation
CN115588438B (en) WLS multi-channel speech dereverberation method based on bilinear decomposition
HK1186036B (en) Estimation of synthetic audio prototypes
HK1186036A (en) Estimation of synthetic audio prototypes
Herzog et al. Direction preserving wind noise reduction of b-format signals
JP2023165528A (en) Beamforming method, beamforming system
Herzog et al. Signal-Dependent Mixing for Direction-Preserving Multichannel Noise Reduction
Thaleiser et al. Common-Gain Autoencoder Network for Binaural Speech Enhancement
Modhave et al. Design of matrix wiener filter for noise cancellation of speech signals in hearing aids
Tomita et al. Quantitative evaluation of segregated signal with frequency domain binaural model