HK1172475B

HK1172475B - An apparatus for determining a spatial output multi-channel audio signal

Info

Publication number: HK1172475B
Application number: HK12113191.6A
Authority: HK
Inventors: Sascha Disch; Ville Pulkki; Mikko-Ville Laitinen; Cumhur Erkut
Original assignee: Fraunhofer-Gesellschaft Zur Foerderung Der Angewandten Forschung E.V.
Priority date: 2008-08-13
Filing date: 2012-12-20
Publication date: 2015-06-19

Abstract

An apparatus (100) for determining a spatial output multichannel audio signal based on an input audio signal and an input parameter. The apparatus (100) comprises a decomposer (110) for decomposing the input audio signal based on the input parameter to obtain a first decomposed signal and a second decomposed signal different from each other. Furthermore, the apparatus (100) comprises a renderer (110) for rendering the first decomposed signal to obtain a first rendered signal having a first semantic property and for rendering the second decomposed signal to obtain a second rendered signal having a second semantic property being different from the first semantic property. The apparatus (100) comprises a processor (130) for processing the first rendered signal and the second rendered signal to obtain the spatial output multi-channel audio signal.

Description

Apparatus for determining a spatial output multi-channel audio signal

The present application is a divisional application of the applicant's franhoff applied science research promotion association, having an application date of 2011, 2/11, and an application number of 200980131419.8, entitled "apparatus for determining spatial output multi-channel audio signals".

Technical Field

The present invention is in the field of audio processing, and in particular, to the processing of spatial audio properties.

Background

Audio processing and/or encoding has advanced in many ways. For spatial audio applications, more and more demands are being generated. In many applications, audio signal processing is utilized to decorrelate or render signals. Such applications may be implemented, for example, mono to stereo upmixing, mono/stereo to multi-channel upmixing, artificial reverberation, stereo widening, or user interactive mixing/rendering.

For certain classes of signals, such as noise-like signals, e.g., applause-like signals, conventional methods and systems suffer either from undesirable perceptual performance or, if an object-oriented approach is employed, from high computational complexity due to the large number of auditory events to be modeled or processed. Other examples of ambiguous audio material are typically ambient sound material, such as noise emitted by a flock of birds, a coast, a galloping horse flock, a traveling soldier, and the like.

The conventional idea is to use, for example, parametric stereo or MPEG-surround coding (MPEG ═ motion picture compression standard). Fig. 6 shows a typical application of a decorrelator in a mono-to-stereo upmixer. Fig. 6 shows a monophonic input signal provided to a decorrelator 610, the decorrelator 610 providing the decorrelated input signal at its output. The original input signal and the decorrelated signal are provided together to an upmix matrix 620. The stereo output signal is rendered according to the upmix control parameters 630. The signal decorrelator 610 generates a decorrelated signal D which is provided to the matrixing stage 620 along with the dry mono signal M. In the mixing matrix 620, stereo channels L (L ═ left stereo channel) and R (R ═ right stereo channel) are formed from the mixing matrix H. The coefficients in matrix H may be fixed, signal dependent or controlled by the user.

Alternatively, the matrix may be controlled by side information, which is transmitted along with the downmix, containing a parametric description of how the downmix signals are upmixed to form the desired multi-channel output. This spatial side information is typically generated by a signal encoder prior to the upmix process.

This is typically done in Parametric Spatial Audio Coding, e.g. in Parametric stereo, see j.breeebaart, s.vande Par, a.kohlrausch, e.schuijers, "High-Quality Parametric Spatial Coding at lowbistates" in AES 116^thConvention, Berlin, Preprint 6072, May 2004, and in MPEG surround, see j.J.Breebaart，et.al.，“MPEG Surround-the ISO/MPEG Standard forEfficient and Compatible Multi-Channel Audio Coding”in Proceedings of the 122^ndAescinvention, Vienna, Austria, May 2007. A typical structure of a parametric stereo decoder is shown in fig. 7. In this example, the decorrelation process is performed in the transform domain, represented by the analysis filter bank 710, which analysis filter bank 710 transforms the input monophonic signal into the transform domain, e.g., the frequency domain for many frequency bands.

In the frequency domain, the decorrelator 720 generates a corresponding decorrelated signal, which is to be upmixed in the upmix matrix 730. The upmix matrix 730 takes into account upmix parameters, which are provided by a parameter modification block 740, the parameter modification block 740 being provided with spatial input parameters and being connected to a parameter control stage 750. In the example shown in fig. 7, the spatial parameters may be modified by the user or by additional tools, e.g. post-processing for binaural rendering/presentation. In this case, the upmix parameters may be combined with the input parameters from the binaural filter to form the input parameters for the upmix matrix 730. The determination of the parameters may be performed by a parameter modification block 740. The output of the upmix matrix 730 is then provided to a synthesis filter bank 760, the synthesis filter bank 760 determining the stereo output signal.

As mentioned above, the output L/R of the mixing matrix H can be calculated from the monophonic input signal M and the decorrelated signal D, for example, according to the following equation:

in the mixing matrix, the number of decorrelated sounds provided to the output may be controlled in accordance with transmission parameters, such as ICC (ICC ═ inter-channel correlation) and/or mixed or user-defined settings.

Another conventional method is established by a time alignment method. For example, special suggestions for decorrelation of Applause-like Signals can be found in Gerard Hotho, Steven van depa, Jeroen Breebaart, "Multichannel Coding of Aplayuse Signals," in EURASIP Journal on Advances in Signal Processing, Vol.1, Art.10, 2008. Here, the mono audio signal is divided into overlapping time segments that are pseudo-randomly time-aligned in "super" blocks, thereby forming decorrelated output channels. The arrangement is independent of each other for the n output channels.

Another method is an alternating channel swap of the original and delayed copies in order to obtain a decorrelated signal, see german patent application 102007018032.4-55.

In some conventional conceptual object-oriented systems, for example, in Wagner, Andreas; walther, Andreas; melchoir, Frank; strau β, Michael; "Generation of high elevation imaging for Wave field Synthesis Reproduction" at 116^thIn International EAS Convention, Berlin, 2004, it is described how to generate an immersive scene from many objects, such as individual applause, by applying wave field synthesis.

Yet another approach is so-called "Directional Audio Coding" (DirAC), which is a method for Spatial sound representation, suitable for different sound reproduction systems, see Pulkki, Ville, "Spatial sound reproduction with Directional Audio Coding" in j. In the analysis section, the spread and direction of arrival of the sound are estimated at a single position based on time and frequency. In the synthesis part, the loudspeaker signal is first split into a non-diffuse part and a diffuse part, which are then reproduced using different strategies.

The conventional method has many disadvantages. For example, guided or unguided upmixing of audio signals with content such as applause may require strong decorrelation. Thus, on the one hand, a strong decorrelation is needed to restore the presence feeling as in a concert hall. On the other hand, suitable decorrelating filters, such as all-pass filters, reduce the reproduction of the quality of transient events by introducing temporal smearing effects, such as pre-and post-echoes, and filter ringing. Moreover, the spatial translation of a single clapping event must be done in a rather fine time grid, while the decorrelation of the ambient sound should be quasi-stationary in time.

According to J.Breebaart, S.van de Par, A.Kohlrausch, E.Schuijers, "High-Quality parametric spatial Audio Coding at Low Bitrates" in AES 116^th Convention，Berlin，Preprint 6072，May2004 and J.Herre，K.J.Breebaart，et.al.，“MPEG Surround-the ISO/MPEG Standardfor Efficient and Compatible Multi-Channel Audio Coding”in Proceedings of the 122^ndThe description of the existing system of aescinvention, Vienna, Austria, May 2007 includes temporal resolution versus environmental stability and transient quality degradation versus environmental sound decorrelation.

For example, a system utilizing a time-aligned approach will exhibit perceptible degradation of the output sound due to some repetitive quality in the output audio signal. This is due to the fact that the same segment of the input signal occurs unchanged in each output channel, albeit at different points in time. Furthermore, in order to avoid increased applause density, some of the original channels must be discarded in the upmix, and therefore some important auditory events may be lost in the resulting upmix.

In object-oriented systems, such sound events are typically spatialized into a large group of point-like sources, which leads to computationally complex implementations.

Disclosure of Invention

It is an object of the invention to provide an improved concept for spatial audio processing.

The above object is achieved by a device according to claim 1 and a method according to claim 3.

One discovery of the present invention is: the audio signal can be decomposed into several components, for example spatial rendering according to a decorrelation or amplitude-panning (amplitude-panning) method can be adapted to the several components. In other words, the present invention is based on the discovery that: for example, in a scene with multiple audio sources, the foreground and background sources may be distinguished and rendered or decorrelated differently. In general, different spatial depths and/or extents of audio objects can be distinguished.

One key point of the present invention is to decompose a signal (e.g., sound from clapping spectators, bird flocks, coasts, galloping horse flocks, traveling soldiers, etc.) into a foreground part and a background part, whereby the foreground part comprises a single auditory event from, for example, a nearby source, and the background part comprises ambient sound of a perceptually fused distant event. The two signal parts are processed separately, e.g. for synthesizing correlations, rendering a scene, etc., before final mixing.

Embodiments are not limited to only distinguishing foreground and background portions of a signal, they may distinguish between multiple different audio portions, which may each be rendered or decorrelated differently.

In general, an audio signal may be decomposed into n different semantic parts by embodiments, which are processed separately. The decomposition/separate processing of different semantic components may be achieved in the time and/or frequency domain by embodiments.

Embodiments can provide excellent perceptual quality of rendered signals at moderate computational cost. Thus, embodiments provide novel decorrelation/rendering methods that can provide high perceptual quality at moderate cost, especially for applause-like key audio material or other similar environmental sound material, e.g., noise emitted by a flock of birds, the coast, a galloping horse flock, a travelling soldier, etc.

Drawings

Embodiments of the invention will be described in detail below with reference to the attached drawing figures, wherein:

fig. 1a shows an embodiment of an apparatus for determining a spatial audio multi-channel audio signal;

FIG. 1b shows a block diagram of another embodiment;

FIG. 2 shows an embodiment illustrating the diversity of the decomposed signals;

FIG. 3 illustrates an embodiment with foreground and background semantic decomposition;

FIG. 4 shows an example of a temporal separation method for obtaining background signal components;

FIG. 5 illustrates the synthesis of a sound source with a spatial extent;

fig. 6 shows one state of the prior art application of a time-domain decorrelator in a mono-to-stereo upmixer;

fig. 7 shows another state of the art application of a frequency domain decorrelator in a mono-to-stereo upmixer scheme.

Detailed Description

Fig. 1 shows an embodiment of an apparatus 100 for determining a spatial output multi-channel audio signal based on an input audio signal. In some embodiments, the apparatus may be further adapted to base the spatial output multi-channel audio signal on input parameters. The input parameters may be generated locally or provided with the input audio signal, e.g. as side information.

In the embodiment depicted in fig. 1, the apparatus 100 comprises a decomposer 110, the decomposer 110 being configured to decompose the input audio signal to obtain a first decomposed signal having a first semantic attribute and a second decomposed signal having a second semantic attribute, the second semantic attribute being different from the first semantic attribute.

The apparatus 100 further comprises a renderer 120, the renderer 120 being configured to render the first decomposed signal with the first rendering characteristic to obtain a first rendered signal having the first semantic attribute, and to render the second decomposed signal with the second rendering characteristic to obtain a second rendered signal having the second semantic attribute.

Semantic attributes may correspond to spatial attributes, near or far, focused or broad, and/or dynamic attributes, e.g., whether the signal is tonal, stationary or transient, and/or dominant, e.g., whether the signal is foreground or background, whose measurements are taken separately.

Furthermore, in the present embodiment, the apparatus 100 comprises a processor 130, the processor 130 being configured to process the first rendered signal and the second rendered signal to obtain a spatial output multi-channel audio signal.

In other words, in some embodiments, the decomposer 110 is adapted to decompose the input audio signal based on the input parameters. The decomposition of the input audio signal is adapted to semantic properties, e.g. spatial properties, of different parts of the input audio signal. Furthermore, the rendering performed by the renderer 120 according to the first rendering characteristic and the second rendering characteristic may also be adapted to the spatial properties, which allows for example to apply different rendering or decorrelators, respectively, inversely in a scene where the first decomposed signal corresponds to the background audio signal and the second decomposed signal corresponds to the foreground audio signal. The term "foreground" is hereinafter understood to mean audio objects that are dominant in the audio environment, such that a potential listener should focus on the foreground audio objects. The foreground audio objects or sources may be distinct or different from the background audio objects or sources. Background audio objects are not of interest to potential listeners because they are less dominant than foreground audio objects or sources. In some embodiments, the foreground audio objects may be punctual audio sources, wherein the background audio objects or sources may correspond to spatially wider objects or sources, but are not limited thereto.

In other words, in some embodiments, the first rendering characteristic may be based on or matched to the first semantic attribute and the second rendering characteristic may be based on or matched to the second semantic attribute. In one embodiment, the first semantic attribute and the first rendering characteristic correspond to a foreground audio source or object, and the renderer 120 may be adapted to apply an amplitude panning to the first decomposed signal. The renderer 120 may then be further adapted to provide two amplitude translated versions of the first decomposed signal as the first rendered signal. In this embodiment, the second semantic attributes and the second rendering properties correspond to a background audio source or object, a plurality of background audio sources or objects, respectively, and the renderer 120 may be adapted to apply a decorrelation to the second decomposed signal and to provide the second decomposed signal and a decorrelated version thereof as the second rendered signal.

In some embodiments, the renderer 120 may be further adapted to render the first decomposed signal such that the first rendering characteristic has no delay introducing characteristic. In other words, there may be no decorrelation of the first decomposed signal. In another embodiment, the first rendering characteristic may have a first delay introducing characteristic with a first delay amount, and the second rendering characteristic may have a second delay amount, the second delay amount being greater than the first delay amount. In other words, in this embodiment, the first and second decomposed signals may be decorrelated, but the level of decorrelation may be proportional to the amount of delay introduced to the respective decorrelated versions of the decomposed signals. Thus, the decorrelation for the second decomposed signal may be stronger than the decorrelation for the first decomposed signal.

In some embodiments, the first and second decomposed signals may overlap and/or may be time synchronized. In other words, the signal processing may be performed in blocks, wherein one block of input audio signal samples may be subdivided into a number of blocks of decomposed signals by the decomposer 110. In some embodiments, the decomposed signals of many blocks may at least partially overlap in the time domain, i.e. they may represent overlapping time domain samples. In other words, the decomposed signal may correspond to a portion of the input audio signal that is overlapping (i.e., representing an at least partially synchronized audio signal). In some embodiments, the first and second decomposed signals may represent filtered or transformed versions of the original input signal. For example, they may represent signal portions extracted from a combined spatial signal corresponding to, for example, a nearby sound source or a further sound source. In other embodiments, they may correspond to transient signal components and steady state signal components, among others.

In some embodiments, the renderer 120 may be subdivided into a first renderer and a second renderer, wherein the first renderer may be adapted to render the first decomposed signal and the second renderer may be adapted to render the second decomposed signal. In some embodiments, the renderer 120 may be implemented as software, e.g., a program stored in a memory to run on a processor or digital signal processor, which is adapted to render the decomposed signals sequentially.

The renderer 120 may be adapted for decorrelating the first decomposed signal to obtain a first decorrelated signal and/or for decorrelating the second decomposed signal to obtain a second decorrelated signal. In other words, the renderer 120 may be adapted to decorrelate all of the decomposed signals, but employ different decorrelation or rendering characteristics. In some embodiments, the renderer 120 may be adapted to apply an amplitude translation to either the first decomposed signal or the second decomposed signal instead of or in addition to the decorrelation.

The renderer 120 may be adapted to render a first rendered signal and a second rendered signal each having as many components as there are channels in the spatial output multi-channel audio signal, and the processor 130 may be adapted to combine the components of the first rendered signal and the second rendered signal to obtain the spatial output multi-channel audio signal. In other embodiments, the renderer 120 may be adapted to render the first rendered signal and the second rendered signal each having fewer components than the spatial output multi-channel audio signal, and wherein the processor 130 may be adapted to upmix the components of the first rendered signal and the second rendered signal to obtain the spatial output multi-channel audio signal.

FIG. 1b illustrates another embodiment of the device 100, including similar components as described in connection with FIG. 1 a. However, fig. 1b shows an embodiment with more details. Fig. 1b shows a decomposer 110 receiving an input audio signal and optionally input parameters. As can be seen from fig. 1b, the decomposer is adapted to provide the first decomposed signal and the second decomposed signal to the renderer 120, which is indicated by the dashed line. In the embodiment illustrated in fig. 1b, the renderer 120 is adapted to apply an amplitude translation as a first rendering characteristic to the first decomposed signal, assuming that the first decomposed signal corresponds to a point-like audio source as a first semantic attribute. In some embodiments, the first and second decomposed signals are interchangeable, i.e., in other embodiments, an amplitude translation may be applied to the second decomposed signal.

In the embodiment depicted in fig. 1b, renderer 120 shows two variable-scale amplifiers 121 and 122 in the signal path of the first decomposed signal, amplifiers 121 and 122 being adapted to amplify two copies of the first decomposed signal differently. In some embodiments, the different amplification factors employed may be determined by input parameters, in other embodiments they may be determined by the input audio signal, may be preset or may be generated locally, and may also be referenced to user input. The outputs of the two variable ratio amplifiers 121 and 122 are provided to a processor 130, a detailed description of the processor 130 being provided below.

As can be seen from fig. 1b, the decomposer 110 provides the second decomposed signal to the renderer 120, and the renderer 120 performs different renderings in a processing path of the second decomposed signal. In other embodiments, the first decomposed signal may also be processed in the presently described path, or the first decomposed signal may be processed in the presently described path in place of the second decomposed signal. In some embodiments, the first and second decomposed signals are interchangeable.

In the embodiment depicted in fig. 1b, in the processing path of the second decomposed signal, there is a decorrelator 123, followed by a rotator or parametric stereo or upmix module 124 as a second rendering property, after the decorrelator 123. The decorrelator 123 may be adapted to decorrelate the second decomposed signal X k]And for providing a decorrelated version Q [ k ] of the second decomposed signal]To a parametric stereo or upmix module 124. In FIG. 1b, the monophonic signal X [ k ]]To the decorrelator unit "D" 123 and to the upmix module 124. The decorrelator unit 123 may generate a decorrelated version Q k of the input signal]Which have the same frequency characteristics and the same long-term energy. The upmix module 124 may calculate an upmix matrix based on the spatial parameters and synthesize the output channel Y₁[k]And Y₂[k]. The upmix module 124 may be interpreted according to the following formula,

wherein the parameter c_l，c_rAlpha and beta are constants, or are derived from the input signal X [ k ]]The adaptively estimated time-and frequency-varying values, or in the form of, for example, ILD (ILD) parameters and ICC (ICC) parameters, and the input signal X [ k ═ inter-channel correlation) parameters]Side information transmitted together. Signal X [ k ]]For received monophonic signals, the signal Q k]For decorrelated signals, the signals are X [ k ]]A decorrelated version of (a). Output signal passes through Y₁[k]And Y₂[k]And (4) showing.

The decorrelator 123 may be implemented as an IIR filter (IIR), an arbitrary FIR Filter (FIR) or a specific FIR filter that employs a single band for simply delaying the signal.

Parameter c_l，c_rα and β can be determined in different ways. In some embodiments they may simply be determined by input parameters which may be provided with the input audio signal, e.g. with the downmix data as side information. In other embodiments, they may be generated locally or derived from properties of the input audio signal.

In the embodiment shown in fig. 1b, the renderer 120 is adapted to output two output signals Y according to the upmix model 124₁[k]And Y₂[k]And provides the second rendering signal to the processor 130.

Depending on the processing path of the first split signal, two amplitude-shifted versions of the first split signal, which are available from the outputs of the two variable proportional amplifiers 121 and 122, are also provided to the processor 130. In other embodiments, variable scaling amplifiers 121 and 122 may be present in processor 130, where only the first decomposed signal and the translation factor may be provided by renderer 120.

As can be seen from fig. 1b, the processor 130 can be adapted to process or combine the first rendered signal and the second rendered signal, in this embodiment simply by combining the output signals so as to provide a stereo signal having a left channel L and a right channel R corresponding to the spatial output multi-channel audio signal of fig. 1 a.

In the embodiment of fig. 1b, in both signal paths, a left channel and a right channel for a stereo signal are determined. In the path of the first split signal, amplitude panning is performed by the two variable-ratio amplifiers 121 and 122, and therefore the two components result in two in-phase audio signals of different amplification ratios. This corresponds to the effect of a point-like audio source as a semantic attribute or rendering property.

In the signal processing path of the second decomposed signal, corresponding to the pass-up-mixingThe left and right channels determined by module 124 will output signal Y₁[k]And Y₂[k]Provided to the processor 130. Parameter c_l，c_rAnd alpha and beta determine the spatial width of the respective audio source. In other words, the parameter c_l，c_rThe α and β may be selected in such a manner or in such a range that any correlation between the maximum correlation and the minimum correlation can be obtained in the second signal processing path as the second rendering characteristic for the L channel and the R channel. Also, this may be performed independently for different frequency bands. In other words, the parameter c_l，c_rAlpha and beta may be selected in such a way or in such a range that the L and R channels are in phase and the point-like audio sources are modeled as semantic attributes.

Parameter c_l，c_rAlpha and beta may also be selected in such a way or in such a range that the L channel and the R channel in the second signal processing path are decorrelated and model a rather spatially distributed audio source as semantic property, e.g. a modeled background or a spatially broader sound source.

Figure 2 shows another more general embodiment. FIG. 2 illustrates a semantic decomposition block 210, the semantic decomposition block 210 corresponding to the decomposer 110. The output of the semantic decomposition 210 is an input to a rendering stage 220, the rendering stage 220 corresponding to the renderer 120. The rendering stage 220 consists of a number of individual renderers 221 to 22n, i.e. the semantic decomposition stage 210 is adapted to decompose a mono/stereo input signal into n decomposed signals having n semantic properties. The decomposition can be performed based on decomposition control parameters, which may be provided with the mono/stereo input signal, preset, generated locally, or input by the user, etc.

In other words, the decomposer 110 may be adapted to semantically decompose the input audio signal based on the selectable input parameters and/or to determine the input parameters from the input audio signal.

The output of the decorrelation or rendering stage 220 is then provided to an upmix block 230, the upmix block 230 determining a multi-channel output from the decorrelation or rendering signal and optionally from upmix control parameters.

In general, embodiments may separate the sound data into n different semantic components and decorrelate each component separately using a matched decorrelator, also labeled D in fig. 2¹To Dⁿ. In other words, in some embodiments, the rendering characteristics may be matched to semantic attributes of the decomposed signal. Each of the decorrelators or renderers may be adapted to semantic properties of the respective decomposed signal components. The processed components can then be mixed to obtain an output multi-channel signal. The different components can for example correspond to foreground and background modeled objects.

In other words, the renderer 110 may be adapted to combine the first decomposed signal and the first decorrelated signal to obtain a stereo or multi-channel upmix signal as the first rendered signal and/or to combine the second decomposed signal and the second decorrelated signal to obtain a stereo upmix signal as the second rendered signal.

Furthermore, the renderer 120 may be adapted to render the first decomposed signal in dependence of the background audio characteristics and/or the second decomposed signal in dependence of the foreground audio characteristics, or vice versa.

Since e.g. a applause-like signal can be seen as consisting of a single, different, adjacent clap and a noise-like ambient sound generated from a very dense distant clap, a suitable decomposition of such a signal can be obtained by distinguishing an isolated foreground clap event as one component and a noise-like background as another component. In other words, in one embodiment, n is 2. In such an embodiment, for example, the renderer 120 may be adapted to render the first decomposed signal by amplitude shifting of the first decomposed signal. In other words, in some embodiments, the amplitude of each signal event may be translated to its estimated original position at D¹The correlation or rendering of the foreground clap component is realized.

In some embodiments, the renderer 120 may be adapted to render the first and/or second decomposed signal, e.g. by all-pass filtering the first or second decomposed signal, to obtain the first or second decorrelated signal.

In other words, in some embodiments, the filter can be implemented by using m mutually independent all-pass filters D² _1...mTo decorrelate or render the background. In some embodiments, only a quasi-stationary background may be processed by an all-pass filter, which may avoid temporal smearing effects in existing decorrelation techniques. Since amplitude panning can be applied to the events of the foreground object, the original foreground applause density can be approximately recovered, unlike prior art systems, such as j.breeebaart, s.van de Par, a.kohlrausch, e.schuijers, "High-Quality Parametric Spatial Audio Coding at Low bits" inAES 116^th Convention，Berlin，Preprint 6072，May 2004 and J.Herre，K.J.Breebaart，et.al.，“MPEG Surround-the ISO/MPEG Standard for Efficient and Compatible Multi-ChannelAudio Coding”in Proceedings of the 122^ndA prior art system described in AES Convention, Vienna, Austria, May 2007.

In other words, in some embodiments, the decomposer 110 may be adapted to semantically decompose the input audio signal based on the input parameters, wherein the input parameters may be provided together with the input audio signal, e.g. as side information. In such embodiments, the decomposer 110 may be adapted to determine the input parameters from the input audio signal. In other embodiments, the decomposer 110 may be adapted to determine input parameters as control parameters independently of the input audio signal, the input parameters may be generated locally, preset, or may also be input by a user.

In some embodiments, the renderer 120 may be adapted to obtain the spatial distribution of the first or second rendered signal by applying a wideband amplitude panning. In other words, the panning position of the sources can be changed in time, according to the description of fig. 1b above, in order to produce audio sources with a specific spatial distribution, instead of producing point-like sources. In some embodiments, the renderer 120 may be adapted to apply locally generated low-pass noise for amplitude panning, i.e. the scaling factors for amplitude panning of e.g. variable scaling amplifiers 121 and 122 in fig. 1b correspond to locally generated noise values, i.e. time variable quantities with a certain bandwidth.

Embodiments may be adapted to operate in either a guided or unguided mode. For example, in a guided scenario, e.g. with reference to the dashed line in fig. 2, decorrelation can be achieved by applying only standard technical decorrelation filters controlled on a coarse time grid, e.g. onto the background or ambient sound part, and correlation is obtained by reallocation of individual events in the foreground part via time-variant spatial localization using wideband amplitude panning on a finer grid. In other words, in some embodiments, the renderer 120 may be adapted to operate the decorrelators for different decomposed signals on different time grids, e.g., based on different time scales, which may depend on different sampling ratios or different delays for the respective decorrelators. In one embodiment, foreground and background separation is performed, and the foreground portion may employ amplitude panning, where the amplitude for the foreground portion varies on a finer temporal grid than the operation of the decorrelator for correlating with the background portion.

Furthermore, it should be emphasized that for decorrelation of e.g. applause-like signals (i.e. signals with a quasi-stationary random quality), the exact spatial position of each individual foreground applause may not be as important as recovery of the overall distribution of a large number of applause events. Embodiments may take advantage of this fact and may operate in a non-conductive mode. In this mode, the amplitude translation factor described above can be controlled by low pass noise. Fig. 3 shows a mono-to-stereo system implementing a scene. Fig. 3 shows a semantic decomposition block 310 corresponding to the decomposer 110 for decomposing a monophonic input signal into a foreground decomposed signal part and a background decomposed signal part.

As can be seen from fig. 3, by means of the all-pass D¹320 render the background decomposed portion of the signal. The decorrelated signal is then provided to an upmix 330 corresponding to the processor 130, along with the unrendered background decomposed portions. The foreground decomposed signal portion is provided to an amplitude shift D corresponding to the renderer 120²Stage 340. The locally generated low pass noise 350 is also provided to the amplitude panning stage 340, and the amplitude panning stage 340 may then provide the foreground resolved signal to the upmix 330 in an amplitude-panned configuration. Amplitude translation D²Stage 340 may determine its output by providing a scale factor k for amplitude selection between two of a set of stereo audio channels. The scale factor k may be based on low-pass noise.

As can be seen from fig. 3, there is only one arrow between the amplitude panning 340 and the upmixing 330. This one arrow may also represent an amplitude panning signal, i.e. in case of a stereo upmix, an existing left and right channel. As can be seen from fig. 3, the upmix 330 corresponding to the processor 130 is adapted to process or combine the background decomposed signal and the foreground decomposed signal to obtain a stereo output.

Other embodiments may employ local processing to derive the background decomposed signal and the foreground decomposed signal or input parameters for decomposition. The decomposer 110 may be adapted to determine the first decomposed signal and/or the second decomposed signal based on a transient separation method. In other words, the decomposer 110 may be adapted to determine the first decomposed signal or the second decomposed signal based on a separation method, the other decomposed signal being determined based on a difference between the determined first decomposed signal and the input audio signal. In other embodiments, the first decomposed signal or the second decomposed signal may be determined based on a transient separation method, the other decomposed signal being determined based on a difference between the first decomposed signal or the second decomposed signal and the input audio signal.

The decomposer 110 and/or the renderer 120 and/or the processor 130 may comprise a DirAC mono-synthesis stage and/or a DirAC merging stage. In some embodiments, the decomposer 110 may be adapted to decompose the input audio signal, the renderer 120 may be adapted to render the first decomposed signal and/or the second decomposed signal, and/or the processor 130 may be adapted to process the first rendered signal and/or the second rendered signal according to different frequency bands.

Embodiments may employ the following approximation for applause-like signals. The background component may be given by a residual signal when the foreground component may be obtained by a transient detection or separation method (see Pulkki, Ville; "Spatial Sound Reproduction with Directional Audio Coding" in j. Audio end. soc., vol.55, No.6, 2007). Fig. 4 depicts an example in which an appropriate method is employed to obtain the background component x' (n) of, for example, a applause-like signal x (n) to implement the semantic decomposition 310 of fig. 3, i.e., an embodiment of the decomposer 120. Fig. 4 shows a time-discrete input signal x (n) of an input DFT410(DFT ═ discrete fourier transform). The output of the DFT block 410 is provided to a block 420 for smoothing the spectrum and a spectral whitening block 430, the spectral whitening block 430 being used for spectral whitening from the output of the DFT410 and the output of the smoothing spectrum stage 430.

The output of the spectral whitening stage 430 is then provided to a spectral peak picking stage 440, the spectral peak picking stage 440 separating the spectrum and providing two outputs, namely the noise and transient residual signal and the tonal signal. The noise and transient residual signal are provided to an LPC filter 450(LPC ═ linear prediction coding), where the residual noise signal and the pitch signal are provided together as the output of the spectral peak picking stage 440 to a mixing stage 460. The output of the mixing stage 460 is then provided to a spectral shaping stage 470, which spectral shaping stage 470 shapes the spectrum according to the smoothed spectrum provided by the smoothed spectrum stage 420. The output of the spectral shaping stage 470 is then provided to a synthesis filter 480, i.e. an inverse discrete fourier transform, in order to obtain x' (n) representing the background components. The foreground component is then obtained as the difference between the input signal and the output signal, i.e. x (n) -x' (n).

Embodiments of the invention may operate in virtual reality applications, such as 3D gaming. In such applications, the synthesis of a source of sound with a large spatial extent may be more complex when based on conventional thinkingAnd (3) mixing. Such sources may be, for example, shores, bird flocks, galloping horse flocks, traveling soldiers or clapped spectators. Typically, such sound events are spatialized into a large group of point-like sources, which leads to computationally complex implementations, see Wagner, Andreas; walther, Andreas; melchoir, Frank; strau β, Michael; "Generation of high elevation imaging for Wave field Synthesis Reproduction" at 116^th International EAS Convention，Berlin，2004。

Embodiments may accomplish a method of plausibly performing synthesis of a breadth of sound sources, but at the same time with lower structural and computational complexity. Embodiments may be based on DirAC (DirAC ═ directional audio coding), see Pulkki, Ville; "SpatialSound Reproduction with Directional Audio Coding" in j. Audio eng.soc., vol.55, No.6, 2007. In other words, in some embodiments, the decomposer 110 and/or the renderer 120 and/or the processor 130 may be adapted to process the DirAC signal. In other words, the decomposer 110 may comprise a DirAC mono-synthesis stage, the renderer 120 may comprise a DirAC synthesis stage, and/or the processor 130 may comprise a DirAC merge stage.

Embodiments may be based on DirAC processing, e.g. employing only two synthesis structures, e.g. one for foreground sound sources and one for background sound sources. The foreground sound source may be applied to a single DirAC stream with controlled directional data, resulting in the perception of a nearby point-like source. Background sound may also be reproduced with a single directional stream with differently controlled directional data, which results in the perception of spatially propagating sound objects. The two DirAC streams are then merged and decoded, for example for arbitrary speaker settings or headphones.

Fig. 5 shows the synthesis of a sound source with a spatial extent. Fig. 5 shows an upper mono synthesis block 610, the upper mono synthesis block 610 producing a mono DirAC stream resulting in the perception of a nearby point-like sound source, such as the nearest clapper in the audience. The lower mono synthesis block 620 is used to produce a mono DirAC stream resulting in the perception of spatially propagated sound, e.g. to produce background sound like applause from the audience. The outputs of the two DirAC mono synthesis blocks 610 and 620 are then combined in a DirAC combining stage 630. Fig. 5 shows that only two DirAC synthesis blocks 610, 620 are employed in this embodiment. One of them is used to produce sound events in the foreground, such as the nearest or nearby bird crowd or nearest or nearby people in the clapping audience, and the other is used to produce background sound, continuous bird crowd sound, etc.

The DirAC mono synthesis block 610 is used to transform the foreground sounds into a mono DirAC stream in such a way that the orientation data remains constant with frequency but changes randomly in time or is controlled by external processing. The diffusion parameter psi is set to 0, i.e. indicating a point-like origin. The audio input to the input block 610 is assumed to be non-overlapping sounds in time, such as different bird calls or claps, that produce the perception of a nearby sound source, such as a bird or clap. By judging theta and theta_{Scope-prospect}Controlling the spatial extent of a foreground sound event, which means that it will be at θ ± θ_{Scope-prospect}Individual sound events are perceived, but a single event may be perceived as point-like. In other words, the possible positions of the points are limited to θ ± θ_{Scope-prospect}In the range of (3), a point-like sound source is generated.

The background block 620 employs as the input audio stream a signal that includes all other sound events that are not in the foreground audio stream and is intended to include a large number of temporally overlapping sound events, such as hundreds of birds or a large number of distant spatulas. The dependent bearing values are then at given limit bearing values θ ± θ_{Scope-prospect}The inner settings are random in both time and frequency. Then, the spatial range of the background sound is synthesized with low computational complexity. The diffuseness ψ can also be controlled. If the diffuseness ψ increases, the DirAC decoder applies the sound to all directions, which will be used when the sound source completely surrounds the listener. If the sound source is not surround, thatThe diffusivity in embodiments can be kept low, or close to zero, or zero.

Embodiments of the invention may provide the advantage that a good perceptual quality of the rendered sound is achieved at moderate computational cost. Embodiments may enable a modular implementation of spatial sound rendering, as shown in fig. 5.

Depending on the particular implementation requirements of the inventive method, the inventive method may be implemented in hardware or in software. The implementation can be performed using a digital storage medium, in particular a flash memory, a disk, a DVD or a CD having stored thereon electrically readable control signals which cooperate with a programmable computer system such that the inventive method is performed. Generally, the invention is thus a computer program product with a program code stored on a machine-readable carrier, the program code being operative for performing the inventive methods when the computer program product runs on a computer. In other words, the inventive methods are therefore a computer program having a program code for performing at least one of the inventive methods when the computer program runs on a computer.

Claims

1. An apparatus (100) for determining a spatial output multi-channel audio signal based on an input audio signal, comprising:

a semantic decomposer (110) configured for decomposing the input audio signal to obtain a first decomposed signal having first semantic properties and a second decomposed signal having second semantic properties, the second semantic properties being different from the first semantic properties, the first decomposed signal being a foreground signal portion and the second decomposed signal being a background signal portion;

a renderer (120) for rendering the first decomposed signal with a first rendering characteristic to obtain a first rendered signal having the first semantic attribute, and for rendering the second decomposed signal with a second rendering characteristic to obtain a second rendered signal having the second semantic attribute, wherein the first rendering characteristic and the second rendering characteristic are different from each other,

wherein the renderer (120) comprises a first directional audio coding mono synthesis module (610) for rendering the foreground signal portion and a second directional audio coding mono synthesis module (620) for rendering the background signal portion, the first directional audio coding mono synthesis module (610) being configured for generating a first mono directional audio coding stream resulting in perception of adjacent point-like sources, the second directional audio coding mono synthesis module (620) being configured for generating a second mono directional audio coding stream resulting in perception of spatially propagating sound; and

a processor (130) for processing the first rendered signal and the second rendered signal to obtain the spatial output multi-channel audio signal, wherein the processor (130) comprises a directional audio coding combination module (630) for combining the first mono directional audio encoded stream and the second mono directional audio encoded stream.

2. The apparatus (100) of claim 1, wherein the first directional audio coding mono synthesis module (610) is configured such that the azimuth data remains constant with frequency and varies randomly in time within a controlled azimuth range or is controlled by an external process, and the diffuseness parameter is set to 0, and

wherein the second directional audio coding mono synthesis module (620) is configured such that the bearing data is randomly set in time and in frequency within a given limit bearing value.

3. A method for determining a spatial output multi-channel audio signal based on an input audio signal, comprising the steps of:

semantically decomposing the input audio signal to obtain a first decomposed signal having first semantic attributes and a second decomposed signal having second semantic attributes, the second semantic attributes being different from the first semantic attributes, the first decomposed signal being a foreground signal portion and the second decomposed signal being a background signal portion;

rendering the first decomposed signal with first rendering characteristics by processing the first decomposed signal in a first directional audio coding mono synthesis stage (610) to obtain a first rendered signal having the first semantic attribute, the first directional audio coding mono synthesis stage (610) being configured for generating a first mono directional audio coding stream resulting in perception of a neighboring point-like source;

rendering the second decomposed signal with second rendering characteristics by processing the second decomposed signal in a second directional audio coding mono synthesis stage (620) to obtain a second rendered signal having the second semantic attributes, the second directional audio coding mono synthesis stage (620) being configured for generating a perceived second mono directional audio coding stream resulting in a spatially propagated sound; and

processing the first rendered signal and the second rendered signal by employing a directional audio coding combination stage (630) for combining the first mono directional audio encoded stream and the second mono directional audio encoded stream to obtain the spatial output multi-channel audio signal.

4. The method of claim 3, wherein in the first directional audio coding mono synthesis stage (610), the azimuth data is kept constant with frequency and is randomly changed in time within a controlled azimuth range or is controlled by an external process, and the diffuseness parameter is set to 0, and

wherein in the second directional audio coding mono synthesis stage (620) the bearing data is randomly set in time and in frequency within given limit bearing values.