HK1159392B

HK1159392B - Parametric joint-coding of audio sources

Info

Publication number: HK1159392B
Application number: HK11113485.2A
Authority: HK
Inventors: 克里斯多夫．法勒
Original assignee: 弗劳恩霍夫应用研究促进协会
Priority date: 2005-02-14
Filing date: 2011-12-14
Publication date: 2013-11-08

Description

Parametric joint coding of audio sources

The present application is a divisional application of the invention patent application having application number 200680004728.5, application date 2006, month 2 and day 13, and entitled "parameter joint coding of sound sources".

Technical Field

The present invention relates generally to signal processing and, in particular, to coding and decoding of audio signals.

Background

1. Introduction to

In the general coding problem we have a number of (mono) source signals s_i(n) (1 ≦ i ≦ M) and a scene description vector S (n), where n is a time index. The scene description vector contains parameters such as (virtual) source position, source width and sound parameters, e.g. (virtual) room parameters. The scene description may be non-time variable or change over time. The source signal and scene description are encoded and transmitted to a decoder. Encoded sound source signalFunction of continuous mixing into scene description(n) to generate a wave field synthesis, multi-channel or stereo signal as a function of the scene description vector. Representing the decoder output signal as_i(N) (i is 0. ltoreq. N). Note that the scene description vector s (n) may not be transmitted, but may be determined at the decoder. In this context, the term "stereo signal" generally refers to a two-channel stereo signal.

ISO/IEC MPEG-4 means the described coding scenario. It defines a scene description and uses a separate mono audio encoder (e.g. an AAC audio encoder) for each ("natural") source signal. However, when a composite scene with many sources is mixed, the bit rate becomes high, i.e. the bit rate increases proportionally with the number of sources. Encoding an audio source signal with high quality requires about 60-90 kb/s.

Previously, we have addressed the special case of the described coding problem [1] [2] with schemes for Binaural Cue Coding (BCC) which represent flexible rendering. The low bitrate is obtained by transmitting only the sum of a given source signal and low bitrate side information (side information). However, the source signal cannot be recovered at the decoder, and the scheme is limited to the generation of stereo and multi-channel surround signals. Furthermore, only simplistic hybrids are used, depending on the amplitude and delay phase shifts. Thereby, the direction of the sound source is controlled, but no other auditory spatial image is generated. Another limitation of this solution is that its sound quality is not high. In particular, the sound quality degrades more severely as the number of source signals increases.

Document [1] (binaural cue coding, parametric stereo, MPEG surround) refers to the case of encoding N audio channels, decoding N audio channels with similar cues, and then decoding the original audio channels. The transmitted side information includes inter-channel cue parameters related to the input inter-channel differences.

The channels of stereo and multi-channel audio signals contain a mix of source signals and are thus different in nature from pure source signals. Stereo and multi-channel audio signals are mixed so that the listener will perceive an auditory spatial image ("base width") when played back on a suitable playback system, as captured by recording equipment or as designed by the videographer during the mixing process. A number of schemes for jointly encoding channels of a stereo or multi-channel audio signal have been proposed previously.

Disclosure of Invention

It is an object of the present invention to provide a method for transmitting a plurality of source signals while using a minimum bandwidth. In most known methods, the playback modality (e.g. stereo 5.1) is predetermined and has a direct impact on the coding scene. The audio stream on the decoder side should only take this predetermined playback form, thus combining the user with a predetermined playback scenario (e.g. stereo).

The invention encodes N source signals, which are typically not channels of a stereo or multi-channel signal, but separate signals, e.g. different speech or instrument signals. The transmitted side information includes statistical information related to the input source signals.

The invention decodes the M audio channels using different cues than the original audio source signal. These different cues are either implicitly synthesized by applying a mixer to the received sum signal. The mixer is controlled as a function of the received statistical sound source information and the received (or locally determined) audio format parameters and mixing parameters. Alternatively, these different cues are explicitly computed as a function of the received statistical sound source information and the received (or locally determined) audio format parameters and mixing parameters. These calculated cues are used to control a decoder (binaural cue coding, parametric stereo, MPEG surround) of the prior art for synthesizing the output and the channel given the received sum signal.

The proposed joint coded audio source signal scheme is first. This scheme is used for jointly encoding the source signals. The source signal is typically a mono audio signal that is not suitable for playback on a stereo or multi-channel audio system. For simplicity, the source signal is often referred to as a source signal hereinafter.

The source signals first need to be mixed into stereo, multi-channel or wave field synthesized audio signals before playback. The audio source signal may be a single instrument or a speaker, or a sum of many instruments and speakers. Another source signal is a mono audio signal captured during a conference using a point microphone. The source signal is often stored in a multitrack recorder or a hard disk recording system.

The claimed solution for jointly encoding audio source signals is based on transmitting only the sum of audio source signals,

or a weighted sum of the source signals. Optionally, the weighted sum may be done with different weights for different sub-bands, and the weights may change in time. Also a sum of equalizations may be applied, e.g. [1]]Chapter 3.3.2 of. In the following, when we refer to a sum or sum signal, we always refer to the signal resulting from (1) or as described. In addition to the sum signal, side information is also transmitted. The sum together with the side information represents the output audio stream. Optionally, the sum signal is encoded using a conventional mono audio encoder. This audio stream can be stored in a file (CD, DVD, hard disk) or played back to the receiver. The side information represents the statistical properties of the source signals, which are the most important factors in determining the perceptual spatial cues of the mixer output signal. It will be shown that these characteristics are temporally spreading out the spectral envelope and autocorrelation function. Each source signal transmits about 3kb/s of side information. At the receiver, the sound source signal ((1. ltoreq. i.ltoreq.M)) are recovered using the aforementioned statistical properties that approximate the corresponding properties of the original source signal and the sum signal.

Drawings

The invention will be better understood with the aid of the accompanying drawings, in which:

figure 1 shows a scheme for transmitting each source signal separately for further processing,

figure 2 shows a number of audio sources transmitted as sum signals plus side information,

figure 3 is a block diagram of a Binaural Cue Coding (BCC) scheme,

figure 4 shows a mixer for generating a stereo signal based on several source signals,

FIG. 5 shows the dependency between ICTD, ICLD and ICC and the sub-band power of the sound source signals,

figure 6 shows a procedure for the generation of side information,

figure 7 shows the process of evaluating the LPC parameters for each excitation signal,

figure 8 shows the process of reconstructing a source signal from a sum signal,

figure 9 shows an alternative scheme for generating each signal from a sum signal,

figure 10 shows a mixer for generating a stereo signal based on a sum signal,

figure 11 shows an amplitude phase shift algorithm that prevents the source order from being dependent on the mixing parameters,

figure 12 shows a loudspeaker array of a wave field synthesis replay system,

figure 13 shows how the estimate of the audio source signal is recovered at the receiver by processing the down-mix of the transmission channel,

fig. 14 shows how the estimate of the source signal is recovered at the receiver by processing the transmission channels.

Detailed Description

Definitions, flags and variables

The following notations and variables are employed herein:

n is a time index;

i audio channel or source index;

d a delay index;

the number of input sound source signals of the M encoders;

the number of N decoder output channels;

x_i(n) mixed original sound source signals;

a mixed decoder output signal;

s_i(n) an encoder inputs a sound source signal;

a transmission source signal also called a pseudo source signal;

s (n) a transmitted sum signal;

y_i(n) an L-channel audio signal; (audio signals to be remixed);

s_i(n) one subband signal (same definition for the other signals);

short-term estimation (same definition is used for other signals);

level differences between ICLD channels;

time difference between ICTD channels;

coherence between ICC channels;

Δ L (n) ICLD for the estimated subband;

ICTD of the sub-band estimated τ (n);

c (n) ICC of the estimated subband;

relative sound source sub-band power;

a_i，b_ia mixer scale factor;

c_i，d_ia mixer delay;

ΔL_iτ (n) mixer stage and time difference;

G_ia mixer source gain;

joint coding of audio source signals

First, Binaural Cue Coding (BCC), i.e. parametric multi-channel audio coding techniques, are described. It is then shown that an algorithm for jointly coding the sound source signals can be designed for the coding scene using the same principles as BCC.

A. Binaural Cue Coding (BCC)

The BCC scheme [1] [2] for multi-channel audio coding is shown in the following figure. The input multi-channel audio signals are down-mixed into a single channel. Instead of encoding and transmitting information about all channel waveforms, only the down-mix signal is encoded (using a conventional mono audio encoder) and transmitted. Furthermore, perceptually motivated "audio channel differences" between the original channels are estimated and also transmitted to the decoder. The decoder generates the output audio channels such that the audio channel differences approximate corresponding audio channel differences of the original audio signal.

Local summation means that the perceptually relevant audio channel differences of a loudspeaker signal channel pair are the time difference between channels (ICTD) and the level difference between channels (ICLD). ICTD and ICLD may be related to the direction of auditory event perception. Other auditory spatial image attributes, such as apparent sound source width and listener environment, are related to inter-auditory coherence (IC). For pairs of loudspeakers in front of or behind the listener, the inter-auditory coherence is often directly related to the inter-channel coherence (ICC), which is thus considered to be the third audio channel difference measured with BCC. The ICTD, ICLD and ICC are estimated in subbands as a function of time. Both the spectral and temporal resolutions used are perceptually motivated.

B. Parametric joint coding of audio sources

BCC decoder by acquiring a single signal and synthesizing each sub-band and channel at regular time intervalsFor a single specific ICTD, ICLD and ICC cue, a multi-channel audio signal can be generated with any auditory spatial image. Wide range audio materials [ see 1]The good performance of the BCC scheme of (1) means that the perceived auditory spatial image is mainly determined by ICTD, ICLD and ICC. Therefore, the "clean" source signal s is required as in FIG. 1_i(n) as input to the mixer-conversely, for the case where the real source signal is supplied to the mixer, we need only have the pseudo-source signal at the mixer output resulting in ICTD-, ICLD-and ICC-like propertiesTo generate toThere are three goals:

● if willProvided to the mixer, the mixer output channels will have approximately the same spatial cues (ICTD, ICLD and ICC) as if s were_i(n) is supplied to the mixer.

● generatedWith as little information as possible about the original source signal s (n) (since the goal is to have low bit side information).

● are generated from the transmitted sum signal s (n)Thereby introducing a minimum amount of signal distortion.

To derive the proposed scheme, we consider a stereo mixer (M ═ 2). Further simplified over the normal case is that only amplitude and delay phase shifts are applied for mixing. If discrete source signals are available at the decoder, the stereo signals are mixed as shown in fig. 4, i.e.

In this case, the scene description vector s (n) contains exactly the direction of the source of the sound determining the mixing parameters,

M(n)＝(a₁，a₂，...，a_M，b₁，b₂，...，b_M，c₁，c₂，...，c_M，d₁，d₂，...，d_M)^T (3)

where T is the transpose of the vector. Note that for the mixing parameters, we ignore the time index for ease of notation.

A more convenient parameter for controlling the mixer is the sum of a_i、b_i、c_iAnd d_iRelative time difference and level difference T_iAnd Δ L_i，a_i、b_i、c_iAnd d_iRepresented by the formula:

b_i＝10^(Gi+ΔLi)/20a_i c_i＝max{-T_i，0) d_i＝max{T_i，0) (4)

where G is_iThe source gain factor is in dB.

Next, we use ICTD, ICLD and ICC outputted from the stereo mixer as the input sound source signal s_i(n) is calculated as a function of (n). The expression obtained will give an indication that: the source signal characteristics determine ICTD, ICLD and ICC (along with the mixing parameters). Then generateSo that the identified source signal characteristics approximate the corresponding characteristics of the original source signal.

B.1 ICTD, ICLD and ICC output by mixer

Cues are estimated in subbands and as a function of time. Next, assume that sound source signal s_i(n) are zero mean values and are independent of each other. Representing a pair of subband signals of the mixer output (2) asAndnote that to simplify the notation, we use the same time index n for both the time domain and subband domain signals. Also, the subband index is not used, and the analysis/processing described is applied to each subband separately. The subband powers of the two mixer output signals are:

at this point in time,is a sound source s_i(n) a subband signal, E { } denotes a short-time expression, e.g.

Here, K determines the length of the moving average. Note that sub-band power valuesFor each source signal being representativeIs the spectral envelope as a function of time. ICLD, Δ L (n), is:

for the estimation of ICTD and ICC, normalized cross-correlation functions are estimated,

ICC, c (n), is calculated as follows:

to calculate ICTD, i.e., t (n), the position of the highest peak on the delay axis is calculated:

now the problem is how to be able to calculate the normalized cross-correlation function as a function of the mixing parameters. (8) Written together with (2) as:

which is equal to

Here, the autocorrelation function is normalizedThe method comprises the following steps:

and T_i＝d_i-c_i. Note that for the calculation (12) of the given (11), it is assumed that the signal is broadly stationary over the delay range considered, i.e. the signal is stationary in a broad sense

Fig. 5 shows an example of the numerical values of the two excitation signals, showing the dependency relationship between ICTD, ICLD, and ICC and the excitation subband powers. Δ L for different mixing parameters (4)₁、ΔL₂、T₁And T₂The top, middle and bottom of FIG. 5 represent Δ L (n), T (n) and c (n), respectively, as a function of the ratio of the subband powers of the two source signals,note that when only one source has power in a subband (a-0 or a-1), then the calculated Δ L (n) and t (n) are equal to the mixing parameter (Δ L)₁、ΔL₂、T₁And T₂)。

B.2 necessary auxiliary information

ICLD (7) depends on the mixing parameter (a)_i，b_i，c_i，d_iAnd) short-time subband powers of the sum sourceFor each source signal, the normalized subband cross-correlation function Φ (n, d) (12) required for ICTD (10) and ICC (9) computation depends onAnd additionally on the normalized sub-band autocorrelation function (phi)_i(n, e)) (13). (n, d) is at mini { T)_i}≤d≤maxi{T_i}., respectively. For the mixing parameter T_i＝d_i，-c_iSound source i, required sound source signal subband characteristic Φ_iThe corresponding ranges for (n, e) are:

since the ICTD, ICLD and ICC cues depend on the sub-band characteristics of the source signals in the range (14)And phi_i(n, e), it is therefore these source signal subband characteristics that need to be transmitted as side information. We assume that any kind of mixer (e.g. efficient mixer, wave field synthesis mixer/convolver, etc.) has similar properties, whereby the side information is used in other mixers than the one in questionMay also be useful. In order to reduce the amount of side information, a set of predetermined autocorrelation functions is stored in the decoder and only the index for selecting the closest match to the source signal characteristics is transmitted. The first version of the algorithm assumes that phi is within range (14)_i(n, e) is equal to 1, thereby calculating (12) using only the subband power value (6) as side information. Suppose phi_iThe data shown in fig. 5 were calculated with (n, e) ═ 1.

In order to reduce the amount of side information, the sound source information in the relative dynamic range is limited. At each moment, the power of the strongest source is selected for each subband. We have found that it is sufficient to lower the respective sub-band powers of all other sources to a value 24dB below the strongest sub-band power. Thereby limiting the dynamic range of the quantizer to 24 dB.

Assuming that the source signals are independent, the decoder can calculate the sum of all source subband powersIt is thus in principle sufficient to transmit only the subband power values of the M-1 source to the decoder, while the subband powers of the remaining sources are calculated in situ. Given this concept, it is possible to reduce the side information slightly by the sub-band powers of the sources with transmission indices 2 i M with respect to the first source power,

note that the dynamic range limitation as described above is done before (15). As an alternative, the subband power values are normalized with respect to the sum signal subband power, as opposed to normalization (15) with respect to an audio source subband power. For a sampling frequency of 44.1kHz, we use 20 subbands and transmit each subband approximately every 12 millisecondsThe 20 subbands correspond to half the spectral resolution of the auditory system (one subband is as wide as two "critical bandwidths"). Informal experiments show that only a slight improvement is obtained by using more than 20 subbands, e.g. 40 subbands. The number of subbands and the subband bandwidth are selected according to the time and frequency resolution of the auditory system. Implementing this scheme with low quality requires at least three subbands (low frequency, intermediate frequency, high frequency).

According to a particular embodiment, the sub-bands have different bandwidths, the low frequency sub-bands having a smaller bandwidth than the high frequency sub-bands.

The relative power values are quantized using a scheme similar to the ICLD quantizer described in [2], resulting in the generation of a bit rate of about 3(M-1) kb/s. Fig. 6 shows an auxiliary information generation process (corresponding to the "auxiliary information generation" block in fig. 2).

By analysing the activity of each source signal and transmitting only the relevant tone if the source signal is activeSide information of the source signal, which enables an additional reduction of the side information rate. And sub-band power valuesInstead of transmitting as statistical information, other information representing the spectral envelope of the source signal may be transmitted. For example, Linear Predictive Coding (LPC) parameters may be transmitted, or corresponding other parameters such as lattice filter parameters or Line Spectral Pair (LSP) parameters. The estimation process of the LPC parameters of each source signal is shown in fig. 7.

B.3 calculation

Fig. 8 shows the procedure used to recreate the source signal (in the case of the known sum signal (1)). This process is part of the "composite" block in fig. 2. By using g_i(n) each sub-band of the metric sum signal and by applying a signal having an impulse response h_i(n) a decorrelation filter to recover each source signal,

here, it is a linear convolution operator,is calculated with the auxiliary information by the following formula

As decorrelation filter h, a complementary comb filter, an all-pass filter, a delay or a filter with a random impulse response may be used_i(n) of (a). The purpose of the decorrelation process is to reduce the correlation between signals without altering the way in which the individual waveforms are perceived. Different decorrelation techniques lead to different distortion phenomena. The complementary comb filter results in correlation. All described techniques propagate instantaneous energy in time, resulting in distortion phenomena such as "pre-echoes". Known distortion phenomenaShould the decorrelation techniques be applied as little as possible. The next section describes the decorrelation processing ratio independent signals requiredLess techniques and strategies are generated.

Generating a signalAn alternative to (2) is shown in figure 9. The spectrum of s (n) is first flattened by calculating the linear prediction error e (n). Then, the LPC filter f estimated at the encoder is known_iThe corresponding all-pole filter is calculated as the inverse z-transform of the following equation:

the resulting all-pole filter f_iRepresenting the spectral envelope of the source signal. If additional side information is transmitted in addition to the LPC parameters, the LPC parameters first need to be calculated as a function of the side information. As in the other scheme, a decorrelation filter h is used_iThe source signals are made independent.

Embodiment considering practical limitations

In the first part of this section, use is made ofThe BCC synthesis scheme gives an embodiment as a stereo or multi-channel mixer. This is particularly interesting because such BCC type synthesis schemes are part of the ISO/IEC MPEG standard to be put into practice, called "spatial audio coding". In this case, the source signal is not explicitly calculatedResulting in a reduction in computational complexity. Furthermore, this solution offers the possibility to make the sound quality better, since the source signal is computed more explicitly thanThe decorrelation required is significantly less.

The second part of this section discusses the problem when the proposed scheme uses any mixer and does not employ decorrelation processing at all. Such a scheme is less complex than a scheme employing decorrelation processing, but has other drawbacks (discussed later).

Ideally, decorrelation is employed so that the resultingAre independent.

However, decorrelation is performed as little as possible, since the decorrelation process is problematic in view of the distortion phenomenon. The third section of this section discusses obtaining what appears to be generatedIs an independent benefit, and how to reduce the amount of problematic decorrelation processing.

A. Without explicit calculationEmbodiments of (1)

Direct application of mixing to transmitted sum signal(1) Without explicit calculationThe BCC synthesis scheme is used for this purpose. In the following we consider the stereo case, but all described principles can be used for the generation of a multi-channel audio signal.

A stereo BCC synthesis scheme (or "parametric stereo" scheme) for the processing sum signal (1) is shown in fig. 10. The BCC synthesis scheme is expected to produce a signal that is perceived as identical as the mixer output signal as shown in fig. 4. The same is true in the case when ICTD, ICLD and ICC between the BCC synthesis scheme output channels are similar when corresponding cues occur between the mixer output (4) signal channels.

Using the same side information as in the more general scheme described earlier, the decoder is enabled to calculate the short-time subband power values for the audio sourcesIt is known thatGain factor g in fig. 10₁And g₂Is calculated by the following formula:

the output sub-band power and ICLD (7) are thus the same as the mixer in fig. 4. ICTD T (n) is calculated according to (10) to determine the delay D in FIG. 10₁And D₂：

D₁(n)＝max{-T(n)，o} D₂(n)＝max{T(n)，0}(19)

ICC c (n) is calculated as (9) to determine the decorrelation process in fig. 10. Decorrelation processing (ICC synthesis) is described in [1]As described therein. Applying decorrelation processing to mixer output channels compared to their use to produce independenceHas the advantages that:

● generally, the number M of audio source signals is greater than the number N of audio output channels. Thus, as opposed to decorrelating M source signals, the number of independent audio channels that need to be generated is smaller when decorrelating N output channels.

● often, the N audio output channels are correlated (ICC > 0) and require less decorrelation processing than is required to generate independent M or N channels.

Better sound quality is desired because less decorrelation processing is employed.

When the mixer parameters are limited so thatI.e. G_iAt 0dB, the best sound quality is desired. In this case, the power of each audio source in the transmitted sum signal (1) is the same as the power of the same audio source in the hybrid decoder output signal. The decoder output signal (fig. 10) is the same as if the mixer output signal (fig. 4) were encoded and decoded by the BCC encoder/decoder in this case. Thus, similar quality is also desired.

Decoder isOnly the direction in which each source appears can be determined and the gain of each source can also be varied. By selectingAnd the gain is increased by selectionAnd the gain is reduced.

B. Without decorrelation processing

A limitation of the aforementioned techniques is that the mixing is done with a BCC synthesis scheme. It is envisaged that not only ICTD, ICLD and ICC synthesis, but also additional effect processing within BCC synthesis is performed.

However, it is desirable to be able to use existing mixers and effect processors. This also includes wave field synthesis mixers (often referred to as "convolvers"). Due to the use of existing mixers and effect processorsCan be explicitly calculated and used as if they were the original source signals.

H when decorrelation processing (16) is not employed_iWhen (n) is δ (n), good sound quality can be obtained. This is due to distortion phenomena introduced by the decorrelation process and to the excitation signalThe trade-off between the distortion phenomena caused by correlation. When non-decorrelation processing is not employed, the resulting auditory spatial image may suffer from instability [1]]. But the mixer itself may introduce some decorrelation when reflectors or other effects are employed and thus less need for decorrelation processing.

If generated without decorrelation processingThe source level depends on the direction in which they are mixed relative to the other sources. By replacing the amplitude phase shift algorithm in existing mixers with an algorithm that compensates for this level of dependency, the side effects of exceeding the loudness dependence on mixing parameters can be counteracted. The level compensation amplitude algorithm is shown in fig. 11, which helps to supplement the dependency of the source level on the mixing parameters. The gain factor a of a conventional amplitude phase shift algorithm (e.g. fig. 4) is known_iAnd b_i，

The weight in FIG. 11 was calculated by the following equationAnd

note that the calculationAndso that the output sub-band power is the same as ifIndependently in each sub-band.

C. Reducing the amount of decorrelation processing

As previously described, independentlyIs problematic. Strategies for applying less decorrelation processing are described herein, while effectively achieving similar results as ifAre independent.

Such as considering a wave field synthesis system as shown in figure 12. Shows s₁、s₂.......s₆(M ═ 6) desired virtual sound source position. Can calculate without generating M sufficiently independent signalsThe strategy of (1) is as follows:

1. an array of source indices corresponding to sources that are close to each other is generated. For example, in fig. 8, these are: {1}, {2, 5}, {3}, and {4, 6 }.

2. The excitation index of the strongest excitation is selected in each subband at each instant,

do not apply decorrelation processing to data containing i _max I.e. h _i Tone source index of (n) ═ delta (n) group In part。

3. Selecting the same h within this group for each other group_i(n)。

The algorithm corrects at least the strongest signal component. In addition, the different h used_iThe number of (n) is also reduced. This is advantageous because decorrelation is easier and fewer independent channels need to be created. The techniques may also be applied when stereo or multi-channel audio signals are mixed.

V. scalability in terms of quality and bit rate

The proposed scheme only transmits the sum of all source signals, which can be encoded with a conventional mono audio encoder. The proposed scheme scales with more than one transmission channel when single backward compatibility is not required and capacity is available for transmitting/storing more than one audio waveform. This is implemented by generating several sum signals with different given source signal subgroups, i.e. applying the proposed coding scheme separately for each source signal subgroup. It is desirable that sound quality can be improved when the number of transmitted audio channels increases, because fewer independent channels are generated from each transmission channel by decorrelation (compared to the case of one transmission channel).

Backward compatibility to existing stereo surround audio formats

Consider the following audio delivery scenario. The consumer gets the maximum quality stereo or multi-channel surround signal (e.g. via audio CD, DVD or online music storage, etc.). The objective is to optionally deliver flexibility to the consumer to produce a customized mix of the obtained audio content without compromising quality to standard stereo/surround sound reproduction.

This is implemented by delivering to the consumer (e.g. as an arbitrary purchase selection in an online music store) a bit stream of ancillary information that allows this to be doneAs a function of a given stereo or multi-channel audio signal. The consumer's blending algorithm is then applied toIn the following, the calculation of a given stereo or multi-channel audio signal is describedTwo possibilities of (3).

A. Estimating sum of source signals at receiver

The most straightforward way of utilizing the proposed coding scheme with stereo or multi-channel audio transmission is shown in fig. 13, where y_i(n) (1 ≦ i ≦ L) are the L channels of the given stereo or multi-channel audio signal. The sum signal of the audio sources is estimated by down-mixing the transmission channels into the single audio channel. The down-mixing is by computing channel y_i(n) (1. ltoreq. i.ltoreq.L) or more advanced techniques can be applied.

In order to obtain the best performance, it is suggested thatThe source signals of this stage are used before the estimation (6) so that the power ratio between the source signals is similar to the power ratio of the sources contained in a given stereo or multi-channel signal. In this case the down-mix of the transmission channel is a reasonably good estimate of the sum of the audio sources (1) (or a measured version thereof).

An automatic process may be used to adjust the encoder audio signal input s before computing the side information_i(n) energy level. This process properly estimates the energy level of each source signal contained in a given stereo or multi-channel signal in time. The energy level of each source information is then appropriately adjusted in time to be equal to the energy level of the sources contained in the stereo or multi-channel audio signal before the side information calculation.

B. Using transmission channels separately

Fig. 14 shows a different embodiment of the proposed scheme with stereo or multi-channel surround signal transmission. Here, the transmission channel is not downmixed, but is used solely for generationIn the most general sense, the most common is,the frequency band signal ofThe following formula calculates:

wherein, w₁(n) is a weight that determines a particular linear combination of subbands of a transmission channel. The linear combination is selected so thatAs much decorrelation as possible has been done. Thus, no or only a small amount of decorrelation processing needs to be applied, which is beneficial as discussed above.

Application of

In the foregoing we have mentioned a number of applications of the proposed coding scheme. Here, we summarize these applications and propose some more.

A. Audio coding for mixing

The proposed solution can be applied whenever it is necessary to store or transmit the source signals before they are mixed into the stereo, multi-channel or wave field synthesized audio signals. In the prior art, a mono audio encoder can be applied to each source signal separately, resulting in the generation of a bit rate measured in number of sources. The proposed coding scheme is able to encode a large number of audio signals with a single mono audio encoder plus a rather low bitrate side information. By using more than one transmission channel, as described in section V, the sound quality can be improved if the memory/capacity to do so is useful.

B. Remixing with metadata

As described in section VI, existing stereo and multi-channel audio signals can be re-mixed with additional side information (i.e., "metadata"). Rather than merely selling optimized stereo and multi-channel mixed audio content, metadata can be sold to allow a user to remix his stereo and multi-channel music. This may be used, for example, to mute the vocal work of the karaoke song, or to mute the specific instrument playing the music.

Even if storage is not an issue, the described solution is very attractive for custom mixing of music. That is, because it is likely that the music industry never would like to abandon multi-track recording. With too much risk of abuse. The proposed scheme enables remixing capacity without discarding multi-track recording.

Furthermore, a certain degradation of the quality of the stereo or multi-channel signal occurs upon re-mixing, making the illegal propagation of the re-mixing less attractive.

C. Stereo/multichannel-wave field synthesis version

Another application of the scheme described in section VI is described below. By adding side information, stereo and multi-channel (e.g. 5.1 surround) audio accompanying a movie picture can be extended for wave field synthesis reproduction. For example, dolby AC-3 (audio on DVD) can be extended to 5.1 backward compatible encoded audio for wave field synthesis systems, i.e. DVD backward 5.1 surround sound for conventional successor players and wave field synthesized sound for new generation players supporting side information processing.

Subjective assessment

We implement the real-time decoder of the algorithm proposed in sections IV-a and IV-B. An FFT-based STFT filter bank is used. An STFT window size of 1024 points FFT and 768 (with zero characters) is used. The spectral coefficients are grouped together so that each group represents a signal having a bandwidth twice as wide as the Equivalent Rectangular Bandwidth (ERB). Informal listening reveals that the audio quality does not improve significantly when a higher frequency resolution is selected. A lower frequency resolution is advantageous because it allows fewer parameters to be transmitted.

The amplitude/delay phase shift and gain can be adjusted individually for each source. The algorithm is used to encode several multi-track audio recordings with 12-14 tracks.

The decoder allows 5.1 surround mixing with a vector-based magnitude-phase-shift (VBAP) mixer. The direction and gain of each source signal can be adjusted. The software allows for fast switching between mixing the encoded audio source signal and mixing the original discrete audio source signal.

Accidental listening usually reveals that if a gain G of 0dB is used for each source_iThere is no or little difference between the mixed encoded source signals or the mixed original source signals. The more the source gain is changed, the more distortion occurs. The sound emitted by the source is still audible with slight amplification or attenuation (e.g., up to 6 dB). A critical scene is when all sources are mixed to one side and only one source is mixed to the opposite side. In thatIn this case, the audio quality may be degraded, depending on the specific mix and the source signal.

IX. conclusion

A coding scheme for joint coding of audio source signals, e.g. multi-track recording channels, has been proposed above. The goal is not to encode the source signal waveforms with high quality, in which case joint coding gives very little coding gain, since the sources are usually independent. The object is to obtain a high quality audio signal when the encoded audio source signals are mixed. By taking into account the statistical properties of the source signals, the mixing scheme properties and the spatial hearing, it has been shown that the coding gain can be significantly increased by jointly coding the source signals.

The improvement in coding gain is a result of only one audio waveform being transmitted.

Side information representing the statistical properties of the source signals, which are relevant factors in determining the spatial perception of the final mix signal, is also transmitted.

The side information rate is about 3kbs per source signal. Any mixer may be used for the encoded audio source signals, such as stereo, multi-channel or wave field synthesis mixers.

Briefly, the proposed scheme for higher bit rates and higher quality is measured by transmitting more than one audio channel. Furthermore, one proposed variant is to allow a given stereo or multi-channel audio signal to be re-mixed (even to change the audio format, e.g. to change stereo to multi-channel or waveform synthesis).

The application of the proposed solution is multifaceted. For example, MPEG-4 can be extended with the proposed scheme for reducing the bit rate when more than one "natural audio object" (source signal) needs to be transmitted. Furthermore, the proposed solution provides the content in compressed form for a wave field synthesis system. As mentioned above, existing stereo or multi-channel signals may be complemented with side information to allow the user to remix the signal into his preferred form.

Reference to the literature

[1]C.Faller，Parametric Coding of Spatial Audio，Ph.D.thesis，Swiss Federal Institute of Technology Lausanne(EPFL)，2004，Ph.D.Thesis No.3062.

[2]C.Faller and F.Baumgarte，“Binaural Cue Coding-Part II：Schemes and applications，”IEEE Trans.On Speech and Audio Proc.，v0l.11， no.6，Nov.2003.

Claims

1. A decoder for synthesizing a plurality of audio channels, the decoder configured to:

retrieving at least one sum signal representing the sum of the M source signals from the audio stream,

retrieving statistical information about one or more source signals from the audio stream, wherein the statistical information represents a spectral envelope of the source signals,

receiving or locally determining from the audio stream parameters describing an output audio format and mixing parameters,

calculating M pseudo-source signals from the at least one sum signal and the received statistical information, the number M of pseudo-source signals being equal to the number M of source signals, an

Synthesizing the plurality of audio channels from the pseudo-source signal by controlling a mixer in accordance with the received or locally determined audio format parameters and mixing parameters.

2. A decoder for synthesizing a plurality of audio channels, the decoder configured to:

retrieving at least one sum signal representing a sum of source signals from the audio stream,

retrieving statistical information about one or more source signals from an audio stream, wherein the statistical information represents a spectral envelope of the source signals,

parameters describing the output audio format and mixing parameters are received from the audio stream or determined locally,

calculating an output signal cue from the received statistical information, the audio format parameters and the mixing parameters, an

Synthesizing the plurality of audio channels from the sum signal based on the calculated cues.

3. Decoder according to claim 1 or 2, wherein the statistical information represents the relative power in terms of frequency and time of the source signal.

4. Decoder in accordance with claim 1 in which the pseudo source signals are computed in the subband domain of a filter bank.

5. Decoder in accordance with claim 2, in which the audio channels are synthesized in the subband domain of a filter bank.

6. Decoder according to claim 4 or 5, wherein the number and bandwidth of the sub-bands is determined according to the spectral and temporal resolution of the human auditory system.

7. Decoder according to claim 4, wherein the number of subbands is comprised between 3 and 40.

8. The decoder of claim 4 wherein the subbands have different bandwidths, wherein the bandwidth of the lower frequency subbands is less than the bandwidth of the higher frequency subbands.

9. Decoder in accordance with claim 4, in which a Short Time Fourier Transform (STFT) -based filter bank is used and spectral coefficients are combined such that each group of spectral coefficients forms a subband.

10. Decoder according to claim 1 or 2, wherein the statistical information further comprises an autocorrelation function.

11. Decoder according to claim 1 or 2, wherein the spectral envelope is represented as linear predictive coded LPC parameters.

12. The decoder of claim 1, wherein the sum signal is divided into a plurality of subbands, and the statistical information is used to determine the power of each subband for each pseudo-source signal.

13. Decoder according to claim 1, wherein a linear prediction error of the sum signal is calculated, followed by all-pole filtering, in order to apply a spectral envelope determined by the statistical information for each pseudo source signal.

14. Decoder according to claim 12 or 13, wherein the output pseudo-source signal is made independent by means of a decorrelation technique, such as all-pass filtering.

15. Decoder according to claim 2, wherein the calculated cues are level differences, time differences or coherence according to different frequencies and time instants.

16. Decoder in accordance with claim 1 in which the mixer is a magnitude phase shift algorithm compensating for the dependency of the audio source level on the mixing parameters.

17. Decoder according to claim 1, wherein the mixer is a wave field synthesis mixer.

18. Decoder according to claim 1, wherein the mixer is a binaural mixer.

19. Decoder according to claim 1, wherein the mixer is a 3D audio mixer.

20. Encoding a plurality of audio signals(s)₁(n)，s₂(n)，…，s_M(n)) comprising:

for the plurality of source signals(s)₁(n)，s₂(n)，…，s_M(n)), calculating a signal(s) representative of one or more sound sources₁(n)，s₂(n)，…，s_M(n)) of a spectral envelope, wherein the statistical information comprises statistics (Φ (n, e)) on a normalized subband autocorrelation function (Φ)_i(n, e)), or lattice filter parameters or linear predictive coding LPC parameters or line spectral pair parameters, and

as signals(s) from said plurality of sound sources₁(n)，s₂(n)，…，s_M(n)) the calculated statistical information is transmitted by the obtained metadata of the audio signal.

21. For encoding a plurality of sound source signals(s)₁(n)，s₂(n)，…，s_M(n)) an encoder, whichWherein the encoder is configured to: