WO2007034375A2

WO2007034375A2 - Determination of a distortion measure for audio encoding

Info

Publication number: WO2007034375A2
Application number: PCT/IB2006/053261
Authority: WO
Inventors: Steven L. J. D. E. Van De Par
Original assignee: Koninklijke Philips Electronics NV
Current assignee: Koninklijke Philips NV
Priority date: 2005-09-23
Filing date: 2006-09-13
Publication date: 2007-03-29
Anticipated expiration: 2008-03-23
Also published as: WO2007034375A3

Abstract

A distortion processor (207) is arranged to determine a distortion measure for an audio encoding of an audio signal. The distortion processor (207) comprises a temporal distortion processor (211) which generates a frequency domain sensitivity characteristic for the audio signal in response to a predetermined error estimate and a time domain analysis of the audio signal using a time domain model having a time scale smaller than a duration of an analysis segment of the audio encoding. The frequency domain sensitivity characteristic is fed to a frequency distortion processor (213) and may be in the form of a frequency domain masking curve. The frequency distortion processor (213) determines the distortion measure in response to a frequency domain representation of the audio signal, a frequency domain error signal and the frequency domain sensitivity characteristic. The frequency domain sensitivity characteristic is independent of the applied quantization and different quantization options can be evaluated using only frequency domain processing.

Description

Determination of a distortion measure for audio encoding

The invention relates to determination of one or more distortion measures for audio encoding and in particular, but not exclusively, to determination of distortion measures associated with different quantization options.

Digital encoding of various source signals has become increasingly important over the last decades as digital signal representation and communication increasingly has replaced analogue representation and communication. For example, mobile telephone systems, such as the Global System for Mobile communication, are based on digital speech encoding. Also distribution of media content, such as video and music, is increasingly based on digital content encoding.

In low bit-rate audio coding, psycho-acoustic masking models are often used to determine whether the quantization noise that is introduced by an audio encoder leads to audible artefacts. The general approach for a masking model is to provide a masking curve corresponding to a short segment of the original audio which for every frequency defines how much quantization noise can be allowed before audible artefacts are introduced. When the quantization noise is below the masking curve, it will be masked by the original signal. Nearly all existing models used in audio coding today consider only the segmental power spectrum (or an equivalent spectrum type) to determine the masking properties of the original signal. The power spectrum does not represent all information about the temporal structure of the audio signal that is relevant. It is known from various basic psycho-acoustical experiments that the temporal structure of an audio signal can greatly influence the masking properties.

Recent developments in the field of psycho-acoustics have led to more elaborate models of auditory masking that take into account the temporal structure of audio signals in order to more accurately predict the masking properties of the audio signal. These models have been developed in the field of psycho-acoustics with the primary purpose of improving the understanding of basic auditory processing. An example of such a model is presented in T. Dau and D. Pϋschel and A. Kohlrausch, (1996), 'A quantitative model of the 'effective' signal processing in the auditory system. I. Model structure', J. Acoust. Soc. Am., Vol. 99, pp. 3615-3622. This model provides for a very accurate prediction of simultaneous masking and the sustain of masking shortly after a masker has been switched of (forward masking). Another example of such a psycho-acoustic model is given in T. Dau and B.

Kollmeier and A. Kohlrausch, (1997), 'Modeling auditory processing of amplitude modulation. I. Detection and masking with narrow-band carriers', J. Acoust. Soc. Am., Vol. 102, pp. 2892-2905. This model provides for a prediction of the effect of inherent modulations in a signal on the masking power of that signal. Specifically, a noise signal which has strong modulations tends to result in increased masking relative to a tonal signal with very little modulation.

The merit of these models is that they can predict the effects of the temporal properties of the original signal on the masking strength that is created by the signal. In particular, the models may allow an accurate determination of a distortion measure associated with a given quantization or encoding option which is used by the audio encoder.

However, the models are not well suited for practical applications as they are very complex to evaluate and require exceedingly high computational resource. Specifically, the structure of the models is such that in order to determine how the quantization noise floor needs to be shaped, the model calculates a so called internal representation of the original signal and of the quantized signal. These internal representations are then compared and the difference can be interpreted as an indication of the audibility of the quantization noise. Typically, this process needs to be repeated many times for each encoding segment before a suitable noise floor shape is found that does not lead to audible artefacts but results in an efficient encoding. For this reason, these advanced models are currently not applied in audio coding thereby resulting in suboptimal encoding and a degraded signal quality to data rate ratio.

Hence, an improved system for determining distortion measures and audio encoding would be advantageous and in particular a system allowing increased flexibility, reduced complexity, reduced computational demand, improved distortion measurement determination and/or improved audio encoding would be advantageous.

Accordingly, the Invention seeks to preferably mitigate, alleviate or eliminate one or more of the above mentioned disadvantages singly or in any combination. According to an aspect of the invention there is provided, an apparatus for determining a distortion measure for an audio encoding of an audio signal, the apparatus comprising: first means for generating a frequency domain sensitivity characteristic for the audio signal in response to a predetermined error estimate and a time domain analysis of the audio signal using a time domain model having a time scale smaller than a duration of an analysis segment of the audio encoding; second means for determining the distortion measure in response to a frequency domain representation of the audio signal, a frequency domain error signal and the frequency domain sensitivity characteristic. The invention may allow separate determination of a frequency domain sensitivity characteristic taking temporal characteristics into account followed by a frequency domain determination of a distortion measure. The invention may provide reduced computational resource demand. In particular, the invention may enable or facilitate the use of advanced temporal psycho-acoustical models in practical encoders. The invention may allow an improved audio encoding. In particular, the invention may improve performance and/or reduce resource requirements in audio encoders that evaluate a plurality of encoding options before selecting the encoding option providing the lowest perceivable distortion.

The distortion measure may be a perceptual distortion measure.

The audio encoding may be segmented and the time scale of the time domain model may be smaller than the segmentation interval. The frequency domain representations may be obtained by block based time domain to frequency domain transforms corresponding to the segments. The time domain model may specifically be a psycho-acoustical perceptual model taking into account temporal masking variations within the individual audio encoding segments of the audio encoding. According to an optional feature of the invention, the first means comprises means for dividing the audio signal into psycho-acoustic bands and is arranged to perform the time domain analysis on the individual psycho-acoustic bands.

The time domain model may be applied to each individual psycho-acoustic band and may in particular be evaluated individually for each psycho-acoustic band without consideration of any other band.

The feature may allow efficient and/or accurate distortion measurement determination and may result in improved audio encoding.

According to an optional feature of the invention, the first means comprises means for determining an error sensitivity weighting for each psycho-acoustic band and means for determining an error sensitivity value for each psycho-acoustic band in response to a predetermined subband error estimate and the error sensitivity weighting for each psycho- acoustic band.

The sensitivity value for a psycho-acoustic band may be determined by an individual weighting of a predetermined band error estimate without consideration of other bands. The feature may facilitate implementation, reduce complexity, reduce computational resource requirements and/or improve performance.

According to an optional feature of the invention, the apparatus further comprises conversion means for generating an error sensitivity value for each frequency subband of the audio encoding in response to error sensitivity values of each of the psycho- acoustic bands.

This may facilitate and/or improve performance. In particular, it may provide a practical approach for determining error sensitivity values that can directly be used for evaluating encoding performance. According to an optional feature of the invention, the conversion means is arranged to determine the error sensitivity value for each frequency subband of the audio encoding in response to a characteristic of an audio encoding filter of the audio encoding.

This may allow accurate determination of the sensitivity values as they apply to the encoding subbands. According to an optional feature of the invention, the time domain model comprises first weighting means for determining a first weight value for the predetermined error estimate in response to a forward masking model.

This may allow improved performance and may in particular allow a more accurate distortion measure which more closely reflects the perceptual characteristics of a user. The first weight value may be determined individually for each psycho-acoustic band and may be applied individually to each psycho-acoustic band. Thus, psycho-acoustic bands may be processed individually and separately.

According to an optional feature of the invention, the first weighting means is arranged to reduce the first weight value for increasing signal levels of the audio signal. The weight value may be reduced for increasing signal levels to indicate a reduced sensitivity to errors. In embodiments where an increasing weight value indicates a reduced sensitivity to errors, the weight value may be increased for increasing signal levels.

The signal level of the audio signal may be a signal level in a given time interval and/or following low pass filtering with a given dynamic performance and/or may be in the individual psycho-acoustic bands.

The feature may allow a practical determination of an accurate sensitivity indication taking into account temporal characteristics. According to an optional feature of the invention, the first weighting means is arranged to limit the first weight value.

This may improve dynamic performance.

According to an optional feature of the invention, the first weighting means is arranged to normalize the first weight value in response to a signal level of the audio signal. This may improve performance and may provide a sensitivity indication more accurately reflecting the human audio perceptual characteristics.

According to an optional feature of the invention, the time domain model comprises second weighting means for determining a second weight value for the predetermined error estimate in response to a temporal modulation model. This may allow improved performance and may in particular allow a more accurate distortion measure which more closely reflects the human audio perceptual characteristics. The second weight value may be determined individually for each psycho- acoustic band and may be applied individually to each psycho-acoustic band. Thus, psycho- acoustic bands may be processed individually and separately. According to an optional feature of the invention, the second weighting means is arranged to determine the second weight value in response to a temporal signal variance of the audio signal.

The second weight value may for example be modified to indicate reducing sensitivity for an increasing variance. The temporal signal variance may for example be determined as a standard deviation relative to an average value. The temporal signal variance may be a temporal variance within a time interval and/or of a low pass filtered version of the audio signal and/or may be individually determined for each psycho-acoustic band.

According to an optional feature of the invention, the second means is arranged to adjust the distortion measure in response to a correlation characteristic between the frequency domain error signal and the audio signal.

This may allow an improved distortion measure and may in particular allow improved audio encoding.

According to an optional feature of the invention, the first means is arranged to modify the frequency domain sensitivity characteristic in response to a temporal onset delay of a transient of the audio signal relative to the onset of an encoding segment of the audio encoding.

This may allow an improved distortion measure and may in particular allow improved audio encoding. According to an optional feature of the invention, the first means is arranged to modify the frequency domain sensitivity characteristic in response to a time envelope of the audio signal within a first frequency interval.

This may allow an improved distortion measure and may in particular allow improved audio encoding. According to another aspect of the invention there is provided an encoder as outlined above.

According to an optional feature of the invention, the encoder comprises means for generating frequency domain representations of the audio signal and frequency domain error signals for a plurality of encoding options and means for selecting an encoding option from the plurality of encoding options in response to distortion measures determined for the plurality of encoding options by the second means.

This may allow audio encoding with high quality while maintaining low complexity and/or low computational resource requirements.

According to another aspect of the invention there is provided, a method of determining a distortion measure for an audio encoding of an audio signal, the method comprising: generating a frequency domain sensitivity characteristic for the audio signal in response to a predetermined error estimate and a time domain analysis of the audio signal using a time domain model having a time scale smaller than a duration of an analysis segment of the audio encoding; and determining the distortion measure in response to a frequency domain representation of the audio signal, a frequency domain error signal and the frequency domain sensitivity characteristic.

According to another aspect of the invention there is provided, a method of encoding an audio signal comprising: generating a frequency domain sensitivity characteristic for the audio signal in response to a predetermined error estimate and a time domain analysis of the audio signal using a time domain model having a time scale smaller than a duration of an analysis segment of the audio encoding; for a plurality of encoding options determining a distortion measure in response to a frequency domain representation of the audio signal, a frequency domain error signal and the frequency domain sensitivity characteristic; selecting an encoding option from the plurality of encoding options in response to the distortion measures; and encoding the audio signal using the selected encoding option.

According to another aspect of the invention there is provided, computer program product for executing any of the methods outlined above.

According to another aspect of the invention there is provided, a method of transmitting an audio signal, the method comprising: generating a frequency domain sensitivity characteristic for the audio signal in response to a predetermined error estimate and a time domain analysis of the audio signal using a time domain model having a time scale smaller than a duration of an analysis segment of the audio encoding; for a plurality of encoding options determining a distortion measure in response to a frequency domain representation of the audio signal, a frequency domain error signal and the frequency domain sensitivity characteristic; selecting an encoding option from the plurality of encoding options in response to the distortion measures; encoding the audio signal using the selected encoding option; and transmitting the encoded audio signal.

According to another aspect of the invention there is provided, a transmitter for transmitting an audio signal, the transmitter comprising: means for generating a frequency domain sensitivity characteristic for the audio signal in response to a predetermined error estimate and a time domain analysis of the audio signal using a time domain model having a time scale smaller than a duration of an analysis segment of the audio encoding; means for, for a plurality of encoding options, determining the distortion measure in response to a frequency domain representation of the audio signal, a frequency domain error signal and the frequency domain sensitivity characteristic; means for selecting an encoding option from the plurality of encoding options in response to the distortion measures; means for encoding the audio signal using the selected encoding option; and transmitting the encoded audio signal.

These and other aspects, features and advantages of the invention will be apparent from and elucidated with reference to the embodiment(s) described hereinafter.

Embodiments of the invention will be described, by way of example only, with reference to the drawings, in which

Fig. 1 illustrates a transmission system 100 for communication of an audio signal in accordance with some embodiments of the invention; Fig. 2 illustrates an example of an audio encoder in accordance with some embodiments of the invention; and

Fig. 3 illustrates an example of an apparatus for determining a distortion measure in accordance with some embodiments of the invention.

The following description focuses on embodiments of the invention applicable to an Advanced Audio Coder (AAC). However, it will be appreciated that the invention is not limited to this application but may be applied to many other encoding standards and approaches.

Fig. 1 illustrates a transmission system 100 for communication of an audio signal in accordance with some embodiments of the invention. The transmission system 100 comprises a transmitter 101 which is coupled to a receiver 103 through a network 105 which specifically may be the Internet. In the specific example, the transmitter 101 is a signal recording device and the receiver is a signal player device 103 but it will be appreciated that in other embodiments a transmitter and receiver may used in other applications and for other purposes. For example, the transmitter 101 and/or the receiver 103 may be part of a transcoding functionality and may e.g. provide interfacing to other signal sources or destinations. In the specific example where a signal recording function is supported, the transmitter 101 comprises a digitizer 107 which receives an analog signal that is converted to a digital PCM signal by sampling and analog-to-digital conversion.

The transmitter 101 is coupled to the encoder 109 of Fig. 1 which encodes the PCM signal in accordance with an encoding algorithm. The encoder 100 is coupled to a network transmitter 111 which receives the encoded signal and interfaces to the Internet 105. The network transmitter may transmit the encoded signal to the receiver 103 through the Internet 105.

The receiver 103 comprises a network receiver 113 which interfaces to the Internet 105 and which is arranged to receive the encoded signal from the transmitter 101. The network receiver 111 is coupled to a decoder 115. The decoder 115 receives the encoded signal and decodes it in accordance with a decoding algorithm.

In the specific example where a signal playing function is supported, the receiver 103 further comprises a signal player 117 which receives the decoded audio signal from the decoder 115 and presents this to the user. Specifically, the signal player 113 may comprise a digital-to-analog converter, amplifiers and speakers as required for outputting the decoded audio signal.

Fig. 2 illustrates the encoder 109 in more detail. In the example, the encoder 109 generates an AAC (Advanced Audio Coder) compatible encoded signal. The encoder 109 comprises a receiver 201 which receives the PCM time domain audio signal that is to be encoded.

The receiver 201 is coupled to a transform processor 203 which transforms the time domain audio signal to a frequency domain representation. Specifically, the transform processor 203 can implement a Modified Discrete Cosine Transform (MDCT) thereby generating the encoding subbands as is well known for AAC encoders.

The transform processor 203 is coupled to an encoding unit 205 which performs the encoding of the frequency domain representation of the audio signal in accordance with the AAC standard. In particular, the encoding unit 205 performs a quantization of the data values of the individual subbands. The quantization applied to the individual subbands depends on the perceptual significance of the introduced quantization noise to the subbands. In the encoder 109 of Fig. 2, the encoding unit 205 evaluates a plurality of different possible quantization options (or settings). The encoding unit 205 is coupled to a distortion processor 207 which is operable to determine a distortion measure for a given quantization option. For each possible quantization option, a distortion measure is determined by the distortion processor 207 and the encoding unit 205 selects the quantization option which results in the lowest distortion. The encoding unit 205 then proceeds to encode the audio signal using the selected encoding option.

The encoding unit 205 is furthermore coupled to a bit stream processor (209) which generates a bit stream comprising the encoded audio signal. The bit stream is then fed to the network transmitter 111 for transmission to the receiver 103.

The distortion processor 207 is arranged to determine a distortion measure which is indicative of the perceptual distortion that will be experienced by a human listener. Specifically, the distortion processor 207 includes an accurate perceptual model of human listening which allows an evaluation of the perceived distortion for a given quantization option to be determined.

In the embodiments, the distortion processor 207 comprises a temporal distortion processor 211 which generates a frequency domain sensitivity characteristic for the audio signal. Specifically, the temporal distortion processor 211 receives the time domain audio signal from the receiver 201 and evaluates a perceptual time domain model to generate the frequency domain sensitivity characteristic. Thus, the temporal distortion processor 211 takes into account the temporal characteristics of the audio signal to generate a frequency domain sensitivity characteristic which is an indication of the masking characteristics of the time domain audio signal and thus indicates a users sensitivity to quantization errors. The frequency domain sensitivity characteristic is determined by a time domain evaluation using a time domain model that has a time scale smaller than the duration of an analysis segment of the encoder 109. In particular, the transform processor 203 and the encoding unit 205 operate on encoding segments that are individually transformed into the frequency domain and encoded. For an AAC encoder operating at 44.1 kHz sampling frequency, encoding segments are typically of the order of 3 to 20 msec. The time scales of the time domain model of the temporal distortion processor 211 are smaller than the segmentation duration and therefore allow temporal variations within the individual encoding segment to be taken into account when determining the frequency domain sensitivity characteristic. This allows for a determination of the frequency domain sensitivity characteristic which more closely reflects a users perception of the quantization noise.

Furthermore, the frequency domain sensitivity characteristic is determined on the basis of a predetermined error estimate which does not depend on the specific audio signal or the quantization option, but rather is some estimate of the average quantization noise that can occur in a specific encoding subband. In the described embodiments, the frequency domain sensitivity characteristic is determined as an individual sensitivity or masking value for each encoding subband. Thus, the frequency domain sensitivity characteristic can indicate the sensitivity of a user to a quantization of a subband value of an encoding segment which reflects not only the frequency characteristics for the encoding segment but also the time domain characteristics within the encoding segment or even in other encoding segments.

The time domain model is typically very complex and is evaluated only once for each encoding segment. Thus, for each encoding segment, a single frequency domain sensitivity characteristic is determined. This characteristic is then fed to the frequency distortion processor 213 which determines a distortion measure by a frequency domain evaluation.

Specifically, the frequency distortion processor 213 receives a frequency domain representation of the audio signal from the encoding unit 205 as well as a frequency domain representation of the error signal resulting from a given quantization option. A total distortion measure is then determined by evaluating the error signal relative to the frequency domain audio signal taking into account the frequency domain sensitivity characteristic. This evaluation is relatively simple and has low computational resource requirements as it is only performed in the frequency domain and using the subbands generated by the transform processor 203. Accordingly, the encoding unit 205 feeds the frequency distortion processor

213 the frequency domain error signal for all the quantization options that are to be evaluated and determines a distortion measure for each. The resulting distortion measure reflects the temporal characteristics and masking effects through the use of the frequency domain sensitivity characteristic without requiring any separate time domain evaluation. Thus, an accurate distortion measure can be determined for all quantization options and the most appropriate quantization option can be selected. Additionally, this can be achieved using a high complexity time domain model while maintaining a low computational resource demand thereby allowing for a practical encoder implementation with significantly improved encoding performance. Fig. 3 illustrates the distortion processor 207 in more detail. The distortion processor 207 and the psycho-acoustical model which is used in the example will be described in more detail in the following.

As mentioned, the distortion processor 207 comprises a temporal distortion processor 211 which implements time domain processing based on the time domain input signal that is provided to this block. At the output of the temporal distortion processor 211, a number of outputs are provided as inputs to the frequency distortion processor 213. In the example, these outputs closely resemble a masking curve presented on a perceptually relevant scale, such as the ERB scale as described in "Equivalent Rectangular Bandwidth"; B.R. Glasberg and B.C.J. Moore, (1990), 'Derivation of auditory filter shapes from notched- noise data', Hearing Research, Vol. 47, pp. 103-138.

Thus, the frequency distortion processor 213 receives the inputs from the temporal distortion processor 211 containing a representation of the masking curve. In addition there are separate inputs consisting of a frequency domain representation of the input audio signal and the quantization error signal determined for the quantization option for which the distortion measure is determined. The ability to operate on frequency domain representations of the input signals is a significant advantage of the implementation. Specifically, in many state-of-the art audio encoders, like e.g. AAC, some kind of transform of the input signal to the frequency domain is used for coding the signal (e.g. the Modified Discrete Cosine Transform (MDCT)). It is on this transform domain representation that quantization is performed and accordingly it is highly efficient to evaluate the perceptual distortion that is introduced by these quantization operations using the same signal representations without the need to first apply a signal transform to another domain.

Thus, although detailed time domain information is taken into account by the temporal distortion processor 211 in the encoder 109 of Fig. 2, this does not require that the evaluation of quantization options are also made in the time domain. Rather, the described approach allows the frequency domain data available in the encoder 109 to be directly evaluated. This provides a significant computational advantage as the determination of the appropriate quantization noise floor generally requires that a large number of quantization options are evaluated for each audio segment that needs to be encoded.

The operation of the temporal distortion processor 211 and the frequency distortion processor 213 will be described in more detail in the following.

In the temporal distortion processor 211, a number of processing stages are present. The audio signal is fed to an auditory filter bank 301 (specifically a Basilar Membrane or auditory filter bank) which models properties of the basilar membrane that can be found in the cochlea of the human auditory system. In the specific embodiment, the filter bank comprises a number of gammatone filters which are implemented as complex IIR filters, such as for example described in P.I.M. Johannesma, (1972), 'The pre-response stimulus ensemble of neurons in the cochlear nucleus' in Proceedings of the symposium on hearing theory, IPO, Eindhoven, The Netherlands. These filters have a band-pass characteristic which follows the estimated auditory filter bandwidth as a function of the center frequency such as e.g. proposed in B.R. Glasberg and B.C.J. Moore, (1990), 'Derivation of auditory filter shapes from notched-noise data', Hearing Research, Vol. 47, pp. 103-138. In the example, the filters are separated uniformly on an ERB rate scale. Thus, in the example of Fig. 3, the filter bank 301 generates a plurality of psycho-acoustic bands. It will be appreciated, that many different alternatives for generating psycho-acoustic bands are possible, including an implementation based on a Fast Fourier Transform performed on overlapping time segments of the input signal followed by a spectral weighting function resembling the auditory filter shapes followed by a root mean square operation to estimate activity within each auditory filter.

The following processing is performed separately and individually for each psycho-acoustic band.

In particular, an envelope extractor 303 generates absolute signal values for the individual signals in the psycho-acoustic bands. Since the auditory filters of the filter bank 301 in the specific embodiment are implemented as complex filters, an envelope of the filtered outputs can be extracted simply by taking the absolute value of each complex filter output sample. This envelope will be used for determining the relevant temporal properties of the input signal to estimate their effect on the masking capabilities of the input signal. The envelope values are provided as an input to a forward masking processor

305 which comprises a series of five cascaded adaptation loops. As described in T. Dau and D. Puschel and A. Kohlrausch, (1996), 'A quantitative model of the 'effective' signal processing in the auditory system. I. Model structure', J. Acoust. Soc. Am., Vol. 99, pp. 3615-3622, such loops are able to model the effect of forward masking in the human auditory system with high accuracy. In the example, each adaptation loop consists of a gain control unit that is driven by the output of a low-pass filter (e.g. an RC network) operating on the input signal. The total effect of one adaptation loop is that a steady state input signal is compressed according to a square root transformation. When the input signal is switched off, the gain reduction will persist for some time due to the attenuation that is created by the low- pass filtering. This effect accounts for forward masking as seen in the human auditory system.

The combined effect of the five adaptation loops can be represented by a transforming function T:

y(n) = T(x(n),x(n - ϊ),x(n - 2),...) = T(x)

where x(n) is the input sample at time instance n, and y(n) is the output sample at the same time instance. Thus, the transformation that is imposed by the adaptation loops depends in principle on all the past samples of the input signal which for convenience is omitted from the notation T(x).

For a steady state signal, the transform of the adaptation loops is an approximately logarithmic transform of the input (x) to the output (y) of the adaptation loops:

y = T(x) ~ a \n(x) + $

This transform is mapped to a dB (deciBell) scale using a linear mapping. In this way, an input signal that is presented at a certain level is mapped to the (approximately) corresponding dB level. The extracted envelope (x) is also provided to a first weight processor 307 which generates a weight value for each psycho-acoustic band. The weight value is used to modify a predetermined error estimate to generate a sensitivity indication for the psycho- acoustic band. In the embodiment, the first weight processor 307 implements a number of functions. The first function is related to the net effect that is achieved by the five cascaded adaptation loops. The gains of all five adaptation loops can be combined into one total gain. The combined gain (as a function of time) can be obtained by dividing the sample-by-sample output values of the last adaptation loop, y, by the input signal of the first adaptation loop, x:

g(n) = y(n)/x(n)

This gain function is indicative of the momentary sensitivity of the auditory system to an input signal. When the gain is large (no adaptation), any change in the input signal will have a large effect on the output of the adaptation loops, e.g., the auditory system is very sensitive to changes in the input signal and a high weight value is determined. When the gain is small (much adaptation), a change in the input signal will have only a small effect on the output of the adaptation loops, e.g. the auditory system is rather insensitive to changes in the input signal and a low weight value is determined. This insensitivity can occur for example shortly after the offset of a sustained input signal (forward masking).

The second function performed by the first weight processor 307 is that of limiting or clipping the weight value. This function is related to the observation that the adaptation loops need some time to adapt to a sudden increase in level in the input signal. This implies that in response to a large transient, the masking effect thereof will not be instantaneously reflected because of the delay of the adaptation loops. As a consequence, the adaptation loops will show a large overshoot at the onset of a masker, which quickly reduces after the onset. This prediction is not in line with the actually observed masking effects for human listeners. These observations suggest that the masking effect is already present directly after onset of the masker. As a remedy for this discrepancy between observed and predicted masking effects, a clipping operation is performed by the first weight processor 307. As mentioned before, the steady-state effect of the adaptation loops is that they map the input signal to a dB scale. At the onset of a signal transient, an overshoot in the response of the adaptation loops can occur and this overshoot can be removed or reduced by clipping the signal with a directly measured dB value of the extracted envelope: y = tmn(T(x),20 bg_lo(jc))

Note that this operation can be performed before the gain is derived. Accordingly, including clipping, the gain function can be expressed by:

g = πύn(T(x),20\og_w (x))/x

The third function performed by the first weight processor 307 is related to the use of the gain function as a weight value for a predetermined estimated error signal. In each psycho-acoustic band, a predetermined error value is assumed and this error value is weighted by the first weight value generated by the first weight processor 307 (in the specific example, the predetermined error value is multiplied by the predetermined error estimate value). When the gain (and thus in the example also the weight value) is large, the sensitivity to the error signal is large and vice versa. From auditory perception it is known that the detectability of an error signal is roughly constant for a constant error-signal-to- masker ratio. The output gain is normalized such that for the steady-state case, a constant error-signal-to-masker ratio leads to constant predicted detectability of the error signal. The normalized gain (g_e) that is used in this embodiment is:

g_e = min(r(x),201og₁₀(x))/(x20 fog_lo(*))

As previously mentioned, the gain function that is derived is used as a weight value and is multiplied by the predetermined error estimate (ε) to generate a sensitivity value. Therefore the integrated product (d) of the normalized gain function and the estimated error signal in the steady- state case will be:

d = ∑(ε(n)g»)² = £(ε min(r(x),201og₁₀ (x))/(x201og₁₀ (x)))² » ∑(ε Ix)² n n n

due to the fact that in the steady state, the adaptation loops perform an approximate dB transform (201oglO(x)). As can be seen, the product, which is an estimate of the sensitivity of listeners to the error, is proportional to the error-signal-to-masker ratio, assuming that x serves as the masker.

In the example of Fig. 3, the distortion processor 207 comprises an error store 309 which contains the predetermined error estimate. In particular, the error store 309 comprises a stored predetermined error estimate value for each of the psycho-acoustic bands. The predetermined error estimate (ε) represents the average temporal envelope of an error signal such as it would look after basilar membrane filtering. It is assumed that the error signal is centered within the pass band of the basilar membrane filter. The average temporal envelope of an error signal can be estimated very well in case of a transform coder. The quantization noise that is introduced in such a coder is typically smeared out across the whole segment to be encoded following (on average) the envelope of the prototypical filter shape. Only the level (and not the average shape) of the quantization noise depends on the signal that is quantized and the quantization step size used, and it is therefore possible for the temporal distortion processor 211 to use the same fixed error estimate for all the possible quantization options that may be applied.

In the embodiment, the temporal distortion processor 211 comprises a first multiplier 311 which multiplies the predetermined error estimate values and the first weight values from the first weight processor 307 to generate the sensitivity values d (one for each of the psycho-acoustic bands). The multiplied values may then be summed. The multiplication of the estimated error values and the weight values enables accurate modeling of the masking effect taking into account the dynamical properties of the input signal. Specifically, the integration of this multiplication over time represents an estimate of the ability of human listeners to hear the introduced error signal.

To exemplify this, let us consider the case that the input signal consists of a transient signal that is placed in the middle of the segment under consideration. The gain that is determined will be high in the interval just before the transient, because the adaptation loops will not have adapted to the transient signal and the clipping is not active. As a consequence, the model will be highly sensitive to the presence of an error signal. The level of the error signal will be proportional to the level of the transient that is present within the segment. However, the gain function just before the transient will reflect a sensitivity corresponding to a much lower input signal level. Thus, the prediction is that the multiplication of the gain function and the estimated error signal just before the onset of the transient will lead to very high values, e.g. high sensitivity to the estimated error signal. This property reflects the extreme sensitivity of the human auditory system to the occurrence of pre-echos in audio coding. Note that the pre-echo is only observed when the transient is placed after the beginning of a segment. When a segment is placed such that the transient starts immediately after the beginning of the window, no pre-echos will be present in this segment. Usually segments have a considerable overlap and for the given example it can be expected that in the preceding segment a considerable pre-echo can occur and therefore more bits need to be allocated in this segment to prevent pre-echos.

Shortly after the onset of the transient, the gain factor will be reduced considerable, first by the clipping and if the transient has a sufficiently long sustain, later by the steady-state response of the adaptation loops. Thus for this part of the segment, much less sensitivity results in line with insights from psycho-acoustics. When the transient is finished, the gain function will recover slowly. Even in the next encoding segment, the gain function can still be low as a result of the presence of a transient in the previous segment. Therefore, in the next segment, reduced sensitivity to the presence of an error signal can result, which predicts the effect of forward masking as discovered within the field of psycho-acoustics. Hence, it is a feature of the temporal distortion processor 211 that frequency domain sensitivity characteristic depends on the temporal position of a transient of the audio signal within an encoding segment of the audio encoding. Thus, the perceptual can adjusts its masked threshold (and distortion predictions) depending on the position of the transient within a segment. In particular, the model provides for the temporal distortion processor 211 modifying the frequency domain sensitivity characteristic in response to a temporal onset delay of a transient of the audio signal relative to the onset of an encoding segment of the audio encoding.

Similarly, it is a feature of the implemented model that the frequency domain sensitivity characteristic is dependent on the time envelope of the audio signal within a given frequency interval.

In the case where a long duration masker is present within a large range of segments, the adaptation loops will be fully adapted to the input signal and the gain factors will be low corresponding to less sensitivity to the presence of an error signal.

The above described method is able to accurately predict the masking behavior of dynamically changing signals by constructing a sensitivity index (d) for each psycho- acoustic band using a predetermined estimated error signal. The approach can accurately predict how many bits are needed to avoid pre-echos and the method is also able to specify this for each frequency range separately such that extra bits are only allocated in frequency ranges where they are required. In addition to the forward masking model implemented by the forward masking processor 305 and the first weight processor 307, the temporal distortion processor 211 furthermore implements a modulation model. In particular, the temporal distortion processor 211 comprises a second weight processor 313 which receives the filtered signal from the adaptation loops and generates a second weight value for the predetermined error estimate.

Thus, in order to model auditory modulation processing, the output of the adaptation loops is also used to determine the nature of the envelope modulations that are observed in the input signal. From psycho-acoustical measurements, it is known that signals with very flat temporal envelopes (such as very tonal signals) create comparably less masking than signals with a moderate degree of modulation in the envelope (such as a noise signal). The second weight processor 313 evaluates the degree of modulation within each psycho-acoustic band and generates an output weight value which is applied to the sensitivity index from the first multiplier 311. In particular, the signal from the first multiplier is fed to a second multiplier 313 which scales the distortion index in accordance with the second weight value determined by the second weight processor 313.

In the specific embodiment, the following modified sensitivity values are generated:

where d_m is the sensitivity index adjusted to incorporate modulation effects, C is a calibration constant (a value of 2 is typically appropriate), O_E is the standard deviation of the adaptation loop output, με is the mean of the adaptation loop output, O_N is the standard deviation a band- pass Gaussian-noise envelope and μ_N is the mean of a band-pass Gaussian-noise envelope. The latter two variables can be derived analytically and are replaced in the right-hand side of the equation. There is a certain time interval on which the variables for the modification procedure are based. Typically an interval of 100-300 ms is appropriate considering auditory time processing. An alternative method for deriving d_m may be based on a band-pass filtering of the envelope modulations resulting from the input audio signal. The band-pass filter can be chosen such that they correspond with the modulations that are typically introduced by the quantizing operation in the encoder. The modulations are related to the prototypical window that is used. When strong modulations are inherently present in the envelope, sensitivity to the modulations introduced by the quantization step will be low. In contrast when very few modulations are present sensitivity will be very high. It will be described that other means or algorithms for modifying the determined sensitivities may be used.

The determined sensitivity values, d_m, are indicative of the human perceptual sensitivity to errors within the individual psycho-acoustic bands. The second multiplier 315 is coupled to a subband converter 317 which converts these sensitivities to corresponding sensitivities in the encoding subbands used by the encoder 109 and specifically to the subbands generated by the transform processor 203.

Specifically, the band-pass error signals that are introduced by an audio codec are not necessarily limited to a single psycho-acoustic band. Rather, due to the band-pass filtering that takes place in the auditory system, the error signal will not only have an effect on the auditory filter spectrally centered on the error signal, but also in adjacent auditory filters.

A spreading matrix S_β can be used as a mapping of the encoding subbands (b) to the psycho-acoustic bands and can take into account the spectral spreading that results from the particular window shape (or prototype filter) that is used by the transform processor 203. Given the sensitivity indices, d_m, which are determined for each psycho-acoustic band, the subband converter 317 determines the sensitivity in the subband domain of the encoder 109 and specifically transforms the sensitivity to a masking curve that is provided from the temporal distortion processor 211 to the frequency distortion processor 213.

Specifically, it is assumed that by assuming that if there is an error signal with an energy of ni_b in encoding subband b, then the product S_β ni_b reflects the energy that is seen within psycho-acoustic band f as a result of the error signal in band b. For each of the auditory filters, a sensitivity d_m/ is calculated by the temporal distortion processor 211. The total sensitivity to the presence of the error signal in encoding subband b is then given by the summation of sensitivities in all psycho-acoustic bands:

D_b = ∑d_mfS fb^mb Assuming that a total sensitivity of D=I corresponds to a level of audibility of the error signal that is at the threshold of detectability, this value may be determined as:

D_b = l = ∑d_mfS_βm_b , f

from which it follows that

÷=Σ_'. C m_h

where ni_b represents the masking curve in encoding subband b. By considering all encoding subbands, a complete masking curve is obtained. This masking curve incorporates a number of features that take account of the temporal masking properties of the input signal and of the synthesis window that is used by the transform processor 203. Specifically, the matrix S_β is determined in response to the transfer spectrum of the prototype filter used for the time domain to frequency domain transform.

The processing performed by the temporal distortion processor 211 is only performed once for each encoding segment and is independent of the specific quantization option applied to the audio signal. The masking curve that results from this calculation is provided to the frequency distortion processor 213 and can be re-used for every individual encoding quantization option that is evaluated. Typically this is repeated for many different encoding options for each segment.

In the frequency distortion processor 213, the perceptual distortion calculations are performed for a particular segment based on the masking curve that is provided by the temporal distortion processor 211 and on frequency domain representations of the audio signal and the error signal for the specific quantization option being evaluated. Thus, the error signal and the audio signal can be provided in the encoding subband domain where the quantization is actually taking place and no requirement for any further conversion is introduced.

Specifically, the distortion measure can be determined as follows. The error signal energy in the encoding subbands, ri_b, are divided by the masking curve values, ni_b, and integrated across the complete spectrum. This result can be understood by considering that the total perceptual distortion, D, results from the addition of perceptual distortions resulting from the individual subbands Df.

ω=ω^, -1

∑Δ²(ω)

^D = Σ A = ∑∑d_mfS_βn_b =∑n_b∑d_mfS_β = ∑^ = ∑- b b f b f m. m.

where ω is a frequency index referring to the frequency domain data (transform coefficients), (u_b is the starting frequency of a subband used by the proposed model, and A(ω) the error spectrum (in the transform domain) due to quantization.

In some embodiments, the frequency distortion processor 213 is furthermore arranged to adjust the distortion measure in response to a correlation characteristic between the frequency domain error signal and the audio signal.

Specifically, when the two signals are highly correlated, considerable energy modifications can occur within certain ERB bands which can cause very significant and perceptible artefacts. Therefore, in each ERB band the energy change is measured and the distortion derived for that spectral interval is increased when there is a considerable energy change.

In the specific embodiment the adjusted distortion measure D_a can be determined as:

where x(ω) is the spectrum of the original (unquantized) signal, and δ represents the number of subbands of the perceptual model that are taken together in the evaluation of the energy criterion. It can be noted that the energy of the difference between the audio signal and the error signal depends on the correlation between both signals. Typically the subbands are chosen such that they are %-th of a critical band and δ is chosen to be four, such that the total range that is taken together is one auditory critical band in total. The numerator of the right hand division is then the change in energy in one critical band, the max operator in the denominator is a normalization to ensure that relative energy changes are measured.

It will be appreciated that other adjustments of the distortion measure are possible including e.g. the use of different normalizations or different parameters. Also adjustments directly based on the correlation between the error signal and the input audio signal can be taken into account.

Thus, the described encoder provides a computationally efficient method for modeling the audibility of the quantization operation that is applied to an audio signal. The method takes into account knowledge that is included in the most advanced psycho- acoustical models. The model specifically processes the temporal properties of the input signal, such as inherent modulations in the input signal and transient behavior of the input signal, to more accurately determine the audibility of the quantization operation. In addition, the described approach takes into account correlations between quantization noise and the original signal. By taking into account the transient properties of the original signal, an accurate prediction of the audibility of pre-echos can be derived, and the proposed model can correctly inform the encoder of how many extra bits are needed in each subband to avoid pre- echos (quantization noise occurring before a transient) becoming audible. At the same time, it is known that after the end of a transient or sustained sound, the masking effect can persist for some time (forward masking). The proposed model can inform the encoder of the level of masking that can be expected after the offset of a transient or sustained sound.

By taking into account the modulation properties of the original signal, the weaker masking capabilities of a tonal signal, with very few inherent modulations, as compared to a noisy signal, with much stronger masking capabilities, can accurately be predicted. In existing models used in audio coding, modulations are determined based on the presence of peaks in the spectrum of the input signal, which is an indirect and inaccurate approach in comparison to the direct observation of the time signal in the present model.

It will be appreciated that the above description for clarity has described embodiments of the invention with reference to different functional units and processors. However, it will be apparent that any suitable distribution of functionality between different functional units or processors may be used without detracting from the invention. For example, functionality illustrated to be performed by separate processors or controllers may be performed by the same processor or controllers. Hence, references to specific functional units are only to be seen as references to suitable means for providing the described functionality rather than indicative of a strict logical or physical structure or organization.

The invention can be implemented in any suitable form including hardware, software, firmware or any combination of these. The invention may optionally be implemented at least partly as computer software running on one or more data processors and/or digital signal processors. The elements and components of an embodiment of the invention may be physically, functionally and logically implemented in any suitable way. Indeed the functionality may be implemented in a single unit, in a plurality of units or as part of other functional units. As such, the invention may be implemented in a single unit or may be physically and functionally distributed between different units and processors.

Although the present invention has been described in connection with some embodiments, it is not intended to be limited to the specific form set forth herein. Rather, the scope of the present invention is limited only by the accompanying claims. Additionally, although a feature may appear to be described in connection with particular embodiments, one skilled in the art would recognize that various features of the described embodiments may be combined in accordance with the invention. In the claims, the term comprising does not exclude the presence of other elements or steps.

Furthermore, although individually listed, a plurality of means, elements or method steps may be implemented by e.g. a single unit or processor. Additionally, although individual features may be included in different claims, these may possibly be advantageously combined, and the inclusion in different claims does not imply that a combination of features is not feasible and/or advantageous. Also the inclusion of a feature in one category of claims does not imply a limitation to this category but rather indicates that the feature is equally applicable to other claim categories as appropriate. Furthermore, the order of features in the claims do not imply any specific order in which the features must be worked and in particular the order of individual steps in a method claim does not imply that the steps must be performed in this order. Rather, the steps may be performed in any suitable order. In addition, singular references do not exclude a plurality. Thus references to "a", "an", "first", "second" etc do not preclude a plurality. Reference signs in the claims are provided merely as a clarifying example shall not be construed as limiting the scope of the claims in any way.

Claims

CLAIMS:

1. An apparatus for determining a distortion measure for an audio encoding of an audio signal, the apparatus comprising: first means (211) for generating a frequency domain sensitivity characteristic for the audio signal in response to a predetermined error estimate and a time domain analysis of the audio signal using a time domain model having a time scale smaller than a duration of an analysis segment of the audio encoding; second means (213) for determining the distortion measure in response to a frequency domain representation of the audio signal, a frequency domain error signal and the frequency domain sensitivity characteristic.

2. The apparatus of claim 1 wherein the first means (211) comprises means for dividing (301) the audio signal into psycho-acoustic bands and is arranged to perform the time domain analysis on the individual psycho-acoustic bands.

3. The apparatus of claim 2 wherein the first means (211) comprises means for determining an error sensitivity weighting (303, 305, 307) for each psycho-acoustic band and means for determining an error sensitivity value (307, 311) for each psycho-acoustic band in response to a predetermined band error estimate and the error sensitivity weighting for each psycho-acoustic band.

4. The apparatus of claim 2 further comprising conversion means (317) for generating an error sensitivity value for each frequency subband of the audio encoding in response to error sensitivity values of each of the psycho-acoustic bands.

5. The apparatus of claim 4 wherein the conversion means (317) is arranged to determine the error sensitivity value for each frequency subband of the audio encoding in response to a characteristic of an audio encoding filter of the audio encoding.

6. The apparatus of claim 1 wherein the time domain model comprises first weighting means (303, 305, 307) for determining a first weight value for the predetermined error estimate in response to a forward masking model.

7. The apparatus of claim 6 wherein the first weighting means (303, 305, 307) is arranged to reduce the first weight value for increasing signal levels of the audio signal.

8. The apparatus of claim 6 wherein the first weighting means (303, 305, 307) is arranged to limit the first weight value.

9. The apparatus of claim 6 wherein the first weighting means (303, 305, 307) is arranged to normalize the first weight value in response to a signal level of the audio signal.

10. The apparatus of claim 1 wherein the time domain model comprises second weighting means (313) for determining a second weight value for the predetermined error estimate in response to a temporal modulation model.

11. The apparatus of claim 10 wherein the second weighting means (313) is arranged to determine the second weight value in response to a temporal signal variance of the audio signal.

12. The apparatus of claim 1 wherein the second means (213) is arranged to adjust the distortion measure in response to a correlation characteristic between the frequency domain error signal and the audio signal.

13. The apparatus of claim 1 wherein the first means is arranged to modify the frequency domain sensitivity characteristic in response to a temporal onset delay of a transient of the audio signal relative to the onset of an encoding segment of the audio encoding.

14. The apparatus of claim 1 wherein the first means (211) is arranged to modify the frequency domain sensitivity characteristic in response to a time envelope of the audio signal within a first frequency interval.

15. An encoder comprising the apparatus (207) of claim 1.

16. The encoder of claim 15 comprising means (203, 205) for generating frequency domain representations of the audio signal and frequency domain error signals for a plurality of encoding options and means for selecting (205) an encoding option from the plurality of encoding options in response to distortion measures determined for the plurality of encoding options by the second means (213).

17. A method of determining a distortion measure for an audio encoding of an audio signal, the method comprising: generating a frequency domain sensitivity characteristic for the audio signal in response to a predetermined error estimate and a time domain analysis of the audio signal using a time domain model having a time scale smaller than a duration of an analysis segment of the audio encoding; - determining the distortion measure in response to a frequency domain representation of the audio signal, a frequency domain error signal and the frequency domain sensitivity characteristic.

18. A method of encoding an audio signal comprising: - generating a frequency domain sensitivity characteristic for the audio signal in response to a predetermined error estimate and a time domain analysis of the audio signal using a time domain model having a time scale smaller than a duration of an analysis segment of the audio encoding; for a plurality of encoding options determining a distortion measure in response to a frequency domain representation of the audio signal, a frequency domain error signal and the frequency domain sensitivity characteristic; selecting an encoding option from the plurality of encoding options in response to the distortion measures; and encoding the audio signal using the selected encoding option.

19. A computer program product for executing the method of claim 17 or 18.

20. A method of transmitting an audio signal, the method comprising: generating a frequency domain sensitivity characteristic for the audio signal in response to a predetermined error estimate and a time domain analysis of the audio signal using a time domain model having a time scale smaller than a duration of an analysis segment of the audio encoding; for a plurality of encoding options determining a distortion measure in response to a frequency domain representation of the audio signal, a frequency domain error signal and the frequency domain sensitivity characteristic; selecting an encoding option from the plurality of encoding options in response to the distortion measures; encoding the audio signal using the selected encoding option; and - transmitting the encoded audio signal.

21. A transmitter for transmitting an audio signal, the transmitter comprising: means (211) for generating a frequency domain sensitivity characteristic for the audio signal in response to a predetermined error estimate and a time domain analysis of the audio signal using a time domain model having a time scale smaller than a duration of an analysis segment of the audio encoding; means (207) for, for a plurality of encoding options, determining the distortion measure in response to a frequency domain representation of the audio signal, a frequency domain error signal and the frequency domain sensitivity characteristic; - means (205) for selecting an encoding option from the plurality of encoding options in response to the distortion measures; means (205) for encoding the audio signal using the selected encoding option; and means for transmitting (209) the encoded audio signal.