HK1228151B

HK1228151B - Method and device for compressing and decompressing sound field data of an area

Info

Publication number: HK1228151B
Application number: HK17101632.3A
Authority: HK
Inventors: Johannes Nowak; Christoph SLADECZEK
Original assignee: Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V.; Technische Universität Ilmenau
Priority date: 2013-11-14
Filing date: 2014-11-05
Publication date: 2020-03-20

Description

The present invention relates to audio technology and in particular to the compression of spatial sound field data.

The acoustic description of rooms is of great interest for controlling playback arrangements, such as headphones, speaker arrangements with, for example, two up to a moderate number of speakers, like 10 speakers, or also for speaker arrangements with a large number of speakers, as used in wave field synthesis (WFS).

In general, there are various approaches for spatial audio coding. One approach, for example, involves creating different channels for various speakers at predefined speaker positions, as is the case with MPEG Surround. This way, a listener who is positioned in a specific and ideally central location within the playback room experiences a sense of space for the reproduced sound field.

An alternative way to describe a space is to characterize it by its impulse response. For example, if a sound source is positioned somewhere within a room or area, this room or area can be measured using a circular microphone array in the case of a two-dimensional area, or using a spherical microphone array in the case of a three-dimensional area. For instance, when considering a spherical microphone array with a high number of microphones, such as 350 microphones, the measurement of the room would proceed as follows. An impulse is generated at a specific position inside or outside the microphone array. Subsequently, the response of each microphone to this impulse, i.e., the impulse response, is measured. Depending on how strong the reverberation characteristics are, a longer or shorter impulse response will be recorded. For example, measurements in large churches have shown that impulse responses can last over 10 seconds, depending on the scale.

A set of, for example, 350 impulse responses thus describes the acoustic characteristics of this room for the specific position of a sound source at which the impulse was generated. In other words, this set of impulse responses represents sound field data of the area, specifically for the case when a source is located at the position where the impulse was generated. To further measure the space, that is, to capture the acoustic properties of the room when a source is placed at another location, the described procedure must be repeated for each additional position, for example, outside the array (but also within the array). Therefore, for example, if one were to capture the sound field of a concert hall in which a string quartet is playing, with each musician positioned at four different locations, then 350 impulse responses would be measured for each of the four positions. These 4 x 350 = 1400 impulse responses would then represent the sound field data of the area.

Since the duration of impulse responses can take on quite significant values, and since possibly a more detailed representation of the acoustic properties of the room with respect to not only four but even more positions may be desired, a huge amount of impulse response data arises, especially when considering that impulse responses can indeed reach lengths of over 10 seconds.

Approaches for spatial audio coding include, for example, spatial audio coding (SAC) [1] and spatial audio object coding (SAOC) [2], which enable bit-rate efficient encoding of multi-channel audio signals or object-based spatial audio scenes. The spatial impulse response rendering (SIRR) [3] and the further development directional audio coding (DirAc) [4] are parametric coding methods based on a time-dependent sound arrival direction estimation (direction of arrival - DOA), as well as an estimation of diffuseness within frequency bands. Here, a distinction is made between non-diffuse and diffuse sound fields. In [5], lossless compression of spherical microphone array data and the encoding of higher-order Ambisonics signals are addressed. The compression is achieved by exploiting redundant data between channels (interchannel redundancy).

Studies in [6] show a separate consideration of early and late sound fields in binaural reproduction. For dynamic systems where head movements are considered, the filter length is optimized by convolving only the early sound field in real time. For the late sound field, a single filter for all directions is sufficient without reducing the perceived quality. In [7], head-related transfer functions (HRTFs) are represented on a sphere in the spherical harmonic domain. The influence of different accuracies using various orders of spherical harmonics on the interaural cross-correlation and the spatial-temporal correlation (spatio-temporal correlation) is analyzed. This is done in octave bands in a diffuse sound field. [1] Herre, J et al (2004) Spatial Audio Coding: Next-generation efficient and compatible coding of multi-channel audio AES Convention Paper 6186 presented at the 117th Convention,San Francisco, USA[2] Engdegard, J et al (2008) Spatial Audio Object Coding (SAOC) - The Upcoming MPEG Standard on Parametric Object Based Audio Coding, AES Convention Paper 7377 presented at the 125th Convention, Amsterdam, Netherlands[3] Merimaa J and Pulkki V (2003) Perceptually-based processing of directional room responses for multichannel loudspeaker reproduction, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics[4] Pulkki, V (2007) Spatial Sound Reproduction with Directional Audio Coding, J. Audio Eng. Soc., Vol. 55. No.6[5] Hellerud E et al (2008) Encoding Higher Order Ambisonics with AAC AES Convention Paper 7366 presented at the 125th Convention, Amsterdam,Netherlands [6] Lindau A, Kosanke L, Weinzierl S (2010) Perceptual evaluation of physical predictors of the mixing time in binaural room impulse responses AES Convention Paper presented at the 128th Convention, London, UK [7] Avni, A and Rafaely B (2009) Interaural cross correlation and spatial correlation in a sound field represented by spherical harmonics in Ambisonics Symposium 2009, Graz,

A encoder-decoder scheme for low bit rates is described in [8]. The encoder generates a composite audio information signal that describes the sound field to be reproduced and a direction vector or steering control signal. The spectrum is divided into subbands. For control, the dominant direction is evaluated in each subband. Based on the perceived spatial audio scene, a spatial audio coding framework in the frequency domain is described in [9]. Time-frequency dependent direction vectors describe the input audio scene. [10] describes a parametric, channel-based audio coding method in the time and frequency domains. In [11], a binaural cue coding (BCC) is described, which uses one or more object-based cue codes. These include direction, distance, and envelopment of an auditory scene. [12] refers to the processing of spherical array data for playback using Ambisonics.In this context, system distortions caused by measurement errors, such as noise, are to be compensated. In [13], a channel-based coding method is described that also refers to positions of loudspeakers as well as individual audio objects. In [14], a matrix-based coding method is presented, which enables the real-time transmission of higher order ambisonics sound fields with orders greater than 3. In [15], a method for encoding spatial audio data is described, which is independent of the playback system. The input material is divided into two groups, where the first group contains the audio that requires high localizability, while the second group is described using sufficiently low ambisonics orders for localization.In the first group, the signal is encoded into a set of mono channels with metadata. The metadata include time information indicating when the corresponding channel should be played back and direction information at each moment. During playback, the audio channels are decoded for conventional panning algorithms, which requires the playback system to be known. The audio in the second group is encoded into channels of different Ambisonics orders. During decoding, corresponding Ambisonics orders are used according to the playback system. [8] Dolby R M (1999) Low-bit-rate spatial coding method and system, EP 1677576 A3 [9] Goodwin M and Jot J-M (2007) Spatial audio coding based on universal spatial cues, US 8,379,868 B2 [10] Seefeldt A and Vinton M (2006) Controlling spatial audio coding parameters as a function of auditory events,EP 2296142 A2[11] Faller C (2005) Parametric coding of spatial audio with object-based side information, US 8340306 B2[12] Kordon S, Batke J-M, Krüger A (2011) Method and apparatus for processing signals of a spherical microphone array on a rigid sphere used for generating an ambisonics representation of the sound field, EP 2592845 A1[13] Corteel E and Rosenthal M (2011) Method and device for enhanced sound field reproduction of spatially encoded audio input signals, EP 2609759 A1[14] Abeling S et al (2010) Method and apparatus for generating and decoding sound field data including ambisonics sound field data of an order higher than three,EP 2451196 A1 [15] Arumi P and Sole A (2008) Method and apparatus for three-dimensional acoustic field encoding and optimal reconstruction,

The technical publication "Development and evaluation of a mixed-order Ambisonics playback system," by J. Käsbach, Technical University of Denmark, November 2010, reveals a mixed-order Ambisonics playback system, in which the decomposition according to spherical harmonic components of the three-dimensional sound field is supplemented with additional horizontal components. Taking into account the orthonormality properties of spherical functions, the maximum two-dimensional and three-dimensional orders for a given speaker array are determined. Based on this analysis, an alternative implementation of mixed order is proposed, which requires a truncated order of the inherent Legendre functions.

The technical publication "A New Mixed-Order Scheme for Ambisonic Signals," by Chris Travis, presented at the Ambisonics Symposium 2009, June 25, 2009, Graz, pages 1-6, refers to mixed-order systems that provide higher resolution in the horizontal plane than at the poles. A two-parameter scheme (#H#V) truncates the spherical harmonic expansion in a specific way. This results in resolution versus elevation curves that are flatter in and near the horizontal plane.

WO 2010/012478 A2 discloses a system for generating binaural signals based on a multichannel signal, which represents a plurality of channels intended for reproduction by a speaker configuration that has an assigned virtual sound source position for each channel.

The technical publication "Multichannel Audio Coding Based on Analysis by Synthesis," I. Elfitri et al., Proceedings of the IEEE, New York, Volume 99, No. 4, April 1, 2011, pages 657-670, describes a closed-loop encoder system based on an analysis-by-synthesis principle applied to the MPEG-Surround architecture.

The object of the present invention is to create a more efficient concept for handling, such as compressing or decompressing, sound field data of an area.

This task is solved by a device for compressing sound field data according to claim 1, a device for decompressing sound field data according to claim 13, a method for compressing sound field data according to claim 19, a method for decompressing sound field data according to claim 20, or a computer program according to claim 21.

A device for compressing sound field data of an area includes a divider for dividing the sound field data into a first portion and a second portion, and a subordinate converter for converting the first portion and the second portion into harmonic components, wherein the conversion is performed such that the second portion is converted into one or more harmonic components of a second order, and the first portion is converted into harmonic components of a first order, wherein the first order is higher than the second order, to obtain the compressed sound field data.

This enables the conversion of sound field data, such as the number of impulse responses into harmonic components, and this conversion itself can already lead to a significant data saving. Harmonic components, which can for example be obtained by means of a spatial spectral transformation, describe a sound field much more compactly than impulse responses. Furthermore, the order of the harmonic components is easily controllable. The zeroth-order harmonic component is merely an (omnidirectional) mono signal. It does not yet allow any directional description of the sound field. In contrast, the additional first-order harmonic components already allow a relatively rough directional representation analogous to beamforming. Second-order harmonic components allow an additional, even more precise description of the sound field with more directional information.For example, with Ambisonics, the number of components is equal to 2n+1, where n is the order. For the zeroth order, there is thus only a single harmonic component. For a representation up to the first order, there are already three harmonic components. For example, for a representation with fifth order, there are already 11 harmonic components, and it has been found that an order of 14 is sufficient, for instance, for 350 impulse responses. In other words, this means that 29 harmonic components describe the room just as well as 350 impulse responses. Already this conversion from 350 input channels to 29 output channels provides a compression gain.In addition, different parts of the sound field data, such as impulse responses with various orders, are also converted, since it has been found that not all parts need to be described with the same accuracy/order. An example is that the direction perception of human hearing is mainly derived from early reflections, while late/diffuse reflections in a typical impulse response contribute little or nothing to direction perception. In this example, the first part of the impulse response, which is the early part, will be converted into the harmonic components domain with a higher order, whereas the late diffuse part will be converted with a lower order, and sometimes even with an order of zero.

Another example is that the directional perception of human hearing is frequency-dependent. At low frequencies, the directional perception of human hearing is relatively weak. Therefore, for compression of sound field data, it is sufficient to map the low spectral range of the harmonic components into the harmonic component area with a relatively low order, while the frequency ranges of the sound field data, where the directional perception of human hearing is very high, are mapped with a high, preferably even maximum order. According to the invention, the sound field data are thus decomposed by means of a filter bank into individual subband sound field data, and these subband sound field data are then decomposed with different orders, wherein the first portion again comprises subband sound field data at higher frequencies.While the second portion contains subband acoustic field data at lower frequencies, extremely low frequencies can again be represented with an order of zero, meaning only a single harmonic component. In another example, the advantageous properties of temporal and frequency domain processing are combined. Thus, the early portion, which is already processed with a higher order, can be decomposed into spectral components, for which adapted orders can then be obtained for each individual band. Particularly when a decimating filter bank, such as a QMF filter bank (QMF = Quadrature Mirror Filter Bank), is used for the subband signals, the effort required to convert the subband acoustic field data into the harmonic component domain is further reduced.In addition, the differentiation of various portions of the sound field data with respect to the order to be calculated provides a significant reduction in computational effort, especially since the calculation of harmonic components, such as cylindrical or spherical harmonic components, strongly depends on up to which order these harmonic components are to be computed. For example, calculating harmonic components up to the second order requires significantly less computational effort and thus less computing time and battery power, particularly for mobile devices, compared to calculating harmonic components up to order 14. In the described embodiments, the converter is thus designed to process the portion, that is, the first portion of the sound field data,the one that is more important for the perception of direction in human hearing, to implement with a higher level than the second part, which is less important for the perception of the direction of a sound source than the first part.

The present invention can not only be used for a temporal decomposition of sound field data into components or for a spectral decomposition of sound field data into components, but also for an alternative decomposition, for example, spatial decomposition of the components, when considering that the direction perception of human hearing differs for sounds at different azimuth or elevation angles. For example, if the sound field data are provided as impulse responses or other sound field descriptions, each of which is assigned a specific azimuth/elevation angle, then the sound field data from azimuth/elevation angles where the direction perception of human hearing is stronger can be compressed with a higher order than a spatial component of the sound field data from another direction.

Alternatively or additionally, individual harmonics can be "thinned out," meaning, for example, with order 14, where there are 29 modes. Depending on human direction perception, certain modes that represent sound fields for unimportant sound incidence directions are omitted. In the case of microphone array measurements, this introduces some uncertainty because it is unknown which direction the head is oriented relative to the array sphere. However, when HRTFs are represented using spherical harmonics, this uncertainty is resolved.

Further decompositions of the sound field data, in addition to decompositions in temporal, spectral, or spatial directions, can also be used, such as a decomposition of the sound field data into a first and a second component in volume classes, etc.

In the case of implementation examples, acoustic problems are described using cylindrical or spherical coordinate systems, that is, by means of complete sets of orthonormal eigenfunctions, so-called cylindrical or spherical harmonic components. With higher spatial accuracy in describing the sound field, the data volume and processing time for data processing or manipulation increase. For high-quality audio applications, high accuracy is required, which leads to the problems of long computation times, particularly disadvantageous for real-time systems, large data volumes, which complicate the transmission of spatial sound field data, and high energy consumption due to intensive computing, especially on mobile devices.

All these disadvantages are alleviated or eliminated by embodiments of the invention, because due to the differentiation of the orders for calculating the harmonic components, the calculation times are reduced, compared to a case where all components are converted into harmonic components with the highest order. According to the invention, the large data volumes are reduced in that the representation by harmonic components is particularly more compact, and additionally different components are represented with different orders. The data volume reduction is achieved in that a low order, such as the first order, has only three harmonic components, whereas the highest order, for example, has 29 harmonic components, as illustrated by an example of an order of 14.

The reduced computing power and lower memory usage automatically reduce energy consumption, which is particularly significant when using acoustic field data on mobile devices.

In the case of implementation examples, the spatial sound field description is optimized in the cylindrical or spherical harmonic domain based on human spatial perception. In particular, a combination of time- and frequency-dependent calculation of the order of spherical harmonics depending on the spatial perception of human hearing leads to a significant reduction in effort without reducing the subjective quality of sound field perception. Naturally, the objective quality is reduced, as the present invention represents a lossy compression. However, this lossy compression is not critical, since the final receiver is the human ear, and because it is therefore irrelevant for transparent reproduction whether sound field components that are not perceived by the human ear are present or not in the reproduced sound field.

In other words, when reproducing/auralizing, the human hearing is the most important quality measure, whether using binaural methods, i.e., with headphones, or with speaker systems with few (e.g., stereo) or many speakers (e.g., WFS). According to the invention, the accuracy of harmonic components, such as cylindrical or spherical harmonics, is reduced in the time domain and/or frequency domain and/or in further domains, adapted to the human ear. This results in data and computation time reduction.

Preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings. They show: Fig. 1a a block diagram of a device for compressing sound field data according to an embodiment; Fig. 1b a block diagram of a device for decompressing compressed sound field data of an area; Fig. 1c a block diagram of a device for compression with temporal decomposition; Fig. 1d a block diagram of an embodiment of a device for decompression in the case of temporal decomposition; Fig. 1e an alternative device for decompression as compared to Fig. 1d; Fig. 1f an example of the application of the invention with temporal and spectral decomposition, using 350 measured impulse responses as sound field data as an example; Fig. 2a a block diagram of a device for compression with spectral decomposition; Fig. 2b an example of a sub-sampled filter bank and subsequent translation of the sub-sampled subband sound field data; Fig. 2c a device for decompression for the example of spectral decomposition shown in Fig. 2a; Fig. 2d an alternative implementation of the decompressor for spectral decomposition; Fig. 3a a block diagram overview with a special analysis/synthesis coder according to another embodiment of the present invention; Fig. 3b a more detailed representation of an embodiment with temporal and spectral decomposition; Fig. 4 a schematic representation of an impulse response; Fig. 5 a block diagram of a converter from the time or spectral domain into the harmonic component domain with variable order; and Fig. 6 a representation of an example converter from the harmonic component domain into the time domain or spectral domain with subsequent auralization.

Fig. 1a shows a block diagram of a device or a method for compressing sound field data of an area, which are input at an input 10 into a splitter 100. The splitter 100 is designed to split the sound field data into a first portion 101 and a second portion 102. In addition, a converter is provided, which has two functionalities denoted by 140 or 180. In particular, the converter is designed to convert the first portion 101 as shown at 140, and to convert the second portion 102 as shown at 180. In particular, the converter converts the first portion 101 into one or more harmonic components 141 of a first order,While the converter 180 converts the second portion 102 into one or more harmonic components 182 of the second order. In particular, the first order, which is the underlying order of the harmonic components 141, is higher than the second order. In other words, this means that the converter 140 operating with a higher order emits more harmonic components 141 than the converter 180 operating with a lower order. The order n1, by which the converter 140 is driven, is therefore greater than the order n2, by which the converter 180 is driven. The converters 140, 180 can be controllable converters. Alternatively, the order may be fixed and thus pre-programmed.so that the inputs labeled n1 and n2 are not present in this embodiment.

Fig. 1b shows a device for decompressing compressed sound field data 20, which have first harmonic components of a first order and one or more second harmonic components of a second order, as for example output by Fig. 1a at 141, 182. However, the decompressed sound field data do not necessarily have to be in the "raw format" of the harmonic components 141, 142. Instead, a lossless entropy encoder, such as a Huffman encoder or an arithmetic encoder, could be provided in Fig. 1a, to further reduce the number of bits required to represent the harmonic components ultimately. Then, the data stream 20 fed into an input interface 200 would consist of entropy-coded harmonic components and optionally side information, as illustrated in Fig. 3a. In this case, an individual entropy decoder would be provided at the output of the input interface 200, which is adapted to the entropy encoder on the encoder side, i.e., with respect to Fig. 1a. Thus, the first harmonic components of the first order 201 and the second harmonic components of the second order 202, as shown in Fig. 1b, may still be entropy-coded, or already entropy-decoded, or actually the harmonic components in "raw form," as they are present at 141, 182 in Fig. 1a.

Both groups of harmonic components are fed into a decoder or converter/combiner 240. The block 240 is designed to decompress the compressed sound field data 201, 202 by using a combination of the first part and the second part and by performing a conversion from a harmonic component representation to a time-domain representation, thereby finally obtaining the decompressed representation of the sound field as shown at 240. The decoder 240, which can for example be designed as a signal processor, is thus configured to perform a conversion from the spherical harmonic component domain to the time domain, and to perform a combination. However, the order between the conversion and the combination may vary, as illustrated in FIGS. 1d, 1e or 2c, 2d for different examples.

Fig. 1c shows a device for compressing sound field data of an area according to an embodiment, where the splitter 100 is designed as a temporal splitter 100a. In particular, the temporal splitter 100a, which is an implementation of the splitter 100 of Fig. 1a, is configured to divide the sound field data into a first portion including first reflections in the area and a second portion including second reflections in the area, wherein the second reflections occur later in time than the first reflections. Referring to Fig. 4, the first portion 101, output by block 100a, thus corresponds to the impulse response section 310 of Fig.4, while the second late part represents section 320 of the impulse response from Fig. 4. The point in time of the division can, for example, be at 100 ms. However, there are also other possibilities for temporal division, such as earlier or later. Preferably, the division is made at the point where discrete reflections transition into diffuse reflections. This can be a different time point depending on the room, and there are concepts to create an optimal division here. On the other hand, the division into an early and a late part can also depend on the available data rate, meaning that the division time is always made smaller.The lower the bitrate. This is favorable with regard to the bitrate, because then the largest possible portion of the impulse response is converted into the harmonic components area with a low order.

The converter represented by blocks 140 and 180 in Fig. 1c is thus designed to convert the first portion 101 and the second portion 102 into harmonic components, wherein the converter particularly converts the second portion into one or more harmonic components 182 of a second order and converts the first portion 101 into harmonic components 141 of a first order, wherein the first order is higher than the second order, in order to finally obtain the compressed sound field, which can be output via an output interface 190 for purposes of transmission and/or storage.

Fig. 1d shows an implementation of the decompressor for the example of temporal division. In particular, the decompressor is trained to process the compressed sound field data using a combination of the first portion 201 with the first reflections and the second portion 202 with the late reflections, as well as converting from the harmonic component domain to the time domain. Fig. 1d shows an implementation where the combination takes place after the conversion. Fig. 1e shows an alternative implementation where the combination occurs before the conversion. In particular, the converter 241 is trained to convert high-order harmonic components into the time domain, while the converter 242 is trained to...to implement the low-order harmonic components into the time domain. With respect to Fig. 4, the output of the converter 241 thus provides something corresponding to area 210, while the converter 242 provides something corresponding to area 320; however, due to the lossy compression, the sections at the outputs of the bridges 241, 242 are not identical to the sections 310, 320. Nevertheless, there will be at least a perceptual similarity or identity between the section at the output of block 241 and section 310 of Fig. 4, whereas the section at the output of block 242, which corresponds to the later part 320 of the impulse response,There will be clear differences and thus only approximately represent the course of the impulse response. However, these deviations are not critical for human direction perception, because human direction perception is based scarcely or not at all on the late part or the diffuse reflections of the impulse response.

Fig. 1e shows an alternative implementation where the decoder first includes the combiner 245 and subsequently the converter 244. In this embodiment shown in Fig. 1e, the individual harmonic components are added together, and then the result of this addition is converted to finally obtain a time-domain representation. In contrast, in the implementation shown in Fig. 1d, the combination does not consist of an addition, but rather of a serialization, meaning that the output of block 241 will be arranged earlier in a decompressed impulse response than the output of block 242, thereby again obtaining an impulse response corresponding to Fig. 4, which can then be used for further purposes, such as auralization, i.e., processing audio signals to achieve the desired spatial impression.

Fig. 2a shows an alternative implementation of the present invention, in which a frequency-domain division is performed. In particular, the divider 100 of Fig. 1a is implemented as a filterbank in the embodiment of Fig. 2a, to filter at least a portion of the sound field data and to obtain sound field data in different filterbank channels 101, 102. In one embodiment where the time-domain division of Fig. 1a is not implemented, the filterbank receives both the early and late components. However, in an alternative embodiment, only the early component of the sound field data is fed into the filterbank, while the late component is not further spectrally decomposed.

The analyzer filter bank 100b is followed by a mixer, which can be formed from partial mixers 140a, 140b, 140c. The mixer 140a, 140b, 140c is designed to convert the sound field data in different filter bank channels using different orders for different filter bank channels, in order to obtain one or more harmonic components for each filter bank channel. In particular, the mixer is configured to perform a conversion with a first order for a first filter bank channel having a first center frequency, and to perform a conversion with a second order for a second filter bank channel having a second center frequency, wherein the first order is higher than the second order, and wherein the first center frequency,that is, fn, higher than the second center frequency f1, in order to finally obtain the compressed sound field representation. Generally, depending on the embodiment, a lower order can be used for the lowest frequency band compared to a middle frequency band. However, depending on the implementation, the highest frequency band, such as the filter bank channel with the center frequency fn shown in Fig. 2a, does not necessarily have to be implemented with a higher order than, for example, a middle channel. Instead, the highest order can be used in areas where direction perception is highest, such as in other areas, which may also include, for example, a certain high-frequency range.The lower order is because in these areas, the directional perception of human hearing is also lower.

Fig. 2b shows a more detailed implementation of the analysis filter bank 100b. In the embodiment shown in Fig. 2b, this includes a bandpass filter and further decimators 100c for each filter bank channel. For example, if a filter bank consisting of a bandpass filter and decimators with 64 channels is used, each decimator can decimate by a factor of 1/64, such that the total number of digital samples at the output of the decimators across all channels corresponds to the number of samples of a block of time-domain sound field data that has been decomposed by the filter bank. An exemplary filter bank may be a real or complex QMF (Quadrature Mirror Filter) filter bank. Each subband signal, preferably the early parts of the impulse responses, is then converted into harmonic components using the frequency shifters 140a to 140c, analogous to Fig. 2a, in order to finally obtain a description with cylindrical or preferably spherical harmonic components for various subband signals, which have different orders for different subband signals, i.e., a different number of harmonic components.

Fig. 2c and Fig. 2d show different implementations of the decompressor as shown in Fig. 1b, i.e., a different order of combination followed by downconversion in Fig. 2c, or first downconversion followed by combination as illustrated in Fig. 2d. In particular, the decompressor 240 of Fig. 1b, in the embodiment shown in Fig. 2c, again includes a combiner 245 that performs an addition of the different harmonic components from the various subbands to obtain an overall representation of the harmonic components, which is then converted into the time domain by the downconverter 244. Thus, the input signals to the combiner 245 are in the harmonic-component spectral domain, while the output of the combiner 345 represents a harmonic-component domain, from which a conversion into the time domain is obtained through the downconverter 244.

In the alternative embodiment shown in Fig. 2d, the individual harmonic components of each subband are first converted into the spectral domain by different converters 241a, 241b, 241c, so that the output signals of blocks 241a, 241b, 241c correspond to the output signals of blocks 140a, 140b, 140c of Fig. 2a or Fig. 2b. Then, these subband signals are processed in a subsequent synthesis filter bank, which in the case of downsampling on the encoder side (block 100c of Fig. 2b) can also have a high-rate function, i.e., an upsampling function. The synthesis filter bank then represents the combiner function of the decoder 240 of Fig. 1b. Thus, at the output of the synthesis filter bank, there is the decompressed sound field representation, which can be used for auralization, as will be shown.

Fig. 1f shows an example of the decomposition of impulse responses into harmonic components of different orders. The later sections are not spectrally decomposed but are processed overall with the zeroth order. The early sections of the impulse responses are spectrally decomposed. For example, the lowest band is processed with the first order, while the next band is already processed with the fifth order, and the last band, because it is most important for direction/space perception, is processed with the highest order, which in this example is order 14.

Fig. 3a shows the entire encoder/decoder scheme or the entire compressor/decompressor scheme of the present invention.

In particular, the compressor shown in the embodiment of FIG. 3a includes not only the functionalities of FIG. 1a designated as 1 or PENC, but also a decoder PDEC2, which can be designed as shown in FIG. 1b. Furthermore, the compressor also includes a control unit CTRL4, which is configured to compare decompressed sound field data received from decoder 2 with original sound field data, taking into account a psychoacoustic model, such as the PEAQ model standardized by the ITU.

As a result, the controller 4 generates optimized parameters for the division, such as the temporal division, the frequency-based division in the filter bank, or optimized parameters for the orders in the individual converters for the different parts of the sound field data, if these converters are designed to be controllable.

Control parameters, such as partitioning information, filter bank parameters or orders can then be transmitted together with a bitstream that contains the harmonic components to a decoder or decompressor, which is represented by 2 in Fig. 3a. The compressor 11 thus consists of a control block CTRL4 for codec control, a parameter encoder PENC1 and a parameter decoder PDEC2. The inputs 10 are data from microphone array measurements. The control block 4 initializes the encoder 1 and provides all parameters for encoding the array data. In the PENC block 1, the data are processed according to the described methodology of hearing-dependent partitioning in the time and frequency domains and prepared for data transmission.

Fig. 3b shows the schematic of data encoding and decoding. The input data 10 are first split by the splitter 100a into an early sound field 101 and a late sound field 102. The early sound field 101 is then decomposed by means of an n-band filter bank 100b into its spectral components f1 ... fn, which are each decomposed according to a spherical harmonic order adapted to human hearing (x-order SHD - SHD = Spherical Harmonics Decomposition). This decomposition into spherical harmonics represents a preferred embodiment, although any sound field decomposition generating harmonic components can also be used. Since the decomposition into spherical harmonic components requires different calculation times in each band depending on the order, it is preferred to correct the time delays using a delay line with delay blocks 306, 304. Thus, the frequency range in the reconstruction block 245, also called a combiner, is reconstructed and combined again with the late sound field in the further combiner 243 after it has been processed with a hearing-adapted low order.

The control block CTRL4 of Fig. 3a includes a spatial acoustic analysis module and a psychoacoustic module. This control block analyzes both the input data 10 and the output data of the decoder 2 from Fig. 3a, to adaptively adjust the coding parameters, which are also referred to as side information 300 in Fig. 3a, or which are directly provided by the compressor 11 to the encoder PENC1. From the input signals 10, spatial acoustic parameters are extracted, which together with the parameters of the used array configuration define the initial parameters of the encoding. These include both the timing of the separation between early and late sound field, also known as "mixing time" or "Mischzeit," as well as the parameters for the filter bank, such as corresponding orders of spherical harmonics. The output, which can, for example, be in the form of binaural impulse responses, as generated by the combiner 243, is fed into a psychoacoustic module with an auditory model that evaluates the quality and adjusts the coding parameters accordingly. Alternatively, the concept can also work with static parameters. In this case, the control module CTRL4 and the PEDC module 2 on the encoder or compressor side 11 are no longer required.

The invention is advantageous in that data and computational effort during the processing and transmission of circular and spherical array data can be reduced depending on human hearing. Furthermore, it is advantageous that the processed data can be integrated into existing compression methods, thus allowing additional data reduction. This is particularly beneficial in bandwidth-limited transmission systems, such as for mobile devices. Another advantage is the possible real-time processing of data in the spherical harmonic domain even at high orders. The present invention can be applied in many areas, especially in fields where the acoustic sound field is represented by cylindrical or spherical harmonics. This occurs, for example, in sound field analysis using circular or spherical arrays. If the analyzed sound field is to be auralized, the concept of the present invention can be used. In devices for simulating rooms, databases are used to store existing rooms. Here, the inventive concept enables space-saving and high-quality storage. There are playback methods based on spherical surface functions, such as Higher Order Ambisonics or binaural synthesis. In these cases, the present invention provides a reduction in computation time and data requirements. This can be particularly advantageous with regard to data transmission, for example in teleconferencing systems.

Fig. 5 shows an implementation of a converter 140 or 180 with adjustable order or with at least different orders, which can also be fixedly set.

The mapper comprises a time-frequency transformation block 502 and a subsequent space transformation block 504. The space transformation block 504 is designed to operate according to the calculation rule 508. In the calculation rule, n denotes the order. The calculation rule 508 is solved only once when the order is zero, or it is solved more frequently when the order, for example, goes up to order 5 or, as described above, up to order 14. In particular, the time-frequency transformation element 502 is designed to transform the impulse responses on the input lines 101, 102 into the frequency domain, preferably using the fast Fourier transform. Furthermore, only the single-sided spectrum is then forwarded to reduce computational effort. Then, a spatial Fourier transformation is performed in the space transformation block 504, as described in the textbook "Fourier Acoustics, Sound Radiation and Nearfield Acoustical Holography," published by Academic Press in 1999, by Earl G. Williams. Preferably, the space transformation 504 is optimized for sound field analysis and simultaneously provides high numerical accuracy and fast computation speed.

Figure 6 shows the preferred implementation of a converter from the harmonic component domain to the time domain, where as an alternative, a processor for decomposition into plane waves and beamforming 602 is illustrated, alternatively to an inverse space transformation implementation 604. The output signals of both blocks 602, 604 can be alternatively fed into a block 606 for generating impulse responses. The inverse space transformation 604 is designed to reverse the forward transformation in block 504. Alternatively, the decomposition into plane waves and beamforming in block 606 allows a large number of decomposition directions to be processed uniformly, which is advantageous for fast processing, especially for visualization or auralization.Preferably, block 602 receives radial filter coefficients as well as, depending on the implementation, additional beamforming or steering coefficients. These can either have a constant directivity or be frequency-dependent. Alternative input signals into block 602 can be modal radial filters, and in particular for spherical arrays or different configurations, such as an open sphere with omnidirectional microphones, an open sphere with cardioid microphones, and a rigid sphere with omnidirectional microphones. The impulse response generation block 606 generates impulse responses or time domain signals from data either from block 602 or from block 604. In particular, this block recombines the previously omitted negative parts of the spectrum, performs a fast inverse Fourier transform, and allows for resampling orResampling to the original sampling rate, if the input signal was downsampled at a certain point. Furthermore, a window option can be used.

Details about the functionality of blocks 502, 504, 602, 604, and 606 are described in the technical publication "SofiA Sound Field Analysis Toolbox" by Bernschütz et al., ICSA - International Conference on Spatial Audio, Detmold, November 10 to 13, 2011, wherein this technical publication is incorporated herein by reference in its entirety.

Block 606 can further be configured to output the complete set of decompressed impulse responses, for example, the lossy impulse responses, where block 608 would then output, for example, 350 impulse responses. However, depending on the auralization, it is preferred to output only the impulse responses finally needed for playback, which can be achieved by block 608, which provides a selection or interpolation for a specific playback scenario. For example, if stereo playback is intended, as shown in block 616, from the 350 retrieved impulse responses, the impulse response corresponding to the spatial direction of each respective stereo speaker is selected, depending on the placement of the two stereo speakers. This impulse response is then used to set a pre-filter of the corresponding speaker such that the pre-filter has a filter characteristic corresponding to this impulse response. The audio signal to be played back is then routed through the respective pre-filters to the two speakers and played back, thereby ultimately creating the desired spatial impression for a stereo auralization.

If among the available impulse responses there is no impulse response in a specific direction where a speaker is actually placed, then preferably the two or three nearest neighboring impulse responses are used, and an interpolation is performed.

In an alternative embodiment, where playback or auralization is performed by wave field synthesis 612, it is preferred to reproduce early and late reflections via virtual sources, as detailed in the Ph.D. thesis "Spatial Sound Design based on Measured Room Impulse Responses" by Frank Melchior from TU Delft in 2011, which is also incorporated herein by reference in its entirety.

In particular, the reflections of a source during wave field synthesis playback 612 are reproduced by four impulse responses at specific positions for early reflections and eight impulse responses at specific positions for late reflections. The selection block 608 then selects the 12 impulse responses for the 12 virtual positions. These impulse responses, together with their corresponding positions, are then fed into a wave field synthesis renderer, which can be arranged in block 612, and the wave field synthesis renderer calculates the speaker signals for the actually existing speakers using these impulse responses, so that they can then reproduce the corresponding virtual sources. Thus, for each speaker in the wave field synthesis playback system, an individual pre-filter is calculated, and finally, the audio signal to be output is filtered through this pre-filter before being emitted by the speaker, in order to achieve a corresponding playback with high-quality spatial effects.

An alternative implementation of the present invention consists in generating a headphone signal, that is, a binaural application, where the spatial impression of the area is to be created through headphone playback.

Although the preceding section mainly presented impulse responses as acoustic field data, any other acoustic field data, for example acoustic field data in terms of magnitude and vector, such as sound pressure and particle velocity at specific positions in space, can also be used. These acoustic field data can also be divided into more important and less important components with regard to human direction perception and converted into harmonic components. The acoustic field data may also include any type of impulse responses, such as, for example, Head-Related Transfer Function (HRTF) functions or Binaural Room Impulse Response (BRIR) functions, or impulse responses from a single discrete point to a predetermined position within the area.

It is preferable to sample a room using a spherical array. Then, the sound field is available as a set of impulse responses. In the time domain, the sound field is divided into early and late components. Subsequently, both parts are decomposed into their spherical or cylindrical harmonic components. Since the relative direction information is present in the early sound field, a higher order of spherical harmonics is calculated here compared to the late sound field, which is sufficient with a low order. The early part is relatively short, for example 100 ms, and is represented accurately, i.e., with many harmonic components, whereas the late part, for example 100 ms to 2 s or 10 s long, is represented with fewer or only a single harmonic component.

Another data reduction is achieved by splitting the early sound field into individual frequency bands before representing it as spherical harmonics. To this end, after separating in the time domain into early and late sound fields, the early sound field is decomposed into its spectral components using a filter bank. By sub-sampling the individual frequency bands, a data reduction is achieved which significantly accelerates the computation of the harmonic components. Additionally, for each frequency band, a perceptually sufficient order is used depending on human direction perception. Thus, for low-frequency bands, where human direction perception is low, low orders or even order zero for the lowest frequency band may be sufficient, while higher orders up to the maximum meaningful order with respect to the accuracy of the measured sound field are required for high bands. On the decoder or decompressor side, the complete spectrum is reconstructed. Subsequently, the early or late sound field is combined again. The data is now ready for auralization.

Although some aspects have been described in connection with a device, it is understood that these aspects also constitute a description of the corresponding method, so that a block or a component of a device can also be understood as a corresponding method step or as a feature of a method step. Similarly, aspects described in connection with one or more method steps also constitute a description of a corresponding block, details, or feature of a corresponding device. Some or all of the method steps can be performed by a hardware apparatus (or using a hardware apparatus), such as a microprocessor, a programmable computer, or an electronic circuit. In some embodiments, some or several of the main method steps can be performed by such an apparatus.

Depending on specific implementation requirements, embodiments of the invention can be implemented in hardware or software. The implementation can be carried out using a digital storage medium, for example, a floppy disk, a DVD, a Blu-ray disc, a CD, a ROM, a PROM, an EPROM, an EEPROM, or a flash memory, a hard drive, or another magnetic or optical storage device, on which electronically readable control signals are stored that can interact with a programmable computer system in such a way that the respective method is performed. Therefore, the digital storage medium can be computer-readable.

Some embodiments according to the invention therefore include a storage medium having electronically readable control signals that are capable of cooperating with a programmable computer system in such a way that one of the methods described herein is carried out.

Generally, embodiments of the present invention can be implemented as a computer program product with a program code, wherein the program code is effective to perform one of the methods when the computer program product runs on a computer.

The program code can, for example, also be stored on a machine-readable medium.

Other embodiments include a computer program for performing one of the methods described herein, wherein the computer program is stored on a machine-readable medium.

In other words, an embodiment of the inventive method is thus a computer program that comprises a program code for performing one of the methods described herein, when the computer program is executed on a computer.

Another embodiment of the inventive method is thus a data carrier (or a digital storage medium or a computer-readable medium) on which the computer program for performing one of the methods described herein is recorded.

Another embodiment of the inventive method is thus a data stream or a sequence of signals that represents or represent the computer program for performing one of the methods described herein. The data stream or sequence of signals can be configured, for example, to be transferred via a data communication connection, such as via the Internet.

Another embodiment includes a processing device, such as a computer or a programmable logic component, which is configured or adapted to perform one of the methods described herein.

Another embodiment includes a computer on which the computer program for performing one of the methods described herein is installed.

Another embodiment according to the invention includes a device or system configured to transmit a computer program for performing at least one of the methods described herein to a receiver. The transmission can, for example, be electronic or optical. The receiver can, for example, be a computer, a mobile device, a storage device, or a similar device. The device or system can, for example, include a file server for transmitting the computer program to the receiver.

In some embodiments, a programmable logic device (for example, a field-programmable gate array, an FPGA) may be used to perform some or all of the functionalities of the methods described herein. In some embodiments, a field-programmable gate array may work in conjunction with a microprocessor to perform one of the methods described herein. Generally, the methods are performed by any hardware device in some embodiments. This may be general-purpose hardware such as a computer processor (CPU), or specific hardware for the method, such as an ASIC.

The embodiments described above merely illustrate the principles of the present invention. It is understood that modifications and variations of the arrangements and details described herein will be apparent to those skilled in the art. Therefore, it is intended that the invention be limited solely by the scope of the following patent claims and not by the specific details presented herein based on the description and explanation of the embodiments.

Claims

Apparatus for compressing sound field data (10) of an area, comprising:
a divider (100) for dividing the sound field data into a first portion (101) and into a second portion (102); and

a converter (140, 180) for converting the first portion (101) and the second portion (102) into harmonic components (141, 182) of a sound field description, wherein the converter (140, 180) is configured to convert the second portion (102) into one or several harmonic components (141) of a second order, and to convert the first portion (101) into harmonic components of a first order, wherein the first order is higher than the second order, to obtain the compressed sound field data,

wherein the divider (100) is configured to perform spectral division and comprises a filterbank (100b) for filtering at least part of the sound field data (10) for obtaining sound field data in different filterbank channels (140a, 140b, 140c), and

wherein the converter is configured to compute, for a subband signal from a first filterbank channel (140c), which represents the first portion (101), of the different filterbank channels (140a, 140b, 140c), the harmonic components of the first order, and to compute, for a subband signal from a second filterbank channel (140a), which represents the second portion (102), of the different filterbank channels (140a, 140b, 140c), the harmonic components of the second order, wherein a center frequency (f_n) of the first filterbank channel (140a) is higher than a center frequency (f₁) of the second filterbank channel (140c).
Apparatus according to claim 1, wherein the converter (140, 180) is configured to compute the harmonic components of the first order, which is higher than the second order, for the first portion, which is more important for directional perception of the human hearing than the second portion.
Apparatus according to claims 1 or 2, wherein the divider (100) is configured to divide the sound field data (10) into the first portion including first reflections in the area and into the second portion including second reflections in the area, wherein the second reflections occur later in time than the first reflections.
Apparatus according to one of the previous claims, wherein the divider (100) is configured to divide the sound field data into the first portion including first reflections in the area and into the second portion including second reflections in the area, wherein the second reflections occur later in time than the first reflections, and wherein the divider (100) is further configured to decompose the first portion into spectral portions (101, 102) and to convert the spectral portions each into one or several harmonic components of different orders, wherein an order for a spectral portion with a higher frequency band is higher than an order for a spectral portion in a lower frequency band.
Apparatus according to one of the previous claims, further comprising an output interface (190) for providing the one or several harmonic components (182) of the second order and the harmonic components of the first order (141) together with side information (300) comprising an indication on the first order or the second order for transmission and storage.
Apparatus according to one of the previous claims, wherein the sound field data describe a three-dimensional area and the converter is configured to compute cylindrical harmonic components as the harmonic components, or wherein the sound field data (10) describe a three-dimensional area and the converter (140, 180) is configured to compute spherical harmonic components as the harmonic components.
Apparatus according to one of the previous claims, wherein the sound field data exist as a first number of discrete signals, wherein the converter (140, 180) for the first portion (101) and the second portion (102) provides a second total number of harmonic components, and wherein the second total number of harmonic components is smaller than the first number of discrete signals.
Apparatus according to one of the previous claims, wherein the divider (100) is configured to use, as sound field data (10), a plurality of different impulse responses that are allocated to different positions in the area.
Apparatus according to claim 8, wherein the impulse responses are head-related transfer functions (HRTF) or binaural room impulse responses (BRIR) functions or impulse responses of a respective discrete point in the area to a predetermined position in the area.
Apparatus according to one of the previous claims, further comprising:
a decoder (2) for decompressing the compressed sound field data by using a combination of the first and second portions and by using a conversion from a harmonic component representation into a time domain representation for obtaining a decompressed representation; and

a control (4) for controlling the divider (100) or the converter (140, 180) with respect to the first or second order, wherein the control (4) is configured to compare, by using a psychoacoustic module, the decompressed sound field data with the sound field data (10) and to control the divider (100) or the converter (140, 180) by using the comparison.
Apparatus according to claim 10, wherein the decoder is configured to convert the harmonic components of the second order and the harmonic components of the first order (241, 242) and to then perform a combination of the converted harmonic components, or wherein the decoder (2) is configured to combine the harmonic components of the second order and the harmonic components of the first order (245) and to convert a result of the combination in the combiner (245) from a harmonic component domain into the time domain (244).
Apparatus according to claim 10, wherein the decoder is configured to convert harmonic components of different spectral portions with different orders (140a, 140b), to compensate different processing times for different spectral portions (304, 306), and to combine spectral portions of the first portion converted into a time domain with the spectral components of the second portion converted into the time domain by serially arranging the same.
Apparatus for decompressing compressed sound field data comprising first harmonic components (141) of a sound field description up to a first order and one or several second harmonic components (182) of a sound filed description up to a second order, wherein the first order is higher than the second order, comprising:
an input interface (200) for obtaining the compressed sound field data; and

a processor (240) for processing the first harmonic components (201) and the second harmonic components (202) by using a combination of the first and the second portion and by using a conversion of a harmonic component representation into a time domain representation to obtain a decompressed illustration, wherein the first portion is represented by the first harmonic components and the second portion by the second harmonic components,

wherein the first harmonic components (HK_n) of the first order represent a first spectral domain, and the one or the several harmonic components (HK₁) of the second order represent a different spectral domain,

wherein the processor (240) is configured to convert the harmonic components (HK_n) of the first order into the spectral domain (241a) and to convert the one or the several second harmonic components (HK₁) of the second order into the spectral domain (241c), and to combine the converted harmonic components by means of a synthesis filterbank (245) to obtain a representation of sound field data in the time domain.
Apparatus according to claim 13, wherein the processor (240) comprises:
a combiner (245) for combining the first harmonic components and the second harmonic components to obtain combined harmonic components; and

a converter (244) for converting the combined harmonic components into the time domain.
Apparatus according to claim 13, wherein the processor comprises:
a converter (241, 242) for converting the first harmonic components and the second harmonic components into the time domain; and

a combiner (243, 245) for combining the harmonic components converted into the time domain for obtaining the decompressed sound field data.
Apparatus according to one of claims 13 to 15, wherein the processor (240) is configured to obtain information on a reproduction arrangement (610, 612, 614), and wherein the processor (240) is configured to compute the decompressed sound field data (602, 604, 606) and to select, based on the information on the reproduction arrangement, part of the sound field data of the decompressed sound field data for reproduction purposes (608), or wherein the processor is configured to compute only a part of the decompressed sound field data necessitated for the reproduction arrangement.
Apparatus according to one of claims 13 to 16, wherein the first harmonic components of the first order represent early reflections of the area and the second harmonic components of the second order represent late reflections of the area, and wherein the processor (240) is configured to add the first harmonic components and the second harmonic components and to convert a result of the addition into the time domain for obtaining the decompressed sound field data.
Apparatus according to one of claims 13 to 17, wherein the processor is configured to perform, for the conversion, an inverse room transformation (604) and an inverse Fourier transformation (606).
Method for compressing sound field data (10) of an area, comprising the steps of:
dividing (100) the sound field data into a first portion (101) and into a second portion (102), and

converting (140, 180) the first portion (101) and the second portion (102) into harmonic components (141, 182) of a sound field description, wherein the second portion (102) is converted into one or several harmonic components (141) of a second order, and wherein the first portion (101) is converted into harmonic components of a first order, wherein the first order is higher than the second order, to obtain the compressed sound field data,

wherein dividing (100) comprises spectral division by filtering with a filterbank (100b) for filtering at least part of the sound field data (10) for obtaining sound field data in different filterbank channels (140a, 140b, 140c), and

wherein converting represents a computation of the harmonic components of the first order for a subband signal from a first filterbank channel (140c), which represents the first portion (101), of the different filterbank channels (140a, 140b, 140c), and a computation of the harmonic components of the second order for a subband signal from a second filterbank channel (140a), which represents the second portion (102), of the different filterbank channels (140a, 140b, 140c), wherein a center frequency (f_n) of the first filterbank channel (140a) is higher than a center frequency (f₁) of the second filterbank channel (140c).
Method for decompressing compressed sound field data comprising first harmonic components (141) of a sound field description up to a first order and one or several second harmonic components (182) of a sound field description up to a second order, wherein the first order is higher than the second order, comprising steps of:
obtaining (200) the compressed sound field data; and

processing (240) the first harmonic components (201) and the second harmonic components (202) by using a combination of the first and second portions and by using a conversion from a harmonic component representation into a time domain representation to obtain a decompressed representation, wherein the first portion is represented by the first harmonic components and the second portion by the second harmonic components,

wherein the first harmonic components (HK_n) of the first order represent a first spectral domain, and the one or the several harmonic components (HK₁) of the second order represent a different spectral domain,

wherein processing (240) comprises converting the first harmonic components (HK_n) of the first order into the spectral domain and converting the one or the several second harmonic components (HK₁) of the second order into the spectral domain and combining the converted harmonic components by means of a synthesis filterbank (245) to obtain a representation of sound field data in the time domain.
Computer program for performing a method according to one of claims 19 to 20 when the method runs on a computer.