HK40045794B

HK40045794B - Method and apparatus for determining for the compression of an hoa data frame representation a lowest integer number of bits required for representing non-differential gain values

Info

Publication number: HK40045794B
Application number: HK42021035032.8A
Authority: HK
Inventors: 斯文·科尔东; 亚历山大·克鲁格
Original assignee: 杜比国际公司
Priority date: 2014-06-27
Filing date: 2021-07-15
Publication date: 2023-03-10

Description

Method and apparatus for determining a minimum number of integer bits required to represent non-differential gain values for compression of a representation of a HOA data frame

The present application is a divisional application of the inventive patent application having application number 201580035094.9, filed on 2015, 6-month, 22-year entitled "method and apparatus for determining a minimum integer number of bits required to represent non-differential gain values for compression of a HOA data frame representation".

Technical Field

The present invention relates to a method and apparatus for determining a minimum integer number of bits required to represent a non-differential gain value associated with a channel signal of a particular one of HOA data frames for compression of the HOA data frame representation.

Background

Higher order ambisonics, denoted HOA, offers a possibility to represent three dimensional sound. Other techniques are Wave Field Synthesis (WFS) or channel-based methods like 22.2. Compared to channel-based approaches, HOA representation offers advantages independent of the specific speaker setup. However, this flexibility comes at the expense of the decoding process required to play back the HOA representation on a particular speaker setting. HOA can also be presented as a setup comprising only a few loudspeakers, in contrast to WFS methods, where the number of required loudspeakers is typically large. Another advantage of HOA is that the same representation can also be used without any modifications to the binaural rendering of the headphones.

HOA is based on the spatial density of complex harmonic plane wave amplitudes expressed by a truncated spherical harmonic function (SH) expansion. Each expansion coefficient is a function of angular frequency, which can be equivalently represented by a time-domain function. Thus, without loss of generality, a complete HOA soundfield representation can actually be assumed to consist of O time-domain functions, where O represents the number of expansion coefficients. These time-domain functions will be referred to hereinafter equivalently as HOA coefficient sequences or HOA channels.

The spatial resolution of the HOA representation increases with the maximum order N of the expansion. Unfortunately, the number of expansion coefficients O grows quadratically with the order N, in particular O ═ N +1)². For example, using a typical HOA of order N-4 means that 25 HOA (expansion) coefficients are required. Assume that the desired mono sampling rate is f_SAnd the number of bits per sample is N_bThen the total bit rate for the transport HOA representation is represented by O · f_S·N_bAnd (4) determining. To adopt N per sample_b16 bit f_SThe HOA representation with an order N-4 is transmitted at a sampling rate of 48kHz resulting in a bit rate of 19.2MBits/s, which is very high for many practical applications, such as streaming. Therefore, it is highly desirable to compress the HOA representation.

Compression of HOA sound field representation was previously proposed in EP 2665208 a1, EP 2743922 a1, EP 2800401 a1, see ISO/IEC JTC1/SC29/WG11, N14264, WD1-HOA text for MPEG-H3D audio on month 1, 2014. These methods have in common that: they both perform a sound field analysis and decompose a given HOA representation into a directional component and a residual ambient component. On the one hand, the final compressed representation is assumed to consist of several quantized signals resulting from perceptual coding of the directional and vector-based signals and the sequence of correlation coefficients of the ambient HOA component. The final compressed representation, on the other hand, comprises additional side information related to the quantized signal, which side information is needed to reconstruct the HOA representation from its compressed version.

These intermediate time domain signals are required to have a maximum amplitude within the range of values of [ -1, 1] before being passed to the perceptual encoder, which is a requirement that arises for implementing currently available perceptual encoders. To meet this requirement when compressing HOA representations, gain control processing units are used before the perceptual encoder that smoothly attenuate or amplify the input signal (see EP 2824661 a1 and the above mentioned ISO/IEC JTC1/SC29/WG 11N 14264 documents). The resulting signal modification is assumed to be reversible and applied frame by frame, wherein in particular the change in signal amplitude between successive frames is assumed to be a power of "2". To facilitate the inversion of the signal modification in the HOA decompressor, the corresponding normalized side information is included in the total side information. The normalized side information may consist of base "2" indices that describe the relative amplitude change between two consecutive frames. These indices are encoded using run length code (run length code) according to the ISO/IEC JTCl/SC29/WG 11N 14264 document mentioned above, since smaller amplitude changes between successive frames are more likely to occur than larger amplitude changes.

Disclosure of Invention

For example, in case of decompressing a single file without any time jumps from start to end, it is feasible to use differentially encoded amplitude variations in HOA decompression to reconstruct the original signal amplitude. However, to facilitate random access, a separate access unit must be present in the encoded representation (which is typically a bitstream) to enable decompression to start from the desired location (or at least in the vicinity thereof) independent of the information from the previous frame. Such a separate access unit must contain the total absolute amplitude change (i.e. the non-differential gain value) from the first frame up to the current frame caused by the gain control processing unit. Assuming that the amplitude variation between two successive frames is a power of "2", it is sufficient to describe the total absolute amplitude variation by an exponent with a base "2". In order to efficiently code the exponent, it is necessary to know the maximum gain possible for the signal before applying the gain control processing unit. However, this knowledge is highly dependent on the constraint specification on the value range of the HOA representation to be compressed. Unfortunately, the MPEG-H3D audio documents ISO/IEC JTC1/SC29/WG 11N 14264 provide only a description of the format used for the input HOA representation, without setting any constraints on the value range.

The problem to be solved by the invention is to provide the minimum number of integer bits required to represent non-differential gain values. This problem is solved by the method disclosed in claim 1. An apparatus for using the method is disclosed in claim 2. Advantageous additional embodiments of the invention are disclosed in the respective dependent claims.

The invention establishes a correlation between the range of values of the input HOA representation and the maximum gain possible for the signal before applying the gain control processing unit in the HOA compressor.

Based on this correlation, the amount of bits needed to describe the total absolute amplitude change of the modified signal from the first frame up to the current frame caused by the gain control processing unit (i.e. the non-differential gain values) within the access unit is determined for a given specification of the value range represented by the input HOA for an efficient coding of the exponent with a base "2".

Furthermore, once the rules for calculating the required amount of bits for encoding the exponent are determined, the present invention uses a process for verifying whether a given HOA representation satisfies the required value range constraint, so that the given HOA representation can be correctly compressed.

In principle, the inventive method is suitable for determining a minimum number of integer bits β required for a non-differential gain value of a channel signal representing a particular one of the HOA data frames for compression of the HOA data frame representation_eWherein each channel signal in each frame comprises a set of sample values, and wherein each of the HOA data frames is assigned a sampling valueEach channel signal of a number of HOA data frames is assigned a differential gain value, and such differential gain value causes a change in the amplitude of sample values of the channel signal in the current HOA data frame relative to sample values of the channel signal in the previous HOA data frame, and wherein such gain adjusted channel signal is encoded in the encoder,

and wherein the HOA data frame representation is rendered in the spatial domain as O virtual loudspeaker signals w_j(t) wherein the positions of the virtual loudspeakers are located on and intended to be evenly distributed over a unit sphere, said rendering being by matrix multiplication w (t) ═ Ψ^-1C (t), where w (t) is a vector containing all virtual loudspeaker signals, Ψ is a virtual loudspeaker position mode matrix, and c (t) is a vector of corresponding HOA coefficient sequences of the HOA data frame representation,

and wherein the HOA data frame representation is normalized such that

The method comprises the following steps:

-forming the channel signal from the normalized HOA data frame representation by one or more of the following sub-steps a), b), c):

a) for representing a dominant sound signal in said channel signal, multiplying a vector of said HOA coefficient sequences c (t) by a mixing matrix a having a euclidean norm no greater than "1", wherein mixing matrix a represents a linear combination of coefficient sequences represented by said normalized HOA data frame;

b) to represent an ambient component c in the channel signal_AMB(t) subtracting the primary sound signal from the normalized HOA data frame representation, and selecting the ambience component c_AMB(t), wherein | c_AMB(t)||₂ ²≤||c(t)||₂ ²And by calculatingFor the obtained minimum environmental component c_AMB，MIN(t) performing a transformation, wherein,and Ψ_MINIs the minimum ambient component c_AMB，MIN(t) a modulus matrix;

c) selecting a part of the HOA coefficient sequences c (t), wherein the selected coefficient sequence is related to the coefficient sequence of the ambient HOA component to which the spatial transformation is applied and describes the minimum order N of the number of the selected coefficient sequences_MINIs N_MIN≤9；

-said minimum integer ratio bit number β required to represent said non-differential gain values of said channel signals_eIs arranged as

Wherein the content of the first and second substances,n is the order, N_MAXIs the maximum order of interest and,is the direction of the virtual loudspeaker, O ═ 1+ N)²Is the number of HOA coefficient sequences, and K is the squared Euclidean norm of the modulus matrix₂ ²Ratio to O.

In principle, the inventive apparatus is adapted to determine a minimum number of integer bits β required for a non-differential gain value of a channel signal representing a particular one of the HOA data frames for compression of a representation of the HOA data frames_eWherein each channel signal in each frame comprises a set of sample values, and wherein each channel signal of each of said HOA data frames is assigned a differential gain value, and such differential gain value causes a change in the amplitude of a sample value of a channel signal in a current HOA data frame relative to a sample value of a channel signal in a previous HOA data frame, and wherein such differential gain value causes a change in the amplitude of a sample value of a channel signal in a current HOA data frame, and wherein such a change in the amplitude of a sample value of a channel signal in a previous HOA data frameThe gain adjusted channel signal is encoded in an encoder,

and wherein the HOA data frame representation is rendered in the spatial domain as O virtual loudspeaker signals w_j(t) wherein the positions of the virtual loudspeakers are located on and intended to be evenly distributed over a unit sphere, said rendering being by matrix multiplication w (t) ═ Ψ^-1C (t), where w (t) is a vector containing all virtual loudspeaker signals, Ψ is a virtual loudspeaker position mode matrix, and c (t) is a vector of the corresponding HOA coefficient sequence represented by the HOA data frame,

and wherein the HOA data frame representation is normalized such that

The apparatus comprises:

-means for forming the channel signal from the normalized HOA data frame representation by one or more of the following operations a), b), c):

a) for representing a dominant sound signal in said channel signals, multiplying a vector of said HOA coefficient sequences c (t) by a mixing matrix a having a euclidean norm no greater than "1", wherein mixing matrix a represents a linear combination of coefficient sequences represented by said normalized HOA data frames;

b) to represent an ambient component c in the channel signal_AMB(t) subtracting the primary sound signal from the normalized HOA data frame representation and selecting the ambience component c_AMB(t), wherein | c_AMB(t)||₂ ²≤||c(t)||₂ ²And by calculatingFor the obtained minimum environmental component c_AMB，MIN(t) performing a transformation, wherein,and Ψ_MINIs thatMinimum ambient component c_AMB，MIN(t) a modulus matrix;

-the minimum number of integer bits required to represent the non-differential gain values of the channel signal, β_eIs arranged asThe apparatus of (1) is provided with a plurality of the devices,

wherein the content of the first and second substances,n is the order, N_MAXIs the maximum order of interest and,is the direction of the virtual loudspeaker, O ═ 1+ N)²Is the number of HOA coefficient sequences, and K is the squared Euclidean norm of the modulus matrix₂ ²And O.

Drawings

Exemplary embodiments of the invention are described with reference to the accompanying drawings, in which:

FIG. 1 HOA compressor;

fig. 2 HOA decompressor;

FIG. 3 virtual direction Ω_j ^(N)(1 ≦ j ≦ O) a scaling value K for the HOA order (N ═ 1.., 29);

FIG. 4 for HOA order (N)_MIN1, 9), inverse mode matrix Ψ^-1With respect to the virtual direction Ω_MIN，d(d＝1，...，O_MIN) The euclidean norm of (a);

fig. 5 virtual speaker at position Ω_j ^(N)(1. ltoreq. j. ltoreq.O, wherein O is (N +1)²) Maximum allowable amplitude gamma of the signal of_dBDetermining;

fig. 6 spherical coordinate system.

Detailed Description

The following embodiments may be used in any combination or sub-combination, even if not explicitly described.

In the following, the principles of HOA compression and decompression are introduced to provide a more detailed background to the above-mentioned problems. The basis of this introduction is the processing described in the MPEG-H3D audio documents ISO/IEC JTCl/SC29/WG 11N 14264 (see also EP 2665208A 1, EP 2800401A 1 and EP 2743922A 1). In N14264, the "directional component" is extended to the "primary sound component". As directional components, the main sound component is assumed to be represented in part by a directional signal, which refers to a mono signal with a corresponding direction assumed to impinge the listener from, together with some prediction parameters for predicting the parts of the original HOA representation from the directional signal. In addition, the main sound component is assumed to be represented by a "vector-based signal", which refers to a mono signal having a corresponding vector defining a directional distribution of the vector-based signal.

HOA compression

Fig. 1 shows the general architecture of the HOA compressor described in EP 2800401 a 1. The overall architecture of the HOA compressor has a spatial HOA encoding section shown in fig. 1A and a perceptual encoding section and a source encoding section shown in fig. 1B. The spatial HOA encoder provides a first compressed HOA representation composed of the I-signal together with side information describing how to create its HOA representation. The I-signal is perceptually encoded in a perceptual encoder and a side information source encoder and the side information is source encoded before multiplexing the two encoded representations.

Spatial HOA coding

In a first step, the current k-th frame c (k) of the original HOA representation, which is assumed to provide a tuple set, is input to a direction and vector estimation processing step or stage 11Andmeta group setIs constituted by a tuple whose first element represents the index of the direction signal and the second element represents the corresponding quantization direction. Meta group setIs composed of tuples whose first elements represent the index of the vector-based signal and the second elements represent vectors defining the directional distribution of the signal (i.e. how the HOA representation of the vector-based signal is computed).

Using two tuple setsAndthe initial HOA frame c (k) is decomposed in a HOA decomposition step or stage 12 into frames X of all dominant sound (i.e. directional and vector-based) signals_PS(k-1) and frame C of the ambient HOA component_AMB(k-1). Note the delay of one frame caused by the overlap-add process to avoid the artifacts of occlusion. Furthermore, the HOA decomposition step/stage 12 is assumed to output some prediction parameters ζ (k-1) describing how parts of the original HOA representation are predicted from the direction signal to enrich the dominant sound HOA component. In addition, it is assumed that a target allocation vector v is provided which contains information about the allocation of the primary sound signal determined in the HOA decomposition processing step or stage 12 to the I available channels_A，T(k-1). It may be assumed that the affected channel is to be occupied, which means that the affected channel cannot be used for transmitting any coefficient sequence of the ambient HOA component in the corresponding time frame.

In an ambient component modification processing step or stage 13, a vector v is assigned according to the target_A，T(k-1) providing information to modify frame C of the ambient HOA component_AMB(k-1). In particular, the channel is available (among other things) according to which channels are available and have not been signaled by the primary soundOccupied by a number (contained in the target allocation vector v)_A，T(k-1) to determine which coefficient sequences of the ambient HOA component are to be transmitted in a given I channels.

In addition, if the index of the selected coefficient sequence changes between successive frames, a cross fade of the coefficient sequence is performed.

Furthermore, assume an ambient HOA component C_AMBFirst O of (k-2)_MINThe coefficient sequence is always selected to be perceptually encoded and transmitted, where O_MIN＝(N_MIN+1)²(N_MINN) is typically smaller than the order of the original HOA representation. In order to decorrelate these sequences of HOA coefficients, they may be transformed in step/stage 13 from some predefined direction Ω_MIN，d(d＝1，...，O_MIN) The direction signal of the impact (i.e., the general plane wave function).

Temporally predicted modified ambient HOA component C_P，M，A(k-1) together with a modified ambient HOA component C_M，A(k-1) are calculated together in step/stage 13 and used in the gain control step or stage 15, 151 to achieve a reasonable look-ahead, where the information about the modification of the ambient HOA component is directly related to the allocation of all possible types of signals to the available channels in the channel allocation step or stage 14. The final information about the allocation is assumed to be contained in the final allocation vector v_A(k-2). For calculating the vector in step/stage 13, the target allocation vector v is used_A，TInformation in (k-1).

Channel allocation in step/stage 14 using an allocation vector v_A(k-2) the information provided will be contained in frame X_PS(k-2) neutralization is contained in frame C_M，AThe appropriate signal in (k-2) is assigned to the I available channels, resulting in signal frame y_i(k-2), I ═ 1,. multidot., I. In addition, it will also be included in frame X_PS(k-1) and frame C_P，AMBThe appropriate signal in (k-1) is assigned to the I available channels, resulting in a predicted signal frame y_P，i(k-1)，i＝1，...，I。

Signal frame y_i(k-2), I-1, IThis is finally processed by a gain control step/stage 15, 151 to obtain the index e_i(k-2) and an abnormality marker beta_i(k-2), I-1, I and signal z_i(k-2), I1.., I, where the signal gain is smoothly modified to achieve a range of values suitable for the perceptual encoder step or stage 16. Step/stage 16 outputs a corresponding encoded signal framePredicted signal frame y_P，i(k-1), I ═ 1., I, implements reasonable predictions to avoid large gain variations between successive blocks. In a side information source encoder step or stage 17, side information data is encodede_i(k-2)、β_i(k-2), ζ (k-1) and v_A(k-2) source coding to obtain a coded side information frameIn the multiplexer 18, the encoded signal of the frame (k-2)And encoded side information data of the frameAre combined to obtain an output frame

In the spatial HOA decoder, the gain modification in step/stage 15, 151 is assumed to be by using the pass exponent e_i(k-2) and an abnormality marker beta_iAnd (k-2), I is 1.

HOA decompression

Fig. 2 shows the general architecture of the HOA decompressor described in EP 2800401 a 1. The overall architecture consists of the counterpart components of the HOA compressor component, arranged in reverse order and comprising the perceptual and source decoding sections shown in fig. 2A and the spatial HOA decoding section shown in fig. 2B.

In the perceptual and source decoding sections (representing perceptual and side-information source decoders), a demultiplexing step or stage 21 receives input frames from the bitstreamAnd provides a perceptually encoded representation of the I signalsI1.. I and encoded side information data describing how to create its HOA representationIn a perceptual decoder step or stage 22Perceptually decoding the signal to obtain a decoded signal1., I. Encoding of side information data in a side information source decoder step or stage 23Decoding is carried out to obtain a data setIndex e_i(k) Abnormal marker beta_i(k) Prediction parameter ζ (k +1), and allocation vector v_AMB，ASSIGN(k) .1. the About v_AAnd v_AMB，ASSIGNSee MPEG document N14264 mentioned above for differences therebetween.

Spatial HOA decoding

In the spatial HOA decoding part, perception decodingOf (2) a signalEach of I-1, I together with its associated gain correction index e_i(k) And a gain correction abnormality flag beta_i(k) Together are input to the inverse gain control processing steps or stages 24, 241. The ith inverse gain control processing step/stage provides a gain corrected signal frame

All I gain-corrected signal framesI-1.. I, together with an allocation vector v_AMB，ASSIGN(k) And tuple setsAndare fed together to a channel reallocation step or stage 25, see tuple setsAndthe above definition of (1). Distribution vector v_AMB，ASSIGN(k) Consists of I components indicating for each transmission channel whether it contains a coefficient sequence of the ambient HOA component and which coefficient sequence it contains. In the channel reassignment step/stage 25, the gain corrected signal frameFrames re-allocated to reconstruct all main sound signals (i.e. all direction signals and vector-based signals)And frame C of an intermediate representation of the ambient HOA component_I，AMB(k) In that respect In addition, a set of indices of coefficient sequences of the ambient HOA component active in the k-th frame is providedAnd coefficient index of ambient HOA component that must be enabled, disabled, and kept active in the (k-1) th frameAnd

in the main sound synthesis step or stage 26, the tuple sets are utilizedSet of prediction parameters ζ (k +1), tuple setAnd a data setAndfrom frames of all main sound signalsTo calculate a principal sound componentHOA of (a).

In an ambient synthesis step or stage 27, a set of indices of coefficient sequences of ambient HOA components active in the k-th frame is utilizedIntermediate table based on ambient HOA componentsFrame C of display_I，AMB(k) To create an ambient HOA component frameA delay of one frame is introduced due to the synchronization with the main sound HOA component.

Finally, in an HOA composition step or stage 28, the ambient HOA component frames are processedWith frames of the main sound HOA componentSuperimposing to provide decoded HOA frames

Thereafter, the spatial HOA decoder creates a reconstructed HOA representation from the I signals and the side information.

If located on the encoding side, the ambient HOA component is transformed into a directional signal, the inverse of this transformation being performed on the decoder side in step/stage 27.

The maximum gain possible for the signal prior to the gain control step/stage 15, 151 in the HOA compressor is very dependent on the range of values represented by the input HOA. Thus, a meaningful range of values for the input HOA representation is first defined, and then a conclusion is made on the possible maximum gain of the signal before entering the gain control step/stage.

Normalization of input HOA representation

To use the inventive process, a normalization of the (total) input HOA representative signal is performed first. For HOA compression, a frame-by-frame processing is performed, wherein the kth frame c (k) of the original input HOA representation is defined as the vector c (t) of the sequence of temporally consecutive HOA coefficients specified in equation (54) in section Basics of higher order ambisonics

Where k denotes the frame index, L is the frame length (in the sample), O ═ N +1)²Is the number of HOA coefficient sequences, and T_SIndicating the sampling period.

As mentioned in EP 2824661 a1, from a practical point of view, meaningful normalization of HOA representation is not by applying to individual HOA coefficient sequencesIs achieved by imposing constraints on the value ranges of these time domain functions, since these are not the signals that are actually played by the loudspeakers after rendering. Instead, it is more convenient to consider rendering the HOA representation as O virtual loudspeaker signals w_j(t), 1 ≦ j ≦ O. The corresponding virtual loudspeaker positions are assumed to be represented by means of a spherical coordinate system, wherein each position is assumed to lie on a unit sphere and has a radius of "1". Thus, the direction Ω can be correlated by order_j ^(N)＝(θ_j ^(N)，φ_j ^(N)) J is more than or equal to 1 and less than or equal to O equivalent expression position, wherein theta_j ^(N)And phi_j ^(N)Respectively, inclination and azimuth (see also fig. 6 and its description with respect to the definition of the spherical coordinate system). See, for example, J.Fliege, U.Maier, 1999, "A two-stage approach for computing the library for the sphere," these directions should be distributed as uniformly as possible on the unit sphere. The number of nodes for the computation of a particular direction can be found in the following web site: http:// www.mathematik.uni-dortmund.de/lsx/research/project/fliege/nodes/nodes.html. These positions are usually dependent on the kind of definition of "uniform distribution on the ball" and are therefore ambiguous.

The advantage of defining the value range of the virtual loudspeaker signal by defining the value range of the HOA coefficient sequence is that: the value range of the virtual loudspeaker signal can be intuitively set equal to the interval [ -1, 1] as is the case for conventional loudspeaker signals assuming PCM representation. This results in quantization errors that are spatially uniformly distributed, so that quantization is advantageously applied in the domain relevant for actual listening. An important aspect in this context is that the number of bits per sample can be chosen as low as the number of bits typically used for conventional loudspeaker signals (i.e. 16), which improves the efficiency compared to direct quantization of HOA coefficient sequences which typically require a higher number of bits per sample (e.g. 24 or even 32).

To elaborate the normalization process in the spatial domain, all virtual loudspeaker signals are summarized in vectors as w (t): is ═ w₁(t) ... w_O(t)]^T， (2)

Wherein, (.)^TIndicating transposition. With Ψ representing the direction Ω with respect to the virtual direction_j ^(N)1 ≦ j ≦ O, psi is defined as

Wherein the content of the first and second substances,

the rendering process may be formulated as a matrix product

w(t)＝(Ψ)^-1·c(t)。 (5)

Using these definitions, reasonable requirements for the virtual loudspeaker signals are:

this means that the amplitude of each virtual loudspeaker signal needs to fall within the range-1, 1]And (4) inside. The time T is determined by the sampling index l and the sampling period T of the sampling values of the HOA data frame_STo indicate.

The total power of the loudspeaker signals thus satisfies the condition

The rendering and normalization of the HOA data frame representation is performed upstream of the input c (k) of fig. 1A.

Signal value range results before gain control

Assuming that the normalization of the input HOA representation is performed according to the description in the normalization section of the input HOA representation, the signal y input to the gain control processing unit in the HOA compressor is considered below_iI is a range of values of 1. These signals are generated by applying a sequence of HOA coefficients or a primary sound signal x_PS，dD1, D and/or the ambient HOA component c_AMB，nOne or more assignments of a particular sequence of coefficients for O may be created with I channels, with a spatial transform applied to some of these signals. Therefore, under the normalization assumption in equation (6), it is necessary to analyze the possible value ranges of these different signal types mentioned. Since all kinds of signals are calculated in the middle from the original HOA coefficient sequence, their possible value ranges are examined.

The case of including only one or more HOA coefficient sequences in the I channels is not depicted in fig. 1A and 2B, i.e. in this case, no HOA decomposition, ambient component modification block and corresponding synthesis block are required.

Value range results expressed in HOA

The temporally continuous HOA representation is represented by c (t) ═ Ψ_w(t)， (8)

Obtained from the virtual loudspeaker signal, equation (8) is the inverse of equation (5).

Thus, the total power of all HOA coefficient sequences is limited as follows using equation (8) and equation (7):

||c(lT_S)||₂ ²≤||Ψ||₂ ²·||w(lT_S)||₂ ²≤||Ψ||₂ ²·O (9)

under the assumption of N3D normalization of spherical harmonic functions, the modulusThe square of the euclidean norm of the matrix can be written as: | Ψ | calculation₂ ²＝K·O， (10a)

Wherein, the first and the second end of the pipe are connected with each other,

representing the ratio between the square of the euclidean norm of the modulus matrix and the number O of HOA coefficient sequences. The ratio depends on the particular HOA order N and the particular virtual loudspeaker direction1 ≦ j ≦ O, which may be represented as follows by appending a corresponding parameter list to the ratio:

FIG. 3 shows the virtual orientation of an article according to Fliege et al, mentioned aboveJ 1 ≦ O for the value of K of HOA order (N ═ 1.., 29).

In connection with all previous demonstrations and considerations, an upper bound is provided for the amplitude of the HOA coefficient sequence as follows:

wherein the first inequality is derived directly from the norm definition.

It is important to note that: the condition in formula (6) means the condition in formula (11), but the opposite is not true, i.e., formula (11) does not mean formula (6).

Another important aspect is: under the assumption that the virtual speaker positions are approximately uniformly distributed, column vectors of the mode matrix Ψ, which represent mode vectors with respect to the virtual speaker positions, are almost orthogonal to each other and each have a euclidean norm N + 1. This property means that: in addition to the multiplication constants, the spatial transform almost preserves the euclidean norm, i.e.,

||c(lT_S)||₂≈(N+1)||w(lT_S)||₂。 (12)

true norm c (lT)_S)||₂The more the difference from the approximation in equation (12), the more the assumption of orthogonality to the modal vector is violated.

Value range result of primary sound signal

The two types of (directional and vector-based) primary sound signals have in common: their contribution to the HOA representation is given by a single vector with euclidean norm N +1I.e., | | v₁||₂＝N+1。 (13)

In the case of directional signals, the vector is associated with a direction Ω with respect to a certain signal source_S，1The amount of the mode vector of (a) corresponds to, i.e.,

v₁＝S(Ω_S，1) (14)

this vector describes the directional beam as the signal source direction omega by means of the HOA representation_S，1. In the case of vector-based signals, vector v₁Not limited to the modal vectors with respect to any direction, a more general directional distribution of the vector based mono signal may be described.

Consider the following D primary sound signals x_dIn the general case of (t), D1.. D, the D primary sound signals may be concentrated in a vector x (t) according to the following equation

x(t)＝[x₁(t) x₂(t) ... x_D(t)]^T (16)

These signals must be determined based on the following matrix:

V：＝[v₁ v₂ ... v_D] (17)

the matrix is represented by a monaural primary sound signal x_d(t), D ═ 1. -, D, all vectors v distributed in the direction of D_dD is 1.

For a meaningful extraction of the main sound signal x (t), the following constraints are specified:

a) each main sound signal is obtained as a linear combination of a sequence of coefficients of the original HOA representation, i.e.

x(t)＝A·c(t)， (18)

Wherein, the first and the second end of the pipe are connected with each other,representing a mixing matrix.

b) The mixing matrix a should be selected such that its euclidean norm does not exceed the value "1", i.e.,

and such that the squared (or power) of the euclidean norm of the residual between the original HOA representation and the HOA representation of the primary sound signal is not greater than the squared (or power) of the euclidean norm of the original HOA representation, i.e. the original HOA representation

By substituting equation (18) into equation (20), it can be seen that equation (20) is equivalent to the following constraint:

wherein I represents an identity matrix.

The upper limit of the amplitude of the principal sound signal is defined by the following equation, using equations (18), (19) and (11), according to the constraints in equations (18) and (19) and according to the compatibility of euclidean matrices with vector norms:

||x(lT_S)||_∞≤||x(lT_S)||₂ (22)

≤||A||₂||c(lT_S)||₂ (23)

thus, it is ensured that the main sound signal remains within the same range as the original HOA coefficient sequence (compared to equation (11)), i.e.,

examples of selecting a mixing matrix

An example of how to determine a mixing matrix that satisfies the constraint (20) is obtained by calculating the main sound signal such that the euclidean norm of the residual after extraction is minimized, that is,

x(t)＝argmin_x(t)||V·x(t)-c(t)||₂。 (26)

the solution to the minimization problem in equation (26) is given by:

x(t)＝V⁺c(t)， (27)

wherein, (. cndot.)⁺Represents Moore-Penrose (Moore-Penrose) generalized inverse. By comparing equation (27) with equation (18), it follows that in this case the mixing matrix is equal to the moore-penrose generalized inverse of matrix V, i.e. a ═ V⁺。

The matrix V must still be selected, however, to satisfy the constraint (19), i.e.,

in the case of directional signals only, where the matrix V is for some source signal direction Ω_S，dD is 1, D, i.e. a matrix of modes

V＝[S(Ω_S，1) S(Ω_S，2) ... S(Ω_S，D)]， (29)

By selecting the source signal direction omega_S，dD is such that the distance of any two adjacent directions is not too small to satisfy the constraint (28).

Value range result of coefficient sequence of ambient HOA component

The ambient HOA component is calculated by subtracting the HOA representation of the primary sound signal from the original HOA representation, i.e. c_AMB(t)＝c(t)-V·x(t)。 (30)

If the vector of the primary sound signal x (t) is determined according to the criterion (20), it can be concluded that:

||c_AMB(lT_S)||_∞||c_AMB(lT_S)||₂ (31)

value range of a sequence of spatial transform coefficients of an ambient HOA component

Another aspect of the HOA compression process proposed in EP 2743922 a1 and the above mentioned MPEG document N14264 is: first O of ambient HOA component_MINThe coefficient sequence is always selected to be allocated to the transmission channel, where O_MIN＝(N_MIN+1)²，N_MINN is typically a smaller order than the original HOA representation. To decorrelate these sequences of HOA coefficients, they may be transformed from some predefined direction Ω_MIN，d，d＝1，...，O_MIN(similar to the concepts described in the normalized section of the input HOA representation) of the impacted virtual loudspeaker signal.

By c_AMB，MIN(t) defining the order index as N ≦ N_MINAnd with Ψ, all coefficient sequences of the ambient HOA components_MINTo define a direction omega with respect to a virtual direction_MIN，d，d＝1，...，O_MINA vector of all virtual loudspeaker signals (defined as) w_MIN(t) is obtained by the following formula:

thus, using the compatibility of Euclidean matrices with vector norms,

||w_MIN(lT_S)||_∞≤||w_MIN(lT_S)||₂ (36)

in the above mentioned MPEG document N14264 the virtual direction Ω is selected according to the above mentioned article by Fliege et al_MIN，d，d＝1，...，O_MIN. FIG. 4 illustrates the mode matrix Ψ_MINFor order (N)_MIN1, 9). It can be seen that: for N_MIN＝1，...，9，However, this is generally not applicableIs usually much greater than N of "1_MINCase > 9. However, at least for 1 ≦ N_MIN≦ 9, the amplitude of the virtual speaker signal is limited by:

by limiting the input HOA representation to satisfy the condition (6), where the condition (6) requires that the amplitude of the virtual loudspeaker signal created from the HOA representation does not exceed the value "1", it can be ensured that under the following conditions the amplitude of the signal before gain control will not exceed the value(see formula (25), formula (34), and formula (40)):

a) the vectors of all the main sound signals x (t) are calculated according to the formulas/constraints (18), (19) and (20);

b) if the virtual loudspeaker positions as defined in the above-mentioned article by Fliege et al are used, the number O of first coefficient sequences of the ambient HOA component to which a spatial transformation is applied is determined_MINIs a minimum order of N_MINMust be less than "9".

It can be further concluded that: for up to the maximum order of interest N_MAXOf any order N, i.e. 1. ltoreq. N.ltoreq.N_MAXThe amplitude of the signal before gain control will not exceed a valueWherein the content of the first and second substances,

in particular, it can be concluded from fig. 3 that: if a virtual loudspeaker direction for the initial spatial transformation is assumed1 ≦ j ≦ O is selected based on the distribution in the Fliege et al article, and if it is otherwise assumed that the maximum order of interest is N_MAX29 (see for example MPEG document N14264), the amplitude before signal gain control will not exceed the value 1.5O, since in this special caseThat is, can select

K_MAXDepending on the maximum order of interest N_MAXAnd virtual speaker direction1 ≦ j ≦ O, which may be represented by the following formula:

thus, to ensure that the signal before perceptual coding lies in the interval [ -1, 1 [ -1 [ ]]Minimum gain applied by gain controlGiving a solution to the problem that, among others,

in the case where the amplitude of the signal before gain control is too small, it is proposed in MPEG document N14264 that up toTo smoothly amplify them, wherein e_MAX≧ 0 is transmitted as side-information in the encoded HOA representation.

Thus, each exponent with a base of "2" describing the change in the total absolute amplitude of the modified signal from the first frame up to the current frame caused by the gain control processing unit in the access unit may be assumed to be in the interval e_MIN，e_MAX]Any integer value within. Thus, the number of (smallest integer) bits required for encoding β_eGiven by:

in the case where the amplitude of the signal before gain control is not too small, equation (42) can be simplified to:

the number of bits β may be calculated at the input of the gain control step/stage 15_e。

Using the number of bits beta for the exponent_eIt is ensured that all possible absolute amplitude variations caused by the HOA compressor gain control processing unit can be captured, allowing decompression to start at some predefined entry point in the compressed representation.

When starting to decompress the compressed HOA representation in the HOA decompressor, side information assigned to some data frames and in addition to the received data streamNon-differential gain values, which are received from the demultiplexer 21 in addition and represent the total absolute amplitude variation, are used in the inverse gain control step or stage 24, 22, to implement the correct gain control in a manner opposite to the processing performed in the gain control step/stage 15, 151.

Other embodiments

When implementing a particular HOA compression/decompression system as described in the chapters HOA compression, spatial HOA encoding, HOA decompression and spatial HOA decoding, the number of bits β used to encode the exponent_eMust depend on the scaling factor K_MAX，DESSet according to equation (42), the scaling factor K_MAX，DESItself depending on the desired maximum order N of the HOA representation to be compressed_MAX，DESAnd a specific virtual loudspeaker direction1≤N≤N_MAX。

For example, when assuming N_MAX，DESWhen 29 and virtual speaker directions are selected from Fliege et al, a reasonable choice isIn this case, the pair order is guaranteed to be N (1. ltoreq. N. ltoreq.N)_MAX) Is correctly compressed using the same virtual loudspeaker directionsNormalized according to the normalization of the chapter input HOA representation. However, this guarantee cannot be given in the case of HOA representation: the HOA representation is also (for efficiency reasons) equivalently represented by a virtual loudspeaker signal in PCM format, but where the direction of the virtual loudspeaker is1 j O is selected to correspond to the virtual loudspeaker direction assumed during the system design phaseDifferent.

Due to this different selection of virtual loudspeaker positions, even if the amplitudes of these virtual loudspeaker signals are in the interval [ -1, 1]In addition, it is no longer guaranteed that the amplitude of the signal before the gain control will not exceed a valueTherefore, it cannot be guaranteed that this HOA representation has a proper normalization for compression according to the processing described in MPEG document N14264.

In this case, it is advantageous to have the following system: the system provides the maximum allowed amplitude of the virtual loudspeaker signal based on knowledge of the virtual loudspeaker position to ensure that the corresponding HOA representation is suitable for compression according to the process described in MPEG document N14264. Such a system is shown in fig. 5. It employs virtual speaker positionsJ is more than or equal to 1 and less than or equal to O is used as input, wherein,and provides the maximum allowed amplitude gamma of the virtual loudspeaker signal_dB(which is measured in decibels) as an output. In step or stage 51, a mode matrix Ψ for the virtual loudspeaker positions is calculated according to equation (3). In a subsequent step or stage 52, the Euclidean norm of the model matrix is calculated | | | | Ψ | | luminance₂. In a third step or stage 53, the amplitude γ is calculated as the minimum of "1" and the following value: the value is the square root of the number of virtual loudspeaker positions and K_MAX，DESThe product of the square root of (a) and the quotient of the euclidean norm of the model matrix,

namely that

The value in decibels is obtained by the following formula: gamma ray_dB＝20log₁₀(γ)。 (44)

For the purpose of illustration: from the above derivation it can be seen that if the amplitude of the HOA coefficient sequence does not exceed a valueI.e., if

All signals before the gain control processing unit will accordingly not exceed this value, which is a requirement for proper HOA compression.

It was found from equation (9) that the amplitude of the HOA coefficient sequence is limited by the following equation

||c(lT_S)||_∞≤||c(lT_S)||₂≤||Ψ||₂·||w(lT_S)||₂。 (46)

Therefore, if γ is set according to the formula (43) and the virtual speaker signal of the PCM format satisfies

||w(lT_S)||_∞≤γ， (47)

Then it is derived from equation (7)

And meets the requirements (45).

That is, the maximum amplitude value "1" in equation (6) is replaced by the maximum amplitude value γ in equation (47).

Basis for higher order ambisonics

Higher Order Ambisonics (HOA) is based on the description of the sound field in a dense area of interest, which is assumed to have no sound source. In this case, the spatio-temporal behavior of the sound pressure p (t, x) at time t and at position x within the region of interest is physically determined entirely by the homogeneous wave equation. Hereinafter, a spherical coordinate system as shown in fig. 6 is assumed. In the coordinate system used, the x-axis points to the front, the y-axis to the left, and the z-axis to the top. Position in space x ═ (r, θ, φ)^TThe tilt angle θ ∈ [0, π ] measured from the polar axis z by a radius r > 0 (i.e., distance to the origin of coordinates)]And an azimuth angle φ e [0, 2 π [ measured counterclockwise from the x-axis in the x-y plane. Furthermore, (.)^TIndicating transposition.

Then, as can be seen from the "Fourier Acoustic" textbook, the Fourier transform of the sound pressure with respect to time consists ofIt is meant that, i.e.,

where ω represents an angular frequency and i represents an imaginary unit, the fourier transform of the sound pressure with respect to time can be expanded into a series of spherical harmonic functions according to the following equation

Wherein, c_sRepresenting the speed of sound, k representing the angular wavenumber, which passesBut is related to the angular frequency omega. Furthermore, j_n(. represents a Bessel function of the first kind, andreal-valued spherical harmonic functions of order n and degree m are represented, and are defined in the definition of chapter real-valued spherical harmonic functions. Coefficient of expansionDepending only on the angular wavenumber k. Note that it has been implicitly assumed that the sound pressure is spatially band limited. The number of levels is therefore truncated with respect to the order index N at the upper limit N of the order, called HOA representation.

If the sound field is represented by the superposition of an infinite number of harmonic Plane waves with different angular frequencies ω arriving from all possible directions specified by the angular tuple (θ, φ), it can be seen (see B. Rafaly, "Plane-wave decomposition of the sound field on a surface by spatial correlation", J. Acoust. Soc. am, Vol. 4(116), pp. 2149 to 2157, 2004, 10 months) that the corresponding Plane wave complex amplitude function C (ω, θ, φ) can be represented by the following spherical harmonic function expansion equation

Wherein the expansion coefficientBy the following formula and expansion coefficientAnd (3) correlation:

assuming individual coefficientsIs a function of the angular frequency omega, then the inverse Fourier transform (fromRepresentation) provides the following time-domain function for each order n and degree m

These time domain functions, referred to herein as sequences of continuous-time HOA coefficients, may be concentrated in a single vector c (t) by

HOA coefficient sequence within vector c (t)Is given by n (n +1) +1+ m. The total number of elements in the vector c (t) is represented by O ═ N +1)²It is given.

Final ambisonics format using the sampling frequency f_SProviding a sampled version of c (t) as follows

Wherein, T_S＝1/f_SIndicating the sampling period. Element c (lT)_S) Referred to as a sequence of discrete-time HOA coefficients, which may always be real-valued. This characteristic is also applicable to continuous timeVersion(s)

Definition of real-valued spherical harmonic functions

Real value spherical harmonic function(assuming normalization according to SN3D of J.Daniel, "reproduction sensing de channels audioques, application delay analysis et la reproduction de sc es nano compounds dans unconjugated multi m dia", PhD.A., university of Paris, 6 months 2001, chapter 3.1) is given by the following equation

Wherein the content of the first and second substances,

associated Legendre function P_n，m(x) Is defined as

Having Legendre polynomials P_n(x) And, unlike in "Fourier Acoustics" of Applied Mathematical Sciences, volume 93 E.G.Williams, published by Academic Press1999, it does not have a Condon-Shortley phase term (-1)^m。

The processes of the present invention may be performed by a single processor or electronic circuit, or by several processors or electronic circuits operating in parallel and/or in different parts of the processes of the present invention.

Instructions for operating the one or more processors may be stored in the one or more memories.

Claims

1. A method for decoding a compressed Higher Order Ambisonics (HOA) sound representation of a sound or sound field, the method comprising:

receiving a bitstream comprising a compressed HOA representation, wherein the bitstream comprises a number of HOA coefficients corresponding to the compressed HOA representation, and

based on the smallest integer beta_eDecoding a compressed HOA representation, wherein the smallest integer β_eBased onIt is determined that,

wherein the content of the first and second substances,n is the order, N_MAXIs the maximum order of interest and,is the direction of the virtual loudspeaker, O ═ 1²Is the number of HOA coefficient sequences, and K is the square of the Euclidean norm of the modulus matrix (| | | Ψ | | | non-conductive cells₂ ²The ratio of the amount of oxygen to the amount of O,

wherein the content of the first and second substances,

2. an apparatus for decoding a compressed Higher Order Ambisonics (HOA) sound representation of a sound or sound field, the apparatus comprising:

a processor configured to receive a bitstream containing a compressed HOA representation, wherein the bitstream comprises a number of HOA coefficients corresponding to the compressed HOA representation, and the processor is further configured to be based on a minimum integer β_eDecoding a compressed HOA representation, wherein the smallest integer β_eBased onIt is determined that,

wherein the content of the first and second substances,n is the order, N_MAXIs the maximum order of interest and,is the direction of the virtual loudspeaker, O ═ 1²Is the number of HOA coefficient sequences, and K is the square of the Euclidean norm of the modulus matrix (| | | Ψ | | | non-conductive cells₂ ²The ratio of the oxygen to the oxygen is,

3. a non-transitory computer-readable storage medium containing instructions that, when executed by a processor, perform the method of claim 1.