HK1218589B

HK1218589B - Coding of audio scenes

Info

Publication number: HK1218589B
Application number: HK16106570.7A
Authority: HK
Inventors: 海科．普尔哈根; 拉尔斯．维尔默斯; 利夫．约纳什．萨穆埃尔松; 托尼．希尔沃宁
Original assignee: 杜比国际公司
Priority date: 2013-05-24
Filing date: 2014-05-23
Publication date: 2020-03-06

Description

Encoding of audio scenes

相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS

本申请要求于2013年5月24日提交的美国临时专利申请第61/827,246号的优先权，通过引用将该申请整体地合并到本文中。This application claims priority to U.S. Provisional Patent Application No. 61/827,246, filed May 24, 2013, which is hereby incorporated by reference in its entirety.

技术领域Technical Field

本文所公开的发明总体上涉及音频编码和解码领域。特别地，本发明涉及对包括音频对象的音频场景的编码和解码。The invention disclosed herein generally relates to the field of audio encoding and decoding. In particular, the invention relates to encoding and decoding of audio scenes comprising audio objects.

背景技术Background Art

存在用于参数空间音频编码的音频编码系统。例如，MPEG Surround描述了一种用于多声道音频的参数空间编码的系统。MPEG SAOC(空间音频对象编码)描述了一种用于音频对象的参数编码的系统。There are audio coding systems for parametric spatial audio coding. For example, MPEG Surround describes a system for parametric spatial coding of multi-channel audio. MPEG SAOC (Spatial Audio Object Coding) describes a system for parametric coding of audio objects.

在编码器侧，这些系统通常将声道/对象下混成下混，下混通常为单声道(一个声道)或立体声(两个声道)下混，并且提取通过如电平差和互相关来描述声道/对象的性质的边信息。然后对下混和边信息进行编码并且将其发送解码器侧。在解码器侧，在边信息的参数的控制下根据下混来重构即近似估计声道/对象。On the encoder side, these systems typically downmix the channels/objects into a downmix, which is typically a mono (one channel) or stereo (two channels) downmix, and extract side information that describes the properties of the channels/objects through, for example, level differences and cross-correlations. The downmix and side information are then encoded and sent to the decoder. On the decoder side, the channels/objects are reconstructed or approximated from the downmix under the control of the parameters of the side information.

这些系统的缺点在于重构通常在数学上是复杂的并且经常需要依赖于对由作为边信息发送的参数未明确描述的音频内容的性质的假设。这种假设例如可以是：除非发送了互相关参数，否则声道/对象被认为是不相关的；或者以特定方式生成声道/对象的下混。此外，当下混的声道的数目增加时，数学复杂度和对额外的假设的需要会显著增加。A drawback of these systems is that the reconstruction is often mathematically complex and often relies on assumptions about the nature of the audio content that are not explicitly described by the parameters sent as side information. Such assumptions can include, for example, that channels/objects are considered uncorrelated unless cross-correlation parameters are sent, or that the downmix of channels/objects is generated in a specific way. Furthermore, the mathematical complexity and the need for additional assumptions increase significantly as the number of channels in the downmix increases.

此外，在应用在解码器侧的处理的算法细节中内在地反映出所需要的假设。这意味着在解码器侧必须包括相当多的智能。这是个缺点，因为当解码器被设置在例如很难或甚至不可能升级的消费者装置中时，很难升级和改进算法。Furthermore, the required assumptions are inherently reflected in the algorithmic details of the processing applied on the decoder side. This means that considerable intelligence must be included on the decoder side. This is a disadvantage because it is difficult to upgrade and improve the algorithm when the decoder is, for example, installed in a consumer device where upgrading is difficult or even impossible.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

在下文中，将参考附图并且更加详细地描述示例实施例，其中：Hereinafter, example embodiments will be described in more detail with reference to the accompanying drawings, in which:

图1是根据示例实施例的音频编码/解码系统的示意图；FIG1 is a schematic diagram of an audio encoding/decoding system according to an example embodiment;

图2是根据示例实施例的具有遗留解码器的音频编码/解码系统的示意图；2 is a schematic diagram of an audio encoding/decoding system with a legacy decoder according to an example embodiment;

图3是根据示例实施例的音频编码/解码系统的编码侧的示意图；FIG3 is a schematic diagram of an encoding side of an audio encoding/decoding system according to an example embodiment;

图4是根据示例实施例的编码方法的流程图；FIG4 is a flowchart of an encoding method according to an example embodiment;

图5是根据示例实施例的编码器的示意图；FIG5 is a schematic diagram of an encoder according to an example embodiment;

图6是根据示例实施例的音频编码/解码系统的解码器侧的示意图；FIG6 is a schematic diagram of a decoder side of an audio encoding/decoding system according to an example embodiment;

图7是根据示例实施例的解码方法的流程图；FIG7 is a flowchart of a decoding method according to an example embodiment;

图8是根据示例实施例的音频编码/解码系统的解码器侧的示意图；以及FIG8 is a schematic diagram of a decoder side of an audio encoding/decoding system according to an example embodiment; and

图9是在根据示例实施例的音频编码/解码系统的解码器侧执行的时频变换的示意图。FIG9 is a schematic diagram of time-frequency transform performed at a decoder side of an audio encoding/decoding system according to an example embodiment.

所有附图都是示意性的，并且一般仅示出为阐明本发明所必须的部分，而可以省略或仅暗示其它部分。除非另有说明，否则相同附图标记在不同附图中的指示相同部件。All the figures are schematic and generally show only parts which are necessary to elucidate the invention, while other parts may be omitted or merely suggested. Unless otherwise indicated, the same reference numerals in the different figures denote the same components.

具体实施方式DETAILED DESCRIPTION

考虑到上述内容，目的是提供编码器和解码器，以及提供音频对象的较不复杂的且更灵活的重构的相关方法。In view of the above, it is an object to provide an encoder and a decoder, and related methods that provide a less complex and more flexible reconstruction of audio objects.

I.概述——编码器I. Overview - Encoder

根据第一方面，示例实施例提出了编码方法、编码器以及用于编码的计算机程序产品。所提出的方法、编码器和计算机程序产品一般可以具有相同特征和优势。According to a first aspect, example embodiments propose an encoding method, an encoder and a computer program product for encoding. The proposed method, encoder and computer program product may generally have the same features and advantages.

根据示例实施例，提供了一种对至少包括N个音频对象的音频场景的时频块进行编码的方法。该方法包括：接收N个音频对象；基于至少N个音频对象生成M个下混信号；用矩阵元素生成重构矩阵，重构矩阵使得能够根据M个下混信号重构至少N个音频对象；以及生成包括M个下混信号以及重构矩阵的矩阵元素中的至少一些矩阵元素的比特流。According to an example embodiment, a method for encoding time-frequency blocks of an audio scene including at least N audio objects is provided. The method includes: receiving the N audio objects; generating M downmix signals based on the at least N audio objects; generating a reconstruction matrix using matrix elements, the reconstruction matrix enabling reconstruction of the at least N audio objects from the M downmix signals; and generating a bitstream including the M downmix signals and at least some of the matrix elements of the reconstruction matrix.

音频对象的数目N可以等于或大于1。下混信号的数目M可以等于或大于1。The number N of audio objects may be equal to or greater than 1. The number M of downmix signals may be equal to or greater than 1.

通过该方法，从而生成了比特流，该比特流包括作为边信息的重构矩阵的矩阵元素中的至少一些矩阵元素以及M个下混信号。通过将重构矩阵的各个矩阵元素包括在比特流中，在解码器侧需要非常少的智能。例如，在解码器侧不需要基于所传输的对象参数和额外的假设对重构矩阵进行复杂计算。因此，显著降低了解码器侧的数学复杂度。此外，因为该方法的复杂度不依赖于所使用的下混信号的数目，所以与现有技术方法相比，增加了关于下混信号的数目的灵活性。This method generates a bitstream that includes at least some of the matrix elements of the reconstruction matrix as side information, along with M downmix signals. By including the matrix elements of the reconstruction matrix in the bitstream, minimal intelligence is required on the decoder side. For example, complex computations of the reconstruction matrix based on the transmitted object parameters and additional assumptions are not required on the decoder side. Consequently, the mathematical complexity on the decoder side is significantly reduced. Furthermore, because the complexity of this method is independent of the number of downmix signals used, flexibility regarding the number of downmix signals is increased compared to prior art methods.

如本文中所使用的，音频场景一般指如下三维音频环境：其包括与可以被呈现以在音频系统上回放的三维空间中的位置相关联的音频单元。As used herein, an audio scene generally refers to a three-dimensional audio environment that includes audio units associated with positions in three-dimensional space that can be rendered for playback on an audio system.

如本文中所使用的，音频对象指音频场景的单元。音频对象通常包括音频信号以及诸如对象在三位空间中的位置的附加信息。附加信息通常被用于在给定的回放系统上最优地呈现音频对象。As used herein, an audio object refers to a unit of an audio scene. An audio object typically includes an audio signal and additional information such as the object's position in three-dimensional space. This additional information is typically used to optimally render the audio object on a given playback system.

如本文中所使用的，下混信号指是作为至少N个音频对象的组合的信号。诸如音床声道(将在下文中描述)的音频场景的其它信号也可以被组合到下混信号中。例如，M个下混信号可以对应于对给定扬声器配置，例如标准5.1配置的音频场景的呈现。在本文中由M表示的下混信号的数目通常(但不必须地)少于音频对象和音床声道的数目之和，这解释了为什么M个下混信号称为下混。As used herein, a downmix signal refers to a signal that is a combination of at least N audio objects. Other signals of the audio scene, such as the bed channels (described below), may also be combined into the downmix signal. For example, M downmix signals may correspond to the rendering of the audio scene for a given speaker configuration, such as a standard 5.1 configuration. The number of downmix signals, denoted by M herein, is typically (but not necessarily) less than the sum of the number of audio objects and bed channels, which explains why M downmix signals are referred to as a downmix.

音频编码/解码系统通常例如通过将适合的滤波器组应用于输入音频信号而将时频空间划分成时频块。时频块的一般意思是对应于时间间隔和频率子带的时频空间的一部分。时间间隔可以通常对应于用在音频编码/解码系统中的时间帧的持续时间。频率子带可以通常对应于由用在编码/解码系统中的滤波器组所定义的一个或若干相邻频率子带。在频率子带对应于由滤波器组定义的若干相邻频率子带的情形下，这允许在音频信号的解码过程中存在不均匀的频率子带，例如，更宽的频率子带用于音频信号的较高频率。在音频编码/解码系统对整个频率范围进行操作的宽波段的情形下，时频块的频率子带可以对应于整个频率范围。上述方法公开了用于在一个这样的时频块期间对音频场景进行编码的编码步骤。然而，要理解的是，可以针对音频编码/解码系统的每个时频块重复该方法。并且，还要理解的是，可以同时对若干时频块进行编码。通常，相邻的时频块可以在时间和/或频率上稍稍重叠。例如，时间上的重叠可以相当于重构矩阵的元素在时间上，即从一个时间间隔到下一个时间间隔的线性插值。然而，本公开内容的目标在于编码/解码系统的其它部件，而相邻的时频块之间的时间和/或频率上的任何重叠留给本领域技术人员去实现。Audio encoding/decoding systems typically divide the time-frequency space into time-frequency blocks, for example by applying a suitable filter bank to the input audio signal. A time-frequency block generally refers to a portion of the time-frequency space corresponding to a time interval and a frequency subband. A time interval may typically correspond to the duration of a time frame used in the audio encoding/decoding system. A frequency subband may typically correspond to one or several adjacent frequency subbands defined by the filter bank used in the encoding/decoding system. When a frequency subband corresponds to several adjacent frequency subbands defined by the filter bank, this allows for uneven frequency subbands during the decoding of the audio signal, for example, wider frequency subbands for higher frequencies in the audio signal. In the case of a wide-band audio encoding/decoding system operating over the entire frequency range, the frequency subbands of a time-frequency block may correspond to the entire frequency range. The above method discloses encoding steps for encoding an audio scene during one such time-frequency block. However, it should be understood that the method may be repeated for each time-frequency block of the audio encoding/decoding system. Furthermore, it should be understood that several time-frequency blocks may be encoded simultaneously. Typically, adjacent time-frequency blocks may overlap slightly in time and/or frequency. For example, the temporal overlap may be equivalent to a linear interpolation of the elements of the reconstruction matrix in time, i.e., from one time interval to the next. However, the present disclosure is directed to other components of the encoding/decoding system, and any temporal and/or frequency overlap between adjacent time-frequency blocks is left to those skilled in the art.

根据示例实施例，使用第一格式将M个下混信号布置在比特流的第一字段中，并且使用第二格式将矩阵元素布置在比特流的第二字段中，从而允许仅支持第一格式的解码器解码和回放第一字段中的M个下混信号并且丢弃第二子段中的矩阵元素。这样做的优势在于比特流中的M个下混信号与不用于实现音频对象重构的遗留解码器后向兼容。换言之，遗留解码器仍然可以例如通过将每个下混信号映射到解码器的声道输出来解码和回放比特流的M个下混信号。According to an example embodiment, M downmix signals are arranged in a first field of a bitstream using a first format, and matrix elements are arranged in a second field of the bitstream using a second format, thereby allowing a decoder that only supports the first format to decode and play back the M downmix signals in the first field and discard the matrix elements in the second field. This has the advantage that the M downmix signals in the bitstream are backward compatible with legacy decoders that do not implement audio object reconstruction. In other words, the legacy decoder can still decode and play back the M downmix signals of the bitstream, for example, by mapping each downmix signal to a channel output of the decoder.

根据示例实施例，该方法还可以包括步骤：接收对应于N个音频对象中的每个音频对象的位置数据，其中，基于位置数据生成M个下混信号。位置数据通常将每个音频对象与三位空间中的位置相关联。音频对象的位置可以随时间而变化。通过在对音频对象进行下混时使用位置数据，将通过以下方式将音频对象混合到M个下混信号中：例如如果在具有M个输出声道的系统上听M个下混信号，则音频对象听起来就像它们近似地位于其各自的位置。这例如在M个下混信号要与遗留解码器后向兼容的情况下是有利的。According to an example embodiment, the method may further include the step of receiving position data corresponding to each of the N audio objects, wherein M downmix signals are generated based on the position data. The position data typically associates each audio object with a position in three-dimensional space. The position of the audio object may change over time. By using the position data when downmixing the audio objects, the audio objects are mixed into the M downmix signals in such a way that, for example, if the M downmix signals are listened to on a system with M output channels, the audio objects sound as if they were approximately located at their respective positions. This is advantageous, for example, when the M downmix signals are to be backward compatible with legacy decoders.

根据示例实施例，重构矩阵的矩阵元素是时变的和频变的。换言之，重构矩阵的矩阵元素可以对于不同的时频块而不同。以这样的方式，实现了音频对象的重构的极好的灵活性。According to an exemplary embodiment, the matrix elements of the reconstruction matrix are time-varying and frequency-varying. In other words, the matrix elements of the reconstruction matrix can be different for different time-frequency blocks. In this way, excellent flexibility in the reconstruction of audio objects is achieved.

根据示例实施例，音频场景还包括多个音床声道。这例如在音频内容除了包括音频对象以外还包括音床声道的影院音频应用中是常见的。在这种情形下，可以基于至少N个音频对象和多个音床声道生成M个下混信号。音床声道的一般意思是对应于三维空间中的固定位置的音频信号。例如，音床声道可以对应于音频编码/解码系统的输出声道之一。这样，音床声道可以被解释为具有三维空间中与音频编码/解码系统的输出扬声器之一的位置相同的相关位置。因此，音床声道可以与仅指示相应输出扬声器的位置的标签相关联。According to an example embodiment, the audio scene also includes multiple bed channels. This is common, for example, in cinema audio applications where the audio content includes bed channels in addition to audio objects. In this case, M downmix signals can be generated based on at least N audio objects and the multiple bed channels. A bed channel generally refers to an audio signal corresponding to a fixed position in three-dimensional space. For example, a bed channel can correspond to one of the output channels of an audio encoding/decoding system. In this way, the bed channel can be interpreted as having a relative position in three-dimensional space that is the same as the position of one of the output speakers of the audio encoding/decoding system. Therefore, the bed channel can be associated with a label that simply indicates the position of the corresponding output speaker.

当音频场景包括音床声道时，重构矩阵可以包括使得能够根据M个下混信号重构音床声道的矩阵元素。When the audio scene comprises a bed channel, the reconstruction matrix may comprise matrix elements enabling reconstruction of the bed channel from the M downmix signals.

在某些情况下，音频场景可以包括大量的对象。为了降低表现音频场景所需要的复杂度和数据量，可以通过减少音频对象的数量来简化音频场景。因此，如果音频场景初始包括K个音频对象，其中K>N，则该方法还可以包括步骤：接收K个音频对象，并且通过将K个音频对象聚类成N个聚类并将每个聚类用一个音频对象表示，来将K个音频对象减少到N个音频对象。In some cases, an audio scene may include a large number of objects. To reduce the complexity and amount of data required to represent the audio scene, the audio scene may be simplified by reducing the number of audio objects. Therefore, if the audio scene initially includes K audio objects, where K>N, the method may further include the steps of receiving the K audio objects and reducing the K audio objects to N audio objects by clustering the K audio objects into N clusters and representing each cluster with an audio object.

为了简化场景，该方法还可以包括步骤：接收对应于K个音频对象中的每个音频对象的位置数据，其中，将K个对象聚类成N个聚类基于由K个音频对象的位置数据所给出的K个对象之间的位置距离。例如，三维空间中位置彼此靠近的音频对象可以被聚类在一起。To simplify the scenario, the method may further include receiving position data corresponding to each of the K audio objects, wherein clustering the K objects into N clusters is based on position distances between the K objects given by the position data of the K audio objects. For example, audio objects that are located close to each other in three-dimensional space may be clustered together.

如上所述，该方法的示例实施例在所使用的下混信号的数目方面是灵活的。具体地，当存在多于两个下混信号时，即当M大于二时，可以有利地使用该方法。例如，可以使用对应于常规的5.1或7.1音频设置的五个或七个下混信号。这么做是有利的，因为与现有技术系统相反，无论使用的下混信号的数目为多少，所提出的编码原则的数学复杂度保持相同。As described above, the exemplary embodiments of the method are flexible with respect to the number of downmix signals used. Specifically, the method can be advantageously used when there are more than two downmix signals, i.e., when M is greater than two. For example, five or seven downmix signals, corresponding to a conventional 5.1 or 7.1 audio setup, can be used. This is advantageous because, in contrast to prior art systems, the mathematical complexity of the proposed encoding principle remains the same regardless of the number of downmix signals used.

为了能够进一步改进N个音频对象的重构，该方法还可以包括：根据N个音频对象形成L个辅助信号；将矩阵元素包括在使得能够根据M个下混信号和L个辅助信号来重构至少N个音频对象的重构矩阵中；以及将L个辅助信号包括在比特流中。因此，辅助信号充当帮助信号，其例如可以捕获很难根据下混信号重构的音频对象的方面。辅助信号还可以基于音床声道。辅助信号的数目可以等于或大于1。To further improve the reconstruction of the N audio objects, the method may further include: forming L auxiliary signals from the N audio objects; including matrix elements in a reconstruction matrix that enables reconstruction of at least the N audio objects from the M downmix signals and the L auxiliary signals; and including the L auxiliary signals in the bitstream. Thus, the auxiliary signals act as helper signals that, for example, can capture aspects of the audio objects that are difficult to reconstruct from the downmix signals. The auxiliary signals may also be based on bed channels. The number of auxiliary signals may be equal to or greater than one.

根据一个示例实施例，辅助信号可以对应于特别重要的音频对象，诸如表示对话的音频对象。因此，L个辅助信号中的至少之一可以与N个音频对象之一相同。这使得与必须仅根据M个下混声道进行重构的情况相比以更高质量呈现重要的对象。实际上，音频内容提供者可能已经优先化和/或标注了音频对象中的一些音频对象作为优选地单独作为辅助对象而被包括的音频对象。此外，这使得呈现之前对这些对象的修改/处理较不容易发生伪影。作为比特率和质量之间的折中，也可以发送两个或更多个音频对象的混合以作为辅助信号。换言之，L个辅助信号中的至少之一可以被形成为N个音频对象中的至少两个音频对象的组合。According to an exemplary embodiment, the auxiliary signal may correspond to an audio object of particular importance, such as an audio object representing a conversation. Thus, at least one of the L auxiliary signals may be identical to one of the N audio objects. This allows the important objects to be presented with higher quality than if they had to be reconstructed based solely on the M downmix channels. In practice, the audio content provider may have prioritized and/or marked some of the audio objects as being audio objects that are preferably included separately as auxiliary objects. Furthermore, this makes modifications/processing of these objects prior to presentation less prone to artifacts. As a compromise between bit rate and quality, a mixture of two or more audio objects may also be sent as the auxiliary signal. In other words, at least one of the L auxiliary signals may be formed as a combination of at least two of the N audio objects.

根据一个示例实施例，辅助信号表示在生成M个下混信号的过程中丢失的音频对象的信号维度，该丢失例如由于独立对象的数目通常多于下混声道的数目，或者由于两个对象所关联的位置使得该两个对象被混合到同一下混信号中。后一种情形的示例是两个对象仅在纵向上分离而在投影到水平平面上时共享同一位置的情况，这意味着该两个对象通常将被呈现成标准5.1环绕扬声器设置的相同下混声道，在标准5.1环绕扬声器设置中所有的扬声器都在同一水平平面上。具体地，M个下混信号跨信号空间中的超平面。通过形成M个下混信号的线性组合，仅可以重构位于超平面中的音频信号。为了改进重构，可以包括不位于超平面中的辅助信号，从而也能够重构不位于超平面中的信号。换言之，根据示例实施例，多个辅助信号中的至少之一不位于被M个下混信号所跨的超平面中。例如，多个辅助信号中的至少之一可以与被M个下混信号所跨的超平面正交。According to an example embodiment, the auxiliary signal represents the signal dimensions of the audio objects that were lost during the generation of the M downmix signals, for example because the number of independent objects is typically greater than the number of downmix channels, or because the positions of two objects are associated such that the two objects are mixed into the same downmix signal. An example of the latter case is when two objects are separated only vertically but share the same position when projected onto a horizontal plane, meaning that the two objects would typically be presented as the same downmix channel in a standard 5.1 surround speaker setup, where all speakers are located on the same horizontal plane. Specifically, the M downmix signals span a hyperplane in the signal space. By forming a linear combination of the M downmix signals, only audio signals that lie within the hyperplane can be reconstructed. To improve the reconstruction, auxiliary signals that do not lie within the hyperplane can be included, thereby also enabling the reconstruction of signals that do not lie within the hyperplane. In other words, according to an example embodiment, at least one of the multiple auxiliary signals does not lie within the hyperplane spanned by the M downmix signals. For example, at least one of the multiple auxiliary signals may be orthogonal to the hyperplane spanned by the M downmix signals.

根据示例实施例，提供了一种包括当在具有处理能力的装置上运行时适于执行第一方面的任何方法的计算机代码指令的计算机可读介质。According to an example embodiment, there is provided a computer readable medium comprising computer code instructions adapted to perform any of the methods of the first aspect when run on a device having processing capabilities.

根据示例实施例，提供了一种对至少包括N个音频对象的音频场景的时频块进行编码的编码器，该编码器包括：接收部件，被配置成接收N个音频对象；下混生成部件，被配置成接收来自接收部件的N个音频对象并且基于至少N个音频对象生成M个下混信号；分析部件，被配置成用矩阵元素生成重构矩阵，重构矩阵使得能够根据M个下混信号重构至少N个音频对象；以及比特流生成部件，被配置成接收来自下混生成部件的M个下混信号以及来自分析部件的重构矩阵，并且生成包括M个下混信号和重构矩阵的矩阵元素中的至少一些矩阵元素的比特流。According to an example embodiment, an encoder for encoding time-frequency blocks of an audio scene including at least N audio objects is provided, the encoder comprising: a receiving component configured to receive the N audio objects; a downmix generation component configured to receive the N audio objects from the receiving component and generate M downmix signals based on the at least N audio objects; an analysis component configured to generate a reconstruction matrix using matrix elements, the reconstruction matrix enabling reconstructing the at least N audio objects from the M downmix signals; and a bitstream generation component configured to receive the M downmix signals from the downmix generation component and the reconstruction matrix from the analysis component, and generate a bitstream including the M downmix signals and at least some of the matrix elements of the reconstruction matrix.

Ⅱ.概述——解码器II. Overview - Decoder

根据第二方面，示例实施例提出了解码方法、解码装置和用于解码的计算机程序产品。所提出的方法、装置和计算机程序产品一般可以具有相同特征和优势。According to a second aspect, example embodiments propose a decoding method, a decoding apparatus and a computer program product for decoding. The proposed method, apparatus and computer program product may generally have the same features and advantages.

与在上述编码器的概述中呈现的特征和设置有关的优势可以一般对解码器的相应特征和设置有效。The advantages relating to the features and arrangements presented in the above overview of the encoder may generally be valid for the corresponding features and arrangements of the decoder.

根据示例实施例，提供了一种对至少包括N个音频对象的音频场景的时频块进行解码的方法，该方法包括步骤：接收包括M个下混信号和重构矩阵的矩阵元素中的至少一些矩阵元素的比特流；使用矩阵元素生成重构矩阵；以及使用重构矩阵根据M个下混信号重构N个音频对象。According to an example embodiment, a method for decoding time-frequency blocks of an audio scene including at least N audio objects is provided, the method comprising the steps of: receiving a bitstream including M downmix signals and at least some of the matrix elements of a reconstruction matrix; generating a reconstruction matrix using the matrix elements; and reconstructing the N audio objects from the M downmix signals using the reconstruction matrix.

根据示例实施例，使用第一格式将M个下混信号布置在比特流的第一字段中，并且使用第二格式将矩阵元素布置在比特流的第二子段中，从而允许仅支持第一格式的解码器解码和回放第一字段中的M个下混信号并且丢弃第二子段中的矩阵元素。According to an example embodiment, M downmix signals are arranged in a first field of a bitstream using a first format, and matrix elements are arranged in a second subsegment of the bitstream using a second format, thereby allowing a decoder that only supports the first format to decode and play back the M downmix signals in the first field and discard the matrix elements in the second subsegment.

根据示例实施例，重构矩阵的矩阵元素是时变的和频变的。According to an example embodiment, the matrix elements of the reconstruction matrix are time-varying and frequency-varying.

根据示例实施例，音频场景还包括多个音床声道，该方法还包括使用重构矩阵根据M个下混信号来重构音床声道。According to an example embodiment, the audio scene further comprises a plurality of bed channels, and the method further comprises reconstructing the bed channels from the M downmix signals using a reconstruction matrix.

根据示例实施例，下混信号的数目M大于2。According to an example embodiment, the number M of downmix signals is greater than two.

根据示例实施例，该方法还包括：接收由N个音频对象形成的L个辅助信号；使用重构矩阵根据M个下混信号和L个辅助信号重构N个音频对象，其中，重构矩阵包括使得能够根据M个下混信号和L个辅助信号重构至少N个音频对象的矩阵元素。According to an example embodiment, the method further comprises: receiving L auxiliary signals formed by N audio objects; and reconstructing the N audio objects based on the M downmix signals and the L auxiliary signals using a reconstruction matrix, wherein the reconstruction matrix comprises matrix elements enabling reconstruction of at least N audio objects based on the M downmix signals and the L auxiliary signals.

根据示例实施例，L个辅助信号的至少之一与N个音频对象之一相同。According to an example embodiment, at least one of the L auxiliary signals is identical to one of the N audio objects.

根据示例实施例，L个辅助信号的至少之一是N个音频对象的组合。According to an example embodiment, at least one of the L auxiliary signals is a combination of N audio objects.

根据示例实施例，M个下混信号跨超平面，并且其中多个辅助信号的至少之一不位于被M下混信号所跨的超平面中。According to an example embodiment, the M downmix signals span a hyperplane, and wherein at least one of the plurality of auxiliary signals is not located in the hyperplane spanned by the M downmix signals.

根据示例实施例，不位于超平面中的多个辅助信号的至少之一正交于被M个下混信号所跨的超平面。According to an example embodiment, at least one of the plurality of auxiliary signals not located in the hyperplane is orthogonal to the hyperplane spanned by the M downmix signals.

如上所述，音频编码/解码系统通常在频域中工作。因此，音频编码/解码系统使用滤波器组执行音频信号的时频变换。可以使用不同类型的时频变换。例如，可以关于第一频域来表示M个下混信号并且可以关于第二频域来表示重构矩阵。为了减少解码器的计算负担，以聪明的方式选择第一频域和第二频域是有利的。例如，第一频域和第二频域可以被选择成相同的频域，诸如改进离散余弦变换(MDCF)域。以这种方式，可以避免在解码器中将M个下混信号从第一频域变换到时域然后变换到第二频域。可替选地，能够通过以下方式选择第一频域和第二频域：可以共同实现从第一频域到第二频域的变换，使得在第一频域与第二频域之间没有必要通过时域。As mentioned above, audio coding/decoding systems usually work in the frequency domain. Therefore, audio coding/decoding systems use filter banks to perform time-frequency transforms of audio signals. Different types of time-frequency transforms can be used. For example, M downmix signals can be represented with respect to a first frequency domain and a reconstruction matrix can be represented with respect to a second frequency domain. In order to reduce the computational burden of the decoder, it is advantageous to select the first frequency domain and the second frequency domain in a smart way. For example, the first frequency domain and the second frequency domain can be selected to be the same frequency domain, such as a modified discrete cosine transform (MDCF) domain. In this way, it is possible to avoid transforming the M downmix signals from the first frequency domain to the time domain and then to the second frequency domain in the decoder. Alternatively, the first frequency domain and the second frequency domain can be selected in the following manner: the transform from the first frequency domain to the second frequency domain can be implemented together, so that there is no need to pass through the time domain between the first frequency domain and the second frequency domain.

该方法还可以包括接收对应于N个音频对象的位置数据，并且使用位置数据呈现N个音频对象以创建至少一个输出音频声道。以这种方式，基于重构的N个音频对象在三维空间中的位置将其映射到音频编码器/解码器系统的输出声道上。The method may further include receiving position data corresponding to the N audio objects, and rendering the N audio objects using the position data to create at least one output audio channel. In this manner, the reconstructed N audio objects are mapped to output channels of the audio encoder/decoder system based on their positions in three-dimensional space.

优选地在频域中执行呈现。为了减少解码器的计算负担，优选地以聪明的方式关于重构音频对象的频域来选择呈现的频域。例如，如果关于对应于第二滤波器组的第二频域表示重构矩阵，并且在对应于第三滤波器组的第三频域中执行呈现，则优选地将第二滤波器组和第三滤波器组选择成至少部分地为相同的滤波器组。例如，第二滤波器组和第三滤波器组可以包括正交镜像滤波器(QMF)域。可替选地，第二频域和第三频域可以包括MDCT滤波器组。根据示例实施例，第三滤波器组可以由一系列滤波器组组成，诸如QMF滤波器组，后接奈奎斯特滤波器组。如果这样，则序列的滤波器组中至少之一(序列的第一滤波器组)与第二滤波器组相同。以这种方式，可以说第二滤波器组和第三滤波器组至少部分地为相同的滤波器组。Preferably, presentation is performed in the frequency domain. In order to reduce the computational burden of the decoder, it is preferably selected in a smart way about the frequency domain of the reconstructed audio object to be presented. For example, if a reconstruction matrix is represented about the second frequency domain corresponding to the second filter group, and presentation is performed in the third frequency domain corresponding to the third filter group, then the second filter group and the third filter group are preferably selected to be at least partially the same filter group. For example, the second filter group and the third filter group can include a quadrature mirror filter (QMF) domain. Alternatively, the second frequency domain and the third frequency domain can include an MDCT filter group. According to an exemplary embodiment, the third filter group can be composed of a series of filter groups, such as a QMF filter group, followed by a Nyquist filter group. If so, at least one of the filter groups of the sequence (the first filter group of the sequence) is identical to the second filter group. In this way, it can be said that the second filter group and the third filter group are at least partially the same filter group.

根据示例实施例，提供了包括当在具有处理能力的装置上运行时适于执行第二方面的任一方法的计算机代码指令的计算机可读介质。According to an example embodiment, there is provided a computer readable medium comprising computer code instructions adapted to perform any of the methods of the second aspect when run on a device having processing capabilities.

根据示例实施例，提供了一种对至少包括N个音频对象的音频场景的时频块进行解码的解码器，该解码器包括：接收部件，被配置成接收包括M个下混信号和重构矩阵的矩阵元素中的至少一些矩阵元素的比特流；重构矩阵生成部件，被配置成接收来自接收部件的矩阵元素，并且基于矩阵元素生成重构矩阵；以及重构部件，被配置成接收来自重构矩阵生成部件的重构矩阵，并且使用重构矩阵根据M个下混信号重构N个音频对象。According to an example embodiment, a decoder for decoding time-frequency blocks of an audio scene including at least N audio objects is provided, the decoder including: a receiving component configured to receive a bitstream including M downmix signals and at least some of the matrix elements of a reconstruction matrix; a reconstruction matrix generating component configured to receive the matrix elements from the receiving component and generate a reconstruction matrix based on the matrix elements; and a reconstruction component configured to receive the reconstruction matrix from the reconstruction matrix generating component and reconstruct the N audio objects according to the M downmix signals using the reconstruction matrix.

Ⅲ.示例实施例III. Example Embodiments

图1图示出对音频场景102进行编码/解码的编码/解码系统100。编码/解码系统100包括编码器108、比特流生成部件110、比特流解码部件118、解码器120以及呈现器122。1 illustrates an encoding/decoding system 100 that encodes/decodes an audio scene 102. The encoding/decoding system 100 includes an encoder 108, a bitstream generating component 110, a bitstream decoding component 118, a decoder 120, and a renderer 122.

音频场景102由一个或更多个音频对象106a(诸如N个音频对象)即音频信号来表示。音频场景102还可以包括一个或更多个音床声道106b，即直接对应于呈现器122的输出声道之一的信号。音频场景102还由包括位置信息104的元数据来表示。在呈现音频场景102时例如由呈现器122使用位置信息104。位置信息104可以将音频对象106a以及可能还有音床声道106b与三维空间中的空间位置关联起来以作为时间的函数。元数据还可以包括对于呈现音频场景102有用的其它类型的数据。The audio scene 102 is represented by one or more audio objects 106a (such as N audio objects), i.e., audio signals. The audio scene 102 may also include one or more bed channels 106b, i.e., signals that directly correspond to one of the output channels of the renderer 122. The audio scene 102 is also represented by metadata, including position information 104. The position information 104 is used, for example, by the renderer 122 when rendering the audio scene 102. The position information 104 may associate the audio objects 106a, and possibly the bed channels 106b, with spatial positions in three-dimensional space as a function of time. The metadata may also include other types of data useful for rendering the audio scene 102.

系统100的编码部分包括编码器108和比特流生成部件110。编码器108接收音频对象106a、音床声道106b(如果存在)，以及包括位置信息104的元数据。基于此，编码器108生成一个或更多个下混信号112，诸如M个下混信号。举例来说，下混信号112可以对应于5.1音频系统的声道[Lf Rf Cf Ls Rs LFE]。(“L”代表左，“R”代表右，“C”代表中央，“f”代表前，“s”代表环绕，并且“LFE”代表低频效果)。The encoding portion of the system 100 includes an encoder 108 and a bitstream generation component 110. The encoder 108 receives the audio objects 106a, the bed channels 106b (if present), and metadata including position information 104. Based on this, the encoder 108 generates one or more downmix signals 112, such as M downmix signals. For example, the downmix signals 112 may correspond to the channels [Lf Rf Cf Ls Rs LFE] of a 5.1 audio system. ("L" stands for left, "R" stands for right, "C" stands for center, "f" stands for front, "s" stands for surround, and "LFE" stands for low frequency effects).

编码器108还生成边信息。边信息包括重构矩阵。重构矩阵包括使得能够根据下混信号112重构至少音频对象106a的矩阵元素114。重构矩阵还可以使得能够重构音床声道106b。The encoder 108 also generates side information. The side information comprises a reconstruction matrix. The reconstruction matrix comprises matrix elements 114 that enable reconstruction of at least the audio object 106a from the downmix signal 112. The reconstruction matrix may also enable reconstruction of the bed channel 106b.

编码器108将M个下混信号112以及矩阵元素114中的至少一些矩阵元素传输到比特流生成部件110。比特流生成部件110通过执行量化和编码来生成包括M个下混信号112和矩阵元素114中的至少一些矩阵元素的比特流116。比特流生成部件110还接收包括位置信息104的元数据，以包括在比特流116中。The encoder 108 transmits the M downmix signals 112 and at least some of the matrix elements 114 to a bitstream generation component 110. The bitstream generation component 110 generates a bitstream 116 including the M downmix signals 112 and at least some of the matrix elements 114 by performing quantization and encoding. The bitstream generation component 110 also receives metadata including the position information 104 for inclusion in the bitstream 116.

系统的解码部分包括比特流解码部件118和解码器120。比特流解码部件118接收比特流116，并且执行解码和去量化(dequantization)以提取M个下混信号112和包括重构矩阵的至少一些矩阵元素114的边信息。M个下混信号112和矩阵元素114随后被输入到解码器120，解码器120基于下混信号112和矩阵元素114生成N个音频对象106a以及很可能还有音床声道106b的重构106'。因此，N个音频对象的重构106'是N个音频对象106a以及很可能还有音床声道106b的近似。The decoding portion of the system includes a bitstream decoding component 118 and a decoder 120. The bitstream decoding component 118 receives the bitstream 116 and performs decoding and dequantization to extract M downmix signals 112 and side information including at least some matrix elements 114 of the reconstruction matrix. The M downmix signals 112 and matrix elements 114 are then input to the decoder 120, which generates a reconstruction 106' of the N audio objects 106a and possibly the bed channels 106b based on the downmix signals 112 and the matrix elements 114. Thus, the reconstruction 106' of the N audio objects is an approximation of the N audio objects 106a and possibly the bed channels 106b.

举例来说，如果下混信号112对应于5.1配置的声道[Lf Rf Cf Ls Rs LFE]，则解码器120可以仅使用全波段声道[Lf Rf Cf Ls Rs]来重构对象106'，从而忽略LFE。这同样适用于其它声道配置。可以将下混112的LFE声道(基本未修改)发送到呈现器122。For example, if the downmix signal 112 corresponds to the channels [Lf Rf Cf Ls Rs LFE] of a 5.1 configuration, the decoder 120 can reconstruct the object 106' using only the full-band channels [Lf Rf Cf Ls Rs], thereby ignoring the LFE. The same applies to other channel configurations. The LFE channel of the downmix 112 can be sent (substantially unmodified) to the renderer 122.

重构的音频对象106'以及位置信息104随后被输入到呈现器122。基于重构的音频对象106'和位置信息104，呈现器122呈现具有适合于在期望的扬声器或耳机配置上回放的格式的输出信号124。典型的输出格式是标准5.1环绕设置(3个前扬声器、2个环绕扬声器以及1个低频效果(LFE)扬声器)或者7.1+4设置(3个前扬声器、4个环绕扬声器、1个LFE扬声器以及4个高架扬声器)。The reconstructed audio object 106' and the position information 104 are then input to a renderer 122. Based on the reconstructed audio object 106' and the position information 104, the renderer 122 renders an output signal 124 having a format suitable for playback on a desired speaker or headphone configuration. Typical output formats are a standard 5.1 surround setup (3 front speakers, 2 surround speakers, and 1 low frequency effects (LFE) speaker) or a 7.1+4 setup (3 front speakers, 4 surround speakers, 1 LFE speaker, and 4 overhead speakers).

在一些实施例中，原始音频场景可以包括大量的音频对象。对大量的音频对象进行处理的代价是高计算复杂度。并且，要嵌入到比特流116中的边信息量(位置信息104和重构矩阵元素114)取决于音频对象的数目。通常，边信息量随音频对象的数目线性地增长。因此，为了节省计算复杂度和/或为了降低对音频场景进行编码所需要的比特率，在编码之前减少音频对象的数目是有利的。为此，音频编码器/解码器系统100还可以包括设置在编码器108上游的场景简化模块(未示出)。场景简化模块将原始音频对象以及很可能还有音床声道作为输入，并且执行处理以输出音频对象106a。场景简化模块通过执行聚类而将原始音频对象的数目例如K减少到音频对象106a的更合适的数目N。更确切地，场景简化模块将K个原始音频对象以及很可能还有音床声道组织成N个聚类。通常，基于K个原始音频对象/音床声道在音频场景中的空间接近度来定义聚类。为了确定空间接近度，场景简化模块可以将原始音频对象/音床声道的位置信息作为输入。当场景简化模块已经形成了N个聚类时，其继续执行以将每个聚类用一个音频对象代表。例如，代表聚类的音频对象可以被形成为形成聚类的一部分的音频对象/音床声道之和。更具体地，可以添加音频对象/音床声道的音频内容以生成代表性音频对象的音频内容。此外，可以对聚类中音频对象/音床声道的位置取平均，以给出代表性音频对象的位置。场景简化模块将代表性音频对象的位置包括在位置数据104中。此外，场景简化模块输出构成图1的N个音频对象106a的代表性音频对象。In some embodiments, the original audio scene may include a large number of audio objects. Processing a large number of audio objects comes at the cost of high computational complexity. Furthermore, the amount of side information (position information 104 and reconstruction matrix elements 114) to be embedded in the bitstream 116 depends on the number of audio objects. Typically, the amount of side information increases linearly with the number of audio objects. Therefore, to save computational complexity and/or to reduce the bitrate required to encode the audio scene, it is advantageous to reduce the number of audio objects before encoding. To this end, the audio encoder/decoder system 100 may also include a scene simplification module (not shown) disposed upstream of the encoder 108. The scene simplification module takes as input the original audio objects and, possibly, the audio bed channels, and performs processing to output audio objects 106a. The scene simplification module reduces the number of original audio objects, for example, K, to a more appropriate number N of audio objects 106a by performing clustering. More specifically, the scene simplification module organizes the K original audio objects and, possibly, the audio bed channels, into N clusters. Typically, clusters are defined based on the spatial proximity of the K original audio objects/audio bed channels in the audio scene. In order to determine the spatial proximity, the scene simplification module may take as input the position information of the original audio objects/bed channels. When the scene simplification module has formed N clusters, it proceeds to represent each cluster with an audio object. For example, the audio object representing a cluster may be formed as the sum of the audio objects/bed channels that form part of the cluster. More specifically, the audio content of the audio objects/bed channels may be added to generate the audio content of the representative audio object. Furthermore, the positions of the audio objects/bed channels in the clusters may be averaged to give the position of the representative audio object. The scene simplification module includes the position of the representative audio object in the position data 104. Furthermore, the scene simplification module outputs the representative audio object that constitutes the N audio objects 106a of FIG. 1 .

可以使用第一格式将M个下混信号112布置在比特流116的第一字段中。可以使用第二格式将矩阵元素114布置在比特流116的第二字段中。以这种方式，仅支持第一格式的解码器能够解码和回放第一字段中的M个下混信号112，并且丢弃第二字段中的矩阵元素114。The M downmix signals 112 may be arranged using a first format in a first field of a bitstream 116. The matrix elements 114 may be arranged using a second format in a second field of the bitstream 116. In this manner, a decoder that supports only the first format can decode and play back the M downmix signals 112 in the first field and discard the matrix elements 114 in the second field.

图1的音频编码器/解码器系统100支持第一格式和第二格式。更确切地，解码器120被配置成解译第一格式和第二格式，这意味着其能够基于M个下混信号112和矩阵元素114来重构对象106'。1 supports a first format and a second format. More precisely, the decoder 120 is configured to interpret the first format and the second format, which means that it is able to reconstruct the object 106 ′ based on the M downmix signals 112 and the matrix elements 114 .

图2图示出音频编码器/解码器系统200。系统200的编码部分108、110对应于图1的编码部分。然而，音频编码器/解码器系统200的解码部分与图1的音频编码器/解码器系统100的解码部分不同。音频编码器/解码器系统200包括支持第一格式但不支持第二格式的遗留解码器230。因此，音频编码器/解码器系统200的遗留解码器230不能够重构音频对象/音床声道106a到106b。然而，因为遗留解码器230支持第一格式，所以其仍可以对M个下混信号112进行解码以生成输出224，输出224是适合于通过相应的多声道扬声器设置实现直接回放的基于声道的表示，诸如5.1表示。下混信号的这个性质称为后向兼容，后向兼容意味着不支持第二格式，即不能够解译包括矩阵元素114的边信息的遗留解码器也可以解码和回放M个下混信号112。FIG2 illustrates an audio encoder/decoder system 200. The encoding portion 108, 110 of the system 200 corresponds to the encoding portion of FIG1. However, the decoding portion of the audio encoder/decoder system 200 differs from the decoding portion of the audio encoder/decoder system 100 of FIG1. The audio encoder/decoder system 200 includes a legacy decoder 230 that supports the first format but not the second format. Consequently, the legacy decoder 230 of the audio encoder/decoder system 200 is unable to reconstruct the audio object/bed channels 106a to 106b. However, because the legacy decoder 230 supports the first format, it can still decode the M downmix signals 112 to generate an output 224, which is a channel-based representation suitable for direct playback via a corresponding multi-channel speaker setup, such as a 5.1 representation. This property of the downmix signal is known as backward compatibility, meaning that legacy decoders that do not support the second format, i.e., cannot interpret the side information including the matrix elements 114, can also decode and play back the M downmix signals 112.

现在将参考图3和图4的流程图更详细地描述音频编码/解码系统100的编码器侧的操作。The operation of the encoder side of the audio encoding/decoding system 100 will now be described in more detail with reference to the flowcharts of FIG. 3 and FIG. 4 .

图4更详细地图示出图1的编码器108和比特流生成部件110。编码器108具有接收部件(未示出)、下混生成部件318和分析部件328。Figure 4 illustrates in more detail the encoder 108 and the bitstream generation component 110 of Figure 1. The encoder 108 has a receiving component (not shown), a downmix generation component 318 and an analysis component 328.

在步骤E02中，编码器108的接收部件接收N个音频对象106a和音床声道106b(如果存在)。编码器108还可以接收位置数据104。使用向量标记，N个音频对象可以由向量S＝[S1S2...SN]^T表示，并且音床声道由向量B表示。N个音频对象和音床声道可以一起由向量A＝[B^T S^T]^T表示。In step E02, the receiving component of the encoder 108 receives N audio objects 106a and, if present, bed channels 106b. The encoder 108 may also receive position data 104. Using vector notation, the N audio objects may be represented by a vector S = [S1S2...SN] ^T , and the bed channels by a vector B. The N audio objects and bed channels may be collectively represented by a vector A = [B ^T S ^T ] ^T .

在步骤E04中，下混生成部件318根据N个音频对象106a和音床声道106b(如果存在)生成M个下混信号112。使用向量标记，M个下混信号可以由包括M个下混信号的向量D＝[Dl D2...DM]^T表示。一般多个信号的下混是信号的组合，诸如信号的线性组合。举例来说，M个下混信号可以对应于特定的扬声器配置，诸如5.1扬声器配置中的扬声器[Lf Rf Cf LsRs LFE]的配置。In step E04, the downmix generation component 318 generates M downmix signals 112 based on the N audio objects 106a and the bed channels 106b (if present). Using vector notation, the M downmix signals can be represented by a vector D=[D1 D2 ... DM] ^T comprising the M downmix signals. Generally, a downmix of multiple signals is a combination of signals, such as a linear combination of the signals. For example, the M downmix signals can correspond to a specific speaker configuration, such as the speaker configuration [Lf Rf Cf Ls Rs LFE] in a 5.1 speaker configuration.

下混生成部件318在生成M个下混信号时可以使用位置信息104，使得基于各对象在三维空间中的位置将这些对象组合成不同的下混信号。当M个下混信号本身如同上述示例中那样对应于特定扬声器配置时，这是特别相关的。举例来说，下混生成部件318可以基于位置信息得出表示矩阵Pd(对应于应用在图1的呈现器122中的表示矩阵)，并且使用该表示矩阵根据D＝pd*[B^T S^T]^T生成下混。The downmix generation component 318 can use the position information 104 when generating the M downmix signals, so that the objects are combined into different downmix signals based on their positions in three-dimensional space. This is particularly relevant when the M downmix signals themselves correspond to a specific speaker configuration, as in the example above. For example, the downmix generation component 318 can derive a representation matrix Pd (corresponding to the representation matrix used in the renderer 122 in FIG. 1 ) based on the position information and use this representation matrix to generate the downmix according to D = pd * [B ^T S ^T ] ^T.

N个音频对象106a和音床声道106b(如果存在)也被输入到分析部件328。分析部件328通常对输入音频信号106a、106b的时频块进行操作。为此，可以将N个音频对象106a和音床声道106b馈送过对输入音频信号106a、106b执行时间到频率变换的滤波器组，即QMF组。特别地，滤波器组338与多个频率子带相关联。时频块的频率分辨率对应于这些频率子带中的一个或更多个。时频块的频率分辨率可以是不均匀的，即其可以对频率变化。例如，低频率分辨率可以用于高频，这意味着高频范围内的时频块可以对应于由滤波器组338定义的若干频率子带。The N audio objects 106a and the bed channel 106b (if present) are also input to the analysis component 328. The analysis component 328 typically operates on time-frequency blocks of the input audio signals 106a, 106b. To this end, the N audio objects 106a and the bed channel 106b can be fed through a filter bank, i.e. a QMF bank, which performs a time-to-frequency transform on the input audio signals 106a, 106b. In particular, the filter bank 338 is associated with a plurality of frequency subbands. The frequency resolution of the time-frequency block corresponds to one or more of these frequency subbands. The frequency resolution of the time-frequency block can be non-uniform, i.e. it can vary with frequency. For example, a low frequency resolution can be used for high frequencies, which means that a time-frequency block in the high frequency range can correspond to several frequency subbands defined by the filter bank 338.

在步骤E06中，分析部件328生成在本文中由R1表示的重构矩阵。生成的重构矩阵由多个矩阵元素组成。重构的矩阵R1使得能够在解码器中根据M个下混信号112重构(近似)N个音频对象106a以及很可能还有音床声道106b。In step E06, the analysis component 328 generates a reconstruction matrix, denoted herein by R1. The generated reconstruction matrix consists of a plurality of matrix elements. The reconstructed matrix R1 enables the (approximate) reconstruction of the N audio objects 106a and possibly also the bed channels 106b in the decoder from the M downmix signals 112.

分析部件328可以采取不同的方法来生成重构矩阵。例如，可以使用将N个音频对象106a/音床声道106b以及M个下混信号112作为输入的最小均方误差(MMSE)预测方法。可以将该方法描述成旨在得出能够最小化重构的音频对象/音床声道的均方误差的重构矩阵的方法。特别地，该方法使用候选重构矩阵来重构N个音频对象/音床声道，并且关于均方误差将音频对象/音床声道与输入音频对象106a/音床声道106b进行比较。将最小化均方误差的候选重构矩阵选作重构矩阵，并且其矩阵元素114是分析部件328的输出。The analysis component 328 can take different approaches to generating the reconstruction matrix. For example, a minimum mean square error (MMSE) prediction method can be used that takes as input the N audio objects 106a/audio bed channels 106b and the M downmix signals 112. The method can be described as a method that aims to derive a reconstruction matrix that minimizes the mean square error of the reconstructed audio objects/audio bed channels. In particular, the method uses a candidate reconstruction matrix to reconstruct the N audio objects/audio bed channels and compares the audio objects/audio bed channels with the input audio objects 106a/audio bed channels 106b with respect to the mean square error. The candidate reconstruction matrix that minimizes the mean square error is selected as the reconstruction matrix, and its matrix elements 114 are the output of the analysis component 328.

MMSE方法需要对N个音频对象106a/音床声道106b以及M个下混信号112的相关矩阵和协方差矩阵进行估计。根据上述方法，基于N个音频对象106a/音床声道106b和M个下混信号112来测量这些相关矩阵和协方差矩阵。在替选的基于模型的方法中，分析部件328将位置数据104而不是M个下混信号112作为输入。通过做出某些假设，例如假设N个音频对象互不相关，并且使用该假设并结合应用在下混生成部件318中的下混规则，分析部件328可以计算出执行上述MMSE方法所需要的所需相关矩阵和协方差矩阵。The MMSE method requires estimating correlation matrices and covariance matrices for the N audio objects 106a/bed channels 106b and the M downmix signals 112. According to the above-described method, these correlation matrices and covariance matrices are measured based on the N audio objects 106a/bed channels 106b and the M downmix signals 112. In an alternative model-based approach, the analysis component 328 takes as input the position data 104 instead of the M downmix signals 112. By making certain assumptions, such as assuming that the N audio objects are uncorrelated, and using this assumption in conjunction with the downmix rules applied in the downmix generation component 318, the analysis component 328 can calculate the required correlation matrices and covariance matrices required to perform the above-described MMSE method.

重构矩阵的元素114和M个下混信号112随后输入到比特流生成部件110。在步骤E108中，比特流生成部件110对M个下混信号112和重构矩阵的至少一些矩阵元素114进行量化和编码，并且将它们布置在比特流116中。特别地，比特流生成部件110可以使用第一格式将M个下混信号112布置在比特流116的第一字段中。此外，比特流生成部件110可以使用第二格式将矩阵元素114布置在比特流116的第二字段中。如前面参考图2所描述的，这允许仅支持第一格式的遗留解码器解码和回放M个下混信号112并且丢弃第二字段中的矩阵元素114。The elements 114 of the reconstruction matrix and the M downmix signals 112 are then input to the bitstream generation component 110. In step E108, the bitstream generation component 110 quantizes and encodes the M downmix signals 112 and at least some of the matrix elements 114 of the reconstruction matrix, and arranges them in a bitstream 116. In particular, the bitstream generation component 110 may arrange the M downmix signals 112 in a first field of the bitstream 116 using a first format. Furthermore, the bitstream generation component 110 may arrange the matrix elements 114 in a second field of the bitstream 116 using a second format. As previously described with reference to FIG. 2 , this allows a legacy decoder that only supports the first format to decode and play back the M downmix signals 112 and discard the matrix elements 114 in the second field.

图5图示出编码器108的替选实施例。与图3中示出的编码器相比，图5的编码器508还使得一个或更多个辅助信号能够被包括在比特流116中。为此，编码器508包括辅助信号生成部件548。辅助信号生成部件548接收音频对象106a/音床声道106b，并且基于音频对象106a/音床声道106b生成一个或更多个辅助信号512。辅助信号生成部件548例如可以生成辅助信号512以作为音频对象106a/音床声道106b的组合。用向量C＝[CI C2...CL]^T来表示辅助信号，辅助信号可以被生成为C＝Q*[B^T S^T]^T，其中，Q为可以是时变和频变的矩阵。这包括辅助信号等于音频对象中的一个或更多个音频对象的情形以及辅助信号是音频对象的线性组合的情形。例如，辅助信号可以代表一个特别重要的对象，诸如对话。FIG5 illustrates an alternative embodiment of the encoder 108. Compared to the encoder shown in FIG3 , the encoder 508 of FIG5 also enables one or more auxiliary signals to be included in the bitstream 116. To this end, the encoder 508 includes an auxiliary signal generation component 548. The auxiliary signal generation component 548 receives the audio objects 106a/bed channels 106b and generates one or more auxiliary signals 512 based on the audio objects 106a/bed channels 106b. The auxiliary signal generation component 548 can, for example, generate the auxiliary signals 512 as combinations of the audio objects 106a/bed channels 106b. The auxiliary signals are represented by the vector C=[CI C2 ...CL] ^T , which can be generated as C=Q*[B ^T S ^T ] ^T , where Q is a matrix that can be both time- and frequency-varying. This includes situations where the auxiliary signal is equal to one or more of the audio objects, as well as situations where the auxiliary signal is a linear combination of the audio objects. For example, the auxiliary signal can represent a particularly important object, such as dialogue.

辅助信号512的作用是改善解码器中音频对象106a/音床声道106b的重构。更具体地，在解码器侧，可以基于M个下混信号112以及L个辅助信号512来重构音频对象106a/音床声道106b。因此，重构矩阵将包括能够根据M个下混信号112以及L个辅助信号重构音频对象/音床声道的矩阵元素114。The auxiliary signals 512 serve to improve the reconstruction of the audio objects 106a/bed channels 106b at the decoder. More specifically, at the decoder side, the audio objects 106a/bed channels 106b can be reconstructed based on the M downmix signals 112 and the L auxiliary signals 512. Therefore, the reconstruction matrix will include matrix elements 114 that enable the reconstruction of the audio objects/bed channels based on the M downmix signals 112 and the L auxiliary signals.

因此，L个辅助信号512可以被输入到分析部件328，使得在生成重构矩阵时考虑到L个辅助信号512。分析部件328也可以将控制信号发送至辅助信号生成部件548。例如，分析部件328可以控制哪些音频对象/音床声道包括在辅助信号中以及它们是如何被包括的。特别地，分析部件328可以控制Q矩阵的选择。该控制例如可以基于上述MMSE方法，使得可以选择辅助信号以使得重构的音频对象/音床声道与音频对象106a/音床声道106b尽可能地接近。Thus, the L auxiliary signals 512 can be input to the analysis component 328 so that the L auxiliary signals 512 are taken into account when generating the reconstruction matrix. The analysis component 328 can also send control signals to the auxiliary signal generation component 548. For example, the analysis component 328 can control which audio objects/bed channels are included in the auxiliary signals and how they are included. In particular, the analysis component 328 can control the selection of the Q matrix. This control can be based on the MMSE method described above, for example, so that the auxiliary signals can be selected so that the reconstructed audio objects/bed channels are as close as possible to the audio objects 106a/bed channels 106b.

现在将参考图6和图7的流程图更加详细地描述音频编码/解码系统100的解码器侧的操作。The operation of the decoder side of the audio encoding/decoding system 100 will now be described in more detail with reference to the flowcharts of FIG. 6 and FIG. 7 .

图6更具体地图示出图1的比特流解码部件118和解码器120。解码器120包括重构矩阵生成部件622和重构部件624。6 illustrates in more detail the bitstream decoding component 118 and the decoder 120 of FIG1 . The decoder 120 includes a reconstruction matrix generation component 622 and a reconstruction component 624 .

在步骤D02中，比特流解码部件118接收比特流116。比特流解码部件118对比特流116中的信息进行解码和去量化，以提取M个下混信号112以及重构矩阵中的至少一些矩阵元素114。In step D02, the bitstream decoding component 118 receives the bitstream 116. The bitstream decoding component 118 decodes and dequantizes the information in the bitstream 116 to extract the M downmix signals 112 and at least some matrix elements 114 in the reconstruction matrix.

重构矩阵生成部件622接收矩阵元素114并且在步骤D04中继续进行以生成重构矩阵614。重构矩阵生成部件622通过将矩阵元素114布置在矩阵中的适当位置来生成重构矩阵614。如果没有接收到重构矩阵的全部矩阵元素，重构矩阵生成部件622例如可以插入零来代替缺少的元素。The reconstruction matrix generation component 622 receives the matrix elements 114 and proceeds in step D04 to generate a reconstruction matrix 614. The reconstruction matrix generation component 622 generates the reconstruction matrix 614 by arranging the matrix elements 114 at appropriate positions in the matrix. If all the matrix elements of the reconstruction matrix are not received, the reconstruction matrix generation component 622 can, for example, insert zeros to replace the missing elements.

重构矩阵614和M个下混信号随后被输入到重构部件624。重构部件624随后在步骤D06中重构N个音频对象，并且如果可以，重构音床声道。换言之，重构部件624生成N个音频对象106a/音床声道106b的近似106'。The reconstruction matrix 614 and the M downmix signals are then input to a reconstruction component 624. The reconstruction component 624 then reconstructs the N audio objects and, if applicable, the bed channels in step D06. In other words, the reconstruction component 624 generates an approximation 106' of the N audio objects 106a/bed channels 106b.

举例来说，M个下混信号可以对应于特定的扬声器配置，诸如5.1扬声器配置中的扬声器[Lf Rf Cf Ls Rs LFE]的配置。如果这样，重构部件624可以使得对象106'的重构仅基于对应于扬声器配置的全波段声道的下混信号。如上文所解释的，带限信号(低频LFE信号)可以基本未修改地被发送到呈现器。For example, the M downmix signals may correspond to a specific speaker configuration, such as the configuration of speakers [Lf Rf Cf Ls Rs LFE] in a 5.1 speaker configuration. If so, the reconstruction component 624 may base the reconstruction of the object 106' solely on the downmix signals corresponding to the full-band channels of the speaker configuration. As explained above, the band-limited signal (low-frequency LFE signal) may be sent to the renderer substantially unmodified.

重构部件624通常在频域中工作。更确切地，重构部件624对输入信号的各个时频块进行操作。因此，在输入到重构部件624之前，M个下混信号112通常经受时间到频率变换623。时间到频率变换623通常与在编码器侧应用的变换338相同或相似。例如，时间到频率变换623可以是QMF变换。The reconstruction component 624 typically operates in the frequency domain. More specifically, the reconstruction component 624 operates on individual time-frequency blocks of the input signal. Therefore, before being input to the reconstruction component 624, the M downmix signals 112 typically undergo a time-to-frequency transform 623. The time-to-frequency transform 623 is typically the same as or similar to the transform 338 applied at the encoder side. For example, the time-to-frequency transform 623 can be a QMF transform.

为了重构音频对象/音床声道106'，重构部件624应用矩阵操作。更具体地，使用先前引入的标记，重构部件624可以将音频对象/音床声道的近似A'生成为A'＝R1*D。重构矩阵R1可以根据时间和频率变化。因此，重构矩阵在由重构部件624处理的不同的时频块之间可以不同。To reconstruct the audio object/bed channel 106', the reconstruction component 624 applies matrix operations. More specifically, using the notation introduced previously, the reconstruction component 624 can generate an approximation A' of the audio object/bed channel as A' = R1 * D. The reconstruction matrix R1 can vary in time and frequency. Therefore, the reconstruction matrix can differ between different time-frequency blocks processed by the reconstruction component 624.

在从解码器120输出之前，重构的音频对象/音床声道106'通常被变换回时域625。The reconstructed audio objects/bed channels 106 ′ are typically transformed back to the time domain 625 before being output from the decoder 120 .

图8图示出当比特流116额外地包括辅助信号时的情况。与图7的实施例相比，比特流解码部件118现在额外地对来自比特流116的一个或更多个辅助信号512进行解码。辅助信号512被输入到重构部件624，辅助信号512在重构部件624处被包括在音频对象/音床声道的重构中。更具体地，重构部件624通过应用矩阵运算A'＝R1*[D^T C^T]^T生成音频对象/音床声道。FIG8 illustrates the situation when the bitstream 116 additionally includes auxiliary signals. Compared to the embodiment of FIG7 , the bitstream decoding component 118 now additionally decodes one or more auxiliary signals 512 from the bitstream 116. The auxiliary signals 512 are input to a reconstruction component 624, where they are included in the reconstruction of the audio objects/bed channels. More specifically, the reconstruction component 624 generates the audio objects/bed channels by applying the matrix operation A′=R1*[D ^T C ^T ] ^T.

图9图示出在图1的音频编码/解码系统100的解码器侧使用的不同的时频变换。比特流解码部件118接收比特流116。解码和去量化部件918对比特流116进行解码和去量化，以提取位置信息104、M个下混信号112和重构矩阵的矩阵元素114。9 illustrates different time-frequency transforms used at the decoder side of the audio encoding/decoding system 100 of FIG 1. The bitstream decoding component 118 receives the bitstream 116. The decoding and dequantization component 918 decodes and dequantizes the bitstream 116 to extract the position information 104, the M downmix signals 112, and the matrix elements 114 of the reconstruction matrix.

在该阶段，通常在第一频域中表示M个下混信号112，第一频域对应于在本文中由T/F_C和F/T_C表示以分别用于从时域到第一频域的变换和从第一频域到时域的变换的第一组时频滤波器组。通常，对应于第一频域的滤波器组可以实现重叠窗变换，诸如MDCT和反MDCT。比特流解码部件118可以包括通过使用滤波器组F/T_C将M个下混信号112变换到时域的变换部件901。At this stage, the M downmix signals 112 are typically represented in a first frequency domain, corresponding to a first set of time-frequency filter banks, denoted herein by T/ _FC and F/ _TC for transforming from the time domain to the first frequency domain and from the first frequency domain to the time domain, respectively. Typically, the filter banks corresponding to the first frequency domain can implement overlapping window transforms, such as MDCT and inverse MDCT. The bitstream decoding component 118 may include a transform component 901 that transforms the M downmix signals 112 into the time domain using the filter banks F/ _TC .

解码器120，尤其是重构部件624通常关于第二频域处理信号。第二频域对应于在本文中由T/F_U和F/T_U表示的分别用于从时域到第二频域的变换和从第二频域到时域的变换的第二组时频滤波器组。因此，解码器120可以包括通过使用滤波器组T/F_U将在时域中表示的M个下混信号112变换到第二频域的变换部件903。当重构部件624已经通过在第二频域中执行处理而基于M个下混信号重构对象106'时，变换部件905可以通过使用滤波器组F/T_U将重构对象106’变换回时域。The decoder 120, and in particular the reconstruction component 624, typically processes signals with respect to a second frequency domain. The second frequency domain corresponds to a second set of time-frequency filter banks, represented herein by T/ _FU and F/ _TU, for transforming from the time domain to the second frequency domain and from the second frequency domain to the time domain. Therefore, the decoder 120 may include a transform component 903 that transforms the M downmix signals 112 represented in the time domain to the second frequency domain using the filter bank T/ _FU _. When the reconstruction component 624 has reconstructed the object 106' based on the M downmix signals by performing processing in the second frequency domain, the transform component 905 may transform the reconstructed object 106' back to the time domain using the filter bank F/TU.

呈现器122通常关于第三频域处理信号。第三频域对应于在本文中由T/F_R和F/T_R表示的分别用于从时域到第三频域的变换以及从第三频域到时域的变换的第三组时频滤波器组。因此，呈现器122可以包括通过使用滤波器组T/F_R将重构的音频对象106'从时域变换到第三频域的变换部件907。当呈现器122通过呈现部件922已经呈现输出声道124时，可以由变换部件909通过使用滤波器组F/T_R将输出声道变换到时域。The renderer 122 typically processes the signal with respect to a third frequency domain. The third frequency domain corresponds to a third set of time-frequency filter banks, denoted herein by T/ _FR and F/ _TR, for transforming from the time domain to the third frequency domain and from the third frequency domain to the time domain, respectively. Thus, the renderer 122 may include a transform component 907 for transforming the reconstructed audio object 106' from the time domain to the third frequency domain using the filter bank T/ _FR . When the renderer 122 has rendered the output channels 124 via the rendering component 922, the transform component 909 may transform the output channels to the time domain using the filter bank F/ _TR .

从以上描述显而易见，音频编码/解码系统的解码器侧包括许多时频变换步骤。然而，如果以一定方式选择第一频域、第二频域和第三频域，则时频变换步骤中的一些步骤会变得冗余。As is apparent from the above description, the decoder side of the audio encoding/decoding system includes many time-frequency transformation steps. However, if the first frequency domain, the second frequency domain and the third frequency domain are selected in a certain way, some of the time-frequency transformation steps may become redundant.

例如，可以将第一频域、第二频域和第三频域中的一些选择成为一样的，或者可以共同地实现为从一个频域直接到另一频域而不通过它们之间的时域。后者的一个示例是以下情形：第二频域和第三频域的不同仅在于呈现器122中的变换部件907除了使用两个变换部件905和907共同的QMF滤波器组以外还使用奈奎斯特滤波器组以提高低频处的频率分辨率。在这种情形下，可以以奈奎斯特滤波器组的形式共同实现变换部件905和907，从而节省计算复杂度。For example, some of the first frequency domain, the second frequency domain, and the third frequency domain may be selected to be the same, or may be implemented together to go directly from one frequency domain to another frequency domain without passing through the time domain between them. An example of the latter is a situation where the second frequency domain and the third frequency domain differ only in that the transform component 907 in the renderer 122 uses a Nyquist filter bank to improve the frequency resolution at low frequencies in addition to using the QMF filter bank common to the two transform components 905 and 907. In this case, the transform components 905 and 907 may be implemented together in the form of a Nyquist filter bank, thereby saving computational complexity.

在另一示例中，第二频域和第三频域是相同的。例如，第二频域和第三频域可以都是QMF频域。在这种情形下，变换部件905和907是冗余的并且可以被去除，从而节省计算复杂度。In another example, the second frequency domain and the third frequency domain are the same. For example, the second frequency domain and the third frequency domain can both be QMF frequency domains. In this case, transform components 905 and 907 are redundant and can be removed, thereby saving computational complexity.

根据另一示例，第一频域和第二频域可以是相同的。例如，第一频域和第二频域可以都是MDCT域。在这种情形下，可以去除第一变换部件901和第二变换部件903，从而节省计算复杂度。According to another example, the first frequency domain and the second frequency domain may be the same. For example, the first frequency domain and the second frequency domain may both be MDCT domains. In this case, the first transform component 901 and the second transform component 903 may be removed, thereby saving computational complexity.

等同物、扩展、替选方案以及其它Equivalents, extensions, alternatives and others

本领域技术人员在研究以上描述之后将会明白本公开内容的其它实施例。虽然本说明书和附图公开了实施例和示例，但本公开内容不限于这些具体示例。在不偏离由所附权利要求所定义的本公开内容的范围的情况下可以做出许多修改和变形。在权利要求中出现的任何附图标记不被理解为限制它们的范围。Other embodiments of the present disclosure will become apparent to those skilled in the art after studying the above description. Although this specification and the drawings disclose embodiments and examples, the present disclosure is not limited to these specific examples. Many modifications and variations may be made without departing from the scope of the present disclosure as defined by the appended claims. Any reference signs appearing in the claims are not to be construed as limiting their scope.

另外，根据对附图、公开内容和所附权利要求的研究，本领域技术人员在实践本公开内容时可以理解并实现对所公开的实施例的变型。在权利要求书中，词语“包括”不排除其它元件或步骤，并且不定冠词“一”不排除复数形式。在相互不同的从属权利要求中引述某些措施的事实不指示不可以使用这些措施的组合来获利。Furthermore, variations to the disclosed embodiments may be understood and effected by those skilled in the art in practicing the present disclosure, based on a study of the drawings, the disclosure, and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps, and the indefinite article "a" or "an" does not exclude plural reference. The fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

在上文中公开的系统和方法可以实现为软件、固件、硬件或者它们的组合。在硬件实现中，以上描述中提到的功能单元之间的任务划分不一定对应于实体单元的划分；相反地，一个物理部件可以具有多个功能，并且可以由若干物理部件共同执行一个任务。某些部件或全部部件可以被实现为由数字信号处理器或微处理器执行的软件，或者可以被实现为硬件或专用集成电路。这样的软件可以分布在可以包括计算机存储介质(或非暂态介质)和通信介质(或暂态介质)的计算机可读介质上。如为本领域技术人员所熟知的，术语计算机存储介质包括以任何方法或用于存储诸如计算机可读指令、数据结构、程序模块或其它数据的信息的技术实现的易失性介质和非易失性介质，可移除介质和不可移除介质。计算机存储介质包括但不限于RAM、ROM、EEPROM、闪速存储器或其它存储器技术、CD-ROM、数字多功能盘(DVD)或其它光盘存储装置、磁盒、磁带、磁盘存储器或其它磁存储装置、或者可用于存储期望信息并且可被计算机访问的任何其它介质。此外，技术人员熟知的是，通信介质通常包含计算机可读指令、数据结构、程序模块或诸如载波的调制数据信号中的其它数据，或者其它传输机制，并且包括任何信息传输介质。The systems and methods disclosed above can be implemented as software, firmware, hardware, or a combination thereof. In hardware implementations, the division of tasks between the functional units mentioned in the above description does not necessarily correspond to the division of physical units; on the contrary, a physical component can have multiple functions, and a task can be performed jointly by several physical components. Some or all components can be implemented as software executed by a digital signal processor or microprocessor, or can be implemented as hardware or an application-specific integrated circuit. Such software can be distributed on a computer-readable medium that can include computer storage media (or non-transitory media) and communication media (or transient media). As is well known to those skilled in the art, the term computer storage media includes volatile and non-volatile media, removable media, and non-removable media implemented in any method or technology for storing information such as computer-readable instructions, data structures, program modules, or other data. Computer storage media include, but are not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage devices, magnetic cassettes, magnetic tapes, disk storage or other magnetic storage devices, or any other medium that can be used to store desired information and can be accessed by a computer. Furthermore, as is well known to those skilled in the art, communication media typically embodies computer-readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave, or other transport mechanism, and includes any information delivery media.

Claims

1. A method for encoding a time-frequency block of an audio scene comprising at least N audio objects, the method comprising:

Receive the N audio objects;

Generate M downmixed signals based on at least the N audio objects;

A reconstruction matrix is generated from matrix elements used to reconstruct at least N audio objects based on the M downmixed signals, wherein the approximations of the at least N audio objects can be obtained as linear combinations of the at least M downmixed signals, and the matrix elements of the reconstruction matrix serve as coefficients in the linear combination; and

Generate a bitstream, the bitstream comprising at least some of the matrix elements of the M downmixed signals and the matrix elements of the reconstruction matrix.

2. The method of claim 1, wherein the M downmixed signals are arranged in a first field of the bitstream using a first format, and the matrix elements are arranged in a second field of the bitstream using a second format, thereby allowing a decoder that only supports the first format to decode and replay the M downmixed signals in the first field, and discarding the matrix elements in the second field.

3. The method according to any one of the preceding claims further comprises the step of: receiving position data corresponding to each of the N audio objects, wherein the M downmix signals are generated based on the position data.

4. The method according to claim 1 or 2, wherein the matrix elements of the reconstructed matrix are time-varying and frequency-varying.

5. The method according to claim 1 or 2, wherein the audio scene further includes a plurality of audio bed channels, wherein the M downmix signals are generated based on at least the N audio objects and the plurality of audio bed channels.

6. The method of claim 5, wherein the reconstruction matrix comprises matrix elements for reconstructing the bed channel based on the M downmixing signals, wherein an approximation of the N audio objects and the bed channel can be obtained as a linear combination of at least the M downmixing signals, and the matrix elements of the reconstruction matrix serve as coefficients in the linear combination.

7. The method according to claim 1 or 2, wherein the audio scene initially includes K audio objects, where K>N, and the method further includes the steps of: receiving the K audio objects and reducing the K audio objects to the N audio objects by clustering the K audio objects into N clusters and representing each cluster with an audio object.

8. The method of claim 7, further comprising the step of: receiving location data corresponding to each of the K audio objects, wherein the K objects are clustered into N clusters based on the location distances between the K objects given by the location data of the K audio objects.

9. The method according to claim 1 or 2, wherein the number M of the downmixed signals is greater than 2.

10. The method according to claim 1 or 2, further comprising:

The N audio objects form L auxiliary signals;

The matrix elements used to reconstruct at least N audio objects based on the M downmix signals and the L auxiliary signals are included in the reconstruction matrix, wherein the approximation of the at least N audio objects can be obtained as a linear combination of the M downmix signals and the L auxiliary signals, and the matrix elements of the reconstruction matrix serve as coefficients in the linear combination; and

The L auxiliary signals are included in the bit stream.

11. The method of claim 10, wherein at least one of the L auxiliary signals is identical to one of the N audio objects.

12. The method of claim 10, wherein at least one of the L auxiliary signals is formed as a combination of at least two audio objects among the N audio objects.

13. The method of claim 10, wherein the M downmixing signals span a hyperplane, and wherein at least one of the L auxiliary signals is not located in the hyperplane spanned by the M downmixing signals.

14. The method of claim 13, wherein at least one of the L auxiliary signals is orthogonal to the hyperplane spanned by the M downmixing signals.

15. A computer-readable medium comprising computer code instructions adapted to perform the method according to any one of claims 1 to 14 when operated on a processing device.

16. An encoder for encoding a time-frequency block of an audio scene comprising at least N audio objects, the encoder comprising:

A receiving component, configured to receive the N audio objects;

A downmixing generation unit is configured to receive the N audio objects from the receiving unit and to generate M downmixing signals based on at least the N audio objects;

An analysis component is configured to generate a reconstruction matrix from matrix elements used to reconstruct at least the N audio objects based on the M downmixed signals, wherein an approximation of the at least N audio objects can be obtained as a linear combination of the at least M downmixed signals, and the matrix elements of the reconstruction matrix serve as coefficients in the linear combination; and

A bitstream generation unit is configured to receive the M downmixed signals from the downmixing generation unit and the reconstruction matrix from the analysis unit, and to generate a bitstream comprising at least some of the matrix elements of the M downmixed signals and the reconstruction matrix.

17. A method for decoding a time-frequency block of an audio scene comprising at least N audio objects, the method comprising the steps of:

Receive a bit stream comprising at least some matrix elements of M downmixed signals and a reconstruction matrix;

The reconstructed matrix is generated using the matrix elements; and

The reconstruction matrix is used to reconstruct the N audio objects based on the M downmixed signals, wherein at least the approximations of the N audio objects can be obtained as linear combinations of at least the M downmixed signals, and the matrix elements of the reconstruction matrix serve as coefficients in the linear combination.

18. The method of claim 17, wherein the M downmixed signals are arranged in a first field of the bitstream using a first format, and the matrix elements are arranged in a second field of the bitstream using a second format, thereby allowing a decoder that only supports the first format to decode and replay the M downmixed signals in the first field, and discarding the matrix elements in the second field.

19. The method according to any one of claims 17 to 18, wherein the matrix elements of the reconstructed matrix are time-varying and frequency-varying.

20. The method according to any one of claims 17 to 18, wherein the audio scene further includes a plurality of audio bed channels, the method further includes reconstructing the audio bed channels using the reconstruction matrix based on the M downmixing signals, wherein an approximation of the N audio objects and the audio bed channels can be obtained as a linear combination of at least the M downmixing signals, the matrix elements of the reconstruction matrix serving as coefficients in the linear combination.

21. The method according to any one of claims 17 to 18, wherein the number M of downmixed signals is greater than 2.

22. The method according to any one of claims 17 to 18, further comprising:

Receive L auxiliary signals formed by the N audio objects;

The reconstruction matrix is used to reconstruct the N audio objects based on the M downmixed signals and the L auxiliary signals, wherein at least the approximations of the N audio objects are obtained as linear combinations of the M downmixed signals and the L auxiliary signals, and the matrix elements of the reconstruction matrix are used as coefficients in the linear combination.

23. The method of claim 22, wherein at least one of the L auxiliary signals is identical to one of the N audio objects.

24. The method of claim 22, wherein at least one of the L auxiliary signals is a combination of the N audio objects.

25. The method of claim 22, wherein the M downmixing signals span a hyperplane, and wherein at least one of the L auxiliary signals is not located in the hyperplane spanned by the M downmixing signals.

26. The method of claim 25, wherein at least one of the L auxiliary signals not located in the hyperplane is orthogonal to the hyperplane spanned by the M downmixing signals.

27. The method according to any one of claims 17 to 18, wherein the M downmixed signals are represented with respect to a first frequency domain, and wherein the reconstruction matrix is represented with respect to a second frequency domain, the first frequency domain and the second frequency domain being the same frequency domain.

28. The method of claim 27, wherein the first frequency domain and the second frequency domain are improved discrete cosine transform (MDCT) domains.

29. The method according to any one of claims 17 to 18, further comprising: receiving position data corresponding to the N audio objects, and

The location data is used to render the N audio objects to create at least one output audio channel.

30. The method of claim 29, wherein the reconstruction matrix is represented with respect to a second frequency domain corresponding to a second filter group, and the rendering is performed in a third frequency domain corresponding to a third filter group, wherein the second filter group and the third filter group are at least partially the same filter group.

31. The method of claim 30, wherein the second filter bank and the third filter bank comprise a quadrature mirror filter (QMF) filter bank.

32. A computer-readable medium comprising computer code instructions adapted, when operated on a processing device, to perform the method according to any one of claims 17 to 31.

33. A decoder for decoding a time-frequency block of an audio scene comprising at least N audio objects, the decoder comprising:

A receiving unit configured to receive a bit stream comprising at least some matrix elements of a matrix including M downmixed signals and a reconstruction matrix;

A reconstruction matrix generation component is configured to receive the matrix elements from the receiving component and generate the reconstruction matrix based on the matrix elements; and

A reconstruction component is configured to receive the reconstruction matrix from the reconstruction matrix generation component and reconstruct the N audio objects based on the M downmixing signals using the reconstruction matrix, wherein at least the approximations of the N audio objects are obtained as linear combinations of at least the M downmixing signals, and the matrix elements of the reconstruction matrix serve as coefficients in the linear combination.