HK40008788B

HK40008788B - Method for decoding audio scene, audio decoder and medium

Info

Publication number: HK40008788B
Application number: HK19132009.2A
Authority: HK
Inventors: 海科·普尔哈根; 拉尔斯·维尔默斯; 利夫·约纳什·萨穆埃尔松; 托尼·希尔沃宁
Original assignee: 杜比国际公司
Priority date: 2013-05-24
Filing date: 2019-11-08
Publication date: 2024-03-15

Description

Methods for decoding audio scenes, audio decoders, and media

本申请是申请日为2014年5月23日、申请号为“201480030011.2”、发明名称为“对音频场景的编码”的发明专利申请的分案申请。This application is a divisional application of the invention patent application filed on May 23, 2014, with application number "201480030011.2" and invention title "Encoding of Audio Scenes".

相关申请的交叉引用Cross-references to related applications

本申请要求于2013年5月24日提交的美国临时专利申请第61/827,246号的优先权，通过引用将该申请整体地合并到本文中。This application claims priority to U.S. Provisional Patent Application No. 61/827,246, filed May 24, 2013, which is incorporated herein by reference in its entirety.

技术领域Technical Field

本文所公开的发明总体上涉及音频编码和解码领域。特别地，本发明涉及对包括音频对象的音频场景的编码和解码。The inventions disclosed herein generally relate to the field of audio encoding and decoding. In particular, the invention relates to the encoding and decoding of audio scenes including audio objects.

背景技术Background Technology

存在用于参数空间音频编码的音频编码系统。例如，MPEG Surround描述了一种用于多声道音频的参数空间编码的系统。MPEG SAOC(空间音频对象编码)描述了一种用于音频对象的参数编码的系统。Audio coding systems exist for parametric spatial audio coding. For example, MPEG Surround describes a system for parametric spatial coding of multichannel audio. MPEG SAOC (Spatial Audio Object Coding) describes a system for parametric coding of audio objects.

在编码器侧，这些系统通常将声道/对象下混成下混，下混通常为单声道(一个声道)或立体声(两个声道)下混，并且提取通过如电平差和互相关来描述声道/对象的性质的边信息。然后对下混和边信息进行编码并且将其发送解码器侧。在解码器侧，在边信息的参数的控制下根据下混来重构即近似估计声道/对象。On the encoder side, these systems typically downmix the channel/object into a downmix, usually a mono (one channel) or stereo (two channels) downmix, and extract side information describing the properties of the channel/object through factors such as level difference and cross-correlation. The downmix and side information are then encoded and sent to the decoder side. On the decoder side, the channel/object is reconstructed, or approximately estimated, based on the downmix, under the control of parameters of the side information.

这些系统的缺点在于重构通常在数学上是复杂的并且经常需要依赖于对由作为边信息发送的参数未明确描述的音频内容的性质的假设。这种假设例如可以是：除非发送了互相关参数，否则声道/对象被认为是不相关的；或者以特定方式生成声道/对象的下混。此外，当下混的声道的数目增加时，数学复杂度和对额外的假设的需要会显著增加。The drawback of these systems is that reconstruction is often mathematically complex and frequently relies on assumptions about the properties of audio content not explicitly described by parameters sent as side information. Such assumptions could be, for example, that channels/objects are considered uncorrelated unless cross-correlation parameters are sent; or that channel/object downmixing is generated in a specific manner. Furthermore, the mathematical complexity and the need for additional assumptions increase significantly as the number of channels in the downmix increases.

此外，在应用在解码器侧的处理的算法细节中内在地反映出所需要的假设。这意味着在解码器侧必须包括相当多的智能。这是个缺点，因为当解码器被设置在例如很难或甚至不可能升级的消费者装置中时，很难升级和改进算法。Furthermore, the necessary assumptions are inherently reflected in the algorithmic details of the processing applied on the decoder side. This means that a considerable amount of intelligence must be included on the decoder side. This is a drawback because when the decoder is placed in consumer devices that are, for example, difficult or even impossible to upgrade, it is difficult to upgrade and improve the algorithm.

发明内容Summary of the Invention

本发明的第一方面涉及一种用于对由N个音频信号表示的音频场景进行解码的方法，该方法包括：接收包括M个下混信号和重构矩阵的矩阵元素的比特流；使用矩阵元素生成重构矩阵；以及使用重构矩阵根据M个下混信号重构N个音频信号，其中，将N个音频信号的近似获得为M个下混信号的线性组合，重构矩阵的矩阵元素作为线性组合中的系数，其中，M小于N，并且M等于或大于1。A first aspect of the present invention relates to a method for decoding an audio scene represented by N audio signals, the method comprising: receiving a bitstream including M downmixed signals and matrix elements of a reconstruction matrix; generating a reconstruction matrix using the matrix elements; and reconstructing the N audio signals from the M downmixed signals using the reconstruction matrix, wherein the N audio signals are approximated as a linear combination of the M downmixed signals, and the matrix elements of the reconstruction matrix are used as coefficients in the linear combination, wherein M is less than N and M is equal to or greater than 1.

本发明的第二方面涉及一种对由N个音频信号表示的音频场景进行解码的音频解码器，音频解码器包括：接收器，其接收包括M个下混信号和重构矩阵的矩阵元素的比特流；重构矩阵生成器，其接收来自接收器的矩阵元素并且基于矩阵元素生成重构矩阵；以及重构器，其从重构矩阵生成器接收重构矩阵，并且使用重构矩阵根据M个下混信号重构N个音频信号，其中，将N个音频信号的近似获得为M个下混信号的线性组合，重构矩阵的矩阵元素作为线性组合中的系数，其中，M小于N，并且M等于或大于1。A second aspect of the invention relates to an audio decoder for decoding an audio scene represented by N audio signals, the audio decoder comprising: a receiver receiving a bitstream including M downmixed signals and matrix elements of a reconstruction matrix; a reconstruction matrix generator receiving matrix elements from the receiver and generating a reconstruction matrix based on the matrix elements; and a reconstructor receiving the reconstruction matrix from the reconstruction matrix generator and reconstructing the N audio signals from the M downmixed signals using the reconstruction matrix, wherein the N audio signals are approximated as a linear combination of the M downmixed signals, the matrix elements of the reconstruction matrix serving as coefficients in the linear combination, wherein M is less than N and M is equal to or greater than 1.

本发明的第三方面涉及一种包括指令的非暂态计算机可读介质，该指令在由信息处理系统的处理器执行时使信息处理系统执行根据实施方式的方法。A third aspect of the invention relates to a non-transitory computer-readable medium comprising instructions that, when executed by a processor of an information processing system, cause the information processing system to perform a method according to an embodiment.

附图说明Attached Figure Description

在下文中，将参考附图并且更加详细地描述示例实施例，其中：In the following description, exemplary embodiments will be illustrated with reference to the accompanying drawings and in more detail, wherein:

图1是根据示例实施例的音频编码/解码系统的示意图；Figure 1 is a schematic diagram of an audio encoding/decoding system according to an example embodiment;

图2是根据示例实施例的具有遗留解码器的音频编码/解码系统的示意图；Figure 2 is a schematic diagram of an audio encoding/decoding system with a legacy decoder according to an example embodiment;

图3是根据示例实施例的音频编码/解码系统的编码侧的示意图；Figure 3 is a schematic diagram of the encoding side of an audio encoding/decoding system according to an example embodiment;

图4是根据示例实施例的编码方法的流程图；Figure 4 is a flowchart of the encoding method according to an example embodiment;

图5是根据示例实施例的编码器的示意图；Figure 5 is a schematic diagram of an encoder according to an example embodiment;

图6是根据示例实施例的音频编码/解码系统的解码器侧的示意图；Figure 6 is a schematic diagram of the decoder side of an audio encoding/decoding system according to an example embodiment;

图7是根据示例实施例的解码方法的流程图；Figure 7 is a flowchart of a decoding method according to an example embodiment;

图8是根据示例实施例的音频编码/解码系统的解码器侧的示意图；以及Figure 8 is a schematic diagram of the decoder side of an audio encoding/decoding system according to an example embodiment; and

图9是在根据示例实施例的音频编码/解码系统的解码器侧执行的时频变换的示意图。Figure 9 is a schematic diagram of time-frequency transformation performed on the decoder side of an audio encoding/decoding system according to an example embodiment.

所有附图都是示意性的，并且一般仅示出为阐明本发明所必须的部分，而可以省略或仅暗示其它部分。除非另有说明，否则相同附图标记在不同附图中的指示相同部件。All accompanying drawings are schematic and generally only show parts necessary to illustrate the invention, while other parts may be omitted or only implied. Unless otherwise stated, the same reference numerals in different drawings indicate the same parts.

具体实施方式Detailed Implementation

考虑到上述内容，目的是提供编码器和解码器，以及提供音频对象的较不复杂的且更灵活的重构的相关方法。In light of the above, the aim is to provide encoders and decoders, as well as related methods for providing less complex and more flexible reconstruction of audio objects.

I.概述——编码器I. Overview – Encoder

根据第一方面，示例实施例提出了编码方法、编码器以及用于编码的计算机程序产品。所提出的方法、编码器和计算机程序产品一般可以具有相同特征和优势。According to the first aspect, exemplary embodiments have proposed an encoding method, an encoder, and a computer program product for encoding. The proposed method, encoder, and computer program product can generally have the same features and advantages.

根据示例实施例，提供了一种对至少包括N个音频对象的音频场景的时频块进行编码的方法。该方法包括：接收N个音频对象；基于至少N个音频对象生成M个下混信号；用矩阵元素生成重构矩阵，重构矩阵使得能够根据M个下混信号重构至少N个音频对象；以及生成包括M个下混信号以及重构矩阵的矩阵元素中的至少一些矩阵元素的比特流。According to an example embodiment, a method is provided for encoding a time-frequency block of an audio scene comprising at least N audio objects. The method includes: receiving the N audio objects; generating M downmixing signals based on the at least N audio objects; generating a reconstruction matrix using matrix elements, the reconstruction matrix enabling the reconstruction of the at least N audio objects from the M downmixing signals; and generating a bitstream comprising at least some of the matrix elements from the M downmixing signals and the reconstruction matrix.

音频对象的数目N可以等于或大于1。下混信号的数目M可以等于或大于1。The number of audio objects N can be equal to or greater than 1. The number of downmixed signals M can be equal to or greater than 1.

通过该方法，从而生成了比特流，该比特流包括作为边信息的重构矩阵的矩阵元素中的至少一些矩阵元素以及M个下混信号。通过将重构矩阵的各个矩阵元素包括在比特流中，在解码器侧需要非常少的智能。例如，在解码器侧不需要基于所传输的对象参数和额外的假设对重构矩阵进行复杂计算。因此，显著降低了解码器侧的数学复杂度。此外，因为该方法的复杂度不依赖于所使用的下混信号的数目，所以与现有技术方法相比，增加了关于下混信号的数目的灵活性。This method generates a bitstream comprising at least some matrix elements of a reconstruction matrix serving as side information, and M downmixing signals. By including individual matrix elements of the reconstruction matrix in the bitstream, very little intelligence is required on the decoder side. For example, complex calculations of the reconstruction matrix based on transmitted object parameters and additional assumptions are not required on the decoder side. Therefore, the mathematical complexity on the decoder side is significantly reduced. Furthermore, because the complexity of this method does not depend on the number of downmixing signals used, it increases flexibility regarding the number of downmixing signals compared to existing methods.

如本文中所使用的，音频场景一般指如下三维音频环境：其包括与三维空间中的位置相关联的可以被呈现以在音频系统上回放的音频单元。As used in this article, an audio scene generally refers to a three-dimensional audio environment that includes audio units that can be presented for playback on an audio system and are associated with positions in three-dimensional space.

如本文中所使用的，音频对象指音频场景的单元。音频对象通常包括音频信号以及诸如对象在三位空间中的位置的附加信息。附加信息通常被用于在给定的回放系统上最优地呈现音频对象。As used in this article, an audio object refers to a unit of audio scene. An audio object typically includes an audio signal as well as additional information such as the object's position in three-dimensional space. This additional information is usually used to optimally represent the audio object on a given playback system.

如本文中所使用的，下混信号指是作为至少N个音频对象的组合的信号。诸如音床声道(将在下文中描述)的音频场景的其它信号也可以被组合到下混信号中。例如，M个下混信号可以对应于对给定扬声器配置，例如标准5.1配置的音频场景的呈现。在本文中由M表示的下混信号的数目通常(但不必须地)少于音频对象和音床声道的数目之和，这解释了为什么M个下混信号称为下混。As used herein, a downmix signal refers to a signal that is a combination of at least N audio objects. Other signals of an audio scene, such as bed channels (described below), can also be combined into a downmix signal. For example, M downmix signals may correspond to the presentation of an audio scene for a given speaker configuration, such as a standard 5.1 configuration. The number of downmix signals represented by M in this document is typically (but not necessarily) less than the sum of the number of audio objects and bed channels, which explains why M downmix signals are called a downmix.

音频编码/解码系统通常例如通过将适合的滤波器组应用于输入音频信号而将时频空间划分成时频块。时频块的一般意思是对应于时间间隔和频率子带的时频空间的一部分。时间间隔可以通常对应于用在音频编码/解码系统中的时间帧的持续时间。频率子带可以通常对应于由用在编码/解码系统中的滤波器组所定义的一个或若干相邻频率子带。在频率子带对应于由滤波器组定义的若干相邻频率子带的情形下，这允许在音频信号的解码过程中存在不均匀的频率子带，例如，更宽的频率子带用于音频信号的较高频率。在音频编码/解码系统对整个频率范围进行操作的宽波段的情形下，时频块的频率子带可以对应于整个频率范围。上述方法公开了用于在一个这样的时频块期间对音频场景进行编码的编码步骤。然而，要理解的是，可以针对音频编码/解码系统的每个时频块重复该方法。并且，还要理解的是，可以同时对若干时频块进行编码。通常，相邻的时频块可以在时间和/或频率上稍稍重叠。例如，时间上的重叠可以相当于重构矩阵的元素在时间上，即从一个时间间隔到下一个时间间隔的线性插值。然而，本公开内容的目标在于编码/解码系统的其它部件，而相邻的时频块之间的时间和/或频率上的任何重叠留给本领域技术人员去实现。Audio encoding/decoding systems typically divide the time-frequency space into time-frequency blocks, for example, by applying a suitable filter bank to the input audio signal. A time-frequency block generally means a portion of the time-frequency space corresponding to a time interval and frequency sub-bands. The time interval may typically correspond to the duration of a time frame used in the audio encoding/decoding system. The frequency sub-bands may typically correspond to one or more adjacent frequency sub-bands defined by the filter bank used in the encoding/decoding system. In the case where the frequency sub-bands correspond to several adjacent frequency sub-bands defined by the filter bank, this allows for non-uniform frequency sub-bands during the decoding of the audio signal; for example, wider frequency sub-bands may be used for higher frequencies of the audio signal. In the case of a wideband in which the audio encoding/decoding system operates over the entire frequency range, the frequency sub-bands of the time-frequency block may correspond to the entire frequency range. The above method discloses encoding steps for encoding an audio scene during one such time-frequency block. However, it should be understood that the method can be repeated for each time-frequency block of the audio encoding/decoding system. Furthermore, it should be understood that several time-frequency blocks can be encoded simultaneously. Typically, adjacent time-frequency blocks may slightly overlap in time and/or frequency. For example, temporal overlap can be equivalent to the elements of the reconstruction matrix being linearly interpolated from one time interval to the next. However, the present disclosure is intended for other components of the encoding/decoding system, while any temporal and/or frequency overlap between adjacent time-frequency blocks is left to those skilled in the art to implement.

根据示例实施例，使用第一格式将M个下混信号布置在比特流的第一字段中，并且使用第二格式将矩阵元素布置在比特流的第二字段中，从而允许仅支持第一格式的解码器解码和回放第一字段中的M个下混信号并且丢弃第二子段中的矩阵元素。这样做的优势在于比特流中的M个下混信号与不用于实现音频对象重构的遗留解码器后向兼容。换言之，遗留解码器仍然可以例如通过将每个下混信号映射到解码器的声道输出来解码和回放比特流的M个下混信号。According to the example embodiment, M downmixed signals are arranged in a first field of the bitstream using a first format, and matrix elements are arranged in a second field of the bitstream using a second format. This allows decoders that only support the first format to decode and play back the M downmixed signals in the first field and discard the matrix elements in the second sub-field. The advantage of this is that the M downmixed signals in the bitstream are backward compatible with legacy decoders not used for audio object reconstruction. In other words, legacy decoders can still decode and play back the M downmixed signals of the bitstream, for example, by mapping each downmixed signal to the decoder's channel output.

根据示例实施例，该方法还可以包括步骤：接收对应于N个音频对象中的每个音频对象的位置数据，其中，基于位置数据生成M个下混信号。位置数据通常将每个音频对象与三位空间中的位置相关联。音频对象的位置可以随时间而变化。通过在对音频对象进行下混时使用位置数据，将通过以下方式将音频对象混合到M个下混信号中：例如如果在具有M个输出声道的系统上听M个下混信号，则音频对象听起来就像它们近似地位于其各自的位置。这例如在M个下混信号要与遗留解码器后向兼容的情况下是有利的。According to an example embodiment, the method may further include the step of: receiving position data corresponding to each of N audio objects, wherein M downmixed signals are generated based on the position data. The position data typically associates each audio object with a position in three-dimensional space. The position of the audio object may change over time. By using the position data when downmixing the audio objects, the audio objects are mixed into the M downmixed signals in such a way that, for example, if the M downmixed signals are listened to on a system with M output channels, the audio objects sound as if they are approximately located at their respective positions. This is advantageous, for example, when the M downmixed signals need to be backward compatible with legacy decoders.

根据示例实施例，重构矩阵的矩阵元素是时变的和频变的。换言之，重构矩阵的矩阵元素可以对于不同的时频块而不同。以这样的方式，实现了音频对象的重构的极好的灵活性。According to the example embodiment, the matrix elements of the reconstruction matrix are time-varying and frequency-varying. In other words, the matrix elements of the reconstruction matrix can be different for different time-frequency blocks. In this way, excellent flexibility in the reconstruction of audio objects is achieved.

根据示例实施例，音频场景还包括多个音床声道。这例如在音频内容除了包括音频对象以外还包括音床声道的影院音频应用中是常见的。在这种情形下，可以基于至少N个音频对象和多个音床声道生成M个下混信号。音床声道的一般意思是对应于三维空间中的固定位置的音频信号。例如，音床声道可以对应于音频编码/解码系统的输出声道之一。这样，音床声道可以被解释为具有三维空间中与音频编码/解码系统的输出扬声器之一的位置相同的相关位置。因此，音床声道可以与仅指示相应输出扬声器的位置的标签相关联。According to the example embodiment, the audio scene also includes multiple bed channels. This is common, for example, in cinema audio applications where the audio content includes bed channels in addition to audio objects. In this case, M downmixed signals can be generated based on at least N audio objects and multiple bed channels. A bed channel generally means an audio signal corresponding to a fixed position in three-dimensional space. For example, a bed channel may correspond to one of the output channels of an audio encoding/decoding system. Thus, a bed channel can be interpreted as having a corresponding position in three-dimensional space that is the same as the position of one of the output speakers of the audio encoding/decoding system. Therefore, a bed channel can be associated with a label that only indicates the position of the corresponding output speaker.

当音频场景包括音床声道时，重构矩阵可以包括使得能够根据M个下混信号重构音床声道的矩阵元素。When the audio scene includes a bed channel, the reconstruction matrix can include matrix elements that enable the reconstruction of the bed channel based on M downmixed signals.

在某些情况下，音频场景可以包括大量的对象。为了降低表现音频场景所需要的复杂度和数据量，可以通过减少音频对象的数量来简化音频场景。因此，如果音频场景初始包括K个音频对象，其中K>N，则该方法还可以包括步骤：接收K个音频对象，并且通过将K个音频对象聚类成N个聚类并将每个聚类用一个音频对象表示，来将K个音频对象减少到N个音频对象。In some cases, an audio scene can include a large number of objects. To reduce the complexity and data volume required to represent an audio scene, the audio scene can be simplified by reducing the number of audio objects. Therefore, if the audio scene initially includes K audio objects, where K > N, the method may further include the step of receiving K audio objects and reducing the K audio objects to N audio objects by clustering the K audio objects into N clusters and representing each cluster with one audio object.

为了简化场景，该方法还可以包括步骤：接收对应于K个音频对象中的每个音频对象的位置数据，其中，将K个对象聚类成N个聚类基于由K个音频对象的位置数据所给出的K个对象之间的位置距离。例如，三维空间中位置彼此靠近的音频对象可以被聚类在一起。To simplify the scenario, the method may further include the step of receiving position data corresponding to each of the K audio objects, wherein the K objects are clustered into N clusters based on the positional distances between the K objects given by the position data of the K audio objects. For example, audio objects that are close to each other in three-dimensional space can be clustered together.

如上所述，该方法的示例实施例在所使用的下混信号的数目方面是灵活的。具体地，当存在多于两个下混信号时，即当M大于二时，可以有利地使用该方法。例如，可以使用对应于常规的5.1或7.1音频设置的五个或七个下混信号。这么做是有利的，因为与现有技术系统相反，无论使用的下混信号的数目为多少，所提出的编码原则的数学复杂度保持相同。As described above, the example embodiments of this method are flexible in terms of the number of downmixing signals used. Specifically, the method can be advantageously used when there are more than two downmixing signals, i.e., when M is greater than two. For example, five or seven downmixing signals corresponding to conventional 5.1 or 7.1 audio settings can be used. This is advantageous because, unlike prior art systems, the mathematical complexity of the proposed coding principle remains the same regardless of the number of downmixing signals used.

为了能够进一步改进N个音频对象的重构，该方法还可以包括：根据N个音频对象形成L个辅助信号；将矩阵元素包括在使得能够根据M个下混信号和L个辅助信号来重构至少N个音频对象的重构矩阵中；以及将L个辅助信号包括在比特流中。因此，辅助信号充当帮助信号，其例如可以捕获很难根据下混信号重构的音频对象的方面。辅助信号还可以基于音床声道。辅助信号的数目可以等于或大于1。To further improve the reconstruction of N audio objects, the method may further include: forming L auxiliary signals based on the N audio objects; including matrix elements in a reconstruction matrix that enables the reconstruction of at least N audio objects based on M downmixing signals and L auxiliary signals; and including the L auxiliary signals in a bitstream. Thus, the auxiliary signals act as aids, which can, for example, capture aspects of audio objects that are difficult to reconstruct from the downmixing signals. The auxiliary signals may also be based on the audio bed channel. The number of auxiliary signals may be equal to or greater than one.

根据一个示例实施例，辅助信号可以对应于特别重要的音频对象，诸如表示对话的音频对象。因此，L个辅助信号中的至少之一可以与N个音频对象之一相同。这使得与必须仅根据M个下混声道进行重构的情况相比以更高质量呈现重要的对象。实际上，音频内容提供者可能已经优先化和/或标注了音频对象中的一些音频对象作为优选地单独作为辅助对象而被包括的音频对象。此外，这使得呈现之前对这些对象的修改/处理较不容易发生伪影。作为比特率和质量之间的折中，也可以发送两个或更多个音频对象的混合以作为辅助信号。换言之，L个辅助信号中的至少之一可以被形成为N个音频对象中的至少两个音频对象的组合。According to one example embodiment, auxiliary signals may correspond to particularly important audio objects, such as audio objects representing dialogue. Therefore, at least one of the L auxiliary signals may be identical to one of the N audio objects. This allows for higher quality rendering of important objects compared to situations where reconstruction must be performed based solely on M downmix channels. In practice, the audio content provider may have prioritized and/or labeled some audio objects as preferably included individually as auxiliary objects. Furthermore, this makes modifications/processing of these objects prior to rendering less prone to artifacts. As a trade-off between bitrate and quality, a mixture of two or more audio objects may also be sent as auxiliary signals. In other words, at least one of the L auxiliary signals may be formed as a combination of at least two audio objects from the N audio objects.

根据一个示例实施例，辅助信号表示在生成M个下混信号的过程中丢失的音频对象的信号维度，该丢失例如由于独立对象的数目通常多于下混声道的数目，或者由于两个对象所关联的位置使得该两个对象被混合到同一下混信号中。后一种情形的示例是两个对象仅在纵向上分离而在投影到水平平面上时共享同一位置的情况，这意味着该两个对象通常将被呈现成标准5.1环绕扬声器设置的相同下混声道，在标准5.1环绕扬声器设置中所有的扬声器都在同一水平平面上。具体地，M个下混信号跨信号空间中的超平面。通过形成M个下混信号的线性组合，仅可以重构位于超平面中的音频信号。为了改进重构，可以包括不位于超平面中的辅助信号，从而也能够重构不位于超平面中的信号。换言之，根据示例实施例，多个辅助信号中的至少之一不位于被M个下混信号所跨的超平面中。例如，多个辅助信号中的至少之一可以与被M个下混信号所跨的超平面正交。According to one example embodiment, the auxiliary signal represents the signal dimension of an audio object lost during the generation of M downmixes. This loss is due, for example, because the number of independent objects is often greater than the number of downmix channels, or because the positions of the two objects are associated with each other, causing them to be mixed into the same downmix. An example of the latter case is when two objects are separated only in the vertical direction but share the same position when projected onto a horizontal plane. This means that the two objects would typically be presented as the same downmix channel in a standard 5.1 surround speaker setup, where all speakers are on the same horizontal plane. Specifically, the M downmixes span a hyperplane in the signal space. By forming a linear combination of the M downmixes, only audio signals located in the hyperplane can be reconstructed. To improve the reconstruction, auxiliary signals not located in the hyperplane can be included, thereby enabling the reconstruction of signals not located in the hyperplane as well. In other words, according to the example embodiment, at least one of the multiple auxiliary signals is not located in the hyperplane spanned by the M downmixes. For example, at least one of the multiple auxiliary signals may be orthogonal to the hyperplane spanned by the M downmixes.

根据示例实施例，提供了一种包括当在具有处理能力的装置上运行时适于执行第一方面的任何方法的计算机代码指令的计算机可读介质。According to an example embodiment, a computer-readable medium is provided that includes computer code instructions adapted to perform any of the methods of the first aspect when run on a processing device.

根据示例实施例，提供了一种对至少包括N个音频对象的音频场景的时频块进行编码的编码器，该编码器包括：接收部件，被配置成接收N个音频对象；下混生成部件，被配置成接收来自接收部件的N个音频对象并且基于至少N个音频对象生成M个下混信号；分析部件，被配置成用矩阵元素生成重构矩阵，重构矩阵使得能够根据M个下混信号重构至少N个音频对象；以及比特流生成部件，被配置成接收来自下混生成部件的M个下混信号以及来自分析部件的重构矩阵，并且生成包括M个下混信号和重构矩阵的矩阵元素中的至少一些矩阵元素的比特流。According to an example embodiment, an encoder is provided for encoding a time-frequency block of an audio scene comprising at least N audio objects. The encoder includes: a receiving unit configured to receive the N audio objects; a downmixing generation unit configured to receive the N audio objects from the receiving unit and generate M downmixing signals based on the at least N audio objects; an analysis unit configured to generate a reconstruction matrix using matrix elements, the reconstruction matrix enabling the reconstruction of at least N audio objects based on the M downmixing signals; and a bitstream generation unit configured to receive the M downmixing signals from the downmixing generation unit and the reconstruction matrix from the analysis unit, and generate a bitstream comprising at least some of the matrix elements of the M downmixing signals and the reconstruction matrix.

Ⅱ.概述——解码器II. Overview – Decoder

根据第二方面，示例实施例提出了解码方法、解码装置和用于解码的计算机程序产品。所提出的方法、装置和计算机程序产品一般可以具有相同特征和优势。According to the second aspect, exemplary embodiments provide a decoding method, a decoding apparatus, and a computer program product for decoding. The proposed method, apparatus, and computer program product can generally have the same features and advantages.

与在上述编码器的概述中呈现的特征和设置有关的优势可以一般对解码器的相应特征和设置有效。The advantages associated with the features and settings presented in the overview of the encoder above can generally be effective for the corresponding features and settings of the decoder.

根据示例实施例，提供了一种对至少包括N个音频对象的音频场景的时频块进行解码的方法，该方法包括步骤：接收包括M个下混信号和重构矩阵的矩阵元素中的至少一些矩阵元素的比特流；使用矩阵元素生成重构矩阵；以及使用重构矩阵根据M个下混信号重构N个音频对象。According to an example embodiment, a method is provided for decoding a time-frequency block of an audio scene comprising at least N audio objects. The method includes the steps of: receiving a bitstream comprising at least some matrix elements of a matrix including M downmixed signals and a reconstruction matrix; generating a reconstruction matrix using the matrix elements; and reconstructing the N audio objects based on the M downmixed signals using the reconstruction matrix.

根据示例实施例，使用第一格式将M个下混信号布置在比特流的第一字段中，并且使用第二格式将矩阵元素布置在比特流的第二子段中，从而允许仅支持第一格式的解码器解码和回放第一字段中的M个下混信号并且丢弃第二子段中的矩阵元素。According to the example embodiment, M downmixed signals are arranged in a first field of the bitstream using a first format, and matrix elements are arranged in a second sub-segment of the bitstream using a second format, thereby allowing a decoder that only supports the first format to decode and play back the M downmixed signals in the first field and discard the matrix elements in the second sub-segment.

根据示例实施例，重构矩阵的矩阵元素是时变的和频变的。According to the example embodiment, the matrix elements of the reconstructed matrix are time-varying and frequency-varying.

根据示例实施例，音频场景还包括多个音床声道，该方法还包括使用重构矩阵根据M个下混信号来重构音床声道。According to an example embodiment, the audio scene also includes multiple audio bed channels, and the method further includes reconstructing the audio bed channels based on M downmixed signals using a reconstruction matrix.

根据示例实施例，下混信号的数目M大于2。According to the example embodiment, the number M of downmixed signals is greater than 2.

根据示例实施例，该方法还包括：接收由N个音频对象形成的L个辅助信号；使用重构矩阵根据M个下混信号和L个辅助信号重构N个音频对象，其中，重构矩阵包括使得能够根据M个下混信号和L个辅助信号重构至少N个音频对象的矩阵元素。According to an example embodiment, the method further includes: receiving L auxiliary signals formed by N audio objects; reconstructing the N audio objects using a reconstruction matrix based on M downmixing signals and L auxiliary signals, wherein the reconstruction matrix includes matrix elements that enable the reconstruction of at least N audio objects based on the M downmixing signals and L auxiliary signals.

根据示例实施例，L个辅助信号的至少之一与N个音频对象之一相同。According to the example embodiment, at least one of the L auxiliary signals is the same as one of the N audio objects.

根据示例实施例，L个辅助信号的至少之一是N个音频对象的组合。According to the example embodiment, at least one of the L auxiliary signals is a combination of N audio objects.

根据示例实施例，M个下混信号跨超平面，并且其中多个辅助信号的至少之一不位于被M下混信号所跨的超平面中。According to the example embodiment, M downmixed signals span a hyperplane, and at least one of the plurality of auxiliary signals is not located in the hyperplane spanned by the M downmixed signals.

根据示例实施例，不位于超平面中的多个辅助信号的至少之一正交于被M个下混信号所跨的超平面。According to an example embodiment, at least one of a plurality of auxiliary signals not located in the hyperplane is orthogonal to the hyperplane spanned by M downmixing signals.

如上所述，音频编码/解码系统通常在频域中工作。因此，音频编码/解码系统使用滤波器组执行音频信号的时频变换。可以使用不同类型的时频变换。例如，可以关于第一频域来表示M个下混信号并且可以关于第二频域来表示重构矩阵。为了减少解码器的计算负担，以聪明的方式选择第一频域和第二频域是有利的。例如，第一频域和第二频域可以被选择成相同的频域，诸如改进离散余弦变换(MDCF)域。以这种方式，可以避免在解码器中将M个下混信号从第一频域变换到时域然后变换到第二频域。可替选地，能够通过以下方式选择第一频域和第二频域：可以共同实现从第一频域到第二频域的变换，使得在第一频域与第二频域之间没有必要通过时域。As mentioned above, audio encoding/decoding systems typically operate in the frequency domain. Therefore, audio encoding/decoding systems use filter banks to perform time-frequency transformations of the audio signals. Different types of time-frequency transformations can be used. For example, the M downmixed signals can be represented with respect to a first frequency domain, and the reconstruction matrix can be represented with respect to a second frequency domain. To reduce the computational burden on the decoder, it is advantageous to choose the first and second frequency domains intelligently. For example, the first and second frequency domains can be chosen to be the same frequency domain, such as the improved discrete cosine transform (MDCF) domain. In this way, it is avoided in the decoder to transform the M downmixed signals from the first frequency domain to the time domain and then to the second frequency domain. Alternatively, the first and second frequency domains can be chosen such that the transformation from the first to the second frequency domain can be performed jointly, so that it is not necessary to go through the time domain between the first and second frequency domains.

该方法还可以包括接收对应于N个音频对象的位置数据，并且使用位置数据呈现N个音频对象以创建至少一个输出音频声道。以这种方式，基于重构的N个音频对象在三维空间中的位置将其映射到音频编码器/解码器系统的输出声道上。The method may also include receiving position data corresponding to N audio objects, and using the position data to represent the N audio objects to create at least one output audio channel. In this way, the reconstructed N audio objects are mapped onto the output channel of the audio encoder/decoder system based on their positions in three-dimensional space.

优选地在频域中执行呈现。为了减少解码器的计算负担，优选地以聪明的方式关于重构音频对象的频域来选择呈现的频域。例如，如果关于对应于第二滤波器组的第二频域表示重构矩阵，并且在对应于第三滤波器组的第三频域中执行呈现，则优选地将第二滤波器组和第三滤波器组选择成至少部分地为相同的滤波器组。例如，第二滤波器组和第三滤波器组可以包括正交镜像滤波器(QMF)域。可替选地，第二频域和第三频域可以包括MDCT滤波器组。根据示例实施例，第三滤波器组可以由一系列滤波器组组成，诸如QMF滤波器组，后接奈奎斯特滤波器组。如果这样，则序列的滤波器组中至少之一(序列的第一滤波器组)与第二滤波器组相同。以这种方式，可以说第二滤波器组和第三滤波器组至少部分地为相同的滤波器组。The rendering is preferably performed in the frequency domain. To reduce the computational burden on the decoder, the frequency domain for rendering is preferably selected intelligently with respect to the frequency domain of the reconstructed audio object. For example, if the reconstruction matrix is represented with respect to a second frequency domain corresponding to a second filter bank, and rendering is performed in a third frequency domain corresponding to a third filter bank, then the second and third filter banks are preferably selected to be at least partially the same filter banks. For example, the second and third filter banks may include a quadrature mirror filter (QMF) domain. Alternatively, the second and third frequency domains may include an MDCT filter bank. According to an example embodiment, the third filter bank may consist of a series of filter banks, such as a QMF filter bank followed by a Nyquist filter bank. If so, at least one of the filter banks in the sequence (the first filter bank in the sequence) is the same as the second filter bank. In this way, it can be said that the second and third filter banks are at least partially the same filter banks.

根据示例实施例，提供了包括当在具有处理能力的装置上运行时适于执行第二方面的任一方法的计算机代码指令的计算机可读介质。According to an example embodiment, a computer-readable medium is provided that includes computer code instructions adapted to perform any of the methods of the second aspect when run on a device with processing capabilities.

根据示例实施例，提供了一种对至少包括N个音频对象的音频场景的时频块进行解码的解码器，该解码器包括：接收部件，被配置成接收包括M个下混信号和重构矩阵的矩阵元素中的至少一些矩阵元素的比特流；重构矩阵生成部件，被配置成接收来自接收部件的矩阵元素，并且基于矩阵元素生成重构矩阵；以及重构部件，被配置成接收来自重构矩阵生成部件的重构矩阵，并且使用重构矩阵根据M个下混信号重构N个音频对象。According to an example embodiment, a decoder is provided for decoding a time-frequency block of an audio scene comprising at least N audio objects. The decoder includes: a receiving component configured to receive a bitstream comprising at least some matrix elements of a matrix including M downmixed signals and a reconstruction matrix; a reconstruction matrix generating component configured to receive matrix elements from the receiving component and generate a reconstruction matrix based on the matrix elements; and a reconstruction component configured to receive the reconstruction matrix from the reconstruction matrix generating component and reconstruct the N audio objects using the reconstruction matrix based on the M downmixed signals.

Ⅲ.示例实施例III. Example Implementation

图1图示出对音频场景102进行编码/解码的编码/解码系统100。编码/解码系统100包括编码器108、比特流生成部件110、比特流解码部件118、解码器120以及呈现器122。Figure 1 illustrates an encoding/decoding system 100 for encoding/decoding audio scene 102. The encoding/decoding system 100 includes an encoder 108, a bitstream generation unit 110, a bitstream decoding unit 118, a decoder 120, and a renderer 122.

音频场景102由一个或更多个音频对象106a(诸如N个音频对象)即音频信号来表示。音频场景102还可以包括一个或更多个音床声道106b，即直接对应于呈现器122的输出声道之一的信号。音频场景102还由包括位置信息104的元数据来表示。在呈现音频场景102时例如由呈现器122使用位置信息104。位置信息104可以将音频对象106a以及可能还有音床声道106b与三维空间中的空间位置关联起来以作为时间的函数。元数据还可以包括对于呈现音频场景102有用的其它类型的数据。Audio scene 102 is represented by one or more audio objects 106a (such as N audio objects), i.e., audio signals. Audio scene 102 may also include one or more sound bed channels 106b, i.e., signals directly corresponding to one of the output channels of the renderer 122. Audio scene 102 is also represented by metadata including position information 104. Position information 104 is used, for example, by the renderer 122 when rendering audio scene 102. Position information 104 can associate the audio objects 106a, and possibly sound bed channels 106b, with spatial positions in three-dimensional space as a function of time. Metadata may also include other types of data useful for rendering audio scene 102.

系统100的编码部分包括编码器108和比特流生成部件110。编码器108接收音频对象106a、音床声道106b(如果存在)，以及包括位置信息104的元数据。基于此，编码器108生成一个或更多个下混信号112，诸如M个下混信号。举例来说，下混信号112可以对应于5.1音频系统的声道[Lf Rf Cf Ls Rs LFE]。(“L”代表左，“R”代表右，“C”代表中央，“f”代表前，“s”代表环绕，并且“LFE”代表低频效果)。The encoding portion of system 100 includes an encoder 108 and a bitstream generation unit 110. Encoder 108 receives an audio object 106a, a bed channel 106b (if present), and metadata including position information 104. Based on this, encoder 108 generates one or more downmix signals 112, such as M downmix signals. For example, downmix signals 112 may correspond to the channels of a 5.1 audio system [Lf Rf Cf Ls Rs LFE]. ("L" represents left, "R" represents right, "C" represents center, "f" represents front, "s" represents surround, and "LFE" represents low-frequency effects).

编码器108还生成边信息。边信息包括重构矩阵。重构矩阵包括使得能够根据下混信号112重构至少音频对象106a的矩阵元素114。重构矩阵还可以使得能够重构音床声道106b。The encoder 108 also generates side information. The side information includes a reconstruction matrix. The reconstruction matrix includes matrix elements 114 that enable the reconstruction of at least audio object 106a based on the downmix signal 112. The reconstruction matrix can also enable the reconstruction of the audio bed channel 106b.

编码器108将M个下混信号112以及矩阵元素114中的至少一些矩阵元素传输到比特流生成部件110。比特流生成部件110通过执行量化和编码来生成包括M个下混信号112和矩阵元素114中的至少一些矩阵元素的比特流116。比特流生成部件110还接收包括位置信息104的元数据，以包括在比特流116中。Encoder 108 transmits M downmixed signals 112 and at least some matrix elements from matrix elements 114 to bitstream generation unit 110. Bitstream generation unit 110 generates a bitstream 116 comprising the M downmixed signals 112 and at least some matrix elements from matrix elements 114 by performing quantization and encoding. Bitstream generation unit 110 also receives metadata including position information 104 to be included in bitstream 116.

系统的解码部分包括比特流解码部件118和解码器120。比特流解码部件118接收比特流116，并且执行解码和去量化(dequantization)以提取M个下混信号112和包括重构矩阵的至少一些矩阵元素114的边信息。M个下混信号112和矩阵元素114随后被输入到解码器120，解码器120基于下混信号112和矩阵元素114生成N个音频对象106a以及很可能还有音床声道106b的重构106'。因此，N个音频对象的重构106'是N个音频对象106a以及很可能还有音床声道106b的近似。The system's decoding section includes a bitstream decoding unit 118 and a decoder 120. The bitstream decoding unit 118 receives a bitstream 116 and performs decoding and dequantization to extract M downmixed signals 112 and side information including at least some matrix elements 114 of the reconstruction matrix. The M downmixed signals 112 and matrix elements 114 are then input to the decoder 120, which generates a reconstruction 106' based on the downmixed signals 112 and matrix elements 114, consisting of N audio objects 106a and possibly a bed channel 106b. Therefore, the reconstruction 106' of the N audio objects is an approximation of the N audio objects 106a and possibly the bed channel 106b.

举例来说，如果下混信号112对应于5.1配置的声道[Lf Rf Cf Ls Rs LFE]，则解码器120可以仅使用全波段声道[Lf Rf Cf Ls Rs]来重构对象106'，从而忽略LFE。这同样适用于其它声道配置。可以将下混112的LFE声道(基本未修改)发送到呈现器122。For example, if the downmixer 112 corresponds to a 5.1 channel configuration [Lf Rf Cf Ls Rs LFE], then the decoder 120 can reconstruct object 106' using only the full-band channel [Lf Rf Cf Ls Rs], thus ignoring the LFE. This also applies to other channel configurations. The LFE channel of the downmixer 112 (essentially unmodified) can be sent to the renderer 122.

重构的音频对象106'以及位置信息104随后被输入到呈现器122。基于重构的音频对象106'和位置信息104，呈现器122呈现具有适合于在期望的扬声器或耳机配置上回放的格式的输出信号124。典型的输出格式是标准5.1环绕设置(3个前扬声器、2个环绕扬声器以及1个低频效果(LFE)扬声器)或者7.1+4设置(3个前扬声器、4个环绕扬声器、1个LFE扬声器以及4个高架扬声器)。The reconstructed audio object 106' and position information 104 are then input to the presenter 122. Based on the reconstructed audio object 106' and position information 104, the presenter 122 presents an output signal 124 with a format suitable for playback on the desired speaker or headphone configuration. Typical output formats are a standard 5.1 surround setup (3 front speakers, 2 surround speakers, and 1 low-frequency effect (LFE) speaker) or a 7.1+4 setup (3 front speakers, 4 surround speakers, 1 LFE speaker, and 4 overhead speakers).

在一些实施例中，原始音频场景可以包括大量的音频对象。对大量的音频对象进行处理的代价是高计算复杂度。并且，要嵌入到比特流116中的边信息量(位置信息104和重构矩阵元素114)取决于音频对象的数目。通常，边信息量随音频对象的数目线性地增长。因此，为了节省计算复杂度和/或为了降低对音频场景进行编码所需要的比特率，在编码之前减少音频对象的数目是有利的。为此，音频编码器/解码器系统100还可以包括设置在编码器108上游的场景简化模块(未示出)。场景简化模块将原始音频对象以及很可能还有音床声道作为输入，并且执行处理以输出音频对象106a。场景简化模块通过执行聚类而将原始音频对象的数目例如K减少到音频对象106a的更合适的数目N。更确切地，场景简化模块将K个原始音频对象以及很可能还有音床声道组织成N个聚类。通常，基于K个原始音频对象/音床声道在音频场景中的空间接近度来定义聚类。为了确定空间接近度，场景简化模块可以将原始音频对象/音床声道的位置信息作为输入。当场景简化模块已经形成了N个聚类时，其继续执行以将每个聚类用一个音频对象代表。例如，代表聚类的音频对象可以被形成为形成聚类的一部分的音频对象/音床声道之和。更具体地，可以添加音频对象/音床声道的音频内容以生成代表性音频对象的音频内容。此外，可以对聚类中音频对象/音床声道的位置取平均，以给出代表性音频对象的位置。场景简化模块将代表性音频对象的位置包括在位置数据104中。此外，场景简化模块输出构成图1的N个音频对象106a的代表性音频对象。In some embodiments, the original audio scene may include a large number of audio objects. Processing a large number of audio objects comes at the cost of high computational complexity. Furthermore, the amount of side information (position information 104 and reconstruction matrix elements 114) to be embedded in the bitstream 116 depends on the number of audio objects. Typically, the amount of side information increases linearly with the number of audio objects. Therefore, it is advantageous to reduce the number of audio objects before encoding to save computational complexity and/or to reduce the bit rate required to encode the audio scene. To this end, the audio encoder/decoder system 100 may also include a scene simplification module (not shown) disposed upstream of the encoder 108. The scene simplification module takes the original audio objects, and possibly also bed channels, as input, and performs processing to output audio objects 106a. The scene simplification module reduces the number of original audio objects, for example K, to a more suitable number N of audio objects 106a by performing clustering. More specifically, the scene simplification module organizes K original audio objects, and possibly also bed channels, into N clusters. Typically, clustering is defined based on the spatial proximity of the K original audio objects/bed channels in the audio scene. To determine spatial proximity, the scene simplification module can take the location information of the original audio objects/soundbed channels as input. Once the scene simplification module has formed N clusters, it continues to execute to represent each cluster with an audio object. For example, the audio object representing a cluster can be formed as the sum of the audio objects/soundbed channels that form part of the cluster. More specifically, audio content of the audio objects/soundbed channels can be added to generate the audio content of the representative audio object. Furthermore, the locations of the audio objects/soundbed channels in the clusters can be averaged to give the location of the representative audio object. The scene simplification module includes the location of the representative audio object in location data 104. Additionally, the scene simplification module outputs the representative audio object constituting the N audio objects 106a in Figure 1.

可以使用第一格式将M个下混信号112布置在比特流116的第一字段中。可以使用第二格式将矩阵元素114布置在比特流116的第二字段中。以这种方式，仅支持第一格式的解码器能够解码和回放第一字段中的M个下混信号112，并且丢弃第二字段中的矩阵元素114。The M downmixed signals 112 can be arranged in the first field of the bitstream 116 using the first format. The matrix element 114 can be arranged in the second field of the bitstream 116 using the second format. In this way, a decoder that only supports the first format can decode and play back the M downmixed signals 112 in the first field and discard the matrix element 114 in the second field.

图1的音频编码器/解码器系统100支持第一格式和第二格式。更确切地，解码器120被配置成解译第一格式和第二格式，这意味着其能够基于M个下混信号112和矩阵元素114来重构对象106'。The audio encoder/decoder system 100 of Figure 1 supports a first format and a second format. More specifically, the decoder 120 is configured to decode both the first and second formats, meaning it is capable of reconstructing object 106' based on M downmixed signals 112 and matrix elements 114.

图2图示出音频编码器/解码器系统200。系统200的编码部分108、110对应于图1的编码部分。然而，音频编码器/解码器系统200的解码部分与图1的音频编码器/解码器系统100的解码部分不同。音频编码器/解码器系统200包括支持第一格式但不支持第二格式的遗留解码器230。因此，音频编码器/解码器系统200的遗留解码器230不能够重构音频对象/音床声道106a到106b。然而，因为遗留解码器230支持第一格式，所以其仍可以对M个下混信号112进行解码以生成输出224，输出224是适合于通过相应的多声道扬声器设置实现直接回放的基于声道的表示，诸如5.1表示。下混信号的这个性质称为后向兼容，后向兼容意味着不支持第二格式，即不能够解译包括矩阵元素114的边信息的遗留解码器也可以解码和回放M个下混信号112。Figure 2 illustrates an audio encoder/decoder system 200. The encoding sections 108 and 110 of system 200 correspond to the encoding section of Figure 1. However, the decoding section of audio encoder/decoder system 200 differs from the decoding section of audio encoder/decoder system 100 of Figure 1. Audio encoder/decoder system 200 includes a legacy decoder 230 that supports a first format but not a second format. Therefore, the legacy decoder 230 of audio encoder/decoder system 200 cannot reconstruct audio object/sound bed channels 106a to 106b. However, because the legacy decoder 230 supports the first format, it can still decode M downmixed signals 112 to generate output 224, which is a channel-based representation suitable for direct playback via appropriate multi-channel speaker setups, such as 5.1 representation. This property of the downmixed signals is called backward compatibility, meaning that a legacy decoder that does not support the second format, i.e., cannot decode the side information including matrix element 114, can still decode and play back M downmixed signals 112.

现在将参考图3和图4的流程图更详细地描述音频编码/解码系统100的编码器侧的操作。The operation of the encoder side of the audio encoding/decoding system 100 will now be described in more detail with reference to the flowcharts in Figures 3 and 4.

图4更详细地图示出图1的编码器108和比特流生成部件110。编码器108具有接收部件(未示出)、下混生成部件318和分析部件328。Figure 4 shows the encoder 108 and bitstream generation unit 110 of Figure 1 in more detail. The encoder 108 has a receiving unit (not shown), a downmixing generation unit 318, and an analysis unit 328.

在步骤E02中，编码器108的接收部件接收N个音频对象106a和音床声道106b(如果存在)。编码器108还可以接收位置数据104。使用向量标记，N个音频对象可以由向量S＝[S1S2...SN]^T表示，并且音床声道由向量B表示。N个音频对象和音床声道可以一起由向量A＝[B^T S^T]^T表示。In step E02, the receiving component of encoder 108 receives N audio objects 106a and a sound bed channel 106b (if present). Encoder 108 may also receive position data 104. Using vector notation, the N audio objects can be represented by vector S = [S1S2...SN] ^T , and the sound bed channel is represented by vector B. The N audio objects and the sound bed channel can ^be represented together by vector A = [ ^BTST ] ^T .

在步骤E04中，下混生成部件318根据N个音频对象106a和音床声道106b(如果存在)生成M个下混信号112。使用向量标记，M个下混信号可以由包括M个下混信号的向量D＝[Dl D2...DM]^T表示。一般多个信号的下混是信号的组合，诸如信号的线性组合。举例来说，M个下混信号可以对应于特定的扬声器配置，诸如5.1扬声器配置中的扬声器[Lf Rf Cf LsRs LFE]的配置。In step E04, the downmixing generation unit 318 generates M downmixing signals 112 based on N audio objects 106a and audio bed channels 106b (if present). Using vector notation, the M downmixing signals can be represented by a vector D = [D1 D2 ... DM] ^T comprising the M downmixing signals. Generally, downmixing of multiple signals is a combination of signals, such as a linear combination of signals. For example, the M downmixing signals may correspond to a specific speaker configuration, such as the speaker configuration [Lf Rf Cf LsRs LFE] in a 5.1 speaker configuration.

下混生成部件318在生成M个下混信号时可以使用位置信息104，使得基于各对象在三维空间中的位置将这些对象组合成不同的下混信号。当M个下混信号本身如同上述示例中那样对应于特定扬声器配置时，这是特别相关的。举例来说，下混生成部件318可以基于位置信息得出表示矩阵Pd(对应于应用在图1的呈现器122中的表示矩阵)，并且使用该表示矩阵根据D＝pd*[B^T S^T]^T生成下混。When generating M downmix signals, the downmix generation unit 318 can use the position information 104 to combine these objects into different downmix signals based on their positions in three-dimensional space. This is particularly relevant when the M downmix signals themselves correspond to a specific speaker configuration, as in the example above. For example, the downmix generation unit 318 can derive a representation matrix Pd (corresponding to the representation matrix applied in the renderer 122 of Figure 1) based on the position information, and use this representation matrix to generate downmixes according to D = pd * [B ^T S ^T ] ^T.

N个音频对象106a和音床声道106b(如果存在)也被输入到分析部件328。分析部件328通常对输入音频信号106a、106b的时频块进行操作。为此，可以将N个音频对象106a和音床声道106b馈送过对输入音频信号106a、106b执行时间到频率变换的滤波器组，即QMF组。特别地，滤波器组338与多个频率子带相关联。时频块的频率分辨率对应于这些频率子带中的一个或更多个。时频块的频率分辨率可以是不均匀的，即其可以对频率变化。例如，低频率分辨率可以用于高频，这意味着高频范围内的时频块可以对应于由滤波器组338定义的若干频率子带。N audio objects 106a and audio bed channels 106b (if present) are also input to the analysis unit 328. The analysis unit 328 typically operates on time-frequency blocks of the input audio signals 106a and 106b. For this purpose, the N audio objects 106a and audio bed channels 106b can be fed through a filter bank, i.e., a QMF bank, that performs time-to-frequency transformation on the input audio signals 106a and 106b. Specifically, the filter bank 338 is associated with multiple frequency sub-bands. The frequency resolution of the time-frequency block corresponds to one or more of these frequency sub-bands. The frequency resolution of the time-frequency block can be non-uniform, i.e., it can vary with frequency. For example, a low frequency resolution can be used for high frequencies, meaning that a time-frequency block in the high-frequency range can correspond to several frequency sub-bands defined by the filter bank 338.

在步骤E06中，分析部件328生成在本文中由R1表示的重构矩阵。生成的重构矩阵由多个矩阵元素组成。重构的矩阵R1使得能够在解码器中根据M个下混信号112重构(近似)N个音频对象106a以及很可能还有音床声道106b。In step E06, the analysis unit 328 generates a reconstruction matrix, denoted herein by R1. The generated reconstruction matrix consists of multiple matrix elements. The reconstructed matrix R1 enables the reconstruction (approximately) of N audio objects 106a, and possibly also the audio bed channel 106b, in the decoder based on M downmixed signals 112.

分析部件328可以采取不同的方法来生成重构矩阵。例如，可以使用将N个音频对象106a/音床声道106b以及M个下混信号112作为输入的最小均方误差(MMSE)预测方法。可以将该方法描述成旨在得出能够最小化重构的音频对象/音床声道的均方误差的重构矩阵的方法。特别地，该方法使用候选重构矩阵来重构N个音频对象/音床声道，并且关于均方误差将音频对象/音床声道与输入音频对象106a/音床声道106b进行比较。将最小化均方误差的候选重构矩阵选作重构矩阵，并且其矩阵元素114是分析部件328的输出。Analysis unit 328 can employ different methods to generate the reconstruction matrix. For example, a minimum mean square error (MMSE) prediction method can be used, taking N audio objects 106a/bed channels 106b and M downmix signals 112 as inputs. This method can be described as aiming to derive a reconstruction matrix that minimizes the mean square error of the reconstructed audio objects/bed channels. Specifically, this method uses candidate reconstruction matrices to reconstruct the N audio objects/bed channels and compares the audio objects/bed channels with the input audio objects 106a/bed channels 106b regarding the mean square error. The candidate reconstruction matrix that minimizes the mean square error is selected as the reconstruction matrix, and its matrix element 114 is the output of analysis unit 328.

MMSE方法需要对N个音频对象106a/音床声道106b以及M个下混信号112的相关矩阵和协方差矩阵进行估计。根据上述方法，基于N个音频对象106a/音床声道106b和M个下混信号112来测量这些相关矩阵和协方差矩阵。在替选的基于模型的方法中，分析部件328将位置数据104而不是M个下混信号112作为输入。通过做出某些假设，例如假设N个音频对象互不相关，并且使用该假设并结合应用在下混生成部件318中的下混规则，分析部件328可以计算出执行上述MMSE方法所需要的所需相关矩阵和协方差矩阵。The MMSE method requires estimation of the correlation and covariance matrices for N audio objects 106a/bed channels 106b and M downmix signals 112. According to the method described above, these correlation and covariance matrices are measured based on the N audio objects 106a/bed channels 106b and the M downmix signals 112. In an alternative model-based method, the analysis unit 328 takes position data 104 instead of the M downmix signals 112 as input. By making certain assumptions, such as assuming the N audio objects are uncorrelated, and using these assumptions in conjunction with the downmixing rules applied in the downmixing generation unit 318, the analysis unit 328 can calculate the required correlation and covariance matrices for performing the MMSE method described above.

重构矩阵的元素114和M个下混信号112随后输入到比特流生成部件110。在步骤E108中，比特流生成部件110对M个下混信号112和重构矩阵的至少一些矩阵元素114进行量化和编码，并且将它们布置在比特流116中。特别地，比特流生成部件110可以使用第一格式将M个下混信号112布置在比特流116的第一字段中。此外，比特流生成部件110可以使用第二格式将矩阵元素114布置在比特流116的第二字段中。如前面参考图2所描述的，这允许仅支持第一格式的遗留解码器解码和回放M个下混信号112并且丢弃第二字段中的矩阵元素114。The elements 114 of the reconstructed matrix and the M downmixed signals 112 are then input to the bitstream generation unit 110. In step E108, the bitstream generation unit 110 quantizes and encodes the M downmixed signals 112 and at least some of the matrix elements 114 of the reconstructed matrix, and arranges them in the bitstream 116. Specifically, the bitstream generation unit 110 can arrange the M downmixed signals 112 in the first field of the bitstream 116 using a first format. Furthermore, the bitstream generation unit 110 can arrange the matrix elements 114 in the second field of the bitstream 116 using a second format. As described above with reference to FIG2, this allows legacy decoders that only support the first format to decode and play back the M downmixed signals 112 and discard the matrix elements 114 in the second field.

图5图示出编码器108的替选实施例。与图3中示出的编码器相比，图5的编码器508还使得一个或更多个辅助信号能够被包括在比特流116中。为此，编码器508包括辅助信号生成部件548。辅助信号生成部件548接收音频对象106a/音床声道106b，并且基于音频对象106a/音床声道106b生成一个或更多个辅助信号512。辅助信号生成部件548例如可以生成辅助信号512以作为音频对象106a/音床声道106b的组合。用向量C＝[CI C2...CL]^T来表示辅助信号，辅助信号可以被生成为C＝Q*[B^T S^T]^T，其中，Q为可以是时变和频变的矩阵。这包括辅助信号等于音频对象中的一个或更多个音频对象的情形以及辅助信号是音频对象的线性组合的情形。例如，辅助信号可以代表一个特别重要的对象，诸如对话。Figure 5 illustrates an alternative embodiment of encoder 108. Compared to the encoder shown in Figure 3, encoder 508 of Figure 5 further enables one or more auxiliary signals to be included in bitstream 116. For this purpose, encoder 508 includes an auxiliary signal generation unit 548. The auxiliary signal generation unit 548 receives audio objects 106a/sound bed channels 106b and generates one or more auxiliary signals 512 based on the audio objects 106a/sound bed channels 106b. The auxiliary signal generation unit 548 may, for example, generate auxiliary signals 512 as a combination of audio objects 106a/sound bed channels 106b. The auxiliary signals are represented by a vector C = [CI C2...CL] ^T , and can be generated as C = Q*[B ^T S ^T ] ^T , where Q is a matrix that can be time-varying and frequency-varying. This includes the case where the auxiliary signal is equal to one or more audio objects and the case where the auxiliary signal is a linear combination of audio objects. For example, the auxiliary signal may represent a particularly important object, such as dialogue.

辅助信号512的作用是改善解码器中音频对象106a/音床声道106b的重构。更具体地，在解码器侧，可以基于M个下混信号112以及L个辅助信号512来重构音频对象106a/音床声道106b。因此，重构矩阵将包括能够根据M个下混信号112以及L个辅助信号重构音频对象/音床声道的矩阵元素114。The auxiliary signal 512 serves to improve the reconstruction of the audio object 106a/sound bed channel 106b in the decoder. More specifically, on the decoder side, the audio object 106a/sound bed channel 106b can be reconstructed based on M downmix signals 112 and L auxiliary signals 512. Therefore, the reconstruction matrix will include matrix elements 114 capable of reconstructing the audio object/sound bed channel based on the M downmix signals 112 and L auxiliary signals.

因此，L个辅助信号512可以被输入到分析部件328，使得在生成重构矩阵时考虑到L个辅助信号512。分析部件328也可以将控制信号发送至辅助信号生成部件548。例如，分析部件328可以控制哪些音频对象/音床声道包括在辅助信号中以及它们是如何被包括的。特别地，分析部件328可以控制Q矩阵的选择。该控制例如可以基于上述MMSE方法，使得可以选择辅助信号以使得重构的音频对象/音床声道与音频对象106a/音床声道106b尽可能地接近。Therefore, L auxiliary signals 512 can be input to the analysis unit 328, so that the L auxiliary signals 512 are taken into account when generating the reconstruction matrix. The analysis unit 328 can also send control signals to the auxiliary signal generation unit 548. For example, the analysis unit 328 can control which audio objects/soundbed channels are included in the auxiliary signals and how they are included. In particular, the analysis unit 328 can control the selection of the Q matrix. This control can be based, for example, on the MMSE method described above, so that the auxiliary signals can be selected so that the reconstructed audio objects/soundbed channels are as close as possible to the audio objects 106a/soundbed channels 106b.

现在将参考图6和图7的流程图更加详细地描述音频编码/解码系统100的解码器侧的操作。The operation of the decoder side of the audio encoding/decoding system 100 will now be described in more detail with reference to the flowcharts in Figures 6 and 7.

图6更具体地图示出图1的比特流解码部件118和解码器120。解码器120包括重构矩阵生成部件622和重构部件624。Figure 6 illustrates in more detail the bitstream decoding unit 118 and decoder 120 of Figure 1. Decoder 120 includes a reconstruction matrix generation unit 622 and a reconstruction unit 624.

在步骤D02中，比特流解码部件118接收比特流116。比特流解码部件118对比特流116中的信息进行解码和去量化，以提取M个下混信号112以及重构矩阵中的至少一些矩阵元素114。In step D02, the bitstream decoding unit 118 receives the bitstream 116. The bitstream decoding unit 118 decodes and dequantizes the information in the bitstream 116 to extract M downmixed signals 112 and at least some matrix elements 114 in the reconstruction matrix.

重构矩阵生成部件622接收矩阵元素114并且在步骤D04中继续进行以生成重构矩阵614。重构矩阵生成部件622通过将矩阵元素114布置在矩阵中的适当位置来生成重构矩阵614。如果没有接收到重构矩阵的全部矩阵元素，重构矩阵生成部件622例如可以插入零来代替缺少的元素。The reconstruction matrix generation unit 622 receives matrix element 114 and continues in step D04 to generate reconstruction matrix 614. The reconstruction matrix generation unit 622 generates reconstruction matrix 614 by arranging matrix element 114 in appropriate positions within the matrix. If not all matrix elements of the reconstruction matrix are received, the reconstruction matrix generation unit 622 may, for example, insert zeros to replace the missing elements.

重构矩阵614和M个下混信号随后被输入到重构部件624。重构部件624随后在步骤D06中重构N个音频对象，并且如果可以，重构音床声道。换言之，重构部件624生成N个音频对象106a/音床声道106b的近似106'。The reconstruction matrix 614 and M downmixed signals are then input to the reconstruction unit 624. The reconstruction unit 624 then reconstructs the N audio objects in step D06, and, if possible, reconstructs the bed channel. In other words, the reconstruction unit 624 generates an approximation 106' of the N audio objects 106a/bed channel 106b.

举例来说，M个下混信号可以对应于特定的扬声器配置，诸如5.1扬声器配置中的扬声器[Lf Rf Cf Ls Rs LFE]的配置。如果这样，重构部件624可以使得对象106'的重构仅基于对应于扬声器配置的全波段声道的下混信号。如上文所解释的，带限信号(低频LFE信号)可以基本未修改地被发送到呈现器。For example, the M downmix signals may correspond to a specific speaker configuration, such as the speaker configuration [Lf Rf Cf Ls Rs LFE] in a 5.1 speaker configuration. In this case, the reconstruction unit 624 can make the reconstruction of object 106' based solely on the downmix signals of the full-band channels corresponding to the speaker configuration. As explained above, the band-limited signal (low-frequency LFE signal) can be sent to the presenter substantially unmodified.

重构部件624通常在频域中工作。更确切地，重构部件624对输入信号的各个时频块进行操作。因此，在输入到重构部件624之前，M个下混信号112通常经受时间到频率变换623。时间到频率变换623通常与在编码器侧应用的变换338相同或相似。例如，时间到频率变换623可以是QMF变换。The reconstruction unit 624 typically operates in the frequency domain. More precisely, the reconstruction unit 624 operates on individual time-frequency blocks of the input signal. Therefore, before being input to the reconstruction unit 624, the M downmixed signals 112 typically undergo a time-to-frequency transformation 623. The time-to-frequency transformation 623 is typically the same as or similar to the transformation 338 applied on the encoder side. For example, the time-to-frequency transformation 623 could be a QMF transformation.

为了重构音频对象/音床声道106'，重构部件624应用矩阵操作。更具体地，使用先前引入的标记，重构部件624可以将音频对象/音床声道的近似A'生成为A'＝R1*D。重构矩阵R1可以根据时间和频率变化。因此，重构矩阵在由重构部件624处理的不同的时频块之间可以不同。To reconstruct the audio object/sound bed channel 106', reconstruction component 624 applies matrix operations. More specifically, using the previously introduced notation, reconstruction component 624 can generate an approximation A' of the audio object/sound bed channel as A' = R1 * D. The reconstruction matrix R1 can vary according to time and frequency. Therefore, the reconstruction matrix can differ between different time-frequency blocks processed by reconstruction component 624.

在从解码器120输出之前，重构的音频对象/音床声道106'通常被变换回时域625。Before being output from decoder 120, the reconstructed audio object/sound bed channel 106' is typically transformed back to the time domain 625.

图8图示出当比特流116额外地包括辅助信号时的情况。与图7的实施例相比，比特流解码部件118现在额外地对来自比特流116的一个或更多个辅助信号512进行解码。辅助信号512被输入到重构部件624，辅助信号512在重构部件624处被包括在音频对象/音床声道的重构中。更具体地，重构部件624通过应用矩阵运算A'＝R1*[D^T C^T]^T生成音频对象/音床声道。Figure 8 illustrates the case when bitstream 116 additionally includes auxiliary signals. Compared to the embodiment of Figure 7, bitstream decoding unit 118 now additionally decodes one or more auxiliary signals 512 from bitstream 116. The auxiliary signals 512 are input to reconstruction unit 624, where they are included in the reconstruction of the audio object/soundbed channel. More specifically, reconstruction unit 624 generates the audio object/soundbed channel by applying matrix operation A' = R1 * [D ^T C ^T ] ^T.

图9图示出在图1的音频编码/解码系统100的解码器侧使用的不同的时频变换。比特流解码部件118接收比特流116。解码和去量化部件918对比特流116进行解码和去量化，以提取位置信息104、M个下混信号112和重构矩阵的矩阵元素114。Figure 9 illustrates the different time-frequency transformations used on the decoder side of the audio encoding/decoding system 100 of Figure 1. Bitstream decoding unit 118 receives bitstream 116. Decoding and dequantization unit 918 decodes and dequantizes bitstream 116 to extract position information 104, M downmixed signals 112, and matrix elements 114 of the reconstruction matrix.

在该阶段，通常在第一频域中表示M个下混信号112，第一频域对应于在本文中由T/F_C和F/T_C表示以分别用于从时域到第一频域的变换和从第一频域到时域的变换的第一组时频滤波器组。通常，对应于第一频域的滤波器组可以实现重叠窗变换，诸如MDCT和反MDCT。比特流解码部件118可以包括通过使用滤波器组F/T_C将M个下混信号112变换到时域的变换部件901。In this stage, M downmixed signals 112 are typically represented in a first frequency domain, which corresponds to a first set of time-frequency filter banks, denoted herein by T/ _FC and F/ _TC , for transformations from the time domain to the first frequency domain and from the first frequency domain to the time domain, respectively. Typically, the filter banks corresponding to the first frequency domain can implement overlapping window transforms, such as MDCT and inverse MDCT. The bitstream decoding unit 118 may include a transformation unit 901 that transforms the M downmixed signals 112 to the time domain using the filter bank F/ _TC .

解码器120，尤其是重构部件624通常关于第二频域处理信号。第二频域对应于在本文中由T/F_U和F/T_U表示的分别用于从时域到第二频域的变换和从第二频域到时域的变换的第二组时频滤波器组。因此，解码器120可以包括通过使用滤波器组T/F_U将在时域中表示的M个下混信号112变换到第二频域的变换部件903。当重构部件624已经通过在第二频域中执行处理而基于M个下混信号重构对象106'时，变换部件905可以通过使用滤波器组F/T_U将重构对象106’变换回时域。Decoder 120, particularly reconstruction unit 624, typically processes signals with respect to the second frequency domain. The second frequency domain corresponds herein to a second set of time-frequency filter banks, denoted herein by T/F _U and F/T _U , for transformations from the time domain to the second frequency domain and from the second frequency domain to the time domain, respectively. Therefore, decoder 120 may include a transformation unit 903 that transforms the M downmixed signals 112 represented in the time domain to the second frequency domain using the filter bank T/F _U. When reconstruction unit 624 has reconstructed object 106' based on the M downmixed signals by performing processing in the second frequency domain, transformation unit 905 can transform the reconstructed object 106' back to the time domain using the filter bank F/T _U.

呈现器122通常关于第三频域处理信号。第三频域对应于在本文中由T/F_R和F/T_R表示的分别用于从时域到第三频域的变换以及从第三频域到时域的变换的第三组时频滤波器组。因此，呈现器122可以包括通过使用滤波器组T/F_R将重构的音频对象106'从时域变换到第三频域的变换部件907。当呈现器122通过呈现部件922已经呈现输出声道124时，可以由变换部件909通过使用滤波器组F/T_R将输出声道变换到时域。Presenter 122 typically processes signals with respect to the third frequency domain. The third frequency domain corresponds to a third set of time-frequency filter banks, denoted herein by T/ _FR and F/ _TR , used for transformations from the time domain to the third frequency domain and from the third frequency domain to the time domain, respectively. Therefore, presenter 122 may include a transformation unit 907 that transforms the reconstructed audio object 106' from the time domain to the third frequency domain using the filter bank T/ _FR. When presenter 122 has already presented the output channel 124 via presentation unit 922, the output channel can be transformed to the time domain by transformation unit 909 using the filter bank F/ _TR .

从以上描述显而易见，音频编码/解码系统的解码器侧包括许多时频变换步骤。然而，如果以一定方式选择第一频域、第二频域和第三频域，则时频变换步骤中的一些步骤会变得冗余。As is evident from the above description, the decoder side of an audio encoding/decoding system includes many time-frequency transformation steps. However, if the first, second, and third frequency domains are chosen in a certain way, some steps in the time-frequency transformation process become redundant.

例如，可以将第一频域、第二频域和第三频域中的一些选择成为一样的，或者可以共同地实现为从一个频域直接到另一频域而不通过它们之间的时域。后者的一个示例是以下情形：第二频域和第三频域的不同仅在于呈现器122中的变换部件907除了使用两个变换部件905和907共同的QMF滤波器组以外还使用奈奎斯特滤波器组以提高低频处的频率分辨率。在这种情形下，可以以奈奎斯特滤波器组的形式共同实现变换部件905和907，从而节省计算复杂度。For example, some choices in the first, second, and third frequency domains can be the same, or they can be implemented together as a direct transition from one frequency domain to another without passing through the time domain in between. An example of the latter is the case where the second and third frequency domains differ only in that the transform unit 907 in the renderer 122 uses a Nyquist filter bank in addition to the shared QMF filter bank of both transform units 905 and 907 to improve frequency resolution at low frequencies. In this case, transform units 905 and 907 can be implemented together as a Nyquist filter bank, thus saving computational complexity.

在另一示例中，第二频域和第三频域是相同的。例如，第二频域和第三频域可以都是QMF频域。在这种情形下，变换部件905和907是冗余的并且可以被去除，从而节省计算复杂度。In another example, the second and third frequency domains are identical. For instance, both the second and third frequency domains could be QMF frequency domains. In this case, transform components 905 and 907 are redundant and can be removed, thus saving computational complexity.

根据另一示例，第一频域和第二频域可以是相同的。例如，第一频域和第二频域可以都是MDCT域。在这种情形下，可以去除第一变换部件901和第二变换部件903，从而节省计算复杂度。According to another example, the first frequency domain and the second frequency domain can be the same. For example, both the first frequency domain and the second frequency domain can be the MDCT domain. In this case, the first transformation component 901 and the second transformation component 903 can be eliminated, thereby saving computational complexity.

等同物、扩展、替选方案以及其它Equivalents, extensions, alternatives, and others

本领域技术人员在研究以上描述之后将会明白本公开内容的其它实施例。虽然本说明书和附图公开了实施例和示例，但本公开内容不限于这些具体示例。在不偏离由所附权利要求所定义的本公开内容的范围的情况下可以做出许多修改和变形。在权利要求中出现的任何附图标记不被理解为限制它们的范围。Other embodiments of this disclosure will become apparent to those skilled in the art upon studying the above description. While embodiments and examples are disclosed in this specification and accompanying drawings, this disclosure is not limited to these specific examples. Many modifications and variations may be made without departing from the scope of this disclosure as defined by the appended claims. Any reference numerals appearing in the claims are not to be construed as limiting their scope.

另外，根据对附图、公开内容和所附权利要求的研究，本领域技术人员在实践本公开内容时可以理解并实现对所公开的实施例的变型。在权利要求书中，词语“包括”不排除其它元件或步骤，并且不定冠词“一”不排除复数形式。在相互不同的从属权利要求中引述某些措施的事实不指示不可以使用这些措施的组合来获利。Furthermore, based on a study of the accompanying drawings, the disclosure, and the appended claims, those skilled in the art can understand and implement variations of the disclosed embodiments in practicing this disclosure. In the claims, the word "comprising" does not exclude other elements or steps, and the indefinite article "a" does not exclude a plural form. The fact that certain measures are cited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to generate profit.

在上文中公开的系统和方法可以实现为软件、固件、硬件或者它们的组合。在硬件实现中，以上描述中提到的功能单元之间的任务划分不一定对应于实体单元的划分；相反地，一个物理部件可以具有多个功能，并且可以由若干物理部件共同执行一个任务。某些部件或全部部件可以被实现为由数字信号处理器或微处理器执行的软件，或者可以被实现为硬件或专用集成电路。这样的软件可以分布在可以包括计算机存储介质(或非暂态介质)和通信介质(或暂态介质)的计算机可读介质上。如为本领域技术人员所熟知的，术语计算机存储介质包括以任何方法或用于存储诸如计算机可读指令、数据结构、程序模块或其它数据的信息的技术实现的易失性介质和非易失性介质，可移除介质和不可移除介质。计算机存储介质包括但不限于RAM、ROM、EEPROM、闪速存储器或其它存储器技术、CD-ROM、数字多功能盘(DVD)或其它光盘存储装置、磁盒、磁带、磁盘存储器或其它磁存储装置、或者可用于存储期望信息并且可被计算机访问的任何其它介质。此外，技术人员熟知的是，通信介质通常包含计算机可读指令、数据结构、程序模块或诸如载波的调制数据信号中的其它数据，或者其它传输机制，并且包括任何信息传输介质。The systems and methods disclosed above can be implemented as software, firmware, hardware, or a combination thereof. In a hardware implementation, the task division among the functional units mentioned above does not necessarily correspond to the division of physical units; rather, a physical component may have multiple functions, and several physical components may jointly perform a task. Some or all of the components may be implemented as software executed by a digital signal processor or microprocessor, or may be implemented as hardware or application-specific integrated circuits. Such software may be distributed on a computer-readable medium that may include computer storage media (or non-transitory media) and communication media (or transient media). As is well known to those skilled in the art, the term computer storage medium includes volatile and non-volatile media, removable and non-removable media, implemented in any way or by any technique for storing information such as computer-readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technologies, CD-ROM, digital versatile disc (DVD) or other optical disc storage devices, magnetic cartridges, magnetic tape, disk storage or other magnetic storage devices, or any other medium that can be used to store desired information and is accessible to a computer. Furthermore, as is well known to those skilled in the art, communication media typically contain computer-readable instructions, data structures, program modules, or other data in modulated data signals such as carrier waves, or other transmission mechanisms, and include any information transmission medium.

本公开内容还包括以下方案。This disclosure also includes the following schemes.

(1)一种对至少包括N个音频对象的音频场景的时频块进行编码的方法，所述方法包括：(1) A method for encoding a time-frequency block of an audio scene comprising at least N audio objects, the method comprising:

接收所述N个音频对象；Receive the N audio objects;

基于至少所述N个音频对象生成M个下混信号；Generate M downmixed signals based on at least the N audio objects;

用矩阵元素生成重构矩阵，所述重构矩阵使得能够根据所述M个下混信号重构至少所述N个音频对象；以及A reconstruction matrix is generated using matrix elements, the reconstruction matrix enabling the reconstruction of at least the N audio objects based on the M downmixed signals; and

生成比特流，所述比特流包括所述M个下混信号和所述重构矩阵的所述矩阵元素中的至少一些矩阵元素。Generate a bitstream, the bitstream comprising at least some of the matrix elements of the M downmixed signals and the matrix elements of the reconstruction matrix.

(2)根据方案(1)所述的方法，其中，使用第一格式将所述M个下混信号布置在所述比特流的第一字段中，并且使用第二格式将所述矩阵元素布置在所述比特流的第二字段中，从而允许仅支持所述第一格式的解码器解码和重放所述第一字段中的所述M个下混信号，并且丢弃所述第二字段中的所述矩阵元素。(2) The method according to scheme (1) wherein the M downmixed signals are arranged in a first field of the bitstream using a first format and the matrix elements are arranged in a second field of the bitstream using a second format, thereby allowing a decoder that only supports the first format to decode and replay the M downmixed signals in the first field and discard the matrix elements in the second field.

(3)根据前述方案中的任一方案所述的方法，还包括步骤：接收对应于所述N个音频对象中的每个音频对象的位置数据，其中，基于所述位置数据生成所述M个下混信号。(3) The method according to any of the above schemes further includes the step of: receiving position data corresponding to each of the N audio objects, wherein the M downmixing signals are generated based on the position data.

(4)根据前述方案中的任一方案所述的方法，其中，所述重构矩阵的所述矩阵元素是时变的和频变的。(4) The method according to any of the foregoing schemes, wherein the matrix elements of the reconstructed matrix are time-varying and frequency-varying.

(5)根据前述方案中的任一方案所述的方法，其中，所述音频场景还包括多个音床声道，其中，基于至少所述N个音频对象和所述多个音床声道生成所述M个下混信号。(5) The method according to any of the foregoing schemes, wherein the audio scene further includes multiple audio bed channels, wherein the M downmix signals are generated based on at least the N audio objects and the multiple audio bed channels.

(6)根据方案(5)所述的方法，其中，所述重构矩阵包括使得能够根据所述M个下混信号重构所述音床声道的矩阵元素。(6) The method according to scheme (5), wherein the reconstruction matrix includes matrix elements that enable the reconstruction of the sound bed channel based on the M downmixed signals.

(7)根据前述方案中的任一方案所述的方法，其中，所述音频场景初始包括K个音频对象，其中K>N，所述方法还包括步骤：接收所述K个音频对象，并且通过将所述K个音频对象聚类成N个聚类并将每个聚类用一个音频对象代表，来将所述K个音频对象减少到所述N个音频对象。(7) The method according to any of the above schemes, wherein the audio scene initially includes K audio objects, where K>N, and the method further includes the steps of: receiving the K audio objects and reducing the K audio objects to the N audio objects by clustering the K audio objects into N clusters and representing each cluster with an audio object.

(8)根据方案(7)所述的方法，还包括步骤：接收对应于所述K个音频对象中的每个音频对象的位置数据，其中，将所述K个对象聚类成N个聚类基于由所述K个音频对象的所述位置数据给出的所述K个对象之间的位置距离。(8) The method according to scheme (7) further includes the step of: receiving location data corresponding to each of the K audio objects, wherein the K objects are clustered into N clusters based on the location distance between the K objects given by the location data of the K audio objects.

(9)根据前述方案中的任一方案所述的方法，其中，下混信号的所述数目M大于2。(9) The method according to any of the foregoing schemes, wherein the number M of the downmixed signals is greater than 2.

(10)根据前述方案中的任一方案所述的方法，还包括：(10) The method according to any of the foregoing schemes further includes:

由所述N个音频对象形成L个辅助信号；The N audio objects form L auxiliary signals;

将使得能够根据所述M个下混信号和所述L个辅助信号重构至少所述N个音频对象的矩阵元素包括在所述重构矩阵中；以及This will enable the matrix elements that allow the reconstruction of at least the N audio objects based on the M downmixed signals and the L auxiliary signals to be included in the reconstruction matrix; and

将所述L个辅助信号包括在所述比特流中。The L auxiliary signals are included in the bit stream.

(11)根据方案(10)所述的方法，其中，所述L个辅助信号的至少之一与所述N个音频对象之一相同。(11) The method according to scheme (10), wherein at least one of the L auxiliary signals is the same as one of the N audio objects.

(12)根据方案(10)至(11)中的任一方案所述的方法，其中，所述L个辅助信号的至少之一被形成为所述N个音频对象中的至少两个音频对象的组合。(12) The method according to any of the schemes (10) to (11), wherein at least one of the L auxiliary signals is formed as a combination of at least two audio objects among the N audio objects.

(13)根据方案(10)至(12)中的任一方案所述的方法，其中，所述M个下混信号跨超平面，并且其中，所述多个辅助信号的至少之一不位于被所述M个下混信号所跨的所述超平面中。(13) The method according to any of the schemes (10) to (12), wherein the M downmixing signals cross a hyperplane, and wherein at least one of the plurality of auxiliary signals is not located in the hyperplane crossed by the M downmixing signals.

(14)根据方案(13)所述的方法，其中，所述多个辅助信号中的所述至少之一与被所述M个下混信号所跨的所述超平面正交。(14) The method according to scheme (13), wherein at least one of the plurality of auxiliary signals is orthogonal to the hyperplane spanned by the M downmixing signals.

(15)一种计算机可读介质，其包括当在具有处理能力的装置上运行时适于执行根据方案(1)至(14)中的任一方案所述的方法的计算机代码指令。(15) A computer-readable medium comprising computer code instructions adapted to perform the method according to any one of schemes (1) to (14) when operated on a processing device.

(16)一种对至少包括N个音频对象的音频场景的时频块进行编码的编码器，所述编码器包括：(16) An encoder for encoding a time-frequency block of an audio scene comprising at least N audio objects, the encoder comprising:

接收部件，其被配置成接收所述N个音频对象；A receiving component, configured to receive the N audio objects;

下混生成部件，其被配置成接收来自所述接收部件的所述N个音频对象，以及基于至少所述N个音频对象生成M个下混信号；A downmixing generation unit is configured to receive the N audio objects from the receiving unit and to generate M downmixing signals based on at least the N audio objects;

分析部件，其被配置成用矩阵元素生成重构矩阵，所述重构矩阵使得能够根据所述M个下混信号重构至少所述N个音频对象；以及Analysis components are configured to generate a reconstruction matrix using matrix elements, the reconstruction matrix enabling the reconstruction of at least the N audio objects based on the M downmixed signals; and

比特流生成部件，其被配置成接收来自所述下混生成部件的所述M个下混信号和来自所述分析部件的所述重构矩阵，以及生成包括所述M个下混信号和所述重构矩阵的所述矩阵元素中的至少一些矩阵元素的比特流。A bitstream generation unit is configured to receive the M downmixed signals from the downmixing generation unit and the reconstruction matrix from the analysis unit, and to generate a bitstream comprising at least some of the matrix elements of the M downmixed signals and the reconstruction matrix.

(17)一种对至少包括N个音频对象的音频场景的时频块进行解码的方法，所述方法包括步骤：(17) A method for decoding a time-frequency block of an audio scene comprising at least N audio objects, the method comprising the steps of:

接收包括M个下混信号和重构矩阵的至少一些矩阵元素的比特流；Receive a bit stream comprising at least some matrix elements of M downmixed signals and a reconstruction matrix;

使用所述矩阵元素生成所述重构矩阵；以及The reconstructed matrix is generated using the matrix elements; and

使用所述重构矩阵根据所述M个下混信号重构所述N个音频对象。The reconstruction matrix is used to reconstruct the N audio objects based on the M downmixed signals.

(18)根据方案(17)所述的方法，其中，所述M个下混信号被使用第一格式布置在所述比特流的第一字段中，并且所述矩阵元素被使用第二格式布置在所述比特流的第二字段中，从而允许仅支持所述第一格式的解码器解码和重放所述第一字段中的所述M个下混信号，并且丢弃所述第二字段中的所述矩阵元素。(18) The method according to scheme (17) wherein the M downmixed signals are arranged in a first field of the bitstream using a first format and the matrix elements are arranged in a second field of the bitstream using a second format, thereby allowing a decoder that only supports the first format to decode and replay the M downmixed signals in the first field and discard the matrix elements in the second field.

(19)根据方案(17)至(18)中的任一方案所述的方法，其中，所述重构矩阵的所述矩阵元素是时变的和频变的。(19) The method according to any of the schemes (17) to (18), wherein the matrix elements of the reconstructed matrix are time-varying and frequency-varying.

(20)根据方案(17)至(19)中的任一方案所述的方法，其中，所述音频场景还包括多个音床声道，所述方法还包括使用所述重构矩阵根据所述M个下混信号重构所述音床声道。(20) The method according to any one of the schemes (17) to (19), wherein the audio scene further includes a plurality of audio bed channels, and the method further includes reconstructing the audio bed channels according to the M downmixing signals using the reconstruction matrix.

(21)根据方案(17)至(20)中的任一方案所述的方法，其中，下混信号的数目M大于2。(21) The method according to any of the schemes (17) to (20), wherein the number of downmixed signals M is greater than 2.

(22)根据方案(17)至(21)中的任一方案所述的方法，还包括：(22) The method according to any one of schemes (17) to (21) further includes:

接收由所述N个音频对象形成的L个辅助信号；Receive L auxiliary signals formed by the N audio objects;

使用所述重构矩阵根据所述M个下混信号和所述L个辅助信号重构所述N个音频对象，其中，所述重构矩阵包括使得能够根据所述M个下混信号和所述L个辅助信号重构至少所述N个音频对象的矩阵元素。The reconstruction matrix is used to reconstruct the N audio objects based on the M downmixed signals and the L auxiliary signals, wherein the reconstruction matrix includes matrix elements that enable the reconstruction of at least the N audio objects based on the M downmixed signals and the L auxiliary signals.

(23)根据方案(22)所述的方法，其中，所述L个辅助信号的至少之一与所述N个音频对象之一相同。(23) The method according to scheme (22) wherein at least one of the L auxiliary signals is the same as one of the N audio objects.

(24)根据方案(22)至(23)中的任一方案所述的方法，其中，所述L个辅助信号的至少之一是所述N个音频对象的组合。(24) The method according to any of the schemes (22) to (23), wherein at least one of the L auxiliary signals is a combination of the N audio objects.

(25)根据方案(22)至(24)中的任一方案所述的方法，其中，所述M个下混信号跨超平面，并且其中，所述多个辅助信号的至少之一不位于被所述M个下混信号所跨的所述超平面中。(25) The method according to any of the schemes (22) to (24), wherein the M downmixing signals cross a hyperplane, and wherein at least one of the plurality of auxiliary signals is not located in the hyperplane crossed by the M downmixing signals.

(26)根据方案(25)所述的方法，其中，不位于所述超平面中的所述多个辅助信号中的所述至少之一与被所述M个下混信号所跨的所述超平面正交。(26) The method according to scheme (25), wherein at least one of the plurality of auxiliary signals not located in the hyperplane is orthogonal to the hyperplane spanned by the M downmixing signals.

(27)根据方案(17)至(26)中的任一方案所述的方法，其中，关于第一频域表示所述M个下混信号，并且其中，关于第二频域表示所述重构矩阵，所述第一频域和所述第二频域是相同的频域。(27) The method according to any of the schemes (17) to (26), wherein the M downmixed signals are represented with respect to the first frequency domain, and wherein the reconstruction matrix is represented with respect to the second frequency domain, wherein the first frequency domain and the second frequency domain are the same frequency domain.

(28)根据方案(27)所述的方法，其中，所述第一频域和所述第二频域是改进离散余弦变换(MDCT)域。(28) The method according to scheme (27), wherein the first frequency domain and the second frequency domain are improved discrete cosine transform (MDCT) domains.

(29)根据方案(17)至(28)中的任一方案所述的方法，还包括：接收对应于所述N个音频对象的位置数据，以及(29) The method according to any one of schemes (17) to (28) further includes: receiving position data corresponding to the N audio objects, and

使用所述位置数据呈现所述N个音频对象以创建至少一个输出音频声道。The location data is used to render the N audio objects to create at least one output audio channel.

(30)根据方案(29)所述的方法，其中，关于对应于第二滤波器组的第二频域表示所述重构矩阵，并且在对应于第三滤波器组的第三频域中执行所述呈现，其中，所述第二滤波器组和所述第三滤波器组是至少部分地相同的滤波器组。(30) The method according to scheme (29), wherein the reconstruction matrix is represented with respect to a second frequency domain corresponding to the second filter group, and the presentation is performed in a third frequency domain corresponding to the third filter group, wherein the second filter group and the third filter group are at least partially the same filter group.

(31)根据方案(30)所述的方法，其中，所述第二滤波器组和所述第三滤波器组包括正交镜像滤波器(QMF)滤波器组。(31) The method according to scheme (30), wherein the second filter bank and the third filter bank include a quadrature mirror filter (QMF) filter bank.

(32)一种计算机可读介质，其包括当在具有处理能力的装置上运行时适于执行根据方案17至31中的任一方案所述的方法的计算机代码指令。(32) A computer-readable medium comprising computer code instructions adapted to perform the method according to any one of schemes 17 to 31 when operated on a processing device.

(33)一种对至少包括N个音频对象的音频场景的时频块进行解码的解码器，所述解码器包括：(33) A decoder for decoding a time-frequency block of an audio scene comprising at least N audio objects, the decoder comprising:

接收部件，其被配置成接收包括M个下混信号和重构矩阵的矩阵元素中的至少一些矩阵元素的比特流；A receiving unit configured to receive a bit stream comprising at least some matrix elements of a matrix including M downmixed signals and a reconstruction matrix;

重构矩阵生成部件，其被配置成接收来自所述接收部件的所述矩阵元素，并且基于所述矩阵元素生成所述重构矩阵；以及A reconstruction matrix generation component is configured to receive the matrix elements from the receiving component and generate the reconstruction matrix based on the matrix elements; and

重构部件，其被配置成接收来自所述重构矩阵生成部件的所述重构矩阵，并且使用所述重构矩阵根据所述M个下混信号重构所述N个音频对象。A reconstruction component is configured to receive the reconstruction matrix from the reconstruction matrix generation component and to reconstruct the N audio objects based on the M downmixed signals using the reconstruction matrix.

Claims

1. A method for decoding an audio scene represented by N audio signals, the method comprising:

Receive a bitstream comprising M downmixed signals and matrix elements of a reconstruction matrix, wherein at least some of the matrix elements are zero values;

Use the matrix elements to generate a reconstructed matrix; and

The reconstruction matrix is used to reconstruct the N audio signals from the M downmixed signals, wherein the N audio signals are approximated as a linear combination of the M downmixed signals, and the matrix elements of the reconstruction matrix serve as coefficients in the linear combination.

Where M is less than N, and M is equal to or greater than 1.

2. The method according to claim 1, further comprising: receiving L auxiliary signals in the bitstream, and reconstructing the N audio signals using the reconstruction matrix based on the M downmixed signals and the L auxiliary signals.

3. The method according to claim 1, wherein at least some of the M downmixed signals are formed by two or more of the N audio signals.

4. The method of claim 1, wherein at least some of the N audio signals are presented to generate a three-dimensional audio environment.

5. The method of claim 1, wherein the audio scene includes a three-dimensional audio environment, the three-dimensional audio environment including audio units associated with positions in three-dimensional space that can be presented for playback on an audio system.

6. The method of claim 1, wherein the M downmixed signals are arranged in a first field of the bitstream using a first format, and the matrix elements are arranged in a second field of the bitstream using a second format.

7. The method of claim 1, wherein the linear combination is formed by multiplying the matrix of the M downmixed signals with the reconstruction matrix.

8. The method of claim 1, further comprising: receiving L auxiliary signals, wherein the linear combination is formed by multiplying the matrix of the M downmixed signals and the L auxiliary signals with the reconstruction matrix.

9. The method of claim 1, wherein the M downmixed signals are decoded before the reconstruction is performed.

10. The method of claim 1, further comprising: receiving one or more audio bed channels in the bitstream, and reconstructing the N audio signals using the reconstruction matrix based on the M downmixed signals and the audio bed channels.

11. The method of claim 10, further comprising: receiving L auxiliary signals in the bitstream, and reconstructing the N audio signals using the reconstruction matrix based on the M downmixing signals, the L auxiliary signals, and the one or more audio bed channels.

12. The method of claim 11, wherein the one or more audio channels represent audio units having fixed positions in the audio scene.

13. A non-transitory computer-readable medium comprising instructions that, when executed by a processor of an information processing system, cause the information processing system to perform the method according to claim 1.

14. An audio decoder for decoding an audio scene represented by N audio signals, the audio decoder comprising:

A receiver that receives a bitstream comprising M downmixed signals and matrix elements of a reconstructed matrix, wherein at least some of the matrix elements are zero values;

A reconstruction matrix generator that receives the matrix elements from the receiver and generates the reconstruction matrix based on the matrix elements; and

A reconstructor receives the reconstruction matrix from the reconstruction matrix generator and uses the reconstruction matrix to reconstruct the N audio signals based on the M downmixed signals, wherein the N audio signals are approximated as a linear combination of the M downmixed signals, and the matrix elements of the reconstruction matrix serve as coefficients in the linear combination.

Where M is less than N, and M is equal to or greater than 1.

15. The audio decoder of claim 14, wherein the receiver further receives L auxiliary signals in the bitstream, and the reconstructor reconstructs the N audio signals using the reconstruction matrix based on the M downmixed signals and the L auxiliary signals.

16. The audio decoder of claim 14, wherein at least some of the M downmixed signals are formed by two or more of the N audio signals.

17. The audio decoder of claim 14, wherein the M downmixed signals are arranged in a first field of the bitstream using a first format, and the matrix elements are arranged in a second field of the bitstream using a second format.

18. The audio decoder of claim 14, wherein the audio decoder is configured to present at least some of the N audio signals to generate a three-dimensional audio environment.

19. The audio decoder of claim 14, wherein the linear combination is formed by multiplying the matrix of the M downmixed signals with the reconstruction matrix.

20. The audio decoder of claim 14, wherein the audio decoder is configured to decode the M downmixed signals prior to performing the reconstruction.