HK1227166B

HK1227166B - Post-encoding bitrate reduction of multiple object audio

Info

Publication number: HK1227166B
Application number: HK17100763.6A
Authority: HK
Inventors: Z．菲左
Original assignee: Dts（英属维尔京群岛）有限公司
Priority date: 2014-03-06
Filing date: 2015-02-26
Publication date: 2020-12-24

Description

Post-encoding bit rate reduction for multi-object audio

相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS

本申请要求于2014年3月6日提交的题为“多对象音频的编码后位速率减少”的美国专利申请14/199706的优先权，其全部内容通过引用并入在此。This application claims priority to U.S. Patent Application No. 14/199,706, filed on March 6, 2014, and entitled “Post-Encoding Bit Rate Reduction for Multi-Object Audio,” which is incorporated herein by reference in its entirety.

背景技术Background Art

音频压缩技术最小化用来创建输入音频信号的表示的数字位的数量。未压缩的高质量数字音频信号往往包含大量的数据。这些未压缩信号的庞大尺寸往往使它们不理想或者不适于用于存储和传输。Audio compression techniques minimize the number of digital bits used to create a representation of an input audio signal. Uncompressed, high-quality digital audio signals often contain large amounts of data. The sheer size of these uncompressed signals often makes them undesirable or unsuitable for storage and transmission.

压缩技术可以用来减少数字信号的文件尺寸。这些压缩技术减少存储音频信号用于将来回放或传输所需的数字存储空间。此外，这些技术可以用来以减少的文件尺寸生成音频信号的可信表示。这种低位速率版本的音频信号然后可以经有限带宽的网络信道以低位速率被传送。这种压缩版本的音频信号在传送之后被解压缩，以重构声音上可接受的输入音频信号的表示。Compression techniques can be used to reduce the file size of digital signals. These compression techniques reduce the digital storage space required to store the audio signal for future playback or transmission. In addition, these techniques can be used to generate a credible representation of the audio signal with a reduced file size. This low-bit-rate version of the audio signal can then be transmitted at a low bit rate over a network channel with limited bandwidth. This compressed version of the audio signal is then decompressed after transmission to reconstruct a sonically acceptable representation of the input audio signal.

作为一般规则，重构的音频信号的质量与用来编码输入音频信号的位的数量成反比。换句话说，用来编码音频信号的位越少，重构的音频信号和输入音频信号之间的差异越大。传统的音频压缩技术在压缩编码时使位速率固定，并且因此使音频质量的水平固定。位速率是每时间段用来编码输入音频信号的位的数量。在不以较低位速率重新编码输入音频信号，或者解压缩压缩的音频信号并且然后以较低位速率重新压缩解压缩的信号的情况下，不能实现位速率的进一步减少。这些传统技术在解决其中不同应用需要以不同位速率编码的位流的情况下不是“可缩放的”。As a general rule, the quality of the reconstructed audio signal is inversely proportional to the number of bits used to encode the input audio signal. In other words, the fewer bits used to encode the audio signal, the greater the difference between the reconstructed audio signal and the input audio signal. Conventional audio compression techniques fix the bit rate during compression encoding, and therefore the level of audio quality. The bit rate is the number of bits used to encode the input audio signal per time period. No further reduction in the bit rate can be achieved without re-encoding the input audio signal at a lower bit rate, or decompressing the compressed audio signal and then re-compressing the decompressed signal at a lower bit rate. These conventional techniques are not "scalable" in situations where different applications require bit streams encoded at different bit rates.

用来创建可缩放位流的一种技术是差分编码。差分编码将输入音频信号编码为由低位速率位流的子集组成的高位速率位流。低位速率位流然后被用来构建较高位速率位流。差分编码需要对被缩放的位流的广泛分析并且是计算密集的。这种计算强度需要显著的处理能力来获得实时性能。One technique used to create scalable bitstreams is differential encoding. Differential encoding encodes the input audio signal into a high-bitrate bitstream composed of a subset of a lower-bitrate bitstream. The lower-bitrate bitstream is then used to construct the higher-bitrate bitstream. Differential encoding requires extensive analysis of the scaled bitstream and is computationally intensive. This computational intensity requires significant processing power to achieve real-time performance.

另一种可缩放编码技术使用多种压缩方法来创建分层的可缩放位流。这种方法使用混合的压缩技术来覆盖期望范围的可缩放位速率。但是，有限的可缩放性范围和有限的分辨率使这种分层的方法不适于许多类型的应用。由于这些原因，存储单个压缩音频位流并且以不同位速率从这个单个位流交付内容的期望场景往往难以实现。Another scalable coding technique uses multiple compression methods to create layered, scalable bitstreams. This approach uses a mix of compression techniques to cover a desired range of scalable bitrates. However, the limited scalability range and limited resolution make this layered approach unsuitable for many types of applications. For these reasons, the desired scenario of storing a single compressed audio bitstream and delivering content from this single bitstream at different bitrates is often difficult to implement.

发明内容Summary of the Invention

提供本发明内容是为了以简化的形式介绍下面在具体实施方式中进一步描述的概念的选择。本发明内容并不旨在识别所要求保护主题的关键特征或必要特征，也不旨在被用来限制所要求保护主题的范围。This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

编码后位速率减少系统和方法的实施例从单个全文件产生一个或多个缩放的压缩位流。全文件包含先前已被单独编码的多个音频对象文件。因此，全文件的处理在音频对象文件已利用全文件的可缩放性特征进行编码之后被执行。Embodiments of a post-encoding bitrate reduction system and method generate one or more scaled compressed bitstreams from a single full file. The full file contains multiple audio object files that have been previously encoded individually. Thus, processing of the full file is performed after the audio object files have been encoded using the scalability characteristics of the full file.

用于每个编码音频文件的编码过程是可缩放的，使得可以从编码音频文件的帧中缩减位以减小文件尺寸。这种可缩放性允许数据以特定位速率进行编码，并且然后任何百分比的编码数据可以被切掉或丢弃，同时仍然保留正确解码编码数据的能力。例如，如果数据以位速率Z进行编码，则一半的帧可以被切掉或丢弃来获得一半的位速率(Z/2)并且仍然能够正确解码。The encoding process used for each encoded audio file is scalable, allowing bits to be shaved off from the frames of the encoded audio file to reduce the file size. This scalability allows data to be encoded at a specific bit rate, and then any percentage of the encoded data can be chopped off or discarded while still retaining the ability to correctly decode the encoded data. For example, if the data was encoded at a bit rate of Z, half of the frames can be chopped off or discarded to achieve half the bit rate (Z/2) and still be correctly decoded.

其中来自单个编码全文件的这种细粒度可缩放性和工作是有价值的一个实例是当流到不同带宽的设备时。例如，如果有多个音频对象文件位于服务器上，则本系统和方法的实施例将以内容提供商想要实现的某个高位速率单独编码这些音频对象文件。但是，如果这个内容被流到不同的和较低带宽的设备，诸如蜂窝电话、汽车、电视机等，则位速率需要被减少。虽然从单个编码全文件工作，但是本系统和方法的实施例允许位速率针对每个个别设备的位速率进行调整。因此，每次交付被不同地进行裁剪。但是单个文件被用来交付不同位速率位流。此外，没有必要重新编码编码的音频对象文件。One instance where this fine-grained scalability and working from a single, encoded full file is valuable is when streaming to devices of varying bandwidths. For example, if there are multiple audio object files located on a server, embodiments of the present system and method will encode these audio object files individually at a certain high bit rate that the content provider wants to achieve. However, if this content is streamed to different and lower bandwidth devices, such as cell phones, cars, televisions, etc., the bit rate needs to be reduced. While working from a single, encoded full file, embodiments of the present system and method allow the bit rate to be adjusted for the bit rate of each individual device. Therefore, each delivery is tailored differently. But a single file is used to deliver different bit rate bitstreams. Furthermore, there is no need to re-encode the encoded audio object files.

不是重新编码音频对象文件，而是本系统和方法的实施例处理单个版本的编码全文件，并且然后缩小位速率。此外，位速率的缩放在无需首先将全文件解码回到其未压缩形式并且然后以不同位速率重新编码结果得到的未压缩数据的情况下完成。这一切可以在无需重新编码编码的音频对象文件的情况下实现。Rather than re-encoding the audio object files, embodiments of the present system and method process a single version of the encoded full file and then scale down the bit rate. Furthermore, this bit rate scaling is accomplished without first decoding the full file back to its uncompressed form and then re-encoding the resulting uncompressed data at a different bit rate. This can all be accomplished without re-encoding the encoded audio object files.

编码和压缩是昂贵的、计算能力要求很高的过程，而本系统和方法的实施例的编码后位速率缩放是非常轻量级的过程。这意味着，当与同时执行多个编码来服务每个不同信道位速率的现有系统和方法相比，本系统和方法的实施例施加了小得多的服务器要求。While encoding and compression are expensive and computationally demanding processes, the post-encoding bitrate scaling of embodiments of the present systems and methods is a very lightweight process. This means that embodiments of the present systems and methods impose much smaller server requirements when compared to existing systems and methods that perform multiple encodings simultaneously to service each different channel bitrate.

本系统和方法的实施例从单个全文件产生缩放的压缩位流。以全位速率的全文件通过合并多个单独编码音频对象文件来创建。音频对象是特定声音或声音组合的源信号。在一些实施例中，全文件包括对应于编码音频对象文件的分层元数据。分层元数据包含用于每个编码音频对象文件相对于其它编码音频对象文件的优先级信息。例如，电影音轨中的对话音频对象会比街道噪声音频对象(在同一时间段期间)具有更高的权重。在一些实施例中，每个编码音频对象文件的整个时间长度在全文件中被使用。这意味着，即使编码音频对象文件包含静音期，它们仍然被包含在全文件中。Embodiments of the present system and method generate a scaled compressed bitstream from a single full file. A full file at a full bit rate is created by merging multiple individually encoded audio object files. An audio object is a source signal for a specific sound or combination of sounds. In some embodiments, the full file includes layered metadata corresponding to the encoded audio object files. The layered metadata contains priority information for each encoded audio object file relative to other encoded audio object files. For example, a dialogue audio object in a movie soundtrack may have a higher weight than a street noise audio object (during the same time period). In some embodiments, the entire time length of each encoded audio object file is used in the full file. This means that even if the encoded audio object file contains silent periods, they are still contained in the full file.

每个音频对象文件被分段成数据帧。时间段被选择并且在那个指定时间段的每个编码音频文件的数据帧的数据帧活动彼此进行比较。这给出了用于在选定时间段的所有编码音频文件的数据帧活动比较。然后基于数据帧活动比较和在一些情况下分层元数据，位从可用位池中被分配给在选定时间段期间的编码音频对象文件的每个数据帧。这为选定的时间段产生位分配。在一些实施例中，分层元数据包含编码音频对象文件优先级，使得文件以对用户优先级或重要性的顺序被排名。应当指出，来自可用位池的位被分配给选定时间段的所有数据帧和所有编码音频对象文件。换句话说，在给定时间段，每个音频对象文件和其中的帧接收位，但是一些文件基于它们的帧活动和其它因素比其它文件接收更多的位。Each audio object file is segmented into data frames. A time period is selected and the data frame activities of the data frames of each encoded audio file in that specified time period are compared with each other. This provides a data frame activity comparison for all encoded audio files in the selected time period. Then based on the data frame activity comparison and in some cases layered metadata, bits are allocated to each data frame of the encoded audio object file during the selected time period from the available bit pool. This produces a bit allocation for the selected time period. In some embodiments, the layered metadata includes encoded audio object file priorities so that the files are ranked in order of priority or importance to the user. It should be noted that bits from the available bit pool are allocated to all data frames and all encoded audio object files in the selected time period. In other words, in a given time period, each audio object file and the frames therein receive bits, but some files receive more bits than other files based on their frame activity and other factors.

测量数据帧活动可以基于编码位流中可用的任何数量的参数。例如，音频水平、视频活动、以及帧活动的其它度量可以被用来测量数据帧活动。此外，在本系统和方法的一些实施例中，数据帧活动在编码器侧被测量并且被嵌入在位流中，诸如每帧一个数字。在其它实施例中，解码帧可以针对帧活动被分析。Measuring data frame activity can be based on any number of parameters available in the coded bitstream. For example, audio level, video activity, and other metrics of frame activity can be used to measure data frame activity. In addition, in some embodiments of the present system and method, data frame activity is measured on the encoder side and embedded in the bitstream, such as a number per frame. In other embodiments, decoded frames can be analyzed for frame activity.

在一些实施例中，数据帧活动在帧之间进行比较。通常在某个时间段期间，在一些数据帧中将有更多的活动存在，而其它数据帧将具有较少的活动。数据帧比较包括选择时间段，并且然后测量在该时间段期间数据帧内的数据帧活动。每个编码音频对象的帧在选定时间段期间被检查。每个数据帧中的数据帧活动然后与其它帧进行比较，以获得数据帧活动比较。该比较是在该时间段期间特定数据帧相对于其它数据帧的活动的度量。In some embodiments, data frame activity is compared between frames. Typically, during a certain time period, some data frames will have more activity, while other data frames will have less activity. The data frame comparison includes selecting a time period and then measuring the data frame activity within the data frames during that time period. Each frame of the encoded audio object is examined during the selected time period. The data frame activity in each data frame is then compared with the other frames to obtain a data frame activity comparison. This comparison is a measure of the activity of a particular data frame relative to the other data frames during that time period.

本系统和方法的实施例然后通过根据位分配缩减数据帧的位以生成削减的帧来缩小全文件。这种位缩减使用全文件的可缩放性并且以反向排名顺序缩减数据帧中的位。这产生在位分配中多个位被分配给数据帧，使得较低排名的位在较高排名的位之前被缩减。在一些实施例中，编码音频对象文件内的帧的可缩放性包括从音频对象文件的频域表示中提取音调，以获得表示有至少一些音调被去除的音频对象文件的时域残差信号。提取的音调和时域残差信号被格式化成多个数据区块，其中每个数据区块包括多个字节的数据。编码音频对象文件的数据帧中的数据区块和数据区块中的位都以心理声学重要性的顺序进行排序，以获得从最重要的位到最不重要的位的排名顺序。Embodiments of the present system and method then reduce the full file by reducing the bits of the data frame according to the bit allocation to generate the reduced frames. This bit reduction uses the scalability of the full file and reduces the bits in the data frame in reverse ranking order. This results in multiple bits being allocated to the data frame in the bit allocation so that lower-ranked bits are reduced before higher-ranked bits. In some embodiments, the scalability of the frames within the encoded audio object file includes extracting tones from a frequency domain representation of the audio object file to obtain a time domain residual signal representing the audio object file with at least some tones removed. The extracted tones and the time domain residual signal are formatted into a plurality of data blocks, each data block comprising a plurality of bytes of data. The data blocks and the bits within the data blocks in the data frames of the encoded audio object file are sorted in order of psychoacoustic importance to obtain a ranking order from the most significant bit to the least significant bit.

位减少的编码音频对象文件从削减的帧中获得。位减少的编码音频对象文件然后被一起复用并且被包装到缩放的压缩位流中，使得缩放的压缩位流具有低于或等于全位速率的目标位速率，以便促进单个全文件的编码后位速率减少。Bit-reduced encoded audio object files are obtained from the pruned frames. The bit-reduced encoded audio object files are then multiplexed together and packed into a scaled compressed bitstream such that the scaled compressed bitstream has a target bitrate that is less than or equal to the full bitrate, in order to facilitate post-encoding bitrate reduction of a single full file.

对在选定时间段的每个数据帧测得的数据帧活动与静音阈值进行比较，以确定在任何数据帧中是否存在最小量的活动。如果特定数据帧的数据帧活动小于或等于静音阈值，则那个数据帧被指定为静音数据帧。此外，用来表示那个数据帧的位的数量被保持，而无需减少任何位。另一方面，如果特定数据帧的数据帧活动大于静音阈值，则在帧活动缓冲区中存储数据帧活动。用于选定时间段的可用位池通过从分配给选定时间段的多个位中减去在选定时间段期间由静音数据帧使用的位来确定。The data frame activity measured for each data frame during the selected time period is compared to a silence threshold to determine whether a minimum amount of activity exists in any data frame. If the data frame activity for a particular data frame is less than or equal to the silence threshold, that data frame is designated as a silent data frame. Furthermore, the number of bits used to represent that data frame is maintained without reducing any bits. On the other hand, if the data frame activity for the particular data frame is greater than the silence threshold, the data frame activity is stored in a frame activity buffer. The available bit pool for the selected time period is determined by subtracting the bits used by silent data frames during the selected time period from the number of bits allocated to the selected time period.

在一些实施例中，缩放的压缩位流以小于或等于目标位速率的位速率经网络信道传送。位流被接收设备接收并且然后解压缩以获得解码的音频对象文件。在一些情况下，解码的音频对象文件被混合以创建音频对象混合。用户可以手动或自动混合解码音频对象来创建音频对象混合。此外，分层元数据中的编码音频对象文件可以基于在音频对象混合中的空间定位被区分优先级。此外，两个或更多个解码音频对象文件可以互相依赖，以便基于其在混合中的位置进行空间掩蔽。In certain embodiments, the compressed bit stream of scaling can be transmitted through a network channel with a bit rate that is less than or equal to the target bit rate. The bit stream is received by a receiving device and then decompressed to obtain the audio object file of decoding. In some cases, the audio object file of decoding is mixed to create an audio object mix. The user can manually or automatically mix the decoded audio objects to create an audio object mix. In addition, the encoded audio object file in the layered metadata can be prioritized based on the spatial positioning in the audio object mix. In addition, two or more decoded audio object files can be interdependent, so that they carry out spatial masking based on their position in the mix.

本系统和方法的实施例也可以被用来从单个全文件获得多个缩放的压缩位流。这通过利用具有细粒度可缩放性的可缩放位流编码器以全位速率单独编码多个音频对象文件以获得多个编码音频对象文件来完成。这种细粒度可缩放性特征将编码音频对象文件的每个数据帧中的位以对人类听觉的心理声学重要性的顺序排名。全文件通过合并该多个独立编码的音频对象文件和对应的分层元数据生成。该多个编码音频对象文件中的每一个都是持久性的并且在全文件的整个持续时间存在。Embodiments of the present system and method can also be used to obtain multiple scaled compressed bitstreams from a single full file. This is accomplished by utilizing a scalable bitstream encoder with fine-grained scalability to individually encode multiple audio object files at the full bit rate to obtain multiple encoded audio object files. This fine-grained scalability feature ranks the bits in each data frame of the encoded audio object file in order of psychoacoustic importance to human hearing. A full file is generated by merging the multiple independently encoded audio object files and the corresponding layered metadata. Each of the multiple encoded audio object files is persistent and exists for the entire duration of the full file.

以第一目标位速率的第一缩放的压缩位流从全文件以及以第二目标位速率的第二缩放的压缩位流构造。这从单个全文件产生以不同目标位速率的多个缩放的位流，而无需该多个编码音频对象文件的任何重新编码。此外，第一目标位速率和第二目标位速率彼此不同，并且两者都小于全位速率。第一目标位速率是第一缩放的压缩位流将以其经网络信道被传送的最大位速率。A first scaled compressed bitstream at a first target bitrate is constructed from the full file and a second scaled compressed bitstream at a second target bitrate. This generates multiple scaled bitstreams at different target bitrates from a single full file without requiring any re-encoding of the multiple encoded audio object files. Furthermore, the first target bitrate and the second target bitrate are different from each other and both are less than the full bitrate. The first target bitrate is the maximum bitrate at which the first scaled compressed bitstream will be transmitted over a network channel.

如上所述，将多个编码音频文件中的每一个在选定时间段的数据帧的数据帧活动彼此进行比较，以获得数据帧活动比较。该数据帧活动比较和第一目标位速率被用来将位分配给编码音频对象文件基于选定时间段的数据帧中的每一个，以获得用于选定时间段的位分配。全文件通过根据位分配缩减数据帧的位被缩小，以实现第一目标位速率并获得位减少的编码音频对象文件。这些位减少的编码音频对象文件被一起复用并且被包装到以第一目标位速率的第一缩放的压缩位流中。第一缩放的压缩位流以第一目标位速率被传送到接收设备并且被解码以获得解码音频对象。这些解码音频对象被混合以创建音频对象混合。As described above, the data frame activities of each of the data frames in the selected time period in the plurality of encoded audio files are compared with each other to obtain a data frame activity comparison. This data frame activity comparison and the first target bit rate are used to allocate bits to each of the data frames of the encoded audio object file based on the selected time period to obtain the bit allocation for the selected time period. The full file is reduced by reducing the bits of the data frames according to the bit allocation to achieve the first target bit rate and obtain encoded audio object files with reduced bits. These encoded audio object files with reduced bits are multiplexed together and packaged into a first scaled compressed bit stream at the first target bit rate. The first scaled compressed bit stream is transmitted to a receiving device at the first target bit rate and is decoded to obtain decoded audio objects. These decoded audio objects are mixed to create an audio object mixture.

应当指出，取决于特定的实施例，备选实施例是可能的，并且本文所讨论的步骤和元素可以被改变、添加或删除。在不背离本发明的范围的情况下，这些备选实施例包括可以被使用的备选步骤和备选元素，以及可以做出的结构变化。Should be noted that, depending on specific embodiment, alternative embodiment is possible, and the steps and elements discussed herein can be changed, added or deleted. Without departing from the scope of the present invention, these alternative embodiments include alternative steps and alternative elements that can be used, and the structural changes that can be made.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

现在参考附图，其中相同的标号贯穿全文表示对应的部分：Referring now to the drawings, wherein like reference numerals designate corresponding parts throughout:

图1是示出编码后位速率减少系统和方法的实施例的一般概述的框图。1 is a block diagram illustrating a general overview of an embodiment of a post-encoding bit rate reduction system and method.

图2是示出从单个全文件获得多个缩放的压缩位流的编码后位速率减少系统的实施例的一般概述的框图。2 is a block diagram illustrating a general overview of an embodiment of a post-encoding bitrate reduction system that obtains multiple scaled compressed bitstreams from a single full file.

图3是示出在图1和2中所示的编码后位速率减少系统的第一实施例的细节的框图。3 is a block diagram showing details of a first embodiment of the post-encoding bit rate reduction system shown in FIGs. 1 and 2. In FIG.

图4是示出在图1和2中所示的编码后位速率减少系统的第二实施例的细节的框图。4 is a block diagram showing details of a second embodiment of the post-encoding bit rate reduction system shown in FIGS. 1 and 2 .

图5是示出在图1和图4中所示的可缩放位流编码器的示例性实施例的框图。FIG. 5 is a block diagram illustrating an exemplary embodiment of the scalable bitstream encoder shown in FIG. 1 and FIG. 4 .

图6是示出在联网环境中实现的编码后位速率减少系统和方法的实施例的示例性例子的框图。6 is a block diagram illustrating an illustrative example of an embodiment of a post-encoding bit rate reduction system and method implemented in a networked environment.

图7是示出在图3中所示的逐帧分层位分配模块的细节的框图。FIG. 7 is a block diagram showing details of the frame-by-frame hierarchical bit allocation module shown in FIG. 3 .

图8是示出在图1-7中所示的编码后位速率减少系统和方法的实施例的一般操作的流程图。8 is a flow chart illustrating the general operation of the embodiment of the post-encoding bit rate reduction system and method shown in FIGs. 1-7.

图9是示出在图1-8中所示的编码后位速率减少系统和方法的实施例中的第一实施例的细节的流程图。9 is a flow chart showing details of a first embodiment among the embodiments of the post-encoding bit rate reduction system and method shown in FIGs. 1-8.

图10示出了根据在图1-9中所示的编码后位速率减少系统和方法的一些实施例的音频帧。10 illustrates an audio frame according to some embodiments of the post-encoding bit rate reduction systems and methods shown in FIGs. 1-9.

图11示出了由在图1中所示的可缩放位流编码器产生的数据的可缩放帧的示例性实施例。FIG. 11 shows an exemplary embodiment of a scalable frame of data generated by the scalable bitstream encoder shown in FIG. 1 .

图12示出了将全文件划分成多个帧和时间段的例子的示例性实施例。FIG. 12 shows an exemplary embodiment of an example of dividing a full file into multiple frames and time periods.

图13示出了在时间段内的全文件的帧的细节。FIG13 shows the details of the frames of the full file over the time period.

具体实施方式DETAILED DESCRIPTION

在以下编码后位速率减少系统和方法的实施例的描述中，对附图进行了参考。这些附图作为说明示出了编码后位速率减少系统和方法的实施例可以如何被实践的具体例子。应当理解，在不背离所要求保护的主题的范围的情况下，可以使用其它实施例并且可以进行结构改变。In the following description of embodiments of the post-encoding bit rate reduction system and method, reference is made to the accompanying drawings. These drawings show, by way of illustration, specific examples of how the embodiments of the post-encoding bit rate reduction system and method may be practiced. It should be understood that other embodiments may be used and structural changes may be made without departing from the scope of the claimed subject matter.

I.介绍I. Introduction

音频对象是特定声音或声音组合的源信号。在一些情况下，音频对象也包括其相关联的呈现元数据。呈现元数据是伴随音频对象的指示该音频对象在重放期间应该如何在音频空间中被呈现的数据。这种元数据可以包括多维音频空间信息、空间中的位置信息、以及周围放置信息。An audio object is a source signal for a specific sound or combination of sounds. In some cases, an audio object also includes associated rendering metadata. Rendering metadata is data accompanying an audio object that indicates how the audio object should be rendered in the audio space during playback. This metadata can include multidimensional audio spatial information, positional information within the space, and ambient placement information.

音频对象可以表示各种类型的声音源，诸如各个乐器和人声。此外，音频对象可以包括音频支干(stem)，其有时被称为子混音、子组或总线。音频支干也可以是包含诸如弦乐部分、喇叭部分或街道噪声的一组音频内容的单个轨道。Audio objects can represent various types of sound sources, such as individual instruments and vocals. Additionally, audio objects can include audio stems, which are sometimes called submixes, subgroups, or buses. An audio stem can also be a single track containing a group of audio content, such as a string section, a horn section, or street noise.

在传统的音频内容生产环境中，音频对象被记录。专业的音频工程师然后将音频对象混合成最终的主混合。结果得到的混合然后被交付给最终用户用于重放。一般而言，这种音频对象的混合是最终的，并且最终用户几乎不能对该混合做出改变。In traditional audio content production environments, audio objects are recorded. Professional audio engineers then mix the audio objects into a final master mix. The resulting mix is then delivered to the end user for playback. Generally speaking, this mix of audio objects is final, and the end user has little ability to make changes to it.

与传统的音频内容生产对照，多对象音频(或其变形)允许最终用户在交付之后混合音频对象。一种以特定或建议的方式控制和指定这种交付后混合的方式是通过利用与音频内容一起传送的嵌入式元数据。另一种方式是通过提供允许最终用户直接处理和混合音频对象的用户控件。多对象音频允许最终用户创建独特的和高度个性化的音频呈现。In contrast to traditional audio content production, multi-object audio (or its variations) allows the end user to mix audio objects after delivery. One way to control and specify this post-delivery mixing in a specific or suggested manner is by utilizing embedded metadata transmitted with the audio content. Another way is by providing user controls that allow the end user to directly manipulate and mix audio objects. Multi-object audio allows the end user to create unique and highly personalized audio presentations.

多对象音频可以被存储为存储设备上的文件，并且然后当请求时在位流中传送。音频位流可以被压缩或编码以减少传送位流所需的位速率和存储文件所需的存储空间。一般而言，作为解释而不是限制，位流的压缩意味着较少的信息被用来表示位流。另一方面，位流的编码意味着位流以另一种形式被表示，诸如利用符号。但是，编码不总是压缩位流。Multi-object audio can be stored as a file on a storage device and then transmitted in a bitstream when requested. The audio bitstream can be compressed or encoded to reduce the bit rate required to transmit the bitstream and the storage space required to store the file. Generally speaking, as an explanation and not a limitation, the compression of a bitstream means that less information is used to represent the bitstream. On the other hand, the encoding of a bitstream means that the bitstream is represented in another form, such as using symbols. However, encoding does not always compress the bitstream.

编码位流经有限带宽网络信道传送。编码后位速率减少系统和方法的实施例采用单独编码的音频对象并且将它们彼此与附加数据合并以生成编码位流。当单独编码的音频对象被传送时，包含编码音频对象的编码位流的带宽往往会超过网络信道的容量。在这种情况下，具有不适合特定应用的较低位速率的位流可能经网络信道传送。这会导致接收到的音频数据的质量降低。The encoded bitstream is transmitted over a bandwidth-limited network channel. Embodiments of the post-encoding bitrate reduction system and method utilize individually encoded audio objects and combine them with additional data to generate an encoded bitstream. When individually encoded audio objects are transmitted, the bandwidth of the encoded bitstream containing the encoded audio objects often exceeds the capacity of the network channel. In this case, a bitstream with a lower bitrate that is unsuitable for a particular application may be transmitted over the network channel. This can result in a reduction in the quality of the received audio data.

当音频数据(诸如多个音频对象)的多个流被复用用于同时或几乎同时经公共网络信道传输时，这种质量的降级是尤其成问题的。这是因为，在一些情况下，每个编码音频对象的带宽按比例被降级，这不会考虑每个音频对象或音频对象组的相对内容。例如，一个音频对象可能包含音乐，而另一个可能包含街道噪声。按比例降级每个音频对象的带宽将有可能对音乐数据比对噪声数据具有更有害的影响。This degradation in quality is particularly problematic when multiple streams of audio data (such as multiple audio objects) are multiplexed for simultaneous or near-simultaneous transmission over a common network channel. This is because, in some cases, the bandwidth of each encoded audio object is proportionally degraded, which does not take into account the relative content of each audio object or group of audio objects. For example, one audio object may contain music, while another may contain street noise. Proportionately degrading the bandwidth of each audio object will likely have a more detrimental effect on music data than on noise data.

可能存在当编码位流以特定位速率经网络信道传送并且信道条件将改变的时间。例如，信道的带宽会变得紧缩并且需要较低的传输位速率。在这些情况下，编码后位速率减少系统和方法的实施例可以通过调整编码位流位速率的缩放对网络条件中的这一变化作出反应。例如，当网络信道的带宽变得有限时，编码位流的位速率下降，使得经网络信道的传输可以继续。不是重新编码音频对象，本系统和方法的实施例处理单个版本的编码位流，并且然后缩小位速率。结果得到的缩放的位流然后可以以减小的位速率经网络信道传送。There may be times when an encoded bitstream is being transmitted over a network channel at a particular bitrate and the channel conditions will change. For example, the bandwidth of the channel may become constricted and a lower transmission bitrate may be required. In these situations, embodiments of the post-encoding bitrate reduction system and method can react to this change in network conditions by adjusting the scaling of the encoded bitstream bitrate. For example, when the bandwidth of the network channel becomes limited, the bitrate of the encoded bitstream is reduced so that transmission over the network channel can continue. Rather than re-encoding the audio objects, embodiments of the present system and method process a single version of the encoded bitstream and then scale the bitrate down. The resulting scaled bitstream can then be transmitted over the network channel at the reduced bitrate.

可能出现其中期望经各种网络信道以不同位速率传送单个编码位流的场景。例如，当每个网络信道具有不同的容量和带宽时或者当位流被具有不同能力的设备接收时，这可能发生。在这种情况下，本系统和方法的实施例减轻了分别为每个信道编码或压缩的需要。相反，单个版本的编码位流被使用并且位速率的缩放响应于每个信道的容量被调整。Scenarios may arise where it is desirable to transmit a single encoded bitstream at different bitrates over various network channels. This may occur, for example, when each network channel has different capacity and bandwidth, or when the bitstream is received by devices with different capabilities. In such cases, embodiments of the present system and method alleviate the need to encode or compress each channel separately. Instead, a single version of the encoded bitstream is used, and the bitrate is adjusted in response to the capacity of each channel.

编码位流可以被实时或基本上实时地处理。基本上实时会例如在不能对整个音频文件或程序访问的情况下发生，诸如在实况体育事件的广播期间。此外，音频数据可以被离线处理并且被实时回放。这在对整个音频文件或诸如视频点播应用的程序访问时发生。在编码音频位流的情况下，它可以包括多个音频对象，其中一些或全部包括声音信息和相关联的元数据。这种元数据可以包括，但不限于，位置信息，其包括空间中的位置、速度、轨迹等、声波特性，其包括发散、辐射参数，等等。The encoded bitstream can be processed in real time or substantially real time. Substantially real time can occur, for example, when the entire audio file or program is not accessible, such as during the broadcast of a live sporting event. In addition, the audio data can be processed offline and played back in real time. This occurs when the entire audio file or program, such as a video on demand application, is accessed. In the case of an encoded audio bitstream, it can include multiple audio objects, some or all of which include sound information and associated metadata. Such metadata can include, but is not limited to, position information, including position in space, velocity, trajectory, etc., sound wave characteristics, including divergence, radiation parameters, etc.

每个音频对象或音频对象组可以利用相同或不同的编码技术被单独编码。编码可以在位流的帧或块上执行。“帧”是在音频信号的压缩和编码中使用的时间中的数据的离散片段。这些数据帧可以一个接一个地(像电影胶片)以串行序列放置，以创建压缩的音频位流。每一帧是固定的尺寸并且表示恒定时间间隔包含。帧尺寸取决于脉冲编码调制(PCM)采样率和编码的位速率。Each audio object or group of audio objects can be individually encoded using the same or different encoding techniques. Encoding can be performed on frames or blocks of the bitstream. A "frame" is a discrete segment of data in time used in the compression and encoding of audio signals. These data frames can be placed one after another in a serial sequence (like a movie film) to create a compressed audio bitstream. Each frame is of fixed size and represents a constant time interval. The frame size depends on the pulse code modulation (PCM) sampling rate and the encoding bit rate.

每个数据帧前面通常是包含关于跟随数据的信息的报头。报头后面可能跟着误差检测和校正数据，而帧的剩余部分包含音频数据。音频数据包括PCM数据和在特定时间点的幅度(音量)信息。为了产生可识别的声音，数万帧被顺序播放以产生频率。Each data frame is typically preceded by a header containing information about the following data. The header may be followed by error detection and correction data, while the remainder of the frame contains the audio data. The audio data consists of PCM data and amplitude (volume) information at a specific point in time. To produce a recognizable sound, tens of thousands of frames are played sequentially to generate a frequency.

取决于特定应用的目标，不同帧(诸如同一对象但在不同时间发生的帧)可以基于例如帧的音频内容利用不同的位速率进行编码。这种方法被称为可变位速率(VBR)编码，因为编码数据的尺寸随时间而变化。这种方法可以提供灵活性并且改善编码数据的质量与带宽比。备选地，帧可以利用相同的位速率进行编码。这种方法被称为恒定位速率(CBR)编码，因为编码数据的大小随时间是恒定的。Depending on the goals of a particular application, different frames (such as frames of the same object but occurring at different times) can be encoded using different bit rates based on, for example, the audio content of the frame. This approach is known as variable bit rate (VBR) encoding because the size of the encoded data changes over time. This approach can provide flexibility and improve the quality-to-bandwidth ratio of the encoded data. Alternatively, the frames can be encoded using the same bit rate. This approach is known as constant bit rate (CBR) encoding because the size of the encoded data is constant over time.

虽然有可能为了保持分离以未编码和未压缩的方式独立地传送音频对象，但是由于发送通常大的文件一般所需的大带宽要求，这通常是不可行的。因此，频繁使用一些形式的音频压缩和编码来促进向最终用户经济地交付多对象音频。已经发现，编码包含音频对象的音频信号来减少其位速率，同时仍然维护音频对象之间适当的声学分离是困难的。While it is possible to transmit audio objects independently in an unencoded and uncompressed manner to maintain separation, this is generally not feasible due to the large bandwidth requirements typically required to send the typically large files. Therefore, some form of audio compression and encoding is frequently used to facilitate the economical delivery of multi-object audio to end users. It has been found that encoding an audio signal containing audio objects to reduce its bit rate while still maintaining appropriate acoustic separation between the audio objects is difficult.

例如，用于多个音频对象的一些现有音频压缩技术基于对象的依赖关系。特别地，联合编码技术基于诸如位置、空间掩蔽和频率掩蔽的因素频繁使用音频对象的依赖关系。但是，利用这些联合编码技术的一个挑战是如果对象的放置在交付之前是未知的，则难以预测对象之间的空间和频率掩蔽。For example, some existing audio compression techniques for multiple audio objects are based on object dependencies. In particular, joint coding techniques frequently exploit audio object dependencies based on factors such as position, spatial masking, and frequency masking. However, one challenge with these joint coding techniques is that if the placement of the objects is unknown before delivery, it is difficult to predict spatial and frequency masking between objects.

另一种类型的现有音频压缩技术是通常需要计算昂贵的解码和呈现系统以及用于单独携带多个音频对象的高传输或数据存储率的基于离散对象的音频场景编码。用于交付多对象音频的另一种类型的编码技术是多声道空间音频编码。但是，与基于离散对象的音频场景编码技术不同，这种空间音频编码方法不定义可分离的音频对象。因此，空间音频解码器不能分离每个音频对象在降混音频信号中的贡献。Another type of existing audio compression technology is discrete object-based audio scene coding, which typically requires computationally expensive decoding and rendering systems, as well as high transmission or data storage rates to carry multiple audio objects individually. Another type of coding technology for delivering multi-object audio is multi-channel spatial audio coding. However, unlike discrete object-based audio scene coding, this spatial audio coding approach does not define separable audio objects. Therefore, spatial audio decoders cannot separate the contribution of each audio object in the downmix audio signal.

用于编码多个音频对象的还有的另一种技术是空间音频对象编码(SAOC)。但是，SAOC技术不能完全分离降混信号中在时频域中并发的音频对象。因此，如由可交互用户控件可能所需要的通过SAOC解码器广泛放大或衰减对象会导致重现场景的音频质量明显变差。Yet another technique for encoding multiple audio objects is Spatial Audio Object Coding (SAOC). However, SAOC cannot completely separate concurrent audio objects in the downmix signal in the time and frequency domains. Consequently, extensive amplification or attenuation of objects by an SAOC decoder, as might be required by interactive user controls, can significantly degrade the audio quality of the reproduced scene.

应当指出，出于教导的目的和易于说明，本文档主要指音频数据的使用。但是，本文所描述的特征也可以被应用到其它形式的数据，包括视频数据和包含诸如地震和医疗数据的时间序列信号的数据。此外，本文所描述的特征也可以被应用到几乎任何类型的数据操纵，诸如数据的存储和数据的传输。It should be noted that for didactic purposes and ease of illustration, this document primarily refers to the use of audio data. However, the features described herein can also be applied to other forms of data, including video data and data containing time-series signals such as seismic and medical data. Furthermore, the features described herein can also be applied to virtually any type of data manipulation, such as data storage and data transmission.

II.系统概览II. System Overview

编码后位速率减少系统和方法的实施例以某个全位速率单独和独立地编码多个音频对象文件。该系统和方法的实施例然后合并这些编码音频对象文件连同其相关联的分层元数据，以生成全文件。多个位流可以从单个全文件获得。这些多个位流是以小于或等于全位速率的目标位速率。这种被称为缩放的位速率改变确保在每个缩放的位速率维持最优的质量。另外，位速率的缩放在无需首先将全文件解码回到其非压缩形式并且然后以不同位速率重新编码结果得到的非压缩数据的情况下实现。Embodiments of the post-encoding bit rate reduction system and method encode multiple audio object files individually and independently at a full bit rate. Embodiments of the system and method then merge these encoded audio object files along with their associated layered metadata to generate a full file. Multiple bitstreams can be obtained from a single full file. These multiple bitstreams are at a target bit rate that is less than or equal to the full bit rate. This bit rate change, known as scaling, ensures that optimal quality is maintained at each scaled bit rate. In addition, bit rate scaling is achieved without first decoding the full file back to its uncompressed form and then re-encoding the resulting uncompressed data at a different bit rate.

如下面详细解释的，这种缩放部分地实现如下。首先，音频对象文件利用基于心理声学重要性将每一帧中的位排序的可缩放位流编码器进行单独编码。这种可缩放编码还通过去除帧内的位以精细缩放的方式提供位速率改变。其次，在每一帧时间间隔，每个目标文件内的对应帧活动被考虑。然后，基于这些帧活动测量之间的相对关系，该系统和方法的实施例决定每个压缩对象文件的哪个帧有效载荷被保留。换句话说，音频对象文件的每一帧有效载荷基于其测得的多媒体帧活动和它与要被一起复用的所有其它音频对象文件中的所有帧活动的关系被缩放。As explained in detail below, this scaling is partially implemented as follows. First, the audio object file is individually encoded using a scalable bitstream encoder that sorts the bits in each frame based on psychoacoustic importance. This scalable encoding also provides bit rate changes in a finely scaled manner by removing the bits in the frame. Secondly, at each frame time interval, the corresponding frame activity in each target file is considered. Then, based on the relative relationship between these frame activity measurements, embodiments of this system and method determine which frame payload of each compressed object file is retained. In other words, each frame payload of an audio object file is scaled based on its measured multimedia frame activity and its relationship to all frame activities in all other audio object files that will be multiplexed together.

图1是示出编码后位速率减少系统100的实施例的一般概述的框图。系统100位于服务器计算设备110上。系统100的实施例接收音频信号120作为输入。音频信号120可以包含以各种形式和类型的各种类型的内容。此外，音频信号120可以是模拟、数字或其它形式。其类型可以是以重复离散量、以连续流或一些其它类型发生的信号。输入信号的内容可以是几乎任何内容，包括音频数据、视频数据或两者。在一些实施例中，音频信号120包含多个音频对象文件。Fig. 1 is a block diagram showing a general overview of an embodiment of a post-encoding bit rate reduction system 100. System 100 is located on a server computing device 110. An embodiment of system 100 receives an audio signal 120 as input. Audio signal 120 can comprise various types of content in various forms and types. In addition, audio signal 120 can be analog, digital, or other forms. Its type can be a signal that occurs in a repeated discrete amount, in a continuous stream, or some other type. The content of the input signal can be almost any content, including audio data, video data, or both. In some embodiments, audio signal 120 comprises a plurality of audio object files.

系统100的实施例包括可缩放位流编码器130，其分别编码在音频信号120中包含的每个音频对象文件。应当指出，可缩放位流编码器130可以是多个编码器。如在图1中所示，来自可缩放位流编码器130的输出是M个独立编码音频对象文件，包括编码音频对象文件(1)至编码音频对象文件(M)，其中M是非零的正整数。编码音频对象文件(1)至(M)与相关联的分层元数据合并以获得全文件140。An embodiment of the system 100 includes a scalable bitstream encoder 130 that encodes each audio object file contained in the audio signal 120. It should be noted that the scalable bitstream encoder 130 can be a plurality of encoders. As shown in FIG1 , the output from the scalable bitstream encoder 130 is M independently encoded audio object files, including encoded audio object file (1) to encoded audio object file (M), where M is a non-zero positive integer. The encoded audio object files (1) to (M) are merged with the associated layered metadata to obtain a full file 140.

每当期望具有特定目标位速率160的位流时，全文件140被位减少模块150处理以产生期望的位流。位减少模块150处理全文件140来产生具有小于或等于目标位速率160的位速率的缩放的压缩位流170。一旦缩放的压缩位流170被生成，然后它就可以被发送到接收设备180。服务器计算设备110经网络185与其它设备(诸如接收设备180)通信。服务器计算设备110利用第一通信链路190访问网络185并且接收设备180利用第二通信链路195访问网络185。以这种方式，缩放的压缩位流170可以被接收设备180请求和发送到接收设备180。Whenever a bitstream having a specific target bitrate 160 is desired, the full file 140 is processed by the bit reduction module 150 to generate the desired bitstream. The bit reduction module 150 processes the full file 140 to generate a scaled compressed bitstream 170 having a bitrate less than or equal to the target bitrate 160. Once the scaled compressed bitstream 170 is generated, it can then be sent to the receiving device 180. The server computing device 110 communicates with other devices (such as the receiving device 180) via a network 185. The server computing device 110 accesses the network 185 using a first communication link 190 and the receiving device 180 accesses the network 185 using a second communication link 195. In this manner, the scaled compressed bitstream 170 can be requested by the receiving device 180 and sent to the receiving device 180.

在图1所示的实施例中，网络信道包括第一通信链路190、网络185和第二通信链路195。网络信道具有某个最大带宽，其作为目标位速率160被传达给位减少模块。缩放的压缩位流170以目标位速率或低于目标位速率经网络信道传送，以便不超过信道的最大带宽。1 , the network channel includes a first communication link 190, a network 185, and a second communication link 195. The network channel has a certain maximum bandwidth, which is communicated to the bit reduction module as a target bit rate 160. The scaled compressed bitstream 170 is transmitted over the network channel at or below the target bit rate so as not to exceed the maximum bandwidth of the channel.

如上所述，在一些情况下，期望经具有多种能力的多个网络信道以不同位速率传送单个全文件。图2是示出从单个全文件140获得多个缩放的压缩位流的编码后位速率减少系统100的实施例的一般概述的框图。如在图2中所示，全文件140包含以某个全位速率的M个编码音频对象文件。具体而言，图2示出了以全位速率的编码音频对象文件(1)、以全位速率的编码音频对象文件(2)、以全位速率的编码音频对象文件(3)、以及包括以全位速率的编码音频对象文件(M)的任何附加编码音频对象文件(如由省略号所指示的)。As described above, in some cases, it is desirable to transmit a single full file at different bit rates over multiple network channels having multiple capabilities. FIG2 is a block diagram illustrating a general overview of an embodiment of a post-encoding bit rate reduction system 100 that obtains multiple scaled compressed bit streams from a single full file 140. As shown in FIG2 , full file 140 contains M encoded audio object files at a certain full bit rate. Specifically, FIG2 illustrates encoded audio object file (1) at the full bit rate, encoded audio object file (2) at the full bit rate, encoded audio object file (3) at the full bit rate, and any additional encoded audio object files (as indicated by the ellipsis) that include encoded audio object file (M) at the full bit rate.

编码音频对象文件(1)至编码音频对象文件(M)由可缩放位流编码器130以全位速率进行独立编码。全位速率高于目标位速率160。通常，目标位速率160是用来在不超过信道的可用带宽的情况下经网络信道传送内容的位速率。The encoded audio object files (1) to the encoded audio object files (M) are independently encoded at full bit rate by the scalable bitstream encoder 130. The full bit rate is higher than the target bit rate 160. Typically, the target bit rate 160 is a bit rate used to transmit content over a network channel without exceeding the available bandwidth of the channel.

在一些实施例中，全文件140使用高位速率来编码该M个独立编码音频对象文件，使得全文件140的尺寸相当大。如果全文件140的内容要经具有有限带宽的网络信道传送，则这会是有问题的。如下面详细解释的，为了减轻与经有限带宽信道发送大尺寸文件(诸如全文件140)相关联的困难，编码音频对象文件(1)至(M)被位减少模块150处理，以从单个全文件140创建多个缩放的编码位流。这部分地通过基于位分配去除数据帧中有序数据的区块来实现。In some embodiments, full file 140 uses a high bit rate to encode the M independently encoded audio object files, making the size of full file 140 quite large. This can be problematic if the contents of full file 140 are to be transmitted over a network channel having limited bandwidth. As explained in detail below, to alleviate the difficulties associated with sending large files (such as full file 140) over a limited bandwidth channel, the encoded audio object files (1) through (M) are processed by a bit reduction module 150 to create multiple scaled encoded bitstreams from a single full file 140. This is achieved in part by removing blocks of sequential data in the data frames based on bit allocation.

虽然在图1中示出了单个目标位速率160，但是在一些情况下，可能存在多个目标位速率。例如，可能期望经每个具有不同位速率的各种网络信道传送全文件140。如在图2中所示，存在N个目标位速率200，其中N是正的非零整数。目标位速率200包括目标位速率(1)、目标位速率(2)，等等，直到目标位速率(N)。Although a single target bit rate 160 is shown in FIG1 , in some cases, multiple target bit rates may exist. For example, it may be desirable to transmit the full file 140 via various network channels, each having a different bit rate. As shown in FIG2 , there are N target bit rates 200, where N is a positive, non-zero integer. Target bit rates 200 include target bit rate (1), target bit rate (2), and so on, up to target bit rate (N).

位减少模块150接收目标位速率160，以便缩放全文件140的位速率，使得结果得到的缩放的编码位流将最好地适合特定的有限带宽信道。目标位速率200通常从互联网服务提供商(ISP)发送，以通知系统100和方法的实施例关于位流将经其被传送的网络信道的带宽需求和能力。目标位速率200小于或等于全位速率。The bit reduction module 150 receives a target bit rate 160 in order to scale the bit rate of the full file 140 so that the resulting scaled encoded bitstream will best fit a particular limited bandwidth channel. The target bit rate 200 is typically sent from an Internet service provider (ISP) to inform embodiments of the system 100 and method about the bandwidth requirements and capabilities of the network channel over which the bitstream will be transmitted. The target bit rate 200 is less than or equal to the full bit rate.

在图2的示例性实施例中，目标位速率200包括N个不同的目标位速率，其中N是可以等于、小于或大于M的非零的正整数。目标位速率200包括目标位速率(1)、目标位速率(2)、一些情况下的附加目标位速率(如由省略号所指示的)、以及目标位速率(N)。通常，目标位速率200将彼此不同，但是它们在一些实施例中可能类似。此外，应当指出，目标位速率200中的每一个可以被一起或随时间被分别发送。In the exemplary embodiment of FIG2 , target bit rates 200 include N different target bit rates, where N is a non-zero positive integer that can be equal to, less than, or greater than M. Target bit rates 200 include target bit rate (1), target bit rate (2), additional target bit rates in some cases (as indicated by ellipses), and target bit rate (N). Typically, target bit rates 200 will be different from each other, but they may be similar in some embodiments. Furthermore, it should be noted that each of target bit rates 200 can be sent together or separately over time.

在图2中所示的缩放的压缩位流对应于目标位速率200。例如，目标位速率(1)被用来创建以目标位速率(1)的缩放的压缩位流(1)，目标位速率(2)被用来创建以目标位速率(2)的缩放的压缩位流(2)、在一些情况下以目标位速率(如由省略号所示出的)的附加的缩放的压缩位流、以及缩放的编码文件(N)，其中N是如上所述相同的非零的正整数。在一些实施例中，各个目标位速率可以是相似或完全相同的，但是通常各个目标位速率彼此不同。The scaled compressed bitstreams shown in FIG2 correspond to target bitrates 200. For example, target bitrate (1) is used to create scaled compressed bitstream (1) at target bitrate (1), target bitrate (2) is used to create scaled compressed bitstream (2) at target bitrate (2), in some cases additional scaled compressed bitstreams at target bitrates (as indicated by ellipses), and scaled encoded file (N), where N is the same non-zero positive integer as described above. In some embodiments, the target bitrates may be similar or identical, but typically the target bitrates differ from one another.

应当指出，出于教导的目的，在图2中示出了特定数量的编码音频对象文件、目标位速率和缩放的压缩位流。但是，存在其中N＝1、M＝1以及单个缩放的压缩位流从全文件140获得的情况。在其它实施例中，N可以是大数量，其中若干个缩放的压缩位流从全文件140获得。此外，缩放的压缩位流可以响应于来自客户端的请求被即时(on the fly)创建。备选地，缩放的压缩位流可以预先被创建并且存储在存储设备上。It should be noted that for didactic purposes, a specific number of encoded audio object files, target bit rates, and scaled compressed bitstreams are shown in FIG2 . However, there are cases where N=1, M=1, and a single scaled compressed bitstream is obtained from the full file 140. In other embodiments, N can be a larger number, where several scaled compressed bitstreams are obtained from the full file 140. Furthermore, the scaled compressed bitstreams can be created on the fly in response to a request from a client. Alternatively, the scaled compressed bitstreams can be created in advance and stored on a storage device.

III.系统细节III. System Details

现在将讨论编码后位速率减少系统100的实施例的部件的系统细节。这些部件包括位减少模块150、可缩放位流编码器130和逐帧分层位分配模块。此外，将讨论接收设备180上的缩放的压缩位流170的解码。应当指出，下面只详细描述了其中可以实现该系统的少数几种方式。许多变化都是可能的。The system details of the components of an embodiment of the post-encoding bit rate reduction system 100 will now be discussed. These components include a bit reduction module 150, a scalable bitstream encoder 130, and a frame-by-frame layered bit allocation module. Furthermore, the decoding of the scaled compressed bitstream 170 at the receiving device 180 will be discussed. It should be noted that only a few of the ways in which this system can be implemented are described in detail below. Many variations are possible.

图3是示出在图1和2中所示的编码后位速率减少系统100的第一实施例的细节的框图。在这个特定的实施例中，音频对象文件已被独立和单独地编码并且被包含在全文件140中。全文件140被输入到编码后位速率减少系统100的实施例。系统100接收以全位速率300的分别编码的音频对象文件，用于进一步处理。FIG3 is a block diagram illustrating details of a first embodiment of the post-encoding bitrate reduction system 100 shown in FIG1 and FIG2. In this particular embodiment, the audio object files have been independently and individually encoded and are contained within a full file 140. The full file 140 is input to the embodiment of the post-encoding bitrate reduction system 100. The system 100 receives the individually encoded audio object files at the full bitrate 300 for further processing.

分别编码的音频对象文件300被位减少模块150处理。如下面详细解释的，位减少模块150减少用来表示编码音频对象文件的位的数量，以便实现目标位速率200。位减少模块150接收分别编码的音频对象文件300并且利用逐帧分层位分配模块310处理它们。该模块310基于分层位分配方案减少每个帧中的位的数量。模块310的输出是位减少的编码音频对象文件320。The individually encoded audio object files 300 are processed by the bit reduction module 150. As explained in detail below, the bit reduction module 150 reduces the number of bits used to represent the encoded audio object files in order to achieve the target bit rate 200. The bit reduction module 150 receives the individually encoded audio object files 300 and processes them using a frame-by-frame hierarchical bit allocation module 310. The module 310 reduces the number of bits in each frame based on a hierarchical bit allocation scheme. The output of the module 310 is a bit-reduced encoded audio object file 320.

统计复用器330取得位减少的编码音频对象文件320并且合并它们。在一些实施例中，统计复用器330至少部分地基于分层位分配方案向每个编码音频对象文件1至M分配信道容量或带宽(以位的数量来衡量)。在一些实施例中，编码音频对象文件是可变位速率(VBR)编码数据并且统计复用器330输出恒定位速率(CBR)编码数据。The statistical multiplexer 330 takes the bit-reduced encoded audio object files 320 and merges them. In some embodiments, the statistical multiplexer 330 allocates channel capacity or bandwidth (measured in number of bits) to each of the encoded audio object files 1 through M based at least in part on a hierarchical bit allocation scheme. In some embodiments, the encoded audio object files are variable bit rate (VBR) encoded data and the statistical multiplexer 330 outputs constant bit rate (CBR) encoded data.

在一些实施例中，统计复用器330还说明在位分配期间编码音频对象文件的其它特征。例如，编码音频对象文件的音频内容(例如音乐、语音、噪声等)可以是相关的。与简单碰撞相关联的编码音频对象文件(诸如噪音)可能需要比与音乐曲目相关联的对象少的带宽。作为另一个例子，对象的音量可以在带宽分配中被使用(使得响的对象可以从更多的位分配中受益)。作为还有的另一个例子，与对象相关联的音频数据的频率也可以在位分配中被使用(使得宽带对象可以从更多的位分配中受益)。In some embodiments, statistical multiplexer 330 also illustrates other features of the encoded audio object files during bit allocation. For example, the audio content (e.g., music, speech, noise, etc.) of the encoded audio object files may be relevant. Encoded audio object files associated with simple collisions (such as noise) may require less bandwidth than objects associated with music tracks. As another example, the volume of an object can be used in bandwidth allocation (so that loud objects can benefit from more bit allocations). As another example, the frequency of the audio data associated with an object can also be used in bit allocation (so that broadband objects can benefit from more bit allocations).

位流包装器(packer)340然后处理复用的位减少的编码音频对象文件320，并且将它们包装成帧和容器用于传输。位流包装器340的输出是包含可变尺寸帧有效载荷的缩放的压缩位流170。缩放的压缩位流170是以小于或等于目标位速率160的位速率。The bitstream packer 340 then processes the multiplexed bit-reduced encoded audio object files 320 and packs them into frames and containers for transmission. The output of the bitstream packer 340 is a scaled compressed bitstream 170 containing variable-sized frame payloads. The scaled compressed bitstream 170 is at a bit rate less than or equal to the target bit rate 160.

在一些实施例中，音频对象文件还没有被编码。图4是示出在图1和2中所示的编码后位速率减少系统100的第二实施例的细节的框图。未编码音频对象文件400被系统100的实施例接收到。可缩放位流编码器130独立编码每个音频对象文件400，以获得全文件140。In some embodiments, the audio object files have not yet been encoded. FIG4 is a block diagram illustrating details of a second embodiment of the post-encoding bitrate reduction system 100 shown in FIG1 and FIG2. Unencoded audio object files 400 are received by an embodiment of the system 100. The scalable bitstream encoder 130 independently encodes each audio object file 400 to obtain a full file 140.

全文件140被输入到位减少模块150。逐帧分层位分配模块310处理全文件140，以获得位减少的编码音频对象文件320。统计复用器330取得位减少的编码音频对象文件320并且合并它们。位流包装器340然后处理复用的位减少的编码音频对象文件320，并且将它们包装成帧和容器用于传输。位流包装器340的输出是包含可变尺寸帧有效载荷的缩放的压缩位流170。缩放的压缩位流170是以小于或等于目标位速率160的位速率。The full file 140 is input to the bit reduction module 150. The frame-by-frame layered bit allocation module 310 processes the full file 140 to obtain bit-reduced encoded audio object files 320. The statistical multiplexer 330 takes the bit-reduced encoded audio object files 320 and merges them. The bitstream packager 340 then processes the multiplexed bit-reduced encoded audio object files 320 and packages them into frames and containers for transmission. The output of the bitstream packager 340 is a scaled compressed bitstream 170 containing variable-sized frame payloads. The scaled compressed bitstream 170 is at a bit rate less than or equal to the target bit rate 160.

图5是示出在图1和4中所示的可缩放位流编码器130的示例性实施例的框图。可缩放位流编码器130的这些实施例包括多个可缩放位流编码器。在图5中示出的示例性实施例中，可缩放位流编码器500包含M个编码器，即可缩放位流编码器(1)至可缩放位流编码器(M)，其中M是非零的正整数。到可缩放位流编码器500的输入是音频信号120。在这些实施例中，音频信号120包含多个音频对象文件。特别地，音频信号120包括M个音频对象文件，包括音频对象文件(1)至音频对象文件(M)。FIG5 is a block diagram illustrating an exemplary embodiment of the scalable bitstream encoder 130 shown in FIG1 and 4. These embodiments of the scalable bitstream encoder 130 include a plurality of scalable bitstream encoders. In the exemplary embodiment shown in FIG5 , the scalable bitstream encoder 500 includes M encoders, namely, scalable bitstream encoder (1) to scalable bitstream encoder (M), where M is a non-zero positive integer. The input to the scalable bitstream encoder 500 is an audio signal 120. In these embodiments, the audio signal 120 includes a plurality of audio object files. In particular, the audio signal 120 includes M audio object files, including audio object file (1) to audio object file (M).

在图5中示出的示例性实施例中，可缩放位流编码器500包含用于M个音频对象文件中的每一个的M个编码器。因此，对每一个音频对象存在编码器。但是，在其它实施例中，可缩放位流编码器的数量可以小于音频对象文件的数量。与可缩放位流编码器的数量无关，多个编码器中的每一个分别编码多个音频对象文件中的每一个，以获得分别编码的对象文件300，即分别编码的音频对象文件(1)至分别编码的音频对象文件(M)。In the exemplary embodiment shown in FIG5 , scalable bitstream encoder 500 includes M encoders for each of M audio object files. Thus, there is an encoder for each audio object. However, in other embodiments, the number of scalable bitstream encoders may be less than the number of audio object files. Regardless of the number of scalable bitstream encoders, each of the plurality of encoders encodes each of the plurality of audio object files to obtain separately encoded object files 300, i.e., separately encoded audio object file (1) to separately encoded audio object file (M).

图6是示出在联网环境中实现的编码后位速率减少系统100和方法的实施例的示例性例子的框图。在图6中，系统100和方法的实施例被示为以媒体数据库服务器600的形式在计算设备上实现。媒体数据库服务器600可以是包括处理器的几乎任何设备，诸如台式计算机、笔记本计算机、以及诸如移动电话的嵌入式设备。FIG6 is a block diagram illustrating an illustrative example of an embodiment of the post-encoding bit rate reduction system 100 and method implemented in a networked environment. In FIG6 , an embodiment of the system 100 and method is shown as being implemented on a computing device in the form of a media database server 600. The media database server 600 can be nearly any device that includes a processor, such as a desktop computer, a laptop computer, and an embedded device such as a mobile phone.

在一些实施例中，系统100和方法在媒体数据库服务器600上被存储为用于跨应用、跨设备访问的基于云的服务。服务器600经网络185与其它设备通信。在一些实施例中，其中一个其它设备是接收设备180。媒体数据库服务器600利用第一通信链路190访问网络185并且接收设备180利用第二通信链路195访问网络185。以这种方式，媒体数据库服务器600和接收设备180可以在彼此之间通信和传送数据。In some embodiments, the system 100 and method are stored on a media database server 600 as a cloud-based service for cross-application, cross-device access. The server 600 communicates with other devices via a network 185. In some embodiments, one of the other devices is a receiving device 180. The media database server 600 accesses the network 185 using a first communication link 190, and the receiving device 180 accesses the network 185 using a second communication link 195. In this manner, the media database server 600 and the receiving device 180 can communicate and transfer data between each other.

包含编码音频对象文件(1)至(M)的全文件140位于媒体数据库服务器600上。全文件140被位减少模块150处理，以获得位减少的编码音频对象文件320。位减少的编码音频对象文件320被统计复用器330和位流包装器340处理，以生成以等于或低于目标位速率的缩放的压缩位流170。目标位速率从在图2中示出的目标位速率200获得。A full file 140 containing the encoded audio object files (1) to (M) is located on the media database server 600. The full file 140 is processed by the bit reduction module 150 to obtain a bit-reduced encoded audio object file 320. The bit-reduced encoded audio object file 320 is processed by a statistical multiplexer 330 and a bitstream packer 340 to generate a scaled compressed bitstream 170 at or below a target bitrate. The target bitrate is obtained from the target bitrate 200 shown in FIG.

在图6中示出的实施例中，全文件140被示为存储在媒体数据库服务器600上。如上所述，全文件140包含以全位速率独立编码的M个编码音频对象文件。如在本文档中所使用的，位速率被定义为通过通信链路或信道的二进制数字的流的速率。换句话说，位速率描述位以其从一个位置传送到另一个位置的速率。位速率通常被表示为每秒位的数量。In the embodiment shown in Figure 6, full file 140 is shown as being stored on media database server 600.As mentioned above, full file 140 comprises M coded audio object files that are independently encoded at full bit rate.As used in this document, bit rate is defined as the rate of the stream of binary digits through a communication link or channel.In other words, bit rate describes the rate at which a bit is transferred from one location to another.Bit rate is usually expressed as the number of bits per second.

位速率可以指示下载速度，使得对于给定位速率，下载3千兆字节(Gb)文件比下载1千兆字节文件花费较少的时间。位速率也可以指示媒体文件的质量。作为例子，以每秒192千位(Kbps)压缩的音频文件通常将具有比以128Kbps压缩的同一音频文件更好或更高的质量(以更大的动态范围和清晰度的形式)。这是因为更多的位被用来表示用于每秒重放的数据。因此，多媒体文件的质量由其相关联的位速率来测量和指示。Bit rate can indicate download speed, so that for a given bit rate, downloading a 3 gigabyte (Gb) file takes less time than downloading a 1 gigabyte file. Bit rate can also indicate the quality of a media file. As an example, an audio file compressed at 192 kilobits per second (Kbps) will generally have better or higher quality (in the form of greater dynamic range and clarity) than the same audio file compressed at 128 Kbps. This is because more bits are used to represent the data for playback per second. Therefore, the quality of a multimedia file is measured and indicated by its associated bit rate.

在图1-5中示出的实施例中，编码音频对象文件以大于任何目标位速率200的全位速率进行编码。这意味着全文件140的编码音频对象文件比在缩放的压缩位流170中包含的以任何目标位速率200的编码音频对象文件具有更高的质量。1-5 , the encoded audio object files are encoded at a full bit rate that is greater than any target bit rate 200. This means that the encoded audio object files of the full file 140 are of higher quality than the encoded audio object files at any target bit rate 200 contained in the scaled compressed bitstream 170.

全文件140和每个编码音频对象文件被输入到编码后位速率减少系统100和方法的实施例。如下面详细讨论的，系统100和方法的实施例使用逐帧位减少来减少用于表示编码音频对象文件的位的数量。这在无需重新编码对象的情况下实现。这产生包含多个位减少的编码音频对象文件320的位减少的文件(未示出)。这意味着，全文件140的至少一些编码音频对象文件通过与全文件140相比减少的位数量被表示为位减少的编码音频对象文件320。各个位减少的编码音频对象文件320然后被统计复用器330处理成单个信号，并且被位流包装器340包装成缩放的压缩位流170。缩放的压缩位流170是以小于或等于目标位速率的位速率。另外，目标位速率小于全位速率。The full file 140 and each encoded audio object file are input to an embodiment of the post-encoding bit rate reduction system 100 and method. As discussed in detail below, embodiments of the system 100 and method use frame-by-frame bit reduction to reduce the number of bits used to represent the encoded audio object files. This is achieved without having to re-encode the objects. This produces a bit-reduced file (not shown) containing multiple bit-reduced encoded audio object files 320. This means that at least some of the encoded audio object files of the full file 140 are represented as bit-reduced encoded audio object files 320 by a reduced number of bits compared to the full file 140. Each bit-reduced encoded audio object file 320 is then processed into a single signal by a statistical multiplexer 330 and packaged into a scaled compressed bitstream 170 by a bitstream wrapper 340. The scaled compressed bitstream 170 is at a bit rate less than or equal to the target bit rate. In addition, the target bit rate is less than the full bit rate.

缩放的压缩位流170经网络185被传送到接收设备180。这种传送通常在由接收设备180请求时发生，但是可以发生许多其它情况，包括将缩放的压缩位流170存储为媒体数据库服务器600上的文件。接收设备180可以是能够存储或回放缩放的压缩位流170的任何启用网络的计算设备。虽然接收设备180在图6中被示为驻留在与编码后位速率减少系统100和方法的实施例不同的计算设备上，但是应当指出，在一些实施例中，它们可以驻留在同一计算设备上(诸如媒体数据库服务器600)。The scaled compressed bitstream 170 is transmitted to the receiving device 180 via the network 185. This transmission typically occurs upon request by the receiving device 180, but many other situations may occur, including storing the scaled compressed bitstream 170 as a file on the media database server 600. The receiving device 180 may be any network-enabled computing device capable of storing or playing back the scaled compressed bitstream 170. Although the receiving device 180 is shown in FIG6 as residing on a different computing device than the embodiment of the post-encoding bitrate reduction system 100 and method, it should be noted that in some embodiments, they may reside on the same computing device (such as the media database server 600).

接收设备180通过利用解复用器610处理接收到的缩放的压缩位流170以将编码音频对象文件分离成其各个组分。如在图6中所示，这些各个组分包括编码音频对象文件(1)、编码音频对象文件(2)、编码音频对象文件(3)、存在的其它编码音频对象文件(如由省略号所指示的)、直到并且包括编码音频对象文件(M)。这些单独编码音频对象文件中的每一个被发送到能够解码编码音频对象文件的可缩放位流解码器620。在一些实施例中，可缩放位流解码器630包含用于每个编码音频对象文件的单独的解码器。The receiving device 180 processes the received scaled compressed bitstream 170 by utilizing a demultiplexer 610 to separate the encoded audio object files into their individual components. As shown in FIG6 , these individual components include encoded audio object file (1), encoded audio object file (2), encoded audio object file (3), other encoded audio object files that exist (as indicated by ellipses), up to and including encoded audio object file (M). Each of these individual encoded audio object files is sent to a scalable bitstream decoder 620 that is capable of decoding the encoded audio object file. In some embodiments, the scalable bitstream decoder 630 includes a separate decoder for each encoded audio object file.

如在图6中所示，在一些实施例中，可缩放位流解码器620包括可缩放解码器(1)(用来解码编码音频对象文件(1))、可缩放解码器(2)(用来解码编码音频对象文件(2))，可缩放解码器(3)(用来解码编码音频对象文件(3))、根据需要的其它可缩放解码器(如由省略号所指示的)、以及可缩放解码器(M)(用来解码编码音频对象(文件M))。应当指出，在其它实施例中，可以使用任何数量的可缩放解码器来解码编码音频对象文件。As shown in FIG6 , in some embodiments, the scalable bitstream decoder 620 includes a scalable decoder (1) (for decoding the encoded audio object file (1)), a scalable decoder (2) (for decoding the encoded audio object file (2)), a scalable decoder (3) (for decoding the encoded audio object file (3)), other scalable decoders as needed (as indicated by ellipses), and a scalable decoder (M) (for decoding the encoded audio object (file M)). It should be noted that in other embodiments, any number of scalable decoders can be used to decode the encoded audio object file.

可缩放位流解码器620的输出是多个解码音频对象文件。具体而言，这多个解码音频对象文件包括解码音频对象文件(1)、解码音频对象文件(2)、解码音频对象文件(3)、可能需要的其它解码音频对象文件(如由省略号所指示的)、以及解码音频对象文件(M)。在这点上，解码音频对象文件可以被存储用于以后使用或立即使用。无论哪种方式，解码音频对象文件的至少一部分被输入到混合设备630。通常，混合设备630由混合解码音频对象文件以生成个性化音频对象混合640的用户控制。但是，在其它实施例中，解码音频对象文件的混合可以由系统100和方法的实施例自动处理。在其它实施例中，音频对象混合640由第三方供应商创建。The output of scalable bitstream decoder 620 is a plurality of decoded audio object files. Specifically, these multiple decoded audio object files include decoded audio object file (1), decoded audio object file (2), decoded audio object file (3), other decoded audio object files that may be needed (as indicated by ellipsis) and decoded audio object file (M). In this regard, the decoded audio object files can be stored for later use or used immediately. No matter which way, at least a portion of the decoded audio object files is input to mixing device 630. Usually, mixing device 630 is controlled by the user who mixes the decoded audio object files to generate personalized audio object mix 640. However, in other embodiments, the mixing of the decoded audio object files can be automatically processed by embodiments of system 100 and method. In other embodiments, audio object mix 640 is created by a third party supplier.

图7是示出在图3中所示的逐帧分层位分配模块310的细节的框图。模块310接收已以全位速率编码的分别编码的音频对象文件300。对于特定的时间段，在那个时间段中的每个编码音频对象文件的每个帧在特定时间段700的所有编码音频对象文件上被检查。分层信息710被输入到分层模块720。分层信息710包括关于帧应该如何被优先级化以及最终位应该如何在帧中被分配的数据。FIG7 is a block diagram illustrating details of the frame-by-frame hierarchical bit allocation module 310 shown in FIG3 . Module 310 receives individually encoded audio object files 300 that have been encoded at the full bit rate. For a particular time period, each frame of each encoded audio object file in that time period is examined across all encoded audio object files for the particular time period 700. Hierarchical information 710 is input to hierarchical module 720. Hierarchical information 710 includes data regarding how frames should be prioritized and, ultimately, how bits should be allocated within the frames.

位池730中可用的位被分配模块740使用，以确定有多少位可用来在该时间段期间的帧之间进行分配。基于分层信息710，分配模块740在那个时间段中的帧之间分配位。这些位基于分层信息710跨编码音频对象文件、子频带和帧进行分配。The bits available in the bit pool 730 are used by an allocation module 740 to determine how many bits are available to allocate between the frames during the time period. Based on the layering information 710, the allocation module 740 allocates bits between the frames in that time period. The bits are allocated across the encoded audio object files, subbands, and frames based on the layering information 710.

分配模块740生成这种指示分配给特定时间段中的每个帧的位的数量的位分配750。基于位分配，减少模块760根据需要从每一帧中削减位，以符合用于那个特定帧的位分配750。这产生用于给定时间段的削减帧770。这些削减帧被合并，以生成位减少的编码音频对象文件320。The allocation module 740 generates such a bit allocation 750 indicating the number of bits allocated to each frame in a particular time segment. Based on the bit allocation, the reduction module 760 prunes bits from each frame as needed to conform to the bit allocation 750 for that particular frame. This produces pruned frames 770 for the given time segment. These pruned frames are merged to generate the bit-reduced encoded audio object file 320.

IV.操作概述IV. Operation Overview

图8是示出在图1-7中所示的编码后位速率减少系统100和方法的实施例的一般操作的流程图。该操作通过输入多个音频对象文件开始(方框800)。这些音频对象文件可以包括与其相关联的呈现元数据结合的源信号，并且可以表示各种声源。这些声源可以包括各个乐器和人声，以及声源的组合，诸如包含鼓套件的各个部件的多个轨道的鼓套件的音频对象。FIG8 is a flow chart illustrating the general operation of the embodiment of the post-encoding bit rate reduction system 100 and method shown in FIG1-7. The operation begins by inputting a plurality of audio object files (block 800). These audio object files may include source signals combined with associated rendering metadata and may represent various sound sources. These sound sources may include individual instruments and vocals, as well as combinations of sound sources, such as an audio object for a drum kit containing multiple tracks of individual components of the drum kit.

接下来，系统100和方法的实施例独立和单独编码每个音频对象文件(方框810)。这种编码采用具有细粒度可缩放性特征的一个或多个可缩放位流编码器。具有细粒度可缩放性特征的可缩放位流编码器的例子在于2008年2月19日提交的标题为“ModularScalable Compressed Audio Data Stream”的美国专利号7,333,929和于2009年6月16日提交的标题为“Scalable Compressed Audio Bit Stream and Codec Using aHierarchical Filterbank and Multichannel Joint Coding”的美国专利号7,548,853中阐述。Next, embodiments of the system 100 and method independently and individually encode each audio object file (block 810). This encoding employs one or more scalable bitstream encoders with fine-grained scalability features. Examples of scalable bitstream encoders with fine-grained scalability features are described in U.S. Patent No. 7,333,929, entitled "Modular Scalable Compressed Audio Data Stream," filed on February 19, 2008, and U.S. Patent No. 7,548,853, entitled "Scalable Compressed Audio Bit Stream and Codec Using a Hierarchical Filterbank and Multichannel Joint Coding," filed on June 16, 2009.

系统100和方法合并该多个单独编码的音频文件以及任何分层元数据710来生成全文件140(方框820)。全文件140以全位速率进行编码。应当强调的是，每个音频对象文件被分别编码，以便保持该多个音频对象文件之间的分离和隔离。The system 100 and method merge the multiple individually encoded audio files and any layered metadata 710 to generate the full file 140 (box 820). The full file 140 is encoded at the full bit rate. It should be emphasized that each audio object file is encoded separately in order to maintain the separation and isolation between the multiple audio object files.

分层元数据可以包含至少三种类型的层次结构或优先级。这些类型的优先级中的一个或任何组合可以被包括在分层元数据中。第一类型的优先级是帧内的位优先级。在这些情况下，位以对人类听觉的心理声学重要性的顺序被放置。第二类型的优先级是音频对象文件内的帧优先级。在这些情况下，帧的重要性或优先级基于帧的活动。如果帧活动相对于在帧时间间隔期间的其它帧高，则它在层次结构中比较低活动的帧排名更高。Hierarchical metadata can contain at least three types of hierarchies or priorities. One or any combination of these types of priorities can be included in the hierarchical metadata. The first type of priority is bit priority within a frame. In these cases, bits are placed in order of psychoacoustic importance to human hearing. The second type of priority is frame priority within an audio object file. In these cases, the importance or priority of a frame is based on the activity of the frame. If the frame activity is high relative to other frames during the frame time interval, it will be ranked higher in the hierarchy than frames with lower activity.

第三种类型的优先级是全文件内的音频对象文件优先级。这包括交叉对象掩蔽和用户定义的优先级两者。在交叉对象掩蔽中，特定的音频对象文件可以基于音频对象在音频空间中的哪里被呈现被另一个音频对象文件掩蔽。在这种情况下，一个音频对象文件将具有高于被掩蔽音频对象文件的优先级。在用户定义的优先级中，用户可以定义一个音频对象文件比另一个音频对象文件对他们更重要。例如，对于用于电影的音频音轨，包含对话的音频对象文件对用户会具有比包含街道噪声的音频对象文件或包含背景音乐的音频对象文件更高的重要性。The third type of priority is the audio object file priority within the full file. This includes both cross-object masking and user-defined priorities. In cross-object masking, a specific audio object file can be masked by another audio object file based on where the audio object is presented in the audio space. In this case, an audio object file will have a priority higher than the masked audio object file. In user-defined priorities, the user can define that an audio object file is more important to them than another audio object file. For example, for the audio soundtrack for a movie, the audio object file that comprises dialogue will have a higher importance to the user than the audio object file that comprises street noise or the audio object file that comprises background music.

基于期望的目标位速率，全文件140被位减少模块150处理，以产生缩放的压缩位流170。缩放的压缩位流在不进行任何重新编码的情况下生成。此外，缩放的压缩位流被设计为以等于或小于目标位速率的位速率经网络信道传输。Based on the desired target bit rate, the full file 140 is processed by the bit reduction module 150 to generate a scaled compressed bit stream 170. The scaled compressed bit stream is generated without any re-encoding. In addition, the scaled compressed bit stream is designed to be transmitted over a network channel at a bit rate equal to or less than the target bit rate.

目标位速率始终小于全位速率。此外，应当指出，每个音频对象以超过任何目标位速率200的全位速率被独立编码。在其中目标位速率在编码之前未知的情况下，每个音频对象以最大可用的位速率或以超过将在传输期间被使用的最高预期目标位速率的位速率进行编码。The target bit rate is always less than the full bit rate. In addition, it should be noted that each audio object is independently encoded at a full bit rate that exceeds any target bit rate by 200. In cases where the target bit rate is unknown prior to encoding, each audio object is encoded at the maximum available bit rate or at a bit rate that exceeds the highest expected target bit rate that will be used during transmission.

为了获得缩放的压缩位流，系统100和方法的实施例将全文件140划分成一系列帧。在一些实施例中，全文件140中的每个音频对象文件贯穿文件140的整个持续时间存在。即使音频对象文件在重放期间包含静音期，也是这样。To obtain a scalable compressed bitstream, embodiments of the system 100 and method divide the full file 140 into a series of frames. In some embodiments, each audio object file in the full file 140 exists throughout the entire duration of the file 140. This is true even if the audio object file contains silent periods during playback.

再次参考图8，系统100和方法的实施例选择帧时间间隔(或时间段)并且对在选定时间段期间的帧比较帧活动(方框830)。这种帧时间间隔包括来自每个音频对象的帧。对选定时间段的逐帧比较生成用于那个时间段的数据帧活动比较。一般而言，帧活动是编码帧中的音频有多困难的度量。帧活动可以以多种方式确定。在一些实施例中，帧活动基于许多提取的音调和结果得到的帧残余能量。其它实施例计算帧的熵以得出帧活动。Referring again to FIG8 , embodiments of the system 100 and method select a frame time interval (or time period) and compare frame activity for frames during the selected time period (block 830). This frame time interval includes frames from each audio object. The frame-by-frame comparison of the selected time period generates a data frame activity comparison for that time period. Generally speaking, frame activity is a measure of how difficult it is to encode the audio in a frame. Frame activity can be determined in a variety of ways. In some embodiments, frame activity is based on a number of extracted tones and the resulting frame residual energy. Other embodiments calculate the entropy of the frame to derive the frame activity.

位在选定时间段的帧中从可用位池中指定或分配(方框840)。位基于数据帧活动和分层元数据进行分配。一旦用于选定时间段的帧之间的位分配已知，位就在这些帧之间被分布。每一帧然后通过削减超过用于该帧的位分配的位使得与其位分配相符合以获得削减的帧(方框850)。如下面详细解释的，这种位减少以有序的方式执行，使得具有最高优先级和重要性的位最后被削减。Bits are assigned or allocated from a pool of available bits in the frames for the selected time period (block 840). Bits are allocated based on data frame activity and hierarchical metadata. Once the bit allocation between frames for the selected time period is known, bits are distributed among the frames. Each frame is then brought into conformity with its bit allocation by trimming bits that exceed the bit allocation for that frame to obtain a trimmed frame (block 850). As explained in detail below, this bit reduction is performed in an ordered manner so that bits with the highest priority and importance are trimmed last.

负责多个编码音频对象文件中的多个削减帧的这种位减少生成位减少的编码音频对象文件320(方框860)。位减少的编码音频对象文件320然后被一起复用(方框870)。系统100和方法然后利用位流包装器340包装复用的位减少的编码音频对象文件320，以获得以目标位速率的缩放的压缩位流170(方框880)。This bit reduction of the plurality of pruned frames in the plurality of encoded audio object files generates bit-reduced encoded audio object files 320 (block 860). The bit-reduced encoded audio object files 320 are then multiplexed together (block 870). The system 100 and method then packages the multiplexed bit-reduced encoded audio object files 320 using the bitstream wrapper 340 to obtain a scaled compressed bitstream 170 at the target bit rate (block 880).

在一些情况下，可能出现以若干个不同的位速率传送编码音频对象的需要。例如，如果全文件存储在媒体数据库服务器600上，则它可能被每个都具有不同带宽要求的若干个客户端请求。在这种情况下，可以从单个全文件140获得多个缩放的压缩位流。此外，每个缩放的压缩位流可以以不同的目标位速率，其中每个目标位速率小于全位速率。这一切都可以在无需重新编码编码音频对象文件的情况下实现。In some cases, the need to transmit an encoded audio object at several different bit rates may arise. For example, if a full file is stored on a media database server 600, it may be requested by several clients, each with different bandwidth requirements. In this case, multiple scaled compressed bit streams can be obtained from a single full file 140. Furthermore, each scaled compressed bit stream can be at a different target bit rate, where each target bit rate is less than the full bit rate. All of this can be achieved without re-encoding the encoded audio object file.

系统100和方法的实施例然后可以将缩放的压缩位流中的一个或多个以等于或小于目标位速率的位速率传送到接收设备180。接收设备180然后解复用接收到的缩放的压缩位流，以获得多个位减少的编码音频对象。接着，系统100和方法利用至少一个可缩放位速率解码器解码这些位减少的编码音频对象，以获得多个解码音频对象文件。解码音频对象文件然后可以由最终用户、内容提供商混合或自动混合，以生成音频对象混合640。Embodiments of the system 100 and method can then transmit one or more of the scaled compressed bitstreams to a receiving device 180 at a bitrate equal to or less than the target bitrate. The receiving device 180 then demultiplexes the received scaled compressed bitstreams to obtain a plurality of bit-reduced encoded audio objects. The system 100 and method then decodes these bit-reduced encoded audio objects using at least one scalable bitrate decoder to obtain a plurality of decoded audio object files. The decoded audio object files can then be mixed by an end user, a content provider, or automatically to generate an audio object mix 640.

V.操作细节V. Operational Details

编码后位速率减少系统100和方法的实施例包括处理音频的静音期的实施例和将单个全文件交付给各种不同带宽网络信道的实施例。静音期实施例针对当若干个音频对象文件可能具有其中该音频是静音或相对于其它音频对象文件处于非常低水平的相当长时间段的那些情况。例如，包含音乐的音频内容可能具有长的时间段，其中人声轨道是静音或处于非常低的水平。当利用恒定位速率音频编解码器编码这些音频对象文件时，相当量的数据有效载荷被浪费在编码静音期上。Embodiments of the post-encoding bitrate reduction system 100 and method include embodiments for handling silent periods in audio and embodiments for delivering a single, full file to various network channels of varying bandwidths. The silent period embodiment addresses situations where several audio object files may have significant periods of time during which the audio is silent or at a very low level relative to other audio object files. For example, audio content containing music may have long periods during which the vocal track is silent or at a very low level. When encoding these audio object files using a constant bitrate audio codec, a significant amount of data payload is wasted encoding the silent periods.

系统100和方法利用每个编码音频对象文件的细粒度可缩放性来减轻在静音期期间的任何数据(或帧)有效负载的浪费。这在不影响重构的压缩音频的质量的情况下实现了整体压缩数据有效负载的减少。在一些实施例中，编码音频对象文件具有开始和停止时间。开始时间表示其中静音开始的时间点并且停止时间表示静音结束的时间点。在这些情况下，系统100和方法可以将开始和停止时间之间的帧标记为空帧。这允许位被分配给在时间段期间的其它音频对象文件的帧。System 100 and method utilize the fine-grained scalability of each encoded audio object file to alleviate the waste of any data (or frame) payload during the silent period. This has achieved the reduction of the overall compressed data payload when not affecting the quality of the compressed audio of reconstruction. In certain embodiments, the encoded audio object file has a start and stop time. The start time represents the time point where silence begins and the stop time represents the time point where silence ends. In these cases, system 100 and method can mark the frame between the start and stop times as a null frame. This allows bits to be allocated to the frames of other audio object files during the time period.

在其它场景中，除了或代替静音期实施例，可能需要即时位速率减少方案。例如，当包含多个音频对象文件的单个高质量编码音频文件或位流被存储在需要同时利用不同连接带宽服务客户端的服务器上时，这会发生。单个全文件到各种不同带宽网络信道的实施例使用音频文件或位流的细粒度可缩放性特征来缩小编码音频对象文件的整体位速率，同时试图尽可能多地保持整体质量。In other scenarios, an on-the-fly bitrate reduction scheme may be needed in addition to or instead of the silent period embodiment. This may occur, for example, when a single high-quality encoded audio file or bitstream containing multiple audio object files is stored on a server that needs to simultaneously serve clients utilizing different connection bandwidths. The single full file to various bandwidth network channels embodiment uses the fine-grained scalability characteristics of the audio file or bitstream to reduce the overall bitrate of the encoded audio object file while attempting to maintain as much overall quality as possible.

现在将讨论系统100和方法的实施例的操作细节。图9是示出在图1-8中所示的编码后位速率减少系统100和方法的实施例的第一实施例的细节的流程图。该操作通过输入包含多个单独编码的音频对象文件的全文件开始(方框900)。该多个编码的音频对象文件中的每一个被分段成数据帧(方框905)。The details of the operation of the embodiment of the system 100 and method will now be discussed. FIG9 is a flow chart illustrating details of a first embodiment of the embodiment of the post-encoding bit rate reduction system 100 and method shown in FIG1-8. The operation begins by inputting a full file containing a plurality of individually encoded audio object files (block 900). Each of the plurality of encoded audio object files is segmented into data frames (block 905).

系统100和方法然后在全文件的开始处选择时间段(方框910)。这个时间段理想地与各个帧的时间长度一致。选定的时间段在全文件的开始处开始。该方法处理选定时间段的数据帧，并且然后通过按时间顺序取得时间段连续地处理数据帧的剩余部分。换句话说，选择的下一个时间段是在时间上与先前的时间段相邻的时间段并且以上和以下描述的方法被用来处理在每个时间段期间的数据帧。The system 100 and method then selects a time segment at the beginning of the full file (block 910). This time segment ideally coincides with the temporal length of each frame. The selected time segment begins at the beginning of the full file. The method processes the data frames for the selected time segment and then continuously processes the remaining data frames by sequentially retrieving time segments in chronological order. In other words, the next time segment selected is one that is temporally adjacent to the previous time segment, and the methods described above and below are used to process the data frames during each time segment.

接下来，系统100和方法选择用于在选定时间段期间的多个编码音频对象文件的数据帧(方框915)。帧活动针对在选定时间段期间的音频对象文件中的每个数据帧进行测量(方框920)。如上所述，可以使用各种技术来测量帧活动。Next, the system 100 and method selects data frames for a plurality of encoded audio object files during a selected time period (block 915). Frame activity is measured for each data frame in the audio object files during the selected time period (block 920). As described above, various techniques can be used to measure frame activity.

对于在时间段期间的每个数据帧，系统100和方法做出关于测得的帧活动是否大于静音阈值的确定(方框925)。如果是，则用于数据帧的帧活动被存储在帧活动缓冲区中(方框930)。如果测得的帧活动小于或等于静音阈值，则数据帧被指定为静音数据帧(方框935)。这种指定意味着数据帧已被减少到最小的有效载荷，并且在那个帧中的位的数量被用来表示没有进一步减少的数据帧。静音数据帧然后被存储在帧活动缓冲区中(方框940)。For each data frame during the time period, the system 100 and method makes a determination as to whether the measured frame activity is greater than the silence threshold (block 925). If so, the frame activity for the data frame is stored in a frame activity buffer (block 930). If the measured frame activity is less than or equal to the silence threshold, the data frame is designated as a silent data frame (block 935). This designation means that the data frame has been reduced to a minimum payload, and the number of bits in that frame is used to indicate a data frame without further reduction. The silent data frame is then stored in a frame activity buffer (block 940).

系统100和方法然后将存储在帧活动缓冲区中用于在选定时间段的每个数据帧的数据帧活动与用于当前时间段的其它数据帧进行比较(945)。这产生数据帧活动比较。然后，系统100和方法确定由在该时间段期间的任何静音帧使用的可用位的数量(方框950)。可以被分配给在该时间段期间的剩余数据帧的可用位的数量然后被确定。这通过从已分配给在该时间段期间被使用的位的数量中减去由任何静音数据帧使用的位来完成(方框955)。The system 100 and method then compares the data frame activity stored in the frame activity buffer for each data frame in the selected time period with the other data frames for the current time period (945). This produces a data frame activity comparison. The system 100 and method then determines the number of available bits used by any silent frames during the time period (block 950). The number of available bits that can be allocated to the remaining data frames during the time period is then determined. This is accomplished by subtracting the bits used by any silent data frames from the number of bits allocated for use during the time period (block 955).

剩余数据帧中的位分配通过将可用位分配给来自在选定时间段的每个编码音频对象文件的数据帧来执行(方框960)。这种位分配基于数据帧活动比较和分层元数据执行。接着，在数据帧中排序的位被削减，以符合位分配(方框965)。换句话说，位以重要的位被最后去除并且最不重要的位被首先去除的方式从数据帧中去除。这继续直到只剩下分配给那个特定帧的位数量。结果是削减的数据帧。The bit allocation in the remaining data frames is performed by allocating available bits to the data frames from each encoded audio object file in the selected time period (box 960). This bit allocation is performed based on data frame activity comparison and layered metadata. Next, the ordered bits in the data frames are pruned to conform to the bit allocation (box 965). In other words, bits are removed from the data frames in such a way that the most important bits are removed last and the least important bits are removed first. This continues until only the number of bits allocated to that particular frame remains. The result is a pruned data frame.

这些削减的数据帧被存储(方框970)并且做出关于是否存在更多时间段的确定(方框975)。如果是，则下一个顺续的时间段被选择(方框980)。该过程再次通过选择用于在新的时间段处的多个编码音频对象文件的数据帧开始(方框915)。否则，削减的数据帧被包装成可缩放的压缩位流(方框985)。These reduced data frames are stored (block 970) and a determination is made as to whether more time segments exist (block 975). If so, the next sequential time segment is selected (block 980). The process begins again by selecting data frames for the plurality of encoded audio object files at the new time segment (block 915). Otherwise, the reduced data frames are packaged into a scalable compressed bitstream (block 985).

V.A.帧和容器V.A. Frames and Containers

如以上所讨论的，在一些实施例中，全文件140包括多个编码音频对象文件。这些编码音频对象文件中的一些或全部可能包含音频数据、声音信息以及相关联的元数据的任意组合。此外，在一些实施例中，编码音频对象文件可以被划分或分区成数据帧。数据帧或帧的使用对流化应用会是高效的。一般而言，“帧”是由编解码器创建和在编码和解码中使用的离散数据段。As discussed above, in some embodiments, full file 140 includes multiple encoded audio object files. Some or all of these encoded audio object files may contain any combination of audio data, sound information, and associated metadata. In addition, in some embodiments, the encoded audio object files can be divided or partitioned into data frames. The use of data frames or frames can be efficient for streaming applications. Generally speaking, a "frame" is a discrete data segment created by a codec and used in encoding and decoding.

图10示出了根据在图1-9中所示的编码后位速率减少系统100和方法的一些实施例的音频帧1000。帧1000包括帧首部1010，其可被配置为指示帧1000的开始，以及帧尾部1020，其可以被配置为指示帧1000的结束。帧1000也包括一个或多个编码音频数据块1030和对应的元数据1040。元数据1040包括一个或多个片段首部1050块，其可以被配置为指示新元数据片段的开始。该元数据1040可以包括由分层模块720使用的分层元数据710。FIG10 illustrates an audio frame 1000 according to some embodiments of the post-encoding bitrate reduction system 100 and method shown in FIG1-9. Frame 1000 includes a frame header 1010, which can be configured to indicate the beginning of frame 1000, and a frame trailer 1020, which can be configured to indicate the end of frame 1000. Frame 1000 also includes one or more coded audio data blocks 1030 and corresponding metadata 1040. Metadata 1040 includes one or more segment headers 1050, which can be configured to indicate the beginning of a new metadata segment. The metadata 1040 can include layered metadata 710 used by layering module 720.

未分组的音频对象可以被包括作为对象片段1060。分组的音频对象1070可以包括分组开始和结束块。这些块可以被配置为指示新组的开始和结束。此外，分组的音频对象1070可以包括一个或多个对象片段。在一些实施例中，帧1000然后可以被封装到容器中(诸如MP4)。Ungrouped audio objects may be included as object segments 1060. Grouped audio objects 1070 may include group start and end blocks. These blocks may be configured to indicate the start and end of a new group. In addition, grouped audio objects 1070 may include one or more object segments. In some embodiments, frame 1000 may then be encapsulated into a container (such as MP4).

一般地，“容器”或包裹格式是元文件格式，其规范描述共存于计算机文件中的数据元素和元数据如何不同。容器指数据在文件内被组织的方式，而与所使用的编码方案无关。此外，容器用来将多个位流“包裹”在一起并且同步帧以确保它们以正确的顺序被呈现。如果需要，容器也可以负责为流化服务器添加信息，使得流化服务器知道何时发送文件的哪个部分。如在图10中所示，帧1000可以被包装到容器1080中。可用于容器1080的数字容器格式的例子包括传输流(Transport Stream，TS)、素材交换格式(Material ExchangeFormat，MXF)、运动图像专家组，第14部分(Moving Pictures Expert Group，Part 14，MP4)，等等。Generally, a "container" or package format is a metafile format whose specifications describe how the data elements and metadata that coexist in a computer file differ. A container refers to the way data is organized within a file, regardless of the encoding scheme used. In addition, a container is used to "wrap" multiple bitstreams together and synchronize frames to ensure that they are presented in the correct order. If necessary, the container can also be responsible for adding information to the streaming server so that the streaming server knows when to send which part of the file. As shown in Figure 10, frame 1000 can be packaged into container 1080. Examples of digital container formats that can be used for container 1080 include Transport Stream (TS), Material Exchange Format (MXF), Moving Pictures Expert Group, Part 14 (MP4), and the like.

V.B.细粒度位流可缩放性V.B. Fine-grained bitstream scalability

放置在缩放的压缩位流170中的元素的结构和顺序提供了位流170的宽的位范围和细粒度的可缩放性。该结构和顺序允许位流170通过诸如位减少模块150的外部机制被平滑地缩放。The structure and order of elements placed in scaled compressed bitstream 170 provide wide bit range and fine grain scalability of bitstream 170. The structure and order allow bitstream 170 to be smoothly scaled by external mechanisms such as bit reduction module 150.

图11示出了由图1中所示的可缩放位流编码器130产生的数据的可缩放帧的示例性实施例。应当指出，基于其它分解规则的一个或多个其它类型的音频压缩编解码器可以被用来向编码后位速率减少系统100和方法的实施例提供细粒度的可缩放性。在这些情况下，其它代码将提供一组不同的心理声学相关元素。FIG11 illustrates an exemplary embodiment of a scalable frame of data generated by the scalable bitstream encoder 130 shown in FIG1 . It should be noted that one or more other types of audio compression codecs based on other decomposition rules can be used to provide fine-grained scalability to embodiments of the post-encoding bitrate reduction system 100 and method. In these cases, the other code will provide a different set of psychoacoustically relevant elements.

在图11的例子中使用的可缩放的压缩位流170由多个资源交换文件格式(RIFF)数据结构(被称为“区块”)组成。应当指出，这是示例性实施例，并且其它类型的数据结构可以被使用。由本领域技术人员众所周知的这个RIFF文件格式允许识别由区块携带的数据的类型以及由区块携带的数据量。应当指出，携带关于在其定义的位流数据结构中携带的数据量和类型的信息的任何位流格式可以与系统100和方法的实施例一起使用。The scalable compressed bitstream 170 used in the example of FIG11 is composed of a plurality of Resource Interchange File Format (RIFF) data structures (referred to as "chunks"). It should be noted that this is an exemplary embodiment and that other types of data structures may be used. This RIFF file format, well known to those skilled in the art, allows identification of the type of data carried by a chunk and the amount of data carried by the chunk. It should be noted that any bitstream format that carries information about the amount and type of data carried in its defined bitstream data structures may be used with embodiments of the system 100 and method.

图11示出了可缩放位速率帧区块1100的布局，连同包括网格1区块1105、音调1区块1110、音调2区块1115、音调3区块1120、音调4区块1125、音调5区块1130的子区块。另外，子区块包括高分辨率网格区块1135、时间样本1区块1140、以及时间样本2区块1145。这些区块组成在帧区块1100内被携带的心理声学数据。虽然图11只绘出了区块标识(ID)和帧区块1100的区块长度，但是子区块ID和子区块长度数据被包括在每个子区块中。FIG11 shows the layout of a scalable bitrate frame block 1100, along with subblocks including a trellis 1 block 1105, a tone 1 block 1110, a tone 2 block 1115, a tone 3 block 1120, a tone 4 block 1125, and a tone 5 block 1130. Additionally, subblocks include a high-resolution trellis block 1135, a time sample 1 block 1140, and a time sample 2 block 1145. These blocks comprise the psychoacoustic data carried within the frame block 1100. While FIG11 depicts only the block identification (ID) and block length of the frame block 1100, the subblock ID and subblock length data are included in each subblock.

图11示出了可缩放位流的帧中的区块的顺序。这些区块包含由在图1中示出的可缩放位流编码器130产生的心理声学音频元素。除了区块按心理声学重要性布置之外，区块中的音频元素也按心理声学重要性布置。Figure 11 shows the order of blocks in a frame of a scalable bitstream. These blocks contain psychoacoustic audio elements generated by the scalable bitstream encoder 130 shown in Figure 1. In addition to the blocks being arranged by psychoacoustic importance, the audio elements within the blocks are also arranged by psychoacoustic importance.

帧中的最后一个区块是空区块1150。它被用来在需要帧是恒定或特定尺寸的情况下填补区块。因此，空区块1150没有心理声学相关性。如在图11中所示，最不重要的心理声学区块是时间样本2区块1145。相反，最重要的心理声学区块是网格1区块1105。在操作中，如果需要缩小可缩放位速率帧区块1100，则数据从在位流尾部处的心理声学最不相关的区块(时间样本2区块1145)开始并且沿心理声学相关性排名向上移动被去除。这在图11中将会从右向左移动。这意味着在可缩放位速率帧区块1100中具有可能最高质量的心理声学最相关的区块(网格1区块1105)最有可能不被去除。The last block in the frame is an empty block 1150. This is used to fill in blocks when a constant or specific frame size is required. Therefore, empty block 1150 has no psychoacoustic relevance. As shown in FIG11 , the least important psychoacoustic block is time sample 2 block 1145. Conversely, the most important psychoacoustic block is grid 1 block 1105. In operation, if a scalable bitrate frame block 1100 needs to be downsized, data is removed starting with the least psychoacoustically relevant block at the end of the bitstream (time sample 2 block 1145) and moving up the psychoacoustic relevance ranking. This would move from right to left in FIG11 . This means that the most psychoacoustically relevant block (grid 1 block 1105), which has the highest possible quality in the scalable bitrate frame block 1100, is most likely not to be removed.

应当指出，将能够被位流支持的最高目标位速率(连同最高音频质量)在编码时被定义。但是，缩放之后的最低位速率可以由对于应用使用可接受的音频质量级别来定义。去除的每个心理声学元素不使用相同数量的位。作为例子，用于在图11中示出的示例性实施例的缩放分辨率的范围从用于最低心理声学重要性元素的1位到用于那些最高心理声学重要性元素的32位不等。It should be noted that the highest target bit rate that can be supported by the bitstream (along with the highest audio quality) is defined at the time of encoding. However, the lowest bit rate after scaling can be defined by the audio quality level acceptable for the application. Each psychoacoustic element removed does not use the same number of bits. As an example, the scaling resolution for the exemplary embodiment shown in Figure 11 ranges from 1 bit for the elements of lowest psychoacoustic importance to 32 bits for those of highest psychoacoustic importance.

还应当指出，用于缩放位流的机制不需要一次去除整个区块。如前面所指出的，每个区块内的音频元素被布置为使得心理声学最重要的数据被放置在可缩放位速率帧区块1100的开始处(最靠近图11的右侧)。由于这个原因，音频元素可以由缩放机制从区块的尾部被去除，每次一个元素，同时在每个元素从可缩放位速率帧区块1100中去除的情况下维持可能最好的音频质量。这就是“细粒度可缩放性”的含义。It should also be noted that the mechanism for scaling the bitstream does not require removing entire blocks at once. As previously noted, the audio elements within each block are arranged so that the most psychoacoustically important data is placed at the beginning of the scalable bitrate frame block 1100 (closest to the right side of Figure 11). For this reason, audio elements can be removed from the end of the block by the scaling mechanism, one element at a time, while maintaining the best possible audio quality as each element is removed from the scalable bitrate frame block 1100. This is what is meant by "fine-grained scalability."

系统100和方法按需去除区块内的音频元素，并且然后更新从中去除音频元素的特定区块的区块长度字段。此外，系统100和方法还更新帧区块长度1155和帧校验和1160。利用对每个缩放的区块的更新的区块长度字段以及更新的帧区块长度1155和更新的帧校验和信息，解码器可以正确地处理和解码缩放的位流。此外，即使在位流中存在丢失音频元素的区块和完全从位流中丢失的区块，系统100和方法也可以自动产生固定数据速率的音频输出信号。此外，帧区块标识(帧区块ID 1165)被包含在可缩放位速率帧区块1100中用于识别的目的。此外，帧区块数据1170包含(从右向左移动)校验和1160至空区块1150。The system 100 and method remove audio elements within the block as needed and then update the block length field for the specific block from which the audio element was removed. Furthermore, the system 100 and method also update the frame block length 1155 and frame checksum 1160. Utilizing the updated block length field for each scaled block, along with the updated frame block length 1155 and updated frame checksum information, a decoder can correctly process and decode the scaled bitstream. Furthermore, the system 100 and method can automatically generate a fixed data rate audio output signal even in the presence of blocks with missing audio elements or blocks that are completely missing from the bitstream. Furthermore, a frame block identifier (frame block ID 1165) is included in the scalable bitrate frame block 1100 for identification purposes. Furthermore, the frame block data 1170 includes (moving from right to left) the checksum 1160 to the empty block 1150.

V.C.位分配V.C. bit allocation

现在将讨论在时间段期间的帧之间进行位分配的例子。应当指出，这只是其中可以执行位分配的若干种方法中的一种。图12示出了将全文件140划分为多个帧和时间段的例子的示例性实施例。如在图12中所示，全文件140被示为划分成用于多个音频对象的多个帧。x-轴是时间轴并且y-轴是编码音频对象文件编号。在这个例子中，有M个编码音频对象，其中M是正的非零整数。此外，在这个示例性例子中，每个编码音频对象文件在全文件140的整个持续时间存在。The example of carrying out bit allocation between frames during time period will now be discussed. It should be noted that this is only one of several methods in which bit allocation can be performed. Figure 12 shows an exemplary embodiment of an example in which a full file 140 is divided into multiple frames and time periods. As shown in Figure 12, full file 140 is shown as being divided into multiple frames for multiple audio objects. The x-axis is the time axis and the y-axis is the encoded audio object file number. In this example, there are M encoded audio objects, where M is a positive non-zero integer. In addition, in this illustrative example, each encoded audio object file exists for the entire duration of full file 140.

跨时间轴从左到右看，可以看到，每个编码音频对象(编号1至M)被划分成X个帧，其中X是正的非零整数。每个方框由标记F_M,X表示，其中F是帧，M是音频对象文件编号，并且X是帧编号。例如，帧F_1,2表示编码音频对象文件(1)的第二帧。Looking from left to right across the timeline, you can see that each encoded audio object (numbered 1 to M) is divided into X frames, where X is a positive, non-zero integer. Each box is represented by the label F _M,X , where F is the frame, M is the audio object file number, and X is the frame number. For example, frame F _1,2 represents the second frame of the encoded audio object file (1).

如在图12中所示，为全文件140定义了对应于帧的长度的时间段1200。图13示出了在时间段1200内全文件140的帧的细节。在每个帧中示出了其有序频率分量，这些有序频率分量是相对于它们对全文件140的质量的相对重要性。应当指出，x-轴是频率(以kHz为单位)并且y-轴表示特定频率的大小(以分贝为单位)。例如，在F_1,1中，可以看到，7kHz是最重要的频率分量(在这个例子中)，随后分别跟着6kHz和8kHz的频率分量，以此类推。这样，每个音频对象的每个帧包含这些排名的频率分量。As shown in FIG12 , a time period 1200 corresponding to the length of a frame is defined for the full file 140. FIG13 shows details of the frames of the full file 140 within the time period 1200. Within each frame, its frequency components are shown in order of their relative importance to the quality of the full file 140. It should be noted that the x-axis is frequency (in kHz) and the y-axis represents the magnitude of a particular frequency (in decibels). For example, in _F1,1 , it can be seen that 7 kHz is the most important frequency component (in this example), followed by 6 kHz and 8 kHz, and so on. Thus, each frame of each audio object contains these ranked frequency components.

目标位速率被用来确定用于时间段1200的多个可用位。在一些实施例中，心理声学(诸如掩蔽曲线)被用来以非均匀的方式跨频率分量分布可用位。例如，对1、19和20kHz频率分量中的每一个可用位的数量可以是64位，而2048位可用于7、8和9kHz频率分量中的每一个。这是因为，依照掩蔽曲线，人耳对7、8和9kHz频率分量最敏感，而人耳对非常低和非常高的分量，即1kHz和以下的频率分量以及19和20kHz的频率分量相对不敏感。虽然心理声学被用来确定可用位跨频率范围的分布，但是应该指出，可以使用许多其它不同的技术来分布可用位。The target bit rate is used to determine the number of available bits for time period 1200. In some embodiments, psychoacoustics (such as masking curves) are used to distribute the available bits across the frequency components in a non-uniform manner. For example, the number of available bits for each of the 1, 19, and 20 kHz frequency components may be 64 bits, while 2048 bits may be available for each of the 7, 8, and 9 kHz frequency components. This is because, according to the masking curves, the human ear is most sensitive to the 7, 8, and 9 kHz frequency components, while the human ear is relatively insensitive to very low and very high components, i.e., frequency components at and below 1 kHz, and frequency components at 19 and 20 kHz. Although psychoacoustics are used to determine the distribution of available bits across the frequency range, it should be noted that many other different techniques may be used to distribute the available bits.

编码后位速率减少系统100和方法的实施例然后为每个编码音频对象文件针对对应时间段1200测量每个帧的帧活动。在时间段1200中每个编码音频对象文件的每个数据帧的帧活动被相互进行比较。这被称为数据帧活动比较，这是相对于时间段1200期间的其它帧的帧活动。The embodiment of the post-encoding bitrate reduction system 100 and method then measures the frame activity of each frame for each encoded audio object file for the corresponding time period 1200. The frame activity of each data frame of each encoded audio object file in the time period 1200 is compared to each other. This is called a data frame activity comparison, which is the frame activity relative to other frames during the time period 1200.

在一些实施例中，帧被分配帧活动编号。作为例子，假定音频对象文件的数量是10，使得帧活动编号范围从1至10。在这个例子中，10意味着在时间段1200期间具有最大帧活动的帧并且1意味着具有最小活动的帧。应当指出，可以使用许多其它技术来排名在时间段1200期间每一帧内的帧活动。基于数据帧活动比较和来自位池的可用位，系统100和方法的实施例然后针对时间段1200在编码音频对象文件的帧之间分配可用位。In some embodiments, frames are assigned frame activity numbers. As an example, assume the number of audio object files is 10, such that the frame activity numbers range from 1 to 10. In this example, 10 represents the frame with the greatest frame activity during time period 1200, and 1 represents the frame with the least frame activity. It should be noted that many other techniques can be used to rank the frame activity within each frame during time period 1200. Based on the data frame activity comparison and the available bits from the bit pool, embodiments of the system 100 and method then allocate the available bits among the frames of the encoded audio object files for time period 1200.

可用位的数量和数据帧活动比较被系统100和方法使用来按需削减帧中的位以与分配的位相符。系统100和方法利用细粒度可缩放性特征和位基于分层元数据以重要性顺序被排名的事实。例如，参考图13，对于F_1,1，假定只有足够的分配位来表示前四个频率分量。这意味着，7、6、8和3kHz的频率分量将被包括在位减少编码位流中。F_1,1的5kHz频率分量和那些在排序中更低的频率分量被丢弃。The number of available bits and the data frame activity comparison are used by the system 100 and method to reduce bits in the frame as needed to match the allocated bits. The system 100 and method take advantage of the fine-grained scalability feature and the fact that bits are ranked in order of importance based on hierarchical metadata. For example, referring to FIG13 , for F _1,1 , assume that there are only enough allocated bits to represent the first four frequency components. This means that the frequency components at 7, 6, 8, and 3 kHz will be included in the bit-reduced encoded bitstream. The 5 kHz frequency component of F _1,1 and those lower in the ranking are discarded.

在一些实施例中，数据帧活动比较通过音频对象重要性进行加权。该信息包含在分层元数据710中。作为例子，假定编码音频对象文件#2对音频信号是重要的，如果该音频是电影音轨并且编码音频对象文件#2是对话轨道，则这可能发生。即使编码音频对象文件#9可能是10的最高相对帧活动排名并且编码音频对象文件#2具有7的排名，编码音频对象文件#2的排名也可以由于因为音频对象的重要性的加权被增加到10。应当指出，可以使用以上技术的许多变型和其它技术来分配位。In some embodiments, the data frame activity comparison is weighted by the importance of the audio objects. This information is included in the layered metadata 710. As an example, assume that encoded audio object file #2 is important to the audio signal, which may occur if the audio is a movie soundtrack and encoded audio object file #2 is a dialogue track. Even if encoded audio object file #9 may have the highest relative frame activity ranking of 10 and encoded audio object file #2 has a ranking of 7, the ranking of encoded audio object file #2 can be increased to 10 due to the weighting of the audio object's importance. It should be noted that many variations of the above technique and other techniques can be used to allocate bits.

VI.备选实施例和示例性操作环境 VI. Alternative Embodiments and Exemplary Operating Environments

与本文所述的那些不同的许多其它变体将从本文档显而易见。例如，依赖于实施例，本文所述的任何方法和算法的某些动作、事件或功能可以以不同的顺序来执行，可以被添加、合并，或完全排除(诸如，不是所有描述的动作或事件对方法和算法的实践都是必须的)。而且，在某些实施例中，动作或事件可以同时执行，诸如通过多线程处理、中断处理或者多个处理器或处理器核心或者在其它并行体系架构上，而不是连续地。此外，不同的任务或过程可以由可以一起发挥作用的不同机器和计算系统来执行。Many other variations different from those described herein will be apparent from this document. For example, depending on the embodiment, certain actions, events, or functions of any method and algorithm described herein may be performed in a different order, may be added, merged, or completely excluded (such as, not all described actions or events are necessary for the practice of the method and algorithm). Moreover, in certain embodiments, actions or events may be performed simultaneously, such as by multithreading, interrupt processing, or multiple processors or processor cores or on other parallel architectures, rather than continuously. In addition, different tasks or processes may be performed by different machines and computing systems that can work together.

结合本文公开的实施例描述的各种说明性逻辑块、模块、方法和算法过程和序列可以被实现为电子硬件、计算机软件或两者的组合。为了清楚地说明硬件和软件的这种可互换性，各种说明性部件、块、模块和过程动作已经在上面就其功能性一般地进行了描述。这种功能被实现为硬件还是软件依赖于强加到整个系统上的特定应用和设计限制。所描述的功能可以对每个特定的应用以不同的方式来实现，但是这种实现决定不应当被解释为造成从本文档的范围的偏离。The various illustrative logic blocks, modules, methods, and algorithmic processes and sequences described in conjunction with the embodiments disclosed herein can be implemented as electronic hardware, computer software, or a combination of the two. In order to clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, and process actions have been generally described above with respect to their functionality. Whether such functionality is implemented as hardware or software depends on the specific application and design constraints imposed on the overall system. The described functionality can be implemented in different ways for each specific application, but such implementation decisions should not be interpreted as causing a departure from the scope of this document.

联系本文公开的实施例描述的各种说明性逻辑块和模块可以由机器实现或执行，诸如通用处理器、数字信号处理器(DSP)、专用集成电路(ASIC)、现场可编程门阵列(FPGA)或其它可编程逻辑器件、分立门或晶体管逻辑、分立硬件部件，或者被设计为执行本文描述的功能的其任意组合。通用处理器可以是微处理器，但在备选方案中，处理器可以是控制器、微控制器或状态机，其组合，等等。处理器也可以被实现为计算设备，诸如DSP和微处理器的组合、多个微处理器，一个或多个微处理器与DSP核心结合，或者任何其它此类配置。The various illustrative logical blocks and modules described in connection with the embodiments disclosed herein may be implemented or executed by a machine such as a general purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, but in the alternative, the processor may be a controller, a microcontroller or a state machine, a combination thereof, or the like. A processor may also be implemented as a computing device such as a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.

本文描述的编码后位速率减少系统100和方法的实施例在多种类型的通用或专用计算系统环境或配置中操作。一般而言，计算环境可以包括任何类型的计算机系统，包括但不限于基于一个或多个微处理器的计算机系统、大型计算机、数字信号处理器、便携式计算设备、个人组织器、设备控制器、器具中的计算引擎、移动电话、台式计算机、移动计算机、平板计算机、智能电话，以及具有嵌入式计算机的器具，这仅仅举了几个例子。The embodiments of the post-encoding bit rate reduction system 100 and method described herein operate in a variety of general-purpose or special-purpose computing system environments or configurations. Generally speaking, the computing environment can include any type of computer system, including but not limited to computer systems based on one or more microprocessors, mainframe computers, digital signal processors, portable computing devices, personal organizers, device controllers, computing engines in appliances, mobile phones, desktop computers, mobile computers, tablet computers, smartphones, and appliances with embedded computers, to name a few.

这种计算设备通常可以在具有至少某个最小计算能力的设备中找到，包括但不限于个人计算机、服务器计算机、手持式计算设备、膝上型或移动计算机、诸如手机和PDA的通信设备、多处理器系统、基于微处理器的系统、机顶盒、可编程消费电子产品、网络PC、小型计算机、大型计算机、音频或视频媒体播放器，等等。在一些实施例中，计算设备将包括一个或多个处理器。每个处理器可以是专用微处理器，诸如数字信号处理器(DSP)、超长指令字(VLIW)或其它微控制器，或者可以是具有一个或多个处理核心的常规中央处理单元(CPU)，包括多核CPU中基于专用图形处理单元(GPU)的核心。Such computing devices are typically found in devices having at least some minimum computing power, including but not limited to personal computers, server computers, handheld computing devices, laptop or mobile computers, communication devices such as cell phones and PDAs, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, audio or video media players, and the like. In some embodiments, the computing device will include one or more processors. Each processor may be a special-purpose microprocessor, such as a digital signal processor (DSP), a very long instruction word (VLIW), or other microcontroller, or may be a conventional central processing unit (CPU) having one or more processing cores, including cores based on a special-purpose graphics processing unit (GPU) in a multi-core CPU.

联系本文公开的实施例描述的方法、过程或算法的处理动作可以直接体现在硬件中、在由处理器执行的软件模块中，或者在这两者的任意组合中。软件模块可以包含在能够由计算设备访问的计算机可读介质中。计算机可读介质既包括易失性又包括非易失性介质，或者是可移动的、或者是不可移动的，或者是其某种组合。计算机可读介质被用来存储信息，诸如计算机可读或计算机可执行指令、数据结构、程序模块或其它数据。作为例子而非限制，计算机可读介质可以包括计算机存储介质和通信介质。The processing actions of the methods, processes, or algorithms described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in any combination of the two. The software module may be contained in a computer-readable medium that can be accessed by a computing device. Computer-readable media includes both volatile and non-volatile media, and may be removable, non-removable, or some combination thereof. Computer-readable media is used to store information, such as computer-readable or computer-executable instructions, data structures, program modules, or other data. By way of example and not limitation, computer-readable media may include computer storage media and communication media.

计算机存储介质包括，但不限于，计算机或机器可读介质或存储设备，诸如蓝光盘(BD)、数字多功能盘(DVD)、压缩盘(CD)、软盘，带式驱动器、硬驱、光驱、固态存储器设备、RAM存储器、ROM存储器、EPROM存储器、EEPROM存储器、闪速存储器或其它存储器技术、磁带盒、磁带、磁盘存储装置或其它磁存储设备，或者可被用来存储期望的信息并可被一个或多个计算设备访问的任何其它设备。Computer storage media includes, but is not limited to, computer or machine readable media or storage devices, such as Blu-ray Discs (BDs), Digital Versatile Discs (DVDs), Compact Discs (CDs), floppy disks, tape drives, hard drives, optical drives, solid-state memory devices, RAM memory, ROM memory, EPROM memory, EEPROM memory, flash memory or other memory technology, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other device that can be used to store the desired information and can be accessed by one or more computing devices.

软件模块可以驻留在RAM存储器、快闪存储器、ROM存储器、EPROM存储器、EEPROM存储器、寄存器、硬盘、可移动盘、CD-ROM，或任何其它形式的非临时性计算机可读存储介质、媒体，或本领域中已知的物理计算机储存器。示例性存储介质可以耦合到处理器，使得处理器可以从存储介质读取信息，并将信息写入到其中。在备选方案中，存储介质可以是处理器的组成部分。处理器和存储介质可以驻留在专用集成电路(ASIC)中。ASIC可以驻留在用户终端中。作为替代，处理器和存储介质可以作为分立元件驻留在用户终端中。The software module can reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, register, hard disk, removable disk, CD-ROM, or any other form of non-temporary computer readable storage medium, media, or physical computer storage known in the art. An exemplary storage medium can be coupled to a processor so that the processor can read information from the storage medium and write information thereto. In an alternative, the storage medium can be an integral part of the processor. The processor and the storage medium can reside in an application specific integrated circuit (ASIC). The ASIC can reside in a user terminal. Alternatively, the processor and the storage medium can reside in a user terminal as discrete components.

如在本文档中所使用的，短语“非临时性”是指“持久或长寿的”。短语“非临时性计算机可读介质”包括任何和所有计算机可读介质，具有过渡性传播信号的唯一例外。作为例子而非限制，这包括非临时性计算机可读介质，诸如寄存器存储器、处理器高速缓存和随机存取存储器(RAM)。As used in this document, the phrase "non-transitory" means "persistent or long-lived." The phrase "non-transitory computer-readable medium" includes any and all computer-readable media, with the sole exception of transitory propagating signals. By way of example, and not limitation, this includes non-transitory computer-readable media such as register memory, processor cache, and random access memory (RAM).

诸如计算机可读或计算机可执行指令、数据结构、程序模块等等信息的保持也可以通过使用多种通信介质来编码一个或多个调制的数据信号、电磁波(诸如载波)或其它传输机制或通信协议来实现，并且包括任何有线或无线信息输送机制。一般而言，这些通信介质指其一个或多个特征以使得在信号中编码信息或指令的方式被设置或改变的信号。例如，通信介质包括有线介质，诸如有线网络或携带一个或多个调制的数据信号的直接连线连接，以及无线介质，诸如声学、射频(RF)、红外线、激光，以及用于发送、接收或两者的其它无线介质或多个调制的数据信号或电磁波。以上所述的任意组合也应当包括在通信介质的范围内。The retention of information such as computer-readable or computer-executable instructions, data structures, program modules, etc. can also be achieved by encoding one or more modulated data signals, electromagnetic waves (such as carrier waves) or other transmission mechanisms or communication protocols using a variety of communication media, and includes any wired or wireless information delivery mechanism. Generally speaking, these communication media refer to signals whose one or more characteristics are set or changed in a manner that encodes information or instructions in the signal. For example, communication media include wired media, such as a wired network or a direct wire connection carrying one or more modulated data signals, and wireless media, such as acoustic, radio frequency (RF), infrared, laser, and other wireless media or multiple modulated data signals or electromagnetic waves for sending, receiving, or both. Any combination of the above should also be included within the scope of communication media.

另外，体现本文描述的编码后位速率减少系统100和方法的各种实施例的一些或所有或者其部分的软件、程序、计算机程序产品的一个或任意组合，可以从计算机或机器可读介质或存储设备以及形式为计算机可执行指令或其它数据结构的通信介质的任何期望的组合存储、接收、发送或读取。Additionally, one or any combination of software, programs, computer program products embodying some or all or portions of the various embodiments of the post-encoding bit rate reduction system 100 and methods described herein may be stored, received, transmitted, or read from any desired combination of computer or machine-readable media or storage devices and communication media in the form of computer-executable instructions or other data structures.

本文描述的编码后位速率减少系统100和方法的实施例可以在由计算设备执行的计算机可执行指令，诸如程序模块，的一般上下文中进一步描述。一般而言，程序模块包括执行特定任务或实现特定抽象数据类型的例程、程序、对象、部件、数据结构，等等。本文描述的实施例还可以在其中任务由一个或多个远程处理设备执行的分布式计算环境中，或者在通过一个或多个通信网络链接的一个或多个设备的云中实践。在分布式计算环境中，程序模块可以位于包括介质存储设备的本地和远程计算机存储介质两者中。更进一步，上述指令可以部分或全部地被实现为硬件逻辑电路，其可以或可以不包括处理器。The embodiments of the post-encoding bit rate reduction system 100 and method described herein may be further described in the general context of computer-executable instructions, such as program modules, executed by a computing device. Generally speaking, program modules include routines, programs, objects, components, data structures, etc. that perform specific tasks or implement specific abstract data types. The embodiments described herein may also be practiced in a distributed computing environment where tasks are performed by one or more remote processing devices, or in a cloud of one or more devices linked via one or more communication networks. In a distributed computing environment, program modules may be located in both local and remote computer storage media, including media storage devices. Furthermore, the instructions described above may be implemented in part or in whole as hardware logic circuits, which may or may not include a processor.

除非另有说明或者如所使用的以其它方式在上下文中被理解，否则本文所使用的条件性语言，诸如除其它之外还有“能够”、“可能”、“可以”、“例如”等，一般意在传达某些实施例包括，而其它实施例不包括，某些特征、元件和/或状态。因此，这种条件语言一般不意在暗示特征、元件和/或状态以任何方式是一个或多个实施例所需的或者一个或多个实施例必需包括用于在有或没有作者输入或提示的情况下决定这些特征、元件和/或状态是否包括在或者要在任何特定实施例中执行的逻辑。术语“包括”、“具有”等是同义的并且以开放的方式被包含性地使用，并且不排除附加的元件、特征、动作、操作，等等。而且，术语“或者”是在其包含的意义上(而不是在其排他的意义上)使用的，使得在用于，例如，连接元件的列表时，术语“或”是指列表中的一个、一些或所有元素。Unless otherwise noted or otherwise understood in the context as used, conditional language used herein, such as "can," "might," "could," "for example," and the like, among others, is generally intended to convey that certain embodiments include, while other embodiments do not, certain features, elements, and/or states. Thus, such conditional language is generally not intended to imply that features, elements, and/or states are in any way required by one or more embodiments or that one or more embodiments must include logic for determining, with or without author input or prompting, whether such features, elements, and/or states are included or are to be performed in any particular embodiment. The terms "including," "having," and the like are synonymous and are used inclusively in an open-ended manner and do not exclude additional elements, features, actions, operations, and the like. Moreover, the term "or" is used in its inclusive sense (and not in its exclusive sense) such that when used, for example, to link a list of elements, the term "or" refers to one, some, or all of the elements in the list.

虽然以上详细描述已经示出、描述并指出了如应用到各种实施例的新颖特征，但是应当理解，在不背离本公开内容的精神的情况下，可以进行所示出的设备或算法的形式和细节的各种省略、替换和变化。如将认识到的，本文描述的本发明的某些实施例可以在不提供本文阐述的所有特征和益处的形式中体现，因为一些特征可以与其它特征分开使用或实践。While the foregoing detailed description has shown, described, and pointed out novel features as applied to various embodiments, it will be understood that various omissions, substitutions, and changes in the form and details of the devices or algorithms shown may be made without departing from the spirit of the present disclosure. As will be appreciated, certain embodiments of the invention described herein may be embodied in a form that does not provide all of the features and benefits set forth herein, as some features may be used or practiced separately from other features.

而且，虽然本主题已经在特定于结构特征和方法动作的语言中进行了描述，但是应当理解，在所附权利要求书中定义的主题不必限于上述具体特征或动作。相反，上述具体特征和动作是作为实现权利要求书的示例形式被公开的。Furthermore, although the subject matter has been described in language specific to structural features and methodological acts, it should be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims

1. A method performed by one or more processing devices for generating a scaled compressed bitstream from a single full file, comprising:

A full file with full bit rate is created by merging multiple audio object files that are encoded separately and independently using a scalable bitstream encoder that sorts the bits in each frame based on psychoacoustic importance, where the audio object is the source signal of a particular sound or combination of sounds.

Each encoded audio object file is divided into data frames;

The data frame activity of each encoded audio object file in each data frame of the selected time period is compared with all the frame activities of all other audio object files to be reused together to obtain a comparison of the data frame activities of all encoded audio object files in the selected time period.

Based on data frame activity comparison, bits from the available bit pool are allocated to each data frame of the encoded audio object file during a selected time period to obtain bit allocation for the selected time period.

The entire file is reduced by shrinking the bits of the data frame according to the bit allocation to generate a truncated frame;

Obtain bit-reduced encoded audio object files from the cut frames, and reuse the bit-reduced encoded audio object files together; and

The reused bit-reduced encoded audio object file is packaged into a scaled compressed bitstream, such that the scaled compressed bitstream has a target bit rate lower than or equal to the full bit rate, in order to facilitate bit rate reduction after encoding of a single full file.

2. The method of claim 1, further comprising:

The full file is created by merging multiple separately encoded audio object files and their corresponding hierarchical metadata, wherein the hierarchical metadata contains priority information for each encoded audio object file relative to other encoded audio object files; and

Based on data frame activity comparison and hierarchical metadata, bits from the available bit pool are allocated to each data frame to obtain bit allocations for a selected time period.

3. The method of claim 1, wherein the entire duration of each encoded audio object file is used to create the full file.

4. The method of claim 1, further comprising allocating bits from the available bit pool to all data frames and all audio encoded object files for the selected time period.

5. The method of claim 2, further comprising:

Measure the data frame activity of each data frame within a selected time period; and

The data frame activity of each data frame is compared with a silence threshold to determine whether there is a minimum amount of activity in any data frame.

6. The method of claim 5, further comprising:

If the data frame activity of a specific data frame is less than or equal to the mute threshold, then that specific data frame is designated as a mute data frame with the minimum amount of activity, and the number of bits used to represent the mute data frame remains the same without any reduction in bits; and

If the data frame activity of a specific data frame is greater than the mute threshold, the data frame activity is stored in the frame activity buffer.

7. The method of claim 6, further comprising determining a pool of available bits for the selected time period by subtracting bits used by silent data frames during the selected time period from a plurality of bits allocated to the selected time period.

8. The method of claim 2, further comprising reducing bits of the data frame in reverse rank order to achieve multiple bits allocated to the data frame in bit allocation such that lower-ranked bits are reduced before higher-ranked bits.

9. The method of claim 8, further comprising:

Extract tones from the frequency domain representation of the audio object file to obtain a time domain residual signal representing the audio object file in which at least some tones have been removed;

The extracted pitch and time-domain residual signals are formatted into multiple data blocks, each containing multiple bytes of data; and

The data blocks and bits within the data blocks of the audio object file are sorted according to psychoacoustic importance to obtain a ranking order from the most important bit to the least important bit.

10. The method of claim 2, further comprising:

A scaled compressed bitstream is transmitted via a network channel at a bit rate less than or equal to the target bit rate; and

Receive and decode scaled compressed bitstreams to obtain decoded audio object files.

11. The method of claim 10, further comprising mixing decoded audio object files to create an audio object mix, wherein two or more of the decoded audio object files depend on each other to perform spatial masking based on their position in the mix.

12. The method of claim 2, further comprising prioritizing encoded audio object files in hierarchical metadata based on spatial positioning in audio object mixing.

13. The method of claim 2, further comprising prioritizing audio object files encoded based on the importance of each audio object file to the user in the audio object mix.

14. A method for obtaining multiple scaled compressed bitstreams from a single full file, comprising:

Multiple encoded audio object files are obtained by individually and independently encoding multiple audio object files at full bit rate using a scalable bitstream encoder with fine-grained scalability. The encoder ranks the bits in each data frame of the encoded audio object files in order of psychoacoustic importance to human hearing.

A full file with full bitrate is generated by merging multiple independently encoded audio object files and their corresponding hierarchical metadata.

Construct a first-scaled compressed bitstream with a first target bit rate from the entire file;

Constructing a second scaled compressed bitstream with a second target bit rate from the full file, such that multiple scaled bitstreams with different target bit rates are obtained from a single full file without any re-encoding of the multiple encoded audio object files;

The data frame activity of each encoded audio object file in each data frame of a selected time period is compared with all the frame activities of all other audio object files to be reused together to obtain a data frame activity comparison.

Based on the data frame activity comparison and the first target bit rate, bits are allocated to each data frame of the encoded audio object file in a selected time period to obtain the bit allocation for the selected time period.

The entire file is reduced in size by decreasing the number of bits in the data frame according to the bit allocation, to achieve a first target bit rate and obtain a bit-reduced encoded audio object file; and

The bit-reduced encoded audio object files are reused and packaged together into a first-scaled compressed bitstream with a first target bit rate;

The first target bit rate and the second target bit rate are different from each other, and both are smaller than the full bit rate.

15. The method of claim 14, wherein the first target bit rate is the maximum bit rate at which the first scaled compressed bit stream will be transmitted.

16. The method of claim 15, wherein each of the plurality of encoded audio object files is persistent and exists throughout the entire duration of the entire file.

17. The method of claim 14, further comprising:

The first scaled compressed bit stream is transmitted to the receiving device at the first target bit rate; and

Decode the first scaled compressed bitstream to obtain the decoded audio object.

18. The method of claim 17, further comprising mixing decoded audio objects to create an audio object mix.

19. A bit rate reduction system after encoding, comprising:

The full file contains audio object files that have been encoded at full bit rate and combined with the corresponding hierarchical metadata to form the full file, each encoded separately and independently using a scalable bitstream encoder that sorts the bits in each frame based on psychoacoustic importance.

The bit reduction module is used to reduce multiple bits allocated to the data frames of the encoded audio object files based on a comparison of the data frame activity of each data frame in a selected time period of each audio object file with all the frame activities of all other audio object files to be reused together, so as to obtain bit-reduced encoded audio objects.

Bitstream wrappers are used to arrange data frames of bit-reduced encoded audio objects in a container for transmission over computer networks; and

A multiplexer is used to combine containers containing bit-reduced encoded audio to generate a scaled compressed bitstream at a target bit rate, where the target bit rate is less than the full bit rate.

20. An audio signal receiving system, comprising:

A scaled compressed bitstream received over a network at a target bit rate, the bitstream comprising multiple bit-reduced encoded audio object files individually and independently encoded using a scalable bitstream encoder residing on a computing device, the scalable bitstream encoder ordering bits in each frame based on psychoacoustic importance, and the bitstream causing bits in data frames of the full file encoded at full bit rate to be reduced based on data frame activity comparison and corresponding hierarchical metadata, wherein the target bit rate is less than or equal to the full bit rate, and wherein the data frame activity comparison comprises: comparing the data frame activity of each encoded audio object file in each data frame over a selected time period with all frame activities in all other audio object files to be multiplexed together to obtain a data frame activity comparison of all encoded audio object files over the selected time period;

A demultiplexer is used to separate a scaled, compressed bitstream into multiple encoded audio object files; and

A scalable bitstream decoder that decodes encoded audio objects to obtain decoded audio objects.

21. The audio signal receiving system of claim 20, further comprising a mixing device for mixing decoded audio object files and generating an audio object mix.