HK1248913B

HK1248913B - Audio encoder and decoder with program loudness and boundary metadata

Info

Publication number: HK1248913B
Application number: HK18108488.2A
Authority: HK
Inventors: 迈克尔．格兰特; 斯科特．格雷戈里．诺克罗斯; 杰弗里．里德米勒; 迈克尔．沃德
Original assignee: 杜比实验室特许公司
Priority date: 2013-01-21
Filing date: 2018-07-03
Publication date: 2022-02-18

Description

Audio encoders and decoders utilizing program loudness and boundary metadata

本申请为2014年1月15日提交的国际申请号为PCT/US2014/011672、发明名称为“利用节目响度和边界元数据的音频编码器和解码器”的PCT申请的分案申请，该PCT申请进入中国国家阶段日期为2015年4月13日，国家申请号为201480002687.0。This application is a divisional application of PCT application with international application number PCT/US2014/011672 filed on January 15, 2014, and invention name “Audio encoder and decoder using program loudness and boundary metadata”, which entered the Chinese national phase on April 13, 2015, and has national application number 201480002687.0.

相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS

本申请要求于2013年1月21日提交的美国临时专利申请No.61/754,882和于2013年5月16日提交的美国临时专利申请No.61/824,010的优先权，上述每个申请的全部内容由此通过引用被合并于此。This application claims priority to U.S. Provisional Patent Application No. 61/754,882, filed January 21, 2013, and U.S. Provisional Patent Application No. 61/824,010, filed May 16, 2013, the entire contents of each of which are hereby incorporated by reference herein.

技术领域Technical Field

本发明涉及音频信号处理，更具体地，本发明涉及使用表示音频内容的响度处理状态的元数据以及比特流所标示的音频节目边界的位置对音频数据比特流进行编码和解码。本发明的一些实施例生成或解码作为AC-3、增强型AC-3或E-AC-3或者Dolby E(杜比E)已知的格式之一的音频数据。The present invention relates to audio signal processing, and more particularly, to encoding and decoding audio data bitstreams using metadata indicating the loudness processing state of the audio content and the location of audio program boundaries indicated by the bitstream. Some embodiments of the present invention generate or decode audio data in one of the formats known as AC-3, Enhanced AC-3 or E-AC-3, or Dolby E.

背景技术Background Art

“杜比”、“杜比数字”、“杜比数字+”和“杜比E”是杜比实验室特许公司的商标。杜比实验室提供分别作为“杜比数字”和“杜比数字+”已知的AC-3和E-AC-3的专有实施。"Dolby," "Dolby Digital," "Dolby Digital Plus," and "Dolby E" are trademarks of Dolby Laboratories, Inc. Dolby Laboratories provides proprietary implementations of AC-3 and E-AC-3 known as "Dolby Digital" and "Dolby Digital Plus," respectively.

音频数据处理单元通常以盲目的方式来操作，并且不注意在数据被接收之前出现的音频数据的处理历史。这可能在以下处理框架内起作用：其中，单个实体进行各种各样的目标媒体渲染装置的所有的音频数据处理和编码，同时，目标媒体渲染装置进行对编码音频数据的所有的解码和渲染。然而，该盲目的处理在以下情况下不能很好地起作用(或者一点都不起作用)：其中，多个音频处理单元散布在多种多样的网络上或者串联放置(即，链式放置)并被期望最优地执行它们各自类型的音频处理。例如，某些音频数据可以被编码以用于高性能的媒体系统并且可能必须沿着媒体处理链被转换成适合移动装置的缩减形式。因而，音频处理单元可能不必对已经被执行了某种类型的处理的音频数据执行该类型的音频处理。例如，音量调节(volume leveling)单元可以对输入音频剪辑执行处理，而不管是否之前已经对输入音频剪辑执行了相同的或类似的音量调节。因此，音量调节单元可能在不需要时执行调节。该非必要的处理还可能引起在对音频数据的内容进行渲染时的特定特征的去除和/或降级。Audio data processing units typically operate in a blind manner, oblivious to the processing history of the audio data that occurred before the data was received. This may work within a processing framework where a single entity performs all audio data processing and encoding for a variety of target media rendering devices, while the target media rendering devices also perform all decoding and rendering of the encoded audio data. However, this blind processing does not work well (or at all) in situations where multiple audio processing units are spread across a variety of networks or placed in series (i.e., chained) and are expected to optimally perform their respective types of audio processing. For example, some audio data may be encoded for use in high-performance media systems and may have to be converted along the media processing chain into a reduced form suitable for mobile devices. As a result, an audio processing unit may not have to perform a certain type of audio processing on audio data that has already been processed. For example, a volume leveling unit may perform processing on an input audio clip regardless of whether the same or similar volume adjustment has previously been performed on the input audio clip. Therefore, the volume leveling unit may perform adjustments when not necessary. This unnecessary processing may also result in the removal and/or degradation of specific features when rendering the content of the audio data.

音频数据的典型的流包括音频内容(如，音频内容的一个或更多个通道)和表示音频内容的至少一个特征的元数据二者。例如，在AC-3比特流中，存在若干音频元数据参数，这些音频元数据参数具体地意在用于改变被递送到倾听环境的节目的声音。元数据参数之一是“DIALNORM”参数，其意在表示出现音频节目的会话的平均电平，并且用于确定音频回放信号电平。A typical stream of audio data includes both audio content (e.g., one or more channels of audio content) and metadata representing at least one characteristic of the audio content. For example, in an AC-3 bitstream, there are several audio metadata parameters that are specifically intended to be used to alter the sound of a program delivered to a listening environment. One of the metadata parameters is a "DIALNORM" parameter that is intended to represent the average level of a conversation in which an audio program occurs and is used to determine the audio playback signal level.

在包括不同的音频节目分段(每个音频节目分段具有不同的DIALNORM参数)序列的比特流的回放期间，AC-3解码器使用每个分段的DIALNORM参数来执行某种类型的响度处理，其中，其修改回放电平或响度，使得分段序列的会话的感知响度处于恒定的电平。编码音频项的序列中的每个编码音频分段(项)会(通常)具有不同的DIALNORM参数，并且解码器可以对其中每个项的电平进行缩放，使得每个项的会话的回放电平或响度相同或者非常类似，虽然这可能需要在回放期间将不同量的增益应用于不同的项。During playback of a bitstream that includes a sequence of different audio program segments (each with different DIALNORM parameters), an AC-3 decoder uses the DIALNORM parameters for each segment to perform a type of loudness processing in which it modifies the playback level or loudness so that the perceived loudness of the session for the sequence of segments is at a constant level. Each encoded audio segment (item) in a sequence of encoded audio items will (usually) have different DIALNORM parameters, and the decoder can scale the level of each of the items so that the playback level or loudness of the session for each item is the same or very similar, although this may require applying different amounts of gain to the different items during playback.

DIALNORM通常由用户来设置，并且不是自动生成的，虽然如果用户没有设置任何值时则存在默认的DIALNORM值。例如，内容产生器可以用AC-3编码器外部的装置来进行响度测量，并且接着将(表示音频节目的口语会话的响度的)结果传输给编码器以设置DIALNORM值。因此，存在为了正确地设置DIALNORM参数而对内容产生器的依赖。DIALNORM is typically set by the user and is not automatically generated, although there is a default DIALNORM value if the user does not set any value. For example, the content creator may perform loudness measurement using a device external to the AC-3 encoder and then transmit the results (representing the loudness of the spoken dialogue of the audio program) to the encoder to set the DIALNORM value. Therefore, there is a reliance on the content creator to correctly set the DIALNORM parameter.

AC-3比特流中的DIALNORM参数可能不正确的原因有若干不同的原因。首先，如果内容产生器没有设置DIALNORM值，则每个AC-3编码器具有在比特流的生成期间使用的默认的DIALNORM值。该默认值可能与音频的实际会话响度电平相当不同。第二，即使内容产生器测量响度并且相应地设置DIALNORM值，也可能使用不遵守推荐的AC-3响度测量方法的响度测量算法或仪表，从而导致错误的DIALNORM值。第三，即使已经利用内容产生器正确地测量的并且设置的DIALNORM值产生了AC-3比特流，其可能在比特流的传输和/或存储期间已经被改为错误的值。例如，在电视广播应用中，使用错误的DIALNORM元数据信息来对AC-3比特流进行解码、修改和接着重新编码不是不常见。因此，AC-3比特流中所包括的DIALNORM值可能是不正确的或者不准确的，因此可能对于倾听体验的质量有负面影响。There are several different reasons why the DIALNORM parameter in an AC-3 bitstream may be incorrect. First, if the content creator does not set the DIALNORM value, each AC-3 encoder has a default DIALNORM value that is used during bitstream generation. This default value may be quite different from the actual conversational loudness level of the audio. Second, even if the content creator measures loudness and sets the DIALNORM value accordingly, it may use a loudness measurement algorithm or meter that does not adhere to the recommended AC-3 loudness measurement method, resulting in an incorrect DIALNORM value. Third, even if the AC-3 bitstream has been generated using the DIALNORM value correctly measured and set by the content creator, it may have been changed to an incorrect value during the transmission and/or storage of the bitstream. For example, in television broadcast applications, it is not uncommon for an AC-3 bitstream to be decoded, modified, and then re-encoded using incorrect DIALNORM metadata information. As a result, the DIALNORM value included in the AC-3 bitstream may be incorrect or inaccurate, which may negatively impact the quality of the listening experience.

此外，DIALNORM参数不指示相应的音频数据的响度处理状态(如，已经对音频数据执行了什么类型的响度处理)。在本发明之前，音频比特流一直没有以本公开中描述的类型的格式包括元数据，所述元数据表示音频比特流的音频内容的响度处理状态(如，所应用的响度处理的类型)、或者比特流的音频内容的响度处理状态和响度。这样的格式的响度处理状态元数据用于以特别有效的方式便利对音频比特流的自适应响度处理和/或音频内容的响度处理状态和响度的有效性的验证。Furthermore, the DIALNORM parameter does not indicate the loudness processing state of the corresponding audio data (e.g., what type of loudness processing has been performed on the audio data). Prior to the present invention, audio bitstreams have not included metadata in a format of the type described in the present disclosure that indicates the loudness processing state of the audio content of an audio bitstream (e.g., the type of loudness processing applied), or the loudness processing state and loudness of the audio content of the bitstream. Loudness processing state metadata in such a format is used to facilitate verification of the effectiveness of adaptive loudness processing of an audio bitstream and/or the loudness processing state and loudness of the audio content in a particularly efficient manner.

虽然本发明不限于与AC-3比特流、E-AC-3比特流或者杜比E比特流一起使用，然而，为了方便，将在实施例中对其进行描述，在实施例中，其生成、解码或者处理这样的包括响度处理状态元数据在内的比特流。While the present invention is not limited to use with AC-3 bitstreams, E-AC-3 bitstreams, or Dolby E bitstreams, for convenience it will be described in embodiments in which it generates, decodes, or processes such bitstreams including loudness processing state metadata.

AC-3编码比特流包括元数据以及音频内容的一至六个通道。音频内容是已经使用感知音频编码被压缩的音频数据。元数据包括意在用于改变被递送给倾听环境的节目的声音的若干音频元数据参数。The AC-3 encoded bitstream includes metadata and one to six channels of audio content. The audio content is audio data that has been compressed using perceptual audio coding. The metadata includes several audio metadata parameters intended to change the sound of the program delivered to the listening environment.

AC-3(也称为杜比数字)编码的细节是公知的，并且在很多公开参考文献中被阐明，这些公开的参考文献包括：The details of AC-3 (also known as Dolby Digital) encoding are well known and are set forth in a number of published references, including:

ATSC Standard A52/A:Digital Audio Compression Standard(AC-3),RevisionA,Advanced Television Systems Committee,20Aug.2001；以及ATSC Standard A52/A:Digital Audio Compression Standard(AC-3),RevisionA,Advanced Television Systems Committee,20Aug.2001; and

美国专利5,583,962、5,632,005、5,633,981、5,727,119和6,021,386，上述所有专利的全部内容由此通过引用被合并于此。U.S. Patents 5,583,962, 5,632,005, 5,633,981, 5,727,119, and 6,021,386, all of which are hereby incorporated by reference in their entirety.

“Introduction to Dolby Digital Plus,an Enhancement to the DolbyDigital Coding System”,AES Convention Paper 6191,117^th AES Convention,October28,2004中阐明了杜比数字+(E-AC-3)编码的细节。The details of Dolby Digital Plus (E-AC-3) encoding are explained in “Introduction to Dolby Digital Plus, an Enhancement to the Dolby Digital Coding System”, AES Convention Paper 6191, 117 ^th AES Convention, October 28, 2004.

“Efficient Bit Allocation,Quantization,and Coding in an AudioDistribution System”,AES Preprint 5068,107^th AES conference,August 1999和“Professional Audio Coder Optimized for Use with Video”,AES Preprint 5033,107^th AES Conference August 1999中阐明了杜比E编码的细节。Details of Dolby E coding are explained in “Efficient Bit Allocation, Quantization, and Coding in an Audio Distribution System”, AES Preprint 5068, 107th AES ^conference , August 1999, and “Professional Audio Coder Optimized for Use with Video”, AES Preprint 5033, ^107th AES Conference August 1999.

AC-3编码音频比特流的每个帧包含用于数字音频的1536个样本的音频内容和元数据。对于48kHz的采样速率，这表示32毫秒的数字音频或者音频的每秒31.25个帧的速率。Each frame of the AC-3 encoded audio bitstream contains audio content and metadata for 1536 samples of digital audio. For a sampling rate of 48kHz, this represents 32 milliseconds of digital audio or a rate of 31.25 frames per second of audio.

取决于帧包含一个、两个、三个还是六个音频数据块，E-AC-3编码音频比特流的每个帧分别包含用于数字音频的256、512、768或1536个样本的音频内容和元数据。对于48kHz的采样速率，这分别表示5.333、10.667、16或32毫秒的数字音频或者音频的每秒189.9、93.75、62.5或31.25个帧的速率。Each frame of the E-AC-3 encoded audio bitstream contains 256, 512, 768, or 1536 samples of digital audio content and metadata, depending on whether the frame contains one, two, three, or six audio data blocks. For a sampling rate of 48kHz, this represents 5.333, 10.667, 16, or 32 milliseconds of digital audio, or audio rates of 189.9, 93.75, 62.5, or 31.25 frames per second, respectively.

如图4所示，每个AC-3帧被分成区段(分段)，包括：同步信息(SI)区段，其包含(如图5所示)同步字(SW)和两个纠错字中的第一纠错字(CRC1)；比特流信息(BSI)区段，其包含元数据的大多数；一个至六个之间的音频块(AB0至AB5)，其包含数据压缩的音频内容(也可以包括元数据)；浪费比特分段(W)，其包含在音频内容被压缩之后留下的任意的未使用的比特；辅助(AUX)信息区段，其可以包含更多的元数据；以及两个纠错字中的第二纠错字(CRC2)。浪费比特分段(W)也可以称为“跳过域”。As shown in Figure 4, each AC-3 frame is divided into segments (segments), including: a synchronization information (SI) segment, which contains (as shown in Figure 5) a synchronization word (SW) and the first of two error correction words (CRC1); a bit stream information (BSI) segment, which contains most of the metadata; one to six audio blocks (AB0 to AB5), which contain data-compressed audio content (which may also include metadata); a wasted bit segment (W), which contains any unused bits left after the audio content is compressed; an auxiliary (AUX) information segment, which may contain more metadata; and the second of two error correction words (CRC2). The wasted bit segment (W) may also be called a "skip field."

如图7所示，每个E-AC-3帧被划分成区段(分段)，包括：同步信息(SI)区段，其包含(如图5所示)同步字(SW)；比特流信息(BSI)区段，其包含元数据的大多数；一个至六个之间的音频块(AB0至AB5)，其包含数据压缩的音频内容(也可以包括元数据)；浪费比特分段(W)，其包含在音频内容被压缩之后留下的任意的未使用的比特(虽然仅示出了一个浪费比特分段，然而每个音频块之后通常可以跟随不同的浪费比特分段)；辅助(AUX)信息区段，其可以包含更多的元数据；以及纠错字(CRC)。浪费比特分段(W)也可以称为“跳过域”。As shown in Figure 7, each E-AC-3 frame is divided into segments (segments), including: a synchronization information (SI) segment, which contains (as shown in Figure 5) a synchronization word (SW); a bit stream information (BSI) segment, which contains most of the metadata; between one and six audio blocks (AB0 to AB5), which contain data-compressed audio content (which may also include metadata); a wasted bit segment (W), which contains any unused bits left after the audio content is compressed (although only one wasted bit segment is shown, each audio block can usually be followed by a different wasted bit segment); an auxiliary (AUX) information segment, which can contain more metadata; and an error correction word (CRC). The wasted bit segment (W) can also be called a "skip field."

在AC-3(或者E-AC-3)比特流中，存在若干音频元数据参数，该音频元数据参数具体地意在用于改变被递送给倾听环境的节目的声音。元数据参数之一是DIALNORM参数，其包括在BSI区段中。In an AC-3 (or E-AC-3) bitstream, there are several audio metadata parameters that are specifically intended for use in changing the sound of a program delivered to a listening environment. One of the metadata parameters is the DIALNORM parameter, which is included in the BSI section.

如图6所示，AC-3帧的BSI区段包括五比特的参数(“DIALNORM”)，其指示用于该节目的DIALNORM值。如果AC-3帧的音频编码模式(“acmod”)为“0”，指示使用双-单或者“1+1”通道配置，则包括用于指示同一AC-3帧中承载的第二音频节目的DIALNORM值的五比特的参数(“DIALNORM2”)。As shown in Figure 6, the BSI section of the AC-3 frame includes a five-bit parameter ("DIALNORM") that indicates the DIALNORM value used for the program. If the audio coding mode ("acmod") of the AC-3 frame is "0," indicating the use of a dual-single or "1+1" channel configuration, a five-bit parameter ("DIALNORM2") is included to indicate the DIALNORM value of the second audio program carried in the same AC-3 frame.

BSI分段还包括：用于表示跟随“addbsie”比特的附加比特流信息的存在(或不存在)的标志(“addbsie”)、用于表示跟随“addbsil”值的任意附加比特流信息的长度的参数(“addbsil”)、以及跟随“addbsil”值的最高达64比特的附加比特流信息(“addbsi”)。The BSI segment also includes: a flag ("addbsie") for indicating the presence (or absence) of additional bitstream information following the "addbsie" bit, a parameter ("addbsil") for indicating the length of any additional bitstream information following the "addbsil" value, and up to 64 bits of additional bitstream information ("addbsi") following the "addbsil" value.

BSI分段包括没有在图6中具体示出的其他元数据值。The BSI segment includes other metadata values not specifically shown in FIG. 6 .

发明内容Summary of the Invention

在一类实施例中，本发明为包括缓冲存储器、音频解码器和解析器的音频处理单元。缓冲存储器存储编码音频比特流的至少一个帧。编码音频比特流包括音频数据和元数据容器。元数据容器包括首部、一个或更多个元数据有效载荷、以及保护数据。首部包括标识容器的开始的同步字。一个或更多个元数据有效载荷描述与音频数据关联的音频节目。保护数据位于一个或更多个元数据有效载荷之后。保护数据还能够用于验证元数据容器以及元数据容器内的一个或更多个有效载荷的完整性。音频解码器耦接至缓冲存储器并且能够对音频数据解码。解析器耦接至音频解码器或者与音频解码器集成并且能够解析元数据容器。In one embodiment, the present invention is an audio processing unit comprising a buffer memory, an audio decoder, and a parser. The buffer memory stores at least one frame of an encoded audio bitstream. The encoded audio bitstream includes audio data and a metadata container. The metadata container includes a header, one or more metadata payloads, and protection data. The header includes a synchronization word that identifies the beginning of the container. The one or more metadata payloads describe an audio program associated with the audio data. The protection data is located after the one or more metadata payloads. The protection data can also be used to verify the integrity of the metadata container and one or more payloads within the metadata container. The audio decoder is coupled to the buffer memory and is capable of decoding the audio data. The parser is coupled to or integrated with the audio decoder and is capable of parsing the metadata container.

在典型的实施例中，上述方法包括接收编码音频比特流，其中编码音频比特流被分段成一个或更多个帧。音频数据连同元数据容器一起从编码音频比特流中被提取。元数据容器包括首部，首部之后跟随一个或更多个元数据有效载荷，一个或更多个元数据有效载荷之后跟随保护数据。最后，容器以及一个或更多个元数据有效载荷的完整性通过保护数据的使用来被验证。一个或更多个元数据有效载荷可以包括节目响度有效载荷，节目响度有效载荷包含指示与音频数据关联的音频节目的所测量的响度的数据。In a typical embodiment, the method includes receiving an encoded audio bitstream, wherein the encoded audio bitstream is segmented into one or more frames. Audio data is extracted from the encoded audio bitstream along with a metadata container. The metadata container includes a header followed by one or more metadata payloads, which are followed by protection data. Finally, the integrity of the container and the one or more metadata payloads is verified using the protection data. The one or more metadata payloads may include a program loudness payload containing data indicating the measured loudness of an audio program associated with the audio data.

可以对根据本发明的典型的实施例的嵌入在音频比特流中的节目响度元数据有效载荷——被称为响度处理状态元数据(“LPSM”)——进行认证和验证，如使能响度调节实体来验证具体的节目的响度是否已经在指定的范围内以及相应的音频数据本身是否还没有被修改过(从而确保符合可应用的规则)。可以读取包括响度处理状态元数据在内的数据块中所包括的响度值，以对其进行验证，代替再次计算响度。响应于LPSM，管理机构可以判定相应的音频内容是否符合(如LPSM所示)响度法规和/或管理要求(如，在商业广告响度降低法案(Commercial Advertisement Loudness Mitigation Act)，也称为“CALM”法案下公布的规则)，而不需要计算音频内容的响度。Program loudness metadata payloads embedded in audio bitstreams according to exemplary embodiments of the present invention, referred to as Loudness Processing State Metadata ("LPSM"), can be authenticated and verified, e.g., enabling a loudness adjustment entity to verify that the loudness of a particular program is within a specified range and that the corresponding audio data itself has not been modified (thereby ensuring compliance with applicable regulations). Instead of recalculating the loudness, the loudness values included in the data blocks containing the loudness processing state metadata can be read for verification. In response to the LPSM, a regulatory agency can determine whether the corresponding audio content complies with loudness regulations and/or regulatory requirements (e.g., regulations promulgated under the Commercial Advertisement Loudness Mitigation Act, also known as the "CALM" Act) (as indicated by the LPSM) without having to calculate the loudness of the audio content.

与一些响度法规和/或管理要求(例如在CALM法案下颁布的那些规则)符合所需要的响度测量值基于整体节目响度。整体节目响度要求响度测量值——会话水平或者充分混合水平之一——可以在整个音频节目上进行。因此，为了使得节目响度测量值(例如在广播链中的各个阶段)能够验证与典型的法律要求的符合性，至关重要的是使用关于什么音频数据(和元数据)确定整个音频节目的知识来取得测量值，并且这通常需要关于节目的开始位置和结束位置的知识(例如在指示音频节目序列的比特流的处理期间)。Compliance with some loudness regulations and/or regulatory requirements (e.g., those enacted under the CALM Act) requires that loudness measurements be based on overall program loudness. Overall program loudness requires that loudness measurements—either at the session level or at the full mix level—be made on the entire audio program. Therefore, in order for program loudness measurements (e.g., at various stages in the broadcast chain) to verify compliance with typical legal requirements, it is crucial that the measurements be derived using knowledge of what audio data (and metadata) defines the entire audio program, and this typically requires knowledge of the start and end positions of the program (e.g., during processing of the bitstream indicating the audio program sequence).

根据本发明的典型的实施例，编码音频比特流指示至少一个音频节目(例如音频节目序列)，并且比特流中所包括的节目边界元数据和LPSM使得能够在节目的结束时重置节目响度测量值并且因此提供一种测量整体节目响度的自动化的方式。本发明的典型的实施例以如下高效的方式在编码音频比特流中包括节目边界元数据：该方式使得能够精确且鲁棒地确定比特流所指示的连续的音频节目之间的至少一个边界。典型的实施例使得能够甚至在其中指示不同节目的比特流以如下方式被接合在一起(以生成发明的比特流)的情况下在它们允许精确的节目边界确定的场景中以如下方式精确且鲁棒地确定节目边界：该方式使得能够截断被接合的比特流中的一个或者两个比特流(并且因此丢弃已经被包括在预先接合比特流中的至少一个预先接合比特流中的节目边界元数据)。According to an exemplary embodiment of the present invention, an encoded audio bitstream indicates at least one audio program (e.g., an audio program sequence), and program boundary metadata and an LPSM included in the bitstream enable program loudness measurements to be reset at the end of a program, thereby providing an automated way to measure overall program loudness. Exemplary embodiments of the present invention include program boundary metadata in the encoded audio bitstream in an efficient manner that enables accurate and robust determination of at least one boundary between consecutive audio programs indicated by the bitstream. Exemplary embodiments enable accurate and robust determination of program boundaries, even in scenarios where bitstreams indicating different programs are spliced together (to generate the inventive bitstream) in a manner that enables truncating one or both of the spliced bitstreams (and thereby discarding program boundary metadata already included in at least one of the pre-spliced bitstreams).

在典型的实施例中，发明的比特流的帧中的节目边界元数据为指示帧计数的节目边界标记。通常，标记指示当前帧(包括标记的帧)与节目边界(当前音频节目的开始或结束)之间的帧数目。在一些优选实施例中，节目边界标记以对称且高效的方式被插入在指示单个节目(即在分段的开始之后的一些预定数目的帧内出现的帧中，以及在分段的结束之前的一些预定数目的帧内出现的帧中)的每个比特流分段的开始处和结束处，使得当两个这样的比特流分段被连结(以便指示两个节目的序列)时，节目边界元数据可以出现在两个节目之间的边界的两侧(即对称地)。In a typical embodiment, the program boundary metadata in a frame of the inventive bitstream is a program boundary marker indicating a frame count. Typically, the marker indicates the number of frames between the current frame (including the marked frame) and the program boundary (the start or end of the current audio program). In some preferred embodiments, the program boundary marker is inserted at the beginning and end of each bitstream segment indicating a single program (i.e., in a frame that occurs within some predetermined number of frames after the start of the segment, and in a frame that occurs within some predetermined number of frames before the end of the segment) in a symmetrical and efficient manner, so that when two such bitstream segments are concatenated (so as to indicate a sequence of two programs), the program boundary metadata can appear on both sides of the boundary between the two programs (i.e., symmetrically).

为了限制由于在编码音频比特流(其可以指示一个音频节目或者音频节目序列)中包括节目边界元数据而导致的数据速率增加，在典型的实施例中，节目边界标记仅被插入在比特流的帧的子集中。通常，边界标记插入速率为比特流的帧(其中插入标志)中的每个帧从距离这些帧中的所述每个帧最近的节目边界的增加的分离的非增加函数，其中“边界标记插入速率”表示包括节目边界标记的帧(指示节目的帧)的数目与不包括节目边界标记的帧(指示节目的帧)的数目的平均比率，其中平均值为编码音频比特流的若干(例如相对较少数目的)连续的帧上的滑动平均值。在一类实施例中，边界标记插入速率为(每个标记插入位置)距离最近的节目边界的增加的距离的对数减小函数，并且对于包括其中一个标记的每个包含标记的帧，所述包含标记的帧中的标志的大小等于或者大于比所述包含标记的帧更靠近的节目边界的帧中的每个标记的大小(即每个包含标记的帧中的节目边界标记的大小为所述包含标记的帧从最近的节目边界的增加的分离的非减小的函数)。In order to limit the increase in data rate caused by including program boundary metadata in a coded audio bitstream (which may indicate an audio program or sequence of audio programs), in a typical embodiment, program boundary markers are only inserted in a subset of the frames of the bitstream. Typically, the boundary marker insertion rate is a non-increasing function of the increasing separation of each of the frames of the bitstream (in which the marker is inserted) from the program boundary nearest to each of those frames, where the "boundary marker insertion rate" represents the average ratio of the number of frames that include program boundary markers (frames indicating a program) to the number of frames that do not include program boundary markers (frames indicating a program), where the average is a sliding average over a number (e.g., a relatively small number) of consecutive frames of the coded audio bitstream. In one class of embodiments, the boundary marker insertion rate is a logarithmically decreasing function of increasing distance (at each marker insertion location) from the nearest program boundary, and for each marker-containing frame that includes one of the markers, the size of the marker in the marker-containing frame is equal to or greater than the size of each marker in frames that are closer to the program boundary than the marker-containing frame (i.e., the size of the program boundary marker in each marker-containing frame is a non-decreasing function of increasing separation of the marker-containing frame from the nearest program boundary).

本发明的另外的方面是被配置成执行本发明的方法的任意实施例的音频处理单元(APU)。在另一类实施例中，本发明是包括缓冲存储器(缓冲器)的APU，缓冲存储器(如，以非暂时性方式)存储已经由本发明的方法的任意实施例生成的编码音频比特流的至少一个帧。APU的示例包括但不限于编码器(如，转码器)、解码器、编解码器、预处理系统(预处理器)、后处理系统(后处理器)、音频比特流处理系统、以及这样的元件的组合。Another aspect of the present invention is an audio processing unit (APU) configured to perform any embodiment of the method of the present invention. In another class of embodiments, the present invention is an APU that includes a buffer memory (buffer) that stores (e.g., in a non-transitory manner) at least one frame of an encoded audio bitstream that has been generated by any embodiment of the method of the present invention. Examples of APUs include, but are not limited to, encoders (e.g., transcoders), decoders, codecs, pre-processing systems (pre-processors), post-processing systems (post-processors), audio bitstream processing systems, and combinations of such elements.

在另一类实施例中，本发明是被配置成生成包括音频数据分段和元数据分段的编码音频比特流的音频处理单元(APU)，其中，音频数据分段表示音频数据，并且至少部分元数据分段中的每个包括响度处理状态元数据(LPSM)以及可选地还包括节目边界元数据。通常，比特流的帧中的至少一个这样的元数据分段包括：LPSM的表示是否已经对帧的音频数据(即在所述帧的至少一个音频数据分段中的音频数据)执行了第一类型的响度处理的至少一个分段；以及LPSM的表示帧的音频数据中的至少一些的响度(如，表示会话的帧的音频数据中的至少一些的会话响度)的至少一个其他分段。在这种类型的一个实施例中，APU是一种被配置成对输入音频进行编码以生成编码音频的编码器，音频数据分段包括编码音频。在这种类型的典型实施例中，每个元数据分段具有本文中要描述的优选格式。In another class of embodiments, the present invention is an audio processing unit (APU) configured to generate an encoded audio bitstream comprising audio data segments and metadata segments, wherein the audio data segments represent audio data and at least some of the metadata segments each comprise loudness processing state metadata (LPSM) and, optionally, program boundary metadata. Typically, at least one such metadata segment in a frame of the bitstream comprises: at least one segment of the LPSM indicating whether a first type of loudness processing has been performed on the audio data of the frame (i.e., the audio data in at least one audio data segment of the frame); and at least one other segment of the LPSM indicating loudness of at least some of the audio data of the frame (e.g., a session loudness of at least some of the audio data of the frame representing a session). In one embodiment of this class, the APU is an encoder configured to encode input audio to generate encoded audio, the audio data segments comprising the encoded audio. In a typical embodiment of this class, each metadata segment has a preferred format as described herein.

在一些实施例中，包括LPSM(例如LPSM和节目边界元数据)的编码比特流(在一些实施例中为AC-3比特流或者E-AC-3比特流)的元数据分段中的每个元数据分段被包括在比特流的帧的跳过域分段的浪费比特(例如图4或者图7所示的类型的浪费比特分段W)中。在其他实施例中，包括LPSM(例如LPSM和节目边界元数据)的编码比特流(在一些实施例中为AC-3比特流或者E-AC-3比特流)的元数据分段中的每个元数据分段作为附加比特流信息被包括在比特流的帧的比特流信息(“BSI”)分段的“addbsi”域中或者被包括在比特流的帧的结束处的辅助数据域(例如图4或者图7所示的类型的AUX分段)中。每个包括LPSM的元数据分段具有本文中在以下参考表1和表2指定的格式(即，其包括表1所示的核心元素或者其变型，核心元素或者其变型之后跟随有效载荷ID(将元数据标识为LPSM)和有效载荷尺寸值，有效载荷ID和有效载荷尺寸值之后跟随有效载荷(如本文中所述的具有如表2所示的格式或者表2的变型所示的格式的LPSM数据))。在一些实施例中，帧可以包括一个或两个元数据分段，一个或两个元数据分段中的每个元数据分段包括LPSM，并且如果帧包括两个元数据分段，则一个元数据分段可以出现在帧的addbsi域中，而另一个元数据分段可以出现在帧的AUX域中。In some embodiments, each metadata segment in a metadata segment of a coded bitstream (in some embodiments, an AC-3 bitstream or an E-AC-3 bitstream) that includes an LPSM (e.g., an LPSM and program boundary metadata) is included in a wasted bits segment (e.g., a wasted bits segment W of the type shown in FIG. 4 or 7 ) of a skip field segment of a frame of the bitstream. In other embodiments, each metadata segment in a metadata segment of a coded bitstream (in some embodiments, an AC-3 bitstream or an E-AC-3 bitstream) that includes an LPSM (e.g., an LPSM and program boundary metadata) is included as additional bitstream information in an “addbsi” field of a bitstream information (“BSI”) segment of a frame of the bitstream or in an auxiliary data field (e.g., an AUX segment of the type shown in FIG. 4 or 7 ) at the end of a frame of the bitstream. Each metadata segment including an LPSM has a format specified herein with reference to Tables 1 and 2 below (i.e., it includes the core elements shown in Table 1 or a variation thereof, the core elements or a variation thereof followed by a payload ID (identifying the metadata as an LPSM) and a payload size value, the payload ID and payload size value followed by a payload (LPSM data having the format shown in Table 2 or a variation thereof as described herein)). In some embodiments, a frame may include one or two metadata segments, each of the one or two metadata segments including an LPSM, and if a frame includes two metadata segments, one metadata segment may appear in the addbsi field of the frame and the other metadata segment may appear in the AUX field of the frame.

在一类实施例中，本发明为包括如下步骤的方法：对音频数据编码以生成AC-3或者E-AC-3编码音频比特流，所述包括通过以下方式来实现：在(比特流的至少一个帧的)元数据分段中包括LPSM和节目边界元数据以及可选地还包括帧属于其的音频节目的其他元数据。在一些实施例中，每个这样的元数据分段被包括在帧的addbsi域或者帧的auxdata域中。在其他实施例中，每个这样的元数据分段被包括在帧的浪费比特分段中。在一些实施例中，包含LPSM和节目边界元数据的每个元数据分段包含核心首部(以及可选地还包括附加核心元素)、以及在核心首部(或者核心首部和其他核心元素)之后的具有以下格式的LPSM有效载荷(或者容器)分段：In one class of embodiments, the present invention is a method comprising encoding audio data to generate an AC-3 or E-AC-3 encoded audio bitstream, said method being implemented by including, in a metadata segment (of at least one frame of the bitstream), LPSM and program boundary metadata, and optionally other metadata for the audio program to which the frame belongs. In some embodiments, each such metadata segment is included in the addbsi field of a frame or the auxdata field of a frame. In other embodiments, each such metadata segment is included in a wasted bits segment of a frame. In some embodiments, each metadata segment containing the LPSM and program boundary metadata comprises a core header (and optionally additional core elements), and, following the core header (or the core header and other core elements), an LPSM payload (or container) segment having the following format:

首部，通常包括至少一个标识值(例如LPSM格式版本、长度、周期、计数、和子流关联值，如本文中所提出的表2中所示)，以及header, which typically includes at least one identification value (e.g., LPSM format version, length, period, count, and substream association value, as shown in Table 2 proposed herein), and

在首部之后的LPSM和节目边界元数据。节目边界元数据可以包括节目边界帧计数、编码值和(在一些情况下的)偏移值，编码值(例如“offset_exist”值)指示帧仅包含节目边界帧计数还是包括节目边界帧计数和偏移值二者。LPSM可以包括：Following the header is the LPSM and program boundary metadata. The program boundary metadata may include a program boundary frame count, an encoded value, and (in some cases) an offset value. The encoded value (e.g., an "offset_exist" value) indicates whether the frame contains only a program boundary frame count or both a program boundary frame count and an offset value. The LPSM may include:

至少一个会话指示值，其指示对应的音频数据指示会话还是不指示会话(例如对应的音频数据的哪些通道指示会话)。会话指示值可以指示对应的音频数据的通道的任意组合或者全部通道中是否存在会话；At least one session indication value indicating whether the corresponding audio data indicates a session or does not indicate a session (e.g., which channels of the corresponding audio data indicate a session). The session indication value may indicate whether a session exists in any combination of channels or in all channels of the corresponding audio data;

至少一个响度调节符合值，其指示对应的音频数据是否符合所指示的响度规则集合；at least one loudness adjustment compliance value indicating whether the corresponding audio data complies with the indicated loudness rule set;

至少一个响度处理值，其指示已经对对应的音频数据执行的至少一种类型的响度处理；以及at least one loudness processing value indicating at least one type of loudness processing that has been performed on the corresponding audio data; and

至少一个响度值，其指示表征对应的音频数据的至少一个响度(例如峰值响度或者平均响度)。At least one loudness value indicating at least one loudness (eg, peak loudness or average loudness) characterizing corresponding audio data.

在其他实施例中，编码比特流是一种并非AC-3比特流或者E-AC-3比特流的比特流，并且，包括LPSM(以及可选地还包括节目边界元数据)的元数据分段中的每个被包括在被保留用于存储附加数据的比特流的分段(或域、或时隙)中。包括LPSM的每个元数据分段可以具有与本文中在以下参考表1和表2指出的格式类似或相同的格式(即，其包括与表1所示的核心元素类似或相同的核心元素，之后跟随有效载荷ID(将元数据标识为LPSM)和有效载荷尺寸值，之后跟随有效载荷(具有与如本文中所述的如表2所示的格式或者表2的变型所示的格式类似或相同的格式的LPSM数据))。In other embodiments, the coded bitstream is a bitstream that is not an AC-3 bitstream or an E-AC-3 bitstream, and each of the metadata segments including the LPSM (and optionally also program boundary metadata) is included in a segment (or field, or time slot) of the bitstream reserved for storing additional data. Each metadata segment including the LPSM can have a format similar to or the same as the format indicated below with reference to Tables 1 and 2 herein (i.e., it includes core elements similar to or the same as those shown in Table 1, followed by a payload ID (identifying the metadata as an LPSM) and a payload size value, followed by a payload (LPSM data having a format similar to or the same as the format shown in Table 2 or a variation of Table 2 as described herein)).

在某些实施例中，编码比特流包括帧的序列，每个帧包括比特流信息(“BSI”)分段和auxdata域或时隙(如，编码比特流是AC-3比特流或者E-AC-3比特流)，其中，比特流信息(“BSI”)分段包括“addbsi”域(有时称为分段或时隙)，并且，每个帧包括音频数据分段(如，图4所示的帧的AB0-AB5分段)和元数据分段，其中，元数据分段表示音频数据，至少部分元数据分段的每个包括响度处理状态元数据(LPSM)以及可选地还包括节目边界元数据。LPSM以以下格式存在于比特流中。包括LPSM的元数据分段中的每个被包括在比特流的帧的BSI分段的“addbsi”域中，或者被包括在比特流的帧的auxdata域中，或者被包括在比特流的帧的浪费比特分段中。包括LPSM的每个元数据分段包括具有以下格式的LPSM有效载荷(或容器)分段：In some embodiments, a coded bitstream includes a sequence of frames, each frame including a bitstream information ("BSI") segment and an auxdata field or time slot (e.g., the coded bitstream is an AC-3 bitstream or an E-AC-3 bitstream), wherein the bitstream information ("BSI") segment includes an "addbsi" field (sometimes referred to as a segment or time slot), and each frame includes an audio data segment (e.g., segments AB0-AB5 of the frame shown in FIG. 4 ) and a metadata segment, wherein the metadata segment represents the audio data, and at least some of the metadata segments each include loudness processing state metadata (LPSM) and optionally program boundary metadata. The LPSM is present in the bitstream in the following format. Each metadata segment including the LPSM is included in the "addbsi" field of a BSI segment of a frame of the bitstream, or in the auxdata field of a frame of the bitstream, or in a wasted bits segment of a frame of the bitstream. Each metadata segment including the LPSM includes an LPSM payload (or container) segment having the following format:

首部(通常包括至少一个标识值，如以下表2中所示的LPSM格式版本、长度、周期、计数和子流关联值)；以及header (usually including at least one identification value, such as LPSM format version, length, period, count, and substream association value as shown in Table 2 below); and

在首部之后的LPSM以及可选的还有节目边界元数据。节目边界元数据可以包括节目边界帧计数、编码值和(在一些情况下)偏移值，编码值(例如“offset_exist”值)指示帧仅包含节目边界帧计数还是包括节目边界帧计数和偏移值二者。LPSM可以包括：Following the header is the LPSM and, optionally, program boundary metadata. The program boundary metadata may include a program boundary frame count, an encoded value, and (in some cases) an offset value. The encoded value (e.g., an "offset_exist" value) indicates whether the frame contains only a program boundary frame count or both a program boundary frame count and an offset value. The LPSM may include:

至少一个会话指示值(如，表2的参数“会话通道”)，其指示相应的音频数据是指示会话还是不指示会话(如，相应的音频数据的哪个通道指示会话)。会话指示值可以指示会话是否存在于相应的音频数据的通道的任意组合或所有通道中；At least one session indication value (e.g., parameter "session channel" of Table 2), which indicates whether the corresponding audio data indicates a session or does not indicate a session (e.g., which channel of the corresponding audio data indicates a session). The session indication value may indicate whether a session exists in any combination of channels or all channels of the corresponding audio data;

至少一个响度调节相符值(如，表2的参数“响度调节类型”)，其指示相应的音频数据是否与所指示的响度调节的集合相符；At least one loudness adjustment compliance value (e.g., parameter "Loudness Adjustment Type" of Table 2), which indicates whether the corresponding audio data complies with the indicated set of loudness adjustments;

至少一个响度处理值(如，表2的参数“会话选通的响度校正标志”、“响度校正类型”中的一个或更多个)，其指示已经对相应的音频数据执行的至少一种类型的响度处理；以及At least one loudness processing value (e.g., one or more of the parameters "Session-gated Loudness Correction Flag" and "Loudness Correction Type" of Table 2) indicating at least one type of loudness processing that has been performed on the corresponding audio data; and

至少一个响度值(如，表2的参数“ITU相对选通的响度”、“ITU语音选通的响度”、“ITU(EBU 3341)短期3s响度”和“真实峰值”中的一个或更多个)，其指示相应的音频数据的至少一个响度(如，峰值或者平均响度)特性。At least one loudness value (e.g., one or more of the parameters "ITU Relative Gated Loudness," "ITU Speech Gated Loudness," "ITU (EBU 3341) Short-Term 3s Loudness," and "True Peak" of Table 2) indicating at least one loudness (e.g., peak or average loudness) characteristic of corresponding audio data.

在本发明的专注、使用或者生成表示相应的音频数据的至少一个响度值的任意实施例中，响度值可以指示用于处理音频数据的响度和/或动态范围的至少一个响度测量特性。In any embodiment of the present invention that focuses on, uses, or generates at least one loudness value representing corresponding audio data, the loudness value may be indicative of at least one loudness measurement characteristic used to process the loudness and/or dynamic range of the audio data.

在一些实现中，比特流的帧的“addbsi”域或auxdata域或浪费比特分段中的每个元数据分段具有以下格式：In some implementations, each metadata segment in the "addbsi" field or auxdata field or wasted bits segment of a frame of a bitstream has the following format:

核心首部(通常包括标识元数据分段的开始的同步字，其后跟随标识值，如以下表1中所示的核心元素版本、长度和周期、扩展元素计数以及子流关联值)；以及Core header (typically includes a synchronization word identifying the start of the metadata segment, followed by identification values, such as the core element version, length and period, extension element count, and substream association value as shown in Table 1 below); and

在核心首部之后的至少一个保护值(如，HMAC摘要和音频指纹值，其中，HMAC摘要可以是基于整个帧的音频数据、核心元素和所有的扩展元素计算的256比特的HMAC摘要(使用SHA-2算法)，如表1所示，其用于响度处理状态元数据或相应的音频数据中的至少一个的解密、认证或验证中的至少一个)；以及At least one protection value following the core header (e.g., an HMAC digest and an audio fingerprint value, wherein the HMAC digest may be a 256-bit HMAC digest (using the SHA-2 algorithm) calculated based on the audio data of the entire frame, the core element, and all extended elements, as shown in Table 1, and used for at least one of decryption, authentication, or verification of at least one of the loudness processing state metadata or the corresponding audio data); and

如果元数据分段包括LPSM，则也在核心首部之后的LPSM有效载荷标识(ID)和LPSM有效载荷尺寸值，其将跟随的元数据标识为LPSM有效载荷并且指示LPSM有效载荷的尺寸。LPSM有效载荷分段(优选地具有上述格式)跟随LPSM有效载荷ID和LPSM有效载荷尺寸值。If the metadata segment includes an LPSM payload, an LPSM payload identification (ID) and an LPSM payload size value also follow the core header, identifying the following metadata as an LPSM payload and indicating the size of the LPSM payload. An LPSM payload segment (preferably having the format described above) follows the LPSM payload ID and LPSM payload size value.

在以上段落中描述的类型的一些实施例中，帧的auxdata域(或者“addbsi”域或者浪费比特分段)中的每个元数据分段具有三层结构：In some embodiments of the type described in the preceding paragraphs, each metadata segment in the auxdata field (or "addbsi" field or wasted bits segment) of a frame has a three-layer structure:

高层结构，包括：指示auxdata(或addbsi)域是否包括元数据的标志；指示存在的是什么类型的元数据的至少一个ID值；以及通常还包括指示存在多少比特的元数据(如，每种类型的元数据)的值(如果存在元数据)。可能存在的一种类型的元数据是LPSM，可能存在的另一种类型的元数据为节目边界元数据，可能存在的另一种类型的元数据是媒体研究元数据；A high-level structure that includes: a flag indicating whether the auxdata (or addbsi) field contains metadata; at least one ID value indicating what type of metadata is present; and typically also a value indicating how many bits of metadata (e.g., of each type of metadata) are present, if metadata is present. One type of metadata that may be present is LPSM, another type of metadata that may be present is program boundary metadata, and another type of metadata that may be present is media studies metadata.

中层结构，包括用于每个标识的类型的元数据的核心元素(如，对于每种标识的类型的元数据如上述类型的核心首部、保护值和有效载荷ID以及有效载荷尺寸值)；以及a middle-level structure including core elements of metadata for each identified type (e.g., metadata for each identified type such as a core header of the type described above, a protection value, and a payload ID and payload size value); and

低层结构，包括用于一个核心元素的每个有效载荷(如，如果核心元素将其标识为存在，则是LPSM有效载荷，和/或，如果核心元素将其标识为存在，则是另一种类型的元数据有效载荷)。A low-level structure, including each payload for a core element (e.g., an LPSM payload if the core element identifies it as present, and/or another type of metadata payload if the core element identifies it as present).

可以对在这样的三层结构中的数据值进行嵌套。例如，可以在由核心元素标识的有效载荷之后(从而在核心元素的核心首部之后)，包括用于LPSM有效载荷和/或由核心元素标识的另一元数据有效载荷的保护值。在一种示例中，核心首部可以标识LPSM有效载荷与另一元数据有效载荷，用于第一有效载荷(如，LPSM有效载荷)的有效载荷ID和有效载荷尺寸值可以跟随核心首部，第一有效载荷本身可以跟随该ID和尺寸值，用于第二有效载荷的有效载荷ID和有效载荷尺寸值可以跟随第一有效载荷，第二有效载荷本身可以跟随这些ID和尺寸值，并且，两种有效载荷之一或两者(或者核心元素值和两种有效载荷之一或两者)的保护值可以跟随最后的有效载荷。Data values in such a three-layer structure can be nested. For example, protection values for an LPSM payload and/or another metadata payload identified by a core element can be included after a payload identified by a core element (and thus after a core header of the core element). In one example, the core header can identify the LPSM payload and the other metadata payload, a payload ID and payload size value for a first payload (e.g., an LPSM payload) can follow the core header, the first payload itself can follow the ID and size values, a payload ID and payload size value for a second payload can follow the first payload, the second payload itself can follow these ID and size values, and a protection value for one or both of the two payloads (or the core element value and one or both of the two payloads) can follow the final payload.

在一些实施例中，帧的auxdata域(或“addbsi”域或浪费比特分段)中的元数据分段的核心元素包括核心首部(通常包括标识值，如核心元素版本)，并且在核心首部之后包括：指示指纹数据是否被包括用于元数据分段的元数据的值、指示是否存在外部数据(与对应于元数据分段的元数据的音频数据有关)的值、由核心元素标识的每种类型的元数据(如，LPSM和/或除了LPSM之外的类型的元数据)的有效载荷ID和有效载荷尺寸值、以及由核心元素标识的至少一种类型的元数据的保护值。元数据分段的元数据有效载荷跟随核心首部，并且(在一些情况下)被嵌套在核心元素的值内。In some embodiments, a core element of a metadata segment in an auxdata field (or "addbsi" field or wasted bit segment) of a frame includes a core header (typically including an identification value, such as a core element version), and after the core header includes: a value indicating whether fingerprint data is included for the metadata of the metadata segment, a value indicating whether external data (related to the audio data corresponding to the metadata of the metadata segment) is present, a payload ID and payload size value for each type of metadata identified by the core element (e.g., LPSM and/or metadata of a type other than LPSM), and a protection value for at least one type of metadata identified by the core element. The metadata payload of the metadata segment follows the core header and (in some cases) is nested within the value of the core element.

在另一优选格式中，编码比特流是杜比E比特流，并且，包括LPSM(以及可选地还包括节目边界元数据)的元数据分段中的每个被包括在杜比E保护带间隔的前N个样本位置中。In another preferred format, the encoded bitstream is a Dolby E bitstream, and each of the metadata segments including the LPSM (and optionally also program boundary metadata) is included in the first N sample positions of the Dolby E guard band interval.

在另一类型的实施例中，本发明是APU(如，解码器)，APU被耦接和配置来接收包括音频数据分段和元数据分段的编码音频比特流的，其中，音频数据分段表示音频数据，并且至少部分元数据分段中的每个包括响度处理状态元数据(LPSM)以及可选地还包括节目边界元数据，并且APU被耦接和配置来从比特流中提取LPSM，响应于音频数据来生成解码音频数据、以及使用LPSM对音频数据执行至少一个自适应响度处理操作。这种类型中的某些实施例还包括耦接至APU的后处理器，其中，后处理器被耦接和配置来使用LPSM对音频数据执行至少一个自适应响度处理操作。In another type of embodiment, the present invention is an APU (e.g., a decoder) coupled and configured to receive an encoded audio bitstream comprising audio data segments and metadata segments, wherein the audio data segments represent audio data and at least some of the metadata segments each comprise loudness processing state metadata (LPSM) and optionally program boundary metadata, and the APU is coupled and configured to extract the LPSM from the bitstream, generate decoded audio data in response to the audio data, and perform at least one adaptive loudness processing operation on the audio data using the LPSM. Some embodiments of this type further include a post-processor coupled to the APU, wherein the post-processor is coupled and configured to perform at least one adaptive loudness processing operation on the audio data using the LPSM.

在另一类型的实施例中，本发明是包括缓冲存储器(缓冲器)和耦接至缓冲器的处理子系统的音频处理单元(APU)。其中，APU被耦接成接收包括音频数据分段和元数据分段的编码音频比特流，其中，音频数据分段表示音频数据，并且至少部分元数据分段中的每个包括响度处理状态元数据(LPSM)以及可选地还包括节目边界元数据，缓冲器(如，以非暂时性方式)存储编码音频比特流的至少一个帧，并且处理子系统被配置成从比特流中提取LPSM以及使用LPSM对音频数据执行至少一个自适应响度处理操作。在这种类型中的典型的实施例中，APU是编码器、解码器和后处理器中的一种。In another type of embodiment, the present invention is an audio processing unit (APU) comprising a buffer memory (buffer) and a processing subsystem coupled to the buffer. The APU is coupled to receive an encoded audio bitstream comprising audio data segments and metadata segments, wherein the audio data segments represent audio data and at least some of the metadata segments each comprise loudness processing state metadata (LPSM) and optionally program boundary metadata. The buffer stores at least one frame of the encoded audio bitstream (e.g., in a non-transitory manner), and the processing subsystem is configured to extract the LPSM from the bitstream and perform at least one adaptive loudness processing operation on the audio data using the LPSM. In a typical embodiment of this type, the APU is one of an encoder, a decoder, and a post-processor.

在本发明的方法的一些实现中，所生成的音频比特流是AC-3编码比特流、E-AC-3比特流或者杜比E比特流中的一种，包括响度处理状态元数据以及其他元数据(如，DIALNORM元数据参数、动态范围控制元数据参数和其他元数据参数)。在方法的一些其他实现中，所生成的音频比特流是另一类型的编码比特流。In some implementations of the method of the present invention, the generated audio bitstream is one of an AC-3 encoded bitstream, an E-AC-3 bitstream, or a Dolby E bitstream, including loudness processing state metadata and other metadata (e.g., DIALNORM metadata parameters, dynamic range control metadata parameters, and other metadata parameters). In some other implementations of the method, the generated audio bitstream is another type of encoded bitstream.

本发明的各个方面包括被配置(或编程)成执行本发明的方法的任意实施例的系统或装置、以及(如，以非暂时性方式)存储用于实现本发明的方法或其步骤的任意实施例的代码的计算机可读介质(如，磁盘)。例如，本发明的系统可以是或者包括用软件或固件编程的可编程通用处理器、数字信号处理器或微处理器，和/或被配置成执行对数据的各种操作中的任意操作，包括发明的方法或其步骤的实施例。这样的通用处理器可以是或者包括如下计算机系统：其包括输入装置、存储器和处理电路，其被编程为(和/或被配置成)响应于向其传送的数据来执行本发明的方法(或步骤)的实施例。Various aspects of the present invention include systems or devices configured (or programmed) to perform any embodiment of the method of the present invention, and computer-readable media (e.g., disks) storing (e.g., in a non-transitory manner) code for implementing any embodiment of the method of the present invention or its steps. For example, the system of the present invention can be or include a programmable general-purpose processor, digital signal processor, or microprocessor programmed with software or firmware, and/or configured to perform any of a variety of operations on data, including embodiments of the method of the present invention or its steps. Such a general-purpose processor can be or include a computer system that includes an input device, memory, and processing circuitry that is programmed (and/or configured) to perform embodiments of the method (or steps) of the present invention in response to data transmitted thereto.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是可以被配置成执行本发明的方法的实施例的系统的实施例的框图；FIG1 is a block diagram of an embodiment of a system that may be configured to perform an embodiment of the method of the present invention;

图2是作为本发明的音频处理单元的实施例的编码器的框图；FIG2 is a block diagram of an encoder as an embodiment of an audio processing unit of the present invention;

图3是作为本发明的音频处理单元的实施例的解码器以及作为本发明的音频处理单元的另一实施例的与解码器耦接的后处理器的框图；3 is a block diagram of a decoder as an embodiment of an audio processing unit of the present invention and a post-processor coupled to the decoder as another embodiment of the audio processing unit of the present invention;

图4是AC-3帧的图，包括其被划分成的分段；FIG4 is a diagram of an AC-3 frame, including the segments into which it is divided;

图5是AC-3帧的同步信息(SI)分段的图，包括其被划分成的分段；FIG5 is a diagram of a synchronization information (SI) segment of an AC-3 frame, including the segments into which it is divided;

图6是AC-3帧的比特流信息(BSI)分段的图，包括其被划分成的分段；FIG6 is a diagram of the bit stream information (BSI) segment of an AC-3 frame, including the segments into which it is divided;

图7是E-AC-3帧的图，包括其被划分成的分段；FIG7 is a diagram of an E-AC-3 frame, including the segments into which it is divided;

图8是包括具有根据本发明的实施例的格式的节目边界元数据的编码音频比特流的帧的图；FIG8 is a diagram of a frame of an encoded audio bitstream including program boundary metadata having a format according to an embodiment of the present invention;

图9是图9的编码音频比特流的其他帧的图，这些帧中的一些帧包括具有根据本发明的实施例的格式的节目边界元数据；9 is a diagram of other frames of the encoded audio bitstream of FIG. 9 , some of which include program boundary metadata having a format according to an embodiment of the present invention;

图10是两个编码音频比特流的图：比特流(IEB)和另一个比特流(TB)，在比特流(IEB)中，节目边界(被标记为“边界”)与比特流的两个帧之间的过渡对准，而在另一个比特流(TB)中，节目边界(被标记为“真实边界”)偏离比特流的两个边界之间的过渡512个样本；以及FIG10 is a diagram of two coded audio bitstreams: a bitstream (IEB) in which a program boundary (labeled “Boundary”) is aligned with a transition between two frames of the bitstream, and another bitstream (TB) in which a program boundary (labeled “TrueBoundary”) is offset by 512 samples from the transition between two boundaries of the bitstream; and

图11是示出了4个编码音频比特流的图形集合。图11的顶部处的比特流(被标记为“场景1”)指示包括节目边界元数据的第一音频节目(P1)，P1之后跟随还包括节目边界元数据的第二音频节目(P2)；第二比特流(被标记为“场景2”)指示包括节目边界元数据的第一音频节目(P1)，P1之后跟随不包括节目边界元数据的第二音频节目(P2)；第三比特流(被标记为“场景3”)指示包括节目边界元数据的被截短的第一音频节目(P1)，其已经与包括节目边界元数据的整个第二音频节目(P2)接合；第四比特流(被标记为“场景4”)指示包括节目边界元数据的被截短的第一音频节目(P1)和被截短的第二音频节目(P2)，其包括节目边界元数据并且已经与第一音频节目的一部分接合。FIG11 is a set of graphics illustrating four encoded audio bitstreams. The bitstream at the top of FIG11 (labeled “Scene 1”) indicates a first audio program (P1) including program boundary metadata, followed by a second audio program (P2) also including program boundary metadata; a second bitstream (labeled “Scene 2”) indicates the first audio program (P1) including program boundary metadata, followed by a second audio program (P2) that does not include program boundary metadata; a third bitstream (labeled “Scene 3”) indicates a truncated first audio program (P1) including program boundary metadata, which has been spliced with an entire second audio program (P2) including program boundary metadata; and a fourth bitstream (labeled “Scene 4”) indicates a truncated first audio program (P1) including program boundary metadata and a truncated second audio program (P2) that includes program boundary metadata and has been spliced with a portion of the first audio program.

符号和命名Symbols and naming

贯穿本公开，包括在权利要求中，在广义上使用“对”信号或数据执行操作(如，对信号或数据进行滤波、缩放、变换或施加增益)的表述来表示直接对信号或数据、或者对信号或数据的已处理版本(如，对在对其执行该操作之前已经经历了初步的滤波或预处理的信号的版本)执行该操作。Throughout this disclosure, including in the claims, statements that refer to “performing an operation on” a signal or data (e.g., filtering, scaling, transforming, or applying a gain to the signal or data) are used broadly to mean performing the operation directly on the signal or data, or on a processed version of the signal or data (e.g., a version of the signal that has undergone preliminary filtering or preprocessing before the operation is performed on it).

贯穿本公开，包括在权利要求中，在广义上使用表述“系统”来表示装置、系统或子系统。例如，实现解码器的子系统可以被称为解码器系统，包括这样的子系统的系统(如，响应于多个输入生成X个输出信号的系统，其中，子系统生成M个输入，其他X-M个输入从外部源来接收)也可以被称为解码器系统。Throughout this disclosure, including in the claims, the expression "system" is used in a broad sense to refer to an apparatus, system, or subsystem. For example, a subsystem that implements a decoder may be referred to as a decoder system, and a system that includes such a subsystem (e.g., a system that generates X output signals in response to a plurality of inputs, where the subsystem generates M inputs and the other X-M inputs are received from external sources) may also be referred to as a decoder system.

贯穿本公开，包括在权利要求中，在广义上使用术语“处理器”来表示可编程或者否则可(利用软件或固件)配置以对数据(如，音频或视频或其他图像数据)执行操作的系统或装置。处理器的示例包括现场可编程门阵列(或者其他可配置集成电路或芯片组)、被编程和/或否则被配置成对音频或其他声音数据执行流水线处理的数字信号处理器、可编程通用处理器或计算机、以及可编程微处理器芯片或芯片组。Throughout this disclosure, including in the claims, the term "processor" is used in a broad sense to refer to a system or device that is programmable or otherwise configurable (using software or firmware) to perform operations on data (e.g., audio or video or other image data). Examples of processors include field programmable gate arrays (or other configurable integrated circuits or chipsets), digital signal processors that are programmed and/or otherwise configured to perform pipeline processing on audio or other sound data, programmable general-purpose processors or computers, and programmable microprocessor chips or chipsets.

贯穿本公开，包括在权利要求中，表述“音频处理器”和“音频处理单元”被可交换地使用，并且在广义上被用来表示被配置成处理音频数据的系统。音频处理单元的示例包括但不限于编码器(如，转码器)、解码器、编解码器、预处理系统、后处理系统和比特流处理系统(有时被称为比特流处理工具)。Throughout this disclosure, including in the claims, the expressions "audio processor" and "audio processing unit" are used interchangeably and are used in a broad sense to refer to a system configured to process audio data. Examples of audio processing units include, but are not limited to, encoders (e.g., transcoders), decoders, codecs, pre-processing systems, post-processing systems, and bitstream processing systems (sometimes referred to as bitstream processing tools).

贯穿本公开，包括在权利要求中，表述“处理状态元数据”(如，在表述“响度处理状态元数据”中)指与相应的音频数据(也包括处理状态元数据在内的音频数据流的音频内容)分离的且不同的数据。处理状态元数据与音频数据相关联，指示相应的音频数据的响度处理状态(如，已经对音频数据执行了什么类型的处理)，并且通常还指示音频数据的至少一个特征或特性。处理状态元数据与音频数据的关联是时间同步的。因此，当前(最近接收的或更新的)处理状态元数据指示相应的音频数据同时包括指示类型的音频数据处理的结果。在某些情况下，处理状态元数据可以包括处理历史和/或用在指示类型的处理中和/或根据指示类型的处理得到的参数中的一些或全部。附加地，处理状态元数据可以包括相应的音频数据的至少一个特征或特性，所述至少一个特征或特性已经根据音频数据计算出或从音频数据中提取到。处理状态元数据还可以包括不是与相应的音频数据的任意处理相关的或者不是从相应的音频数据的任意处理中得到的其他元数据。例如，可以通过具体的音频处理单元来添加第三方数据、乐曲(tracking)信息、标识符、专有权或标准信息、用户注解数据、用户偏好数据等，以传递给其他音频处理单元。Throughout this disclosure, including in the claims, the term "processing state metadata" (e.g., in the term "loudness processing state metadata") refers to data that is separate and distinct from corresponding audio data (and the audio content of an audio data stream, including the processing state metadata). Processing state metadata is associated with audio data and indicates the loudness processing state of the corresponding audio data (e.g., what type of processing has been performed on the audio data) and typically also indicates at least one feature or characteristic of the audio data. The association of processing state metadata with the audio data is time-synchronous. Thus, current (most recently received or updated) processing state metadata indicates the corresponding audio data and also includes the results of the indicated type of processing on the audio data. In some cases, the processing state metadata may include some or all of the processing history and/or parameters used in and/or derived from the indicated type of processing. Additionally, the processing state metadata may include at least one feature or characteristic of the corresponding audio data that has been calculated from or extracted from the audio data. Processing state metadata may also include other metadata that is not related to or derived from any processing of the corresponding audio data. For example, third-party data, tracking information, identifiers, proprietary or standard information, user annotation data, user preference data, etc. can be added by a specific audio processing unit to be passed to other audio processing units.

贯穿本公开，包括在权利要求中，表述“响度处理状态元数据”(或“LPSM”)表示如下处理状态元数据：其表示相应的音频数据的响度处理状态(如，已经对音频数据执行了什么类型的响度处理)，并且通常还表示相应的音频数据的至少一个特征或特性(如，响度)。响度处理状态元数据可以包括不是(即，当其被单独考虑时)响度处理状态元数据的数据(如，其他元数据)。Throughout this disclosure, including in the claims, the expression "loudness processing state metadata" (or "LPSM") refers to processing state metadata that indicates the loudness processing state of corresponding audio data (e.g., what type of loudness processing has been performed on the audio data) and typically also indicates at least one feature or characteristic of the corresponding audio data (e.g., loudness). Loudness processing state metadata may include data (e.g., other metadata) that is not (i.e., when considered alone) loudness processing state metadata.

贯穿本公开，包括在权利要求中，表述“通道”(或者“音频通道”)表示单声道音频信号。Throughout this disclosure, including in the claims, the expression "channel" (or "audio channel") denotes a monophonic audio signal.

贯穿本公开，包括在权利要求中，表述“音频节目”表示一组一个或更多个音频通道并且可选地还表示关联的元数据(例如描述期望的空间音频存在的元数据、和/或LPSM、和/或节目边界元数据)。Throughout this disclosure, including in the claims, the expression "audio program" refers to a set of one or more audio channels and optionally also to associated metadata (e.g. metadata describing the expected spatial audio presence, and/or LPSM, and/or program boundary metadata).

贯穿本公开，包括在权利要求中，表述“节目边界元数据”表示编码音频比特流的元数据，其中编码音频比特流指示至少一个音频节目(例如两个或更多个音频节目)，并且节目边界元数据指示至少一个所述音频节目的至少一个边界(开始和/或结束)的比特流中的位置。例如，(指示音频节目的编码音频比特流的)节目边界元数据可以包括指示节目的开始的位置(例如比特流的第N帧的开始，或者比特流的第N帧的第M个样本位置)的元数据、以及指示节目的结束的位置(例如比特流的第J帧的开始，或者比特流的第J帧的第K个样本位置)的附加元数据。Throughout this disclosure, including in the claims, the expression "program boundary metadata" refers to metadata of a coded audio bitstream, wherein the coded audio bitstream indicates at least one audio program (e.g., two or more audio programs), and the program boundary metadata indicates the position in the bitstream of at least one boundary (start and/or end) of at least one of the audio programs. For example, the program boundary metadata (of the coded audio bitstream indicating the audio program) may include metadata indicating the position of the start of the program (e.g., the start of the Nth frame of the bitstream, or the Mth sample position of the Nth frame of the bitstream), and additional metadata indicating the position of the end of the program (e.g., the start of the Jth frame of the bitstream, or the Kth sample position of the Jth frame of the bitstream).

贯穿本公开，包括在权利要求中，使用术语“耦接(couples)”或“被耦接(coupled)”来表示直接连接或间接连接。因此，如果第一装置耦接至第二装置，则该连接可以是直接连接或者是通过其他装置和连接实现的间接连接。Throughout this disclosure, including in the claims, the terms "couple" or "coupled" are used to indicate either a direct connection or an indirect connection. Thus, if a first device couples to a second device, that connection may be a direct connection or an indirect connection via other devices and connections.

具体实施方式DETAILED DESCRIPTION

根据本发明的典型的实施例，被称为响度处理状态元数据(LPSM)的节目响度元数据的有效载荷以及可选地还有节目边界元数据被嵌入在音频比特流的元数据分段的一个或更多个保留的域(或时隙)中，该音频比特流在其他分段(音频数据分段)中也包括音频数据。通常，比特流的每个帧的至少一个分段包括LPSM，该帧的至少一个其他分段包括相应的音频数据(即，由LPSM指示其响度处理状态和响度的音频数据)。在一些实施例中，LPSM的数据量可以充分小以在不影响被分配用于承载音频数据的比特速率的情况下被承载。According to an exemplary embodiment of the present invention, a payload of program loudness metadata, referred to as loudness processing state metadata (LPSM), and optionally program boundary metadata, is embedded in one or more reserved fields (or time slots) of a metadata segment of an audio bitstream that also includes audio data in other segments (audio data segments). Typically, at least one segment of each frame of the bitstream includes the LPSM, and at least one other segment of the frame includes corresponding audio data (i.e., audio data whose loudness processing state and loudness are indicated by the LPSM). In some embodiments, the amount of LPSM data can be sufficiently small to be carried without affecting the bit rate allocated for carrying the audio data.

当两个或更多个音频处理单元需要遍及处理链(或内容生命周期)彼此串联工作时，在音频数据处理链中传送响度处理状态元数据特别有用。在音频比特流中不包括响度处理状态元数据的情况下，例如，当在链中使用两个或更多个音频编解码器并且在比特流的至媒体消耗装置(或者比特流的音频内容的渲染点)的行程期间不止一次施加单端音量调节时，可能出现严重的媒体处理问题，如质量、电平和空间的降级。Transmitting loudness processing state metadata in an audio data processing chain is particularly useful when two or more audio processing units need to work in tandem with each other throughout the processing chain (or content lifecycle). If loudness processing state metadata is not included in the audio bitstream, for example, when two or more audio codecs are used in the chain and single-ended volume adjustment is applied more than once during the bitstream's journey to a media consumption device (or the rendering point of the bitstream's audio content), serious media processing issues such as degradation in quality, level, and spatial quality can arise.

图1是示例性音频处理链(音频数据处理系统)的框图，其中，可以根据本发明的实施例配置系统的元件中的一个或更多个。该系统包括如所示出地那样耦接在一起的以下元件：预处理单元、编码器、信号分析和元数据校正单元、转码器、解码器和预处理单元。在所示出的系统的变型中，省略了其中一个或更多个元件，或者包括附加的音频数据处理单元。FIG1 is a block diagram of an exemplary audio processing chain (audio data processing system), wherein one or more of the elements of the system may be configured according to embodiments of the present invention. The system includes the following elements coupled together as shown: a preprocessing unit, an encoder, a signal analysis and metadata correction unit, a transcoder, a decoder, and the preprocessing unit. In variations of the illustrated system, one or more of these elements may be omitted, or additional audio data processing units may be included.

在一些实现中，图1的预处理单元被配置成：接受包括音频内容在内的PCM(时域)样本作为输入；以及输出经处理的PCM样本。编码器可以被配置成：接受PCM样本作为输入；以及输出表示音频内容的编码比特流(如，压缩)音频比特流。表示音频内容的比特流的数据有时在本文中被称为“音频数据”。如果编码器根据本发明的典型的实施例来配置，则从编码器输出的音频比特流包括响度处理状态元数据(通常还有其他元数据，可选地包括节目边界元数据)以及音频数据。In some implementations, the pre-processing unit of FIG. 1 is configured to accept as input PCM (time-domain) samples comprising audio content and output processed PCM samples. The encoder can be configured to accept as input PCM samples and output an encoded bitstream (e.g., compressed) representing the audio content. The data representing the bitstream of audio content is sometimes referred to herein as "audio data." If the encoder is configured according to typical embodiments of the present invention, the audio bitstream output from the encoder includes loudness processing state metadata (and typically other metadata, optionally including program boundary metadata) as well as the audio data.

图1的信号分析和元数据校正单元可以接受一个或更多个编码音频比特流作为输入，并且通过执行信号分析(例如使用编码音频比特流中的节目边界元数据)来判定(如，验证)在每个编码音频比特流中的处理状态元数据是否正确。如果信号分析和元数据校正单元发现所包括的元数据无效，则其通常用根据信号分析获得的正确的值来替代错误的值。因此，从信号分析和元数据校正单元输出的每个编码音频比特流可以包括已校正(或未校正)处理状态元数据以及编码比特流音频数据。The signal analysis and metadata correction unit of FIG1 can accept one or more coded audio bitstreams as input and determine (e.g., verify) whether the processing state metadata in each coded audio bitstream is correct by performing signal analysis (e.g., using program boundary metadata in the coded audio bitstream). If the signal analysis and metadata correction unit finds that the included metadata is invalid, it typically replaces the erroneous value with the correct value obtained based on the signal analysis. Thus, each coded audio bitstream output from the signal analysis and metadata correction unit can include corrected (or uncorrected) processing state metadata as well as the coded bitstream audio data.

图1的转码器可以接受编码音频比特流作为输入，并且相应地输出已修改(如，不同地编码的)音频比特流(如，通过对输入流进行解码并且以不同的编码格式对解码流进行重新编码)。如果转码器根据本发明的典型的实施例来配置，则从转码器输出的音频比特流包括响度处理状态元数据(通常还有其他元数据)以及编码比特流音频数据。元数据已经被包括在比特流中。The transcoder of FIG1 can accept an encoded audio bitstream as input and output a modified (e.g., differently encoded) audio bitstream accordingly (e.g., by decoding the input stream and re-encoding the decoded stream in a different encoding format). If the transcoder is configured according to an exemplary embodiment of the present invention, the audio bitstream output from the transcoder includes loudness processing state metadata (and typically other metadata) as well as the encoded bitstream audio data. The metadata is already included in the bitstream.

图1的解码器可以接受编码(如，压缩)音频比特流作为输入，并且(相应地)输出解码PCM音频样本的流。如果解码器根据本发明的典型的实施例来配置，则典型的操作中的解码器的输出是或者包括以下中的任一个：The decoder of Figure 1 can accept an encoded (e.g., compressed) audio bitstream as input and (correspondingly) output a stream of decoded PCM audio samples. If the decoder is configured according to an exemplary embodiment of the present invention, the output of the decoder in typical operation is or includes any of the following:

音频样本的流、以及从输入编码比特流中提取的响度处理状态元数据(通常还有其他元数据)的相应的流；或者a stream of audio samples and a corresponding stream of loudness processing state metadata (and typically other metadata) extracted from the input coded bitstream; or

音频样本的流、以及根据从输入编码比特流中提取的响度处理状态元数据(通常还有其他元数据)确定的相应的控制比特的流；或者a stream of audio samples and corresponding control bits determined according to loudness processing state metadata (and typically other metadata) extracted from the input coded bitstream; or

在没有处理状态元数据或根据处理状态元数据确定的控制比特的相应的流的情况下的音频样本的流。在该最后的情况下，解码器可以从输入编码比特流中提取响度处理状态元数据(和/或其他元数据)，以及对所提取的元数据执行至少一次操作(如，验证)，虽然其没有输出所提取的元数据或者根据其确定的控制比特。A stream of audio samples without a corresponding stream of processing state metadata or control bits determined based on the processing state metadata. In this last case, the decoder can extract loudness processing state metadata (and/or other metadata) from the input coded bitstream and perform at least one operation (e.g., validation) on the extracted metadata, even though it does not output the extracted metadata or control bits determined based thereon.

通过根据本发明的典型的实施例来配置图1的后处理单元，后处理单元被配置成接受解码PCM音频样本的流，并且使用与样本一起接收的响度处理状态元数据(通常还有其他元数据)或者与样本一起接收的控制比特(其由解码器根据响度处理状态元数据并且通常还根据其他元数据确定)来对其执行后处理(即，音频内容的音量调节)。后处理单元通常还被配置成对经后处理的音频内容进行渲染以由一个或更多个扬声器回放。By configuring the post-processing unit of FIG. 1 according to an exemplary embodiment of the present invention, the post-processing unit is configured to accept a stream of decoded PCM audio samples and perform post-processing (i.e., volume adjustment of the audio content) thereon using loudness processing state metadata (and typically other metadata) received with the samples or control bits received with the samples (which are determined by the decoder based on the loudness processing state metadata and typically other metadata). The post-processing unit is also typically configured to render the post-processed audio content for playback by one or more speakers.

本发明的典型的实施例提供了增强型音频处理链，其中，根据由通过音频处理单元分别接收的响度处理状态元数据指示的元数据的同时期的状态，音频处理单元(如，编码器、解码器、转码器、预处理单元和后处理单元)适配待施加给音频数据的它们的各个处理。Typical embodiments of the present invention provide an enhanced audio processing chain in which audio processing units (e.g., encoder, decoder, transcoder, pre-processing unit, and post-processing unit) adapt their respective processing to be applied to audio data according to a contemporaneous state of metadata indicated by loudness processing state metadata respectively received by the audio processing units.

输入到图1的系统的任意音频处理单元(如，图1的编码器或转码器)的音频数据可以包括响度处理状态元数据(还可选地包括其他元数据)以及音频数据(如，编码音频数据)。根据本发明的实施例，该元数据可以已经通过图1的系统的另一元件(或者图1中未示出的另一源)被包括在输入音频中。接收(具有元数据的)输入音频的处理单元可以被配置成对元数据执行至少一个操作(如，验证)或者响应于元数据执行至少一个操作(如，对输入音频的自适应处理)，并且通常还被配置成在其输出音频中包括元数据、元数据的已处理版本或者根据元数据确定的控制比特。Audio data input to any audio processing unit of the system of FIG. 1 (e.g., the encoder or transcoder of FIG. 1 ) may include loudness processing state metadata (and optionally other metadata) along with audio data (e.g., encoded audio data). According to embodiments of the present invention, this metadata may already be included in the input audio by another element of the system of FIG. 1 (or another source not shown in FIG. 1 ). A processing unit that receives the input audio (with metadata) may be configured to perform at least one operation on the metadata (e.g., validation) or to perform at least one operation in response to the metadata (e.g., adaptive processing of the input audio), and is typically also configured to include the metadata, a processed version of the metadata, or control bits determined based on the metadata in its output audio.

本发明的音频处理单元(或音频处理器)的典型实施例被配置成基于由与音频数据对应的响度处理状态元数据指示的音频数据的状态来执行对音频数据的自适应处理。在一些实施例中，自适应处理是(或者包括)响度处理(如果元数据指示还没有对音频数据执行响度处理或者与其类似的处理)，并且不是(或者不包括)响度处理(如果元数据指示已经对音频数据执行了这样的响度处理或者与其类似的处理)。在一些实施例中，自适应处理是或者包括元数据验证(如，在元数据验证子单元中执行的元数据验证)，以确保音频处理单元基于由响度处理状态元数据指示的音频数据的状态来执行对音频数据的其他自适应处理。在一些实施例中，验证判定与该音频数据相关联(如，包括在具有该音频数据的比特流中)的响度处理状态元数据的可靠性。例如，如果元数据被验证为可靠，则可以重复使用来自某种类型的事先执行的音频处理的结果，并且可以避免同一类型的音频处理的新的执行。另一方面，如果发现元数据已经被篡改(或者不可靠)，则可以由音频处理单元来重复据称事先被执行的这种类型的媒体处理(如不可靠的元数据指示的那样)，和/或可以由音频处理单元对元数据和/或音频数据执行其他处理。音频处理单元还可以被配置成如果音频处理单元判定处理状态元数据有效(如，基于所提取的密码值与参考密码值的匹配)，则向增强型媒体处理链中的下游的其他音频处理单元示意响度处理状态元数据(如，存在于媒体比特流中的响度处理状态元数据)是有效的。Typical embodiments of an audio processing unit (or audio processor) of the present invention are configured to perform adaptive processing on audio data based on the state of the audio data indicated by loudness processing state metadata corresponding to the audio data. In some embodiments, the adaptive processing is (or includes) loudness processing (if the metadata indicates that loudness processing or similar processing has not yet been performed on the audio data), and is not (or does not include) loudness processing (if the metadata indicates that such loudness processing or similar processing has already been performed on the audio data). In some embodiments, the adaptive processing is or includes metadata verification (e.g., metadata verification performed in a metadata verification subunit) to ensure that the audio processing unit performs other adaptive processing on the audio data based on the state of the audio data indicated by the loudness processing state metadata. In some embodiments, the verification determines the reliability of the loudness processing state metadata associated with the audio data (e.g., included in a bitstream containing the audio data). For example, if the metadata is verified to be reliable, results from previously performed audio processing of a certain type can be reused, and new execution of the same type of audio processing can be avoided. On the other hand, if the metadata is found to have been tampered with (or is unreliable), the type of media processing purportedly performed previously may be repeated by the audio processing unit (as indicated by the unreliable metadata), and/or other processing may be performed by the audio processing unit on the metadata and/or audio data. The audio processing unit may also be configured to indicate to other audio processing units downstream in the enhanced media processing chain that the loudness processing state metadata (e.g., loudness processing state metadata present in the media bitstream) is valid if the audio processing unit determines that the processing state metadata is valid (e.g., based on a match between the extracted cryptographic value and the reference cryptographic value).

图2是作为本发明的音频处理单元的实施例的编码器(100)的框图。编码器100的任意部件或元件都可以用硬件、软件或硬件与软件的组合实现为一个或更多个处理和/或一个或更多个电路(如，ASIC、FPGA或其他集成电路)。编码器100包括如所示地连接的帧缓冲器110、解析器111、解码器101、音频状态验证器102、响度处理级103、音频流选择级104、编码器105、填充器/格式器级107、元数据生成级106、会话响度测量子系统108和帧缓冲器109。通常，编码器100还包括其他处理元件(未示出)。FIG2 is a block diagram of an encoder (100) as an embodiment of an audio processing unit of the present invention. Any component or element of the encoder 100 can be implemented as one or more processes and/or one or more circuits (e.g., ASICs, FPGAs, or other integrated circuits) using hardware, software, or a combination of hardware and software. The encoder 100 includes a frame buffer 110, a parser 111, a decoder 101, an audio state validator 102, a loudness processing stage 103, an audio stream selection stage 104, an encoder 105, a filler/formatter stage 107, a metadata generation stage 106, a session loudness measurement subsystem 108, and a frame buffer 109, connected as shown. Typically, the encoder 100 also includes other processing elements (not shown).

编码器100(作为转码器)被配置成通过使用输入比特流中所包括的响度处理状态元数据执行自适应和自动响度处理，来将输入音频比特流(例如，可以是AC-3比特流、E-AC-3比特流或杜比E比特流中的一种)转换成包括响度处理状态元数据的编码输出音频比特流(例如，可以是AC-3比特流、E-AC-3比特流或杜比E比特流中的另一种)。例如，编码器100可以被配置成将输入的杜比E比特流(通常用在制作和广播设施而非接收已经向其广播的音频节目的消费者装置中的格式)转换成AC-3或E-AC-3格式的编码输出音频比特流(适于广播给用户装置)。The encoder 100 (acting as a transcoder) is configured to convert an input audio bitstream (e.g., one of an AC-3 bitstream, an E-AC-3 bitstream, or a Dolby E bitstream) into an encoded output audio bitstream (e.g., another of an AC-3 bitstream, an E-AC-3 bitstream, or a Dolby E bitstream) including loudness processing state metadata by performing adaptive and automatic loudness processing using the loudness processing state metadata included in the input bitstream. For example, the encoder 100 can be configured to convert an input Dolby E bitstream (a format typically used in production and broadcast facilities rather than in consumer devices that receive audio programs that have been broadcast to them) into an encoded output audio bitstream in AC-3 or E-AC-3 format (suitable for broadcast to user devices).

图2的系统还包括编码音频递送子系统150(其存储和/或递送从编码器100输出的编码比特流)和解码器152。从编码器100输出的编码音频比特流可以由子系统150来存储(如，以DVD或蓝光光盘的形式)、或者由子系统150来发送(其可以实现发送链路或网络)、或者可以由子系统150来存储和发送。解码器152被配置成通过从比特流的每个帧中提取响度处理状态元数据(LPSM)(并且可选地还从比特流中提取节目边界元数据)、以及生成解码音频数据，来对其通过子系统150接收的(由编码器100生成的)包括响度处理状态元数据的编码音频比特流进行解码。通常，解码器152被配置成使用LPSM(并且可选地还使用节目边界元数据)对解码音频数据执行自适应响度处理，和/或将解码音频数据和LPSM转发给后处理器，后处理器被配置成使用LPSM(并且可选地还使用节目边界元数据)对解码音频数据执行自适应响度处理。通常，解码器152包括用于(如，以非暂时性方式)存储从子系统150接收的编码音频比特流的缓冲器。The system of FIG2 also includes an encoded audio delivery subsystem 150 (which stores and/or delivers the encoded bitstream output from the encoder 100) and a decoder 152. The encoded audio bitstream output from the encoder 100 may be stored by the subsystem 150 (e.g., in the form of a DVD or Blu-ray disc), or transmitted by the subsystem 150 (which may implement a transmission link or network), or may be stored and transmitted by the subsystem 150. The decoder 152 is configured to decode the encoded audio bitstream (generated by the encoder 100) received via the subsystem 150, including loudness processing state metadata, by extracting loudness processing state metadata (LPSM) from each frame of the bitstream (and optionally also extracting program boundary metadata from the bitstream) and generating decoded audio data. Typically, decoder 152 is configured to perform adaptive loudness processing on the decoded audio data using the LPSM (and optionally also using program boundary metadata), and/or forward the decoded audio data and the LPSM to a post-processor that is configured to perform adaptive loudness processing on the decoded audio data using the LPSM (and optionally also using program boundary metadata). Typically, decoder 152 includes a buffer for storing (e.g., in a non-transitory manner) the encoded audio bitstream received from subsystem 150.

编码器100和解码器152的各种实现可以被配置成执行本发明的方法的不同的实施例。Various implementations of the encoder 100 and decoder 152 may be configured to perform different embodiments of the method of the present invention.

帧缓冲器110是被耦接成接收编码输入音频比特流的缓冲存储器。在操作中，缓冲器110(如，以非暂时性方式)存储编码音频比特流的至少一个帧，并且，编码音频比特流的帧的序列被从缓冲器110向解析器111传送(assert)。The frame buffer 110 is a buffer memory coupled to receive an encoded input audio bitstream. In operation, the buffer 110 stores at least one frame of the encoded audio bitstream (e.g., in a non-transitory manner), and a sequence of frames of the encoded audio bitstream is transmitted (asserted) from the buffer 110 to the parser 111.

解析器111被耦接和配置来从这样的元数据被包括在其中的编码输入音频的每个帧中提取响度处理状态元数据(LPSM)，并且可选地还从这样的元数据被包括在其中的编码输入音频的每个帧中提取节目边界元数据(和/或其他元数据)，以向音频状态验证器102、响度处理级103、级106和子系统108传送至少LPSM(以及可选地还传送节目边界元数据和/或其他元数据)，以从编码输入音频中提取音频数据，以及以向解码器101传送音频数据。编码器100的解码器101被配置成对音频数据进行解码以生成解码音频数据，以及向响度处理级103、音频流选择级104、子系统108以及通常也向状态验证器102传送解码音频数据。The parser 111 is coupled and configured to extract loudness processing state metadata (LPSM) from each frame of the encoded input audio in which such metadata is included, and optionally also extract program boundary metadata (and/or other metadata) from each frame of the encoded input audio in which such metadata is included, to deliver at least the LPSM (and optionally also program boundary metadata and/or other metadata) to the audio state validator 102, the loudness processing stage 103, the stage 106, and the subsystem 108 to extract audio data from the encoded input audio, and to deliver the audio data to the decoder 101. The decoder 101 of the encoder 100 is configured to decode the audio data to generate decoded audio data, and to deliver the decoded audio data to the loudness processing stage 103, the audio stream selection stage 104, the subsystem 108, and typically also to the state validator 102.

状态验证器102被配置成对向其传送的LPSM(通常还有其他元数据)进行认证和验证。在一些实施例中，LPSM是已经被包括在(如，根据本发明的实施例的)输入比特流中的数据块(或者被包括在已经被包括在输入比特流中的数据块中)。该块可以包括用于处理LPSM(以及可选地还有其他元数据)和/或潜在的(从解码器101提供给验证器102的)音频数据的加密散列(基于散列的消息认证代码或“HMAC”)。该数据块在这些实施例中可以以数字形式标记，使得下游音频处理单元可以相对容易地认证和验证处理状态元数据。The state verifier 102 is configured to authenticate and verify the LPSM (and typically other metadata) that is passed to it. In some embodiments, the LPSM is a data block that has been included in the input bitstream (e.g., according to an embodiment of the present invention) (or is included in a data block that has been included in the input bitstream). The block may include a cryptographic hash (hash-based message authentication code or "HMAC") for processing the LPSM (and optionally other metadata) and/or the underlying audio data (provided to the verifier 102 from the decoder 101). The data block may be digitally signed in these embodiments so that downstream audio processing units can relatively easily authenticate and verify the processing state metadata.

例如，使用HMAC来生成摘要(digest)，并且，包括在本发明的比特流中的保护值可以包括摘要。可以如下针对AC-3帧生成摘要：For example, HMAC is used to generate a digest, and the protection value included in the bitstream of the present invention may include the digest. The digest may be generated for an AC-3 frame as follows:

1.在AC-3数据和LPSM被编码之后，使用帧数据字节(级联的frame_data#1和frame_data#2)和LPSM数据字节作为用于散列函数HMAC的输入。在计算摘要时不考虑可能存在于auxdata域中的其他数据。这样的其他数据可以是既不属于AC-3数据也不属于LSPSM数据的字节。可以在计算HMAC摘要时不考虑LPSM中所包括的保护比特。1. After the AC-3 data and LPSM are encoded, the frame data bytes (concatenated frame_data#1 and frame_data#2) and the LPSM data bytes are used as input to the HMAC hash function. Other data that may be present in the auxdata field is not considered when computing the digest. Such other data may be bytes that are neither AC-3 data nor LSPSM data. The protection bits included in the LPSM may not be considered when computing the HMAC digest.

2.在计算摘要之后，将其写入被保留用于保护比特的域中的比特流中。2. After the digest is calculated, it is written to the bitstream in a field reserved for protection bits.

3.完整的AC-3帧的生成的最后的步骤是计算CRC校验。其被写在帧的最末端，并且，将属于该帧的所有的数据都考虑在内，包括LPSM比特。3. The final step in the generation of a complete AC-3 frame is to calculate the CRC checksum, which is written at the very end of the frame and takes into account all the data belonging to the frame, including the LPSM bits.

可以将包括但不限于一种或更多种非HMAC加密方法中的任一种的其他加密算法用于LPSM的验证(如，在验证器102中)，以确保对于LPSM和/或潜在的音频数据的安全的发送和接收。例如，可以在接收本发明的音频比特流的实施例的每个音频处理单元中执行验证(使用这样的加密方法的验证)，以判定包括在比特流中的响度处理状态元数据和相应的音频数据是否已经经历了(由元数据指示的)特定的响度处理(和/或是否已经从特定的响度处理得到)以及是否在这样的特定的响度处理执行之后尚未被修改。Other cryptographic algorithms, including but not limited to any of one or more non-HMAC cryptographic methods, may be used for verification of the LPSM (e.g., in the verifier 102) to ensure secure transmission and reception of the LPSM and/or underlying audio data. For example, verification (using such cryptographic methods) may be performed in each audio processing unit of an embodiment of the present invention that receives an audio bitstream to determine whether loudness processing state metadata and corresponding audio data included in the bitstream have undergone (and/or have been derived from) a specific loudness processing (indicated by the metadata) and have not been modified since such specific loudness processing was performed.

状态验证器102向音频流选择级104、元数据生成器106和会话响度测量子系统108传送控制数据，以指示验证操作的结果。响应于控制数据，级104可以选择以下中的任一项(并且将其传送给编码器105)：The state validator 102 transmits control data to the audio stream selection stage 104, the metadata generator 106, and the session loudness measurement subsystem 108 to indicate the result of the validation operation. In response to the control data, the stage 104 may select (and transmit to the encoder 105) any of the following:

响度处理级103的经自适应处理的输出(如，当LPSM指示从解码器101输出的音频数据尚未经历特定类型的响度处理，并且来自验证器102的控制比特指示LPSM有效时)；或者the adaptively processed output of the loudness processing stage 103 (e.g., when the LPSM indicates that the audio data output from the decoder 101 has not been subjected to a particular type of loudness processing, and the control bit from the validator 102 indicates that the LPSM is valid); or

从解码器101输出的音频数据(如，当LPSM指示从解码器101输出的音频数据已经经历了可以由级103来执行的特定类型的响度处理，并且来自验证器102的控制比特指示LPSM有效时)。Audio data output from decoder 101 (e.g., when the LPSM indicates that the audio data output from decoder 101 has been subjected to a particular type of loudness processing that may be performed by stage 103, and the control bit from validator 102 indicates that the LPSM is valid).

编码器100的级103被配置成基于由解码器101提取的LPSM所指示的一个或更多个音频数据特性，对从解码器101输出的解码音频数据执行自适应响度处理。级103可以是自适应变换域实时响度和动态范围控制处理器。级103可以接收用户输入(如，用户目标响度/动态范围值或dialnorm值)、或者其他元数据输入(如，一种或多种类型的第三方数据、乐曲信息、标识符、专有权或标准信息、用户注解数据、用户偏好数据等)和/或其他输入(如，来自指纹处理的其他输入)，并且使用这样的输入来对从解码器101输出的解码音频数据进行处理。级103可以对指示单个音频节目(如解析器111所提取的节目边界元数据所指示的单个音频节目)的解码音频数据(从解码器101输出)执行自适应响度处理，并且可以响应于接收到指示解析器111所提取的节目边界元数据所指示的不同的音频节目的解码音频数据(从解码器101输出)来重置响度处理。The stage 103 of the encoder 100 is configured to perform adaptive loudness processing on the decoded audio data output from the decoder 101 based on one or more audio data characteristics indicated by the LPSM extracted by the decoder 101. The stage 103 may be an adaptive transform-domain real-time loudness and dynamic range control processor. The stage 103 may receive user input (e.g., a user target loudness/dynamic range value or a dialnorm value), or other metadata input (e.g., one or more types of third-party data, musical information, identifiers, proprietary or standard information, user annotation data, user preference data, etc.), and/or other input (e.g., other input from fingerprint processing), and use such input to process the decoded audio data output from the decoder 101. Stage 103 may perform adaptive loudness processing on decoded audio data (output from decoder 101) indicating a single audio program (as indicated by program boundary metadata extracted by parser 111), and may reset the loudness processing in response to receiving decoded audio data (output from decoder 101) indicating a different audio program as indicated by program boundary metadata extracted by parser 111.

当来自验证器102的控制比特指示LPSM无效时，会话响度测量子系统108可以操作以使用如由解码器101提取的LPSM(和/或其他元数据)来确定指示会话(或其他语音)的解码音频(来自解码器101)的分段的响度。当来自验证器102的控制比特指示LPSM有效时当LPSM指示之前确定的解码音频(来自解码器101)的会话(或其他语音)分段的响度时，可以禁止会话响度测量子系统108的操作。子系统108可以对指示单个音频节目(如解析器111所提取的节目边界元数据所指示的单个音频节目)的解码音频数据执行响度测量，并且可以响应于接收到指示这样的节目边界元数据所指示的不同的音频节目的解码音频数据来重置测量。When the control bit from validator 102 indicates that the LPSM is invalid, session loudness measurement subsystem 108 may operate to determine the loudness of a segment of decoded audio (from decoder 101) indicative of conversation (or other speech) using the LPSM (and/or other metadata) as extracted by decoder 101. When the control bit from validator 102 indicates that the LPSM is valid, operation of session loudness measurement subsystem 108 may be disabled when the LPSM indicates the loudness of a previously determined segment of conversation (or other speech) of decoded audio (from decoder 101). Subsystem 108 may perform loudness measurement on decoded audio data indicative of a single audio program (as indicated by program boundary metadata extracted by parser 111), and may reset the measurement in response to receiving decoded audio data indicative of a different audio program as indicated by such program boundary metadata.

存在有用的工具(如，杜比LM100响度仪表)，用于方便且容易地测量音频内容中的会话的电平。本发明的APU(如，编码器100的级108)的一些实施例被实现为包括这样的工具(或执行这样的工具的功能)以测量音频比特流的音频内容的平均会话响度(如，从编码器100的解码器101向级108传送的解码AC-3比特流)。There are useful tools (e.g., the Dolby LM100 loudness meter) for conveniently and easily measuring the level of conversation in audio content. Some embodiments of the APU of the present invention (e.g., stage 108 of encoder 100) are implemented to include such a tool (or perform the functionality of such a tool) to measure the average conversation loudness of the audio content of an audio bitstream (e.g., a decoded AC-3 bitstream passed from decoder 101 of encoder 100 to stage 108).

如果级108被实现为测量音频数据的真实的平均会话响度，则测量可以包括步骤：隔离音频内容的主要包含语音的分段。接着，根据响度测量算法对主要为语音的音频分段进行处理。对于从AC-3比特流解码的音频数据，该算法可以是标准的K加权的响度测量(根据国际标准ITU-RBS.1770)。可替选地，可以使用其他响度测量(如，基于响度的心理声学模型的响度测量)。If stage 108 is implemented to measure the true average conversational loudness of the audio data, the measurement may include the steps of isolating segments of the audio content that primarily contain speech. The primarily speech audio segments are then processed according to a loudness measurement algorithm. For audio data decoded from an AC-3 bitstream, the algorithm may be a standard K-weighted loudness measurement (according to the international standard ITU-R BS.1770). Alternatively, other loudness measurements (e.g., those based on psychoacoustic models of loudness) may be used.

语音分段的隔离对于测量音频数据的平均会话响度而言并不是至关重要的。然而，从倾听者的角度来说，其提高了测量的精度并且通常提供更令人满意的结果。由于并非所有的音频内容都包含会话(语音)，所以整个音频内容的响度测量可以提供对音频的会话电平的充分的近似，如果语音出现的话。Isolation of speech segments is not crucial for measuring the average conversational loudness of audio data. However, it improves the accuracy of the measurement and generally provides more satisfactory results from the listener's perspective. Since not all audio content contains conversation (speech), loudness measurement of the entire audio content can provide a sufficient approximation of the audio's conversational level, if speech is present.

元数据生成器106生成(和/或向级107传递)要被级107包括在编码比特流中以从编码器100输出的元数据。元数据生成器106可以将由解码器101和/或解析器111提取的LPSM(以及可选地还有节目边界元数据和/或其他元数据)传递给级107(如，当来自验证器102的控制比特指示LPSM和/或其他元数据有效时)，或者生成新的LPSM(以及可选地还生成节目边界元数据和/或其他元数据)并且向级107传送新的元数据(如，当来自验证器102的控制比特指示由解码器101提取的LPSM和/或其他元数据无效时)，或者其可以向级107传送由解码器101和/或解析器111提取的元数据与新生成的元数据的组合。元数据生成器106可以在其向级107传送的LPSM中包括由子系统108生成的响度数据以及表示由子系统108执行的响度处理的类型的至少一个值，以包括在要从编码器100输出的编码比特流中。The metadata generator 106 generates (and/or passes to the stage 107) metadata to be included by the stage 107 in the encoded bitstream for output from the encoder 100. The metadata generator 106 may pass the LPSM (and optionally also program boundary metadata and/or other metadata) extracted by the decoder 101 and/or the parser 111 to the stage 107 (e.g., when the control bit from the validator 102 indicates that the LPSM and/or other metadata are valid), or generate a new LPSM (and optionally also program boundary metadata and/or other metadata) and pass the new metadata to the stage 107 (e.g., when the control bit from the validator 102 indicates that the LPSM and/or other metadata extracted by the decoder 101 are invalid), or it may pass to the stage 107 a combination of the metadata extracted by the decoder 101 and/or the parser 111 and the newly generated metadata. Metadata generator 106 may include the loudness data generated by subsystem 108 and at least one value representing the type of loudness processing performed by subsystem 108 in the LPSM it transmits to stage 107 for inclusion in the encoded bitstream to be output from encoder 100 .

元数据生成器106可以生成保护比特(其可以包括基于散列的消息认证代码或“HMAC”或者由基于散列的消息认证代码或“HMAC”构成)，保护比特对于要被包括在编码比特流中的LPSM(可选地还有其他元数据)和/或要被包括在编码比特流中的潜在的音频数据的解密、认证或验证中的至少一个而言是很有用的。元数据生成器106可以向级107提供这样的保护比特，以包括在编码比特流中。The metadata generator 106 may generate guard bits (which may include or consist of a hash-based message authentication code or "HMAC") that are useful for at least one of decryption, authentication, or verification of the LPSM (and optionally other metadata) and/or underlying audio data to be included in the coded bitstream. The metadata generator 106 may provide such guard bits to the stage 107 for inclusion in the coded bitstream.

在典型的操作中，会话响度测量子系统108对从解码器101输出的音频数据进行处理以响应于其生成响度值(如，选通和未选通会话响度值)以及动态范围值。响应于这些值，元数据生成器106可以生成响度处理状态元数据(LPSM)，以(由填充器/格式器107)包括在要从编码器100输出的编码比特流中。In typical operation, the session loudness measurement subsystem 108 processes the audio data output from the decoder 101 to generate loudness values (e.g., gated and ungated session loudness values) and dynamic range values in response thereto. In response to these values, the metadata generator 106 may generate loudness processing state metadata (LPSM) for inclusion (by the stuffer/formatter 107) in the encoded bitstream to be output from the encoder 100.

附加地，可选地，或者可替选地，编码器100的子系统106和/或108可以执行对音频数据的附加分析以生成表示音频数据的至少一个特性的元数据，以包括在要从级107输出的编码比特流中。Additionally, optionally, or alternatively, subsystems 106 and/or 108 of encoder 100 may perform additional analysis of the audio data to generate metadata representative of at least one characteristic of the audio data for inclusion in the encoded bitstream to be output from stage 107 .

编码器105对从选择级104输出的音频数据进行编码(如，通过对其执行压缩)，并且向级107传送编码音频，以包括在要从级107输出的编码比特流中。Encoder 105 encodes the audio data output from selection stage 104 (eg, by performing compression thereon) and transmits the encoded audio to stage 107 for inclusion in an encoded bitstream to be output from stage 107 .

级107对来自编码器105的编码音频和来自生成器106的元数据(包括LPSM)进行复用以生成要从级107输出的编码比特流，优选地使得编码比特流具有本发明的优选实施例指定的格式。Stage 107 multiplexes the encoded audio from encoder 105 and metadata (including LPSM) from generator 106 to generate an encoded bitstream to be output from stage 107, preferably such that the encoded bitstream has the format specified by the preferred embodiment of the invention.

帧缓冲器109是(如，以非暂时性方式)存储从级107输出的编码音频比特流的至少一个帧的缓冲存储器，接着，编码音频比特流的帧的序列作为来自编码器100的输出从缓冲器109向递送系统150传送。The frame buffer 109 is a buffer memory that stores (e.g., in a non-temporary manner) at least one frame of the encoded audio bitstream output from stage 107, and then the sequence of frames of the encoded audio bitstream is transmitted from the buffer 109 to the delivery system 150 as output from the encoder 100.

由元数据生成器106生成的并且由级107包括在编码比特流中的LPSM表示相应的音频数据的响度处理状态(如，已经对音频数据执行了什么类型的响度处理)以及相应的音频数据的响度(如，测量的会话响度、选通的和/或未选通的响度、和/或动态范围)。The LPSM generated by metadata generator 106 and included in the encoded bitstream by stage 107 represents the loudness processing state of the corresponding audio data (e.g., what type of loudness processing has been performed on the audio data) and the loudness of the corresponding audio data (e.g., measured session loudness, gated and/or ungated loudness, and/or dynamic range).

本文中，对音频数据执行的响度和/或电平测量的“选通(gating)”指超过阈值的计算值被包括在最终的测量中的情况下的具体的电平或响度阈值(如，在最终的测量值中忽略低于-60dBFS的短期响度值)。绝对值的选通指固定的电平或响度，而相对值的选通指取决于当前“非选通(ungated)”测量值的值。As used herein, "gating" a loudness and/or level measurement performed on audio data refers to a specific level or loudness threshold under which calculated values exceeding the threshold are included in the final measurement (e.g., short-term loudness values below -60 dBFS are ignored in the final measurement). Absolute gating refers to a fixed level or loudness, while relative gating refers to a value that depends on the current "ungated" measurement value.

在编码器100的一些实现中，被缓冲在存储器109中(并且被输出给递送系统150)的编码比特流是AC-3比特流或E-AC-3比特流，并且包括音频数据分段(如，图4所示的帧的AB0至AB5分段)和元数据分段，其中，音频数据分段表示音频数据，至少部分元数据分段中的每个包括响度处理状态元数据(LPSM)。级107按以下格式将LPSM(以及可选地还将节目边界元数据)插入比特流中。包括LPSM(以及可选地还包括节目边界元数据)的元数据分段中的每个被包括在比特流的浪费比特分段(例如图4或者图7所示的浪费比特分段“W”)中，或者被包括在比特流的帧的比特流信息(“BSI”)分段的“addbsi”域中，或者被包括在比特流的帧的端部处的auxdata域(如，图4或者图7所示的AUX分段)中。比特流的帧可以包括一个或两个元数据分段，其中每个包括LPSM，并且，如果帧包括两个元数据分段，则其中一个可以存在于帧的addbsi域中，而另一个存在于帧的AUX域中。在一些实施例中，包括LPSM的每个元数据分段包括具有以下格式的LPSM有效载荷(或容器)分段：In some implementations of encoder 100, the encoded bitstream buffered in memory 109 (and output to delivery system 150) is an AC-3 bitstream or an E-AC-3 bitstream and includes audio data segments (e.g., segments AB0 to AB5 of a frame shown in FIG. 4 ) and metadata segments, wherein the audio data segments represent audio data and at least some of the metadata segments each include loudness processing state metadata (LPSM). Stage 107 inserts the LPSM (and optionally also program boundary metadata) into the bitstream in the following format. Each metadata segment including the LPSM (and optionally also program boundary metadata) is included in a wasted bits segment of the bitstream (e.g., the wasted bits segment "W" shown in FIG. 4 or FIG. 7 ), in the "addbsi" field of a bitstream information ("BSI") segment of a frame of the bitstream, or in the auxdata field at the end of a frame of the bitstream (e.g., the AUX segment shown in FIG. 4 or FIG. 7 ). A frame of a bitstream may include one or two metadata segments, each of which includes an LPSM, and if a frame includes two metadata segments, one of the metadata segments may be present in the addbsi field of the frame and the other in the AUX field of the frame. In some embodiments, each metadata segment including an LPSM includes an LPSM payload (or container) segment having the following format:

首部(通常包括标识LPSM有效载荷的开始的同步字，其后跟随至少一个标识值，如下面的表2所示的LPSM格式版本、长度、周期、计数和子流关联值)；以及A header (typically comprising a synchronization word identifying the start of the LPSM payload, followed by at least one identification value, such as the LPSM format version, length, period, count, and substream association value as shown in Table 2 below); and

在首部之后，After the first part,

至少一个会话指示值(如，表2的参数“会话通道”)，其指示相应的音频数据是指示会话还是不指示会话(如，相应的音频数据的哪个通道指示会话)；At least one session indication value (e.g., parameter “session channel” of Table 2), which indicates whether the corresponding audio data indicates a session or does not indicate a session (e.g., which channel of the corresponding audio data indicates a session);

至少一个响度处理值(如，表2的参数“会话选通的响度校正标志”、“响度校正类型”中的一个或更多个)，其表示已经对相应的音频数据执行的至少一种类型的响度处理；以及At least one loudness processing value (e.g., one or more of the parameters "Session-gated Loudness Correction Flag" and "Loudness Correction Type" of Table 2) indicating at least one type of loudness processing that has been performed on the corresponding audio data; and

至少一个响度值(如，表2的参数“ITU相对选通的响度”、“ITU语音选通的响度”、“ITU(EBU 3341)短期3s响度”和“真实峰值”中的一个或更多个)，其表示相应的音频数据的至少一个响度(如，峰值或者平均响度)特性。At least one loudness value (e.g., one or more of the parameters "ITU Relative Gated Loudness," "ITU Speech Gated Loudness," "ITU (EBU 3341) Short-Term 3s Loudness," and "True Peak" in Table 2) representing at least one loudness (e.g., peak or average loudness) characteristic of corresponding audio data.

在一些实施例中，包含LPSM和节目边界元数据的每个元数据分段包含核心首部(并且可选地还包括附加核心元素)、以及在核心首部(或者核心首部和其他核心元素)之后的具有以下格式的LPSM有效载荷(或者容器)分段：In some embodiments, each metadata segment containing LPSM and program boundary metadata contains a core header (and optionally additional core elements), and following the core header (or core header and other core elements) an LPSM payload (or container) segment having the following format:

首部，通常包括至少一个标识值(例如LPSM格式版本、长度、周期、计数和子流关联值，如本文中所提出的表2中所示)，以及header, which typically includes at least one identification value (e.g., LPSM format version, length, period, count, and substream association value, as shown in Table 2 proposed herein), and

在首部之后的LPSM和节目边界元数据。节目边界元数据可以包括节目边界帧计数、编码值和(在一些情况下的)偏移值，编码值(例如“offset_exist”值)指示帧仅包括节目边界帧计数还是包括节目边界帧计数和偏移值二者。Following the header is the LPSM and program boundary metadata. The program boundary metadata may include a program boundary frame count, an encoded value, and (in some cases) an offset value. An encoded value (e.g., an "offset_exist" value) indicates whether the frame includes only a program boundary frame count or both a program boundary frame count and an offset value.

在一些实现中，由级107插入到比特流的帧的浪费比特分段或者“addbsi”域或者auxdata域中的元数据分段中的每个具有以下格式：In some implementations, each of the wasted bits segments or metadata segments in the "addbsi" field or auxdata field inserted into a frame of the bitstream by stage 107 has the following format:

核心首部(通常包括标识元数据分段的开始的同步字，其后跟随标识值，如下面的表1中所示的核心元素版本、长度和周期、扩展元素计数以及子流关联值)；以及Core header (typically includes a synchronization word identifying the start of the metadata segment, followed by identification values, core element version, length and period, extension element count, and substream association value as shown in Table 1 below); and

在核心首部之后的至少一个保护值(如，表1的HMAC摘要和音频指纹值，其对于响度处理状态元数据或相应的音频数据中的至少一个的解密、认证或验证中的至少一个而言是很有用的)；以及At least one protection value following the core header (e.g., the HMAC digest and audio fingerprint value of Table 1, which is useful for at least one of decryption, authentication, or verification of at least one of the loudness processing state metadata or corresponding audio data); and

如果元数据分段包括LPSM，则也在核心首部之后的LPSM有效载荷标识(ID)和LPSM有效载荷尺寸值，其将跟随的元数据标识为LPSM有效载荷并且指示LPSM有效载荷的尺寸。If the metadata segment includes an LPSM, an LPSM payload identification (ID) and an LPSM payload size value also follow the core header, identifying the following metadata as an LPSM payload and indicating the size of the LPSM payload.

(优选地具有上述格式的)LPSM有效载荷(或容器)分段跟随LPSM有效载荷ID和LPSM有效载荷尺寸值。The LPSM Payload (or Container) segment (preferably having the format described above) follows the LPSM Payload ID and LPSM Payload Size values.

在一些实施例中，帧的auxdata域(或“addbsi”域)中的每个元数据分段具有三层结构：In some embodiments, each metadata segment in the auxdata field (or "addbsi" field) of a frame has a three-layer structure:

高层结构，包括：表示auxdata(或addbsi)域是否包括元数据的标志；指示存在的是什么类型的元数据的至少一个ID值；以及通常还包括指示存在了多少比特的元数据(如，每种类型的元数据)的值(如果存在元数据)。可能存在的一种类型的元数据是LPSM，可能存在的另一种类型的元数据是节目边界元数据，可能存在的另一种类型的元数据是媒体研究元数据(如，Nielsen媒体研究元数据)；A high-level structure that includes: a flag indicating whether the auxdata (or addbsi) field contains metadata; at least one ID value indicating what type of metadata is present; and typically also a value indicating how many bits of metadata (e.g., of each type of metadata) are present (if metadata is present). One type of metadata that may be present is LPSM, another type of metadata that may be present is program boundary metadata, and another type of metadata that may be present is media research metadata (e.g., Nielsen Media Research metadata).

中层结构，包括用于每个标识的类型的元数据的核心元素(如，对于每种标识的类型的元数据如上述类型的核心首部、保护值和LPSM有效载荷ID以及有效载荷尺寸值)；以及a middle-level structure including core elements of metadata for each identified type (e.g., metadata for each identified type such as a core header of the type described above, a protection value, and an LPSM payload ID and payload size value); and

可以对在这样的三层结构中的数据值进行嵌套。例如，可以在由核心元素标识的有效载荷之后(从而在核心元素的核心首部之后)，包括用于LPSM有效载荷和/或由核心元素标识的另一元数据有效载荷的保护值。在一种示例中，核心首部可以标识LPSM有效载荷与另一元数据有效载荷，用于第一有效载荷(如，LPSM有效载荷)的有效载荷ID和有效载荷尺寸值可以跟随核心首部，第一有效载荷本身可以跟随该ID和尺寸值，用于第二有效载荷的有效载荷ID和有效载荷尺寸值可以跟随第一有效载荷，第二有效载荷本身可以跟随这些ID和尺寸值，并且，两种有效载荷(或者核心元素值和两种有效载荷)的保护值可以跟随最后的有效载荷。Data values in such a three-layer structure can be nested. For example, protection values for an LPSM payload and/or another metadata payload identified by a core element can be included after a payload identified by a core element (and thus after a core header of the core element). In one example, the core header can identify the LPSM payload and the other metadata payload, a payload ID and payload size value for a first payload (e.g., an LPSM payload) can follow the core header, the first payload itself can follow the ID and size values, a payload ID and payload size value for a second payload can follow the first payload, the second payload itself can follow these ID and size values, and protection values for both payloads (or the core element value and both payloads) can follow the final payload.

在一些实施例中，如果解码器101接收具有加密散列的根据本发明的实施例生成的音频比特流，则解码器被配置成从根据比特流确定的数据块中解析并取回加密散列，上述块包括响度处理状态元数据(LPSM)以及可选地还包括节目边界元数据。验证器102可以使用加密散列来验证所接收的比特流和/或相关联的元数据。例如，如果验证器102基于参考加密散列与从数据块中取回的加密散列之间的匹配发现LPSM是有效的，则其可以禁止处理器103对于相应的音频数据的操作并且使得选择级104通过(未改变的)音频数据。附加地，可选地，或者可替选地，可以使用其他类型的加密技术来替代基于加密散列的方法。In some embodiments, if decoder 101 receives an audio bitstream generated according to an embodiment of the present invention with a cryptographic hash, the decoder is configured to parse and retrieve the cryptographic hash from a data block determined from the bitstream, the block including loudness processing state metadata (LPSM) and, optionally, program boundary metadata. Validator 102 can use the cryptographic hash to validate the received bitstream and/or associated metadata. For example, if validator 102 finds that the LPSM is valid based on a match between a reference cryptographic hash and the cryptographic hash retrieved from the data block, it can disable processor 103 from operating on the corresponding audio data and cause selection stage 104 to pass the (unchanged) audio data. Additionally, optionally, or alternatively, other types of cryptographic techniques can be used in place of cryptographic hash-based approaches.

图2的编码器100可以判定(响应于由解码器101提取的LPSM以及可选地还有节目边界元数据)后处理/预处理单元已经对要被编码的音频数据执行了某种类型的响度处理(在元件105、106和107中)，因此可以产生(在生成器106中)响度处理状态元数据，该响度处理状态元数据包括用于事先执行的响度处理和/或从事先执行的响度处理中提取的具体的参数。在一些实现中，编码器100可以产生(并且在从其输出的编码比特流中包括)表示对音频内容的处理历史的处理状态元数据，只要编码器知晓已经对音频内容执行的处理的类型。The encoder 100 of FIG2 can determine (in response to the LPSM extracted by the decoder 101 and, optionally, program boundary metadata) that a post-processing/pre-processing unit has already performed some type of loudness processing on the audio data to be encoded (in elements 105, 106, and 107), and can therefore generate (in a generator 106) loudness processing state metadata that includes specific parameters for and/or extracted from the previously performed loudness processing. In some implementations, the encoder 100 can generate (and include in the encoded bitstream output therefrom) processing state metadata that represents the processing history of the audio content, as long as the encoder is aware of the type of processing that has been performed on the audio content.

图3是作为本发明的音频处理单元的实施例的解码器(200)的框图、以及耦接至编码器200的后处理器(300)的框图。后处理器(300)也是本发明的音频处理单元的实施例。解码器200和后处理器300的任意部件或元件都可以用硬件、软件或硬件与软件的组合实现为一个或更多个处理和/或一个或更多个电路(如，ASIC、FPGA或其他集成电路)。解码器200包括如所示地连接的帧缓冲器210、解析器205、音频解码器202、音频状态验证级(验证器)203和控制比特生成级204。通常，解码器200还包括其他处理元件(未示出)。3 is a block diagram of a decoder (200) as an embodiment of an audio processing unit of the present invention, and a block diagram of a post-processor (300) coupled to the encoder 200. The post-processor (300) is also an embodiment of an audio processing unit of the present invention. Any component or element of the decoder 200 and the post-processor 300 can be implemented as one or more processes and/or one or more circuits (e.g., ASICs, FPGAs, or other integrated circuits) using hardware, software, or a combination of hardware and software. The decoder 200 includes a frame buffer 210, a parser 205, an audio decoder 202, an audio state verification stage (verifier) 203, and a control bit generation stage 204, connected as shown. Typically, the decoder 200 also includes other processing elements (not shown).

帧缓冲器201(缓冲存储器)(以非暂时性方式)存储由解码器200接收的编码音频比特流的至少一个帧。从缓冲器201向解析器205传送编码音频比特流的帧的序列。The frame buffer 201 (buffer memory) stores (in a non-transitory manner) at least one frame of the encoded audio bitstream received by the decoder 200. From the buffer 201 to the parser 205 a sequence of frames of the encoded audio bitstream is transferred.

解析器205被耦接和配置来从编码输入音频的每个帧中提取响度处理状态元数据(LPSM)以及可选地还提取节目边界元数据和其他元数据、向音频状态验证器203和级204传送至少LPSM(如果提取到任何节目边界元数据则还传送节目边界元数据)、将LPSM(以及可选地还将节目边界元数据)作为输出(如，给后处理器300的输出)传送、从编码输入音频中提取音频数据、以及向解码器202传送所提取的音频数据。Parser 205 is coupled and configured to extract loudness processing state metadata (LPSM) and optionally also program boundary metadata and other metadata from each frame of the encoded input audio, pass at least the LPSM (and program boundary metadata if any was extracted) to audio state validator 203 and stage 204, pass the LPSM (and optionally also program boundary metadata) as output (e.g., to post-processor 300), extract audio data from the encoded input audio, and pass the extracted audio data to decoder 202.

输入给解码器200的编码音频比特流可以是AC-3比特流、E-AC-3比特流或杜比E比特流中的一种。The encoded audio bitstream input to the decoder 200 may be one of an AC-3 bitstream, an E-AC-3 bitstream, or a Dolby E bitstream.

图3的系统还包括后处理器300。后处理器300包括帧缓冲器301和其他处理元件(未示出)，包括耦接至缓冲器301的至少一个处理元件。帧缓冲器301(如，以非暂时性方式)存储由后处理器300从解码器200接收的解码音频比特流的至少一个帧。后处理器300的处理元件被耦接和配置来使用从解码器202输出的元数据(包括LPSM值)和/或从解码器200的级204输出的控制比特，来接收并且自适应地处理从缓冲器301输出的解码音频比特流的帧的序列。通常，后处理器300被配置成(如，基于由指示单个音频节目的音频数据的LPSM指示的响度处理状态和/或一个或更多个音频数据特性)使用LPSM值以及可选地还使用节目边界元数据来对解码音频数据执行自适应响度处理。The system of FIG3 also includes a post-processor 300. Post-processor 300 includes a frame buffer 301 and other processing elements (not shown), including at least one processing element coupled to buffer 301. Frame buffer 301 stores (e.g., in a non-transitory manner) at least one frame of a decoded audio bitstream received by post-processor 300 from decoder 200. The processing elements of post-processor 300 are coupled and configured to receive and adaptively process a sequence of frames of the decoded audio bitstream output from buffer 301 using metadata (including LPSM values) output from decoder 202 and/or control bits output from stage 204 of decoder 200. Generally, post-processor 300 is configured to perform adaptive loudness processing on the decoded audio data using LPSM values and, optionally, program boundary metadata (e.g., based on a loudness processing state indicated by the LPSM for audio data indicating a single audio program and/or one or more audio data characteristics).

解码器200和后处理器300的各种实现都被配置成执行本发明的方法的不同的实施例。Various implementations of decoder 200 and post-processor 300 are configured to perform different embodiments of the method of the present invention.

解码器200的音频解码器202被配置成对由解析器205提取的音频数据进行解码以生成解码音频数据，并且被配置成将解码音频数据作为输出(如，至后处理器300的输出)传送。The audio decoder 202 of the decoder 200 is configured to decode the audio data extracted by the parser 205 to generate decoded audio data, and is configured to transmit the decoded audio data as output (eg, to the post-processor 300 ).

状态验证器203被配置成对向其传送的LPSM(通常还有其他元数据)进行认证和验证。在一些实施例中，LPSM是已经被包括在(如，根据本发明的实施例的)输入比特流中的数据块(或者被包括在已经被包括在输入比特流中的数据块中)。该块可以包括用于处理LPSM(以及可选地还处理其他元数据)和/或潜在的音频数据(从解析器205和/或解码器202提供给验证器203)的加密散列(基于散列的消息认证代码或“HMAC”)。该数据块在这些实施例中可以用数字标记，使得下游音频处理单元可以相对容易地认证和验证处理状态元数据。The state validator 203 is configured to authenticate and verify the LPSM (and typically other metadata) that is passed to it. In some embodiments, the LPSM is a data block that is already included in the input bitstream (e.g., according to an embodiment of the present invention) (or is included in a data block that is already included in the input bitstream). The block may include a cryptographic hash (hash-based message authentication code or "HMAC") for processing the LPSM (and optionally other metadata) and/or the underlying audio data (provided to the validator 203 from the parser 205 and/or decoder 202). The data block may be digitally signed in these embodiments so that downstream audio processing units can relatively easily authenticate and verify the processing state metadata.

可以将包括但不限于一种或更多种非HMAC加密方法中的任一种的其他加密算法用于LPSM的验证(如，在验证器203中)，以确保对于LPSM和/或潜在的音频数据的安全的发送和接收。例如，可以在接收本发明的音频比特流的实施例的每个音频处理单元中执行验证(使用这样的加密方法的验证)，以判定包括在比特流中的响度处理状态元数据和相应的音频数据是否已经经历了(由元数据指示的)特定的响度处理(和/或是否已经从特定的响度处理得到)以及是否在这样的特定的响度处理执行之后尚未被修改。Other cryptographic algorithms, including but not limited to any of one or more non-HMAC cryptographic methods, may be used for verification of the LPSM (e.g., in the verifier 203) to ensure secure transmission and reception of the LPSM and/or underlying audio data. For example, verification (using such cryptographic methods) may be performed in each audio processing unit of an embodiment of the present invention that receives an audio bitstream to determine whether loudness processing state metadata and corresponding audio data included in the bitstream have undergone (and/or have been derived from) a specific loudness processing (indicated by the metadata) and have not been modified since such specific loudness processing was performed.

状态验证器203向控制比特生成器204传送控制数据，和/或传送控制数据作为输出(如，至后处理器300的输出)，以指示验证操作的结果。响应于控制数据(以及可选地也响应于从输入比特流中提取的其他元数据)，级204可以生成以下中的任一项(并且将其传送给后处理器300)：State validator 203 transmits control data to control bit generator 204 and/or transmits the control data as an output (e.g., to post-processor 300) to indicate the result of the validation operation. In response to the control data (and optionally also in response to other metadata extracted from the input bitstream), stage 204 may generate (and transmit to post-processor 300) any of the following:

指示从解码器202输出的解码音频数据已经经历了特定类型的响度处理的控制比特(当LPSM指示从解码器202输出的音频数据已经经历了特定类型的响度处理，并且来自验证器203的控制比特指示LPSM有效时)；或者a control bit indicating that the decoded audio data output from the decoder 202 has been subjected to a specific type of loudness processing (when the LPSM indicates that the audio data output from the decoder 202 has been subjected to a specific type of loudness processing and the control bit from the validator 203 indicates that the LPSM is valid); or

指示从解码器202输出的解码音频数据应当经历特定类型的响度处理的控制比特(如，当LPSM指示从解码器202输出的音频数据尚未经历特定类型的响度处理时，或者当LPSM指示从解码器202输出的音频数据已经经历了特定类型的响度处理而来自验证器203的控制比特指示LPSM无效时)。A control bit indicating that the decoded audio data output from the decoder 202 should be subjected to a particular type of loudness processing (e.g., when the LPSM indicates that the audio data output from the decoder 202 has not been subjected to the particular type of loudness processing, or when the LPSM indicates that the audio data output from the decoder 202 has been subjected to the particular type of loudness processing and the control bit from the validator 203 indicates that the LPSM is invalid).

可替选地，解码器200向后处理器300传送由解码器202从输入比特流中提取的元数据以及由解析器205从输入比特流中提取的LPSM(以及可选地还有节目边界元数据)，并且，后处理器300使用LPSM(以及可选地还使用节目边界元数据)对解码音频数据执行响度处理，或者执行LPSM的验证，接着，如果验证指示LPSM有效，则使用LPSM(以及可选地还使用节目边界元数据)对解码音频数据执行响度处理。Alternatively, the decoder 200 transmits the metadata extracted from the input bitstream by the decoder 202 and the LPSM (and optionally also the program boundary metadata) extracted from the input bitstream by the parser 205 to the post-processor 300, and the post-processor 300 performs loudness processing on the decoded audio data using the LPSM (and optionally also the program boundary metadata), or performs verification of the LPSM and then, if the verification indicates that the LPSM is valid, performs loudness processing on the decoded audio data using the LPSM (and optionally also the program boundary metadata).

在一些实施例中，如果解码器200来接收具有加密散列的根据本发明的实施例生成的音频比特流，则解码器被配置成从根据比特流确定的数据块来解析和取回加密散列，上述块包括响度处理状态元数据(LPSM)。验证器203可以使用加密散列来验证所接收的比特流和/或相关联的元数据。例如，如果验证器203基于参考加密散列与从数据块中取回的加密散列之间的匹配发现LPSM是有效的，则其可以向下游音频处理单元(如，后处理器300，其可以是或者包括音量调节单元)发信号，以传递(未改变)比特流的音频数据。附加地，可选地，或者可替选地，可以使用其他类型的加密技术来替代基于加密散列的方法。In some embodiments, if decoder 200 receives an audio bitstream generated according to an embodiment of the present invention with a cryptographic hash, the decoder is configured to parse and retrieve the cryptographic hash from a data block determined from the bitstream, the block comprising loudness processing state metadata (LPSM). Verifier 203 can use the cryptographic hash to verify the received bitstream and/or associated metadata. For example, if verifier 203 finds that the LPSM is valid based on a match between a reference cryptographic hash and a cryptographic hash retrieved from the data block, it can signal a downstream audio processing unit (e.g., post-processor 300, which may be or include a volume adjustment unit) to pass the audio data of the bitstream (unchanged). Additionally, optionally, or alternatively, other types of encryption techniques can be used in place of cryptographic hash-based approaches.

在解码器200的一些实现中，所接收(并且被缓冲在存储器201中)的编码比特流是AC-3比特流或E-AC-3比特流，并且包括音频数据分段(如，图4所示的帧的AB0至AB5分段)和元数据分段，其中，音频数据分段表示音频数据，至少部分元数据分段中的每个包括响度处理状态元数据(LPSM)以及可选地还包括节目边界元数据。解码器级202(和/或解析器205)被配置成从比特流中提取具有以下格式的LPSM(以及可选地还有节目边界元数据)。包括LPSM(以及可选地还包括节目边界元数据)的元数据分段中的每个被包括在比特流的帧的浪费比特分段中，或者被包括在比特流的帧的比特流信息(“BSI”)分段的“addbsi”域中，或者被包括在比特流的帧的端部处的auxdata域(如，图4所示的AUX分段)中。比特流的帧可以包括一个或两个元数据分段，其中每个可以包括LPSM，并且，如果帧包括两个元数据分段，则其中一个可以存在于帧的addbsi域中，而另一个存在于帧的AUX域中。在一些实施例中，包括LPSM的每个元数据分段包括具有以下格式的LPSM有效载荷(或容器)分段：In some implementations of decoder 200, the encoded bitstream received (and buffered in memory 201) is an AC-3 bitstream or an E-AC-3 bitstream and includes audio data segments (e.g., segments AB0 to AB5 of a frame shown in FIG4) and metadata segments, wherein the audio data segments represent audio data and at least some of the metadata segments each include loudness processing state metadata (LPSM) and optionally program boundary metadata. Decoder stage 202 (and/or parser 205) is configured to extract the LPSM (and optionally program boundary metadata) from the bitstream in the following format. Each of the metadata segments including the LPSM (and optionally program boundary metadata) is included in a wasted bits segment of a frame of the bitstream, or in an "addbsi" field of a bitstream information ("BSI") segment of a frame of the bitstream, or in an auxdata field at the end of a frame of the bitstream (e.g., an AUX segment shown in FIG4). A frame of a bitstream may include one or two metadata segments, each of which may include an LPSM, and if a frame includes two metadata segments, one of the metadata segments may be present in the addbsi field of the frame and the other in the AUX field of the frame. In some embodiments, each metadata segment including an LPSM includes an LPSM payload (or container) segment having the following format:

首部(通常包括标识LPSM有效载荷的开始的同步字，其后跟随标识值，如下面的表2所示的LPSM格式版本、长度、周期、计数和子流关联值)；以及header (typically including a synchronization word that identifies the start of the LPSM payload, followed by identification values, such as the LPSM format version, length, period, count, and substream association values shown in Table 2 below); and

在首部之后，After the first part,

至少一个响度调节相符值(如，表2的参数“响度调节类型”)，其指示相应的音频数据是否与指示的响度调节的集合相符；At least one loudness adjustment compliance value (e.g., parameter "Loudness Adjustment Type" of Table 2), which indicates whether the corresponding audio data complies with the indicated set of loudness adjustments;

在一些实现中，解析器205(和/或解码器级202)被配置成从比特流的帧的浪费比特分段或“addbsi”域或auxdata域中提取具有以下格式的每个元数据分段：In some implementations, the parser 205 (and/or the decoder stage 202) is configured to extract from the wasted bits segment or the "addbsi" field or the auxdata field of a frame of the bitstream each metadata segment having the following format:

核心首部(通常包括标识元数据分段的开始的同步字，其后跟随至少一个标识值，如下面的表1中所示的核心元素版本、长度和周期、扩展元素计数以及子流关联值)；以及A core header (typically comprising a synchronization word identifying the start of a metadata segment, followed by at least one identification value, such as a core element version, length and period, an extension element count, and a substream association value as shown in Table 1 below); and

在核心首部之后的至少一个保护值(如，表1的HMAC摘要和音频指纹值)，其对于响度处理状态元数据或相应的音频数据中的至少一个的解密、认证或验证中的至少一个而言是很有用的；以及At least one protection value following the core header (e.g., the HMAC digest and audio fingerprint value of Table 1) that is useful for at least one of decryption, authentication, or verification of at least one of the loudness processing state metadata or corresponding audio data; and

如果元数据分段包括LPSM，则也在核心首部之后的LPSM有效载荷标识(ID)和LPSM有效载荷尺寸值，其将以下元数据标识为LPSM有效载荷并且指示LPSM有效载荷的尺寸。If the metadata segment includes an LPSM, an LPSM payload identification (ID) and an LPSM payload size value also follow the core header, identifying the following metadata as an LPSM payload and indicating the size of the LPSM payload.

更一般地，由本发明的优选实施例生成的编码音频比特流具有如下结构：其向标签元数据元素和子元素提供机制作为核心(强制)或扩展(可选元素)。这使得比特流(包括其元数据)的数据速率能够跨大量的应用来缩放。优选的比特流句法的核心(强制)元素应当能够发信号告知与音频内容相关联的扩展(可选)元素存在(在带内)和/或在远处(在带外)。More generally, the coded audio bitstream generated by the preferred embodiment of the present invention has a structure that provides a mechanism for tagging metadata elements and sub-elements as core (mandatory) or extensions (optional elements). This enables the data rate of the bitstream (including its metadata) to be scalable across a wide range of applications. The core (mandatory) elements of the preferred bitstream syntax should be able to signal the presence (in-band) and/or remote (out-of-band) of extension (optional) elements associated with the audio content.

需要核心元素存在于比特流的每个帧中。核心元素的一些子元素是可选的并且可以以任意组合存在。扩展元素不需要存在于每个帧中(以防止比特速率过高)。因此，扩展元素可以存在于某些帧中而不存在于其他帧中。扩展元素的某些子元素是可选的并且可以以任意组合存在，而扩展元素的某些子元素可以是强制的(即，如果扩展元素存在于比特流的帧中)。Core elements are required to be present in every frame of the bitstream. Some subelements of core elements are optional and can be present in any combination. Extension elements do not need to be present in every frame (to prevent the bit rate from being too high). Therefore, extension elements can be present in some frames and not in others. Some subelements of extension elements are optional and can be present in any combination, while some subelements of extension elements can be mandatory (i.e., if the extension element is present in a frame of the bitstream).

在一类实施例中，(如，由实施本发明的音频处理单元)生成包括音频数据分段和元数据分段的序列的编码音频比特流。音频数据分段表示音频数据，元数据分段中的至少某些中的每个包括响度处理状态元数据(LPSM)以及可选地还包括节目边界元数据，音频数据分段与元数据分段时分复用。在这种类型的优选实施例中，每个元数据分段具有要在本文中描述的优选格式。In one class of embodiments, an encoded audio bitstream is generated (e.g., by an audio processing unit implementing the present invention) comprising a sequence of audio data segments and metadata segments. The audio data segments represent audio data, and at least some of the metadata segments each comprise loudness processing state metadata (LPSM) and optionally program boundary metadata, the audio data segments being time-division multiplexed with the metadata segments. In a preferred embodiment of this class, each metadata segment has a preferred format to be described herein.

在一种优选格式中，编码比特流是AC-3比特流或E-AC-3比特流，并且，包括LPSM的每个元数据分段作为附加比特流信息被(如，编码器100的优选实现的级107)包括在比特流的帧的比特流信息(“BSI”)分段的“addbsi”域(如图6所示)中，或者被包括在比特流的帧的auxdata域中，或者被包括在比特流的帧的浪费比特分段中。In a preferred format, the encoded bitstream is an AC-3 bitstream or an E-AC-3 bitstream, and each metadata segment including the LPSM is included as additional bitstream information (e.g., stage 107 of a preferred implementation of encoder 100) in an "addbsi" field (as shown in FIG. 6 ) of a bitstream information (“BSI”) segment of a frame of the bitstream, or in an auxdata field of a frame of the bitstream, or in a wasted bits segment of a frame of the bitstream.

在该优选格式中，每个帧在帧的addbsi域(或者浪费比特分段)中包括具有如以下表1所示的格式的核心元素：In the preferred format, each frame includes core elements in the addbsi field (or wasted bits segment) of the frame having the format shown in Table 1 below:

表1Table 1

在优选格式中，包含LPSM的每个addbsi(或auxdata)域或者浪费比特分段包含核心首部(可选地还有附加核心元素)，并且在核心首部(或者核心首部和其他核心元素)之后，还包含以下LPSM值(参数)：In the preferred format, each addbsi (or auxdata) field or wasted bits segment containing an LPSM contains a core header (optionally with additional core elements) and, after the core header (or core header and other core elements), the following LPSM values (parameters):

在核心元素值之后跟随的有效载荷ID(将元数据标识为LPSM)(如表1所示)；The payload ID (identifying the metadata as LPSM) follows the core element value (as shown in Table 1);

在有效载荷ID之后跟随的有效载荷尺寸(指示LPSM有效载荷的尺寸)；以及The payload ID is followed by the payload size (indicating the size of the LPSM payload); and

具有如下面的表(表2)所示的格式的LPSM数据(跟随有效载荷ID和有效载荷尺寸值)：The LPSM data has the format shown in the following table (Table 2) (followed by the payload ID and payload size values):

表2Table 2

在根据本发明生成的编码比特流的另一优选格式中，比特流是AC-3比特流或E-AC-3比特流，并且，包括LPSM(以及可选地还包括节目边界元数据)的每个元数据分段(如，通过编码器100的优选实现的级107)被包括在以下任一个中：比特流的帧的浪费比特分段；或者比特流的帧的比特流信息(“BSI”)分段的“addbsi”域(如图6所示)；或者比特流的帧的端部处的auxdata域(如，图4所示的AUX分段)。帧可以包括一个或两个元数据分段，每个元数据分段包括LPSM，并且，如果帧包括两个元数据分段，则其中一个元数据分段可以存在于帧的addbsi域中，而另一个存在于帧的AUX域中。包括LPSM的每个元数据分段具有以上参考表1和表2指出的格式(即，其包括：表1所示的核心元素，其后跟随有效载荷ID(将元数据标识为LPSM)；以及上述有效载荷尺寸值，其后跟随有效载荷(具有表2所示的格式的LPSM数据))。In another preferred format of an encoded bitstream generated according to the present invention, the bitstream is an AC-3 bitstream or an E-AC-3 bitstream, and each metadata segment (e.g., by stage 107 of a preferred implementation of encoder 100) comprising an LPSM (and optionally program boundary metadata) is included in either: a wasted bits segment of a frame of the bitstream; or an "addbsi" field of a bitstream information ("BSI") segment of a frame of the bitstream (as shown in FIG6); or an auxdata field at the end of a frame of the bitstream (e.g., an AUX segment as shown in FIG4). A frame may include one or two metadata segments, each metadata segment including an LPSM, and if a frame includes two metadata segments, one of the metadata segments may be present in the addbsi field of the frame and the other in the AUX field of the frame. Each metadata segment including LPSM has the format indicated in the above reference Tables 1 and 2 (i.e., it includes: the core elements shown in Table 1, followed by a payload ID (identifying the metadata as LPSM); and the above-mentioned payload size value, followed by a payload (LPSM data having the format shown in Table 2)).

在另一优选格式中，编码比特流是杜比E比特流，并且，每个包括LPSM(以及可选地还包括节目边界元数据)的元数据分段是杜比E保护带间隔的前N个样本位置。包括这样的元数据分段(其包括LPSM)的杜比E比特流优选地包括表示在SMPTE 337M前序的Pd字中示意的LPSM有效载荷长度的值(SMPTE 337M Pa字重复速率优选地保持与相关联的视频帧速率一致)。In another preferred format, the encoded bitstream is a Dolby E bitstream, and each metadata segment including an LPSM (and optionally program boundary metadata) is the first N sample positions of a Dolby E guardband interval. A Dolby E bitstream including such metadata segments (including an LPSM) preferably includes a value representing the LPSM payload length indicated in the Pd word of the SMPTE 337M preamble (the SMPTE 337M Pa word repetition rate is preferably kept consistent with the associated video frame rate).

在其中编码比特流是E-AC-3比特流的优选格式中，每个包括LPSM(以及可选地还包括节目边界元数据)的元数据分段作为附加比特流信息(如，通过编码器100的优选实现的级107)被包括在比特流的帧的比特流信息(“BSI”)分段的浪费比特分段或者“addbsi”域中。接下来描述具有这种优选格式的LPSM的E-AC-3比特流的编码的另外的方面：In a preferred format in which the encoded bitstream is an E-AC-3 bitstream, each metadata segment including the LPSM (and optionally program boundary metadata) is included as additional bitstream information (e.g., by stage 107 of a preferred implementation of encoder 100) in a wasted bits segment or "addbsi" field of a bitstream information ("BSI") segment of a frame of the bitstream. Further aspects of the encoding of an E-AC-3 bitstream having the LPSM in this preferred format are described below:

1.在E-AC-3比特流的生成期间，当E-AC-3编码器(其将LPSM值插入到比特流中)为“有效”时，对于所生成的每个帧(同步帧)，比特流应当包括承载在帧的addbsi域(或者浪费比特分段)中的元数据块(包括LPSM)。承载元数据块所需的比特不应当增加编码器比特速率(帧长度)；1. During the generation of an E-AC-3 bitstream, when the E-AC-3 encoder (which inserts LPSM values into the bitstream) is "active", for each frame generated (sync frame), the bitstream shall include a metadata block (including LPSM) carried in the addbsi field (or wasted bits segment) of the frame. The bits required to carry the metadata block shall not increase the encoder bit rate (frame length);

2.每个元数据块(包含LPSM)应当包含以下信息：2. Each metadata block (including LPSM) should contain the following information:

1.loudness_correction_type_flag：其中，“1”指示从编码器的上游对相应的音频数据的响度进行校正，“0”指示由嵌入在编码器中的响度校正器(如，图2的编码器100的响度处理器103)来对响度进行校正；1. loudness_correction_type_flag: where "1" indicates that the loudness of the corresponding audio data is corrected upstream from the encoder, and "0" indicates that the loudness is corrected by a loudness corrector embedded in the encoder (e.g., the loudness processor 103 of the encoder 100 of FIG. 2 );

2.speech_channel：指示哪个源通道包含语音(在之前的0.5秒内)。如果没有检测到任何语音，则应这样指示这一点；2. speech_channel: Indicates which source channel contains speech (in the previous 0.5 seconds). If no speech is detected, this should be indicated as such;

3.speech_loudness：指示包含语音的每个相应的音频通道的整体的语音响度(在之前的0.5秒内)；3.speech_loudness: indicates the overall speech loudness (in the previous 0.5 seconds) of each corresponding audio channel containing speech;

4.ITU_loudness：指示每个相应的音频通道的整体的ITUBS.1770-2响度；以及4. ITU_loudness: Indicates the overall ITU BS.1770-2 loudness of each corresponding audio channel; and

5.gain：用于解码器中的反转的响度合成增益(以示出反转性)；5. gain: loudness synthesis gain for inversion in the decoder (to show inversion);

3.当E-AC-3编码器(其将LPSM值插入到比特流中)为“有效”并且接收具有“信任”标志的AC-3帧时，编码器中的响度控制器(如，图2的编码器100的响度处理器103)应当被旁路。“信任的”源dialnorm和DRC值应当被传递(如，通过编码器100的生成器106)给E-AC-3编码器部件(如，编码器100的级107)。LPSM块生成继续，并且loudness_correction_type_flag被设置成“1”。响度控制器旁路序列必须与出现“信任”标志的解码AC-3帧的开始同步。响度控制器旁路序列应当如下来实现：在10个音频块周期(即，53.3毫秒)内从值9到值0递减leveler_amount控制，并且将leveler_back_end_meter控制置于旁路模式下(该操作应当导致无缝过渡)。术语校准器的“信任的”旁路表示源比特流的dialnorm值也在编码器的输出端处被重新使用。(即，如果“信任的”源比特流的dialnorm值为-30，则编码器的输出应当将-30用于输出的dialnorm值)；3. When the E-AC-3 encoder (which inserts LPSM values into the bitstream) is "active" and receives an AC-3 frame with the "trust" flag, the loudness controller in the encoder (e.g., loudness processor 103 of encoder 100 of FIG. 2 ) should be bypassed. The "trusted" source dialnorm and DRC values should be passed (e.g., via generator 106 of encoder 100 ) to the E-AC-3 encoder components (e.g., stage 107 of encoder 100 ). LPSM block generation continues, and the loudness_correction_type_flag is set to "1." The loudness controller bypass sequence must be synchronized with the start of the decoded AC-3 frame in which the "trust" flag is present. The loudness controller bypass sequence should be implemented by decrementing the leveler_amount control from a value of 9 to a value of 0 over 10 audio block periods (i.e., 53.3 milliseconds) and placing the leveler_back_end_meter control in bypass mode (this operation should result in a seamless transition). The term "trusted" bypass of the calibrator means that the dialnorm value of the source bitstream is also reused at the output of the encoder. (i.e., if the dialnorm value of the "trusted" source bitstream is -30, then the output of the encoder should use -30 for the output dialnorm value);

4.当E-AC-3编码器(其将LPSM值插入到比特流中)为“有效”并且接收没有“信任”标志的AC-3帧时，嵌入在编码器中的响度控制器(如，图2的编码器100的响度处理器103)应当是有效的。LPSM块生成继续，并且loudness_correction_type_flag被设置为“0”。响度控制器激活序列应当与“信任”标志消失的解码AC-3帧的开始同步。响度控制器激活序列应当如下来实现：在1个音频块周期(即，5.3毫秒)内从值0到值9递增leveler_amount控制，并且将leveler_back_end_meter控制置于“有效”模式下(该操作应当导致无缝过渡并且包括back_end_meter整体重置)；以及4. When the E-AC-3 encoder (which inserts LPSM values into the bitstream) is "active" and receives an AC-3 frame without the "trust" flag, the loudness controller embedded in the encoder (e.g., the loudness processor 103 of the encoder 100 of FIG. 2 ) should be active. LPSM block generation continues, and the loudness_correction_type_flag is set to "0". The loudness controller activation sequence should be synchronized with the start of the decoded AC-3 frame where the "trust" flag disappears. The loudness controller activation sequence should be implemented as follows: increment the leveler_amount control from a value of 0 to a value of 9 over 1 audio block period (i.e., 5.3 milliseconds), and put the leveler_back_end_meter control in "active" mode (this operation should result in a seamless transition and include an overall reset of the back_end_meter); and

5.在编码期间，图形用户接口(GUI)应当向用户指明以下参数：“输入音频节目：[信任的/不信任的]”——该参数的状态基于输入信号内的“信任”标志的存在；以及“实时响度校正：[使能/禁止]”——该参数的状态基于嵌入在编码器中的该响度控制器是否有效。5. During encoding, the Graphical User Interface (GUI) should indicate to the user the following parameters: "Input Audio Program: [Trusted/Untrusted]" - the state of this parameter is based on the presence of a "Trusted" flag within the input signal; and "Real-time Loudness Correction: [Enabled/Disabled]" - the state of this parameter is based on whether the loudness controller embedded in the encoder is active.

当对具有包括在比特流的每个帧的比特流信息(“BSI”)分段的浪费比特分段或者“addbsi”域中的LPSM(优选格式的LPSM)的AC-3或E-AC-3比特流进行解码时，解码器应当解析LPSM块数据(浪费比特分段或者addbsi域中的LPSM块数据)并且将所有的所提取的LPSM值传递给图形用户界面(GUI)。每帧对所提取的LPSM值的组进行更新。When decoding an AC-3 or E-AC-3 bitstream having an LPSM (preferred format of LPSM) included in the Wasted Bits Segment or the "addbsi" field of the Bitstream Information ("BSI") segment of each frame of the bitstream, the decoder should parse the LPSM block data (wasted bits segment or LPSM block data in the addbsi field) and pass all extracted LPSM values to the graphical user interface (GUI). The set of extracted LPSM values is updated every frame.

在根据本发明生成的编码比特流的另一优选格式中，编码比特流是AC-3比特流或E-AC-3比特流，并且，包括LPSM的每个元数据分段(如，通过编码器100的优选实现的级107)作为附加比特流信息被包括在比特流的帧的比特流信息(“BSI”)分段的“addbsi”域(如图6所示)中(或AUX分段或浪费比特分段中)。在该格式(其是以上参考表1和表2所述的格式的变型)中，每个包含LPSM的addbsi(或AUX或浪费比特)域包含以下LPSM值：In another preferred format for an encoded bitstream generated according to the present invention, the encoded bitstream is an AC-3 bitstream or an E-AC-3 bitstream, and each metadata segment that includes an LPSM (e.g., by stage 107 of a preferred implementation of encoder 100) is included as additional bitstream information in an "addbsi" field (as shown in FIG6 ) of a bitstream information ("BSI") segment of a frame of the bitstream (or in an AUX segment or a wasted bits segment). In this format (which is a variation of the format described above with reference to Tables 1 and 2), each addbsi (or AUX or wasted bits) field that includes an LPSM contains the following LPSM values:

表1所示的核心元素，其后跟随有效载荷ID(将元数据标识为LPSM)和有效载荷尺寸值，其后跟随具有以下格式的有效载荷(LPSM数据)(类似于上述表2所示的强制元素)：The core elements shown in Table 1 are followed by a payload ID (identifying the metadata as LPSM) and a payload size value, followed by a payload (LPSM data) having the following format (similar to the mandatory elements shown in Table 2 above):

LPSM有效载荷的版本：指示LPSM有效载荷的版本的2比特域；LPSM payload version: a 2-bit field indicating the version of the LPSM payload;

dialchan：指示相应的音频数据的左通道、右通道和/或中心通道是否包含口语会话的3比特域。Dialchan域的比特分配可以如下：指示左通道中的会话的存在的比特0被存储在dialchan域的最高有效比特中；指示中心通道中的会话的存在的比特2被存储在dialchan域的最低有效比特中。如果相应的通道在节目的在前的0.5秒期间包含口语会话，则dialchan域的每个比特被设置为“1”。dialchan: A 3-bit field indicating whether the left channel, right channel, and/or center channel of the corresponding audio data contains spoken conversation. The bit allocation of the dialchan field may be as follows: bit 0 indicating the presence of conversation in the left channel is stored in the most significant bit of the dialchan field; bit 2 indicating the presence of conversation in the center channel is stored in the least significant bit of the dialchan field. Each bit of the dialchan field is set to "1" if the corresponding channel contains spoken conversation during the first 0.5 seconds of the program.

loudregtyp：指示节目响度符合哪个响度调节标准的4比特域。将“loudregtyp”域设置为“000”指示LPSM不指示响度调节相符性。例如，该域的一个值(如，0000)可以指示与响度调节标准的相符性没有被示出，该域的另一个值(如，0001)可以指示节目的音频数据符合ATSC A/85标准，该域的另一个值(如，0010)可以指示节目的音频数据符合EBU R128标准。在示例中，如果域被设置为除了“0000”以外的任意值，则loudcorrdialgat域和loudcorrtyp域应当在有效载荷中跟随。loudregtyp: A 4-bit field that indicates which loudness adjustment standard the program loudness complies with. Setting the "loudregtyp" field to "000" indicates that the LPSM does not indicate loudness adjustment compliance. For example, one value for this field (e.g., 0000) may indicate that compliance with a loudness adjustment standard is not indicated, another value for this field (e.g., 0001) may indicate that the program's audio data complies with the ATSC A/85 standard, and another value for this field (e.g., 0010) may indicate that the program's audio data complies with the EBU R128 standard. In this example, if this field is set to any value other than "0000," the loudcorrdialgat field and the loudcorrtyp field should follow in the payload.

loudcorrdialgat：指示是否已经应用了会话选通的响度校正的1比特域。如果已经使用会话选通对节目的响度进行了校正，则loudcorrdialgat域的值被设置为“1”。否则，其被设置为“0”。loudcorrdialgat: A 1-bit field indicating whether loudness correction for session gating has been applied. If the program's loudness has been corrected using session gating, the value of the loudcorrdialgat field is set to "1". Otherwise, it is set to "0".

loudcorrtyp：指示被应用于节目的响度校正的类型的1比特域。如果已经用无限的预测未来(基于文件)响度校正处理对节目的响度进行了校正，则loudcorrtyp域的值被设置为“0”。如果已经使用实时响度测量和动态范围控制的组合对节目的响度进行了校正，则该域的值被设置为“1”；loudcorrtyp: A 1-bit field indicating the type of loudness correction applied to the program. If the program's loudness has been corrected using an infinite look-ahead (file-based) loudness correction process, the value of the loudcorrtyp field is set to "0". If the program's loudness has been corrected using a combination of real-time loudness measurement and dynamic range control, the value of this field is set to "1".

loudrelgate：指示是否存在有关的选通响度数据(ITU)的1比特域。如果loudrelgate域被设置为“1”，则7比特的ituloudrelgat域应当在有效载荷中跟随；loudrelgate: A 1-bit field indicating whether related gated loudness data (ITU) is present. If the loudrelgate field is set to "1", the 7-bit ituloudrelgat field shall follow in the payload;

loudrelgat：指示有关的选通节目响度(ITU)的7比特域。该域指示由于所应用的dialnorm和动态范围压缩、在没有任何增益调节的情况下根据ITU-R BS.1770-3测量的音频节目的整体的响度。0至127的值被理解为-58LKFS至+5.5LKFS，步长为0.5LKFS；loudrelgat: A 7-bit field indicating the relative gated program loudness (ITU). This field indicates the overall loudness of the audio program measured according to ITU-R BS.1770-3 without any gain adjustment due to the applied dialnorm and dynamic range compression. Values from 0 to 127 are interpreted as -58 LKFS to +5.5 LKFS in steps of 0.5 LKFS.

loudspchgate：指示语音选通的响度数据(ITU)是否存在的1比特域。如果loudspchgate域被设置为“1”，则7比特的loudspchgat域应当在有效载荷中跟随；loudspchgate: A 1-bit field indicating whether the loudness data (ITU) of the speech gate is present. If the loudspchgate field is set to "1", the 7-bit loudspchgat field shall follow in the payload;

loudspchgat：指示语音选通的节目响度的7比特域。该域指示由于所应用的dialnorm和动态范围压缩、在没有任何增益调节的情况下根据ITU-R BS.1770-3的公式(2)测量的整个相应的音频节目的整体的响度。0至127的值被理解为-58LKFS至+5.5LKFS，步长为0.5LKFS；loudspchgat: A 7-bit field indicating the loudness of the speech-gated program. This field indicates the overall loudness of the entire corresponding audio program, measured according to equation (2) of ITU-R BS.1770-3 without any gain adjustment, due to the applied dialnorm and dynamic range compression. Values from 0 to 127 are interpreted as -58 LKFS to +5.5 LKFS, in steps of 0.5 LKFS;

loudstrm3se：指示是否存在短期(3秒)响度数据的1比特域。如果该域被设置为“1”，则7比特的loudstrm3s域应当在有效载荷中跟随；loudstrm3se: A 1-bit field indicating whether short-term (3-second) loudness data is present. If this field is set to "1", the 7-bit loudstrm3s field shall follow in the payload;

loudstrm3s：指示由于所应用的dialnorm和动态范围压缩、在没有任何增益调节的情况下根据ITU-R BS.1771-1测量的相应的音频节目的在前的3秒的未选通的响度的7比特域。0至256的值被理解为-116LKFS至+11.5LKFS，步长为0.5LKFS；loudstrm3s: A 7-bit field indicating the ungated loudness of the preceding 3 seconds of the corresponding audio programme, measured according to ITU-R BS.1771-1 without any gain adjustment, due to the applied dialnorm and dynamic range compression. Values from 0 to 256 are understood as -116 LKFS to +11.5 LKFS, in steps of 0.5 LKFS;

truepke：指示是否存在真实的峰值响度数据的1比特域。如果truepke域被设置为“1”，则8比特的truepk域应当在有效载荷中跟随；以及truepke: A 1-bit field indicating whether true peak loudness data is present. If the truepke field is set to "1", then an 8-bit truepk field shall follow in the payload; and

truepk：指示由于所应用的dialnorm和动态范围压缩、在没有任何增益调节的情况下根据ITU-R BS.1770-3的附件2测量的节目的真实的峰值样本值的8比特域。0至256的值被理解为-116LKFS至+11.5LKFS，步长为0.5LKFS。truepk: An 8-bit field indicating the true peak sample value of the program due to applied dialnorm and dynamic range compression, measured according to Annex 2 of ITU-R BS.1770-3 without any gain adjustment. Values from 0 to 256 are interpreted as -116 LKFS to +11.5 LKFS in steps of 0.5 LKFS.

在一些实施例中，AC-3比特流或E-AC-3比特流的帧的浪费比特分段或者auxdata域(或“addbsi”域)中的元数据分段的核心元素包括核心首部(通常包括标识值，如核心元素版本)，并且，在核心首部之后，包括：指示指纹数据(或者其他保护值)是否被包括用于元数据分段的元数据的值、指示是否存在(与元数据分段的元数据所对应的音频数据有关的)外部数据的值、用于由核心元素来标识的每种类型的元数据(如，LPSM和/或除了LPSM之外的其他类型的元数据)的有效载荷ID和有效载荷尺寸值、以及用于由核心元素来标识的至少一种类型的元数据的保护值。元数据分段的元数据有效载荷跟随核心首部，并且(在某些情况下)嵌套在核心元素的值内。In some embodiments, a core element of a metadata segment in a wasted bits segment or auxdata field (or "addbsi" field) of a frame of an AC-3 bitstream or E-AC-3 bitstream includes a core header (typically including an identification value, such as a core element version), and, following the core header, includes: a value indicating whether fingerprint data (or other protection value) is included for the metadata of the metadata segment, a value indicating whether external data (related to the audio data corresponding to the metadata of the metadata segment) is present, a payload ID and payload size value for each type of metadata identified by the core element (e.g., LPSM and/or metadata other than LPSM), and a protection value for at least one type of metadata identified by the core element. The metadata payload of the metadata segment follows the core header and (in some cases) is nested within the value of the core element.

本发明的典型的实施例以如下高效的方式在编码音频比特流中包括节目边界元数据：该方式使得能够精确且鲁棒地确定比特流所指示的连续的音频节目之间的至少一个边界。典型的实施例使得能够甚至在其中指示不同节目的比特流以如下方式被接合在一起(以生成发明的比特流)的情况下在它们允许精确的节目边界确定的场景中以如下方式精确且鲁棒地确定节目边界：该方式使得能够截断被接合的比特流中的一个或者两个比特流(并且因此丢弃已经被包括在预先接合比特流中的至少一个预先接合比特流中的节目边界元数据)。Typical embodiments of the present invention include program boundary metadata in an encoded audio bitstream in an efficient manner that enables accurate and robust determination of at least one boundary between consecutive audio programs indicated by the bitstream. Typical embodiments enable accurate and robust determination of program boundaries in scenarios where they allow accurate program boundary determination even in cases where bitstreams indicating different programs are spliced together (to generate the inventive bitstream) in a manner that enables truncating one or both of the spliced bitstreams (and thereby discarding program boundary metadata that had been included in at least one of the pre-spliced bitstreams).

最大鲁棒性可以通过在指示节目的比特流的每个帧中插入节目边界标记来实现，但是这通常由于数据速率的关联增加而不实际。在典型的实施例中，节目边界标记仅被插入在编码音频比特流的帧的子集中(其可以指示一个音频节目或者音频节目序列)，并且边界标记插入速率为比特流的帧(其中插入标记)中的每个帧从距离这些帧中的所述每个帧最近的节目边界的增加的分离的非增加函数，其中“边界标记插入速率”表示包括节目边界标记的帧(指示节目的帧)的数目与不包括节目边界标记的帧(指示节目的帧)的数目的平均比率，其中平均值为编码音频比特流的若干(例如相对较少数目的)连续的帧上的滑动平均值。Maximum robustness can be achieved by inserting a program boundary marker in every frame of the bitstream indicating a program, but this is generally impractical due to the associated increase in data rate. In a typical embodiment, program boundary markers are inserted only in a subset of the frames of the coded audio bitstream (which may indicate an audio program or sequence of audio programs), and the boundary marker insertion rate is a non-increasing function of the increasing separation of each of the frames of the bitstream in which the marker is inserted from the program boundary closest to said each of those frames, where the "boundary marker insertion rate" represents the average ratio of the number of frames that include program boundary markers (indicating a program) to the number of frames that do not include program boundary markers (indicating a program), where the average is a sliding average over a number (e.g., a relatively small number) of consecutive frames of the coded audio bitstream.

增加边界标记插入速率(例如在比特流中更靠近节目边界的位置处)增加了递送比特流所需要的数据速率。为了对其进行补偿，优选地在边界标记插入速率增加时减小每个插入的标记的大小(比特数目)(例如使得比特流的第N帧中的节目边界标记的大小(其中N为整数)为第N帧与最近的节目边界之间的距离(帧数目)的非增加函数)。在一类实施例中，边界标记插入速率为(每个标记插入位置)距离最近的节目边界的增加的距离的对数减小函数，并且对于包括其中一个标记的每个包含标记的帧，所述包含标记的帧中的标记的大小等于或者大于比所述包含标记的帧更靠近最近的节目边界的帧中的每个标记的大小。典型地，每个标记的大小由从标记的插入位置到最近的节目边界的帧的数目的增加函数确定。Increasing the rate at which boundary markers are inserted (e.g., at locations in the bitstream closer to program boundaries) increases the data rate required to deliver the bitstream. To compensate for this, the size (number of bits) of each inserted marker is preferably reduced as the rate of boundary marker insertion increases (e.g., such that the size of the program boundary marker in the Nth frame of the bitstream (where N is an integer) is a non-increasing function of the distance (number of frames) between the Nth frame and the nearest program boundary). In one class of embodiments, the rate of boundary marker insertion is a logarithmically decreasing function of increasing distance (at each marker insertion location) from the nearest program boundary, and for each marker-containing frame that includes one of the markers, the size of the marker in the marker-containing frame is equal to or greater than the size of each marker in frames that are closer to the nearest program boundary than the marker-containing frame. Typically, the size of each marker is determined by an increasing function of the number of frames from the marker's insertion location to the nearest program boundary.

例如，考虑图8和图9的实施例，其中帧编号所标识的每个列(顶行中)指示编码音频比特流的帧。比特流指示具有第一节目边界(指示节目的开始)和第二节目边界(指示节目的结束)的音频节目，第一节目边界出现在图9的左侧的帧编号“17”所标识的列的紧左侧，第二节目边界出现在图8的右侧的帧编号“1”所标识的列的紧右侧。图8所示的帧中所包括的节目边界标记对在当前帧与第二节目边界之间的帧数目向下计数。图9所示的帧中所包括的节目边界标记对在当前帧与第一节目边界之间的帧数目向上计数。For example, consider the embodiments of Figures 8 and 9, in which each column identified by a frame number (in the top row) indicates a frame of an encoded audio bitstream. The bitstream indicates an audio program having a first program boundary (indicating the start of the program) and a second program boundary (indicating the end of the program), the first program boundary occurring immediately to the left of the column identified by frame number "17" on the left side of Figure 9, and the second program boundary occurring immediately to the right of the column identified by frame number "1" on the right side of Figure 8. The program boundary markers included in the frames shown in Figure 8 count down the number of frames between the current frame and the second program boundary. The program boundary markers included in the frames shown in Figure 9 count up the number of frames between the current frame and the first program boundary.

在图8和图9的实施例中，节目边界标记仅被插入在比特流所指示的音频节目的开始之后的编码比特流的前X帧中的第2^N帧中的每个第2^N帧中，并且在距离比特流所指示的节目的结束最近的(比特流的后X帧中的)第2^N帧中的每个第2^N帧中，其中节目包括Y个帧，X为小于等于Y/2的整数，N为1至log₂(X)之间的正整数。因此(如图8和图9所指示的)，节目边界标记被插入在比特流的第2帧(N＝1)(最接近节目的开始的包含标记的帧)中，在第4帧(N＝2)中，在第8帧(N＝3)中，等等，并且在从比特流的结束开始的第8帧中，在从比特流的结束开始的第4帧中，以及在从比特流的结束开始的第2帧中(距离节目的结束最近的包含标记的帧中)。在本示例中，从节目的开始(或者结束)开始的第2^N帧中的节目边界标记包括log2(2^N+2)个二进制比特，如图8和图9中所指示的。因此，从节目的开始(或者结束)开始的第2帧(N＝1)中的节目边界标记包括log₂(2^N+2)＝log₂(2³)＝3个二进制比特，并且从节目的开始(或者结束)开始的第4帧(N＝2)中的标记包括log₂(2^N+2)＝log₂(2⁴)＝4个二进制比特，等等。In the embodiments of Figures 8 and 9, program boundary markers are inserted only in each of the ^2Nth frames of the first X frames of the encoded bitstream following the start of the audio program indicated by the bitstream, and in each of ^{the 2Nth} ^frames ^of the next X frames of the bitstream that are closest to the end of the program indicated by the bitstream, where a program comprises Y frames, X being an integer less than or equal to Y/2, and N being a positive integer between 1 and log ₂ (X). Thus (as indicated in Figures 8 and 9), program boundary markers are inserted in the 2nd frame (N=1) of the bitstream (the frame containing the marker closest to the start of the program), in the 4th frame (N=2), in the 8th frame (N=3), and so on, as well as in the 8th frame from the end of the bitstream, in the 4th frame from the end of the bitstream, and in the 2nd frame from the end of the bitstream (the frame containing the marker closest to the end of the program). In this example, the program boundary marker in the ^2Nth frame from the start (or end) of the program comprises log2( ^2N+2 ) binary bits, as indicated in Figures 8 and 9. Thus, the program boundary marker in the 2nd frame (N=1) from the start (or end) of the program comprises _log2 (2N ⁺² )= _log2 ( ²³ )=3 binary bits, and the marker in the 4th frame (N=2) from the start (or end) of the program comprises _log2 ( ^2N+2 )= _log2 ( ²⁴ )=4 binary bits, and so on.

在图8和图9的示例中，每个节目边界标记的格式如下。每个节目边界标记包括引导“1”比特、引导比特之后的“0”比特序列(或者没有“0”比特或者一个或更多个连续的“0”比特)、以及2比特尾部代码。尾部代码为用于比特流的最后的X个帧(距离节目结束最近的帧)中的标记的“11”，如图8中所指示的。尾部代码为用于比特流的开始的X个帧(距离节目的开始最近的帧)中的标记的“10”，如图9中所指示的。因此，为了读取(解码)每个标记，引导“1”比特与尾部代码之间的0的数目被计数。如果尾部代码被标识为“11”，则标记表示在当前帧(包括标记的帧)与节目的结尾之间存在(2^Z+1-1)个帧，其中Z为标记的引导“1”比特与尾部代码之间的0的数目。解码器可以被高效地实现以忽略每个这样的标记的第一比特和最后比特，从而确定标记的其他(中间)比特的序列的逆(例如，如果中间比特的序列为“0001”，其中“1”比特为序列中的最后比特，则中间比特的逆序列为“1000”，其中“1”比特为逆序列中的第一比特)，以及从而将中间比特的逆序列的二进制值标识为当前帧(标记被包括在其中的帧)相对于节目的结尾的索引。例如，如果中间比特的逆序列为“1000”，则这一逆序列具有二进制值2⁴＝16，并且帧被标识为在节目的结束之前的第16帧(如图8的描述帧“0”的列所指示的)。In the examples of Figures 8 and 9, the format of each program boundary marker is as follows. Each program boundary marker includes a leading "1" bit, a "0" bit sequence after the leading bit (or no "0" bit or one or more consecutive "0" bits), and a 2-bit tail code. The tail code is "11" for the marker in the last X frames of the bit stream (the frames closest to the end of the program), as indicated in Figure 8. The tail code is "10" for the marker in the first X frames of the bit stream (the frames closest to the start of the program), as indicated in Figure 9. Therefore, in order to read (decode) each marker, the number of 0s between the leading "1" bit and the tail code is counted. If the tail code is identified as "11", the marker indicates that there are (2 ^Z+1 -1) frames between the current frame (including the marked frame) and the end of the program, where Z is the number of 0s between the leading "1" bit and the tail code of the marker. The decoder can be efficiently implemented to ignore the first and last bits of each such marker, thereby determining the inverse of the sequence of the other (middle) bits of the marker (e.g., if the sequence of middle bits is "0001," where the "1" bit is the last bit in the sequence, then the inverse sequence of the middle bits is "1000," where the "1" bit is the first bit in the inverse sequence), and thereby identifying the binary value of the inverse sequence of the middle bits as the index of the current frame (the frame in which the marker is included) relative to the end of the program. For example, if the inverse sequence of the middle bits is "1000," then this inverse sequence has a binary value of 2 ⁴ =16, and the frame is identified as the 16th frame before the end of the program (as indicated by the column describing frame "0" of FIG. 8 ).

如果尾部代码被标识为“10”，则标记指示在节目的开始与当前帧(包括标记的帧)之间存在(2^Z+1-1)个帧，其中Z为标记的引导“1”比特与尾部代码之间的0的数目。解码器可以被高效地实现以忽略每个这样的标记的第一比特和最后比特，从而确定标记的中间比特的序列的逆(例如，如果中间比特的序列为“0001”，其中“1”比特为序列中的最后比特，则中间比特的逆序列为“1000”，其中“1”比特为逆序列中的第一比特)，以及从而将中间比特的逆序列的二进制值标识为当前帧(标记被包括在其中的帧)相对于节目的开始的索引。例如，如果中间比特的逆序列为“1000”，则这一逆序列具有二进制值2⁴＝16，并且帧被标识为在节目的开始之后的第16帧(如图9的描述帧“32”的列所指示的)。If the tail code is identified as "10", the marker indicates that there are (2 ^{Z + 1} -1) frames between the start of the program and the current frame (including the marked frame), where Z is the number of zeros between the leading "1" bit of the marker and the tail code. The decoder can be efficiently implemented to ignore the first and last bits of each such marker, thereby determining the inverse of the sequence of the marker's middle bits (for example, if the sequence of the middle bits is "0001", where the "1" bit is the last bit in the sequence, then the inverse sequence of the middle bits is "1000", where the "1" bit is the first bit in the inverse sequence), and thereby identifying the binary value of the inverse sequence of the middle bits as the index of the current frame (the frame in which the marker is included) relative to the start of the program. For example, if the inverse sequence of the middle bits is "1000", then this inverse sequence has a binary value of 2 ⁴ =16, and the frame is identified as the 16th frame after the start of the program (as indicated by the column describing frame "32" of Figure 9).

在图8和图9的示例中，节目边界标记仅出现在比特流所指示的音频节目的开始之后的编码比特流的前X帧中的第2^N帧中的每个第2^N帧中，并且出现在比特流所指示的节目的结束最近的(比特流的后X帧中的)第2^N帧中的每个第2^N帧中，其中节目包括Y个帧，X为小于或等于Y/2的整数，N为在1到log₂(X)的范围中的正整数。包括节目边界标记仅向在没有标记的情况下传输比特流所需要的比特速率添加了1.875比特/帧的平均比特速率。In the examples of Figures 8 and 9, program boundary markers appear only in every ^2Nth of the first X frames of the encoded bitstream following the start of an audio program indicated by the bitstream, and appear in every ^2Nth of the ^2Nth ^frames (of the last X frames of the bitstream) nearest the end of the program indicated by the bitstream, where the program comprises Y frames, X is an integer less than or equal to Y/2, and N is a positive integer in the range of 1 to _log2 (X). Including program boundary markers adds only an average bit rate of 1.875 bits/frame to the bit rate required to transmit the bitstream without the markers.

在图8和图9的实施例的其中比特流为AC-3编码音频比特流的典型实现中，每个帧包含数字音频的1536个样本的音频内容和元数据。对于48kHz的采样率，这表示数字音频的32毫秒或者音频的31.25帧每秒的速率。因此，在这样的实施例中，与节目边界分离一些数目的帧(“X”个帧)的帧中的节目边界标记指示边界在包含标记的帧的结束之后32X毫秒时(或者在包含标记的帧的开始之前32X毫秒时)出现。In a typical implementation of the embodiments of Figures 8 and 9 where the bitstream is an AC-3 encoded audio bitstream, each frame contains the audio content and metadata for 1536 samples of digital audio. For a sampling rate of 48kHz, this represents 32 milliseconds of digital audio or a rate of 31.25 frames per second of audio. Thus, in such an embodiment, a program boundary marker in a frame that is separated from a program boundary by some number of frames ("X" frames) indicates that the boundary occurs 32X milliseconds after the end of the frame containing the marker (or 32X milliseconds before the beginning of the frame containing the marker).

在图8和图9的实施例的其中比特流为E-AC-3编码音频比特流的典型实现中，比特流的每个帧包含数字音频的256、512、768或1536个样本的音频内容和元数据，这取决于帧分别包含1个、2个、3个还是6个音频数据块。对于48kHz的采样率，这分别表示数字音频的5.333、10.667、16或32个毫秒或者音频的189.9、93.75、62.5或31.25帧每秒的速率。因此，在这样的实施例中(假定每个帧指示数字音频的32毫秒)，与节目边界分离一些数目的帧(“X”个帧)的帧中的节目边界标记指示边界在包含标记的帧的结束之后32X毫秒时(或者在包含标记的帧的开始之前32X毫秒时)出现。In the embodiment of Figure 8 and Figure 9, in which the bitstream is a typical implementation of an E-AC-3 encoded audio bitstream, each frame of the bitstream contains audio content and metadata for 256, 512, 768, or 1536 samples of digital audio, depending on whether the frame contains 1, 2, 3, or 6 audio data blocks. For a sampling rate of 48kHz, this represents a rate of 5.333, 10.667, 16, or 32 milliseconds of digital audio or 189.9, 93.75, 62.5, or 31.25 frames per second of audio, respectively. Therefore, in such an embodiment (assuming that each frame indicates 32 milliseconds of digital audio), the program boundary marker in a frame that is separated from the program boundary by some number of frames ("X" frames) indicates that the boundary occurs 32X milliseconds after the end of the frame containing the marker (or 32X milliseconds before the beginning of the frame containing the marker).

在其中节目边界可能出现在音频比特流的帧内(即不与帧的开始或者结束对准)的一些实施例中，比特流的帧中所包括的节目边界元数据包括节目边界帧计数(即指示在包含帧计数的帧的开始或者结束与节目边界之间的全帧的数目的元数据)和偏移值。偏移值指示在包含节目边界的帧的开始或者结束与节目边界在包含节目边界的帧内的实际位置之间的偏移量(通常为样本的数目)。In some embodiments where program boundaries may occur within frames of the audio bitstream (i.e., not aligned with the start or end of a frame), program boundary metadata included in frames of the bitstream includes a program boundary frame count (i.e., metadata indicating the number of full frames between the start or end of the frame containing the frame count and the program boundary) and an offset value. The offset value indicates the offset (typically a number of samples) between the start or end of the frame containing the program boundary and the actual location of the program boundary within the frame containing the program boundary.

编码音频比特流可以指示对应视频节目序列中的节目(音轨)序列，并且这样的音频节目的边界趋向于在视频帧的边缘处而非在音频帧的边缘处出现。此外，一些音频编解码器(例如E-AC-3编解码器)使用没有与视频帧对准的音频帧大小。此外，在一些情况下，初始编码音频比特流经历转码以生成转码比特流，并且初始编码比特流具有与转码比特流不同的帧大小，使得不保证节目边界(由初始编码比特流确定)出现在转码比特流的帧边界处。例如，如果初始编码比特流(例如图10的比特流“IEB”)具有每帧1536个样本的帧大小，并且转码比特流(例如图10的比特流“TB”)具有每帧1024个样本的帧大小，则由于不同编解码器的不同帧大小，转码过程可以使得实际的节目边界不出现在转码比特流的帧边界处，而出现在其帧中的某个地方(例如如图10所示，转码比特流的帧中的512个样本)。本发明的其中编码音频比特流的帧中所包括的节目边界元数据包括偏移值以及节目边界帧计数的实施例在本段落中所指出的三种情况下(以及在其他情况下)很有用。The coded audio bitstream can indicate a sequence of programs (audio tracks) within a corresponding video program sequence, and the boundaries of such audio programs tend to occur at the edges of video frames rather than at the edges of audio frames. Furthermore, some audio codecs (e.g., the E-AC-3 codec) use audio frame sizes that are not aligned with video frames. Furthermore, in some cases, an initial coded audio bitstream undergoes transcoding to generate a transcoded bitstream, and the initial coded bitstream has a different frame size than the transcoded bitstream, such that program boundaries (determined by the initial coded bitstream) are not guaranteed to occur at the frame boundaries of the transcoded bitstream. For example, if the initial coded bitstream (e.g., bitstream "IEB" in FIG. 10 ) has a frame size of 1536 samples per frame, and the transcoded bitstream (e.g., bitstream "TB" in FIG. 10 ) has a frame size of 1024 samples per frame, then due to the different frame sizes of the different codecs, the transcoding process may cause the actual program boundaries to occur not at the frame boundaries of the transcoded bitstream, but somewhere within its frames (e.g., 512 samples within a frame of the transcoded bitstream, as shown in FIG. 10 ). The embodiment of the present invention in which the program boundary metadata included in the frames of the encoded audio bitstream includes an offset value as well as a program boundary frame count is useful in the three situations indicated in this paragraph (and in other situations).

以上参考图8和图9描述的实施例在编码比特流的帧中的任何帧中不包括偏移值(例如偏移域)。在本实施例的变型中，在编码音频比特流的包括节目边界标记的每个帧中(例如与图8中编号为0、8、12和14的帧以及图9中编号为18、20、24和32的帧对应的帧中)包括偏移值。The embodiment described above with reference to Figures 8 and 9 does not include an offset value (e.g., an offset field) in any of the frames in the coded bitstream. In a variation of this embodiment, an offset value is included in each frame of the coded audio bitstream that includes a program boundary marker (e.g., frames corresponding to frames numbered 0, 8, 12, and 14 in Figure 8 and frames numbered 18, 20, 24, and 32 in Figure 9).

在一类实施例中，(在编码比特流的包含发明的节目边界元数据的每个帧中的)数据结构包括指示帧仅包括节目边界帧计数还是包括节目边界帧计数和偏移值二者的代码值。例如，代码值可以是单比特域(在下文中称为“offset_exist”域)的值，值“offset_exist”＝0可以表示帧中不包括任何偏移值，值“offset_exist”＝1可以表示帧中包括节目边界帧计数和偏移值二者。In one class of embodiments, a data structure (in each frame of the coded bitstream containing the inventive program boundary metadata) includes a code value indicating whether the frame includes only a program boundary frame count or includes both a program boundary frame count and an offset value. For example, the code value may be the value of a single-bit field (hereinafter referred to as an "offset_exist" field), where a value of "offset_exist" = 0 may indicate that the frame does not include any offset value, and a value of "offset_exist" = 1 may indicate that the frame includes both a program boundary frame count and an offset value.

在一些实施例中，AC-3或者E-AC-3编码音频比特流的至少一个帧包括包含比特流所确定的音频节目的LPSM和节目边界元数据(以及可选地还包括其他元数据)的元数据分段。每个这样的元数据分段(其可以被包括在比特流的addbsi域、或者auxdata域、或者浪费比特分段中)包含核心首部(以及可选地还包含额外的核心元素)、以及在核心首部(或者核心首部和其他核心元素)之后的LPSM有效载荷(或者容器)分段，其具有如下格式：In some embodiments, at least one frame of an AC-3 or E-AC-3 encoded audio bitstream includes a metadata segment containing LPSM and program boundary metadata (and optionally other metadata) for an audio program identified by the bitstream. Each such metadata segment (which may be included in the addbsi field, or the auxdata field, or the wasted bits segment of the bitstream) includes a core header (and optionally additional core elements) and, following the core header (or the core header and other core elements), an LPSM payload (or container) segment having the following format:

首部(通常包括至少一个标识值，例如LPSM格式版本、长度、周期、计数和子流关联值)，以及header (usually including at least one identification value, such as LPSM format version, length, period, count, and substream association value), and

在首部之后的：节目边界元数据(其可以包括节目边界帧计数、编码值以及在一些情况下还包括偏移值和LPSM，编码值(例如“offset_exist”值)指示帧仅包括节目边界帧计数还是包括节目边界帧计数和偏移值二者)。LPSM可以包括：Following the header: Program Boundary Metadata (which may include a program boundary frame count, an encoded value, and in some cases an offset value, and an LPSM. An encoded value (e.g., an "offset_exist" value) indicates whether the frame includes only a program boundary frame count or both a program boundary frame count and an offset value). The LPSM may include:

至少一个响度值，其指示对应音频数据的至少一个响度(例如峰值响度或者平均响度)特征。At least one loudness value indicating at least one loudness (eg, peak loudness or average loudness) characteristic of the corresponding audio data.

在一些实施例中，LPSM有效载荷分段包括指示帧仅包括节目边界帧计数还是包括节目边界帧计数和偏移值二者的代码值(“offset_exist”值)。例如，在一个这样的实施例中，当这样的代码值指示(例如当offset_exist＝1时)帧包括节目边界帧计数和偏移值时，LPSM有效载荷分段可以包括为11比特无符号整数(即具有在0到2048之间的值)并且指示示意的帧边界(包括节目边界的帧的边界)与实际节目边界之间的附加音频样本的数目的偏移值。如果节目边界帧计数指示到包含节目边界的帧的帧数目(以当前帧速率)，则节目边界的精确位置(相对于包括LPSM有效载荷分段的帧的开始或者结束)(以样本数目为单位)可以计算为：In some embodiments, the LPSM payload segment includes a code value ("offset_exist" value) indicating whether the frame includes only a program boundary frame count or includes both a program boundary frame count and an offset value. For example, in one such embodiment, when such a code value indicates (e.g., when offset_exist=1) that the frame includes a program boundary frame count and an offset value, the LPSM payload segment may include an offset value that is an 11-bit unsigned integer (i.e., having a value between 0 and 2048) and indicates the number of additional audio samples between an indicated frame boundary (a frame boundary including a program boundary) and an actual program boundary. If the program boundary frame count indicates the number of frames (at the current frame rate) to the frame containing the program boundary, then the exact position of the program boundary (relative to the start or end of the frame including the LPSM payload segment) (in number of samples) can be calculated as:

S＝(frame_counter*frame size)+offset，S＝(frame_counter*frame size)+offset，

其中S为(从包括LPSM有效载荷分段的帧的开始或者结束)到节目边界的样本数目，“frame_counter”为节目边界帧计数所指示的帧计数，“frame size”为每帧的样本数目，“offset”为偏移值所指示的样本数目。Where S is the number of samples from the start or end of the frame containing the LPSM payload segment to the program boundary, "frame_counter" is the frame count indicated by the program boundary frame count, "frame size" is the number of samples per frame, and "offset" is the number of samples indicated by the offset value.

其中节目边界标记的插入速率在实际节目边界附近增加的一些实施例实现如下规则：如果帧小于或者等于从包括节目边界的帧开始的某个帧数目(“Y”)，则帧中不包括偏移值。通常，Y＝32。对于实现这一规则的E-AC-3编码器(Y＝32)，编码器不在音频节目的最后一秒中插入偏移值。在这种情况下，接收设备负责维持计数器并且从而执行其自己的偏移量计算(响应于节目边界元数据，包括偏移值，在编码比特流的距离包含节目边界的帧超过Y个帧的帧中)。Some embodiments in which the rate of insertion of program boundary markers increases near actual program boundaries implement the following rule: if a frame is less than or equal to a certain number of frames ("Y") from the frame containing the program boundary, then the offset value is not included in the frame. Typically, Y = 32. For E-AC-3 encoders that implement this rule (Y = 32), the encoder does not insert an offset value in the last second of an audio program. In this case, the receiving device is responsible for maintaining a counter and thereby performing its own offset calculation (responsive to program boundary metadata, including the offset value, in frames of the encoded bitstream that are more than Y frames from the frame containing the program boundary).

对于其音频节目已知与对应的视频节目的视频帧“帧对准”的节目(例如通常使用Dolby E编码音频的收集馈送(contribution feed))，可以不必要在指示音频节目的编码比特流中包括偏移值。因此，这样的编码比特流中通常不包括偏移值。For programs whose audio programs are known to be "frame aligned" with the video frames of the corresponding video program (such as contribution feeds typically using Dolby E encoded audio), it may not be necessary to include offset values in the encoded bitstream indicating the audio program. Therefore, offset values are typically not included in such encoded bitstreams.

参考图11，下面考虑其中编码音频比特流被接合在一起以生成发明的音频比特流的实施例的情况。With reference to FIG. 11 , consider below the case where encoded audio bitstreams are joined together to generate an embodiment of the inventive audio bitstream.

图11的顶部处的比特流(被标记为“场景1”)指示包括节目边界元数据(节目边界标记，F)的整个第一音频节目(P1)，P1之后跟随也包括节目边界元数据(节目边界标记，F)的整个第二音频节目(P2)。第一节目的结束部分中的节目边界标记(其中一些在图11中示出)与参考图8描述的那些相同或者相似，并且确定两个节目之间的边界(即在第二节目的开始处的边界)的位置。第二节目的开始部分中的节目边界标记(其中一些在图11中示出)与参考图9描述的那些相同或者相似，并且它们还确定边界的位置。在典型的实施例中，编码器或者解码器实现计数器(通过第一节目中的标记来被校准)，计数器向下计数至节目边界，并且同一计数器(通过第二节目中的标记被校准)从同一节目边界向上计数。如图11的场景1中的边界计数器图所示，这样的计数器的向下计数(通过第一节目中的标记被校准)在边界处达到0，计数器的向上计数(通过第二节目中的标记被校准)指示边界的同一位置。The bitstream at the top of Figure 11 (labeled "Scene 1") shows the entire first audio program (P1) including program boundary metadata (program boundary markers, F), followed by the entire second audio program (P2), also including program boundary metadata (program boundary markers, F). The program boundary markers in the end portion of the first program (some of which are shown in Figure 11) are the same or similar to those described with reference to Figure 8 and determine the location of the boundary between the two programs (i.e., the boundary at the beginning of the second program). The program boundary markers in the beginning portion of the second program (some of which are shown in Figure 11) are the same or similar to those described with reference to Figure 9 and also determine the location of the boundary. In a typical embodiment, the encoder or decoder implements a counter (calibrated by the markers in the first program) that counts down to the program boundary, and the same counter (calibrated by the markers in the second program) counts up from the same program boundary. As shown in the boundary counter diagram in Scene 1 of Figure 11, the down count of such a counter (calibrated by the markers in the first program) reaches 0 at the boundary, and the up count of the counter (calibrated by the markers in the second program) indicates the same location of the boundary.

从图11的顶部处开始的第二比特流(被标记为“场景2”)指示包括节目边界元数据(节目边界标记，F)的整个第一音频节目(P1)，P1之后跟随不包括节目边界元数据的整个第二音频节目(P2)。第一节目的结束部分中的节目边界标记(其中一些在图11中示出)与参考图8所描述的那些相同或相似，并且确定两个节目之间的边界(即在第二节目的开始处的边界)的位置，这与场景1中相同。在典型的实施例中，编码器或者解码器实现计数器(通过第一节目中的标记来被校准)，计数器向下计数至节目边界，并且同一计数器(没有被进一步校准)从节目边界继续向上计数(如图11的场景2中的边界计数器图所示)。The second bitstream starting at the top of FIG11 (labeled "Scene 2") indicates the entire first audio program (P1) including program boundary metadata (program boundary markers, F), followed by the entire second audio program (P2) that does not include program boundary metadata. The program boundary markers in the end portion of the first program (some of which are shown in FIG11) are the same or similar to those described with reference to FIG8 and determine the location of the boundary between the two programs (i.e., the boundary at the beginning of the second program), just as in Scenario 1. In a typical embodiment, the encoder or decoder implements a counter (calibrated by the markers in the first program) that counts down to the program boundary, and the same counter (without further calibration) continues counting up from the program boundary (as shown in the boundary counter diagram in Scenario 2 of FIG11).

从图11的顶部处开始的第三比特流(被标记为“场景3”)指示包括节目边界元数据(节目边界标记，F)的被截短的第一音频节目(P1)，P1已经与也包括节目边界元数据(节目边界标记，F)的整个第二音频节目(P2)接合。接合已经去除了第一节目的后“N”帧。第二节目的开始部分中的节目边界标记(其中一些在图11中示出)与参考图9描述的那些相同或者相似，并且它们还确定被截短的第一节目与整个第二节目之间的边界(接合部分)的位置。在典型的实施例中，编码器或者解码器实现计数器(通过第一节目中的标记来被校准)，计数器向下计数至被截短的第一节目的结束，并且同一计数器(通过第二节目中的标记被校准)从第二节目的开始处向上计数。第二节目的开始为场景3中的节目边界。如图11的场景3中的边界计数器图所示，这样的计数器的向下计数(通过第一节目中的节目边界元数据被校准)在达到0(响应于第一节目中的节目边界元数据)之前被重置(响应于第二节目中的节目边界元数据)。因此，虽然第一节目的截短(通过接合)防止计数器仅响应于第一节目中的节目边界元数据(即在第一节目中的节目边界元数据的校准下)来标识被截短的第一节目与第二节目的开始之间的节目边界，然而第二节目中的节目元数据重置计数器，使得被重置的计数器正确地指示被截短的第一节目与第二节目的开始之间的节目边界的位置(作为与被重置的计数器的“0”计数对应的位置)。The third bitstream (labeled "Scene 3") starting at the top of FIG. 11 shows a truncated first audio program (P1) including program boundary metadata (Program Boundary Markers, F), which has been spliced with an entire second audio program (P2) also including program boundary metadata (Program Boundary Markers, F). The splicing has removed the last "N" frames of the first program. The program boundary markers in the beginning of the second program (some of which are shown in FIG. 11) are the same or similar to those described with reference to FIG. 9, and they also determine the location of the boundary (the splice) between the truncated first program and the entire second program. In a typical embodiment, the encoder or decoder implements a counter (calibrated by the markers in the first program) that counts down to the end of the truncated first program, and the same counter (calibrated by the markers in the second program) counts up from the beginning of the second program. The beginning of the second program is the program boundary in Scene 3. As shown in the boundary counter diagram in scenario 3 of FIG11 , such a counter's down count (calibrated by the program boundary metadata in the first program) is reset (in response to the program boundary metadata in the second program) before reaching 0 (in response to the program boundary metadata in the first program). Thus, while the truncation of the first program (by splicing) prevents the counter from identifying the program boundary between the truncated first program and the start of the second program solely in response to the program boundary metadata in the first program (i.e., under calibration by the program boundary metadata in the first program), the program metadata in the second program nevertheless resets the counter such that the reset counter correctly indicates the position of the program boundary between the truncated first program and the start of the second program (as the position corresponding to the "0" count of the reset counter).

第四比特流(被标记为“场景4”)指示包括节目边界元数据(节目边界标记，F)的被截短的第一音频节目(P1)和被截短的第二音频节目(P2)，P2包括节目边界元数据(节目边界标记，F)并且已经与第一音频节目的一部分(非截短部分)接合。整个(预截短)第二节目的开始部分中的节目边界标记(其中一些在图11中示出)与参考图9描述的那些相同或者相似，整个(预截短)第一节目的结束部分中的节目边界标记(其中一些在图11中示出)与参考图8描述的那些相同或者相似。接合已经去除了第一节目的后“N”个帧(并且因此去除了在接合之前被包括在其中的一些节目边界标记)和第二节目的前“M”个帧(以及因此去除了在接合之前被包括在其中的一些节目边界标记)。在典型的实施例中，编码器或者解码器实现计数器(通过被截短的第一节目中的标记来被校准)，计数器向下计数至被截短的第一节目的结束，并且同一计数器(通过被截短的第二节目中的标记被校准)从被截短的第二节目的开始处向上计数。如图11的场景4中的边界计数器图所示，这样的计数器的向下计数(通过第一节目中的节目边界元数据被校准)在达到0(响应于第一节目中的节目边界元数据)之前被重置(响应于第二节目中的节目边界元数据)。因此，第一节目的截短(通过接合)防止计数器仅响应于第一节目中的节目边界元数据(即在第一节目中的节目边界元数据的校准下)来标识被截短的第一节目与被截短的第二节目的开始之间的节目边界，然而，被重置的计数器没有正确地指示被截短的第一节目的结束与被截短的第二节目的开始之间的节目边界的位置。因此，两个接合的比特流的截短可以防止精确地确定其之间的边界。A fourth bitstream (labeled "Scene 4") indicates a truncated first audio program (P1) that includes program boundary metadata (program boundary markers, F) and a truncated second audio program (P2), which includes program boundary metadata (program boundary markers, F) and has been spliced with a portion (non-truncated portion) of the first audio program. The program boundary markers in the beginning portion of the entire (pre-truncated) second program (some of which are shown in FIG11) are the same as or similar to those described with reference to FIG9, and the program boundary markers in the ending portion of the entire (pre-truncated) first program (some of which are shown in FIG11) are the same as or similar to those described with reference to FIG8. The splicing has removed the last "N" frames of the first program (and thus removed some of the program boundary markers included therein before splicing) and the first "M" frames of the second program (and thus removed some of the program boundary markers included therein before splicing). In a typical embodiment, an encoder or decoder implements a counter (calibrated by a marker in the truncated first program) that counts down to the end of the truncated first program, and the same counter (calibrated by a marker in the truncated second program) that counts up from the beginning of the truncated second program. As shown in the boundary counter diagram in scenario 4 of FIG11 , such a counter (calibrated by program boundary metadata in the first program) that counts down is reset (in response to program boundary metadata in the second program) before reaching zero (in response to program boundary metadata in the first program). Thus, the truncation of the first program (by splicing) prevents the counter from identifying the program boundary between the truncated first program and the beginning of the truncated second program solely in response to program boundary metadata in the first program (i.e., under calibration by program boundary metadata in the first program). However, the reset counter does not correctly indicate the location of the program boundary between the end of the truncated first program and the beginning of the truncated second program. Consequently, the truncation of the two spliced bitstreams can prevent the boundary between them from being accurately determined.

本发明的实施例可以用硬件、固件或软件或者其组合(如，作为可编程逻辑阵列)来实现。除非相反指出，否则被包括作为本发明的一部分的算法或处理并非固有地与任意具体的计算机或其他装置有关。具体地，各种通用机器可以与根据本文中的教导写出的程序一起使用，或者，构造更专门的设备(如，集成电路)来执行所需要的方法步骤可能更为方便。因此，本发明可以在一个或更多个可编程计算机系统上执行的一个或更多个计算机程序中实现(如，图1的元件、或图2的编码器100(或其元件)、或图3的解码器200(或其元件)或图3的后处理器300(或其元件)中的任一个的实现)，每个可编程计算机系统包括至少一个处理器、至少一个数据存储系统(包括易失性和非易失性存储器和/或存储元件)、至少一个输入装置或端口、以及至少一个输出装置或端口。应用节目代码，以输入数据，从而执行本文中所描述的功能并且生成输出信息。输出信息以公知的方式应用于一个或更多个输出装置。Embodiments of the present invention can be implemented in hardware, firmware, or software, or a combination thereof (e.g., as a programmable logic array). Unless otherwise indicated, the algorithms or processes included as part of the present invention are not inherently related to any specific computer or other device. Specifically, various general-purpose machines can be used with programs written according to the teachings herein, or it may be more convenient to construct more specialized devices (e.g., integrated circuits) to perform the required method steps. Therefore, the present invention can be implemented in one or more computer programs executed on one or more programmable computer systems (e.g., the implementation of any one of the elements of FIG. 1 , or the encoder 100 (or its elements) of FIG. 2 , or the decoder 200 (or its elements) of FIG. 3 , or the post-processor 300 (or its elements) of FIG. 3 ), each programmable computer system including at least one processor, at least one data storage system (including volatile and non-volatile memory and/or storage elements), at least one input device or port, and at least one output device or port. Program code is applied to input data to perform the functions described herein and generate output information. The output information is applied to one or more output devices in a known manner.

可以用任意期望的计算机语言(包括机器、汇编或高层过程、逻辑或面向对象编程语言)来实现每个这样的程序，以与计算机系统通信。在任何情况下，语言可以是编译语言或解释语言。Each such program may be implemented in any desired computer language (including machine, assembly or high-level procedural, logical or object-oriented programming languages) to communicate with a computer system. In any case, the language may be a compiled or interpreted language.

例如，当用计算机软件指令序列来实现时，可以用在合适的数字信号处理硬件中运行的多线程软件指令序列来实现本发明的实施例的各种功能和步骤，在这种情况下，实施例的各种设备、步骤和功能可以对应于软件指令的一部分。For example, when implemented using a sequence of computer software instructions, the various functions and steps of the embodiments of the present invention may be implemented using a multi-threaded software instruction sequence running in appropriate digital signal processing hardware. In this case, the various devices, steps, and functions of the embodiments may correspond to a portion of the software instructions.

每个这样的计算机程序优选地存储在或者下载到由通用或专用可编程计算机可读的存储介质或装置(如固态存储器或介质、或者磁性或光学介质)，以在计算机系统读取存储介质和装置以执行本文中所述的过程时，对计算机进行配置和操作。本发明的系统还可以实现为计算机可读的存储介质，其配置有(即，存储)计算机程序，其中，这样配置的存储介质使得计算机系统以具体和预定的方式操作以执行本文中所述的功能。Each such computer program is preferably stored on or downloaded to a storage medium or device (such as solid-state memory or media, or magnetic or optical media) readable by a general or special purpose programmable computer to configure and operate the computer when the storage medium and device are read by the computer system to perform the processes described herein. The system of the present invention can also be implemented as a computer-readable storage medium configured with (i.e., storing) a computer program, wherein the storage medium so configured causes the computer system to operate in a specific and predetermined manner to perform the functions described herein.

已经描述了本发明的大量实施例。然而，应当理解，可以在不偏离本发明的精神和范围的情况下作出各种修改。鉴于以上教导，对本发明的大量修改和变化是可能的。应当理解，可以在所附权利要求的范围内，以除了如本文中具体地描述的方式之外的方式来实践本发明。A number of embodiments of the present invention have been described. However, it will be appreciated that various modifications may be made without departing from the spirit and scope of the present invention. In light of the above teachings, numerous modifications and variations of the present invention are possible. It will be appreciated that the present invention may be practiced other than as specifically described herein, within the scope of the appended claims.

附记：Note:

1.一种音频处理单元，包括：1. An audio processing unit, comprising:

缓冲存储器，所述缓冲存储器用于存储编码音频比特流的至少一个帧，其中，所述编码音频比特流包括音频数据和元数据容器，其中，所述元数据容器包括首部、一个或更多个元数据有效载荷、以及保护数据；a buffer memory for storing at least one frame of a coded audio bitstream, wherein the coded audio bitstream comprises audio data and a metadata container, wherein the metadata container comprises a header, one or more metadata payloads, and protection data;

音频解码器，所述音频解码器耦接至所述缓冲存储器，用于对所述音频数据解码；以及an audio decoder coupled to the buffer memory and configured to decode the audio data; and

解析器，所述解析器耦接至所述音频解码器或者与所述音频解码器集成，用于解析所述编码音频比特流，a parser coupled to or integrated with the audio decoder for parsing the encoded audio bitstream,

其中，所述首部包括标识所述元数据容器的开始的同步字，所述一个或更多个元数据有效载荷描述与所述音频数据关联的音频节目，所述保护数据位于所述一个或更多个元数据有效载荷之后，并且所述保护数据能够用于验证所述元数据容器和所述元数据容器内的所述一个或更多个有效载荷的完整性。The header includes a synchronization word that identifies the beginning of the metadata container, the one or more metadata payloads describe an audio program associated with the audio data, the protection data is located after the one or more metadata payloads, and the protection data can be used to verify the integrity of the metadata container and the one or more payloads within the metadata container.

2.根据附记1所述的音频处理单元，其中，所述元数据容器被存储在选自以下各项的AC-3或者E-AC-3保留数据空间中：跳过域、auxdata域、addbsi域、及其组合。2. The audio processing unit according to Note 1, wherein the metadata container is stored in an AC-3 or E-AC-3 reserved data space selected from the following: a skip field, an auxdata field, an addbsi field, and a combination thereof.

3.根据附记1或2所述的音频处理单元，其中，所述一个或更多个元数据有效载荷包括指示连续的音频节目之间的至少一个边界的元数据。3. The audio processing unit according to note 1 or 2, wherein the one or more metadata payloads include metadata indicating at least one boundary between consecutive audio programs.

4.根据附记1或2所述的音频处理单元，其中，所述一个或更多个元数据有效载荷包括节目响度有效载荷，该节目响度有效载荷包含指示音频节目的所测量的响度的数据。4. The audio processing unit of claim 1 or 2, wherein the one or more metadata payloads comprises a program loudness payload containing data indicating a measured loudness of an audio program.

5.根据附记4所述的音频处理单元，其中，所述节目响度有效载荷包括指示音频通道是否包含口语会话的域。5. The audio processing unit of claim 4, wherein the program loudness payload includes a field indicating whether the audio channel contains spoken conversation.

6.根据附记4所述的音频处理单元，其中，所述节目响度有效载荷包括指示已经用于生成所述节目响度有效载荷中所包含的响度数据的响度测量方法的域。6. The audio processing unit according to supplementary note 4, wherein the program loudness payload includes a field indicating a loudness measurement method that has been used to generate the loudness data contained in the program loudness payload.

7.根据附记4所述的音频处理单元，其中，所述节目响度有效载荷包括指示音频节目的响度是否已经使用会话选通被校正的域。7. The audio processing unit of claim 4, wherein the program loudness payload includes a field indicating whether the loudness of the audio program has been corrected using session gating.

8.根据附记4所述的音频处理单元，其中，所述节目响度有效载荷包括指示音频节目的响度是否已经使用无限的预测未来或基于文件的响度校正过程被校正的域。8. The audio processing unit of claim 4, wherein the program loudness payload includes a field indicating whether the loudness of the audio program has been corrected using an infinite look-ahead or file-based loudness correction process.

9.根据附记4所述的音频处理单元，其中，所述节目响度有效载荷包括指示在没有任何能够归因于动态范围压缩的增益调节的情况下音频节目的整体响度的域。9. The audio processing unit of claim 4, wherein the program loudness payload includes a field indicating the overall loudness of the audio program in the absence of any gain adjustment attributable to dynamic range compression.

10.根据附记4所述的音频处理单元，其中，所述节目响度有效载荷包括指示在没有任何能够归因于会话归一化的增益调节的情况下音频节目的整体响度的域。10. The audio processing unit of claim 4, wherein the program loudness payload includes a field indicating the overall loudness of the audio program in the absence of any gain adjustment attributable to session normalization.

11.根据附记4所述的音频处理单元，其中，所述音频处理单元被配置成使用所述节目响度有效载荷来执行自适应响度处理。11. The audio processing unit according to Supplement 4, wherein the audio processing unit is configured to perform adaptive loudness processing using the program loudness payload.

12.根据附记1至11中任一项所述的音频处理单元，其中，所述编码音频比特流为AC-3比特流或者E-AC-3比特流。12. The audio processing unit according to any one of Notes 1 to 11, wherein the encoded audio bit stream is an AC-3 bit stream or an E-AC-3 bit stream.

13.根据附记4至11中任一项所述的音频处理单元，其中，所述音频处理单元被配置成从所述编码音频比特流中提取所述节目响度有效载荷并且认证或者验证所述节目响度有效载荷。13. The audio processing unit according to any one of Notes 4 to 11, wherein the audio processing unit is configured to extract the program loudness payload from the encoded audio bitstream and authenticate or verify the program loudness payload.

14.根据附记1至13中任一项所述的音频处理单元，其中，所述一个或更多个元数据有效载荷每个包括唯一的有效载荷标识符，并且所述唯一的有效载荷标识符位于每个元数据有效载荷的开始处。14. An audio processing unit according to any one of Notes 1 to 13, wherein the one or more metadata payloads each include a unique payload identifier, and the unique payload identifier is located at the beginning of each metadata payload.

15.根据附记1至13中任一项所述的音频处理单元，其中，所述同步字是值为0x5838的16比特同步字。15. The audio processing unit according to any one of Notes 1 to 13, wherein the synchronization word is a 16-bit synchronization word with a value of 0x5838.

16.一种用于对编码音频比特流解码的方法，所述方法包括：16. A method for decoding a coded audio bitstream, the method comprising:

接收编码音频比特流，所述编码音频比特流被分段成一个或更多个帧；receiving a coded audio bitstream, the coded audio bitstream being segmented into one or more frames;

从所述编码音频比特流中提取音频数据和元数据容器，所述元数据容器包括首部，所述首部之后跟随一个或更多个元数据有效载荷，所述一个或更多个元数据有效载荷之后跟随保护数据；以及extracting audio data and a metadata container from the coded audio bitstream, the metadata container comprising a header followed by one or more metadata payloads followed by protection data; and

通过使用所述保护数据来验证所述容器以及所述一个或更多个元数据有效载荷的完整性，verifying the integrity of the container and the one or more metadata payloads by using the protection data,

其中，所述一个或更多个元数据有效载荷包括节目响度有效载荷，所述节目响度有效载荷包含指示与所述音频数据关联的音频节目的所测量的响度的数据。Wherein the one or more metadata payloads include a program loudness payload containing data indicative of a measured loudness of an audio program associated with the audio data.

17.根据附记16所述的方法，其中，所述编码比特流为AC-3比特流或者E-AC-3比特流。17. The method according to Note 16, wherein the coded bit stream is an AC-3 bit stream or an E-AC-3 bit stream.

18.根据附记16所述的方法，还包括：18. The method according to Supplementary Note 16, further comprising:

使用所述节目响度有效载荷对从所述编码音频比特流中提取的所述音频数据执行自适应响度处理。Adaptive loudness processing is performed on the audio data extracted from the encoded audio bitstream using the program loudness payload.

19.根据附记16所述的方法，其中，所述容器位于AC-3或者E-AC-3保留数据空间中或者从AC-3或者E-AC-3保留数据空间中提取，所述AC-3或者E-AC-3保留数据空间选自：跳过域、auxdata域、addbsi域、及其组合。19. The method according to Note 16, wherein the container is located in or extracted from an AC-3 or E-AC-3 reserved data space, and the AC-3 or E-AC-3 reserved data space is selected from: a skip field, an auxdata field, an addbsi field, and a combination thereof.

20.根据附记16所述的方法，其中，所述节目响度有效载荷包括指示音频通道是否包含口语会话的域。20. The method of claim 16, wherein the program loudness payload includes a field indicating whether the audio channel contains spoken conversation.

21.根据附记16所述的方法，其中，所述节目响度有效载荷包括指示已经用于生成所述节目响度有效载荷中所包含的响度数据的响度测量方法的域。21. The method of claim 16, wherein the program loudness payload includes a field indicating a loudness measurement method that has been used to generate the loudness data contained in the program loudness payload.

22.根据附记16所述的方法，其中，所述节目响度有效载荷包括指示音频节目的响度是否已经使用会话选通被校正的域。22. The method of claim 16, wherein the program loudness payload includes a field indicating whether the loudness of the audio program has been corrected using session gating.

23.根据附记16所述的方法，其中，所述节目响度有效载荷包括指示音频节目的响度是否已经使用无限的预测未来或基于文件的响度校正过程被校正的域。23. The method of claim 16, wherein the program loudness payload includes a field indicating whether the loudness of the audio program has been corrected using an infinite look-ahead or file-based loudness correction process.

24.根据附记16所述的方法，其中，所述节目响度有效载荷包括指示在没有任何能够归因于动态范围压缩的增益调节的情况下音频节目的整体响度的域。24. The method of claim 16, wherein the program loudness payload includes a field indicating the overall loudness of the audio program in the absence of any gain adjustment attributable to dynamic range compression.

25.根据附记16所述的方法，其中，所述节目响度有效载荷包括指示在没有任何能够归因于会话归一化的增益调节的情况下音频节目的整体响度的域。25. The method of claim 16, wherein the program loudness payload includes a field indicating the overall loudness of the audio program in the absence of any gain adjustment attributable to session normalization.

26.根据附记16所述的方法，其中，所述元数据容器包括指示连续的音频节目之间的至少一个边界的元数据。26. The method of claim 16, wherein the metadata container includes metadata indicating at least one boundary between consecutive audio programs.

27.根据附记16所述的方法，其中，所述元数据容器被存储在帧的一个或更多个跳过域或者浪费比特分段中。27. The method according to Note 16, wherein the metadata container is stored in one or more skip fields or wasted bit segments of a frame.

Claims

1. An audio processing unit, comprising:

A buffer memory for storing at least one frame of an encoded audio bitstream, wherein the encoded audio bitstream includes audio data and metadata segments, wherein the metadata segments are separate and distinct data from the audio data, and the metadata segments include a header, one or more metadata payloads, and protection data;

An audio decoder, coupled to the buffer memory, is used to decode the audio data; and

A parser, coupled to or integrated with the audio decoder, is used to parse the encoded audio bitstream.

The header includes a synchronization word that identifies the start of the metadata segment, followed by at least one identifier value. The one or more metadata payloads describe an audio program associated with the audio data. The protection data is located after the one or more metadata payloads and can be used to verify the integrity of the metadata segment and the one or more metadata payloads within the metadata segment.

2. The audio processing unit of claim 1, wherein the metadata segments are stored in an AC-3 or E-AC-3 reserved data space selected from the following: a skip field, an auxdata field, an addbsi field, and combinations thereof.

3. The audio processing unit of claim 1, wherein the one or more metadata payloads include metadata indicating at least one boundary between consecutive audio programs.

4. The audio processing unit of claim 1, wherein the one or more metadata payloads include a program loudness payload containing data indicating the measured loudness of an audio program.

5. The audio processing unit of claim 4, wherein the program loudness payload includes a domain indicating whether the audio channel contains spoken conversation.

6. The audio processing unit of claim 4, wherein the program loudness payload includes a domain indicating a loudness measurement method that has been used to generate the loudness data contained in the program loudness payload.

7. The audio processing unit of claim 4, wherein the program loudness payload includes a domain indicating whether the loudness of the audio program has been corrected using session gating.

8. The audio processing unit of claim 4, wherein the program loudness payload includes a domain indicating whether the loudness of the audio program has been corrected using a file-based loudness correction process.

9. The audio processing unit of claim 4, wherein the program loudness payload includes a domain indicating the overall loudness of the audio program without any gain adjustment attributable to dynamic range compression.

10. The audio processing unit of claim 4, wherein the program loudness payload includes a domain indicating the overall loudness of the audio program without any gain adjustment attributable to session normalization.

11. The audio processing unit of claim 4, wherein the audio processing unit is configured to perform adaptive loudness processing using the program loudness payload.

12. The audio processing unit according to any one of claims 1 to 11, wherein the encoded audio bitstream is an AC-3 bitstream or an E-AC-3 bitstream.

13. The audio processing unit according to any one of claims 4 to 11, wherein the audio processing unit is configured to extract the program loudness payload from the encoded audio bitstream and authenticate or verify the program loudness payload.

14. The audio processing unit of claim 1, wherein each of the one or more metadata payloads includes a unique payload identifier, and the unique payload identifier is located at the beginning of each metadata payload.

15. The audio processing unit according to claim 1, wherein the synchronization word is a 16-bit synchronization word with a value of 0x5838.

16. The audio processing unit according to any one of claims 1 to 11, wherein the at least one identifier value includes at least one of core element version, length and period, extended element count, and substream association value.

17. A method for decoding an encoded audio bitstream, the method comprising:

Receive an encoded audio bitstream, which is segmented into one or more frames;

Audio data and metadata segments are extracted from the encoded audio bitstream. The metadata segments are separate and distinct from the audio data. Each metadata segment includes a header, followed by one or more metadata payloads, followed by protection data. The header includes a synchronization word identifying the start of the metadata segment, followed by at least one identifier value.

The integrity of the metadata segments and the one or more metadata payloads is verified by using the protection data.

The one or more metadata payloads include a program loudness payload, which contains data indicating the measured loudness of an audio program associated with the audio data.

18. The method according to claim 17, wherein the encoded audio bitstream is an AC-3 bitstream or an E-AC-3 bitstream.

19. The method of claim 17, further comprising:

Adaptive loudness processing is performed on the audio data extracted from the encoded audio bitstream using the program loudness payload.

20. The method of claim 17, wherein the metadata segment is located in or extracted from an AC-3 or E-AC-3 reserved data space, the AC-3 or E-AC-3 reserved data space being selected from: a skip field, an auxdata field, an addbsi field, and combinations thereof.

21. The method of claim 17, wherein the program loudness payload includes a domain indicating whether the audio channel contains spoken conversation.

22. The method of claim 17, wherein the program loudness payload includes a domain indicating a loudness measurement method that has been used to generate the loudness data contained in the program loudness payload.

23. The method of claim 17, wherein the program loudness payload includes a domain indicating whether the loudness of the audio program has been corrected using session gating.

24. The method of claim 17, wherein the program loudness payload includes a domain indicating whether the loudness of the audio program has been corrected using a document-based loudness correction process.

25. The method of claim 17, wherein the program loudness payload comprises a domain indicating the overall loudness of the audio program without any gain adjustment attributable to dynamic range compression.

26. The method of claim 17, wherein the program loudness payload comprises a domain indicating the overall loudness of the audio program without any gain adjustment attributable to session normalization.

27. The method of claim 17, wherein the metadata segmentation includes metadata indicating at least one boundary between consecutive audio programs.

28. The method of claim 17, wherein the metadata segment is stored in one or more skipped fields or wasted bit segments of the frame.

29. The method according to any one of claims 17 to 28, wherein the at least one identifier value includes at least one of core element version, length and period, extended element count, and substream association value.