HK1262874B

HK1262874B - Binaural rendering for headphones using metadata processing

Info

Publication number: HK1262874B
Application number: HK19122745.3A
Authority: HK
Inventors: N·R·茨恩高斯; R·威尔森; S·布哈里特卡; C·P·布朗; A·J·希菲尔德; R·奥德弗雷伊
Original assignee: 杜比实验室特许公司
Priority date: 2013-10-31
Filing date: 2019-04-23
Publication date: 2021-11-19

Description

Binaural rendering for headphones using metadata processing

本申请是基于申请号为201480060042.2、申请日为2014年10月28日、发明名称为“使用元数据处理的耳机的双耳呈现”的专利申请的分案申请。This application is a divisional application based on the patent application with application number 201480060042.2, application date October 28, 2014, and invention name “Binaural presentation of headphones using metadata processing”.

相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS

本申请要求2013年10月31日提交的美国临时专利申请No.61/898,365的优先权，该申请的全部内容特此通过引用并入。This application claims priority to U.S. Provisional Patent Application No. 61/898,365, filed October 31, 2013, the entire contents of which are hereby incorporated by reference.

技术领域Technical Field

一个或多个实现一般涉及音频信号处理，并且更具体地涉及用于耳机回放的基于声道和对象的音频的双耳呈现。One or more implementations relate generally to audio signal processing, and more particularly to binaural rendering of channel- and object-based audio for headphone playback.

背景技术Background Art

空间音频通过一对扬声器的虚拟呈现通常涉及立体声双耳信号的创建，所述立体声双耳信号表示到达收听者的左耳和右耳的期望声音，并且被合成以模拟可能包含在不同位置处的众多源的三维(3D)空间中的特定音频场景。对于通过耳机、而不是扬声器的回放，双耳处理或呈现可以被定义为一组信号处理操作，这些信号处理操作旨在通过仿真人类主体的自然空间收听线索来通过耳机再现声源的预期3D位置。双耳呈现器的典型的核心组件是头部相关滤波以再现方向相关的线索以及距离线索处理，这些可能涉及对真实的或虚拟的收听房间或环境的影响进行建模。目前的双耳呈现器的一个示例将基于声道的音频展现(presentation)中的5.1或7.1环绕的5或7个声道中的每一个处理为围绕收听者的2D空间中的5/7个虚拟声源。双耳再现也通常在游戏或游戏音频硬件中找到，在这种情况下，处理可以基于游戏中的单个音频对象的单个3D位置而被应用于这些音频对象。Virtual rendering of spatial audio through a pair of speakers typically involves the creation of stereo binaural signals representing the desired sounds arriving at the listener's left and right ears, synthesized to simulate a specific audio scene in a three-dimensional (3D) space that may contain numerous sources at different locations. For playback through headphones rather than speakers, binaural processing or rendering can be defined as a set of signal processing operations designed to reproduce the expected 3D positions of sound sources through headphones by emulating the natural spatial listening cues of a human subject. Typical core components of a binaural renderer are head-related filtering to reproduce direction-related cues and distance cue processing, which may involve modeling the effects of a real or virtual listening room or environment. One example of a current binaural renderer processes each of the 5 or 7 channels of a 5.1 or 7.1 surround sound channel-based audio presentation as 5 or 7 virtual sound sources in a 2D space surrounding the listener. Binaural reproduction is also commonly found in games or gaming audio hardware, where processing can be applied to individual audio objects in the game based on their individual 3D positions.

传统地，双耳呈现是应用于基于多声道或对象的音频内容的盲后处理的形式。双耳呈现中涉及的处理中的一些可能对内容的音色具有不期望的和负面的影响，诸如瞬态的平滑或者添加到对话或一些效果和音乐元素的过度的混响。随着耳机收听的重要性增长以及基于对象的内容(诸如Atmos^TM系统)带来的附加的灵活性，存在更大的机会和需要使混合器在内容创建时创建和编码该双耳呈现元数据，例如，指示呈现器用不同的算法或不同的设置对内容的部分进行处理。目前的系统的特征不在于该能力，它们也不允许这样的元数据在编解码器中被作为附加的特定的耳机有效载荷输送。Traditionally, binaural rendering is a form of blind post-processing applied to multi-channel or object-based audio content. Some of the processing involved in binaural rendering can have undesirable and negative effects on the timbre of the content, such as smoothing of transients or excessive reverberation added to dialogue or some effects and musical elements. With the growing importance of headphone listening and the additional flexibility brought by object-based content (such as Atmos ^™ systems), there is a greater opportunity and need for mixers to create and encode this binaural rendering metadata at the time of content creation, for example, to instruct the renderer to process parts of the content with different algorithms or different settings. Current systems do not feature this capability, nor do they allow such metadata to be conveyed as an additional, specific headphone payload in the codec.

只要内容未被配置为与可以被即时提供给双耳呈现器的附加元数据一起在设备上被接收，目前的系统在流水线的回放端也未被优化。虽然实时头部跟踪先前已经被实现并且被显示出改进了双耳呈现，但是这一般阻止了其它特征，诸如自动化的连续头部大小感测和房间感测、以及将双耳呈现的质量改进得在基于耳机的回放系统中被有效地、高效率地实现的其它定制特征。Current systems are also not optimized on the playback side of the pipeline as long as the content is not configured to be received on the device with additional metadata that can be provided immediately to the binaural renderer. While real-time head tracking has been previously implemented and shown to improve binaural rendering, this has generally prevented other features, such as automated continuous head size sensing and room sensing, and other customized features that improve the quality of binaural rendering from being effectively and efficiently implemented in headphone-based playback systems.

因此，需要在回放设备上运行的将制作元数据与实时地局部地产生的元数据组合以当通过耳机收听基于声道和对象的音频时为终端用户提供最好的可能的体验的双耳呈现器。此外，对于基于声道的内容，一般要求艺术意图通过合并音频分段分析而被保留。Therefore, there is a need for a binaural renderer running on a playback device that combines production metadata with metadata generated locally in real time to provide the end user with the best possible experience when listening to channel- and object-based audio through headphones. Furthermore, for channel-based content, it is generally required that artistic intent be preserved by incorporating audio segment analysis.

背景部分中讨论的主题不应仅由于其在背景部分中被提及就被假定为是现有技术。类似地，背景部分中提及的或者与背景部分的主题相关联的问题不应被假定为先前在现有技术中就已经被认识到。背景部分中的主题仅代表不同的方法，这些方法本身也可以是发明。The subject matter discussed in the Background section should not be assumed to be prior art simply because it is mentioned in the Background section. Similarly, problems mentioned in the Background section or related to the subject matter in the Background section should not be assumed to have been previously recognized in the prior art. The subject matter in the Background section merely represents different approaches, which may themselves be inventions.

发明内容Summary of the Invention

描述了关于在基于耳机的回放系统中虚拟呈现基于对象的音频内容并且改进均衡的系统和方法的实施例。实施例包括一种用于呈现音频以供通过耳机回放的方法，该方法包括：接收数字音频内容；接收由对接收的数字音频内容进行处理的制作工具产生的双耳呈现元数据；接收由回放设备产生的回放元数据；并且组合双耳呈现元数据和回放元数据以优化数字音频内容通过耳机的回放。数字音频内容可以包括基于声道的音频和基于对象的音频，基于对象的音频包括用于再现相应的声源在三维空间中相对于收听者的预期位置的位置信息。该方法还包括基于内容类型将数字音频内容分成一个或多个成分，并且其中，内容类型选自由以下项构成的组：对话、音乐、音效、瞬态信号以及周围环境信号。双耳呈现元数据控制多个声道和对象特性，包括：位置、大小、增益调整以及内容相关的设置或处理预设；回放元数据控制多个收听者特定特性，包括头部位置、头部朝向、头部大小、收听房间噪声水平、收听房间性质以及回放设备或屏幕相对于收听者的位置。该方法还可以包括接收修改双耳呈现元数据的一个或多个用户输入命令，这些用户输入命令控制一个或多个特性，包括：提升强调，其中，提升的对象和声道可以接收增益提高；用于对象或声道定位的优选1D(一维)声音半径或3D缩放因子；以及处理模式启用(例如，以在传统立体声或内容的全处理之间切换)。回放元数据可以响应于由容纳多个传感器的使能(enabled)耳麦提供的传感器数据而产生，所述使能耳麦构成回放设备的一部分。该方法还可以包括：例如通过内容类型将输入音频分为单独的子信号，或者将(基于声道的和基于对象的)输入音频去混合为组成的直接内容和扩散内容，其中，扩散内容包括混响的或反射的声音元素；并且独立地对单独的子信号执行双耳呈现。Embodiments of systems and methods for virtually presenting object-based audio content and improving equalization in a headphone-based playback system are described. Embodiments include a method for presenting audio for playback through headphones, the method comprising: receiving digital audio content; receiving binaural rendering metadata generated by an authoring tool that processes the received digital audio content; receiving playback metadata generated by a playback device; and combining the binaural rendering metadata and the playback metadata to optimize playback of the digital audio content through the headphones. The digital audio content may include channel-based audio and object-based audio, the object-based audio including position information for reproducing the intended position of corresponding sound sources in three-dimensional space relative to a listener. The method also includes separating the digital audio content into one or more components based on content type, and wherein the content type is selected from the group consisting of: dialogue, music, sound effects, transient signals, and ambient signals. Binaural rendering metadata controls multiple channel and object characteristics, including: position, size, gain adjustment, and content-related settings or processing presets; playback metadata controls multiple listener-specific characteristics, including head position, head orientation, head size, listening room noise level, listening room properties, and the position of the playback device or screen relative to the listener. The method may also include receiving one or more user input commands to modify the binaural rendering metadata, these user input commands controlling one or more characteristics, including: boost emphasis, where boosted objects and channels can receive a gain increase; preferred 1D (one-dimensional) sound radius or 3D scaling factor for object or channel positioning; and processing mode enablement (e.g., to switch between traditional stereo or full processing of the content). The playback metadata may be generated in response to sensor data provided by an enabled headset housing multiple sensors, the enabled headset forming part of the playback device. The method may further comprise: separating the input audio into separate sub-signals, for example by content type, or demixing the (channel-based and object-based) input audio into constituent direct content and diffuse content, wherein the diffuse content includes reverberant or reflected sound elements; and independently performing binaural rendering on the separate sub-signals.

实施例还涉及一种用于通过以下步骤呈现音频以供通过耳机回放的方法：接收决定内容元素如何通过耳机呈现的内容相关元数据；从耦合到耳机的回放设备和包括耳机的使能耳麦中的至少一个接收传感器数据；并且利用传感器数据修改内容相关元数据以相对于一个或多个回放特性和用户特性优化呈现的音频。内容相关元数据可以由内容创建者操作的制作工具产生，并且其中，内容相关元数据决定包含音频声道和音频对象的音频信号的呈现。内容相关元数据控制选自由以下项构成的组的多个声道和对象特性：位置、大小、增益调整、提升强调、立体声/全切换、3D缩放因子、内容相关设置、以及呈现的声场的其它的空间和音色性质。该方法还可以包括将传感器数据格式化为与内容相关元数据兼容的元数据格式以生成回放元数据。回放元数据控制选自由以下项构成的组的多个收听者特定特性：头部位置、头部朝向、头部大小、收听房间噪声水平、收听房间性质以及声源设备位置。在实施例中，元数据格式包括容器，该容器包括符合定义的语法的一个或多个有效载荷分组，并且对相应的音频内容元素的数字音频定义进行编码。该方法还可以包括将组合的回放元数据和内容相关元数据与源音频内容一起编码为用于在呈现系统中处理的比特流；并且对编码的比特流进行解码以提取从内容相关元数据和回放元数据得到的一个或多个参数以产生修改用于通过耳机回放的源音频内容的控制信号。Embodiments also relate to a method for rendering audio for playback through headphones, comprising the steps of: receiving content-related metadata that determines how content elements are rendered through the headphones; receiving sensor data from at least one of a playback device coupled to the headphones and an enabled headset comprising the headphones; and utilizing the sensor data to modify the content-related metadata to optimize the rendered audio relative to one or more playback characteristics and user characteristics. The content-related metadata may be generated by an authoring tool operated by a content creator, and wherein the content-related metadata determines the rendering of an audio signal comprising audio channels and audio objects. The content-related metadata controls a plurality of channel and object characteristics selected from the group consisting of: position, size, gain adjustment, boost emphasis, stereo/full switching, 3D scaling factor, content-related settings, and other spatial and timbre properties of the rendered sound field. The method may further include formatting the sensor data into a metadata format compatible with the content-related metadata to generate playback metadata. The playback metadata controls a plurality of listener-specific characteristics selected from the group consisting of: head position, head orientation, head size, listening room noise level, listening room properties, and sound source device location. In an embodiment, the metadata format includes a container including one or more payload packets conforming to a defined syntax and encoding a digital audio definition of a corresponding audio content element. The method may further include encoding the combined playback metadata and content-related metadata along with the source audio content into a bitstream for processing in a rendering system; and decoding the encoded bitstream to extract one or more parameters derived from the content-related metadata and the playback metadata to generate a control signal for modifying the source audio content for playback through headphones.

所述方法还可以包括在通过耳机回放之前对源音频内容执行一个个或多个后处理功能；其中，后处理功能包括以下中的至少一个：从多个环绕声声道到双耳混合或立体声混合之一的下混、水平管理、均衡、音色校正以及噪声消除。The method may further include performing one or more post-processing functions on the source audio content prior to playback through the headphones; wherein the post-processing functions include at least one of: downmixing from multiple surround sound channels to one of a binaural mix or a stereo mix, level management, equalization, timbre correction, and noise cancellation.

实施例进一步涉及执行或实施执行或实现上述方法等的处理命令的系统和制造品。Embodiments further relate to systems and articles of manufacture that perform or implement processing instructions to perform or implement the above-described methods and the like.

通过引用合并Incorporated by Reference

本说明书中提及的每个出版物、专利和/或专利申请的全部内容通过引用并入本文，达到如同每一个出版物和/或专利申请被明确地分别地指示被通过引用并入一样的程度。Each publication, patent, and/or patent application mentioned in this specification is herein incorporated by reference in its entirety, to the same extent as if each individual publication and/or patent application was specifically and individually indicated to be incorporated by reference.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

在下面的附图中，相似的标号用于指代相似的元件。尽管下面的图描绘了各种例子，但是所述一个或多个实现不限于这些图中描绘的例子。In the following figures, like reference numerals are used to refer to like elements. Although the following figures depict various examples, the one or more implementations are not limited to the examples depicted in these figures.

图1示出一些实施例下的合并内容创建、呈现和回放系统的实施例的总系统。FIG. 1 illustrates an overall system of an embodiment of a combined content creation, presentation, and playback system, under some embodiments.

图2A是实施例下的在基于对象的耳机呈现系统中使用的制作工具的框图。2A is a block diagram of an authoring tool used in an object-based headphone rendering system, under an embodiment.

图2B是替代实施例下的在基于对象的耳机呈现系统中使用的制作工具的框图。2B is a block diagram of an authoring tool used in an object-based headphone rendering system, under an alternative embodiment.

图3A是实施例下的在基于对象的耳机呈现系统中使用的呈现组件的框图。3A is a block diagram of presentation components used in an object-based headphone presentation system, under an embodiment.

图3B是替代实施例下的在基于对象的耳机呈现系统中使用的呈现组件的框图。3B is a block diagram of presentation components used in an object-based headphone presentation system, under an alternative embodiment.

图4是提供实施例下的双端双耳呈现系统的概览的框图。4 is a block diagram providing an overview of a two-end binaural presentation system, under an embodiment.

图5示出实施例下的可以与耳机呈现系统的实施例一起使用的制作工具GUI。FIG5 illustrates an authoring tool GUI that may be used with embodiments of a headphone rendering system, under an embodiment.

图6示出实施例下的使能耳机，该耳机包括感测回放状况以用于编码为在耳机呈现系统中使用的元数据的一个或多个传感器。6 illustrates an enabled headset including one or more sensors that sense playback conditions for encoding as metadata for use in a headphone rendering system, under an embodiment.

图7示出实施例下的耳机和包括耳机传感器处理器的设备之间的连接。FIG7 illustrates a connection between a headset and a device including a headset sensor processor, under an embodiment.

图8是示出实施例下的可以用在耳机呈现系统中的不同的元数据组件的框图。8 is a block diagram illustrating different metadata components that may be used in a headphone rendering system, under an embodiment.

图9示出实施例下的用于耳机处理的双耳呈现组件的功能组件。FIG9 illustrates functional components of a binaural rendering component for headphone processing, under an embodiment.

图10示出实施例下的耳机呈现系统中的用于呈现音频对象的双耳呈现系统。FIG10 illustrates a binaural rendering system for rendering audio objects in a headphone rendering system under an embodiment.

图11示出实施例下的图10的双耳呈现系统的更详细的表示。FIG11 shows a more detailed representation of the binaural rendering system of FIG10 , under an embodiment.

图12是示出实施例下的在用在耳机呈现系统中的HRTF建模系统中使用的不同工具的系统示图。12 is a system diagram illustrating different tools used in an HRTF modeling system for use in a headphone rendering system, under an embodiment.

图13示出实施例下的使得能够对耳机呈现系统递送元数据的数据结构。Figure 13 illustrates a data structure that enables delivery of metadata to a headphone presentation system, under an embodiment.

图14示出耳机均衡处理的实施例中的用于每个耳朵的三个脉冲响应测量的示例情况。FIG. 14 shows an example case of three impulse response measurements for each ear in an embodiment of headphone equalization processing.

图15A示出实施例下的用于计算自由场声音传输的电路。FIG15A shows a circuit for calculating free-field acoustic transmission, under an embodiment.

图15B示出实施例下的用于计算耳机声音传输的电路。FIG15B illustrates a circuit for calculating headphone sound transmission, under an embodiment.

具体实施方式DETAILED DESCRIPTION

描述了用于通过耳机虚拟呈现基于对象的内容的系统和方法以及用于这样的虚拟呈现的元数据递送和处理系统，但是应用不限于此。本文描述的一个或多个实施例的方面可以在包括执行软件指令的一个或多个计算机或处理设备的混合、呈现和回放系统中的对源音频信息进行处理的音频或视听系统中实现。所描述的任何实施例可以单独使用，或者按任何组合相互一起使用。尽管各种实施例的动机可能是在本说明书中的一个或多个地方可能讨论的或暗指的现有技术的各种缺陷，但是实施例不一定解决这些缺陷中的任何一个。换句话说，不同的实施例可能解决在本说明书中可能讨论的不同缺陷。一些实施例可能仅部分解决在本说明书中可能讨论的一些缺陷，或者可能仅解决在本说明书中可能讨论的一个缺陷，并且一些实施例可能不解决这些缺陷中的任何一个。Systems and methods for virtually presenting object-based content through headphones and metadata delivery and processing systems for such virtual presentations are described, but the applications are not limited thereto. Aspects of one or more embodiments described herein can be implemented in an audio or audio-visual system that processes source audio information in a mixing, presentation, and playback system that includes one or more computers or processing devices that execute software instructions. Any of the embodiments described may be used alone or in any combination with each other. Although the various embodiments may be motivated by various deficiencies in the prior art that may be discussed or implied in one or more places in this specification, the embodiments do not necessarily address any of these deficiencies. In other words, different embodiments may address different deficiencies that may be discussed in this specification. Some embodiments may only partially address some of the deficiencies that may be discussed in this specification, or may only address one defect that may be discussed in this specification, and some embodiments may not address any of these deficiencies.

实施例针对优化基于对象和/或声道的音频通过耳机的呈现和回放的音频内容生成和回放系统。图1示出合并一些实施例下的内容创建、呈现和回放系统的实施例的总体系统。如系统100中所示的，制作工具102被创建者用于产生音频内容以供通过一个或多个设备104回放，以用于用户通过耳机116或118收听。设备104一般是运行允许回放音频内容的应用的小型计算机或移动电信设备或者便携式音频或音乐播放器。这样的设备可以是移动电话或音频(例如，MP3)播放器106、平板计算机(例如，Apple iPad或类似设备)108、音乐控制台110、笔记本计算机111、或任何类似的音频回放设备。音频可以包括可期望通过耳机被收听的音乐、对话、效果、或任何数字音频，并且这样的音频可以从内容源被无线地流传输、从存储介质(例如，盘、闪存驱动器等)被本地回放、或本地产生。在以下描述中，术语“耳机”通常具体指的是被用户戴在他/她的耳朵正上方的密耦合回放设备、或耳内收听设备；它还可以一般指的是作为术语“耳机处理”或“耳机呈现”的替代的、被执行以呈现意图用于在耳机上回放的信号的处理中的至少一些。Embodiment is directed to optimizing the audio content generation and playback system of the presentation and playback of the audio based on object and/or sound channel by earphone.Fig. 1 illustrates the overall system of the embodiment of the content creation, presentation and playback system that merges under some embodiments.As shown in system 100, making tool 102 is used by the creator to produce audio content for playback by one or more devices 104, for the user to listen by earphone 116 or 118.Device 104 is generally a small computer or mobile telecommunication device or portable audio or music player that runs the application that allows playback audio content.Such device can be mobile phone or audio (for example, MP3) player 106, tablet computer (for example, Apple iPad or similar device) 108, music console 110, notebook computer 111 or any similar audio playback device.Audio can comprise music, dialogue, effect or any digital audio that can be expected to be listened to by earphone, and such audio can be wirelessly streamed from content source, played back locally from storage medium (for example, disk, flash drive etc.), or locally produced. In the following description, the term "headphones" generally and specifically refers to a close-coupled playback device, or in-ear listening device, worn by a user just above his/her ear; it may also generally refer to at least some of the processing performed to render a signal intended for playback on headphones, as an alternative to the terms "headphone processing" or "headphone rendering."

在实施例中，被系统处理的音频可以包括基于声道的音频、基于对象的音频、或基于对象和声道的音频(例如，混合型的或自适应的音频)。音频包括决定音频如何被呈现以供在特定的端点设备和收听环境上回放的元数据，或者与该元数据相关联。基于声道的音频一般指的是音频信号加元数据，在该元数据中位置被编码为声道标识符，其中音频被格式化以供通过具有相关联的标称环绕声位置的预定义的一组扬声器区(例如，5.1、7.1等)回放；基于对象意指具有参数化源描述(诸如视在源位置(例如，3D坐标)、视在源宽度等)的一个或多个音频声道。术语“自适应音频”可以用于意指基于声道的和/或基于对象的音频信号加元数据，该元数据使用音频流加元数据(在该元数据中位置被编码为空间中的3D位置)、基于回放环境来呈现音频信号。一般地，收听环境可以是任何开放的、部分封闭的、或完全封闭的区域，诸如房间，但是本文中描述的实施例一般针对通过耳机或其它紧邻的端点设备的回放。音频对象可被认为是可以被感知为源自环境中的特定的一个物理位置或多个物理位置的多组声音元素，并且这样的对象可以是静态的或动态的。音频对象由元数据控制，除了其它方面之外，该元数据详述声音在给定时间点的位置，并且当一回放时，它们就根据位置元数据而呈现。在混合型音频系统中，除了音频对象之外，基于声道的内容(例如，“床(bed)”)也可以被处理，其中音床是有效地基于声道的子混合(sub-mix)或stem。这些可以被递送以供最终回放(呈现)，并且可以不同的基于声道的配置(诸如5.1、7.1)被创建。In embodiments, the audio processed by the system may include channel-based audio, object-based audio, or object- and channel-based audio (e.g., hybrid or adaptive audio). The audio includes, or is associated with, metadata that determines how the audio is rendered for playback on a specific endpoint device and listening environment. Channel-based audio generally refers to an audio signal plus metadata in which positions are encoded as channel identifiers, where the audio is formatted for playback through a predefined set of speaker zones (e.g., 5.1, 7.1, etc.) with associated nominal surround sound positions; object-based means one or more audio channels with parameterized source descriptions, such as apparent source positions (e.g., 3D coordinates), apparent source width, etc. The term "adaptive audio" may be used to refer to a channel-based and/or object-based audio signal plus metadata that uses an audio stream plus metadata in which positions are encoded as 3D positions in space to render the audio signal based on the playback environment. In general, a listening environment can be any open, partially enclosed, or fully enclosed area, such as a room, but the embodiments described herein are generally directed to playback through headphones or other closely located endpoint devices. Audio objects can be thought of as groups of sound elements that can be perceived as originating from a specific physical location or locations in the environment, and such objects can be static or dynamic. Audio objects are controlled by metadata that, among other things, details the location of the sound at a given point in time, and when played back, they are rendered according to the positional metadata. In hybrid audio systems, in addition to audio objects, channel-based content (e.g., "beds") can also be processed, where a bed is effectively a channel-based sub-mix or stem. These can be delivered for final playback (rendering) and can be created in different channel-based configurations (such as 5.1, 7.1).

如图1所示，用户所利用的耳机可以是仅包括简单地重新创建音频信号的无供电(non-powered)换能器的旧有或无源耳机118，或者它可以是包括将某些操作参数提供回给呈现器以供进一步对音频内容进行处理和优化的传感器和其它组件(供电或无供电)的使能耳机118。耳机116或118可以以任何适当的封闭耳朵设备(诸如开放式或封闭式耳机、耳上或耳内耳机、耳塞、耳垫、噪声消除、隔离或其它类型的耳机设备)体现。这样的耳机关于其与声源或设备104的连接可以是有线的或无线的。As shown in FIG1 , the earphones utilized by the user may be legacy or passive earphones 118 that simply include a non-powered transducer that simply recreates the audio signal, or it may be enabled earphones 118 that include sensors and other components (powered or non-powered) that provide certain operating parameters back to the renderer for further processing and optimization of the audio content. The earphones 116 or 118 may be embodied in any suitable closed-ear device, such as open or closed-back earphones, on-ear or in-ear headphones, earbuds, earpads, noise canceling, isolating, or other types of earphone devices. Such earphones may be wired or wireless with respect to their connection to the sound source or device 104.

在实施例中，除了基于对象的音频之外，来自制作工具102的音频内容还包括立体声或基于声道的音频(例如，5.1或7.1环绕声)。对于图1的实施例，呈现器112从制作工具接收音频内容，并且提供优化音频内容以供通过设备104和耳机116或118回放的某些功能。在实施例中，呈现器112包括预处理级113、双耳呈现级114和后处理级115。除了其它功能之外，预处理级113一般对输入音频执行某些分段操作，诸如基于音频的内容类型对其进行分段；双耳呈现级114一般组合并处理与音频的声道成分和对象成分相关联的元数据，并且产生双耳立体声或具有双耳立体声和附加的低频输出的多声道音频输出；并且后处理组件115在将音频信号发送到设备104之前一般执行下混、均衡、增益/响度/动态范围控制以及其它功能。应注意，虽然呈现器在大多数情况下将有可能产生两声道信号，但是它可以被配置为将输入的多于两个的声道提供给特定的使能耳机，例如以递送单独的低音声道(类似于传统环绕声中的LFE.1声道)。使能耳机可以具有与中频到较高频声音分开地再现低音成分的特定的多组驱动器。In embodiments, in addition to object-based audio, the audio content from the authoring tool 102 also includes stereo or channel-based audio (e.g., 5.1 or 7.1 surround sound). For the embodiment of FIG. 1 , the renderer 112 receives the audio content from the authoring tool and provides certain functions for optimizing the audio content for playback via the device 104 and headphones 116 or 118. In embodiments, the renderer 112 includes a pre-processing stage 113, a binaural rendering stage 114, and a post-processing stage 115. Among other functions, the pre-processing stage 113 generally performs certain segmentation operations on the input audio, such as segmenting it based on its content type; the binaural rendering stage 114 generally combines and processes metadata associated with the channel and object components of the audio and produces a binaural audio output or a multi-channel audio output with binaural audio and additional low-frequency output; and the post-processing component 115 generally performs downmixing, equalization, gain/loudness/dynamic range control, and other functions before sending the audio signal to the device 104. It should be noted that while the renderer will likely produce a two-channel signal in most cases, it can be configured to provide more than two channels of input to specific enabled headphones, for example to deliver a separate bass channel (similar to the LFE.1 channel in traditional surround sound). The enabled headphones may have specific sets of drivers that reproduce bass components separately from mid-range to higher frequency sounds.

应注意，图1的组件一般表示音频产生、呈现和回放系统的主要功能块，并且某些功能可以被作为一个或多个其它组件的一部分而合并。例如，呈现器112的一个或多个部分可以部分地或整个地合并在设备104中。在这种情况下，音频播放器或平板(或其它设备)可以包括集成在该设备内的呈现器组件。类似地，使能耳机116可以至少包括与回放设备和/或呈现器相关联的一些功能。在这样的情况下，完全集成的耳机可以包括集成的回放设备(例如，内置内容解码器，例如，MP3播放器)以及集成的呈现组件。另外，呈现器112的一个或多个组件(诸如预处理组件113)可以至少部分地在制作工具中实现，或者作为单独的预处理组件的一部分实现。It should be noted that the components of Figure 1 generally represent the main functional blocks of the audio generation, presentation and playback system, and certain functions can be incorporated as part of one or more other components. For example, one or more parts of the renderer 112 can be partially or entirely incorporated into the device 104. In this case, the audio player or tablet (or other device) can include a renderer component integrated within the device. Similarly, enabling headphones 116 can include at least some functions associated with the playback device and/or renderer. In such a case, a fully integrated headset can include an integrated playback device (e.g., a built-in content decoder, e.g., an MP3 player) and an integrated presentation component. In addition, one or more components of the renderer 112 (such as pre-processing component 113) can be implemented at least in part in a production tool, or as part of a separate pre-processing component.

图2A是实施例下的基于对象的耳机呈现系统中使用的制作工具的框图。如图2A所示，来自音频源(例如，现场源、记录等)的输入音频202被输入到数字音频工作站(DAW)204以供声音工程师处理。输入音频201通常是数字形式，并且如果模拟音频被使用，则需要A/D(模数)转换步骤(未示出)。该音频通常包括诸如在自适应音频系统(例如，Dolby Atmos)中可以使用的基于对象和声道的内容，并且常常包括几种不同类型的内容。输入音频可以通过(可选的)音频分段预处理204进行分段，该音频分段预处理204基于音频的内容类型对音频进行分割(或分段)，以使得不同类型的音频可以被不同地呈现。例如，对话可以被与瞬态信号或周围环境(ambient)信号不同地呈现。DAW 204可以被实现为用于对分段的或未分段的数字音频202进行编辑和处理的工作站，并且可以包括混合控制台、控制面、音频转换器、数据储存器和其它适当的元件。在实施例中，DAW是运行数字音频软件的处理平台，除了其它功能(诸如均衡器、合成器、效果等)之外，该数字音频软件提供全面的编辑功能以及用于一个或多个插件程序(诸如声像器(panner)插件)的接口。DAW 204中所示的声像器插件执行声像功能，所述声像功能被配置为以将每个相应的对象信号的期望位置传递给收听者的方式将每个对象信号分配到2D/3D空间中特定的扬声器对或位置。Fig. 2 A is a block diagram of the production tool used in the object-based headphone presentation system under an embodiment.As shown in Figure 2A, input audio 202 from an audio source (e.g., a live source, a record, etc.) is input to a digital audio workstation (DAW) 204 for sound engineer processing. Input audio 201 is typically in digital form, and if analog audio is used, an A/D (analog-to-digital) conversion step (not shown) is required. The audio typically includes object-based and channel-based content such as that which can be used in an adaptive audio system (e.g., Dolby Atmos), and often includes several different types of content. The input audio can be segmented by (optional) audio segmentation pre-processing 204, which segments the audio based on the content type of the audio (or segments) so that different types of audio can be presented differently. For example, a conversation can be presented differently from a transient signal or an ambient signal. The DAW 204 can be implemented as a workstation for editing and processing segmented or unsegmented digital audio 202 and can include a mixing console, a control surface, audio converters, data storage, and other appropriate components. In an embodiment, the DAW is a processing platform that runs digital audio software that provides comprehensive editing capabilities and an interface for one or more plug-ins (such as a panner plug-in) in addition to other functions (such as equalizers, synthesizers, effects, etc.). The panner plug-in shown in the DAW 204 performs a panning function that is configured to assign each object signal to a specific speaker pair or position in 2D/3D space in a manner that conveys the desired position of each corresponding object signal to the listener.

在制作工具102a中，来自DAW 204的处理的音频被输入到双耳呈现组件206。该组件包括音频处理功能，该音频处理功能生成双耳音频输出210以及双耳呈现元数据208和空间媒体类型元数据212。音频210以及元数据成分208和212与双耳元数据有效载荷214一起形成编码音频比特流。一般地，音频成分210包括与元数据成分208和212一起传递给比特流214的基于声道和对象的音频；然而，应注意，音频成分210可以是标准的多声道音频、双耳呈现的音频、或这两种音频类型的组合。双耳呈现组件206还包括直接生成用于直接连接到耳机的耳机输出216的双耳元数据输入功能。对于图2A的实施例，用于双耳呈现的元数据在制作工具102a内在混合时产生。在替代实施例中，如参照图2B所示的，元数据可以在编码时产生。如图2A所示，混合器203使用应用或工具来创建音频数据以及双耳和空间元数据。混合器203对DAW 204提供输入。可替代地，它还可以对双耳呈现处理206直接提供输入。在实施例中，混合器接收耳机音频输出216，以使得混合器可以监视音频和元数据输入的效果。这有效地构成反馈环路，在该反馈环路中，混合器接收通过耳机输出的耳机呈现的音频以确定是否需要任何输入改变。混合器203可以是人操作设备，诸如混合控制台或计算机，或者它可以是远程控制的或预编程的自动化处理。In the authoring tool 102a, processed audio from the DAW 204 is input to a binaural rendering component 206. This component includes audio processing functionality that generates a binaural audio output 210, as well as binaural rendering metadata 208 and spatial media type metadata 212. The audio 210 and metadata components 208 and 212, together with a binaural metadata payload 214, form an encoded audio bitstream. Typically, the audio component 210 includes channel- and object-based audio, which is passed along with the metadata components 208 and 212 to the bitstream 214; however, it should be noted that the audio component 210 can be standard multi-channel audio, binaurally rendered audio, or a combination of both audio types. The binaural rendering component 206 also includes binaural metadata input functionality that directly generates a headphone output 216 for direct connection to headphones. For the embodiment of FIG. 2A , the metadata for binaural rendering is generated within the authoring tool 102a during mixing. In an alternative embodiment, as illustrated with reference to FIG. 2B , the metadata can be generated during encoding. As shown in Figure 2A, the mixer 203 uses an application or tool to create audio data and binaural and spatial metadata. The mixer 203 provides input to the DAW 204. Alternatively, it can also provide input directly to the binaural rendering process 206. In an embodiment, the mixer receives headphone audio output 216 so that the mixer can monitor the effect of the audio and metadata input. This effectively constitutes a feedback loop in which the mixer receives audio rendered by the headphones through the headphone output to determine whether any input changes are needed. The mixer 203 can be a human-operated device, such as a mixing console or computer, or it can be a remotely controlled or pre-programmed automated process.

图2B是替代实施例下的基于对象的耳机呈现系统中使用的制作工具的框图。在该实施例中，用于双耳呈现的元数据在编码时产生，并且编码器运行内容分类器和元数据产生器以从旧有的基于声道的内容产生附加元数据。对于图2B的制作工具102b，不包括任何音频对象、而仅包括基于声道的音频的旧有的多声道内容220被输入到编码工具和呈现耳机仿真组件226。基于对象的内容222也单独地输入到该组件。基于声道的旧有内容220可以首先被输入到可选的音频分段预处理器224以供分成不同的内容类型以用于单个呈现。在制作工具102b中，双耳呈现组件226包括耳机仿真功能，该耳机仿真功能生成双耳音频输出230以及双耳呈现元数据228和空间媒体类型元数据232。音频230以及元数据成分228和232与双耳元数据有效载荷236一起形成编码音频比特流。如上所述，音频成分230通常包括与元数据成分228和232一起传递给比特流236的基于声道和对象的音频；然而，应注意，音频成分230可以是标准的多声道音频、双耳呈现的音频、或这两种音频类型的组合。当旧有内容被输入时，输出的编码音频比特流可以包含明确分开的子成分音频数据或元数据，该元数据隐含描述内容类型，该内容类型允许接收端点执行分段并且适当地处理每个子成分。双耳呈现组件226还包括双耳元数据输入功能，该双耳元数据输入功能直接生成用于直接连接到耳机的耳机输出234。如图2B所示，可选的混合器(人或处理)223可以被包括以监视耳机输出234，以输入和修改可以直接提供给呈现处理226的音频数据和元数据输入。FIG2B is a block diagram of a production tool used in an object-based headphone rendering system under an alternative embodiment. In this embodiment, metadata for binaural rendering is generated during encoding, and the encoder runs a content classifier and a metadata generator to generate additional metadata from legacy channel-based content. For the production tool 102b of FIG2B , legacy multi-channel content 220 that does not include any audio objects but only includes channel-based audio is input to the encoding tool and rendering headphone emulation component 226. Object-based content 222 is also input separately to this component. Legacy channel-based content 220 can first be input to an optional audio segmentation preprocessor 224 for separation into different content types for individual rendering. In the production tool 102b, the binaural rendering component 226 includes a headphone emulation function that generates a binaural audio output 230 as well as binaural rendering metadata 228 and spatial media type metadata 232. The audio 230 and metadata components 228 and 232, together with the binaural metadata payload 236, form a coded audio bitstream. As described above, the audio component 230 typically includes channel- and object-based audio that is passed to the bitstream 236 along with metadata components 228 and 232; however, it should be noted that the audio component 230 can be standard multi-channel audio, binaurally rendered audio, or a combination of both audio types. When legacy content is input, the output encoded audio bitstream can contain explicitly separated sub-component audio data or metadata that implicitly describes the content type that allows the receiving endpoint to perform segmentation and process each sub-component appropriately. The binaural rendering component 226 also includes a binaural metadata input function that directly generates a headphone output 234 for direct connection to headphones. As shown in Figure 2B, an optional mixer (person or process) 223 can be included to monitor the headphone output 234 to input and modify the audio data and metadata input that can be directly provided to the rendering process 226.

关于内容类型和内容分类器的操作，音频一般被分类为多个定义的内容类型(诸如对话、音乐、周围环境、特殊效果等)中的一个。对象可以在其整个持续时间内改变内容类型，但是在任何特定的时间点，它一般仅是一种内容类型。在实施例中，内容类型被表达为对象在任何时间点为特定的内容类型的概率。因此，例如，持久的对话对象将被表达为百分之百概率的对话对象，而从对话变换到音乐的对象可以被表达为百分之五十的对话/百分之五十的音乐。处理具有不同内容类型的对象可以通过以下操作来执行：即，对它们的每种内容类型的相应概率进行平均，选择一组对象内的最主导(dominant)的对象、或单个对象随着时间、或内容类型度量的某一其它的逻辑组合的内容类型概率。内容类型也可以被表达为n维向量(其中，n是不同内容类型的总数，例如，在对话/音乐/周围环境/效果的情况下为四)。内容类型元数据可以被体现为组合的内容类型元数据定义，其中，内容类型的组合反映被组合的概率分布(例如，音乐、语音等的概率的向量)。With respect to content types and the operation of content classifiers, audio is generally classified into one of a number of defined content types (such as dialogue, music, ambiance, special effects, etc.). An object can change content types throughout its duration, but at any given point in time, it is generally of only one content type. In embodiments, content type is expressed as the probability of an object being of a particular content type at any point in time. Thus, for example, a persistent dialogue object would be expressed as a dialogue object with a 100% probability, while an object that transitions from dialogue to music could be expressed as 50% dialogue/50% music. Objects with different content types can be processed by averaging their respective probabilities for each content type, selecting the content type probability of the most dominant object within a group of objects, or a single object over time, or some other logical combination of content type metrics. Content type can also be expressed as an n-dimensional vector (where n is the total number of different content types, e.g., four in the case of dialogue/music/ambiance/effects). Content type metadata can be embodied as a combined content type metadata definition, where the combination of content types reflects the probability distribution of the combined content types (e.g., a vector of probabilities of music, speech, etc.).

关于音频的分类，在实施例中，所述处理以每一个时间帧为基础进行操作以分析信号、识别信号的特征、以及将识别的特征与已知类的特征进行比较以便确定对象的特征与特定类的特征匹配的程度。基于特征与特定类匹配的程度，分类器可以识别对象属于特定类的概率。例如，如果在时间t＝T，对象的特征与对话特征非常好地匹配，则该对象将被以高的概率分类为对话。如果在时间＝T+N，对象的特征与音乐特征非常好地匹配，则该对象将被以高的概率分类为音乐。最后，如果在时间T＝T+2N，对象的特征与对话或音乐均不特别好地匹配，则该对象可能被分类为50％音乐和50％对话。因此，在实施例中，基于内容类型概率，音频内容可以被分成与不同内容类型对应的不同的子信号。这例如通过按由计算的媒体类型概率决定(drive)的比例将原始信号的一些百分比发送到每个子信号(以宽带为基础或者以每一个频率子带为基础)来实现。Regarding audio classification, in an embodiment, the process operates on a per-timeframe basis to analyze the signal, identify features of the signal, and compare the identified features with features of known classes to determine the degree to which the features of an object match those of a particular class. Based on the degree to which the features match those of a particular class, the classifier can determine the probability that the object belongs to that particular class. For example, if at time t=T, the features of an object match very well with the features of conversation, then the object will be classified as conversation with a high probability. If at time=T+N, the features of an object match very well with the features of music, then the object will be classified as music with a high probability. Finally, if at time T=T+2N, the features of an object do not match either conversation or music particularly well, then the object may be classified as 50% music and 50% conversation. Thus, in an embodiment, based on the content type probabilities, the audio content can be divided into different sub-signals corresponding to different content types. This is achieved, for example, by sending a percentage of the original signal to each sub-signal (on a broadband basis or per frequency sub-band basis) in proportions determined by the calculated media type probabilities.

参照图1，来自制作工具102的输出被输入到呈现器112以供呈现为用于通过耳机或其它端点设备回放的音频输出。图3A是实施例下的基于对象的耳机呈现系统中使用的呈现组件112a的框图。图3A更详细地示出呈现器112的预处理113、双耳呈现114和后处理115子组件。从制作工具102，元数据和音频以编码音频比特流301的形式被输入到处理或预处理组件。元数据302被输入到元数据处理组件306，并且音频304被输入到可选的音频分段预处理器308。如参照图2A和2B所示的，音频分段可以由制作工具通过预处理器202或224执行。如果这样的音频分段不是由制作工具执行，则呈现器可以通过预处理器308执行该任务。处理的元数据和分段的音频然后被输入到双耳呈现组件310。该组件执行某些耳机特定的呈现功能，诸如3D定位、距离控制、头部大小处理等。双耳呈现的音频然后被输入到音频后处理器314，该音频后处理器314应用某些音频操作，诸如水平(level)管理、均衡、噪声补偿或消除等。后处理的音频然后被输出312以供通过耳机116或118回放。对于耳机或回放设备104安装有用于向呈现器反馈的传感器和/或麦克风的实施例，麦克风和传感器数据316被输入回元数据处理组件306、双耳呈现组件310或音频后处理组件314中的至少一个。对于未安装有传感器的标准耳机，头部跟踪可以用更简单的伪随机产生的头部“抖动(jitter)”代替，该头部“抖动”模仿连续改变的小的头部移动。这允许回放点处的任何相关的环境或操作数据被呈现系统使用以进一步修改音频以抵消或增强某些回放状况。Referring to FIG1 , output from the authoring tool 102 is input to the renderer 112 for rendering as audio output for playback via headphones or other endpoint devices. FIG3A is a block diagram of a rendering component 112a used in an object-based headphone rendering system under an embodiment. FIG3A illustrates in greater detail the pre-processing 113, binaural rendering 114, and post-processing 115 subcomponents of the renderer 112. From the authoring tool 102, metadata and audio are input to the processing or pre-processing components in the form of an encoded audio bitstream 301. Metadata 302 is input to a metadata processing component 306, and audio 304 is input to an optional audio segmentation pre-processor 308. As shown in FIG2A and FIG2B , audio segmentation can be performed by the authoring tool via pre-processors 202 or 224. If such audio segmentation is not performed by the authoring tool, the renderer can perform this task via pre-processor 308. The processed metadata and segmented audio are then input to a binaural rendering component 310. This component performs certain headphone-specific rendering functions, such as 3D positioning, distance control, head size handling, etc. The binaurally rendered audio is then input to an audio post-processor 314, which applies certain audio operations, such as level management, equalization, noise compensation or cancellation, etc. The post-processed audio is then output 312 for playback through headphones 116 or 118. For embodiments where the headphones or playback device 104 are equipped with sensors and/or microphones for providing feedback to the renderer, microphone and sensor data 316 is input back to at least one of the metadata processing component 306, the binaural rendering component 310, or the audio post-processing component 314. For standard headphones without sensors, head tracking can be replaced with a simpler pseudo-randomly generated head "jitter" that simulates small, continuously changing head movements. This allows any relevant environmental or operational data at the playback point to be used by the rendering system to further modify the audio to offset or enhance certain playback conditions.

如以上提及的，音频的分段可以由制作工具或呈现器执行。对于音频被预分段(pre-segment)的实施例，呈现器直接处理该音频。图3B是该替代实施例下的基于对象的耳机呈现系统中使用的呈现组件的框图。如对于呈现器112b所示的，来自制作工具的编码音频比特流321以其输入到元数据处理组件326的元数据322和输入到双耳呈现组件330的音频324的组成部分提供。对于图3B的实施例，音频在适当的制作工具中被音频预分段处理202或224预分段。双耳呈现组件330执行某些耳机特定的呈现功能，诸如3D定位、距离控制、头部大小处理等。双耳呈现的音频然后被输入到音频后处理器334，该音频后处理器334应用某些音频操作，诸如水平管理、均衡、噪声补偿或消除等。后处理的音频然后被输出332以供通过耳机116或118回放。对于耳机或回放设备104安装有用于向呈现器反馈的传感器和/或麦克风的实施例，麦克风和传感器数据336被输入回元数据处理组件326、双耳呈现组件330或音频后处理组件334中的至少一个。图2A、2B、3A和3B的制作和呈现系统允许内容制作者使用制作工具102在内容创建时创建特定的双耳呈现元数据并且对该特定的双耳呈现元数据进行编码。这允许音频数据被用于指示呈现器利用不同的算法或利用不同的设置对音频内容的部分进行处理。在实施例中，制作工具102表示允许内容创建者(制作者)选择或创建用于回放的音频内容并且对构成该音频内容的声道和/或对象中的每一个定义某些特性的工作站或计算机应用。制作工具可以包括混合器类型控制台接口或混合控制台的图形用户接口(GUI)表示。图5示出实施例下的可以与耳机呈现系统的实施例一起使用的制作工具GUI。如在GUI显示500中可以看到的，多个不同的特性被允许由制作者设置，诸如增益水平、低频特性、均衡、声像、对象位置和密度、延迟、渐淡(fade)等。对于所示的实施例，通过使用供制作者指定设置值的虚拟滑块来便于用户输入，尽管其它的虚拟化的或直接的输入手段也是可以的，诸如直接文本输入、电位计设置、旋转拨号盘等。由用户输入的参数设置中的至少一些被编码为与相关的声道或音频对象相关联的元数据以供与音频内容一起传输。在实施例中，元数据可以在音频系统中的编解码器(编码器/解码器)电路中被打包为附加的特定的耳机有效载荷的一部分。使用使能设备，对某些操作状况和环境状况(例如，头部跟踪、头部大小感测、房间感测、周围环境状况、噪声水平等)进行编码的实时元数据可以被即时提供给双耳呈现器。双耳呈现器组合制作的元数据内容和实时地本地产生的元数据以为用户提供优化的收听体验。一般地，由制作工具提供的对象控件和用户输入接口允许用户控制某些重要的耳机特定的参数，诸如双耳和立体声旁路(stereo-bypass)动态呈现模式、LFE(低频元素)增益和对象增益、媒体智能和内容相关控制。更具体地，可以使用耳间时间延迟和立体声幅度或强度声像、或者整个双耳呈现的组合(即，耳间时间延迟和水平以及频率相关的频谱线索的组合)，在立体声(Lo/Ro)、矩阵化立体声(Lt/Rt)之间以内容类型为基础或以对象为基础来选择呈现模式。另外，频率交叉(cross over)点可以被指定以回到低于给定频率的立体声处理。低频增益也可以被指定以使低频成分或LFE内容衰减。如下面更详细地描述的，低频内容也可以被单独地传输到使能耳机。其它元数据可以以每一内容类型或每一声道/对象为基础指定，诸如一般通过直接/混响增益和频率相关的混响时间以及耳间目标互相关性描述的房间模型。它还可以包括房间的其它更详细的建模(例如，早期反射位置、增益和晚期混响增益)。它还可以包括直接指定的对特定房间响应进行建模的滤波器。其它元数据包括扭曲(warp)到屏幕标志(其控制对象如何被重新映射以适合作为距离的函数的观看角度和屏幕纵横比)。最后，收听者相对(relative)标志(即，是否应用头部跟踪信息)、优选的缩放(指定用于呈现内容的“虚拟房间”的默认大小/纵横比，其用于缩放对象位置以及重新映射到屏幕(根据设备屏幕大小和到设备的距离))、以及控制距离衰减定律(例如，1/(1+r^α))的距离模型指数也是可以的。还可以用信号发送可应用于不同声道/对象或者根据内容类型而应用的参数组或“预设”。As mentioned above, audio segmentation can be performed by either the production tool or the renderer. For embodiments in which the audio is pre-segmented, the renderer processes the audio directly. FIG3B is a block diagram of the rendering components used in the object-based headphone rendering system under this alternative embodiment. As shown for renderer 112 b, an encoded audio bitstream 321 from the production tool is provided as a component of metadata 322 that is input to a metadata processing component 326 and audio 324 that is input to a binaural rendering component 330. For the embodiment of FIG3B , the audio is pre-segmented in the appropriate production tool by audio pre-segmentation processing 202 or 224. The binaural rendering component 330 performs certain headphone-specific rendering functions, such as 3D positioning, distance control, head size processing, etc. The binaurally rendered audio is then input to an audio post-processor 334, which applies certain audio operations, such as level management, equalization, noise compensation or cancellation, etc. The post-processed audio is then output 332 for playback through headphones 116 or 118. For embodiments in which headphones or playback device 104 are equipped with sensors and/or microphones for providing feedback to a renderer, microphone and sensor data 336 are input back to at least one of metadata processing component 326, binaural rendering component 330, or audio post-processing component 334. The production and rendering system of Figures 2A, 2B, 3A, and 3B allows content producers to use production tool 102 to create specific binaural rendering metadata when creating content and to encode this specific binaural rendering metadata. This allows audio data to be used to instruct the renderer to utilize different algorithms or utilize different settings to process parts of the audio content. In an embodiment, production tool 102 represents a workstation or computer application that allows content creators (producers) to select or create audio content for playback and to define certain characteristics for each of the channels and/or objects that constitute the audio content. The production tool can include a graphical user interface (GUI) representation of a mixer-type console interface or a mixing console. Figure 5 illustrates an embodiment of a production tool GUI that can be used together with an embodiment of a headphone rendering system. As can be seen in GUI display 500, a number of different characteristics are allowed to be set by the producer, such as gain level, low-frequency characteristics, equalization, panning, object position and density, delay, fade, etc. For the illustrated embodiment, user input is facilitated by using virtual sliders for the producer to specify setting values, although other virtualized or direct input methods are also possible, such as direct text input, potentiometer settings, rotary dials, etc. At least some of the parameter settings entered by the user are encoded as metadata associated with the relevant channels or audio objects for transmission with the audio content. In embodiments, the metadata can be packaged as part of an additional, specific headphone payload in the codec (encoder/decoder) circuitry within the audio system. Using an enabling device, real-time metadata encoding certain operating and environmental conditions (e.g., head tracking, head size sensing, room sensing, ambient conditions, noise levels, etc.) can be provided to the binaural renderer in real time. The binaural renderer combines the produced metadata content with the real-time locally generated metadata to provide the user with an optimized listening experience. Generally, the object controls and user input interface provided by the authoring tool allow the user to control certain important headphone-specific parameters, such as binaural and stereo-bypass dynamic rendering modes, LFE (low-frequency element) gain and object gain, media intelligence, and content-dependent control. More specifically, the rendering mode can be selected based on content type, stereo (Lo/Ro), matrixed stereo (Lt/Rt), or object-based, using a combination of interaural time delay and stereo amplitude or intensity panning, or the entire binaural rendering (i.e., a combination of interaural time delay and level and frequency-dependent spectral cues). In addition, frequency crossover points can be specified to revert to stereo processing below a given frequency. Low-frequency gain can also be specified to attenuate low-frequency components or LFE content. As described in more detail below, low-frequency content can also be transmitted separately to enabled headphones. Other metadata can be specified on a per-content type or per-channel/object basis, such as a room model, typically described by direct/reverberant gain and frequency-dependent reverberation time, as well as interaural target cross-correlations. It may also include other more detailed modeling of the room (e.g., early reflection positions, gains, and late reverberation gains). It may also include directly specified filters that model specific room responses. Other metadata include a warp-to-screen flag (which controls how objects are remapped to fit the viewing angle and screen aspect ratio as a function of distance). Finally, a listener-relative flag (i.e., whether head tracking information is applied), a preferred scaling (specifying a default size/aspect ratio of the "virtual room" used to render the content, which is used to scale object positions and remap to the screen (according to the device screen size and distance to the device)), and a distance model exponent that controls the distance decay law (e.g., 1/(1+r ^α )) are also possible. Groups of parameters or "presets" that can be applied to different channels/objects or based on the type of content can also be signaled.

如关于制作工具和/或呈现器的预分段组件所示的，不同类型的内容(例如，对话、音乐、效果等)可以基于制作者的意图和最佳的呈现配置而被不同地处理。基于类型或其它显著特性对内容的分割可以在制作期间先验地实现(例如，通过手动地将分割的对话保存在它们自己的一组音轨或对象中)，或者在接收设备中呈现之前即时后验地实现。附加的媒体智能工具可以在制作期间被用于根据不同的特性对内容进行分类并且产生可以携载不同的多组呈现元数据的附加声道或对象。例如，在知晓stem(音乐、对话、拟音(Foley)、效果等)和相关联的环绕(例如，5.1)混合之后，可以针对内容创建处理训练媒体分类器以开发识别不同的stem混合比例的模型。相关联的源分割技术可以被利用以使用从媒体分类器得到的加权功能来提取近似的stem。从提取的stem，将被编码为元数据的双耳参数可以在制作期间被应用。在实施例中，在终端用户设备中应用镜像处理，由此使用解码的元数据参数将创建与内容创建期间基本上类似的体验。As shown with respect to the pre-segmentation components of the production tools and/or renderers, different types of content (e.g., dialogue, music, effects, etc.) can be treated differently based on the producer's intent and the optimal presentation configuration. Segmentation of content based on type or other salient characteristics can be achieved a priori during production (e.g., by manually saving segmented dialogue into their own set of tracks or objects), or a posteriori immediately prior to presentation in the receiving device. Additional media intelligence tools can be used during production to classify content according to different characteristics and generate additional channels or objects that can carry different sets of presentation metadata. For example, given the knowledge of the stems (music, dialogue, Foley, effects, etc.) and the associated surround (e.g., 5.1) mix, a media classifier can be trained for the content creation process to develop a model that recognizes different stem mix ratios. Associated source segmentation techniques can be utilized to extract approximate stems using a weighting function derived from the media classifier. From the extracted stems, binaural parameters, which will be encoded as metadata, can be applied during production. In an embodiment, a mirroring process is applied in the end-user device, whereby using the decoded metadata parameters will create a substantially similar experience as during content creation.

在实施例中，对于现有的工作室制作工具的扩展包括双耳监视和元数据记录。在制作时捕捉的典型的元数据包括：对于每个声道和音频对象的声道/对象位置/大小信息、声道/对象增益调整、内容相关元数据(可以基于内容类型而改变)、指示设置(诸如立体声/左/右呈现应当替代双耳而被使用)的旁路标志、交叉点和指示低于交叉点的低音频率必须被旁路和/或衰减的水平、以及描述直接/混响增益和频率相关混响时间或其它特性(诸如早期反射和晚期混响增益)的房间模型信息。其它内容相关元数据可以提供扭曲到屏幕功能，所述扭曲到屏幕功能重新映射图像以适合屏幕纵横比或改变作为距离的函数的观看角度。头部跟踪信息可以被应用于提供收听者相对体验。实现根据衰减定律(例如，1/(1+r^α))控制距离的距离模型指数的元数据也可以被使用。这些仅表示可以通过元数据编码的某些特性，并且其它特性也可以被编码。In an embodiment, extensions to existing studio production tools include binaural monitoring and metadata recording. Typical metadata captured during production includes: channel/object position/size information for each channel and audio object, channel/object gain adjustments, content-specific metadata (which can vary based on content type), a bypass flag indicating whether a setting (such as stereo/left/right rendering should be used instead of binaural), a crossover point and a level indicating that bass frequencies below the crossover point must be bypassed and/or attenuated), and room model information describing direct/reverberation gain and frequency-dependent reverberation time or other characteristics (such as early reflection and late reverberation gain). Other content-specific metadata can provide a warp-to-screen function that remaps the image to fit the screen aspect ratio or changes the viewing angle as a function of distance. Head tracking information can be used to provide a listener-relative experience. Metadata implementing a distance model exponent that controls distance according to a decay law (e.g., 1/(1+ ^rα )) can also be used. These represent only some of the characteristics that can be encoded in metadata, and others may also be encoded.

图4是实施例下的提供双端双耳呈现系统的概览的框图。在实施例中，系统400提供内容相关元数据和影响不同类型的音频内容将如何被呈现的呈现设置。例如，原始音频内容可以包括不同的音频元素，诸如对话、音乐、效果、周围环境声音、瞬态等。这些元素中的每一个可以被以不同的方式最佳地呈现，而不是将它们限制为全部以唯一一种方式呈现。对于系统400的实施例，音频输入401包括多声道信号、基于对象的声道、或声道加对象的混合音频。音频被输入到编码器402，该编码器402添加或修改与音频对象和声道相关联的元数据。如系统400中所示的，音频被输入到耳机监视组件410，该耳机监视组件410应用用户可调整参数化工具来控制耳机处理、均衡、下混以及适合于耳机回放的其它特性。用户优化的参数集(M)然后被编码器402作为元数据或附加元数据嵌入以形成被发送到解码器404的比特流。解码器404对元数据以及基于对象和声道的音频的用于控制耳机处理和下混组件406的参数集M进行解码，该耳机处理和下混组件406生成到耳机的耳机优化和下混(例如，5.1到立体声)的音频输出408。尽管某个内容相关处理已在目前的系统和后处理链中被实现，但是它一般尚未被应用于诸如图4的系统400中所示的双耳呈现。Fig. 4 is a block diagram of an overview of a two-end binaural presentation system provided under an embodiment. In an embodiment, system 400 provides content-related metadata and influences how different types of audio content will be presented. For example, the original audio content can include different audio elements, such as dialogue, music, effects, ambient sounds, transients, etc. Each of these elements can be best presented in different ways, rather than being limited to all being presented in a unique way. For an embodiment of system 400, audio input 401 includes a multi-channel signal, an object-based channel, or a mixed audio of a channel plus an object. The audio is input to encoder 402, which adds or modifies metadata associated with the audio object and the channel. As shown in system 400, the audio is input to headphone monitoring component 410, which uses user-adjustable parameterization tools to control headphone processing, equalization, downmixing, and other characteristics suitable for headphone playback. The user-optimized parameter set (M) is then embedded by encoder 402 as metadata or additional metadata to form a bitstream that is sent to decoder 404. The decoder 404 decodes metadata and parameter sets M of object- and channel-based audio for controlling a headphone processing and downmixing component 406, which generates a headphone-optimized and downmixed (e.g., 5.1 to stereo) audio output 408 for headphones. Although some content-dependent processing has been implemented in current systems and post-processing chains, it has generally not been applied to binaural renderings such as shown in the system 400 of FIG. 4 .

如图4所示，某些元数据可以由耳机监视组件410提供，该耳机监视组件410提供控制耳机特定的回放的特定的用户可调整参数化工具。这样的组件可以被配置为为用户提供对针对被动地回放发送的音频内容的旧有耳机118的耳机呈现的一定程度的控制。可替代地，端点设备可以是使能耳机116，其包括传感器和/或一定程度的处理能力以产生元数据或信号数据，该元数据或信号数据可以被编码为兼容的元数据以进一步修改制作的元数据，以针对通过耳机的呈现对音频内容进行优化。因此，在内容的接收端，呈现被即时执行，并且可以考虑本地产生的传感器阵列数据，该传感器阵列数据可以由耳麦(headset)或耳麦附连的实际的移动设备104产生，并且这样的硬件产生的元数据可以进一步与内容创建者在制作时创建的元数据组合以增强双耳呈现体验。As shown in Figure 4, certain metadata can be provided by a headphone monitoring component 410, which provides specific user-adjustable parameterization tools for controlling headphone-specific playback. Such a component can be configured to provide a user with a certain degree of control over the headphone presentation of an old headphone 118 for passively playing back the transmitted audio content. Alternatively, the endpoint device can be an enabling headset 116, which includes sensors and/or a certain degree of processing power to generate metadata or signal data, which can be encoded as compatible metadata to further modify the produced metadata to optimize the audio content for presentation through the headphones. Therefore, at the receiving end of the content, the presentation is performed instantly and can take into account locally generated sensor array data, which can be generated by the headset or the actual mobile device 104 to which the headset is attached, and the metadata generated by such hardware can be further combined with the metadata created by the content creator at the time of production to enhance the binaural presentation experience.

如上所述，在一些实施例中，低频内容可以被单独地传输到允许多于一个立体声输入(通常3或4个音频输入)的使能耳机，或者被编码和调制到携载到具有唯一立体声输入的耳麦的主要立体声波形的较高频中。这将允许进一步的低频处理在耳机中发生(例如，路由到针对低频优化的特定驱动器)。这样的耳机可以包括低频特定的驱动器和/或滤波器加交叉和放大电路以优化低频信号的回放。As described above, in some embodiments, low-frequency content can be transmitted separately to enabled headphones that allow more than one stereo input (typically 3 or 4 audio inputs), or encoded and modulated into the higher frequencies of the primary stereo waveform carried to a headset with a single stereo input. This would allow further low-frequency processing to occur in the headphones (e.g., routing to specific drivers optimized for low frequencies). Such headphones may include low-frequency-specific drivers and/or filter-plus-crossover and amplification circuitry to optimize playback of low-frequency signals.

在实施例中，在回放侧提供从耳机到耳机处理组件的链路以使得能够手动识别耳机以用于耳机的自动耳机预设加载或其它配置。这样的链路可以被实现为例如图4中从耳机到耳机处理406的无线或有线链路。该识别可以被用于配置目标耳机或者将特定的内容或特别呈现的内容发送到特定的一组耳机，如果多个目标耳机正被使用的话。耳机标识符可以被体现为任何适当的字母数字或二进制码，该字母数字或二进制码被呈现处理作为元数据的一部分或者单独的数据处理操作而处理。In an embodiment, a link is provided on the playback side from the headphones to the headphone processing component to enable manual identification of the headphones for automatic headphone preset loading or other configuration of the headphones. Such a link can be implemented, for example, as a wireless or wired link from the headphones to the headphone processing 406 in FIG. 4 . This identification can be used to configure the target headphones or to send specific content or specially rendered content to a specific group of headphones if multiple target headphones are being used. The headphone identifier can be embodied as any suitable alphanumeric or binary code that is processed by the presentation processing as part of the metadata or as a separate data processing operation.

图6示出实施例下的使能耳机，该使能耳机包括感测回放状况以供编码为在耳机呈现系统中使用的元数据的一个或多个传感器。各种传感器可以被布置成可以被用于在呈现时将即时元数据提供给呈现器的传感器阵列。对于图6的示例耳机600，除了其它适当的传感器之外，传感器包括测距传感器(诸如红外IR或飞行时间TOF照相机)602、张力/头部大小传感器604、陀螺仪传感器606、外部麦克风(或对)610、周围环境噪声消除处理器608、内部麦克风(或对)612。如图6所示，传感器阵列可以包括音频传感器(即，麦克风)以及数据传感器(例如，朝向、大小、张力/应力和测距传感器)两者。特别地，为了与耳机一起使用，朝向数据可以被用于根据收听者的头部运动“锁定”或旋转空间音频对象，张力传感器或外部麦克风可以被用于推断收听者的头部的大小(例如，通过监视位于耳杯上的两个外部麦克风处的音频互相关)并且调整相关的双耳呈现参数(例如，耳间时间延迟、肩部反射定时等)。测距传感器602可以被用于在移动A/V回放的情况下评估到显示器的距离并且校正屏上对象的位置以考虑距离相关的观看角度(即，随着屏幕越接近收听者，越宽地呈现对象)或者调整全局增益和房间模型以传递适当的距离呈现。如果音频内容是在范围可以从小型移动电话(例如，2-4”屏幕大小)到平板(例如，7-10”屏幕大小)、再到膝上型计算机(例如，15-17”屏幕大小)的设备上回放的A/V内容的一部分，则这样的传感器功能是有用的。另外，传感器也可以被用于自动地检测和设置左和右音频输出到正确的换能器的路由，而不需要耳机上的特定的先验朝向或明确的“左/右”标记。FIG6 illustrates an embodiment of an enabled headset that includes one or more sensors for sensing playback conditions for encoding into metadata for use in a headphone rendering system. The various sensors can be arranged into a sensor array that can be used to provide real-time metadata to a renderer during rendering. For the example headset 600 of FIG6 , the sensors include, among other appropriate sensors, a range sensor (such as an infrared (IR) or time-of-flight (TOF) camera) 602, a strain/head size sensor 604, a gyroscope sensor 606, an external microphone (or pairs) 610, an ambient noise cancellation processor 608, and an internal microphone (or pairs) 612. As shown in FIG6 , the sensor array can include both audio sensors (i.e., microphones) and data sensors (e.g., orientation, size, strain/stress, and range sensors). Specifically, for use with headphones, orientation data can be used to “lock” or rotate spatial audio objects based on the listener’s head movement, and strain sensors or external microphones can be used to infer the listener’s head size (e.g., by monitoring audio cross-correlation at two external microphones located on the ear cups) and adjust relevant binaural rendering parameters (e.g., interaural time delay, shoulder reflex timing, etc.). The distance sensor 602 can be used to assess the distance to the display in the case of mobile A/V playback and correct the position of on-screen objects to account for distance-dependent viewing angles (i.e., objects are rendered wider as the screen gets closer to the listener) or to adjust global gain and room model to deliver appropriate distance rendering. Such sensor functionality is useful if the audio content is part of A/V content that can be played back on devices ranging from small mobile phones (e.g., 2-4" screen size) to tablets (e.g., 7-10" screen size) to laptop computers (e.g., 15-17" screen size). Additionally, the sensor can also be used to automatically detect and route the left and right audio outputs to the correct transducers without requiring a specific a priori orientation or explicit "left/right" markings on the headphones.

如图1所示，发送到耳机116或118的音频或A/V内容可以通过手持或便携式设备104提供。在实施例中，设备104本身可以包括一个或多个传感器。例如，如果设备是手持游戏控制台或游戏控制器，则某些陀螺仪传感器和加速度计可以被提供以跟踪对象移动和位置。对于该实施例，耳麦连接的设备104也可以提供附加的传感器数据(诸如朝向、头部大小、照相机等)作为设备元数据。As shown in Figure 1, the audio or A/V content sent to the headset 116 or 118 can be provided by a handheld or portable device 104. In an embodiment, the device 104 itself may include one or more sensors. For example, if the device is a handheld game console or game controller, certain gyroscope sensors and accelerometers may be provided to track object movement and position. For this embodiment, the device 104 to which the headset is connected may also provide additional sensor data (such as orientation, head size, camera, etc.) as device metadata.

对于该实施例，实现某些耳机到设备通信手段。例如，耳麦可以通过有线或无线数字链路或模拟音频链路(麦克风输入)连接到设备，在这种情况下，元数据将被频率调制和添加到模拟麦克风输入。图7示出实施例下的耳机和包括耳机传感器处理器702的设备104之间的连接。如系统700中所示的，耳机600通过有线或无线链路将某些传感器、音频和麦克风数据701发送到耳机传感器处理器702。来自处理器702的处理的数据可以包括具有元数据的模拟音频704或空间音频输出706。如图7所示，每个连接包括耳机、处理器和输出之间的双向链路。这允许传感器和麦克风数据在耳机和设备之间发送以供创建或修改适当的元数据。除了硬件产生的元数据之外，还可以提供用户控制以补充或产生适当的元数据，如果元数据不可通过硬件传感器阵列获得的话。示例用户控制可以包括：提升(elevation)强调、双耳开/关切换、优选声音半径或大小、以及其它类似的特性。这样的用户控制可以通过与耳机处理器组件、回放设备和/或耳机相关联的硬件或软件接口元件来提供。For this embodiment, certain headset-to-device communication methods are implemented. For example, the headset can be connected to the device via a wired or wireless digital link or an analog audio link (microphone input), in which case metadata will be frequency modulated and added to the analog microphone input. Figure 7 shows the connection between the headset and the device 104, including the headset sensor processor 702, under an embodiment. As shown in system 700, the headset 600 sends certain sensor, audio, and microphone data 701 to the headset sensor processor 702 via a wired or wireless link. The processed data from the processor 702 may include analog audio 704 with metadata or spatial audio output 706. As shown in Figure 7, each connection includes a bidirectional link between the headset, the processor, and the output. This allows sensor and microphone data to be sent between the headset and the device for creation or modification of appropriate metadata. In addition to the metadata generated by the hardware, user controls can also be provided to supplement or generate appropriate metadata if metadata is not available through the hardware sensor array. Example user controls may include: elevation emphasis, binaural on/off switching, preferred sound radius or size, and other similar features. Such user control may be provided through hardware or software interface elements associated with the headset processor assembly, playback device, and/or headset.

图8是示出实施例下的可以在耳机呈现系统中使用的不同的元数据成分的框图。如图800所示，被耳机处理器806处理的元数据包括制作的元数据(诸如由制作工具102和混合控制台500生成的元数据)和硬件产生的元数据804。硬件产生的元数据804可以包括用户输入的元数据、由设备808提供的或者从从设备808发送的数据产生的设备侧元数据、和/或由耳机810提供的或者从从耳机810发送的数据产生的耳机侧元数据。8 is a block diagram illustrating various metadata components that may be used in a headphone rendering system under an embodiment. As shown in diagram 800, the metadata processed by the headphone processor 806 includes authored metadata (such as metadata generated by the authoring tool 102 and the mixing console 500) and hardware-generated metadata 804. The hardware-generated metadata 804 may include user-entered metadata, device-side metadata provided by the device 808 or generated from data sent from the device 808, and/or headphone-side metadata provided by the headphone 810 or generated from data sent from the headphone 810.

在实施例中，制作的802和/或硬件产生的804元数据在呈现器112的双耳呈现组件114中被处理。元数据提供对特定的音频声道和/或对象的控制以优化通过耳机116或118的回放。图9示出实施例下的用于耳机处理的双耳呈现组件的功能组件。如系统900中所示的，解码器902输出多声道信号或声道加对象音轨，连同用于控制由耳机处理器904执行的耳机处理的解码的参数集M。耳机处理器904还从基于照相机的或基于传感器的跟踪设备910接收某些空间参数更新906。跟踪设备910是测量与用户的头部相关联的某些角度和位置参数(r、θ、Ф)的面部跟踪或头部跟踪设备。空间参数可以对应于距离和某些朝向角度，诸如偏航(yaw)、倾斜和滚动(roll)。原始的空间参数集x可以随着传感器数据910被处理而更新。这些空间参数更新Y然后被传递到耳机处理器904以供进一步修改参数集M。处理的音频数据然后被发送到后处理级908，该后处理级908执行某些音频处理，诸如音色校正、滤波、下混和其它相关处理。音频然后被均衡器912均衡，并且被发送到耳机。在实施例中，如在下面的描述中更详细地描述的，均衡器912可以使用或不使用压力分配比(PDR)变换来执行均衡。In an embodiment, the produced 802 and/or hardware-generated 804 metadata is processed in the binaural rendering component 114 of the renderer 112. The metadata provides control over specific audio channels and/or objects to optimize playback through headphones 116 or 118. FIG9 illustrates the functional components of a binaural rendering component for headphone processing, under an embodiment. As shown in system 900, a decoder 902 outputs a multi-channel signal or a channel-plus-object track, along with a decoded parameter set M for controlling the headphone processing performed by a headphone processor 904. The headphone processor 904 also receives certain spatial parameter updates 906 from a camera-based or sensor-based tracking device 910. The tracking device 910 is a face tracking or head tracking device that measures certain angular and positional parameters (r, θ, φ) associated with the user's head. The spatial parameters may correspond to distance and certain heading angles, such as yaw, pitch, and roll. The original set of spatial parameters x may be updated as the sensor data 910 is processed. These spatial parameter updates Y are then passed to the headphone processor 904 for further modification of the parameter set M. The processed audio data is then sent to the post-processing stage 908, which performs certain audio processing, such as timbre correction, filtering, downmixing, and other related processing. The audio is then equalized by the equalizer 912 and sent to the headphones. In an embodiment, as described in more detail in the description below, the equalizer 912 may or may not use a pressure distribution ratio (PDR) transform to perform equalization.

图10示出实施例下的耳机呈现系统中的用于呈现音频对象的双耳呈现系统。图10示出信号成分中的一些在它们通过双耳耳机处理器被处理时。如图1000所示，对象音频成分被输入到去混合器(unmixer)1002，该去混合器1002分离音频的直接成分和扩散(diffuse)成分(例如，使直接与混响路径分离)。直接成分被输入到下混组件1006，该下混组件1006利用相移信息将环绕声道(例如，5.1环绕)下混为立体声。直接成分还被输入到直接内容双耳呈现器1008。两个两声道成分然后被输入到动态音色均衡器1012。对于基于对象的音频输入，对象位置和用户控制信号被输入到虚拟器操纵器(virtualizer steerer)组件1004。这产生缩放的对象位置，该缩放的对象位置连同直接成分一起被输入到双耳呈现器1008。音频的扩散成分被输入到单独的双耳呈现器1010，并且在作为两声道输出音频输出之前被加法器电路与呈现的直接内容组合。Figure 10 illustrates the binaural rendering system for presenting audio objects in the headphone rendering system under the embodiment.Figure 10 illustrates some in signal components when they are processed by binaural headphone processor.As shown in Figure 1000, object audio component is input to unmixer 1002, and this unmixer 1002 separates the direct component and the diffusion (diffuse) component (for example, makes direct and reverberation path separate).Direct component is input to downmix component 1006, and this downmix component 1006 utilizes phase shift information that surround sound channel (for example, 5.1 surround) is downmixed to stereo.Direct component is also input to direct content binaural renderer 1008.Two two-channel components are then input to dynamic timbre equalizer 1012.For object-based audio input, object position and user control signal are input to virtualizer manipulator (virtualizer steerer) component 1004.This produces the object position of scaling, and the object position of scaling is input to binaural renderer 1008 together with direct component. The diffuse component of the audio is input to a separate binaural renderer 1010 and combined with the rendered direct content by a summer circuit before being output as two-channel output audio.

图11示出实施例下的图10的双耳呈现系统的更详细的表示。如图11的示图1100所示，基于多声道和对象的音频被输入到去混合器1102以供分离成直接成分和扩散成分。直接内容被直接双耳呈现器1118处理，而扩散内容被扩散双耳呈现器1120处理。在直接内容的下混1116和音色均衡1124之后，扩散音频成分和直接音频成分然后通过加法器电路被组合以供诸如通过耳机均衡器1122和其它可能的电路进行后处理。如图11所示，某些用户输入和反馈数据被用于修改扩散双耳呈现器1120中的扩散内容的双耳呈现。对于系统1100的实施例，回放环境传感器1106提供关于收听房间性质和噪声估计(周围环境声音水平)的数据，头部/面部跟踪传感器1108提供头部位置、朝向和大小数据，设备跟踪传感器1110提供设备位置数据，并且用户输入1112提供回放半径数据。该数据可以由位于耳机116和/或设备104中的传感器提供。各种传感器数据和用户输入数据被与内容元数据组合，这在虚拟器操纵器组件1104中提供对象位置和房间参数信息。该组件还从去混合器1102接收直接和扩散能量信息。虚拟器操纵器1104将包括对象位置、头部位置/朝向/大小、房间参数以及其它相关信息的数据输出到扩散内容双耳呈现器1120。以这种方式，输入音频的扩散内容被调整以适应传感器和用户输入数据。Figure 11 shows a more detailed representation of the binaural rendering system of Figure 10 under an embodiment. As shown in the diagram 1100 of Figure 11, multi-channel and object-based audio is input to a demixer 1102 for separation into direct components and diffuse components. Direct content is processed by a direct binaural renderer 1118, while diffuse content is processed by a diffuse binaural renderer 1120. After downmixing 1116 and timbre equalization 1124 of the direct content, the diffuse audio component and the direct audio component are then combined by an adder circuit for post-processing such as by a headphone equalizer 1122 and other possible circuits. As shown in Figure 11, certain user input and feedback data are used to modify the binaural rendering of the diffuse content in the diffuse binaural renderer 1120. For an embodiment of the system 1100, playback environment sensors 1106 provide data regarding the listening room properties and noise estimation (ambient sound level), head/face tracking sensors 1108 provide head position, orientation, and size data, device tracking sensors 1110 provide device position data, and user input 1112 provides playback radius data. This data can be provided by sensors located in the headphones 116 and/or the device 104. The various sensor data and user input data are combined with content metadata, which provides object position and room parameter information in the virtualizer manipulator component 1104. This component also receives direct and diffuse energy information from the demixer 1102. The virtualizer manipulator 1104 outputs data including object position, head position/orientation/size, room parameters, and other relevant information to the diffuse content binaural renderer 1120. In this way, the diffuse content of the input audio is adjusted to accommodate the sensor and user input data.

虽然当传感器数据、用户输入数据和内容元数据被接收时实现虚拟器操纵器的最佳性质，但是即使在不存在这些输入中的一个或多个的情况下，也可以实现虚拟器操纵器的有益性能。例如，当对旧有内容(例如，不包含双耳呈现元数据的编码比特流)进行处理以供通过传统的耳机(例如，不包括各种传感器、麦克风等的耳机)回放时，即使在不存在对于虚拟器操纵器的一个或多个其它的输入的情况下，仍可以通过将去混合器1102的直接能量输出和扩散能量输出提供给虚拟器操纵器1104以产生用于扩散内容双耳呈现器1120的控制信息来获得有益结果。While the optimal properties of the virtualizer manipulator are achieved when sensor data, user input data, and content metadata are received, beneficial performance of the virtualizer manipulator can be achieved even in the absence of one or more of these inputs. For example, when processing legacy content (e.g., an encoded bitstream that does not include binaural rendering metadata) for playback through traditional headphones (e.g., headphones that do not include various sensors, microphones, etc.), beneficial results can be achieved by providing the direct energy output and diffuse energy output of the demixer 1102 to the virtualizer manipulator 1104 to generate control information for the diffuse content binaural renderer 1120, even in the absence of one or more other inputs to the virtualizer manipulator.

在实施例中，图11的呈现系统1100允许双耳耳机呈现器高效地基于耳间时间差(ITD)和耳间水平差(ILD)以及头部大小的感测来提供个性化。ILD和ITD是关于方位角的重要线索，该方位角是当音频信号在水平面中被生成时该音频信号相对于头部的角度。ITD被定义为声音在两个耳朵之间的到达时间的差，并且ILD效果使用进入耳朵的声音水平的差来提供定位线索。一般接受的是，ITD被用于定位低频声音，并且ILD被用于定位高频声音，而两者都被用于包含高频和低频两者的内容。In an embodiment, the rendering system 1100 of Figure 11 allows binaural headphone renderer to provide personalized based on the sensing of interaural time difference (ITD) and interaural level difference (ILD) and head size efficiently.ILD and ITD are important clues about azimuth, and this azimuth is the angle of this audio signal relative to head when audio signal is generated in the horizontal plane.ITD is defined as the difference of the arrival time of sound between two ears, and the difference of the sound level that ILD effect uses to enter ear provides positioning clue. Generally accepted is that ITD is used to locate low-frequency sound, and ILD is used to locate high-frequency sound, and both are used to comprise high-frequency and low-frequency content.

呈现系统1100还允许适应源距离控制和房间模型。它进一步允许直接对扩散/混响(干/湿)内容提取和处理、房间反射的优化以及音色匹配。The rendering system 1100 also allows for adaptive source distance control and room modeling. It further allows for direct extraction and processing of diffuse/reverberant (dry/wet) content, optimization of room reflections, and timbre matching.

HRTF模型HRTF model

在空间音频再现中，某些声源线索被虚拟化。例如，意图被从收听者后面听到的声音可以由物理上位于它们后面的扬声器产生，并且这样，所有的收听者都将这些声音感知为来自后面。利用通过耳机的虚拟空间呈现，另一方面，来自后面的音频的感知由用于产生双耳信号的头部相关传递函数(HRTF)控制。在实施例中，基于元数据的耳机处理系统100可以包括某些HRTF建模机制。这样的系统的基础一般建立于头部和躯干的结构模型上。该方法允许算法被建立于模块化方法中的核心模型上。在该算法中，模块化算法被称为“工具”。除了提供ITD和ILD线索之外，模型方法提供相对于头部上的耳朵的位置(更广泛地，建立于模型上的工具)的参考点。系统可以根据用户的人体测量特征而被调谐或修改。模块化方法的其它益处允许突出某些特征以便放大特定的空间线索。例如，某些线索可以被夸大超出声学双耳滤波器将给予个体的线索。图12是示出实施例下的在用在耳机呈现系统中的HRTF建模系统中使用的不同工具的系统示图。如图12所示，在至少一些输入成分被滤波1202之后，某些输入(包括方位角、仰角、fs和范围)被输入到建模级1204。在实施例中，滤波器级1202可以包括雪人(snowman)滤波器模型，该雪人滤波器模型由在球形身体的顶部的球形头部构成，并且考虑躯干以及头部对HRTF的贡献。建模级1204计算耳廓模型和躯干模型，并且左和右(l，r)成分被后处理1206以供最终输出1208。In spatial audio reproduction, certain sound source cues are virtualized. For example, sounds intended to be heard from behind the listener can be produced by speakers physically located behind them, so that all listeners perceive these sounds as coming from behind. With virtual spatial presentation through headphones, on the other hand, the perception of audio from behind is controlled by the head-related transfer functions (HRTFs) used to generate binaural signals. In embodiments, the metadata-based headphone processing system 100 may include certain HRTF modeling mechanisms. The foundation of such systems is generally built on a structural model of the head and torso. This approach allows algorithms to be built on a core model in a modular approach. In this algorithm, the modular algorithm is referred to as a "tool." In addition to providing ITD and ILD cues, the model approach provides reference points relative to the position of the ears on the head (more broadly, the tools built on the model). The system can be tuned or modified based on the user's anthropometric characteristics. Another benefit of the modular approach is that it allows certain features to be emphasized to amplify specific spatial cues. For example, certain cues can be exaggerated beyond what the acoustic binaural filters would give an individual. FIG12 is a system diagram illustrating different tools used in an HRTF modeling system for use in a headphone rendering system, under an embodiment. As shown in FIG12 , after at least some input components are filtered 1202, certain inputs (including azimuth, elevation, fs, and range) are input to a modeling stage 1204. In an embodiment, the filter stage 1202 may include a snowman filter model consisting of a spherical head on top of a spherical body, and taking into account the contribution of the torso as well as the head to the HRTF. The modeling stage 1204 computes the pinna model and the torso model, and the left and right (l, r) components are post-processed 1206 for final output 1208.

元数据结构Metadata structure

如上所述，被耳机回放系统处理的音频内容包括声道、对象和相关联的元数据，该元数据提供优化音频通过耳机的呈现所必需的空间线索和处理线索。这样的元数据可以从制作工具被作为制作的元数据以及从一个或多个端点设备被作为硬件产生的元数据产生。图13示出实施例下的使得能够为耳机呈现系统递送元数据的数据结构。在实施例中，图13的元数据结构被配置为补充在比特流的其它部分中递送的元数据，所述比特流可以根据已知的基于声道的音频格式(诸如Dolby digital AC-3或增强AC-3比特流语法)而打包。如图13所示，数据结构由容器1300构成，该容器1300包含一个或多个数据有效载荷1304。每个有效载荷在容器中通过使用提供有效载荷中存在的数据的类型的明确的指示的唯一的有效载荷标识符值而被识别。容器内的有效载荷的次序未被定义。有效载荷可以按任何次序存储，并且解析器必须能够对整个容器进行解析以提取相关的有效载荷，并且忽略既不相关的、又不支持的有效载荷。可以被解码设备使用以验证容器和容器内的有效载荷无错误的保护数据1306跟随容器中的最后的有效载荷。包含同步、版本和key-ID信息的初始部分1302在容器中的第一个有效载荷的前面。As described above, the audio content processed by the headphone playback system includes channels, objects, and associated metadata that provides the spatial and processing cues necessary to optimize the presentation of the audio through the headphones. Such metadata can be generated as authored metadata from an authoring tool and as hardware-generated metadata from one or more endpoint devices. Figure 13 illustrates a data structure that enables metadata delivery to the headphone presentation system, under an embodiment. In an embodiment, the metadata structure of Figure 13 is configured to supplement metadata delivered in other portions of a bitstream, which can be packaged according to known channel-based audio formats (such as Dolby Digital AC-3 or Enhanced AC-3 bitstream syntax). As shown in Figure 13, the data structure consists of a container 1300 containing one or more data payloads 1304. Each payload is identified within the container using a unique payload identifier value that provides an unambiguous indication of the type of data present in the payload. The order of the payloads within the container is undefined. The payloads may be stored in any order, and a parser must be able to parse the entire container to extract relevant payloads and ignore irrelevant or unsupported payloads. Protection data 1306, which can be used by a decoding device to verify that the container and the payload within the container are error-free, follows the last payload in the container. An initial portion 1302 containing synchronization, version, and key-ID information precedes the first payload in the container.

数据结构通过对特定有效载荷类型使用版本化和标识符来支持可扩展性。元数据有效载荷可以被用于描述在AC-3或增强AC-3(或其它类型)比特流中递送的音频节目(program)的性质或配置，或者可以被用于控制被设计为进一步对解码处理的输出进行处理的音频处理算法。The data structure supports extensibility through the use of versioning and identifiers for specific payload types. Metadata payloads can be used to describe the nature or configuration of an audio program delivered in an AC-3 or enhanced AC-3 (or other type) bitstream, or can be used to control audio processing algorithms designed to further process the output of the decoding process.

容器可以使用不同的编程结构、基于实现偏好来定义。以下的表示出实施例下的容器的示例语法。Containers can be defined using different programming constructs based on implementation preferences. The following table shows an example syntax for a container under an embodiment.

下表中示出用于以上提供的示例容器语法的可变比特(variable bit)的可能的语法的示例。Examples of possible syntaxes for the variable bits of the example container syntax provided above are shown in the following table.

下表中示出用于以上提供的示例容器语法的有效载荷配置的可能的语法的示例。Examples of possible syntax for payload configurations for the example container syntax provided above are shown in the following table.

以上语法定义被作为示例实现提供，而非意在于限制，因为许多其它不同的程序结构可以被使用。在实施例中，使用被称为可变比特的方法来对有效载荷数据和容器结构内的许多字段进行编码。该方法使得能够以能够表达任意大的字段值的可扩展性来对小的字段值进行高效编码。当使用variable_bit编码时，字段由一组或多组n比特构成，其中每组之后跟着1比特的read_more字段。在最低限度，n比特的编码需要n+1比特被发送。所有使用variable_bit编码的字段都被解释为无符号整数。各种其它不同的编码方面可以根据本领域的普通技术人员已知的实践和方法来实现。以上表和图13示出了示例元数据结构、格式和程序内容。应注意，这些意图表示元数据表示的一个示例实施例，并且其它元数据定义和内容也是可能的。The above syntax definitions are provided as example implementations and are not intended to be limiting, as many other different program structures can be used. In an embodiment, a method called variable bit is used to encode many fields within the payload data and container structure. This method enables efficient encoding of small field values with scalability to express arbitrarily large field values. When variable_bit encoding is used, the field consists of one or more groups of n bits, each group followed by a 1-bit read_more field. At a minimum, an n-bit encoding requires n+1 bits to be sent. All fields using variable_bit encoding are interpreted as unsigned integers. Various other different encoding aspects can be implemented according to practices and methods known to those of ordinary skill in the art. The above table and Figure 13 show example metadata structures, formats, and program content. It should be noted that these are intended to represent an example embodiment of metadata representation, and other metadata definitions and content are also possible.

耳机EQ和校正Headphone EQ and Correction

如图1所示，某些后处理功能115可以由呈现器112执行。如图9的元件912所示，一个这样的后处理功能包括耳机均衡。在实施例中，可以通过获得对于每个耳朵的不同耳机放置的被堵塞耳道脉冲响应测量来执行均衡。图14示出耳机均衡处理的实施例中的对于每个耳朵的三个脉冲响应测量的示例情况。均衡后处理计算每个响应的快速傅立叶变换(FFT)，并且执行得到的响应的RMS(均方根)平均。响应可以是可变的、倍频程(octave)平滑的、ERB平滑的、等等。所述处理然后通过对中频和高频处的逆幅值响应的限值(+/-x dB)的约束来计算RMS平均的逆|F(ω)|。所述处理然后确定时域滤波器。As shown in Figure 1, certain post-processing functions 115 can be performed by the renderer 112. As shown in element 912 of Figure 9, one such post-processing function includes headphone equalization. In an embodiment, equalization can be performed by obtaining blocked ear canal impulse response measurements for different headphone placements for each ear. Figure 14 shows an example case of three impulse response measurements for each ear in an embodiment of headphone equalization processing. The equalization post-processing calculates the fast Fourier transform (FFT) of each response and performs RMS (root mean square) averaging of the resulting responses. The responses can be variable, octave smoothed, ERB smoothed, etc. The processing then calculates the inverse of the RMS average |F(ω)| by constraining the inverse amplitude response at mid- and high frequencies (+/- x dB). The processing then determines the time domain filter.

后处理还可以包括闭合到开放变换功能。该压力分配比(PDR)方法涉及通过在如何根据最早到达的到达声音的方向对于自由场声音传输获得测量的方面的修改来为封闭式(closed-back)耳机设计匹配鼓膜和自由场之间的声阻抗的变换。这间接地使得能够在不需要复杂的鼓膜测量的情况下匹配封闭式耳机和自由场等效条件之间的鼓膜压力信号。Post-processing can also include a closed-to-open conversion function. The Pressure Distribution Ratio (PDR) method involves matching the acoustic impedance between the eardrum and the free field for closed-back headphone designs by modifying how measurements are obtained for free-field sound transmission based on the direction of the earliest arriving sound. This indirectly enables matching the eardrum pressure signal between closed-back headphone and free-field equivalent conditions without the need for complex eardrum measurements.

图15A示出实施例下的用于计算自由场声音发送的电路。电路1500基于自由场声阻抗模型。在该模型中，P₁(ω)是利用围绕正中面成θ度(例如，相对于收听者的左边和前面大约30度)的喇叭在被堵塞耳道的入口处测量的Thevenin压力，该测量涉及从测量的脉冲响应提取直接声音。测量P₁(ω)可以在耳道的入口处或者在耳道内部离开口某一距离(xmm)处(或在鼓膜处)对于用于测量P₁(ω)(该测量涉及从测量的脉冲响应提取直接声音)的相同放置处的相同喇叭进行。FIG15A illustrates a circuit for calculating free-field acoustic transmission under an embodiment. Circuit 1500 is based on a free-field acoustic impedance model. In this model, _P1 (ω) is the Thevenin pressure measured at the entrance of an obstructed ear canal using a speaker positioned θ degrees around the median plane (e.g., approximately 30 degrees to the left and in front of the listener), which is used to extract the direct sound from the measured impulse response. The measurement of _P1 (ω) can be performed at the entrance of the ear canal or at a distance (x mm) from the entrance inside the ear canal (or at the eardrum) using the same speaker in the same placement used to measure _P1 (ω) (which is used to extract the direct sound from the measured impulse response).

对于该模型，如下计算P₂(ω)/P₁(ω)的比：For this model, the ratio P ₂ (ω)/P ₁ (ω) is calculated as follows:

图15B示出实施例下的用于计算耳机声音发送的电路。电路1510基于耳机声阻抗模拟模型。在该模型中，P₄是利用耳机(RMS平均的)稳态测量在被堵塞耳道的入口处测量的，并且测量P₅(ω)是在去往耳道的入口处或者在耳道内部离开口一定距离处(或在鼓膜处)对于用于测量P₄(ω)的相同耳机放置进行的。FIG15B shows a circuit for calculating headphone sound transmission under an embodiment. Circuit 1510 is based on a headphone acoustic impedance simulation model. In this model, _P4 is measured at the entrance of an obstructed ear canal using a headphone (RMS averaged) steady-state measurement, and _P5 (ω) is measured at the entrance to the ear canal or inside the ear canal at a distance from the entrance (or at the eardrum) for the same headphone placement used to measure _P4 (ω).

对于该模型，如下计算P₅(ω)/P₄(ω)的比：For this model, the ratio P ₅ (ω)/P ₄ (ω) is calculated as follows:

然后可以使用以下公式来计算压力分配比(PDR)：The Pressure Distribution Ratio (PDR) can then be calculated using the following formula:

本文描述的方法和系统的方面可以在用于处理数字或数字化音频文件的适当的基于计算机的声音处理网络环境中实现。自适应音频系统的部分可以包括一个或多个包括任何期望数量的单个的机器的网络，所述机器包括用于缓冲和路由在计算机之间传输的数据的一个或多个路由器(未示出)。这样的网络可以构建于各种不同的网络协议上，并且可以是互联网、广域网(WAN)、局域网(LAN)或它们的任何组合。在网络包括互联网的实施例中，一个或多个机器可以被配置为通过web浏览器程序来访问互联网。Aspects of the methods and systems described herein can be implemented in a suitable computer-based sound processing network environment for processing digital or digitized audio files. Portions of the adaptive audio system can include one or more networks comprising any desired number of individual machines, the machines comprising one or more routers (not shown) for buffering and routing data transmitted between the computers. Such a network can be built on a variety of different network protocols and can be the Internet, a wide area network (WAN), a local area network (LAN), or any combination thereof. In embodiments where the network includes the Internet, one or more machines can be configured to access the Internet via a web browser program.

所述组件、块、处理或其它功能组件中的一个或多个可以通过控制系统的基于处理器的计算设备的执行的计算机程序来实现。还应注意，本文公开的各种功能就它们的行为、寄存器传送、逻辑组件和/或其它特性而言，可以使用硬件、固件的任何数量的组合来描述，和/或被描述为包含在各种机器可读或计算机可读介质中的数据和/或指令。这样的格式化的数据和/或指令可以被包含在其中的计算机可读介质包括，但不限于，各种形式的物理的(非暂时性)非易失性存储介质，诸如光学、磁性或半导体存储介质。One or more of the components, blocks, processes, or other functional components may be implemented by a computer program executed by a processor-based computing device that controls the system. It should also be noted that the various functions disclosed herein, in terms of their behavior, register transfers, logic components, and/or other characteristics, may be described using any number of combinations of hardware, firmware, and/or as data and/or instructions contained in various machine-readable or computer-readable media. Computer-readable media in which such formatted data and/or instructions may be contained include, but are not limited to, various forms of physical (non-transitory) non-volatile storage media, such as optical, magnetic, or semiconductor storage media.

除非上下文另有明确要求，否则在整个描述和权利要求书中，词语“包括”等要被解释为与排他或穷举的意义完全相反的包容性的意义；也就是说，“包括但不限于”的意义。使用单数或复数的词语也分别包括复数或单数。另外，词语“本文”、“下文”、“以上”、“以下”以及类似的词语指的是作为整体的本申请，而不指的是本申请的任何特定部分。当在引用两个或更多个项的列表中使用词语“或”时，该词语涵盖该词语的以下全部解释：该列表中的任何项、该列表中的全部项以及该列表中的项的任何组合。Unless the context clearly requires otherwise, throughout the description and claims, the words "including," "comprising," and the like are to be construed in an inclusive sense, as opposed to an exclusive or exhaustive sense; that is, in the sense of "including but not limited to." Words using the singular or plural number also include the plural or singular number, respectively. In addition, the words "herein," "herein below," "above," "below," and similar words refer to this application as a whole and not to any particular parts of this application. When the word "or" is used in a list referring to two or more items, the word encompasses all of the following interpretations of the word: any item in that list, all items in that list, and any combination of items in that list.

虽然已经以举例的方式就特定实施例描述了一个或多个实现，但是要理解一个或多个实现不限于所公开的实施例。相反，意图是涵盖对于本领域技术人员将是显而易见的各种修改和类似布置。因此，所入权利要求的范围应被给予最广泛的解释，以便包含所有这样的修改和类似的布置。Although one or more implementations have been described with respect to specific embodiments by way of example, it is to be understood that one or more implementations are not limited to the disclosed embodiments. On the contrary, it is intended to cover various modifications and similar arrangements that will be apparent to those skilled in the art. Therefore, the scope of the claims hereto should be given the broadest interpretation so as to encompass all such modifications and similar arrangements.

Claims

1. A method performed by an audio signal processing device for generating binaural presentation of digital audio content for playback via headphones, the method comprising:

Receive encoded signals, the encoded signals including digital audio content and presentation metadata, wherein the digital audio content includes multiple audio object signals;

Receive playback control metadata, which includes local setting information;

Decoding the encoded signal to obtain the plurality of audio object signals; and

The digital audio content is presented in a binaural manner in response to the plurality of audio object signals, presentation metadata, and playback control metadata.

The presentation metadata indicates the location, gain, and how the audio object signal is remapped in response to screen size information for the audio object signal.

The binaural presentation of the digital audio content includes: for the audio object signal, remapping the audio object signal based on the presentation metadata to fit the local screen size.

2. The method according to claim 1, wherein the playback control metadata further includes room model metadata.

3. The method according to claim 2, wherein the room model metadata includes frequency-related reverberation time.

4. The method of claim 2, wherein the room model metadata includes filters for modeling room responses.

5. The method of claim 1, wherein the presentation metadata further includes an indication of the apparent source width for each object audio signal.

6. The method according to any one of claims 1-5, wherein the digital audio content further comprises a plurality of channel audio signals.

7. The method of claim 6, wherein the presentation metadata for each channel audio signal further includes an indication of whether stereo presentation rather than binaural presentation is to be used for the channel audio signal.

8. The method of claim 6, wherein the presentation metadata further includes an indication of the position of the audio channel for each channel audio signal.

9. The method of claim 8, wherein the position of the channel audio signal is indicated by a channel identifier.

10. The method according to any one of claims 1-5, wherein generating the binaural presentation of the digital audio content includes generating separate direct binaural presentation and diffuse binaural presentation, and combining the direct binaural presentation and diffuse binaural presentation.

11. An audio signal processing apparatus for generating binaural presentation of digital audio content for playback via headphones, the audio signal processing apparatus comprising one or more processors, the one or more processors being configured to:

Receive playback control metadata, which includes local setting information;

12. A non-transitory computer-readable storage medium comprising a sequence of instructions that, when executed by an audio signal processing device, cause the audio signal processing device to perform the method according to any one of claims 1-5.