HK1248910B

HK1248910B - System and method for capturing, encoding, distributing, and decoding immersive audio

Info

Publication number: HK1248910B
Application number: HK18108150.9A
Authority: HK
Inventors: M‧M‧古德文; J-M‧卓特; M‧沃尔什
Original assignee: Dts公司
Priority date: 2015-01-30
Filing date: 2016-01-29
Publication date: 2022-03-11

Description

Systems and methods for capturing, encoding, distributing, and decoding immersive audio

相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS

本申请要求2015年1月30日提交的、标题为“System and Method for Capturingand Encoding a 3-D Audio Soundfield”的美国临时专利申请第62/110,211号的权益，这两篇申请的全部内容通过引用并入本文。This application claims the benefit of U.S. Provisional Patent Application No. 62/110,211, filed January 30, 2015, entitled “System and Method for Capturing and Encoding a 3-D Audio Soundfield,” both of which are incorporated herein by reference in their entireties.

背景技术Background Art

随着专用的录制设备变得更加便携和更加实惠，并且随着录制能力在日常设备(诸如智能电话)中变得更加普及，音频内容的捕获(经常与视频结合)已经变得越来越普遍。视频捕获的质量已经持续提高，并且已经超过了音频捕获的质量。现代移动设备上的视频捕获通常是高分辨率的而且DSP处理密集的，但是伴随的音频内容一般是以低保真度和很少的额外处理用单声道捕获的。As dedicated recording devices become more portable and more affordable, and as recording capabilities become more prevalent in everyday devices (such as smartphones), the capture of audio content (often in combination with video) has become increasingly common. The quality of video capture has continued to improve and has surpassed the quality of audio capture. Video capture on modern mobile devices is typically high-resolution and DSP-intensive, but the accompanying audio content is typically captured in mono with low fidelity and little additional processing.

为了捕获空间线索，许多现有的音频录制技术采用至少两个麦克风。作为一般规则，录制360度水平环绕音频场景需要至少3个音频声道，而录制三维音频场景需要至少4个音频声道。虽然多声道音频捕获被用于沉浸式音频录制，但是目前可用的更普及的消费者音频递送技术和分布框架限于传输两声道音频。在标准的两声道立体声再现中，存储的或传输的左音频声道和右音频声道意图分别在左和右扩音器或耳机上直接回放。To capture spatial cues, many existing audio recording technologies employ at least two microphones. As a general rule, recording a 360-degree horizontal surround audio scene requires at least three audio channels, while recording a three-dimensional audio scene requires at least four audio channels. While multi-channel audio capture is used for immersive audio recording, currently available more popular consumer audio delivery technologies and distribution frameworks are limited to transmitting two-channel audio. In standard two-channel stereo reproduction, the stored or transmitted left and right audio channels are intended to be played back directly on left and right loudspeakers or headphones, respectively.

为了回放沉浸式音频录制，可能需要在各种回放配置中渲染录制的空间音频信息。这些回放配置包括耳机、前置条形音箱(sound-bar)扩音器、前置分立扩音器对、5.1水平环绕扩音器阵列以及包括高度声道的三维扩音器阵列。不论回放配置如何，期望的是为收听者再现空间音频场景，该空间音频场景是捕获的音频场景的基本精确的表示。另外，有利的是提供对于特定回放配置不敏感(agnostic)的音频存储或传输格式。In order to play back an immersive audio recording, it may be necessary to render the recorded spatial audio information in a variety of playback configurations. These playback configurations include headphones, front soundbar speakers, front discrete speaker pairs, 5.1 horizontal surround speaker arrays, and three-dimensional speaker arrays including height channels. Regardless of the playback configuration, it is desirable to reproduce a spatial audio scene for the listener that is a substantially accurate representation of the captured audio scene. In addition, it would be advantageous to provide an audio storage or transmission format that is agnostic to a particular playback configuration.

一种这样的配置不敏感格式是B格式。B格式包括以下信号：(1)W——与全向麦克风的输出相对应的压力信号；(2)X——与前指(forward-pointing)“8字形”麦克风的输出对应的前后(front-to-back)方向信息；(3)Y——与左指“8字形”麦克风的输出相对应的左右(side-to-side)方向信息；以及(4)Z——与上指“8字形”麦克风的输出相对应的上下(up-to down)方向信息。One such configuration-insensitive format is the B-format. The B-format includes the following signals: (1) W—a pressure signal corresponding to the output of an omnidirectional microphone; (2) X—front-to-back directional information corresponding to the output of a forward-pointing "figure-of-eight" microphone; (3) Y—side-to-side directional information corresponding to the output of a left-pointing "figure-of-eight" microphone; and (4) Z—up-to-down directional information corresponding to the output of an upward-pointing "figure-of-eight" microphone.

B格式音频信号可以被空间解码以用于在耳机或灵活的扩音器配置上进行沉浸式音频回放。B格式信号可以直接获得，或者从包括全向和/或双向麦克风或单向麦克风的、标准的接近重合(coincident)麦克风布置导出。特别地，4声道A格式从心形麦克风的四面体布置获得，并且可以经由4×4线性矩阵被转换为B格式。另外，4声道B格式可以被转换为与标准的2声道立体声再现兼容的两声道高保真立体声(ambisonic)UHJ格式。但是，两声道高保真立体声UHJ格式不足以使得能够进行忠实的三维沉浸式音频或水平环绕再现。B-format audio signals can be spatially decoded for immersive audio playback on headphones or flexible loudspeaker configurations. B-format signals can be obtained directly or derived from a standard near-coincident microphone arrangement comprising omnidirectional and/or bidirectional microphones or unidirectional microphones. In particular, the 4-channel A-format is obtained from a tetrahedral arrangement of cardioid microphones and can be converted to B-format via a 4×4 linear matrix. In addition, the 4-channel B-format can be converted to a two-channel ambisonic UHJ format that is compatible with standard 2-channel stereo reproduction. However, the two-channel ambisonic UHJ format is not sufficient to enable faithful three-dimensional immersive audio or horizontal surround reproduction.

其他方法已经被提出以用于将表示环绕或沉浸式声音场景的多个音频声道编码为用于存储和/或分布的简化数据(reduced-data)格式，该简化数据格式随后可以被解码以使得能够忠实地再现原始音频场景。一种这样的方法是时域相位振幅矩阵编码/解码。该方法中的编码器将具有特定的振幅和相位关系的输入声道线性地组合为较小的一组编码声道。解码器组合具有特定的振幅和相位的编码声道来试图恢复原始声道。但是，由于中间声道计数减少，与原始音频场景相比，再现的音频场景的空间局部化保真度可能有损失。Other methods have been proposed for encoding multiple audio channels representing surround or immersive sound scenes into a reduced-data format for storage and/or distribution, which can then be decoded to enable faithful reproduction of the original audio scene. One such method is time-domain phase-amplitude matrix encoding/decoding. The encoder in this method linearly combines input channels with specific amplitude and phase relationships into a smaller set of encoded channels. The decoder combines the encoded channels with specific amplitude and phase to attempt to restore the original channels. However, due to the reduced intermediate channel count, the spatial localization fidelity of the reproduced audio scene may be lost compared to the original audio scene.

用于改进再现的音频场景的空间局部化保真度的方法是频域相位振幅矩阵解码，该频域相位振幅矩阵解码将矩阵编码的两声道音频信号分解为时间-频率表示。该方法然后分别使各时间-频率分量空间化(spatialize)。时间-频率分解提供输入音频信号的高分辨率表示，在该表示中，与时域中相比，各个源被更离散地表示。结果，当与时域矩阵解码相比时，该方法可以改进随后解码的信号的空间保真度。A method for improving the spatial localization fidelity of the reproduced audio scene is frequency-domain phase-amplitude matrix decoding, which decomposes a matrix-encoded two-channel audio signal into a time-frequency representation. The method then spatializes each time-frequency component separately. The time-frequency decomposition provides a high-resolution representation of the input audio signal in which the individual sources are represented more discretely than in the time domain. As a result, the method can improve the spatial fidelity of the subsequently decoded signal when compared to time-domain matrix decoding.

对多声道音频表示进行数据简化的另一方法是空间音频编码。在该方法中，输入声道被组合为简化声道(reduced-channel)格式(可能甚至单声道)，并且关于音频场景的空间特性的一些辅助信息(side information)也被包括。辅助信息中的参数可以用于将简化声道格式在空间上解码为忠实地近似原始音频场景的多声道信号。Another approach to data reduction for multi-channel audio representations is spatial audio coding. In this approach, the input channels are combined into a reduced-channel format (possibly even mono), and some side information about the spatial characteristics of the audio scene is also included. Parameters in the side information can be used to decode the reduced-channel format into a multi-channel signal that faithfully approximates the original audio scene spatially.

上述相位振幅矩阵编码和空间音频编码方法常常涉及对在录制工作室中创建的多声道音轨进行编码。而且，它们有时涉及简化声道编码的音频信号是完全解码版本的可行的收听替代者的要求。这是为了使得直接回放是一个选项并且不需要定做的解码器。The phase-amplitude matrix coding and spatial audio coding methods described above are often used to encode multi-channel soundtracks created in recording studios. Furthermore, they sometimes require that the audio signal encoded with reduced channels be a viable listening alternative to the fully decoded version. This is to ensure that direct playback is an option and eliminates the need for a custom decoder.

声场编码是空间音频编码的类似尝试，其集中于捕获并编码“即时”音频场景并且通过回放系统精确地再现该音频场景。声场编码的现有方法依赖于特定的麦克风配置以精确地捕获方向源。而且，它们依靠各种分析技术以适当地处理方向源和扩散源。但是，声场编码所需的麦克风配置对于消费者设备常常是不切实际的。现代的消费者设备通常具有施加于麦克风的数量和位置上的显著的设计约束，这些设计约束可以导致与对于目前的声场编码方法的要求不匹配的配置。声场分析方法常常也是计算密集型的，缺乏支持较低复杂度实现的可扩展性。Sound field coding is a similar attempt at spatial audio coding that focuses on capturing and encoding the "live" audio scene and accurately reproducing it through a playback system. Existing methods for sound field coding rely on specific microphone configurations to accurately capture directional sources. Moreover, they rely on various analysis techniques to properly handle directional and diffuse sources. However, the microphone configurations required for sound field coding are often impractical for consumer devices. Modern consumer devices often have significant design constraints imposed on the number and location of microphones, which can result in configurations that do not match the requirements for current sound field coding methods. Sound field analysis methods are also often computationally intensive and lack the scalability to support lower complexity implementations.

发明内容Summary of the Invention

提供本发明内容是为了以简化的形式介绍下面在具体实施方式中进一步描述的概念的选择。本发明内容并非意图认定要求保护的主题的关键特征或必要特征，也非意图用来限制要求保护的主题的范围。This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

声场编码系统和方法的实施例涉及音频信号的处理，更具体地涉及三维(3-D)音频声场的捕获、编码和再现。该系统和方法的实施例用于捕获表示沉浸式音频场景的3-D声场。该捕获是使用任意的麦克风阵列配置来执行的。为了高效地存储和分布，捕获的音频被编码为通用空间编码信号(SES)格式。在一些实施例中，用于对该SES格式进行空间解码以用于再现的方法对于用于捕获3-D声场中的音频的麦克风阵列配置是不敏感的。Embodiments of sound field coding systems and methods relate to the processing of audio signals, and more specifically to the capture, encoding, and reproduction of three-dimensional (3-D) audio sound fields. Embodiments of the systems and methods are used to capture a 3-D sound field representing an immersive audio scene. The capture is performed using an arbitrary microphone array configuration. For efficient storage and distribution, the captured audio is encoded into a common spatially encoded signal (SES) format. In some embodiments, the method for spatially decoding the SES format for reproduction is insensitive to the microphone array configuration used to capture the audio in the 3-D sound field.

目前没有使得能够灵活地捕获、分布和再现用与标准的两声道和多声道再现系统兼容的通用数字音频格式编码的沉浸式音频录制的端到端系统。特别地，因为采用标准的多声道麦克风阵列配置在消费者移动设备(诸如智能电话或相机)中是不切实际的，所以需要用于对来自灵活的多声道麦克风阵列配置的与传统回放系统兼容的两声道或多声道沉浸式音频信号进行空间编码的方法。Currently, there is no end-to-end system that enables the flexible capture, distribution, and reproduction of immersive audio recordings encoded in a universal digital audio format that is compatible with standard two-channel and multi-channel reproduction systems. In particular, because employing standard multi-channel microphone array configurations is impractical in consumer mobile devices (such as smartphones or cameras), there is a need for methods for spatially encoding two-channel or multi-channel immersive audio signals from flexible multi-channel microphone array configurations that are compatible with conventional playback systems.

系统和方法的实施例包括通过选择用于捕获3-D声场的、具有多个麦克风的麦克风配置来对多个麦克风信号进行处理。麦克风用于从至少一个音频源捕获声音。麦克风配置对在音频捕获中使用的多个麦克风中的每个限定麦克风方向性。麦克风方向性是相对于参考方向限定的。Embodiments of the system and method include processing multiple microphone signals by selecting a microphone configuration having multiple microphones for capturing a 3-D sound field. The microphones are used to capture sound from at least one audio source. The microphone configuration defines a microphone directivity for each of the multiple microphones used in the audio capture. The microphone directivity is defined relative to a reference direction.

系统和方法的实施例还包括选择包含多个麦克风的虚拟麦克风配置。虚拟麦克风配置用于对关于音频源相对于参考方向的位置的空间信息进行编码。系统和方法还包括基于麦克风配置和虚拟麦克风配置来计算空间编码系数。空间编码系数用于将麦克风信号转换为空间编码信号(SES)。SES包括虚拟麦克风信号，其中虚拟麦克风信号是通过使用空间编码系数组合麦克风信号而获得的。Embodiments of the system and method also include selecting a virtual microphone configuration comprising a plurality of microphones. The virtual microphone configuration is used to encode spatial information about the position of an audio source relative to a reference direction. The system and method also include calculating spatial coding coefficients based on the microphone configuration and the virtual microphone configuration. The spatial coding coefficients are used to convert the microphone signals into spatially coded signals (SES). The SES includes virtual microphone signals, wherein the virtual microphone signals are obtained by combining the microphone signals using the spatial coding coefficients.

应注意，替代实施例是可能的，并且本文所讨论的步骤和元件可以依赖于特定实施例改变、添加或消除。在不脱离本发明的范围的情况下，这些替代实施例包括可以使用的替代步骤和替代元件以及可以做出的结构改变。It should be noted that alternative embodiments are possible, and the steps and elements discussed herein can be changed, added, or eliminated depending on the specific embodiment. Without departing from the scope of the present invention, these alternative embodiments include alternative steps and alternative elements that can be used and structural changes that can be made.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

现在参照附图，在附图中，同样的附图标记始终表示对应的部分：Referring now to the drawings, wherein like reference numerals designate corresponding parts throughout:

图1是根据本发明的声场编码系统的实施例的概述框图。FIG1 is an overview block diagram of an embodiment of a sound field coding system according to the present invention.

图2A是示出图1中所示的声场编码系统的实施例的捕获组件、编码组件和分布组件的细节的框图。2A is a block diagram illustrating details of the capture component, encoding component, and distribution component of the embodiment of the sound field encoding system shown in FIG. 1 .

图2B是示出具有按非标准配置布置的麦克风的便携式捕获设备的实施例的框图。2B is a block diagram illustrating an embodiment of a portable capture device having microphones arranged in a non-standard configuration.

图3是示出图1中所示的声场编码系统的实施例的解码和回放组件的细节的框图。3 is a block diagram showing details of the decoding and playback components of the embodiment of the soundfield encoding system shown in FIG. 1 .

图4示出根据本发明的声场编码系统的实施例的一般框图。Fig. 4 shows a general block diagram of an embodiment of a sound field coding system according to the present invention.

图5是更详细地描绘与图4中描述的系统类似的系统的实施例的框图，其中T＝2。FIG5 is a block diagram depicting an embodiment of a system similar to that described in FIG4 in greater detail, where T=2.

图6是更详细地示出图5中所示的空间解码器和渲染器的框图。FIG. 6 is a block diagram illustrating the spatial decoder and renderer shown in FIG. 5 in more detail.

图7是示出在具有T＝2个传输信号并且没有辅助信息的情况下的空间编码器的框图。FIG. 7 is a block diagram illustrating a spatial encoder in a case with T=2 transmission signals and without auxiliary information.

图8是示出图7中所示的空间编码器的替代实施例的框图。FIG. 8 is a block diagram illustrating an alternative embodiment of the spatial encoder shown in FIG. 7 .

图9A示出空间编码器的特定示例实施例，在该空间编码器中，A格式信号被捕获并且被转换为B格式，2声道空间编码信号从B格式导出。FIG. 9A shows a specific example embodiment of a spatial encoder in which an A-format signal is captured and converted to a B-format from which a 2-channel spatially encoded signal is derived.

图9B示出B格式W分量、X分量和Y分量在水平面中的方向性图案。FIG9B shows directivity patterns of the B-format W component, X component, and Y component in the horizontal plane.

图9C示出通过组合B格式W分量、X分量和Y分量而导出的3个超心形虚拟麦克风的方向性图案。FIG. 9C shows directivity patterns of three supercardioid virtual microphones derived by combining the B-format W component, X component, and Y component.

图10示出图9A中所示的系统的替代实施例，其中B格式信号被转换为5声道环绕声信号。FIG. 10 illustrates an alternative embodiment of the system shown in FIG. 9A , in which the B-format signal is converted to a 5-channel surround sound signal.

图11示出图9A中所示的系统的替代实施例，其中B格式信号被转换为定向音频编码(DirAC)表示。11 illustrates an alternative embodiment of the system shown in FIG. 9A , in which the B-format signal is converted to a Directional Audio Coding (DirAC) representation.

图12是更详细地描绘与图11中所描述的系统类似的系统的实施例的框图。FIG. 12 is a block diagram depicting an embodiment of a system similar to the system described in FIG. 11 in more detail.

图13是示出空间编码器的又一实施例的框图，该空间编码器将B格式信号变换到频域中，并且将它编码为2声道立体声信号。FIG13 is a block diagram illustrating yet another embodiment of a spatial encoder that transforms a B-format signal into the frequency domain and encodes it into a 2-channel stereo signal.

图14是示出空间编码器的实施例的框图，在该空间编码器中，输入麦克风信号首先被分解为直接分量和扩散分量。FIG14 is a block diagram illustrating an embodiment of a spatial encoder in which an input microphone signal is first decomposed into a direct component and a diffuse component.

图15是示出包括风噪声检测器的空间编码系统和方法的实施例的框图。15 is a block diagram illustrating an embodiment of a spatial encoding system and method including a wind noise detector.

图16示出用于捕获N个麦克风信号并且在空间编码之前将它们转换为适合于编辑的M声道格式的系统。FIG. 16 shows a system for capturing N microphone signals and converting them to an M-channel format suitable for editing before spatial encoding.

图17示出系统和方法的实施例，作为空间解码过程的一部分，捕获的音频场景通过该系统和方法而被修改。FIG17 illustrates an embodiment of a system and method by which a captured audio scene is modified as part of the spatial decoding process.

图18是示出根据本发明的声场编码系统的捕获组件的实施例的一般操作的流程图。18 is a flow chart illustrating the general operation of an embodiment of a capture component of a soundfield encoding system according to the present invention.

具体实施方式DETAILED DESCRIPTION

在以下对声场编码系统和方法的实施例的描述中，参照附图。这些附图以例证的方式示出系统和方法的实施例可以怎样实施的特定示例。理解的是，在不脱离要求保护的主题的范围的情况下，可以利用其他实施例，并且可以做出结构改变。In the following description of embodiments of sound field coding systems and methods, reference is made to the accompanying drawings. These drawings show, by way of illustration, specific examples of how embodiments of the systems and methods may be implemented. It is understood that other embodiments may be utilized and structural changes may be made without departing from the scope of the claimed subject matter.

I.系统概述 I. System Overview

本文所描述的声场编码系统和方法的实施例用于使用任意的麦克风阵列配置来捕获表示沉浸式音频场景的声场。为了高效地存储和分布，捕获的音频被编码为通用空间编码信号(SES)格式。在本发明的优选实施例中，用于对该SES格式进行空间解码以用于再现的方法对于所使用的麦克风阵列配置是不敏感的。存储和分布可以使用现有的用于两声道音频的方法(例如，普遍使用的数字媒体分布或流传输网络)来实现。SES格式可以在标准的两声道立体声再现系统上回放，或者可替代地，在灵活的回放配置上以高空间保真度再现(如果适当的SES解码器可用)。SES编码格式使得能够进行如下的空间解码：该空间解码被配置为在各种回放配置(例如耳机或环绕声系统)中实现原始沉浸式音频场景的忠实再现。Embodiments of the sound field coding systems and methods described herein are used to capture a sound field representing an immersive audio scene using an arbitrary microphone array configuration. For efficient storage and distribution, the captured audio is encoded into a universal spatial encoding signal (SES) format. In a preferred embodiment of the present invention, the method for spatially decoding the SES format for reproduction is insensitive to the microphone array configuration used. Storage and distribution can be implemented using existing methods for two-channel audio (e.g., commonly used digital media distribution or streaming networks). The SES format can be played back on a standard two-channel stereo reproduction system, or alternatively, reproduced with high spatial fidelity on a flexible playback configuration (if an appropriate SES decoder is available). The SES coding format enables the following spatial decoding: the spatial decoding is configured to achieve faithful reproduction of the original immersive audio scene in various playback configurations (e.g., headphones or surround sound systems).

声场编码系统和方法的实施例提供用于用任意的麦克风配置捕获三维声场并且对该三维声场进行编码的灵活的且可扩展的技术。这不同于现有方法，因为不需要特定的麦克风配置。此外，本文所描述的SES编码格式对于不需要空间解码器的高质量两声道回放是可行的。这不同于其他三维声场编码方法(诸如高保真立体声B格式或DirAC)，因为它们通常不涉及直接从编码的音频信号提供忠实的沉浸式3-D音频回放。而且，这些编码方法可能不能在不将辅助信息包括在编码信号中的情况下提供高质量回放。就本文所描述的系统和方法的实施例而言，辅助信息是可选的。Embodiments of the sound field coding systems and methods provide flexible and scalable techniques for capturing a three-dimensional sound field with an arbitrary microphone configuration and encoding the three-dimensional sound field. This differs from existing methods in that no specific microphone configuration is required. In addition, the SES coding format described herein is feasible for high-quality two-channel playback that does not require a spatial decoder. This differs from other three-dimensional sound field coding methods (such as Ambisonics B-format or DirAC) in that they generally do not involve providing faithful immersive 3-D audio playback directly from the encoded audio signal. Moreover, these coding methods may not be able to provide high-quality playback without including auxiliary information in the encoded signal. For embodiments of the systems and methods described herein, the auxiliary information is optional.

捕获、编码和分布系统Capture, encoding and distribution systems

图1是声场编码系统100的实施例的概述框图。系统100包括捕获组件110、分布组件120以及回放组件130。在捕获组件中，输入麦克风或优选地麦克风阵列接收音频信号。捕获组件110从各种麦克风配置接受麦克风信号135。以示例的方式，这些配置包括单声道、立体声、3麦克风环绕、4麦克风全向声(periphonic)(诸如高保真立体声B格式)或任意的麦克风配置。第一符号138示出麦克风信号格式中的任何一个可以被选择为输入。麦克风信号135被输入到音频捕获组件140。在系统100的一些实施例中，音频捕获组件140对麦克风信号135进行处理以移除不期望的环境噪声(诸如静止的背景噪声或风噪声)。FIG1 is an overview block diagram of an embodiment of a sound field coding system 100. System 100 includes a capture component 110, a distribution component 120, and a playback component 130. In the capture component, an input microphone or preferably a microphone array receives an audio signal. Capture component 110 receives microphone signals 135 from various microphone configurations. By way of example, these configurations include mono, stereo, 3-microphone surround, 4-microphone periphonic (such as high-fidelity stereo B format) or any microphone configuration. First symbol 138 shows that any one of the microphone signal formats can be selected as input. Microphone signal 135 is input to audio capture component 140. In some embodiments of system 100, audio capture component 140 processes microphone signal 135 to remove undesirable ambient noise (such as static background noise or wind noise).

捕获的音频信号被输入到空间编码器145。这些音频信号被空间编码为适合于随后存储和分布的空间编码信号(SES)格式。随后的SES被传递到分布组件120的存储/传输组件150。在一些实施例中，存储/传输组件150用音频波形编码器(诸如MP3或AAC)对SES进行编码以便在不修改SES中编码的空间线索的情况下降低存储要求或传输数据速率。在分布组件120中，音频被存储或通过分布网络提供到回放设备。The captured audio signals are input to a spatial encoder 145. These audio signals are spatially encoded into a spatially encoded signal (SES) format suitable for subsequent storage and distribution. The SES is then passed to the storage/transmission component 150 of the distribution component 120. In some embodiments, the storage/transmission component 150 encodes the SES with an audio waveform encoder (such as MP3 or AAC) to reduce storage requirements or transmission data rates without modifying the spatial cues encoded in the SES. In the distribution component 120, the audio is stored or provided to a playback device via a distribution network.

在回放组件130中，描绘了各种回放设备。如第二符号152所描绘的，回放设备中的任何一个可以被选择。图1中示出了第一回放设备155、第二回放设备160以及第三回放设备165。对于第一回放设备155，SES被空间解码以用于通过耳机最佳地回放。对于第二回放设备160，SES被空间解码以用于通过立体声系统最佳地回放。对于第三回放设备165，SES信号被空间解码以用于通过多声道扩音器系统最佳地回放。在普遍的使用情景中，如本领域技术人员将理解的并且如以下图中所示出的，音频捕获、分布和回放可以结合视频发生。In playback component 130, various playback devices are depicted. As depicted by second symbol 152, any one of the playback devices can be selected. Figure 1 shows a first playback device 155, a second playback device 160, and a third playback device 165. For the first playback device 155, SES is spatially decoded for optimal playback through headphones. For the second playback device 160, SES is spatially decoded for optimal playback through a stereo system. For the third playback device 165, the SES signal is spatially decoded for optimal playback through a multi-channel loudspeaker system. In general use scenarios, as will be understood by those skilled in the art and as shown in the following figures, audio capture, distribution, and playback can occur in conjunction with video.

图2A是示出图1中所示的声场编码系统100的捕获组件110的细节的框图。在捕获组件110中，录制设备支持连接到第一音频捕获子组件200的四麦克风阵列和连接到第二音频捕获子组件210的两麦克风阵列两者。第一音频捕获子组件200和第二音频捕获子组件210的输出分别被提供给第一空间编码器子组件220和第二空间编码器子组件230，在第一空间编码器子组件220和第二空间编码器子组件230中，它们被编码为空间编码信号(SES)格式。应注意，系统100的实施例不限于两麦克风或四麦克风阵列。在其他情况下，其他麦克风配置将用适当的空间编码器被类似地支持。在一些实施例中，音频位流编码器240对由第一空间编码器子组件220或由第二空间编码器子组件230生成的SES进行编码。从编码器240输出的编码信号被打包(packed)到音频位流250中。FIG2A is a block diagram illustrating the details of the capture component 110 of the sound field coding system 100 shown in FIG1 . In the capture component 110 , the recording device supports both a four-microphone array connected to a first audio capture subcomponent 200 and a two-microphone array connected to a second audio capture subcomponent 210 . The outputs of the first audio capture subcomponent 200 and the second audio capture subcomponent 210 are provided to a first spatial encoder subcomponent 220 and a second spatial encoder subcomponent 230 , respectively, where they are encoded into a spatially encoded signal (SES) format. It should be noted that embodiments of the system 100 are not limited to two-microphone or four-microphone arrays. In other cases, other microphone configurations will be similarly supported using appropriate spatial encoders. In some embodiments, an audio bitstream encoder 240 encodes the SES generated by the first spatial encoder subcomponent 220 or by the second spatial encoder subcomponent 230. The encoded signal output from the encoder 240 is packaged into an audio bitstream 250.

在一些实施例中，视频被包括在捕获组件110中。如图2A中所示，视频捕获组件260捕获视频信号，并且视频编码器270对视频信号进行编码以产生视频位流。A/V复用器280复用音频位流250与相关联的视频位流。复用的音频和视频位流在分布组件120的存储/传输组件150中被存储或传输。位流数据可以作为数据文件被临时存储在捕获设备上、本地媒体服务器上、或计算机网络中，并且使得可用于传输或分布。In some embodiments, video is included in the capture component 110. As shown in FIG2A , the video capture component 260 captures a video signal, and the video encoder 270 encodes the video signal to produce a video bitstream. An A/V multiplexer 280 multiplexes an audio bitstream 250 with an associated video bitstream. The multiplexed audio and video bitstreams are stored or transmitted in the storage/transmission component 150 of the distribution component 120. The bitstream data can be temporarily stored as a data file on a capture device, on a local media server, or on a computer network and made available for transmission or distribution.

在一些实施例中，第一音频捕获子组件200捕获高保真立体声B格式信号，并且由第一空间编码器子组件220进行的SES编码执行常规的B格式到UHJ两声道立体声编码，如例如在1985年11月JAES第33卷、第11期、第859–871页上的、Michael Gerzon的“Ambisonicsin multichannel broadcasting and video(多声道广播和视频中的高保真立体声)”中所描述的那样。在替代实施例中，第一空间编码器子组件220执行B格式信号到两声道SES的频域空间编码，不像两声道UHJ格式，两声道SES可以保留三维空间音频线索。在又一实施例中，连接到第一音频捕获子组件200的麦克风按非标准配置布置。In some embodiments, the first audio capture subcomponent 200 captures an audiophile B-format signal, and the SES encoding performed by the first spatial encoder subcomponent 220 performs conventional B-format to UHJ two-channel stereo encoding, as described, for example, in "Ambisonics in multichannel broadcasting and video," by Michael Gerzon, JAES, Vol. 33, No. 11, pp. 859–871, November 1985. In an alternative embodiment, the first spatial encoder subcomponent 220 performs frequency-domain spatial encoding of the B-format signal to a two-channel SES, which, unlike the two-channel UHJ format, can preserve three-dimensional spatial audio cues. In yet another embodiment, the microphones connected to the first audio capture subcomponent 200 are arranged in a non-standard configuration.

图2B是示出具有按非标准配置布置的麦克风的便携式捕获设备201的实施例的示图。图2B中的便携式捕获设备201包括用于音频捕获的麦克风202、203、204和205以及用于视频捕获的相机206。在便携式设备(诸如智能电话)中，麦克风在设备201上的定位可以受到工业设计考虑或其他因素的约束。由于这样的约束，麦克风202、203、204和205可以被以不是标准的麦克风配置(诸如本领域技术人员所认识的录制麦克风配置)的方式配置。实际上，配置可以是特定于特别的捕获设备的。图2B仅提供了这样的设备特定配置的示例。应注意，各种其他的实施例是可能的，并不限于该特定麦克风配置。另外，本发明的实施例适用于任意的麦克风配置。FIG2B is a diagram illustrating an embodiment of a portable capture device 201 having microphones arranged in a non-standard configuration. The portable capture device 201 in FIG2B includes microphones 202, 203, 204, and 205 for audio capture and a camera 206 for video capture. In portable devices (such as smart phones), the positioning of the microphones on the device 201 may be subject to industrial design considerations or other constraints. Due to such constraints, microphones 202, 203, 204, and 205 may be configured in a manner that is not a standard microphone configuration (such as a recording microphone configuration recognized by those skilled in the art). In fact, the configuration may be specific to a particular capture device. FIG2B merely provides an example of such a device-specific configuration. It should be noted that various other embodiments are possible and are not limited to this particular microphone configuration. In addition, embodiments of the present invention are applicable to any microphone configuration.

在替代实施例中，只有两个麦克风信号被(第二音频捕获子组件210)捕获，并且被(第二空间编码器子组件230)空间编码。到两个麦克风声道的这个限制可以例如在存在最小化设备制造成本的产品设计决策时发生。在这种情况下，在SES中编码的空间信息的保真度可能相应地受损。例如，SES可能缺乏上对下或前对后区别线索。但是，在本发明的有利实施例中，对于相同的原始捕获声场，在从第二空间编码器子组件230产生的SES中编码的左对右区别线索基本上等同于在从第一空间编码器子组件220产生的SES中编码的左对右区别线索(如收听者在标准的两声道立体声回放配置中感知到的)。因此，不论捕获麦克风阵列配置如何，SES格式都保持与标准的两声道立体声再现兼容。In an alternative embodiment, only two microphone signals are captured (by the second audio capture subcomponent 210) and spatially encoded (by the second spatial encoder subcomponent 230). This limitation to two microphone channels may occur, for example, when there is a product design decision to minimize the manufacturing cost of the device. In this case, the fidelity of the spatial information encoded in the SES may be correspondingly compromised. For example, the SES may lack up-to-down or front-to-back distinction cues. However, in a preferred embodiment of the present invention, for the same original captured sound field, the left-to-right distinction cues encoded in the SES generated from the second spatial encoder subcomponent 230 are substantially equivalent to the left-to-right distinction cues encoded in the SES generated from the first spatial encoder subcomponent 220 (as perceived by a listener in a standard two-channel stereo playback configuration). Therefore, regardless of the capture microphone array configuration, the SES format remains compatible with standard two-channel stereo reproduction.

在一些实施例中，第一空间编码器子组件220还产生被包括在SES中的空间音频辅助信息或元数据。该辅助信息在一些实施例中是根据对捕获的麦克风信号之间的声道间关系进行频域分析而导出的。这样的空间音频辅助信息被音频位流编码器240合并到音频位流中，并且随后被存储或传输，使得它可以可选地在回放组件中被检索并且被利用以优化空间音频再现保真度。In some embodiments, the first spatial encoder subcomponent 220 also generates spatial audio auxiliary information or metadata that is included in the SES. This auxiliary information, in some embodiments, is derived from frequency-domain analysis of the inter-channel relationships between the captured microphone signals. Such spatial audio auxiliary information is incorporated into the audio bitstream by the audio bitstream encoder 240 and subsequently stored or transmitted so that it can be optionally retrieved and utilized in the playback component to optimize spatial audio reproduction fidelity.

更一般地，在一些实施例中，由音频位流编码器240产生的数字音频位流被格式化以包括两声道或多声道向后兼容的音频下混信号以及可选的扩展(在本文中被称为“辅助信息”)，这些扩展可以包括元数据和附加音频声道。标题为“Encoding and reproductionof three dimensional audio soundtracks(三维音频声轨的编码和再现)”的美国专利申请US2014-0350944 A1中描述了这样的音频编码格式的示例，该申请全文通过引用并入本文。More generally, in some embodiments, the digital audio bitstream generated by the audio bitstream encoder 240 is formatted to include a two-channel or multi-channel backward-compatible audio downmix signal and optional extensions (referred to herein as "side information") that may include metadata and additional audio channels. Examples of such audio encoding formats are described in U.S. patent application US2014-0350944 A1, entitled "Encoding and reproduction of three dimensional audio soundtracks," which is incorporated herein by reference in its entirety.

虽然如图2A中所描绘的那样在复用音频和视频之前执行空间编码(出于传统和兼容性的目的)常常是有用的，但是在其他实施例中，原始捕获的多声道音频信号可以被“照原样”与视频复用，并且SES编码可以在递送链中的某个后一级发生。例如，空间编码(包括可选的辅助信息提取)可以在基于网络的计算机上离线地执行。该方法可以允许比当在原始录制设备处理器上实现空间编码计算时可以实现的信号分析计算更先进的信号分析计算。While it is often useful to perform spatial encoding before multiplexing the audio and video (for legacy and compatibility purposes) as depicted in FIG2A , in other embodiments, the original captured multi-channel audio signal can be multiplexed with the video "as is," and SES encoding can occur at some later stage in the delivery chain. For example, spatial encoding (including optional auxiliary information extraction) can be performed offline on a network-based computer. This approach can allow for more advanced signal analysis calculations than would be possible if the spatial encoding calculations were implemented on the original recording device processor.

在一些实施例中，由音频位流编码器240编码的两声道SES包含在原始声场中捕获的空间音频线索。在一些实施例中，音频线索是声道间振幅和相位关系(在由麦克风阵列的几何结构和麦克风数量施加的保真度限制内)的形式，该声道间振幅和相位关系对于捕获设备上所采用的特别的麦克风阵列配置是基本上不敏感的。两声道SES可以后来通过提取编码的空间音频线索并且渲染对于通过可用的回放设备再现表示原始音频场景的空间线索最佳的音频信号来进行解码。In some embodiments, the two-channel SES encoded by the audio bitstream encoder 240 contains spatial audio cues captured in the original sound field. In some embodiments, the audio cues are in the form of inter-channel amplitude and phase relationships (within the fidelity limits imposed by the geometry of the microphone array and the number of microphones) that are substantially insensitive to the particular microphone array configuration employed on the capture device. The two-channel SES can be later decoded by extracting the encoded spatial audio cues and rendering an audio signal that is optimal for reproducing the spatial cues representing the original audio scene via the available playback device.

图3是示出图1中所示的声场编码系统100的回放组件130的细节的框图。回放组件130从分布组件120的存储/传输组件150接收媒体位流。在接收的位流包括音频位流和视频位流两者的实施例中，由A/V解复用器(demuxer)300对这些位流进行解复用。视频位流被提供给视频解码器310以用于解码和在监视器320上回放。音频位流被提供给音频位流解码器330，音频位流解码器330准确地或以保留在SES中编码的空间线索的形式恢复原始的编码的SES。例如，在一些实施例中，音频位流解码器330包括与可选地被包括在音频位流编码器240中的音频波形编码器互反(reciprocal)的音频波形解码器。FIG3 is a block diagram illustrating the details of the playback component 130 of the sound field coding system 100 shown in FIG1 . The playback component 130 receives a media bitstream from the storage/transmission component 150 of the distribution component 120. In embodiments where the received bitstream includes both an audio bitstream and a video bitstream, these bitstreams are demultiplexed by an A/V demultiplexer 300. The video bitstream is provided to a video decoder 310 for decoding and playback on a monitor 320. The audio bitstream is provided to an audio bitstream decoder 330, which restores the original encoded SES accurately or in a form that preserves the spatial clues encoded in the SES. For example, in some embodiments, the audio bitstream decoder 330 includes an audio waveform decoder that is reciprocal to the audio waveform encoder optionally included in the audio bitstream encoder 240.

在一些实施例中，从解码器330输出的解码的SES包括与标准的两声道立体声再现兼容的两声道立体声信号。该信号可以被直接提供给传统回放系统340(诸如一对扩音器)而不需要进一步的解码或处理(除了各个左音频信号和右音频信号的数模转换和放大之外)。如前所述，在SES中包括的向后兼容的立体声信号使得它在传统回放系统340上提供原始捕获音频场景的可行的再现。在替代实施例中，传统回放系统340可以是多声道回放系统(诸如5.1或7.1环绕声再现系统)并且由音频位流解码器330提供的解码的SES可以包括与传统回放系统340直接兼容的多声道信号。In some embodiments, the decoded SES output from the decoder 330 includes a two-channel stereo signal compatible with standard two-channel stereo reproduction. This signal can be provided directly to a legacy playback system 340 (such as a pair of loudspeakers) without further decoding or processing (except for digital-to-analog conversion and amplification of the respective left and right audio signals). As previously described, the backward-compatible stereo signal included in the SES enables it to provide a viable reproduction of the original captured audio scene on the legacy playback system 340. In alternative embodiments, the legacy playback system 340 can be a multi-channel playback system (such as a 5.1 or 7.1 surround sound reproduction system) and the decoded SES provided by the audio bitstream decoder 330 can include a multi-channel signal directly compatible with the legacy playback system 340.

在解码的SES被直接提供给两声道或多声道传统回放系统340的实施例中，音频位流中所包括的任何辅助信息(诸如附加元数据或音频波形声道)可以简单地被音频位流解码器330忽略。因此，整个回放组件130可以是传统音频或A/V回放设备，诸如任何现有的移动电话或计算机。在一些实施例中，捕获组件110和分布组件120与任何传统音频或视频媒体回放设备向后兼容。In embodiments where the decoded SES is provided directly to a two-channel or multi-channel legacy playback system 340, any auxiliary information included in the audio bitstream (such as additional metadata or audio waveform channels) can simply be ignored by the audio bitstream decoder 330. Thus, the entire playback component 130 can be a legacy audio or A/V playback device, such as any existing mobile phone or computer. In some embodiments, the capture component 110 and the distribution component 120 are backward compatible with any legacy audio or video media playback device.

在一些实施例中，可选的空间音频解码器被应用于从音频位流解码器330输出的SES。如图3中所示，SES耳机解码器350执行SES解码以用于由耳机355进行耳机输出和回放。SES立体声解码器360执行SES解码以生成到立体声扩音器回放系统365的立体声扩音器输出。SES多声道解码器370执行SES解码以生成到多声道扩音器回放系统375的多声道扩音器输出。这些SES解码器中的每个执行针对对应的回放配置专门定制的解码算法。回放组件30的实施例包括用于任意回放配置的上述SES解码器中的一个或多个。不论回放配置如何，这些SES解码器都不需要关于原始捕获或录制配置的信息。例如，在一些实施例中，SES解码器包括高保真立体声UHJ到B格式解码器，之后接着是针对特定回放配置定制的B格式空间解码器，如例如1985年11月JAES第33卷、第11期、第859–871页上的、Michael Gerzon的“Ambisonics in multichannel broadcasting and video(多声道广播和视频中的高保真立体声)”中所描述的那样。In some embodiments, an optional spatial audio decoder is applied to the SES output from the audio bitstream decoder 330. As shown in Figure 3, an SES headphone decoder 350 performs SES decoding for headphone output and playback by headphones 355. An SES stereo decoder 360 performs SES decoding to generate a stereo loudspeaker output to a stereo loudspeaker playback system 365. An SES multichannel decoder 370 performs SES decoding to generate a multichannel loudspeaker output to a multichannel loudspeaker playback system 375. Each of these SES decoders performs a decoding algorithm specifically customized for corresponding playback configuration. An embodiment of the playback component 30 includes one or more of the above-mentioned SES decoders for any playback configuration. Regardless of the playback configuration, these SES decoders do not require information about the original capture or recording configuration. For example, in some embodiments, the SES decoder includes an Ambisonics UHJ to B-format decoder followed by a B-format spatial decoder customized for a particular playback configuration, as described, for example, in Michael Gerzon, “Ambisonics in multichannel broadcasting and video,” JAES, Vol. 33, No. 11, pp. 859–871, November 1985.

以示例的方式，在支持耳机回放的实施例中，由SES耳机解码器350对SES进行解码以输出再现编码的音频场景的双耳信号。这是通过对嵌入的空间音频线索进行解码并且应用适当的方向滤波(诸如头部相关传递函数(HRTF))来实现的。在一些实施例中，这可以涉及UHJ到B格式解码器，之后接着是双耳转码器。解码器还可以支持头部跟踪，使得再现的音频场景的方位在耳机回放期间可以被自动地调整以连续地补偿收听者的头部方位上的改变，因而增强收听者沉浸在原始捕获的声场中的错觉。By way of example, in embodiments supporting headphone playback, the SES is decoded by an SES headphone decoder 350 to output a binaural signal that reproduces the encoded audio scene. This is achieved by decoding the embedded spatial audio cues and applying appropriate directional filtering, such as head-related transfer functions (HRTFs). In some embodiments, this may involve a UHJ to B-format decoder followed by a binaural transcoder. The decoder may also support head tracking so that the orientation of the reproduced audio scene can be automatically adjusted during headphone playback to continuously compensate for changes in the listener's head orientation, thereby enhancing the listener's illusion of being immersed in the originally captured sound field.

作为连接到两声道扩音器系统(诸如独立的扩音器或被构建到膝上型或平板计算机、电视机或长条音箱外壳中的扩音器)的回放组件130的实施例的示例，SES首先被SES立体声解码器360空间解码。在一些实施例中，解码器360包括等同于SES耳机解码器350的SES解码器，该SES解码器的双耳输出信号可以被适当的串音消除(crosstalk cancellation)电路进一步处理以提供在SES中编码的空间线索的忠实再现(针对特定的两声道扩音器回放配置定制)。As an example of an embodiment of the playback component 130 connected to a two-channel loudspeaker system (such as a stand-alone loudspeaker or a loudspeaker built into a laptop or tablet computer, a television, or a soundbar housing), the SES is first spatially decoded by an SES stereo decoder 360. In some embodiments, the decoder 360 comprises an SES decoder equivalent to the SES headphone decoder 350, whose binaural output signals may be further processed by appropriate crosstalk cancellation circuitry to provide a faithful reproduction of the spatial cues encoded in the SES (tailored for the specific two-channel loudspeaker playback configuration).

作为连接到多声道扩音器系统的回放组件130的实施例的示例，SES首先被SES多声道解码器370空间解码。多声道扩音器回放系统375的配置可以是标准的5.1或7.1环绕声系统配置或包括例如高度声道的任何任意的环绕声或沉浸式三维配置(诸如22.2系统配置)。As an example of an embodiment of the playback component 130 connected to a multi-channel loudspeaker system, the SES is first spatially decoded by the SES multi-channel decoder 370. The configuration of the multi-channel loudspeaker playback system 375 can be a standard 5.1 or 7.1 surround sound system configuration or any arbitrary surround sound or immersive 3D configuration including, for example, height channels (such as a 22.2 system configuration).

由SES多声道解码器370执行的操作可以包括重新格式化在SES中包括的两声道或多声道信号。该重新格式化是为了根据扩音器输出布局以及在SES中所包括的可选的附加元数据或辅助信息而忠实地再现在SES中编码的空间音频场景而做出的。在一些实施例中，SES包括两声道或多声道UHJ或B格式信号，并且SES多声道解码器370包括针对特定回放配置而优化的空间解码器。The operations performed by the SES multi-channel decoder 370 may include reformatting the two-channel or multi-channel signal included in the SES. This reformatting is done to faithfully reproduce the spatial audio scene encoded in the SES according to the loudspeaker output layout and the optional additional metadata or auxiliary information included in the SES. In some embodiments, the SES includes a two-channel or multi-channel UHJ or B-format signal, and the SES multi-channel decoder 370 includes a spatial decoder optimized for a specific playback configuration.

在SES包括对于标准的两声道立体声回放可行的向后兼容的两声道立体声信号的其他实施例中，可以采用替代的两种声道编码/解码方案以便克服UHJ编码/解码方法的已知的就空间音频保真度而言的限制。例如，为了实现改进的空间线索分辨率并且保留三维信息，SES编码器还可以利用可以在多个频带中执行空间编码的两声道频域相位振幅编码方法。另外，SES编码器中的可选的元数据提取以及这样的空间编码方法的组合使得能够进一步提高再现的音频场景相对于原始捕获的声场的保真度和精确度。In other embodiments where the SES includes a backward-compatible two-channel stereo signal that is feasible for standard two-channel stereo playback, alternative two-channel encoding/decoding schemes can be employed to overcome the known limitations of the UHJ encoding/decoding methods in terms of spatial audio fidelity. For example, to achieve improved spatial cue resolution and preserve three-dimensional information, the SES encoder can also utilize a two-channel frequency-domain phase-amplitude encoding method that can perform spatial encoding in multiple frequency bands. In addition, optional metadata extraction in the SES encoder and the combination of such spatial encoding methods enable further improvement in the fidelity and accuracy of the reproduced audio scene relative to the originally captured sound field.

在一些实施例中，SES解码器驻留在具有最适合于假定的收听情景的默认回放配置的回放设备上。例如，耳机再现可以是用于移动设备或相机的假定收听情景，使得SES解码器可以被配置为以耳机为默认解码格式。作为另一示例，7.1多声道环绕系统可以是用于家庭影院收听情景的假定回放配置，所以驻留在家庭影院设备上的SES解码器可以被配置为以7.1多声道环绕为默认回放配置。In some embodiments, the SES decoder resides on a playback device with a default playback configuration that best suits the assumed listening scenario. For example, headphone reproduction may be an assumed listening scenario for a mobile device or camera, so the SES decoder may be configured to use headphones as the default decoding format. As another example, a 7.1 multichannel surround system may be an assumed playback configuration for a home theater listening scenario, so the SES decoder resident on the home theater device may be configured to use 7.1 multichannel surround as the default playback configuration.

II.系统细节和替代实施例 II. System Details and Alternative Embodiments

现在将讨论声场编码系统100和方法的各种实施例的系统细节。应注意，下面仅对可以实现组件、系统和编解码器(codec)的数种方式中的几个进行详细描述。本文所示和所描述的那些的许多变型是可能的。The system details of various embodiments of the sound field coding system 100 and method will now be discussed. It should be noted that only a few of the several ways in which components, systems, and codecs can be implemented are described in detail below. Many variations of those shown and described herein are possible.

灵活的沉浸式音频捕获和空间编码实施例Flexible immersive audio capture and spatial encoding embodiments

图4示出声场编码系统100中的空间编码器和解码器的实施例的一般框图。参照图4，N个音频信号分别被N个麦克风捕获以获得N个麦克风信号。N个麦克风中的每个具有方向性图案，该方向性图案将其响应表征为频率和相对于参考方向的方向的函数。在空间编码器410中，N个信号被组合为T个信号以使得T个信号中的每个具有与该信号相关联的规定的方向性图案。FIG4 illustrates a general block diagram of an embodiment of a spatial encoder and decoder in sound field coding system 100. Referring to FIG4 , N audio signals are captured by N microphones to obtain N microphone signals. Each of the N microphones has a directivity pattern that characterizes its response as a function of frequency and direction relative to a reference direction. In spatial encoder 410, the N signals are combined into T signals such that each of the T signals has a specified directivity pattern associated with it.

在一些实施例中，空间编码器410还产生由图4中的虚线所表示的辅助信息S，在一些实施例中，辅助信息S包括空间音频元数据和/或附加的音频波形信号。T个信号以及可选的辅助信息S形成空间编码信号(SES)。SES被传输或存储以用于随后使用或分布。在优选实施例中，T小于N，使得N个麦克风信号到T个传输信号的编码实现表示由N个麦克风捕获的音频场景所需要的数据量减少。In some embodiments, the spatial encoder 410 also generates auxiliary information S, represented by the dashed line in FIG4 . In some embodiments, the auxiliary information S includes spatial audio metadata and/or additional audio waveform signals. The T signals and the optional auxiliary information S form a spatially encoded signal (SES). The SES is transmitted or stored for subsequent use or distribution. In a preferred embodiment, T is less than N, so that the encoding of the N microphone signals into T transmitted signals reduces the amount of data required to represent the audio scene captured by the N microphones.

在一些优选实施例中，辅助信息S由以比T个音频传输信号的数据速率低的数据速率存储的空间线索组成。这意味着包括辅助信息S一般不大幅地提高总SES数据速率。空间解码器和渲染器420将SES转换为针对目标回放系统(未示出)优化的Q个回放信号。目标回放系统可以是耳机、两声道扩音器系统、五声道扩音器系统或一些其他的回放配置。In some preferred embodiments, the auxiliary information S consists of spatial cues stored at a lower data rate than the data rate of the T audio transmission signals. This means that including the auxiliary information S generally does not significantly increase the overall SES data rate. The spatial decoder and renderer 420 converts the SES into Q playback signals optimized for a target playback system (not shown). The target playback system can be headphones, a two-channel loudspeaker system, a five-channel loudspeaker system, or some other playback configuration.

应注意，在图4中，不失一般性地，传输信号的数量T被描绘为2。对于传输声道的数量的其他设计选择被包括在本发明的范围内。例如，在一些实施例中，T可以被选择为1。在这些实施例中，传输信号可以是N个捕获信号的单音(monophonic)下混，并且一些空间辅助信息S可以被包括在SES中以便对表示所捕获的声场的空间线索进行编码。在其他实施例中，T可以被选择为大于2。当T大于1时，将空间线索包括在辅助信息S中不是必需的，因为可以对T个音频信号本身中的空间线索进行编码。以示例的方式，空间线索可以被映射到T个传输信号之间的声道间振幅和相位差。It should be noted that in Figure 4, without loss of generality, the number of transmitted signals T is depicted as 2. Other design choices for the number of transmitted channels are included within the scope of the present invention. For example, in some embodiments, T can be chosen to be 1. In these embodiments, the transmitted signal can be a monophonic downmix of the N captured signals, and some spatial side information S can be included in the SES to encode spatial cues representing the captured sound field. In other embodiments, T can be chosen to be greater than 2. When T is greater than 1, it is not necessary to include spatial cues in the side information S, as the spatial cues can be encoded in the T audio signals themselves. By way of example, the spatial cues can be mapped to the inter-channel amplitude and phase differences between the T transmitted signals.

图5是更详细地描绘与图4中所描述的系统类似的系统100的实施例的框图，其中T＝2。在这些实施例中，N个麦克风信号被输入到空间编码器410中。空间线索被空间编码器410编码到T个传输信号中并且辅助信息S可以被一起省略。在一些实施例中，如前面结合图1和图2所描述的，两声道SES使用标准波形编码器(诸如MP3或AAC)被感知编码，通过可用的数字分布媒体或网络和广播基础设施被容易地分布，并且在标准的两声道立体声配置中被直接回放(使用耳机或扩音器)。在这样的实施例中，重要的优点是，编码和传输系统支持通过普遍可用的2声道立体声系统的回放而不需要空间解码和渲染过程。FIG5 is a block diagram depicting in more detail an embodiment of a system 100 similar to the system described in FIG4 , where T=2. In these embodiments, N microphone signals are input into a spatial encoder 410. Spatial cues are encoded into T transmission signals by the spatial encoder 410 and the auxiliary information S can be omitted altogether. In some embodiments, as previously described in conjunction with FIG1 and FIG2 , a two-channel SES is perceptually encoded using a standard waveform encoder (such as MP3 or AAC), easily distributed via available digital distribution media or networks and broadcast infrastructure, and played back directly in a standard two-channel stereo configuration (using headphones or loudspeakers). In such embodiments, an important advantage is that the encoding and transmission system supports playback via commonly available 2-channel stereo systems without the need for spatial decoding and rendering processes.

系统100的一些实施例包含单个麦克风(N＝1)。应注意，在这些实施例中，空间信息将不被捕获，因为在麦克风信号中没有空间多样性(spatial diversity)。在这些情形下，伪立体声技术(诸如例如JAES 18(2)(1970年)上的、Orban的“A Rational Techniquefor Synthesizing Pseudo-Stereo From Monophonic Sources(用于从单音源合成伪立体声的合理技术)”中所描述的技术)可以在空间编码器410中被采用以从单音捕获音频信号生成2声道SES，该2声道SES适合于当通过标准的立体声再现系统直接回放时产生人造的空间印象。Some embodiments of the system 100 include a single microphone (N=1). It should be noted that in these embodiments, spatial information will not be captured because there is no spatial diversity in the microphone signal. In these cases, pseudo-stereo techniques (such as those described, for example, in Orban's "A Rational Technique for Synthesizing Pseudo-Stereo From Monophonic Sources," JAES 18(2) (1970)) can be employed in the spatial encoder 410 to generate a 2-channel SES from a monophonic captured audio signal that is suitable for producing an artificial spatial impression when played back directly through a standard stereo reproduction system.

系统100的一些实施例包括空间解码器和渲染器420。在一些优选实施例中，空间解码器和渲染器420的功能是针对使用中的特定回放配置而对再现的音频场景的空间保真度进行优化。例如，空间解码器和渲染器420提供以下中的一个或多个：(a)(例如使用基于HRTF的虚拟化技术)针对耳机回放中的沉浸式3-D音频再现而被优化的2个输出声道；(b)(例如使用虚拟化和串音消除技术)针对通过2个扩音器的回放中的沉浸式3-D音频再现而被优化的2个输出声道；以及(c)针对通过5个扩音器的回放中的沉浸式3-D音频或环绕声再现而被优化的5个输出声道。这些是再现格式的代表性示例。在一些实施例中，如下面更详细地解释的，空间解码器和渲染器420被配置为提供针对通过任何任意的再现系统的再现而被优化的回放信号。Some embodiments of the system 100 include a spatial decoder and renderer 420. In some preferred embodiments, the function of the spatial decoder and renderer 420 is to optimize the spatial fidelity of the reproduced audio scene for the specific playback configuration in use. For example, the spatial decoder and renderer 420 provides one or more of the following: (a) two output channels optimized for immersive 3-D audio reproduction in headphone playback (e.g., using HRTF-based virtualization techniques); (b) two output channels optimized for immersive 3-D audio reproduction in playback through two loudspeakers (e.g., using virtualization and crosstalk cancellation techniques); and (c) five output channels optimized for immersive 3-D audio or surround sound reproduction in playback through five loudspeakers. These are representative examples of reproduction formats. In some embodiments, as explained in more detail below, the spatial decoder and renderer 420 is configured to provide a playback signal optimized for reproduction through any arbitrary reproduction system.

图6是更详细地示出图4和图5中所示的空间解码器和渲染器420的实施例的框图。如图6中所示，空间解码器和渲染器420包括空间解码器600和渲染器610。不失一般性地示出的SES包括T＝2个声道以及可选的辅助信息S。解码器600首先将SES解码为P个音频信号。在示例实施例中，解码器600输出5声道矩阵解码(matrix decoded)信号。P个音频信号然后被处理以形成针对再现系统的回放配置而被优化的Q个回放信号。在一个示例实施例中，SES是2声道UHJ编码信号，解码器600是常规的高保真立体声UHJ到B格式转换器，并且渲染器610进一步针对Q声道回放配置而对B格式信号进行解码。FIG6 is a block diagram illustrating an embodiment of the spatial decoder and renderer 420 shown in FIG4 and FIG5 in more detail. As shown in FIG6, the spatial decoder and renderer 420 includes a spatial decoder 600 and a renderer 610. The SES shown without loss of generality includes T=2 channels and optional auxiliary information S. The decoder 600 first decodes the SES into P audio signals. In an exemplary embodiment, the decoder 600 outputs a 5-channel matrix decoded signal. The P audio signals are then processed to form Q playback signals optimized for the playback configuration of the reproduction system. In an exemplary embodiment, the SES is a 2-channel UHJ encoded signal, the decoder 600 is a conventional high-fidelity stereo UHJ to B-format converter, and the renderer 610 further decodes the B-format signal for the Q-channel playback configuration.

图7是示出在具有T＝2个传输信号并且没有辅助信息的情况下的SES捕获和编码的框图。在这些实施例中，空间编码器410被设计为将N个麦克风信号编码为立体声信号。如上面所解释的，T＝2的选择与普遍的感知音频波形编码器(诸如AAC或MP3)、音频分布媒体和再现系统是兼容的。这N个麦克风可以是重合的麦克风、接近重合的麦克风或不重合的麦克风。麦克风可以被构建到单个设备(诸如相机、智能电话、现场录制机或用于这样的设备的附件)中。另外，这N个麦克风信号可以跨多个同类(homogeneous)或不同类(heterogeneous)的设备或设备附件而同步。FIG7 is a block diagram illustrating SES capture and encoding with T=2 transmitted signals and no auxiliary information. In these embodiments, the spatial encoder 410 is designed to encode N microphone signals into a stereo signal. As explained above, the choice of T=2 is compatible with common perceptual audio waveform encoders (such as AAC or MP3), audio distribution media, and reproduction systems. The N microphones can be overlapping microphones, nearly overlapping microphones, or non-overlapping microphones. The microphones can be built into a single device (such as a camera, smartphone, field recorder, or accessory for such a device). In addition, the N microphone signals can be synchronized across multiple homogeneous or heterogeneous devices or device accessories.

在一些实施例中，T＝2个传输声道被编码以模拟重合的虚拟麦克风信号，因为重合(信号的时间对齐)对于促进高质量空间解码是有利的。在使用不重合麦克风的实施例中，基于对到达方向进行分析并且应用对应补偿的时间对齐的提供可以被合并在SES编码器中。在替代实施例中，可以依赖于与预期解码器相关联的空间音频再现使用情景和应用来将立体声信号导出为对应于双耳或不重合麦克风录制信号。In some embodiments, the T=2 transmission channels are encoded to simulate coincident virtual microphone signals, since coincidence (time alignment of the signals) is advantageous for facilitating high-quality spatial decoding. In embodiments using non-coincident microphones, provision of time alignment based on analysis of the directions of arrival and application of corresponding compensation can be incorporated into the SES encoder. In alternative embodiments, the stereo signal can be derived to correspond to binaural or non-coincident microphone recording signals, depending on the spatial audio reproduction use case and application associated with the intended decoder.

图8是示出图4至图7中所示的空间编码器410的实施例的框图。如图8中所示，N个麦克风信号被输入到空间分析器和转换器800，在空间分析器和转换器800中，N个麦克风信号首先被转换为由M个信号组成的中间格式。这M个信号随后被渲染器810编码为2个声道以用于传输。当中间M声道格式比N个麦克风信号更适合于由渲染器810处理时，图8中所示的实施例是有利的。在一些实施例中，到M个中间声道的转换可以合并N个麦克风信号的分析。而且，在一些实施例中，空间转换过程800可以包括多个转换步骤和中间格式。FIG8 is a block diagram illustrating an embodiment of the spatial encoder 410 shown in FIG4 to FIG7. As shown in FIG8, N microphone signals are input to a spatial analyzer and converter 800, where the N microphone signals are first converted into an intermediate format consisting of M signals. These M signals are then encoded into two channels by a renderer 810 for transmission. The embodiment shown in FIG8 is advantageous when the intermediate M-channel format is more suitable for processing by the renderer 810 than the N microphone signals. In some embodiments, the conversion to M intermediate channels can incorporate the analysis of the N microphone signals. Furthermore, in some embodiments, the spatial conversion process 800 can include multiple conversion steps and intermediate formats.

特定实施例的细节Details of specific embodiments

图9A示出图7中所示的空间编码器410和方法的特定示例实施例，在该实施例中使用A格式麦克风信号捕获。初始4声道A格式麦克风信号可以被A格式到B格式转换器900容易地转换为高保真立体声B格式信号(W、X、Y、Z)。可替代地，可以使用直接提供B格式信号的麦克风，在这种情况下，A格式到B格式转换器900是不必需的。FIG9A illustrates a specific example embodiment of the spatial encoder 410 and method shown in FIG7 , in which an A-format microphone signal is captured. The original 4-channel A-format microphone signal can be easily converted to a high-fidelity stereo B-format signal (W, X, Y, Z) by an A-format to B-format converter 900. Alternatively, a microphone that directly provides a B-format signal can be used, in which case the A-format to B-format converter 900 is not required.

各种虚拟麦克风方向性图案可以从B格式信号形成。在本实施例中，B格式到超心形转换器块910将B格式信号转换为使用这些等式形成的一组三个超心形麦克风信号：Various virtual microphone directivity patterns can be formed from the B-format signal. In this embodiment, the B-format to supercardioid converter block 910 converts the B-format signal into a set of three supercardioid microphone signals formed using these equations:

其中例如设计参数被设置为：θ_S＝π以及p＝0.33。W是B格式中的全向压力信号，X是B格式中的前后8字形信号，并且Y是B格式中的左右8字形信号。B格式中的Z信号(上下8字形)没有用于该转换中。V_L是水平面中与具有转向到-60度(根据弧度角)的方向性图案的超心形对应的虚拟左麦克风信号，V_R是水平面中与具有转向到+60度(根据弧度角)的方向性图案的超心形对应的虚拟右麦克风信号，并且V_S是水平面中与具有转向到+180度(根据θ_S＝π弧度角)的方向性图案的超心形对应的虚拟环绕麦克风信号。参数p＝0.33是根据虚拟麦克风信号的期望方向性选择的。For example, the design parameters are set to: θ _S = π and p = 0.33. W is the omnidirectional pressure signal in B format, X is the front-to-back figure-of-eight signal in B format, and Y is the left-right figure-of-eight signal in B format. The Z signal (up-down figure-of-eight) in B format is not used in this conversion. V _L is the virtual left microphone signal corresponding to a hypercardioid with a directivity pattern steered to -60 degrees (in radians) in the horizontal plane, _VR is the virtual right microphone signal corresponding to a hypercardioid with a directivity pattern steered to +60 degrees (in radians) in the horizontal plane, and V _S is the virtual surround microphone signal corresponding to a hypercardioid with a directivity pattern steered to +180 _degrees (in radians) in the horizontal plane. The parameter p = 0.33 is selected based on the desired directivity of the virtual microphone signals.

图9B示出线性标度(scale)上的B格式分量的方向性图案。绘图920示出了全向W分量的方向性图案。绘图930示出了前后X分量的方向性图案，其中0度是向前方向。绘图940示出了左右Y分量的方向性图案。Figure 9B shows the directivity patterns of the B-format components on a linear scale. Plot 920 shows the directivity pattern of the omnidirectional W component. Plot 930 shows the directivity pattern of the front-to-back X component, where 0 degrees is the forward direction. Plot 940 shows the directivity pattern of the left-right Y component.

图9C示出本实施例中的超心形虚拟麦克风在dB标度上的方向性图案。绘图950示出了V_L的方向性图案，虚拟麦克风被转向到-60度。绘图960示出了V_R的方向性图案，虚拟麦克风被转向到+60度。绘图970示出了V_S的方向性图案，虚拟麦克风被转向到+180度。FIG9C illustrates the directivity patterns of the supercardioid virtual microphones in this embodiment on a dB scale. Plot 950 shows the directivity pattern of V _L with the virtual microphone steered to -60 degrees. Plot 960 shows the directivity pattern of _VR with the virtual microphone steered to +60 degrees. Plot 970 shows the directivity pattern of _VS with the virtual microphone steered to +180 degrees.

空间编码器410将由转换器910产生的所得的3声道超心形信号(V_L、V_R、V_S)转换为两声道SES。这是通过使用以下相位振幅矩阵编码等式来实现的：The spatial encoder 410 converts the resulting 3-channel supercardioid signal (V _L , _VR , V _S ) produced by the converter 910 into a two-channel SES. This is achieved by using the following phase-amplitude matrix encoding equation:

L_T＝aV_L+jbV_S _LT = aV _L + jbV _S

R_T＝aV_R-jbV_S _RT = aVR _- _jbVS

其中L_T标示编码的左声道信号，R_T标示编码的右声道信号，j标示90度相移，a和b是3:2矩阵编码权重，并且V_R、V_L和V_S分别是左声道虚拟麦克风信号、右声道虚拟麦克风信号以及环绕声道虚拟麦克风信号。在一些实施例中，3:2矩阵编码权重可以被选择为a＝1且这保持了编码的SES中的3声道信号(V_L、V_R、V_S)的总功率。如本领域的技术人员在阅读时将清楚的，上面的矩阵编码等式具有以下效果：将图9C中所示的与3声道信号(V_L、V_R、V_S)相关联的一组三个虚拟麦克风方向性图案转换为与两声道SES(L_T、R_T)相关联的一对复值虚拟麦克风方向性图案。Where _LT denotes the encoded left channel signal, _RT denotes the encoded right channel signal, j denotes a 90-degree phase shift, a and b are 3:2 matrix encoding weights, and _VR , _VL , and _VS are the left channel virtual microphone signal, the right channel virtual microphone signal, and the surround channel virtual microphone signal, respectively. In some embodiments, the 3:2 matrix encoding weights can be selected such that a=1, and this preserves the total power of the three-channel signals ( _VL , _VR , _VS ) in the encoded SES. As will be clear to those skilled in the art upon reading this, the above matrix encoding equation has the effect of converting the set of three virtual microphone directivity patterns associated with the three-channel signals ( _VL , _VR , _VS ) shown in FIG9C into a pair of complex-valued virtual microphone directivity patterns associated with the two-channel SES ( _LT , _RT ).

图9A中所描绘的并且在上面描述的实施例实现可以适合于低功率设备和应用的低复杂度空间编码器。注意，在本发明的范围内，用于中间3声道表示的替代方向性图案可以从B格式信号形成。所得的两声道SES适合用于使用相位振幅矩阵解码器(诸如图6中所示的空间解码器600)进行空间解码。The embodiment depicted in FIG9A and described above implements a low-complexity spatial encoder that can be suitable for low-power devices and applications. Note that, within the scope of the present invention, alternative directivity patterns for the intermediate 3-channel representation can be formed from the B-format signal. The resulting two-channel SES is suitable for spatial decoding using a phase-amplitude matrix decoder (such as the spatial decoder 600 shown in FIG6).

图10示出图7中所示的空间编码器410和方法的特定示例实施例，在该示例实施例中，B格式信号被转换为5声道环绕声信号(L、R、C、L_S、R_S)。应注意，L标示前左声道，R标示前右声道，C标示前中央声道，L_S标示左环绕声道，并且R_S标示右环绕声道。类似于图9A，A格式麦克风信号被输入到A格式到B格式转换器1000并且被转换为B格式信号。该4声道B格式信号被B格式到多声道格式转换器1010处理，B格式到多声道格式转换器1010在一些实施例中是多声道B格式解码器。接着，空间编码器通过在实施例中使用以下相位振幅矩阵编码等式来将由转换器1010产生的5声道环绕声信号转换为两声道SES：FIG10 illustrates a specific example embodiment of the spatial encoder 410 and method shown in FIG7 , in which a B-format signal is converted into a 5-channel surround sound signal (L, R, C, _LS , _RS ). Note that L denotes the front left channel, R denotes the front right channel, C denotes the front center channel, _LS denotes the left surround channel, and _RS denotes the right surround channel. Similar to FIG9A , an A-format microphone signal is input to an A-format to B-format converter 1000 and converted into a B-format signal. This 4-channel B-format signal is processed by a B-format to multi-channel format converter 1010, which in some embodiments is a multi-channel B-format decoder. Next, the spatial encoder converts the 5-channel surround sound signal generated by the converter 1010 into a two-channel SES by using the following phase-amplitude matrix coding equation in an embodiment:

L_T＝a₁L+a₂R+a₃C+ja₄L_s-ja₅R_s L _T ＝a ₁ L+a ₂ R+a ₃ C+ja ₄ L _s -ja ₅ R _s

R_T＝a₂L+a₁R+a₃C-ja₅Ls+ja₄R_s R _T =a ₂ L+a ₁ R+a ₃ C-ja ₅ Ls+ja ₄ R _s

其中L_T和R_T分别标示由空间编码器输出的左SES信号和右SES信号。在一些实施例中，矩阵编码系数可以被选择为a₁＝1、a₂＝0、且依赖于两声道编码信号中的前声道和环绕声道的期望空间分布，可以使用替代的一组矩阵编码系数。如图9A中的空间编码器实施例中那样，所得的两声道SES适合于由相位振幅矩阵解码器(诸如图6中所示的空间解码器600)进行空间解码。Where _LT and _RT denote the left and right SES signals, respectively, output by the spatial encoder. In some embodiments, the matrix coding coefficients may be selected such that _a1 = 1 and _a2 = 0, and alternative sets of matrix coding coefficients may be used depending on the desired spatial distribution of the front and surround channels in the two-channel encoded signal. As in the spatial encoder embodiment of FIG9A , the resulting two-channel SES is suitable for spatial decoding by a phase-amplitude matrix decoder (such as the spatial decoder 600 shown in FIG6 ).

在图10中所示的实施例中，B格式信号被转换为5声道中间环绕声格式。但是，将清楚的是，在本发明的范围内，可以使用任意的水平环绕或三维中间多声道格式。在这些情况下，转换器1010和空间编码器410的操作可以根据分配给各个中间声道的假定的一组方向来容易地配置。In the embodiment shown in FIG10 , the B-format signal is converted to a 5-channel center surround format. However, it will be apparent that any horizontal surround or 3D center multi-channel format may be used within the scope of the present invention. In these cases, the operation of the converter 1010 and the spatial encoder 410 can be easily configured based on the assumed set of directions assigned to the various center channels.

图11示出图7中所示的空间编码器410和方法的特定示例实施例，在该实施例中，B格式信号被转换为定向音频编码(DirAC)表示。具体地，如图11中所示，A格式麦克风信号被输入到A格式到B格式转换器1100。所得的B格式信号被B格式到DirAC格式转换器1110转换为DirAC编码信号，如例如2007年6月JAES第55卷、第6期、第503-516页上的、Pulkki的“Spatial Sound Reproduction with Directional Audio Coding(利用定向音频编码的空间声音再现)”中所描述的那样。空间编码器410然后将DirAC编码信号转换为两声道SES。在一个实施例中，该转换通过将频域DirAC波形数据转换为两声道表示来实现，该两声道表示例如是通过2008年10月第125届AES大会上呈现的、Jot的“Two-Channel MatrixSurround Encoding for Flexible Interactive 3-D Audio Reproduction(用于灵活交互式3D音频再现的两声道矩阵环绕编码)”中所描述的方法获得的。所得的SES适合于由相位振幅矩阵解码器(诸如图6中所示的空间解码器600)进行空间解码。FIG11 illustrates a specific example embodiment of the spatial encoder 410 and method shown in FIG7 , in which a B-format signal is converted to a Directional Audio Coding (DirAC) representation. Specifically, as shown in FIG11 , an A-format microphone signal is input to an A-format to B-format converter 1100. The resulting B-format signal is converted to a DirAC-encoded signal by a B-format to DirAC format converter 1110, as described, for example, in Pulkki, “Spatial Sound Reproduction with Directional Audio Coding,” JAES, Vol. 55, No. 6, pp. 503-516, June 2007. The spatial encoder 410 then converts the DirAC-encoded signal into a two-channel SES. In one embodiment, the conversion is accomplished by converting the frequency-domain DirAC waveform data into a two-channel representation, such as obtained by the method described in Jot's "Two-Channel Matrix Surround Encoding for Flexible Interactive 3-D Audio Reproduction," presented at the 125th AES Convention in October 2008. The resulting SES is suitable for spatial decoding by a phase-amplitude matrix decoder, such as the spatial decoder 600 shown in FIG6 .

DirAC编码包括区分声场的直接分量和扩散分量的频域分析。在根据本发明的空间编码器(诸如空间编码器410)中，在频域表示内实施两声道编码以便充分利用DirAC分析。这导致空间保真度的程度高于使用常规的时域相位振幅矩阵编码技术(诸如结合图9A和图10描述的空间编码器实施例中使用的那些)的空间保真度的程度。DirAC encoding involves a frequency domain analysis that distinguishes between direct and diffuse components of the sound field. In a spatial encoder according to the present invention (such as spatial encoder 410), two-channel encoding is performed within a frequency domain representation to fully exploit the DirAC analysis. This results in a higher degree of spatial fidelity than would be achieved using conventional time-domain phase-amplitude matrix coding techniques (such as those used in the spatial encoder embodiments described in conjunction with FIG9A and FIG10).

图12是更详细地示出A格式麦克风信号到SES的转换的实施例的框图。如图12中所示，A格式麦克风信号被使用A格式到B格式转换器1200而转换为B格式信号。B格式信号通过使用时间-频率变换1210被转换到频域。变换1210是短时傅立叶变换、小波变换、子带滤波器组或将时域信号变换为时间-频率表示的一些其他操作中的至少一个。接着，B格式到DirAC格式转换器1220将B格式信号转换为Di_rAC格式信号。DirAC信号被输入到空间编码器410并且被空间编码为两声道SES，该两声道SES仍然是在频域中表示的。信号使用频率-时间变换1240被转换回时域，频率-时间变换1240是时间-频率变换1210的逆变换，或者在完美逆变换是不可能的或不可行的情况下，是该逆变换的近似。应注意，为了改进空间编码的保真度，可以将直接时间到频率变换和逆时间到频率变换两者合并在根据本发明的编码器实施例中的任何一个中。FIG12 is a block diagram illustrating an embodiment of the conversion of an A-format microphone signal to an SES in greater detail. As shown in FIG12 , the A-format microphone signal is converted to a B-format signal using an A-format to B-format converter 1200. The B-format signal is converted to the frequency domain using a time-frequency transform 1210. Transform 1210 is at least one of a short-time Fourier transform, a wavelet transform, a subband filter bank, or some other operation that converts a time-domain signal into a time-frequency representation. Next, a B-format to DirAC format converter 1220 converts the B-format signal to a _DirAC format signal. The DirAC signal is input to the spatial encoder 410 and spatially encoded into a two-channel SES, which is still represented in the frequency domain. The signal is converted back to the time domain using a frequency-time transform 1240, which is the inverse transform of the time-frequency transform 1210, or an approximation of the inverse transform if a perfect inverse transform is not possible or feasible. It should be noted that in order to improve the fidelity of the spatial coding, both direct time-to-frequency transform and inverse time-to-frequency transform may be incorporated in any of the encoder embodiments according to the present invention.

图13是示出空间编码器410的又一实施例的框图，该空间编码器410在空间编码之前将B格式信号变换到频域中。参照图13，A格式麦克风信号被输入到A格式到B格式转换器1300。所得的信号使用时间-频率变换器1310从时域被转换到频域中。该信号使用基于B格式主导的编码器1320被编码。在一个实施例中，SES是根据以下等式编码的两声道立体声信号：FIG13 is a block diagram illustrating another embodiment of a spatial encoder 410 that transforms a B-format signal into the frequency domain prior to spatial encoding. Referring to FIG13 , an A-format microphone signal is input to an A-format to B-format converter 1300. The resulting signal is converted from the time domain to the frequency domain using a time-frequency converter 1310. The signal is encoded using a B-format-based encoder 1320. In one embodiment, the SES is a two-channel stereo signal encoded according to the following equation:

L_T＝a_LW+b_LX+c_LY+d_LZL _T ＝a _L W+b _L X+c _L Y+d _L Z

R_T＝a_RW+b_RX+c_RY+d_RZR _T =a _R W+b _R X+c _R Y+d _R Z

其中系数(a_L、b_L、c_L、d_L)是从频域3-D主导方向确定的时间依赖系数和频率依赖系数，频域3-D主导方向是从B格式信号(W、X、Y、Z)计算的，使得如果声场由3-D位置处的单个声源S组成，则所得的编码信号由以下等式给出：where the coefficients (a _L , b _L , c _L , d _L ) are time- and frequency-dependent coefficients determined from the frequency-domain 3-D dominant directions computed from the B-format signal (W, X, Y, Z) such that if the sound field consists of a single sound source S at a 3-D position, the resulting coded signal is given by:

其中k_L和k_R是使得左/右声道间振幅和相位差与3-D位置唯一地映射的复因子。例如在2008年10月第125届AES大会上呈现的、Jot的“Two-Channel Matrix SurroundEncoding for Flexible Interactive 3-D Audio Reproduction(用于灵活交互式3D音频再现的两声道矩阵环绕编码)”中提出了用于这个目的的示例映射公式。这样的3-D编码也可以对其他声道格式执行。编码信号使用频率-时间变换器1330从频域被变换到时域中。Where k _L and k _R are complex factors that uniquely map the left/right channel amplitude and phase differences to 3-D positions. Example mapping formulas for this purpose are presented, for example, in Jot's "Two-Channel Matrix Surround Encoding for Flexible Interactive 3-D Audio Reproduction," presented at the 125th AES Convention in October 2008. Such 3-D encoding can also be performed for other channel formats. The encoded signal is transformed from the frequency domain to the time domain using a frequency-to-time converter 1330.

音频场景可以由离散声源(诸如说话者或乐器)或扩散声音(诸如雨、掌声或混响)组成。一些声音可以是部分扩散的，例如大型引擎的隆隆声。在空间编码器中，可能有益的是以不同于扩散声音的方式对离散声音(这些声音从不同的方向到达麦克风)进行处理。An audio scene can consist of discrete sound sources (such as speakers or musical instruments) or diffuse sounds (such as rain, applause, or reverberation). Some sounds can be partially diffuse, such as the rumble of a large engine. In a spatial encoder, it can be beneficial to process discrete sounds (those that arrive at the microphone from different directions) differently than diffuse sounds.

图14是示出空间编码器410的实施例的框图，在空间编码器410中，输入的麦克分信号首先被分解为直接分量和扩散分量。直接分量和扩散分量然后被分别编码以便保持直接分量和扩散分量的不同空间特性。例如在第133届AES大会(2012年10月)上呈现的、Thompson等人的“Direct-Diffuse Decomposition of Multichannel Signals Using aSystem of Pairwise Correlations(使用成对相关系统的多声道信号的直接-扩散分解)”中描述了用于多声道音频信号的直接/扩散分解的示例方法。应理解，直接/扩散分解可以与早前描绘的各种空间编码系统结合使用。FIG14 is a block diagram illustrating an embodiment of a spatial encoder 410 in which an input microphone component signal is first decomposed into a direct component and a diffuse component. The direct component and the diffuse component are then encoded separately so as to preserve the different spatial characteristics of the direct component and the diffuse component. An example method for direct/diffuse decomposition of a multichannel audio signal is described, for example, in Thompson et al., “Direct-Diffuse Decomposition of Multichannel Signals Using a System of Pairwise Correlations,” presented at the 133rd AES Convention (October 2012). It should be understood that the direct/diffuse decomposition can be used in conjunction with the various spatial coding systems described earlier.

在户外设置中由麦克风捕获的音频信号可能被风噪声破坏。在一些情况下，风噪声可能严重地影响一个或多个麦克风上的信号质量。在这些及其他情形下，有益的是包括风噪声检测模块。图15是示出包括风噪声检测器的系统100和方法的实施例的框图。如图15中所示，N个麦克风信号被输入到自适应空间编码器1500。风噪声检测器1510提供每个麦克风中的风噪声能量或能量比的估计。被严重破坏的麦克风信号可以从编码器中使用的声道组合自适应地排除。另一方面，被部分破坏的麦克风可以在编码组合中被减小权重以控制在编码信号中风噪声的量。在一些情况下(诸如当捕获快速移动的户外动作场景时)，基于风噪声检测的自适应编码可以被配置为在编码音频信号中传达风噪声的至少某个部分。In an outdoor setting, the audio signal captured by a microphone may be damaged by wind noise. In some cases, wind noise may seriously affect the signal quality on one or more microphones. In these and other situations, it is beneficial to include a wind noise detection module. Figure 15 is a block diagram showing an embodiment of a system 100 and method including a wind noise detector. As shown in Figure 15, N microphone signals are input to an adaptive spatial encoder 1500. A wind noise detector 1510 provides an estimate of the wind noise energy or energy ratio in each microphone. Severely damaged microphone signals can be adaptively excluded from the channel combination used in the encoder. On the other hand, partially damaged microphones can be reduced in weight in the coding combination to control the amount of wind noise in the coded signal. In some cases (such as when capturing fast-moving outdoor action scenes), adaptive coding based on wind noise detection can be configured to convey at least a certain portion of wind noise in the coded audio signal.

自适应编码对于考虑对一个或多个麦克风的、来自于声学环境的阻挡(例如，被设备的用户的手指或设备上的累积的灰尘阻挡)也可以是有用的。在阻挡的情况下，麦克风提供不良的信号捕获，并且从麦克风信号导出的空间信息可能由于低信号电平而误导。阻挡状况(condition)的检测可以用于把被阻挡的麦克风从编码过程排除。Adaptive encoding can also be useful to account for occlusion of one or more microphones from the acoustic environment (e.g., by a user's finger or accumulated dust on the device). In the case of occlusion, the microphone provides poor signal capture, and spatial information derived from the microphone signal may be misleading due to low signal levels. Detection of an occlusion condition can be used to exclude the blocked microphone from the encoding process.

在一些实施例中，可能期望的是在对信号进行编码以用于存储或分布之前对音频场景实施编辑操作。这样的编辑操作可以包括相对于某个声源放大或缩小、移除不想要的声音分量(诸如背景噪声)以及将声音对象添加到场景中。图16示出用于捕获N个麦克风信号并且将它们转换为适合于编辑的M声道格式的系统。In some embodiments, it may be desirable to perform editing operations on the audio scene before encoding the signal for storage or distribution. Such editing operations may include amplifying or reducing the size of a sound source, removing unwanted sound components (such as background noise), and adding sound objects to the scene. FIG16 illustrates a system for capturing N microphone signals and converting them into an M-channel format suitable for editing.

特别地，N个麦克风信号被输入到空间分析器和转换器1600。由转换器1600输出的所得的M声道信号被提供给音频场景编辑器1610，音频场景编辑器1610由用户控制以实行对场景的期望修改。在修改被做出之后，场景被空间编码器1620空间编码。为了例证的目的，图1620示出两声道SES格式。可替代地，N个麦克风信号可以被直接提供给编辑工具。Specifically, N microphone signals are input to a spatial analyzer and converter 1600. The resulting M-channel signal output by converter 1600 is provided to an audio scene editor 1610, which is controlled by the user to implement desired modifications to the scene. After the modifications are made, the scene is spatially encoded by a spatial encoder 1620. For illustrative purposes, diagram 1620 shows a two-channel SES format. Alternatively, the N microphone signals can be provided directly to the editing tool.

在捕获设备被配置为仅提供两声道SES格式的实施例中，SES可以被解码为适合于编辑的多声道格式并且然后被重新编码以用于存储或分布。因为额外的解码/编码过程可能引入空间保真度上的一些劣化，所以优选的是使得能够在两声道空间编码之前对多声道格式进行编辑操作。在一些实施例中，设备可以被配置为与意图用于编辑的M声道格式或N个麦克风信号同时地输出两声道SES。In embodiments where the capture device is configured to provide only a two-channel SES format, the SES may be decoded into a multi-channel format suitable for editing and then re-encoded for storage or distribution. Because the additional decoding/encoding process may introduce some degradation in spatial fidelity, it is preferable to enable editing operations on the multi-channel format prior to two-channel spatial encoding. In some embodiments, the device may be configured to output a two-channel SES simultaneously with an M-channel format or N microphone signals intended for editing.

在一些实施例中，SES可以被导入到非线性视频编辑套件中，并且被关于传统的立体声电影捕获而操控。在没有空间上有害的音频处理效果被应用于该内容的前提下，所得内容的空间完整性将保持完好的后编辑。SES解码和重新格式化也可以被作为视频编辑套件的一部分而被应用。例如，如果内容正在被烧录到DVD或蓝光盘，则可以应用多声道扬声器解码和重新格式化并且将结果编码在多声道格式中以用于随后的多声道回放。可替代地，音频内容可以被“照原样”创作以用于在任何兼容的回放硬件上进行传统立体声回放。在这种情况下，如果适当的重新格式化算法存在于设备上，则可以在回放设备上应用SES解码。In some embodiments, SES can be imported into a non-linear video editing suite and manipulated with respect to traditional stereo film capture. The spatial integrity of the resulting content will remain intact post-editing, provided that no spatially deleterious audio processing effects are applied to the content. SES decoding and reformatting can also be applied as part of a video editing suite. For example, if the content is being burned to DVD or Blu-ray Disc, multi-channel speaker decoding and reformatting can be applied and the result encoded in a multi-channel format for subsequent multi-channel playback. Alternatively, the audio content can be authored "as is" for traditional stereo playback on any compatible playback hardware. In this case, SES decoding can be applied on the playback device if the appropriate reformatting algorithm exists on the device.

图17示出系统和方法的实施例，作为解码过程的一部分，捕获的音频场景通过该系统和方法被修改。更具体地，N个麦克风信号被空间编码器1700编码为SES，在一些实施例中，SES包括辅助信息S。SES被存储，被传输，或者既被存储又被传输。空间解码器1710用于对编码的SES进行解码，并且渲染器1720提供Q个回放信号。由解码器1710使用场景修改参数来对音频场景进行修改。FIG17 illustrates an embodiment of a system and method by which a captured audio scene is modified as part of the decoding process. More specifically, N microphone signals are encoded into a SES by a spatial encoder 1700. In some embodiments, the SES includes side information S. The SES is stored, transmitted, or both. A spatial decoder 1710 decodes the encoded SES, and a renderer 1720 provides Q playback signals. The decoder 1710 uses scene modification parameters to modify the audio scene.

在一些优选实施例中，场景修改在解码过程中修改可以被高效地实施的点处发生。例如，在使用耳机进行音频渲染的虚拟现实应用中，关键的是根据用户的头部的运动来实时地更新声音场景的空间线索，使得声音对象的感知局部化与它们的视觉对应物的感知局部化匹配。为了实现这一点，使用头部跟踪设备来检测用户的头部的方位。然后基于这些估计来连续地更新虚拟音频渲染使得再现的声音场景表现得独立于收听者的头部运动。In some preferred embodiments, scene modifications occur at points in the decoding process where the modifications can be efficiently implemented. For example, in virtual reality applications using headphones for audio rendering, it is crucial to update the spatial cues of the sound scene in real time based on the user's head movements, so that the perceived localization of sound objects matches the perceived localization of their visual counterparts. To achieve this, a head tracking device is used to detect the orientation of the user's head. The virtual audio rendering is then continuously updated based on these estimates so that the reproduced sound scene appears independent of the listener's head movements.

头部方位的估计可以被合并在空间解码器1710的解码过程中，使得渲染器1720再现稳定的音频场景。这等同于在解码之前旋转场景或者在虚拟化之前渲染到旋转的中间格式(由空间解码器输出的P个声道)。在辅助信息被包括在SES中的实施例中，这样的场景旋转可以包括辅助信息中所包括的空间元数据的操控。The estimation of head position can be incorporated into the decoding process of the spatial decoder 1710, so that the renderer 1720 reproduces a stable audio scene. This is equivalent to rotating the scene before decoding or rendering to a rotated intermediate format (the P channels output by the spatial decoder) before virtualization. In embodiments where auxiliary information is included in the SES, such scene rotation can include manipulation of the spatial metadata included in the auxiliary information.

在空间解码过程中可以被支持的其他感兴趣修改包括使音频场景的宽度扭曲(warp)以及音频变焦。在一些实施例中，可以对解码的音频信号进行空间扭曲以与原始视频录制的视场匹配。例如，如果原始视频使用广角透镜，则音频场景可以跨类似的角弧(angular arc)被拉伸以便更好地匹配音频线索和视觉线索。在一些实施例中，可以将音频修改为放大到感兴趣空间区域中或者从区域缩小；音频变焦可以与视频变焦修改结合。Other interesting modifications that can be supported during spatial decoding include warping the width of the audio scene and audio zoom. In some embodiments, the decoded audio signal can be spatially warped to match the field of view of the original video recording. For example, if the original video uses a wide-angle lens, the audio scene can be stretched across similar angular arcs to better match the audio cues and visual cues. In some embodiments, the audio can be modified to zoom into or out of the spatial region of interest; audio zoom can be combined with video zoom modifications.

在一些实施例中，解码器可以修改解码信号的空间特性以便引导或强调在特定空间定位上的解码信号。这可以允许提高或降低某些听觉事件(诸如例如对话)的突显性。在一些实施例中，这可以通过使用语音检测算法来促进。In some embodiments, the decoder can modify the spatial characteristics of the decoded signal to guide or emphasize the decoded signal at a specific spatial location. This can allow the salience of certain auditory events (such as, for example, conversations) to be increased or decreased. In some embodiments, this can be facilitated by using a speech detection algorithm.

III.操作概述 III. Operation Overview

声场编码系统100和方法的实施例使用任意的麦克风阵列配置来捕获表示沉浸式音频场景的声场。捕获的音频被用通用的SES格式编码，该SES格式对于所使用的麦克风阵列配置是不敏感的。Embodiments of the sound field coding system 100 and method use an arbitrary microphone array configuration to capture a sound field representing an immersive audio scene. The captured audio is encoded using a generic SES format that is insensitive to the microphone array configuration used.

图18是示出图1-17中所示的声场编码系统100的捕获组件110的实施例的一般操作的流程图。该操作从选择包括多个麦克风的麦克风配置开始(方框1800)。这些麦克风用于从至少一个音频源捕获声音。麦克风配置限定每个麦克风相对于参考方向的麦克风方向性图案。另外，包括多个虚拟麦克风的虚拟麦克风配置被选择(方框1810)。FIG18 is a flow chart illustrating the general operation of an embodiment of the capture component 110 of the sound field coding system 100 shown in FIG1-17. The operation begins by selecting a microphone configuration comprising a plurality of microphones (block 1800). These microphones are used to capture sound from at least one audio source. The microphone configuration defines a microphone directivity pattern for each microphone relative to a reference direction. Additionally, a virtual microphone configuration comprising a plurality of virtual microphones is selected (block 1810).

该方法基于麦克风配置和虚拟麦克风配置来计算空间编码系数(方框1820)。使用空间编码系数来将来自多个麦克风的麦克风信号转换为空间编码信号(方框1830)。系统100的输出是空间编码信号(方框1840)。该信号包含关于音频源相对于参考方向的位置的编码空间信息。The method calculates spatial coding coefficients based on the microphone configuration and the virtual microphone configuration (block 1820). The spatial coding coefficients are used to convert microphone signals from the plurality of microphones into spatially coded signals (block 1830). The output of the system 100 is a spatially coded signal (block 1840). This signal contains encoded spatial information about the position of the audio source relative to a reference direction.

如上所述，本文公开了系统100和方法的各种其他的实施例。以示例的方式，而非限制，再次参照图7，空间编码器410可以从N:2空间编码器推广到N:T空间编码器。而且，在本发明的范围内，对于产生与被配置为在灵活的回放配置中进行沉浸式音频再现的相位振幅矩阵解码器以及直接两声道立体声回放兼容的两声道SES(L_T、R_T)的编码器，可以实现各种其他的实施例。在使用标准的麦克风配置(诸如高保真立体声A或B格式)的实施例中，可以基于麦克风格式的制定的方向性图案来指定两声道编码等式。As described above, various other embodiments of the system 100 and method are disclosed herein. By way of example, and not limitation, referring again to FIG. 7 , the spatial encoder 410 can be generalized from an N:2 spatial encoder to an N:T spatial encoder. Furthermore, within the scope of the present invention, various other embodiments can be implemented for an encoder that generates a two-channel SES ( _LT , _RT ) compatible with a phase-amplitude matrix decoder configured for immersive audio reproduction in flexible playback configurations and direct two-channel stereo playback. In embodiments using a standard microphone configuration (such as an Ambisonics A or B format), the two-channel encoding equations can be specified based on a specified directivity pattern for the microphone format.

更一般地，在麦克风由于设备设计约束或设备的网络的自组性(adhoc nature)而可以被置于非标准配置中的实施例中，可以通过基于相对麦克风定位以及测量的或估计的麦克风的方向性组合麦克风信号来形成空间编码信号的导出。这些组合可以被形成以最佳地实现适合于两声道SES编码的规定方向性图案。给定安装在相应录制设备或附件上的N个麦克风的方向性图案(其中方向性图案是表征随着频率f和3-D位置而变化的麦克风的响应的复振幅因子)，可以对每个麦克风、在每个频率处对一组系数k_Ln(f)和k_Rn(f)进行优化以形成用于左SES声道和右SES声道的虚拟麦克风方向性图案：More generally, in embodiments where microphones may be placed in non-standard configurations due to device design constraints or the adhoc nature of a network of devices, a derivation of a spatially coded signal may be formed by combining microphone signals based on relative microphone positioning and measured or estimated directivities of the microphones. These combinations may be formed to optimally achieve a prescribed directivity pattern suitable for two-channel SES encoding. Given the directivity patterns of N microphones mounted on a respective recording device or accessory (where a directivity pattern is a complex amplitude factor characterizing the response of a microphone as a function of frequency f and 3-D position), a set of coefficients k _Ln (f) and k _Rn (f) may be optimized for each microphone at each frequency to form a virtual microphone directivity pattern for the left and right SES channels:

其中实施系数优化以最小化所得的左虚拟麦克风方向性图案和右虚拟麦克风方向性图案与每个编码声道的规定的左方向性图案和右方向性图案之间的误差准则。Therein, coefficient optimization is performed to minimize an error criterion between the resulting left and right virtual microphone directivity patterns and the prescribed left and right directivity patterns for each encoded channel.

在一些实施例中，可以组合麦克风响应以准确地形成规定的虚拟麦克风方向性图案，在这种情况下，等式可以在上面的表达式中保持。例如，在结合图9B和图9C描述的实施例中，组合B格式麦克风响应以精确地实现规定的虚拟麦克风响应。在一些实施例中，可以使用优化方法(诸如最小二乘近似)来实施系数优化。In some embodiments, the microphone responses can be combined to accurately form a prescribed virtual microphone directivity pattern, in which case the equations can hold in the above expressions. For example, in the embodiments described in conjunction with FIG9B and FIG9C , the B-format microphone responses are combined to accurately achieve the prescribed virtual microphone response. In some embodiments, coefficient optimization can be implemented using an optimization method such as least squares approximation.

其后通过以下等式给出两声道SES编码等式：The two-channel SES encoding equation is then given by:

其中L_T(f，t)和R_T(f，t)分别标示左SES声道和右SES声道的频域表示，并且S_n(f，t)标示第n麦克风信号的频域表示。Wherein _LT (f, t) and _RT (f, t) denote the frequency domain representation of the left SES channel and the right SES channel, respectively, and _Sn (f, t) denotes the frequency domain representation of the nth microphone signal.

类似地，在根据图4的一些实施例中，可以形成与T个编码信号对应的用于T个虚拟麦克风的最佳方向性图案，其中T不等于2。在根据图8的实施例中，可以对应于中间格式中的M个声道而形成用于M个虚拟麦克风的最佳方向性图案，其中中间格式中的每个声道具有规定的方向性图案；中间格式中的M个声道随后被编码为两个声道。在其他实施例中，M个中间声道可以被编码为T个声道，其中T不等于2。Similarly, in some embodiments according to FIG. 4 , optimal directivity patterns for T virtual microphones corresponding to T coded signals may be formed, where T is not equal to 2. In the embodiment according to FIG. 8 , optimal directivity patterns for M virtual microphones may be formed corresponding to M channels in an intermediate format, where each channel in the intermediate format has a specified directivity pattern; the M channels in the intermediate format are then encoded into two channels. In other embodiments, the M intermediate channels may be encoded into T channels, where T is not equal to 2.

从上面的各种实施例的描述，应理解，本发明可以用于对任何麦克风格式进行编码；并且此外，如果麦克风格式提供方向选择性响应，则空间编码/解码可以保留方向选择性。可以被合并在捕获和编码系统中的其他麦克风格式包括但不限于XY立体声麦克风和不重合麦克风，这些麦克风可以基于频域空间分析而被时间对齐以支持矩阵编码和解码。From the description of the various embodiments above, it should be understood that the present invention can be used to encode any microphone format; and further, if the microphone format provides a directionally selective response, the spatial encoding/decoding can preserve the directionality selectivity. Other microphone formats that can be incorporated into the capture and encoding system include, but are not limited to, XY stereo microphones and non-coincident microphones, which can be time-aligned based on frequency-domain spatial analysis to support matrix encoding and decoding.

从合并在上面的各种实施例中的频域操作的描述，应理解，频域分析可以结合实施例中的任何一个而实施，以便提高编码过程的空间保真度；换句话说，频域处理将导致比纯时域方法更精确地匹配捕获场景的解码场景，代价是执行时间-频率变换、频域分析以及空间编码之后的逆变换的额外计算。From the description of the frequency domain operations incorporated in the various embodiments above, it should be understood that frequency domain analysis can be implemented in conjunction with any of the embodiments in order to improve the spatial fidelity of the encoding process; in other words, frequency domain processing will result in a decoded scene that more accurately matches the captured scene than a purely time domain approach, at the expense of the additional computations of performing the time-frequency transform, frequency domain analysis, and inverse transform following spatial encoding.

IV.示例性操作环境 IV. Exemplary Operating Environment

根据本文件，除了本文所描述的那些变型之外的许多其他的变型将是清楚的。例如，依赖于实施例，本文所描述的方法和算法中的任何一个的某些动作、事件或功能可以按不同的顺序执行，可以被添加、被融合、或者被一起省去(以使得并非所有的所描述的动作或事件对于实践该方法和算法都是必需的)。而且，在某些实施例中，动作或事件可以同时执行(诸如通过多线程处理、中断处理或多个处理器或处理器核或者在其他并行架构上)，而不是顺序地执行。另外，不同的任务或过程可以由可以一起运行的不同的机器和计算系统执行。According to this document, many other variations except those described herein will be clear.For example, depending on the embodiment, some actions, events or functions of any one of the methods and algorithms described herein can be performed in different orders, can be added, merged or omitted together (so that not all described actions or events are necessary for practicing the method and algorithm). Moreover, in certain embodiments, actions or events can be performed simultaneously (such as by multithreading, interrupt processing or multiple processors or processor cores or on other parallel architectures), rather than sequentially. In addition, different tasks or processes can be performed by different machines and computing systems that can run together.

结合本文所公开的实施例描述的各种例证性逻辑块、模块、方法以及算法过程和顺序可以实现为电子硬件、计算机软件或这二者的组合。为了清楚地示出硬件和软件的这个可互换性，各种例证性组件、块、模块和过程动作已经在上面就它们的功能进行了一般性的描述。这样的功能是实现为硬件还是软件取决于特定应用和施加于整个系统的设计约束。所描述的功能可以对于每个特定应用以变化的方式实现，但是这样的实现决策不应被解释为使得脱离本文件的范围。The various illustrative logic blocks, modules, methods, and algorithmic processes and sequences described in conjunction with the embodiments disclosed herein can be implemented as electronic hardware, computer software, or a combination of the two. In order to clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, and process actions have been generally described above with respect to their functions. Whether such functions are implemented as hardware or software depends on the specific application and the design constraints imposed on the entire system. The described functions can be implemented in varying ways for each specific application, but such implementation decisions should not be interpreted as departing from the scope of this document.

结合本文所公开的实施例描述的各种例证性逻辑块和模块可以由机器实现或执行，诸如通用处理器、处理设备、具有一个或多个处理设备的计算设备、数字信号处理器(DSP)、专用集成电路(ASIC)、现场可编程门阵列(FPGA)或其他可编程逻辑器件、分立门或晶体管逻辑、分立硬件组件、或它们的被设计为执行本文所描述的功能的任何组合。通用处理器和处理设备可以是微处理器，但是在替代方案中，处理器可以是控制器、微控制器或状态机、它们的组合等。处理器也可以实现为计算设备的组合，诸如DSP和微处理器的组合、多个微处理器、与DSP核结合的一个或多个微处理器、或任何其他这样的配置。The various illustrative logical blocks and modules described in conjunction with the embodiments disclosed herein may be implemented or executed by a machine, such as a general-purpose processor, a processing device, a computing device having one or more processing devices, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. General-purpose processors and processing devices may be microprocessors, but in an alternative embodiment, the processor may be a controller, a microcontroller or a state machine, a combination thereof, etc. A processor may also be implemented as a combination of computing devices, such as a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors combined with a DSP core, or any other such configuration.

本文所描述的声场编码系统和方法的实施例在许多类型的通用或专用计算系统环境或配置内是可操作的。一般地，计算环境可以包括任何类型的计算机系统，包括但不限于基于一个或多个微处理器的计算机系统、大型计算机、数字信号处理器、便携式计算设备、个人记事本(personal organizer)、设备控制器、器械内的计算引擎、移动电话、台式计算机、移动计算机、平板计算机、智能电话、以及具有嵌入计算机的器械等。Embodiments of the sound field coding systems and methods described herein are operable within many types of general-purpose or special-purpose computing system environments or configurations. Generally, the computing environment may include any type of computer system, including but not limited to a computer system based on one or more microprocessors, a mainframe computer, a digital signal processor, a portable computing device, a personal organizer, an equipment controller, a computing engine within an appliance, a mobile phone, a desktop computer, a mobile computer, a tablet computer, a smartphone, and an appliance with an embedded computer, etc.

这样的计算设备通常可以见于具有至少某个最小计算能力的设备中，这些设备包括但不限于个人计算机、服务器计算机、手持计算设备、膝上型或移动计算机、通信设备(诸如蜂窝电话和PDA)、多处理器系统、基于微处理器的系统、机顶盒、可编程消费者电子产品、网络PC、迷你计算机、大型计算机、音频或视频媒体播放器等。在一些实施例中，计算设备将包括一个或多个处理器。每个处理器可以是专门的微处理器，诸如数字信号处理器(DSP)、超长指令字(VLIW)、或其他微控制器，或者可以是具有一个或多个处理核(包括多核CPU中的基于专门的图形处理单元(GPU)的核)的常规中央处理单元(CPU)。Such computing devices are typically found in devices with at least some minimum computing power, including but not limited to personal computers, server computers, handheld computing devices, laptop or mobile computers, communication devices (such as cellular phones and PDAs), multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, audio or video media players, etc. In some embodiments, the computing device will include one or more processors. Each processor can be a specialized microprocessor, such as a digital signal processor (DSP), a very long instruction word (VLIW), or other microcontroller, or can be a conventional central processing unit (CPU) with one or more processing cores (including a core based on a specialized graphics processing unit (GPU) in a multi-core CPU).

结合本文所公开的实施例描述的方法、过程或算法的过程动作可以直接用硬件实现，用由处理器执行的软件模块实现，或者用这二者的任何组合实现。软件模块可以被包含在可以被计算设备访问的计算机可读介质中。计算机可读介质包括可移除的、不可移除的或是它们的某个组合的易失性和非易失性介质两者。计算机可读介质用于存储信息，诸如计算机可读或计算机可执行指令、数据结构、程序模块或其他数据。以示例的方式，而非限制，计算机可读介质可以包括计算机存储介质和通信介质。The process actions of the methods, processes, or algorithms described in conjunction with the embodiments disclosed herein may be implemented directly in hardware, in software modules executed by a processor, or in any combination thereof. The software modules may be contained in a computer-readable medium that can be accessed by a computing device. Computer-readable media include both volatile and non-volatile media that are removable, non-removable, or some combination thereof. Computer-readable media is used to store information, such as computer-readable or computer-executable instructions, data structures, program modules, or other data. By way of example, and not limitation, computer-readable media may include computer storage media and communication media.

计算机存储介质包括但不限于计算机或机器可读介质或存储设备，诸如蓝光盘(BD)、数字多功能盘(DVD)、紧凑盘(CD)、软盘、磁带驱动器、硬盘驱动器、光学驱动器、固态存储器设备、RAM存储器、ROM存储器、EPROM存储器、EEPROM存储器、闪存或其他存储器技术、磁盒、磁带、磁盘储存器、或其他磁性存储设备、或可以用于存储期望的信息并且可以被一个或多个计算设备访问的任何其他的设备。Computer storage media includes, but is not limited to, computer or machine readable media or storage devices, such as Blu-ray discs (BDs), digital versatile discs (DVDs), compact discs (CDs), floppy disks, tape drives, hard drives, optical drives, solid-state memory devices, RAM memory, ROM memory, EPROM memory, EEPROM memory, flash memory or other memory technology, magnetic cassettes, magnetic tape, disk storage, or other magnetic storage devices, or any other device that can be used to store the desired information and which can be accessed by one or more computing devices.

软件模块可以驻留在RAM存储器、闪存、ROM存储器、EPROM存储器、EEPROM存储器、寄存器、硬盘、可移除盘、CD-ROM、或任何其他形式的非暂时性计算机可读存储介质、介质或本领域中已知的物理计算机储存器。示例性存储介质可以耦合到处理器，使得处理器可以从存储介质读取信息并且将信息写到存储介质。在替代方案中，存储介质可以与处理器一体化。处理器和存储介质可以驻留在专用集成电路(ASIC)中。ASIC可以驻留在用户终端中。可替代地，处理器和存储介质可以作为分立的组件驻留在用户终端中。The software module can reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, register, hard disk, removable disk, CD-ROM or any other form of non-transitory computer readable storage medium, medium or physical computer storage known in the art. An exemplary storage medium can be coupled to a processor so that the processor can read information from the storage medium and write information to the storage medium. In an alternative, the storage medium can be integrated with the processor. The processor and the storage medium can reside in an application specific integrated circuit (ASIC). The ASIC can reside in a user terminal. Alternatively, the processor and the storage medium can reside in a user terminal as discrete components.

如本文件中所使用的短语“非暂时性”意指“持久的或长期的”。短语“非暂时性计算机可读介质”包括任何和所有的计算机可读介质，唯一例外是暂时性的传播信号。以示例的方式，而非限制，这包括非暂时性计算机可读介质，诸如寄存器存储器、处理器高速缓存以及随机存取存储器(RAM)。As used in this document, the phrase "non-transitory" means "persistent or long-term." The phrase "non-transitory computer-readable medium" includes any and all computer-readable media, with the sole exception of transitory propagating signals. By way of example, and not limitation, this includes non-transitory computer-readable media such as register memory, processor cache, and random access memory (RAM).

信息(诸如计算机可读或计算机可执行指令、数据结构、程序模块等)的保留也可以通过使用对一个或多个调制数据信号、电磁波(诸如载波)进行编码的各种通信介质或其他传输机制或通信协议来实现，并且包括任何有线或无线信息递送机制。一般来说，这些通信介质是指这样的信号，该信号的特性中的一个或多个被以对该信号中的信息或指令进行编码的这样的方式而被设置或改变。例如，通信介质包括有线介质(诸如传载(carry)一个或多个调制数据信号的直接有线连接或有线网络)和无线介质(诸如声学、射频(RF)、红外线、激光以及用于发送、接收或者既发送又接收一个或多个调制数据信号或电磁波的其他无线介质)。以上任何一个的组合也应包括在通信介质的范围内。The retention of information (such as computer-readable or computer-executable instructions, data structures, program modules, etc.) can also be achieved by using various communication media or other transmission mechanisms or communication protocols that encode one or more modulated data signals, electromagnetic waves (such as carrier waves), and include any wired or wireless information delivery mechanism. Generally speaking, these communication media refer to such signals, one or more of the characteristics of which are set or changed in such a way as to encode information or instructions in the signal. For example, communication media include wired media (such as a direct wired connection or wired network that carries one or more modulated data signals) and wireless media (such as acoustic, radio frequency (RF), infrared, laser, and other wireless media for sending, receiving, or both sending and receiving one or more modulated data signals or electromagnetic waves). Combinations of any of the above should also be included within the scope of communication media.

此外，实现本文所描述的声场编码系统和方法的各种实施例中的一些或全部的软件、程序、计算机程序产品中的一个或它们的任何组合或者其部分可以被存储、被接收、被发送、或者以计算机可执行指令或其他数据结构的形式被从计算机或机器可读介质或存储设备和通信介质的任何期望组合读取。In addition, one of the software, programs, computer program products, or any combination thereof, or portions thereof, that implement some or all of the various embodiments of the sound field coding systems and methods described herein may be stored, received, sent, or read from any desired combination of computer or machine-readable media or storage devices and communication media in the form of computer-executable instructions or other data structures.

本文所描述的声场编码系统和方法的实施例可以在正在被计算机设备执行的计算机可执行指令(诸如程序模块)的一般上下文下被进一步描述。一般来说，程序模块包括执行特别的任务或实现特别的抽象数据类型的例程、程序、对象、组件、数据结构等。本文所描述的实施例也可以在分布式计算环境中实践，在分布式计算环境中，任务由一个或多个远程处理设备执行，或者在通过一个或多个通信网络链接的一个或多个设备的云内执行。在分布式计算环境中，程序模块可以被定位在本地计算机存储介质和远程计算机存储介质(包括介质存储设备)两者中。更进一步地，前述指令可以部分地或整个地实现为硬件逻辑电路，这些硬件逻辑电路可以包括或者可以不包括处理器。The embodiments of the sound field coding systems and methods described herein may be further described in the general context of computer-executable instructions (such as program modules) being executed by a computer device. Generally speaking, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The embodiments described herein may also be practiced in a distributed computing environment, in which tasks are performed by one or more remote processing devices, or within a cloud of one or more devices linked by one or more communication networks. In a distributed computing environment, program modules may be located in both local computer storage media and remote computer storage media (including media storage devices). Furthermore, the aforementioned instructions may be implemented in part or in whole as hardware logic circuits, which may or may not include a processor.

本文所使用的条件语言(其中，诸如“能够”、“可能”、“可以”、“例如”等)除非另有明确陈述或者在所用上下文内另有理解，否则一般意图传达某些实施例包括而其他实施例不包括某些特征、元件和/或状态。因此，这样的条件语言一般并不意图暗示特征、元件和/或状态对于一个或多个实施例无论如何都是必需的、或者一个或多个实施例一定包括用于决定的逻辑、有或没有创作者输入或提示、这些特征、元件和/或状态是包括在任何特定实施例中、还是将在任何特定实施例中被执行。术语“包括”、“包含”、“具有”等是同义的，并且被以开放的方式包括性地使用，并且不排除附加的元件、特征、动作、操作等。此外，术语“或”是以其包括性的意义(而非其排他的意义)使用的，使得当被用于例如连接元素列表时，术语“或”意指该列表中的元素中的一个、一些或全部。As used herein, conditional language (including, for example, "can," "might," "may," "for example," etc.), unless expressly stated otherwise or understood otherwise within the context in which it is used, is generally intended to convey that some embodiments include and other embodiments do not include certain features, elements, and/or states. Thus, such conditional language is generally not intended to imply that features, elements, and/or states are in any way required for one or more embodiments, or that one or more embodiments necessarily include logic for determining, with or without author input or prompting, whether such features, elements, and/or states are included in or will be performed in any particular embodiment. The terms "including," "comprising," "having," and the like are synonymous and are used inclusively in an open-ended manner and do not exclude additional elements, features, actions, operations, and the like. Furthermore, the term "or" is used in its inclusive sense (and not its exclusive sense) such that when used, for example, to connect a list of elements, the term "or" means one, some, or all of the elements in the list.

虽然上面详述的描述已经示出、描述和指出了应用于各种实施例的新颖特征，但是将理解，在不脱离本公开的范围的情况下，可以做出示出的设备或算法的形式和细节上的各种省略、替换和改变。如将认识到的，本文所描述的发明的某些实施例可以在没有提供本文所阐释的所有特征和益处的形式内实现，因为一些特征可以与其他特征分别使用或实践。While the above detailed description has shown, described, and pointed out novel features as applied to various embodiments, it will be understood that various omissions, substitutions, and changes in the form and details of the devices or algorithms shown may be made without departing from the scope of the present disclosure. As will be appreciated, certain embodiments of the invention described herein may be implemented in a form that does not provide all of the features and benefits set forth herein, as some features may be used or practiced separately from other features.

而且，尽管已经用特定于结构特征和方法动作的语言描述了主题，但是要理解，所附权利要求中限定的主题不一定限于上述特定特征或动作。相反，上述特定特征和动作是作为实现权利要求的示例形式而公开的。Furthermore, although the subject matter has been described in language specific to structural features and methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims

1. A method for processing signals from multiple capture microphones, comprising:

Select a capture microphone configuration having a plurality of capture microphones for capturing sound from at least one audio source, the capture microphone configuration defining the capture microphone directivity of each of the plurality of capture microphones relative to a reference direction;

Select a virtual microphone configuration with multiple virtual microphones, the virtual microphone configuration being used to encode spatial information about the position of the at least one audio source relative to a reference direction, the virtual microphone configuration defining the directivity of each of the multiple virtual microphones relative to the reference direction;

Spatial coding coefficients are calculated based on the captured microphone configuration and the virtual microphone configuration; and

The multiple captured microphone signals are converted into spatially encoded signals that include virtual microphone signals;

Each of the virtual microphone signals is obtained by capturing microphone signals using spatial coding coefficients;

The capture microphone directivity is a complex amplitude factor, which characterizes the microphone's response as it varies with the frequency and 3D position of at least one audio source.

2. The method of claim 1, wherein the spatial information is encoded in one of the following forms: (a) inter-channel amplitude; and (b) phase difference.

3. The method of claim 2, further comprising selecting a virtual microphone configuration having a plurality of virtual microphones, the virtual microphone configuration being used to encode spatial information about the position of the audio source relative to a reference direction.

4. The method of claim 1, wherein the plurality of captured microphone signals are A-format microphone signals, further comprising converting the A-format microphone signals into B-format microphone signals.

5. The method of claim 4, further comprising forming a virtual microphone directional pattern from a B-format microphone signal.

6. The method of claim 5, further comprising using the following equation to form a virtual microphone directional pattern:

Where _θL , _θR , _θS and p are design parameters, W is the omnidirectional pressure signal in B format, X is the front-back figure-eight signal in B format, Y is the left-right figure-eight signal in B format, _VL is the virtual left microphone signal in the horizontal plane, _VR is the virtual right microphone signal in the horizontal plane corresponding to the supercardioid, and _VS is the virtual surround microphone signal in the horizontal plane corresponding to the supercardioid.

7. The method of claim 6, further comprising selecting design parameter p based on the desired directivity of the virtual microphone signal.

8. A method for processing an audio signal comprising signals from multiple capture microphones, comprising:

Select a capture microphone configuration having multiple capture microphones for capturing sound from an audio source, the capture microphone configuration defining the capture microphone directivity of each of the multiple capture microphones relative to a reference direction;

Spatial coding coefficients are calculated based on the capture microphone configuration; and

Spatial coding coefficients are used to convert the plurality of captured microphone signals into spatially coded signals, wherein the spatially coded signals are two-channel spatially coded signals that carry coded spatial information about the position of the audio source relative to a reference direction;

The capture microphone directivity is a complex amplitude factor, which characterizes the microphone's response as it varies with the frequency and 3D position of the audio source.

9. The method of claim 8, wherein the spatially encoded signal is a phase amplitude spatially encoded signal.

10. A method for processing signals from multiple capture microphones, comprising:

Spatial coding coefficients are used to convert the plurality of captured microphone signals into spatially coded signals, wherein the spatially coded signals are spatially coded signals having at least two channels and carrying coded spatial information about the position of the audio source relative to a reference direction;

11. The method of claim 10, wherein the spatial information is partially transmitted in the form of location audio metadata.