HK1222470B

HK1222470B - Hybrid waveform-coded and parametric-coded speech enhancement

Info

Publication number: HK1222470B
Application number: HK16110573.6A
Authority: HK
Inventors: 耶伦．科庞; 汉内斯．米施
Original assignee: 杜比实验室特许公司; 杜比国际公司
Priority date: 2013-08-28
Filing date: 2014-08-27
Publication date: 2021-01-22

Description

Hybrid Waveform Coding and Parametric Coding for Speech Enhancement

相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS

本申请要求2013年8月28日提交的美国临时专利申请第61/870,933号、2013年10月25日提交的美国临时专利申请第61/895,959号以及2013年11月25日提交的美国临时专利申请第61/908,664号的优先权，上述美国临时专利申请中的每一个的全部内容通过引用合并到本文中。This application claims priority to U.S. Provisional Patent Application No. 61/870,933, filed on August 28, 2013, U.S. Provisional Patent Application No. 61/895,959, filed on October 25, 2013, and U.S. Provisional Patent Application No. 61/908,664, filed on November 25, 2013, each of which is incorporated herein by reference in its entirety.

技术领域Technical Field

本发明涉及音频信号处理，更具体地，涉及音频节目的语音内容相对于节目的其他内容的增强，其中，语音增强就以下这种意义而言是“混合的”：所述语音增强在一些信号条件下包括波形编码增强(或者相对较多的波形编码增强)以及在其他信号条件下包括参数编码增强(或者相对较多的参数编码增强)。其他方面是对包括足以使得能够实现这样的混合语音增强的数据的音频节目的编码、解码和呈现(render)。The present invention relates to audio signal processing and, more particularly, to enhancement of the speech content of an audio program relative to other content of the program, wherein the speech enhancement is "hybrid" in the sense that it includes waveform coding enhancement (or relatively more waveform coding enhancement) under some signal conditions and parametric coding enhancement (or relatively more parametric coding enhancement) under other signal conditions. Other aspects are the encoding, decoding, and rendering of audio programs that include data sufficient to enable such hybrid speech enhancement.

背景技术Background Art

在电影和电视中，对话和叙述经常与其他的非语音音频如来自体育赛事的音乐、效果或氛围一起呈现。在许多情况下，语音和非语音声音在声音工程师的控制下被分别捕获并且混合在一起。声音工程师以适合于大多数收听者的方式来选择相对于非语音的水平的语音的水平。然而，一些收听者——例如听力损伤的那些收听者——在理解音频节目的语音内容(具有工程师确定的语音与非语音混合比)时体验到困难，并且更偏好以更高的相对水平混合语音。In movies and television, dialogue and narration are often presented together with other non-speech audio such as music, effects or atmosphere from sporting events. In many cases, speech and non-speech sounds are captured and mixed together respectively under the control of a sound engineer. The sound engineer selects the level of speech relative to the level of non-speech in a manner suitable for most listeners. However, some listeners, such as those with hearing impairment, experience difficulty when understanding the speech content of an audio program (with the speech and non-speech mixing ratio determined by the engineer), and prefer to mix speech with higher relative levels.

在使得这些收听者能够相对于非语音音频内容的可听度增大音频节目语音内容的可听度时存在要解决的问题。A problem to be solved exists in enabling these listeners to increase the audibility of the speech content of an audio program relative to the audibility of the non-speech audio content.

一种当前的方法是向收听者提供两个高品质音频流。一个流携载主内容音频(主要是语音)而另外的流携载次内容音频(剩余音频节目，其将语音排除在外)，并且赋予用户对混合处理的控制。遗憾的是，该方案是不实用的，原因是该方案并不建立在传输完全混合的音频节目的当前实践上。另外，该方案要求当前广播实践的带宽的大约两倍，原因是两个独立音频流——广播品质中的每一个——必须被递送至用户。One current approach is to provide the listener with two high-quality audio streams. One stream carries the primary content audio (primarily voice) and the other carries the secondary content audio (the remaining audio program, which excludes voice), and gives the user control over the mixing process. Unfortunately, this approach is impractical because it does not build on the current practice of transmitting fully mixed audio programs. In addition, this approach requires approximately twice the bandwidth of current broadcast practices because two independent audio streams—each at broadcast quality—must be delivered to the user.

在受让于杜比实验室公司并且将Hannes Muesch指定为发明人的、2010年4月29日公开的美国专利申请公开第2010/0106507 A1号中描述了另一种语音增强方法(在本文中被称为“波形编码”增强)。在波形编码增强中，通过将已经与主混合一起被发送至接收器的纯净语音信号(clean speech signal)的降低品质版本(低品质复本)添加至主混合来增大语音与非语音内容的原始音频混合(有时称为主混合)的语音与背景(非语音)比。为了减少带宽开销，通常以非常低的比特率对低品质复本进行编码。由于低比特率编码，编码伪声与低品质复本相关联，并且当低品质复本被单独地呈现和试听时，编码伪声是清楚地听得见的。因此，当被单独地试听时，低品质复本具有令人讨厌的品质。仅在非语音分量的水平高而使得编码伪声被非语音分量掩蔽的时间期间，波形编码增强试图通过将低品质复本添加至主混合来隐藏这些编码伪声。如稍后将详细描述的，该方法的限制包括以下：语音增强的量通常不能在时间上恒定，并且当主混合的背景(非语音)分量弱或者它们的频率幅度频谱与编码噪声的频率幅度频谱有很大不同时，音频伪声会变得可听见。Another speech enhancement method (referred to herein as "waveform coding" enhancement) is described in U.S. Patent Application Publication No. 2010/0106507 A1, assigned to Dolby Laboratories, Inc. and designating Hannes Muesch as the inventor, published on April 29, 2010. In waveform coding enhancement, the speech to background (non-speech) ratio of the original audio mix of speech and non-speech content (sometimes referred to as the main mix) is increased by adding a reduced-quality version (low-quality replica) of the clean speech signal that has been sent to the receiver along with the main mix to the main mix. In order to reduce bandwidth overhead, the low-quality replica is usually encoded at a very low bit rate. Due to the low bit rate encoding, encoding artifacts are associated with the low-quality replica, and when the low-quality replica is presented and auditioned alone, the encoding artifacts are clearly audible. Therefore, when auditioned alone, the low-quality replica has an annoying quality. Waveform coding enhancement attempts to hide coding artifacts by adding a low-quality copy to the main mix only during times when the level of non-speech components is high enough that the coding artifacts are masked by the non-speech components. As will be described in detail later, limitations of this approach include the following: the amount of speech enhancement is generally not constant over time, and audio artifacts can become audible when the background (non-speech) components of the main mix are weak or their frequency-amplitude spectra differ significantly from those of the coding noise.

根据波形编码增强，音频节目(用于递送至解码器以进行解码和随后的呈现)被编码为包括作为主混合的侧流的低品质语音复本(或者其编码版本)的比特流。比特流可以包括指示确定要执行的波形编码语音增强的量的缩放参数的元数据(即，缩放参数确定在缩放的低品质语音复本与主混合组合之前要应用于低品质语音复本的缩放因子，或者将确保对编码伪声的掩蔽的这样的缩放因子的最大值)。当缩放因子的当前值为0时，解码器不对主混合的相应片段执行语音增强。虽然缩放参数的当前值(或者缩放参数可以达到的当前最大值)通常在编码器中被确定(由于缩放参数通常由计算密集型心理声学模型生成)，但是其也可以在解码器中生成。在后一种情况下，不需要将指示缩放参数的元数据从编码器发送至解码器，并且替代地，解码器可以根据主混合确定混合的语音内容的功率与混合的功率之比，并且响应于功率比的当前值来实现确定缩放参数的当前值的模型。According to waveform coding enhancement, an audio program (for delivery to a decoder for decoding and subsequent presentation) is encoded as a bitstream that includes a low-quality speech copy (or an encoded version thereof) as a sidestream of the main mix. The bitstream may include metadata indicating scaling parameters that determine the amount of waveform coding speech enhancement to be performed (i.e., the scaling parameters determine the scaling factor to be applied to the scaled low-quality speech copy before it is combined with the main mix, or the maximum value of such a scaling factor that will ensure masking of coding artifacts). When the current value of the scaling factor is 0, the decoder does not perform speech enhancement on the corresponding segment of the main mix. Although the current value of the scaling parameter (or the current maximum value that the scaling parameter can reach) is typically determined in the encoder (since the scaling parameter is typically generated by a computationally intensive psychoacoustic model), it can also be generated in the decoder. In the latter case, the metadata indicating the scaling parameter does not need to be sent from the encoder to the decoder, and instead, the decoder can determine the ratio of the power of the mixed speech content to the power of the mix based on the main mix, and implement a model that determines the current value of the scaling parameter in response to the current value of the power ratio.

用于在存在竞争音频(背景)的情况下增强语音的可理解度的另一种方法(在本文中要被称为“参数编码”增强)是：将原始音频节目(通常是音轨)分割成时间/频率分块(tile)并且根据它们的语音内容与背景内容的功率(或水平)的比率来增强分块以实现语音分量相对于背景的增强。该方法的基本构思类似于指引频谱减少噪声抑制的基本构思。在该方法的极端示例中，其中，SNR(即，语音分量的功率或水平与竞争声音内容的功率或水平的比率)在预定阈值以下的所有分块被完全抑制，已经显示该方法提供鲁棒的语音可理解度增强。在该方法应用于广播时，可以通过将原始音频混合(语音与非语音内容的)与混合的语音分量进行比较来推断语音与背景比(SNR)。然后，可以将所推断的SNR转换成与原始音频混合一起被发送的增强参数的适当的集合。在接收器处，可以(可选地)将这些参数应用于原始音频混合以获得指示增强语音的信号。如稍后将详细描述的，当语音信号(混合的语音分量)比背景信号(混合的非语音分量)占优势时，参数编码增强最优地发挥作用。Another approach for enhancing speech intelligibility in the presence of competing audio (background), referred to herein as "parametric coding" enhancement, is to segment the original audio program (typically an audio track) into time/frequency tiles and enhance the tiles based on the ratio of the power (or level) of their speech content to the power (or level) of the background content to achieve an enhancement of the speech component relative to the background. The basic concept of this approach is similar to that of guided spectral reduction noise suppression. In an extreme example of this approach, in which all tiles with an SNR (i.e., the ratio of the power or level of the speech component to the power or level of the competing sound content) below a predetermined threshold are completely suppressed, this approach has been shown to provide robust speech intelligibility enhancement. When applied to broadcast, the speech-to-background ratio (SNR) can be inferred by comparing the original audio mix (of speech and non-speech content) with the mixed speech component. The inferred SNR can then be converted into an appropriate set of enhancement parameters that are transmitted along with the original audio mix. At the receiver, these parameters can (optionally) be applied to the original audio mix to obtain a signal indicating enhanced speech. As will be described in detail later, parametric coding enhancement works best when the speech signal (the speech component of the mixture) dominates the background signal (the non-speech component of the mixture).

波形编码增强要求递送音频节目的语音分量的低品质复本在接收器处可用。为了限制在与主音频混合一起发送该复本时引起的数据开销，以非常低的比特率对该复本进行编码并且该复本呈现编码失真。当非语音分量的水平高时，这些编码失真很可能被原始音频掩蔽。当编码失真被掩蔽时，所得到的增强音频的品质非常好。Waveform coding enhancement requires that a low-quality copy of the speech component of the delivered audio program be available at the receiver. To limit the data overhead incurred when sending this copy along with the main audio mix, this copy is encoded at a very low bit rate and exhibits coding distortions. When the level of non-speech components is high, these coding distortions are likely to be masked by the original audio. When the coding distortions are masked, the quality of the resulting enhanced audio is very good.

参数编码增强是基于将主音频混合信号解析成时间/频率分块并且向这些分块中的每一个应用适当的增益/衰减。当与波形编码增强的数据率相比时，将这些增益转发至接收器所需的数据率较低。然而，由于参数的有限的时间频谱分辨率，当语音与非语音音频混合时，语音不能被操纵，也不会影响非语音音频。因此，音频混合的语音内容的参数编码增强在混合的非语音内容中引入调制，并且当回放语音增强混合时，该调制(“背景调制”)会变得令人讨厌。当语音与背景比非常低时，背景调制最可能令人讨厌。Parametric coding enhancement is based on parsing the main audio mix signal into time/frequency blocks and applying appropriate gain/attenuation to each of these blocks. When compared with the data rate enhanced by waveform coding, the data rate required for forwarding these gains to the receiver is low. However, due to the limited time-spectral resolution of the parameters, when speech is mixed with non-speech audio, the speech cannot be manipulated and will not affect the non-speech audio. Therefore, parametric coding enhancement of the speech content of the audio mix introduces modulation in the mixed non-speech content, and when the speech enhancement mix is played back, this modulation ("background modulation") can become annoying. When the speech-to-background ratio is very low, background modulation is most likely annoying.

在本部分中描述的方法是能够被执行的方法，但是不一定是先前已经被构思或执行的方法。因此，除非另有说明，否则不应该假定在本部分中描述的任何方法仅因其被包括在本部分中而被认为是现有技术。类似地，除非另有说明，否则不应该假定在基于本部分的任何现有技术中已经意识到关于一种或更多种方法而识别出的问题。The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise stated, it should not be assumed that any approach described in this section is prior art simply by virtue of its inclusion in this section. Similarly, unless otherwise stated, it should not be assumed that a problem identified with respect to one or more approaches has been recognized in any prior art based on this section.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

在附图的图中以示例性方式而非限制性方式来说明本发明，并且在附图中相似的附图标记指代类似的要素，并且其中：The present invention is illustrated by way of example and not by way of limitation in the figures of the accompanying drawings and in which like reference numerals refer to similar elements, and in which:

图1是被配置成生成用于重构单通道混合内容信号(具有语音内容和非语音内容)的语音内容的预测参数的系统的框图。1 is a block diagram of a system configured to generate prediction parameters for reconstructing speech content of a single-channel mixed-content signal having speech content and non-speech content.

图2是被配置成生成用于重构多通道混合内容信号(具有语音内容和非语音内容)的语音内容的预测参数的系统的框图。2 is a block diagram of a system configured to generate prediction parameters for reconstructing speech content of a multi-channel mixed content signal (having speech content and non-speech content).

图3是包括被配置成执行本发明的编码方法的实施方式以生成指示音频节目的编码音频比特流的编码器，以及被配置成对编码音频比特流进行解码并执行语音增强(根据本发明方法的实施方式)的解码器的系统的框图。3 is a block diagram of a system including an encoder configured to perform an embodiment of the encoding method of the present invention to generate an encoded audio bitstream indicative of an audio program, and a decoder configured to decode the encoded audio bitstream and perform speech enhancement according to an embodiment of the method of the present invention.

图4是被配置成呈现包括通过对其执行常规语音增强的多通道混合内容音频信号的系统的框图。4 is a block diagram of a system configured to render a multi-channel mixed content audio signal including by performing conventional speech enhancement thereon.

图5是被配置成呈现包括通过对其执行常规参数编码语音增强的多通道混合内容音频信号的系统的框图。5 is a block diagram of a system configured to render a multi-channel mixed content audio signal including speech enhancement by performing conventional parametric coding thereon.

图6和图6A是被配置成呈现包括通过对其执行本发明的语音增强方法的实施方式的多通道混合内容音频信号的系统的框图。6 and 6A are block diagrams of systems configured to present a multi-channel mixed content audio signal including an embodiment of the speech enhancement method of the present invention performed thereon.

图7是用于使用听觉掩蔽模型来执行本发明的编码方法的实施方式的系统的框图。FIG7 is a block diagram of a system for performing an embodiment of the encoding method of the present invention using an auditory masking model.

图8A和图8B示出了示例处理流程，以及8A and 8B illustrate an example process flow, and

图9示出了在其上可以实现如本文中所描述的计算机或计算装置的示例硬件平台。FIG9 illustrates an example hardware platform upon which a computer or computing device as described herein may be implemented.

具体实施方式DETAILED DESCRIPTION

在本文中描述了涉及混合波形编码和参数编码语音增强的示例实施方式。在下面的描述中，出于说明的目的，阐述了大量具体细节以提供对本发明的透彻理解。然而，将会明白可以在没有这些具体细节的情况下实践本发明。在其他实例中，并未详尽地描述已知的结构和装置，以避免不必要地封闭本发明、模糊或者混淆本发明。Described herein are example embodiments involving hybrid waveform coding and parametric coding speech enhancement. In the following description, for illustrative purposes, numerous specific details are set forth to provide a thorough understanding of the present invention. However, it will be appreciated that the present invention can be practiced without these specific details. In other examples, known structures and devices are not described in detail to avoid unnecessarily enclosing, obscuring, or confusing the present invention.

在本文中根据以下概要来描述示例实施方式：Example implementations are described herein according to the following overview:

1.一般概述1. General Overview

2.符号和术语2. Symbols and terminology

3.预测参数的生成3. Generation of prediction parameters

4.语音增强操作4. Voice enhancement operation

5.语音呈现5. Voice Presentation

6.中间/侧表示6. Middle/Side Representation

7.示例处理流程7. Example Processing Flow

8.实现机构——硬件概述8. Implementation mechanism - Hardware overview

9.等同方案、扩展方案、替代方案和其他方案9. Equivalents, extensions, alternatives and other options

1.一般概述1. General Overview

本概述提供对本发明的实施方式的一些方面的基本描述。应当注意，该概述并非对实施方式的各方面的广泛或详尽概括。此外，应当注意，此概述并非意在被理解为识别实施方式的任何特别显著的方面或要素，也并非意在被理解为划定一般为本发明、特别是实施方式的任何范围。此概述仅以扼要和简化的形式提供与示例实施方式有关的一些概念，并且应当被理解为仅是随后在下面描述的示例实施方式的更详细描述的概念性前序。注意，尽管本文中讨论了单独的实施方式，但是可以将本文中讨论的部分实施方式和/或实施方式的任意组合进行组合以形成另外的实施方式。This overview provides a basic description of some aspects of embodiments of the present invention. It should be noted that this overview is not an extensive or exhaustive summary of various aspects of the embodiments. Furthermore, it should be noted that this overview is not intended to be understood as identifying any particularly notable aspects or elements of the embodiments, nor is it intended to be understood as delineating any scope of the present invention generally, or the embodiments in particular. This overview merely provides some concepts related to the example embodiments in a concise and simplified form, and should be understood as merely a conceptual prelude to the more detailed description of the example embodiments that will be described below. Note that although individual embodiments are discussed herein, any combination of some embodiments and/or embodiments discussed herein may be combined to form additional embodiments.

发明人已经意识到参数编码增强和波形编码增强的各自的优势和弱势可以彼此抵消，并且意识到可以通过以下混合增强方法来显著改善常规语音增强，该混合增强方法在一些信号条件下使用参数编码增强(或者参数编码增强与波形编码增强的混和(blend))并且在其他信号条件下使用波形编码增强(或者参数编码增强与波形编码增强的不同的混和)。本发明的混合增强方法的典型实施方式提供比通过单独的参数编码增强或者波形编码增强可以实现的语音增强更稳定并且品质更好的语音增强。The inventors have realized that the respective advantages and disadvantages of parametric coding enhancement and waveform coding enhancement can offset each other, and have realized that conventional speech enhancement can be significantly improved by the following hybrid enhancement method, which uses parametric coding enhancement (or a blend of parametric coding enhancement and waveform coding enhancement) under some signal conditions and uses waveform coding enhancement (or a different blend of parametric coding enhancement and waveform coding enhancement) under other signal conditions. Typical embodiments of the hybrid enhancement method of the present invention provide more stable and better quality speech enhancement than can be achieved by either parametric coding enhancement or waveform coding enhancement alone.

在一类实施方式中，本发明方法包括以下步骤：(a)接收指示包括具有未增强的波形的语音以及其他音频内容的音频节目的比特流，其中，比特流包括指示语音内容和其他音频内容的音频数据，指示语音的降低品质的版本的波形数据(其中，已经通过将语音数据与非语音数据混合而生成了音频数据，与语音数据相比，波形数据通常包括较少比特)，其中降低品质的版本具有与未增强的波形类似(例如，至少基本上类似)的第二波形，如果被单独地试听，则降低品质的版本将具有令人讨厌的品质，以及比特流包括参数数据，其中参数数据与音频数据一起确定参数构造语音，并且参数构造语音是至少与该语音基本匹配(例如，是该语音的良好近似)的该语音的参数重构版本；以及(b)响应于混和指示符对比特流执行语音增强，从而生成指示语音增强音频节目的数据，包括通过将音频数据与根据波形数据确定的低品质语音数据和重构语音数据的组合进行组合，其中，组合由混和指示符来确定(例如，组合具有由混和指示符的当前值序列所确定的状态序列)，响应于参数数据中的至少一些以及音频数据中的至少一些来生成重构语音数据，与通过将仅低品质语音数据(其指示语音的降低品质的版本)与音频数据组合所确定的纯波形编码语音增强音频节目或者根据参数数据和音频数据确定的纯参数编码语音增强音频节目相比，该语音增强音频节目具有较少的听得见的语音增强伪声(例如，当语音增强音频节目被呈现和试听时被较好地掩蔽并且从而较少听得见的语音增强伪声)。In one class of embodiments, the method includes the steps of: (a) receiving a bitstream indicative of an audio program including speech having an unenhanced waveform and other audio content, wherein the bitstream includes audio data indicative of the speech content and the other audio content, waveform data indicative of a reduced-quality version of the speech (wherein the audio data has been generated by mixing the speech data with non-speech data, the waveform data typically including fewer bits than the speech data), wherein the reduced-quality version has a second waveform that is similar (e.g., at least substantially similar) to the unenhanced waveform and would have objectionable quality if listened to alone, and the bitstream includes parametric data, wherein the parametric data, together with the audio data, defines parametrically constructed speech, and the parametrically constructed speech is a parametrically reconstructed version of the speech that at least substantially matches (e.g., is a good approximation of) the speech; and (b) responding Performing speech enhancement on a bitstream in response to a mixing indicator to thereby generate data indicative of a speech enhanced audio program, including by combining the audio data with a combination of low-quality speech data determined from waveform data and reconstructed speech data, wherein the combination is determined by the mixing indicator (e.g., the combination has a sequence of states determined by a current sequence of values of the mixing indicator), generating the reconstructed speech data in response to at least some of the parameter data and at least some of the audio data, the speech enhanced audio program having fewer audible speech enhancement artifacts (e.g., better masked and thus having fewer audible speech enhancement artifacts when the speech enhanced audio program is presented and auditioned) as compared to a purely waveform-encoded speech enhanced audio program determined by combining only the low-quality speech data (which indicates a reduced-quality version of the speech) with the audio data or a purely parametric-encoded speech enhanced audio program determined from the parameter data and the audio data.

在本文中，“语音增强伪声”(或者“语音增强编码伪声”)表示由语音信号的表示(例如，连同混合内容信号一起的参数数据或者波形编码语音信号)所引起的音频信号(指示语音信号和非语音音频信号)的失真(通常是可测量的失真)。As used herein, "speech enhancement artifacts" (or "speech enhancement coding artifacts") refer to distortions (typically measurable distortions) of audio signals (indicative of speech signals and non-speech audio signals) caused by a representation of a speech signal (e.g., parametric data or a waveform-encoded speech signal together with a mixed content signal).

在一些实施方式中，混和指示符(其可以具有值序列，例如，针对比特流片段序列中的每一个有一个值序列)被包括在步骤(a)中所接收的比特流中。一些实施方式包括以下步骤：响应于在步骤(a)中所接收的比特流来生成混和指示符(例如，在接收比特流并且对比特流进行解码的接收器中)。In some embodiments, a blend indicator (which may have a sequence of values, e.g., one for each of a sequence of bitstream segments) is included in the bitstream received in step (a). Some embodiments include generating the blend indicator (e.g., in a receiver that receives and decodes the bitstream) in response to the bitstream received in step (a).

应当理解，表达式“混和指示符”并非意在要求混和指示符是比特流的每个片段的单个参数或值(或者单个参数或值序列)。而是，可以想到，在一些实施方式中，混和指示符(针对比特流的片段)可以是两个或更多个参数或值(例如，对于每个片段，参数编码增强控制参数以及波形编码增强控制参数)的集合，或者参数或值的集合的序列。It should be understood that the expression "mix indicator" is not intended to require that the mix indicator be a single parameter or value (or a single sequence of parameters or values) for each segment of the bitstream. Rather, it is contemplated that in some embodiments, the mix indicator (for a segment of the bitstream) may be a set of two or more parameters or values (e.g., for each segment, a parameter coding enhancement control parameter and a waveform coding enhancement control parameter), or a sequence of sets of parameters or values.

在一些实施方式中，每个片段的混和指示符可以是指示片段的每频带的混和的值序列。In some embodiments, the mixing indicator for each segment may be a sequence of values indicating the mixing per frequency band of the segment.

无需针对比特流的每个片段设置(例如，包括)波形数据和参数数据，无需使用波形数据和参数数据两者来对比特流的每个片段执行语音增强。例如，在一些情况下，至少一个片段可以包括仅波形数据(并且由每个这样的片段的混和指示符所确定的组合可以包括仅波形数据)并且至少一个其他片段可以包括仅参数数据(并且由每个这样的片段的混和指示符所确定的组合可以包括仅重构语音数据)。Waveform data and parameter data need not be provided for (e.g., included with) each segment of the bitstream, and speech enhancement need not be performed on each segment of the bitstream using both waveform data and parameter data. For example, in some cases, at least one segment may include only waveform data (and the combination determined by the blend indicator for each such segment may include only waveform data) and at least one other segment may include only parameter data (and the combination determined by the blend indicator for each such segment may include only reconstructed speech data).

通常可以想到，编码器生成比特流，其包括通过对音频数据进行编码(例如，压缩)而不对波形数据或参数数据应用相同的编码。因此，当比特流被递送至接收器时，接收器通常对比特流进行解析以提取音频数据、波形数据和参数数据(以及混和指示符，如果其在比特流中被递送)，但是仅对音频数据进行解码。在不对波形数据或参数数据应用与对音频数据应用的解码处理相同的解码处理的情况下，接收器通常(使用波形数据和/或参数数据)对所解码的音频数据执行语音增强。It is generally conceivable that an encoder generates a bitstream by encoding (e.g., compressing) the audio data without applying the same encoding to the waveform data or parameter data. Thus, when the bitstream is delivered to a receiver, the receiver typically parses the bitstream to extract the audio data, waveform data, and parameter data (and the blend indicator, if delivered in the bitstream), but only decodes the audio data. The receiver typically performs speech enhancement on the decoded audio data (using the waveform data and/or parameter data) without applying the same decoding processing to the waveform data or parameter data as was applied to the audio data.

通常，波形数据和重构语音数据的组合(由混和指示符所指示)随时间变化，其中，每个组合状态与比特流的相应片段的语音内容和其他音频内容有关。混和指示符被生成为使得(波形数据和重构语音数据的)当前组合状态至少部分地由比特流的相应片段中的语音内容和其他音频内容的信号特性(例如，语音内容的功率与其他音频内容的功率的比)来确定。在一些实施方式中，混和指示符被生成为使得当前组合状态由比特流的相应片段中的语音内容和其他音频内容的信号特性来确定。在一些实施方式中，混和指示符被生成为使得当前组合状态由比特流的相应片段中的语音内容和其他音频内容的信号特性，以及波形数据中的编码伪声的量两者来确定。Typically, the combination of waveform data and reconstructed speech data (indicated by the mixing indicator) changes over time, where each combination state is related to the speech content and other audio content of the corresponding segment of the bitstream. The mixing indicator is generated so that the current combination state (of the waveform data and the reconstructed speech data) is at least partially determined by the signal characteristics of the speech content and other audio content in the corresponding segment of the bitstream (e.g., the ratio of the power of the speech content to the power of the other audio content). In some embodiments, the mixing indicator is generated so that the current combination state is determined by the signal characteristics of the speech content and other audio content in the corresponding segment of the bitstream. In some embodiments, the mixing indicator is generated so that the current combination state is determined by both the signal characteristics of the speech content and other audio content in the corresponding segment of the bitstream and the amount of coding artifacts in the waveform data.

步骤(b)可以包括以下步骤：通过将低品质语音数据中的至少一些与比特流的至少一个片段的音频数据进行组合(例如，混合或混和)来执行波形编码语音增强；以及通过将重构语音数据与比特流中的至少一个片段的音频数据进行组合来执行参数编码语音增强。通过将片段的低品质语音数据和参数构造语音两者与片段的音频数据进行混和来对比特流中的至少一个片段执行波形编码语音增强与参数编码语音增强的组合。在一些信号条件下，对比特流的片段(或者对多于一个片段中的每一个)(响应于混和指示符)执行波形编码语音增强和参数编码语音增强中的仅一个(而不是两者)。Step (b) may include the steps of: performing waveform coded speech enhancement by combining (e.g., mixing or blending) at least some of the low-quality speech data with audio data for at least one segment of the bitstream; and performing parametric coded speech enhancement by combining the reconstructed speech data with the audio data for at least one segment of the bitstream. The combination of waveform coded speech enhancement and parametric coded speech enhancement is performed on at least one segment of the bitstream by blending both the low-quality speech data and the parametrically constructed speech for the segment with the audio data for the segment. Under some signal conditions, only one (but not both) of waveform coded speech enhancement and parametric coded speech enhancement is performed on the segment of the bitstream (or on each of more than one segment) (in response to the blending indicator).

在本文中，将使用表达“SNR”(信噪比)来表示音频节目(或者整个节目)的片段的语音内容与片段或节目的非语音内容的功率比(或水平差)，或者节目(或整个节目)的片段的语音内容与片段或节目的整个(语音和非语音)内容的功率比(或水平差)。In this document, the expression "SNR" (signal-to-noise ratio) will be used to indicate the power ratio (or level difference) of the speech content of a segment of an audio program (or the entire program) to the non-speech content of the segment or program, or the power ratio (or level difference) of the speech content of a segment of a program (or the entire program) to the entire (speech and non-speech) content of the segment or program.

在一类实施方式中，本发明方法实现了音频节目的片段的参数编码增强与波形编码增强之间的基于“盲”时间SNR的切换。在此上下文中，“盲”表示切换并不由(例如，本文中要描述的类型的)复杂听觉掩蔽模型感知地指引，而是由与节目的片段相对应的SNR值序列(混和指示符)指引。在该类中的一种实施方式中，通过参数编码增强与波形编码增强之间的时间切换来实现混合编码语音增强，使得对执行语音增强的音频节目的每个片段执行参数编码增强或波形编码增强(而非参数编码增强和波形编码增强两者)。意识到波形编码增强在低SNR条件下(对具有低SNR值的片段)最优地执行并且参数编码增强在良好的SNR下(对具有高SNR值的片段)最优地执行，切换决定通常基于原始音频混合中的语音(对话)与剩余音频的比率。In one class of embodiments, the present invention method implements "blind" temporal SNR-based switching between parametrically coded enhancement and waveform-coded enhancement of segments of an audio program. In this context, "blind" means that the switching is not perceptually guided by a complex auditory masking model (e.g., of the type to be described herein), but rather by a sequence of SNR values (hybrid indicators) corresponding to the segments of the program. In one embodiment of this class, hybrid coded speech enhancement is implemented by temporally switching between parametrically coded enhancement and waveform-coded enhancement, such that either parametric or waveform-coded enhancement (rather than both) is performed on each segment of the audio program for which speech enhancement is performed. Recognizing that waveform-coded enhancement performs optimally under low SNR conditions (for segments with low SNR values) and that parametric enhancement performs optimally under good SNR (for segments with high SNR values), the switching decision is typically based on the ratio of speech (dialogue) to the remaining audio in the original audio mix.

实现基于“盲”时间SNR的切换的实施方式通常包括以下步骤：将未增强的音频信号(原始音频混合)分割成连续的时间片(片段)，以及针对每个片段来确定片段的语音内容与其他音频内容之间(或者语音内容与总音频内容之间)的SNR；以及对于每个片段，将SNR与阈值进行比较，并且当SNR大于阈值时，针对片段(即，该片段的混和指示符指示应执行参数编码增强)设置参数编码增强控制参数，或者当SNR不大于阈值时，针对片段(即，混和指示符表示应执行该片段的波形编码增强)设置波形编码增强控制参数。通常，未增强的音频信号与作为元数据所包括的控制参数一起被递送(例如，被发送)至接收器，接收器(对每个片段)执行由片段的控制参数所指示的类型的语音增强。因此，接收器对控制参数是参数编码增强控制参数的每个片段执行参数编码增强，并且接收器对控制参数是波形编码增强控制参数的每个片段执行波形编码增强。Implementations of "blind" temporal SNR-based switching typically include the following steps: segmenting the unenhanced audio signal (the original audio mix) into consecutive time slices (segments), and determining for each segment the SNR between the speech content of the segment and other audio content (or between the speech content and the total audio content); and for each segment, comparing the SNR to a threshold, and setting a parametric coding enhancement control parameter for the segment (i.e., the segment's mixing indicator indicates that parametric coding enhancement should be performed) when the SNR is greater than the threshold, or setting a waveform coding enhancement control parameter for the segment (i.e., the segment's mixing indicator indicates that waveform coding enhancement should be performed) when the SNR is not greater than the threshold. Typically, the unenhanced audio signal is delivered (e.g., transmitted) to a receiver along with the control parameters included as metadata, and the receiver performs (for each segment) speech enhancement of the type indicated by the segment's control parameters. Thus, the receiver performs parametric coding enhancement on each segment for which the control parameters are parametric coding enhancement control parameters, and the receiver performs waveform coding enhancement on each segment for which the control parameters are waveform coding enhancement control parameters.

如果愿意承担传输(关于原始音频混合的每个片段)波形数据(用于实现波形编码语音增强)以及关于原始(未增强)混合的参数编码增强参数两者的成本，那么通过对混合的各个片段应用波形编码增强和参数编码增强两者可以获得较高程度的语音增强。因此，在一类实施方式中，本发明方法实现音频节目的片段的参数编码增强与波形编码增强之间的基于“盲”时间SNR的混和。在此上下文中，“盲”还表示切换不是由复杂听觉掩蔽模型(例如，要在本文中描述的类型的)感知地指引，而是由与节目的片段相对应的SNR值序列指引。If one is willing to incur the cost of transmitting (for each segment of the original audio mix) both waveform data (for implementing waveform-coded speech enhancement) and parametrically coded enhancement parameters for the original (unenhanced) mix, then a higher degree of speech enhancement can be achieved by applying both waveform-coded and parametric-coded enhancement to individual segments of the mix. Thus, in one class of embodiments, the inventive method implements "blind" temporal SNR-based blending between parametric-coded and waveform-coded enhancement of segments of an audio program. In this context, "blind" also means that the switching is not perceptually guided by a complex auditory masking model (e.g., of the type to be described herein), but rather by a sequence of SNR values corresponding to the segments of the program.

实现基于“盲”时间SNR的混和的实施方式通常包括以下步骤：将未增强的音频信号(原始音频混合)分割成连续的时间片(片段)；针对每个片段来确定片段的语音内容与其他音频内容之间(或者语音内容与总音频内容之间)的SNR；以及针对每个片段来设置混和控制指示符，其中，混和控制指示符的值由片段的SNR确定(是片段的SNR的函数)。An implementation of "blind" temporal SNR-based mixing generally includes the following steps: segmenting the unenhanced audio signal (the original audio mix) into consecutive time slices (segments); determining, for each segment, the SNR between the speech content and other audio content of the segment (or between the speech content and the total audio content); and setting a mixing control indicator for each segment, wherein the value of the mixing control indicator is determined by (is a function of) the SNR of the segment.

在一些实施方式中，方法包括确定(例如，接收请求)语音增强的总量(“T”)的步骤，混和控制指示符是使得T＝αPw+(1-α)Pp的每个片段的参数α，其中，Pw是下述的片段的波形编码增强：如果使用针对片段所设置的波形数据将该片段的波形编码增强应用于片段的未增强的音频内容则将产生预定的总增强量T(其中，片段的语音内容具有未增强的波形，片段的波形数据指示片段的语音内容的降低品质的版本，该降低品质的版本具有与未增强的波形类似(例如，至少基本上类似)的波形，并且当被单独地呈现和感知时，语音内容的降低品质的版本具有令人讨厌的品质)，Pp是下述的参数编码增强：如果使用针对片段所设置的参数数据将该参数编码增强应用于片段的未增强的音频内容则将产生预定总增强量T(其中，片段的参数数据与片段的未增强的音频内容一起来确定片段的语音内容的参数重构版本)。在一些实施方式中，片段中的每一个的混和控制指示符是包括相关片段的每个频带的参数的这样的参数的集合。In some embodiments, the method includes the step of determining (e.g., receiving a request for) a total amount of speech enhancement ("T"), the blending control indicator being a parameter α for each segment such that T = αPw + (1-α)Pp, where Pw is the waveform coded enhancement for the segment that would produce a predetermined total enhancement amount T if the waveform coded enhancement for the segment were applied to the unenhanced audio content of the segment using waveform data set for the segment (wherein the speech content of the segment has an unenhanced waveform, the waveform data for the segment indicates a degraded version of the speech content of the segment, the degraded version having a waveform that is similar (e.g., at least substantially similar) to the unenhanced waveform, and the degraded version of the speech content has an objectionable quality when presented and perceived alone), and Pp is the parametric coded enhancement that would produce a predetermined total enhancement amount T if the parametric coded enhancement were applied to the unenhanced audio content of the segment using parameter data set for the segment (wherein the parameter data for the segment, together with the unenhanced audio content of the segment, determine a parametric reconstructed version of the speech content of the segment). In some embodiments, the mixing control indicator for each of the segments is a set of such parameters including parameters for each frequency band of the relevant segment.

当未增强的音频信号与作为元数据的控制参数一起被递送(例如，被发送)至接收器时，接收器可以(对每个片段)执行由片段的控制参数所指示的混合语音增强。替选地，接收器根据未增强的音频信号生成控制参数。When the unenhanced audio signal is delivered (e.g., transmitted) to a receiver along with the control parameters as metadata, the receiver can (for each segment) perform the hybrid speech enhancement indicated by the control parameters for the segment. Alternatively, the receiver generates the control parameters based on the unenhanced audio signal.

在一些实施方式中，接收器(对未增强的音频信号的每个片段)执行参数编码增强(以通过由片段的参数α所缩放的增强Pp所确定的量)和波形编码增强(以通过由片段的值(1-α)所缩放的增强Pw所确定的量)的组合，使得参数编码增强与波形编码增强的组合生成预定的总增强量：In some embodiments, the receiver performs (for each segment of the unenhanced audio signal) a combination of parametric coded enhancement (by an amount determined by enhancement Pp scaled by a parameter α for the segment) and waveform coded enhancement (by an amount determined by enhancement Pw scaled by a value (1-α) for the segment) such that the combination of parametric coded enhancement and waveform coded enhancement produces a predetermined total enhancement amount:

T＝αPw+(1-α)Pp (1)T＝αPw+(1-α)Pp (1)

在另一类实施方式中，通过听觉掩蔽模型来确定要对音频信号的每个片段执行的波形编码增强和参数编码增强的组合。在该类的一些实施方式中，要对音频节目的片段执行的波形编码增强和参数编码增强的混和的最佳混和比率使用刚好防止编码噪声变得可听见的最高的波形编码增强量。应当理解，解码器中的编码噪声可得性总是统计估计的形式，并且不能被精确地确定。In another class of embodiments, the combination of waveform coded enhancement and parametric coded enhancement to be performed on each segment of the audio signal is determined by an auditory masking model. In some embodiments of this class, the optimal mix ratio of the mix of waveform coded enhancement and parametric coded enhancement to be performed on the segments of the audio program uses the highest amount of waveform coded enhancement that just prevents the coding noise from becoming audible. It should be understood that the availability of coding noise in the decoder is always in the form of a statistical estimate and cannot be determined precisely.

在该类中的一些实施方式中，音频数据的每个片段的混和指示符指示要对片段执行的波形编码增强和参数编码增强的组合，并且该组合至少基本上等于由听觉掩蔽模型针对片段所确定的波形编码最大化组合，其中，波形编码最大化组合指定确保语音增强音频节目的相应片段中的编码噪声(由于波形编码增强而引起)并非令人讨厌地听得见(例如，听不见的)的最大相对波形编码增强量。在一些实施方式中，确保语音增强音频节目的片段中的编码噪声不听起来令人讨厌的最大相对形编码增强量是以下最大相对量，该最大相对量确保(对音频数据的相应片段)要执行的波形编码增强和参数编码增强的组合生成片段的预定总量的语音增强，和/或(其中，参数编码增强的伪声被包括在由听觉掩蔽模型所执行的评估中)其可以使得(由于波形编码增强而引起的)编码伪声能够超过参数编码增强的伪声而听得见(当这是良好的时)(例如，在(由于波形编码增强而引起的)听得见的编码伪声与参数编码增强的听得见的伪声相比而较不令人讨厌的情况下)。In some embodiments of this class, the blend indicator for each segment of the audio data indicates a combination of waveform coding enhancement and parametric coding enhancement to be performed on the segment, and the combination is at least substantially equal to a waveform coding maximization combination determined for the segment by an auditory masking model, wherein the waveform coding maximization combination specifies a maximum relative amount of waveform coding enhancement that ensures that coding noise (caused by the waveform coding enhancement) in the corresponding segment of the speech enhancement audio program is not objectionably audible (e.g., inaudible). In some embodiments, the maximum relative amount of waveform coding enhancement that ensures that coding noise in a segment of a speech enhanced audio program does not sound objectionable is a maximum relative amount that ensures that the combination of waveform coding enhancement and parametric coding enhancement to be performed (on the corresponding segment of audio data) generates a predetermined total amount of speech enhancement for the segment, and/or (wherein artifacts of the parametric coding enhancement are included in the evaluation performed by the auditory masking model) that enables coding artifacts (due to waveform coding enhancement) to be audible over artifacts of the parametric coding enhancement (when this is desirable) (e.g., where the audible coding artifacts (due to waveform coding enhancement) are less objectionable than the audible artifacts of the parametric coding enhancement).

在通过使用听觉掩蔽模型来更精确地预测降低品质的语音复本(要用于实现波形编码增强)中的编码噪声如何被主要节目的音频混合掩蔽并且据此选择混和比率，来确保编码噪声不变得令人讨厌地听得见(例如，不变得听得见)的同时，可以增大本发明的混合编码方案中的波形编码增强的贡献。The contribution of waveform coding enhancement in the hybrid coding scheme of the present invention can be increased while ensuring that the coding noise does not become objectionably audible (e.g., does not become audible) by using an auditory masking model to more accurately predict how the coding noise in a degraded speech replica (to be used to implement waveform coding enhancement) will be masked by the audio mix of the main program and selecting the mixing ratio accordingly.

使用听觉掩蔽模型的一些实施方式包括以下步骤：将未增强音频信号(原始音频混合)分割成连续的时间片(片段)；提供每个片段中的语音的降低品质的复本(用于波形编码增强)以及每个片段的参数编码增强参数(用于参数编码增强)；对于每个片段，使用听觉掩蔽模型来确定在编码伪声不变得令人讨厌地听得见的情况下可以应用的最大量的波形编码增强；以及生成波形编码增强(以不超过使用片段的听觉掩蔽模型所确定的最大量的波形编码增强以及至少基本上与使用片段的听觉掩蔽模型所确定的最大量的波形编码增强匹配的量)和参数编码增强的组合的指示符(针对未增强音频信号的每个片段)，使得波形编码增强和参数编码增强的组合生成片段的预定总量的语音增强。Some embodiments using an auditory masking model include the steps of: segmenting an unenhanced audio signal (the original audio mixture) into consecutive time slices (segments); providing a degraded replica of the speech in each segment (for waveform coded enhancement) and parametric coded enhancement parameters for each segment (for parametric coded enhancement); for each segment, using the auditory masking model to determine the maximum amount of waveform coded enhancement that can be applied without coding artifacts becoming unpleasantly audible; and generating an indicator (for each segment of the unenhanced audio signal) of a combination of waveform coded enhancement (in an amount that does not exceed and at least substantially matches the maximum amount of waveform coded enhancement determined using the auditory masking model for the segment) and parametric coded enhancement such that the combination of waveform coded enhancement and parametric coded enhancement generates a predetermined total amount of speech enhancement for the segment.

在一些实施方式中，每个指示符被包括(例如，由编码器)在比特流中，该比特流还包括指示未增强音频信号的编码音频数据。In some embodiments, each indicator is included (eg, by an encoder) in a bitstream that also includes encoded audio data indicative of the unenhanced audio signal.

在一些实施方式中，未增强音频信号被分割成连续的时间片并且每个时间片被分割成频带，对于每个时间片中的每个频带，使用听觉掩蔽模型确定在编码伪声不变得令人讨厌地听得见的情况下可以应用的最大量的波形编码增强，针对未增强音频信号的每个时间片的每个频带生成指示符。In some embodiments, the unenhanced audio signal is segmented into consecutive time slices and each time slice is segmented into frequency bands, and for each frequency band in each time slice, an auditory masking model is used to determine the maximum amount of waveform coding enhancement that can be applied without coding artifacts becoming unpleasantly audible, and an indicator is generated for each frequency band of each time slice of the unenhanced audio signal.

可选地，方法还包括以下步骤：响应于每个片段的指示符来(对未增强音频信号的每个片段)执行由指示符所确定的波形编码增强和参数编码增强的组合，使得波形编码增强和参数编码增强的组合生成片段的预定总量的语音增强。Optionally, the method further comprises the steps of performing (for each segment of the unenhanced audio signal) in response to the indicator for each segment a combination of waveform coded enhancement and parametric coded enhancement as determined by the indicator, such that the combination of waveform coded enhancement and parametric coded enhancement generates a predetermined total amount of speech enhancement for the segment.

在一些实施方式中，将音频内容编码在诸如环绕声配置、5.1扬声器配置、7.1扬声器配置、7.2扬声器配置等的参考音频通道配置(或表示)的编码音频信号中。参考配置可以包括音频通道如立体声通道、左前通道和右前通道、环绕通道、扬声器通道、对象通道等。携载语音内容的通道中的一个或更多个可以不是中间/侧(M/S)音频通道表示的通道。如本文中所使用的，M/S音频通道表示(或简称为M/S表示)包括至少中间通道和侧通道。在示例实施方式中，中间通道表示左通道和右通道(例如，等同地被加权等)之和，而侧通道表示左通道和右通道之差，其中，左通道和右通道可以被视为两个通道例如前中央通道和前左通道的任意组合。In some embodiments, the audio content is encoded in an encoded audio signal of a reference audio channel configuration (or representation), such as a surround sound configuration, a 5.1 speaker configuration, a 7.1 speaker configuration, a 7.2 speaker configuration, or the like. The reference configuration may include audio channels such as stereo channels, left and right front channels, surround channels, speaker channels, object channels, or the like. One or more of the channels carrying the voice content may not be channels of a mid/side (M/S) audio channel representation. As used herein, an M/S audio channel representation (or simply M/S representation) includes at least a mid channel and a side channel. In an example embodiment, the mid channel represents the sum of the left and right channels (e.g., equally weighted, etc.), while the side channel represents the difference between the left and right channels, where the left and right channels can be considered as any combination of two channels, such as a front center channel and a front left channel.

在一些实施方式中，节目的语音内容可以与非语音内容混合，并且可以被分布在参考音频通道配置中的两个或更多个非M/S通道如左通道和右通道、左前通道和右前通道等上。语音内容可以但并不要求被表示在立体声内容中的幻象中心处，在所述立体声内容中，语音内容在两个非M/S通道如左通道和右通道等中同样响亮。立体声内容可以包括不一定同样响亮或者甚至出现在两个通道中的非语音内容。In some embodiments, the speech content of a program may be mixed with non-speech content and may be distributed across two or more non-M/S channels in a reference audio channel configuration, such as left and right channels, left front and right front channels, etc. The speech content may, but is not required to, be represented at the phantom center in stereo content where the speech content is equally loud in both non-M/S channels, such as left and right channels, etc. Stereo content may include non-speech content that is not necessarily equally loud or even present in both channels.

在一些方法中，用于与在其上分布有语音内容的多个非M/S音频通道相对应的用于语音增强的非M/S控制数据、控制参数等的多个集合作为全部音频元数据的一部分从音频编码器被发送至下游音频解码器。用于语音增强的非M/S控制数据、控制参数等的多个集合中的每一个与在其上分布有语音内容的多个非M/S音频通道的特定音频通道相对应，并且可以由下游音频解码器使用来控制与特定音频通道有关的语音增强操作。如本文中所使用的，非M/S控制数据、控制参数等的集合指代用于非M/S表示如在其中如本文中所描述的音频信号被编码的参考配置的音频通道中的语音增强操作的控制数据、控制参数等。In some methods, multiple sets of non-M/S control data, control parameters, etc. for speech enhancement corresponding to multiple non-M/S audio channels on which speech content is distributed are sent from an audio encoder to a downstream audio decoder as part of overall audio metadata. Each of the multiple sets of non-M/S control data, control parameters, etc. for speech enhancement corresponds to a specific audio channel of the multiple non-M/S audio channels on which speech content is distributed and can be used by a downstream audio decoder to control speech enhancement operations related to the specific audio channel. As used herein, a set of non-M/S control data, control parameters, etc. refers to control data, control parameters, etc. for speech enhancement operations in an audio channel of a reference configuration in which an audio signal as described herein is encoded.

在一些实施方式中，M/S语音增强元数据——除了非M/S控制数据、控制参数等的一个或更多个集合以外或者代替非M/S控制数据、控制参数等的一个或更多个集合——作为音频元数据的一部分从音频编码器被发送至下游音频解码器。M/S语音增强元数据可以包括用于语音增强的M/S控制数据、控制参数等的一个或更多个集合。如本文中所使用的，M/S控制数据、控制参数等的集合指代用于M/S表示的音频通道中的语音增强操作的控制数据、控制参数等。在一些实施方式中，用于语音增强的M/S语音增强元数据与编码在参考音频通道配置中的混合内容一起被音频编码器发送至下游音频解码器。在一些实施方式中，用于M/S语音增强元数据中的语音增强的M/S控制数据、控制参数等的集合的数目可以比在其上分布有混合内容中的语音内容的参考音频通道表示中的多个非M/S音频通道的数目少。在一些实施方式中，甚至当混合内容中的语音内容被分布在参考音频通道配置中的两个或更多个非M/S音频通道如左通道和右通道等上时，用于语音增强的M/S控制数据、控制参数等的仅一个集合——例如，与M/S表示的中间通道相对应——作为M/S语音增强元数据被音频编码器发送至下游解码器。可以使用用于语音增强的M/S控制数据、控制参数等的单个集合来实现针对两个或更多个非M/S音频通道如左通道和右通道等中的所有通道的语音增强操作。在一些实施方式中，可以使用参考配置与M/S表示之间的转换矩阵来应用用于如本文中所描述的语音增强的基于M/S控制数据、控制参数等的语音增强操作。In some embodiments, M/S speech enhancement metadata—in addition to or in lieu of one or more sets of non-M/S control data, control parameters, etc.—is transmitted from an audio encoder to a downstream audio decoder as part of the audio metadata. The M/S speech enhancement metadata may include one or more sets of M/S control data, control parameters, etc. for speech enhancement. As used herein, a set of M/S control data, control parameters, etc. refers to control data, control parameters, etc. used for speech enhancement operations in audio channels of an M/S representation. In some embodiments, the M/S speech enhancement metadata for speech enhancement is transmitted from an audio encoder to a downstream audio decoder along with the mixed content encoded in a reference audio channel configuration. In some embodiments, the number of sets of M/S control data, control parameters, etc. for speech enhancement in the M/S speech enhancement metadata may be less than the number of non-M/S audio channels in the reference audio channel representation over which the speech content in the mixed content is distributed. In some embodiments, even when the speech content in the mixed content is distributed across two or more non-M/S audio channels, such as a left channel and a right channel, in a reference audio channel configuration, only one set of M/S control data, control parameters, and the like for speech enhancement—e.g., corresponding to a center channel of an M/S representation—is sent by the audio encoder to the downstream decoder as M/S speech enhancement metadata. A single set of M/S control data, control parameters, and the like for speech enhancement can be used to implement speech enhancement operations for all of the two or more non-M/S audio channels, such as a left channel and a right channel. In some embodiments, a conversion matrix between the reference configuration and the M/S representation can be used to apply speech enhancement operations based on the M/S control data, control parameters, and the like for speech enhancement as described herein.

如本文中所描述的技术可以用于以下情况中：语音内容被平移在左通道和右通道的幻象中心处，语音内容未被完全平移至中央(例如，左通道和右通道两者中不同样响亮)等。在示例中，这些技术可以用于以下情况中：语音内容的大百分比(例如，70+％、80+％、90+％等)的能量在中间信号或M/S表示的中间通道中。在另一个示例中，(例如，空间等)转换如平移、旋转等可以用来将参考配置中的不等同的语音内容转换成M/S配置中的等同或基本上等同的语音内容。表示平移、旋转等的呈现向量、转换矩阵等可以用作语音增强操作的一部分或者可以与语音增强操作结合使用。The techniques described herein can be used in situations where the speech content is panned at the phantom center of the left and right channels, where the speech content is not fully panned to the center (e.g., the left and right channels are not equally loud), etc. In an example, these techniques can be used in situations where a large percentage (e.g., 70+%, 80+%, 90+%, etc.) of the speech content's energy is in the middle signal or middle channel of the M/S representation. In another example, (e.g., spatial, etc.) transformations such as translations, rotations, etc. can be used to convert unequal speech content in a reference configuration into equivalent or substantially equivalent speech content in an M/S configuration. Rendering vectors, transformation matrices, etc. representing translations, rotations, etc. can be used as part of or in conjunction with speech enhancement operations.

在一些实施方式中(例如，混合模式等)，语音内容的版本(例如，降低的版本等)作为M/S表示中的仅中间通道信号或者中间通道信号和侧通道信号两者，连同可能具有非M/S表示的参考音频通道配置中所发送的混合内容一起被发送至下游音频解码器。在一些实施方式中，当语音内容的版本作为M/S表示中的仅中间通道信号被发送至下游音频解码器时，对中间通道信号进行操作(例如，执行转换等)以基于中间通道信号来生成非M/S音频通道配置(例如，参考配置等)的一个或更多个非M/S通道中的信号部分的、相应呈现向量也被发送至下游音频解码器。In some embodiments (e.g., hybrid mode, etc.), a version of the speech content (e.g., a degraded version, etc.) is sent to a downstream audio decoder as only the mid channel signal or both the mid channel signal and the side channel signal in an M/S representation, along with the mixed content sent in a reference audio channel configuration that may have a non-M/S representation. In some embodiments, when the version of the speech content is sent to a downstream audio decoder as only the mid channel signal in an M/S representation, corresponding rendering vectors for signal portions in one or more non-M/S channels of a non-M/S audio channel configuration (e.g., a reference configuration, etc.) that are operated on (e.g., performed a conversion, etc.) the mid channel signal to generate, based on the mid channel signal, corresponding rendering vectors are also sent to the downstream audio decoder.

在一些实施方式中，实现音频节目的片段的参数编码增强(例如，独立通道对话预测、多通道对话预测等)与波形编码增强之间的基于“盲”时间SNR切换的对话/语音增强算法(例如，在下游音频解码器等中)至少部分地在M/S表示中操作。In some implementations, a dialog/speech enhancement algorithm (e.g., in a downstream audio decoder, etc.) that implements "blind" temporal SNR switching between parametric coding enhancement (e.g., independent channel dialog prediction, multi-channel dialog prediction, etc.) and waveform coding enhancement of segments of an audio program operates at least partially in the M/S representation.

如本文中所描述的至少部分地在M/S表示中实现语音增强操作的技术可以用于独立通道预测(例如，在中间通道等中)、多通道预测(例如，在中间通道和侧通道等中)等。这些技术还可以用来同时支持对一个对话、两个或更多个对话的语音增强。控制参数、控制数据等如预测参数、增益、呈现向量等的零个集合、一个或更多个另外的集合可以作为M/S语音增强元数据的一部分被设置在编码音频信号中以支持另外的对话。The techniques described herein for implementing speech enhancement operations at least partially in the M/S representation can be used for independent channel prediction (e.g., in the mid channel, etc.), multi-channel prediction (e.g., in the mid channel and side channels, etc.), etc. These techniques can also be used to support speech enhancement for one dialog, two, or more dialogs simultaneously. Zero, one, or one or more additional sets of control parameters, control data, and the like, such as prediction parameters, gains, rendering vectors, etc., can be provided in the encoded audio signal as part of the M/S speech enhancement metadata to support additional dialogs.

在一些实施方式中，(例如，从编码器输出等的)编码音频信号的语义支持M/S标记从上游音频编码器至下游音频解码器的传输。当要至少部分地使用利用M/S标记所发送的M/S控制数据、控制参数等来执行语音增强操作时，M/S标记出现/被设置。例如，当M/S标记被设置时，在根据语音增强算法(例如，独立通道对话预测、多通道对话预测、基于波形的、波形参数混合等)中的一个或更多个、使用如利用M/S标记所接收的M/S控制数据、控制参数等应用M/S语音增强操作之前，接收方音频解码器可以首先将非M/S通道中的立体声信号(例如，来自左通道和右通道等)转换成M/S表示的中间通道和侧通道。在执行M/S语音增强操作之后，可以将M/S表示中的语音增强信号转换回至非M/S通道。In some embodiments, the semantics of the encoded audio signal (e.g., output from an encoder, etc.) support the transmission of an M/S flag from an upstream audio encoder to a downstream audio decoder. The M/S flag is present/set when a speech enhancement operation is to be performed, at least in part, using M/S control data, control parameters, etc. transmitted using the M/S flag. For example, when the M/S flag is set, the receiving audio decoder may first convert the stereo signals in non-M/S channels (e.g., from the left and right channels, etc.) into mid and side channels in an M/S representation before applying an M/S speech enhancement operation according to one or more speech enhancement algorithms (e.g., independent channel dialogue prediction, multi-channel dialogue prediction, waveform-based, waveform parameter mixing, etc.) using the M/S control data, control parameters, etc. received using the M/S flag. After performing the M/S speech enhancement operation, the speech-enhanced signals in the M/S representation may be converted back to non-M/S channels.

在一些实施方式中，要根据本发明来增强其语音内容的音频节目包括扬声器通道但是不包括任何对象通道。在其他实施方式中，要根据本发明增强其语音内容的音频节目是包括至少一个对象通道以及可选地至少一个扬声器通道的基于对象的音频节目(典型地为基于多通道对象的音频节目)。In some embodiments, the audio program whose speech content is to be enhanced according to the present invention includes speaker channels but does not include any object channels. In other embodiments, the audio program whose speech content is to be enhanced according to the present invention is an object-based audio program (typically a multi-channel object-based audio program) that includes at least one object channel and, optionally, at least one speaker channel.

本发明的另一个方面是以下系统，该系统包括：编码器，其被配置(例如，被编程)为响应于指示包括语音内容和非语音内容的节目的音频数据，执行本发明编码方法的任何实施方式以生成包括编码音频数据、波形数据和参数数据(以及此外可选地音频数据的每个片段的混和指示符(例如，混和指示数据))的比特流；以及解码器，其被配置成对比特流进行解析以恢复编码音频数据(以及此外可选地每个混和指示符)并且对编码音频数据进行解码以恢复音频数据。替代地，解码器被配置成响应于所恢复的音频数据而生成音频数据的每个片段的混和指示符。解码器被配置成响应于每个混和指示符对所恢复的音频数据执行混合语音增强。Another aspect of the present invention is a system comprising: an encoder configured (e.g., programmed) to, in response to audio data indicating a program including speech content and non-speech content, perform any embodiment of the encoding method of the present invention to generate a bitstream comprising encoded audio data, waveform data, and parameter data (and optionally, a mixing indicator (e.g., mixing indicator data) for each segment of the audio data); and a decoder configured to parse the bitstream to recover the encoded audio data (and optionally, each mixing indicator) and decode the encoded audio data to recover the audio data. Alternatively, the decoder is configured to generate a mixing indicator for each segment of the audio data in response to the recovered audio data. The decoder is configured to perform hybrid speech enhancement on the recovered audio data in response to each mixing indicator.

本发明的另一个方面是被配置成执行本发明方法的任何实施方式的解码器。在另一类实施方式中，本发明是包括存储(例如，以非暂态方式)已经通过本发明方法的任何实施方式所生成的编码音频比特流的至少一个片段(例如，帧)的缓冲存储器(缓冲器)的解码器。Another aspect of the present invention is a decoder configured to perform any embodiment of the method of the present invention. In another class of embodiments, the present invention is a decoder comprising a buffer memory (buffer) storing (e.g., in a non-transient manner) at least one segment (e.g., frame) of a coded audio bitstream that has been generated by any embodiment of the method of the present invention.

本发明的其他方面包括被配置(例如，被编程)成执行本发明方法的任何实施方式的系统或装置(例如，编码器、解码器或处理器)以及存储用于实现本发明方法或其步骤的任何实施方式的代码的计算机可读介质(例如，磁盘)。例如，本发明系统可以是或者包括使用软件或固件被编程成和/或以其他方式被配置成对数据执行包括本发明方法或其步骤的实施方式的多种操作中的任何操作的可编程通用处理器、数字信号处理器或微处理器。这样的通用处理器可以是或者包括以下计算机系统，该计算机系统包括被编程(和/或以其他方式被配置)成响应于设定(assert)至该计算机系统的数据来执行本发明方法(或其步骤)的实施方式的输入装置、存储器和处理电路。Other aspects of the present invention include systems or devices (e.g., encoders, decoders, or processors) configured (e.g., programmed) to perform any embodiment of the method of the present invention and computer-readable media (e.g., disks) storing code for implementing any embodiment of the method of the present invention or its steps. For example, the system of the present invention may be or include a programmable general-purpose processor, digital signal processor, or microprocessor that is programmed and/or otherwise configured using software or firmware to perform any of a variety of operations on data including embodiments of the method of the present invention or its steps. Such a general-purpose processor may be or include a computer system that includes an input device, memory, and processing circuitry that is programmed (and/or otherwise configured) to perform embodiments of the method of the present invention (or its steps) in response to data asserted to the computer system.

在一些实施方式中，如本文中所描述的机构形成媒体处理系统的一部分，包括但不限于：音视频装置、平板TV、手持装置、游戏机、电视、家庭影院系统、平板、移动装置、膝上型计算机、笔记本计算机、蜂窝无线电话、电子书阅读器、销售端的点、桌面型计算机、计算机工作站、计算机信息站、各种其他种类的终端和媒体处理单元等。In some embodiments, the mechanisms as described herein form part of a media processing system including, but not limited to, audio and video devices, flat-panel TVs, handheld devices, game consoles, televisions, home theater systems, tablets, mobile devices, laptop computers, notebook computers, cellular wireless telephones, e-book readers, point of sale, desktop computers, computer workstations, computer kiosks, various other types of terminals and media processing units, and the like.

对本领域的技术人员而言，对本文中所描述的一般原理和特征和优选实施方式的各种修改将是显而易见的。因此，本公开内容并不意在受限于所示的实施方式，而是意在符合与本文中所描述的原理和特征一致的最宽的范围。Various modifications to the general principles and features and preferred embodiments described herein will be apparent to those skilled in the art. Therefore, the present disclosure is not intended to be limited to the embodiments shown, but is intended to be consistent with the broadest scope consistent with the principles and features described herein.

2.符号和术语2. Symbols and terminology

贯穿包括权利要求在内的本公开内容，术语“对话”和“语音”作为同义词可互换地被用来表示作为由人类(或者虚拟世界中的角色)沟通的形式所感知的音频信号内容。Throughout this disclosure, including the claims, the terms "dialogue" and "speech" are used interchangeably as synonyms to refer to audio signal content perceived as a form of communication by a human (or a character in a virtual world).

贯穿包括权利要求在内的本公开内容，表达“对”信号或数据执行操作(例如，对信号或数据进行滤波、缩放、转换、或者应用增益)在广义上被用来表示对信号或数据直接执行操作或者对信号或数据的经处理的版本(例如，对在对其执行操作之前已经经历初步滤波或预处理的信号的版本)执行操作。Throughout this disclosure, including the claims, the expression "performing an operation on" a signal or data (e.g., filtering, scaling, converting, or applying a gain to the signal or data) is used broadly to mean performing the operation directly on the signal or data or on a processed version of the signal or data (e.g., a version of the signal that has undergone preliminary filtering or preprocessing before the operation is performed on it).

贯穿包括权利要求在内的本公开内容，表达“系统”在广义上被用来表示装置、系统或子系统。例如，实现解码器的子系统可以被称为解码器系统，包括这样的子系统(例如，响应于多个输入生成X输出信号的系统，其中，子系统生成M个输入，从外部源接收另外X-M个输入)的系统还可以被称为解码器系统。Throughout this disclosure, including the claims, the expression "system" is used in a broad sense to refer to a device, system, or subsystem. For example, a subsystem that implements a decoder may be referred to as a decoder system, and a system that includes such a subsystem (e.g., a system that generates X output signals in response to multiple inputs, where the subsystem generates M inputs and receives another X-M inputs from an external source) may also be referred to as a decoder system.

贯穿包括权利要求在内的本公开内容，术语“处理器”在广义上被用来表示可编程或者以其他方式可配置(例如，使用软件或固件)成对数据(例如，音频、或者视频或其他图像数据)执行操作的系统或装置。处理器的示例包括现场可编程门阵列(或其他可配置集成电路或芯片组)、被编程和/或以其他方式被配置成对音频或其他声音数据执行流水线处理的数字信号处理器、可编程通用处理器或计算机、以及可编程微处理器芯片或芯片组。Throughout this disclosure, including the claims, the term "processor" is used in a broad sense to refer to a system or device that is programmable or otherwise configurable (e.g., using software or firmware) to perform operations on data (e.g., audio, or video or other image data). Examples of processors include field programmable gate arrays (or other configurable integrated circuits or chipsets), digital signal processors that are programmed and/or otherwise configured to perform pipeline processing on audio or other sound data, programmable general-purpose processors or computers, and programmable microprocessor chips or chipsets.

贯穿包括权利要求在内的本公开内容，表达“音频处理器”和“音频处理单元”可互换地被使用，并且在广义上，表示被配置成处理音频数据的系统。音频处理单元的示例包括但不限于编码器(例如，转码器)、解码器、编解码器、预处理系统、后处理系统、以及比特流处理系统(有时称为比特流处理工具)。Throughout this disclosure, including the claims, the expressions "audio processor" and "audio processing unit" are used interchangeably and, in a broad sense, refer to a system configured to process audio data. Examples of audio processing units include, but are not limited to, encoders (e.g., transcoders), decoders, codecs, pre-processing systems, post-processing systems, and bitstream processing systems (sometimes referred to as bitstream processing tools).

贯穿包括权利要求在内的本公开内容，表达“元数据”指代与相应音频数据(还包括元数据的比特流的音频内容)分立且不同的数据。元数据与音频数据相关联，并且表示音频数据的至少一个特征或特性(例如，已经对音频数据或者由音频数据所指示的对象的轨迹执行了什么类型的处理或者应该执行什么类型的处理)。元数据与音频数据的关联是时间同步的。因此，当前(最近所接收或更新的)元数据可以指示相应音频数据同时具有所指示的特征和/或包括音频数据处理的所指示类型的结果。Throughout this disclosure, including the claims, the expression "metadata" refers to data that is separate and distinct from corresponding audio data (the audio content of the bitstream that also includes the metadata). Metadata is associated with audio data and represents at least one feature or characteristic of the audio data (e.g., what type of processing has been performed or should be performed on the audio data or the trajectory of an object indicated by the audio data). The association of metadata with the audio data is time-synchronized. Thus, current (most recently received or updated) metadata can indicate that the corresponding audio data simultaneously has the indicated features and/or includes the results of the indicated type of audio data processing.

贯穿包括权利要求在内的本公开内容，术语“耦接(couples)”或“耦接(coupled)”被用来表示直接或间接连接。因此，如果第一装置耦接至第二装置，则连接可以通过直接连接或者通过经由其他装置和连接的间接连接。Throughout this disclosure, including the claims, the terms "couples" or "coupled" are used to mean either a direct or indirect connection. Thus, if a first device couples to a second device, that connection may be through a direct connection or through an indirect connection via other devices and connections.

贯穿包括权利要求在内的本公开内容，以下表达具有下面的定义：Throughout this disclosure, including the claims, the following expressions have the following definitions:

-扬声器(speaker)和扩音器(loudspeaker)同义地被用来表示任何发出声音的转换器。该定义包括实现为多个转换器(例如，低频扬声器和高频扬声器)的扩音器；- Speaker and loudspeaker are used synonymously to mean any transducer that produces sound. This definition includes loudspeakers implemented as multiple transducers (e.g., a woofer and a tweeter);

-扬声器馈送：要直接应用于扩音器的音频信号，或者要应用于串联的放大器和扩音器的音频信号；- Loudspeaker feed: audio signal to be applied directly to a loudspeaker, or to an amplifier and loudspeaker connected in series;

-通道(或“音频通道”)：单通道音频信号。通常，这样的信号可以以这样的方式被呈现，使得等同于将信号直接应用于在期望位置或标称位置处的扩音器。如通常是具有物理扩音器的情况，期望位置可以是静止的,或可以是动态的；Channel (or "audio channel"): a single-channel audio signal. Typically, such a signal can be presented in such a way that it is equivalent to applying the signal directly to a loudspeaker at a desired or nominal position. The desired position can be static, as is typically the case with physical loudspeakers, or it can be dynamic;

-音频节目：一个或更多个音频通道的集合(至少一个扬声器通道和/或至少一个对象通道)以及此外可选地相关联的元数据(例如，描述期望的空间音频表示的元数据)；- audio program: a set of one or more audio channels (at least one loudspeaker channel and/or at least one object channel) and further optionally associated metadata (e.g. metadata describing the desired spatial audio representation);

-扬声器通道(或者“扬声器馈送通道”)：与命名扩音器(在期望位置或标称位置处)相关联的或者与限定的扬声器配置内的命名扬声器区相关联的音频通道。扬声器通道以这样的方式被呈现，使得等同于直接向命名扩音器(在期望位置或标称位置处)或者命名扬声器区中的扬声器应用音频信号；- Loudspeaker channels (or "loudspeaker feed channels"): audio channels associated with named loudspeakers (at desired or nominal positions) or associated with named loudspeaker zones within a defined loudspeaker configuration. Loudspeaker channels are presented in such a way as to be equivalent to applying audio signals directly to the named loudspeakers (at desired or nominal positions) or to the loudspeakers in the named loudspeaker zones;

-对象通道：指示由音频源(有时称为音频“对象”)发出的声音的音频通道。通常，对象通道确定参数音频源描述(例如，指示参数音频源描述被包括在对象通道中或者设置有对象通道的元数据)。源描述可以确定由源发出的声音(作为时间的函数)、作为时间的函数的源的表观位置(例如，三维空间坐标)、以及可选地至少一个表征源的附加参数(例如，表观源大小或宽度)；- Object channel: An audio channel that indicates the sound emitted by an audio source (sometimes referred to as an audio "object"). Typically, an object channel determines a parametric audio source description (e.g., metadata indicating that a parametric audio source description is included in or provided with the object channel). The source description may determine the sound emitted by the source (as a function of time), the apparent location of the source as a function of time (e.g., three-dimensional spatial coordinates), and optionally at least one additional parameter characterizing the source (e.g., apparent source size or width);

-基于对象的音频节目：包括一个或更多个对象通道的集合(以及此外可选地包括至少一个扬声器通道)以及此外可选地相关联的元数据(例如，指示发出由对象通道所指示的声音的音频对象的轨迹的元数据，或者以其他方式指示由对象通道所指示的声音的期望的空间音频表示的元数据，或者指示作为由对象通道所指示的声音的源的至少一个音频对象的标识的元数据)的音频节目；以及- object-based audio program: an audio program comprising a set of one or more object channels (and optionally at least one loudspeaker channel) and optionally associated metadata (e.g., metadata indicating the trajectory of an audio object emitting the sound indicated by the object channel, or otherwise indicating a desired spatial audio representation of the sound indicated by the object channel, or metadata indicating the identity of at least one audio object that is the source of the sound indicated by the object channel); and

-呈现：将音频节目转变成一个或更多个扬声器馈送的处理，或者将音频节目转变成一个或更多个扬声器馈送并且使用一个或更多个扩音器将该扬声器馈送转变成声音的处理(在后一种情况下，在本文中呈现有时被称为“由”扩音器呈现)。可以通过直接对在期望位置处的物理扬声器应用信号来平常地呈现(“在”期望位置处)音频通道，或者可以使用要被设计成基本上等同于(对于听者而言)这样的平常呈现的多个虚拟化技术之一来呈现一个或更多个音频通道。在该后一种情况下，每个音频通道可以被转变成要应用于一般不同于期望位置的已知位置中的扩音器的一个或更多个扬声器馈送，使得由扩音器响应于馈送所发出的声音将被感知为从期望位置发出。这样的虚拟化技术的示例包括经由耳机的双耳呈现(例如，使用为耳机佩戴者模拟高达7.1环绕声通道的杜比耳机处理)以及波场合成。Rendering: The process of converting an audio program into one or more speaker feeds, or the process of converting an audio program into one or more speaker feeds and converting the speaker feeds into sound using one or more loudspeakers (in the latter case, rendering is sometimes referred to herein as being rendered "by" the loudspeakers). An audio channel may be rendered ordinarily ("at" a desired location) by applying a signal directly to a physical loudspeaker at the desired location, or one or more audio channels may be rendered using one of a number of virtualization techniques designed to be substantially equivalent (to a listener) to such an ordinary rendering. In the latter case, each audio channel may be converted into one or more speaker feeds to be applied to loudspeakers in known locations, generally different from the desired location, so that the sound emitted by the loudspeakers in response to the feeds will be perceived as emanating from the desired location. Examples of such virtualization techniques include binaural rendering via headphones (e.g., using Dolby Headphone processing to simulate up to 7.1 surround sound channels for a headphone wearer) and wave field synthesis.

将参照图3、图6和图7来描述本发明的编码、解码和语音增强方法的实施方式以及被配置成实现方法的系统。3 , 6 and 7 , embodiments of the encoding, decoding and speech enhancement methods of the present invention and a system configured to implement the methods will be described.

3.预测参数的生成3. Generation of prediction parameters

为了执行语音增强(包括根据本发明的实施方式的混合语音增强)，需要访问要增强的语音信号。如果在要执行语音增强时语音信号不可用(与要增强的混合信号的语音内容和非语音内容的混合分立)，则可以使用参数技术来创建可用混合的语音的重构。In order to perform speech enhancement (including hybrid speech enhancement according to embodiments of the present invention), access to the speech signal to be enhanced is required. If a speech signal is not available when speech enhancement is to be performed (separate from the mix of speech content and non-speech content of the hybrid signal to be enhanced), parametric techniques can be used to create a reconstruction of the available mixed speech.

一种用于混合内容信号(指示语音内容与非语音内容的混合)的语音内容的参数重构的方法基于重构信号的每个时间-频率分块中的语音功率，并且根据以下公式生成参数：A method for parametric reconstruction of speech content of a mixed-content signal (indicative of a mixture of speech content and non-speech content) is based on the speech power in each time-frequency block of the reconstructed signal and generates parameters according to the following formula:

其中，p_n,b是分块的参数(参数编码语音增强值)，p_n,b具有时间索引n和频率带索引b，值D_s,f表示分块的时隙s和频率仓(bin)f中的语音信号，值M_s,f表示分块的同一时隙和频率仓中的混合内容信号，求和针对所有分块中的s和f的所有值。可以使用混合内容信号自身来递送(作为元数据)参数p_n,b，以使得接收器能够重构混合内容信号的每个片段的语音内容。Where pn _,b is a parameter (parameter-encoded speech enhancement value) of the partition, pn _,b has a time index n and a frequency band index b, the value Ds _,f represents the speech signal in time slot s and frequency bin f of the partition, the value Ms _,f represents the mixed content signal in the same time slot and frequency bin of the partition, and the sum is over all values of s and f across all partitions. The parameter pn _,b can be delivered (as metadata) with the mixed content signal itself to enable the receiver to reconstruct the speech content of each segment of the mixed content signal.

如图1所描绘的，可以通过以下操作来确定每个参数p_n,b：对要增强的其语音内容的混合内容信号(“混合音频”)执行时域到频域的转换；对语音信号(混合内容信号的语音内容)执行时域到频域的转换；在分块中的所有时隙和频率仓对(具有语音信号的时间索引n和频率带索引b的每个时间-频率分块的)能量求积分；关于分块中的所有时隙和频率仓对混合内容信号的相应时间-频率分块的的能量求积分；以及将第一积分的结果除以第二积分的结果以生成分块的参数p_n,b。As depicted in FIG1 , each parameter p _n,b can be determined by: performing a time domain to frequency domain conversion on a mixed content signal (“mixed audio”) whose speech content is to be enhanced; performing a time domain to frequency domain conversion on a speech signal (the speech content of the mixed content signal); integrating the energy of each time-frequency block (for each time-frequency block having a time index n and a frequency band index b of the speech signal) over all time slots and frequency bins in the block; integrating the energy of the corresponding time-frequency block of the mixed content signal with respect to all time slots and frequency bins in the block; and dividing the result of the first integration by the result of the second integration to generate the parameter p _n,b for the block.

当将混合内容信号的每个时间-频率分块乘以分块的参数p_n,b时，所得到信号具有与混合内容信号的语音内容相似的频谱和时间包络。When each time-frequency block of the mixed content signal is multiplied by the parameters pn _,b of the block, the resulting signal has a spectral and temporal envelope similar to the speech content of the mixed content signal.

典型音频节目——例如立体声或5.1通道音频节目——包括多个扬声器通道。通常，每个通道(或者通道的子集中的每一个)指示语音内容和非语音内容，并且混合内容信号确定每个通道。可以将所描述的参数语音重构方法独立地应用于每个通道以重构所有通道的语音内容。可以使用每个通道的适当的增益将重构语音信号(针对通道中的每一个有一个重构语音信号)添加至相应混合内容通道信号，以获得对语音内容的期望的增强。A typical audio program, such as a stereo or 5.1 channel audio program, includes multiple speaker channels. Typically, each channel (or each of a subset of channels) indicates speech content and non-speech content, and a mixed content signal determines each channel. The described parametric speech reconstruction method can be applied independently to each channel to reconstruct the speech content of all channels. The reconstructed speech signal (one for each of the channels) can be added to the corresponding mixed content channel signal using an appropriate gain for each channel to achieve the desired enhancement of the speech content.

多通道节目的混合内容信号(通道)可以被表示为信号向量的集合，其中，每个向量元素是与特定参数集合即帧(n)中的时隙(s)和参数带(b)中的所有频率仓(f)相对应的时间-频率分块的汇集。三通道混合内容信号的向量的这样的集合的示例是：The mixed content signals (channels) of a multi-channel program can be represented as a set of signal vectors, where each vector element is a collection of time-frequency blocks corresponding to a specific set of parameters, i.e., time slots (s) in a frame (n) and all frequency bins (f) in a parameter band (b). An example of such a set of vectors for a three-channel mixed content signal is:

其中，c_i表示通道。该示例假定三个通道，但是通道的数目是任意量。Here, _ci represents channels. This example assumes three channels, but the number of channels can be any number.

类似地，多通道节目的语音内容可以被表示为1×1矩阵的集合(其中，语音内容包括仅一个通道)D_n,b。混合内容信号的每个矩阵元素与标量值的乘法产生每个子元素与标量值的乘积。因此，通过针对每个n和b计算下面的公式来获得每个分块的重构语音值Similarly, the speech content of a multi-channel program can be represented as a set of 1×1 matrices (where the speech content includes only one channel) Dn _,b . The multiplication of each matrix element of the mixed content signal with a scalar value produces the product of each sub-element with a scalar value. Therefore, the reconstructed speech value for each block is obtained by calculating the following formula for each n and b:

D_r，n，b＝diag(P)·M_n，b (4)D _r,n,b =diag(P)·M _n,b (4)

其中，P是其元素是预测参数的矩阵。(所有分块的)重构语音还可以被表示为：Where P is a matrix whose elements are prediction parameters. The reconstructed speech (of all blocks) can also be expressed as:

D_r＝diag(P)·M (5)D _r =diag(P)·M (5)

多通道混合内容信号的多个通道中的内容引起可以使用其对语音信号做出较好的预测的通道之间相干。通过使用(例如，常规类型的)最小均方差(MMSE)预测器，可以将通道与预测参数进行组合以根据均方差(MSE)标准使用最小误差来重构语音内容。如图2所示，假定三通道混合内容输入信号，这样的MMSE预测器(在频域中操作)响应于混合内容输入信号以及指示混合内容输入信号的语音内容的单个输入语音信号来迭代地生成预测参数p_i(其中，索引i是1、2或3)的集合。The content in the multiple channels of the multi-channel mixed content signal induces inter-channel coherence that can be used to make a better prediction of the speech signal. By using a minimum mean square error (MMSE) predictor (e.g., of conventional type), the channels can be combined with the prediction parameters to reconstruct the speech content using a minimum error according to a mean square error (MSE) criterion. As shown in FIG2 , assuming a three-channel mixed content input signal, such an MMSE predictor (operating in the frequency domain) iteratively generates a set of prediction parameters p _i (where index i is 1, 2, or 3) in response to the mixed content input signal and a single input speech signal indicating the speech content of the mixed content input signal.

根据混合内容输入信号的每个通道的分块(具有相同的索引n和索引b的每个分块)所重构的语音值是由每个通道的权重参数所控制的混合内容信号的每个通道(i＝1，2或3)的内容(M_ci,n,b)的线性组合。这些权重参数是具有相同的索引n和b的分块的预测参数p_i。因此，根据混合内容信号的所有通道的所有片重构的语音是：The speech value reconstructed from each channel block of the mixed content input signal (each block with the same index n and index b) is a linear combination of the content (M _ci,n,b ) of each channel (i=1, 2, or 3) of the mixed content signal, controlled by the weight parameters of each channel. These weight parameters are the prediction parameters p _i for the blocks with the same index n and b. Therefore, the speech reconstructed from all slices of all channels of the mixed content signal is:

D_r＝p₁·M_c1+p₂·M_c2+p₃·M_c3 (6)D _r ＝p ₁ ·M _c1 +p ₂ ·M _c2 +p ₃ ·M _c3 (6)

或者以下面的信号矩阵形式：Or in the form of the following signal matrix:

D_r＝PM (7)D _r ＝PM (7)

例如，当语音在混合内容信号的多个通道中相干地呈现而背景(非语音)声音在通道之间不相干时，通道的相加组合将有利于语音的能量。与通道独立重构相比，对于两个通道，这将导致3dB更好的语音分离。作为另一个示例，当语音内容在一个通道中呈现并且背景声音在多个通道中相干呈现时，通道的相减组合将(部分地)消除背景声音，而保留语音。For example, when speech is coherently present in multiple channels of a mixed content signal and background (non-speech) sounds are incoherent between channels, the additive combination of the channels will favor the energy of the speech. Compared to independent reconstruction of the channels, this will result in 3dB better speech separation for two channels. As another example, when speech content is present in one channel and background sounds are coherently present in multiple channels, the subtractive combination of the channels will (partially) eliminate the background sounds while preserving the speech.

在一类实施方式中，本发明方法包括以下步骤：(a)接收指示包括具有未增强的波形的语音以及其他音频内容的音频节目的比特流，其中，比特流包括：指示语音内容和其他音频内容的未增强的音频数据；指示语音的降低品质版本的波形数据，其中，语音的降低品质版本具有与未增强的波形相似(例如，至少基本上相似)的第二波形，并且如果单独地被试听则降低品质版本将具有令人讨厌的品质；以及参数数据，其中，与未增强音频数据一起的参数数据确定参数创建语音，并且该参数重构语音是至少基本上与语音匹配(例如，是良好近似)的、语音的参数重构版本；以及(b)响应于混和指示符对比特流执行语音增强，从而生成指示语音增强音频节目的数据，包括通过将未增强的音频数据与根据波形数据所确定的低品质语音数据和重构语音数据的组合进行组合，其中，该组合由混和指示符(例如，该组合具有由混和指示符的当前值序列所确定的状态序列)确定，重构的语音数据响应于参数数据中的至少一些以及未增强音频数据中的至少一些而生成，与通过将仅低品质语音数据与未增强的音频数据进行组合确定的纯波形编码语音增强音频节目或者根据参数数据和未增强的音频数据所确定的纯参数编码语音增强音频节目相比，语音增强音频节目具有不太听得见语音增强编码伪声(例如，更好地被掩蔽的语音增强编码伪声)。In one class of embodiments, the method of the present invention includes the following steps: (a) receiving a bitstream indicative of an audio program including speech having an unenhanced waveform and other audio content, wherein the bitstream includes: unenhanced audio data indicative of the speech content and the other audio content; waveform data indicative of a reduced-quality version of the speech, wherein the reduced-quality version of the speech has a second waveform that is similar (e.g., at least substantially similar) to the unenhanced waveform and would have an objectionable quality if listened to alone; and parameter data, wherein the parameter data, together with the unenhanced audio data, determines parameters for recreating speech, and the parametrically reconstructed speech is a parametrically reconstructed version of the speech that at least substantially matches (e.g., is a good approximation of) the speech; and (b) performing speech blending on the bitstream in response to a blending indicator. The invention relates to a method for enhancing speech sound, thereby generating data indicative of a speech-enhanced audio program, comprising combining unenhanced audio data with a combination of low-quality speech data determined based on waveform data and reconstructed speech data, wherein the combination is determined by a mixing indicator (e.g., the combination has a state sequence determined by a current value sequence of the mixing indicator), the reconstructed speech data being generated in response to at least some of the parametric data and at least some of the unenhanced audio data, the speech-enhanced audio program having less audible speech-enhancement coding artifacts (e.g., better masked speech-enhancement coding artifacts) than a purely waveform-coded speech-enhanced audio program determined by combining only the low-quality speech data with the unenhanced audio data or a purely parametric-coded speech-enhanced audio program determined based on the parametric data and the unenhanced audio data.

在一些实施方式中，混和指示符(其可以具有值序列，例如针对比特流片段序列中的每一个的一个值序列)被包括在步骤(a)中所接收的比特流中。在其他实施方式中，混和指示符响应于比特流而生成(例如，在接收比特流并且对比特流进行解码的接收器中)。In some embodiments, the blend indicator (which may have a sequence of values, e.g., one for each of the sequence of bitstream segments) is included in the bitstream received in step (a). In other embodiments, the blend indicator is generated in response to the bitstream (e.g., in a receiver that receives and decodes the bitstream).

应当理解，表达“混和指示符”并不意在表示比特流的每个片段的单个参数或值(或者单个参数或值序列)。相反地，可以想到，在一些实施方式中，(比特流的片段的)混和指示符可以是两个或更多个参数或值的集合(例如，对于每个片段，参数编码增强控制参数和波形编码增强控制参数)。在一些实施方式中，每个片段的混和指示符可以是指示每片段的频带进行混和的值序列。It should be understood that the expression "mixing indicator" is not intended to mean a single parameter or value (or a single sequence of parameters or values) for each segment of the bitstream. Rather, it is contemplated that in some embodiments, the mixing indicator (of a segment of the bitstream) may be a set of two or more parameters or values (e.g., a parameter coding enhancement control parameter and a waveform coding enhancement control parameter for each segment). In some embodiments, the mixing indicator for each segment may be a sequence of values indicating that the frequency bands of each segment are mixed.

无需为(例如，被包括在)比特流的每个片段设置波形数据和参数数据，或者无需被用于对比特流的每个片段执行语音增强。例如，在一些情况下，至少一个片段可以包括仅波形数据(以及由每个这样的片段的混和指示符所确定的组合可以包括仅波形数据)并且至少一个另外的片段可以包括仅参数数据(以及由每个这样的片段的混和指示符所确定的组合可以包括仅重构语音数据)。Waveform data and parameter data need not be provided for (e.g., included in) every segment of a bitstream, or need not be used to perform speech enhancement on every segment of a bitstream. For example, in some cases, at least one segment may include only waveform data (and the combination determined by the blend indicator for each such segment may include only waveform data) and at least one other segment may include only parameter data (and the combination determined by the blend indicator for each such segment may include only reconstructed speech data).

可以想到，在一些实施方式中，编码器生成比特流，包括通过对未增强音频数据而非波形数据或参数数据进行编码(例如，压缩)。因此，当比特流被递送至接收器时，接收器将对比特流进行解析以提取未增强的音频数据、波形数据以及参数数据(如果其在比特流中被递送，则以及混和指示符)，但是将对仅未增强的音频数据进行解码。在不对波形数据或参数数据应用与对音频数据应用的解码处理相同的解码处理的情况下，接收器将对经解码的、未增强的音频数据(使用波形数据和/或参数数据)执行语音增强。It is contemplated that in some embodiments, the encoder generates a bitstream including by encoding (e.g., compressing) the unenhanced audio data rather than the waveform data or parameter data. Thus, when the bitstream is delivered to a receiver, the receiver will parse the bitstream to extract the unenhanced audio data, waveform data, and parameter data (and the blend indicator, if delivered in the bitstream), but will decode only the unenhanced audio data. The receiver will perform speech enhancement on the decoded, unenhanced audio data (using the waveform data and/or parameter data) without applying the same decoding processing to the waveform data or parameter data as was applied to the audio data.

通常，波形数据与重构语音数据的组合(由混和指示符所指示)随时间而变化，具有与比特流的相对应的片段的语音内容和其他音频内容有关的每个组合状态。混和指示符被生成为：使得(波形数据和重构语音数据的)当前组合状态由比特流的相应片段中的语音内容和其他音频内容(例如，语音内容的功率与其他音频内容的功率的比)的信号特性确定。Typically, the combination of waveform data and reconstructed speech data (indicated by the mixing indicator) changes over time, with each combination state being related to the speech content and other audio content of the corresponding segment of the bitstream. The mixing indicator is generated such that the current combination state (of the waveform data and the reconstructed speech data) is determined by signal characteristics of the speech content and other audio content (e.g., the ratio of the power of the speech content to the power of the other audio content) in the corresponding segment of the bitstream.

步骤(b)可以包括以下步骤：通过将低品质语音数据中的至少一些与比特流的至少一个片段的未增强的音频数据进行组合(例如，混合或混和)执行波形编码语音增强；以及通过将重构语音数据与比特流的至少一个片段的未增强的音频数据进行组合执行参数编码语音增强。通过将片段的低品质语音数据和重构语音数据两者与片段的未增强的音频数据进行混和对比特流的至少一个片段执行波形编码语音增强与参数编码语音增强的组合。在一些信号条件下，对比特流的片段(或者对多于一个片段中的每一个)执行(响应于混和指示符)波形编码语音增强和参数编码语音增强中的仅一个(而不是两者)。Step (b) may include the steps of: performing waveform coded speech enhancement by combining (e.g., mixing or blending) at least some of the low-quality speech data with unenhanced audio data for at least one segment of the bitstream; and performing parametric coded speech enhancement by combining reconstructed speech data with the unenhanced audio data for at least one segment of the bitstream. The combination of waveform coded speech enhancement and parametric coded speech enhancement is performed on at least one segment of the bitstream by blending both the low-quality speech data and the reconstructed speech data for the segment with the unenhanced audio data for the segment. Under some signal conditions, only one (but not both) of waveform coded speech enhancement and parametric coded speech enhancement is performed on the segment of the bitstream (or on each of more than one segment) (in response to the blending indicator).

4.语音增强操作4. Voice enhancement operation

在本文中，“SNR”(信噪比)被用来表示对音频节目(或整个节目)的片段的语音分量(即，语音内容)的功率(或水平)与片段或节目的非语音分量(即，非语音内容)的功率(或水平)之比，或者与片段或节目的整个(语音和非语音)内容的功率(或水平)之比。在一些实施方式中，根据音频信号(以经历语音增强)以及指示音频信号的语音内容(例如，为了在波形编码增强中使用已经生成的语音内容的低品质复本)的分立的信号导出SNR。在一些实施方式中，根据音频信号(以经历语音增强)并且根据参数数据(其为了在音频信号的参数编码增强中使用已经被生成)导出SNR。As used herein, "SNR" (signal-to-noise ratio) is used to refer to the ratio of the power (or level) of the speech component (i.e., speech content) of a segment of an audio program (or the entire program) to the power (or level) of the non-speech component (i.e., non-speech content) of the segment or program, or to the power (or level) of the entire (speech and non-speech) content of the segment or program. In some embodiments, the SNR is derived from an audio signal (to undergo speech enhancement) and a separate signal indicative of the speech content of the audio signal (e.g., a low-quality copy of the speech content that has been generated for use in waveform coding enhancement). In some embodiments, the SNR is derived from an audio signal (to undergo speech enhancement) and from parametric data (which has been generated for use in parametric coding enhancement of the audio signal).

在一类实施方式中，本发明方法实现音频节目的片段的参数编码增强与波形编码增强之间基于“盲”时间SNR切换。在本上下文中，“盲”表示切换并不由(例如，本文中要描述的类型的)复杂听觉掩蔽模型感知地指引，而是由与节目的片段相对应的SNR值序列(混和指示符)指引。在该类的一种实施方式中，通过参数编码增强与波形编码增强(响应于混和指示符，例如，在图3的编码器的子系统29中所生成的混和指示符，其指示应当对相应音频数据执行仅参数编码增强或者波形编码增强)之间的时间切换实现混合编码语音增强，使得对执行了语音增强的音频节目的每个片段执行参数编码增强或者波形编码增强(而非参数编码增强和波形编码增强两者)。意识到在低SNR(对具有低SNR值的片段)的条件下波形编码增强表现地最好并且在良好的SNR(对具有高SNR值的片段)的条件下参数编码增强表现地最好，切换决定通常基于原始音频混合中的语音(对话)与剩余音频的比。In one class of embodiments, the present invention implements "blind" temporal SNR-based switching between parametric and waveform-coded enhancement for segments of an audio program. In this context, "blind" means that the switching is not perceptually guided by a complex auditory masking model (e.g., of the type described herein), but rather by a sequence of SNR values corresponding to the segments of the program (a blend indicator). In one embodiment of this class, hybrid coded speech enhancement is implemented by temporally switching between parametric and waveform-coded enhancement (in response to a blend indicator, e.g., generated in encoder subsystem 29 of FIG. 3 , indicating that only parametric or waveform-coded enhancement should be performed on the corresponding audio data), such that each segment of the audio program for which speech enhancement is performed is subjected to either parametric or waveform-coded enhancement (rather than both). Recognizing that waveform-coded enhancement performs best under low SNR conditions (for segments with low SNR values) and that parametric enhancement performs best under good SNR conditions (for segments with high SNR values), the switching decision is typically based on the ratio of speech (dialogue) to the remaining audio in the original audio mix.

实现基于“盲”时间SNR的切换的实施方式通常包括以下步骤：将未增强的音频信号(原始音频混合)分割成连续时间片(片段)，为每个片段确定片段的语音内容与其他音频内容之间(或者语音内容与总音频内容之间)的SNR；以及对于每个片段，将SNR与阈值进行比较并且当SNR大于阈值时为片段设置参数编码增强控制参数(即，片段的混和指示符指示应当执行参数编码增强)，或者当SNR不大于阈值时为参数设置波形编码增强控制参数(即，片段的混和指示符指示应当执行波形编码增强)。An implementation scheme for implementing "blind" temporal SNR-based switching generally includes the following steps: dividing the unenhanced audio signal (original audio mixture) into consecutive time slices (segments), determining for each segment the SNR between the speech content of the segment and other audio content (or between the speech content and the total audio content); and for each segment, comparing the SNR with a threshold and setting a parametric coding enhancement control parameter for the segment when the SNR is greater than the threshold (i.e., the segment's mixing indicator indicates that parametric coding enhancement should be performed), or setting a waveform coding enhancement control parameter for the segment when the SNR is not greater than the threshold (i.e., the segment's mixing indicator indicates that waveform coding enhancement should be performed).

当未增强音频信号与作为元数据所包括的控制参数一起被递送(例如，发送)至接收器时，接收器可以(对每个片段)执行由片段的控制参数所指示的语音增强的类型。因此，接收器对控制参数是参数编码增强控制参数的每个片段执行参数编码增强，并且对控制参数是波形编码增强控制参数的每个片段执行波形编码增强。When the unenhanced audio signal is delivered (e.g., transmitted) to a receiver along with the control parameters included as metadata, the receiver may perform (for each segment) the type of speech enhancement indicated by the control parameters for the segment. Thus, the receiver performs parametric coding enhancement for each segment for which the control parameters are parametric coding enhancement control parameters, and performs waveform coding enhancement for each segment for which the control parameters are waveform coding enhancement control parameters.

如果愿意承担传输(关于原始音频混合的每个片段)波形数据(用于实现波形编码语音增强)以及关于原始(未增强)混合的参数编码增强参数两者的成本，那么可以通过对混合的各个分量应用波形编码增强和参数编码增强两者实现较高程度的语音增强。因此，在一类实施方式中，本发明方法实现音频节目的片段的参数编码增强与波形编码增强之间的基于“盲”时间SNR混合。此外，在此上下文中，“盲”表示切换并不由(例如，本文中要描述的类型的)复杂听觉掩蔽模型感知地指引，而是由与节目的片段相对应的SNR值序列指引。If one is willing to incur the cost of transmitting (for each segment of the original audio mix) both waveform data (for implementing waveform-coded speech enhancement) and parametrically coded enhancement parameters for the original (unenhanced) mix, then a higher degree of speech enhancement can be achieved by applying both waveform-coded and parametric-coded enhancement to the individual components of the mix. Thus, in one class of embodiments, the present method implements "blind" temporal SNR-based mixing between parametric-coded and waveform-coded enhancements for segments of an audio program. Furthermore, "blind" in this context means that the switching is not perceptually guided by a complex auditory masking model (e.g., of the type to be described herein), but rather by a sequence of SNR values corresponding to the segments of the program.

实现基于的“盲”时间SNR混和的实施方式通常包括以下步骤：将未增强音频信号(原始音频混合)分割成连续时间片(片段)，并且为每个片段确定片段的语音内容与其他音频内容之间(或者语音内容与总音频内容之间)的SNR；确定(例如，接收请求)语音增强的总量(“T”)；以及为每个片段设置混和控制参数，求中，混和控制参数的值由片段的SNR确定(是片段的SNR的函数)。Implementations of "blind" temporal SNR-based mixing generally include the following steps: segmenting an unenhanced audio signal (the original audio mix) into consecutive time slices (segments), and determining, for each segment, the SNR between the segment's speech content and other audio content (or between the speech content and the total audio content); determining (e.g., receiving a request for) an amount of speech enhancement ("T"); and setting a mixing control parameter for each segment, wherein the value of the mixing control parameter is determined by (is a function of) the segment's SNR.

例如，音频节目的片段的混和指示符可以是在图3的编码器的子系统29中为片段所生成的混和指示符参数(或参数集合)。For example, the mixing indicator for a segment of an audio program may be a mixing indicator parameter (or set of parameters) generated for the segment in subsystem 29 of the encoder of FIG. 3 .

混和控制指示符可以是使得T＝αPw+(1-α)Pp的每个片段的参数α，其中，Pw是下述的波形的波形编码增强：如果使用针对片段所设置的波形数据将该波形的波形编码增强应用于片段的未增强音频内容则将产生预定的总增强量T(其中，片段的语音内容具有未增强的波形，片段的波形数据指示片段的语音内容的降低品质的版本，降低品质的版本具有与未增强波形相似(例如，至少基本上相似)的波形，当被单独地呈现和感知时，语音内容的降低品质的版本具有令人讨厌的品质)，Pp是下述的参数编码增强：如果使用针对片段所设置的参数数据将该参数编码增强应用于片段的未增强音频内容则将产生预定的总增强量T(其中，片段的参数数据与片段的未增强音频内容一起来确定片段的语音内容的参数重构版本)。The blending control indicator may be a parameter α for each segment such that T = αPw + (1-α)Pp, where Pw is a waveform coded enhancement of a waveform that would produce a predetermined total enhancement amount T if the waveform coded enhancement of the waveform were applied to the unenhanced audio content of the segment using the waveform data set for the segment (wherein the speech content of the segment has an unenhanced waveform, the waveform data for the segment indicates a reduced-quality version of the speech content of the segment, the reduced-quality version having a waveform that is similar (e.g., at least substantially similar) to the unenhanced waveform, and the reduced-quality version of the speech content has an objectionable quality when presented and perceived alone), and Pp is a parametric coded enhancement that would produce a predetermined total enhancement amount T if the parametric coded enhancement were applied to the unenhanced audio content of the segment using the parameter data set for the segment (wherein the parameter data for the segment together with the unenhanced audio content of the segment determine a parametric reconstructed version of the speech content of the segment).

当未增强音频信号与作为元数据的控制参数一起被递送(例如，发送)至接收器时，接收器可以(对每个片段)执行由片段的控制参数所指示的混合语音增强。替代地，接收器根据未增强音频信号生成控制参数。When the unenhanced audio signal is delivered (e.g., sent) to a receiver along with the control parameters as metadata, the receiver can (for each segment) perform the hybrid speech enhancement indicated by the control parameters for the segment. Alternatively, the receiver generates the control parameters based on the unenhanced audio signal.

在一些实施方式中，接收器(对未增强音频信号的每个片段)执行参数编码增强Pp(由片段的参数α缩放)与波形编码增强Pw(由片段的值(1-α)缩放)的组合，使得所缩放的参数编码增强和所缩放的波形编码增强的组合生成如表达式(1)(T＝αPw+(1-α)Pp)中的预定总量的增强。In some embodiments, the receiver performs (for each segment of the unenhanced audio signal) a combination of a parameter coded enhancement Pp (scaled by the parameter α of the segment) and a waveform coded enhancement Pw (scaled by the value (1-α) of the segment) such that the combination of the scaled parameter coded enhancement and the scaled waveform coded enhancement produces a predetermined total amount of enhancement as in expression (1) (T=αPw+(1-α)Pp).

片段的SNR与α之间的关系的示例如下：α是SNR的非递减函数，α的范围是0到1，当片段的SNR小于或等于阈值(“SNR_poor”)时，α的值为0，当SNR大于或等于较大阈值(“SNR_high”)时，α的值为1。当SNR良好时，α高，导致很大部分的参数编码增强。当SNR不良时，α低，导致很大部分的波形编码增强。应当选择饱和点的位置(SNR_poor和SNR_high)以调节波形编码增强算法和参数编码增强算法两者的具体实现。An example of the relationship between the SNR of a segment and α is as follows: α is a non-decreasing function of the SNR, and α ranges from 0 to 1, with the value of α being 0 when the SNR of the segment is less than or equal to a threshold ("SNR_poor") and the value of α being 1 when the SNR is greater than or equal to a larger threshold ("SNR_high"). When the SNR is good, α is high, resulting in a large portion of parametric coding enhancement. When the SNR is poor, α is low, resulting in a large portion of waveform coding enhancement. The locations of the saturation points (SNR_poor and SNR_high) should be selected to adjust the specific implementation of both the waveform coding enhancement algorithm and the parametric coding enhancement algorithm.

在另一类实施方式中，要对音频信号的每个片段执行的波形编码增强和参数编码增强的组合由听觉掩蔽模型确定。在该类的一些实施方式中，要对音频节目的片段执行的波形编码增强和参数编码增强的混和的最佳混和比率使用刚好使编码噪声不变得听得见的最高波形编码增强量。In another class of embodiments, the combination of waveform coded enhancement and parametric coded enhancement to be performed on each segment of the audio signal is determined by an auditory masking model. In some embodiments of this class, an optimal mix ratio of the mix of waveform coded enhancement and parametric coded enhancement to be performed on the segments of the audio program uses the highest amount of waveform coded enhancement that just causes the coding noise to not become audible.

在上述基于盲SNR的混和实施方式中，从SNR获得片段的混和比率，SNT被假定成指示掩蔽要为波形编码增强所使用的语音的降低品质版本(复本)中的编码噪声的音频混合的能力。基于盲SNR方法的优点是实现的简单性以及编码器处的低计算负荷。然而，SNR是以下不可靠的预测器：编码噪声在多大程度上将被掩蔽以及必须在多大程度上应用大的安全裕度以确保编码噪声将一直仍然被掩蔽。这意味着至少一些时间被混和的降低品质语音复本的水平低于其能够达到的水平，或者如果将裕度设置地较严格，则一些时候编码噪声变得听得见。当通过使用更准确地预测降低品质语音复本中的编码噪声如何被主要节目的音频混合掩蔽并且据此选择混和比率的听觉掩蔽模型确保编码噪声不变得听得见时，可以增大本发明的混和编码方案中的波形编码增强的贡献。In the blind SNR-based mixing implementation described above, the mixing ratio of the segment is derived from the SNR, with the SNR being assumed to indicate the ability of the audio mix to mask the coding noise in the reduced-quality version (replica) of the speech to be used for waveform coding enhancement. The advantages of the blind SNR-based approach are its simplicity of implementation and the low computational load at the encoder. However, the SNR is an unreliable predictor of the extent to which the coding noise will be masked and the extent to which a large safety margin must be applied to ensure that the coding noise will always remain masked. This means that at least some of the time the level of the reduced-quality speech replica being mixed is lower than it could be, or if the margin is set more tightly, the coding noise may sometimes become audible. The contribution of waveform coding enhancement in the hybrid coding scheme of the present invention can be increased by ensuring that the coding noise does not become audible by using an auditory masking model that more accurately predicts how the coding noise in the reduced-quality speech replica is masked by the audio mix of the main program and selecting the mixing ratio accordingly.

使用听觉掩蔽模型的特定实施方式包括以下步骤：将未增强音频信号(原始音频混合)分割成连续时间片(片段)，设置每个片段中的语音的降低品质复本(用于在波形编码增强中使用)以及每个片段的参数编码增强参数(用于在参数编码增强中使用)；使用听觉掩蔽模型针对片段中的每一个来确定可以被应用但伪声不变得听得见的最大波形编码增强量；以及生成波形编码增强(以不超过使用听觉掩蔽模型针对片段所确定的最大波形编码增强量以及优选地至少基本上与使用听觉掩蔽模型针对片段所确定的最大波形编码增强量匹配的量)和参数编码增强的组合的混和指示符(未增强音频信号的每个片段的)，使得波形编码增强与参数编码增强的组合生成片段的预定语音增强总量。A specific embodiment using an auditory masking model includes the steps of: segmenting an unenhanced audio signal (the original audio mixture) into consecutive time slices (segments), setting a reduced quality replica of the speech in each segment (for use in waveform coded enhancement) and parametric coded enhancement parameters for each segment (for use in parametric coded enhancement); determining, for each of the segments, using the auditory masking model, a maximum amount of waveform coded enhancement that can be applied without artifacts becoming audible; and generating a mixed indicator (for each segment of the unenhanced audio signal) of a combination of waveform coded enhancement (in an amount that does not exceed the maximum amount of waveform coded enhancement determined for the segment using the auditory masking model and preferably at least substantially matches the maximum amount of waveform coded enhancement determined for the segment using the auditory masking model) and parametric coded enhancement, such that the combination of waveform coded enhancement and parametric coded enhancement produces a predetermined total amount of speech enhancement for the segment.

在一些实施方式中，每个这样的混和指示符被包括(例如，由编码器)在还包括指示未增强音频信号的编码音频数据的比特流中。例如，图3的编码器20的子系统29可以被配置成生成这样的混和指示符，编码器20的子系统28可以被配置成将混和指示符包括在要从编码器20输出的比特流中。又例如，可以根据由图7编码器的子系统14所生成的g_max(t)参数生成混和指示符(例如，图7编码器的子系统13中)，图7编码器的子系统13可以被配置成将混和指示符包括在要从图7编码器输出的比特流中(或者子系统13可以将由子系统14所生成的g_max(t)参数包括在要从图7编码器输出的比特流中，并且接收并解析比特流的接收器可以被配置成响应于g_max(t)参数生成混和指示符)。In some embodiments, each such blend indicator is included (e.g., by an encoder) in a bitstream that also includes encoded audio data indicating the unenhanced audio signal. For example, subsystem 29 of encoder 20 of FIG3 can be configured to generate such a blend indicator, and subsystem 28 of encoder 20 can be configured to include the blend indicator in the bitstream to be output from encoder 20. For another example, the blend indicator can be generated (e.g., in subsystem 13 of encoder 14 of FIG7 ) based on a g _max (t) parameter generated by subsystem 14 of encoder 14 of FIG7 , and subsystem 13 of encoder 13 of FIG7 can be configured to include the blend indicator in the bitstream to be output from encoder 14 of FIG7 (or subsystem 13 can include the g _max (t) parameter generated by subsystem 14 in the bitstream to be output from encoder 14 of FIG7 , and a receiver that receives and parses the bitstream can be configured to generate the blend indicator in response to the g _max (t) parameter).

可选地，所述方法还包括的步骤：响应于每个片段的混和指示符(对未增强音频信号的每个片段)执行由混和指示符所确定的波形编码增强和参数编码增强的组合，使得波形编码增强和参数编码增强的组合生成片段的预定语音增强总量。Optionally, the method further comprises the step of performing, in response to the mixing indicator for each segment (for each segment of the unenhanced audio signal), a combination of waveform coded enhancement and parametric coded enhancement determined by the mixing indicator, such that the combination of waveform coded enhancement and parametric coded enhancement generates a predetermined total amount of speech enhancement for the segment.

将参照图7来描述使用听觉掩蔽模型的本发明方法的实施方式的示例。在该示例中，语音和背景音频的混和A(t)(未增强音频混合)被确定(在图7的元件10中)并且被传递至预测未增强音频混合的每个片段的掩蔽阈值Θ(f，t)的听觉掩蔽模型(由图7的元件11实现)。未增强音频混合A(t)还被提供至用于编码的编码元件13以供传输。An example of an embodiment of the method of the present invention using an auditory masking model will be described with reference to FIG7 . In this example, a mixture A(t) of speech and background audio (unenhanced audio mixture) is determined (in element 10 of FIG7 ) and passed to an auditory masking model (implemented by element 11 of FIG7 ) that predicts a masking threshold θ(f, t) for each segment of the unenhanced audio mixture. The unenhanced audio mixture A(t) is also provided to an encoding element 13 for encoding for transmission.

由模型所生成的掩蔽阈值指示为任何信号必须超过以成为听得见的频率和时间听觉激励的函数。这样的掩蔽模型是本领域公知的。对未增强音频混合A(t)的每个片段的语音分量s(t)进行编码(以低比特率音频编码器15)以生成片段的语音内容的降低品质复本s'(t)。降低品质复本s'(t)(与原始语音s(t)相比，其包括较少的比特)可以被概念化为原始语音s(t)与编码噪声n(t)之和。编码噪声可以通过从降低品质复本减去(在元件16中)时间对准语音信号s(t)而与降低品质复本分立以供分析。替选地，编码噪声可以从音频编码器直接可得。The masking threshold value that is generated by the model is indicated as the function of the frequency and time auditory stimulation that any signal must exceed to become audible.Such masking models are well known in the art.The speech component s (t) of each segment of unenhanced audio mixture A (t) is encoded (with low bit rate audio encoder 15) to generate a reduced quality replica s ' (t) of the speech content of the segment.The reduced quality replica s ' (t) (compared with the original speech s (t), it comprises less bits) can be conceptualized as the sum of the original speech s (t) and coding noise n (t).Coding noise can be separated from the reduced quality replica for analysis by deducting (in element 16) time alignment speech signal s (t) from the reduced quality replica.Alternatively, coding noise can be directly available from audio encoder.

在元件17中将编码噪声n与缩放因子g(t)相乘，并且将所缩放的编码噪声传递至预测由所缩放的编码噪声所生成的听觉激励N(f,t)的听觉模型(由元件18实现)。这样的激励模型是本领域已知的。在最终的步骤中，将听觉激励N(f,t)与所预测的掩蔽阈值Θ(f，t)相比，并且确保编码噪声被掩蔽即确保N(f，t)＜Θ(f，t)的g(t)的最大值的最大缩放因子g_max(t)被找到(在元件14中)。如果听觉模型是非线性的，则可能需要通过在元件17中将应用于编码噪声n(t)的值g(t)迭代来迭代地进行上述操作(如图2所示)；如果听觉模型是线性的，则可以在简单前馈步骤中进行上述操作。所得到的缩放因子g_max(t)是其被添加至未增强音频混合A(t)的相应片段而所缩放的、降低品质的语音复本中的编码伪声并不在所缩放的、降低品质的语音复本g_max(t)*s'(t)与未增强音频混合A(t)的混合中变得听得见之前，可以对降低品质语音复本s'(t)应用的最大缩放因子。In element 17, the coding noise n is multiplied by a scaling factor g(t), and the scaled coding noise is passed to an auditory model (implemented by element 18) that predicts the auditory excitation N(f, t) generated by the scaled coding noise. Such excitation models are known in the art. In a final step, the auditory excitation N(f, t) is compared with the predicted masking threshold Θ(f, t), and a maximum scaling factor gmax(t) is found (in element 14) that ensures that the coding noise is masked, i.e., that N(f, t) < Θ( _f , t) for the maximum value of g(t) is ensured. If the auditory model is nonlinear, it may be necessary to iteratively perform the above operation (as shown in FIG2 ) by iterating the value g(t) applied to the coding noise n(t) in element 17; if the auditory model is linear, the above operation can be performed in a simple feedforward step. The resulting scaling factor g _max (t) is the maximum scaling factor that can be applied to the reduced quality speech copy s'(t) before it is added to the corresponding segment of the unenhanced audio mixture A(t) and coding artifacts in the scaled, reduced quality speech copy do not become audible in the mixture of the scaled, reduced quality speech copy g _max (t)*s'(t) and the unenhanced audio mixture A(t).

图7系统还包括元件12，该元件12被配置成(响应于未增强音频混合A(t)和语音s(t))生成用于对未增强音频混合的每个片段执行参数编码语音增强的参数编码增强参数p(t)。The FIG7 system further comprises an element 12 configured to generate (in response to the unenhanced audio mixture A(t) and the speech s(t)) parametrically coded enhancement parameters p(t) for performing parametrically coded speech enhancement on each segment of the unenhanced audio mixture.

针对音频节目的每个片段的参数编码增强参数p(t)以及在编码器15中所生成的降低品质语音复本s'(t)和在元件14中所生成的因子g_max(t)也被设定至编码元件13。元件13生成指示针对音频节目的每个片段的未增强音频混合A(t)、参数编码增强参数p(t)、降低品质语音复本s'(t)和因子g_max(t)的编码音频比特流，并且该编码音频比特流可以被发送或以其他方式被递送至接收器。The parametrically coded enhancement parameters p(t) for each segment of the audio program, as well as the reduced-quality speech replica s'(t) generated in encoder 15 and the factor _gmax (t) generated in element 14 are also set to encoding element 13. Element 13 generates an encoded audio bitstream indicating the unenhanced audio mix A(t), the parametrically coded enhancement parameters p(t), the reduced-quality speech replica s'(t), and the factor _gmax (t) for each segment of the audio program, and the encoded audio bitstream may be transmitted or otherwise delivered to a receiver.

在该示例中，如下对未增强音频混合A(t)的每个片段(例如，在元件13的编码输出已经被递送至的接收器中)执行语音增强，以使用片段的缩放因子g_max(t)应用预定的(例如，所要求的)总增强量T。对编码音频节目进行解码，以提取针对音频节目的每个片段的未增强音频混合A(t)、参数编码增强参数p(t)、降低品质语音复本s'(t)以及因子g_max(t)。对于每个片段，波形编码增强Pw被确定成下述的波形编码增强：如果使用片段的降低品质的语音复本s'(t)将该波形编码增强应用于片段的未增强音频内容则将产生预定的总增强量T。参数编码增强Pp被确定成下述参数编码增强：如果使用针对片段设置的参数数据将该参数编码增强应用于片段的未增强音频内容，则将产生预定的总增强量T(其中，关于片段的未增强音频内容，片段的参数数据确定片段的语音内容的参数重构版本)。对于每个片段，执行参数编码增强(以由片段的参数α₂缩放的量)与波形编码增强(以由片段的值α₁所确定的量)的组合，使得参数编码增强与波形编码增强的组合使用由以下模型所允许的最大波形编码增强量来生成预定总增强量：T＝(α₁(Pw)+α₂(Pp)，在T＝(α₁(Pw)+α₂(Pp)中，因子α₁是不超过片段的g_max(t)并且使得能够实现所指示的等式(T＝(α₁(Pw)+α₂(Pp))的最大值，参数α₂是使得能够实现所指示的等式(T＝(α₁(Pw)+α₂(Pp))的最小非负值。In this example, speech enhancement is performed on each segment of an unenhanced audio mix A(t) (e.g., in a receiver to which the encoded output of element 13 has been delivered) as follows, applying a predetermined (e.g., required) total enhancement amount T using a scaling factor _gmax (t) for the segment. The encoded audio program is decoded to extract the unenhanced audio mix A(t), parametrically coded enhancement parameters p(t), a reduced-quality speech replica s'(t), and the factor _gmax (t) for each segment of the audio program. For each segment, a waveform coded enhancement Pw is determined to be a waveform coded enhancement that would produce a predetermined total enhancement amount T if applied to the unenhanced audio content of the segment using the reduced-quality speech replica s'(t) of the segment. A parametric coded enhancement Pp is determined to be a parametric coded enhancement that would produce a predetermined total enhancement amount T if applied to the unenhanced audio content of the segment using parameter data set for the segment (wherein the parameter data for the segment determines a parametric reconstructed version of the speech content of the segment with respect to the unenhanced audio content of the segment). For each segment, a combination of parametric coding enhancement (by an amount scaled by the segment's parameter _α2 ) and waveform coding enhancement (by an amount determined by the segment's value _α1 ) is performed such that the combination of parametric coding enhancement and waveform coding enhancement generates a predetermined total enhancement amount using the maximum amount of waveform coding enhancement allowed by the following model: T=( _α1 (Pw)+ _α2 (Pp), where the factor _α1 is the maximum value that does not exceed _gmax (t) _for the segment and enables the indicated equation (T=( _α1 (Pw)+ _α2 (Pp)) to be achieved, and the parameter _α2 is the minimum non-negative value that enables the indicated equation (T= ₍ _α1 (Pw)+ _α2 (Pp)) to be achieved.

在替选实施方式中，参数编码增强的伪声被包括在(由听觉掩蔽模型执行的)评估中，以使得当(由于波形编码增强而引起的)编码伪声比参数编码增强的伪声有利时，其变得听得见。In an alternative embodiment, parametric coding enhancement artifacts are included in the evaluation (performed by the auditory masking model) so that coding artifacts (due to waveform coding enhancement) become audible when they outweigh parametric coding enhancement artifacts.

在对图7实施方式(以及类似于使用听觉掩蔽模型的图7的实施方式的实施方式)的变型中，有时被称为听觉模型指引的多带划分实施方式，降低品质语音复本的波形编码增强编码噪声N(f,t)与掩蔽阈值Θ(f，t)之间的关系可以跨所有频带而不一致。例如，波形编码增强编码噪声的频谱特征可以是使得在第一频率区中掩蔽噪声即将超过掩蔽阈值，而在第二频率区中掩蔽噪声远低于掩蔽阈值。在图7实施方式中，通过第一频率区中的编码噪声确定波形编码增强的最大贡献，并且通过第一频率区中的编码噪声和掩蔽特性来确定可被应用于降低品质的语音复本的最大缩放因子g。其小于在最大缩放因子的确定仅基于第二频率区的情况下可应用的最大缩放因子g。如果在两个频率区中分别应用时间混和的原理，则可以改进整体性能。In the variation to Fig. 7 embodiment (and the embodiment that is similar to the embodiment of Fig. 7 using the auditory masking model), be sometimes referred to as the multi-band division embodiment of auditory model guidance, the waveform coding that reduces the quality speech replica enhances the relationship between coding noise N (f, t) and masking threshold value Θ (f, t) and can be inconsistent across all frequency bands.For example, the spectrum characteristic of waveform coding enhancement coding noise can be to make masking noise be about to exceed masking threshold value in the first frequency region, and masking noise is far below masking threshold value in the second frequency region.In Fig. 7 embodiment, the maximum contribution of waveform coding enhancement is determined by the coding noise in the first frequency region, and the maximum scaling factor g that can be applied to the speech replica of reducing quality is determined by the coding noise in the first frequency region and masking characteristic.It is less than the maximum scaling factor g that can be applied only based on the situation of the second frequency region when the determination of maximum scaling factor.If the principle of time mixing is applied respectively in two frequency regions, then overall performance can be improved.

在听觉模型指引的多带划分的一种实施方式中，未增强的音频信号被划分成M个连续的非交叠频带并且在M个带中的每一个中独立地应用时间混和的原理(即，根据本发明的实施方式使用波形编码增强与参数编码增强的混和的混合语音增强)。替选实现将频谱划分成截止频率fc以下的低频带以及截止频率fc以上的高频带。总是使用波形编码增强来增强低频带，并且总是使用参数编码增强来增强高频带。截止频率随着时间变化并且总是在以下约束下选择尽可能高的截止频率：在预定的总语音增强量T处的波形编码增强编码噪声在掩蔽阈值以下。换言之，在任意时刻的最大截止频率是：In a kind of embodiment of the multi-band division of auditory model guidance, unenhanced audio signal is divided into M continuous non-overlapping frequency bands and in each in M band, the principle of time mixing is applied independently (that is, the mixed speech enhancement of the mixing that uses waveform coding enhancement and parameter coding enhancement according to an embodiment of the present invention). Alternative realization divides spectrum into the low-frequency band below the cut-off frequency fc and the high-frequency band above the cut-off frequency fc. Always use waveform coding enhancement to enhance the low-frequency band, and always use parameter coding enhancement to enhance the high-frequency band. Cut-off frequency varies with time and always selects the highest possible cut-off frequency under the following constraint: the waveform coding enhancement coding noise at predetermined total speech enhancement amount T place is below the masking threshold. In other words, the maximum cut-off frequency at any moment is:

max(fc|T*N(f＜fc，t)＜Θ(f，t)) (8)max(fc|T*N(f＜fc,t)＜Θ(f,t)) (8)

上述实施方式已经假定可用来防止波形编码增强编码伪声变得听见的方法调整(波形编码增强与参数编码增强的)混和比率，或者缩减总增强量。替选方法通过比特率的可变分配对波形编码增强编码噪声的量进行控制以生成降低品质语音复本。在该替选实施方式的示例中，应用参数编码增强的恒定基本量并且应用另外的波形编码增强以达到所期望的(预定的)总量增强。使用可变比特流对降低品质语音复本进行编码，并且该比特率被选作保持波形编码增强编码噪声在参数编码增强主音频的掩蔽阈值以下的最低比特率。The above embodiment has assumed that the method that can be used to prevent waveform coding enhancement coding artifacts from becoming audible adjusts the mixing ratio (of waveform coding enhancement and parametric coding enhancement) or reduces the total enhancement amount. The alternative method controls the amount of waveform coding enhancement coding noise to generate a reduced quality speech replica through a variable allocation of bit rate. In the example of this alternative embodiment, a constant basic amount of parametric coding enhancement is applied and additional waveform coding enhancement is applied to achieve the desired (predetermined) total amount enhancement. The reduced quality speech replica is encoded using a variable bit stream, and the bit rate is selected as the lowest bit rate that keeps the waveform coding enhancement coding noise below the masking threshold of the parametric coding enhancement main audio.

在一些实施方式中，要根据本发明增强其语音内容的音频节目包括扬声器通道，但是不包括任何对象通道。在其他实施方式中，要根据本发明增强其语音内容的音频节目是包括至少一个对象通道以及此外可选地至少一个扬声器通道的基于对象的音频节目(通常多通道基于对象的音频节目)。In some embodiments, the audio program whose speech content is to be enhanced according to the present invention includes speaker channels, but does not include any object channels. In other embodiments, the audio program whose speech content is to be enhanced according to the present invention is an object-based audio program (typically a multi-channel object-based audio program) that includes at least one object channel and, optionally, at least one speaker channel.

本发明的其他方面包括：编码器，其被配置成执行本发明编码方法的任何实施方式，以响应于音频输入信号(例如，响应于指示多通道音频输入信号的音频数据)而生成编码音频信号；解码器，其被配置成对这样的编码信号进行解码并且对解码音频内容执行语音增强；以及包括这样的编码器和这样的解码器的系统。图3系统是这样的系统的示例。Other aspects of the present invention include: an encoder configured to perform any embodiment of the encoding method of the present invention to generate an encoded audio signal in response to an audio input signal (e.g., in response to audio data indicating a multi-channel audio input signal); a decoder configured to decode such an encoded signal and perform speech enhancement on the decoded audio content; and a system including such an encoder and such a decoder. The system of Figure 3 is an example of such a system.

图3的系统包括编码器20，该编码器20被配置(例如，被编程)为执行本发明编码方法的实施方式，以响应于指示音频节目的音频数据而生成编码音频信号。通常，节目是多通道音频节目。在一些实施方式中，多通道音频节目包括仅扬声器通道。在其他实施方式中，多通道音频节目是包括至少一个对象通道以及此外可选地至少一个扬声器通道的基于对象的音频节目。The system of FIG3 includes an encoder 20 configured (e.g., programmed) to perform an embodiment of the encoding method of the present invention to generate an encoded audio signal in response to audio data indicative of an audio program. Typically, the program is a multi-channel audio program. In some embodiments, the multi-channel audio program includes only speaker channels. In other embodiments, the multi-channel audio program is an object-based audio program that includes at least one object channel and, optionally, at least one speaker channel.

音频数据包括指示混合音频内容(语音内容与非语音内容的混合)的数据(在图3中被标识为“混合音频”数据)，以及指示混合音频内容的语音内容的数据(在图3中被标识为“语音”数据)。The audio data includes data indicating mixed audio content (a mixture of voice content and non-voice content) (identified as "mixed audio" data in Figure 3), and data indicating voice content of the mixed audio content (identified as "voice" data in Figure 3).

语音数据在级21中进行时域至频域(QMF)转换，所得到的QMF分量被设定至增强参数生成元件23。混合音频数据在级22中进行时域至频域(QMF)转换，所得到的QMF分量被设定至元件23并且被设定至编码子系统27。The speech data is converted from the time domain to the frequency domain (QMF) in stage 21, and the resulting QMF components are applied to the enhancement parameter generation element 23. The mixed audio data is converted from the time domain to the frequency domain (QMF) in stage 22, and the resulting QMF components are applied to element 23 and to the encoding subsystem 27.

语音数据还被设定至被配置成生成指示语音数据的低品质复本的波形数据(在本文中有时被称为“降低品质”或者“低品质”语音复本)的子系统25，以供在由混合音频数据所确定的混合(语音与非语音)内容的波形编码语音增强中使用。与原始语音数据相比，低品质语音复本包括更少的比特，当单独地被呈现和感知时以及当呈现指示具有与由原始语音数据所指示的语音的波形相似(例如，至少基本上相似)的波形的语音时，低品质语音复本具有令人讨厌的品质。实现子系统25的方法是本领域已知的。示例是通常以低比特率(例如，20kbps)所操作的码激励线性预测(CELP)语音编码器如AMR和G729.1、或者现代混合编码器如MPEG统一语音和音频编码(USAC)。替选地，可以使用频域编码器，示例包括Siren(G722.1)、MPEG 2层II/III、MPEG AAC。The speech data is also provided to a subsystem 25 configured to generate waveform data indicating a low-quality copy of the speech data (sometimes referred to herein as a "degraded quality" or "low-quality" speech copy) for use in waveform-coded speech enhancement of the mixed (speech and non-speech) content determined by the mixed audio data. The low-quality speech copy comprises fewer bits than the original speech data and has an objectionable quality when presented and perceived alone, and when presented with speech indicating a waveform similar (e.g., at least substantially similar) to the waveform of the speech indicated by the original speech data. Methods for implementing subsystem 25 are known in the art. Examples include code-excited linear prediction (CELP) speech encoders such as AMR and G729.1, which typically operate at low bit rates (e.g., 20 kbps), or modern hybrid encoders such as MPEG Unified Speech and Audio Coding (USAC). Alternatively, a frequency domain encoder may be used, examples of which include Siren (G722.1), MPEG 2 Layer II/III, and MPEG AAC.

根据本发明的典型实施方式所执行(例如，在解码器40的子系统43中)的混合语音增强包括以下步骤：(对波形数据)执行所执行(例如，在编码器20的子系统25中)的编码的逆操作以生成波形数据，来恢复要增强的混合音频信号的语音内容的低品质复本。然后，(通过参数数据，以及指示混合音频信号的数据)使用所恢复的语音的低品质复本，来执行语音增强的剩余步骤。Mixed speech enhancement performed according to an exemplary embodiment of the present invention (e.g., in subsystem 43 of decoder 40) includes the following steps: performing (on waveform data) the inverse of the encoding performed (e.g., in subsystem 25 of encoder 20) to generate waveform data to recover a low-quality replica of the speech content of the mixed audio signal to be enhanced. The remaining steps of speech enhancement are then performed using the recovered low-quality replica of the speech (via parameter data and data indicative of the mixed audio signal).

元件23被配置成响应于从级21和级22输出的数据生成参数数据。参数数据与原始混合音频数据一起确定作为由原始语音数据(即，混合音频数据的语音内容)所指示的语音的参数重构版本的参数构造语音。语音的参数重构版本至少基本上与由原始语音数据所指示的语音匹配(例如，是由原始语音数据所指示的语音的良好近似)。参数数据确定用于对由混合音频数据所确定的未增强的混合内容的每个片段执行参数编码语音增强的参数编码增强参数p(t)的集合。Element 23 is configured to generate parameter data in response to data output from stages 21 and 22. The parameter data, together with the original mixed audio data, determines parametrically constructed speech that is a parametrically reconstructed version of the speech indicated by the original speech data (i.e., the speech content of the mixed audio data). The parametrically reconstructed version of the speech at least substantially matches the speech indicated by the original speech data (e.g., is a good approximation of the speech indicated by the original speech data). The parameter data determines a set of parametrically coded enhancement parameters p(t) for performing parametrically coded speech enhancement on each segment of the unenhanced mixed content determined by the mixed audio data.

混和指示符生成元件29被配置成响应于从级21和级22输出的数据生成混和指示符(“BI”)。可以想到，由从编码器20输出的比特流所指示的音频节目将进行混合语音增强(例如，在解码器40中)以确定语音增强音频节目，包括通过将原始节目的未增强音频数据与(根据波形数据所确定的)低品质语音数据和参数数据的组合进行组合。混和指示符确定这样的组合(例如，该组合具有由混和指示符的当前值序列所确定的状态序列)，使得与通过将仅低品质语音数据与未增强的音频数据进行组合所确定的纯波形编码语音增强音频节目或者通过将仅参数构造语音与未增强的音频数据进行组合所确定的纯参数编码语音增强音频节目相比，该语音增强音频节目具有更少听得见的语音增强编码伪声(例如，被更好掩蔽的语音增强编码伪声)。The blend indicator generating element 29 is configured to generate a blend indicator ("BI") in response to data output from stages 21 and 22. It is contemplated that the audio program indicated by the bitstream output from the encoder 20 will be subjected to blended speech enhancement (e.g., in the decoder 40) to determine a speech-enhanced audio program, including by combining the unenhanced audio data of the original program with a combination of low-quality speech data (determined from the waveform data) and parametric data. The blend indicator determines such a combination (e.g., the combination having a sequence of states determined by a current sequence of values of the blend indicator) that the speech-enhanced audio program has fewer audible speech-enhancing coding artifacts (e.g., better masked speech-enhancing coding artifacts) than a purely waveform-encoded speech-enhanced audio program determined by combining only low-quality speech data with the unenhanced audio data or a purely parametric-encoded speech-enhanced audio program determined by combining only parametrically constructed speech with the unenhanced audio data.

在对图3实施方式的变型中，本发明混合语音增强所使用的混和指示符没有在本发明的编码器中被生成(并且没有包括在从编码器输出的比特流中)，而替代地响应于从编码器输出的比特流(该比特流包括波形数据和参数数据)而被生成(例如，在对接收器40的变型中)。In a variation of the embodiment of Figure 3, the mixing indicator used by the hybrid speech enhancement of the present invention is not generated in the encoder of the present invention (and is not included in the bitstream output from the encoder), but is instead generated in response to the bitstream output from the encoder (which bitstream includes waveform data and parameter data) (for example, in a variation of receiver 40).

应当理解，表达“混和指示符”并不意在表示比特流的每个片段的单个参数或值(或者单个参数或值序列)。而是，可以想到，在一些实施方式中，(比特流的片段的)混和指示符可以是两个或更多个参数或值(例如，对于每个片段，参数编码增强控制参数和波形编码增强控制参数)的集合。It should be understood that the expression "mixed indicator" is not intended to mean a single parameter or value (or a single sequence of parameters or values) for each segment of the bitstream. Rather, it is contemplated that in some embodiments, the mixed indicator (of a segment of the bitstream) may be a set of two or more parameters or values (e.g., for each segment, a parameter coding enhancement control parameter and a waveform coding enhancement control parameter).

编码子系统27生成指示混合音频数据(通常，混合音频数据的压缩版本)的音频内容的编码音频数据。编码子系统27通常实现在级22中所执行的转换的逆操作以及其他编码操作。The encoding subsystem 27 generates encoded audio data indicative of the audio content of the mixed audio data (typically, a compressed version of the mixed audio data).The encoding subsystem 27 typically implements the inverse of the conversion performed in stage 22 and other encoding operations.

格式化级28被配置成将从元件23输出的参数数据、从元件25输出的波形数据、在元件29中所生成的混和指示符以及从子系统27输出的编码音频数据汇编成指示音频节目的编码比特流。比特流(在一些实现方式中，其可以具有E-AC-3或者AC-3格式)包括未编码的参数数据、波形数据和混和指示符。Formatting stage 28 is configured to assemble the parameter data output from element 23, the waveform data output from element 25, the mix indicator generated in element 29, and the encoded audio data output from subsystem 27 into an encoded bitstream indicative of the audio program. The bitstream (which, in some implementations, may be in E-AC-3 or AC-3 format) includes the unencoded parameter data, the waveform data, and the mix indicator.

从编码器20输出的编码音频比特流(编码音频信号)被提供至递送子系统30。递送子系统30被配置成存储由编码器20生成的编码音频信号(例如，以存储指示编码音频信号的数据)和/或传送编码音频信号。The encoded audio bitstream (encoded audio signal) output from encoder 20 is provided to a delivery subsystem 30. Delivery subsystem 30 is configured to store the encoded audio signal generated by encoder 20 (e.g., to store data indicative of the encoded audio signal) and/or transmit the encoded audio signal.

解码器40被耦接并配置(例如，被编程)为：从子系统30接收编码音频信号(例如，通过从子系统30中的存储装置读取或取回指示编码音频信号的数据，或者接收已经被子系统30发送的编码音频信号)；对指示编码音频信号的混合(语音与非语音)音频内容的数据进行解码；以及对经解码的混合音频内容执行混合语音增强。解码器40通常被配置成生成并且输出指示输入至编码器20的混合音频内容的语音增强版本的语音增强的解码音频信号(例如，至呈现系统，在图3中未示出)。替选地，其包括被耦接成接收子系统43的输出的这样的呈现系统。The decoder 40 is coupled and configured (e.g., programmed) to: receive an encoded audio signal from the subsystem 30 (e.g., by reading or retrieving data indicating the encoded audio signal from a storage device in the subsystem 30, or receiving the encoded audio signal that has been sent by the subsystem 30); decode data indicating the mixed (speech and non-speech) audio content of the encoded audio signal; and perform mixed speech enhancement on the decoded mixed audio content. The decoder 40 is typically configured to generate and output a speech-enhanced decoded audio signal (e.g., to a rendering system, not shown in FIG3 ) indicating a speech-enhanced version of the mixed audio content input to the encoder 20. Alternatively, it includes such a rendering system coupled to receive the output of the subsystem 43.

解码器40的缓冲器44(缓冲存储器)(例如，以非暂态方式)存储由解码器40接收的编码音频信号(比特流)的至少一个片段(例如，帧)。在典型操作中，编码音频比特流的片段序列被提供至缓冲器44并且从缓冲器44被设定至去格式化级41。The buffer 44 (buffer memory) of the decoder 40 stores (e.g., in a non-transitory manner) at least one segment (e.g., frame) of the encoded audio signal (bitstream) received by the decoder 40. In typical operation, a sequence of segments of the encoded audio bitstream is provided to the buffer 44 and is set from the buffer 44 to the deformatting stage 41.

解码器40的去格式化(解析)级41被配置成对来自递送子系统30的编码比特流进行解析，以从编码比特流提取参数数据(由编码器20的元件23所生成)、波形数据(由编码器20的元件25所生成)、混和指示符(在编码器20的元件29中所生成)、以及编码混合(语音与非语音)音频数据(在编码器20的编码子系统27中所生成)。The deformatting (parsing) stage 41 of the decoder 40 is configured to parse the encoded bit stream from the delivery subsystem 30 to extract parameter data (generated by element 23 of the encoder 20), waveform data (generated by element 25 of the encoder 20), mixing indicators (generated in element 29 of the encoder 20), and encoded mixed (speech and non-speech) audio data (generated in the encoding subsystem 27 of the encoder 20) from the encoded bit stream.

编码混合音频数据在解码器40的解码子系统42中被解码，所得到的经解码的混合(语音与非语音)音频数据被设定至混合语音增强子系统43(并且可选地从解码器40输出而未经历语音增强)。The encoded mixed audio data is decoded in a decoding subsystem 42 of the decoder 40, and the resulting decoded mixed (speech and non-speech) audio data is supplied to a mixed speech enhancement subsystem 43 (and optionally output from the decoder 40 without undergoing speech enhancement).

响应于由级41从比特流所提取(或者响应于比特流中所包括的元数据在级41中所生成)的控制数据(包括混和指示符)，并且响应于由级41所提取的参数数据和波形数据，语音增强子系统43根据本发明的实施方式对来自解码子系统42的解码混合(语音与非语音)音频数据执行混合语音增强。从子系统43输出的语音增强音频信号指示输入至编码器20的混合音频内容的语音增强版本。In response to control data (including the blend indicator) extracted from the bitstream by stage 41 (or generated in stage 41 in response to metadata included in the bitstream), and in response to parameter data and waveform data extracted by stage 41, a speech enhancement subsystem 43 performs mixed speech enhancement in accordance with an embodiment of the present invention on the decoded mixed (speech and non-speech) audio data from the decoding subsystem 42. The speech-enhanced audio signal output from subsystem 43 indicates a speech-enhanced version of the mixed audio content input to the encoder 20.

在图3的编码器20的各种实现中，子系统23可以生成混合音频输入信号的每个通道的每个分块的预测参数p_i的所描述的示例中的任何示例，以用于(例如，在解码器40中)解码混合音频信号的语音分量的重构。In various implementations of the encoder 20 of FIG. 3 , the subsystem 23 may generate any of the described examples of prediction parameters _p for each block of each channel of the mixed audio input signal for use in reconstructing a speech component of the decoded mixed audio signal (e.g., in the decoder 40 ).

使用指示解码混合音频信号的语音内容(例如，由编码器20的子系统25所生成的语音的低品质复本，或者使用由编码器20的子系统23所生成的预测参数p_i所生成的语音内容的重构)的语音信号，可以通过将语音信号与解码混合音频信号进行混合来(例如，在图3的解码器40的子系统43中)执行语音增强。通过对要添加(被混合)的语音应用增益，可以控制语音增强量。对于6dB增强，可以向语音添加0dB增益(假定语音增强混合中的语音具有与所发送或所重构的语音信号相同的水平)。语音增强信号是：Speech enhancement can be performed by mixing the speech signal with the decoded mixed audio signal (e.g., in subsystem 43 of decoder 40 of FIG. 3 ) using a _speech signal indicative of the speech content of the decoded mixed audio signal (e.g., a low-quality replica of the speech generated by subsystem 25 of encoder 20, or a reconstruction of the speech content generated using prediction parameters p i generated by subsystem 23 of encoder 20). The amount of speech enhancement can be controlled by applying a gain to the speech to be added (mixed). For a 6 dB enhancement, a 0 dB gain can be added to the speech (assuming the speech in the speech enhancement mix has the same level as the transmitted or reconstructed speech signal). The speech enhancement signal is:

M_e＝M+g·D_r (9) _Me = M + g· _Dr (9)

在一些实施方式中，为了获得语音增强增益G，应用下面的混合增益：In some embodiments, to obtain the speech enhancement gain G, the following mixed gain is applied:

g＝10^G/20-1 (10)g＝10 ^G/20 -1 (10)

在独立通道语音重构的情况下，获得语音增强混合M_e作为：In the case of independent channel speech reconstruction, the speech enhancement mixture _Me is obtained as:

M_e＝M·(1+diag(P)·g) (11) _Me = M·(1+diag(P)·g) (11)

在上述示例中，使用相同的能量来重构混合音频信号的每个通道中的语音贡献。当语音已经作为侧信号(例如，作为混合音频信号的语音内容的低品质复本)被发送时或者当使用多个通道(如使用MMSE预测器)重构语音时，语音增强混合需要语音呈现信息，以将与要增强的混合音频信号中已经存在的语音分量在不同通道上具有相同分布的语音进行混合。In the above example, the same energy is used to reconstruct the speech contribution in each channel of the mixed audio signal. When speech is already sent as a side signal (e.g., as a low-quality replica of the speech content of the mixed audio signal) or when speech is reconstructed using multiple channels (such as using an MMSE predictor), the speech enhancement mix requires speech presence information to mix speech that has the same distribution across different channels as the speech component already present in the mixed audio signal to be enhanced.

该呈现信息可以由每个通道的呈现参数r_i来设置，当存在三个通道时，可以将该呈现信息表达为具有以下形式的呈现向量R。The rendering information can be set by the rendering parameter _ri of each channel. When there are three channels, the rendering information can be expressed as a rendering vector R having the following form.

语音增强混合为：The speech enhancement mix is:

M_e＝M+R·g·D_r (13) _Me = M + R·g· _Dr (13)

在存在多个通道的情况下，使用预测参数p_i重构(要与混合音频信号的每个通道进行混合的)语音，先前的等式可以被写为：In the presence of multiple channels, using the prediction parameters _pi to reconstruct the speech (to be mixed with each channel of the mixed audio signal), the previous equation can be written as:

M_e＝M+R·g·P·M＝(I+R·g·P)·M (14) _Me =M+R·g·P·M=(I+R·g·P)·M (14)

其中，I是单位矩阵。Where I is the identity matrix.

5.语音呈现5. Voice Presentation

图4是实现以下形式的常规语音增强混合的语音呈现系统的框图：FIG4 is a block diagram of a speech rendering system that implements a conventional speech enhancement mix of the following form:

M_e＝M+R·g·D_r (15) _Me = M + R·g· _Dr (15)

在图4中，要增强的三通道混合音频信号处于(或者被转换成)频域中。左通道的频率分量被设定至混合元件52的输入，中央通道的频率分量被设定至混合元件53的输入，右通道的频率分量被设定至混合元件54的输入。In Figure 4, the three-channel mixed audio signal to be enhanced is in (or converted into) the frequency domain. The frequency component of the left channel is set to the input of the mixing element 52, the frequency component of the center channel is set to the input of the mixing element 53, and the frequency component of the right channel is set to the input of the mixing element 54.

要与混合音频信号进行混合(以增强混合音频信号)的语音信号可以作为侧信号(例如，作为混合音频信号的语音内容的低品质复本)已经被发送或者可以根据与混合音频信号一起被发送的预测参数p_i被重构。语音信号由频域数据(例如，其包括通过将时域信号转换至频域生成的频率分量)表示，这些频率分量被设定至混合元件51的输入，在混合元件51中，将这些频率分量与增益参数g相乘。The speech signal to be mixed with the mixed audio signal (to enhance the mixed audio signal) may have been sent as a side signal (e.g., as a low-quality replica of the speech content of the mixed audio signal) or may have been reconstructed based on prediction parameters p _i sent together with the mixed audio signal. The speech signal is represented by frequency domain data (e.g., comprising frequency components generated by converting a time domain signal into the frequency domain), which are set to the input of the mixing element 51, where they are multiplied by a gain parameter g.

元件51的输出被设定至呈现子系统50。还被设定至呈现子系统50的是已经与混合音频信号一起被发送的CLD(通道水平差)参数、CLD₁和CLD₂。(针对混合音频信号的每个片段的)CLD参数描述如何将语音信号混合至混合音频信号内容的所述片段的通道。CLD₁表示一对扬声器通道的平移系数(例如，其限定语音在左通道与中央通道之间的平移)，CLD₂表示另一对扬声器通道的平移系数(例如，其限定语音在中央通道与右通道之间的平移)。因此，呈现子系统50设定(至元件52)指示左通道的R·g·D_r的数据(语音内容，由左通道的增益参数和呈现参数进行缩放)，并且在元件52中将该数据与混合音频信号的左通道进行求和。呈现子系统50设定(至元件53)指示中央通道的R·g·D_r的数据(语音内容，由中央通道的增益参数和呈现参数进行缩放)，并且在元件53中将该数据与混合音频信号的中央通道进行求和。呈现子系统50设定(至元件54)指示右通道的R·g·D_r的数据(语音内容，由右通道的增益参数和呈现参数进行缩放)，并且在元件54中将该数据与混合音频信号的右通道进行求和。The output of element 51 is set to the rendering subsystem 50. Also set to the rendering subsystem 50 are the CLD (channel level difference) parameters, CLD ₁ and CLD ₂ , which have been sent along with the mixed audio signal. The CLD parameters (for each segment of the mixed audio signal) describe how the speech signal is mixed into the channels of that segment of the mixed audio signal content. CLD ₁ represents the panning coefficients for one pair of speaker channels (e.g., defining the panning of speech between the left and center channels), and CLD ₂ represents the panning coefficients for another pair of speaker channels (e.g., defining the panning of speech between the center and right channels). Therefore, the rendering subsystem 50 sets (to element 52) data indicating the R·g·D _r of the left channel (the speech content, scaled by the gain parameters and rendering parameters of the left channel), and this data is summed with the left channel of the mixed audio signal in element 52. The rendering subsystem 50 sets (to element 53) data indicative of the R·g·D _r of the center channel (the speech content, scaled by the gain parameter and rendering parameters of the center channel), and sums this data with the center channel of the mixed audio signal in element 53. The rendering subsystem 50 sets (to element 54) data indicative of the R·g·D _r of the right channel (the speech content, scaled by the gain parameter and rendering parameters of the right channel), and sums this data with the right channel of the mixed audio signal in element 54.

分别使用元件52、53和54的输出来驱动左扬声器L、中央扬声器C和右扬声器“右”。The outputs of elements 52, 53 and 54 are used to drive a left speaker L, a centre speaker C and a right speaker "Right", respectively.

图5是实现以下形式的常规语音增强混合的语音呈现系统的框图：FIG5 is a block diagram of a speech rendering system that implements a conventional speech enhancement mix of the following form:

M_e＝M+R·g·P·M＝(I+R·g·P)·M (16) _Me =M+R·g·P·M=(I+R·g·P)·M (16)

在图5中，要增强的三通道混合音频信号处于(或者被转换成)频域中。左通道的频率分量被设定至混合元件52的输入，中央通道的频率分量被设定至混合元件53的输入，右通道的频率分量被设定至混合元件54的输入。In Figure 5, the three-channel mixed audio signal to be enhanced is in (or converted into) the frequency domain. The frequency components of the left channel are set to the input of the mixing element 52, the frequency components of the center channel are set to the input of the mixing element 53, and the frequency components of the right channel are set to the input of the mixing element 54.

根据与混合音频信号一起被发送的预测参数p_i来重构(如所指示的)要与混合音频信号进行混合的语音信号。使用预测参数p₁来重构来自混合音频信号的第一(左)通道的语音，使用预测参数p₂来重构来自混合音频信号的第二(中央)通道的语音，使用预测参数p₃来重构来自混合音频信号的第三(右)通道的语音。语音信号由频域数据表示，这些频率分量被设定至混合元件51的输入，在混合元件51中，将这些频率分量与增益参数g相乘。The speech signal to be mixed with the mixed audio signal is reconstructed (as indicated) based on the prediction parameters p _i sent along with the mixed audio signal. The speech from the first (left) channel of the mixed audio signal is reconstructed using the prediction parameters p ₁ , the speech from the second (center) channel of the mixed audio signal is reconstructed using _the prediction parameters p ₂ , and the speech from the third (right) channel of the mixed audio signal is reconstructed using the prediction parameters p 3. The speech signal is represented by frequency domain data, and these frequency components are set to the input of the mixing element 51, in which these frequency components are multiplied by the gain parameter g.

元件51的输出被设定至呈现子系统55。还被设定至呈现子系统的是已经与混合音频信号一起被发送的CLD(通道水平差)参数、CLD₁和CLD₂。(针对混合音频信号的每个片段的)CLD参数描述了如何将语音信号混合至混合音频信号内容的所述片段的通道。CLD₁表示一对扬声器通道的平移系数(例如，其限定语音在左通道与中央通道之间的平移)，CLD₂表示另一对扬声器通道的平移系数(例如，其限定语音在中央通道与右通道之间的平移)。因此，呈现子系统55设定(至元件52)指示左通道的R·g·P·M的数据(与混合音频内容的左通道进行混合的重构语音内容，由左通道的增益参数和呈现参数进行缩放，与混合音频内容的左通道进行混合)，并且在元件52中将该数据与混合音频信号的左通道进行求和。呈现子系统55设定(至元件53)指示中央通道的R·g·P·M的数据(与混合音频内容的中央通道进行混合的重构语音内容，由中央通道的增益参数和呈现参数进行缩放)，并且在元件53中将该数据与混合音频信号的中央通道进行求和。呈现子系统55设定(至元件54)指示右通道的R·g·P·M的数据(与混合音频内容的右通道进行混合的重构语音内容，由右通道的增益参数和呈现参数进行缩放)，并且在元件54中将该数据与混合音频信号的右通道进行求和。The output of element 51 is set to the rendering subsystem 55. Also set to the rendering subsystem are the CLD (channel level difference) parameters, CLD ₁ and CLD ₂ , which have been sent along with the mixed audio signal. The CLD parameters (for each segment of the mixed audio signal) describe how the speech signal is mixed into the channels of that segment of the mixed audio signal content. CLD ₁ represents the panning coefficients for one pair of speaker channels (e.g., defining the panning of speech between the left and center channels), and CLD ₂ represents the panning coefficients for another pair of speaker channels (e.g., defining the panning of speech between the center and right channels). Therefore, the rendering subsystem 55 sets (to element 52) data indicating the R·g·P·M of the left channel (the reconstructed speech content mixed with the left channel of the mixed audio content, scaled by the gain parameters and rendering parameters of the left channel, mixed with the left channel of the mixed audio content), and this data is summed with the left channel of the mixed audio signal in element 52. The rendering subsystem 55 sets (to element 53) data indicative of the R·g·P·M of the center channel (the reconstructed speech content mixed with the center channel of the mixed audio content, scaled by the gain parameter and rendering parameters of the center channel), and sums this data with the center channel of the mixed audio signal in element 53. The rendering subsystem 55 sets (to element 54) data indicative of the R·g·P·M of the right channel (the reconstructed speech content mixed with the right channel of the mixed audio content, scaled by the gain parameter and rendering parameters of the right channel), and sums this data with the right channel of the mixed audio signal in element 54.

CLD(通道水平差)参数通常与扬声器通道信号一起被发送(例如，以确定应当呈现不同通道的水平之间的比率)。在本发明的一些实施方式中以新颖的方式使用CLD参数(例如，以在语音增强音频节目的扬声器通道之间平移所增强的语音)。CLD (channel level difference) parameters are typically sent along with speaker channel signals (e.g., to determine the ratio between the levels at which different channels should be rendered). In some embodiments of the present invention, CLD parameters are used in novel ways (e.g., to pan the enhanced speech between speaker channels of a speech enhancement audio program).

在典型实施方式中，呈现参数r_i是(或者指示)语音的上混合系数，描述语音信号如何被混合至要增强的混合音频信号的通道。可以使用通道水平差参数(CLD)将这些系数有效地发送至语音增强器。一个CLD表示两个扬声器的平移系数。例如，In a typical embodiment, the rendering parameters r _i are (or indicate) the upmix coefficients of the speech, describing how the speech signal is mixed into the channels of the mixed audio signal to be enhanced. These coefficients can be effectively sent to the speech enhancer using the channel level difference parameters (CLD). One CLD represents the panning coefficients of two speakers. For example,

其中，β₁表示在平移期间瞬时的第一扬声器的扬声器馈送的增益，β₂表示在平移期间瞬时的第二扬声器的扬声器馈送的增益。当CLD＝0时，平移完全针对第一扬声器，而当CLD接近无穷大时，平移完全朝向第二扬声器。使用在dB范围中所限定的CLD，有限数目的量化水平可以足够描述平移。Where _β1 represents the gain of the speaker feed of the first speaker at the instant during the panning, and _β2 represents the gain of the speaker feed of the second speaker at the instant during the panning. When CLD=0, the panning is entirely directed to the first speaker, while when CLD approaches infinity, the panning is entirely towards the second speaker. Using CLD defined in dB range, a limited number of quantization levels can adequately describe the panning.

使用两个CLD可以限定在三个扬声器之间进行平移。可以如下根据呈现系数来导出CLD：Using two CLDs, we can define panning between the three loudspeakers. The CLDs can be derived from the rendering coefficients as follows:

其中，是归一化呈现系数，使得where is the normalized presentation coefficient such that

然后，可以通过以下等式根据CLD重构呈现系数：The presentation coefficients can then be reconstructed from the CLD using the following equation:

如在本文中别处所指出的，波形编码语音增强使用要增强的混合内容信号的语音内容的低品质复本。低品质复本通常以低比特率被编码并且作为侧信号与混合内容信号一起被发送，因此，低品质复本通常包括显著的编码伪声。因此，在具有低SNR(即，由混合内容信号所指示的语音与所有其他声音之间的低比率)的情况下，波形编码语音增强提供良好的语音增强性能，而在具有高SNR的情况下通常提供差的性能(即，导致不期望的听得见的编码伪声)。As noted elsewhere herein, waveform coded speech enhancement uses a low-quality copy of the speech content of the mixed content signal to be enhanced. The low-quality copy is typically encoded at a low bit rate and transmitted as a side signal along with the mixed content signal, and therefore, typically includes significant coding artifacts. Consequently, waveform coded speech enhancement provides good speech enhancement performance in situations with a low SNR (i.e., a low ratio between the speech indicated by the mixed content signal and all other sounds), but typically provides poor performance (i.e., results in undesirable audible coding artifacts) in situations with a high SNR.

相反地，当挑选出(要增强的混合内容信号的)语音内容(例如，其被设置为多通道混合内容信号中的仅中央通道的内容)或者混合内容信号以其他方式具有高SNR时，参数编码语音增强提供良好的语音增强性能。In contrast, parametric coding speech enhancement provides good speech enhancement performance when the speech content (of the mixed content signal to be enhanced) is singled out (e.g., it is set to be the content of only the center channel in a multi-channel mixed content signal) or the mixed content signal otherwise has a high SNR.

因此，波形编码语音增强和参数编码语音增强具有互补的性能。基于要增强其语音内容的信号的特性，本发明的一类实施方式将两种方法进行混和以利用它们的性能。Therefore, waveform coded speech enhancement and parametric coded speech enhancement have complementary properties. Based on the characteristics of the signal whose speech content is to be enhanced, one class of embodiments of the present invention mixes the two methods to exploit their properties.

图6是该类实施方式中的被配置成执行混合语音增强的语音呈现系统的框图。在一种实现中，图3的解码器40的子系统43实现图6系统(除了图6中所示的三个扬声器以外)。混合(hybrid)语音增强(混合(mixing))可以由下式来描述FIG6 is a block diagram of a speech rendering system configured to perform hybrid speech enhancement in one embodiment of the present invention. In one implementation, subsystem 43 of decoder 40 of FIG3 implements the system of FIG6 (except for the three speakers shown in FIG6). Hybrid speech enhancement (mixing) can be described by the following equation:

M_e＝R·g₁·D_r+(I+R·g₂·P)·M (23) _Me =R·g ₁ ·D _r +(I+R·g ₂ ·P)·M (23)

其中，R·g₁·D_r是由常规的图4系统所实现的类型的波形编码语音增强，R·g₂·P·M是由常规的图5系统所实现的类型的参数编码语音增强，参数g₁和g₂控制整体增强增益以及两种语音增强方法之间的平衡(trade-off)。参数g₁和g₂的定义的示例是：Where R· _g1 · _Dr is waveform coded speech enhancement of the type implemented by the conventional system of FIG4, R· _g2 ·P·M is parametric coded speech enhancement of the type implemented by the conventional system of FIG5, and the parameters _g1 and _g2 control the overall enhancement gain and the trade-off between the two speech enhancement methods. An example of the definition of the parameters _g1 and _g2 is:

g₁＝α_c·(10^G/20-1) (24)g ₁ = α _c ·(10 ^G/20 -1) (24)

g₂＝(1-α_c)·(10^G/20-1) (25)g ₂ =(1-α _c )·(10 ^G/20 -1) (25)

其中，参数α_c限定参数编码语音增强方法与参数编码语音增强方法之间的平衡。当值α_c＝1时，仅语音的低品质复本用于波形编码语音增强。当α_c＝0时，参数编码增强模式对增强作出全部贡献。0到1之间的α_c值对两种方法进行混和。在一些实现中，α_c是宽带参数(应用于音频数据的所有频带)。可以在各个频带内应用相同的原理，使得使用每个频带的参数α_c的不同值以频率相关方式对混和进行优化。The parameter _αc defines the balance between the parametrically coded speech enhancement method and the waveform-coded speech enhancement method. When the value _αc = 1, only a low-quality copy of the speech is used for waveform-coded speech enhancement. When _αc = 0, the parametric coding enhancement mode makes the full contribution to the enhancement. _αc values between 0 and 1 blend the two methods. In some implementations, _αc is a broadband parameter (applied to all frequency bands of the audio data). The same principle can be applied within each frequency band, so that the blending can be optimized in a frequency-dependent manner using different values of the parameter _αc for each frequency band.

在图6中，要增强的三通道混合音频信号处于(或者被转换成)频域中。左通道的频率分量被设定至混合元件65的输入，中央通道的频率分量被设定至混合元件66的输入，右通道的频率分量被设定至混合元件67的输入。In Figure 6, the three-channel mixed audio signal to be enhanced is in (or converted into) the frequency domain. The frequency components of the left channel are set to the input of the mixing element 65, the frequency components of the center channel are set to the input of the mixing element 66, and the frequency components of the right channel are set to the input of the mixing element 67.

要与混合音频信号进行混合(以增强混合音频信号)的语音信号包括：已经根据与混合音频信号(例如，作为侧信号)一起(根据波形编码语音增强)被传输的波形数据而生成的混合音频信号的语音内容的低品质复本(在图6中标识为“语音”)，以及根据混合音频信号和与混合音频信号一起(根据参数编码语音增强)被传输的预测参数p_i所重构的重构语音信号(其从图6的参数编码语音重构元件68输出)。语音信号由频域数据(例如，其包括通过将时域信号转换成频域所生成的频率分量)表示。低品质语音复本的频率分量被设定至混合元件61的输入，在混合元件61中，将低品质语音复本的频率分量乘以增益参数g₂。参数重构语音信号的频率分量从元件68的输出被设定至混合元件62的输入，在混合元件62中，将参数重构语音信号的频率分量乘以增益参数g₁。在替选实施方式中，在时域中而不是在如图6实施方式中的频域中执行要实现语音增强所执行的混合。The speech signal to be mixed with the mixed audio signal (to enhance the mixed audio signal) includes: a low-quality replica of the speech content of the mixed audio signal (labeled "Speech" in FIG. 6 ) generated based on waveform data transmitted along with the mixed audio signal (e.g., as a side signal) (according to waveform coded speech enhancement), and a reconstructed speech signal (output from parametrically coded speech reconstruction element 68 of FIG. 6 ) reconstructed based on the mixed audio signal and prediction parameters p _i transmitted along with the mixed audio signal (according to parametrically coded speech enhancement). The speech signal is represented by frequency domain data (e.g., comprising frequency components generated by converting a time domain signal into the frequency domain). The frequency components of the low-quality speech replica are set to the input of mixing element 61, where they are multiplied by a gain parameter g ₂ . The frequency components of the parametrically reconstructed speech signal are set from the output of element 68 to the input of mixing element 62, where they are multiplied by a gain parameter g ₁ . In an alternative embodiment, the mixing performed to achieve speech enhancement is performed in the time domain rather than in the frequency domain as in the embodiment of FIG. 6 .

求和元件63对元件61和元件62的输出进行求和以生成要与混合音频信号进行混合的语音信号，并且该语音信号从元件63的输出被设定至呈现子系统64。还被设定至呈现子系统64的是已经与混合音频信号一起被发送的CLD(通道水平差)参数、CLD₁和CLD₂。(针对混合音频信号的每个片段的)CLD参数描述了如何将语音信号混合至混合音频信号内容的所述片段的通道。CLD₁表示一对扬声器通道的平移系数(例如，其限定语音在左通道与中央通道之间的平移)，CLD₂表示另一对扬声器通道的平移系数(例如，其限定语音在中央通道与右通道之间的平移)。因此，呈现子系统64设定(至元件52)指示左通道的R·g₁·D_r+(R·g₂·P)·M的数据(与混和音频内容的左通道进行混合的重构语音内容，由左通道的增益参数和呈现参数缩放，与混合音频内容的左通道进行混合)，并且在元件52中将该数据与混合音频信号的左通道进行求和。呈现子系统64设定(至元件53)指示中央通道的R·g₁·D_r+(R·g₂·P)·M的数据(与混合音频内容的中央通道进行混合的重构语音内容，由中央通道的增益参数和呈现参数进行缩放)，并且在元件53中将该数据与混合音频信号的中央通道进行求和。呈现子系统64设定(至元件54)指示右通道的R·g₁·D_r+(R·g₂·P)·M的数据(与混和音频内容的右通道进行混合的重构语音内容，由右通道的增益参数和呈现参数进行缩放)，并且在元件54中将该数据与混合音频信号的右通道进行求和。Summing element 63 sums the outputs of elements 61 and 62 to generate a speech signal to be mixed with the mixed audio signal, and this speech signal is set from the output of element 63 to rendering subsystem 64. Also set to rendering subsystem 64 are CLD (channel level difference) parameters, CLD ₁ and CLD ₂ , which have been sent along with the mixed audio signal. The CLD parameters (for each segment of the mixed audio signal) describe how the speech signal is mixed into the channels of that segment of the mixed audio signal content. CLD ₁ represents the panning coefficient for one pair of speaker channels (e.g., it defines the panning of speech between the left and center channels), and CLD ₂ represents the panning coefficient for another pair of speaker channels (e.g., it defines the panning of speech between the center and right channels). Thus, the rendering subsystem 64 sets (to element 52) data indicative of R· _g1 · _Dr +(R· _g2 ·P)·M for the left channel (the reconstructed speech content mixed with the left channel of the mixed audio content, scaled by the gain parameter and rendering parameters of the left channel, mixed with the left channel of the mixed audio content), and sums this data with the left channel of the mixed audio signal in element 52. The rendering subsystem 64 sets (to element 53) data indicative of R· _g1 · _Dr +(R· _g2 ·P)·M for the center channel (the reconstructed speech content mixed with the center channel of the mixed audio content, scaled by the gain parameter and rendering parameters of the center channel), and sums this data with the center channel of the mixed audio signal in element 53. The rendering subsystem 64 sets (to element 54 ) data indicative of R·g ₁ ·D _r +(R·g ₂ ·P)·M for the right channel (the reconstructed speech content mixed with the right channel of the mixed audio content, scaled by the gain parameter and rendering parameter for the right channel) and sums this data with the right channel of the mixed audio signal in element 54 .

当参数α_c被约束成具有值α_c＝0或者值α_c＝1时，图6系统可以实现基于时间SNR的切换。在以下的强的比特率约束情况下这样的实现尤其有用：低品质语音复本数据可以被发送或者参数数据可以被发送，但是低品质语音复本数据和参数数据两者不能一起被发送。例如，在一种这样的实现中，仅在α_c＝1的片段中将低品质语音复本与混合音频信号(例如，作为侧信号)一起发送，并且仅在α_c＝0的片段中将预测参数p_i与混合音频信号(例如，作为侧信号)一起发送。The system of FIG6 can implement temporal SNR-based switching when the parameter _αc is constrained to have a value of _αc =0 or a value of _αc =1. Such an implementation is particularly useful in the following strong bit rate constraint situation: either the low-quality speech replica data can be sent or the parameter data can be sent, but the low-quality speech replica data and the parameter data cannot be sent together. For example, in one such implementation, the low-quality speech replica is sent with the mixed audio signal (e.g., as a side signal) only in the segments where _αc =1, and the prediction parameters p _i are sent with the mixed audio signal (e.g., as a side signal) only in the segments where _αc =0.

切换(由图6的该实现中的元件61和62所实现)基于片段中的语音内容与所有其他音频内容之间的比率(SNR)(该比率又确定α_c的值)来确定要对每个片段执行波形编码增强还是参数编码增强。这样的实现可以使用SNR的阈值来决定要选择哪种方法：The switch (implemented by elements 61 and 62 in this implementation of FIG6 ) determines whether waveform-coded or _parametric -coded enhancement is to be performed on each segment based on the ratio (SNR) between the speech content and all other audio content in the segment (which in turn determines the value of αc). Such an implementation may use a threshold value for the SNR to decide which method to select:

其中，τ是阈值(例如，τ可以等于0)。where τ is a threshold value (eg, τ may be equal to 0).

当SNR大约为几个帧的阈值时，图6的一些实现使用滞后作用来阻止在波形编码增强模式与参数编码增强模式之间快速交替切换。Some implementations of FIG. 6 use hysteresis to prevent rapid alternation between waveform coding enhancement mode and parametric coding enhancement mode when the SNR is around a threshold of several frames.

当使得参数α_c能够具有0到1范围内的任意实值(0和1也包括在内)时，图6系统可以实现基于时间SNR的混和。When the parameter α _c is allowed to have any real value in the range of 0 to 1 (including 0 and 1), the system of Figure 6 can achieve time-SNR-based mixing.

图6系统的一种实现使用(要增强的混合音频信号的片段的SNR的)两个目标值τ₁和τ₂，超过这两个目标值，一种方法(波形编码增强或者参数编码增强)总是被视为提供最佳性能。在这些目标之间，使用插值来确定片段的参数α_c的值。例如，可以使用线性插值来确定片段的参数α_c的值：One implementation of the system of FIG6 uses two target values τ ₁ and τ ₂ (of the SNR of a segment of the mixed audio signal to be enhanced), above which one method (either waveform coding enhancement or parametric coding enhancement) is always considered to provide the best performance. Between these targets, interpolation is used to determine the value of the parameter α _c for the segment. For example, linear interpolation can be used to determine the value of the parameter α _c for the segment:

替选地，可以使用其他适当的插值方案。当SNR不可用时，在许多实现中可以使用预测参数来提供SNR的近似值。Alternatively, other suitable interpolation schemes may be used.When the SNR is not available, in many implementations the prediction parameters may be used to provide an approximation of the SNR.

在另一类实施方式中，通过听觉掩蔽模型确定要对音频信号的每个片段执行的波形编码增强和参数编码增强的组合。在该类的典型实施方式中，要对音频节目的片段执行的波形编码增强和参数编码增强的混和的最佳混和比率使用刚好防止编码噪声变得听见的最高波形编码增强量。在本文中，参照图7来描述使用听觉掩蔽模型的本发明方法的实施方式的示例。In another class of embodiments, the combination of waveform-coded enhancement and parametric-coded enhancement to be performed on each segment of the audio signal is determined by an auditory masking model. In a typical embodiment of this class, the optimal blend ratio of the waveform-coded enhancement and parametric-coded enhancement to be performed on the segments of the audio program uses the highest amount of waveform-coded enhancement that just prevents the coding noise from becoming audible. An example of an embodiment of the method of the present invention using an auditory masking model is described herein with reference to FIG.

更一般地，下面的考虑涉及以下实施方式：使用听觉掩蔽模型来确定要对音频信号的每个片段执行的波形编码增强和参数编码增强的组合(例如，混和)。在这样的实施方式中，对指示要称为未增强音频混合的语音与背景音频的混合A(t)的数据进行设置并且根据听觉掩蔽模型(例如，由图7的元件11所实现的模型)对其进行处理。模型预测了未增强音频混合的每个片段的掩蔽阈值Θ(f，t)。可以将具有时间索引n和频带索引b的未增强音频混合的每个时间-频率分块的掩蔽阈值表示为Θ_n，b。More generally, the following considerations relate to the following embodiments: use an auditory masking model to determine the combination (e.g., mixing) of waveform coding enhancement and parametric coding enhancement to be performed on each segment of an audio signal. In such an embodiment, the data indicating the mixture A(t) of the speech and background audio to be referred to as an unenhanced audio mix are set and processed according to an auditory masking model (e.g., a model implemented by element 11 of Fig. 7). The model predicts a masking threshold θ(f, t) for each segment of the unenhanced audio mix. The masking threshold for each time-frequency block of the unenhanced audio mix with time index n and band index b can be represented as θn _{, b} .

掩蔽阈值Θ_n，b指示：对于帧n和频带b，可以添加多少失真而不会听得见。令ε_D，n，b为低品质语音复本(要于波形编码增强)的编码误差(即，量化噪声)，并且令ε_P，n，b为参数预测误差。The masking threshold Θn _,b indicates how much distortion can be added without being audible for frame n and frequency band b. Let _εD,n,b be the coding error (i.e., quantization noise) of the low-quality speech replica (for waveform coding enhancement), and let εP _,n,b be the parameter prediction error.

该类中的一些实施方式实现到由未增强音频混合内容最佳掩蔽的方法(波形编码增强或参数编码增强)的硬切换：Some embodiments in this class implement a hard switch to a method (waveform coded enhancement or parametric coded enhancement) that is best masked by the unenhanced audio mix content:

在许多实际情况中，在生成语音增强参数时准确的参数预测误差ε_P，n，b可能不可用，这是因为这些可能在未增强混合的混合被编码之前生成。特别地，参数编码方案可以对来自混合内容通道的语音的参数重构的误差具有显著影响。In many practical situations, the exact parameter prediction errors ε _P,n,b may not be available when generating the speech enhancement parameters, because these may be generated before the mixture of the unenhanced mixture is encoded. In particular, the parametric coding scheme can have a significant impact on the error in the parametric reconstruction of speech from the mixed content channels.

因此，当(要用于波形编码增强的)低品质语音复本中的编码伪声未被混合内容掩蔽时，一些替选实施方式在参数编码语音增强(与波形编码增强)中进行混合：Therefore, some alternative implementations perform a hybrid in parametric coded speech enhancement (with waveform coded enhancement) when the coding artifacts in the low-quality speech replica (to be used for waveform coded enhancement) are not masked by the mixed content:

其中，τ_a是失真阈值，超出该失真阈值，仅应用参数编码增强。当整体失真大于整体掩蔽可能(potential)时，该解决方案开始波形编码增强和参数编码增强的混和。实际上，这意味着失真已经是听得见的。因此，可以使用具有比0更高的值的第二阈值。替选地，可以使用宁愿关注未被掩蔽的时间-频率分块而不是平均行为的情况。where _τa is the distortion threshold, beyond which only parametric coding enhancement is applied. When the overall distortion is greater than the overall masking potential, the solution starts a mix of waveform and parametric coding enhancement. In practice, this means that the distortion is already audible. Therefore, a second threshold with a value higher than 0 can be used. Alternatively, one can use a situation where one would rather focus on the unmasked time-frequency blocks rather than the average behavior.

类似地，当(要用于波形编码增强的)低品质语音复本中的失真(编码伪声)太高时，可以将该方法与SNR指引的混和规则进行组合。该方法的优点在于：在SNR非常低的情况下，当其产生比低品质语音复本的失真更多听得见的噪声时，不使用参数编码增强模式。Similarly, when the distortion (coding artifacts) in the low-quality speech copy (to be used for waveform coding enhancement) is too high, this method can be combined with an SNR-guided blending rule. The advantage of this method is that in very low SNR situations, when it produces more audible noise than the distortion of the low-quality speech copy, the parametric coding enhancement mode is not used.

在另一种实施方式中，当在每个这样的时间-频率分块中检测到频谱空洞(spectral hole)时，对一些时间-频率分块执行的语音增强的类型偏离由上述示例方案(或类似方案)所确定的语音增强类型。例如通过在参数重构中对相应分块中的能量进行评估可以检测频谱空洞，而在(要用于波形编码增强的)低品质语音复本中能量为0。如果该能量超过阈值，则可以将其视为相关音频。在这些情况下，可以将分块的参数α_c设置成0(或者，取决于SNR，分块的参数α_c可以朝向0偏置)。In another embodiment, when a spectral hole is detected in each such time-frequency block, the type of speech enhancement performed on some time-frequency blocks deviates from the type of speech enhancement determined by the above-described example scheme (or similar scheme). For example, spectral holes can be detected by evaluating the energy in the corresponding block during parameter reconstruction, while the energy is zero in the low-quality speech copy (to be used for waveform coding enhancement). If this energy exceeds a threshold, it can be considered as relevant audio. In these cases, the parameter α _c of the block can be set to 0 (or, depending on the SNR, the parameter α _c of the block can be biased towards 0).

在一些实施方式中，本发明的编码器能够在以下模式中的任意所选之一中操作：In some embodiments, the encoder of the present invention is capable of operating in any selected one of the following modes:

1.独立通道参数——在该模式下，传输包括语音的每个通道的参数集合。使用这些参数，接收编码音频节目的解码器可以对节目执行参数编码语音增强以将这些通道中的语音加强任意量。用于传输参数集合的示例比特率是0.75kbps至2.25kbps。1. Independent Channel Parameters - In this mode, parameter sets are transmitted for each channel, including speech. Using these parameters, a decoder receiving an encoded audio program can perform parametric speech enhancement on the program to increase the speech in those channels by any amount. Example bit rates for transmitting parameter sets are 0.75 kbps to 2.25 kbps.

2.多通道语音预测——在该模式下，以线性组合对混合内容的多个通道进行组合来预测语音信号。传输每个通道的参数集合。使用这些参数，接收编码音频节目的解码器可以对节目执行参数编码语音增强。将附加的位置数据与编码音频节目一起传输以使得能够将所加强的语音呈现回混合。用于传输参数集合和位置数据的示例比特率是每对话1.5kbps至6.75kbps。2. Multi-channel Speech Prediction - In this mode, multiple channels of the mixed content are combined in a linear combination to predict the speech signal. A parameter set is transmitted for each channel. Using these parameters, a decoder receiving the encoded audio program can perform parametric speech enhancement on the program. Additional position data is transmitted along with the encoded audio program to enable the enhanced speech to be rendered back into the mix. Example bit rates for transmitting the parameter sets and position data are 1.5 kbps to 6.75 kbps per dialogue.

3.波形编码语音——在该模式下，通过任何适当的方式将音频节目的语音内容的低品质复本与常规音频内容(例如，作为分立的比特流)单独地并行传输。接收编码音频节目的解码器可以通过在语音内容的分立的低品质复本中与主混合进行混合对节目执行波形编码语音增强。当幅度加倍时，通常将语音的低品质复本与0dB的增益进行混合将使语音加强6dB。此外，对于该模式，位置数据被传输，使得将语音信号正确地分布在相关通道中。用于传输语音的低品质复本和位置数据的示例比特率大于每对话20kbps。3. Waveform Coded Speech - In this mode, a low-quality copy of the speech content of an audio program is transmitted separately and in parallel with the regular audio content (e.g., as a separate bitstream) by any suitable means. A decoder receiving the encoded audio program can perform waveform-coded speech enhancement on the program by mixing in the separate, low-quality copy of the speech content with the main mix. Typically, mixing the low-quality copy of the speech with a gain of 0 dB will enhance the speech by 6 dB, when the amplitude is doubled. In addition, for this mode, position data is transmitted so that the speech signal is correctly distributed among the relevant channels. An example bit rate for transmitting the low-quality copy of the speech and the position data is greater than 20 kbps per conversation.

4.波形参数混合——在该模式下，将音频节目的语音内容的低品质复本(用于对节目执行波形编码语音增强)和每个包括语音的通道的参数集合(用于对节目执行参数编码语音增强)两者与节目的未增强混合(语音与非语音)音频内容并行传输。当语音的低品质复本的比特率降低时，在该信号中更多编码伪声变得听得见，并且减小了传输所需要的带宽。此外，还传输了下述混和指示符：该混和指示符使用语音的低品质复本和参数集合来确定要对节目的每个片段执行的波形编码语音增强和参数编码语音增强的组合。在接收器处，对节目执行混合语音增强，包括通过：执行由混和指示符所确定的波形编码语音增强和参数编码语音增强的组合，从而生成指示语音增强音频节目的数据。此外，还将位置数据与节目的未增强的混和音频内容一起传输以指示要在哪里呈现语音信号。该方法的优点在于：如果接收器/解码器丢弃语音的低品质复本并且仅应用参数集合来执行参数编码增强，则可以降低所要求的接收器/解码器复杂度。用于传输语音的低品质复本、参数集合、混和指示符和位置数据的示例比特率是每对话8至24kbps。4. Waveform Parametric Hybrid - In this mode, a low-quality copy of the audio program's speech content (used to perform waveform-coded speech enhancement on the program) and a parameter set for each channel containing speech (used to perform parametric-coded speech enhancement on the program) are both transmitted in parallel with the program's unenhanced mixed (speech and non-speech) audio content. As the bit rate of the low-quality copy of the speech is reduced, more coding artifacts become audible in the signal, and the bandwidth required for transmission is reduced. In addition, a hybrid indicator is transmitted that uses the low-quality copy of the speech and the parameter set to determine the combination of waveform-coded speech enhancement and parametric-coded speech enhancement to be performed on each segment of the program. At the receiver, hybrid speech enhancement is performed on the program, including by performing the combination of waveform-coded speech enhancement and parametric-coded speech enhancement determined by the hybrid indicator, thereby generating data indicating the speech-enhanced audio program. In addition, position data is transmitted along with the program's unenhanced mixed audio content to indicate where the speech signal is to be presented. An advantage of this approach is that the required receiver/decoder complexity can be reduced if the receiver/decoder discards the low-quality copy of the speech and only applies the parameter sets to perform parametric coding enhancement. An example bit rate for transmitting the low-quality copy of the speech, the parameter sets, the mixing indicator, and the position data is 8 to 24 kbps per conversation.

出于实践原因，可以将语音增强增益限制成0至12dB范围。可以将编码器实现成：能够进一步借助于比特流字段来进一步减小该范围的上限。在一些实施方式中，(从编码器输出的)编码节目的语法将支持(除了节目的非语音内容以外的)多个同时的可增强对话，使得可以分立地重构和呈现每个对话。在这些实施方式中，在后面的模式下，将用于(来自不同空间位置处的多个源的)同时对话的语音增强呈现在单个位置处。For practical reasons, the speech enhancement gain can be limited to a range of 0 to 12 dB. The encoder can be implemented to be able to further reduce the upper limit of this range by means of bitstream fields. In some embodiments, the syntax of the encoded program (output from the encoder) will support multiple simultaneous enhanceable conversations (in addition to the non-speech content of the program) so that each conversation can be reconstructed and presented separately. In these embodiments, in the latter mode, speech enhancement for simultaneous conversations (from multiple sources at different spatial locations) will be presented at a single location.

在编码音频节目是基于对象的音频节目的一些实施方式中，可以选择(最大总数中的)一个或更多个对象簇来进行语音增强。可以将CLD值对包括在编码节目中以供语音增强和呈现系统使用，以在对象簇之间平移所增强的语音。类似地，在编码音频节目包括常规5.1格式的扬声器通道的一些实施方式中，可以选择前扬声器通道中的一个或更多个以进行语音增强。In some embodiments where the encoded audio program is an object-based audio program, one or more object clusters (out of a maximum total) can be selected for speech enhancement. CLD value pairs can be included in the encoded program for use by the speech enhancement and rendering system to shift the enhanced speech between the object clusters. Similarly, in some embodiments where the encoded audio program includes speaker channels in a conventional 5.1 format, one or more of the front speaker channels can be selected for speech enhancement.

本发明的另一个方面是用于对已经根据本发明的编码方法的实施方式生成的编码音频信号进行解码并执行混合语音增强的方法(例如，由图3的解码器40所执行的方法)。Another aspect of the present invention is a method for decoding an encoded audio signal that has been generated according to an embodiment of the encoding method of the present invention and performing hybrid speech enhancement (eg, the method performed by the decoder 40 of FIG. 3 ).

可以以硬件、固件或软件或者两者的组合(例如，作为可编程逻辑阵列)来实现本发明。除非另有说明，否则作为本发明的一部分所包括的算法或处理并不固有地与任何特定计算机或其他设备相关。具体地，可以与根据本文中的教示所编写的程序一起使用各种通用机器，或者更便利的是，可以构造执行所要求的方法步骤的更专用的设备(例如，集成电路)。因此，可以以在一个或更多个可编程计算机系统(例如，实现图3的编码器20或图7的编码器或图3的解码器40的计算机系统)上执行的一个或更多个计算机程序来实现本发明，每个可编程计算机系统包括至少一个处理器、至少一个数据存储系统(包括易失性和非易失性存储器和/或存储元件)、至少一个输入装置或端口、以及至少一个输出装置或端口。对输入数据应用程序代码以执行本文中所描述的功能并且生成输出信息。以已知的方式对一个或更多个输出装置应用输出信息。The present invention can be implemented in hardware, firmware or software or a combination of the two (e.g., as a programmable logic array). Unless otherwise stated, the algorithm or processing included as part of the present invention is not inherently related to any particular computer or other device. Specifically, various general-purpose machines can be used together with the program written according to the teachings herein, or more conveniently, a more specialized device (e.g., an integrated circuit) that performs the required method steps can be constructed. Therefore, the present invention can be implemented with one or more computer programs executed on one or more programmable computer systems (e.g., a computer system that implements the encoder 20 of Figure 3 or the encoder of Figure 7 or the decoder 40 of Figure 3), each programmable computer system including at least one processor, at least one data storage system (including volatile and non-volatile memory and/or storage element), at least one input device or port, and at least one output device or port. Application code is applied to the input data to perform the functions described herein and generate output information. Output information is applied to one or more output devices in a known manner.

可以以与计算机系统进行通信的任何期望的计算机语言(包括机器语言、汇编语言、或者高级过程语言、逻辑语言、或者面向对象的编程语言)来实现每个这样的程序。在任何情况下，语言可以是编译语言或解释语言。Each such program can be implemented in any desired computer language for communicating with a computer system, including machine language, assembly language, or high-level procedural language, logical language, or object-oriented programming language. In any case, the language can be a compiled language or an interpreted language.

例如，当由计算机软件指令序列实现时，可以通过在适当的数字信号处理硬件中运行的多线程软件指令序列来实现本发明的实施方式的各种功能和步骤，在这种情况下，实施方式的各种装置、步骤和功能可以对应于软件指令的一部分。For example, when implemented by a sequence of computer software instructions, the various functions and steps of the embodiments of the present invention may be implemented by a multi-threaded software instruction sequence running in appropriate digital signal processing hardware. In this case, the various devices, steps and functions of the embodiments may correspond to a portion of the software instructions.

优选地，每个这样的计算机程序被存储在能够由通用或专用可编程计算机读取的存储介质或装置(例如，固态存储器或介质，或者磁介质或光介质)上或者被下载至能够由通用或专用可编程计算机读取的存储介质或装置(例如，固态存储器或介质，或者磁介质或光介质)，以当执行本文中所描述的过程的计算机系统读取存储介质或装置时对计算机进行配置和操作。还可以将本发明系统实现为配置有(即，存储)计算机程序的计算机可读存储介质，其中，如此配置的存储介质使计算机系统以特定且预定义方式进行操作以执行本文中所描述的功能。Preferably, each such computer program is stored on or downloaded to a storage medium or device (e.g., solid-state memory or media, or magnetic or optical media) that can be read by a general or special purpose programmable computer to configure and operate the computer when the storage medium or device is read by a computer system that performs the processes described herein. The present invention system can also be implemented as a computer-readable storage medium configured with (i.e., storing) a computer program, wherein the storage medium so configured causes the computer system to operate in a specific and predefined manner to perform the functions described herein.

已经描述了本发明的许多实施方式。然而，应当理解，在不偏离本发明的精神和范围的情况下，可以作出各种修改。鉴于上面的教示，本发明的大量修改和变更是可以的。应当理解，在所附权利要求的范围内，可以以与如本文中具体描述的方式不同的方式来实践本发明。A number of embodiments of the present invention have been described. However, it will be appreciated that various modifications may be made without departing from the spirit and scope of the present invention. In light of the above teachings, numerous modifications and variations of the present invention are possible. It will be appreciated that, within the scope of the appended claims, the present invention may be practiced in other ways than as specifically described herein.

6.中间/侧表示6. Middle/Side Representation

音频解码器可以至少部分地基于M/S表示中的控制数据、控制参数等来执行如本文中所描述的语音增强操作。上游音频编码器可以生成M/S表示中的控制数据、控制参数等，并且音频解码器从由上游音频编码器所生成的编码音频信号中提取M/S表示中的控制数据、控制参数等。The audio decoder may perform speech enhancement operations as described herein based at least in part on the control data, control parameters, etc. in the M/S representation. The upstream audio encoder may generate the control data, control parameters, etc. in the M/S representation, and the audio decoder extracts the control data, control parameters, etc. in the M/S representation from the encoded audio signal generated by the upstream audio encoder.

在根据混合内容预测语音内容(例如，一个或更多个对话等)的参数编码增强模式中，如以下表达式中所示，可以使用单个矩阵H一般地表示语音增强操作：In a parametric coding enhancement mode where speech content (e.g., one or more dialogues, etc.) is predicted from mixed content, the speech enhancement operation can be generally represented using a single matrix H as shown in the following expression:

其中，左手侧(LHS)表示通过如由矩阵H所表示的语音增强操作对右手侧(RHS)的原始混合内容信号进行操作而生成的语音增强混合内容信号。Therein, the left hand side (LHS) represents a speech enhanced mixed content signal generated by operating the original mixed content signal on the right hand side (RHS) by the speech enhancement operation as represented by the matrix H.

出于说明的目的，语音增强混合内容信号(例如，表达式(30)的LHS等)和原始混合内容信号(例如，由表达式(30)中的H所操作的原始混合内容信号等)中的每个包括分别在两个通道c₁和c₂中具有语音增强混合内容和原始混合内容的两个分量信号。两个通道c₁和c₂可以是基于非M/S表示的非M/S音频通道(例如，左前通道、右前通道等)。应当注意，在各种实施方式中，语音增强混合内容信号和原始混合内容信号中的每个还可以包括在除了两个非M/S通道c₁和c₂以外的通道(例如，环绕通道、低频效果通道等)中具有非语音内容的分量信号。还应当注意，在各种实施方式中，语音增强混合内容信号和原始混合内容信号中的每个可能包括在一个通道、如表达式(30)中所示的两个通道、或者多于两个通道中具有语音内容的分量信号。如本文中所描述的语音内容可以包括一个对话、两个对话或更多个对话。For purposes of illustration, each of the speech enhancement mixed content signal (e.g., the LHS of Expression (30), etc.) and the original mixed content signal (e.g., the original mixed content signal operated by H in Expression (30), etc.) includes two component signals having speech enhancement mixed content and original mixed content in two channels _c1 and _c2 , respectively. The two channels _c1 and _c2 may be non-M/S audio channels (e.g., left front channel, right front channel, etc.) based on non-M/S representation. It should be noted that, in various embodiments, each of the speech enhancement mixed content signal and the original mixed content signal may also include component signals having non-speech content in channels other than the two non-M/S channels _c1 and _c2 (e.g., surround channels, low frequency effects channels, etc.). It should also be noted that, in various embodiments, each of the speech enhancement mixed content signal and the original mixed content signal may include component signals having speech content in one channel, two channels as shown in Expression (30), or more than two channels. The speech content as described herein may include one dialogue, two dialogues, or more dialogues.

在一些实施方式中，如由表达式(30)中的H所表示的语音增强操作可以用于(例如，如由SNR引导混和规则等所指引)混合内容中的语音内容与其他(例如，非语音等)内容之间的SNR值相对高的混合内容的时间片(片段)。In some embodiments, a speech enhancement operation as represented by H in expression (30) can be used for time slices (segments) of mixed content in which the SNR value between the speech content and other (e.g., non-speech, etc.) content is relatively high (e.g., as guided by an SNR-guided mixing rule, etc.).

如以下表达式所示，可以将矩阵H重写/扩展为表示M/S表示中的增强操作的矩阵H_MS在右边乘以从非M/S表示到M/S表示的正向转换矩阵并且在左边乘以该正向转换矩阵的逆(其包括因子1/2)的乘积：As shown in the following expression, the matrix H can be rewritten/expanded to represent the augmentation operation in the M/S representation as a matrix _HMS multiplied on the right by the forward conversion matrix from non-M/S representation to M/S representation and on the left by the inverse of the forward conversion matrix (which includes a factor of 1/2):

其中，矩阵H_MS右边的示例转换矩阵基于正向转换矩阵将M/S表示中的中间通道混合内容信号限定为两个通道c₁和c₂中的两个混合内容信号之和，并且将M/S表示中的侧通道混合内容信号限定为两个通道c₁和c₂中的两个混合内容信号之差。应当注意，在各种实施方式中，还可以使用除了表达式(31)中所示的示例转换矩阵以外的其他转换矩阵(例如，向不同的非M/S通道分配不同的权重等)，以将混合内容信号从一种表示转换为不同的表示。例如，对于对话增强，其中对话不在幻象中心呈现，而是在具有不相等的权重λ₁和λ₂的两个信号之间平移。如以下表达式所示，可以将M/S转换矩阵修改成使侧信号中对话分量的能量最小：Wherein, the example transformation matrix to the right of the matrix _HMS defines the mid-channel mixed content signal in the M/S representation as the sum of the two mixed content signals in the two channels _c1 and _c2 based on the forward transformation matrix, and defines the side channel mixed content signal in the M/S representation as the difference between the two mixed content signals in the two channels _c1 and _c2 . It should be noted that in various embodiments, other transformation matrices other than the example transformation matrix shown in expression (31) can also be used (e.g., assigning different weights to different non-M/S channels, etc.) to transform the mixed content signal from one representation to a different representation. For example, for dialogue enhancement, where the dialogue is not presented at the center of the phantom, but is shifted between two signals with unequal weights _λ1 and _λ2 . As shown in the following expression, the M/S transformation matrix can be modified to minimize the energy of the dialogue component in the side signal:

在示例实施方式中，如以下表达式所示，可以将代表M/S表示中的增强操作的矩阵H_MS定义为对角化(例如，厄米特矩阵等)矩阵：In an example embodiment, the matrix _HMS representing the augmentation operation in the M/S representation may be defined as a diagonalized (eg, Hermitian matrix, etc.) matrix as shown in the following expression:

其中，p₁和p₂分别表示中间通道和侧通道预测参数。预测参数p₁和p₂中的每一个可以包括针对M/S表示中的相应混合内容信号的时间-频率分块的时变预测参数集合，以被用于根据混合内容信号重构语音内容。例如，如表达式(10)所示，增益参数g对应于语音增强增益G。Where _p1 and _p2 represent the mid-channel and side-channel prediction parameters, respectively. Each of the prediction parameters _p1 and _p2 can include a set of time-varying prediction parameters for the time-frequency partition of the corresponding mixed content signal in the M/S representation, to be used to reconstruct the speech content from the mixed content signal. For example, as shown in Expression (10), the gain parameter g corresponds to the speech enhancement gain G.

在一些实施方式中，在参数通道独立增强模式下执行M/S表示中的语音增强操作。在一些实施方式中，使用中间通道信号和侧通道信号两者中的预测语音内容或者使用仅中间通道信号中的预测语音内容来执行M/S表示中的语音增强操作。出于说明的目的，如以下表达式所示，使用仅中间通道中的混合内容信号来执行M/S表示中的语音增强操作：In some embodiments, the speech enhancement operation in the M/S representation is performed in a parametric channel-independent enhancement mode. In some embodiments, the speech enhancement operation in the M/S representation is performed using the predicted speech content in both the mid-channel signal and the side-channel signal, or using the predicted speech content in only the mid-channel signal. For illustrative purposes, the speech enhancement operation in the M/S representation is performed using the mixed content signal in only the mid-channel, as shown in the following expression:

其中，预测参数p₁包括针对M/S表示的中间通道中的混合内容信号的时间-频率分块的单个预测参数集合，以被用于根据仅中间通道中的混合内容信号重构语音内容。The prediction parameter _p1 comprises a single set of prediction parameters for the time-frequency partition of the mixed content signal in the middle channel of the M/S representation, so as to be used to reconstruct the speech content from the mixed content signal in the middle channel only.

基于表达式(33)中所给出的对角化矩阵H_MS，还可以将如由表达式(31)所表示的参数增强模式下的语音增强操作进一步缩减成以下表达式，该表达式提供了表达式(30)中的矩阵H的明确示例：Based on the diagonalized matrix H _MS given in Expression (33), the speech enhancement operation in the parameter enhancement mode as represented by Expression (31) can be further reduced to the following expression, which provides a clear example of the matrix H in Expression (30):

在波形参数混合增强模式下，可以使用以下示例表达式在M/S表示中表示语音增强操作：In waveform parameter hybrid enhancement mode, the speech enhancement operation can be expressed in M/S representation using the following example expressions:

其中，m₁和m₂在混合内容信号向量M中分别表示中间通道混合内容信号(例如，非M/S通道如左前通道和右前通道等中的混合内容信号之和)和侧通道混合内容信号(例如，非M/S通道如左前通道和右前通道等中的混合内容信号之差)。信号d_c,l代表M/S表示的对话信号向量D_c中的中间通道对话波形信号(例如，表示混合内容中的对话的降低版本的编码波形等)。矩阵H_d表示基于M/S表示的中间通道中的对话信号d_c,l的M/S表示中的语音增强操作，并且可以包括在第一行第一列(1×1)处的仅一个矩阵元素。矩阵H_p表示基于使用M/S表示的中间通道的预测参数p₁重构的对话的、M/S表示中的语音增强操作。在一些实施方式中，例如，如表达式(23)和(24)中所描绘的，增益参数g₁和g₂共同(例如，在分别被应用于对话波形信号和重构对话等之后)对应于语音增强增益G。具体地，在与M/S表示的中间通道中的对话信号d_c,l有关的波形编码语音增强操作中应用参数g₁，而在与M/S表示的中间通道和侧通道中的混合内容信号m₁和m₂有关的参数编码语音增强操作中应用参数g₂。参数g₁和g₂对整体增强增益以及两种语言增强方法之间的平衡进行控制。Wherein, _m1 and _m2 in the mixed content signal vector M represent the mid-channel mixed content signal (e.g., the sum of the mixed content signals in non-M/S channels such as the left and right front channels) and the side-channel mixed content signal (e.g., the difference between the mixed content signals in non-M/S channels such as the left and right front channels), respectively. Signal dc _,l represents the mid-channel dialogue waveform signal in the M/S-represented dialogue signal vector _Dc (e.g., an encoded waveform representing a degraded version of the dialogue in the mixed content). Matrix _Hd represents the speech enhancement operation in the M/S representation based on the dialogue signal dc _,l in the M/S-represented mid-channel and may include only one matrix element at the first row and first column (1×1). Matrix _Hp represents the speech enhancement operation in the M/S representation based on the dialogue reconstructed using the prediction parameter _p1 of the M/S-represented mid-channel. In some embodiments, for example, as depicted in expressions (23) and (24), gain parameters _g1 and _g2 together (e.g., after being applied to the dialog waveform signal and the reconstructed dialog, etc., respectively) correspond to a speech enhancement gain G. Specifically, parameter _g1 is applied in a waveform-coded speech enhancement operation associated with the dialog signal dc _,l in the middle channel of the M/S representation, while parameter g2 is applied in a parametric-coded speech enhancement operation associated with the mixed content signals _m1 and _m2 in the middle and side channels of the M/ _S representation. Parameters _g1 and _g2 control the overall enhancement gain and the balance between the two speech enhancement methods.

在非M/S表示中，可以使用以下表达式来表示与使用表达式(35)所表示的语音增强操作相对应的语音增强操作：In non-M/S representation, the speech enhancement operation corresponding to the speech enhancement operation expressed using expression (35) can be expressed using the following expression:

其中，可以使用与非M/S表示和M/S表示之间的正向转换矩阵左乘的非M/S通道中的混合内容信号M_c1和M_c2来代替如表达式(35)中所示的M/S表示中的混合内容信号m₁和m₂。表达式(36)中的逆转换矩阵(具有因子1/2)将如表达式(35)所示的M/S表示中的语音增强混合内容信号转换回非M/S表示(例如，左前通道和右前通道等)中的语音增强混合内容信号。Here, the mixed content signals _Mc1 and _Mc2 in the non-M/S channels, which are left-multiplied by the forward conversion matrix between the non-M/S representation and the M/S representation, can be used instead of the mixed content signals _m1 and _m2 in the M/S representation as shown in Expression (35). The inverse conversion matrix (with a factor of 1/2) in Expression (36) converts the speech-enhanced mixed content signal in the M/S representation as shown in Expression (35) back to the speech-enhanced mixed content signal in the non-M/S representation (e.g., the left front channel and the right front channel, etc.).

另外，可选地或替选地，在语音增强操作之后无另外的基于QMF的处理被执行的一些实施方式中，出于效率原因，在时域中的QMF合成滤波器组之后，可以执行组合基于对话信号d_c,l的语音增强内容与基于通过预测重构的对话的语音增强混合内容的语音增强操作(例如，如由H_d、H_p转换等所表示)中的一些或所有。Additionally, optionally or alternatively, in some embodiments where no further QMF-based processing is performed after the speech enhancement operations, for efficiency reasons, some or all of the speech enhancement operations that combine the speech enhancement content based on the dialogue signal d _c,l with the speech enhancement mixed content based on the dialogue reconstructed by prediction (e.g., as represented by H _d , H _p transform, etc.) may be performed after the QMF synthesis filter bank in the time domain.

可以基于以下一个或更多个预测参数生成方法中的一个来生成用于根据M/S表示的中间通道和侧通道中的一个或两个中的混合内容信号来构造/预测语音内容的预测参数，所述一个或更多个预测参数生成方法包括但不限于仅以下方法中的任意方法：如图1中所描绘的独立通道对话预测方法、如图2中所描绘的多通道对话预测方法等。在一些实施方式中，预测参数生成方法中的至少之一可以基于MMSE、梯度下降、一个或更多个其他优化方法等。Prediction parameters for constructing/predicting speech content based on mixed content signals in one or both of the mid and side channels of the M/S representation may be generated based on one or more prediction parameter generation methods, including but not limited to any of the following methods: the independent channel dialogue prediction method as depicted in FIG1 , the multi-channel dialogue prediction method as depicted in FIG2 , etc. In some embodiments, at least one of the prediction parameter generation methods may be based on MMSE, gradient descent, one or more other optimization methods, etc.

在一些实施方式中，可以在M/S表示中的音频节目的片段的参数编码增强数据(例如，与基于对话信号d_c,l的语音增强内容有关等)与波形编码增强(例如，与基于通过预测所重构的对话的语音增强混合内容有关等)之间使用如先前所讨论的基于“盲”时间SNR的切换方法。In some embodiments, a "blind" temporal SNR-based switching approach as previously discussed can be used between parametrically coded enhancement data (e.g., relating to speech enhancement content based on a dialogue signal d _c,l , etc.) and waveform coded enhancement (e.g., relating to speech enhancement mixed content based on dialogue reconstructed by prediction, etc.) for a segment of an audio program in an M/S representation.

在一些实施方式中，M/S表示中的波形数据(例如，与基于对话信号d_c,l的语音增强内容有关等)和重构语音数据(例如，与基于通过预测所重构的对话的语音增强混合内容有关等)的组合(例如，由先前讨论的混和指示符指示，表达式(35)中的g₁和g₂的组合等)随时间变化，其中每个组合状态与携载波形数据和在重构语音数据时所使用的混合内容的比特流的相应片段的语音内容和其他音频内容有关。混和指示符被生成，使得由节目的相应片段中的语音内容与其他音频内容的信号特性(例如，语音内容的功率与其他音频内容的功率之比、SNR等)来确定(波形数据和重构语音数据的)当前组合状态。音频节目的片段的混和指示符可以是在图3的编码器的子系统29中针对片段所生成的混和指示符参数(或参数集合)。可以使用如先前所讨论的听觉掩蔽模型来更准确地预测对话信号向量Dc中的降低品质语音复本中的编码噪声如何被主要节目的音频混合掩蔽并且据此选择混和比率。In some embodiments, the combination of waveform data (e.g., related to speech enhancement content based on the dialogue signal d _c,l, etc.) and reconstructed speech data (e.g., related to speech enhancement mixed content based on dialogue reconstructed by prediction, etc.) in the M/S representation (e.g., indicated by the previously discussed mixing indicator, the combination of g ₁ and g ₂ in expression (35), etc.) changes over time, where each combination state is related to the speech content and other audio content of the corresponding segment of the bitstream carrying the waveform data and the mixed content used in reconstructing the speech data. The mixing indicator is generated so that the current combination state (of the waveform data and the reconstructed speech data) is determined by the signal characteristics of the speech content and other audio content in the corresponding segment of the program (e.g., the ratio of the power of the speech content to the power of the other audio content, the SNR, etc.). The mixing indicator of a segment of the audio program can be a mixing indicator parameter (or parameter set) generated for the segment in the encoder subsystem 29 of Figure 3. An auditory masking model as previously discussed may be used to more accurately predict how the coding noise in the degraded speech replica in the dialogue signal vector Dc is masked by the main program's audio mix and select the mixing ratio accordingly.

图3的编码器20的子系统28可以被配置成将与M/S语音增强操作有关的混和指示符包括在比特流中以作为要从编码器20输出的M/S语音增强元数据的一部分。可以根据与对话信号D_c中的编码伪声有关的缩放因子g_max(t)等来生成(例如，在图7的编码器的子系统13中)与M/S语音增强操作有关的混和指示符。缩放因子g_max(t)可以由图7编码器的子系统14生成。图7编码器的子系统13可以被配置成将混和指示符包括在要从图7编码器输出的比特流中。另外，可选地或替选地，子系统13可以将由子系统14所生成的缩放因子g_max(t)包括在要从图7编码器输出的比特流中。Subsystem 28 of encoder 20 of FIG3 can be configured to include a blending indicator related to the M/S speech enhancement operation in the bitstream as part of the M/S speech enhancement metadata to be output from encoder 20. The blending indicator related to the M/S speech enhancement operation can be generated (e.g., in subsystem 13 of encoder 13 of FIG7 ) based on, among other things, a scaling factor g _max (t) related to coding artifacts in dialogue signal D _c . The scaling factor g _max (t) can be generated by subsystem 14 of encoder 14 of FIG7 . Subsystem 13 of encoder 13 of FIG7 can be configured to include the blending indicator in the bitstream to be output from encoder 13 of FIG7 . Additionally, optionally or alternatively, subsystem 13 can include the scaling factor g _max (t) generated by subsystem 14 in the bitstream to be output from encoder 13 of FIG7 .

在一些实施方式中，由图7的操作10所生成的未增强音频混合A(t)表示参考音频通道配置中的混合内容信号向量(例如，其时间片段等)。由图7的元件12所生成的参数编码增强参数p(t)表示用于关于混合内容信号向量的每个片段执行M/S表示中的参数编码语音增强的M/S语音增强元数据中的至少一部分。在一些实施方式中，由图7的编码器15所生成的降低品质语音复本s'(t)表示M/S表示(例如，关于中间通道对话信号、侧通道对话信号等)中的对话信号向量。In some embodiments, the unenhanced audio mix A(t) generated by operation 10 of FIG. 7 represents a mixed content signal vector (e.g., a time segment thereof, etc.) in a reference audio channel configuration. The parametrically coded enhancement parameters p(t) generated by element 12 of FIG. 7 represent at least a portion of the M/S speech enhancement metadata used to perform parametrically coded speech enhancement in the M/S representation with respect to each segment of the mixed content signal vector. In some embodiments, the reduced-quality speech replica s'(t) generated by encoder 15 of FIG. 7 represents a dialog signal vector in the M/S representation (e.g., with respect to a mid-channel dialog signal, a side-channel dialog signal, etc.).

在一些实施方式中，图7的元件14生成缩放因子g_max(t)并且将其提供至编码元件13。在一些实施方式中，元件13针对音频节目的每个片段生成指示参考音频通道配置中的(例如，未增强等)混合内容信号向量、M/S语音增强元数据、如果可应用则有M/S表示中的对话信号向量、以及如果可应用则有缩放因子g_max(t)的编码音频比特流，该编码音频比特流可以被发送至或者以其他方式被递送至接收器。In some embodiments, element 14 of Figure 7 generates the scaling factor _gmax (t) and provides it to encoding element 13. In some embodiments, element 13 generates, for each segment of the audio program, an encoded audio bitstream indicating a mixed content signal vector (e.g., unenhanced, etc.) in a reference audio channel configuration, M/S speech enhancement metadata, a dialog signal vector in M/S representation if applicable, and the scaling factor _gmax (t) if applicable, which can be sent to or otherwise delivered to a receiver.

当将非M/S表示中的未增强音频信号与M/S语音增强元数据一起递送(例如，发送)至接收器时，接收器可以转换M/S表示中的未增强音频信号的每个片段并且针对片段执行由M/S语音增强元数据所指示的M/S语音增强操作。如果要在混合语音增强模式下或在波形编码增强模式下对片段执行语音增强操作，则可以向节目的片段的M/S表示中的对话信号向量提供非M/S表示中的未增强混合内容信号向量。如果可应用，则接收并解析比特流的接收器可以被配置成：响应于缩放因子g_max(t)来生成混和指示符并且确定表达式(35)中的增益参数g₁和g₂。When the unenhanced audio signal in the non-M/S representation is delivered (e.g., transmitted) to a receiver together with the M/S speech enhancement metadata, the receiver may convert each segment of the unenhanced audio signal in the M/S representation and perform the M/S speech enhancement operation indicated by the M/S speech enhancement metadata on the segment. If a speech enhancement operation is to be performed on the segment in a hybrid speech enhancement mode or in a waveform coding enhancement mode, the unenhanced mixed content signal vector in the non-M/S representation may be provided to the dialogue signal vector in the M/S representation of the segment of the program. If applicable, a receiver that receives and parses the bitstream may be configured to generate a mixing indicator in response to the scaling factor g _max (t) and determine the gain parameters g ₁ and g ₂ in Expression (35).

在一些实施方式中，在元件13的编码输出已经被递送至的接收器中，至少部分地在M/S表示中执行语音增强操作。在一个示例中，可以至少部分地基于根据由接收器接收的比特流所解析的混和指示符对未增强混合内容信号的每个片段应用与增强的预定(例如，所要求的)总量相对应的表达式(35)中的增益参数g₁和g₂。在另一个示例中，可以至少部分地基于从根据由接收器接收的比特流所解析的片段的缩放因子g_max(t)所确定的混和指示符对未增强的混合内容信号的每个片段应用与增强的预定(例如，所要求的)总量相对应的表达式(35)中的增益参数g₁和g₂。In some embodiments, the speech enhancement operation is performed at least partially in the M/S representation in a receiver to which the encoded output of element 13 is delivered. In one example, the gain parameters g1 and g2 in expression (35) corresponding to a predetermined (e.g., required) amount of enhancement can be applied to each segment of the unenhanced mixed content signal based at least in part on a blend indicator parsed from a bitstream received by the receiver. In another example, the gain parameters _g1 _and _g2 in expression (35) corresponding to a predetermined (e.g., required) amount of enhancement can be applied to each segment of the unenhanced mixed content signal based at least in part on a blend indicator determined from a scaling factor _gmax ( _t ) of the segment parsed from a bitstream received by the receiver.

在一些实施方式中，图3的编码器20的元件23被配置成响应于从级21和22输出的数据，生成包括M/S语音增强元数据的参数数据(例如，根据中间通道和/或侧通道中的混合内容等重构对话/语音内容的预测参数)。在一些实施方式中，图3的编码器20的混和指示符生成元件29被配置成响应于从级21和22输出的数据来生成确定参数语音增强内容(例如，使用增益参数g₁等)和基于波形的语音增强内容(例如，使用增益参数g₁等)的组合的混和标识符“BI”。In some embodiments, element 23 of encoder 20 of FIG3 is configured to generate parameter data including M/S speech enhancement metadata (e.g., prediction parameters for reconstructing dialogue/speech content based on mixed content in the mid channel and/or side channel, etc.) in response to data output from stages 21 and 22. In some embodiments, blend indicator generation element 29 of encoder 20 of FIG3 is configured to generate a blend identifier "BI" that identifies a combination of parametric speech enhancement content (e.g., using gain parameter _g1 , etc.) and waveform-based speech enhancement content (e.g., using gain parameter _g1 , etc.) in response to data output from stages 21 and 22.

在对图3实施方式的变型中，在编码器中没有生成用于M/S混合语音增强的混和指示符(以及该混和指示符没有包括在从编码器输出的比特流中)，而是替代地响应于从编码器输出的比特流(该比特流包括M/S通道中的波形数据和M/S语音增强元数据)来(例如，在对接收器40的变型中)生成用于M/S混合语音增强的混和指示符。In a variation of the embodiment of Figure 3, the mixing indicator for M/S hybrid speech enhancement is not generated in the encoder (and the mixing indicator is not included in the bitstream output from the encoder), but instead is generated (for example, in a variation of the receiver 40) in response to the bitstream output from the encoder (the bitstream including waveform data in the M/S channel and M/S speech enhancement metadata).

解码器40被耦接和配置(例如，被编程)为：从子系统30接收编码音频信号(例如，通过从子系统30中的存储装置读取或取回指示编码音频信号的数据，或者接收已经被子系统30发送的编码音频信号)；根据编码音频信号对指示参考音频通道配置中的混合(语音与非语音)内容信号向量的数据进行解码；以及至少部分地在M/S表示中对参考音频通道配置中的解码混合内容执行语音增强操作。解码器40可以被配置成生成和输出(例如，至呈现系统等)指示语音增强混合内容的语音增强的解码音频信号。The decoder 40 is coupled and configured (e.g., programmed) to: receive an encoded audio signal from the subsystem 30 (e.g., by reading or retrieving data indicating the encoded audio signal from a storage device in the subsystem 30, or receiving the encoded audio signal having been transmitted by the subsystem 30); decode data indicating a mixed (speech and non-speech) content signal vector in a reference audio channel configuration from the encoded audio signal; and perform a speech enhancement operation on the decoded mixed content in the reference audio channel configuration, at least in part, in an M/S representation. The decoder 40 may be configured to generate and output (e.g., to a rendering system, etc.) a speech-enhanced decoded audio signal indicating the speech-enhanced mixed content.

在一些实施方式中，图4至图6中所描绘的呈现系统中的一些或全部可以被配置成：呈现通过M/S语音增强操作生成的语音增强混合内容，所述M/S语音增强操作中的至少一些是在M/S表示中所执行的操作。图6A示出了被配置成执行如表达式(35)中所表示的语音增强操作的示例呈现系统。In some embodiments, some or all of the rendering systems depicted in Figures 4 to 6 can be configured to render speech-enhanced mixed content generated by M/S speech enhancement operations, at least some of which are operations performed in an M/S representation. Figure 6A shows an example rendering system configured to perform speech enhancement operations as expressed in Expression (35).

图6A的呈现系统可以被配置成：响应于确定在参数语音增强操作中所使用的至少一个增益参数(例如，表达式(35)中的g₂等)是非零的(例如，在混合增强模式下、在参数增强模式下等)来执行参数语音增强操作。例如，根据这样的确定，图6A的子系统68A可以被配置成：对非M/S通道上分布的混合内容信号向量(“混合音频(T/F)”)执行转换以生成M/S通道上分布的相应混合内容信号向量。若适当的话，该转换可以使用正向转换矩阵。可以应用用于参数增强操作的预测参数(例如，p₁、p₂等)、增益参数(例如，表达式(35)中的g₂等)，以根据M/S通道的混合内容信号向量来预测语音内容并且增强所预测的语音内容。The rendering system of FIG6A can be configured to perform a parametric speech enhancement operation in response to determining that at least one gain parameter used in the parametric speech enhancement operation (e.g., g ₂ in expression (35) etc.) is non-zero (e.g., in hybrid enhancement mode, in parametric enhancement mode, etc.). For example, based on such a determination, the subsystem 68A of FIG6A can be configured to perform a transformation on the mixed content signal vector ("mixed audio (T/F)") distributed on the non-M/S channel to generate a corresponding mixed content signal vector distributed on the M/S channel. If appropriate, the transformation can use a forward transformation matrix. Prediction parameters (e.g., p ₁ , p ₂ , etc.) and gain parameters (e.g., g ₂ in expression (35) etc.) for the parametric enhancement operation can be applied to predict speech content based on the mixed content signal vector of the M/S channel and enhance the predicted speech content.

图6A的呈现系统可以被配置成：响应于确定波形编码语音增强操作中所使用的至少一个增益参数(例如，表达式(35)中的g₁等)是非零的(例如，在混合增强模式下、在波形编码增强模式下等)来执行波形编码语音增强操作。例如，根据这样的确定，图6A的呈现系统可以被配置成从所接收的编码音频信号接收/提取M/S通道上分布的对话信号向量(例如，关于混合内容信号向量中存在的语音内容的降低版本)。可以应用用于波形编码增强操作的增益参数(例如，表达式(35)中的g₁等)以增强由M/S通道的对话信号向量所表示的语音内容。用户可定义的增强增益(G)可以用于使用可以或不可以存在于比特流中的混和参数来导出增益参数g₁和g₂。在一些实施方式中，可以从所接收的编码音频信号中的元数据中提取要与用户可定义的增强增益(G)一起使用以导出增益参数g₁和g₂的混和参数。在一些其他实施方式中，可以不从所接收的编码音频信号中的元数据提取这样的混和参数，而是可以由接收方编码器基于所接收的编码音频信号中的音频内容来导出这样的混和参数。The rendering system of FIG6A can be configured to perform a waveform coding speech enhancement operation in response to determining that at least one gain parameter used in the waveform coding speech enhancement operation (e.g., _g1 in expression (35), etc.) is non-zero (e.g., in hybrid enhancement mode, in waveform coding enhancement mode, etc.). For example, based on such a determination, the rendering system of FIG6A can be configured to receive/extract a dialogue signal vector distributed on the M/S channel from the received encoded audio signal (e.g., a reduced version of the speech content present in the mixed content signal vector). The gain parameter used for the waveform coding enhancement operation (e.g., _g1 in expression (35), etc.) can be applied to enhance the speech content represented by the dialogue signal vector of the M/S channel. A user-definable enhancement gain (G) can be used to derive gain parameters _g1 and _g2 using mixing parameters that may or may not be present in the bitstream. In some embodiments, the mixing parameters to be used with the user-definable enhancement gain (G) to derive gain parameters _g1 and _g2 can be extracted from metadata in the received encoded audio signal. In some other embodiments, such mixing parameters may not be extracted from metadata in the received encoded audio signal, but may be derived by the recipient encoder based on the audio content in the received encoded audio signal.

在一些实施方式中，M/S表示中的参数增强语音内容和波形编码增强语音内容的组合被设定(assert)或被输入至图6A的子系统64A。图6的子系统64A可以被配置成：对M/S通道上分布的增强语音内容的组合执行转换以生成非M/S通道上分布的增强语音内容信号向量。若适当的话，该转换可以使用逆转换矩阵。可以将非M/S通道的增强语音内容信号向量与分布在非M/S通道上的混合内容信号向量(“混合音频(T/F)”)进行组合以生成语音增强的混合内容信号向量。In some embodiments, a combination of parametric enhanced speech content and waveform-encoded enhanced speech content in the M/S representation is asserted or input to subsystem 64A of FIG. 6A . Subsystem 64A of FIG. 6 can be configured to perform a transformation on the combination of enhanced speech content distributed across the M/S channels to generate an enhanced speech content signal vector distributed across the non-M/S channels. If appropriate, the transformation can utilize an inverse transformation matrix. The enhanced speech content signal vector for the non-M/S channels can be combined with a mixed content signal vector ("Mixed Audio (T/F)") distributed across the non-M/S channels to generate a speech-enhanced mixed content signal vector.

在一些实施方式中，(例如，从图3的编码器20等输出的)编码音频信号的语法支持M/S标记从上游音频编码器(例如，图3的编码器20等)到下游音频解码器(例如，图3的解码器40等)的传输。当接收方音频解码器(例如，图3的解码器40等)至少部分地使用与M/S标记一起被传输的M/S控制数据、控制参数等来执行语音增强操作时，M/S标记由音频编码器呈现/设置(例如，图3的编码器20中的元件23等)。例如，当M/S标记被设置时，在根据语言增强算法(例如，独立通道对话预测、多通道对话预测、基于波形的波形参数混合等)中的一个或更多个来使用如与M/S标记一起所接收的M/S控制数据、控制参数等来应用M/S语音增强操作之前，接收方音频解码器(例如，图3的解码器40等)可以首先将非M/S通道中的立体声信号(例如，来自左通道和右通道等)转换成M/S表示的中间通道和侧通道。在接收方音频解码器(例如，图3的解码器40等)中，在执行M/S语言增强操作之后，可以将M/S表示中的语音增强信号转换回非M/S通道。In some embodiments, the syntax of the encoded audio signal (e.g., output from encoder 20 of FIG. 3 , etc.) supports transmission of an M/S flag from an upstream audio encoder (e.g., encoder 20 of FIG. 3 , etc.) to a downstream audio decoder (e.g., decoder 40 of FIG. 3 , etc.). The M/S flag is presented/set by the audio encoder (e.g., element 23 of encoder 20 of FIG. 3 , etc.) when the receiving audio decoder (e.g., decoder 40 of FIG. 3 , etc.) performs a speech enhancement operation at least in part using M/S control data, control parameters, etc. transmitted along with the M/S flag. For example, when the M/S flag is set, the receiving audio decoder (e.g., decoder 40 of FIG. 3 , etc.) may first convert the stereo signals in the non-M/S channels (e.g., from the left and right channels, etc.) into the mid and side channels of the M/S representation before applying the M/S speech enhancement operation according to one or more of the speech enhancement algorithms (e.g., independent channel dialogue prediction, multi-channel dialogue prediction, waveform-based waveform parameter mixing, etc.) using the M/S control data, control parameters, etc. received along with the M/S flag. In the receiving audio decoder (e.g., decoder 40 of FIG. 3 , etc.), after performing the M/S speech enhancement operation, the speech enhancement signals in the M/S representation may be converted back into the non-M/S channels.

在一些实施方式中，由如本文中所描述的音频编码器(例如，图3的编码器20、图3的编码器20的元件23等)生成的语音增强元数据可以携载指示针对一个或更多个不同类型的语音增强操作的语音增强控制数据、控制参数等的一个或更多个集合的存在的一个或更多个特定标记。针对一个或更多个不同类型的语音增强操作的语音增强控制数据、控制参数等的一个或更多个集合可以但不限于仅包括作为M/S语音增强元数据的M/S控制数据、控制参数等的集合。语音增强元数据还可以包括指示对于要被语音增强的音频内容而言优选哪种类型的语音增强操作(例如，M/S语音增强操作、非M/S语音增强操作等)的优选标记。可以将语音增强元数据作为在包括针对非M/S参考音频通道配置编码的混合音频内容的编码音频信号中所递送的元数据的一部分递送至下游解码器(例如，图3的解码器40等)。在一些实施方式中，仅M/S语音增强元数据而不是非M/S语音增强元数据被包括在编码音频信号中。In some embodiments, speech enhancement metadata generated by an audio encoder as described herein (e.g., encoder 20 of FIG. 3 , element 23 of encoder 20 of FIG. 3 , etc.) may carry one or more specific flags indicating the presence of one or more sets of speech enhancement control data, control parameters, etc. for one or more different types of speech enhancement operations. The one or more sets of speech enhancement control data, control parameters, etc. for one or more different types of speech enhancement operations may, but are not limited to, only include sets of M/S control data, control parameters, etc. as M/S speech enhancement metadata. The speech enhancement metadata may also include a preference flag indicating which type of speech enhancement operation (e.g., M/S speech enhancement operation, non-M/S speech enhancement operation, etc.) is preferred for the audio content to be speech enhanced. The speech enhancement metadata may be delivered to a downstream decoder (e.g., decoder 40 of FIG. 3 , etc.) as part of the metadata delivered in an encoded audio signal including mixed audio content encoded for a non-M/S reference audio channel configuration. In some embodiments, only the M/S speech enhancement metadata, and not the non-M/S speech enhancement metadata, is included in the encoded audio signal.

另外，可选地或替选地，音频解码器(例如，图3的40等)可以被配置成基于一个或更多个因素来确定并执行特定类型的语音增强操作(例如，M/S语音增强、非M/S语音增强等)。这些因素可以包括但不限于仅下述中的一个或更多个：指定对特定用户选择类型的语音增强操作的偏好的用户输入；指定对系统选择类型的语音增强操作的偏好的用户输入；由音频解码器操作的特定音频通道配置的能力；用于特定类型的语音增强操作的语音增强元数据的可用性；针对一种类型的语音增强操作的任意编码器生成的优选标记等。在一些实施方式中，音频解码器可以实现一个或更多个优先规则，如果这些因素之间冲突，则可以请求进一步的用户输入等以确定特定类型的语音增强操作。Additionally, optionally or alternatively, the audio decoder (e.g., 40 of FIG. 3 ) may be configured to determine and perform a specific type of speech enhancement operation (e.g., M/S speech enhancement, non-M/S speech enhancement, etc.) based on one or more factors. These factors may include, but are not limited to, only one or more of the following: user input specifying a preference for a specific user-selected type of speech enhancement operation; user input specifying a preference for a system-selected type of speech enhancement operation; the capabilities of a specific audio channel configuration operated by the audio decoder; the availability of speech enhancement metadata for a specific type of speech enhancement operation; any encoder-generated preference flags for a type of speech enhancement operation, etc. In some embodiments, the audio decoder may implement one or more priority rules and, if there is a conflict between these factors, may request further user input, etc., to determine a specific type of speech enhancement operation.

7.示例处理流程7. Example Processing Flow

图8A和图8B示出了示例处理流程。在一些实施方式中，媒体处理系统中的一个或更多个计算装置或单元可以执行该处理流程。8A and 8B illustrate an example process flow. In some implementations, one or more computing devices or units in a media processing system may perform the process flow.

图8A示出了可以由如本文中所描述的音频编码器(例如，图3的编码器20)实现的示例处理流程。在图8A的块802中，音频编码器接收在参考音频通道表示中具有语音内容与非语音音频内容的混合的混合音频内容，该混合音频内容被分布在参考音频通道表示的多个音频通道中。FIG8A shows an example process flow that can be implemented by an audio encoder as described herein (e.g., encoder 20 of FIG3 ). In block 802 of FIG8A , the audio encoder receives mixed audio content having a mixture of speech content and non-speech audio content in a reference audio channel representation, the mixed audio content being distributed across a plurality of audio channels of the reference audio channel representation.

在块804中，音频编码器将参考音频通道表示的多个音频通道中的一个或更多个非中间/侧(M/S)通道上分布的混合音频内容的一个或更多个部分转换成M/S音频通道表示的一个或更多个M/S通道上分布的M/S音频通道表示中的一个或更多个转换混合音频内容部分。In block 804, the audio encoder converts one or more portions of mixed audio content distributed over one or more non-mid/side (M/S) channels of a plurality of audio channels of the reference audio channel representation into one or more converted mixed audio content portions distributed over one or more M/S channels of the M/S audio channel representation.

在块806中，音频编码器确定针对M/S音频通道表示中的一个或更多个转换混合音频内容部分的M/S语音增强元数据。In block 806 , the audio encoder determines M/S speech enhancement metadata for one or more transmixed audio content portions in the M/S audio channel representation.

在块808中，音频编码器生成音频信号，该音频信号包括参考音频通道表示中的混合音频内容、以及M/S音频通道表示中的一个或更多个转换混合音频内容部分的M/S语音增强元数据。In block 808 , the audio encoder generates an audio signal comprising the mixed audio content in the reference audio channel representation and M/S speech enhancement metadata for one or more converted mixed audio content portions in the M/S audio channel representation.

在实施方式中，音频编码器还被配置成执行：生成M/S音频通道表示中的与混合音频内容分立的语音内容的版本；以及输出使用M/S音频通道表示中的语音内容的版本所编码的音频信号。In an embodiment, the audio encoder is further configured to perform: generating a version of the speech content in the M/S audio channel representation separate from the mixed audio content; and outputting an audio signal encoded using the version of the speech content in the M/S audio channel representation.

在实施方式中，音频编码器还被配置成执行：生成混和指示数据，该混和指示数据使得接收方音频解码器能够使用基于M/S音频通道表示中的语音内容的版本的波形编码语音增强与基于M/S音频通道表示中的语音内容的重构版本的参数语音增强的特定量组合来对混合音频内容应用语音增强；以及输出使用混和指示数据所编码的音频信号。In an embodiment, the audio encoder is further configured to perform: generating mixing indication data that enables a receiving audio decoder to apply speech enhancement to the mixed audio content using a specific amount combination of waveform-coded speech enhancement based on a version of the speech content in the M/S audio channel representation and parametric speech enhancement based on a reconstructed version of the speech content in the M/S audio channel representation; and outputting an audio signal encoded using the mixing indication data.

在实施方式中，音频编码器还配置成阻止将M/S音频通道表示中的一个或更多个转换混合音频内容部分编码为音频信号的一部分。In an embodiment, the audio encoder is further configured to refrain from encoding one or more transmix audio content portions in the M/S audio channel representation as part of the audio signal.

图8B示出了可以由如本文中所描述的音频解码器(例如，图3的解码器40)来实现的示例处理流程。在图8B的块822中，音频解码器接收包括参考音频通道表示中的混合音频内容以及中间/侧(M/S)语音增强元数据的音频信号。FIG8B shows an example process flow that can be implemented by an audio decoder as described herein (e.g., decoder 40 of FIG3 ). In block 822 of FIG8B , the audio decoder receives an audio signal including mixed audio content in a reference audio channel representation and mid/side (M/S) speech enhancement metadata.

在图8B的块824中，音频解码器将参考音频通道表示的多个音频通道中的一个、两个或更多个非M/S通道上分布的混合音频内容的一个或更多个部分转换成M/S音频通道表示的一个或更多个M/S通道上分布的M/S音频通道表示中的一个或更多个转换混合音频内容部分。In block 824 of FIG. 8B , the audio decoder converts one or more portions of the mixed audio content distributed over one, two or more non-M/S channels of the plurality of audio channels of the reference audio channel representation into one or more converted mixed audio content portions distributed over one or more M/S channels of the M/S audio channel representation.

在图8B的块826中，音频解码器基于M/S语音增强元数据对M/S音频通道表示中的一个或更多个转换混合音频内容部分执行一个或更多个M/S语音增强操作，以生成M/S表示中的一个或更多个增强语音内容部分。In block 826 of FIG. 8B , the audio decoder performs one or more M/S speech enhancement operations on the one or more converted mixed audio content portions in the M/S audio channel representation based on the M/S speech enhancement metadata to generate one or more enhanced speech content portions in the M/S representation.

在图8B的块828中，音频解码器将M/S音频通道表示中的一个或更多个转换混合音频内容部分与M/S表示中的一个或更多个增强语音内容进行组合，以生成M/S表示中的一个或更多个语音增强混合音频内容部分。In block 828 of FIG. 8B , the audio decoder combines the one or more converted mixed audio content portions in the M/S audio channel representation with the one or more enhanced speech contents in the M/S representation to generate one or more speech-enhanced mixed audio content portions in the M/S representation.

在实施方式中，音频解码器还被配置成将M/S表示中的一个或更多个语音增强混合音频内容部分逆转换成参考音频通道表示中的一个或更多个语音增强混合音频内容部分。In an embodiment, the audio decoder is further configured to inversely convert the one or more speech enhanced mixed audio content parts in the M/S representation into one or more speech enhanced mixed audio content parts in the reference audio channel representation.

在实施方式中，音频解码器还被配置成执行：从音频信号中提取M/S音频通道表示中的与混合音频内容分立的语音内容的版本；以及基于M/S语音增强元数据对M/S音频通道表示中的语音内容的版本的一个或更多个部分来执行一个或更多个语音增强操作，以生成M/S音频通道表示中的一个或更多个第二增强语音内容部分。In an embodiment, the audio decoder is further configured to perform: extracting a version of the speech content in the M/S audio channel representation that is separate from the mixed audio content from the audio signal; and performing one or more speech enhancement operations on one or more parts of the version of the speech content in the M/S audio channel representation based on the M/S speech enhancement metadata to generate one or more second enhanced speech content parts in the M/S audio channel representation.

在实施方式中，音频解码器还被配置成执行：确定用于语音增强的混和指示数据；以及基于用于语音增强的混和指示数据，生成基于M/S音频通道表示中的语音内容的版本的波形编码语音增强与基于M/S音频通道表示中的语音内容的重构版本的参数语音增强的特定量组合。In an embodiment, the audio decoder is further configured to perform: determining mixing indication data for speech enhancement; and generating, based on the mixing indication data for speech enhancement, a specific amount combination of waveform coded speech enhancement based on a version of the speech content in the M/S audio channel representation and parametric speech enhancement based on a reconstructed version of the speech content in the M/S audio channel representation.

在实施方式中，至少部分地基于针对M/S音频通道表示中的一个或更多个转换混合音频内容部分的一个或更多个SNR值来生成混和指示数据。一个或更多个SNR值表示下述功率比中的一个或更多个功率比：M/S音频通道表示中的一个或更多个转换混合音频内容部分的语音内容与非语音音频内容的功率比；或者M/S音频通道表示中的一个或更多个转换混合音频内容部分的语音内容与总音频内容的功率比。In an embodiment, the mixing indication data is generated based at least in part on one or more SNR values for one or more converted mixed audio content portions in the M/S audio channel representation. The one or more SNR values represent one or more of the following power ratios: a power ratio of speech content to non-speech audio content in the one or more converted mixed audio content portions in the M/S audio channel representation; or a power ratio of speech content to total audio content in the one or more converted mixed audio content portions in the M/S audio channel representation.

在实施方式中，使用以下听觉掩蔽模型来确定基于M/S音频通道表示中的语音内容的版本的波形编码语音增强与基于M/S音频通道表示中的语音内容的重构版本的参数语音增强的特定量组合，在该听觉掩蔽模型中，基于M/S音频通道表示中的语音内容的版本的波形编码语音增强表示波形编码语音增强与参数语音增强的多个组合中的、确保输出语音增强的音频节目中的编码噪声不听起来令人讨厌的最大相对语音增强量。In an embodiment, a specific amount of combination of waveform-coded speech enhancement based on a version of the speech content in the M/S audio channel representation and parametric speech enhancement based on a reconstructed version of the speech content in the M/S audio channel representation is determined using the following auditory masking model, in which auditory masking model, waveform-coded speech enhancement based on a version of the speech content in the M/S audio channel representation represents the maximum relative speech enhancement amount among multiple combinations of waveform-coded speech enhancement and parametric speech enhancement that ensures that coding noise in the output speech-enhanced audio program does not sound annoying.

在实施方式中，M/S语音增强元数据的至少一部分使得接收方音频解码器能够根据参考音频通道表示中的混合音频内容来重构M/S表示中的语音内容的版本。In an embodiment, at least a portion of the M/S speech enhancement metadata enables a recipient audio decoder to reconstruct a version of the speech content in the M/S representation from the mixed audio content in the reference audio channel representation.

在实施方式中，M/S语音增强元数据包括与M/S音频通道表示中的波形编码语音增强操作或者M/S音频通道中的参数语音增强操作中的一个或更多个有关的元数据。In an embodiment, the M/S speech enhancement metadata comprises metadata related to one or more of a waveform coded speech enhancement operation in the M/S audio channel representation or a parametric speech enhancement operation in the M/S audio channel.

在实施方式中，参考音频通道表示包括与环绕扬声器有关的音频通道。在实施方式中，参考音频通道表示的一个或更多个非M/S通道包括中央通道、左通道、或者右通道中的一个或更多个，而M/S音频通道表示的一个或更多个M/S通道包括中间通道或侧通道中的一个或更多个。In an embodiment, the reference audio channel representation includes audio channels associated with surround speakers. In an embodiment, the one or more non-M/S channels of the reference audio channel representation include one or more of a center channel, a left channel, or a right channel, and the one or more M/S channels of the M/S audio channel representation include one or more of a mid channel or a side channel.

在实施方式中，M/S语音增强元数据包括与M/S音频通道表示的中间通道有关的单个语音增强元数据的集合。在实施方式中，M/S语音增强元数据表示编码在音频信号中的全部音频元数据的一部分。在实施方式中，编码在音频信号中的音频元数据包括指示M/S语音增强元数据的存在的数据字段。在实施方式中，音频信号是音视频信号的一部分。In an embodiment, the M/S speech enhancement metadata comprises a collection of individual speech enhancement metadata associated with a mid-channel represented by the M/S audio channel. In an embodiment, the M/S speech enhancement metadata represents a portion of the total audio metadata encoded in the audio signal. In an embodiment, the audio metadata encoded in the audio signal comprises a data field indicating the presence of the M/S speech enhancement metadata. In an embodiment, the audio signal is part of an audio-video signal.

在实施方式中，包括处理器的设备被配置成执行如本文中所描述的方法中任意一种方法。In an embodiment, a device comprising a processor is configured to perform any of the methods as described herein.

在实施方式中，一种非暂态计算机可读存储介质，其包括以下软件指令：所述软件指令当由一个或更多个处理器执行时使得执行如本文中所描述的方法中的任一方法。注意，虽然本文中讨论了单独的实施方式，但是可以将本文中所讨论的实施方式的任意组合和/或部分实施方式进行组合以形成另外的实施方式。In one embodiment, a non-transitory computer-readable storage medium includes software instructions that, when executed by one or more processors, cause any of the methods described herein to be performed. Note that although individual embodiments are discussed herein, any combination of embodiments and/or portions of embodiments discussed herein may be combined to form additional embodiments.

8.实现机构——硬件概述8. Implementation mechanism - Hardware overview

根据一种实施方式，本文中描述的技术由一个或多个专用计算设备来实现。专用计算设备可以是硬连线的以执行技术，或者可以包括诸如永久地被编程成执行技术的一个或多个专用集成电路(ASIC)或现场可编程门阵列(FPGA)的数字电子设备，或者可以包括被编程成根据固件、存储器、其他存储装置或其组合中的程序指令执行技术的一个或多个通用硬件处理器。这样的专用计算设备还可以将定制的硬连线逻辑、ASIC或FPGA与定制的编程进行组合以实现技术。专用计算设备可以是台式计算机系统、便携式计算机系统、手持式设备、连网设备或合并硬连线和/或程序逻辑以实现技术的任何其他设备。According to one embodiment, the technology described herein is implemented by one or more special-purpose computing devices. Special-purpose computing devices can be hard-wired to perform technology, or can include digital electronic devices such as one or more application-specific integrated circuits (ASICs) or field programmable gate arrays (FPGAs) that are permanently programmed to perform technology, or can include one or more general-purpose hardware processors that are programmed to perform technology according to program instructions in firmware, memory, other storage devices, or a combination thereof. Such special-purpose computing devices can also combine customized hard-wired logic, ASICs or FPGAs with customized programming to implement technology. Special-purpose computing devices can be desktop computer systems, portable computer systems, handheld devices, networking devices, or any other devices that merge hard-wiring and/or program logic to implement technology.

例如，图9是图示了可以在其上实现本发明的实施方式的计算机系统900的框图。计算机系统900包括用于传送信息的总线902或其他通信机构，以及用于处理信息的与总线902耦接的硬件处理器904。硬件处理器904例如可以是通用微处理器。For example, Figure 9 is a block diagram illustrating a computer system 900 on which embodiments of the present invention may be implemented. The computer system 900 includes a bus 902 or other communication mechanism for transmitting information, and a hardware processor 904 coupled to the bus 902 for processing information. The hardware processor 904 may be, for example, a general-purpose microprocessor.

计算机系统900还包括用于存储要由处理器904执行的信息和指令的、与总线902耦接的诸如随机存取存储器(RAM)或其他动态存储设备的主存储器906。主存储器906还可以用于在执行要由处理器904执行的指令期间存储临时变量或其他中间信息。当这样的指令被存储在处理器904能够访问的非暂态存储介质中时，这样的指令使计算机系统900成为专用机器，该专用机器是专用于执行指令中指定的操作的设备。The computer system 900 also includes a main memory 906, such as a random access memory (RAM) or other dynamic storage device, coupled to the bus 902 for storing information and instructions to be executed by the processor 904. The main memory 906 may also be used to store temporary variables or other intermediate information during the execution of instructions to be executed by the processor 904. When such instructions are stored in a non-transitory storage medium accessible to the processor 904, such instructions cause the computer system 900 to become a special-purpose machine, ie, a device dedicated to performing the operations specified in the instructions.

计算机系统900还包括用于存储处理器904的静态信息和指令的、与总线902耦接的只读存储器(ROM)908或其他静态存储设备。诸如磁盘或光盘的存储设备910被设置并且耦接至总线902以存储信息和指令。Computer system 900 also includes a read only memory (ROM) 908 or other static storage device coupled to bus 902 for storing static information and instructions for processor 904. A storage device 910, such as a magnetic or optical disk, is provided and coupled to bus 902 for storing information and instructions.

计算机系统900可以经由总线902耦接至诸如液晶显示器(LCD)的显示器912，以向计算机用户显示信息。包括字母数字和其他键的输入设备914耦接至总线902，以向处理器904传送信息和命令选择。另一类型的用户输入设备是用于向处理器904传送方向信息和命令选择并且用于控制显示器912上的光标运动诸如鼠标、跟踪球或光标方向键的光标控件916。该输入设备通常具有在两个轴，第一轴(例如，x)和第二轴(例如，y)上的两个自由度，这允许设备指定平面中的位置。The computer system 900 may be coupled to a display 912, such as a liquid crystal display (LCD), via the bus 902 for displaying information to a computer user. An input device 914, including alphanumeric and other keys, is coupled to the bus 902 for communicating information and command selections to the processor 904. Another type of user input device is a cursor control 916, such as a mouse, trackball, or cursor direction keys, for communicating directional information and command selections to the processor 904 and for controlling cursor movement on the display 912. This input device typically has two degrees of freedom in two axes, a first axis (e.g., x) and a second axis (e.g., y), which allows the device to specify a position in a plane.

计算机系统900可以使用与计算机系统结合致使或编程计算机系统900成为专用机器的设备特定硬连线逻辑、一个或多个ASIC或FPGA、固件和/或程序逻辑，来实现本文中描述的技术。根据一个实施方式，计算机系统900可以响应于处理器904执行主存储器906中包括的一个或多个指令的一个或多个序列来执行本文中的技术。这样的指令可以从诸如存储设备910的另一存储介质被读入主存储器906中。主存储器906中包括的指令序列的执行使处理器904执行本文中描述的处理步骤。在替选实施方式中，可以使用硬连线电路代替软件指令，或者可以将硬连线电路与软件指令结合使用。The computer system 900 can implement the techniques described herein using device-specific hardwired logic, one or more ASICs or FPGAs, firmware, and/or program logic that is combined with the computer system to render or program the computer system 900 into a special-purpose machine. According to one embodiment, the computer system 900 can perform the techniques described herein in response to the processor 904 executing one or more sequences of one or more instructions contained in the main memory 906. Such instructions can be read into the main memory 906 from another storage medium, such as the storage device 910. Execution of the sequences of instructions contained in the main memory 906 causes the processor 904 to perform the process steps described herein. In alternative embodiments, hardwired circuitry can be used in place of software instructions, or in combination with software instructions.

如本文中使用的术语“存储介质”指代存储使机器能够以特定方式进行操作的数据和/或指令的任意非暂态介质。这样的存储介质可以包括非易失性介质和/或易失性介质。非易失性介质包括例如诸如存储设备910的光盘或磁盘。易失性介质包括诸如主存储器906的动态存储器。存储介质的常见形式包括例如软盘、软磁盘、硬盘、固态驱动器、磁带或任何其他磁数据存储介质、CD-ROM、任何其他光数据存储介质、具有孔图案的任何物理介质、RAM、PROM和EPROM、闪速EPROM、NVRAM、任何其他存储器芯片或盒式磁带。As used herein, the term "storage media" refers to any non-transitory medium that stores data and/or instructions that enable a machine to operate in a particular manner. Such storage media can include non-volatile media and/or volatile media. Non-volatile media include, for example, optical or magnetic disks such as storage device 910. Volatile media include dynamic memories such as main memory 906. Common forms of storage media include, for example, floppy disks, diskettes, hard disks, solid-state drives, magnetic tape or any other magnetic data storage medium, CD-ROMs, any other optical data storage medium, any physical medium with a pattern of holes, RAM, PROM and EPROM, flash EPROM, NVRAM, any other memory chip, or a cassette tape.

存储介质与传输介质不同，但是可以与传输介质结合使用。传输介质参与在存储介质之间传输信息。例如，传输介质包括同轴线缆、铜线和光纤，包括具有总线902的引线。传输介质还能够采用诸如在无线电波和红外线数据通信期间生成的那些声波或光波的声波或光波的形式。Storage media are distinct from transmission media, but can be used in conjunction with them. Transmission media participate in the transmission of information between storage media. Examples of transmission media include coaxial cables, copper wire, and optical fiber, including the wiring that makes up bus 902. Transmission media can also take the form of acoustic or optical waves, such as those generated during radio wave and infrared data communications.

各种形式的介质可以涉及：向处理器904传送一个或多个指令的一个或多个序列以用于执行。例如，最初可以将指令携载在远程计算机的磁盘或固态驱动器上。远程计算机能够将指令加载至其动态存储器中并且使用调制解调器在电话线路上发送指令。计算机系统900本地的调制解调器能够接收电话线路上的数据并且使用红外线发送器将数据转换成红外线信号。红外线检测器能够接收红外线信号中携载的数据，并且适当的电路可以将数据放置在总线902上。总线902将数据携载至主存储器906，处理器904从该主存储器取回指令并执行指令。在处理器904执行之前或之后，由主存储器906接收的指令可以可选地存储在存储设备910上。Various forms of media may be involved in: transmitting one or more sequences of one or more instructions to the processor 904 for execution. For example, the instructions may initially be carried on a disk or solid-state drive of a remote computer. The remote computer can load the instructions into its dynamic memory and send the instructions over a telephone line using a modem. A modem local to the computer system 900 can receive data on the telephone line and convert the data into an infrared signal using an infrared transmitter. An infrared detector can receive the data carried in the infrared signal, and appropriate circuitry can place the data on the bus 902. The bus 902 carries the data to the main memory 906, from which the processor 904 retrieves the instructions and executes them. Before or after execution by the processor 904, the instructions received by the main memory 906 may optionally be stored on the storage device 910.

计算机系统900还包括与总线902耦接的通信接口918。通信接口918提供耦接至与本地网络922连接的网络链路920的双向数据通信。例如，通信接口918可以是综合业务数字网(ISDN)卡、有线调制解调器、卫星调制解调器或向相应类型的电话线路提供数据通信连接的调制解调器。作为另一示例，通信接口918可以是提供至兼容LAN的数据通信连接的局域网(LAN)卡。还可以实现无线链路。在任何这样的实现中，通信接口918发送并接收携载表示各种类型的信息的数字数据流的电信号、电磁信号或光信号。The computer system 900 also includes a communication interface 918 coupled to the bus 902. The communication interface 918 provides bidirectional data communication coupled to a network link 920 connected to a local network 922. For example, the communication interface 918 can be an integrated services digital network (ISDN) card, a cable modem, a satellite modem, or a modem that provides a data communication connection to a corresponding type of telephone line. As another example, the communication interface 918 can be a local area network (LAN) card that provides a data communication connection to a compatible LAN. A wireless link can also be implemented. In any such implementation, the communication interface 918 sends and receives electrical signals, electromagnetic signals, or optical signals that carry digital data streams representing various types of information.

网络链路920通常通过一个或多个网络向其他数据设备提供数据通信。例如，网络链路920可以通过本地网络922向由因特网服务提供商(ISP)926操作的数据设备或主计算机924提供连接。ISP 926进而通过现在通常称为“因特网”928的全球分组数据通信网络提供数据通信服务。本地网络922和因特网928都使用携载数字数据流的电信号、电磁信号或光信号。向计算机系统900携载数字数据或从计算机系统900携载数字数据的通过各种网络的信号以及网络链路920上和通过通信接口918的信号是传输介质的示例形式。Network link 920 typically provides data communication through one or more networks to other data devices. For example, network link 920 may provide a connection through local network 922 to data equipment or a host computer 924 operated by an Internet Service Provider (ISP) 926. ISP 926, in turn, provides data communication services through the global packet data communication network now commonly referred to as the "Internet" 928. Local network 922 and Internet 928 both use electrical, electromagnetic, or optical signals that carry digital data streams. The signals through the various networks, as well as the signals on network link 920 and through communication interface 918, that carry digital data to and from computer system 900, are example forms of transmission media.

计算机系统900可以通过网络、网络链路920和通信接口918发送消息并且接收数据，包括程序代码。在因特网示例中，服务器930可以通过因特网928、ISP 926、本地网络922和通信接口918来传输应用程序的请求代码。Computer system 900 can send messages and receive data, including program code, through the network, network link 920, and communication interface 918. In the Internet example, server 930 can transmit the requested code for an application program through Internet 928, ISP 926, local network 922, and communication interface 918.

当代码被接收和/或存储在存储设备910或其他非易失性存储设备中以供稍后执行时，所接收的代码可以由处理器904执行。When code is received and/or stored in storage device 910 or other non-volatile storage for later execution, the received code may be executed by processor 904 .

在前面的说明中，已经参考可以根据实现而变化的许多特定细节描述了本发明的实施方式。因此，本发明是什么以及本发明的申请人所期望的唯一且排他的指示是以这样的权利要求提出的特定形式而从本申请提出的权利要求组，包括任何后续改正。针对在这样的权利要求中包括的术语，本文中明确阐述的任何定义应约束如在权利要求中使用的这样的术语的含义。因此，权利要求中未明确记载的限制、要素、特性、特征、优点或属性不应以任何方式对这样的权利要求的范围进行限制。因此，说明书和附图应被视为说明性意义而不是限制性意义。In the foregoing description, embodiments of the present invention have been described with reference to numerous specific details that may vary depending on the implementation. Thus, the sole and exclusive indication of what the invention is and what the applicants intend is the set of claims that issue from this application, in the specific form in which such claims are set forth, including any subsequent corrections. Any definitions expressly set forth herein for terms included in such claims shall govern the meaning of such terms as used in the claims. Therefore, no limitation, element, property, feature, advantage, or attribute that is not expressly recited in a claim should in any way limit the scope of such claim. Accordingly, the specification and drawings should be regarded in an illustrative rather than a restrictive sense.

Claims

1. An audio signal processing method, comprising:

Receive mixed audio content distributed across multiple audio channels in a reference audio channel representation, wherein the mixed audio content is a mixture of speech content and non-speech audio content;

The mixed audio content distributed on two or more non-M/S channels of the plurality of audio channels represented by the reference audio channel is converted into the one or more converted mixed audio content portions distributed on one or more channels of the M/S audio channel representation, wherein the M/S audio channel representation includes at least a middle channel and a side channel, wherein the middle channel represents the weighted sum or unweighted sum of the two channels of the reference audio channel representation, and wherein the side channel represents the weighted difference or unweighted difference of the two channels of the reference audio channel representation;

Metadata for speech enhancement used in the one or more transformed mixed audio content portions of the M/S audio channel representation; and

Generate an audio signal, the audio signal including the mixed audio content and the metadata for speech enhancement of one or more transformed mixed audio content portions in the M/S audio channel representation;

The method is performed by one or more computing devices.

2. The method according to claim 1, wherein the mixed audio content is represented in a non-M/S audio channel.

3. The method according to any one of claims 1 to 2, further comprising:

Generate a version of the speech content separate from the mixed audio content in the M/S audio channel representation; and

Output an audio signal encoded using the version of the speech content represented in the M/S audio channel.

4. The method according to claim 3, further comprising:

Generate hybrid indication data, which indicates a specific combination of a first type of speech enhancement and a second type of speech enhancement to be generated by the receiver audio decoder, wherein the first type of speech enhancement is waveform-coded speech enhancement based on the version of the speech content in the M/S audio channel representation, and wherein the second type of speech enhancement is parametric speech enhancement based on a reconstructed version of the speech content in the M/S audio channel representation; and

The output is an audio signal encoded using the mixed instruction data.

5. The method of claim 4, wherein at least a portion of the metadata for speech enhancement enables the receiver audio decoder to reconstruct a reconstructed version of the speech content in the M/S representation based on the mixed audio content in the reference audio channel representation.

6. The method according to any one of claims 4 to 5, wherein the mixing indication data is generated at least in part based on one or more SNR values for one or more converted mixed audio content portions of the M/S audio channel representation, wherein the one or more SNR values represent one or more power ratios of the following: the power ratio of speech content to non-speech audio content in the one or more converted mixed audio content portions of the M/S audio channel representation, or the power ratio of speech content to total audio content in the one or more converted mixed audio content portions of the M/S audio channel representation.

7. The method according to any one of claims 4 to 5, wherein a specific combination of the first type of speech enhancement and the second type of speech enhancement is determined using an auditory masking model, in which the first type of speech enhancement represents the maximum relative amount of speech enhancement among a plurality of combinations of the first type of speech enhancement and the second type of speech enhancement, which ensures that the encoded noise in the output speech-enhanced audio program does not sound unpleasant.

8. The method according to any one of claims 1 to 2, wherein at least a portion of the metadata for speech enhancement enables the receiver audio decoder to reconstruct a version of the speech content in the M/S representation based on the mixed audio content in the reference audio channel representation.

9. The method according to any one of claims 1 to 2, wherein the metadata for speech enhancement includes metadata relating to one or more of waveform-coded speech enhancement operations or parametric speech enhancement operations in the M/S audio channel representation based on the version of the speech content.

10. The method according to any one of claims 1 to 2, wherein the reference audio channel represents an audio channel including an audio channel associated with a surround speaker.

11. The method according to any one of claims 1 to 2, wherein the two or more non-M/S channels represented by the reference audio channel include two or more of a center channel, a left channel, or a right channel; and wherein the one or more M/S channels represented by the M/S audio channel include one or more of a center channel or a side channel.

12. The method according to any one of claims 1 to 2, wherein the metadata for speech enhancement comprises a single set of speech enhancement metadata relating to an intermediate channel represented by the M/S audio channel.

13. The method according to any one of claims 1 to 2, further comprising preventing the encoding of one or more converted mixed audio content portions in the M/S audio channel representation as part of the audio signal.

14. The method according to any one of claims 1 to 2, wherein the metadata for speech enhancement represents a portion of all audio metadata encoded in the audio signal.

15. The method according to any one of claims 1 to 2, wherein the audio metadata encoded in the audio signal includes a data field indicating the presence of the metadata for speech enhancement.

16. The method according to any one of claims 1 to 2, wherein the audio signal is a part of an audio-visual signal.

17. An audio signal processing method, comprising:

Receive an audio signal, the audio signal including metadata for speech enhancement and mixed audio content in a reference audio channel representation, the mixed audio content having a mixture of speech content and non-speech audio content;

One or more portions of the mixed audio content distributed on two or more non-M/S channels of a plurality of audio channels represented by the reference audio channel are converted into one or more portions of the converted mixed audio content distributed on one or more M/S audio channel representations of the M/S audio channel representation, wherein the M/S audio channel representation includes at least a middle channel and a side channel, wherein the middle channel represents a weighted or unweighted sum of two channels of the reference audio channel representation, and wherein the side channel represents a weighted or unweighted difference of two channels of the reference audio channel representation;

Based on the metadata used for speech enhancement, one or more speech enhancement operations are performed on one or more transformed mixed audio content portions in the M/S audio channel representation to generate one or more enhanced speech content portions in the M/S representation; and

The one or more converted mixed audio content portions in the M/S audio channel representation are combined with the one or more enhanced speech content portions in the M/S representation to generate one or more speech-enhanced mixed audio content portions in the M/S representation;

The method is performed by one or more computing devices.

18. The method of claim 17, wherein the steps of conversion, execution, and combination are performed in a single operation, said single operation being performed on said one or more portions of the mixed audio content distributed on two or more non-M/S channels of a plurality of audio channels represented by the reference audio channel.

19. The method according to any one of claims 17 to 18, further comprising inversely converting the one or more speech-enhanced hybrid audio content portions in the M/S representation into one or more speech-enhanced hybrid audio content portions in the reference audio channel representation.

20. The method according to any one of claims 17 to 18, further comprising:

Extract from the audio signal a version of the speech content separate from the mixed audio content in the M/S audio channel representation; and

Based on at least a portion of the metadata used for speech enhancement, one or more speech enhancement operations are performed on one or more portions of the versions of the speech content in the M/S audio channel representation to generate one or more second enhanced speech content portions in the M/S audio channel representation.

21. The method of claim 20, further comprising:

Determine the mixed indication data for speech enhancement;

Based on the hybrid indication data used for speech enhancement, a specific combination of two types of speech enhancement is generated, wherein the first type of speech enhancement is waveform-coded speech enhancement based on the version of the speech content in the M/S audio channel representation, and the second type of speech enhancement is parametric speech enhancement based on the reconstructed version of the speech content in the M/S audio channel representation.

22. The method of claim 21, wherein the mixing indication data is generated at least in part by one of the upstream audio encoder that generates the audio signal or the receiver audio decoder that receives the audio signal based on one or more SNR values for one or more converted mixed audio content portions in the M/S audio channel representation, wherein the one or more SNR values represent one or more of the following power ratios: the power ratio of speech content to non-speech audio content in one or more converted mixed audio content portions in the M/S audio channel representation, or the power ratio of speech content to total audio content in one or more portions of the converted mixed audio content in the M/S audio channel representation or the mixed audio content in the reference audio channel representation.

23. The method of any one of claims 21 to 22, wherein the specific combination of the two types of speech enhancement is determined using an auditory masking model constructed by one of the upstream audio encoder that generates the audio signal or the receiver audio decoder that receives the audio signal, wherein the first type of speech enhancement is the maximum relative speech enhancement amount among a plurality of combinations of the first type of speech enhancement and the second type of speech enhancement, which ensures that the encoded noise in the output speech-enhanced audio program does not sound unpleasant.

24. The method of any one of claims 17 to 18, wherein at least a portion of the metadata for speech enhancement enables the receiver audio decoder to reconstruct a version of the speech content in the M/S representation based on the mixed audio content in the reference audio channel representation.

25. The method according to any one of claims 17 to 18, wherein the metadata for speech enhancement includes metadata relating to one or more of waveform-coded speech enhancement operations or parametric speech enhancement operations in the M/S audio channel representation based on the version of the speech content.

26. The method according to any one of claims 17 to 18, wherein the reference audio channel represents an audio channel relating to a surround speaker.

27. The method according to any one of claims 17 to 18, wherein the two or more non-M/S channels represented by the reference audio channel include one or more of a center channel, a left channel, or a right channel; and wherein the one or more M/S channels represented by the M/S audio channel include one or more of a center channel or a side channel.

28. The method according to any one of claims 17 to 18, wherein the metadata for speech enhancement comprises a single set of speech enhancement metadata relating to an intermediate channel represented by the M/S audio channel.

29. The method according to any one of claims 17 to 18, wherein the metadata for speech enhancement represents a portion of all audio metadata encoded in the audio signal.

30. The method according to any one of claims 17 to 18, wherein the audio metadata encoded in the audio signal includes a data field indicating the presence of the metadata for speech enhancement.

31. The method according to any one of claims 17 to 18, wherein the audio signal is a part of an audio-visual signal.

32. A media processing system configured to perform any one of the methods of claims 1 to 31.

33. An apparatus comprising a processor and configured to perform any one of the methods of claims 1 to 31.

34. A non-transitory computer-readable storage medium comprising software instructions that, when executed by one or more processors, cause to perform any one of the methods of claims 1 to 31.