HK1254791B

HK1254791B - Companding apparatus and method to reduce quantization noise using advanced spectral extension

Info

Publication number: HK1254791B
Application number: HK18113875.3A
Authority: HK
Inventors: P‧何德林; A‧比斯沃斯; M‧舒格; V‧迈勒考特
Original assignee: 杜比实验室特许公司; 杜比国际公司
Priority date: 2013-04-05
Filing date: 2015-12-08
Publication date: 2022-07-29

Description

Companding device and method for reducing quantization noise using advanced spectrum extension

本申请是申请号为201480008819.0、申请日为2014年4月1日、发明名称为“使用高级频谱延拓降低量化噪声的压扩装置和方法”的发明专利申请的分案申请。This application is a divisional application of the invention patent application with application number 201480008819.0, application date April 1, 2014, and invention name “Compression and expansion device and method for reducing quantization noise using advanced spectrum extension”.

相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS

本申请要求在2013年4月5日提交的美国临时专利申请No.61/809028以及在2013年9月12日提交的No.61/877167的优先权，其全文内容通过引用并入于此。This application claims priority to U.S. Provisional Patent Application Nos. 61/809,028, filed April 5, 2013, and 61/877,167, filed September 12, 2013, which are hereby incorporated by reference in their entireties.

技术领域Technical Field

一个或更多个实施例一般涉及音频信号处理，更具体地说，涉及使用压缩/扩展(压扩(companding))技术降低音频编解码器中的编码噪声。One or more embodiments relate generally to audio signal processing, and more particularly to reducing coding noise in audio codecs using compression/expansion (companding) techniques.

背景技术Background Art

许多流行的数字声音格式利用丢弃数据中的一些来降低存储或数据速率要求的有损数据压缩技术。有损数据压缩的应用不仅降低源内容(例如，音频内容)的保真度，而且它也可能引入压缩伪影(artifact)形式的可察觉失真。在音频编码系统的背景下，这些声音伪影被称为编码噪声或量化噪声。Many popular digital audio formats utilize lossy data compression techniques that discard some of the data to reduce storage or data rate requirements. The application of lossy data compression not only reduces the fidelity of the source content (e.g., audio content), but it can also introduce perceptible distortion in the form of compression artifacts. In the context of audio coding systems, these acoustic artifacts are referred to as coding noise or quantization noise.

数字音频系统根据定义的音频文件格式或流媒体音频格式，利用编解码器(编码器-解码器组件)来压缩和解压缩音频数据。编解码器实现试图以最小的比特数表示音频信号同时保持尽可能高的保真度的算法。在音频编解码器中典型地使用的有损压缩技术在人类听觉感知的心理声学模型上工作。音频格式通常涉及时/频域变换(例如，修正离散余弦变换-MDCT)的使用，并且使用诸如频率掩蔽或时间掩蔽之类的掩蔽效应，使得包括任何明显的量化噪声的某些声音被实际内容隐藏或掩蔽。Digital audio systems utilize codecs (encoder-decoder components) to compress and decompress audio data according to defined audio file formats or streaming audio formats. Codecs implement algorithms that attempt to represent audio signals with the minimum number of bits while maintaining the highest possible fidelity. The lossy compression techniques typically used in audio codecs work on a psychoacoustic model of human auditory perception. Audio formats typically involve the use of time/frequency domain transforms (e.g., modified discrete cosine transforms - MDCTs) and use masking effects such as frequency masking or time masking so that certain sounds, including any apparent quantization noise, are hidden or masked by the actual content.

大多数音频编码系统是基于帧的。在帧内，音频编解码器通常在频域中对编码噪声进行整形，使得它变得最难听得见。几种目前的数字音频格式利用这种长持续时间的帧，使得帧可以包含几个不同级别或强度的声音。因为编码噪声随着帧的演变在级别上通常是平稳的，所以编码噪声在帧的低强度部分期间可能是最能听得见。这种效应可以表现为预回声(pre-echo)失真，在该预回声失真中，高强度片段之前的静寂(或低级别信号)被解码的音频信号中的噪声淹没。这种效应可能在来自诸如响板或其他尖锐的打击声源之类的打击乐器的瞬态声音或脉冲中最显著。这种失真典型地由在频域中引入、在时域中遍布于编解码器的整个变换窗的量化噪声引起。Most audio coding systems are frame-based. Within a frame, the audio codec typically shapes the coding noise in the frequency domain so that it becomes the most difficult to hear. Several current digital audio formats utilize frames of this long duration so that a frame can contain sounds of several different levels or intensities. Because coding noise is typically stable in level as the frame evolves, coding noise may be most audible during the low-intensity portion of the frame. This effect can manifest as pre-echo distortion, in which the silence (or low-level signal) before the high-intensity segment is drowned out by the noise in the decoded audio signal. This effect may be most significant in transient sounds or pulses from percussion instruments such as castanets or other sharp percussive sound sources. This distortion is typically caused by quantization noise introduced in the frequency domain and spread over the entire transform window of the codec in the time domain.

目前用于避免或最小化预回声伪影的措施包括使用滤波器。但是，这种滤波器引入相位失真和时间拖尾(temporal smearing)。另一种可能的解决方案包括使用较小的变换窗，但是这种方法可能显著地减小频率分辨率。Current measures to avoid or minimize pre-echo artifacts include the use of filters. However, such filters introduce phase distortion and temporal smearing. Another possible solution involves using a smaller transform window, but this approach can significantly reduce frequency resolution.

在背景技术部分中讨论的主题不应当仅仅因为在背景技术部分中提到了而被认为是现有技术。类似地，在背景技术部分中提及的或者与背景技术部分的主题相关联的问题不应当被认为在现有技术中已预先认识到。背景技术部分中的主题仅代表不同的方法，其自身也可以是发明。The subject matter discussed in the Background section should not be considered prior art simply because it is mentioned in the Background section. Similarly, problems mentioned in or related to the subject matter in the Background section should not be considered to have been previously recognized in the prior art. The subject matter in the Background section merely represents different approaches, which may themselves be inventions.

发明内容Summary of the Invention

实施例针对一种通过经由如下过程将音频信号扩展到扩展的动态范围来处理接收到的音频信号的方法，所述过程包括：使用定义的窗形状将接收到的音频信号分成多个时间片段，使用音频信号的频域表示的基于非能量的平均来在频域中计算用于每个时间片段的宽带增益，以及将增益值施加到每个时间片段以获得扩展的音频信号。对施加到每个时间片段的宽带增益的增益值进行选择以具有放大相对高强度的片段和衰减相对低强度的片段的效果。对于该方法，接收到的音频信号包括经由如下压缩过程从原始动态范围压缩的原始音频信号，所述压缩过程包括使用定义的窗形状将原始音频信号分成多个时间片段，使用原始音频信号的频域样本的基于非能量的平均来在频域中计算宽带增益，以及将宽带增益施加到原始音频信号。在该压缩过程中，对施加给每个时间片段的宽带增益的增益值进行选择以具有放大相对低强度的片段和衰减相对高强度的片段的效果。扩展过程被配置为基本上恢复初始音频信号的动态范围，并且扩展过程的宽带增益可以基本上是压缩过程的宽带增益的逆。Embodiments are directed to a method for processing a received audio signal by extending the audio signal to an extended dynamic range via a process comprising: dividing the received audio signal into a plurality of time segments using a defined window shape, calculating a wideband gain in the frequency domain for each time segment using a non-energy-based average of a frequency domain representation of the audio signal, and applying a gain value to each time segment to obtain an extended audio signal. The gain value of the wideband gain applied to each time segment is selected to have the effect of amplifying segments of relatively high intensity and attenuating segments of relatively low intensity. For this method, the received audio signal comprises an original audio signal compressed from an original dynamic range via a compression process comprising: dividing the original audio signal into a plurality of time segments using a defined window shape, calculating a wideband gain in the frequency domain using a non-energy-based average of frequency domain samples of the original audio signal, and applying the wideband gain to the original audio signal. In the compression process, the gain value of the wideband gain applied to each time segment is selected to have the effect of amplifying segments of relatively low intensity and attenuating segments of relatively high intensity. The expansion process is configured to substantially restore the dynamic range of the original audio signal, and the wideband gain of the expansion process may be substantially the inverse of the wideband gain of the compression process.

在实现通过扩展过程处理接收到的音频信号的方法的系统中，可以使用滤波器组(filterbank)组件来分析音频信号以获得其频域表示，并且所定义的用于分段成多个时间片段的窗形状可以与用于滤波器组的原型滤波器(prototype filter)相同。类似地，在实现通过压缩过程处理接收到的音频信号的方法的系统中，可以使用滤波器组组件来分析原始音频信号以获得其频域表示，并且所定义的用于分段成多个时间片段的窗形状可以与用于滤波器组的原型滤波器相同。任一情况下的滤波器组可以是QMF组或短时傅里叶变换中的一个。在该系统中，在通过产生比特流的音频编码器和对比特流进行解码的解码器修正被压缩的信号之后，获得用于所述扩展过程的接收到的信号。编码器和解码器可以包括基于变换的音频编解码器的至少一部分。该系统还可以包括处理通过比特流接收到的并且确定所述扩展过程的激活状态的控制信息的组件。In a system implementing a method for processing a received audio signal through an expansion process, a filter bank component can be used to analyze the audio signal to obtain its frequency domain representation, and the window shape defined for segmenting into multiple time segments can be the same as the prototype filter used for the filter bank. Similarly, in a system implementing a method for processing a received audio signal through a compression process, a filter bank component can be used to analyze the original audio signal to obtain its frequency domain representation, and the window shape defined for segmenting into multiple time segments can be the same as the prototype filter used for the filter bank. The filter bank in either case can be one of a QMF bank or a short-time Fourier transform. In this system, the received signal for the expansion process is obtained after the compressed signal is corrected by an audio encoder that generates a bitstream and a decoder that decodes the bitstream. The encoder and decoder can include at least a portion of a transform-based audio codec. The system can also include a component that processes control information received through the bitstream and determines the activation state of the expansion process.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

在下面的附图中，相同的附图标记用来表示相同的要素。虽然下面的附图描绘了各种示例，但是一种或更多种实现并不局限于附图中所描绘的示例。In the following figures, like reference numerals are used to denote like elements. Although the following figures depict various examples, one or more implementations are not limited to the examples depicted in the figures.

图1例示了在一个实施例下，用于在基于变换的音频编解码器中压缩和扩展音频信号的系统。FIG1 illustrates a system for compressing and expanding an audio signal in a transform-based audio codec, under one embodiment.

图2A例示了在一个实施例下，被分成多个短时间片段的音频信号。FIG. 2A illustrates an audio signal divided into a plurality of short time segments, under one embodiment.

图2B例示了在一个实施例下，图2A的在每个短时间片段上施加宽带增益之后的音频信号。FIG. 2B illustrates the audio signal of FIG. 2A after a broadband gain is applied to each short time segment, under one embodiment.

图3A是例示了在一个实施例下压缩音频信号的方法的流程图。FIG3A is a flowchart illustrating a method of compressing an audio signal under one embodiment.

图3B是例示了在一个实施例下扩展音频信号的方法的流程图。FIG3B is a flow chart illustrating a method of extending an audio signal under an embodiment.

图4是例示了在一个实施例下用于压缩音频信号的系统的框图。FIG4 is a block diagram illustrating a system for compressing an audio signal, under one embodiment.

图5是例示了在一个实施例下用于扩展音频信号的系统的框图。FIG5 is a block diagram illustrating a system for extending an audio signal, under an embodiment.

图6例示了在一个实施例下音频信号到多个短时间片段的分割。FIG6 illustrates the segmentation of an audio signal into multiple short time segments under one embodiment.

具体实施方式DETAILED DESCRIPTION

描述使用压扩技术实现音频编解码器中的量化噪声的时间噪声整形的系统和方法。这些实施例包括使用在QMF域中实现的压扩算法来实现量化噪声的时间整形。所述过程包括期望的解码器压扩等级的编码器控制，以及超越单音调应用而延伸到立体声和多声道的压扩。Systems and methods are described for implementing temporal noise shaping of quantization noise in audio codecs using companding techniques. These embodiments include implementing temporal shaping of quantization noise using a companding algorithm implemented in the QMF domain. The process includes encoder control of the desired level of companding at the decoder, as well as extending companding beyond monophonic applications to stereo and multichannel applications.

在此描述的一个或更多个实施例的方面可以在处理用于穿过网络传输的音频信号的音频系统中实现，所述网络包括执行软件指令的一个或更多个计算机或处理设备。所描述的实施例的任何一个可以单独使用或者在任意组合中彼此一起使用。虽然各种实施例可能受在说明书中一个或更多个地方可能讨论或略为提及的现有技术的各种缺陷的启发，实施例不一定解决这些缺陷的任何一个。换句话说，不同的实施例可以解决在说明书中可能讨论的不同缺陷。一些实施例可以仅部分地解决在说明书中可能讨论的一些缺陷或仅一个缺陷，并且一些实施例可以不解决这些缺陷中的任何一个。Aspects of one or more embodiments described herein can be implemented in an audio system for processing audio signals for transmission across a network comprising one or more computers or processing devices that execute software instructions. Any of the described embodiments can be used alone or in any combination together. Although the various embodiments may be inspired by various defects of the prior art that may be discussed or briefly mentioned in one or more places in the specification, the embodiments do not necessarily solve any of these defects. In other words, different embodiments may solve different defects that may be discussed in the specification. Some embodiments may only partially solve some of the defects that may be discussed in the specification or only one defect, and some embodiments may not solve any of these defects.

图1例示了在一个实施例下，用于在基于编解码器的音频处理系统中降低量化噪声的压扩系统。图1例示了围绕包括编码器(或“核心编码器”)106和解码器(或“核心解码器”)112的音频编解码器建立的音频信号处理系统。编码器106将音频内容编码成用于在网络110上传输的数据流或信号，在那里它被解码器112解码以用于回放或进一步的处理。在一个实施例中，编解码器的编码器106和解码器112实现有损压缩方法以降低数字音频数据的存储和/或数据速率要求，并且这种编解码器可以实现为MP3、Vorbis、杜比数字(AC-3)、AAC或类似的编解码器。编解码器的有损压缩方法产生编码噪声，其中编码噪声通常随着由编解码器定义的帧的演变而具有平稳的级别。这种编码噪声经常在帧的低强度部分期间最能听得见。通过提供在编解码器的核心编码器106之前的压缩前步骤组件104以及对核心解码器112的输出进行操作的扩展后步骤组件114，系统100包括降低现有编码系统中的感知到的编码噪声的组件。压缩组件104被配置为使用定义的窗形状将原始音频输入信号102分成多个时间片段，使用初始音频信号的频域样本的基于非能量的平均在频域中计算并施加宽带增益，其中施加给每个时间片段的增益值放大相对低强度的片段并且衰减相对高强度的片段。该增益修正具有压缩并显著减小输入音频信号102的原始动态范围的效果。然后，经压缩的音频信号在编码器106中被编码，在网络上110上传输，并在解码器112中被解码。解码后的压缩信号被输入到扩展组件114，其中扩展组件114被配置为通过向每个时间片段施加逆增益值而将压缩的音频信号的动态范围扩展回到原始输入音频信号102的动态范围来执行压缩前步骤104的逆操作。因此，音频输出信号116包括具有原始动态范围的音频信号，而编码噪声通过所述前步骤和后步骤压扩过程被去除。FIG1 illustrates, under one embodiment, a compression and expansion system for reducing quantization noise in a codec-based audio processing system. FIG1 illustrates an audio signal processing system built around an audio codec including an encoder (or "core encoder") 106 and a decoder (or "core decoder") 112. The encoder 106 encodes the audio content into a data stream or signal for transmission over a network 110, where it is decoded by the decoder 112 for playback or further processing. In one embodiment, the encoder 106 and decoder 112 of the codec implement a lossy compression method to reduce the storage and/or data rate requirements of the digital audio data, and such a codec can be implemented as MP3, Vorbis, Dolby Digital (AC-3), AAC or a similar codec. The lossy compression method of the codec generates coding noise, which typically has a steady level as the frame defined by the codec evolves. This coding noise is often most audible during the low-intensity portion of the frame. By providing a pre-compression step component 104 preceding the core encoder 106 of the codec and a post-expansion step component 114 operating on the output of the core decoder 112, system 100 includes components for reducing perceived coding noise in existing coding systems. Compression component 104 is configured to divide the original audio input signal 102 into a plurality of time segments using a defined window shape. A broadband gain is calculated and applied in the frequency domain using a non-energy-based average of the frequency domain samples of the original audio signal, where the gain value applied to each time segment amplifies segments of relatively low intensity and attenuates segments of relatively high intensity. This gain modification has the effect of compressing and significantly reducing the original dynamic range of the input audio signal 102. The compressed audio signal is then encoded in encoder 106, transmitted over network 110, and decoded in decoder 112. The decoded compressed signal is input to expansion component 114, which is configured to perform the inverse operation of pre-compression step 104 by applying an inverse gain value to each time segment, thereby expanding the dynamic range of the compressed audio signal back to the dynamic range of the original input audio signal 102. Thus, the audio output signal 116 includes an audio signal with the original dynamic range, while the coding noise is removed by the pre-step and post-step companding processes.

如图1中所示，压缩组件或压缩前步骤104被配置为减小被输入到核心编码器106的音频信号102的动态范围。输入音频信号被分成许多短片段。每个短片段的大小或长度是核心编码器106所使用的帧大小的一小部分。例如，核心编码器的典型的帧大小可能为大约40至80毫秒。在这种情况下，每个短片段可以为大约1至3毫秒。压缩组件104计算适当的宽带增益值以逐片段地压缩输入音频信号。这通过用针对每个片段的适当的增益值修正信号的短片段来实现。选择相对大的增益值来放大相对低强度的片段，并且选择小的增益值来衰减高强度的片段。As shown in Figure 1, the compression component or pre-compression step 104 is configured to reduce the dynamic range of the audio signal 102 input to the core encoder 106. The input audio signal is divided into many short segments. The size or length of each short segment is a small fraction of the frame size used by the core encoder 106. For example, a typical frame size of the core encoder may be approximately 40 to 80 milliseconds. In this case, each short segment may be approximately 1 to 3 milliseconds. The compression component 104 calculates appropriate broadband gain values to compress the input audio signal on a segment-by-segment basis. This is achieved by modifying the short segments of the signal with an appropriate gain value for each segment. Relatively large gain values are selected to amplify segments of relatively low intensity, and small gain values are selected to attenuate segments of high intensity.

图2A例示了在一个实施例下，被分成多个短的时间片段的音频信号，并且图2B例示了在通过压缩组件施加宽带增益之后的同一音频信号。如图2A中所示，音频信号202代表例如可能由打击乐器(例如响板)产生的瞬态或声音脉冲。如电压V对时间t的绘图中所示，信号以振幅的尖峰为特征。一般而言，信号的振幅与声音的声能或强度有关，并且代表任何时间点处的声音功率的量度。当通过基于帧的音频编解码器处理音频信号202时，信号的部分在变换(例如MDCT)帧204内处理。目前典型的数字音频系统利用相对长持续时间的帧，使得对于尖锐的瞬态或短脉冲声音，单个帧可以包括低强度以及高强度的声音。因此，如图1中所示，单个MDCT帧204包括音频信号的脉冲部分(峰)以及在峰前或峰后的相对大量的低强度信号。在一个实施例中，压缩组件104将信号分成许多短的时间片段206，并且向每个片段施加宽带增益，以压缩信号202的动态范围。可以基于应用需求和系统约束来选择每个短片段的数量和大小。相对于单个MDCT帧的大小，短片段的数量可以在从12至64个片段的范围内变化，并且可以典型地包括32个片段，但是实施例并不限于此。Fig. 2A illustrates an audio signal divided into multiple short time segments under one embodiment, and Fig. 2B illustrates the same audio signal after applying broadband gain by a compression component. As shown in Fig. 2A, audio signal 202 represents a transient or sound pulse that may be produced by a percussion instrument (e.g., castanets). As shown in the plot of voltage V versus time t, the signal is characterized by a peak in amplitude. Generally speaking, the amplitude of a signal is related to the sound energy or intensity of the sound and represents a measure of the sound power at any point in time. When processing audio signal 202 by a frame-based audio codec, the part of the signal is processed within a transform (e.g., MDCT) frame 204. Typical digital audio systems currently utilize frames of relatively long duration so that for sharp transients or short pulse sounds, a single frame can include both low-intensity and high-intensity sounds. Therefore, as shown in Fig. 1, a single MDCT frame 204 includes the pulse portion (peak) of the audio signal and a relatively large amount of low-intensity signals before or after the peak. In one embodiment, the compression component 104 divides the signal into a number of short time segments 206 and applies a broadband gain to each segment to compress the dynamic range of the signal 202. The number and size of each short segment can be selected based on application requirements and system constraints. Relative to the size of a single MDCT frame, the number of short segments can vary from 12 to 64 segments and can typically include 32 segments, but the embodiment is not limited thereto.

图2B例示了在一个实施例下，在每个短的时间片段上施加宽带增益之后的图2A的音频信号。如图2B中所示，音频信号212具有与原始信号202相同的相对形状，但是低强度片段的振幅已经通过施加放大增益值而增大，而高强度片段的振幅已经通过施加衰减增益值而减小。FIG2B illustrates the audio signal of FIG2A after applying a broadband gain to each short time segment, under one embodiment. As shown in FIG2B , the audio signal 212 has the same relative shape as the original signal 202, but the amplitude of the low-intensity segments has been increased by applying the amplification gain value, while the amplitude of the high-intensity segments has been reduced by applying the attenuation gain value.

核心解码器112的输出是具有减小的动态范围的输入音频信号(例如，信号212)加上由核心编码器106引入的量化噪声。该量化噪声以在每个帧内在时间上几乎均一的级别为特征。扩展组件114作用于解码后的信号以恢复原始信号的动态范围。它使用基于短片段大小206的相同的短时间分辨率，并且逆变在压缩组件104中施加的增益。因此，扩展组件114在原始信号中具有低强度并已由压缩器放大的片段上施加小的增益(衰减)，而在原始信号中具有高强度并已由压缩器衰减的片段上施加大的增益(放大)。由核心编码器添加且具有均一的时间包络的量化噪声因此被后处理器增益同时整形，以近似地符合原始信号的时间包络。该处理有效地使得量化噪声在安静的节段期间较难听得见。虽然噪声可能在高强度的节段期间被放大，但是由于音频内容自身的高声信号的掩蔽效应，它仍然较难听得见。The output of the core decoder 112 is the input audio signal (e.g., signal 212) with a reduced dynamic range, plus the quantization noise introduced by the core encoder 106. This quantization noise is characterized by a nearly uniform level over time within each frame. The expansion component 114 acts on the decoded signal to restore the dynamic range of the original signal. It uses the same short temporal resolution based on the short segment size 206 and inverts the gain applied by the compression component 104. Thus, the expansion component 114 applies a small gain (attenuation) to segments of the original signal that have low intensity and have been amplified by the compressor, while applying a large gain (amplification) to segments of the original signal that have high intensity and have been attenuated by the compressor. The quantization noise added by the core encoder and having a uniform temporal envelope is thus simultaneously shaped by the post-processor gain to approximately match the temporal envelope of the original signal. This process effectively makes the quantization noise less audible during quiet segments. While the noise may be amplified during high-intensity segments, it remains less audible due to the masking effect of the loud signals in the audio content itself.

如图2A中所示，压扩过程用各增益值来单独地修正音频信号的离散片段。在某些情况下，这可能导致压缩组件的输出处的不连续，从而可能在核心编码器106中产生问题。类似地，扩展组件114处的增益的不连续可能导致整形后的噪声的包络的不连续，从而可能在音频输出116中导致听得见的咔哒声。与向音频信号的短片段施加个体增益值相关的另一个问题基于这一事实：典型的音频信号是许多单独的源的混合。这些源中的一些可能在时间上是平稳的，而一些可能是瞬态的。平稳信号在统计参数方面通常在时间上是恒定的，然而瞬态信号通常不是恒定的。给定瞬态的宽带性质，它们在这种混合中的特色在较高的频率通常更可见。基于信号的短期能量(short-term energy,RMS)的增益计算易于偏向更强的低频率，因此被平稳源主导，并且呈现为几乎不随时间变化。因此，该基于能量的方法在对由核心编码器引入的噪声进行整形时通常是无效的。As shown in Figure 2A, the companding process modifies discrete segments of the audio signal using individual gain values. In some cases, this can result in discontinuities at the output of the compression component, potentially causing problems in the core encoder 106. Similarly, discontinuities in gain at the expansion component 114 can result in discontinuities in the envelope of the shaped noise, potentially causing audible clicks in the audio output 116. Another problem with applying individual gain values to short segments of the audio signal stems from the fact that a typical audio signal is a mixture of many separate sources. Some of these sources may be stationary in time, while some may be transient. Stationary signals are typically time-constant in terms of statistical parameters, whereas transient signals are typically not. Given the broadband nature of transients, their presence in such a mixture is often more visible at higher frequencies. Gain calculations based on the signal's short-term energy (RMS) tend to be biased towards stronger low frequencies, resulting in them being dominated by stationary sources and appearing to be virtually constant over time. Consequently, these energy-based approaches are generally ineffective at shaping the noise introduced by the core encoder.

在一个实施例中，系统100在具有短的原型滤波器的滤波器组中的压缩和扩展组件处计算和施加增益，以便解决与施加个体增益值相关联的潜在问题。首先由滤波器组分析待修正的信号(压缩组件104处的原始信号，以及扩展组件114中的核心解码器112的输出)，并且直接在频域中施加宽带增益。时域中的相应效果是根据原型滤波器的形状自然地平滑增益施加。这解决了上述不连续的问题。修正后的频域信号然后经由相应的合成滤波器组转换回到时域。用滤波器组来分析信号提供对其频谱内容(spectral content)的访问，并且允许由于高频率而优先增加贡献(或者由于弱的任何频谱内容而增加贡献)的增益的计算，提供不被信号中的最强成分主导的增益值。这解决了如上所述的与包括不同源的混合的音频源相关联的问题。在一个实施例中，系统使用频谱幅度的p-范数来计算增益，其中p典型地小于2(p<2)。与基于能量时(p＝2)相比，这使得能够更加强调弱的频谱内容。In one embodiment, system 100 calculates and applies gains at the compression and expansion components in a filterbank with short prototype filters to address potential issues associated with applying individual gain values. The signal to be modified (the original signal at compression component 104 and the output of core decoder 112 at expansion component 114) is first analyzed by the filterbank, and a broadband gain is applied directly in the frequency domain. The corresponding effect in the time domain is to naturally smooth the gain application according to the shape of the prototype filter. This resolves the discontinuity issue discussed above. The modified frequency-domain signal is then converted back to the time domain via the corresponding synthesis filterbank. Analyzing the signal with the filterbank provides access to its spectral content and allows calculation of gains that preferentially increase contributions due to high frequencies (or any spectral content that is weak), providing gain values that are not dominated by the strongest components in the signal. This addresses the issues associated with audio sources comprising a mixture of different sources, as discussed above. In one embodiment, the system calculates the gain using the p-norm of the spectral magnitude, where p is typically less than 2 (p<2). This enables greater emphasis on weak spectral content than when based on energy (p=2).

如上所述，所述系统包括原型滤波器以平滑增益施加。一般而言，原型滤波器是滤波器组中的基本窗形状，其由正弦波形调制以针对滤波器组中的不同子带滤波器获得脉冲响应。例如，短时傅里叶变换(STFT)是滤波器组，并且该变换的每个频率线是滤波器组的子带。短时傅里叶变换通过将信号乘以窗形状(N采样窗)来实现，窗形状可以是矩形、Hann、Kaiser-Bessel导出(KBD)或一些其他形状。被加窗的信号然后经历离散傅里叶变换(DFT)操作，以获得STFT。该情况下的窗形状是原型滤波器。DFT由正弦基函数组成，每个正弦基函数均具有不同的频率。然后，与正弦函数相乘的窗形状为与该频率对应的子带提供滤波器。因为窗形状在所有频率是相同的，所以被称为“原型”。As mentioned above, described system comprises prototype filter and applies with smooth gain.Generally speaking, prototype filter is the basic window shape in the bank of filters, and it is modulated by sinusoidal waveform to obtain impulse response for the different sub-band filters in the bank of filters.For example, short time Fourier transform (STFT) is a bank of filters, and each frequency line of this conversion is the sub-band of bank of filters.Short time Fourier transform is realized by multiplying the signal by window shape (N sampling window), and window shape can be rectangular, Hann, Kaiser-Bessel derive (KBD) or some other shapes.The signal that is windowed then experiences discrete Fourier transform (DFT) operation, to obtain STFT.The window shape in this case is prototype filter.DFT is made up of sinusoidal basis function, and each sinusoidal basis function has different frequencies.Then, the window shape multiplied by sinusoidal function provides filter for the sub-band corresponding to this frequency.Because window shape is identical at all frequencies, it is called " prototype ".

在一个实施例中，系统针对滤波器组使用QMF(正交调制滤波器)组。在特定的实现方式中，QMF组可以具有形成原型的64点窗(64-pt window)。由余弦和正弦函数调制的该窗(与64个等间隔的频率相对应)为QMF组形成子带滤波器。在每次施加QMF函数之后，窗移动64个样本，即，在这种情况下时间片段之间的重叠是640-64＝576个样本。但是，虽然在这种情况下窗形状跨越十个时间片段(640＝10*64)，但是窗的主瓣(在此处其样本值非常显著)大约为128个样本长。因此，窗的有效长度仍然相对较短。In one embodiment, the system uses a QMF (quadrature modulated filter) bank for the filter bank. In a specific implementation, the QMF bank can have a 64-point window that forms a prototype. This window, modulated by cosine and sine functions (corresponding to 64 equally spaced frequencies), forms the subband filter for the QMF bank. After each application of the QMF function, the window shifts by 64 samples, meaning that the overlap between time segments in this case is 640-64=576 samples. However, although the window shape in this case spans ten time segments (640=10*64), the main lobe of the window (where its sample values are most significant) is approximately 128 samples long. Therefore, the effective length of the window is still relatively short.

在一个实施例中，扩展组件114理想地逆变由压缩组件104施加的增益。虽然可以通过比特流把由压缩组件施加的增益传输到解码器，但是这种方法典型地会消耗显著的比特速率。在一个实施例中，系统100却是直接从它可用的信号，即解码器112的输出中估计扩展组件114所需的增益，这有效地不需要附加比特。压缩和扩展组件处的滤波器组被选择为完全相同，以便计算互为逆的增益。另外，这些滤波器组时间同步，使得压缩组件104的输出与扩展组件114的输入之间的任何有效延迟是滤波器组的步幅的倍数。如果核心编码器-解码器是无损的，并且滤波器组提供完美的重构，则压缩和扩展组件处的增益将是彼此的精确的逆，从而允许精确地重构原始信号。但是，在实践中，由扩展组件114施加的增益仅是由压缩组件104施加的增益的逆的近似。In one embodiment, expansion component 114 is ideally inversely transformed by the gain applied by compression component 104. Although the gain applied by compression component can be transmitted to decoder by bitstream, this method typically can consume significant bit rate. In one embodiment, system 100 is directly from its available signal, i.e. the output of decoder 112, estimating the required gain of expansion component 114, which effectively does not require additional bits. The bank of filters at compression and expansion component place is selected to be identical, so as to calculate the gain that is inverse to each other. In addition, these bank of filters are time-synchronized so that any effective delay between the output of compression component 104 and the input of expansion component 114 is the multiple of the stride of the bank of filters. If the core encoder-decoder is lossless, and bank of filters provides perfect reconstruction, then the gain at compression and expansion component place will be each other's accurate inverse, thereby allowing accurate reconstruction of the original signal. But in practice, the gain applied by expansion component 114 is only an approximation of the inverse of the gain applied by compression component 104.

在一个实施例中，在压缩和扩展组件中使用的滤波器组是QMF组。在典型的使用应用中，核心音频帧可以是4096个样本长，并且与相邻的帧重叠2048个样本。在48kHz的情况下，这种帧的长度是85.3毫秒。与此相反，所使用的QMF组可具有64个样本的步幅(其长度是1.3毫秒)，其为增益提供精细的时间分辨率。而且，QMF具有长度为640个样本的平滑的原型滤波器，保证增益施加在时间上平滑地变化。使用该QMF滤波器组的分析提供信号的时频平铺式表示。每个QMF时隙等于步幅，并且在每个QMF时隙中，存在64个均匀间隔的子带。可替代地，可以使用诸如短期傅里叶变换(STFT)之类的其他滤波器组，并且仍然可以获得这种时频平铺式表示。In one embodiment, the filter bank used in the compression and expansion component is a QMF bank. In a typical use case, the core audio frame may be 4096 samples long and overlap with adjacent frames by 2048 samples. At 48kHz, the length of such a frame is 85.3 milliseconds. In contrast, the QMF bank used may have a stride of 64 samples (which is 1.3 milliseconds in length), which provides fine temporal resolution for the gain. Moreover, QMF has a smooth prototype filter of 640 samples in length, ensuring that the gain applied varies smoothly over time. Analysis using this QMF filter bank provides a time-frequency flattened representation of the signal. Each QMF time slot is equal to the stride, and within each QMF time slot, there are 64 evenly spaced subbands. Alternatively, other filter banks such as the Short-Term Fourier Transform (STFT) can be used and still achieve this time-frequency flattened representation.

在一个实施例中，压缩组件104执行缩放编解码器输入的预处理步骤。对于该实施例，S_t(k)是在时隙t和频率区间k处的复值滤波器组样本。图6例示了在一个实施例下对于一系列的频率，音频信号到许多时隙的分割。对于图600的实施例，存在64个频率区间k以及32个时隙t，其如所示产生多个时频块(虽然不一定按比例绘制)。压缩前步骤对编解码器输入进行缩放以变成S’_t(k)＝S_t(k)/g_t。在该等式中，是经归一化的时隙平均值。In one embodiment, compression component 104 performs a pre-processing step to scale the codec input. For this embodiment, _St (k) is a complex-valued filter bank sample at time slot t and frequency bin k. FIG6 illustrates the segmentation of an audio signal into a number of time slots for a range of frequencies under one embodiment. For the embodiment of FIG600, there are 64 frequency bins k and 32 time slots t, which produce multiple time-frequency blocks as shown (although not necessarily drawn to scale). The pre-compression step scales the codec input to become _S't (k)= _St (k)/ _gt . In this equation, is the normalized time slot average.

在上面的等式中，表达式是平均绝对级别/1-范数，并且S₀是合适的常数。通用的p-范数在该上下文中定义如下：In the equation above, the expression is the mean absolute level/1-norm, and S ₀ is a suitable constant. The general p-norm is defined in this context as follows:

已经示出，1-范数可以给出比使用能量(rms/2-范数)明显更好的结果。指数项γ的值典型地在0到1之间的范围内，并且可以被选为1/3。常数S₀保证独立于实现平台的合理的增益值。例如，当在所有S_t(k)值的绝对值可能限制为1的平台中实现时，它可能是1。在S_t(k)可能具有不同的最大绝对值的平台中，它可能不同。它也可以用来确保一大组信号的平均增益值接近1。也就是说，它可以是介于根据内容的大型语料库确定的最大信号值与最小信号值之间的中间信号值。It has been shown that the 1-norm can give significantly better results than using the energy (rms/2-norm). The value of the exponential term γ is typically in the range of 0 to 1 and can be chosen to be 1/3. The constant _S0 ensures a reasonable gain value independent of the implementation platform. For example, it may be 1 when implemented in a platform where the absolute value of all _St (k) values may be constrained to be 1. It may be different in platforms where _St (k) may have different maximum absolute values. It can also be used to ensure that the average gain value of a large set of signals is close to 1. That is, it can be an intermediate signal value between the maximum signal value and the minimum signal value determined from a large corpus of content.

在由扩展组件114执行的后步骤过程中，编解码器输出由压缩组件104所施加的逆增益来扩展。这需要精确地或近似精确地复制压缩组件的滤波器组。在这种情况下，表示该第二滤波器组的复数值样本。扩展组件114对编解码器输出进行缩放以变成In a post-process performed by the expansion component 114, the codec output is expanded by the inverse gain applied by the compression component 104. This requires an exact or nearly exact replication of the filter bank of the compression component. In this case, the complex-valued samples of this second filter bank are represented. The expansion component 114 scales the codec output to become

在上面的等式中，是经归一化的时隙平均值，如下给出：In the above equation, is the normalized time slot average, given by:

以及as well as

一般而言，扩展组件114将使用与在压缩组件104中所使用的p-范数相同的p-范数。因此，如果使用平均绝对级别来定义压缩组件104中的在上面的等式中也使用1-范数(p＝1)来定义。Generally speaking, the expansion component 114 will use the same p-norm as used in the compression component 104. Thus, if the mean absolute level is used to define φ in the compression component 104, φ in the above equation is also defined using the 1-norm (p=1).

当在压缩和扩展组件中使用诸如STFT或复数QMF之类的复数滤波器组(包含余弦和正弦基函数这两者)时，复数子带样本的幅度或|S_t(k)|的计算需要计算密集型的平方根运算。这可以通过以多种方式求复数子带样本的幅度的近似，例如通过计算其实部和虚部的总和来回避。When complex filter banks (containing both cosine and sine basis functions) such as STFT or complex QMF are used in the compression and expansion components, the calculation of the magnitude of the complex subband samples or |S _t (k)| requires computationally intensive square root operations. This can be circumvented by approximating the magnitude of the complex subband samples in various ways, such as by calculating the sum of their real and imaginary parts.

在上面的等式中，值K等于或小于滤波器组中子带的数量。一般而言，可以使用滤波器组中的子带的任何子集来计算p-范数。但是，应当在编码器106和解码器112这两者处使用相同的子集。在一个实施例中，可以使用高级频谱延拓(A-SPX)工具来对音频信号的高频部分(例如，高于6kHz的音频成分)进行编码。另外，仅使用高于1kHz(或类似频率)的信号来引导噪声整形可能是期望的。在这种情况下，只有那些在1kHz至6kHz范围内的子带可以用来计算p-范数，从而计算增益值。此外，虽然增益是根据子带的一个子集计算的，它仍然可以被施加给不同的且可能更大的子带的子集。In the above equation, the value K is equal to or less than the number of subbands in the filter bank. In general, any subset of the subbands in the filter bank can be used to calculate the p-norm. However, the same subset should be used at both the encoder 106 and the decoder 112. In one embodiment, the advanced spectral extension (A-SPX) tool can be used to encode the high-frequency portion of the audio signal (e.g., audio components above 6 kHz). Alternatively, it may be desirable to use only signals above 1 kHz (or similar frequencies) to guide noise shaping. In this case, only those subbands in the range of 1 kHz to 6 kHz can be used to calculate the p-norm, and thus the gain value. Furthermore, although the gain is calculated based on a subset of the subbands, it can still be applied to a different and potentially larger subset of the subbands.

如图1中所示，用于对由音频编解码器的核心编码器106引入的量化噪声进行整形的压扩功能由执行某些编码器前的压缩功能和解码器后的扩展功能的两个单独的组件104和114执行。图3A是例示了在一个实施例下的在编码器前的压缩组件中压缩音频信号的方法的流程图，并且图3B是例示了在一个实施例下的在解码器后的扩展组件中扩展音频信号的方法的流程图。As shown in Figure 1, the companding function for shaping the quantization noise introduced by the core encoder 106 of the audio codec is performed by two separate components 104 and 114 that perform certain pre-encoder compression functions and post-decoder expansion functions. Figure 3A is a flow chart illustrating a method for compressing an audio signal in a pre-encoder compression component under one embodiment, and Figure 3B is a flow chart illustrating a method for expanding an audio signal in a post-decoder expansion component under one embodiment.

如图3A中所示，过程300从压缩组件接收到输入音频信号(302)开始。该组件然后将音频信号分成短的时间片段(304)，并通过向每个短片段施加宽带增益值来将音频信号压缩到减小的动态范围(306)。压缩组件也实现某种原型滤波和QMF滤波器组件以减少或消除由于将不同的增益值施加给连续的片段而引起的任何不连续，如上所述(308)。在某些情况下，诸如基于音频内容的类型或音频内容的某些特性，在音频编解码器的编码级/解码级之前和之后对音频信号进行的压缩和扩展可能会劣化而不是增强输出音频质量。在这种情况下，可以关闭或修改压扩过程以返回不同的压扩(压缩/扩展)等级。因此，除了其他变量之外，压缩组件还确定具体的信号输入和音频回放环境所需的压扩功能的适当性和/或压扩的最佳等级(310)。该确定步骤310可以发生在过程300的任何实践点，诸如在音频信号的划分304或者音频信号的压缩306之前。如果压扩被认为是适当的，则施加增益(306)，并且编码器然后根据编解码器的数据格式来编码用于传输到解码器的信号(312)。某些压扩控制数据，诸如激活数据、同步数据、压扩等级数据以及其他类似的控制数据，可以作为比特流的一部分进行传输以用于扩展组件的处理。As shown in FIG3A , the process 300 begins with the compression component receiving an input audio signal (302). The component then divides the audio signal into short time segments (304) and compresses the audio signal to a reduced dynamic range by applying a broadband gain value to each short segment (306). The compression component also implements some prototype filtering and QMF filter components to reduce or eliminate any discontinuities caused by applying different gain values to consecutive segments, as described above (308). In some cases, such as based on the type of audio content or certain characteristics of the audio content, the compression and expansion of the audio signal before and after the encoding/decoding stage of the audio codec may degrade rather than enhance the output audio quality. In such cases, the companding process can be turned off or modified to return different levels of companding (compression/expansion). Thus, the compression component determines, among other variables, the appropriateness of the companding function and/or the optimal level of companding required for the specific signal input and audio playback environment (310). This determination step 310 can occur at any practical point in the process 300, such as before the division 304 of the audio signal or the compression 306 of the audio signal. If companding is deemed appropriate, gain is applied (306), and the encoder then encodes a signal for transmission to the decoder (312) in accordance with the codec's data format. Certain companding control data, such as activation data, synchronization data, companding level data, and other similar control data, may be transmitted as part of the bitstream for processing by the expansion component.

图3B是例示了在一个实施例下的在解码器后的扩展组件中扩展音频信号的方法的流程图。如过程350中所示，编解码器的解码器级从编码器级接收编码音频信号的比特流(352)。解码器然后根据编解码器数据格式来对经编码的信号进行解码(353)。然后扩展组件处理比特流并应用任何经编码的控制数据来基于控制数据关闭扩展或修改扩展参数(354)。扩展组件使用合适的窗形状将音频信号分成时间片段(356)。在一个实施例中，时间片段对应于由压缩组件使用的相同的时间片段。扩展组件然后在频域上针对每个片段计算适当的增益值(358)，并向每个时间片段施加增益值以将音频信号的动态范围扩展回到原始的动态范围或任何其他适当的动态范围(360)。FIG3B is a flow chart illustrating a method for extending an audio signal in an expansion component after a decoder under one embodiment. As shown in process 350, the decoder stage of the codec receives a bitstream of the encoded audio signal from the encoder stage (352). The decoder then decodes the encoded signal according to the codec data format (353). The expansion component then processes the bitstream and applies any encoded control data to turn off the extension or modify the extension parameters based on the control data (354). The expansion component uses a suitable window shape to divide the audio signal into time segments (356). In one embodiment, the time segments correspond to the same time segments used by the compression component. The expansion component then calculates an appropriate gain value (358) for each segment in the frequency domain and applies the gain value to each time segment to extend the dynamic range of the audio signal back to the original dynamic range or any other appropriate dynamic range (360).

压扩控制Companding control

包括系统100的压扩器的压缩和扩展组件可以被配置为仅在音频信号处理期间的某个时间，或者仅针对某些类型的音频内容，施加所述前处理和后处理步骤。例如，对于语音和音乐瞬态信号，压扩可能表现出益处。但是，对于诸如平稳信号之类的其他信号，压扩可能会使信号质量劣化。因此，如图3A中所示，压扩控制机制被提供为块310，并且控制数据被从压缩组件104传输到扩展组件114以协调压扩操作。这种控制机制的最简单形式是对于其中施加压扩会使音频质量劣化的音频样本的块关闭压扩功能。在一个实施例中，压扩开/关决定在编码器中被检测并且作为比特流元素传输到解码器，使得压缩器和扩展器能够在同一QMF时隙打开/关闭。The compression and expansion components of the compandor of system 100 can be configured to apply the pre-processing and post-processing steps only at a certain time during the audio signal processing, or only for certain types of audio content. For example, companding may show benefits for transient signals of speech and music. However, for other signals such as stationary signals, companding may degrade the signal quality. Therefore, as shown in Figure 3A, a companding control mechanism is provided as block 310, and control data is transmitted from compression component 104 to expansion component 114 to coordinate the companding operation. The simplest form of this control mechanism is to turn off the companding function for blocks of audio samples where applying companding would degrade the audio quality. In one embodiment, the companding on/off decision is detected in the encoder and transmitted to the decoder as a bitstream element, so that the compressor and expander can be turned on/off in the same QMF time slot.

两种状态之间的切换通常会导致所施加的增益的不连续，从而导致听得见的切换伪影或咔哒声。实施例包括用于减少或消除这些伪影的机制。在第一实施例中，系统仅在增益接近1的帧处允许压扩功能关和开的切换。在这种情况下，在打开/关闭压扩功能之间仅存在小的不连续。在第二实施例中，在开和关的帧之间的音频帧中施加介于开和关模式之间的第三种弱压扩模式，并且在比特流中以信号告知。弱压扩模式使指数项γ从它在压扩期间的默认值缓慢地过渡到相当于不压扩的0。作为介于中间的弱压扩模式的替代，系统可以实现开始帧和停止帧，其在一块音频样本上平滑地淡入到(fade into)不压扩的模式，而不是突然地关闭压扩功能。在另一个实施例中，系统被配置为不是简单地关闭压扩，而是施加平均增益。在某些情况下，如果向音频帧施加这样的恒定增益因子：该恒定增益因子比压扩关闭情况下的恒定增益因子1.0更像相邻的压扩打开的帧的增益因子，则可以提高音调平稳的信号的音频质量。可以通过在一个帧上平均所有压扩增益来计算这种增益因子。包含恒定的平均压扩增益的帧从而在比特流中被以信号告知。Switching between the two states typically results in a discontinuity in the applied gain, resulting in audible switching artifacts or clicks. Embodiments include mechanisms for reducing or eliminating these artifacts. In a first embodiment, the system allows the companding function to be switched off and on only at frames where the gain is close to 1. In this case, there is only a small discontinuity between turning the companding function on/off. In a second embodiment, a third weak companding mode, intermediate between the on and off modes, is applied in the audio frames between the on and off frames and signaled in the bitstream. The weak companding mode causes the exponential term γ to slowly transition from its default value during companding to 0, which is equivalent to no companding. As an alternative to the intermediate weak companding mode, the system can implement start and stop frames that smoothly fade into the no companding mode over a block of audio samples, rather than abruptly turning off the companding function. In another embodiment, the system is configured not to simply turn off the companding, but to apply an average gain. In some cases, the audio quality of pitch-smooth signals can be improved by applying a constant gain factor to an audio frame that is more similar to the gain factor of an adjacent frame with companding on than to a constant gain factor of 1.0 with companding off. This gain factor can be calculated by averaging all companding gains over a frame. Frames containing a constant average companding gain are thus signaled in the bitstream.

虽然实施例是在单音调的音频声道的背景下描述的，但是应当注意，在直接的延伸中，可以通过在每个声道上分别重复该方法来处理多个声道。但是，包含两个或多个声道的音频信号表现出由图1的压扩系统的实施例解决的某些附加的复杂性。压扩策略应当取决于声道之间的相似度。While the embodiments are described in the context of a monophonic audio channel, it should be noted that, in a straightforward extension, multiple channels can be processed by repeating the method separately for each channel. However, audio signals containing two or more channels present certain additional complexities that are addressed by the embodiment of the companding system of FIG1 . The companding strategy should depend on the similarity between the channels.

例如，在立体声漂移(stereo-panned)的瞬态信号的情况下，已经观察到个体声道的独立压扩可能导致听得见的图像伪影。在一个实施例中，系统根据两个声道的子带样本来为每个时间片段确定单个增益值，并且使用相同的增益值来压缩/扩展两个信号。无论何时两个声道具有非常相似的信号，该方法通常都是合适的，其中相似度是例如通过使用互相关来定义的。检测器计算声道之间的相似度并且在使用声道的单独压扩或共同地压扩声道之间转换。延伸到多个声道的话可以通过使用相似度标准将声道分成声道组，并对组施加共同压扩。然后该分组信息可以通过比特流进行传输。For example, in the case of stereo-panned transient signals, it has been observed that independent companding of individual channels can lead to audible image artifacts. In one embodiment, the system determines a single gain value for each time segment based on the subband samples of the two channels, and uses the same gain value to compress/expand both signals. This approach is generally suitable whenever two channels have very similar signals, where the similarity is defined, for example, by using cross-correlation. The detector calculates the similarity between the channels and switches between using individual companding of the channels or companding the channels together. Extension to multiple channels can be achieved by dividing the channels into channel groups using a similarity criterion and applying common companding to the groups. This grouping information can then be transmitted via the bitstream.

系统实现System Implementation

图4是例示在一个实施例下，用于与编解码器的编码器阶段联合地压缩音频信号的系统的框图。图4例示了实现用于图3A中所示的基于编解码器的系统的压缩方法的至少一部分的硬件电路或系统。如系统400中所示，时域中的输入音频信号401被输入到QMF滤波器组402。该滤波器组执行将输入信号分离成多个成分的分析操作，其中每个带通滤波器承载原始信号的频率子带。信号的重构在由QMF滤波器组410执行的合成操作中执行。在图4的示例性实施例中，分析和合成滤波器组这两者均处理64个频带。核心编码器412从合成滤波器组410接收音频信号，并且通过以适当的数字格式(例如MP3、AAC等)编码音频信号来产生比特流414。Fig. 4 is a block diagram illustrating a system for compressing an audio signal in conjunction with the encoder stage of a codec, under one embodiment. Fig. 4 illustrates a hardware circuit or system for implementing at least a portion of a compression method for the codec-based system shown in Fig. 3A. As shown in system 400, an input audio signal 401 in the time domain is input to a QMF filter bank 402. This filter bank performs an analysis operation that separates the input signal into multiple components, wherein each bandpass filter carries the frequency subbands of the original signal. Signal reconstruction is performed in a synthesis operation performed by a QMF filter bank 410. In the exemplary embodiment of Fig. 4, both the analysis and synthesis filter banks process 64 frequency bands. A core encoder 412 receives the audio signal from the synthesis filter bank 410 and generates a bitstream 414 by encoding the audio signal in a suitable digital format (e.g., MP3, AAC, etc.).

系统400包括压缩器406，压缩器406将增益值施加给音频信号已经分成的每个短片段。这产生压缩的动态范围的音频信号，诸如图2B中所示。压扩控制单元404对音频信号进行分析，以基于信号的类型(例如，语音)或者信号的特性(例如，平稳对瞬态)或其他相关参数，确定是否应施加压缩或应施加多少压缩。控制单元404可以包括检测机制，以检测音频信号的时间峰度特性。基于音频信号的所检测的特性和某种预先定义的标准，控制单元404将适当的控制信号发送到压缩器406，以关闭压缩功能或修正施加给短片段的增益值。System 400 includes a compressor 406 that applies a gain value to each short segment into which the audio signal has been divided. This produces an audio signal with a compressed dynamic range, such as that shown in FIG2B . A companding control unit 404 analyzes the audio signal to determine whether or how much compression should be applied based on the type of signal (e.g., speech) or the characteristics of the signal (e.g., stationary versus transient) or other relevant parameters. The control unit 404 may include a detection mechanism to detect the temporal kurtosis characteristic of the audio signal. Based on the detected characteristics of the audio signal and some predefined criteria, the control unit 404 sends an appropriate control signal to the compressor 406 to disable the compression function or modify the gain value applied to the short segment.

除了压扩之外，许多其他编码工具也可以在QMF域中操作。一种此类工具是图4的块408中所示的A-SPX(高级频谱延拓)。A-SPX是一种用来允许相比于在感知上较重要的频率以较粗糙的编码方案来对较不重要的频率进行编码的技术。例如，在解码器端的A-SPX中，来自较低频率的QMF子带样本可以在较高频率处重复，然后使用从编码器传输到解码器的辅助信息来对高频带中的频谱包络进行整形。In addition to companding, many other coding tools can also operate in the QMF domain. One such tool is A-SPX (Advanced Spectral Extension) shown in block 408 of Figure 4. A-SPX is a technique used to allow less important frequencies to be encoded with a coarser coding scheme than more perceptually important frequencies. For example, in A-SPX at the decoder end, QMF subband samples from lower frequencies can be repeated at higher frequencies, and the spectral envelope in the high-frequency band is then shaped using auxiliary information transmitted from the encoder to the decoder.

在压扩和A-SPX都在QMF域中执行的系统中，如图4中所示，在编码器处，可以从尚未压缩的子带样本中提取用于较高频率的A-SPX包络数据，并且可以对与由核心编码器412编码的信号的频率范围对应的较低频率QMF样本施加压缩。在图5的解码器502处，在对解码后的信号进行QMF分析504之后，扩展过程506首先被施加，并且A-SPX操作508随后从较低频率的扩展信号中再现较高子带样本。In a system where both companding and A-SPX are performed in the QMF domain, as shown in FIG4 , at the encoder, A-SPX envelope data for higher frequencies may be extracted from the uncompressed subband samples, and compression may be applied to lower frequency QMF samples corresponding to the frequency range of the signal encoded by the core encoder 412. At the decoder 502 of FIG5 , after QMF analysis 504 is performed on the decoded signal, an expansion process 506 is first applied, and an A-SPX operation 508 then reproduces the higher subband samples from the lower frequency expanded signal.

在该示例性实现中，编码器处的QMF合成滤波器组410和解码器504处的QMF分析滤波器组一起引入640-64+1个样本延迟(～9个QMF时隙)。该示例中的核心编解码器延迟是3200个样本(50个QMF时隙)，所以总延迟是59个时隙。该延迟由如下导致：将控制数据嵌入到比特流中并且在解码器处使用它，使得编码器压缩器和解码器扩展器操作同步。In this exemplary implementation, the QMF synthesis filter bank 410 at the encoder and the QMF analysis filter bank at the decoder 504 together introduce a 640-64+1 sample delay (~9 QMF time slots). The core codec delay in this example is 3200 samples (50 QMF time slots), so the total delay is 59 time slots. This delay is caused by embedding control data in the bitstream and using it at the decoder to synchronize the encoder compressor and decoder expander operations.

可替代地，在编码器处，可以在原始信号的整个带宽上施加压缩。随后可以从被压缩的子带样本中提取A-SPX包络数据。在这种情况下，在QMF分析之后，解码器首先运行A-SPX工具以首先重构全带宽的压缩信号。然后应用扩展阶段以恢复具有其原始动态范围的信号。Alternatively, at the encoder, compression can be applied across the entire bandwidth of the original signal. A-SPX envelope data can then be extracted from the compressed subband samples. In this case, after QMF analysis, the decoder first runs the A-SPX tool to reconstruct the full-bandwidth compressed signal. An expansion phase is then applied to restore the signal to its original dynamic range.

在图4中，可以在QMF域中进行操作的另一种工具可以是高级耦合(advancedcoupling，AC)工具(未示出)。在高级耦合系统中，使用可以在解码器处在QMF域中应用以重构立体声输出的附加参数化空间信息将两个声道编码成单声道下混(mono downmix)。当AC和压扩彼此联合使用时，AC工具可以置于编码器处的压缩级406之后，在该情况下，它将在解码器处的扩展级506之前应用。作可替代地，AC辅助信息可以从未压缩的立体声信号中提取，在该情况下，AC工具将在解码器处的扩展级506之后操作。也可以支持混合AC模式，在该混合AC模式中，在某频率以上使用AC，而在该频率之下使用离散立体声；或者可替代地，在某频率以上使用离散立体声，而在该频率之下使用AC。In Figure 4, another tool that can operate in the QMF domain can be an advanced coupling (AC) tool (not shown). In an advanced coupling system, two channels are encoded into a mono downmix using additional parameterized spatial information that can be applied in the QMF domain at the decoder to reconstruct the stereo output. When AC and companding are used in conjunction with each other, the AC tool can be placed after the compression stage 406 at the encoder, in which case it would be applied before the expansion stage 506 at the decoder. Alternatively, AC side information can be extracted from the uncompressed stereo signal, in which case the AC tool would operate after the expansion stage 506 at the decoder. Hybrid AC modes can also be supported, in which AC is used above a certain frequency and discrete stereo is used below that frequency, or alternatively, discrete stereo is used above a certain frequency and AC is used below that frequency.

如图3A和3B中所示，在编解码器的编码器级与解码器级之间传输的比特流包括某种控制数据。这种控制数据构成允许系统在不同压扩模式之间切换的辅助信息。切换控制数据(用于打开/关闭压扩)可能再加上一些中间状态可以每个声道增加大约1或2比特。其他控制数据可以包括信号以确定是否离散立体声或多声道配置的所有声道都将使用共同的压扩增益因子，或者它们是否应当针对每个声道独立地计算。这种数据可以每个声道仅需要单个额外的比特。可以取决于系统需求和约束而使用其他类似的控制数据元素及其适当的比特权重。As shown in Figures 3A and 3B, the bitstream transmitted between the encoder and decoder stages of the codec includes certain control data. This control data constitutes auxiliary information that allows the system to switch between different companding modes. Switching control data (for turning companding on/off) may be combined with some intermediate states to add about 1 or 2 bits per channel. Other control data may include signals to determine whether all channels of a discrete stereo or multi-channel configuration will use a common companding gain factor, or whether they should be calculated independently for each channel. This data may require only a single additional bit per channel. Other similar control data elements and their appropriate bit weights may be used depending on system requirements and constraints.

检测机制Detection Mechanism

在一个实施例中，压扩控制机制被包括作为压缩组件104的一部分，以提供对QMF域中压扩的控制。可以基于诸如音频信号类型之类的许多因素来配置压扩控制。例如，在大多数应用中，对于语音信号和瞬态信号或者在时间上多峰的信号类别内的任何其他信号，应该打开压扩。系统包括检测机制以检测信号的峰度(peakness)，以便帮助产生用于压扩器功能的适当控制信号。In one embodiment, a companding control mechanism is included as part of compression component 104 to provide control of companding in the QMF domain. Companding control can be configured based on a number of factors, such as the type of audio signal. For example, in most applications, companding should be turned on for speech signals and transient signals, or any other signal within the category of signals that are multimodal in time. The system includes a detection mechanism to detect the peakness of the signal to help generate appropriate control signals for the companding function.

在一个实施例中，对于给定的核心编解码器，时间峰度的量度TP(k)_frame在频率区间k上计算，并且使用下面的公式计算：In one embodiment, for a given core codec, the measure of temporal kurtosis TP(k) _frame is calculated over frequency bin k and is calculated using the following formula:

在上面的等式中，S_t(k)是子带信号，并且T是与一个核心编码器帧相对应的QMF时隙的数量。在示例性实现中，T的值可以是32。每个频带计算的时间峰度可以用来将声音内容分类成一般两个种类：平稳的音乐信号，以及音乐瞬态信号或语音信号。如果TP(k)_frame的值小于定义的值(例如1.2)，则帧的该子带中的信号有可能是平稳的音乐信号。如果TP(k)_frame的值大于该值，则信号有可能是音乐瞬态信号或语音信号。如果值大于甚至更高的阈值(例如1.6)，信号很有可能是纯音乐瞬态信号，例如响板。而且，已经观察到，对于自然发生的信号，在不同的带中获得的时间峰度的值或多或少是相似的，并且该特性可以用来减小要计算时间峰度值的子带数量。基于该观察，系统可以实现下面两个之一。In the above equation, _St (k) is the subband signal, and T is the number of QMF time slots corresponding to one core encoder frame. In an exemplary implementation, the value of T can be 32. The temporal kurtosis calculated for each frequency band can be used to classify the sound content into two general categories: stationary music signals, and music transient signals or speech signals. If the value of TP(k) _frame is less than a defined value (e.g., 1.2), the signal in that subband of the frame is likely to be a stationary music signal. If the value of TP(k) _frame is greater than this value, the signal is likely to be a music transient signal or a speech signal. If the value is greater than an even higher threshold (e.g., 1.6), the signal is likely to be a pure music transient signal, such as castanets. Moreover, it has been observed that for naturally occurring signals, the temporal kurtosis values obtained in different bands are more or less similar, and this characteristic can be used to reduce the number of subbands for which temporal kurtosis values are to be calculated. Based on this observation, the system can implement one of the following two methods.

在第一实施例中，解码器执行下面的过程。作为第一步骤，它计算时间峰度大于1.6的带的数量。作为第二步骤，它然后计算时间峰度值小于1.6的带的时间峰度值的平均值。如果在第一步骤中发现的带的数量大于51，或者如果在第二步骤中确定的平均值大于1.45，则该信号被确定是音乐瞬态信号，因此压扩应当打开。否则被确定是压扩不应当打开的信号。这种检测器对于语音信号将关闭大部分时间。在一些实施例中，语音信号通常将由单独的语音编码器编码，所以这一般不是问题。但是，在某些情况下，可能期望对于语音也打开压扩功能。在这种情况下，第二类型的检测器可能是优选的。In a first embodiment, the decoder performs the following process. As a first step, it calculates the number of bands whose temporal kurtosis is greater than 1.6. As a second step, it then calculates the average of the temporal kurtosis values for the bands whose temporal kurtosis values are less than 1.6. If the number of bands found in the first step is greater than 51, or if the average value determined in the second step is greater than 1.45, then the signal is determined to be a music transient signal, and therefore compression and expansion should be turned on. Otherwise, it is determined to be a signal that compression and expansion should not be turned on. This detector will be turned off most of the time for speech signals. In some embodiments, speech signals will typically be encoded by a separate speech encoder, so this is generally not a problem. However, in some cases, it may be desirable to turn on the compression and expansion function for speech as well. In this case, a second type of detector may be preferred.

在一个实施例中，该第二类型的检测器执行下面的过程。作为第一步骤，它计算时间峰度值大于1.2的带的数量。在第二步骤中，它然后计算时间峰度小于1.2的带的时间峰度值的平均值。然后它应用下面的规则：如果第一步骤的结果大于55，则打开压扩，如果第一步骤的结果小于15，则关闭压扩；如果第一步骤的结果介于15至55之间并且第二步骤的结果大于1.16，则打开压扩；并且如果第一步骤的结果位于15至55之间并且第二步骤的结果小于1.16，则关闭压扩。应当注意，这两种类型的检测器仅描述检测器算法的许多可能解决方法中的两个示例，并且也可以使用或者替代地使用其他类似的算法。In one embodiment, this second type of detector performs the following process. As a first step, it counts the number of bands with a temporal kurtosis value greater than 1.2. In a second step, it then calculates the average of the temporal kurtosis values for bands with a temporal kurtosis less than 1.2. It then applies the following rules: if the result of the first step is greater than 55, turn on the companding, if the result of the first step is less than 15, turn off the companding; if the result of the first step is between 15 and 55 and the result of the second step is greater than 1.16, turn on the companding; and if the result of the first step is between 15 and 55 and the result of the second step is less than 1.16, turn off the companding. It should be noted that these two types of detectors describe only two examples of many possible solutions to detector algorithms, and that other similar algorithms may also or instead be used.

由图4的元件404提供的压扩控制功能可以任何适当的方式实现，以允许基于某些操作模式使用或不使用压扩。例如，在环绕声系统的LFE(低频效果)声道上通常不使用压扩，以及当没有实现A-SPX(即，非QMF)功能时也不使用压扩。在一个实施例中，压扩控制功能可以由电路或诸如压扩控制元件404之类的基于处理器的元件执行的程序来提供。下面是在一个实施例下，可以实现压扩控制的程序片段的一些示例性语法：The companding control functionality provided by element 404 of FIG4 can be implemented in any suitable manner to allow companding to be used or not used based on certain operating modes. For example, companding is typically not used on the LFE (low frequency effects) channel of a surround sound system, and is also not used when A-SPX (i.e., non-QMF) functionality is not implemented. In one embodiment, the companding control functionality can be provided by a circuit or program executed by a processor-based element such as companding control element 404. The following is some exemplary syntax for a program snippet that can implement companding control, under one embodiment:

sync_flag，b_compand_on[ch]，以及b_compand_avg标记或程序元素可以长约1个比特，或者是取决于系统约束和需求的任何其他长度。应当注意，上面所例示的程序代码是实现压扩控制功能的一种方法的示例，并且根据一些实施例可以使用其他程序或硬件组件来实现压扩控制。The sync_flag, b_compand_on[ch], and b_compand_avg flags or program elements may be approximately 1 bit in length, or any other length depending on system constraints and requirements. It should be noted that the program code illustrated above is an example of one method of implementing the companding control functionality, and that other programs or hardware components may be used to implement companding control according to some embodiments.

虽然所述实施例到目前为止包括用于降低由编解码器中的编码器引入的量化噪声的压扩过程，但应当注意，这种压扩过程的方面也可以在不包括编码器和解码器(编解码器)级的信号处理系统中应用。而且，如果压扩过程与编解码器联合使用，则编解码器可以基于变换或非基于变换。While the embodiments described thus far include a companding process for reducing quantization noise introduced by an encoder in a codec, it should be noted that aspects of this companding process can also be applied in signal processing systems that do not include an encoder and decoder (codec) stage. Furthermore, if the companding process is used in conjunction with a codec, the codec can be transform-based or non-transform-based.

这里描述的系统的方面可以在用于处理数字或数字化音频文件的适当的基于计算机的声音处理网络环境中实现。自适应音频系统的部分可以包括一个或更多个网络，网络包括任何期望数量的个体机器，包括用来缓冲并路由在计算机之间传输的数据的一个或更多个路由器(未示出)。这种网络可以基于各种不同的网络协议构建，并且可以是因特网、广域网(WAN)、局域网(LAN)或其任何组合。Aspects of the systems described herein can be implemented in a suitable computer-based sound processing network environment for processing digital or digitized audio files. Portions of the adaptive audio system may include one or more networks comprising any desired number of individual machines, including one or more routers (not shown) for buffering and routing data transmitted between the computers. Such networks can be constructed based on a variety of different network protocols and can be the Internet, a wide area network (WAN), a local area network (LAN), or any combination thereof.

组件、块、过程或其他功能组件的一个或更多个可以通过对系统的基于处理器的计算设备的执行进行控制的计算机程序来实现。还应当注意，这里公开的各种功能可以按照它们的行为、寄存器传送、逻辑组件和/或其他特性，使用硬件、固件和/或包含在各种机器可读或计算机可读介质中的数据和/或指令的任何数量的组合来描述。其中包含这种格式化的数据和/或指令的计算机可读介质包括但不限于各种形式的物理(非暂态)、非易失性存储介质，诸如光学、磁性或半导体存储介质。One or more of the components, blocks, processes, or other functional components may be implemented by a computer program that controls the execution of a processor-based computing device of the system. It should also be noted that the various functions disclosed herein may be described in terms of their behavior, register transfers, logic components, and/or other characteristics using any number of combinations of hardware, firmware, and/or data and/or instructions contained in various machine-readable or computer-readable media. Computer-readable media containing such formatted data and/or instructions include, but are not limited to, various forms of physical (non-transitory), non-volatile storage media, such as optical, magnetic, or semiconductor storage media.

除非上下文另外明确要求，否则在整个说明书和权利要求书中，单词“包括”、“包含”等应被解释为包括的意义，与排除或穷举的意义截然相反；即，“包括但不限于”的意义。使用单数或复数的单词也分别包括复数或单数。另外，单词“在这里”、“在下面”、“上面”、“下面”以及类似含义的单词是指整个申请，而不是指该申请的任何特定部分。当在提及两个或更多个项的列表时使用单词“或者”时，该单词覆盖单词的以下所有解释：列表中的任何项、列表中的全部项以及列表中的项的任何组合。Unless the context clearly requires otherwise, throughout the specification and claims, the words "comprise," "comprising," and the like should be construed in an inclusive sense, as opposed to an exclusive or exhaustive sense; that is, in the sense of "including, but not limited to." Words using the singular or plural number also include the plural or singular number, respectively. Additionally, the words "herein," "below," "above," "below," and words of similar import refer to the entire application and not to any particular portion of the application. When the word "or" is used in reference to a list of two or more items, the word covers all of the following interpretations of the word: any item in the list, all items in the list, and any combination of items in the list.

虽然已经以示例的方式依据具体的实施例描述了一个或更多个实现，但应当理解，一个或更多个实现并不限于所公开的实施例。相反，其意在覆盖对本领域技术人员显然的各种变型和类似的布置。因此，所附权利要求的范围应被赋予最宽泛的解释，以涵盖所有这种变型和类似的布置。Although one or more implementations have been described in terms of specific embodiments by way of example, it should be understood that one or more implementations are not limited to the disclosed embodiments. Rather, they are intended to cover various modifications and similar arrangements that would be apparent to those skilled in the art. Therefore, the scope of the appended claims should be accorded the broadest interpretation so as to encompass all such modifications and similar arrangements.

Claims

1. A method for processing audio signals, comprising:

Receives audio signals comprising multiple time segments.

Determine the appropriate bandwidth gain for each audio signal time segment.

The broadband gain is in the frequency domain and is based on the p-norm of the spectral amplitude of each time segment of the frequency domain representation of the audio signal, wherein the p-norm value is chosen to emphasize the weak spectral content of the audio signal relative to the strong spectral content of the audio signal, wherein the value of p in the p-norm is less than 2; and

Apply a corresponding broadband gain to each time segment to obtain an extended audio signal.

In this process, the application of broadband gain amplifies relatively high-intensity time segments and attenuates relatively low-intensity time segments.

Each corresponding broadband gain for each time segment is calculated using subband samples from a subset of the subbands in the corresponding time segment.

2. The method of claim 1, wherein a first filter bank is used to analyze the audio signal to obtain a frequency domain representation.

3. The method of claim 2, wherein the first filter bank is one of a quadrature modulation filter (QMF) bank and a short-time Fourier transform.

4. A non-transitory computer-readable medium containing instructions that, when executed by one or more processors, perform the method of claim 1.

5. An apparatus for processing audio signals, comprising:

The first interface is used to receive compressed audio signals that include multiple time segments.

An expander that expands the compressed audio signal includes determining a corresponding broadband gain for each of the plurality of time segments and applying the corresponding broadband gain to each of the plurality of time segments to amplify relatively high-intensity time segments and attenuate relatively low-intensity time segments, wherein the broadband gain is in the frequency domain and represents the p-norm of the spectral amplitude of each of the plurality of time segments of the sample based on the frequency domain of the initial audio signal, wherein the value of p in the p-norm is less than 2, and wherein the p-norm value is selected to emphasize the weak spectral content of the audio signal relative to the strong spectral content of the audio signal;

Each corresponding broadband gain for each time segment is calculated using subband samples from a subset of the subbands in each corresponding time segment.

6. The apparatus of claim 5 further comprises a first filter bank for analyzing the audio signal to obtain a frequency domain representation, and further wherein the first filter bank is one of a quadrature modulation filter (QMF) bank and a short-time Fourier transform.