HK40103944A - Method and device for unified time-domain / frequency domain coding of a sound signal - Google Patents
Method and device for unified time-domain / frequency domain coding of a sound signal Download PDFInfo
- Publication number
- HK40103944A HK40103944A HK62024092251.3A HK62024092251A HK40103944A HK 40103944 A HK40103944 A HK 40103944A HK 62024092251 A HK62024092251 A HK 62024092251A HK 40103944 A HK40103944 A HK 40103944A
- Authority
- HK
- Hong Kong
- Prior art keywords
- frequency
- domain
- time
- audio signal
- coding
- Prior art date
Links
Description
技术领域Technical Field
本公开涉及使用混合时域和频域编码模式对输入声音信号进行编码的统一时域/频域编码设备和方法,以及相应的解码器设备和解码方法。This disclosure relates to a unified time-domain/frequency-domain coding apparatus and method for encoding input audio signals using a hybrid time-domain and frequency-domain coding mode, as well as a corresponding decoder apparatus and decoding method.
在本公开和所附权利要求中:In this disclosure and the appended claims:
术语“声音”可以与语音、诸如音乐和混响语音的通用音频信号以及任何其他声音有关。The term "sound" can refer to speech, general audio signals such as music and reverberant speech, and any other sound.
背景技术Background Technology
现有技术的会话编解码器可以以非常好的质量表示具有大约8kbps的比特率的干净语音信号,并且在16kbps的比特率下接近透明。然而,在低于16kbps的比特率下,低处理延迟会话编解码器(最经常在时域中编码输入语音信号)不适合于通用音频信号,如音乐和混响语音。为了克服这个缺点,已经引入了切换编解码器,基本上使用时域方法来对语音主导的输入声音信号进行编码,并且使用频域方法来对通用音频信号进行编码。然而,这种切换解决方案通常需要更长的处理延迟,这是语音-音乐分类和计算到频域的变换所需的。Existing session codecs can represent clean speech signals at bit rates of approximately 8 kbps with very good quality and are nearly transparent at bit rates of 16 kbps. However, at bit rates below 16 kbps, low-processing-latency session codecs (which most often encode input speech signals in the time domain) are unsuitable for general audio signals such as music and reverberant speech. To overcome this shortcoming, switching codecs have been introduced, essentially using a time-domain approach to encode speech-dominant input sound signals and a frequency-domain approach to encode general audio signals. However, this switching solution typically requires longer processing latency due to speech-music classification and the computation of the frequency-domain transform.
为了克服与较长处理延迟相关的上述缺点,在美国专利No.9,015,038中已经提出了更统一的时域和频域编码模型(参见参考文献[1],其全部内容通过引用并入本文)。该统一的时域和频域编码模型是如参考文献[2]中所述的由3GPP(第三代合作伙伴计划)标准化的EVS(增强型语音服务)声音编解码器的一部分,参考文件[2]的全部内容通过引用并入本文。近年来,3GPP开始致力于基于EVS编解码器开发用于被称为IVAS(沉浸式语音和音频服务)的沉浸式服务的3D(三维)声音编解码器(参见参考文献[3],其全部内容通过引用并入本文)。To overcome the aforementioned drawbacks associated with longer processing latency, a more unified time-domain and frequency-domain coding model has been proposed in U.S. Patent No. 9,015,038 (see Reference [1], the entire contents of which are incorporated herein by reference). This unified time-domain and frequency-domain coding model is part of the EVS (Enhanced Voice Services) sound codec standardized by 3GPP (3rd Generation Partnership Project) as described in Reference [2], the entire contents of which are incorporated herein by reference. In recent years, 3GPP has begun working on developing a 3D (three-dimensional) sound codec based on the EVS codec for immersive services known as IVAS (Immersive Voice and Audio Services) (see Reference [3], the entire contents of which are incorporated herein by reference).
为了使编码模型对于特定种类的信号甚至更有效,已经添加了编码模式以在时域和频域之间以及在低频和高频之间有效地分配可用比特。附加编码模式由新的语音/音乐分类器触发,该语音/音乐分类器的输出允许不能被清楚地分类为音乐或语音的信号的不清楚类别(参见参考文献[4],其全部内容通过引用并入本文)。To make the coding model even more effective for certain types of signals, coding schemes have been added to efficiently allocate available bits between the time and frequency domains and between low and high frequencies. The additional coding schemes are triggered by a new speech/music classifier whose output allows for ambiguous categories of signals that cannot be clearly classified as music or speech (see reference [4], the entire contents of which are incorporated herein by reference).
发明内容Summary of the Invention
本公开涉及一种用于对输入声音信号进行编码的统一时域/频域编码方法。该方法包括:将输入声音信号分类为多个声音信号类别中的一个,其中声音信号类别包括表示输入声音信号的性质不清楚的不清楚信号类型类别;选择用于在输入声音信号被分类为所述不清楚信号类型类别的情况下对所述输入声音信号进行编码的多个编码子模式中的一个;以及使用所选择的编码子模式对所述输入声音信号进行混合时域/频域编码。This disclosure relates to a unified time-domain/frequency-domain coding method for encoding an input audio signal. The method includes: classifying the input audio signal into one of a plurality of audio signal categories, wherein the audio signal categories include an ambiguous signal type category indicating that the nature of the input audio signal is unclear; selecting one of a plurality of coding sub-modes for encoding the input audio signal when it is classified into the ambiguous signal type category; and performing hybrid time-domain/frequency-domain coding on the input audio signal using the selected coding sub-mode.
本公开还涉及一种用于对输入声音信号进行编码的统一时域/频域编码方法,包括:将输入声音信号分类为多个声音信号类别中的一个,其中声音信号类别包括表示输入声音信号的性质不清楚的不清楚信号类型类别;以及响应于所述输入声音信号被分类为所述不清楚信号类型类别,对所述输入声音信号进行混合时域/频域编码。对输入声音信号进行混合时域/频域编码包括频带选择和比特分配,用于选择要量化的频带和用于在所选择的频带之间分配可用于量化的比特预算。This disclosure also relates to a unified time-domain/frequency-domain coding method for encoding an input audio signal, comprising: classifying the input audio signal into one of a plurality of audio signal categories, wherein the audio signal categories include an ambiguous signal type category indicating that the nature of the input audio signal is unclear; and performing hybrid time-domain/frequency-domain coding on the input audio signal in response to the input audio signal being classified into the ambiguous signal type category. Hybrid time-domain/frequency-domain coding of the input audio signal includes band selection and bit allocation, for selecting a frequency band to be quantized and for allocating a bit budget available for quantization among the selected frequency bands.
根据本公开,还提供了一种用于对输入声音信号进行编码的统一时域/频域编码设备,包括:将输入声音信号分类为多个声音信号类别中的一个的分类器,其中声音信号类别包括表示输入声音信号的性质不清楚的不清楚信号类型类别;用于在所述输入声音信号被分类为所述不清晰信号类型类别的情况下对所述输入声音信号进行编码的多个编码子模式中的一个的选择器;以及混合时域/频域编码器,用于使用所选择的编码子模式对输入声音信号进行编码。According to this disclosure, a unified time-domain/frequency-domain coding apparatus for encoding an input sound signal is also provided, comprising: a classifier for classifying the input sound signal into one of a plurality of sound signal categories, wherein the sound signal categories include an ambiguous signal type category indicating that the nature of the input sound signal is unclear; a selector for encoding one of a plurality of coding sub-modes when the input sound signal is classified into the ambiguous signal type category; and a hybrid time-domain/frequency-domain encoder for encoding the input sound signal using the selected coding sub-mode.
本公开还涉及一种用于对输入声音信号进行编码的统一时域/频域编码设备,包括:将输入声音信号分类为多个声音信号类别中的一个的分类器,其中声音信号类别包括表示输入声音信号的性质不清楚的不清楚信号类型类别;以及混合时域/频域编码器,用于响应于输入声音信号被分类为不清楚信号类型类别而对输入声音信号进行编码。混合时域/频域编码器包括频带选择器和比特分配器,用于选择要量化的频带和用于在所选择的频带之间分配可用于量化的比特预算。This disclosure also relates to a unified time-domain/frequency-domain coding apparatus for encoding an input audio signal, comprising: a classifier for classifying the input audio signal into one of a plurality of audio signal categories, wherein the audio signal categories include an ambiguous signal type category indicating that the nature of the input audio signal is unclear; and a hybrid time-domain/frequency-domain encoder for encoding the input audio signal in response to the input audio signal being classified into an ambiguous signal type category. The hybrid time-domain/frequency-domain encoder includes a frequency band selector and a bit allocator for selecting a frequency band to be quantized and for allocating a bit budget available for quantization among the selected frequency bands.
本发明提供了一种声音信号解码方法,包括:接收比特流,所述比特流传达可用于重构混合时域/频域激励的信息,所述混合时域/频域激励表示分类为不清楚信号类型类别的声音信号,所述不清楚信号类型类别示出所述声音信号的性质不清楚,其中,所述信息包括用于对分类为不清楚信号类型类别的声音信号进行编码的多个编码子模式中的一个;响应于在比特流中传送的包括用于对输入声音信号进行编码的编码子模式的信息,重构混合时域/频域激励;将所述混合时域/频域激励转换到时域;以及通过合成滤波器对转换到时域的混合时域/频域激励进行滤波,以产生声音信号的合成版本。This invention provides a method for decoding a sound signal, comprising: receiving a bitstream conveying information for reconstructing a mixed time-domain/frequency-domain excitation, the mixed time-domain/frequency-domain excitation representing a sound signal classified as an unclear signal type category, the unclear signal type category indicating that the nature of the sound signal is unclear, wherein the information includes one of a plurality of coding sub-patterns for encoding the sound signal classified as an unclear signal type category; reconstructing the mixed time-domain/frequency-domain excitation in response to the information including the coding sub-patterns for encoding the input sound signal transmitted in the bitstream; converting the mixed time-domain/frequency-domain excitation to the time domain; and filtering the converted mixed time-domain/frequency-domain excitation to the time domain using a synthesis filter to generate a synthesized version of the sound signal.
本公开提出了一种声音信号解码方法,包括:接收比特流,该比特流传达可用于重构混合时域/频域激励的信息,该混合时域/频域激励表示声音信号,该声音信号(a)被分类为不清楚信号类型类别,该不清楚信号类型类别表示声音信号的性质不清楚,并且(b)使用(i)被选择用于量化的频带和(ii)在频带之间分配的可用于量化的比特预算进行编码;响应于在所述比特流中传送的所述信息来重构所述混合时域/频域激励,其中重构所述混合时域/频域激励包括选择用于量化的所述频带以及可用于在所述频带之间进行量化的比特预算的分配;将所述混合时域/频域激励转换到时域;以及通过合成滤波器对转换到时域的混合时域/频域激励进行滤波,以产生声音信号的合成版本。This disclosure proposes a method for decoding an audio signal, comprising: receiving a bitstream conveying information available for reconstructing a hybrid time-domain/frequency-domain excitation, the hybrid time-domain/frequency-domain excitation representing an audio signal, the audio signal being (a) classified into an ambiguous signal type category, the ambiguous signal type category indicating that the nature of the audio signal is ambiguous, and (b) encoding it using (i) a frequency band selected for quantization and (ii) a bit budget available for quantization allocated between the frequency bands; reconstructing the hybrid time-domain/frequency-domain excitation in response to the information transmitted in the bitstream, wherein reconstructing the hybrid time-domain/frequency-domain excitation includes selecting the frequency band for quantization and allocating the bit budget available for quantization between the frequency bands; converting the hybrid time-domain/frequency-domain excitation to the time domain; and filtering the converted hybrid time-domain/frequency-domain excitation to the time domain using a synthesis filter to produce a synthesized version of the audio signal.
根据本公开,提供了一种声音信号解码器,包括:比特流的接收器,所述比特流传送可用于重构混合时域/频域激励的信息,所述混合时域/频域激励表示分类为不清楚信号类型类别的声音信号,所述不清楚信号类型类别表示声音信号的性质不清楚,其中,所述信息包括用于对分类为不清楚信号类型类别的声音信号进行编码的多个编码子模式中的一个;响应于在比特流中传送的包括用于对输入声音信号进行编码的编码子模式的信息的混合时域/频域激励的重构器;将所述混合时域/频域激励转换到时域的转换器;以及合成滤波器,用于对转换到时域的混合时域/频域激励进行滤波,以产生声音信号的合成版本。According to this disclosure, a sound signal decoder is provided, comprising: a receiver of a bitstream transmitting information for reconstructing a mixed time-domain/frequency-domain excitation, the mixed time-domain/frequency-domain excitation representing a sound signal classified as an unclear signal type category, the unclear signal type category indicating that the nature of the sound signal is unclear, wherein the information includes one of a plurality of coding sub-patterns for encoding the sound signal classified as an unclear signal type category; a reconstructor responsive to the information in the bitstream including the coding sub-patterns for encoding an input sound signal; a converter for converting the mixed time-domain/frequency-domain excitation to the time domain; and a synthesis filter for filtering the converted mixed time-domain/frequency-domain excitation to the time domain to produce a synthesized version of the sound signal.
本公开还涉及一种声音信号解码器,包括:比特流的接收器,所述比特流传送可用于重构混合时域/频域激励的信息,所述混合时域/频域激励表示声音信号,所述声音信号(a)被分类为不清楚信号类型类别,所述不清楚信号类型类别表示声音信号的性质不清楚,并且(b)使用(i)被选择用于量化的频带和(ii)在频带之间分配的可用于量化的比特预算进行编码;响应于在所述比特流中传送的所述信息的所述混合时域/频域激励的重构器,其中所述重构器选择用于量化的所述频带以及可用于在所述频带之间进行量化的比特预算的分配;将所述混合时域/频域激励转换到时域的转换器;以及合成滤波器,用于对转换到时域的混合时域/频域激励进行滤波,以产生声音信号的合成版本。This disclosure also relates to an audio signal decoder, comprising: a receiver of a bitstream transmitting information for reconstructing a mixed time-domain/frequency-domain excitation, the mixed time-domain/frequency-domain excitation representing an audio signal, the audio signal being (a) classified into an ambiguous signal type category, the ambiguous signal type category indicating that the nature of the audio signal is ambiguous, and (b) encoded using (i) a frequency band selected for quantization and (ii) a bit budget allocated between the frequency bands for quantization; a reconstructor of the mixed time-domain/frequency-domain excitation in response to the information transmitted in the bitstream, wherein the reconstructor selects the frequency band for quantization and the allocation of the bit budget available for quantization between the frequency bands; a converter for converting the mixed time-domain/frequency-domain excitation to the time domain; and a synthesis filter for filtering the converted mixed time-domain/frequency-domain excitation to the time domain to produce a synthesized version of the audio signal.
通过阅读以下对仅作为示例参考附图给出的统一时域/频域编码方法、统一时域/频域编码设备、解码方法和解码器设备的说明性实施例的非限制性描述,前述和其他特征将变得更加明显。The foregoing and other features will become more apparent from the following non-limiting description of illustrative embodiments of a unified time-domain/frequency-domain coding method, a unified time-domain/frequency-domain coding apparatus, a decoding method, and a decoder apparatus, given only as examples with reference to the accompanying drawings.
附图说明Attached Figure Description
在附图中:In the attached diagram:
图1是同时示出统一时域/频域CELP(码激励线性预测)编码方法和对应统一时域/频域CELP编码设备(例如ACELP(代数码激励线性预测)编码方法和设备)的概述的示意框图;Figure 1 is a schematic block diagram that simultaneously illustrates the unified time-domain/frequency-domain CELP (code-excited linear prediction) coding method and the corresponding unified time-domain/frequency-domain CELP coding device (e.g., ACELP (algebraic code-excited linear prediction) coding method and device).
图2是图1的统一时域/频域编码方法和设备的更详细结构的示意性框图,其中预处理器进行第一级分析以对输入声音信号进行分类;Figure 2 is a schematic block diagram of a more detailed structure of the unified time-domain/frequency-domain coding method and device of Figure 1, wherein the preprocessor performs first-level analysis to classify the input audio signal;
图3是同时示出时域激励贡献的截止(cut-off)频率的计算器和估计截止频率的对应操作的概述的示意性框图;Figure 3 is a schematic block diagram that simultaneously shows the calculator for the cut-off frequency of the time-domain excitation contribution and the corresponding operation for estimating the cut-off frequency;
图4是示出图3的截止频率的计算器以及估计截止频率的对应操作的更详细结构的示意性框图;Figure 4 is a schematic block diagram showing the calculator for the cutoff frequency of Figure 3 and the corresponding operation for estimating the cutoff frequency in more detail.
图5是同时示出频率量化器和对应频率量化操作的概述的示意性框图;Figure 5 is a schematic block diagram that simultaneously shows an overview of the frequency quantizer and the corresponding frequency quantization operation;
图6是图5的频率量化器和频率量化操作的更详细结构的示意性框图;Figure 6 is a schematic block diagram of the more detailed structure of the frequency quantizer and frequency quantization operation in Figure 5;
图7是同时示出统一时域/频域CELP编码方法和对应的统一时域/频域CELP编码设备的替代实现方式的示意框图;Figure 7 is a schematic block diagram showing both the unified time-domain/frequency-domain CELP coding method and the corresponding alternative implementations of the unified time-domain/frequency-domain CELP coding device;
图8是同时示出选择编码子模式的操作和对应的子模式选择器的示意性框图;Figure 8 is a schematic block diagram showing the operation of selecting an encoding sub-mode and the corresponding sub-mode selector;
图9是同时示出频带选择器和比特分配器以及频带选择和比特分配的对应操作的示意框图,其用于在图7和图8的替代实现方式中当输入声音信号既不被分类为语音也不被分类为音乐时将可用比特预算分配给频域编码模式;Figure 9 is a schematic block diagram showing the band selector and bit allocator, as well as the corresponding operations of band selection and bit allocation, which is used to allocate the available bit budget to the frequency domain coding mode when the input audio signal is neither classified as speech nor music in the alternative implementations of Figures 7 and 8.
图10是形成用于对输入声音信号进行编码的统一时域/频域编码设备和方法的硬件组件的示例配置的简化框图;Figure 10 is a simplified block diagram of an example configuration of hardware components for forming a unified time-domain/frequency-domain coding device and method for encoding input audio signals;
图11是同时示出用于对来自图7的统一时域/频域编码设备和对应的统一时域/频域编码方法的比特流进行解码的解码器设备1100和对应的解码方法1150的示意性框图;以及Figure 11 is a schematic block diagram showing a decoder device 1100 and a corresponding decoding method 1150 for decoding bitstreams from the unified time-domain/frequency-domain coding device and the corresponding unified time-domain/frequency-domain coding method of Figure 7; and
图12是同时示出声音信号解码器和对应的声音信号解码方法的示意性框图,该声音信号解码器和对应的声音信号解码方法用于在声音信号被分类为不清楚信号类型类别的情况下对来自统一时域/频域编码设备和对应的统一时域/频域编码方法的比特流进行解码。Figure 12 is a schematic block diagram showing both a sound signal decoder and a corresponding sound signal decoding method. The sound signal decoder and the corresponding sound signal decoding method are used to decode bitstreams from a unified time-domain/frequency-domain coding device and a corresponding unified time-domain/frequency-domain coding method when the sound signal is classified as having an unclear signal type category.
具体实施方式Detailed Implementation
本公开提出了一种统一时域和频域编码模型,其改善了通用音频信号(例如,音乐和/或混响语音)的合成质量,而不增加处理延迟和比特率。该统一时域和频域编码模型包括:This disclosure proposes a unified time-domain and frequency-domain coding model that improves the synthesis quality of general audio signals (e.g., music and/or reverberant speech) without increasing processing latency and bit rate. The unified time-domain and frequency-domain coding model includes:
-在线性预测(LP)残差域中操作的时域编码模式,其中可用比特在自适应码本、一个或多个固定码本(例如代数码本、高斯码本等)、可变长度固定码本之间动态分配;以及- A time-domain coding scheme operating in the linear prediction (LP) residual domain, where available bits are dynamically allocated among an adaptive codebook, one or more fixed codebooks (e.g., algebraic codebooks, Gaussian codebooks, etc.), and a variable-length fixed codebook; and
-频域编码模式,-Frequency domain coding mode
取决于输入声音信号的特性。It depends on the characteristics of the input sound signal.
为了实现低处理延迟和低比特率会话声音编解码器(其改善通用音频信号(例如音乐和/或混响语音)的合成质量),频域编码模式尽可能接近CELP(码激励线性预测)时域编码模式来集成。为此目的,频域编码模式使用在LP(线性预测)残差域中执行的频率变换。这允许几乎没有伪影地从一个帧(例如20ms帧)切换到另一帧。如在声音编解码器领域中众所周知的,输入声音信号以给定的采样率被采样,并且由称为“帧”的这些样本的组来处理,这些样本通常被分成多个“子帧”。此处,两(2)个时域及频域编码模式的积分足够接近以允许在确定当前编码模式不够有效的情况下将比特预算动态重新分配到另一编码模式。To achieve low processing latency and low bit rate session sound codecs (which improve the synthesis quality of general audio signals such as music and/or reverberant speech), the frequency domain coding scheme is integrated as closely as possible to the CELP (Code-Excited Linear Prediction) time domain coding scheme. For this purpose, the frequency domain coding scheme uses a frequency transformation performed in the LP (Linear Prediction) residual domain. This allows switching from one frame (e.g., a 20ms frame) to another with almost no artifacts. As is well known in the field of sound codecs, the input audio signal is sampled at a given sampling rate and processed by groups of these samples called “frames,” which are typically divided into multiple “subframes.” Here, the integrals of the two (2) time-domain and frequency-domain coding schemes are close enough to allow the bit budget to be dynamically reallocated to another coding scheme if it is determined that the current coding scheme is not efficient enough.
所提出的统一时域和频域编码模型的一个特征是时域分量的可变时间支持,其在逐帧的基础上从四分之一帧(子帧)变化到完整帧。作为非限制性说明性示例,帧可以表示20ms的输入声音信号。如果声音编解码器的内部采样率是16kHz,则这样的帧对应于输入声音信号的320个采样,或者如果编解码器的内部采样率是12.8kHz,则这样的帧对应于每帧256个采样。然后,子帧(在本示例中为帧的四分之一)表示80或64个样本,这取决于声音编解码器的内部采样率。在本非限制性说明性实施例中,声音编解码器的内部采样率是12.8kHz,其给出256个样本的帧长度和64个样本的子帧长度的输入声音信号。A feature of the proposed unified time-domain and frequency-domain coding model is the variable-time support for the time-domain components, which varies from quarter-frames (subframes) to full frames on a frame-by-frame basis. As a non-limiting illustrative example, a frame can represent a 20ms input audio signal. If the audio codec's internal sampling rate is 16kHz, such a frame corresponds to 320 samples of the input audio signal, or if the codec's internal sampling rate is 12.8kHz, such a frame corresponds to 256 samples per frame. Subframes (in this example, a quarter of a frame) then represent either 80 or 64 samples, depending on the audio codec's internal sampling rate. In this non-limiting illustrative embodiment, the audio codec's internal sampling rate is 12.8kHz, which gives an input audio signal with a frame length of 256 samples and a subframe length of 64 samples.
可变时间支持使得可以以最小比特率捕获主要时间事件以创建基本时域激励贡献。在非常低的比特率下,时间支持通常是整个帧。在这种情况下,激励的时域贡献仅由自适应码本组成;然后每帧发送一次相应的自适应码本(音调(pitch))信息和增益。当更多比特率可用时,可以通过缩短时间支持并增大分配给时域编码模式的比特率来捕获更多时间事件。最后,当时间支持足够短(短于帧(子帧)的四分之一)并且可用比特率足够高时,对于每个子帧,激励的时域贡献可以包括具有相应自适应码本增益的自适应码本贡献、具有相应固定码本增益的固定码本贡献、或者具有相应增益的自适应码本贡献和固定码本贡献两者。可选地,对于帧(子帧)的每一半,还可以传输具有相应自适应码本增益的自适应码本贡献和具有相应固定码本增益的固定码本贡献;这具有不消耗太多比特率同时仍然能够对时间事件进行编码的优点。然后为每个子帧发送描述码本索引和增益的参数。Variable timing support allows for the capture of major temporal events at a minimum bit rate to create a basic temporal excitation contribution. At very low bit rates, the timing support is typically the entire frame. In this case, the temporal contribution of the excitation consists only of an adaptive codebook; then, the corresponding adaptive codebook (pitch) information and gain are transmitted once per frame. When more bit rate is available, more temporal events can be captured by shortening the timing support and increasing the bit rate allocated to the temporal coding mode. Finally, when the timing support is sufficiently short (less than a quarter of a frame (subframe)) and the available bit rate is sufficiently high, for each subframe, the temporal contribution of the excitation can include an adaptive codebook contribution with a corresponding adaptive codebook gain, a fixed codebook contribution with a corresponding fixed codebook gain, or both an adaptive codebook contribution with a corresponding gain and a fixed codebook contribution. Optionally, for each half of a frame (subframe), an adaptive codebook contribution with a corresponding adaptive codebook gain and a fixed codebook contribution with a corresponding fixed codebook gain can also be transmitted; this has the advantage of not consuming too much bit rate while still being able to encode temporal events. Parameters describing the codebook index and gain are then transmitted for each subframe.
在低比特率下,会话声音编解码器不能适当地编码较高的频率。当输入声音信号包括音乐和/或混响语音时,这导致合成质量的重要降低。为了解决这个问题,添加了一个特征来计算时域激励贡献的效率。在一些情况下,无论输入比特率和时间帧支持是什么,时域激励贡献都不是有价值的。在这些情况下,所有比特被重新分配给频域编码的下一步骤。但是大多数时间,时域激励贡献仅在某个频率(下文中称为“截止频率”)上是有价值的。在这些情况下,时域激励贡献在截止频率以上被滤波。滤波操作允许保留用时域激励贡献编码的有价值信息,并去除截止频率以上的无价值信息。在非限制性说明性实施例中,通过将高于某一频率(截止频率)的频率段(frequency bin)设置为零来在频域中执行滤波。At low bit rates, conversational sound codecs cannot adequately encode higher frequencies. This leads to a significant reduction in synthesis quality when the input audio signal includes music and/or reverberant speech. To address this issue, a feature is added to calculate the efficiency of the temporal excitation contribution. In some cases, the temporal excitation contribution is not valuable regardless of the input bit rate and time frame support. In these cases, all bits are reallocated to the next step of frequency domain encoding. However, most of the time, the temporal excitation contribution is only valuable at a certain frequency (hereinafter referred to as the "cutoff frequency"). In these cases, the temporal excitation contribution is filtered above the cutoff frequency. The filtering operation allows valuable information encoded with the temporal excitation contribution to be preserved, while worthless information above the cutoff frequency is removed. In a non-limiting illustrative embodiment, filtering is performed in the frequency domain by setting the frequency bin above a certain frequency (cutoff frequency) to zero.
可变时间支持与可变截止频率相结合使得统一时域和频域编码模型内的比特分配非常动态。LP滤波器的量化之后的比特率可以完全分配给时域或完全分配给频域或其间的某处。时域和频域之间的比特率分配是按照用于时域激励贡献的子帧的数量、可用比特预算以及所计算的截止频率的函数来进行的。为了使统一时域和频域编码模型对于特定种类的输入声音信号甚至更有效,添加特定的编码子模式以在时域、频域之间以及在低频和高频之间有效地分配可用比特。使用新的语音/音乐音频分类器来确定这些添加的特定编码子模式,所述语音/音乐音频分类器产生允许不清楚信号类别(不能被清楚地分类为音乐或语音的信号)的输出。The combination of variable time support and variable cutoff frequency makes bit allocation within the unified time-domain and frequency-domain coding model highly dynamic. The bit rate after quantization of the LP filter can be allocated entirely to the time domain, entirely to the frequency domain, or somewhere in between. The bit rate allocation between the time and frequency domains is a function of the number of subframes contributed to the time-domain excitation, the available bit budget, and the calculated cutoff frequency. To make the unified time-domain and frequency-domain coding model even more effective for certain types of input audio signals, specific coding sub-patterns are added to efficiently allocate available bits between the time and frequency domains, and between low and high frequencies. These added specific coding sub-patterns are determined using a novel speech/music audio classifier that produces an output that allows for signals of unclear category (signals that cannot be clearly classified as music or speech).
为了创建将更有效地匹配输入LP残差的总激励,应用频域编码模式。一个特征是对矢量执行频域编码,该矢量包含输入LP残差的频率表示(频率变换)与直到截止频率的滤波的时域激励贡献的频率表示(频率变换)之间的差,并且包含高于该截止频率的输入LP残差本身的频率表示(频率变换)。仅在截止频率以上的两个段之间插入平滑的频谱过渡。换句话说,时域激励贡献的频率表示的高频部分首先在截止频率以上归零。时域激励贡献的频谱的未改变部分与频谱的归零部分之间的过渡区域被插入刚好高于截止频率,以确保频谱的两个部分之间的平滑过渡。然后从输入LP残差的频率表示中减去时域激励贡献的该修改的频谱。因此,所得到的频谱对应于低于截止频率的两个频谱的差,并且对应于高于截止频率的LP残差的频率表示,具有一些过渡区域。如上所述,截止频率可以从一帧到另一帧变化。To create a total excitation that more effectively matches the input LP residuals, a frequency-domain coding mode is applied. One feature is performing frequency-domain coding on a vector containing the difference between the frequency representation (frequency transform) of the input LP residuals and the frequency representation (frequency transform) of the filtered temporal excitation contribution up to the cutoff frequency, and also containing the frequency representation (frequency transform) of the input LP residuals themselves above that cutoff frequency. A smooth spectral transition is inserted only between the two segments above the cutoff frequency. In other words, the high-frequency portion of the frequency representation of the temporal excitation contribution first zeros out above the cutoff frequency. A transition region between the unchanged portion of the spectrum of the temporal excitation contribution and the zero-out portion of the spectrum is inserted just above the cutoff frequency to ensure a smooth transition between the two portions of the spectrum. This modified spectrum of the temporal excitation contribution is then subtracted from the frequency representation of the input LP residuals. Therefore, the resulting spectrum corresponds to the difference between the two spectra below the cutoff frequency and to the frequency representation of the LP residuals above the cutoff frequency, with some transition region. As mentioned above, the cutoff frequency can vary from one frame to another.
无论选择何种频率量化方法(频域编码模式),总是存在预回声(pre-echo)的可能性,特别是在长窗口的情况下。在本文公开的技术中,所使用的窗口是正方形窗口,使得与编码的输入声音信号相比的额外窗口长度为零(0),即不使用重叠相加。虽然这对应于减少任何潜在的预回声的最佳窗口,但是一些预回声在时间攻击时仍然可以是可听见的。存在许多技术来解决此类预回声问题,但本公开提出用于消除此预回声问题的简单特征。该特征基于从ITU-T建议G.718的“过渡模式”导出的无记忆时域编码模式;参考文献[5],第6.8.1.4和6.8.4.2节,其全部内容通过引用并入本文。该特征背后的思想是利用所提出的统一时域和频域编码模型被集成到LP残差域的事实,这允许几乎在任何时间都没有伪影的切换。当输入声音信号被认为是通用音频(音乐和/或混响语音)并且当在帧中检测到时间攻击时,则仅利用无记忆时域编码模式对该帧进行编码。此无记忆时域编码模式将考虑时间攻击,因此避免在使用所述帧的频域编码时可引入的预回声。Regardless of the frequency quantization method (frequency domain coding mode) chosen, there is always the possibility of pre-echo, especially in the case of long windows. In the technique disclosed herein, the window used is a square window, such that the additional window length compared to the encoded input audio signal is zero (0), i.e., no overlap addition is used. While this corresponds to the optimal window for reducing any potential pre-echo, some pre-echo can still be audible under time attacks. Many techniques exist to address such pre-echo problems, but this disclosure proposes a simple feature for eliminating this pre-echo problem. This feature is based on a memoryless time domain coding mode derived from the “Transition Mode” of ITU-T Recommendation G.718; see reference [5], sections 6.8.1.4 and 6.8.4.2, the entire contents of which are incorporated herein by reference. The idea behind this feature is to take advantage of the fact that the proposed unified time and frequency domain coding model is integrated into the LP residual domain, which allows for switching with almost no artifacts at any time. When the input audio signal is considered general audio (music and/or reverberant speech) and a temporal attack is detected in the frame, the frame is encoded using only a memoryless temporal coding scheme. This memoryless temporal coding scheme takes into account temporal attacks, thus avoiding pre-echoes that could be introduced when using frequency-domain coding of the frame.
非限制性说明性实施例Non-limiting illustrative examples
在所提出的统一时域和频域编码模型中,上述自适应码本、一个或多个固定码本(例如代数码本、高斯码本等)(即所谓的时域码本)和频域量化(频域编码模式)可以被视为码本库,并且比特可以分配在所有可用码本或其子集之间。这意味着例如如果输入声音信号是干净语音,则所有比特将被分配给时域编码模式,从而基本上将编码减少到传统CELP方案。另一方面,对于一些音乐片段,被分配用于编码输入LP残差的所有比特有时最好地花费在频域中,例如在变换域中。此外,可以添加特定情况,其中(a)时域使用总可用比特率的较大部分来编码更多的时域事件,同时仍然保持用于编码一些频率信息的比特,或者(b)低频内容优先于高频内容,反之亦然。In the proposed unified time-domain and frequency-domain coding model, the aforementioned adaptive codebook, one or more fixed codebooks (e.g., algebraic codebooks, Gaussian codebooks, etc.) (i.e., the so-called time-domain codebook), and frequency-domain quantization (frequency-domain coding mode) can be considered as a codebook library, and bits can be allocated among all available codebooks or subsets thereof. This means that, for example, if the input audio signal is clean speech, all bits will be allocated to the time-domain coding mode, thus essentially reducing the coding to the traditional CELP scheme. On the other hand, for some musical segments, all bits allocated to encoding the input LP residual are sometimes best spent in the frequency domain, such as in the transform domain. Furthermore, specific cases can be added where (a) the time domain uses a larger portion of the total available bit rate to encode more time-domain events, while still reserving bits for encoding some frequency information, or (b) low-frequency content takes precedence over high-frequency content, and vice versa.
如前面的描述中所指示的,对时域和频域编码模式的时间支持不需要相同。虽然花费在不同时域编码操作(自适应和代数码本搜索)上的比特通常基于子帧(通常是帧的四分之一,或5ms的时间支持)来分配,但是分配给频域编码模式的比特基于帧(通常是20ms的时间支持)来分配以提高频率分辨率。As indicated in the preceding description, the time support for time-domain and frequency-domain coding modes does not need to be the same. While the bits spent on different time-domain coding operations (adaptive and algebraic book search) are typically allocated based on subframes (usually a quarter of a frame, or 5ms of time support), the bits allocated to the frequency-domain coding mode are allocated based on frames (usually 20ms of time support) to improve frequency resolution.
分配给时域CELP编码模式的比特预算也可以根据输入声音信号动态地控制。在一些情况下,分配给时域CELP编码模式的比特预算可为零,实际上意味着整个比特预算归因于频域编码模式。针对时域及频域编码模式两者在LP残差域中工作的选择具有两(2)个主要益处。首先,这与时域CELP编码模式兼容,证明在语音信号编码中是有效的。因此,由于两种类型的编码模式(时域编码模式和频域编码模式)之间的切换,不会引入伪影。第二,LP残差相对于原始输入声音信号的较低动态及其相对平坦性使得使用正方形窗口进行频率变换更容易,从而允许使用非重叠窗口。The bit budget allocated to the time-domain CELP coding mode can also be dynamically controlled based on the input audio signal. In some cases, the bit budget allocated to the time-domain CELP coding mode can be zero, effectively meaning that the entire bit budget is attributed to the frequency-domain coding mode. The choice between the time-domain and frequency-domain coding modes operating in the LP residual domain has two (2) main benefits. First, it is compatible with the time-domain CELP coding mode, proving effective in speech signal coding. Therefore, no artifacts are introduced due to the switching between the two types of coding modes (time-domain and frequency-domain coding modes). Second, the lower dynamics of the LP residual relative to the original input audio signal and its relative flatness make it easier to use a square window for frequency transformation, thus allowing the use of a non-overlapping window.
在其中编解码器的内部采用速率为12.8kHz(意味着每帧256个样本)的非限制性示例中,类似于ITU-T建议G.718(参考文献[5])中,时域CELP编码模式中所使用的子帧的长度可从帧长度的典型1/4(5ms)变化到半帧(10ms)或完整帧长度(20ms)。子帧长度决定基于可用比特率和输入声音信号的分析,特别是该输入声音信号的频谱动态。子帧长度决定可以以闭环方式执行。为了节省复杂性,还可以以开环方式决定子帧长度。子帧长度决定还可以由信号分类器(例如语音/音乐分类器)检测到的输入声音信号的性质来控制。子帧长度可以逐帧改变。In a non-restrictive example where the codec uses a rate of 12.8 kHz (meaning 256 samples per frame), similar to ITU-T Recommendation G.718 (Reference [5]), the length of the subframes used in the time-domain CELP coding mode can vary from a typical 1/4 (5 ms) of the frame length to half a frame (10 ms) or a full frame length (20 ms). The subframe length is determined based on the available bit rate and analysis of the input audio signal, particularly the spectral dynamics of the input audio signal. The subframe length can be determined in a closed-loop manner. To save complexity, the subframe length can also be determined in an open-loop manner. The subframe length can also be controlled by the properties of the input audio signal detected by a signal classifier (e.g., a speech/music classifier). The subframe length can be changed frame by frame.
一旦在当前帧中选择了子帧的长度,就执行标准闭环音调分析,并且从自适应码本中选择对激励信号的第一贡献。然后,取决于可用比特预算和输入声音信号的特性(例如在输入语音信号的情况下),可以在变换域中的转换之前添加来自一个或多个固定码本的第二贡献。所得到的激励贡献是时域激励贡献。另一方面,在非常低的比特率下并且在通用音频信号的情况下,通常更好的是跳过固定码本级并将所有剩余比特用于变换域编码。变换域编码可以是例如频域编码模式。如上所述,子帧长度可以是帧的四分之一、帧的一半或一帧长。只有当子帧长度等于帧长度的1/4时,才使用固定码本贡献。在子帧长度被确定为半帧或整个帧长度的情况下,则仅自适应码本贡献被用于表示时域激励贡献,并且所有剩余比特被分配给频域编码模式。或者,将描述附加编码模式,其中当子帧长度等于帧长度的一半时可以使用固定码本。进行这种添加是为了改善包含时间事件的特定种类的输入声音信号的质量,同时保持可接受的比特预算以对频域激励贡献进行编码。Once the subframe length is selected in the current frame, standard closed-loop tone analysis is performed, and a first contribution to the excitation signal is selected from the adaptive codebook. Then, depending on the available bit budget and the characteristics of the input audio signal (e.g., in the case of an input speech signal), a second contribution from one or more fixed codebooks can be added before the transformation in the transform domain. The resulting excitation contribution is the time-domain excitation contribution. On the other hand, at very low bit rates and in the case of general audio signals, it is generally better to skip the fixed codebook stage and use all remaining bits for transform-domain coding. The transform-domain coding can be, for example, a frequency-domain coding mode. As mentioned above, the subframe length can be one-quarter, half, or one full frame. The fixed codebook contribution is used only when the subframe length is equal to one-quarter of the frame length. When the subframe length is determined to be half or the full frame length, only the adaptive codebook contribution is used to represent the time-domain excitation contribution, and all remaining bits are allocated to the frequency-domain coding mode. Alternatively, an additional coding mode will be described where the fixed codebook can be used when the subframe length is equal to half the frame length. This addition is made to improve the quality of specific types of input audio signals that contain time events, while maintaining an acceptable bit budget to encode the frequency domain excitation contribution.
一旦完成时域激励贡献的计算,就需要评估和量化其效率。如果时域中编码的增益非常低,则更有效的是完全去除时域激励贡献并将所有比特用于频域编码模式。另一方面,例如在干净输入语音信号的情况下,不需要频域编码模式,并且所有比特都被分配给时域编码模式。但是时域中的编码通常仅在某个频率上有效。该频率对应于时域激励贡献的上述截止频率。这种截止频率的确定确保整个时域编码有助于获得更好的最终合成,而不是与频域编码相反。Once the time-domain excitation contribution has been calculated, its efficiency needs to be evaluated and quantified. If the gain of coding in the time domain is very low, it is more efficient to completely remove the time-domain excitation contribution and use all bits for the frequency-domain coding mode. On the other hand, for example, in the case of a clean input speech signal, a frequency-domain coding mode is not needed, and all bits are allocated to the time-domain coding mode. However, coding in the time domain is typically only effective at a certain frequency. This frequency corresponds to the aforementioned cutoff frequency of the time-domain excitation contribution. Determining this cutoff frequency ensures that the entire time-domain coding contributes to a better final synthesis, rather than the opposite of frequency-domain coding.
可以在频域中估计截止频率。为了计算截止频率,首先将LP残差和时域激励贡献的频谱分成预定数量的频带,在每个频带中定义了多个频率段。频带的数量和由每个频带覆盖的频率段的数量可以从一个实现方式到另一个实现方式而变化。对于每个频带,计算时域激励贡献的频率表示和LP残差的频率表示之间的归一化相关,并且平滑相邻频带之间的相关。作为非限制性示例,每频带相关下限为0.5并且在0和1之间归一化,然后将平均相关计算为所有频带的相关的平均值。为了第一次估计截止频率,然后在0和内部采样率的一半(内部采样率的一半对应于归一化相关值1)之间缩放平均相关。在非常低的比特率下或对于如下文所述的附加编码子模式,在找到截止频率之前将平均相关加倍。这是在已知即使由于使用低比特率或者由于输入声音信号的类型不允许高相关而相关不是非常高的情况下也需要时域激励贡献的情况下完成的。然后找到截止频率的第一估计作为最接近缩放的平均相关的值的频带的上限。在实现的示例中,在12.8kHz内部采样率下的十六(16)个频带被定义用于相关计算。The cutoff frequency can be estimated in the frequency domain. To calculate the cutoff frequency, the spectrum of the LP residual and the temporal excitation contribution is first divided into a predetermined number of frequency bands, with multiple frequency segments defined within each band. The number of frequency bands and the number of frequency segments covered by each band can vary from one implementation to another. For each frequency band, a normalized correlation is calculated between the frequency representation of the temporal excitation contribution and the frequency representation of the LP residual, and the correlation between adjacent frequency bands is smoothed. As a non-limiting example, the lower bound for the correlation per band is 0.5 and normalized between 0 and 1, and then the average correlation is calculated as the average of the correlations across all frequency bands. To estimate the cutoff frequency initially, the average correlation is then scaled between 0 and half of the internal sampling rate (half of the internal sampling rate corresponds to a normalized correlation value of 1). At very low bit rates or for additional coded sub-modes as described below, the average correlation is doubled before finding the cutoff frequency. This is done when the temporal excitation contribution is known to be needed even if the correlation is not very high due to the use of a low bit rate or because the type of input audio signal does not allow for high correlation. The first estimate of the cutoff frequency is then found as the upper bound of the band that is closest to the value of the scaled average correlation. In the implemented example, sixteen (16) bands at an internal sampling rate of 12.8 kHz are defined for correlation calculation.
利用人耳的心理声学特性,可以通过将音高的第八谐波频率的估计的位置与通过相关计算估计的截止频率进行比较来提高截止频率估计的可靠性。如果该位置高于通过相关计算估计的截止频率,则将截止频率修改为对应于音调的第八谐波频率的位置。如果使用附加编码子模式之一,则截止频率具有高于或等于例如2775Hz(第7频带)的最小值。然后将截止频率的最终值量化并发送到远程解码器。在实现的示例中,3或4比特用于这种量化,从而根据比特率给出8或16个可能的截止频率。Leveraging the psychoacoustic properties of the human ear, the reliability of the cutoff frequency estimation can be improved by comparing the estimated position of the eighth harmonic frequency of the pitch with the cutoff frequency estimated through correlation calculations. If the position is higher than the cutoff frequency estimated through correlation calculations, the cutoff frequency is modified to correspond to the position of the eighth harmonic frequency of the pitch. If one of the additional coding sub-modes is used, the cutoff frequency has a minimum value higher than or equal to, for example, 2775 Hz (7th band). The final value of the cutoff frequency is then quantized and sent to a remote decoder. In the implemented example, 3 or 4 bits are used for this quantization, thus giving 8 or 16 possible cutoff frequencies depending on the bit rate.
一旦截止频率已知,就执行频域激励贡献的频率量化。首先,确定输入LP残差的频率表示(频率变换)与时域激励贡献的频率表示(频率变换)之间的差。然后创建新的矢量,其包括直到截止频率的该差值,以及到剩余频谱的输入LP残差的频率表示的平滑过渡。然后将频率量化应用于整个新矢量。在实现的示例中,量化包括对主导(最具能量的)频谱脉冲的符号和位置进行编码。每频带待量化的脉冲数与可用于频域编码模式的比特率相关。如果可用比特不足以覆盖所有频带,则仅用噪声填充剩余频带。Once the cutoff frequency is known, frequency quantization of the frequency-domain excitation contribution is performed. First, the difference between the frequency representation (frequency transform) of the input LP residual and the frequency representation (frequency transform) of the time-domain excitation contribution is determined. Then, a new vector is created that includes this difference up to the cutoff frequency, as well as a smooth transition to the frequency representation of the input LP residual to the remaining spectrum. Frequency quantization is then applied to the entire new vector. In the implemented example, quantization involves encoding the sign and position of the dominant (most energetic) spectral pulses. The number of pulses to be quantized per frequency band is related to the bit rate available for the frequency-domain coding mode. If the available bits are insufficient to cover all frequency bands, the remaining bands are simply filled with noise.
使用前面段落中描述的量化方法的频带的频率量化不能保证该频带内的所有频率段都被量化。在每个频带量化的频谱脉冲的数量相对较低的低比特率下尤其如此。为了防止由于这些非量化频段而引起的可听伪像的出现,添加一些噪声以填充这些间隙。由于在低比特率下,量化的频谱脉冲应该主导频谱而不是插入的噪声,所以噪声频谱幅度仅对应于脉冲幅度的一部分。当可用比特预算低(允许更多噪声)时,频谱中添加的噪声的幅度较高,并且当可用比特预算高时,频谱中添加的噪声的幅度较低。Frequency quantization of a band using the quantization methods described in the preceding paragraphs does not guarantee that all frequency bands within that band will be quantized. This is especially true at low bit rates where the number of quantized spectral pulses per band is relatively low. To prevent audible artifacts caused by these unquantized frequency bands, some noise is added to fill these gaps. Since at low bit rates, the quantized spectral pulses should dominate the spectrum rather than the inserted noise, the noise spectral amplitude corresponds only to a portion of the pulse amplitude. When the available bit budget is low (allowing for more noise), the amplitude of the noise added to the spectrum is higher, and when the available bit budget is high, the amplitude of the noise added to the spectrum is lower.
在频域编码模式中,为每个频带计算增益,以使未量化的信号的能量与量化的信号匹配。增益被矢量量化,并按频带应用于量化的信号。例如,当统一时域和频域编码模型将比特分配从仅时域编码模式改变为混合时域/频域编码模式时,仅时域编码模式的每频带激励频谱能量与混合时域/频域编码模式的每频带激励频谱能量不匹配。这种能量不匹配可能产生一些切换伪像,特别是在低比特率下。为了减少由该比特重新分配产生的任何可听降级,可以为每个频带计算长期增益,并且可以应用长期增益来在从仅时域编码模式切换到混合时域/频域编码模式之后针对几个帧校正每个频带的能量。In frequency-domain coding mode, a gain is calculated for each frequency band to match the energy of the unquantized signal with that of the quantized signal. The gain is vector-quantized and applied band-by-band to the quantized signal. For example, when a unified time-domain and frequency-domain coding model changes bit allocation from a time-only coding mode to a hybrid time-domain/frequency-domain coding mode, the per-band excitation spectral energy in the time-only coding mode does not match the per-band excitation spectral energy in the hybrid time-domain/frequency-domain coding mode. This energy mismatch can produce some switching artifacts, especially at low bit rates. To reduce any audible degradation caused by this bit reallocation, a long-term gain can be calculated for each frequency band, and this long-term gain can be applied to correct the energy of each frequency band for several frames after switching from a time-only coding mode to a hybrid time-domain/frequency-domain coding mode.
在完成频域编码模式之后,通过将频域激励贡献与时域激励贡献的频率表示(频率变换)相加来找到总激励,然后将这两(2)个激励贡献的总和变换回时域以形成总激励。最后,通过LP合成滤波器对总激励进行滤波来计算合成信号。After completing the frequency domain coding pattern, the total excitation is found by adding the frequency representations (frequency transformations) of the frequency domain excitation contribution and the time domain excitation contribution. Then, the sum of these two (2) excitation contributions is transformed back to the time domain to form the total excitation. Finally, the total excitation is filtered by an LP synthesis filter to calculate the synthesized signal.
在一个实施例中,虽然仅使用时域激励贡献在子帧基础上更新CELP编码存储器,但总激励用于在帧边界处更新那些存储器。In one embodiment, although only temporal excitation contributions are used to update the CELP-coded memory on a subframe basis, the total excitation is used to update those memories at frame boundaries.
在另一可能实施方式中,CELP编码存储器在子帧基础上且还仅使用时域激励贡献在帧边界处更新。这导致嵌入式结构,其中频域编码信号构成独立于核心CELP层的上量化层。在这种特定情况下,总是使用固定码本以便更新自适应码本内容。然而,频域编码模式可应用于整个帧。这种嵌入式方法适用于大约12kbps和较高的比特率。In another possible implementation, the CELP-encoded memory is updated at frame boundaries on a subframe basis, using only temporal excitation contributions. This results in an embedded architecture where the frequency-domain coded signal constitutes an upquantization layer independent of the core CELP layer. In this particular case, a fixed codebook is always used to update the adaptive codebook content. However, the frequency-domain coding mode can be applied to the entire frame. This embedded approach is suitable for bit rates of approximately 12 kbps and higher.
1)声音信号类型分类1) Classification of sound signal types
图1是同时示出统一时域/频域CELP编码方法150和对应的统一时域/频域CELP编码设备100(例如ACELP方法和设备)的概述的示意性框图。当然,可以使用相同的概念实现其他类型的CELP编码方法和设备。Figure 1 is a schematic block diagram illustrating an overview of a unified time-domain/frequency-domain CELP coding method 150 and a corresponding unified time-domain/frequency-domain CELP coding device 100 (e.g., ACELP method and device). Of course, other types of CELP coding methods and devices can be implemented using the same concepts.
图2是图1的统一时域/频域CELP编码方法150和设备100的更详细结构的示意性框图。Figure 2 is a schematic block diagram of a more detailed structure of the unified time-domain/frequency-domain CELP coding method 150 and device 100 of Figure 1.
统一时域/频域CELP编码设备100包括预处理器102(图1),用于执行分析输入声音信号101(图1和图2)的参数的操作152。参照图2,预处理器102包括:LP分析器201,用于执行输入声音信号101的LP分析的操作251;频谱分析器202,用于执行频谱分析的操作252;开环音调分析器203,用于执行开环音调分析的操作253;以及信号分类器204,用于执行输入声音信号的分类的操作254。分析器201和202以及相关联的操作251和252执行通常在CELP编码中执行的LP和频谱分析,如例如在ITU-T建议G.718,参考文献[5],第6.4和6.1.4节中所描述的,因此,在本公开中将不再进一步描述。The unified time-domain/frequency-domain CELP coding apparatus 100 includes a preprocessor 102 (FIG. 1) for performing operations 152 of analyzing parameters of an input audio signal 101 (FIG. 1 and 2). Referring to FIG. 2, the preprocessor 102 includes: an LP analyzer 201 for performing LP analysis of the input audio signal 101; a spectrum analyzer 202 for performing spectrum analysis; an open-loop tone analyzer 203 for performing open-loop tone analysis; and a signal classifier 204 for performing classification of the input audio signal. Analyzers 201 and 202 and associated operations 251 and 252 perform LP and spectrum analysis typically performed in CELP coding, as described, for example, in ITU-T Recommendation G.718, Reference [5], sections 6.4 and 6.1.4, and therefore will not be described further in this disclosure.
预处理器102进行第一级分析以在语音和非语音(通用音频(音乐或混响语音))之间分类输入声音信号101,例如以类似于参考文献[6]中描述的方式,参考文献[6]的全部内容通过引用并入本文,或者利用任何其他可靠的语音/非语音区分方法。The preprocessor 102 performs a first-level analysis to classify the input sound signal 101 between speech and non-speech (general audio (music or reverberant speech)), for example in a manner similar to that described in reference [6], the entire contents of which are incorporated herein by reference, or by utilizing any other reliable speech/non-speech differentiation method.
在该第一级分析之后,预处理器102执行输入信号参数的第二级分析,以允许对具有强非语音特性但仍用时域方法更好地编码的一些声音信号使用时域CELP编码(无频域编码)。当发生重要的能量变化时,该第二级分析允许统一时域/频域CELP编码设备100切换到无记忆时域编码模式,在参考文献[7]中通常称为转换模式,参考文献[7]的全部内容通过引用并入本文。Following this first-level analysis, the preprocessor 102 performs a second-level analysis of the input signal parameters to allow time-domain CELP coding (no-frequency-domain coding) to be used for some audio signals that have strong non-speech characteristics but are still better coded using time-domain methods. When a significant energy change occurs, this second-level analysis allows the unified time-domain/frequency-domain CELP coding device 100 to switch to a memoryless time-domain coding mode, commonly referred to as the switching mode in reference [7], the entire contents of which are incorporated herein by reference.
在该第二级分析期间,信号分类器204计算并使用来自开环音调分析器203的开环音调相关的平滑版本Cst的变化σC、当前总帧能量Etot(当前帧中的输入声音信号的总能量)以及当前总帧能量与先前总帧能量之间的差Ediff。首先,信号分类器204使用例如以下关系来计算平滑的开环音调相关的变化:During this second-level analysis, signal classifier 204 calculates and uses the smoothed version of the open-loop pitch-related change C <sub>st </sub>σ<sub>C</sub> from open-loop pitch analyzer 203, the current total frame energy E<sub>tot</sub> (the total energy of the input sound signal in the current frame), and the difference E <sub>diff</sub> between the current total frame energy and the previous total frame energy. First, signal classifier 204 calculates the smoothed open-loop pitch-related change using, for example, the following relationship:
其中:in:
-Cst是如下定义的平滑的开环音调相关:Cst=0.9·Ccl+0.1·Cst;-C st is a smooth open-loop pitch correlation defined as follows: C st = 0.9·C cl + 0.1·C st ;
-Col是由分析器203使用CELP编码领域的普通技术人员已知的方法计算的开环音调相关,例如,如在ITU-T建议G.718,参考文献-C ol is the open-loop tone correlation calculated by analyzer 203 using methods known to those skilled in the art of CELP coding, for example, as in ITU-T Recommendation G.718, references
[5],第6.6节中所描述的;[5], as described in Section 6.6;
-是平滑的开环音调相关Cst的最后10帧i上的平均值;- is the average of the last 10 frames i of the smooth open-loop pitch correlation C st ;
-σC是平滑的开环音调相关的变化。-σ C is a smooth, open-loop pitch-related variation.
当在第一级分析期间信号分类器204将帧分类为非语音时,信号分类器204执行以下验证以在第二级分析中确定使用混合时域/频域编码模式是否确实安全。然而,有时,使用由时域编码模式的预处理功能估计的时域方法中的一种,仅利用时域编码模式对当前帧进行编码是更好的。确切地说,可能更好的是使用无记忆时域码元模式以最小化减少可通过混合时域/频域编码模式引入的任何可能预回声。When the signal classifier 204 classifies a frame as non-speech during the first-level analysis, it performs the following verification to determine in the second-level analysis whether using the hybrid time-domain/frequency-domain coding mode is indeed safe. However, sometimes it is better to encode the current frame using only the time-domain coding mode, using one of the time-domain methods estimated by the preprocessing function of the time-domain coding mode. Specifically, it may be better to use a memoryless time-domain symbol mode to minimize any possible pre-echoes that can be introduced by the hybrid time-domain/frequency-domain coding mode.
作为是否应使用混合时域/频域编码模式的第一验证的非限制性实施方案,信号分类器204计算当前总帧能量与先前帧总能量之间的差。当当前总帧能量Etot和先前帧总能量之间的差Ediff高于例如6dB时,这对应于输入声音信号101中的所谓“时间攻击”。在这种情况下,语音/非语音判定和所选择的编码模式被重写,并且强制无记忆时域编码模式。更具体地,统一时域/频域CELP编码设备100包括用于执行在仅时域编码和混合时域/频域编码之间进行选择的操作153的时间/时间-频率编码选择器103(图1)。为此,时间/时间-频率编码选择器103包括:语音/通用音频选择器205(图2),用于执行在语音和通用音频之间进行选择以分类输入声音信号101的操作255;时间攻击检测器208(图2),用于执行检测输入声音信号101中的时间攻击的操作258;以及选择器206(图2),用于执行选择无记忆时域编码模式的操作256。换句话说:As a non-limiting implementation for the first verification of whether a hybrid time-domain/frequency-domain coding mode should be used, the signal classifier 204 calculates the difference between the current total frame energy and the previous frame total energy. When the difference between the current total frame energy E <sub>tot</sub> and the previous frame total energy E <sub>diff </sub> is higher than, for example, 6 dB, this corresponds to a so-called “time attack” in the input audio signal 101. In this case, the speech/non-speech determination and the selected coding mode are rewritten, and a memoryless time-domain coding mode is enforced. More specifically, the unified time-domain/frequency-domain CELP coding device 100 includes a time/time-frequency coding selector 103 (FIG. 1) for performing an operation 153 to select between time-domain-only coding and hybrid time-domain/frequency-domain coding. To this end, the time/time-frequency coding selector 103 includes: a speech/general audio selector 205 (FIG. 2) for performing an operation 255 of selecting between speech and general audio to classify the input sound signal 101; a time attack detector 208 (FIG. 2) for performing an operation 258 of detecting time attacks in the input sound signal 101; and a selector 206 (FIG. 2) for performing an operation 256 of selecting a memoryless time-domain coding mode. In other words:
-响应于选择器205对语音信号的确定,使用闭环CELP编码器207(图2)来执行对语音信号进行CELP编码的操作257。- In response to the selection of the speech signal by the selector 205, the operation 257 of CELP encoding of the speech signal is performed using the closed-loop CELP encoder 207 (Figure 2).
-响应于选择器205对非语音信号(通用音频)的确定和检测器208对输入声音信号101中的时间攻击的检测,选择器206迫使闭环CELP编码器207(图2)使用无记忆时域编码模式来对输入声音信号进行编码。In response to the selection of non-speech signals (general audio) by selector 205 and the detection of time attacks in input audio signal 101 by detector 208, selector 206 forces closed-loop CELP encoder 207 (FIG. 2) to use memoryless time-domain coding mode to encode the input audio signal.
闭环CELP编码器207形成图1的仅时域编码器104的一部分。闭环CELP编码器是本领域普通技术人员公知的,并且在本说明书中将不再进一步描述。The closed-loop CELP encoder 207 forms a part of the time-domain encoder 104 of FIG1. Closed-loop CELP encoders are well known to those skilled in the art and will not be described further in this specification.
作为是否应使用混合时域/频域编码模式的第二验证的非限制性实施方案,当当前总帧能量Etot与先前帧总能量之间的差Ediff低于或等于6dB时,但:As a non-limiting implementation for the second verification of whether a hybrid time-domain/frequency-domain coding mode should be used, when the difference between the current total frame energy Etot and the previous frame total energy Ediff is less than or equal to 6dB, but:
-平滑的开环音调相关Cst高于0.96;或- Smooth open-loop pitch correlation C st greater than 0.96; or
-平滑的开环音调相关Cst高于0.85,并且当前总帧能量Etot和先前帧总能量之间的差Ediff低于0.3dB;或- Smooth open-loop pitch correlation C <sub>st</sub> is greater than 0.85, and the difference between the current total frame energy E <sub>tot</sub> and the previous frame total energy E <sub>diff</sub> is less than 0.3 dB; or
-平滑的开环音调相关σC的变化低于0.1,并且当前总帧能量Etot和最后一个先前帧总能量之间的差Ediff低于0.6d B;或- The change in smooth open-loop pitch correlation σC is less than 0.1, and the difference between the current total frame energy Etot and the total energy of the last previous frame Ediff is less than 0.6 dB; or
-当前总帧能量Etot低于20d B;- The current total frame energy E tot is less than 20 dB;
并且这至少是第二连续帧(cnt≥2),其中第一级分析的判定被改变,然后语音/通用音频选择器205确定将使用闭环CELP编码器207(图2)使用仅时域编码模式对当前帧进行编码。And this is at least the second consecutive frame (cnt≥2), in which the decision of the first-level analysis is changed, and then the speech/general audio selector 205 determines that the current frame will be encoded using the closed-loop CELP encoder 207 (Figure 2) in time-only coding mode.
否则,时间/时间-频率编码选择器103选择混合时域/频域编码模式,如以下描述中所公开的。Otherwise, the time/time-frequency coding selector 103 selects a hybrid time-domain/frequency-domain coding mode, as disclosed in the following description.
例如,当非语音输入声音信号是音乐时,可以使用以下伪码来总结第二验证:For example, when the non-voice input sound signal is music, the following pseudocode can be used to summarize the second verification:
其中Etot是当前总帧能量,表示为:Where E tot is the current total frame energy, expressed as:
其中x(i)表示当前帧中的输入声音信号的样本,N是逐帧的输入声音信号的样本数,以及Ediff是当前总帧能量Etot与最后一个先前帧总能量之间的差。Where x(i) represents the sample of the input audio signal in the current frame, N is the number of samples of the input audio signal per frame, and E diff is the difference between the total energy of the current frame E tot and the total energy of the last previous frame.
图7是同时示出统一时域/频域CELP编码方法750和对应统一时域/频域CELP编码设备700的替代实现方式的示意性框图,其中预处理器702还执行第一级分析以对输入声音信号101进行分类。Figure 7 is a schematic block diagram showing both the unified time-domain/frequency-domain CELP coding method 750 and the corresponding unified time-domain/frequency-domain CELP coding device 700, wherein the preprocessor 702 also performs a first-level analysis to classify the input audio signal 101.
具体地,统一时域/频域CELP编码方法750包括如参考文献[4]中所述的预处理输入声音信号101以获得对该输入声音信号进行分类所需的参数的操作752。为了执行操作752,混合时域/频域CELP编码设备700包括预处理器702。Specifically, the unified time-domain/frequency-domain CELP coding method 750 includes an operation 752 as described in reference [4], preprocessing the input audio signal 101 to obtain parameters required for classifying the input audio signal. To perform operation 752, the hybrid time-domain/frequency-domain CELP coding device 700 includes a preprocessor 702.
统一时域/频域CELP编码方法750包括操作751,操作751使用来自预处理器702的参数以类似于参考文献[4]中描述的方式,或者使用任何其他可靠的语音/音乐和不清楚信号类型区分方法,将输入声音信号101分类为语音、音乐和不清楚信号类型类别。不清楚信号类型类别表示输入声音信号101的性质不清楚,并且特别地,输入声音信号101不被分类为语音也不被分类为音乐。为了执行操作751,统一时域/频域CELP编码设备700包括声音信号分类器701。The Unified Time/Frequency CELP Coding Method 750 includes an operation 751 that classifies the input audio signal 101 into speech, music, and unclear signal type categories using parameters from the preprocessor 702 in a manner similar to that described in reference [4], or using any other reliable speech/music and unclear signal type differentiation method. An unclear signal type category indicates that the nature of the input audio signal 101 is unclear, and specifically, the input audio signal 101 is neither classified as speech nor as music. To perform operation 751, the Unified Time/Frequency CELP Coding Apparatus 700 includes an audio signal classifier 701.
如果声音信号分类器701将输入声音信号101分类为音乐类别,则频域编码器703执行使用如例如参考文献[2]中描述的频域编码对输入声音信号101进行编码的操作753。然后,可以在由合成器704执行的音乐合成操作754中合成频域编码的音乐信号,以恢复音乐信号。If the sound signal classifier 701 classifies the input sound signal 101 into a music category, the frequency domain encoder 703 performs an operation 753 to encode the input sound signal 101 using frequency domain coding as described, for example, in reference [2]. The frequency domain-coded music signal can then be synthesized in a music synthesis operation 754 performed by the synthesizer 704 to recover the music signal.
以相同的方式,如果声音信号分类器701将输入声音信号101分类为语音类别,则时域编码器705执行使用如例如参考文献[2]中描述的时域编码对输入声音信号101进行编码的操作755。然后可以在由合成器706执行的合成滤波操作756中合成时域编码的语音信号,恢复语音信号。Similarly, if the sound signal classifier 701 classifies the input sound signal 101 into a speech category, the temporal encoder 705 performs an operation 755 to encode the input sound signal 101 using temporal coding as described, for example, in reference [2]. The temporally encoded speech signal can then be synthesized in a synthesis filtering operation 756 performed by the synthesizer 706 to recover the speech signal.
因此,统一时域/频域编码设备700和方法750通过分别将它们的使用限制为具有清晰语音特性的输入声音信号和具有清晰音乐特性的输入声音信号来最大化仅时域编码和仅频域编码的性能。这提高了低至中等比特率的所有类型的输入声音信号的整体质量。Therefore, the unified time-domain/frequency-domain coding apparatus 700 and method 750 maximize the performance of time-domain-only coding and frequency-domain-only coding by limiting their use to input audio signals with clear speech characteristics and input audio signals with clear musical characteristics, respectively. This improves the overall quality of all types of input audio signals from low to medium bit rates.
编码子模式已经被设计为统一时域和频域编码模型的一部分,以有效地编码未被分类为语音或音乐(不清楚信号类型类别)的输入声音信号。两(2)个比特用以用信号发送由对应子模式标志标识的三(3)个编码子模式。第四子模式允许传统统一时域和频域编码模型(EVS)的后向互操作性。The coding sub-modes have been designed as part of the unified time-domain and frequency-domain coding model to efficiently encode input sound signals that are not classified as speech or music (the signal type category is unclear). Two (2) bits are used to transmit three (3) coding sub-modes identified by the corresponding sub-mode flags. The fourth sub-mode allows backward interoperability with the traditional unified time-domain and frequency-domain coding model (EVS).
如图8所示,对输入声音信号101进行分类的操作751包括响应于可用于对输入声音信号101进行编码的比特率和分类为不清楚信号类型类别的该输入声音信号的特性来选择编码子模式中的一个的操作850。为了执行操作850,声音信号分类器701包含子模式选择器800。As shown in Figure 8, the operation 751 of classifying the input audio signal 101 includes an operation 850 of selecting one of the encoding sub-modes in response to the bit rate available for encoding the input audio signal 101 and the characteristics of the input audio signal that are classified as having an unclear signal type category. To perform operation 850, the audio signal classifier 701 includes a sub-mode selector 800.
编码子模式由子模式标志Ftfsm来标识。在图8的非限制性实现方式中,子模式选择器800如下选择编码子模式:The encoding sub-pattern is identified by the sub-pattern flag F tfsm . In the non-restrictive implementation of Figure 8, the sub-pattern selector 800 selects the encoding sub-pattern as follows:
-如果(a)可用于对输入声音信号101进行编码的比特率不高于9.2kbps并且(b)输入声音信号101不被分类为语音也不被分类为音乐,则子模式选择器800选择上述后向编码子模式(参见803)。然后将子模式标志Ftfsm设置为“0”(参见802)。后向编码模式的选择导致使用图1和图2(EV)的传统统一时域和频域编码模型。If (a) the bit rate available for encoding the input audio signal 101 is no higher than 9.2 kbps and (b) the input audio signal 101 is neither classified as speech nor as music, then the sub-mode selector 800 selects the aforementioned backward coding sub-mode (see 803). The sub-mode flag F tfsm is then set to "0" (see 802). The selection of the backward coding mode results in the use of the traditional unified time-domain and frequency-domain coding model of Figures 1 and 2 (EV).
-如果(a)输入声音信号101没有被分类器701分类为语音或音乐,并且可用比特率足够高以允许自适应和固定码本的编码和增益,通常意味着比特率高于9.2kbps(见803),(b)输入声音信号101是音乐的概率(趋向于音乐的加权语音/音乐判定,wdlp(n))不大于“0”(见804),(c)在输入声音信号的当前帧中没有检测到时间攻击的可能性(转换计数器不大于“0”,如ITU-T建议G.718,参考文献[5],第6.8.1.4节和第6.8.4.2节中所述)(见806),则子模式选择器800选择第一编码子模式。然后将子模式标志Ftfsm设置为“1”(参见801)。虽然输入声音信号101没有被分类器701分类为语音也没有被分类为音乐,但是选择器800检测输入声音信号101中的类似“语音”的特性,并且选择第一编码子模式(子模式标志Ftfsm=1),因为CELP对于编码这种声音信号不是最佳的。- If (a) the input audio signal 101 is not classified as speech or music by classifier 701, and the available bit rate is high enough to allow for adaptive and fixed codebook encoding and gain, typically meaning a bit rate higher than 9.2 kbps (see 803), (b) the probability that the input audio signal 101 is music (a weighted speech/music decision tending towards music, wdlp(n)) is not greater than “0” (see 804), and (c) there is no possibility of a time attack being detected in the current frame of the input audio signal (the transition counter is not greater than “0”, as described in ITU-T Recommendation G.718, Reference [5], Sections 6.8.1.4 and 6.8.4.2) (see 806), then the submode selector 800 selects the first encoding submode. The submode flag F tfsm is then set to “1” (see 801). Although the input sound signal 101 was not classified as speech or music by the classifier 701, the selector 800 detected speech-like characteristics in the input sound signal 101 and selected the first encoding sub-mode (sub-mode flag F tfsm = 1) because CELP is not optimal for encoding this type of sound signal.
-如果(a)输入声音信号101没有被分类器701分类为语音也没有被分类为音乐,并且可用比特率足够高以允许对自适应和固定码本的编码和增益,通常意味着比特率高于9.2kbps(参见803),(b)输入声音信号101是音乐的概率(趋向于音乐的加权语音/音乐判定,wdlp(n))不大于“0”(参见804),以及(c)在输入声音信号的当前帧中检测到时间攻击的可能性(转换计数器大于“0”,如ITU-T建议G.718,参考文献[5],第6.8.1.4节和第6.8.4.2节中所述)(参见806),则子模式选择器800选择第二编码子模式。然后将子模式标志Ftfsm设置为“2”(参见807)。如将在以下描述中解释的,第二编码子模式(子模式标志Ftfsm=2)将更多比特分配给频谱的较低部分。- If (a) the input audio signal 101 is neither classified as speech nor as music by classifier 701, and the available bit rate is high enough to allow for encoding and gain for both adaptive and fixed codebooks, typically meaning a bit rate greater than 9.2 kbps (see 803), (b) the probability that the input audio signal 101 is music (a weighted speech/music decision tending towards music, wdlp(n)) is not greater than “0” (see 804), and (c) the probability of detecting a time attack in the current frame of the input audio signal (a transition counter greater than “0”, as described in ITU-T Recommendation G.718, Reference [5], Sections 6.8.1.4 and 6.8.4.2) (see 806), then the submode selector 800 selects the second coding submode. The submode flag F tfsm is then set to “2” (see 807). As will be explained in the following description, the second coded sub-mode (sub-mode flag F tfsm = 2) allocates more bits to the lower portion of the spectrum.
-如果(a)输入声音信号101没有被分类器701分类为语音也没有被分类为音乐,并且可用比特率足够高以允许至少自适应码本的编码和增益,并且仍然具有用于频率编码的大量比特,通常意味着比特率高于9.2kbps(见803),以及(b)输入声音信号101是音乐的概率(趋向于音乐的加权语音/音乐判定,wdlp(n))大于“0”(见804),则子模式选择器800选择第三编码子模式。然后将子模式标志Ftfsm设置为“3”(参见808)。虽然输入声音信号101没有被分类器701分类为语音也没有被分类为音乐,但是选择器800检测输入声音信号101中的类似“音乐”的特性,并选择第三编码子模式(子模式标志Ftfsm=3)。这样的声音信号段仍然被认为是非音乐的,但是子模式标志Ftfsm被设置为“3”(选择第三编码子模式),其指示样本包括高频或音调内容。If (a) the input audio signal 101 is neither classified as speech nor music by classifier 701, and the available bit rate is high enough to allow for at least adaptive codebook encoding and gain, and still has a large number of bits for frequency encoding, typically meaning a bit rate higher than 9.2 kbps (see 803), and (b) the probability that the input audio signal 101 is music (a weighted speech/music decision tending towards music, wdlp(n)) is greater than “0” (see 804), then submode selector 800 selects the third encoding submode. The submode flag F tfsm is then set to “3” (see 808). Although the input audio signal 101 is neither classified as speech nor music by classifier 701, selector 800 detects “musical”-like characteristics in the input audio signal 101 and selects the third encoding submode (submode flag F tfsm = 3). Such audio signal segments are still considered non-musical, but the sub-mode flag F tfsm is set to "3" (selecting the third encoded sub-mode), indicating that the sample includes high-frequency or tonal content.
在参考文献[4]中描述了输入声音信号101是语音或音乐或两者之间的概率。当语音或音乐分类的判定不清楚时,如果概率wdlp(n)大于0,则认为信号具有某种音乐特性。下表示出了阈值,其中概率将足够高以被认为是音乐或语音。Reference [4] describes the probability that the input sound signal 101 is speech, music, or something in between. When the classification of speech or music is unclear, if the probability wdlp(n) is greater than 0, the signal is considered to have some musical characteristics. The following shows the threshold where the probability will be high enough to be considered as music or speech.
表1:不清楚类别的概率阈值Table 1: Probability thresholds for categories that are unclear
将所选择的编码子模式(例如子模式标志Ftfsm)发送到比特流中到远程解码器。在解码器内部选择的路径取决于比特流中包括的信令比特。一旦解码器检测到使用混合时域/频域编码而编码的帧的存在,就从比特流解码子模式标志Ftfsm。如果检测到的子模式标志Ftfsm为“0”,那么EVS后向可互操作传统统一时域及频域编码模型将用于解码比特流的剩余部分。另一方面,如果子模式标志Ftfsm不同于“0”,则遵循子模式解码。解码器将复制编码器遵循的过程,特别是时域和频域之间的比特分配以及不同频带中的比特分配,如稍后在第6.2节中描述的。The selected encoding sub-mode (e.g., sub-mode flag F <sub>tfsm</sub> ) is sent to the bitstream to the remote decoder. The path selected within the decoder depends on the signaling bits included in the bitstream. Once the decoder detects the presence of a frame encoded using hybrid time/frequency domain coding, it decodes the sub-mode flag F <sub>tfsm</sub> from the bitstream. If the detected sub-mode flag F <sub>tfsm</sub> is “0”, then the EVS backward interoperable traditional unified time and frequency domain coding model will be used to decode the remainder of the bitstream. On the other hand, if the sub-mode flag F <sub>tfsm </sub> is not “0”, sub-mode decoding is followed. The decoder will replicate the process followed by the encoder, particularly the bit allocation between the time and frequency domains and the bit allocation in different frequency bands, as described later in Section 6.2.
2)子帧长度的判定2) Determination of subframe length
在典型的CELP中,在10-30ms的帧中处理输入声音信号样本,并且将这些帧划分成子帧以用于自适应码本和固定码本分析。例如,可以使用20ms的帧(当内部采样率为12.8kHz时,256个样本)并将其划分为4个5ms的子帧。可变子帧长度是用于将时域和频域集成到一个编码模式中的特征。子帧长度可以从帧长度的典型1/4变化到帧长度的一半或完整帧长度。当然,可以实现使用另一数量的子帧(子帧长度)。In a typical CELP, input audio signal samples are processed in frames of 10-30ms, and these frames are divided into subframes for adaptive and fixed codebook analysis. For example, a 20ms frame (256 samples at an internal sampling rate of 12.8kHz) can be used and divided into four 5ms subframes. Variable subframe length is a feature used to integrate the time and frequency domains into a single coding pattern. The subframe length can vary from a typical quarter of the frame length to half or the full frame length. Of course, it is possible to use an additional number of subframes (subframe length).
如图2所示,统一时域/频域CELP编码方法150的参数分析操作152包括确定输入声音信号101的高频谱动态的操作259,以及逐帧计算子帧数量的操作260。为了执行操作259和260,统一时域/频域CELP编码设备100的预处理器102分别包括高频谱动态分析器209和子帧数量的计算器210。As shown in Figure 2, the parameter analysis operation 152 of the unified time-domain/frequency-domain CELP coding method 150 includes operation 259 for determining the high-spectral dynamics of the input audio signal 101 and operation 260 for calculating the number of subframes frame by frame. To perform operations 259 and 260, the preprocessor 102 of the unified time-domain/frequency-domain CELP coding device 100 includes a high-spectral dynamics analyzer 209 and a subframe count calculator 210, respectively.
关于子帧的长度(子帧的数量)或时间支持的判定由计算器210基于可用比特率和输入声音信号分析来确定,特别是来自分析器209的输入声音信号101的高频谱动态和包括来自分析器203的平滑的开环音调相关Cst的开环音调分析。高频谱动态分析器209响应于来自频谱分析器202的信息以确定输入声音信号101的高频谱动态。例如,如ITU-T建议G.718,参考文献[5],第6.7.2.2节中所述,计算高频谱动态作为没有噪声基底的输入频谱,从而给出输入频谱动态的表示。当如由分析器209确定的4.4kHz和6.4kHz之间的频带中的输入声音信号101的平均频谱动态低于例如9.6dB并且最后一帧被认为具有高频谱动态时,输入声音信号101不再被认为具有高频谱动态。在这种情况下,通过向时域编码模式添加更多的子帧或通过在频域编码模式的较低频率部分中强制施加更多的脉冲,可以将更多的比特分配给低于例如4kHz的频率。The determination of the length (number of subframes) or time support is made by calculator 210 based on the available bit rate and analysis of the input audio signal, particularly the high-spectral dynamics of the input audio signal 101 from analyzer 209 and the open-loop pitch analysis including the smooth open-loop pitch correlation C st from analyzer 203. High-spectral dynamics analyzer 209 determines the high-spectral dynamics of the input audio signal 101 in response to information from spectrum analyzer 202. For example, as described in ITU-T Recommendation G.718, Reference [5], Section 6.7.2.2, the high-spectral dynamics are calculated as the input spectrum without a noise floor, thus giving a representation of the input spectral dynamics. The input audio signal 101 is no longer considered to have high-spectral dynamics when the average spectral dynamics of the input audio signal 101 in the band between 4.4 kHz and 6.4 kHz, as determined by analyzer 209, is less than, for example, 9.6 dB and the last frame is considered to have high-spectral dynamics. In this case, more bits can be allocated to frequencies below, for example, 4 kHz by adding more subframes to the time-domain coding mode or by forcing more pulses into the lower frequency portion of the frequency-domain coding mode.
另一方面,如果输入声音信号101的平均频谱动态相对于如由分析器209确定的不被认为具有高频谱动态的最后一帧的平均频谱动态的增大大于例如4.5dB,则输入声音信号101被认为具有高于例如4kHz的高频谱动态内容。在这种情况下,取决于可用比特率,一些附加比特用于对输入声音信号101的高频进行编码,以允许一个或多个频率脉冲编码。On the other hand, if the average spectral dynamics of the input audio signal 101 increases by, for example, 4.5 dB relative to the average spectral dynamics of the last frame, which is not considered to have high spectral dynamics as determined by the analyzer 209, then the input audio signal 101 is considered to have high spectral dynamics content above, for example, 4 kHz. In this case, depending on the available bit rate, some additional bits are used to encode the high frequencies of the input audio signal 101 to allow one or more frequency pulses to be encoded.
如由计算器210(图2)确定的子帧长度也取决于可用于编码输入声音信号101的比特预算。在非常低的比特率下,例如低于9kbps的比特率,只有一个子帧可用于时域编码,否则可用比特的数量将不足以用于频域编码。在中等比特率下,例如在9kbps和16kbps之间的比特率,一个子帧用于高频包含高频谱动态内容的情况,并且如果不包含,则使用两个子帧。对于中高比特速率,例如大约16kbps和更高的比特速率,如果上面定义的平滑的开环音调相关Cst高于例如0.8,则四(4)个子帧情况也变得可用。The subframe length, as determined by calculator 210 (Figure 2), also depends on the bit budget available for encoding the input audio signal 101. At very low bit rates, such as below 9 kbps, only one subframe is available for time-domain coding; otherwise, the number of available bits will be insufficient for frequency-domain coding. At medium bit rates, such as between 9 kbps and 16 kbps, one subframe is used for cases where high-frequency content is included, and two subframes are used if no such content is included. For medium-high bit rates, such as around 16 kbps and higher, a four (4) subframe case also becomes available if the smooth open-loop tone correlation Cst defined above is above, for example, 0.8.
虽然具有一个或两个子帧的情况将时域编码限制为仅自适应码本贡献(具有编码的音调滞后和音调增益),即在该情况下不使用固定码本,但是如果可用比特预算足够,则具有四(4)个子帧的情况允许自适应和固定码本贡献。在从大约16kbps开始向上的比特率下允许四(4)个子帧的情况。由于比特预算限制,时域激励贡献仅由较低比特率下的自适应码本贡献组成。固定码本贡献可以以更高的比特率添加,例如以24kbps开始。对于所有情况,随后将评估时域编码效率以决定这种时域编码在哪个频率(上述截止频率)上是有价值的。While the case with one or two subframes limits temporal coding to only adaptive codebook contributions (with encoded tone lag and tone gain), i.e., no fixed codebook is used in this case, the case with four (4) subframes allows for both adaptive and fixed codebook contributions if the available bit budget is sufficient. The case with four (4) subframes is allowed at bit rates starting from approximately 16 kbps upwards. Due to bit budget constraints, the temporal excitation contribution consists only of the adaptive codebook contribution at the lower bit rates. The fixed codebook contribution can be added at higher bit rates, for example, starting at 24 kbps. For all cases, the temporal coding efficiency will then be evaluated to determine at which frequency (the aforementioned cutoff frequency) such temporal coding is valuable.
当输入声音信号101被分类器701分类为不清晰信号类型类别并且子模式标志Ftfsm大于零“0”时,图7和图8的替代实现方式使用上面定义的第一、第二或第三编码子模式。When the input audio signal 101 is classified as an ambiguous signal type by the classifier 701 and the sub-mode flag F tfsm is greater than zero "0", the alternative implementations of Figures 7 and 8 use the first, second, or third encoding sub-mode defined above.
声音信号分类器701确定子帧的数量是四(4),除非子模式标志Ftfsm被设置为“1”或“2”(选择第一编码子模式或第二编码子模式),这意味着输入声音信号101的内容更接近语音(在输入声音信号101中检测到类似“语音”的特性或时间攻击的可能性)并且可用比特率低于15kbps。具体地:The audio signal classifier 701 determines that the number of subframes is four (4) unless the sub-mode flag F tfsm is set to "1" or "2" (selecting the first or second encoding sub-mode), which means that the content of the input audio signal 101 is closer to speech (the possibility of detecting speech-like features or time attacks in the input audio signal 101) and the available bit rate is less than 15 kbps. Specifically:
-在第一或第二编码子模式(子模式标志Ftfsm设置为“1”或“2”)中,声音信号分类器701确定四(4)个子帧的数量,除非用于编码输入声音信号101的可用比特率低于15kbps;然后,将选择使用两(2)个子帧的编码模式。在这两种情况下,使用相应数量的固定码本,即,两(2)或四(4)个固定码本;以及- In the first or second encoding sub-mode (with the sub-mode flag F tfsm set to "1" or "2"), the audio signal classifier 701 determines the number of four (4) subframes unless the available bit rate for encoding the input audio signal 101 is less than 15 kbps; then, an encoding mode using two (2) subframes is selected. In both cases, the corresponding number of fixed codebooks is used, i.e., two (2) or four (4) fixed codebooks; and
-在第三编码模式(子模式标志Ftfsm设置为3,这意味着输入声音信号101的内容更接近音乐(在输入声音信号101中检测到类似“音乐”的特性))中,声音信号分类器701确定子帧的数量为四(4),但是没有固定码本贡献用于保持更多比特可用于频域激励贡献,除非用于编码输入声音信号101的可用比特率大于或等于22.6kbps。- In the third encoding mode (the sub-mode flag F tfsm is set to 3, which means that the content of the input audio signal 101 is closer to music (a “musical”-like feature is detected in the input audio signal 101)), the audio signal classifier 701 determines the number of subframes to be four (4), but there is no fixed codebook contribution to keep more bits available for frequency domain excitation contribution unless the available bit rate for encoding the input audio signal 101 is greater than or equal to 22.6 kbps.
3)闭环音调分析3) Closed-loop pitch analysis
在统一时域/频域CELP编码设备100和方法150(图1)中,当选择器205选择通用音频作为输入声音信号101的分类并且在检测器208中没有检测到时间攻击时,使用混合时域/频域编码方法170和对应的混合时域/频域编码器120。或者,在统一时域/频域CELP编码设备700和方法750(图7)中,当声音信号分类器701将输入声音信号101分类为“不清楚信号类型”类别并且选择上述定义的第一、第二和第三编码子模式之一(子模式标志Ftfsm设置为“1”、“2”或“3”)时,使用混合时域/频域编码方法770和对应的混合时域/频域编码器720。In the unified time-domain/frequency-domain CELP encoding device 100 and method 150 (FIG. 1), when the selector 205 selects general audio as the classification of the input audio signal 101 and no time attack is detected in the detector 208, the hybrid time-domain/frequency-domain encoding method 170 and the corresponding hybrid time-domain/frequency-domain encoder 120 are used. Alternatively, in the unified time-domain/frequency-domain CELP encoding device 700 and method 750 (FIG. 7), when the audio signal classifier 701 classifies the input audio signal 101 into the "unclear signal type" category and selects one of the first, second, and third encoding sub-modes defined above (the sub-mode flag F tfsm is set to "1", "2", or "3"), the hybrid time-domain/frequency-domain encoding method 770 and the corresponding hybrid time-domain/frequency-domain encoder 720 are used.
当使用混合时域/频域编码模式时,执行闭环音调分析,如果需要,随后执行固定代数码本搜索。为此,混合时域/频域编码方法170/770包括计算时域激励贡献的操作155。为了执行操作155,混合时域/频域编码器120/720包括时域激励贡献的计算器105。计算器105本身包括分析器211(图2),该分析器211响应于在开环音调分析器203(或预处理器702)中进行的开环音调分析和在计算器210或声音信号分类器701中确定的子帧长度(或帧中子帧的数量),以执行闭环音调分析的操作261。闭环音调分析是本领域普通技术人员公知的,并且例如在ITU-T G.718建议,参考文献[5];第6.8.4.1.4.1节中描述了实现方式的示例。闭环音调分析导致计算音调参数,也称为自适应码本参数,其主要包括音调滞后(自适应码本索引T)和音调增益(自适应码本增益b)。自适应码本贡献通常是延迟T处的过去激励或其内插版本。自适应码本索引T被编码并发送到远程解码器。音调增益b也被量化并发送到远程解码器。When using a hybrid time/frequency coding mode, closed-loop tone analysis is performed, followed by a fixed algebra search if necessary. For this purpose, the hybrid time/frequency coding method 170/770 includes an operation 155 for calculating the temporal excitation contribution. To perform operation 155, the hybrid time/frequency encoder 120/720 includes a calculator 105 for the temporal excitation contribution. The calculator 105 itself includes an analyzer 211 (FIG. 2) that performs a closed-loop tone analysis operation 261 in response to the open-loop tone analysis performed in the open-loop tone analyzer 203 (or preprocessor 702) and the subframe length (or the number of subframes in a frame) determined in the calculator 210 or the sound signal classifier 701. Closed-loop tone analysis is well known to those skilled in the art and examples of its implementation are described, for example, in ITU-T Recommendation G.718, reference [5]; section 6.8.4.1.4.1. Closed-loop pitch analysis results in the computation of pitch parameters, also known as adaptive codebook parameters, which primarily consist of pitch lag (adaptive codebook index T) and pitch gain (adaptive codebook gain b). The adaptive codebook contribution is typically a past excitation at delay T or its interpolated version. The adaptive codebook index T is encoded and sent to the remote decoder. The pitch gain b is also quantized and sent to the remote decoder.
当在操作261中已经完成闭环音调分析并且使用固定码本贡献时,时域激励贡献的计算器105包括在固定码本搜索的操作262期间搜索的固定代数码本212,以找到通常包括固定码本索引和固定码本增益的最佳固定码本参数。固定码本索引和增益形成固定码本贡献。固定码本索引被编码并发送到远程解码器。固定码本增益也被量化并发送到远程解码器。固定代数码本及其搜索被认为是CELP编码领域的普通技术人员公知的,因此,在本公开中将不再进一步描述。When closed-loop pitch analysis has been completed in operation 261 and fixed codebook contributions have been used, the calculator 105 for time-domain excitation contributions includes searching a fixed algebraic codebook 212 during operation 262 to find optimal fixed codebook parameters, typically including a fixed codebook index and a fixed codebook gain. The fixed codebook index and gain form the fixed codebook contribution. The fixed codebook index is encoded and sent to a remote decoder. The fixed codebook gain is also quantized and sent to a remote decoder. Fixed algebraic codebooks and their search are considered well known to those skilled in the art of CELP coding and will therefore not be described further in this disclosure.
自适应码本索引和增益以及固定码本索引和增益(如果使用的话)形成时域CELP激励贡献。Adaptive codebook index and gain, as well as fixed codebook index and gain (if used), form the temporal CELP excitation contribution.
4)频率变换4) Frequency conversion
在混合时域/频域译码模式的频域编码期间,在变换域中(例如在频域中)表示两个信号。在一个实施例中,可以使用256点类型II(或类型IV)DCT(离散余弦变换)来实现时间到频率变换(该DCT给出25Hz的分辨率,其中内部采样率为12.8kHz),但是可以使用任何其他合适的变换。在使用另一变换的情况下,可能需要相应地修改频率分辨率(上面定义)、频带的数量和每频带的频率段的数量(下面进一步定义)。During frequency domain encoding in a hybrid time/frequency domain decoding mode, two signals are represented in the transform domain (e.g., in the frequency domain). In one embodiment, a 256-point Type II (or Type IV) DCT (Discrete Cosine Transform) can be used to implement the time-to-frequency transformation (this DCT provides a resolution of 25 Hz with an internal sampling rate of 12.8 kHz), but any other suitable transform can be used. When using another transform, it may be necessary to modify the frequency resolution (defined above), the number of frequency bands, and the number of frequency segments per band (further defined below) accordingly.
如上所述,在统一时域/频域CELP编码设备100和方法150(图1和2)中,当选择器205选择通用音频作为输入声音信号101的分类并且在检测器208中没有检测到时间攻击时,使用混合时域/频域编码模式。或者,在统一时域/频域CELP编码设备700和方法750(图7)中,当声音信号分类器701将输入声音信号101分类为“不清楚信号类型”类别时,使用混合时域/频域编码模式。混合时域/频域编码器120/720包括频域激励贡献的计算器107(图1和图7),其响应于由分析器201(和预处理器702)执行的输入声音信号101的LP分析的操作251产生的输入LP残差res(n)(参考文献[5])执行计算频域激励贡献的操作157。如图2所示,计算器107可以计算DCT 213,例如输入LP残差res(n)的类型IIDCT。混合时域/频域编码器120/720还包括计算器106(图1和图7),用于执行计算时域激励贡献的频率变换的操作156。如图2所示,计算器106可以计算DCT 214,例如时域激励贡献的类型IIDCT。可以使用例如以下表达式来计算输入LP残差fres和时域CELP激励贡献fexc的频率变换:As described above, in the unified time-domain/frequency-domain CELP encoding apparatus 100 and method 150 (Figures 1 and 2), a hybrid time-domain/frequency-domain encoding mode is used when the selector 205 selects general audio as the classification of the input sound signal 101 and no time attack is detected in the detector 208. Alternatively, in the unified time-domain/frequency-domain CELP encoding apparatus 700 and method 750 (Figure 7), a hybrid time-domain/frequency-domain encoding mode is used when the sound signal classifier 701 classifies the input sound signal 101 into the "unclear signal type" category. The hybrid time-domain/frequency-domain encoder 120/720 includes a calculator 107 (Figures 1 and 7) for frequency-domain excitation contributions, which performs an operation 157 to calculate the frequency-domain excitation contributions in response to the input LP residual res(n) (reference [5]) generated by the operation 251 of the LP analysis of the input sound signal 101 performed by the analyzer 201 (and the preprocessor 702). As shown in Figure 2, calculator 107 can calculate DCT 213, such as type IIDCT of the input LP residual res(n). The hybrid time-domain/frequency-domain encoder 120/720 also includes calculator 106 (Figures 1 and 7) for performing operation 156 to calculate the frequency transformation of the time-domain excitation contribution. As shown in Figure 2, calculator 106 can calculate DCT 214, such as type IIDCT of the time-domain excitation contribution. The frequency transformation of the input LP residual f res and the time-domain CELP excitation contribution f exc can be calculated using, for example, the following expression:
以及:as well as:
其中res(n)是输入LP残差,ctd(n)是时域激励贡献,并且N是帧长度。在可能的实现方式中,对于12.8kHz的对应内部采样速率,帧长度是256个样本。时域激励贡献由以下关系式给出:Where r <sub>es</sub> (n) is the input LP residual, c <sub>td</sub> (n) is the temporal excitation contribution, and N is the frame length. In a possible implementation, for the corresponding internal sampling rate of 12.8 kHz, the frame length is 256 samples. The temporal excitation contribution is given by the following relationship:
etd(n)=bv(n)+gc(n)e <sub>td</sub> (n) = bv(n) + gc(n)
其中v(n)是自适应码本贡献,b是自适应码本增益,c(n)是固定码本贡献,并且g是固定码本增益。应注意,时域激励贡献可仅由如前述描述中所描述的自适应码本贡献组成。Where v(n) is the adaptive codebook contribution, b is the adaptive codebook gain, c(n) is the fixed codebook contribution, and g is the fixed codebook gain. It should be noted that the temporal excitation contribution may consist solely of the adaptive codebook contribution as described above.
5)时域贡献的截止频率5) Cutoff frequency of time-domain contribution
在声音信号样本被分类为通用音频(图1)或声音信号样本被分类为“不清楚信号类型”类别(图7)的情况下,与频域编码相比,时域激励贡献并不总是对编码改进贡献很大。通常,它确实改善了频谱的较低部分的编码,而频谱的较高部分的编码改善最小。混合时域/频域编码器120/720包括截止频率查找器和滤波器108(图1和图7),用于执行确定截止频率的操作158,其中高于该截止频率,由时域激励贡献提供的编码改进变得太低而不具有价值。如图2所示,截止频率查找器和滤波器108包括截止频率的计算器215和滤波器216。When a sound signal sample is classified as general audio (Figure 1) or as an "unclear signal type" category (Figure 7), the contribution of time-domain excitation is not always significant for coding improvement compared to frequency-domain coding. Typically, it does improve coding in the lower part of the spectrum, while the improvement in coding in the higher part of the spectrum is minimal. The hybrid time-domain/frequency-domain encoder 120/720 includes a cutoff frequency lookup unit and filter 108 (Figures 1 and 7) for performing the operation 158 of determining a cutoff frequency, above which the coding improvement provided by the time-domain excitation contribution becomes too low to be valuable. As shown in Figure 2, the cutoff frequency lookup unit and filter 108 includes a cutoff frequency calculator 215 and a filter 216.
估计时域激励贡献的截止频率的操作265首先由计算器215(图2)使用计算器303(图3和图4)完成,计算器303执行来自计算器107的输入LP残差301的频率变换与来自计算器106的时域激励贡献302的频率变换(分别指定为fres和fexc,其在前述章节4中定义)之间的每个频带的归一化互相关的操作353。包括在例如十六(16)个频带中的每个频带中的最后一个频率Lf以Hz为单位定义为:The operation 265, estimating the cutoff frequency of the time-domain excitation contribution, is first performed by calculator 215 (Fig. 2) using calculator 303 (Figs. 3 and 4). Calculator 303 performs a normalized cross-correlation operation 353 between the frequency transformation of the input LP residual 301 from calculator 107 and the frequency transformation of the time-domain excitation contribution 302 from calculator 106 (specified as f_res and f_exc , respectively, as defined in the preceding section 4). The last frequency L_f in each of, for example, sixteen (16) frequency bands is defined in Hz as:
对于该说明性示例,对于12.8kHz内部采样率的20ms帧,每频带Bbi的频率仓j的数量、每频带CBb的累积频率段和每频带i的归一化互相关Cc(i)例如如下定义:For this illustrative example, for a 20ms frame with an internal sampling rate of 12.8kHz, the number of frequency bins j per band B b i, the cumulative frequency segments of band C Bb , and the normalized cross-correlation C c (i) of band i are defined, for example, as follows:
其中in
以及as well as
其中Bb是每个频带Bb的频率段j的数量,CBb是每频带的累积频率段,Cc(i)是每个频带i的归一化互相关,是频带的激励能量,并且类似地,是每频带的残差能量。Where B <sub>b</sub> is the number of frequency segments j in each frequency band B<sub> b</sub>, C <sub>Bb</sub> is the cumulative frequency segment in each frequency band, C <sub>c</sub> (i) is the normalized cross-correlation of each frequency band i, is the excitation energy of the frequency band, and similarly, is the residual energy of each frequency band.
截止频率的计算器215包括通过频带进行互相关的平滑器304(图3和图4),其执行一些操作354以平滑不同频带之间的互相关矢量。更具体地,通过频带进行互相关的平滑器304使用例如以下关系来计算新的互相关矢量The cutoff frequency calculator 215 includes a cross-correlation smoother 304 (Figures 3 and 4) that performs operations 354 to smooth the cross-correlation vectors between different frequency bands. More specifically, the cross-correlation smoother 304 uses, for example, the following relationship to calculate a new cross-correlation vector.
其中,在说明性实施例中,In the illustrative embodiments,
α=0.95;δ=(1-α);Nb=13;β=δ/2α=0.95; δ=(1-α); N b =13; β=δ/2
截止频率的计算器215还包括计算器305(图3和图4),其执行计算新的互相关矢量在第一Nb个频带(例如,Nb=13,表示5575Hz)上的平均值的操作355。The cutoff frequency calculator 215 also includes a calculator 305 (Figures 3 and 4) that performs the operation 355 of calculating the average value of the new cross-correlation vector over the first Nb frequency bands (e.g., Nb = 13, representing 5575 Hz).
截止频率的计算器215还包括截止频率模块306(图3),如图4所示,截止频率模块306包括互相关的限制器406、互相关的归一化器407和互相关最低的频带的查找器408。更具体地,限制器406执行将互相关矢量的平均值限制为最小值0.5的操作456,并且归一化器407执行将互相关矢量的有限平均值归一化在0和1之间的操作457。查找器408执行通过查找频带i的最后频率Lf来获得截止频率的第一估计的操作458,该频带i最小化频带i的所述最后频率Lf与互相关矢量的归一化平均值乘以输入声音信号101的内部采样率(Fs/2)的一半之间的差:The cutoff frequency calculator 215 also includes a cutoff frequency module 306 (FIG. 3), as shown in FIG. 4. The cutoff frequency module 306 includes a cross-correlation limiter 406, a cross-correlation normalizer 407, and a finder 408 for the frequency band with the lowest cross-correlation. More specifically, the limiter 406 performs an operation 456 limiting the average value of the cross-correlation vector to a minimum of 0.5, and the normalizer 407 performs an operation 457 normalizing the finite average value of the cross-correlation vector between 0 and 1. The finder 408 performs an operation 458 obtaining a first estimate of the cutoff frequency by finding the last frequency Lf of frequency band i, which minimizes the difference between the last frequency Lf of frequency band i and half the normalized average value of the cross-correlation vector multiplied by the internal sampling rate (Fs/2) of the input audio signal 101.
并且其中And among them
F5=12800Hz并且F 5 = 12800Hz and
在上述关系中,表示截止频率的第一估计。In the above relationship, represents the first estimate of the cutoff frequency.
在低比特率下,其中归一化平均值永远不会很高(在图1的统一时域/频域编码设备100和方法150的情况下),或者当子模式标志Ftfsm大于“0”时,意味着输入声音信号被分类为“不清楚信号类型”(在图7的统一时域/频域编码设备700和方法750的情况下),或者为了人为地增加的值以给时域激励贡献更多的权重,可以使用归一化器407以固定的缩放因子放大归一化平均值的值。作为非限制性示例,在低于8kbps的比特率下,截止频率的第一估计乘以2。At low bit rates, where the normalized average will never be very high (in the case of the unified time-domain/frequency-domain coding device 100 and method 150 of Figure 1), or when the submode flag F_tfsm is greater than “0”, meaning the input audio signal is classified as an “unclear signal type” (in the case of the unified time-domain/frequency-domain coding device 700 and method 750 of Figure 7), or to artificially increase the value to contribute more weight to the time-domain excitation, the normalizer 407 can be used to amplify the value of the normalized average with a fixed scaling factor. As a non-limiting example, at bit rates below 8 kbps, the first estimate of the cutoff frequency is multiplied by 2.
可以通过将以下分量添加到计算中来提高截止频率的精度。为此目的,截止频率模块306包括在对应操作460中使用例如以下关系从帧的子帧的时域激励贡献的最小或最低音调滞后值计算的第八谐波的外推器410(图4):The accuracy of the cutoff frequency can be improved by adding the following components to the calculation. For this purpose, the cutoff frequency module 306 includes an extrapolator 410 (Figure 4) in corresponding operation 460 that calculates the eighth harmonic using, for example, the minimum or lowest pitch lag value of the temporal excitation contribution from the subframe of the frame using the following relationship:
其中Fs=12800Hz是内部采样率或频率,Nsub是帧中的子帧的数量,并且T(i)是子帧i的自适应码本索引或音调滞后。Where F <sub>s</sub> = 12800Hz is the internal sampling rate or frequency, N <sub>sub</sub> is the number of subframes in the frame, and T(i) is the adaptive codebook index or tone lag of subframe i.
截止频率模块306包括第八谐波所在的频带的查找器409(图4)。更具体地,对于子帧i<Nsub,查找器409执行搜索最高频带的操作459,针对其,例如,仍然验证以下不等式:The cutoff frequency module 306 includes a finder 409 (Figure 4) for the frequency band containing the eighth harmonic. More specifically, for subframe i < N sub , the finder 409 performs a search operation 459 for the highest frequency band, for which, for example, the following inequality is still verified:
该频带的索引将被称为并且它指示第8谐波可能位于的频带。The index of this frequency band will be called and it indicates the frequency band where the 8th harmonic may be located.
截止频率模块306最终包括最终截止频率ftc的选择器411(图4)。更具体地,选择器411使用以下关系执行保持来自查找器408的截止频率的第一估计ftc1与来自查找器409的第八谐波所在的频带的最后频率之间的较高频率的操作461:The cutoff frequency module 306 ultimately includes a selector 411 for the final cutoff frequency f <sub>tc</sub> (Figure 4). More specifically, selector 411 performs an operation 461 using the following relationship to maintain the higher frequency between the first estimated cutoff frequency f <sub>tc</sub>1 from finder 408 and the last frequency of the band containing the eighth harmonic from finder 409:
ftc=max(Lf(i8th),ftc1)f tc =max(L f (i 8th ), f tc1 )
当使用编码子模式时,在图7的统一时域/频域编码设备700和方法750的情况下,使用例如以下关系进一步对截止频率ftc进行阈值化:When using a coding sub-mode, in the case of the unified time-domain/frequency-domain coding device 700 and method 750 of Figure 7, the cutoff frequency f <sub>tc</sub> is further thresholded using, for example, the following relationship:
ftc=max(max(Lf(i8th),2775),ftc1)f tc =max(max(L f (i 8th ),2775),f tc1 )
如图3和图4所示:As shown in Figures 3 and 4:
-截止频率的计算器215还包括:判定器307(图3),用于执行判定要归零的频带的频率段的数量的操作357;- The cutoff frequency calculator 215 also includes: a determiner 307 (Figure 3) for performing an operation 357 to determine the number of frequency bands to be zeroed;
-判定器307本身包括:分析器415(图4),用于执行参数分析的操作465;以及选择器416(图4),用于执行选择要归零的频率段的操作466;以及- The decision maker 307 itself includes: an analyzer 415 (Figure 4) for performing parameter analysis operation 465; and a selector 416 (Figure 4) for performing selection of the frequency band to be zeroed operation 466; and
-滤波器216(图2)在频域中操作,并且包括用于执行滤波操作266的归零器308(图3)。对应的操作358将在判定器307中判定为归零的频率段归零。归零器308可以归零(a)所有频率段(图4中的归零器417和对应的归零操作467)或(b)补充有平滑过渡区域的位于截止频率ftc - Filter 216 (Figure 2) operates in the frequency domain and includes a zeroer 308 (Figure 3) for performing filtering operation 266. The corresponding operation 358 zeroes the frequency range determined to be zeroed in the decision 307. Zeroer 308 can zero (a) all frequency ranges (zeroer 417 and corresponding zeroing operation 467 in Figure 4) or (b) frequencies with a supplementary smooth transition region located at the cutoff frequency f<sub>tc</sub>.
以上的较高频率段(图4中的滤波器418和对应的滤波操作468)。过渡区域位于截止频率ftc以上并且在归零频段以下,并且它允许在截止频率ftc以下的未改变频谱与较高频率中的归零频段之间的平滑频谱过渡。The higher frequency range above (filter 418 and corresponding filter operation 468 in Figure 4). The transition region is located above the cutoff frequency ftc and below the return-to-zero band, and it allows for a smooth spectral transition between the unchanged spectrum below the cutoff frequency ftc and the return-to-zero band in the higher frequencies.
作为非限制性的说明性示例,当来自选择器411的截止频率ftc低于或等于775Hz时,分析器415认为时域激励贡献的成本太高。然后,选择器416选择要归零的时域激励贡献的频率表示的所有频率段,并且归零器417迫使所有频率段归零,并且还迫使截止频率ftc为零。然后将分配给时域激励贡献的所有比特重新分配给频域编码模式。否则,分析器415迫使选择器416选择高于截止频率ftc的高频段,以便由滤波器(归零器)418归零。As a non-limiting illustrative example, when the cutoff frequency f <sub>tc</sub> from selector 411 is below or equal to 775Hz, analyzer 415 considers the cost of the time-domain excitation contribution too high. Selector 416 then selects all frequency segments representing the frequency of the time-domain excitation contribution to be zeroed, and zeroer 417 forces all frequency segments to zero, and also forces the cutoff frequency f <sub>tc </sub> to zero. All bits allocated to the time-domain excitation contribution are then reassigned to the frequency-domain coding pattern. Otherwise, analyzer 415 forces selector 416 to select higher frequency segments above the cutoff frequency f<sub>tc</sub> so that they can be zeroed by filter (zeroer) 418.
最后,截止频率的计算器215包括量化器309(图3和图4),用于执行将截止频率ftc量化为该截止频率的量化的版本ftcQ的操作359,以传输到远程解码器。例如,如果三(3)个比特与截止频率参数相关联,则可以如下定义一组可能的输出值(以Hz为单位):Finally, the cutoff frequency calculator 215 includes a quantizer 309 (Figures 3 and 4) for performing an operation 359 to quantize the cutoff frequency f <sub>tc </sub> to a quantized version f <sub>tcQ</sub> of that cutoff frequency for transmission to a remote decoder. For example, if three (3) bits are associated with the cutoff frequency parameter, a set of possible output values (in Hz) can be defined as follows:
ftcQ={0,1175,1575,1975,2375,2775,3175,3575}f tcQ ={0,1175,1575,1975,2375,2775,3175,3575}
选择器411可以使用许多机制来稳定最终截止频率ftc的选择,以防止量化的版本ftcQ在不适当的信号段中在0和1175之间切换。为了实现这一点,作为非限制性示例,分析器415响应于来自闭环音调分析器211(图2)的长期平均音调增益Glt 412、来自开环音调分析器203的开环音调相关Col 413和平滑的开环音调相关Cst 414。为了防止仅切换到频域编码,分析器415仅在例如满足以下条件时不允许这种频域编码,即ftcQ不能被设置为0:Selector 411 can use a number of mechanisms to stabilize the selection of the final cutoff frequency f<sub>tc</sub> to prevent the quantized version f <sub>tcQ</sub> from switching between 0 and 1175 in an inappropriate signal segment. To achieve this, as a non-limiting example, analyzer 415 responds to the long-term average tone gain G <sub>lt</sub> 412 from the closed-loop tone analyzer 211 (Figure 2), the open-loop tone correlation C <sub>ol </sub> 413 from the open-loop tone analyzer 203, and the smoothed open-loop tone correlation C <sub>st </sub> 414. To prevent switching to frequency domain coding only, analyzer 415 disallows such frequency domain coding only if, for example, f<sub>tcQ</sub> cannot be set to 0:
ftc>2375Hz或f tc > 2375Hz or
ftc>1175Hz并且Col>0.7并且Glt>0.6f <sub>tc </sub>> 1175Hz and C <sub>ol </sub>> 0.7 and G <sub>lt </sub>> 0.6
或or
ftc≥1175Hz并且Cst>0.8并且Glt0≥0.4f <sub>tc </sub> ≥ 1175 Hz and C <sub>st </sub>> 0.8 and G <sub>lt </sub> ≥ 0.4
或or
ftcQ(t-1)!=0并且Col>0.5并且Cst>0.5并且Glt≥0.6f <sub>tcQ</sub> (t-1)! = 0 and C <sub>ol</sub> > 0.5 and C <sub>st</sub> > 0.5 and G <sub>lt </sub> ≥ 0.6
其中col是开环音调相关413,并且Cst对应于如Cst=0.9·Col+0.1·Cst所定义的开环音调相关414的平滑的版本。此外,Glt(图4的项412)对应于由闭环音调分析器211在时域激励贡献内获得的音调增益的长期平均值。音调增益412的长期平均值被定义为其中是当前帧上的平均音调增益。为了进一步降低仅频域编码与混合时域/频域编码之间的切换速率,可以添加延迟。Where c<sub>ol</sub> is the open-loop pitch correlation 413, and C <sub>st</sub> corresponds to a smoothed version of the open-loop pitch correlation 414 as defined by C <sub>st</sub> = 0.9·C <sub>ol</sub> + 0.1·C<sub>st</sub> . Furthermore, G <sub>lt</sub> (term 412 in Figure 4) corresponds to the long-term average of the pitch gain obtained by the closed-loop pitch analyzer 211 within the temporal excitation contribution. The long-term average of the pitch gain 412 is defined as where is the average pitch gain over the current frame. To further reduce the switching rate between frequency-only coding and hybrid temporal/frequency coding, a delay can be added.
6)频域编码6) Frequency domain coding
6.1)创建差分矢量6.1) Create the difference vector
一旦确定了时域激励贡献的截止频率ftc,就执行频域编码。为了执行这种频域编码,混合时域/频域编码方法170/770包括减法操作159、频率量化操作160和加法操作161。混合时域/频域编码器120/720包括减法器或计算器109、频率量化器110和加法器111,以分别执行操作159、160和161。Once the cutoff frequency f <sub>tc</sub> of the time-domain excitation contribution is determined, frequency-domain coding is performed. To perform this frequency-domain coding, the hybrid time-domain/frequency-domain coding method 170/770 includes a subtraction operation 159, a frequency quantization operation 160, and an addition operation 161. The hybrid time-domain/frequency-domain encoder 120/720 includes a subtractor or calculator 109, a frequency quantizer 110, and an adder 111 to perform operations 159, 160, and 161, respectively.
图5是同时示出频率量化器110和对应的频率量化操作160的概述的示意性框图。此外,图6是频率量化器110和对应的频率量化操作160的更详细结构的示意性框图。Figure 5 is a schematic block diagram showing an overview of both the frequency quantizer 110 and the corresponding frequency quantization operation 160. Furthermore, Figure 6 is a schematic block diagram showing a more detailed structure of the frequency quantizer 110 and the corresponding frequency quantization operation 160.
减法器或计算器109(图1、2、5和6)利用来自DCT 213(图2)的输入LP残差的频率变换fres 502(图5和6)(或其他频率表示)与来自DCT 214(图2)的时域激励贡献的频率变换fexc 501(图5和6)(或其他频率表示)之间的差值形成差矢量f的第一部分,该差矢量f从零到时域激励贡献的截止频率ftc。在频率变换fres 502的相应频谱部分从那被减去之前,可以将缩小因子603(图6)应用(参见乘法器604和对应的乘法操作654)到频率变换fexc 501以用于ftrans=2kHz的下一个过渡区域(在该实现示例中为80个频率段)。减法的结果构成差矢量fd的第二部分,该差矢量fd表示从截止频率ftc到ftc+ftrans的频率范围。输入LP残差的频率变换fres 502用于差矢量fd的剩余第三部分。Subtractor or calculator 109 (Figures 1, 2, 5, and 6) uses the difference between the frequency transform fres 502 (Figures 5 and 6) (or other frequency representation) of the input LP residual from DCT 213 (Figure 2) and the frequency transform fexc 501 (Figures 5 and 6) (or other frequency representation) of the time-domain excitation contribution from DCT 214 (Figure 2) to form the first part of the difference vector f, which ranges from zero to the cutoff frequency ftc of the time-domain excitation contribution. Before the corresponding spectral portion of the frequency transform fres 502 is subtracted from there, a reduction factor 603 (Figure 6) can be applied (see multiplier 604 and the corresponding multiplication operation 654) to the frequency transform fexc 501 for the next transition region (80 frequency bands in this implementation example) where ftrans = 2kHz. The result of the subtraction constitutes the second part of the difference vector fd , which represents the frequency range from the cutoff frequency ftc to ftc + ftrans . The frequency transformation f_res502 of the input LP residual is used for the remaining third part of the difference vector f_d .
由应用缩小因子603产生的差值矢量fd的缩小部分可以用任何类型的淡出函数(fade out function)来执行,它可以缩短到仅几个频率段,但是当可用比特预算被判断为足以防止截止频率ftc改变时的能量振荡伪影时,它也可以被省略。例如,在25Hz分辨率的情况下,对应于在12.8kHz内部采样率下的256点DCT中的1个频率段fbin=25Hz,差矢量可以被构建为:The reduction portion of the difference vector f<sub>d</sub> generated by applying a reduction factor of 603 can be performed using any type of fade-out function. It can be shortened to only a few frequency bands, but it can also be omitted when the available bit budget is deemed sufficient to prevent energy oscillation artifacts when the cutoff frequency f<sub>tc</sub> changes. For example, at 25Hz resolution, corresponding to one frequency band f <sub>bin</sub> = 25Hz in a 256-point DCT at an internal sampling rate of 12.8kHz, the difference vector can be constructed as follows:
fd(k)=fres(k)-fexc(k)f d (k) = f res (k) - f exc (k)
where 0≤k≤ftc/fbin where 0≤k≤f tc /f bin
where ftc/fbin<k≤(ftc+ftrans)/fbin where f tc /f bin <k≤(f tc +f trans )/f bin
fd(k)=fres(k),otherwisef d (k) = f res (k), otherwise
其中fres,fexc和ftc已经在前面的描述中定义。Among them, f res , f exc and f tc have been defined in the previous description.
6.2)用于编码子模式的频域比特分配6.2) Frequency domain bit allocation for coded sub-modes
6.2.1)将可用比特的一部分分配给较低频率6.2.1) Allocate a portion of the available bits to lower frequencies.
在如图7所示的统一时域/频域CELP编码方法750中,混合时域/频域编码器720包括频带选择器和比特分配器707,并且混合时域/频域编码方法770包括频带选择和比特分配检测757的对应操作。In the unified time-domain/frequency-domain CELP coding method 750 shown in Figure 7, the hybrid time-domain/frequency-domain encoder 720 includes a frequency band selector and a bit allocator 707, and the hybrid time-domain/frequency-domain coding method 770 includes corresponding operations of frequency band selection and bit allocation detection 757.
图9是同时示出图7的频带选择器和比特分配器707以及频带选择和比特分配的对应操作757的示意性框图,其用于在图7和图8的统一时域/频域CELP编码方法150/750的替代实现方式中,当输入声音信号101不被分类为语音也不被分类为音乐时将可用比特预算分配给差矢量fd的频率量化。Figure 9 is a schematic block diagram showing the band selector and bit allocator 707 of Figure 7, as well as the corresponding operation 757 of band selection and bit allocation, which is used to allocate the available bit budget to the frequency quantization of the difference vector fd when the input audio signal 101 is neither classified as speech nor as music in an alternative implementation of the unified time-domain/frequency-domain CELP coding method 150/750 of Figures 7 and 8.
具体地,图9示出了一种创新的方式,当输入声音信号101不被分类为语音或音乐,而是根据先前选择的编码子模式以“不清楚信号类型”分类时,频带选择器和比特分配器707可以如何将可用比特分配给频率量化。在图9中,以每频带方式执行频率量化。为了简单起见,在当前说明性示例中,频带具有相同数量的频率段,其为十六(16)个频率段,内部采样率为12.8kHz。频带“0”表示频谱的较低部分,而频带“15”表示该频谱的较高部分。Specifically, Figure 9 illustrates an innovative approach to how the band selector and bit allocator 707 can allocate available bits to frequency quantization when the input audio signal 101 is not classified as speech or music, but rather as an "unclear signal type" based on a previously selected coding sub-pattern. In Figure 9, frequency quantization is performed per band. For simplicity, in the current illustrative example, the bands have the same number of frequency segments, which is sixteen (16) frequency segments with an internal sampling rate of 12.8 kHz. Band "0" represents the lower portion of the spectrum, while band "15" represents the higher portion of that spectrum.
为了最好地使用可用于频率量化的比特,频带选择和比特分配操作757包括第一操作951,该第一操作951预先固定可用比特预算的一部分(参见900),以用于按照来自截止频率查找器和滤波器108的量化的截止频率ftcQ的函数来量化差矢量fd的较低频率。为了执行操作951,估计器901使用例如以下关系:To best utilize the bits available for frequency quantization, the band selection and bit allocation operation 757 includes a first operation 951, which pre-fixes a portion of the available bit budget (see 900) for quantizing the lower frequency of the difference vector fd as a function of the quantized cutoff frequency f <sub>tcQ </sub> from the cutoff frequency finder and filter 108. To perform operation 951, the estimator 901 uses, for example, the following relationship:
PBtf=max(min(PBtf,0.75),0.5)P Btf =max(min(P Btf ,0.75),0.5)
其中PBlf是分配给差矢量fd的较低频率的频率量化的可用比特的一部分。在该示例中,较低频率是指前五(5)个频带,或前两(2)kHz。术语Lf(ftcQ)是指直到量化截止频率ftcQ的频率段的数量。Where P <sub>Blf</sub> is a portion of the available bits for frequency quantization allocated to the lower frequency of the difference vector f<sub> d </sub>. In this example, the lower frequency refers to the first five (5) frequency bands, or the first two (2) kHz. The term L <sub>f</sub> (f<sub>tcQ</sub> ) refers to the number of frequency bands up to the quantization cutoff frequency f <sub>tcQ</sub> .
然后,估计器901基于编码子模式标志Ftfsm调整分配给较低频率的频率量化的可用比特的该一部分PBlf。如果编码子模式标志Ftfsm被设置为“2”(图8),意味着在输入声音信号101的当前帧中检测到时间攻击的可能性,则分配给较低频率的频率量化的比特的该一部分增大可用比特的10%。如果在当前帧的内容中检测到类似“音乐”的特征,由设置为“3”的子模式编码标志Ftfsm指示,则分配给较低频率的频率量化的比特的该一部分PBlf减小可用比Then, estimator 901 adjusts the portion of available bits P <sub>Blf</sub> allocated to the lower frequency frequency quantization based on the coded sub-mode flag F <sub>tfsm </sub>. If the coded sub-mode flag F <sub>tfsm</sub> is set to "2" (Figure 8), indicating a possibility of a time attack being detected in the current frame of the input audio signal 101, the portion of available bits allocated to the lower frequency frequency quantization increases by 10%. If a "musical" feature is detected in the content of the current frame, as indicated by the sub-mode coding flag F <sub>tfsm </sub> set to "3", the portion of available bits P <sub>Blf</sub> allocated to the lower frequency frequency quantization decreases by 10%.
特的10%。10% of the special amount.
6.2.2)估计要量化的频带的数量6.2.2) Estimate the number of frequency bands to be quantized
影响可用于对差矢量fd进行频率量化的每频带的总比特数的另一个参数是要量化的该差矢量fd的频带的估计的最大数量NBmx。在当前描述的说明性示例中,在12.8kHz的内部采样率下,频带的最大总数Ntt是十六(16)。Another parameter affecting the total number of bits per band that can be used for frequency quantization of the difference vector f d is the maximum number of estimated bands N Bmx for the difference vector f d to be quantized. In the illustrative example described here, at an internal sampling rate of 12.8 kHz, the maximum total number of bands N tt is sixteen (16).
当使用编码子模式时,频带选择和比特分配操作757包括估计要量化的差矢量fd的频带的最大数量NBmx的操作952。为了执行操作952,如果编码子模式标志Ftfsm被设置为“1”(选择第一编码子模式),则估计器902将频带的最大数量NBmx设置为“10”。如果编码子模式标志Ftfsm设定为“2”(选择第二编码子模式),那么估计器902将频带的最大数量NBmx设定为“9”。如果编码子模式标志Ftfsm被设置为“3”(选择第三编码子模式),则估计器902将频带的最大数量NBmx设置为“13”。然后,估计器902使用例如以下关系按照可用于差矢量fd的频率量化的比特预算的函数来重新调整要量化的频带的最大数量NBmx:When using a coding submode, the band selection and bit allocation operation 757 includes an operation 952 that estimates the maximum number of bands N Bmx for the difference vector f d to be quantized. To perform operation 952, if the coding submode flag F tfsm is set to "1" (selecting the first coding submode), the estimator 902 sets the maximum number of bands N Bmx to "10". If the coding submode flag F tfsm is set to "2" (selecting the second coding submode), then the estimator 902 sets the maximum number of bands N Bmx to "9". If the coding submode flag F tfsm is set to "3" (selecting the third coding submode), then the estimator 902 sets the maximum number of bands N Bmx to "13". The estimator 902 then readjusts the maximum number of bands N Bmx to be quantized using, for example, the following relationship as a function of the bit budget available for frequency quantization of the difference vector f d :
NBmx=max(min(trunc(NBmx·NBadj+0.5),Ntt),5),N Bmx =max(min(trunc(N Bmx ·N Badj +0.5), N tt ), 5),
其中,BF表示可用于差矢量fd的频率量化的比特数(参见900),BT是可用于对正在处理的信道进行编码的总比特率(参见900),Ftfsm是子模式标志(参见900),并且Ntt是频带的最大总数。Where BF represents the number of bits available for frequency quantization of the difference vector fd (see 900), BT is the total bit rate available for encoding the channel being processed (see 900), Ftfsm is the submode flag (see 900), and Ntt is the maximum total number of frequency bands.
估计器902可以与分配给差矢量fd的中频带和较高频带的量化的比特数相关地进一步减少要量化的差矢量fd的频带的最大数量。出于这种限制的目的,假设最后的较低频带和其后的第一频带具有相似的比特数mb,或者大约为分配给较低频率的频率量化的比特PBlf的17%。对于要量化的最后一个频带,使用4.5比特的最小数量mp来量化至少一(1)个频率脉冲。如果可用比特率BT大于或等于15kbps,则最小比特数mp将是九(9),以允许每频带量化更多脉冲。然而,如果总可用比特率BT低于15kbps,但是子模式标志Ftfsm被设置为“3”,意味着内容与音乐具有相似性,则要进行频率量化的最后频带的比特数m将为6.75,以允许更精确的量化。然后,估计器902使用例如以下关系来计算校正的频带的最大数量N′Bmx:Estimator 902 can further reduce the maximum number of frequency bands to be quantized for the difference vector f d in relation to the number of quantization bits allocated to the mid-frequency and higher-frequency bands of the difference vector f d . For this limitation, it is assumed that the last lower-frequency band and the first band thereafter have similar bit numbers m b , or approximately 17% of the number of bits P Blf allocated to the frequency quantization of the lower frequencies. For the last frequency band to be quantized, a minimum number m p of 4.5 bits is used to quantize at least one (1) frequency pulse. If the available bit rate BT is greater than or equal to 15 kbps, the minimum number of bits m p will be nine (9) to allow more pulses to be quantized per frequency band. However, if the total available bit rate BT is less than 15 kbps, but the submode flag F tfsm is set to “3”, meaning the content is similar to music, the number of bits m for the last frequency band to be quantized will be 6.75 to allow for more precise quantization. Estimator 902 then uses, for example, the following relationship to calculate the maximum number of corrected frequency bands N′ Bmx :
N′Bmx=min(NBmx,5+(BF-PBlf·BF)/0.5.(mp+mb))N′ Bmx =min(N Bmx,5 +(B F -P Blf ·B F )/0.5.(m p +m b ))
其中N′Bmx对应于要量化的频带的校正的最大数量,NBmx是频带的估计的最大数量,数字“5”表示频带的最小数量,BF表示可用于差矢量fd的频率量化的比特数,PBlf是分配给五(5)个较低频带的量化的比特的一部分,mp是分配给对频带进行频率量化的最小比特数,mb是分配给对五(5)个较低频带之后的第一频带进行量化的比特数。Where N′Bmx corresponds to the maximum number of corrections for the frequency band to be quantized, NBmx is the maximum number of estimated frequency bands, the number “5” represents the minimum number of frequency bands, BF represents the number of bits available for frequency quantization of the difference vector fd , PBlf is a portion of the bits allocated to the quantization of the five (5) lower frequency bands, mp is the minimum number of bits allocated to the frequency quantization of the frequency band, and mb is the number of bits allocated to the quantization of the first frequency band after the five (5) lower frequency bands.
在计算频带的最大数量之后,估计器902可执行额外验证,使得mp保持低于或等于mb。虽然该附加验证是可选步骤,但是在低比特率下,它有助于在差矢量fd的频带之间更有效地分配比特。After calculating the maximum number of frequency bands, the estimator 902 can perform additional verification to ensure that mp remains less than or equal to mb . While this additional verification is an optional step, it helps to allocate bits more efficiently among the frequency bands of the difference vector fd at low bit rates.
6.2.3)修改分配给较低频率的比特数6.2.3) Modify the number of bits allocated to lower frequencies
频带选择和比特分配操作757包括计算低频比特的操作953。为了执行操作953,提供计算器903。如果频带的最大数量N′Bmx的计算导致要量化的较少数量的频带,则计算器903使用例如以下关系重新分配先前分配给较高频带的比特部分,使得不再与较低频带的量化相关:The band selection and bit allocation operation 757 includes an operation 953 for calculating low-frequency bits. To perform operation 953, a calculator 903 is provided. If the calculation of the maximum number of bands N′ Bmx results in a smaller number of bands to be quantized, the calculator 903 reallocates the bit portions previously allocated to higher frequency bands using, for example, the following relationship, such that they are no longer relevant to the quantization of lower frequency bands:
BLF=PBlf·BF+(0.5·(mp+mb)·(NBmx-N′Bmx)),B LF =P Blf ·B F +(0.5·(m p +m b )·(N Bmx -N′ Bmx )),
其中BLF对应于分配给五(5)个较低频带的比特,BF对应于可用于对差矢量fd的较低频率进行频率量化的比特数,PBlf是来自估计器901的分配给例如五(5)个较低频带的频率量化的上述比特的一部分,mp是分配给量化频带的最小比特数,并且mb是分配给量化五(5)个较低频带之后的第一频带的比特数。Where BLF corresponds to the bits allocated to five (5) lower frequency bands, BF corresponds to the number of bits available for frequency quantization of the lower frequency of the difference vector fd , PBlf is a portion of the aforementioned bits allocated to, for example, five (5) lower frequency bands for frequency quantization from estimator 901, mp is the minimum number of bits allocated to the quantization bands, and mb is the number of bits allocated to the first band after quantizing five (5) lower frequency bands.
6.2.4)频带的双重排序6.2.4) Dual sorting of frequency bands
频带选择和比特分配操作757包括频带表征的操作954。为了执行操作954,频带选择器和比特分配器707包括频带表征器904,一旦比特率在较低频带和其余频带之间分配,频带表征器904就执行频带的双重排序,以决定每个频带的重要性。第一排序包括找到一个或多个频带与其相邻频带相比是否具有较低的能量。当发生这种情况时,表征器904标记这些频带,使得即使可用比特预算高,也只有预定的最小比特数mp可以被分配给对这些低能量频带进行频率量化。第二排序包括执行中能量频带和较高能量频带的位置排序,例如以能量递减顺序。这些第一和第二排序(双重排序)不是对较低频带执行的,而是执行直到频带的最大数量N′Bmx。频带表征的操作954可以总结如下:The band selection and bit allocation operation 757 includes a band characterization operation 954. To perform operation 954, the band selector and bit allocator 707 includes a band characterizer 904 that, once the bit rate is allocated between lower bands and the remaining bands, performs a dual sorting of the bands to determine the importance of each band. The first sorting involves finding one or more bands that have lower energy compared to their neighboring bands. When this occurs, the characterizer 904 marks these bands such that even if the available bit budget is high, only a predetermined minimum number of bits mp can be allocated to these low-energy bands for frequency quantization. The second sorting involves performing a positional sorting of the medium-energy and higher-energy bands, for example, in descending order of energy. These first and second sortings (dual sorting) are not performed on the lower bands, but rather up to the maximum number of bands N′Bmx . The band characterization operation 954 can be summarized as follows:
其中,Ppb(i)对于将仅使用最小比特数mp的频带而被设置为“1”,包含能量递减顺序的中能量频带和较高能量频带的位置,并且E(i)对应于每个频带的能量。CBb和Bb在上文第5节中定义。差矢量fd已在第6.1节中定义。Where P <sub>pb</sub> (i) is set to "1" for the frequency band that will use only the minimum number of bits m<sub> p </sub>, including the positions of the medium-energy and high-energy frequency bands in descending order of energy, and E(i) corresponds to the energy of each frequency band. C <sub>Bb</sub> and B<sub>b</sub> are defined in Section 5 above. The difference vector f<sub>d</sub> has been defined in Section 6.1.
在图7和图9的计算器708和对应操作758中计算差矢量fd的每个频带的能量E(i)。计算器708和操作758还计算每频带增益,如参考图6的计算器615和操作665所描述的。差矢量fd的每个频带的能量E(i)和每个频带的增益被量化,例如如关于图6的量化器616和操作666所描述的,并且两者都被发送到远程解码器。在用于统一时域/频域编码设备700和方法750的图7的实现方式的情况下,计算器708和操作758代替计算器615和操作665以及量化器616和操作666。In calculator 708 and corresponding operation 758 of Figures 7 and 9, the energy E(i) of each frequency band of the difference vector f d is calculated. Calculator 708 and operation 758 also calculate the gain per frequency band, as described with reference to calculator 615 and operation 665 of Figure 6. The energy E(i) of each frequency band of the difference vector f d and the gain of each frequency band are quantized, for example, as described with reference to quantizer 616 and operation 666 of Figure 6, and both are sent to a remote decoder. In the implementation of Figure 7 for the unified time-domain/frequency-domain coding device 700 and method 750, calculator 708 and operation 758 replace calculator 615 and operation 665 and quantizer 616 and operation 666.
6.2.5)将比特分配给所选择的频带6.2.5) Allocate bits to the selected frequency band
频带选择和比特分配操作757包括每频带比特的最终分配的操作955。为了执行操作955,频带选择器和比特分配器707包括每频带的比特最终分配器905。The band selection and bit allocation operation 757 includes a final bit allocation operation 955 per band. To perform operation 955, the band selector and bit allocator 707 includes a final bit allocator 905 per band.
一旦频带已经被表征,分配器905就分配可用于在所选择的频带之间对差矢量fd进行频率量化的比特率或比特数BF。Once the frequency bands have been characterized, the allocator 905 allocates the bit rate or number of bits BF that can be used to perform frequency quantization of the difference vector fd between the selected frequency bands.
在非限制性示例中,对于前五(5)个较低频带,分配器905线性地分配被分配用于对较低频率进行频率量化的比特BLF,其中第一最低频带接收比特BLF的23%并且第五(第5)个较低频带接收比特BLF的最后17%。以这种方式,可以以足够的精度量化差矢量fd的频谱的较低频率,以恢复输入声音信号101的更好质量的合成。In a non-limiting example, for the first five (5) lower frequency bands, the allocator 905 linearly allocates the bits B LF assigned for frequency quantization of the lower frequencies, wherein the first lowest frequency band receives 23% of the bits B LF and the fifth (5th) lower frequency band receives the last 17% of the bits B LF . In this way, the lower frequencies of the spectrum of the difference vector f d can be quantized with sufficient precision to recover a better quality synthesis of the input audio signal 101.
分配器905按照线性函数将分配用于对差矢量fd进行频率量化的剩余比特BF分配在另一个、中频带和较高频带上,但是再次考虑先前的频带能量表征(操作954),使得与其相邻频带的能量相比可以将更多的比特分配给较高能量的频带,并且将更少的比特分配给具有较低能量的频带,从而通过以更高精度量化差矢量fd的频谱的更重要部分来更相关地使用可用比特。作为非限制性示例,以下关系图示了可以如何执行比特分配(操作955):Allocator 905 distributes the remaining bits BF allocated for frequency quantization of the difference vector f <sub>d </sub> according to a linear function across another, mid-frequency band, and higher-frequency bands. However, again considering the previous band energy characterization (operation 954), more bits can be allocated to higher-energy bands and fewer bits to lower-energy bands compared to their adjacent bands, thus making more relevant use of the available bits by quantizing the more important parts of the spectrum of the difference vector f <sub>d </sub> with higher precision. As a non-limiting example, the following relationship illustrates how bit allocation (operation 955) can be performed:
其中Bp(i)表示每频带i分配的比特数,BF表示可用于频率量化差矢量fd的比特数,BLF对应于分配给五(5)个较低频带的比特率或比特,mp是量化频带中的频率脉冲的最小比特数,Ppb(i)包含将使用最小比特数mp的位置,并且N′Bmx是要量化的频带的最大数量。Where Bp (i) represents the number of bits allocated to each frequency band i, BF represents the number of bits available for the frequency quantization difference vector fd , BLF corresponds to the bit rate or bits allocated to the five (5) lower frequency bands, mp is the minimum number of bits for the frequency pulse in the quantization band, Ppb (i) contains the position where the minimum number of bits mp will be used, and N′Bmx is the maximum number of frequency bands to be quantized.
如果在操作955之后,存在一些未分配的比特,则分配器905将它们分配给较低频带。作为非限制性示例,分配器905将从第五(第5)频带开始并返回到第一频带每个频带分配一个剩余比特,并且如果需要分配所有剩余比特,则重复该过程。If there are some unallocated bits after operation 955, allocator 905 allocates them to lower frequency bands. As a non-limiting example, allocator 905 will allocate one remaining bit per band, starting from the fifth (5th) frequency band and going back to the first frequency band, and repeat the process if all remaining bits need to be allocated.
稍后,分配器905可能必须根据用于执行频率脉冲的量化的算法和潜在的定点实现来对每个频带的比特数进行取整、截断或舍入。Later, the allocator 905 may have to round, truncate, or round the number of bits per frequency band based on the algorithm used to perform the quantization of the frequency pulses and the potential fixed-point implementation.
6.3)搜索频率脉冲6.3) Search frequency pulse
混合时域/频域CELP编码方法170/770包括对差矢量fd进行频率量化160(图1、2和7)的操作。为了执行操作160,混合时域/频域CELP编码器120/720包括频率量化器110(图2中的219)。The hybrid time-domain/frequency-domain CELP coding method 170/770 includes the operation of frequency quantization 160 (Figures 1, 2, and 7) of the difference vector f d . In order to perform operation 160, the hybrid time-domain/frequency-domain CELP encoder 120/720 includes a frequency quantizer 110 (219 in Figure 2).
可以使用几种方法来量化差矢量fd。在每种情况下,必须搜索和量化频率脉冲。在一个可能的实现方式中,频率量化器110跨频谱搜索差矢量fd的最高能量脉冲。搜索脉冲的方法可以像将频谱分成频带并且允许每个频带有一定数量的脉冲一样简单。每个频带的脉冲数量取决于可用的比特预算和频带在频谱内的位置。通常,更多的脉冲被分配给较低的频率。Several methods can be used to quantize the difference vector f<sub>d</sub> . In each case, frequency pulses must be searched and quantized. In one possible implementation, the frequency quantizer 110 searches across the spectrum for the highest energy pulse of the difference vector f<sub> d </sub>. The method for searching for pulses can be as simple as dividing the spectrum into bands and allowing a certain number of pulses in each band. The number of pulses in each band depends on the available bit budget and the position of the band within the spectrum. Typically, more pulses are allocated to lower frequencies.
6.4)量化的差矢量6.4) Quantized difference vector
取决于可用的比特率,频率脉冲的量化可以由频率量化器110使用不同的技术来执行。在一个实施例中,在低于12kbps的比特率下,可以使用简单的搜索和量化方案来对脉冲的位置和符号进行编码。该方案在下文中作为非限制性示例描述。Depending on the available bit rate, the quantization of the frequency pulses can be performed by the frequency quantizer 110 using different techniques. In one embodiment, at bit rates below 12 kbps, a simple search and quantization scheme can be used to encode the position and symbol of the pulse. This scheme is described below as a non-limiting example.
对于低于3175Hz的频率,简单搜索和量化方案使用基于阶乘脉冲编码(FPC)的方法,该方法在文献中描述,例如在参考文献[8]中,其全部内容通过引用并入本文。For frequencies below 3175 Hz, a simple search and quantization scheme is used using a factorial pulse coding (FPC) method, which is described in the literature, for example in reference [8], the entire contents of which are incorporated herein by reference.
更具体地,参考图5和图6,频率量化器110包括选择器504,以执行确定是否使用FPC量化所有频谱的操作554。如图5所示,如果选择器504确定没有使用FPC量化所有频谱,则在编码器506中执行FPC编码和脉冲位置和符号编码的操作556。More specifically, referring to Figures 5 and 6, the frequency quantizer 110 includes a selector 504 to perform an operation 554 of determining whether to use FPC quantization for all the spectrum. As shown in Figure 5, if the selector 504 determines that FPC quantization for all the spectrum is not used, then FPC encoding and pulse position and symbol encoding operations 556 are performed in the encoder 506.
如图6所示,FPC编码和脉冲位置和符号编码的操作556包括频率脉冲搜索操作659、FPC编码操作660、找到最高能量脉冲的操作661以及量化频率脉冲的位置和符号的操作662。为了执行操作659-662,编码器506分别包括频率脉冲的搜索器609、FPC编码器610、最高能量脉冲的查找器611和频率脉冲的位置和符号的量化器612。As shown in Figure 6, the FPC encoding and pulse position and sign encoding operation 556 includes a frequency pulse search operation 659, an FPC encoding operation 660, an operation to find the highest energy pulse 661, and an operation to quantize the position and sign of the frequency pulse 662. To perform operations 659-662, the encoder 506 includes a frequency pulse searcher 609, an FPC encoder 610, a highest energy pulse finder 611, and a frequency pulse position and sign quantizer 612, respectively.
搜索器609通过所有频带搜索低于3175Hz的频率的频率脉冲。然后,FPC编码器610处理频率脉冲。查找器611针对等于和大于3175Hz的频率确定最高能量脉冲,并且量化器612对所找到的最高能量脉冲的位置和符号进行编码。如果在频带内允许多于一个(1)脉冲,则将先前找到的脉冲的幅度除以2,并且在整个频带上再次进行搜索。每次发现脉冲时,存储其位置和符号以用于量化和比特打包阶段。作为非限制性示例,以下伪代码示出了这种简单的搜索和量化方案:Searcher 609 searches for frequency pulses below 3175 Hz across all frequency bands. FPC encoder 610 then processes the frequency pulses. Searcher 611 identifies the highest energy pulse for frequencies equal to and greater than 3175 Hz, and quantizer 612 encodes the position and sign of the found highest energy pulse. If more than one (1) pulse is allowed within the frequency band, the amplitude of the previously found pulse is divided by 2, and the search is repeated across the entire frequency band. Each time a pulse is found, its position and sign are stored for use in the quantization and bit packing stages. As a non-limiting example, the following pseudocode illustrates this simple search and quantization scheme:
其中NBD是频带的数量(在说明性示例中NBD=16),Np是在频带k中要编码的脉冲i的数量,Bb是每个频带的频率段的数量,CBb是如先前在第5节中定义的每个频带的累积频率段,pp表示包含找到的脉冲位置的矢量,ps表示包含找到的脉冲的符号的矢量,pmax表示找到的脉冲的能量。Where NBD is the number of frequency bands ( NBD = 16 in the illustrative example), Np is the number of pulses i to be encoded in frequency band k, Bb is the number of frequency segments in each frequency band, CBb is the cumulative frequency segment of each frequency band as defined previously in Section 5, pp represents the vector containing the location of the found pulse, ps represents the vector containing the symbol of the found pulse, and pmax represents the energy of the found pulse.
在高于12kbps的比特率下,选择器504确定要使用FPC量化所有频谱(图5和图6)。如图5所示,然后在FPC编码器505中执行FPC编码的操作555。参照图6,编码器505包括频率脉冲的搜索器607,并且操作555包括搜索频率脉冲的相应操作667。对频率脉冲的搜索是通过整个频带进行的。操作555包括对找到的频率脉冲进行编码的操作668,并且编码器505包括用于执行操作668的FPC处理器608。At bit rates above 12 kbps, selector 504 determines which spectrum to quantize using FPC (Figures 5 and 6). As shown in Figure 5, FPC encoding operation 555 is then performed in FPC encoder 505. Referring to Figure 6, encoder 505 includes frequency pulse searcher 607, and operation 555 includes a corresponding operation 667 for searching for frequency pulses. The search for frequency pulses is performed across the entire frequency band. Operation 555 includes operation 668 for encoding the found frequency pulses, and encoder 505 includes FPC processor 608 for performing operation 668.
然后,FPC处理器608或脉冲位置和符号的量化器612通过将具有脉冲符号ps的脉冲数量nb_pulses与每个找到的位置pp相加来获得量化的差矢量fdQ。对于每个频带,可以使用例如以下伪码来写量化的差矢量fdQ:Then, the FPC processor 608 or the quantizer 612 for pulse positions and symbols obtains the quantized difference vector f <sub>dQ </sub> by adding the number of pulses n<sub>b_pulses</sub> with pulse symbols p<sub>s</sub> to each found position p<sub>p</sub> . For each frequency band, the quantized difference vector f <sub>dQ </sub> can be written using, for example, the following pseudocode:
for j=0,...,j<nb_pulsesfor j = 0, ..., j < nb_pulses
fdQ(pp(j))+=ps(j)f dQ (p p (j))+=p s (j)
6.5)噪声填充6.5) Noise Filling
以或多或少的精度量化频带;前一节中描述的量化方法不保证频带内的所有频率段都被量化。在每个频带量化的脉冲数量相对较低的低比特率下尤其如此。为了防止由于这些未量化的频率段引起的可听伪影的出现,频率量化器110包括噪声填充器507(图5)以执行在未量化的频率段中添加一些噪声的相应操作557,以便填充这些间隙。例如,可以以低于12kbps的比特率在所有频谱上进行该噪声添加,但是对于较高的比特率,可以仅在时域激励贡献的截止频率ftc以上应用该噪声添加。为简单起见,噪声强度仅随可用比特率而变化。在高比特率下,噪声水平低,但在低比特率下,噪声水平较高。The frequency band is quantized with more or less precision; the quantization method described in the previous section does not guarantee that all frequency segments within the band are quantized. This is especially true at low bit rates where the number of pulses quantized in each band is relatively low. To prevent audible artifacts caused by these unquantized frequency segments, the frequency quantizer 110 includes a noise filler 507 (FIG. 5) to perform a corresponding operation 557 of adding some noise to the unquantized frequency segments to fill these gaps. For example, this noise addition can be performed across the entire spectrum at bit rates below 12 kbps, but at higher bit rates, it can be applied only above the cutoff frequency f <sub>tc</sub> contributed by the time-domain excitation. For simplicity, the noise intensity varies only with the available bit rate. At high bit rates, the noise level is low, but at low bit rates, the noise level is high.
噪声填充器507包括加法器613(图6),加法器613执行在已经确定这种添加的噪声的强度或能量水平之后将噪声添加到量化的差矢量fdQ的操作663。为此,频率量化操作160包括估计添加的噪声的强度或能量水平的操作664,并且为了执行操作664,频率量化器110包括噪声能量水平的对应估计器614。估计添加的噪声的强度或能量水平的操作664由估计器614进行,并且在频率量化器110的每频带增益计算器615中确定每频带增益的操作665之前进行。The noise filler 507 includes an adder 613 (FIG. 6) that performs an operation 663 of adding noise to the quantized difference vector f dQ after the strength or energy level of the added noise has been determined. For this purpose, the frequency quantization operation 160 includes an operation 664 of estimating the strength or energy level of the added noise, and to perform operation 664, the frequency quantizer 110 includes a corresponding estimator 614 for the noise energy level. The operation 664 of estimating the strength or energy level of the added noise is performed by the estimator 614 and precedes the operation 665 of determining the per-band gain in the per-band gain calculator 615 of the frequency quantizer 110.
在说明性实施例中,在估计器614中,噪声水平与编码比特率直接相关。例如,在6.60kbps处,估计器614将噪声水平N′L设置为在特定频带中编码的频率脉冲的幅度的0.4倍,并且逐渐降低到在24kbps的频带中编码的频率脉冲的幅度的0.2倍的值。加法器613仅将噪声注入频谱的部分,其中一定数量的连续频率段具有非常低的能量,例如当频带的一半的累积频率段能量低于0.5时。对于特定频带i,例如如下注入噪声:In the illustrative embodiment, in estimator 614, the noise level is directly related to the coding bit rate. For example, at 6.60 kbps, estimator 614 sets the noise level N′L to 0.4 times the amplitude of the frequency pulse encoded in a specific frequency band, and gradually decreases it to a value of 0.2 times the amplitude of the frequency pulse encoded in a 24 kbps frequency band. Adder 613 injects noise only into portions of the spectrum where a certain number of consecutive frequency bands have very low energy, for example, when the cumulative energy of half of the frequency band is below 0.5. For a specific frequency band i, noise is injected, for example, as follows:
其中,对于频带i,CBb是每个频带的频率段的累积数量,Bb是特定频带i中的频率段的数量,N′L是添加的噪声的水平,并且rand是被限制在-1到1之间的随机数生成器。Where, for frequency band i, C <sub>Bb</sub> is the cumulative number of frequency segments in each frequency band, B<sub>b</sub> is the number of frequency segments in a specific frequency band i, N′L is the level of added noise, and r and are random number generators restricted to between -1 and 1.
6.6)每频带增益量化6.6) Per-band gain quantization
参考图5和图6,统一时域/频域编码设备100和方法150的频率量化操作160包括确定每频带增益的操作665,随后是量化每频带增益的操作666。为了执行操作665和666,频率量化器110包括每频带增益计算器615和每频带增益量化器616。Referring to Figures 5 and 6, the frequency quantization operation 160 of the unified time-domain/frequency-domain coding device 100 and method 150 includes an operation 665 for determining the gain per frequency band, followed by an operation 666 for quantizing the gain per frequency band. To perform operations 665 and 666, the frequency quantizer 110 includes a gain per frequency band calculator 615 and a gain per frequency band quantizer 616.
一旦找到量化的差矢量fdQ(如果需要包括噪声填充),则计算器615计算每个频带的每频带增益。特定频带Gb(l)的每频带增益被定义为在对数域中未量化的差矢量fdQ的能量与量化的差矢量fdQ的能量之间的比率,使用例如以下关系:Once the quantized difference vector f dQ is found (including noise fill if necessary), calculator 615 calculates the per-band gain for each frequency band. The per-band gain for a specific frequency band Gb (l) is defined as the ratio between the energy of the unquantized difference vector f dQ and the energy of the quantized difference vector f dQ in the logarithmic domain, using, for example, the following relationship:
其中CBb和Bb在上文第5部分中定义。C <sub>Bb</sub> and B<sub>b</sub> are defined in Part 5 above.
每频带增益量化器616矢量量化每频带频率增益。在矢量量化之前,在低比特率下,最后增益(对应于最后频带)被单独量化,并且剩余的十五(15)个每频带增益(例如,当使用16个频带时)被量化的最后增益除。然后,量化器616对归一化的十五(15)个剩余增益进行矢量量化。在较高的比特率下,首先量化每频带增益的平均值,然后在对这些每频带增益进行矢量量化之前,从例如十六(16)个频带的所有每频带增益中移除该平均值。所使用的矢量量化可以是包含每频带增益的矢量与特定码本的条目之间的距离的对数域中的标准最小化。The per-band gain quantizer 616 vector-quantizes the per-band frequency gain. Prior to vector quantization, at low bit rates, the last gain (corresponding to the last band) is quantized individually, and the remaining fifteen (15) per-band gains (e.g., when using 16 bands) are divided by the quantized last gain. The quantizer 616 then performs vector quantization on the normalized fifteen (15) remaining gains. At higher bit rates, the average per-band gain is first quantized, and then that average is removed from all per-band gains across, for example, sixteen (16) bands before vector quantization of these per-band gains. The vector quantization used can be a standard minimization in the logarithmic domain of the distance between the vector containing the per-band gain and entries in a particular codebook.
在频域编码模式中,在计算器615中为每个频带计算增益,以将未量化的矢量fd的能量与量化的矢量fdQ匹配。增益在量化器616中被矢量量化,并且通过乘法器509(图5和图6)在每个频带(操作559)应用于量化的矢量fdQ。In the frequency domain coding mode, a gain is calculated for each frequency band in calculator 615 to match the energy of the unquantized vector f<sub> d </sub> with the quantized vector f<sub>d</sub> Q. The gain is vector-quantized in quantizer 616 and applied to the quantized vector f <sub>d</sub>Q in each frequency band (operation 559) via multiplier 509 (Figures 5 and 6).
或者,也可以通过仅选择一些要量化的频带,以低于12kbps的速率对整个频谱使用FPC编码方案。在执行频带的选择之前,使用量化器616量化未量化的差矢量fd的频带的能量Ed。使用例如以下关系式来计算能量:Alternatively, the entire spectrum can be encoded using an FPC scheme at a rate below 12 kbps by selecting only a subset of the frequency bands to be quantized. Before selecting the frequency bands, the energy Ed of the unquantized difference vector fd is quantized using quantizer 616. The energy is calculated using, for example, the following relationship:
Ed(i)=log10(Sd(i))E <sub>d</sub> (i) = log <sub>10 </sub>(S <sub>d</sub> (i))
其中CBb和Bb在上文第5部分中定义。C <sub>Bb</sub> and B<sub>b</sub> are defined in Part 5 above.
为了执行频带能量Ed′的量化,首先对所使用的十六个频带中的前12个频带上的平均能量进行量化,并从所有十六(16)个频带能量中减去所使用的十六个频带中的前12个频带上的平均能量。然后,所有频带是每组3或4个频带量化的矢量。所使用的矢量量化可以是包含每频带增益的矢量与特定码本的条目之间的距离的对数域中的标准最小化。如果没有足够的比特可用,则可以仅量化前12个频带,并使用前三(3)个频带的平均值或通过任何其他方法外推最后四(4)个频带。To perform quantization of the band energy Ed ′, the average energy over the first 12 of the sixteen bands used is first quantized, and the average energy over the first 12 of the sixteen bands used is subtracted from the total sixteen (16) band energy. Then, all bands are vectors quantized in groups of 3 or 4 bands. The vector quantization used can be a standard minimization in the logarithmic domain of the distance between the vector containing the gain of each band and the entries of a particular codebook. If not enough bits are available, only the first 12 bands can be quantized, and the average of the first three (3) bands can be used, or the last four (4) bands can be extrapolated by any other method.
一旦量化了未量化的差矢量的频带的能量,就可以以降序对能量进行排序,使得其在解码器侧是可复制的。在排序期间,总是保持低于2kHz的所有能量频带,并且然后仅将最高能量频带传递到FPC方案以用于编码频率脉冲幅度和符号。利用这种方法,FPC方案对较小的矢量进行编码,但覆盖较宽的频率范围。换句话说,它需要较少的比特来覆盖整个频谱上的重要能量事件。Once the energy of the unquantized difference vector's frequency bands has been quantized, the energy can be sorted in descending order, making it reproducible on the decoder side. During sorting, all energy bands below 2kHz are always kept, and only the highest energy bands are then passed to the FPC scheme for encoding frequency pulse amplitudes and symbols. Using this method, the FPC scheme encodes smaller vectors but covers a wider frequency range. In other words, it requires fewer bits to cover important energy events across the entire spectrum.
在实现图7的统一时域/频域编码设备700和方法750的特定情况下,频带选择和比特分配替代地如由如上所述的图7和图9的每频带能量和每频带增益计算器708和计算操作758以及频带选择器和比特分配器707和频带选择和比特分配操作757所确定的那样执行。In the specific implementation of the unified time-domain/frequency-domain coding device 700 and method 750 of FIG7, band selection and bit allocation are performed alternatively as determined by the per-band energy and per-band gain calculator 708 and calculation operation 758 of FIG7 and 9 as described above, and the band selector and bit allocator 707 and band selection and bit allocation operation 757.
在脉冲量化处理之后,执行类似于先前描述的噪声填充。然后,每个频带计算增益调整因子Ga,以将量化的差矢量fdQ的能量EdQ与未量化的差矢量fd的量化能量Ed′相匹配。然后,将该每频带增益调整因子应用于量化的差矢量fdQ。这可以表示如下:After pulse quantization, noise filling, similar to that described previously, is performed. Then, a gain adjustment factor Ga is calculated for each frequency band to match the energy EdQ of the quantized difference vector f dQ to the quantized energy Ed ′ of the unquantized difference vector f d . This per-band gain adjustment factor is then applied to the quantized difference vector f dQ . This can be represented as follows:
其中,in,
以及as well as
E′d是如早些定义的未量化的差矢量fd的量化的每频带能量。E′ d is the quantized per-band energy of the unquantized difference vector f d as defined earlier.
在完成频域编码阶段之后,找到总时域/频域激励。为此目的,混合时域/频域CELP编码方法170/770包括使用混合时域/频域CELP编码器120/720的加法器111(图1、2、5和6)将来自频率量化器110的频率量化的差矢量fdQ与经滤波的频率变换的时域激励贡献fexcF相加的操作161。当统一时域/频域编码设备100/700将其比特分配从仅时域编码模式改变为混合时域/频域编码模式时,仅时域编码模式的每频带激励谱能量与混合时域/频域编码模式的每频带激励谱能量不匹配。这种能量不匹配可能产生在低比特率下更容易听到的切换伪影。为了减少由该比特重新分配产生的任何可听见的劣化,可以为每个频带计算长期增益,并且可以将长期增益应用于求和的激励,以在重新分配之后校正针对几个帧的每个频带的能量。然后,混合时域/频域CELP编码方法170/770包括用于使用例如IDCT(逆DCT)220(图2)将频率量化的差矢量fdQ和经频率变换和滤波的时域激励贡献fexcF的和变换到时域的操作162(图1、5和6)。After the frequency domain coding phase is completed, the total time-domain/frequency-domain excitation is found. For this purpose, the hybrid time-domain/frequency-domain CELP coding method 170/770 includes an operation 161 in which the difference vector f <sub>dQ</sub> from the frequency quantizer 110 is added to the filtered frequency-transformed time-domain excitation contribution f <sub>excF </sub> using adder 111 (Figures 1, 2, 5, and 6) of the hybrid time-domain/frequency-domain CELP encoder 120/720. When the unified time-domain/frequency-domain coding device 100/700 changes its bit allocation from a time-domain-only coding mode to a hybrid time-domain/frequency-domain coding mode, the per-band excitation spectrum energy of the time-domain-only coding mode does not match the per-band excitation spectrum energy of the hybrid time-domain/frequency-domain coding mode. This energy mismatch can produce switching artifacts that are more audible at low bit rates. To reduce any audible degradation caused by this bit reallocation, a long-term gain can be calculated for each band, and the long-term gain can be applied to the summed excitation to correct the energy of each band for several frames after the reallocation. Then, the hybrid time-domain/frequency-domain CELP coding method 170/770 includes operations 162 (Figures 1, 5 and 6) for transforming the sum of the frequency-quantized difference vector f dQ and the frequency-transformed and filtered time-domain excitation contribution f excF to the time domain using, for example, IDCT (inverse DCT) 220 (Figure 2).
统一时域/频域编码方法150/750包括通过编码设备100/700的LP合成滤波器113/706(图1、2和7)对来自IDCT 220的总时域/频域激励进行滤波来产生合成信号的操作163/756。The unified time-domain/frequency-domain coding method 150/750 includes the operation 163/756 of filtering the total time-domain/frequency-domain excitation from the IDCT 220 through the LP synthesis filter 113/706 (Figures 1, 2 and 7) of the coding device 100/700 to generate a synthesized signal.
形成量化的差矢量fdQ的频率脉冲的量化的位置和符号被发送到远程解码器(未示出)。The quantized position and sign of the frequency pulse that forms the quantized difference vector f dQ are sent to a remote decoder (not shown).
在一个非限制性实施例中,虽然仅使用时域激励贡献在子帧基础上更新CELP编码存储器,但总时域/频域激励用于在帧边界处更新那些存储器。在另一可能实施方案中,CELP编码存储器在子帧基础上且还仅使用时域激励贡献在帧边界处更新。这导致嵌入式结构,其中频域量化的信号构成独立于核心CELP层的上量化层。这在某些应用中呈现优点。在这种特定情况下,固定码本总是用于保持良好的感知质量,并且子帧的数量由于相同的原因总是四(4)个。然而,频域分析可以应用于整个帧。这种嵌入式方法适用于大约12kbps和较高的比特率。In one non-limiting embodiment, while the CELP-coded memory is updated on a subframe-by-subframe basis using only temporal excitation contributions, total temporal/frequency excitation is used to update those memories at frame boundaries. In another possible implementation, the CELP-coded memory is updated on a subframe-by-subframe basis and also only using temporal excitation contributions at frame boundaries. This results in an embedded architecture where the frequency-quantized signal constitutes an upquantization layer independent of the core CELP layer. This presents advantages in certain applications. In this particular case, a fixed codebook is always used to maintain good perceptual quality, and the number of subframes is always four (4) for the same reason. However, frequency domain analysis can be applied to the entire frame. This embedded approach is suitable for approximately 12 kbps and higher bit rates.
7)解码器设备和方法7) Decoder devices and methods
图11是同时示出用于对来自上述统一时域/频域编码设备700和对应的统一时域/频域编码方法750的比特流1101进行解码的解码器设备1100和对应的解码方法1150的示意性框图。Figure 11 is a schematic block diagram showing a decoder device 1100 and a corresponding decoding method 1150 for decoding the bitstream 1101 from the aforementioned unified time-domain/frequency-domain coding device 700 and the corresponding unified time-domain/frequency-domain coding method 750.
解码器设备1100包括用于从统一时域/频域编码设备700接收比特流1101的接收器(未示出)。The decoder device 1100 includes a receiver (not shown) for receiving bit stream 1101 from the unified time-domain/frequency-domain coding device 700.
如果由统一时域/频域编码设备700编码的声音信号已被分类为“音乐”,则这在比特流1101中由相应的信令比特指示并由解码器设备1100检测(参见1102)。接收到的比特流1101然后由“音乐”解码器1103(例如频域解码器)解码。If the audio signal encoded by the unified time-domain/frequency-domain encoding device 700 has been classified as "music", this is indicated by the corresponding signaling bits in the bitstream 1101 and detected by the decoder device 1100 (see 1102). The received bitstream 1101 is then decoded by the "music" decoder 1103 (e.g., a frequency-domain decoder).
如果由统一时域/频域编码设备700编码的声音信号已被分类为“语音”,则这在比特流1101中由相应的信令比特指示并由解码器设备1100检测(参见1104)。接收到的比特流1101然后由“语音”解码器1105解码,例如使用ACELP(代数码激励线性预测)或更一般地CELP(码激励线性预测)的时域解码器。If the audio signal encoded by the unified time-domain/frequency-domain coding device 700 has been classified as “speech,” this is indicated by the corresponding signaling bits in the bitstream 1101 and detected by the decoder device 1100 (see 1104). The received bitstream 1101 is then decoded by the “speech” decoder 1105, for example, using a time-domain decoder employing ACELP (Algebraic Code-Excited Linear Prediction) or, more generally, CELP (Code-Excited Linear Prediction).
如果由统一时域/频域编码设备700编码的声音信号尚未被分类为“音乐”或“语音”(参见1102和1104),并且可用于编码声音信号的比特率等于或低于9.2kbps(参见1106),则这在比特流中由设置为“0”的子模式标志Ftfsm指示。然后使用后向编码模式(即,如1107所示的图1和图2的传统统一时域和频域编码模型(EVS))对接收的比特流1101进行解码。If the audio signal encoded by the unified time-domain/frequency-domain coding device 700 has not yet been classified as "music" or "speech" (see 1102 and 1104), and the bit rate available for encoding the audio signal is equal to or less than 9.2 kbps (see 1106), this is indicated in the bitstream by a submode flag F tfsm set to "0". The received bitstream 1101 is then decoded using a backward coding mode (i.e., the conventional unified time-domain and frequency-domain coding model (EVS) of Figures 1 and 2 as shown in 1107).
最后,如果由统一时域/频域编码设备700编码的声音信号尚未被分类为“音乐”或“语音”(参见1102和1104),并且可用于编码声音信号的比特率高于9.2kbps(参见1106),则这在比特流1101中由设置为“1”、“2”或“3”的子模式标志Ftfsm指示。然后使用图12的声音信号解码器1200和对应的声音信号解码方法1250对接收的比特流1101进行解码。Finally, if the audio signal encoded by the unified time-domain/frequency-domain coding device 700 has not yet been classified as "music" or "speech" (see 1102 and 1104), and the bit rate available for encoding the audio signal is higher than 9.2 kbps (see 1106), this is indicated in bitstream 1101 by a sub-mode flag F tfsm set to "1", "2", or "3". The received bitstream 1101 is then decoded using the audio signal decoder 1200 of FIG. 12 and the corresponding audio signal decoding method 1250.
7.1)声音信号解码器及解码方法7.1) Audio signal decoder and decoding method
图12是同时示出声音信号解码器1200和对应的声音信号解码方法1250的示意性框图,该声音信号解码器1200和对应的声音信号解码方法1250用于在声音信号被分类为不清楚信号类型类别的情况下对来自上述统一时域/频域编码设备700和对应的统一时域/频域编码方法750的比特流进行解码。Figure 12 is a schematic block diagram showing both the audio signal decoder 1200 and the corresponding audio signal decoding method 1250, which are used to decode bitstreams from the aforementioned unified time-domain/frequency-domain coding device 700 and the corresponding unified time-domain/frequency-domain coding method 750 when the audio signal is classified as having an unclear signal type category.
如前面的描述中所提到的,自适应码本索引T和自适应码本增益b被量化和发送,并且因此由接收器(未示出)在比特流中接收。以相同的方式,当使用时,固定码本索引和固定码本增益也被量化并发送到解码器,因此由接收器(未示出)在比特流1101中接收。声音信号解码方法1250包括使用自适应码本索引和增益以及固定码本索引和增益(如果使用的话)来计算解码的时域激励贡献的操作1256,如CELP编码领域中通常进行的那样。为了执行操作1256,声音信号解码器1200包括解码的时域激励贡献的计算器126。As mentioned in the preceding description, the adaptive codebook index T and the adaptive codebook gain b are quantized and transmitted, and thus received in the bitstream by a receiver (not shown). Similarly, when used, the fixed codebook index and fixed codebook gain are also quantized and transmitted to the decoder, and thus received in bitstream 1101 by a receiver (not shown). The audio signal decoding method 1250 includes an operation 1256 to calculate the temporal excitation contribution of the decoded signal using the adaptive codebook index and gain, as well as the fixed codebook index and gain (if used), as is typically done in the CELP coding field. To perform operation 1256, the audio signal decoder 1200 includes a calculator 126 for the temporal excitation contribution of the decoded signal.
声音信号解码方法1250还包括使用与使用DCT变换的操作156中相同的过程来计算解码的时域激励贡献的频率变换的操作1257。为了执行操作1257,声音信号解码器1200包括解码的时域激励贡献的频率变换的计算器1207。The audio signal decoding method 1250 also includes an operation 1257 for calculating the frequency transformation of the decoded time-domain excitation contribution using the same procedure as in operation 156 using DCT transform. To perform operation 1257, the audio signal decoder 1200 includes a calculator 1207 for the frequency transformation of the decoded time-domain excitation contribution.
如前面的描述中所提到的,截止频率的量化的版本ftcQ被发送到解码器,并且因此由接收器(未示出)在比特流1101中接收。声音信号解码方法1250包括使用从比特流1101恢复的解码的截止频率ftcQ和与先前描述的滤波操作266相同或相似的过程来对来自计算器1207的时域激励贡献的频率变换进行滤波的操作1258。为了完成操作1258,声音信号解码器1200包括使用恢复的截止频率ftcQ对时域激励贡献进行频率变换的滤波器1208。滤波器1208具有与图2的滤波器216相同或至少相似的结构。As mentioned in the preceding description, the quantized version of the cutoff frequency, f <sub>tcQ</sub> , is sent to the decoder and thus received by the receiver (not shown) in bitstream 1101. The audio signal decoding method 1250 includes an operation 1258 that filters the frequency transformation of the time-domain excitation contribution from the calculator 1207 using the decoded cutoff frequency f <sub>tcQ </sub> recovered from bitstream 1101 and a process identical or similar to the previously described filtering operation 266. To accomplish operation 1258, the audio signal decoder 1200 includes a filter 1208 that performs a frequency transformation of the time-domain excitation contribution using the recovered cutoff frequency f <sub>tcQ </sub>. Filter 1208 has the same or at least similar structure to filter 216 of FIG. 2.
来自滤波器1208的时域激励贡献的滤波后的频率变换被提供给执行相应的加法操作1259的加法器1209的正输入端。The filtered frequency transformation of the time-domain excitation contribution from filter 1208 is provided to the positive input of adder 1209, which performs the corresponding addition operation 1259.
声音信号解码方法1250包括计算差矢量fd的每个频带的解码的能量和增益的操作1260。为了执行操作1260,声音信号解码器1200包括计算器1210。具体地,计算器1210使用与本公开中描述的用于量化的过程相反的过程来对由接收器(未示出)从统一时域/频域编码设备700在比特流1101中接收的每个频带的量化的能量和每个频带的量化的增益进行去量化。The audio signal decoding method 1250 includes an operation 1260 of calculating the decoded energy and gain for each frequency band of the difference vector f<sub> d </sub>. To perform operation 1260, the audio signal decoder 1200 includes a calculator 1210. Specifically, the calculator 1210 dequantizes the quantized energy and quantized gain for each frequency band received by a receiver (not shown) from a unified time-domain/frequency-domain coding device 700 in bitstream 1101 using a process opposite to that described in this disclosure.
声音信号解码方法1250包括恢复频率量化的差矢量fdQ的操作1261。为了执行操作1261,声音信号解码器1200包括计算器1211。计算器1211从比特流1101中提取频率脉冲的量化的位置和符号,并复制由操作757和分配器707确定并由统一时域/频域编码设备700用于编码输入声音信号的不同频带中的用于量化和比特分配的频带的选择。计算器1211使用该复制的信息来从提取的频率脉冲量化的位置和符号恢复频率量化的差矢量fdQ。具体地,为此目的,声音信号解码器1200响应于解码器1200中可用于频率量化的差矢量fdQ的比特数(比特率)(参见1220)、可用于正在处理的信道的总比特率(参见1220)和子模式标志(参见1220),复制如图9所示的统一时域/频域编码设备700中使用的过程。Audio signal decoding method 1250 includes operation 1261 to recover the frequency-quantized difference vector f dQ. To perform operation 1261, audio signal decoder 1200 includes calculator 1211. Calculator 1211 extracts the quantized position and symbol of frequency pulses from bitstream 1101 and copies the selection of frequency bands for quantization and bit allocation in different frequency bands determined by operation 757 and allocator 707 and used by unified time-domain/frequency-domain coding device 700 for encoding the input audio signal. Calculator 1211 uses this copied information to recover the frequency-quantized difference vector f dQ from the extracted quantized position and symbol of the frequency pulses. Specifically, for this purpose, audio signal decoder 1200, in response to the number of bits (bit rate) available for frequency quantization of the difference vector f dQ in decoder 1200 (see 1220), the total bit rate available for the channel being processed (see 1220), and the sub-mode flag (see 1220), copies the process used in unified time-domain/frequency-domain coding device 700 as shown in FIG. 9.
具体地:Specifically:
-图12的估计器1201和操作1251对应于图9的估计器901和操作951,用于按照量化的截止频率ftcQ的函数预先固定用于量化差矢量fd的较低频率的可用比特预算的一部分。- Estimator 1201 and operation 1251 in Figure 12 correspond to estimator 901 and operation 951 in Figure 9, for pre-fixing a portion of the available bit budget for the lower frequency of the quantization difference vector fd as a function of the quantized cutoff frequency f tcQ .
-图12的估计器1202和操作1252对应于图9的估计器902和操作952,用于估计量化的差矢量fdQ的频带的最大数量NBmx。- Estimator 1202 and operation 1252 in Figure 12 correspond to estimator 902 and operation 952 in Figure 9, and are used to estimate the maximum number of frequency bands N Bmx of the quantized difference vector f dQ .
-图12的计算器1203和操作1253对应于图9的计算器903和操作953,用于计算较低频率比特。- Calculator 1203 and operation 1253 in Figure 12 correspond to calculator 903 and operation 953 in Figure 9, and are used to calculate lower frequency bits.
-图12的表征器1204和操作1254对应于图9的表征器904和操作954,用于频带表征。Characterizer 1204 and operation 1254 in Figure 12 correspond to characterizer 904 and operation 954 in Figure 9 and are used for frequency band characterization.
-图12的分配器1205和操作1255对应于图9的分配器905和操作955,用于每频带的比特的最终分配。The allocator 1205 and operation 1255 in Figure 12 correspond to the allocator 905 and operation 955 in Figure 9 for the final allocation of bits per frequency band.
声音信号解码方法1250包括将来自计算器1211的恢复的频率量化的差矢量fdQ与来自滤波器1208的频率变换和滤波的时域激励贡献fexcF相加以形成混合时域/频域激励的操作1259。The audio signal decoding method 1250 includes an operation 1259 of adding the difference vector f dQ of the recovered frequency quantization from the calculator 1211 to the temporal excitation contribution f excF of the frequency transformation and filtering from the filter 1208 to form a hybrid time-domain/frequency-domain excitation.
可以理解,估计器1201和1202、计算器1203、表征器1204、分配器1205、计算器1206和1207、滤波器1208、计算器1210和1211以及加法器1212使用在比特流1101中传送的信息形成混合时域/频域激励的重构器,该信息包括标识被选择并用于对分类为不清楚信号类型类别的声音信号进行编码的编码子模式之一的子模式标志。It is understood that estimators 1201 and 1202, calculator 1203, characterizer 1204, distributor 1205, calculators 1206 and 1207, filter 1208, calculators 1210 and 1211, and adder 1212 use information transmitted in bitstream 1101 to form a reconstructor with mixed time-domain/frequency-domain excitation, which includes a sub-mode flag identifying one of the coded sub-modes selected and used to encode sound signals classified as having an unclear signal type category.
以相同的方式,操作1251-1261形成使用在比特流1101中传送的信息重构混合时域/频域激励的方法。In the same manner, operations 1251-1261 form a method for reconstructing the hybrid time-domain/frequency-domain excitation using the information transmitted in bitstream 1101.
声音信号解码器1200包括用于执行使用例如IDCT(逆DCT)220将混合时域/频域激励变换回时域的操作1262的转换器1212。The audio signal decoder 1200 includes a converter 1212 for performing an operation 1262 that transforms a mixed time-domain/frequency-domain excitation back to the time domain using, for example, an IDCT (inverse DCT) 220.
最后,通过LP(线性预测)合成滤波器1213对来自转换器1212的总激励进行滤波的操作1263,在解码器1200中计算合成的声音信号。当然,解码器1200重构合成滤波器1213所需的LP参数从统一时域/频域编码设备700发送,并从比特流1101中提取,如CELP编码领域中公知的那样。Finally, the synthesized audio signal is calculated in the decoder 1200 through the operation 1263 of filtering the total excitation from the converter 1212 by the LP (linear prediction) synthesis filter 1213. Of course, the LP parameters required by the decoder 1200 to reconstruct the synthesis filter 1213 are sent from the unified time-domain/frequency-domain coding device 700 and extracted from the bitstream 1101, as is well known in the field of CELP coding.
8)硬件实现8) Hardware Implementation
图10是形成上述统一时域/频域编码设备100/700和方法150/750、解码器设备1100和解码方法1150的硬件组件的示例配置的简化框图。Figure 10 is a simplified block diagram of an example configuration of the hardware components forming the above-described unified time-domain/frequency-domain coding device 100/700 and method 150/750, decoder device 1100 and decoding method 1150.
统一时域/频域编码设备100/700和解码器设备1100可以实现为移动终端的一部分、便携式媒体播放器的一部分或任何类似的设备。设备100/700和解码器设备1100(在图10中标识为1000)包括输入端1002、输出1003、处理器1001和存储器1004。The unified time-domain/frequency-domain coding device 100/700 and decoder device 1100 can be implemented as part of a mobile terminal, a portable media player, or any similar device. Device 100/700 and decoder device 1100 (identified as 1000 in Figure 10) include an input 1002, an output 1003, a processor 1001, and a memory 1004.
输入1002被配置为接收数字或模拟形式的图1和图7的输入声音信号101/比特流1101。输出1003被配置为提供输出信号。输入1002和输出1003可以在公共模块中实现,例如串行输入/输出设备。Input 1002 is configured to receive the input audio signal 101/bit stream 1101 of Figures 1 and 7 in digital or analog form. Output 1003 is configured to provide an output signal. Input 1002 and output 1003 can be implemented in a common module, such as a serial input/output device.
处理器1001可操作地连接到输入1002、输出1003和存储器1004。处理器1001被实现为用于执行代码指令的一个或多个处理器,所述代码指令支持如图1-9所示的用于对输入声音信号进行编码的统一时域/频域编码设备100/700或图11-12的解码器设备1100的各种组件的功能。Processor 1001 is operatively connected to input 1002, output 1003, and memory 1004. Processor 1001 is implemented as one or more processors for executing code instructions that support the functionality of various components of the unified time-domain/frequency-domain coding device 100/700 for encoding input audio signals as shown in Figures 1-9 or the decoder device 1100 of Figures 11-12.
存储器1004可以包括用于存储可由处理器1001执行的代码指令的非暂态存储器,具体地,包括/存储非暂时性指令的处理器可读存储器,所述非暂时性指令在被执行时使处理器实现本公开中描述的统一时域/频域编码设备100/700和方法150/750以及解码器设备1100和解码方法1150的操作和组件。存储器1004还可以包括随机存取存储器或缓冲器,以存储来自由处理器1001执行的各种功能的中间处理数据。Memory 1004 may include non-transitory memory for storing code instructions executable by processor 1001. Specifically, it includes/stores processor-readable memory for non-transitory instructions that, when executed, cause the processor to implement the operations and components of the unified time-domain/frequency-domain coding apparatus 100/700 and method 150/750, as well as the decoder apparatus 1100 and decoding method 1150 described in this disclosure. Memory 1004 may also include random access memory or buffers for storing intermediate processing data from various functions performed by processor 1001.
本领域普通技术人员将认识到,统一时域/频域编码设备100/700和方法150/750以及解码器设备1100和解码方法1150的描述仅是说明性的,并不旨在以任何方式进行限制。受益于本公开的本领域普通技术人员将容易地想到其他实施例。此外,可以定制所公开的统一时域/频域编码设备100/700和方法150/750、解码器设备1100和解码方法1150,以提供对编码和解码声音的现有需求和问题的有价值的解决方案。Those skilled in the art will recognize that the description of the unified time-domain/frequency-domain coding apparatus 100/700 and method 150/750, as well as the decoder apparatus 1100 and decoding method 1150, is merely illustrative and not intended to be limiting in any way. Other embodiments will readily conceive of those skilled in the art upon receiving this disclosure. Furthermore, the disclosed unified time-domain/frequency-domain coding apparatus 100/700 and method 150/750, decoder apparatus 1100, and decoding method 1150 can be customized to provide valuable solutions to existing needs and problems in encoding and decoding sound.
为了清楚起见,未示出和描述统一时域/频域编码设备100/700和方法150/750以及解码器设备1100和解码方法1150的实现方式的所有例程特征。当然,应当理解,在开发统一的时域/频域编码设备100/700和方法150/750以及解码器设备1100和解码方法1150的任何这种实际实现时,可能需要做出许多实现特定的决定,以便实现开发者的特定目标,例如符合应用、系统、网络和商业相关的约束,并且这些特定目标将因实现而异以及因开发者而异。此外,应当理解,开发工作可能是复杂且耗时的,但是对于受益于本公开的声音处理领域的普通技术人员来说,开发工作仍然是常规的工程任务。For clarity, not all routine features of the unified time-domain/frequency-domain coding devices 100/700 and 150/750, as well as the decoder device 1100 and decoding method 1150, are shown or described. It should be understood that in developing any such practical implementation of the unified time-domain/frequency-domain coding devices 100/700 and 150/750, as well as the decoder device 1100 and decoding method 1150, many implementation-specific decisions may need to be made to achieve the developer's specific goals, such as compliance with application, system, network, and business-related constraints, and these specific goals will vary from implementation to implementation and from developer to developer. Furthermore, it should be understood that development work may be complex and time-consuming, but it remains a routine engineering task for those skilled in the art of sound processing who benefit from this disclosure.
根据本公开,本文描述的组件/处理器/模块、处理操作和/或数据结构可以使用各种类型的操作系统、计算平台、网络设备、计算机程序和/或通用机器来实现。另外,本领域普通技术人员将认识到,也可以使用较不通用性质的设备,诸如硬连线设备、现场可编程门阵列(FPGA)、专用集成电路(ASIC)等。在包括一系列操作和子操作的方法由处理器、计算机或机器实现并且那些操作和子操作可以存储为处理器、计算机或机器可读的一系列非暂时性代码指令的情况下,它们可以存储在有形和/或非暂时性介质上。According to this disclosure, the components/processors/modules, processing operations, and/or data structures described herein can be implemented using various types of operating systems, computing platforms, network devices, computer programs, and/or general-purpose machines. Furthermore, those skilled in the art will recognize that less general-purpose devices, such as hardwired devices, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), etc., can also be used. Where a method comprising a series of operations and sub-operations is implemented by a processor, computer, or machine, and those operations and sub-operations can be stored as a series of non-transitory code instructions readable by the processor, computer, or machine, they can be stored on tangible and/or non-transitory media.
如本文所述的统一时域/频域编码设备100/700和方法150/750以及解码器设备1100和解码方法1150可以使用软件、固件、硬件或适用于本文所述目的的软件、固件或硬件的任何组合。The unified time-domain/frequency-domain coding apparatus 100/700 and method 150/750, as well as the decoder apparatus 1100 and decoding method 1150 described herein, may use software, firmware, hardware, or any combination of software, firmware, or hardware suitable for the purposes described herein.
在如本文所述的统一时域/频域编码设备100/700和方法150/750以及解码器设备1100和解码方法1150中,可以以各种顺序执行各种操作和子操作,并且一些操作和子操作可以是可选的。In the unified time-domain/frequency-domain coding apparatus 100/700 and method 150/750, as well as the decoder apparatus 1100 and decoding method 1150 described herein, various operations and sub-operations can be performed in various orders, and some operations and sub-operations can be optional.
尽管上文已经通过其非限制性的说明性实施例描述了本公开,但是在不脱离本公开的精神和本质的情况下,可以在所附权利要求的范围内随意修改这些实施例。Although the present disclosure has been described above by way of its non-limiting illustrative embodiments, these embodiments may be modified freely within the scope of the appended claims without departing from the spirit and essence of the present disclosure.
9)参考文献9) References
本公开提及以下参考文献,其全部内容通过引用并入本文:This disclosure references the following sources, the entire contents of which are incorporated herein by reference:
[1]美国专利9,015,038,“以低比特率和低延迟编码通用音频信号(Codinggeneric audio signals at low bit rate and low delay)”。[1] U.S. Patent 9,015,038, “Coding generic audio signals at low bit rate and low delay”.
[2]3GPP TS26.445,v.12.0.0,“用于增强型语音服务(EVS)的编解码器;详细的算法描述(Codec for Enhanced Voice Services(EVS);Detailed AlgorithmicDescription)”,2014年9月。[2] 3GPP TS26.445, v.12.0.0, “Codec for Enhanced Voice Services (EVS); Detailed Algorithmic Description”, September 2014.
[3]3GPP SA4贡献S4-170749“用于沉浸式语音和音频服务的EVS编解码器扩展上的新WID(New WID on EVS Codec Extension for Immersive Voice and AudioServices)”,SA4会议#94,2017年6月26日至30日,http://www.3gpp.org/ftp/tsg_sa/WG4_CODEC/TSGS4_94/Docs/S4-170749.zip[3] 3GPP SA4 Contribution S4-170749 “New WID on EVS Codec Extension for Immersive Voice and Audio Services”, SA4 Meeting #94, June 26-30, 2017, http://www.3gpp.org/ftp/tsg_sa/WG4_CODEC/TSGS4_94/Docs/S4-170749.zip
[4]美国专利临时申请63/010,798,“声音编解码器中用于语音/音乐分类和核心编码器选择的方法和设备(Method and device for speech/music classification andcore encoder selection in a sound codec)”。[4] U.S. Provisional Application 63/010,798, “Method and device for speech/music classification and core encoder selection in a sound codec”.
[5]ITU-T建议G.718“8-32kbit/s语音和音频的帧误差鲁棒窄带和宽带嵌入式可变比特率编码(Frame error robust narrow-band and wideband embedded variablebit-rate coding of speech and audio from 8-32kbit/s)”,2008年6月。[5]ITU-T Recommendation G.718 “Frame error robust narrow-band and wideband embedded variable bit-rate coding of speech and audio from 8-32kbit/s”, June 2008.
[6]T.Vaillancourt等人,“低比特率CELP解码器中的音调间噪声降低(Inter-tone noise reduction in a low bit rate CELP decoder)”,IEEE声学、语音和信号处理国际会议论文集(IEEE Proceedings of International Conference on Acoustics,Speech and Signal Processing(ICASSP)),台北,中国台湾,2009年4月,第4113-16页。[6] T. Vaillancourt et al., “Inter-tone noise reduction in a low bit rate CELP decoder”, Proceedings of the IEEE Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), Taipei, Taiwan, April 2009, pp. 4113-16.
[7]V.Eksler和M.Jelnek,“用于源控制的CELP编解码器的过渡模式编码(Transition mode coding for source controlled CELP codecs)”,IEEE声学、语音和信号处理国际会议论文集(IEEE Proceedings of International Conference onAcoustics,Speech and Signal Processing(ICASSP)),2008年3-4月,第4001-4043页。[7] V. Eksler and M. Jelnek, “Transition mode coding for source controlled CELP codecs”, Proceedings of the IEEE Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), March-April 2008, pp. 4001-4043.
[8]U.Mittal,J.P.Ashley和E.M.Cruz-Zeno,“使用组合函数逼近的MDCT系数的低复杂度阶乘脉冲编码(Low Complexity Factorial Pulse Coding of MDCT Coefficientsusing Approximation of Combinatorial Functions)”,IEEE声学、语音和信号处理国际会议论文集(IEEE Proceedings of International Conference on Acoustics,Speechand Signal Processing(ICASSP)),台北,中国台湾,2007年4月,第289-292页。[8] U. Mittal, J.P. Ashley and E.M. Cruz-Zeno, “Low Complexity Factorial Pulse Coding of MDCT Coefficients using Approximation of Combinatorial Functions”, Proceedings of the IEEE Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), Taipei, Taiwan, April 2007, pp. 289-292.
Claims (120)
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US63/135,171 | 2021-01-08 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| HK40103944A true HK40103944A (en) | 2024-07-12 |
Family
ID=
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US10885926B2 (en) | Classification between time-domain coding and frequency domain coding for high bit rates | |
| JP5978218B2 (en) | General audio signal coding with low bit rate and low delay | |
| RU2389085C2 (en) | Method and device for introducing low-frequency emphasis when compressing sound based on acelp/tcx | |
| KR101078625B1 (en) | Systems, methods, and apparatus for gain factor limiting | |
| JP6980871B2 (en) | Signal coding method and its device, and signal decoding method and its device | |
| JP2010518422A (en) | Improved digital audio signal encoding / decoding method | |
| US20100268542A1 (en) | Apparatus and method of audio encoding and decoding based on variable bit rate | |
| CN117957611A (en) | Integrated ribbon parametric audio coding | |
| KR20170132854A (en) | Audio Encoder and Method for Encoding an Audio Signal | |
| JP7764480B2 (en) | Method and device for joint time-domain/frequency-domain coding of an audio signal - Patents.com | |
| HK40103944A (en) | Method and device for unified time-domain / frequency domain coding of a sound signal | |
| HK40107881A (en) | Coding generic audio signals at low bitrates and low delay | |
| Nemer et al. | Perceptual Weighting to Improve Coding of Harmonic Signals | |
| CN118891673A (en) | Method and apparatus for improving spectral gap filling in a spectral-temporal manner using filtering in audio coding | |
| CN118805218A (en) | Method and apparatus for improving spectral gap filling in a spectro-temporal manner using tilt in audio coding | |
| HK1185709B (en) | Coding generic audio signals at low bitrates and low delay |