CN110739000B

CN110739000B - Audio object coding method suitable for personalized interactive system

Info

Publication number: CN110739000B
Application number: CN201910972165.8A
Authority: CN
Inventors: 胡瑞敏; 胡晨昊; 王晓晨; 武庭照; 吴玉林
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2019-10-14
Filing date: 2019-10-14
Publication date: 2022-02-01
Anticipated expiration: 2039-10-14
Also published as: CN110739000A

Abstract

The invention discloses an audio object coding method suitable for a personalized interactive system, wherein in a coding stage, a plurality of audio objects to be coded are firstly transformed into a frequency domain from a time domain frame by frame and window; sequencing according to the energy of each object, and determining the coding sequence of the objects; circularly extracting the coded object and the corresponding downmix signal in each step, and calculating the parameter and the residual error in each step according to the parameter and the residual error; performing partial decompression on the large-size residual error matrix by using singular value decomposition; and synthesizing the final mixed signal, the parameters and the residual decomposition matrix into a code stream. In the decoding stage, residual errors are reconstructed by using a decomposition matrix; and then gradually decoding and reconstructing the objects from the downmix signals according to the residual error and the parameters of each object. The invention can ensure low code rate and high quality to reconstruct each audio object at the same time by sequential multi-step coding and decoding and residual decomposition.

Description

An Audio Object Coding Method for Personalized Interactive System

技术领域technical field

本发明属于数字音频信号处理技术领域，具体涉及一种多步逐级下混与重建的音频对象编码解码方法，适用于空间音频的个性化交互系统，允许在用户根据自身需求调整音频对象。The invention belongs to the technical field of digital audio signal processing, in particular to an audio object encoding and decoding method for multi-step downmixing and reconstruction, which is suitable for a personalized interactive system of spatial audio and allows users to adjust audio objects according to their own needs.

背景技术Background technique

基于声道编码的空间音频技术可以实现三位音频场景的编码与重建，比单声道或立体声音频技术更能提供身临其境的听觉体验，如MPEG空间音频编码、NHK22.2扬声器阵列等，因而越来越受到人们的欢迎。但传统基于声道的空间音频系统仍然存在的局限性，其灵活性较低，难以满足支持个性化交互功能的音频服务系统。因此，新一代音频编码技术将音频场景分解为一系列独立对象，以对象为基本元素进行编码传输。The spatial audio technology based on channel coding can realize the encoding and reconstruction of three-dimensional audio scenes, which can provide more immersive listening experience than mono or stereo audio technology, such as MPEG spatial audio coding, NHK22.2 speaker array, etc. , thus becoming more and more popular. However, the traditional channel-based spatial audio system still has limitations, its flexibility is low, and it is difficult to meet the audio service system that supports personalized interactive functions. Therefore, the new generation of audio coding technology decomposes the audio scene into a series of independent objects, and uses the objects as the basic elements for encoding and transmission.

国际上许多学者和研究机构已在音频对象编码方面的开展了研究工作，并提出多种音频对象编码方法。其中最具代表性的是德国知名研究机构Fraunhofer提出的空间音频对象联合编码技术(Spatial audio object coding,SAOC)[文献1]，该方法编码传输多个音频对象的下混信号和边信息，在解码端根据边信息将音频对象从下混信号中分离重构。SAOC方法可以以低码率传输大量音频对象，大大提升了音频对象编码效率，并使得用户可以根据自身的听音需求进行个性化的调整与交互[文献2]。Many scholars and research institutions in the world have carried out research work on audio object coding, and proposed a variety of audio object coding methods. The most representative one is the Spatial Audio Object Coding (SAOC) technique proposed by Fraunhofer, a well-known German research institute [Document 1], which encodes and transmits the downmix signal and side information of multiple audio objects. The decoding end separates and reconstructs the audio object from the downmix signal according to the side information. The SAOC method can transmit a large number of audio objects at a low bit rate, which greatly improves the coding efficiency of audio objects, and enables users to make personalized adjustments and interactions according to their own listening needs [Reference 2].

在SAOC框架中，为了获得较低的编码比特率，在同一子带中使用相同的参数作为边信息。这导致了频域混叠失真，严重降低了听力体验，例如一个音频对象信号播放时会包含其他对象信号成分混合[文献3]。甚至，这一问题会影响到后续用户端的空间音频个性化交互服务。一些研究利用残差信号来补偿这一失真，提高解码音质[文献4][文献5]。然而，这些方法只能提高某个目标对象的听音体验，其他对象仍然存在混叠失真问题，并不能保证每个音频对象都有较好的解码音质。In the SAOC framework, in order to obtain a lower coding bit rate, the same parameters are used as side information in the same subband. This leads to aliasing distortion in the frequency domain, which seriously degrades the listening experience. For example, when an audio object signal is played, it will contain other object signal components mixed [Reference 3]. Even, this problem will affect the subsequent user-end spatial audio personalized interactive services. Some studies use the residual signal to compensate for this distortion and improve the decoding sound quality [Reference 4][Reference 5]. However, these methods can only improve the listening experience of a certain target object, and other objects still have the problem of aliasing distortion, and cannot guarantee that each audio object has a better decoding quality.

文献1：Breebaart,J.,Engdeg°ard,J.,Falch,C.,et al.:Spatial audio objectcoding (saoc)-the upcoming mpeg standard on parametric object based audiocoding.In:Audio Engineering Society Convention 124.Audio Engineering Society(2008).Literature 1: Breebaart, J., Engdeg°ard, J., Falch, C., et al.: Spatial audio objectcoding (saoc)-the upcoming mpeg standard on parametric object based audiocoding. In: Audio Engineering Society Convention 124. Audio Engineering Society (2008).

文献2：Coleman,P.,Franck,A.,Francombe,J.,et al.:An audio-visual systemfor objectbased audio:From recording to listening.IEEE Transactions onMultimedia 20(8),1919-1931(2018).Reference 2: Coleman, P., Franck, A., Francombe, J., et al.: An audio-visual system for objectbased audio: From recording to listening. IEEE Transactions on Multimedia 20(8), 1919-1931(2018).

文献3：Wu,T.,Hu,R.,Wang,X.,Ke,S.:Audio object coding based on optimalparameter frequency resolution.Multimedia Tools and Applications pp.1-16(2019).文献4：Kim,K.,Seo,J.,Beack,S.,Kang,K.,Hahn,M.:Spatial audio objectcoding with two-step coding structure for interactive audio service.IEEETransactions on Multimedia 13(6),1208-1216(2011).Literature 3: Wu, T., Hu, R., Wang, X., Ke, S.: Audio object coding based on optimalparameter frequency resolution. Multimedia Tools and Applications pp.1-16 (2019). Literature 4: Kim, K., Seo, J., Beack, S., Kang, K., Hahn, M.: Spatial audio objectcoding with two-step coding structure for interactive audio service. IEEE Transactions on Multimedia 13(6), 1208-1216 (2011 ).

文献5：Lee,B.,Kim,K.,Hahn,M.:Efficient residual coding method ofspatial audio object coding with two-step coding structure for interactiveaudio services.IEICE TRANSACTIONS on Information and Systems 99(7),1949-1952(2016).Document 5: Lee, B., Kim, K., Hahn, M.: Efficient residual coding method of spatial audio object coding with two-step coding structure for interactive audio services. IEICE TRANSACTIONS on Information and Systems 99(7), 1949-1952 (2016).

发明内容SUMMARY OF THE INVENTION

为解决上述技术问题，本发明提供了一种多步逐级下混与重建的音频对象编解码方法，能够在中低码率下进行高质量的音频编解码，保证所有音频对象都具有良好解码音质。In order to solve the above-mentioned technical problems, the present invention provides an audio object encoding and decoding method for multi-step downmixing and reconstruction, which can perform high-quality audio encoding and decoding at medium and low code rates, and ensure that all audio objects have good decoding. sound quality.

本发明所采用的技术方案是：一种适应于个性化交互系统的音频对象编码方法，其特征在于，包括以下步骤：The technical scheme adopted by the present invention is: an audio object coding method suitable for the personalized interactive system, characterized in that it comprises the following steps:

步骤A1：对输入的音频对象序列进行分帧加窗，将时域信号转换到频域信号，得到每个音频对象的时频矩阵；Step A1: carry out frame-by-frame windowing to the input audio object sequence, convert the time-domain signal into a frequency-domain signal, and obtain the time-frequency matrix of each audio object;

步骤A2：根据每个对象的时频矩阵，计算对象频域能量进行排序，确定多步逐级编码中每步需要编码的对象；Step A2: According to the time-frequency matrix of each object, calculate the frequency domain energy of the object to sort, and determine the object to be encoded in each step in the multi-step step-by-step encoding;

步骤A3：根据确定的编码顺序，逐步下混并计算对应的边信息；所述逐步下混指将当前处理流程中输入的对象对数据进行矩阵相加，得到一个和矩阵；其中逐步下混信号并不作为传输码流进行传输；所述边信息包含对象残差与对象增益参数矩阵；其中，对象增益参数通过对象对中两个输入信号的能量比计算得到；Step A3: According to the determined coding sequence, gradually downmix and calculate the corresponding side information; the stepwise downmix refers to performing matrix addition on the data of the input objects in the current processing flow to obtain a sum matrix; wherein the stepwise downmix signal It is not transmitted as a transport code stream; the side information includes the object residual and the object gain parameter matrix; wherein, the object gain parameter is calculated by the energy ratio of the two input signals in the object pair;

步骤A4：利用奇异值分解将边信息中的对象残差分解为左、右奇异矩阵与奇异值；Step A4: Use singular value decomposition to decompose the object residuals in the side information into left and right singular matrices and singular values;

步骤A5：量化奇异矩阵、奇异值及对象增益参数，获得边信息码流；Step A5: quantize singular matrix, singular value and object gain parameter to obtain side information code stream;

步骤A6：将步骤A3中的最终下混信号进行编码，获得下混信号码流；Step A6: Encode the final downmix signal in Step A3 to obtain a downmix signal stream;

步骤A7：步骤A5和步骤A6得到的码流合成为输出码流，传输到解码端。Step A7: The code stream obtained in step A5 and step A6 is synthesized into an output code stream and transmitted to the decoding end.

与现有音频对象编码技术相比，本发明的优势在于：利用多步逐级编解码，最大程度上利用残差补偿解码失真，保证每个音频对象都具有较好的听音质量；同时引入奇异值分解将残差信息分解压缩，降低码率。因此，本发明可以保证在中低码率下，解码得到高质量的音频对象，以满足音频个性化交互系统的使用需求。Compared with the existing audio object coding technology, the present invention has the advantages of: using multi-step step-by-step coding and decoding, using residuals to compensate for decoding distortion to the greatest extent, and ensuring that each audio object has better listening quality; Singular value decomposition decomposes and compresses residual information to reduce the code rate. Therefore, the present invention can ensure that high-quality audio objects can be obtained by decoding at medium and low bit rates, so as to meet the usage requirements of the audio personalized interactive system.

附图说明Description of drawings

图1是本发明实施例的编码原理图；Fig. 1 is the coding principle diagram of the embodiment of the present invention;

图2是本发明实施例的解码原理图。FIG. 2 is a schematic diagram of decoding according to an embodiment of the present invention.

具体实施方式Detailed ways

为了便于本领域的技术人员理解和实施本发明，下面结合附图以及具体实施示例对本发明的技术方案作进一步说明，应当理解，此处所描述的实施示例仅用于说明和解释本发明，并不用于限定本发明：In order to facilitate the understanding and implementation of the present invention by those skilled in the art, the technical solutions of the present invention will be further described below with reference to the accompanying drawings and specific implementation examples. To limit the present invention:

本发明在现有音频对象编码方法的基础上开展进一步研究，提出了多步逐级下混与重建的音频对象编解码方法。首先，根据对象频域能量研究最佳编码顺序，确定每步需要编码和计算边信息的对象，最终可以得到每个对象的残差信息，有效降低所有重建对象的信号失真与混淆；然后利用奇异值分解方法将残差信息分为三个低维矩阵，从而达到压缩残差信息，降低比特率的目的。The present invention conducts further research on the basis of the existing audio object encoding method, and proposes an audio object encoding and decoding method for multi-step downmixing and reconstruction. First, the optimal coding sequence is studied according to the object frequency domain energy, and the objects that need to be coded and calculated in each step are determined. Finally, the residual information of each object can be obtained, which can effectively reduce the signal distortion and confusion of all reconstructed objects; The value decomposition method divides the residual information into three low-dimensional matrices, so as to achieve the purpose of compressing the residual information and reducing the bit rate.

参见图1，本发明提出一种适应于个性化交互系统的多音频对象编码方法，本实施示例以输入A、B、C、D四个对象举例说明，具体实施示例包含以下步骤：Referring to FIG. 1, the present invention proposes a multi-audio object encoding method suitable for a personalized interactive system. This implementation example is illustrated by inputting four objects A, B, C, and D. The specific implementation example includes the following steps:

步骤A1：输入音频对象A、B、C、D(可包含人声、钢琴、吉他等多种不同对象)，将每个对象分帧加窗，时域信号转换到频域信号，得到每个音频对象的时频矩阵；Step A1: Input audio objects A, B, C, D (which may include various objects such as human voice, piano, guitar, etc.), divide each object into frames and add windows, convert time domain signals to frequency domain signals, and obtain each The time-frequency matrix of the audio object;

本实施例中，通过分帧、加窗与改进离散余弦变换MDCT将原本时域的一维声音信号，变为频域的二维频谱图，输出得到的是矩阵形式的对象数据。In this embodiment, the original one-dimensional sound signal in the time domain is transformed into a two-dimensional spectrogram in the frequency domain through framing, windowing and improved discrete cosine transform (MDCT), and the output is object data in the form of a matrix.

输入的音频对象信号采样率为44.1Khz，位深度16位，wav音频格式。The input audio object signal sampling rate is 44.1Khz, the bit depth is 16 bits, and the wav audio format.

应注意的是，此处规定的音频参数和对象种类仅为举例说明本发明的实施过程，并不用于限定本发明。It should be noted that the audio parameters and object types specified herein are only for illustrating the implementation process of the present invention, and are not intended to limit the present invention.

分帧加窗中，每帧长度1024，窗函数选择hanning窗，50％时域交叠；时频变换选择改进离散余弦变换MDCT，变换长度为2048点；最终输出多个矩阵形式的音频对象信号，其中矩阵行数等于帧数(或列数等于帧数)、矩阵的列数等于频点数(或行数等于频点数)。In the frame-by-frame windowing, the length of each frame is 1024, the window function selects the hanning window, and the time domain overlaps 50%; the time-frequency transform selects the improved discrete cosine transform MDCT, and the transform length is 2048 points; finally output multiple audio object signals in the form of matrices , where the number of matrix rows is equal to the number of frames (or the number of columns is equal to the number of frames), and the number of columns of the matrix is equal to the number of frequency points (or the number of rows is equal to the number of frequency points).

应注意的是，此处规定的帧长，窗函数类型以及变换方式等只是为了举例说明本发明的具体实施步骤，并不用作限定本发明。It should be noted that the frame length, window function type and transformation mode specified here are only for illustrating the specific implementation steps of the present invention, and are not used to limit the present invention.

本实施例中，根据矩阵形式的对象数据，计算对象频域能量，选择从大到小的能量排序方式，确定每步需要编码的对象顺序；编码顺序，指优先编码能量较大的音频对象。In this embodiment, the object frequency domain energy is calculated according to the object data in the form of a matrix, the energy sorting method from large to small is selected, and the order of the objects to be encoded in each step is determined;

对象频域能量的计算如下式所示：The calculation of the object frequency domain energy is as follows:

其中，||S_i||表示第i个音频对象的总能量，O_i表示第i个对象在所有对象总能量中所占比例；根据每个对象O_i值从大到小排序，排序顺序为D(S₁)、B(S₂)、A(S₃)、C(S₄)，优先编码O_i值大的对象；应注意的是，此处规定的i∈[1，4]以及从大到小的排序方式，仅为举例说明本发明的具体实施步骤，并不用作限定本发明。Among them, ||S _i || represents the total energy of the ith audio object, O _i represents the proportion of the ith object in the total energy of all objects; according to the value of each object O _i is sorted from large to small, and the sorting order is For D(S ₁ ), B(S ₂ ), A(S ₃ ), and C(S ₄ ), objects with large O _i values are preferentially encoded; it should be noted that i∈[1, 4] specified here And the ordering manner from large to small is only an example to illustrate the specific implementation steps of the present invention, and is not used to limit the present invention.

步骤A3：根据编码顺序，逐步下混并计算对应的边信息(对象残差与奇异矩阵、奇异值)；Step A3: According to the coding sequence, gradually downmix and calculate the corresponding side information (object residual, singular matrix, singular value);

本实施例中，逐步下混指将当前处理流程中输入的对象对数据进行矩阵相加，得到一个和矩阵；其中逐步下混信号并不作为传输码流进行传输；边信息包含对象残差与对象增益参数矩阵；其中，对象增益参数通过对象对中两个输入信号的能量比计算得到；In this embodiment, the step-by-step down-mixing refers to performing matrix addition on the data of the input objects in the current processing flow to obtain a sum matrix; the step-by-step down-mixing signal is not transmitted as a transport stream; the side information includes the object residual and object gain parameter matrix; wherein, the object gain parameter is calculated by the energy ratio of the two input signals in the object pair;

对象残差与对象增益参数的计算公式如下所示：The calculation formulas of the object residual and object gain parameters are as follows:

其中，R(i)为第i+1个对象的残差信号，G_o(i)为第i+1个对象的增益参数，G_d(i)为第i个下混信号的增益参数；公式中X_i表示第i步得到的下混信号,P_o(i)为对象i的能量，P_d(i)为第i步下混信号的能量。在本实施实例中N＝4，表示需要编码的对象个数。Wherein, R(i) is the residual signal of the i+1th object, G _o (i) is the gain parameter of the i+1th object, and G _d (i) is the gain parameter of the i-th downmix signal; In the formula, X _i represents the downmix signal obtained in the i-th step, P _o (i) is the energy of the object i, and P _d (i) is the energy of the down-mix signal in the i-th step. In this embodiment, N=4, which indicates the number of objects to be encoded.

应注意的是，此处规定的对象数量N＝4仅为举例说明本发明的具体实施步骤，并不用作限定本发明。It should be noted that the number of objects N=4 specified here is only for illustrating the specific implementation steps of the present invention, and is not used to limit the present invention.

结合本实例，根据步骤A2确定的编码顺序以上公式多步逐级下混计算过程如下：第一步，将对象D、B作为对象对进行下混及参数提取(在第一步中，D被视为下混信号进行计算)，得到两个对象的下混信号X₁，并计算得到第二个对象B的增益参数G_o(1)及其残差R(1)；第二步，将下混信号X₁、A作为对象对进行下混及参数提取，得到第二步的下混信号X₂，并计算第三个对象A的增益参数G_o(2)及其残差R(2)；第三步，将下混信号X₂、C作为对象对进行下混及参数提取，得到第三步的下混信号X₃(即需要传输到解码端的最终下混信号)，并计算第四个对象C的增益参数G_o(3)及其残差R(3)。至此，四个对象通过以上三步完成下混与参数提取。In conjunction with this example, the multi-step downmixing calculation process of the above formula according to the coding sequence determined in step A2 is as follows: the first step, the objects D and B are used as object pairs to carry out downmixing and parameter extraction (in the first step, D is The downmix signal X ₁ of the two objects is obtained, and the gain parameter G _o (1) of the second object B and its residual R (1) are obtained by calculation; in the second step, the The downmixed signals X ₁ and A are used as object pairs to perform downmixing and parameter extraction to obtain the downmix signal X ₂ of the second step, and calculate the gain parameter G _o (2) of the third object A and its residual R (2 ); In the third step, the downmix signals X ₂ and C are used as object pairs to perform down mixing and parameter extraction to obtain the down mix signal X ₃ of the third step (that is, the final down mix signal that needs to be transmitted to the decoding end), and calculate the first The gain parameters _Go (3) of the four objects C and their residuals R(3). So far, the four objects have completed the downmix and parameter extraction through the above three steps.

应注意的是，此处规定的编码顺序与步数仅为举例说明本发明的具体实施步骤，并不用作限定本发明。It should be noted that the coding sequence and the number of steps specified here are only examples to illustrate the specific implementation steps of the present invention, and are not used to limit the present invention.

步骤A4：利用奇异值分解将边信息中的对象残差分解为系数矩阵与核向量；Step A4: Use singular value decomposition to decompose the object residuals in the side information into coefficient matrices and kernel vectors;

本实施例中，通过奇异值分解方法对多个对象的残差矩阵进行降维压缩，减少残差信息带来的数据量上升；残差矩阵会被分解为三个小矩阵，分别为左奇异矩阵、奇异值矩阵、右奇异矩阵；其中，奇异值矩阵仅传输矩阵对角线上的数值。In this embodiment, the residual matrix of multiple objects is dimensionally reduced and compressed by the singular value decomposition method to reduce the increase in the amount of data caused by residual information; the residual matrix will be decomposed into three small matrices, which are left singular Matrix, singular value matrix, right singular matrix; among them, the singular value matrix only transmits the values on the diagonal of the matrix.

奇异值分解SVD是一种矩阵特征值分解，用于将矩阵归约成其组成部分的矩阵分解方法，以使高维矩阵分解为几个低维矩阵进行表示，以达到数据压缩的目的。分解过程如下所示：Singular value decomposition SVD is a matrix eigenvalue decomposition, a matrix decomposition method used to reduce a matrix into its components, so that a high-dimensional matrix can be decomposed into several low-dimensional matrices for representation, so as to achieve the purpose of data compression. The decomposition process is as follows:

其中，R(i)_P×Q为第i+1个对象的残差信号，行数P为MDCT变换长度的一半，列数Q为音频对象的帧数。U为左奇异矩阵，Λ为奇异值矩阵，V为右奇异矩阵。Λ矩阵中对角线上的奇异值按从大到小排序。Among them, R(i) _P×Q is the residual signal of the i+1 th object, the number of rows P is half the length of the MDCT transform, and the number of columns Q is the number of frames of the audio object. U is the left singular matrix, Λ is the singular value matrix, and V is the right singular matrix. The singular values on the diagonal in the Λ matrix are sorted from largest to smallest.

为了进行降维，可以选择前r个奇异值(取r＝50)和对应的奇异矩阵近似表示R(i)，近似表示如下：For dimensionality reduction, the first r singular values (take r=50) and the corresponding singular matrix can be selected to approximate R(i), and the approximate expression is as follows:

其中，

为奇异值矩阵的一部分，

和

为原始左右奇异矩阵的前50行(或列)。利用以上三个矩阵可以近似表示残差信号，并降低矩阵维度，压缩边信息数据量。in,

is part of the singular value matrix,

and

is the first 50 rows (or columns) of the original left and right singular matrix. Using the above three matrices can approximate the residual signal, reduce the matrix dimension, and compress the amount of side information data.

应注意的是，此处规定的r＝50仅为举例说明本发明的具体实施步骤，并不用作限定本发明。It should be noted that r=50 specified here is only for illustrating the specific implementation steps of the present invention, and is not used to limit the present invention.

步骤A5：量化奇异值、奇异矩阵及对象增益参数，获得边信息码流；Step A5: quantize singular values, singular matrices and object gain parameters to obtain side information code streams;

本实施例中，量化可通过查表法实现。在量化操作中，残差分解矩阵与增益参数中的元素取值范围不同，因此量化前通过归一化处理来统一量化表。然后根据每个元素值的大小在量化表中查找最接近的量化值，并将对应的量化索引作为边信息量化码流输出。In this embodiment, the quantification may be implemented by a table look-up method. In the quantization operation, the value ranges of the elements in the residual decomposition matrix and the gain parameter are different, so the quantization table is unified by normalization before quantization. Then look up the closest quantization value in the quantization table according to the size of each element value, and output the corresponding quantization index as the side information quantization code stream.

本实施例中，最终下混信号为解码端进行对象信号重建的基础，其采用AAC128k进行编码。In this embodiment, the final downmix signal is the basis for the decoding end to reconstruct the object signal, which uses AAC128k for encoding.

应注意的是，对最终下混信号采用AAC 128k编码仅为举例说明本发明的具体实施步骤，并不用作限定本发明。It should be noted that the use of AAC 128k encoding for the final downmix signal is only used to illustrate the specific implementation steps of the present invention, and is not intended to limit the present invention.

合成输出码流指将最终下混信号码流与边信息码流进行码流合并，并添加标志位用于标识解析。最终下混信号码流指经AAC编码后的输出码流，边信息码流指残差分解矩阵与增益参数量化后输出的量化索引码流。参见图2，本发明还提出了一种适应于个性化交互系统的多音频对象解码方法，本实施示例以输入A、B、C、D四个对象举例说明，具体实施示例包含以下步骤：Synthesizing the output code stream refers to merging the final downmix signal code stream and the side information code stream, and adding a flag bit for identification and parsing. The final downmix signal code stream refers to the output code stream after AAC encoding, and the side information code stream refers to the quantization index code stream output after the residual decomposition matrix and the gain parameter are quantized. Referring to Fig. 2, the present invention also proposes a multi-audio object decoding method suitable for a personalized interactive system. This implementation example takes the input of four objects A, B, C, and D as an example, and the specific implementation example includes the following steps:

步骤B1：解析接收到的码流，得到边信息码流与最终下混信号码流；Step B1: Parse the received code stream to obtain the side information code stream and the final downmix signal code stream;

本实施例中，解析码流指根据合成输出码流的方法进行反推，得到最终下混信号码流与边信息码流。In this embodiment, parsing the code stream refers to performing reverse inference according to the method of synthesizing the output code stream to obtain the final downmix signal code stream and the side information code stream.

步骤B2：下混信号码流经过AAC解码得到下混信号；Step B2: the downmix signal code stream is decoded by AAC to obtain the downmix signal;

本实施例中，最终下混信号码流是经过AAC编码压缩后得到的数据流，在经过AAC解码后可得到传输前的最终下混信号。In this embodiment, the final downmix signal code stream is a data stream obtained after AAC encoding and compression, and after AAC decoding, the final downmix signal before transmission can be obtained.

步骤B3：边信息码流经过去量化后得到左、右奇异矩阵、奇异值及对象增益参数；Step B3: After the side information code stream is dequantized, left and right singular matrices, singular values and object gain parameters are obtained;

本实施例中，边信息在进行量化时进行了归一化，在去量化时对应进行去归一化。经此，可解析得到传输前的边信息。In this embodiment, the side information is normalized during quantization, and correspondingly de-normalized during dequantization. Through this, the side information before transmission can be obtained through analysis.

步骤B4：左、右奇异矩阵与奇异值进行矩阵合成恢复出对象残差；Step B4: Matrix synthesis of left and right singular matrices and singular values to recover object residuals;

本实施例中，矩阵合成是将左奇异矩阵，奇异值矩阵，右奇异矩阵相乘得到近似的对象残差，具体见公式：In this embodiment, the matrix synthesis is to multiply the left singular matrix, the singular value matrix, and the right singular matrix to obtain an approximate object residual. For details, see the formula:

步骤B5：根据编码顺序反向解码，利用边信息从传输下混信号中循环重构音频对象频域信号；Step B5: reverse decoding according to the coding order, and use side information to cyclically reconstruct the audio object frequency domain signal from the transmission downmix signal;

利用对象增益参数将对象从对应的下混信号中分离出来，再与残差信号进行计算弥补混叠失真后可以得到重构的音频对象频域信号，如下式所示：Using the object gain parameter to separate the object from the corresponding downmix signal, and then calculating with the residual signal to compensate for the aliasing distortion, the reconstructed audio object frequency domain signal can be obtained, as shown in the following formula:

其中，S′_i是重构得到的频域对象信号，X′_i是重构得到的逐步下混信号,G_d(i)为每步对应下混信号的增益参数。

是解码端通过矩阵合成得到的残差信息，即步骤B4所完成的工作。对象的解码顺序与编码顺序相反，每个对象在对应的解码步骤中从逐步下混信号中解析重构。Among them, S′ _i is the frequency domain object signal obtained by reconstruction, X′ _i is the step-by-step downmix signal obtained by reconstruction, and G _d (i) is the gain parameter of the downmix signal corresponding to each step.

is the residual information obtained by the decoding end through matrix synthesis, that is, the work completed in step B4. The decoding order of the objects is opposite to the encoding order, and each object is analytically reconstructed from the progressive downmix signal in the corresponding decoding step.

结合本实例，根据步骤B5确定的解码顺序，根据以上公式(8)(9)(10)多步逐级重构对象过程如下：第一步，利用增益参数G_o(3)及其残差

从最终下混信号X₃中重构对象C(即S′₄),利用增益参数G_d(3)从最终下混信号X₃中重构得到逐步下混信号X′₂；第二步，利用增益参数Go(2)及其残差

从逐步下混信号X′₂中重构对象A(即S′₃),利用增益参数G_d(2)从最逐步下混信号X′₂中重构得到逐步下混信号X′₁；第三步，利用增益参数G_o(1)及其残差

从逐步下混信号X′₁中重构对象B(即S′₂),利用逐步下混信号X′₁与重构对象B相减，可得重构对象D(即S′₁)。至此，通过三步解码，将对象从对应的逐步下混信号中依次恢复出来，并利用残差信息对其重构信号进行了补偿，减小混叠失真带来的音质降低。Combined with this example, according to the decoding order determined in step B5, the multi-step and step-by-step reconstruction object process according to the above formula (8) (9) (10) is as follows: The first step is to use the gain parameter G _o (3) and its residual.

The object C (ie S' ₄ ) is reconstructed from the final down-mix signal X ₃ , and the step-by-step down-mix signal X' ₂ is reconstructed from the final down-mix signal X ₃ by using the gain parameter G _d (3); in the second step, Using the gain parameter Go(2) and its residual

The object A (ie S' ₃ ) is reconstructed from the stepwise downmix signal X' ₂ , and the stepwise downmix signal X' ₁ is reconstructed from the most stepwise downmix signal X' ₂ by using the gain parameter G _d (2); Three steps, using the gain parameter G _o (1) and its residual

The object B (ie S′ ₂ ) is reconstructed from the stepwise downmix signal X′ ₁ , and the reconstructed object D (ie S′ ₁ ) can be obtained by subtracting the stepwise downmix signal X′ ₁ and the reconstructed object B. So far, through three-step decoding, the object is sequentially recovered from the corresponding progressive downmix signal, and the reconstructed signal is compensated by using the residual information to reduce the sound quality degradation caused by aliasing distortion.

应注意的是，此处A、B、C、D四个对象与解码步数仅为举例说明本发明的具体实施步骤，并不用作限定本发明。It should be noted that the four objects A, B, C, and D and the number of decoding steps here are only examples to illustrate the specific implementation steps of the present invention, and are not used to limit the present invention.

步骤B6：利用时频反变换，将频域的音频对象信号转换到时域。Step B6: Convert the audio object signal in the frequency domain to the time domain by using inverse time-frequency transform.

本实施例中，逐步重建的对象信号仍然是频域信号，需要进行时频反变换将其转换到时域内才可进行后续的渲染、个性化交互、播放等功能。所以，解码方法中的反变换是将对象频域信号进行去窗，改进离散余弦逆变换操作得到时域联系信号。In this embodiment, the gradually reconstructed object signal is still the frequency domain signal, and needs to be converted into the time domain by inverse time-frequency transform before subsequent functions such as rendering, personalized interaction, and playback can be performed. Therefore, the inverse transform in the decoding method is to remove the window of the target frequency domain signal, and improve the inverse discrete cosine transform operation to obtain the time domain signal.

与现有音频对象编码方法相比，本发明具有的优势及特点是：Compared with the existing audio object coding method, the advantages and features of the present invention are:

利用多步逐级编解码，最大程度上利用残差补偿解码失真，保证每个音频对象都具有较好的听音质量；同时引入奇异值分解将残差信息分解压缩，降低码率。因此，本发明可以保证在中低码率下，解码得到高质量的音频对象，以满足音频个性化交互系统的使用需求。Using multi-step step-by-step encoding and decoding, the residual error is used to compensate for decoding distortion to the greatest extent, so as to ensure that each audio object has better listening quality; at the same time, singular value decomposition is introduced to decompose and compress the residual information to reduce the bit rate. Therefore, the present invention can ensure that high-quality audio objects can be obtained by decoding at medium and low bit rates, so as to meet the usage requirements of the audio personalized interactive system.

Claims

1. An audio object coding method adapted to a personalized interactive system, comprising the steps of:

step A1: performing frame windowing on an input audio object sequence, converting a time domain signal into a frequency domain signal, and obtaining a time-frequency matrix of each audio object;

step A2: according to the time-frequency matrix of each object, calculating the frequency domain energy of the objects to sort, and determining the object to be coded in each step in multi-step progressive coding;

step A3: gradually downmixing and calculating corresponding side information according to the determined coding sequence; the step-by-step down mixing is carried out in multiple steps, each step carries out matrix addition on data by using an object input in the current processing flow to obtain a sum matrix, and the sum matrix is used as one of the objects of the next step of down mixing; the object pair refers to two input signals needing to be processed, the object pair comprises two audio objects in the first step of downmixing, the object pair comprises one audio object and the intermediate downmixed signal obtained in the previous step in the second step and the later steps, and the output of the last step is the final downmixed signal; wherein, the intermediate down-mixing signal is not transmitted as a transmission code stream; the side information comprises an object residual error and an object gain parameter matrix; the object gain parameter is obtained by calculating the energy ratio of two input signals in an object pair;

step A4: decomposing the object residual error in the side information into a left singular matrix, a right singular matrix and singular values by singular value decomposition;

step A5: quantizing the singular matrix, the singular value and the object gain parameter to obtain a side information code stream;

step A6: coding the final downmix signal in the step A3 to obtain a downmix signal code stream;

step A7: and synthesizing the code streams obtained in the step A5 and the step A6 into an output code stream, and transmitting the output code stream to a decoding end.

2. The audio object coding method adapted to a personalized interaction system according to claim 1, characterized in that: in step a1, a one-dimensional sound signal in the original time domain is converted into a two-dimensional spectrogram in the frequency domain by framing, windowing and modified discrete cosine transform MDCT, and the obtained object data in the form of a matrix is output.

3. The audio object coding method adapted to a personalized interaction system according to claim 1, characterized in that: in step A2, according to the object data in the form of matrix, calculating the energy of object frequency domain, selecting the energy sorting mode from big to small, and determining the object sequence to be coded in each step; coding order, which means that audio objects with larger coding energy are preferentially coded;

the calculation of the frequency domain energy of the object is shown as follows:

wherein, | | S_iThe total energy of the ith audio object is represented by | l, and N is the number of coded objects; o is_iRepresenting the proportion of the ith object in the total energy of all objects, according to each object O_iSorting values from large to small, preferentially encoding O_iObjects with large values.

4. The audio object coding method adapted to a personalized interaction system according to claim 1, characterized in that: step A3, down-mixing step by step and calculating the side information of each step of coding object, and only calculating and coding one object side information in each step;

the calculation formula of the object residual and the object gain parameter is as follows:

wherein R (i) is the residual signal of the i +1 th object, G_o(i) Gain parameter for the i +1 th object, G_d(i) A gain parameter for an ith downmix signal; x_iRepresenting the downmix signal, P, obtained in step i_o(i) Is the energy of object i, P_d(i) The energy of the mixed signal in the ith step; n represents the number of objects to be encoded.

5. The audio object coding method adapted to a personalized interaction system according to claim 1, characterized in that: in the step A4, carrying out dimension reduction compression on residual error matrixes of a plurality of objects by a singular value decomposition method, and reducing data volume increase brought by residual error information; decomposing the residual matrix into three small matrixes, namely a left singular matrix, a singular value matrix and a right singular matrix; wherein the singular value matrix transmits only the values on the matrix diagonal.

6. The audio object coding method adapted to a personalized interaction system according to claim 1, characterized in that: in step A5, quantizing the side information by a table lookup method, and normalizing the element values of the residual decomposition matrix and the gain parameter matrix before quantization; and then, searching the closest quantization value in the quantization table according to the size of each element value, and outputting the corresponding quantization index as a side information quantization code stream.

7. The audio object coding method adapted to a personalized interaction system according to claim 1, characterized in that: in step a6, the final downmix signal is encoded by an AAC encoder and then a code stream is output.

8. The audio object coding method adapted to a personalized interaction system according to claim 1, characterized in that: in step a7, synthesizing an output code stream refers to merging the final downmix signal code stream and the side information code stream, and adding a flag bit for identifier resolution; and finally, the down-mixing signal code stream refers to an output code stream after AAC coding, and the side information code stream refers to a quantization index code stream output after the residual decomposition matrix and the gain parameter are quantized.

9. An audio object decoding method adapted to a personalized interactive system, characterized by: for decoding an encoding generated by the method of any one of claims 1 to 8;

the specific implementation comprises the following substeps:

step B1: analyzing the received code stream to obtain a side information code stream and a down-mixing signal code stream;

step B2: carrying out AAC decoding on the down-mixed signal code stream to obtain a down-mixed signal;

step B3: the side information is dequantized to obtain a left singular matrix, a right singular matrix, a singular value and an object gain parameter;

step B4: performing matrix synthesis on the left singular matrix, the right singular matrix and the singular value to recover an object residual error;

step B5: decoding backward according to the coding order, and circularly reconstructing an audio object frequency domain signal from the transmission downmix signal by using the side information;

step B6: the audio object signals in the frequency domain are converted to the time domain using a time-frequency transform.

10. The audio object decoding method adapted to a personalized interactive system according to claim 9, characterized in that: in step B4, the matrix synthesis is to multiply the left singular matrix, the singular value matrix, and the right singular matrix to obtain an approximate object residual error.