CN111564163A

CN111564163A - An RNN-based voice detection method for multiple forgery operations

Info

Publication number: CN111564163A
Application number: CN202010382185.2A
Authority: CN
Inventors: 严迪群; 乌婷婷; 王让定
Original assignee: Ningbo University
Current assignee: Ningbo University
Priority date: 2020-05-08
Filing date: 2020-05-08
Publication date: 2020-08-21
Anticipated expiration: 2040-05-08
Also published as: CN111564163B

Abstract

The invention discloses a method for detecting various fake operations based on RNN, which comprises the following steps: 1) obtaining an original voice sample, performing M kinds of counterfeiting processing on the original voice sample to obtain M voices after counterfeiting operation and 1 original voice without processing, performing feature extraction on the voices to obtain an LFCC matrix of a training voice sample, and sending the LFCC matrix into an RNN classifier network for training to obtain a multi-classification training model; 2) obtaining a section of test voice, extracting the characteristics of the test voice to obtain an LFCC matrix of the test voice data, sending the LFCC matrix into the RNN classifier trained in the step 1) for classification, obtaining an output probability for each test voice, and combining all the output probabilities as a final prediction result: if the prediction result is the original voice, the test voice is recognized as the original voice; if the prediction result is a voice subjected to a certain falsification operation, the test voice is recognized as a falsified voice subjected to a corresponding falsification operation.

Description

An RNN-based voice detection method for multiple forgery operations

技术领域technical field

本发明涉及语音检测方法，尤其是一种基于RNN的多种伪造操作语音检测方法。The invention relates to a voice detection method, in particular to a voice detection method for multiple forgery operations based on RNN.

背景技术Background technique

随着语音编辑软件功能的不断增强，非专业人士也可以很容易对语音内容进行修改。如果有不法分子恶意地对语音进行伪造修改，甚至将修改后的语音用于新闻报道、司法取证以及科学研究等领域，这将会带来巨大的威胁，甚至会对社会稳定造成不可估计的影响。数字语音取证方法是对伪造操作的检测，对于鉴别音频材料的原始性与真实性有至关重要的作用，是当前多媒体取证领域的重点研究课题。With the continuous enhancement of the functions of voice editing software, non-professionals can easily modify the voice content. If some criminals maliciously falsify and modify the voice, or even use the modified voice in news reports, judicial forensics, scientific research and other fields, this will bring huge threats and even have an inestimable impact on social stability. . The digital voice forensics method is the detection of forgery operations, and plays a vital role in identifying the originality and authenticity of audio materials. It is a key research topic in the field of multimedia forensics.

现有的数字语音取证检测技术大部分都是检测到单一的伪造操作，即取证人员假设待检测语音会经过某种特定的伪造操作。Mengyu Qiao等人提出了一种基于量化的MDCT系数及其导数的统计特征的检测算法，检测上转换和下转换的MP3音频文件，通过重新压缩和校准音频来生成参考音频信号，然后用支持向量机进行分类，实验结果表明该方法有效地检测了MP3双重压缩并可以检测数字取证的音频处理历史。如王丽华等人提出了一种基于卷积神经网络的变调语音处理历史检测，通过对三种语音库应用四种不同的变调软件进行变调，并使用CNN对语音的变调因子进行语音库内和库间以及变调方法之间的检测，其检测率达到了90％以上。Most of the existing digital voice forensics detection technologies detect a single forgery operation, that is, forensics personnel assume that the voice to be detected will undergo a specific forgery operation. Mengyu Qiao et al. proposed a detection algorithm based on the statistical features of quantized MDCT coefficients and their derivatives to detect up-converted and down-converted MP3 audio files, recompress and calibrate the audio to generate a reference audio signal, and then use the support vector The experimental results show that the method can effectively detect MP3 double compression and can detect the audio processing history of digital forensics. For example, Wang Lihua et al. proposed a tone-shifting speech processing history detection based on convolutional neural network. By applying four different tone-shifting software to three kinds of voice databases, the tone-shifting software was used, and CNN was used to analyze the tone-shifting factor of the voice in the voice library and the library. The detection rate of between and between the modulation methods has reached more than 90%.

现有的数字语音取证检测技术能够检测到单一的伪造操作，且检测率可以达到很高的水平。但在实际应用中，取证者常常无法预测伪造的具体操作，使用某一特定操作分类器进行检测可能会出现误判。Existing digital voice forensics detection technology can detect a single forgery operation, and the detection rate can reach a very high level. However, in practical applications, forensics are often unable to predict the specific operations of forgery, and using a specific operation classifier for detection may result in misjudgment.

目前，大多数现有适用多种伪造操作的数字取证工作都集中在数字图像领域上，针对数字语音取证的研究仍较少。在数字语音领域，骆伟祺团队设计了一个卷积神经网络模型，可以用于检测两个不同音频编辑软件中默认设置的音频处理操作，并提供了较好的结果，该方法可以显着优于现有的基于手工特征的取证方法。然而，该实验虽然开创性地对语音的多种伪造操作检测进行了研究，却存在着一些不容忽视的问题，如计算的复杂度过高、所针对伪造操作的应用场景过于理想等。At present, most of the existing digital forensics work applicable to a variety of forgery operations focus on the field of digital images, and there are still few researches on digital voice forensics. In the field of digital speech, Luo Weiqi's team designed a convolutional neural network model that can be used to detect audio processing operations with default settings in two different audio editing software, and provided better results. This method can significantly outperform existing Some forensics methods are based on handcrafted features. However, although this experiment has pioneered research on the detection of multiple forgery operations in speech, there are some problems that cannot be ignored, such as the high computational complexity and the ideal application scenarios for forgery operations.

发明内容SUMMARY OF THE INVENTION

本发明所要解决的技术问题是针对上述现有技术存在的不足，提供一种基于RNN的多种伪造操作语音检测方法，能够提高检测准确率。The technical problem to be solved by the present invention is to provide an RNN-based voice detection method for multiple forgery operations, which can improve the detection accuracy.

本发明解决上述技术问题所采用的技术方案为：一种基于RNN的多种伪造操作语音检测方法，其特征在于：包括如下步骤：The technical solution adopted by the present invention to solve the above-mentioned technical problems is: an RNN-based multiple forgery operation voice detection method, which is characterized in that: it comprises the following steps:

1)训练网络：获取原始语音样本，对所述原始语音样本进行M种伪造处理，得到M个伪造操作后的语音和1个未经处理的原始语音，对上述M个伪造后的语音和1个原始语音进行特征提取，得到训练语音样本的LFCC矩阵，送入RNN分类器网络中进行训练，得到一个多分类的训练模型；1) Train the network: obtain the original voice samples, perform M kinds of forgery processing on the original voice samples, obtain M voices after forgery operations and 1 unprocessed original voice, and compare the above-mentioned M forged voices and 1 Extract the features of the original voices, obtain the LFCC matrix of the training voice samples, and send them to the RNN classifier network for training to obtain a multi-class training model;

2)语音识别：得到一段测试语音，对该测试语音进行特征提取，得到测试语音数据的LFCC矩阵，送入由步骤1)训练好的RNN分类器中进行分类，每一个测试语音得到一个输出概率，合并所有输出概率作为最后的预测结果：如果预测结果是原始语音，则测试语音被识别为原始语音；如果预测结果是经过某一伪造操作的语音，则测试语音被识别为进行相应伪造操作的伪造语音。2) Speech recognition: obtain a test voice, perform feature extraction on the test voice, obtain the LFCC matrix of the test voice data, and send it to the RNN classifier trained in step 1) for classification, and each test voice obtains an output probability , combine all output probabilities as the final prediction result: if the prediction result is the original speech, the test speech is recognized as the original speech; if the prediction result is the speech that has undergone a certain forgery operation, the test speech is recognized as the corresponding forgery operation. fake voice.

优选的，在步骤1)和2)中，得到LFCC矩阵的步骤为：Preferably, in steps 1) and 2), the step of obtaining the LFCC matrix is:

1)FFT：首先先对语音进行预处理，计算每一个语音帧经过FFT后的频谱能量E(i,k)：1) FFT: First, preprocess the speech, and calculate the spectral energy E(i,k) of each speech frame after FFT:

其中，i为语音帧数，k为频率分量，x_i(m)为第i帧的语音信号数据，N为傅立叶变换的数量；Wherein, i is the number of speech frames, k is the frequency component, x _i (m) is the speech signal data of the ith frame, and N is the number of Fourier transforms;

然后计算每帧的频谱能量E(i,k)经过三角滤波器组后的能量：Then calculate the spectral energy E(i,k) of each frame after passing through the triangular filter bank:

其中，H_i(k)表示三角滤波器的频率响应，f(l)为第l个三角滤波器的滤波函数，S(i,l)为经过三角滤波器组后的谱线能量，l表示三角滤波器的编号，L为三角滤波器总数；Among them, H _i (k) represents the frequency response of the triangular filter, f(l) is the filter function of the l-th triangular filter, S(i,l) is the spectral line energy after the triangular filter bank, and l represents Number of triangular filters, L is the total number of triangular filters;

2)DCT：利用DCT计算每个三角滤波器组的输出数据lfcc(i,n)：2) DCT: Calculate the output data lfcc(i,n) of each triangular filter bank using DCT:

其中，n代表第i帧DCT后的谱线；Among them, n represents the spectral line after the i-th frame DCT;

3)得到LFCC统计矩：将lfcc(i,n)取12阶LFCC系数，计算均值和相关系数，得到一段语音提取出的LFCC矩阵为：3) Obtain the LFCC statistical moment: take lfcc(i,n) as the 12th-order LFCC coefficient, calculate the mean and correlation coefficient, and obtain the LFCC matrix extracted from a piece of speech as:

其中x_s,1…x_1,n为计算得到的第s帧语音数据的n个LFCC。where x _s,1 ...x _1,n are the calculated n LFCCs of the s-th frame of speech data.

优选的，所述RNN分类器包括LSTM网络，依次连接的Dropout层、全连接层和Softmax层，所述Dropout层与最后一个LSTM网络连接。Preferably, the RNN classifier includes an LSTM network, a Dropout layer, a fully connected layer and a Softmax layer connected in sequence, and the Dropout layer is connected to the last LSTM network.

优选的，所述LSTM网络具有两个，参数分别设置为(64,128)和(128,64)。Preferably, the LSTM network has two, and the parameters are set as (64, 128) and (128, 64) respectively.

优选的，所述LSTM网络使用tanh激活函数。Preferably, the LSTM network uses a tanh activation function.

优选的，所述Dropout层的Dropout函数值为0.5。Preferably, the Dropout function value of the Dropout layer is 0.5.

优选的，所述原始语音为WAV格式。Preferably, the original voice is in WAV format.

与现有技术相比，本发明的优点在于：采用语音倒谱特征，通过循环神经网络分类输出结果概率，提高语音检测的准确率，更适用于数字语音载体，能识别不同的伪造痕迹；通过RNN网络中的共享参数较现有的基于深度学习的方法计算复杂度大大降低。Compared with the prior art, the present invention has the advantages that: the speech cepstrum feature is used, the probability of outputting the result is classified by the cyclic neural network, the accuracy of speech detection is improved, and it is more suitable for digital speech carriers and can identify different forgery traces; The shared parameters in the RNN network are significantly less computationally complex than existing deep learning-based methods.

附图说明Description of drawings

图1为本发明实施例的语音检测方法的LFCC统计矩的提取过程图；Fig. 1 is the extraction process diagram of the LFCC statistical moment of the speech detection method of the embodiment of the present invention;

图2为本发明实施例的语音检测方法的总体框架原理图；2 is a schematic diagram of an overall framework of a speech detection method according to an embodiment of the present invention;

图3为本发明实施例的语音检测方法的网络结构图。FIG. 3 is a network structure diagram of a speech detection method according to an embodiment of the present invention.

具体实施方式Detailed ways

下面详细描述本发明的实施例，所述实施例的示例在附图中示出，其中相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。The following describes in detail the embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions.

在本发明的描述中，需要理解的是，术语“中心”、“纵向”、“横向”、“长度”、“宽度”、“厚度”、“上”、“下”、“前”、“后”、“左”、“右”、“竖直”、“水平”、“顶”、“底”“内”、“外”、“顺时针”、“逆时针”、“轴向”、“径向”、“周向”等指示的方位或位置关系为基于附图所示的方位或位置关系，仅是为了便于描述本发明和简化描述，而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作，由于本发明所公开的实施例可以按照不同的方向设置，所以这些表示方向的术语只是作为说明而不应视作为限制，比如“上”、“下”并不一定被限定为与重力方向相反或一致的方向。此外，限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。In the description of the present invention, it should be understood that the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", " Rear, Left, Right, Vertical, Horizontal, Top, Bottom, Inner, Outer, Clockwise, Counterclockwise, Axial, The orientations or positional relationships indicated by "radial direction", "circumferential direction", etc. are based on the orientations or positional relationships shown in the accompanying drawings, which are only for the convenience of describing the present invention and simplifying the description, rather than indicating or implying the indicated devices or elements. must have a specific orientation, be constructed and operated in a specific orientation, and since the disclosed embodiments of the present invention may be arranged in different orientations, these directional terms are for illustration only and should not be regarded as limiting, such as "up", "Down" is not necessarily defined as a direction opposite or in line with the direction of gravity. Furthermore, features delimited with "first", "second" may expressly or implicitly include one or more of that feature.

一种基于RNN(循环神经网络)的多种伪造操作语音检测方法，通过构建一个基于倒谱特征的循环神经网络框架而实现。参见图2，框架由两个部分组成：首先先提取语音样本的倒谱特征，再送入设计好的网络框架中做分类，达到多种伪造操作鉴别的任务。An RNN (Recurrent Neural Network)-based speech detection method for multiple forgery operations is implemented by constructing a cepstral feature-based Recurrent Neural Network framework. Referring to Figure 2, the framework consists of two parts: first, the cepstral features of the speech samples are extracted, and then sent to the designed network framework for classification, so as to achieve the task of identifying various forgery operations.

具体的，在本发明，语音的特征提取通过以下方式实现。本发明采用的倒谱特征为线性频率倒谱系数(Linear Frequency Cepstral Coefficients,LFCC)。语音倒谱特征是语音技术中最常用的特征参数之一，它表征了人类的听觉特征，且被广泛应用于说话人识别。Specifically, in the present invention, the feature extraction of speech is realized in the following manner. The cepstral feature adopted in the present invention is Linear Frequency Cepstral Coefficients (LFCC). Speech cepstrum feature is one of the most commonly used feature parameters in speech technology. It represents the auditory characteristics of human beings and is widely used in speaker recognition.

LFCC是从低频到高频带通滤波器的平均分配。本发明的LFCC统计矩的提取过程可以参见图1：LFCC is an even distribution of bandpass filters from low to high frequencies. The extraction process of the LFCC statistical moment of the present invention can refer to Fig. 1:

1)FFT：首先先对语音进行预处理，计算每一个语音帧经过快速傅里叶变换(FastFourier Transform，FFT)后的频谱能量E(i,k)：1) FFT: First, preprocess the speech, and calculate the spectral energy E(i,k) of each speech frame after Fast Fourier Transform (FFT):

其中，i为语音帧数，k为频率分量，x_i(m)为第i帧的语音信号数据，N为傅立叶变换的数量。Among them, i is the number of speech frames, k is the frequency component, x _i (m) is the speech signal data of the ith frame, and N is the number of Fourier transforms.

计算每帧的频谱能量E(i,k)经过三角滤波器组后的能量：Calculate the spectral energy E(i,k) of each frame after passing through the triangular filter bank:

其中，H_i(k)表示三角滤波器的频率响应，f(l)为第l个三角滤波器的滤波函数，S(i,l)为经过三角滤波器组后的谱线能量，l表示三角滤波器的编号，L为三角滤波器总数。Among them, H _i (k) represents the frequency response of the triangular filter, f(l) is the filter function of the l-th triangular filter, S(i,l) is the spectral line energy after the triangular filter bank, and l represents Number of triangular filters, L is the total number of triangular filters.

2)DCT：然后，利用离散余弦变换(Discrete Cosine Transform，DCT)计算每个三角滤波器组的输出数据lfcc(i,n)：2) DCT: Then, the discrete cosine transform (Discrete Cosine Transform, DCT) is used to calculate the output data lfcc(i,n) of each triangular filter bank:

其中，n代表第i帧DCT后的谱线。Among them, n represents the spectral line after the DCT of the ith frame.

3)得到LFCC统计矩：将lfcc(i,n)取12阶LFCC系数，计算均值和相关系数，上述步骤可通过现有的matlab函数实现，假定某段经预处理后的语音一共有s帧，则该段语音提取出的LFCC矩阵为：3) Obtain the LFCC statistical moment: take lfcc(i,n) as the 12th-order LFCC coefficient, and calculate the mean and correlation coefficient. The above steps can be implemented by existing matlab functions. It is assumed that a certain segment of preprocessed speech has a total of s frames , the LFCC matrix extracted from this segment of speech is:

参见图3，网络框架采用RNN分类器，RNN分类器的网络层数的选择对于优化算法来说至关重要，更深的网络可以学到更多知识，但同时训练也需要花费很长时间而且可能会过度拟合。因此，在本发明中，提出RNN分类器的网络结构如图3所示。网络结构中包含2个LSTM网络，参数分别设置为(64,128)和(128,64)，使用tanh激活函数提高模型性能。还包括依次连接的Dropout层、全连接层(dense)和Softmax层，Dropout层与最后一个LSTM网络连接。设置Dropout函数的值为0.5，这有助于减少过拟合，经过全连接层降维后，使用Softmax层(Softmax分类器)输出概率。网络框架的总体迭代训练设置为50圈。在具体训练时可进行一定的调整。See Figure 3, the network framework adopts RNN classifier. The selection of the number of network layers of the RNN classifier is very important for the optimization algorithm. A deeper network can learn more knowledge, but at the same time, the training also takes a long time and may will overfit. Therefore, in the present invention, the network structure of the proposed RNN classifier is shown in FIG. 3 . The network structure contains 2 LSTM networks, the parameters are set to (64, 128) and (128, 64) respectively, and the tanh activation function is used to improve the model performance. It also includes a Dropout layer, a fully connected layer (dense), and a Softmax layer that are connected in sequence, and the Dropout layer is connected to the last LSTM network. Set the value of the Dropout function to 0.5, which helps to reduce overfitting. After dimensionality reduction by the fully connected layer, the Softmax layer (Softmax classifier) is used to output the probability. The overall iterative training of the network framework is set to 50 epochs. Certain adjustments can be made during specific training.

再参见图2，语音检测方法中，包括如下步骤：2 again, the speech detection method includes the following steps:

1)首先需要先对网络框架进行训练。假设有M种伪造操作，对原始语音分别进行M种伪造处理，得到M+1种语音样本，包括M个伪造操作后的语音和1个未经处理的原始语音。本发明中，对于原始语音的输入有一定的约束，需要提供一定量的WAV格式音频样本库作为网络框架的训练数据。对上述M+1种语音样本进行特征提取，得到训练语音样本的LFCC矩阵，送入设计好的RNN分类器网络中进行训练，得到一个多分类的训练模型；可在数据库中存储多个原始语音样本，对每个原始语音样本都进行特征提取并送入RNN分类器中进行训练；1) First, the network framework needs to be trained. Assuming that there are M kinds of forgery operations, M kinds of forgery processing are performed on the original speech respectively, and M+1 kinds of speech samples are obtained, including M pieces of speech after forgery operations and 1 unprocessed original speech. In the present invention, there are certain constraints on the input of the original speech, and a certain amount of WAV format audio sample library needs to be provided as the training data of the network framework. Perform feature extraction on the above M+1 speech samples to obtain the LFCC matrix of the training speech samples, send them to the designed RNN classifier network for training, and obtain a multi-classified training model; multiple original speech can be stored in the database. Samples are extracted for each original speech sample and sent to the RNN classifier for training;

2)然后，再通过训练后的网络框架得到检测识别结果：当得到一段测试语音时，对其进行特征提取，得到测试语音数据的LFCC矩阵，将其送入已经训练好的RNN分类器中进行分类。每一个测试语音会得到一个输出概率，合并所有输出概率作为最后的预测结果。若预测结果是原始语音，则测试语音被识别为原始语音；若预测结果是经过某一伪造操作的语音，则测试语音被识别为该伪造语音。取证者即可根据实验结果从而判断某段语音是否经过伪造操作。2) Then, the detection and recognition results are obtained through the trained network framework: when a test voice is obtained, feature extraction is performed on it, the LFCC matrix of the test voice data is obtained, and it is sent to the trained RNN classifier. Classification. Each test speech will get an output probability, and all output probabilities are combined as the final prediction result. If the predicted result is the original voice, the test voice is identified as the original voice; if the predicted result is the voice that has undergone a certain forgery operation, the test voice is identified as the forged voice. The forensic person can judge whether a certain speech has been forged according to the experimental results.

Claims

1. a multiple forgery operation voice detection method based on RNN, is characterized in that: comprise the steps:

1) Train the network: obtain the original voice samples, perform M kinds of forgery processing on the original voice samples, obtain M voices after forgery operations and 1 unprocessed original voice, and compare the above-mentioned M forged voices and 1 Extract the features of the original voices, obtain the LFCC matrix of the training voice samples, and send them to the RNN classifier network for training to obtain a multi-class training model;

2) Speech recognition: obtain a test voice, perform feature extraction on the test voice, obtain the LFCC matrix of the test voice data, and send it to the RNN classifier trained in step 1) for classification, and each test voice obtains an output probability , combine all output probabilities as the final prediction result: if the prediction result is the original speech, the test speech is recognized as the original speech; if the prediction result is the speech that has undergone a certain forgery operation, the test speech is recognized as the corresponding forgery operation. fake voice.

2. the multiple forgery operation voice detection method based on RNN according to claim 1, is characterized in that: in step 1) and 2) in, the step that obtains LFCC matrix is:

1) FFT: First, preprocess the speech, and calculate the spectral energy E(i,k) of each speech frame after FFT:

Wherein, i is the number of speech frames, k is the frequency component, x _i (m) is the speech signal data of the ith frame, and N is the number of Fourier transforms;

Then calculate the spectral energy E(i,k) of each frame after passing through the triangular filter bank:

Among them, H _i (k) represents the frequency response of the triangular filter, f(l) is the filter function of the l-th triangular filter, S(i,l) is the spectral line energy after the triangular filter bank, and l represents Number of triangular filters, L is the total number of triangular filters;

2) DCT: Calculate the output data lfcc(i,n) of each triangular filter bank using DCT:

Among them, n represents the spectral line after the i-th frame DCT;

3) Obtain the LFCC statistical moment: take lfcc(i,n) as the 12th-order LFCC coefficient, calculate the mean and correlation coefficient, and obtain the LFCC matrix extracted from a piece of speech as:

where x _s,1 ...x _1,n are the calculated n LFCCs of the s-th frame of speech data.

3. the multiple forgery operation voice detection method based on RNN according to claim 1, it is characterized in that: described RNN classifier comprises LSTM network, the Dropout layer, fully connected layer and Softmax layer that are connected successively, described Dropout layer Connect with the last LSTM network.

4. The RNN-based voice detection method for multiple forgery operations according to claim 3, wherein the LSTM network has two, and the parameters are respectively set as (64, 128) and (128, 64).

5. The RNN-based voice detection method for multiple forgery operations according to claim 3, wherein the LSTM network uses a tanh activation function.

6 . The RNN-based voice detection method for multiple forgery operations according to claim 3 , wherein the Dropout function value of the Dropout layer is 0.5. 7 .

7 . The RNN-based voice detection method for multiple forgery operations according to claim 1 , wherein the original voice is in WAV format. 8 .