CN111816203A

CN111816203A - A synthetic speech detection method based on phoneme-level analysis to suppress the influence of phonemes

Info

Publication number: CN111816203A
Application number: CN202010572748.4A
Authority: CN
Inventors: 魏建国; 刘畅
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2020-06-22
Filing date: 2020-06-22
Publication date: 2020-10-23

Abstract

The invention relates to speech signal processing, in order to study the difference on different phonemes of real speech and fraudulent speech, improve the effect of the systematic fraud attack detection of the automatic speaker, the invention, on the basis of analyzing the synthetic speech detection method that inhibits phoneme influence at the phoneme level, use F-ratio to analyze each frequency band of different phonemes in different real speech and fraudulent speech, find out the frequency range more favorable to distinguishing real speech and fraudulent speech, increase the filter density on the frequency band, obtain the new characteristic, and train the Gaussian mixture model GMM of real speech and fraudulent speech separately with the characteristic, input two models separately after the audio frequency to be recognized extracts the characteristic, score the result of two models with the maximum likelihood ratio finally, receive the final recognition result. The invention is mainly applied to the occasion of voice signal processing.

Description

A synthetic speech detection method based on phoneme-level analysis to suppress the influence of phonemes

技术领域technical field

本发明涉及模式识别，语音信号处理领域，具体是一种使用F-ratio对真实语音和合成语音的音素特征进行分析的方法，用于更高效地鉴别语音的真假。The invention relates to the fields of pattern recognition and speech signal processing, in particular to a method for analyzing phoneme features of real speech and synthesized speech by using F-ratio, which is used for more efficient identification of true and false speech.

背景技术Background technique

利用人的个性化生物特征进行个人身份鉴别如今已经被广泛地应用于生产和生活当中。个性化生物特征是指包括指纹、虹膜以及声纹在内的在一定时间内具有持续的唯一性的，且能够反映出个体与个体之间差异的生理特性。其中声纹识别(VoiceprintRecognition)技术也被称为说话人识别(Speaker Recognition)技术，它可以根据一段音频来判断音频中说话人的身份信息。声纹识别技术相较于指纹识别、人脸识别和虹膜识别等技术具有一定的优势。例如，实现成本低、操作简单等。声纹识别既不需要像指纹识别那样要用到专用的设备，也不需要像人脸识别那样要进行特定的动作，只需要简单地说一句话就可以进行身份鉴别。因此，声纹识别技术具有较高的用户认可度，市场占有率已经达到了15.8％，并且不断呈现出上升的趋势。The use of people's personalized biometrics for personal identification has now been widely used in production and life. Personalized biometrics refer to the physiological characteristics that are persistent and unique within a certain period of time, including fingerprints, iris and voiceprints, and can reflect the differences between individuals. Among them, the Voiceprint Recognition technology is also called the Speaker Recognition technology, which can judge the identity information of the speaker in the audio according to a piece of audio. Compared with fingerprint recognition, face recognition and iris recognition, voiceprint recognition technology has certain advantages. For example, the realization cost is low, the operation is simple, and so on. Voiceprint recognition does not require special equipment like fingerprint recognition, nor does it require specific actions like face recognition. It only needs to say a simple sentence for identity authentication. Therefore, voiceprint recognition technology has a high degree of user acceptance, and the market share has reached 15.8%, and it is constantly showing an upward trend.

但是最近随着语音合成(Speech Synthetic)技术和语音转换(VoiceConversion)技术的日益成熟，很多不法分子可以利用这些技术轻易地模仿出目标说话人声学特征，进而攻破声纹识别系统的防御，盗取他人的信息和财产等。为了保护声纹识别系统不受到合成语音或转换语音的攻击，对于这些欺诈攻击的检测(Spoofing AttackDetection)技术的需求变得日益强烈起来。这项技术的研究对于声纹识别系统的推广和使用起着至关重要的作用。But recently, with the maturity of speech synthesis (Speech Synthetic) technology and voice conversion (VoiceConversion) technology, many criminals can use these technologies to easily imitate the acoustic characteristics of the target speaker, and then break the defense of the voiceprint recognition system and steal information and property of others. In order to protect voiceprint recognition systems from synthetic speech or converted speech attacks, the need for Spoofing Attack Detection techniques has become increasingly strong. The research of this technology plays a vital role in the promotion and use of voiceprint recognition systems.

目前，语音方面的国际顶级会议Interspeech每隔两年会举办针对自动说话人识别欺诈攻击与防御对策的挑战赛(Automatic Speaker Verification Spoofing andCountermeasures Challenge)。分析各支参赛队伍的策略，可以发现国内外对于这一课题的研究主要分为两个方面，分别是基于前端语音特征分析方面的研究以及基于后端分类器方面的研究。在特征方面，目前比较常用的特征包括经过常数Q变换得到的常数Q倒谱系数(Constant Q Cepstral Coefficients,CQCC)和使用线性滤波器处理得到的线性滤波器倒谱系数(Linear-Frequency Cepstral Coefficient，LFCC)等；在分类器方面，除了高斯混合模型(Gaussian Mixture Model，GMM)、线性判别分析(Linear Discriminant Analysis,LDA)以及支持向量机(Support Vector Machine，SVM)等传统的机器学习中经典的分类器之外，一些目前比较热门的深度神经网络模型也被应用于这项任务当中，比如卷积神经网络(Convolutional Neural Networks,CNN)以及循环神经网络(Recurrent NeuralNetwork,RNN)等。Currently, Interspeech, the top international conference on speech, holds the Automatic Speaker Verification Spoofing and Countermeasures Challenge every two years. Analyzing the strategies of each participating team, it can be found that the research on this topic at home and abroad is mainly divided into two aspects, namely the research based on the front-end speech feature analysis and the research based on the back-end classifier. In terms of features, the commonly used features include Constant Q Cepstral Coefficients (CQCC) obtained by constant Q transform and Linear-Frequency Cepstral Coefficients (Linear-Frequency Cepstral Coefficients) obtained by using linear filters. In terms of classifiers, in addition to Gaussian Mixture Model (GMM), Linear Discriminant Analysis (LDA) and Support Vector Machine (SVM) and other traditional machine learning classics In addition to classifiers, some popular deep neural network models are also used in this task, such as Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN).

Gajan Suthokumar等人在2019年的一篇研究中表明，语音发音过程中的不同音素在进行欺诈攻击鉴别时，具有不同的辨别能力，其中轻音音素的鉴别能力普遍高于浊音音素。In a 2019 study by Gajan Suthokumar et al., it was shown that different phonemes in the process of speech pronunciation have different discrimination abilities when identifying fraudulent attacks, and the discrimination ability of light phonemes is generally higher than that of voiced phonemes.

发明内容SUMMARY OF THE INVENTION

为克服现有技术的不足，本发明旨在研究真实语音和欺诈语音在不同音素上的区别,提高自动说话人系统欺诈攻击检测的效果。为此，本发明采取的技术方案是，基于音素级分析抑制音素影响的合成语音检测方法，使用F-ratio对不同真实语音和欺诈语音中的不同音素的各个频段进行分析，F-ratio称为方差比检验，是通过比较类内和类间的方差，来发现各分类中的差异分布情况，通过所述分析找出更有利于鉴别真实语音和欺诈语音的频率范围，增加该频段上的滤波器密度，得到新的特征，并用该特征分别训练真实语音和欺诈语音的高斯混合模型GMM，将待识别的音频提取特征后分别输入两个模型，最后将两个模型的结果用最大似然比打分，得到最终的识别结果。In order to overcome the deficiencies of the prior art, the present invention aims to study the difference between real speech and fraudulent speech in different phonemes, so as to improve the effect of fraud attack detection by automatic speaker system. To this end, the technical solution adopted by the present invention is to use F-ratio to analyze each frequency band of different phonemes in different real voices and fraudulent voices, based on a phoneme-level analysis that suppresses the influence of phonemes, and F-ratio is called The variance ratio test is to find the difference distribution in each classification by comparing the variance within and between classes, and find out the frequency range that is more conducive to distinguishing real speech and fraudulent speech through the analysis, and increase the filtering on this frequency band. Obtain new features, and use the features to train the Gaussian mixture model GMM of real speech and fraudulent speech respectively, extract the features of the audio to be recognized and input them into the two models respectively, and finally use the results of the two models with the maximum likelihood ratio. Score to get the final recognition result.

具体步骤如下：Specific steps are as follows:

步骤一，数据准备：Step 1, data preparation:

首先，对语音数据进行标注，即获取音频中的每个音素以及它们的起始时间信息，然后分别对真实语音和欺诈语音中的各个音素进行研究，使用均匀的子带滤波器来处理语音音频中的每一帧，进而获得不同音素的每一帧上各个频带的数据；First, annotate the speech data, that is, obtain each phoneme in the audio and their starting time information, then study each phoneme in the real speech and the fraudulent speech separately, and use a uniform subband filter to process the speech audio In each frame, and then obtain the data of each frequency band on each frame of different phonemes;

步骤二，数据分析：Step 2, data analysis:

对上一步获取到的数据使用音素级的F-ratio方法进行分析，某个频段上的F-ratio值用来表征该频段在鉴别真实语音和欺诈语音时的能力，F-ratio的值越大，表示这一频道上携带的可供鉴别的信息更多，鉴别能力越强，之后根据所有频道上的F-ratio值，对结果做归一化，然后以各个音素的帧数为权值，对音素的每个频带上归一化的数据做加权平均，最终得到抑制了音素影响后各个频带上的鉴别能力，结果越大表示鉴别能力越强；Use the phoneme-level F-ratio method to analyze the data obtained in the previous step. The F-ratio value on a certain frequency band is used to represent the ability of the frequency band to identify real speech and fraudulent speech. The larger the value of F-ratio, , indicating that this channel carries more information for identification, and the identification ability is stronger. Then, according to the F-ratio value on all channels, the results are normalized, and then the frame number of each phoneme is used as the weight, The normalized data on each frequency band of the phoneme is weighted and averaged, and finally the discrimination ability on each frequency band after the influence of the phoneme is suppressed is obtained. The larger the result, the stronger the discrimination ability;

步骤三，提取特征：Step 3, extract features:

根据第二步的实验结果，在鉴别能力较强的区域，增加滤波器的个数，起到增加滤波器在这些区域中的密度的作用，再使用这些滤波器对经过分帧、加窗和短时傅里叶变换后的语音信号进行滤波，最后经过离散余弦变换DCT得到抑制音素影响的新特征；According to the experimental results of the second step, in areas with strong discrimination ability, increase the number of filters to increase the density of filters in these areas, and then use these filters to The speech signal after short-time Fourier transform is filtered, and finally a new feature that suppresses the influence of phonemes is obtained through discrete cosine transform (DCT);

步骤四，模型训练Step 4, model training

将训练集中的音频提取特征作为输入，分别训练真实语音和欺诈语音的高斯混合模型GMM；Using the audio extraction features in the training set as input, train the Gaussian mixture model GMM of real speech and fraudulent speech respectively;

步骤五，打分确认Step 5: Score confirmation

将待测语音提取到的特征分别输入进真实语音和欺诈语音的模型中进行打分，再用最大似然比分类法得出最终的结果。The features extracted from the speech to be tested are input into the models of real speech and fraudulent speech respectively for scoring, and then the maximum likelihood ratio classification method is used to obtain the final result.

步骤二，数据分析具体步骤如下：Step 2, the specific steps of data analysis are as follows:

使用基于音素的F-ratio分析方法PF(Phoneme F-ratio)对不同音素中的各个频段进行分析，PF的分析思路为计算某一音素k在第l个滤波器上不同方法之间的方差与同一方法内部的方差之比，值越高就说明不同方法在这一区域的差异性越大，PF的具体计算公式如下：Use the phoneme-based F-ratio analysis method PF (Phoneme F-ratio) to analyze each frequency band in different phonemes. The analysis idea of PF is to calculate the variance of a phoneme k on the lth filter between different methods and The ratio of variance within the same method, the higher the value, the greater the difference between different methods in this area. The specific calculation formula of PF is as follows:

其中，T表示方法种类，Ntk表示第t个类型中第k个音素的帧数；

表示第t个类型第k个音素第j帧中第l个滤波器上的数据；

表示第t个类型第k个音素的每一帧第l个滤波器上的数据平均值；u_k表示所有类型第k个音素的每一帧第l个滤波器上的数据平均值，

和u_k的公式如下：Among them, T represents the method type, and Ntk represents the frame number of the k-th phoneme in the t-th type;

represents the data on the lth filter in the jth frame of the kth phoneme of the tth type;

represents the average value of data on the lth filter of each frame of the kth phoneme of the tth type; u _k represents the data average value of the lth filter of each frame of the kth phoneme of all types,

The formulas for and u _k are as follows:

之后对得到的PF值进行归一化处理，即可得到第l个滤波器频带上音素的F-ratio贡献率PFC(Phoneme F-ratio Contribution)，其计算公式如下：After normalizing the obtained PF value, the F-ratio contribution rate PFC (Phoneme F-ratio Contribution) of the phoneme on the lth filter band can be obtained, and its calculation formula is as follows:

其中L为均匀的子带滤波器个数；计算得到的PFC可以反映出不同音素中，用于鉴别欺诈语音的信息的频率分布，接下来对各音素的PFC根据其帧数进行加权平均，得到整体F-ratio值GF(General F-ratio)，其计算公式如下：Among them, L is the number of uniform subband filters; the calculated PFC can reflect the frequency distribution of the information used to identify fraudulent speech in different phonemes, and then the PFC of each phoneme is weighted and averaged according to its frame number to obtain The overall F-ratio value GF (General F-ratio), its calculation formula is as follows:

其中P是所有音素的总数，N为所有音素的总帧数，N的计算公式如下：Where P is the total number of all phonemes, N is the total number of frames of all phonemes, and the calculation formula of N is as follows:

对计算出的GF再进行归一化处理，可以得到抑制音素影响的欺诈攻击信息分布PESSDID(Phoneme Effect Suppressed Spoof Detection Information Distribution)，其计算公式如下：The calculated GF is then normalized to obtain the PESSDID (Phoneme Effect Suppressed Spoof Detection Information Distribution) that suppresses the influence of the phoneme. The calculation formula is as follows:

这里滤波器l的PESSDID的值越高，说明该滤波器的频段上的可以用于鉴别欺诈攻击的信息越多。Here, the higher the value of PESSDID of filter 1, the more information on the frequency band of the filter that can be used to identify fraudulent attacks.

步骤三，提取特征步骤中：Step 3, in the feature extraction step:

除了滤波器分布的区别外，其他特征提取的过程包括：使用滤波器前的步骤包括预加重、分帧和加窗，再经过短时傅里叶变化，得到每一帧的频谱特征，之后用滤波器对频谱特征进行处理，处理后再经过DCT变换，得到最终的特征。In addition to the difference in filter distribution, other feature extraction processes include: the steps before using the filter include pre-emphasis, framing and windowing, and then through short-time Fourier transformation to obtain the spectral features of each frame, and then use The filter processes the spectral features, and then undergoes DCT transformation to obtain the final features.

本发明的特点及有益效果是：The characteristics and beneficial effects of the present invention are:

本发明使用F-ratio的方法，对声纹识别系统面对的欺诈攻击语音与真实语音在不同音素上的差异进行了分析，找到了有助于鉴别欺诈语音的信息在频率上的分布。根据分析结果，通过改进滤波器设计出了一个可以抑制不同音素在识别任务中的影响的新特征。在ASVspoof2019的测试集上进行初步实验，得到的等错误率(Equal Error Rate，EER)为4.16％，相较于常用的LFCC特征(基线系统EER为8.09％)在错误率上有了48.58％的提升。The present invention uses the method of F-ratio to analyze the difference in different phonemes between the fraudulent attack voice faced by the voiceprint recognition system and the real voice, and finds the frequency distribution of the information that helps to identify the fraudulent voice. According to the analysis results, a new feature that can suppress the influence of different phonemes in the recognition task is designed by improving the filter. Preliminary experiments were carried out on the test set of ASVspoof2019, and the obtained Equal Error Rate (EER) was 4.16%, which was 48.58% higher than the commonly used LFCC feature (baseline system EER was 8.09%). promote.

附图说明：Description of drawings:

图1为基于F-ratio分析的抑制音素影响的欺诈攻击检测特征提取流程图。Figure 1 is a flowchart of feature extraction of fraud attack detection based on F-ratio analysis to suppress the influence of phonemes.

图2为基于F-ratio分析得到的抑制音素影响的滤波器分布示意图。FIG. 2 is a schematic diagram of the distribution of filters for suppressing the influence of phonemes obtained based on F-ratio analysis.

具体实施方式Detailed ways

本发明的目的在于研究真实语音和欺诈语音在不同音素上的区别，使用F-ratio的方法对每个因素的不同频段进行对比分析，找出每个因素更有利于鉴别欺诈语音的频段，进而加大特征提取是在这些频段上的滤波器密度，最后得到一个更为鲁棒的个性化特征，提高自动说话人系统欺诈攻击检测的效果。The object of the present invention is to study the difference between real voice and fraudulent voice on different phonemes, use the method of F-ratio to carry out comparative analysis on different frequency bands of each factor, find out the frequency band that each factor is more conducive to identifying fraudulent voice, and then Increasing the feature extraction is the filter density on these frequency bands, and finally a more robust personalized feature is obtained, which improves the effect of automatic speaker system fraud attack detection.

实现本发明目的的技术解决方案为：The technical solution that realizes the purpose of the present invention is:

基于F-ratio分析的抑制音素影响的合成语音检测方法。使用F-ratio对不同真实语音和欺诈语音中的不同音素的各个频段进行分析，找出更有利于鉴别真实语音和欺诈语音的频率范围，增加该频段上的滤波器密度，得到新的特征，并用该特征分别训练真实语音和欺诈语音的高斯混合模型(GMM)，将待识别的音频提取特征后分别输入两个模型，最后将两个模型的结果用最大似然比打分，得到最终的识别结果。A synthetic speech detection method with suppressed phoneme influence based on F-ratio analysis. Use F-ratio to analyze each frequency band of different phonemes in different real voices and fraudulent voices, find out the frequency range that is more conducive to distinguishing real voices and fraudulent voices, increase the filter density on this frequency band, and obtain new features, And use this feature to train the Gaussian Mixture Model (GMM) of real speech and fraudulent speech respectively, extract the features of the audio to be recognized and input them into the two models respectively, and finally score the results of the two models with the maximum likelihood ratio to obtain the final recognition. result.

系统的实现包括以下步骤：The implementation of the system includes the following steps:

步骤一，数据准备：Step 1, data preparation:

首先，对语音数据进行标注，即获取音频中的每个音素以及它们的起始时间等信息。然后分别对真实语音和欺诈语音中的各个音素进行研究。使用均匀的子带滤波器来处理语音音频中的每一帧，进而获得不同音素的每一帧上各个频带的数据。First, annotate the speech data, that is, obtain information such as each phoneme in the audio and their starting time. The individual phonemes in real speech and fraudulent speech are then studied separately. Each frame in the speech audio is processed using a uniform subband filter to obtain data for each frequency band on each frame of different phonemes.

步骤二，数据分析：Step 2, data analysis:

对上一步获取到的数据使用F-ratio的方法进行分析，某个频段上的F-ratio值可以用来表征该频段在鉴别真实语音和欺诈语音时的能力，F-ratio的值越大，表示这一频道上携带的可供鉴别的信息更多，鉴别能力越强。之后根据所有频道上的F-ratio值，对结果做归一化，然后以各个音素的帧数为权值，对音素的每个频带上归一化的数据做加权平均，最终得到抑制了音素影响后各个频带上的鉴别能力，结果越大表示鉴别能力越强。Use the F-ratio method to analyze the data obtained in the previous step. The F-ratio value on a certain frequency band can be used to characterize the ability of the frequency band to identify real speech and fraudulent speech. Indicates that the more information available for identification carried on this channel, the stronger the identification ability. After that, the results are normalized according to the F-ratio values on all channels, and then the frame number of each phoneme is used as the weight, and the normalized data on each frequency band of the phoneme is weighted and averaged, and finally the suppressed phoneme is obtained. The discriminative ability on each frequency band is affected, and the larger the result, the stronger the discrimination ability.

步骤三，提取特征：Step 3, extract features:

根据第二步的实验结果，在鉴别能力较强的区域，适当增加滤波器的个数，起到增加滤波器在这些区域中的密度的作用。再使用这些滤波器对经过分帧、加窗和短时傅里叶变换后的语音信号进行滤波，最后经过离散余弦变换(DCT)得到抑制音素影响的新特征。According to the experimental results of the second step, in areas with strong discriminating ability, appropriately increase the number of filters to increase the density of filters in these areas. Then use these filters to filter the speech signal after framing, windowing and short-time Fourier transform, and finally obtain new features that suppress the influence of phonemes through discrete cosine transform (DCT).

步骤四，模型训练Step 4, model training

将训练集中的音频提取特征作为输入，分别训练真实语音和欺诈语音的GMM模型。Taking the audio extraction features in the training set as input, GMM models for real speech and fraudulent speech are trained separately.

步骤五，打分确认Step 5: Score confirmation

下面结合附图来描述本发明实施的基于F-ratio分析的抑制音素影响的合成语音检测方法，主要包含以下步骤：Below in conjunction with accompanying drawing, the synthetic speech detection method that suppresses the influence of phoneme based on F-ratio analysis that the present invention implements is described, mainly comprises the following steps:

步骤一，数据准备：Step 1, data preparation:

为了验证本发明的效果，在ASVSpoof2019比赛的数据库上进行欺诈攻击检测实验。ASVSpoof2019数据库，包括训练集，开发集和测试集三部分，其中训练集和开发集中包括了7种语音合成和语音转换的欺诈攻击算法，而测试集中包括了12种与训练集和开发集中不同的欺诈攻算法。数据库中所有音频的采样率均为16KHz。由于赛方并未提供数据库中音频对应的文本信息，这里我们使用了一套语音识别系统，对训练集中的25380个音频中的说话内容进行了识别。之后通过语音标注的工具，对这些音频中的音素信息进行了提取。然后使用均匀的子带滤波器来处理语音音频中的每一帧，进而获得不同音素的每一帧上各个频带的数据。In order to verify the effect of the present invention, a fraud attack detection experiment is carried out on the database of the ASVSpoof2019 competition. The ASVSpoof2019 database consists of three parts: training set, development set and test set. The training set and development set include 7 kinds of fraud attack algorithms for speech synthesis and speech conversion, while the test set includes 12 different algorithms from the training set and development set. Fraud attack algorithm. All audio in the database is sampled at 16KHz. Since the competition did not provide the text information corresponding to the audio in the database, here we used a speech recognition system to recognize the speech content in the 25,380 audios in the training set. Afterwards, the phoneme information in these audios was extracted through the tool of speech annotation. Each frame in the speech audio is then processed using a uniform subband filter to obtain data for each frequency band on each frame of different phonemes.

步骤二，数据分析：Step 2, data analysis:

这里使用基于音素级F-ratio(Phoneme F-ratio，PF)的方法对不同音素中的各个频段进行分析，PF的分析思路为计算某一音素k在第l个滤波器上不同方法之间的方差与同一方法内部的方差之比，值越高就说明不同方法在这一区域的差异性越大，PF的具体计算公式如下：Here, the method based on the phoneme-level F-ratio (Phoneme F-ratio, PF) is used to analyze each frequency band in different phonemes. The analysis idea of PF is to calculate the difference between different methods of a phoneme k on the lth filter. The ratio of the variance to the variance within the same method. The higher the value, the greater the difference between different methods in this area. The specific calculation formula of PF is as follows:

其中，T表示方法种类，这里只分成了真实语音和欺诈语音两种，所以T为2；Ntk表示第t个类型中第k个音素的帧数；

表示第t个类型第k个音素第j帧中第l个滤波器上的数据；

表示第t个类型第k个音素的每一帧第l个滤波器上的数据平均值；u_k表示所有类型第k个音素的每一帧第l个滤波器上的数据平均值。

和u_k的公式如下：Among them, T represents the type of method, which is only divided into two types: real voice and fraudulent voice, so T is 2; Ntk represents the frame number of the kth phoneme in the tth type;

Represents the average value of data on the l-th filter of each frame of the k-th phoneme of the t-th type; u _k represents the data average of the l-th filter of each frame of the k-th phoneme of all types.

The formulas for and u _k are as follows:

之后对得到的PF值进行归一化处理，即可得到第l个滤波器频带上音素的F-ratio贡献率(Phoneme F-ratio Contribution，PFC)，其计算公式如下：After that, the obtained PF value is normalized to obtain the F-ratio contribution rate (Phoneme F-ratio Contribution, PFC) of the phoneme on the lth filter band, and its calculation formula is as follows:

其中L为均匀的子带滤波器个数；计算得到的PFC可以反映出不同音素中，用于鉴别欺诈语音的信息的频率分布。接下来为了抑制不同音素的差异在说话人识别过程中影响，对各音素的PFC根据其帧数进行加权平均，得到整体F-ratio值(General F-ratio,GF)，其计算公式如下：Where L is the number of uniform subband filters; the calculated PFC can reflect the frequency distribution of information used to identify fraudulent speech in different phonemes. Next, in order to suppress the influence of the difference of different phonemes in the process of speaker recognition, the PFC of each phoneme is weighted and averaged according to its frame number, and the overall F-ratio value (General F-ratio, GF) is obtained. The calculation formula is as follows:

对计算出的GF再进行归一化处理，可以得到抑制音素影响的欺诈攻击信息分布(Phoneme Effect Suppressed Spoof Detection Information Distribution，PESSDID)，其计算公式如下：By normalizing the calculated GF, the phoneme effect suppression information distribution (Phoneme Effect Suppressed Spoof Detection Information Distribution, PESSDID) can be obtained. The calculation formula is as follows:

步骤三，提取特征：Step 3, extract features:

根据上一步得到的分布情况，我们进行本发明提出的新特征的滤波器设计，其中在信息较多的频段适当增加滤波器的个数，在信息较少的频段，适当减少滤波器的个数，调整后的滤波器可见附图-2。除了滤波器分布的区别外，其他特征提取的过程与传统方法相同，使用滤波器前的步骤包括预加重、分帧和加窗，再经过短时傅里叶变化，得到每一帧的频谱特征，之后用滤波器对频谱特征进行处理，处理后再经过DCT变换等，可以得到最终的特征。According to the distribution obtained in the previous step, we carry out the filter design of the new feature proposed by the present invention, in which the number of filters is appropriately increased in the frequency band with more information, and the number of filters is appropriately reduced in the frequency band with less information , the adjusted filter can be seen in Figure-2. Except for the difference of filter distribution, the process of other feature extraction is the same as the traditional method. The steps before using the filter include pre-emphasis, framing and windowing, and then after short-time Fourier transformation, the spectral features of each frame are obtained. , and then use the filter to process the spectral features, and then go through DCT transformation after processing to obtain the final features.

步骤四，模型训练Step 4, model training

训练模型时，对于训练集中的语音，不再需要进行语音标注，直接用新的特征提取方法对原始音频进行处理即可。根据训练集中音频的真假标签，将得到的特征分别用于训练真实语音的GMM模型和欺诈语音的GMM模型。When training the model, for the speech in the training set, it is no longer necessary to perform speech annotation, and the original audio can be directly processed by the new feature extraction method. According to the true and false labels of the audio in the training set, the obtained features are used to train the GMM model of real speech and the GMM model of fraudulent speech, respectively.

步骤五，打分确认Step 5: Score confirmation

将待测音频进行分帧等处理后提取出新的特征后，将特征分别输入到真实语音和欺诈语音的GMM模型中进行打分。具体方法是，将一个音频的每一帧逐个输入给一个GMM模型，得到一个相似度的打分，然后将所有帧的结果取平均，作为该音频在该GMM模型中的打分结果。最后用最大似然比的方法，计算待测音频在两个模型中的得分，然后得出最终的结果。After the audio to be tested is divided into frames and other processing, new features are extracted, and then the features are input into the GMM models of real voice and fraudulent voice respectively for scoring. The specific method is to input each frame of an audio to a GMM model one by one to obtain a similarity score, and then average the results of all frames as the scoring result of the audio in the GMM model. Finally, the maximum likelihood ratio method is used to calculate the scores of the tested audio in the two models, and then the final result is obtained.

实验的结果采用等错误率(EER)进行评估，等错误率表示错误接收率(FAR)与错误拒绝率(FRR)相等时的错误率。The results of the experiments are evaluated using the Equal Error Rate (EER), which represents the error rate when the False Acceptance Rate (FAR) and the False Rejection Rate (FRR) are equal.

以上所述仅为本发明的较佳实施例，并不用以限制本发明，凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above are only preferred embodiments of the present invention and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included in the protection of the present invention. within the range.

Claims

1. a synthetic speech detection method that suppresses the influence of phonemes based on phoneme-level analysis is characterized in that, each frequency band of different phonemes in different real voices and fraudulent voices is analyzed using F-ratio, and F-ratio is called variance ratio test. , is to find the difference distribution in each classification by comparing the variance within and between classes, find out the frequency range that is more conducive to distinguishing real voice and fraudulent voice through the analysis, and increase the filter density on this frequency band, Obtain a new feature, and use the feature to train the Gaussian mixture model GMM of real speech and fraudulent speech respectively, extract the features of the audio to be recognized and input them into the two models respectively, and finally score the results of the two models with the maximum likelihood ratio to get the final recognition result.

2. the synthetic speech detection method that suppresses phoneme influence based on phoneme level analysis as claimed in claim 1, is characterized in that, concrete steps are as follows:

Step 1, data preparation:

First, annotate the speech data, that is, obtain each phoneme in the audio and their starting time information, then study each phoneme in the real speech and the fraudulent speech separately, and use a uniform subband filter to process the speech audio In each frame, and then obtain the data of each frequency band on each frame of different phonemes;

Step 2, data analysis:

Use the phoneme-level F-ratio method to analyze the data obtained in the previous step. The F-ratio value on a certain frequency band is used to represent the ability of the frequency band to identify real speech and fraudulent speech. The larger the value of F-ratio, , indicating that this channel carries more information for identification, and the identification ability is stronger. Then, according to the F-ratio value on all channels, the results are normalized, and then the frame number of each phoneme is used as the weight, The normalized data on each frequency band of the phoneme is weighted and averaged, and finally the discrimination ability on each frequency band after the influence of the phoneme is suppressed is obtained. The larger the result, the stronger the discrimination ability;

Step 3, extract features:

According to the experimental results of the second step, in areas with strong discrimination ability, increase the number of filters to increase the density of filters in these areas, and then use these filters to The speech signal after short-time Fourier transform is filtered, and finally a new feature that suppresses the influence of phonemes is obtained through discrete cosine transform (DCT);

Step 4, model training

Using the audio extraction features in the training set as input, train the Gaussian mixture model GMM of real speech and fraudulent speech respectively;

Step 5: Score confirmation

The features extracted from the speech to be tested are input into the models of real speech and fraudulent speech respectively for scoring, and then the maximum likelihood ratio classification method is used to obtain the final result.

3. the synthetic speech detection method that suppresses phoneme influence based on phoneme level analysis as claimed in claim 2 is characterized in that, step 2, the concrete steps of data analysis are as follows:

Use the phoneme-based F-ratio analysis method PF (Phoneme F-ratio) to analyze each frequency band in different phonemes. The analysis idea of PF is to calculate the variance of a phoneme k on the lth filter between different methods and The ratio of variance within the same method, the higher the value, the greater the difference between different methods in this area. The specific calculation formula of PF is as follows:

Among them, T represents the method type, and Ntk represents the frame number of the k-th phoneme in the t-th type;

The formulas for and u _k are as follows:

After normalizing the obtained PF value, the F-ratio contribution rate PFC (Phoneme F-ratio Contribution) of the phoneme on the lth filter band can be obtained, and its calculation formula is as follows:

Among them, L is the number of uniform subband filters; the calculated PFC can reflect the frequency distribution of the information used to identify fraudulent speech in different phonemes, and then the PFC of each phoneme is weighted and averaged according to its frame number to obtain The overall F-ratio value GF (General F-ratio), its calculation formula is as follows:

Where P is the total number of all phonemes, N is the total number of frames of all phonemes, and the calculation formula of N is as follows:

The calculated GF is then normalized to obtain the PESSDID (Phoneme Effect Suppressed Spoof Detection Information Distribution) that suppresses the influence of the phoneme. The calculation formula is as follows:

Here, the higher the value of PESSDID of filter 1, the more information on the frequency band of the filter that can be used to identify fraudulent attacks.

4. the synthetic speech detection method that suppresses the influence of phoneme based on phoneme-level analysis as claimed in claim 2, it is characterized in that, step 3, in the extraction feature step: except the difference of filter distribution, the process of other feature extraction comprises: using The steps before the filter include pre-emphasis, framing and windowing, and then through short-time Fourier transformation, the spectral characteristics of each frame are obtained, and then the spectral characteristics are processed by the filter, and then subjected to DCT transformation after processing to obtain final feature.