CN111816203A - A synthetic speech detection method based on phoneme-level analysis to suppress the influence of phonemes - Google Patents
A synthetic speech detection method based on phoneme-level analysis to suppress the influence of phonemes Download PDFInfo
- Publication number
- CN111816203A CN111816203A CN202010572748.4A CN202010572748A CN111816203A CN 111816203 A CN111816203 A CN 111816203A CN 202010572748 A CN202010572748 A CN 202010572748A CN 111816203 A CN111816203 A CN 111816203A
- Authority
- CN
- China
- Prior art keywords
- phoneme
- speech
- ratio
- fraudulent
- filter
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 17
- 238000004458 analytical method Methods 0.000 title claims description 20
- 238000007476 Maximum Likelihood Methods 0.000 claims abstract description 8
- 230000000694 effects Effects 0.000 claims abstract description 8
- 239000000284 extract Substances 0.000 claims abstract description 8
- 239000000203 mixture Substances 0.000 claims abstract description 7
- 238000012545 processing Methods 0.000 claims abstract description 6
- 238000000034 method Methods 0.000 claims description 34
- 238000004364 calculation method Methods 0.000 claims description 15
- 238000009826 distribution Methods 0.000 claims description 15
- 238000012549 training Methods 0.000 claims description 14
- 238000000605 extraction Methods 0.000 claims description 11
- 230000008569 process Effects 0.000 claims description 8
- 238000007405 data analysis Methods 0.000 claims description 6
- 230000003595 spectral effect Effects 0.000 claims description 6
- 230000009466 transformation Effects 0.000 claims description 6
- 238000012360 testing method Methods 0.000 claims description 5
- 238000012790 confirmation Methods 0.000 claims description 4
- 238000009432 framing Methods 0.000 claims description 4
- 238000002360 preparation method Methods 0.000 claims description 4
- 230000002349 favourable effect Effects 0.000 abstract 1
- 230000009897 systematic effect Effects 0.000 abstract 1
- 238000005516 engineering process Methods 0.000 description 8
- 238000011160 research Methods 0.000 description 4
- 238000004422 calculation algorithm Methods 0.000 description 3
- 238000011161 development Methods 0.000 description 3
- 238000002474 experimental method Methods 0.000 description 3
- 230000015572 biosynthetic process Effects 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 238000003786 synthesis reaction Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000010835 comparative analysis Methods 0.000 description 1
- 230000007123 defense Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 230000001629 suppression Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/69—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for evaluating synthetic or decoded voice signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/45—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/08—Network architectures or network communication protocols for network security for authentication of entities
- H04L63/0861—Network architectures or network communication protocols for network security for authentication of entities using biometrical features, e.g. fingerprint, retina-scan
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Signal Processing (AREA)
- Multimedia (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- General Health & Medical Sciences (AREA)
- Computer Hardware Design (AREA)
- Computer Security & Cryptography (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
Description
技术领域technical field
本发明涉及模式识别,语音信号处理领域,具体是一种使用F-ratio对真实语音和合成语音的音素特征进行分析的方法,用于更高效地鉴别语音的真假。The invention relates to the fields of pattern recognition and speech signal processing, in particular to a method for analyzing phoneme features of real speech and synthesized speech by using F-ratio, which is used for more efficient identification of true and false speech.
背景技术Background technique
利用人的个性化生物特征进行个人身份鉴别如今已经被广泛地应用于生产和生活当中。个性化生物特征是指包括指纹、虹膜以及声纹在内的在一定时间内具有持续的唯一性的,且能够反映出个体与个体之间差异的生理特性。其中声纹识别(VoiceprintRecognition)技术也被称为说话人识别(Speaker Recognition)技术,它可以根据一段音频来判断音频中说话人的身份信息。声纹识别技术相较于指纹识别、人脸识别和虹膜识别等技术具有一定的优势。例如,实现成本低、操作简单等。声纹识别既不需要像指纹识别那样要用到专用的设备,也不需要像人脸识别那样要进行特定的动作,只需要简单地说一句话就可以进行身份鉴别。因此,声纹识别技术具有较高的用户认可度,市场占有率已经达到了15.8%,并且不断呈现出上升的趋势。The use of people's personalized biometrics for personal identification has now been widely used in production and life. Personalized biometrics refer to the physiological characteristics that are persistent and unique within a certain period of time, including fingerprints, iris and voiceprints, and can reflect the differences between individuals. Among them, the Voiceprint Recognition technology is also called the Speaker Recognition technology, which can judge the identity information of the speaker in the audio according to a piece of audio. Compared with fingerprint recognition, face recognition and iris recognition, voiceprint recognition technology has certain advantages. For example, the realization cost is low, the operation is simple, and so on. Voiceprint recognition does not require special equipment like fingerprint recognition, nor does it require specific actions like face recognition. It only needs to say a simple sentence for identity authentication. Therefore, voiceprint recognition technology has a high degree of user acceptance, and the market share has reached 15.8%, and it is constantly showing an upward trend.
但是最近随着语音合成(Speech Synthetic)技术和语音转换(VoiceConversion)技术的日益成熟,很多不法分子可以利用这些技术轻易地模仿出目标说话人声学特征,进而攻破声纹识别系统的防御,盗取他人的信息和财产等。为了保护声纹识别系统不受到合成语音或转换语音的攻击,对于这些欺诈攻击的检测(Spoofing AttackDetection)技术的需求变得日益强烈起来。这项技术的研究对于声纹识别系统的推广和使用起着至关重要的作用。But recently, with the maturity of speech synthesis (Speech Synthetic) technology and voice conversion (VoiceConversion) technology, many criminals can use these technologies to easily imitate the acoustic characteristics of the target speaker, and then break the defense of the voiceprint recognition system and steal information and property of others. In order to protect voiceprint recognition systems from synthetic speech or converted speech attacks, the need for Spoofing Attack Detection techniques has become increasingly strong. The research of this technology plays a vital role in the promotion and use of voiceprint recognition systems.
目前,语音方面的国际顶级会议Interspeech每隔两年会举办针对自动说话人识别欺诈攻击与防御对策的挑战赛(Automatic Speaker Verification Spoofing andCountermeasures Challenge)。分析各支参赛队伍的策略,可以发现国内外对于这一课题的研究主要分为两个方面,分别是基于前端语音特征分析方面的研究以及基于后端分类器方面的研究。在特征方面,目前比较常用的特征包括经过常数Q变换得到的常数Q倒谱系数(Constant Q Cepstral Coefficients,CQCC)和使用线性滤波器处理得到的线性滤波器倒谱系数(Linear-Frequency Cepstral Coefficient,LFCC)等;在分类器方面,除了高斯混合模型(Gaussian Mixture Model,GMM)、线性判别分析(Linear Discriminant Analysis,LDA)以及支持向量机(Support Vector Machine,SVM)等传统的机器学习中经典的分类器之外,一些目前比较热门的深度神经网络模型也被应用于这项任务当中,比如卷积神经网络(Convolutional Neural Networks,CNN)以及循环神经网络(Recurrent NeuralNetwork,RNN)等。Currently, Interspeech, the top international conference on speech, holds the Automatic Speaker Verification Spoofing and Countermeasures Challenge every two years. Analyzing the strategies of each participating team, it can be found that the research on this topic at home and abroad is mainly divided into two aspects, namely the research based on the front-end speech feature analysis and the research based on the back-end classifier. In terms of features, the commonly used features include Constant Q Cepstral Coefficients (CQCC) obtained by constant Q transform and Linear-Frequency Cepstral Coefficients (Linear-Frequency Cepstral Coefficients) obtained by using linear filters. In terms of classifiers, in addition to Gaussian Mixture Model (GMM), Linear Discriminant Analysis (LDA) and Support Vector Machine (SVM) and other traditional machine learning classics In addition to classifiers, some popular deep neural network models are also used in this task, such as Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN).
Gajan Suthokumar等人在2019年的一篇研究中表明,语音发音过程中的不同音素在进行欺诈攻击鉴别时,具有不同的辨别能力,其中轻音音素的鉴别能力普遍高于浊音音素。In a 2019 study by Gajan Suthokumar et al., it was shown that different phonemes in the process of speech pronunciation have different discrimination abilities when identifying fraudulent attacks, and the discrimination ability of light phonemes is generally higher than that of voiced phonemes.
发明内容SUMMARY OF THE INVENTION
为克服现有技术的不足,本发明旨在研究真实语音和欺诈语音在不同音素上的区别,提高自动说话人系统欺诈攻击检测的效果。为此,本发明采取的技术方案是,基于音素级分析抑制音素影响的合成语音检测方法,使用F-ratio对不同真实语音和欺诈语音中的不同音素的各个频段进行分析,F-ratio称为方差比检验,是通过比较类内和类间的方差,来发现各分类中的差异分布情况,通过所述分析找出更有利于鉴别真实语音和欺诈语音的频率范围,增加该频段上的滤波器密度,得到新的特征,并用该特征分别训练真实语音和欺诈语音的高斯混合模型GMM,将待识别的音频提取特征后分别输入两个模型,最后将两个模型的结果用最大似然比打分,得到最终的识别结果。In order to overcome the deficiencies of the prior art, the present invention aims to study the difference between real speech and fraudulent speech in different phonemes, so as to improve the effect of fraud attack detection by automatic speaker system. To this end, the technical solution adopted by the present invention is to use F-ratio to analyze each frequency band of different phonemes in different real voices and fraudulent voices, based on a phoneme-level analysis that suppresses the influence of phonemes, and F-ratio is called The variance ratio test is to find the difference distribution in each classification by comparing the variance within and between classes, and find out the frequency range that is more conducive to distinguishing real speech and fraudulent speech through the analysis, and increase the filtering on this frequency band. Obtain new features, and use the features to train the Gaussian mixture model GMM of real speech and fraudulent speech respectively, extract the features of the audio to be recognized and input them into the two models respectively, and finally use the results of the two models with the maximum likelihood ratio. Score to get the final recognition result.
具体步骤如下:Specific steps are as follows:
步骤一,数据准备:Step 1, data preparation:
首先,对语音数据进行标注,即获取音频中的每个音素以及它们的起始时间信息,然后分别对真实语音和欺诈语音中的各个音素进行研究,使用均匀的子带滤波器来处理语音音频中的每一帧,进而获得不同音素的每一帧上各个频带的数据;First, annotate the speech data, that is, obtain each phoneme in the audio and their starting time information, then study each phoneme in the real speech and the fraudulent speech separately, and use a uniform subband filter to process the speech audio In each frame, and then obtain the data of each frequency band on each frame of different phonemes;
步骤二,数据分析:Step 2, data analysis:
对上一步获取到的数据使用音素级的F-ratio方法进行分析,某个频段上的F-ratio值用来表征该频段在鉴别真实语音和欺诈语音时的能力,F-ratio的值越大,表示这一频道上携带的可供鉴别的信息更多,鉴别能力越强,之后根据所有频道上的F-ratio值,对结果做归一化,然后以各个音素的帧数为权值,对音素的每个频带上归一化的数据做加权平均,最终得到抑制了音素影响后各个频带上的鉴别能力,结果越大表示鉴别能力越强;Use the phoneme-level F-ratio method to analyze the data obtained in the previous step. The F-ratio value on a certain frequency band is used to represent the ability of the frequency band to identify real speech and fraudulent speech. The larger the value of F-ratio, , indicating that this channel carries more information for identification, and the identification ability is stronger. Then, according to the F-ratio value on all channels, the results are normalized, and then the frame number of each phoneme is used as the weight, The normalized data on each frequency band of the phoneme is weighted and averaged, and finally the discrimination ability on each frequency band after the influence of the phoneme is suppressed is obtained. The larger the result, the stronger the discrimination ability;
步骤三,提取特征:Step 3, extract features:
根据第二步的实验结果,在鉴别能力较强的区域,增加滤波器的个数,起到增加滤波器在这些区域中的密度的作用,再使用这些滤波器对经过分帧、加窗和短时傅里叶变换后的语音信号进行滤波,最后经过离散余弦变换DCT得到抑制音素影响的新特征;According to the experimental results of the second step, in areas with strong discrimination ability, increase the number of filters to increase the density of filters in these areas, and then use these filters to The speech signal after short-time Fourier transform is filtered, and finally a new feature that suppresses the influence of phonemes is obtained through discrete cosine transform (DCT);
步骤四,模型训练Step 4, model training
将训练集中的音频提取特征作为输入,分别训练真实语音和欺诈语音的高斯混合模型GMM;Using the audio extraction features in the training set as input, train the Gaussian mixture model GMM of real speech and fraudulent speech respectively;
步骤五,打分确认Step 5: Score confirmation
将待测语音提取到的特征分别输入进真实语音和欺诈语音的模型中进行打分,再用最大似然比分类法得出最终的结果。The features extracted from the speech to be tested are input into the models of real speech and fraudulent speech respectively for scoring, and then the maximum likelihood ratio classification method is used to obtain the final result.
步骤二,数据分析具体步骤如下:Step 2, the specific steps of data analysis are as follows:
使用基于音素的F-ratio分析方法PF(Phoneme F-ratio)对不同音素中的各个频段进行分析,PF的分析思路为计算某一音素k在第l个滤波器上不同方法之间的方差与同一方法内部的方差之比,值越高就说明不同方法在这一区域的差异性越大,PF的具体计算公式如下:Use the phoneme-based F-ratio analysis method PF (Phoneme F-ratio) to analyze each frequency band in different phonemes. The analysis idea of PF is to calculate the variance of a phoneme k on the lth filter between different methods and The ratio of variance within the same method, the higher the value, the greater the difference between different methods in this area. The specific calculation formula of PF is as follows:
其中,T表示方法种类,Ntk表示第t个类型中第k个音素的帧数;表示第t个类型第k个音素第j帧中第l个滤波器上的数据;表示第t个类型第k个音素的每一帧第l个滤波器上的数据平均值;uk表示所有类型第k个音素的每一帧第l个滤波器上的数据平均值,和uk的公式如下:Among them, T represents the method type, and Ntk represents the frame number of the k-th phoneme in the t-th type; represents the data on the lth filter in the jth frame of the kth phoneme of the tth type; represents the average value of data on the lth filter of each frame of the kth phoneme of the tth type; u k represents the data average value of the lth filter of each frame of the kth phoneme of all types, The formulas for and u k are as follows:
之后对得到的PF值进行归一化处理,即可得到第l个滤波器频带上音素的F-ratio贡献率PFC(Phoneme F-ratio Contribution),其计算公式如下:After normalizing the obtained PF value, the F-ratio contribution rate PFC (Phoneme F-ratio Contribution) of the phoneme on the lth filter band can be obtained, and its calculation formula is as follows:
其中L为均匀的子带滤波器个数;计算得到的PFC可以反映出不同音素中,用于鉴别欺诈语音的信息的频率分布,接下来对各音素的PFC根据其帧数进行加权平均,得到整体F-ratio值GF(General F-ratio),其计算公式如下:Among them, L is the number of uniform subband filters; the calculated PFC can reflect the frequency distribution of the information used to identify fraudulent speech in different phonemes, and then the PFC of each phoneme is weighted and averaged according to its frame number to obtain The overall F-ratio value GF (General F-ratio), its calculation formula is as follows:
其中P是所有音素的总数,N为所有音素的总帧数,N的计算公式如下:Where P is the total number of all phonemes, N is the total number of frames of all phonemes, and the calculation formula of N is as follows:
对计算出的GF再进行归一化处理,可以得到抑制音素影响的欺诈攻击信息分布PESSDID(Phoneme Effect Suppressed Spoof Detection Information Distribution),其计算公式如下:The calculated GF is then normalized to obtain the PESSDID (Phoneme Effect Suppressed Spoof Detection Information Distribution) that suppresses the influence of the phoneme. The calculation formula is as follows:
这里滤波器l的PESSDID的值越高,说明该滤波器的频段上的可以用于鉴别欺诈攻击的信息越多。Here, the higher the value of PESSDID of filter 1, the more information on the frequency band of the filter that can be used to identify fraudulent attacks.
步骤三,提取特征步骤中:Step 3, in the feature extraction step:
除了滤波器分布的区别外,其他特征提取的过程包括:使用滤波器前的步骤包括预加重、分帧和加窗,再经过短时傅里叶变化,得到每一帧的频谱特征,之后用滤波器对频谱特征进行处理,处理后再经过DCT变换,得到最终的特征。In addition to the difference in filter distribution, other feature extraction processes include: the steps before using the filter include pre-emphasis, framing and windowing, and then through short-time Fourier transformation to obtain the spectral features of each frame, and then use The filter processes the spectral features, and then undergoes DCT transformation to obtain the final features.
本发明的特点及有益效果是:The characteristics and beneficial effects of the present invention are:
本发明使用F-ratio的方法,对声纹识别系统面对的欺诈攻击语音与真实语音在不同音素上的差异进行了分析,找到了有助于鉴别欺诈语音的信息在频率上的分布。根据分析结果,通过改进滤波器设计出了一个可以抑制不同音素在识别任务中的影响的新特征。在ASVspoof2019的测试集上进行初步实验,得到的等错误率(Equal Error Rate,EER)为4.16%,相较于常用的LFCC特征(基线系统EER为8.09%)在错误率上有了48.58%的提升。The present invention uses the method of F-ratio to analyze the difference in different phonemes between the fraudulent attack voice faced by the voiceprint recognition system and the real voice, and finds the frequency distribution of the information that helps to identify the fraudulent voice. According to the analysis results, a new feature that can suppress the influence of different phonemes in the recognition task is designed by improving the filter. Preliminary experiments were carried out on the test set of ASVspoof2019, and the obtained Equal Error Rate (EER) was 4.16%, which was 48.58% higher than the commonly used LFCC feature (baseline system EER was 8.09%). promote.
附图说明:Description of drawings:
图1为基于F-ratio分析的抑制音素影响的欺诈攻击检测特征提取流程图。Figure 1 is a flowchart of feature extraction of fraud attack detection based on F-ratio analysis to suppress the influence of phonemes.
图2为基于F-ratio分析得到的抑制音素影响的滤波器分布示意图。FIG. 2 is a schematic diagram of the distribution of filters for suppressing the influence of phonemes obtained based on F-ratio analysis.
具体实施方式Detailed ways
本发明的目的在于研究真实语音和欺诈语音在不同音素上的区别,使用F-ratio的方法对每个因素的不同频段进行对比分析,找出每个因素更有利于鉴别欺诈语音的频段,进而加大特征提取是在这些频段上的滤波器密度,最后得到一个更为鲁棒的个性化特征,提高自动说话人系统欺诈攻击检测的效果。The object of the present invention is to study the difference between real voice and fraudulent voice on different phonemes, use the method of F-ratio to carry out comparative analysis on different frequency bands of each factor, find out the frequency band that each factor is more conducive to identifying fraudulent voice, and then Increasing the feature extraction is the filter density on these frequency bands, and finally a more robust personalized feature is obtained, which improves the effect of automatic speaker system fraud attack detection.
实现本发明目的的技术解决方案为:The technical solution that realizes the purpose of the present invention is:
基于F-ratio分析的抑制音素影响的合成语音检测方法。使用F-ratio对不同真实语音和欺诈语音中的不同音素的各个频段进行分析,找出更有利于鉴别真实语音和欺诈语音的频率范围,增加该频段上的滤波器密度,得到新的特征,并用该特征分别训练真实语音和欺诈语音的高斯混合模型(GMM),将待识别的音频提取特征后分别输入两个模型,最后将两个模型的结果用最大似然比打分,得到最终的识别结果。A synthetic speech detection method with suppressed phoneme influence based on F-ratio analysis. Use F-ratio to analyze each frequency band of different phonemes in different real voices and fraudulent voices, find out the frequency range that is more conducive to distinguishing real voices and fraudulent voices, increase the filter density on this frequency band, and obtain new features, And use this feature to train the Gaussian Mixture Model (GMM) of real speech and fraudulent speech respectively, extract the features of the audio to be recognized and input them into the two models respectively, and finally score the results of the two models with the maximum likelihood ratio to obtain the final recognition. result.
系统的实现包括以下步骤:The implementation of the system includes the following steps:
步骤一,数据准备:Step 1, data preparation:
首先,对语音数据进行标注,即获取音频中的每个音素以及它们的起始时间等信息。然后分别对真实语音和欺诈语音中的各个音素进行研究。使用均匀的子带滤波器来处理语音音频中的每一帧,进而获得不同音素的每一帧上各个频带的数据。First, annotate the speech data, that is, obtain information such as each phoneme in the audio and their starting time. The individual phonemes in real speech and fraudulent speech are then studied separately. Each frame in the speech audio is processed using a uniform subband filter to obtain data for each frequency band on each frame of different phonemes.
步骤二,数据分析:Step 2, data analysis:
对上一步获取到的数据使用F-ratio的方法进行分析,某个频段上的F-ratio值可以用来表征该频段在鉴别真实语音和欺诈语音时的能力,F-ratio的值越大,表示这一频道上携带的可供鉴别的信息更多,鉴别能力越强。之后根据所有频道上的F-ratio值,对结果做归一化,然后以各个音素的帧数为权值,对音素的每个频带上归一化的数据做加权平均,最终得到抑制了音素影响后各个频带上的鉴别能力,结果越大表示鉴别能力越强。Use the F-ratio method to analyze the data obtained in the previous step. The F-ratio value on a certain frequency band can be used to characterize the ability of the frequency band to identify real speech and fraudulent speech. Indicates that the more information available for identification carried on this channel, the stronger the identification ability. After that, the results are normalized according to the F-ratio values on all channels, and then the frame number of each phoneme is used as the weight, and the normalized data on each frequency band of the phoneme is weighted and averaged, and finally the suppressed phoneme is obtained. The discriminative ability on each frequency band is affected, and the larger the result, the stronger the discrimination ability.
步骤三,提取特征:Step 3, extract features:
根据第二步的实验结果,在鉴别能力较强的区域,适当增加滤波器的个数,起到增加滤波器在这些区域中的密度的作用。再使用这些滤波器对经过分帧、加窗和短时傅里叶变换后的语音信号进行滤波,最后经过离散余弦变换(DCT)得到抑制音素影响的新特征。According to the experimental results of the second step, in areas with strong discriminating ability, appropriately increase the number of filters to increase the density of filters in these areas. Then use these filters to filter the speech signal after framing, windowing and short-time Fourier transform, and finally obtain new features that suppress the influence of phonemes through discrete cosine transform (DCT).
步骤四,模型训练Step 4, model training
将训练集中的音频提取特征作为输入,分别训练真实语音和欺诈语音的GMM模型。Taking the audio extraction features in the training set as input, GMM models for real speech and fraudulent speech are trained separately.
步骤五,打分确认Step 5: Score confirmation
将待测语音提取到的特征分别输入进真实语音和欺诈语音的模型中进行打分,再用最大似然比分类法得出最终的结果。The features extracted from the speech to be tested are input into the models of real speech and fraudulent speech respectively for scoring, and then the maximum likelihood ratio classification method is used to obtain the final result.
下面结合附图来描述本发明实施的基于F-ratio分析的抑制音素影响的合成语音检测方法,主要包含以下步骤:Below in conjunction with accompanying drawing, the synthetic speech detection method that suppresses the influence of phoneme based on F-ratio analysis that the present invention implements is described, mainly comprises the following steps:
步骤一,数据准备:Step 1, data preparation:
为了验证本发明的效果,在ASVSpoof2019比赛的数据库上进行欺诈攻击检测实验。ASVSpoof2019数据库,包括训练集,开发集和测试集三部分,其中训练集和开发集中包括了7种语音合成和语音转换的欺诈攻击算法,而测试集中包括了12种与训练集和开发集中不同的欺诈攻算法。数据库中所有音频的采样率均为16KHz。由于赛方并未提供数据库中音频对应的文本信息,这里我们使用了一套语音识别系统,对训练集中的25380个音频中的说话内容进行了识别。之后通过语音标注的工具,对这些音频中的音素信息进行了提取。然后使用均匀的子带滤波器来处理语音音频中的每一帧,进而获得不同音素的每一帧上各个频带的数据。In order to verify the effect of the present invention, a fraud attack detection experiment is carried out on the database of the ASVSpoof2019 competition. The ASVSpoof2019 database consists of three parts: training set, development set and test set. The training set and development set include 7 kinds of fraud attack algorithms for speech synthesis and speech conversion, while the test set includes 12 different algorithms from the training set and development set. Fraud attack algorithm. All audio in the database is sampled at 16KHz. Since the competition did not provide the text information corresponding to the audio in the database, here we used a speech recognition system to recognize the speech content in the 25,380 audios in the training set. Afterwards, the phoneme information in these audios was extracted through the tool of speech annotation. Each frame in the speech audio is then processed using a uniform subband filter to obtain data for each frequency band on each frame of different phonemes.
步骤二,数据分析:Step 2, data analysis:
这里使用基于音素级F-ratio(Phoneme F-ratio,PF)的方法对不同音素中的各个频段进行分析,PF的分析思路为计算某一音素k在第l个滤波器上不同方法之间的方差与同一方法内部的方差之比,值越高就说明不同方法在这一区域的差异性越大,PF的具体计算公式如下:Here, the method based on the phoneme-level F-ratio (Phoneme F-ratio, PF) is used to analyze each frequency band in different phonemes. The analysis idea of PF is to calculate the difference between different methods of a phoneme k on the lth filter. The ratio of the variance to the variance within the same method. The higher the value, the greater the difference between different methods in this area. The specific calculation formula of PF is as follows:
其中,T表示方法种类,这里只分成了真实语音和欺诈语音两种,所以T为2;Ntk表示第t个类型中第k个音素的帧数;表示第t个类型第k个音素第j帧中第l个滤波器上的数据;表示第t个类型第k个音素的每一帧第l个滤波器上的数据平均值;uk表示所有类型第k个音素的每一帧第l个滤波器上的数据平均值。和uk的公式如下:Among them, T represents the type of method, which is only divided into two types: real voice and fraudulent voice, so T is 2; Ntk represents the frame number of the kth phoneme in the tth type; represents the data on the lth filter in the jth frame of the kth phoneme of the tth type; Represents the average value of data on the l-th filter of each frame of the k-th phoneme of the t-th type; u k represents the data average of the l-th filter of each frame of the k-th phoneme of all types. The formulas for and u k are as follows:
之后对得到的PF值进行归一化处理,即可得到第l个滤波器频带上音素的F-ratio贡献率(Phoneme F-ratio Contribution,PFC),其计算公式如下:After that, the obtained PF value is normalized to obtain the F-ratio contribution rate (Phoneme F-ratio Contribution, PFC) of the phoneme on the lth filter band, and its calculation formula is as follows:
其中L为均匀的子带滤波器个数;计算得到的PFC可以反映出不同音素中,用于鉴别欺诈语音的信息的频率分布。接下来为了抑制不同音素的差异在说话人识别过程中影响,对各音素的PFC根据其帧数进行加权平均,得到整体F-ratio值(General F-ratio,GF),其计算公式如下:Where L is the number of uniform subband filters; the calculated PFC can reflect the frequency distribution of information used to identify fraudulent speech in different phonemes. Next, in order to suppress the influence of the difference of different phonemes in the process of speaker recognition, the PFC of each phoneme is weighted and averaged according to its frame number, and the overall F-ratio value (General F-ratio, GF) is obtained. The calculation formula is as follows:
其中P是所有音素的总数,N为所有音素的总帧数,N的计算公式如下:Where P is the total number of all phonemes, N is the total number of frames of all phonemes, and the calculation formula of N is as follows:
对计算出的GF再进行归一化处理,可以得到抑制音素影响的欺诈攻击信息分布(Phoneme Effect Suppressed Spoof Detection Information Distribution,PESSDID),其计算公式如下:By normalizing the calculated GF, the phoneme effect suppression information distribution (Phoneme Effect Suppressed Spoof Detection Information Distribution, PESSDID) can be obtained. The calculation formula is as follows:
这里滤波器l的PESSDID的值越高,说明该滤波器的频段上的可以用于鉴别欺诈攻击的信息越多。Here, the higher the value of PESSDID of filter 1, the more information on the frequency band of the filter that can be used to identify fraudulent attacks.
步骤三,提取特征:Step 3, extract features:
根据上一步得到的分布情况,我们进行本发明提出的新特征的滤波器设计,其中在信息较多的频段适当增加滤波器的个数,在信息较少的频段,适当减少滤波器的个数,调整后的滤波器可见附图-2。除了滤波器分布的区别外,其他特征提取的过程与传统方法相同,使用滤波器前的步骤包括预加重、分帧和加窗,再经过短时傅里叶变化,得到每一帧的频谱特征,之后用滤波器对频谱特征进行处理,处理后再经过DCT变换等,可以得到最终的特征。According to the distribution obtained in the previous step, we carry out the filter design of the new feature proposed by the present invention, in which the number of filters is appropriately increased in the frequency band with more information, and the number of filters is appropriately reduced in the frequency band with less information , the adjusted filter can be seen in Figure-2. Except for the difference of filter distribution, the process of other feature extraction is the same as the traditional method. The steps before using the filter include pre-emphasis, framing and windowing, and then after short-time Fourier transformation, the spectral features of each frame are obtained. , and then use the filter to process the spectral features, and then go through DCT transformation after processing to obtain the final features.
步骤四,模型训练Step 4, model training
训练模型时,对于训练集中的语音,不再需要进行语音标注,直接用新的特征提取方法对原始音频进行处理即可。根据训练集中音频的真假标签,将得到的特征分别用于训练真实语音的GMM模型和欺诈语音的GMM模型。When training the model, for the speech in the training set, it is no longer necessary to perform speech annotation, and the original audio can be directly processed by the new feature extraction method. According to the true and false labels of the audio in the training set, the obtained features are used to train the GMM model of real speech and the GMM model of fraudulent speech, respectively.
步骤五,打分确认Step 5: Score confirmation
将待测音频进行分帧等处理后提取出新的特征后,将特征分别输入到真实语音和欺诈语音的GMM模型中进行打分。具体方法是,将一个音频的每一帧逐个输入给一个GMM模型,得到一个相似度的打分,然后将所有帧的结果取平均,作为该音频在该GMM模型中的打分结果。最后用最大似然比的方法,计算待测音频在两个模型中的得分,然后得出最终的结果。After the audio to be tested is divided into frames and other processing, new features are extracted, and then the features are input into the GMM models of real voice and fraudulent voice respectively for scoring. The specific method is to input each frame of an audio to a GMM model one by one to obtain a similarity score, and then average the results of all frames as the scoring result of the audio in the GMM model. Finally, the maximum likelihood ratio method is used to calculate the scores of the tested audio in the two models, and then the final result is obtained.
实验的结果采用等错误率(EER)进行评估,等错误率表示错误接收率(FAR)与错误拒绝率(FRR)相等时的错误率。The results of the experiments are evaluated using the Equal Error Rate (EER), which represents the error rate when the False Acceptance Rate (FAR) and the False Rejection Rate (FRR) are equal.
以上所述仅为本发明的较佳实施例,并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The above are only preferred embodiments of the present invention and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included in the protection of the present invention. within the range.
Claims (4)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010572748.4A CN111816203A (en) | 2020-06-22 | 2020-06-22 | A synthetic speech detection method based on phoneme-level analysis to suppress the influence of phonemes |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010572748.4A CN111816203A (en) | 2020-06-22 | 2020-06-22 | A synthetic speech detection method based on phoneme-level analysis to suppress the influence of phonemes |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN111816203A true CN111816203A (en) | 2020-10-23 |
Family
ID=72845402
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202010572748.4A Pending CN111816203A (en) | 2020-06-22 | 2020-06-22 | A synthetic speech detection method based on phoneme-level analysis to suppress the influence of phonemes |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN111816203A (en) |
Cited By (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112349267A (en) * | 2020-10-28 | 2021-02-09 | 天津大学 | Synthesized voice detection method based on attention mechanism characteristics |
| CN113257255A (en) * | 2021-07-06 | 2021-08-13 | 北京远鉴信息技术有限公司 | Method and device for identifying forged voice, electronic equipment and storage medium |
| CN113362814A (en) * | 2021-08-09 | 2021-09-07 | 中国科学院自动化研究所 | Voice identification model compression method fusing combined model information |
| CN113488074A (en) * | 2021-08-20 | 2021-10-08 | 四川大学 | Long-time variable Q time-frequency conversion algorithm of audio signal and application thereof |
| CN114550704A (en) * | 2022-01-26 | 2022-05-27 | 浙江大学 | Method and system for training voice confrontation sample recognition model |
| CN114822587A (en) * | 2021-01-19 | 2022-07-29 | 四川大学 | An Audio Feature Compression Method Based on Constant Q Transform |
| US20250006205A1 (en) * | 2023-06-28 | 2025-01-02 | University Of Florida Research Foundation, Incorporated | Detecting deepfake audio using turbulence |
Citations (11)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US4829572A (en) * | 1987-11-05 | 1989-05-09 | Andrew Ho Chung | Speech recognition system |
| JPH02252000A (en) * | 1989-03-27 | 1990-10-09 | Nippon Telegr & Teleph Corp <Ntt> | Formation of waveform element |
| GB9709696D0 (en) * | 1996-05-15 | 1997-07-02 | Atr Intrepreting Telecommunica | Speech synthesizer apparatus |
| US6173263B1 (en) * | 1998-08-31 | 2001-01-09 | At&T Corp. | Method and system for performing concatenative speech synthesis using half-phonemes |
| AU2001285721A1 (en) * | 2000-09-06 | 2002-03-22 | Pharmexa A/S | Method for down-regulating IgE |
| US6366883B1 (en) * | 1996-05-15 | 2002-04-02 | Atr Interpreting Telecommunications | Concatenation of speech segments by use of a speech synthesizer |
| GB0219870D0 (en) * | 2002-08-27 | 2002-10-02 | 20 20 Speech Ltd | Speech synthesis apparatus and method |
| KR20030081537A (en) * | 2002-04-11 | 2003-10-22 | 주식회사 언어과학 | System and Method for detecting error type by phoneme, and System and method using the same |
| CN101930733A (en) * | 2010-09-03 | 2010-12-29 | 中国科学院声学研究所 | A Speech Emotion Feature Extraction Method for Speech Emotion Recognition |
| CN107680602A (en) * | 2017-08-24 | 2018-02-09 | 平安科技(深圳)有限公司 | Voice fraud recognition methods, device, terminal device and storage medium |
| CN109448759A (en) * | 2018-12-28 | 2019-03-08 | 武汉大学 | A kind of anti-voice authentication spoofing attack detection method based on gas explosion sound |
-
2020
- 2020-06-22 CN CN202010572748.4A patent/CN111816203A/en active Pending
Patent Citations (11)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US4829572A (en) * | 1987-11-05 | 1989-05-09 | Andrew Ho Chung | Speech recognition system |
| JPH02252000A (en) * | 1989-03-27 | 1990-10-09 | Nippon Telegr & Teleph Corp <Ntt> | Formation of waveform element |
| GB9709696D0 (en) * | 1996-05-15 | 1997-07-02 | Atr Intrepreting Telecommunica | Speech synthesizer apparatus |
| US6366883B1 (en) * | 1996-05-15 | 2002-04-02 | Atr Interpreting Telecommunications | Concatenation of speech segments by use of a speech synthesizer |
| US6173263B1 (en) * | 1998-08-31 | 2001-01-09 | At&T Corp. | Method and system for performing concatenative speech synthesis using half-phonemes |
| AU2001285721A1 (en) * | 2000-09-06 | 2002-03-22 | Pharmexa A/S | Method for down-regulating IgE |
| KR20030081537A (en) * | 2002-04-11 | 2003-10-22 | 주식회사 언어과학 | System and Method for detecting error type by phoneme, and System and method using the same |
| GB0219870D0 (en) * | 2002-08-27 | 2002-10-02 | 20 20 Speech Ltd | Speech synthesis apparatus and method |
| CN101930733A (en) * | 2010-09-03 | 2010-12-29 | 中国科学院声学研究所 | A Speech Emotion Feature Extraction Method for Speech Emotion Recognition |
| CN107680602A (en) * | 2017-08-24 | 2018-02-09 | 平安科技(深圳)有限公司 | Voice fraud recognition methods, device, terminal device and storage medium |
| CN109448759A (en) * | 2018-12-28 | 2019-03-08 | 武汉大学 | A kind of anti-voice authentication spoofing attack detection method based on gas explosion sound |
Non-Patent Citations (4)
| Title |
|---|
| GAJAN SUTHOKUMAR ET AL.: "PHONEME SPECIFIC MODELLING AND SCORING TECHNIQUES FOR ANTI SPOOFING SYSTEM", 《2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)》, pages 6106 - 6110 * |
| 张健;徐杰;包秀国;周若华;颜永红;: "应用于语种识别的加权音素对数似然比特征", 清华大学学报(自然科学版), no. 10 * |
| 玄成君: "基于语音频率特性抑制音素影响的说话人特征提取", 《中国博士学位论文全文数据库信息科技辑(月刊)》, pages 38 - 51 * |
| 陈霄鹏;彭亚雄;贺松;: "基于PLDA的说话人识别时变鲁棒性问题研究", 微型机与应用, no. 05 * |
Cited By (12)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112349267A (en) * | 2020-10-28 | 2021-02-09 | 天津大学 | Synthesized voice detection method based on attention mechanism characteristics |
| CN112349267B (en) * | 2020-10-28 | 2023-03-21 | 天津大学 | Synthesized voice detection method based on attention mechanism characteristics |
| CN114822587A (en) * | 2021-01-19 | 2022-07-29 | 四川大学 | An Audio Feature Compression Method Based on Constant Q Transform |
| CN114822587B (en) * | 2021-01-19 | 2023-07-14 | 四川大学 | An Audio Feature Compression Method Based on Constant Q Transform |
| CN113257255A (en) * | 2021-07-06 | 2021-08-13 | 北京远鉴信息技术有限公司 | Method and device for identifying forged voice, electronic equipment and storage medium |
| CN113362814A (en) * | 2021-08-09 | 2021-09-07 | 中国科学院自动化研究所 | Voice identification model compression method fusing combined model information |
| CN113362814B (en) * | 2021-08-09 | 2021-11-09 | 中国科学院自动化研究所 | Voice identification model compression method fusing combined model information |
| CN113488074A (en) * | 2021-08-20 | 2021-10-08 | 四川大学 | Long-time variable Q time-frequency conversion algorithm of audio signal and application thereof |
| CN113488074B (en) * | 2021-08-20 | 2023-06-23 | 四川大学 | A 2D Time-Frequency Feature Generation Method for Detecting Synthetic Speech |
| CN114550704A (en) * | 2022-01-26 | 2022-05-27 | 浙江大学 | Method and system for training voice confrontation sample recognition model |
| CN114550704B (en) * | 2022-01-26 | 2024-11-19 | 浙江大学 | A speech adversarial sample recognition model training method and system |
| US20250006205A1 (en) * | 2023-06-28 | 2025-01-02 | University Of Florida Research Foundation, Incorporated | Detecting deepfake audio using turbulence |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN111816203A (en) | A synthetic speech detection method based on phoneme-level analysis to suppress the influence of phonemes | |
| CN105869630B (en) | Speaker voice spoofing attack detection method and system based on deep learning | |
| CN108922541B (en) | Multi-dimensional feature parameter voiceprint recognition method based on DTW and GMM models | |
| CN108986824B (en) | Playback voice detection method | |
| CN108648759A (en) | A kind of method for recognizing sound-groove that text is unrelated | |
| CN110931022B (en) | Voiceprint recognition method based on high-low frequency dynamic and static characteristics | |
| CN103971700A (en) | Voice monitoring method and device | |
| CN103794207A (en) | Dual-mode voice identity recognition method | |
| CN112349267B (en) | Synthesized voice detection method based on attention mechanism characteristics | |
| Wang et al. | A network model of speaker identification with new feature extraction methods and asymmetric BLSTM | |
| Fasounaki et al. | CNN-based Text-independent automatic speaker identification using short utterances | |
| Hong et al. | Combining deep embeddings of acoustic and articulatory features for speaker identification | |
| CN115424625A (en) | Speaker confirmation deception detection method based on multi-path network | |
| Alam et al. | Tandem Features for Text-Dependent Speaker Verification on the RedDots Corpus. | |
| CN108091340B (en) | Voiceprint recognition method, voiceprint recognition system, and computer-readable storage medium | |
| Sukhwal et al. | Comparative study of different classifiers based speaker recognition system using modified MFCC for noisy environment | |
| CN104464738B (en) | A kind of method for recognizing sound-groove towards Intelligent mobile equipment | |
| CN112259107A (en) | A voiceprint recognition method under the condition of small sample of conference scene | |
| Li et al. | Audio similarity detection algorithm based on Siamese LSTM network | |
| CN111370000A (en) | Voiceprint recognition algorithm evaluation method, system, mobile terminal and storage medium | |
| Zhang et al. | Depthwise separable convolutions for short utterance speaker identification | |
| US20220108702A1 (en) | Speaker recognition method | |
| CN110875044A (en) | A speaker recognition method based on word correlation score calculation | |
| CN110415707B (en) | Speaker recognition method based on voice feature fusion and GMM | |
| CN118972070A (en) | A lightweight identity recognition method combining voiceprint and earprint features |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| RJ01 | Rejection of invention patent application after publication | ||
| RJ01 | Rejection of invention patent application after publication |
Application publication date: 20201023 |