CN106448684A

CN106448684A - Deep-belief-network-characteristic-vector-based channel-robust voiceprint recognition system

Info

Publication number: CN106448684A
Application number: CN201611006202.2A
Authority: CN
Inventors: 邹月娴; 王迪松; 黄艺驰; 柳俊宏
Original assignee: Peking University Shenzhen Graduate School
Current assignee: Peking University Shenzhen Graduate School
Priority date: 2016-11-16
Filing date: 2016-11-16
Publication date: 2017-02-22

Abstract

The invention belongs to the field of speech signal processing and machine learning, and relates to a channel robust voiceprint recognition system based on a deep belief network feature vector, which consists of a speech collection and preprocessing module, an original spectrum feature extraction module, a deep belief network training module, and a speaker. It consists of a voiceprint feature vector extraction module, a speaker acoustic model generation module and a speaker identification module. Supervisedly train a deep belief network with speech data from different channels and the corresponding speaker ID numbers, and propose a discriminant ratio to select the hidden layer output of the deep belief network with the best class discrimination to construct the speech Voiceprint feature vector, which is channel robust. Compared with the traditional speaker confirmation system based on i‑vector, this system has a higher accuracy of voiceprint recognition in the case of channel mismatch.

Description

Channel Robust Voiceprint Recognition System Based on Deep Belief Network Feature Vector

技术领域technical field

本发明涉及一个基于深度置信网络特征矢量的信道鲁棒声纹识别系统，属于人机语音交互技术领域。The invention relates to a channel robust voiceprint recognition system based on deep belief network feature vectors, and belongs to the technical field of human-computer voice interaction.

背景技术Background technique

声纹识别技术属于生物验证技术的一种，采用语音对说话人身份进行验证，即确认某段语音是否是指定的某个人说的。这种技术具有较好的便捷性和安全性，在银行、社保、公安、智能家居、移动支付等领域都有巨大应用前景。但在实际应用中，传统的声纹识别系统面临着信道失配的问题，即说话人注册和测试时使用不同的移动设备，导致声纹识别系统的性能下降，识别准确率下降。因此，为解决移动设备环境下的信道失配问题，本发明提出基于深度置信网络特征矢量的信道鲁棒声纹识别系统。Voiceprint recognition technology is a kind of biological verification technology, which uses voice to verify the identity of the speaker, that is, to confirm whether a certain voice is spoken by a designated person. This technology has good convenience and security, and has great application prospects in banking, social security, public security, smart home, mobile payment and other fields. However, in practical applications, the traditional voiceprint recognition system faces the problem of channel mismatch, that is, different mobile devices are used for speaker registration and testing, resulting in a decline in the performance of the voiceprint recognition system and a decline in recognition accuracy. Therefore, in order to solve the problem of channel mismatch in the mobile device environment, the present invention proposes a channel robust voiceprint recognition system based on deep belief network feature vectors.

本发明采用深度置信网络(DBN)提取说话人特征。现存的很多声纹识别系统仍然采用着语音识别中的特征如MFCC特征、PLP特征等，这些底层声学特征中主要的信息是发音文本特征，说话人信息很容易受到文本信息、信道和噪声信息的干扰，这些特征不能很好的体现说话人的特点，同时在信道失配条件下，系统的识别性能下降，从而制约了声纹识别技术的应用。信道失配指的是训练与测试时采集语音的信道不同，围绕这一问题，Kenny提出了联合因子分析(Joint Factor Analysis，JFA)技术为信道失配环境下的声纹识别研究开辟了新思路，其主要思想是将说话人高斯均值超矢量所在空间划分为三个组成部分：本征信道(Eigenchannel)空间、本征音(Eigenvoice)空间和残差(Diagonal Residual)空间，通过移除说话人均值超矢量在本征信道空间的影响，来达到抗信道失配的目的。然而，在各种信道下训练数据不均衡时，JFA技术存在明显不足。之后Dehak提出基于i-vector技术，这一建模方法的动机来源于JFA建模后的信道因子不仅包含了信道效应也夹杂了说话人信息。I-vector方法采用一个全局差异空间(Total Variability Space)来代替这两个空间，它既包含了说话人之间的差异也包含了信道之间的差异。The present invention uses a deep belief network (DBN) to extract speaker features. Many existing voiceprint recognition systems still use features in speech recognition such as MFCC features, PLP features, etc. The main information in these underlying acoustic features is pronunciation text features, and speaker information is easily affected by text information, channel and noise information. Interference, these features cannot reflect the characteristics of the speaker well, and at the same time, under the condition of channel mismatch, the recognition performance of the system decreases, which restricts the application of voiceprint recognition technology. Channel mismatch refers to the fact that the channels used to collect speech during training and testing are different. Around this problem, Kenny proposed the Joint Factor Analysis (JFA) technology, which opened up new ideas for the research of voiceprint recognition under the environment of channel mismatch. , the main idea is to divide the space where the Gaussian mean supervector of the speaker is located into three components: Eigenchannel (Eigenchannel) space, Eigenvoice (Eigenvoice) space and Diagonal Residual (Diagonal Residual) space, by removing the speaker The influence of the per capita supervector in the eigenchannel space is used to achieve the purpose of resisting channel mismatch. However, when the training data is unbalanced under various channels, the JFA technique has obvious deficiencies. Later, Dehak proposed based on i-vector technology. The motivation of this modeling method comes from the channel factor after JFA modeling not only includes channel effects but also speaker information. The I-vector method uses a global difference space (Total Variability Space) to replace these two spaces, which includes both the difference between speakers and the difference between channels.

基于i-vector技术的声纹识别系统能较好的反映说话人特性，是声纹识别的主流技术之一，但其在信道失配条件下性能一般。深度学习作为近几年新兴的机器学习技术，在多种特定的模式识别任务上取得了显著的效果。深度神经网络的一个常见应用是特征提取，相比传统手工提取的特征，深度神经网络提取的特征能更好地表征高层次抽象信息。深度置信网络(Deep Belief Network,DBN)由Geoffrey Hinton在2006年提出，是一种生成模型，通过训练神经元之间的权重，可让整个神经网络按照最大概率来生成训练数据。深度置信网络(DBN)由多层受限玻尔兹曼机(RBMs)堆叠而成。通常，深度置信网络(DBN)主要用于对一维数据的建模比较有效，例如语音。The voiceprint recognition system based on i-vector technology can better reflect the characteristics of the speaker and is one of the mainstream voiceprint recognition technologies, but its performance is average under the condition of channel mismatch. As an emerging machine learning technology in recent years, deep learning has achieved remarkable results in a variety of specific pattern recognition tasks. A common application of deep neural networks is feature extraction. Compared with traditional hand-extracted features, the features extracted by deep neural networks can better represent high-level abstract information. Deep Belief Network (DBN) was proposed by Geoffrey Hinton in 2006. It is a generative model that allows the entire neural network to generate training data with maximum probability by training the weights between neurons. Deep Belief Networks (DBNs) are stacked by multiple layers of Restricted Boltzmann Machines (RBMs). Usually, Deep Belief Network (DBN) is mainly used to model one-dimensional data more effectively, such as speech.

受深度学习中深度置信网络在语音识别成功应用的启发，本发明通过利用大量不同信道的语音数据和相对应的说话人身份编号对深度置信网络(DBN)进行有监督的训练，通过训练好的深度置信网络(DBN)对说话人语音特征进行提取。为了测量神经网络不同隐含层输出的区分度，提出了一个判别比值来选择区分度最好的输出用于构成信道鲁棒的说话人特征矢量。同时采用3个中文语音数据库验证了相比传统的i-vector系统，基于深度置信信念网络特征矢量的信道鲁棒声纹识别系统具有更强的信道鲁棒特性。Inspired by the successful application of deep belief network in speech recognition in deep learning, the present invention carries out supervised training to deep belief network (DBN) by using a large number of speech data of different channels and corresponding speaker ID numbers, and through the trained Deep Belief Network (DBN) extracts the speaker's speech features. In order to measure the discriminative degree of output of different hidden layers of neural network, a discriminant ratio is proposed to select the output with the best discriminative degree to form a channel-robust speaker feature vector. At the same time, three Chinese speech databases are used to verify that the channel robust voiceprint recognition system based on the deep belief network feature vector has stronger channel robustness than the traditional i-vector system.

发明内容Contents of the invention

基于对上述现有技术的分析，本发明的目的是基于中文的面向移动设备的声纹识别，构造在实际应用中基于深度学习的、对信道鲁棒的面向移动设备的声纹识别系统，本系统采用深度置信网络(DBN)提取说话人语音特征，并提出了一个判别比值R_p用于测量神经网络不同隐含层输出的区分度并选择区分度最好的特征，从而提高了声纹识别系统的信道鲁棒性。系统包括如下模块：Based on the analysis of the above-mentioned prior art, the purpose of the present invention is based on voiceprint recognition for mobile devices based on Chinese, and to construct a voiceprint recognition system for mobile devices that is based on deep learning and is robust to channels in practical applications. The system uses the Deep Belief Network (DBN) to extract the speaker's speech features, and proposes a discriminant ratio R _p to measure the discrimination of the output of different hidden layers of the neural network and select the feature with the best discrimination, thereby improving the voiceprint recognition. The channel robustness of the system. The system includes the following modules:

语音采集及预处理模块，用于采集所述说话人的语音信号，并对语音信号进行预处理；Voice collection and preprocessing module, for collecting the voice signal of the speaker, and preprocessing the voice signal;

原始谱特征提取模块，用于对预处理后的语音进行原始谱特征MFCC提取；The original spectral feature extraction module is used to extract the original spectral feature MFCC to the preprocessed speech;

深度置信网络训练模块，用于有监督训练一个信道鲁棒的特征矢量提取器；Deep belief network training module for supervised training of a channel-robust feature vector extractor;

说话人声纹特征矢量提取模块，利用所述训练好的深度置信网络进行信道鲁棒的说话人声纹特征矢量提取；The speaker's voiceprint feature vector extraction module uses the trained deep belief network to extract the channel robust speaker's voiceprint feature vector;

说话人声学模型生成模块，根据提取的所述说话人声纹特征矢量，对所述说话人进行声学建模；The speaker acoustic model generating module performs acoustic modeling on the speaker according to the extracted speaker voiceprint feature vector;

说话人身份鉴定模块，将待测试说话人的所述声学模型与注册说话人的所述声学模型进行比较评分，确定待测试说话人的身份。The speaker identity identification module compares and scores the acoustic model of the speaker to be tested with the acoustic model of the registered speaker to determine the identity of the speaker to be tested.

进一步地，所述语音采集及预处理模块用于对采集的语音信号进行放大、增益控制、滤波及采样等预处理。Further, the voice collection and preprocessing module is used to perform preprocessing such as amplification, gain control, filtering and sampling on the collected voice signal.

进一步地，所述原始谱特征提取模块包括：对预处理后的语音进行分帧、预加重、加窗、快速傅里叶变换，最后进行梅尔倒谱系数MFCC的提取。Further, the original spectral feature extraction module includes: performing framing, pre-emphasis, windowing, and fast Fourier transform on the preprocessed speech, and finally extracting Mel cepstral coefficients MFCC.

进一步地，所述深度置信网络训练模块，以通过大量不同信道下的语料提取出的MFCC特征作为输入，以相应的说话人身份编号(ID)作为输出，对深度置信网络进行有监督的训练，并保存训练好的深度置信网络各层参数。Further, the deep belief network training module takes the MFCC features extracted from the corpus under a large number of different channels as input, and uses the corresponding speaker ID number (ID) as an output to carry out supervised training to the deep belief network, And save the parameters of each layer of the trained deep belief network.

进一步地，所述说话人声纹特征矢量提取模块，将深度置信网络看做一个特征矢量提取器，以MFCC作为深度置信网络的输入，深度置信网络的隐含层输出可以看成是对原始MFCC特征的高层表示(深度特征)，这些特征矢量具有信道鲁棒的特点。Further, the speaker voiceprint feature vector extraction module regards the depth belief network as a feature vector extractor, uses MFCC as the input of the depth belief network, and the hidden layer output of the depth belief network can be regarded as the original MFCC High-level representations of features (deep features), these feature vectors are channel-robust.

进一步地，提出了一种神经网络不同隐含层所提取深度特征的区分度测量方法，定义判别比值R_p＝det(S_bp)/det(S_wp)，作为深度特征区分度的度量，其中S_bp是训练数据类间散度矩阵，S_wp是训练数据类内散度矩阵，其定义如下：Furthermore, a method for measuring the discrimination of deep features extracted by different hidden layers of the neural network is proposed, and the discriminant ratio R _p =det(S _bp )/det(S _wp ) is defined as the measure of the discrimination of deep features, where S _bp is the inter-class scatter matrix of training data, and S _wp is the intra-class scatter matrix of training data, which is defined as follows:

其中s_mj是MFCC特征，f_p(·)是深度置信网络对MFCC输入到第p个隐含层输出的映射，G_pm是训练数据类均值向量，G_p是所有训练数据的均值向量，数学表示如下：where s _mj is the MFCC feature, f _p ( ) is the mapping of the deep belief network from the input of the MFCC to the output of the p-th hidden layer, G _pm is the mean vector of the training data class, G _p is the mean vector of all training data, and the mathematical Expressed as follows:

类间距离大和类内距离最小有利于所提取的特征矢量的可区分性。因此，判别比值R_p最大的隐含层特征矢量最具区分性，即确定满足k＝argmax_p R_p的隐含层的输出作为最佳深度特征。利用所述说话人的深度置信网络第k层深度特征f_k(s_mj)，则可以得到特征矢量k^th-DBN-vector，其定义为The large inter-class distance and the smallest intra-class distance are beneficial to the distinguishability of the extracted feature vectors. Therefore, the hidden layer feature vector with the largest discrimination ratio R _p is the most discriminative, that is, the output of the hidden layer satisfying k=argmax _p R _p is determined as the best deep feature. Using the k-th layer depth feature f _k (s _mj ) of the speaker's deep belief network, the feature vector k ^th -DBN-vector can be obtained, which is defined as

其中m是说话人身份编号，c_m是每句话提取出MFCC的帧长，N_p是深度置信网络第p个隐含层的维数。Among them, m is the speaker's ID number, c _m is the frame length of MFCC extracted from each sentence, and N _p is the dimension of the pth hidden layer of the deep belief network.

进一步地，所述说话人声学建模模块，利用所述说话人的特征矢量k^th-DBN-vector进行概率线性判别分析(PLDA)建模，并保存PLDA模型参数。Further, the speaker acoustic modeling module uses the speaker's feature vector k ^th -DBN-vector to perform probabilistic linear discriminant analysis (PLDA) modeling, and saves PLDA model parameters.

进一步地，所述说话人身份鉴定模块，根据训练好的深度置信网络，可以首先提取出注册人和测试人的k^th-DBN-vector。然后基于训练好的PLDA模型，得到对数似然比值得分s，最后将得分与给定的阈值s₀进行比较，若s≥s₀，则认为测试人是注册人，否则不是。Further, the speaker identification module can first extract the k ^th -DBN-vector of the registrant and the tester according to the trained deep belief network. Then based on the trained PLDA model, the log likelihood ratio score s is obtained, and finally the score is compared with the given threshold s ₀ , if s≥s ₀ , the tester is considered to be the registrant, otherwise not.

本发明的有益效果在于：随着移动设备的普及，用户会在不同的移动设备间利用声纹识别进行身份验证，这就带来了注册语音和测试语音的信道失配问题，而传统的基于i-vector技术的声纹识别系统在信道失配情况下系统性能一般。深度置信网络作为一种深度网络，具有很强的学习能力，在语音识别等领域具有广泛的应用。通过利用大量不同信道的语音数据对深度置信网络进行训练，训练好的深度置信网络可以提取出对信道鲁棒的说话人特征，从而减小信道失配的影响。因此，基于深度置信网络(DBN)的声纹识别系统能对信道失配有较好的鲁棒性，能够跨设备、跨平台部署，在保证系统验证准确性的同时，为用户在不同移动设备使用声纹识别服务提供便利。The beneficial effect of the present invention is that: with the popularization of mobile devices, users will use voiceprint recognition for identity verification between different mobile devices, which brings the problem of channel mismatch between registration voice and test voice, while the traditional The voiceprint recognition system of i-vector technology has average system performance in the case of channel mismatch. As a deep network, deep belief network has strong learning ability and has a wide range of applications in fields such as speech recognition. By using a large number of speech data of different channels to train the deep belief network, the trained deep belief network can extract speaker features that are robust to the channel, thereby reducing the impact of channel mismatch. Therefore, the voiceprint recognition system based on Deep Belief Network (DBN) can have better robustness to channel mismatch, and can be deployed across devices and platforms. It is convenient to use the voiceprint recognition service.

附图说明Description of drawings

图1是本发明实施例所述的基于深度置信网络特征矢量的信道鲁棒声纹识别系统的结构示意图；Fig. 1 is a schematic structural diagram of a channel robust voiceprint recognition system based on deep belief network feature vectors according to an embodiment of the present invention;

图2.是本发明实施例所述的深度置信网络(DBN)结构示意图。Fig. 2 is a schematic diagram of the structure of the deep belief network (DBN) described in the embodiment of the present invention.

具体实施方式detailed description

以下结合附图对本发明的优选实施例进行说明，应当理解，此处所描述的优选实施例仅用于说明和解释本发明，并不用于限定本发明。The preferred embodiments of the present invention will be described below in conjunction with the accompanying drawings. It should be understood that the preferred embodiments described here are only used to illustrate and explain the present invention, and are not intended to limit the present invention.

本发明所述的基于深度置信网络特征矢量的信道鲁棒声纹识别系统，考虑利用大量不同信道的语料和对应的说话人身份编号(ID)，对深度置信网络进行有监督的训练，因此利用训练好的深度置信网络提取出的特征矢量具有对信道鲁棒的特点，从而提高声纹识别系统在信道失配情况下的准确率。具体步骤如下，并且结合附图1的本发明系统的结构示意图：The channel robust voiceprint recognition system based on the deep belief network feature vector of the present invention considers the use of a large number of corpus of different channels and the corresponding speaker ID number (ID) to carry out supervised training on the deep belief network, so using The feature vector extracted by the trained deep belief network is robust to the channel, thus improving the accuracy of the voiceprint recognition system in the case of channel mismatch. Concrete steps are as follows, and in conjunction with the structural representation of the system of the present invention of accompanying drawing 1:

S01：语音采集及预处理模块；S01: Speech collection and preprocessing module;

首先获取语音数据，并对语音信号进行放大、增益控制、滤波及采样等预处理。Firstly, the voice data is obtained, and the voice signal is preprocessed such as amplification, gain control, filtering and sampling.

S02：原始谱特征提取模块；S02: Original spectrum feature extraction module;

其中包括对预处理后的语音进行分帧、预加重、加窗、快速傅里叶变换，最后进行梅尔倒谱系数MFCC的提取。It includes framing, pre-emphasis, windowing, fast Fourier transform of the preprocessed speech, and finally the extraction of Mel cepstral coefficient MFCC.

S03：深度置信网络训练模块；S03: Deep belief network training module;

假设说话人每句话提取出的MFCC特征表示为m(1≤m≤M)是说话人身份编号，L是每帧MFCC的长度，c_m是帧数。以MFCC作为输入，对应的说话人身份编号作为输出，可以利用训练数据{s_mj,m,j＝1,2,…,c_m,m＝1,2,…,M}对深度置信网络进行有监督的训练，并保存训练好的深度置信网络各层参数。其中深度置信网络的结构示意图如附图2所示。Assume that the MFCC feature extracted from each sentence of the speaker is expressed as m (1≤m≤M) is the ID number of the speaker, L is the length of each frame of MFCC, and c _m is the number of frames. With MFCC as input and the corresponding speaker ID as output, the training data {s _mj ,m,j=1,2,…,c _m ,m=1,2,…,M} can be used to conduct deep belief network Supervised training, and save the parameters of each layer of the trained deep belief network. The structural diagram of the deep belief network is shown in Figure 2.

S04：说话人声纹特征矢量提取模块；S04: Speaker voiceprint feature vector extraction module;

将深度置信网络看做一个特征矢量提取器，以MFCC作为深度置信网络的输入，深度置信网络的隐含层输出可以看成是对原始MFCC特征的高层表示(特征矢量)。定义函数为深度置信网络从输入到第p个隐含层输出的映射，则可以得到特深度特征：{f_p(s_mj),p＝1,2,…,P}。为了测量神经网络不同隐含层所提取特征矢量的区分度，定义判别比值R_p＝det(S_bp)/det(S_wp)，作为深度特征区分度的度量，其中S_bp是训练数据类间散度矩阵，S_wp是训练数据类内散度矩阵，其定义如下：The deep belief network is regarded as a feature vector extractor, and MFCC is used as the input of the deep belief network. The hidden layer output of the deep belief network can be regarded as a high-level representation (feature vector) of the original MFCC features. define function is the mapping from the input of the deep belief network to the output of the p-th hidden layer, then the extra-depth features can be obtained: {f _p (s _mj ),p=1,2,…,P}. In order to measure the discrimination of the feature vectors extracted by different hidden layers of the neural network, the discriminant ratio R _p =det(S _bp )/det(S _wp ) is defined as the measure of the depth feature discrimination, where S _bp is the training data between classes The scatter matrix, S _wp is the scatter matrix within the training data class, which is defined as follows:

类间距离大和类内距离最小有利于所提取的特征矢量的可区分性。因此，判别比值R_p最大的隐含层特征矢量最具区分性，即确定满足k＝argmax_p R_p的隐含层的输出作为最佳深度特征。The large inter-class distance and the smallest intra-class distance are beneficial to the distinguishability of the extracted feature vectors. Therefore, the hidden layer feature vector with the largest discrimination ratio R _p is the most discriminative, that is, the output of the hidden layer satisfying k=argmax _p R _p is determined as the best deep feature.

S05：说话人声学模型生成模块；S05: speaker acoustic model generation module;

利用所述说话人的深度置信网络第k层深度特征f_k(s_mj)，则可以得到特征矢量k^th-DBN-vector，其定义为Using the k-th layer depth feature f _k (s _mj ) of the speaker's deep belief network, the feature vector k ^th -DBN-vector can be obtained, which is defined as

其中m是说话人身份编号，c_m是每句话提取出MFCC的帧长，N_p是深度置信网络第p个隐含层的维数。最后利用特征矢量k^th-DBN-vector进行概率线性判别分析(PLDA)建模，并保存PLDA模型参数。Among them, m is the speaker's ID number, c _m is the frame length of MFCC extracted from each sentence, and N _p is the dimension of the pth hidden layer of the deep belief network. Finally, the feature vector k ^th -DBN-vector is used for probabilistic linear discriminant analysis (PLDA) modeling, and the PLDA model parameters are saved.

S06：说话人身份鉴定模块；S06: speaker identification module;

具体步骤为：(1)首先对注册人的语音进行采集及预处理，并提取原始谱MFCC特征，再利用训练好的深度置信网络提取出注册人的特征矢量k^th-DBN-vector；(2)对测试人的语音进行采集及预处理，并提取原始谱MFCC特征，再利用训练好的深度置信网络提取出注册人的特征矢量k^th-DBN-vector；(3)利用注册人和说话人的k^th-DBN-vector，基于训练好的PLDA模型可以得到对数似然比值得分s，最后将得分与给定的阈值s₀进行比较，若s≥s₀，则认为测试人是注册人，否则不是。The specific steps are: (1) first collect and preprocess the voice of the registrant, and extract the original spectrum MFCC features, and then use the trained deep belief network to extract the registrant’s feature vector k ^th -DBN-vector; (2 ) Collect and preprocess the voice of the tester, and extract the original spectrum MFCC features, and then use the trained deep belief network to extract the feature vector k ^th -DBN-vector of the registrant; (3) use the registrant and the speaker k ^th -DBN-vector, based on the trained PLDA model, the log likelihood ratio score s can be obtained, and finally the score is compared with the given threshold s ₀ , if s≥s ₀ , the tester is considered to be registered people, otherwise not.

表一选用数据库详细信息Table 1 Selected database details

表二数据库分配Table 2 Database allocation

表三实验参数设置Table 3 Experimental parameter settings

在实际实验过程中，首先选用实验数据库，数据库均为中文语料，其详细信息如表一所示。其中MTDSR2015数据库由北京大学现代信号与数据处理实验室录制，THCHS-30数据库由清华大学录制，King-ASR-L-018数据库由海天瑞声公司发布。In the actual experiment process, the experimental database is selected first, and the database is all Chinese corpus, and its detailed information is shown in Table 1. Among them, the MTDSR2015 database was recorded by the Modern Signal and Data Processing Laboratory of Peking University, the THCHS-30 database was recorded by Tsinghua University, and the King-ASR-L-018 database was released by Haitian AAC.

实验中对上述数据库的分配如表二所示，其中bkg数据用于全局背景模型(UBM)、全局差异矩阵T，PLDA模型的训练，bkg数据和dev中的Part I数据用于深度置信网络的训练，dev中的Part II数据用于注册，eva数据用于测试。The distribution of the above database in the experiment is shown in Table 2, where the bkg data is used for the global background model (UBM), the global difference matrix T, and the training of the PLDA model, and the bkg data and the Part I data in dev are used for the deep belief network. For training, Part II data in dev is used for registration, and eva data is used for testing.

然后设置实验参数，如表三所示，本发明的基于深度置信网络特征矢量的信道鲁棒声纹识别系统为DBN-vector，基线算法选用的是i-vector，算法性能评价指标是等错误率(EER)和最小检测代价函数(minDCF)。Then set the experimental parameters, as shown in Table 3, the channel robust voiceprint recognition system based on the deep belief network feature vector of the present invention is DBN-vector, the baseline algorithm is i-vector, and the algorithm performance evaluation index is equal error rate (EER) and the minimum detection cost function (minDCF).

最终的实验结果和分析：Final experimental results and analysis:

利用bkg数据和dev中Part I数据，可以训练得到一个深度置信网络。通过对深度置信网络各个隐含层输出进行分析，可以得到各个隐含层的判别比值，如表四所示。从表四种可以发现，深度置信网络第4个隐含层的判别比值最大，说明第四个隐含层的深度特征f₄(s_mj)具有最好的区分性，从而选取f₄(s_mj)最为最佳深度特征。Using bkg data and Part I data in dev, a deep belief network can be trained. By analyzing the output of each hidden layer of the deep belief network, the discriminant ratio of each hidden layer can be obtained, as shown in Table 4. From Table 4, it can be found that the discriminative ratio of the fourth hidden layer of the deep belief network is the largest, indicating that the deep feature f ₄ (s _mj ) of the fourth hidden layer has the best discrimination, so f ₄ (s mj ) is selected _mj ) is the best deep feature.

隐含层索引hidden layer index p＝1p=1 p＝2p=2 p＝3p=3 p＝4p=4 判别比值R_p Discriminant ratio R _p 0.470.47 0.720.72 1.031.03 1.34 1.34

表四深度置信网络不同隐含层的判别比值Table 4 Discriminant ratios of different hidden layers of deep belief network

表五不同信道失配情况下i-vector系统和4^th-DBN-vector系统性能比较Table 5 Performance comparison between i-vector system and 4 ^th -DBN-vector system under different channel mismatch conditions

考虑信道失配情况下的系统性能，表五给出了不同信道失配情况下我们发明的系统以及i-vector系统的性能表现，其中a代表HUAWEI mate7，b代表XM4，c代表SamsungNote3，d代表iPhone 5C。以a-b为例，a-b表示注册阶段使用HUAWEI mate7信道进行语音信号采集，测试阶段使用XM4信号进行语音信号采集。根据表四，我们选择深度置信网络第四个隐含层的特征矢量4^th-DBN-vector。从表五中可以看出，在每种信道失配情况下，基于深度置信网络特征矢量的信道鲁棒声纹识别系统(4^th-DBN-vector)不管从等错误率EER或者最小检测代价函数minDCF都要远小于传统的i-vector系统，且4^th-DBN-vector系统的等错误率EER均小于0.9％，最小检测代价函数均小于0.8，说明了基于深度置信网络特征矢量的信道鲁棒声纹识别系统在信道失配情况下对说话人身份鉴定的准确率要好于i-vector系统。Considering the system performance under the condition of channel mismatch, Table 5 shows the performance of the system we invented and the i-vector system under different channel mismatch conditions, where a represents HUAWEI mate7, b represents XM4, c represents Samsung Note3, and d represents iPhone 5C. Take ab as an example, ab indicates that the HUAWEI mate7 channel is used for voice signal collection during the registration phase, and the XM4 signal is used for voice signal collection during the test phase. According to Table IV, we choose the feature vector 4th- ^DBN -vector of the fourth hidden layer of the deep belief network. It can be seen from Table 5 that in each case of channel mismatch, the channel robust voiceprint recognition system based on the deep belief network feature vector (4 ^th -DBN-vector) does not matter from the equal error rate EER or the minimum detection cost function The minDCF is much smaller than the traditional i-vector system, and the equal error rate EER of the 4 ^th -DBN-vector system is less than 0.9%, and the minimum detection cost function is less than 0.8, which shows that the channel robustness based on the feature vector of the deep belief network In the case of channel mismatch, the voiceprint recognition system is more accurate than the i-vector system in identifying the speaker's identity.

以上所述，仅是本发明的较佳实施例而已，并非对本发明作任何形式上的限制，故凡是未脱离本发明技术方案内容，依据本发明的技术实质对以上实施例所作的任何修改、等同变化与修饰,均仍属于本发明技术方案的范围内。The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any form. Therefore, any modification, Equivalent changes and modifications all still belong to the scope of the technical solutions of the present invention.

Claims

1. A channel robust voiceprint recognition system based on deep belief network feature vectors to identify the identity of the speaker, characterized in that the system includes:

Voice collection and preprocessing module, for collecting the voice signal of the speaker, and preprocessing the voice signal;

The original spectral feature extraction module is used to extract the original spectral feature MFCC to the preprocessed speech;

Deep belief network training module for supervised training of a channel-robust feature vector extractor;

The speaker's voiceprint feature vector extraction module uses the trained deep belief network to extract the channel robust speaker's voiceprint feature vector;

The speaker acoustic model generating module performs acoustic modeling on the speaker according to the extracted speaker voiceprint feature vector;

The speaker identity identification module compares and scores the acoustic model of the speaker to be tested with the acoustic model of the registered speaker to determine the identity of the speaker to be tested.

2. The channel robust voiceprint recognition system based on the deep belief network feature vector according to claim 1, wherein the voice collection and preprocessing module is used to amplify, gain control, and filter the voice signal collected and sampling preprocessing.

3. The channel robust voiceprint recognition system based on deep belief network feature vectors according to claim 1, wherein the original spectrum feature extraction module comprises: framing, pre-emphasizing, Windowing, fast Fourier transform, and finally the extraction of Mel cepstral coefficient MFCC.

4. the channel robust voiceprint recognition system based on depth belief network feature vector according to claim 1, is characterized in that, described depth belief network training module, with the MFCC feature extracted by the corpus under a large amount of different channels as Input, with the corresponding speaker ID number (ID) as the output, conduct supervised training on the deep belief network, and save the parameters of each layer of the trained deep belief network.

5. The channel robust voiceprint recognition system based on the deep belief network feature vector according to claim 1, wherein the speaker voiceprint feature vector extraction module regards the deep belief network as a feature vector extractor , taking MFCC as the input of the deep belief network, the hidden layer output of the deep belief network can be regarded as a high-level representation of the original MFCC features (deep features), and these deep features have the characteristics of channel robustness.

6. The channel robust voiceprint recognition system based on deep belief network feature vectors according to claim 5, characterized in that, a method for measuring the degree of discrimination of the extracted depth features of different hidden layers of neural networks is proposed, and the definition of discrimination Ratio R _p =det(S _bp )/det(S _wp ), as a measure of depth feature discrimination, where S _bp is the inter-class scatter matrix of training data, and S _wp is the intra-class scatter matrix of training data, which is defined as follows :

where s _mj is the MFCC feature, f _p ( ) is the mapping of the deep belief network from the input of the MFCC to the output of the p-th hidden layer, G _pm is the mean vector of the training data class, G _p is the mean vector of all training data, and the mathematical Expressed as follows:

The large inter-class distance and the smallest intra-class distance are conducive to the distinguishability of the extracted feature vectors. Therefore, the hidden layer feature vector with the largest discriminant ratio R _p is the most discriminative, that is, the hidden layer feature vector that satisfies k=argmax p R p is determined to satisfy k=argmax _p R _p The output of the layer is used as the best depth feature, and using the depth feature f _k (s _mj ) of the kth layer of the speaker's deep belief network, the feature vector k ^th -DBN-vector can be obtained, which is defined as

Among them, m is the speaker's ID number, c _m is the frame length of MFCC extracted from each sentence, and N _p is the dimension of the pth hidden layer of the deep belief network.

7. The channel robust voiceprint recognition system based on deep belief network feature vectors according to claim 1, wherein the speaker acoustic model generation module utilizes the speaker's feature vector k ^th -DBN- vector performs probabilistic linear discriminant analysis (PLDA) modeling and saves the PLDA model parameters.

8. The channel robust voiceprint recognition system based on deep belief network feature vectors according to claim 1, wherein the speaker identification module can first extract the registrant according to the trained deep belief network and tester's k ^th -DBN-vector. Then based on the trained PLDA model, the log likelihood ratio score s is obtained, and finally the score is compared with the given threshold s ₀ , if s≥s ₀ , the tester is considered to be the registrant, otherwise not.