METHOD, APPARATUS, AND SYSTEM FOR SPEAKER VERIFICATION BASED ON ORTHOGONAL GAUSSIAN MIXTURE MODEL(GMM)
FIELD OF THE INVENTION
The present invention relates to the field of speaker recognition. More specifically, the present invention relates to a method, apparatus, and system for speaker verification based upon orthogonal Gaussian mixture model (GMM).
BACKGROUND OF THE INVENTION
The speech signal can convey various types of information at different levels. In particular, not only that the speech signal conveys a message as a sequence of words, it also conveys speaker specific information, for example, information about the identity of the speaker who produces the speech signal. In general, the field of speech recognition is concerned with extracting the underlying message conveyed in the speech signal and the field of speaker recognition deals with extracting and verifying the identity of the speaker who generates the speech signal. Speaker recognition can be divided into two areas: speaker identification and speaker verification. In speaker identification, the task is to determine the identity of a speaker based upon a speech sample provided by that speaker. In speaker verification, the task is to verify whether a speaker is whom he or she claims to be, based upon a speech sample provided by that speaker. Accordingly, speaker verification involves a two-way classification or a binary test to determine whether the speaker's claim is correct or not. In addition, either speaker identification or speaker verification can be constrained to specific phrases or text which is referred to as text-dependent, or unconstrained to any specific text which is referred to as text- independent. Operating a speaker recognition system typically involves two stages. In the first stage, a user enrolls in the system by providing the system with one or more samples of his speech. These training samples are used by the system to build a model for that user. In the second stage, the user provides a test sample to be used by the system to test the similarity between the test sample and the model(s) of the user(s) in order to perform its corresponding function (e.g., speaker identification or speaker verification).
Various techniques have been developed over the years to train or construct speaker models for use in speaker recognition systems. Some of the earlier approaches use long-term averages of acoustic features, for example spectrum representations or pitch information. With respect to spectral features, the long-term average represents a speaker's average vocal tract shape. Another approach is to model the speaker- dependent acoustic features within the individual phonetic units that make up the speech. This technique is concerned with measuring speaker differences rather than textual differences by comparing acoustic features from the phonetic units in a test sample with previously obtained speaker-dependent acoustic features from similar phonetic units. This approach can be accomplished using either explicit or implicit segmentation of the speech into phonetic unit classes before the speaker model training or recognition. Explicit segmentation generally uses a hidden Markov model (HMM) based continuous speech recognizer as a front-end segmenter. Implicit segmentation uses some form of unsupervised clustering to provide the segmentation during training and recognition.
The Gaussian mixture speaker model has been successfully and widely used for text-independent speaker verification. This modeling technique basically uses a Gaussian mixture density (a weighted sum of several multivariate Gaussian functions) to represent or model the distribution of training feature vectors. In theory each Gaussian function may have a full covariance matrix. However, the diagonal covariance matrix has been mostly used in practice because of its computational advantages. Generally, the elements of feature vectors extracted from a speech signal are correlated. A linear combination of diagonal covariance Gaussian functions is capable of modeling the correlation. However, a large number of mixtures needs to be used in order to provide a good approximation for the distribution of the feature vectors extracted from a person's speech. Consequently, a large GMM needs a large amount of training data for a good estimation of the GMM parameters, takes a long time to train, and is slow in response due to its large size. Thus there exists a need to improve the system performance of GMM-based speaker recognition systems.
BRIEF DESCRIPTION OF THE DRAWINGS
The features and advantages of the present invention will be more fully understood by reference to the accompanying drawings, in which:
Figure 1 is a block diagram of one embodiment of a speaker recognition system according to the teachings of the present invention;
Figure 2 is a flow diagram of one embodiment of a method according to the teachings of the present invention; and
Figure 3 shows a flow diagram of one embodiment of a method according to the teachings of the present invention.
DETAILED DESCRIPTION
In the following detailed description numerous specific details are set forth in order to provide a thorough understanding of the present invention. However, it will be appreciated by one skilled in the art that the present invention may be understood and practiced without these specific details.
In the discussion below, the teachings of the present invention are utilized to implement a method, apparatus, system, and machine-readable medium for performing speaker verification based on orthogonal Gaussian mixture model (GMM). In one embodiment, a test signal representing a test speech is converted or transformed into a set of test feature vectors that represent the identity of a test speaker who claims a particular identity. The test feature vectors are then transformed using corresponding linear transform matrices associated with a speaker independent Gaussian mixture model (SIGMM) that was previously trained. The system then determines whether to accept or reject the claimed identity of the test speaker based upon the transformed test feature vectors representing the identity of the test speaker and the models including the speaker dependent model representing the claimed identity and the anti-models corresponding to the claimed identity (cohort models or background model). The speaker dependent model, in one embodiment, is represented by a speaker dependent GMM (SDGMM) which was constructed using linear transform matrices associated with corresponding mixtures in the speaker independent model. In one embodiment, the training feature vectors used to construct the SDGMM are first transformed by the corresponding linear transform matrices associated with the speaker independent model and the parameters of each mixture of the respective SDGMM are trained based upon the transformed training feature vectors. In one embodiment, the speaker independent GMM is constructed using
speech samples provided by a large set of speakers in a training corpus. After the speaker independent GMM is trained, a linear transform matrix is computed for each mixture of the speaker independent model and the resultant linear transform matrices are utilized for the training of the speaker dependent models. In one embodiment, the speaker independent model is trained using the expectation-maximization (EM) method. The speaker dependent models are then trained using the maximum a posteriori (MAP) adaptation method. The linear transform matrices computed for the speaker independent model are shared by the speaker dependent model mixtures that are adapted from the same mixtures of the speaker independent model. The teachings of the present invention are applicable to any scheme, method and system for speaker recognition that employs GMMs as the probabilistic model of the underlying sounds of a speaker's voice. However, the present invention is not limited to speaker recognition systems and can be applied to other types of probabilistic and data modeling in speech recognition and in other fields or disciplines including, but not limited to, image processing, signal processing, geometric modeling, computer-aided-design (CAD), computer-aided- manufacturing (CAM), etc.
The present invention provides a method and a system that combines orthogonal GMM with maximum a posteriori (MAP) adaptation. Using this method, the correlation of feature vectors can be modeled much better than diagonal GMM-based speaker verification systems. As described above, in GMM-based speaker verification systems, the distribution of feature vectors extracted from a speaker's speech is modeled by a Gaussian mixture density which is a weighted sum of several multivariate Gaussian functions. A Gaussian function has the form as shown below:
P(X) = (2π)-d'2 |∑f/2 exp[[-i(( - μ)'∑-> (X - μ))}
where Xis a d-dimensional feature vector, μ is the mean vector and ∑ is the covariance matrix. Based upon linear algebra theory, it is known that a covariance matrix can be diagonalized if the vectors are linearly transformed to the space spanned by the eigenvectors of the original covariance matrix. If the covariance matrix of a speaker is ∑x and the transform matrix Ω is composed of the eigenvectors of ∑x, then after the linear transformation, y= Ωτ x, the covariance matrix in the Y space, ∑y, is diagonal. ∑y, μ y are related to ∑x, μx according to the following equation:
μy = Ω ' μ:
According to the linear algebra theory, it can be seen that the diagonal Gaussian function in Y space is equivalent to the Gaussian function with full covariance matrix in X space. Accordingly, a diagonal GMM in Y space would provide better approximation to the distribution of feature vectors compared to a diagonal GMM in X space. The GMM with orthogonal transform is referred to as orthogonal GMM herein. Based upon the observations described above, the present invention provides a method to more accurately model the correlation of feature vectors than that provided by diagonal GMM-based speaker verification systems. First, a speaker independent model is trained using EM algorithm. The eigenvectors of the covariance matrix for each mixture are calculated. The linear transform matrix for each mixture in the speaker independent model is composed of these eigenvectors. Second, a speaker dependent model for each new speaker enrolled in the system is trained using the MAP method. In speaker dependent model training, the linear transform matrices computed for the speaker independent model are shared by the mixtures of the speaker dependent model that are adapted from the same mixture of the speaker independent model. Using this shared transformation, the covariance matrices in transformed spaces are more diagonal. Accordingly, the diagonal Gaussian functions in transformed spaces will provide a better approximation to the distribution of feature vectors. Furthermore, in MAP adaptation, the covariance is usually much less adapted than mean. Some systems therefore use only mean adaptation. Sharing of transformation matrix will make MAP adaptation more effective because the transformation matrix is computed from the corresponding covariance matrix.
In the decoding phase, the feature vectors are first transformed to the spaces spanned by the eigenvectors of the covariance matrix of each corresponding speaker- independent mixture. The probability or similarity measure between the input speech and the speaker dependent model is then computed using the transformed vectors for each speaker in the set.
Figure 1 illustrates a block diagram of one embodiment of a speaker verification system 100 according to the teachings of the present invention. The system 100, as shown in Figure 1, includes an analog to digital converter (A/D) 110, a feature extractor or spectral analysis unit 120, a similarity measurement unit 130, a speaker dependent
model or reference database 140, and a decision making unit 150. An input signal 101 representing a sample speech of a speaker whose claimed identity is to be verified by the system (also referred to as the test speaker herein) is first digitized using the A/D 110. The digital signal is sliced up into frames at a suitable rate (e.g., 10, 15, or 20 ms). The digital signal is then converted or transformed into a set of feature vectors containing acoustic parameters that convey the identity characteristics of the test speaker. The feature vectors are then inputted to the similarity measurement unit 130 which computes a similarity measure between the identity of the test speaker as represented by the feature vectors and the claimed identity that is represented by a previously constructed model stored in the speaker dependent model database 140. The decision-making unit 150 then compares the similarity measure computed by the similarity measurement unit 130 to a predetermined value or threshold and decides whether to accept or reject the claimed identity of the test speaker. As described above, in the present embodiment, the test feature vectors are transformed using the corresponding linear transform matrices computed from a previously trained speaker independent GMM. The similarity measurement unit 130 computes the similarity measure based upon the transformed test feature vectors and a set of speaker dependent models stored in the database 140. The parameters of the speaker dependent models are trained using transformed training feature vectors that are obtained by transforming the training feature vectors extracted from training speech samples using the corresponding linear transform matrices associated with the previously trained speaker independent GMM.
Figure 2 shows a flow diagram of a method 200 for performing speaker verification according to the teachings of the present invention. The method starts at block 201 and proceeds to block 210. At block 210, a test signal representing a test speech is converted into a set of test feature vectors. The test speech is provided by a test speaker who claims a particular identity. At block 220, the test feature vectors are transformed to the spaces spanned by the eigenvectors of the covariance matrix of each corresponding speaker-independent mixture by using the corresponding linear transform matrices that were previously computed for the speaker independent GMM. At block 230, a determination is made to determine whether to accept or reject the claimed identity of the test speaker based upon the transformed test feature vectors that represent the identity of the test speaker and a previously trained speaker dependent model that
represents the claimed identity. As described above, the speaker dependent model is represented by a speaker dependent GMM that was constructed using the corresponding linear transform matrices associated with the speaker independent model.
In general, a likelihood ratio test is used to determine whether to accept or reject the claimed speaker. The likelihood ratio test is well known in the art. Assuming that an utterance or sample speech X given by a speaker who claims to be a particular speaker Y that has a corresponding model M , then the likelihood ratio is:
P{X is from the claimed spea ker) _ R( y | X) P(X is not from the claimed spea ker) P(M- | X)
The likelihood ratio in the log domain is shown below: R(X) = \ogp(X \ My) - logp(X I M-) where p X \ M y ) is the probability or likelihood of the speech being generated from the claimed speaker and p(X \ M-) is the probability or likelihood of the speech given it is not from the claimed speaker. The likelihood ratio is then compared to a threshold value θ. The claimed speaker is accepted if the likelihood ratio exceeds the threshold value and is rejected if it is less than the threshold value. The decision threshold value can be set to adjust the tradeoff between false rejection error and false acceptance errors. In the present embodiment, a cohort normalization method is used to compute the likelihood ratio. Instead of computing the likelihood ratio using the average score of the entire set of speaker dependent models, only the average score of the subset having the highest scores (excluding the score of the claimed identity model) is used. For example, assuming that there are 100 speaker dependent models that represent a set of 100 speakers that have been enrolled in the system (one model for each speaker in the set), during verification phase, the probability of the input speech (as represented by the transformed feature vectors described above) being generated from each model (including the model that represents the claimed identity) is computed. Then the average score for a predetermined number of the top scores (e.g., the top 10 scores) excluding the score of the claimed identity is computed. This average score is then compared with the score of the claimed identity to generate the likelihood ratio. The likelihood ratio obtained is then compared with a predetermined threshold value to determine whether to accept or reject the claimed identity.
Figure 3 illustrates a flow diagram of one embodiment of a method 300 according to the teachings of the present invention. The method 300 starts at block 301 and proceeds to block 305 to perform speaker independent model training. At block 309, a speaker independent GMM having M mixtures is trained using the expectation- maximization (EM) technique. At block 311, the linear transform matrix for each mixture of the speaker independent GMM is computed. In one embodiment, this is done by calculating the eigenvectors of the covariance matrix for each mixture. The linear transform matrix for each mixture is composed of the corresponding eigenvectors. The method 300 proceeds to block 313 to perform speaker dependent model training (enrolling new speakers). At block 317, the feature vectors of the training speech provided by a speaker being enrolled are transformed to the spaces spanned by the corresponding linear transform matrices that were computed previously based on the mixtures of the speaker independent model. As described above, according to the teachings of the present invention, the linear transform matrices computed from the speaker independent model are shared by the mixtures that are adapted from the same mixture of the speaker independent model. By this shared transformation, the covariance matrices in the transformed spaces are model diagonal. At block 321, the parameters of each mixture of the speaker dependent GMM are trained in the transformed spaces using MAP algorithm. The process of speaker dependent model training is performed for each speaker enrolled in the system. The method 300 proceeds to block 331 to perform speaker verification task. At block 335, the feature vectors extracted from a test speech of a speaker who claims a particular identity are transformed to the corresponding spaces by the corresponding transform matrices, h other words, these feature vectors are transformed to the spaces spanned by the eigenvectors of the covariance matrix of each corresponding speaker independent mixture. At block 339, the probabilities of the feature vectors with respect to the speaker dependent models are calculated in the corresponding spaces to obtain verification results (i.e., whether to accept or reject the claimed identity).
The invention has been described in conjunction with the preferred embodiment. It is evident that numerous alternatives, modifications, variations and uses will be apparent to those skilled in the art in light of the foregoing description.