WO2002029785A1

WO2002029785A1 - Method, apparatus, and system for speaker verification based on orthogonal gaussian mixture model (gmm)

Info

Publication number: WO2002029785A1
Application number: PCT/CN2000/000303
Authority: WO
Inventors: Xiaoxing Liu; Yonghong Yan; Baosheng Yuan
Original assignee: Intel Corp
Current assignee: Intel Corp
Priority date: 2000-09-30
Filing date: 2000-09-30
Publication date: 2002-04-11
Anticipated expiration: 2003-03-30
Also published as: AU2000276401A1

Abstract

According to one aspect of the invention, a method is provided in which a signal representing a speech provided by a speaker who claims a particular identity is converted into a set of feature vectors. The feature vectors are transformed into corresponding spaces using the corresponding linear transform matrices that were previously computed based upon a speaker independent Gaussian mixture model. A determination is made to determine whether to accept or reject the claimed identity of the speaker using the transformed feature vectors and a previously constructed speaker dependent model that represents the claimed identity. The speaker dependent model is represented by a speaker dependent Gaussian mixture model which is constructed using the linear transform matrices associated with the corresponding mixtures in the speaker independent model.

Description

METHOD, APPARATUS, AND SYSTEM FOR SPEAKER VERIFICATION BASED ON ORTHOGONAL GAUSSIAN MIXTURE MODEL(GMM)

FIELD OF THE INVENTION

The present invention relates to the field of speaker recognition. More specifically, the present invention relates to a method, apparatus, and system for speaker verification based upon orthogonal Gaussian mixture model (GMM).

BACKGROUND OF THE INVENTION

The speech signal can convey various types of information at different levels. In particular, not only that the speech signal conveys a message as a sequence of words, it also conveys speaker specific information, for example, information about the identity of the speaker who produces the speech signal. In general, the field of speech recognition is concerned with extracting the underlying message conveyed in the speech signal and the field of speaker recognition deals with extracting and verifying the identity of the speaker who generates the speech signal. Speaker recognition can be divided into two areas: speaker identification and speaker verification. In speaker identification, the task is to determine the identity of a speaker based upon a speech sample provided by that speaker. In speaker verification, the task is to verify whether a speaker is whom he or she claims to be, based upon a speech sample provided by that speaker. Accordingly, speaker verification involves a two-way classification or a binary test to determine whether the speaker's claim is correct or not. In addition, either speaker identification or speaker verification can be constrained to specific phrases or text which is referred to as text-dependent, or unconstrained to any specific text which is referred to as text- independent. Operating a speaker recognition system typically involves two stages. In the first stage, a user enrolls in the system by providing the system with one or more samples of his speech. These training samples are used by the system to build a model for that user. In the second stage, the user provides a test sample to be used by the system to test the similarity between the test sample and the model(s) of the user(s) in order to perform its corresponding function (e.g., speaker identification or speaker verification). Various techniques have been developed over the years to train or construct speaker models for use in speaker recognition systems. Some of the earlier approaches use long-term averages of acoustic features, for example spectrum representations or pitch information. With respect to spectral features, the long-term average represents a speaker's average vocal tract shape. Another approach is to model the speaker- dependent acoustic features within the individual phonetic units that make up the speech. This technique is concerned with measuring speaker differences rather than textual differences by comparing acoustic features from the phonetic units in a test sample with previously obtained speaker-dependent acoustic features from similar phonetic units. This approach can be accomplished using either explicit or implicit segmentation of the speech into phonetic unit classes before the speaker model training or recognition. Explicit segmentation generally uses a hidden Markov model (HMM) based continuous speech recognizer as a front-end segmenter. Implicit segmentation uses some form of unsupervised clustering to provide the segmentation during training and recognition.

The Gaussian mixture speaker model has been successfully and widely used for text-independent speaker verification. This modeling technique basically uses a Gaussian mixture density (a weighted sum of several multivariate Gaussian functions) to represent or model the distribution of training feature vectors. In theory each Gaussian function may have a full covariance matrix. However, the diagonal covariance matrix has been mostly used in practice because of its computational advantages. Generally, the elements of feature vectors extracted from a speech signal are correlated. A linear combination of diagonal covariance Gaussian functions is capable of modeling the correlation. However, a large number of mixtures needs to be used in order to provide a good approximation for the distribution of the feature vectors extracted from a person's speech. Consequently, a large GMM needs a large amount of training data for a good estimation of the GMM parameters, takes a long time to train, and is slow in response due to its large size. Thus there exists a need to improve the system performance of GMM-based speaker recognition systems.

BRIEF DESCRIPTION OF THE DRAWINGS

The features and advantages of the present invention will be more fully understood by reference to the accompanying drawings, in which: Figure 1 is a block diagram of one embodiment of a speaker recognition system according to the teachings of the present invention;

Figure 2 is a flow diagram of one embodiment of a method according to the teachings of the present invention; and

Figure 3 shows a flow diagram of one embodiment of a method according to the teachings of the present invention.

DETAILED DESCRIPTION

In the following detailed description numerous specific details are set forth in order to provide a thorough understanding of the present invention. However, it will be appreciated by one skilled in the art that the present invention may be understood and practiced without these specific details.

In the discussion below, the teachings of the present invention are utilized to implement a method, apparatus, system, and machine-readable medium for performing speaker verification based on orthogonal Gaussian mixture model (GMM). In one embodiment, a test signal representing a test speech is converted or transformed into a set of test feature vectors that represent the identity of a test speaker who claims a particular identity. The test feature vectors are then transformed using corresponding linear transform matrices associated with a speaker independent Gaussian mixture model (SIGMM) that was previously trained. The system then determines whether to accept or reject the claimed identity of the test speaker based upon the transformed test feature vectors representing the identity of the test speaker and the models including the speaker dependent model representing the claimed identity and the anti-models corresponding to the claimed identity (cohort models or background model). The speaker dependent model, in one embodiment, is represented by a speaker dependent GMM (SDGMM) which was constructed using linear transform matrices associated with corresponding mixtures in the speaker independent model. In one embodiment, the training feature vectors used to construct the SDGMM are first transformed by the corresponding linear transform matrices associated with the speaker independent model and the parameters of each mixture of the respective SDGMM are trained based upon the transformed training feature vectors. In one embodiment, the speaker independent GMM is constructed using speech samples provided by a large set of speakers in a training corpus. After the speaker independent GMM is trained, a linear transform matrix is computed for each mixture of the speaker independent model and the resultant linear transform matrices are utilized for the training of the speaker dependent models. In one embodiment, the speaker independent model is trained using the expectation-maximization (EM) method. The speaker dependent models are then trained using the maximum a posteriori (MAP) adaptation method. The linear transform matrices computed for the speaker independent model are shared by the speaker dependent model mixtures that are adapted from the same mixtures of the speaker independent model. The teachings of the present invention are applicable to any scheme, method and system for speaker recognition that employs GMMs as the probabilistic model of the underlying sounds of a speaker's voice. However, the present invention is not limited to speaker recognition systems and can be applied to other types of probabilistic and data modeling in speech recognition and in other fields or disciplines including, but not limited to, image processing, signal processing, geometric modeling, computer-aided-design (CAD), computer-aided- manufacturing (CAM), etc.

The present invention provides a method and a system that combines orthogonal GMM with maximum a posteriori (MAP) adaptation. Using this method, the correlation of feature vectors can be modeled much better than diagonal GMM-based speaker verification systems. As described above, in GMM-based speaker verification systems, the distribution of feature vectors extracted from a speaker's speech is modeled by a Gaussian mixture density which is a weighted sum of several multivariate Gaussian functions. A Gaussian function has the form as shown below:

P(X) = (2π)-^d'² |∑f^/2 exp[[-i(( - μ)'∑-> (X - μ))}

where Xis a d-dimensional feature vector, μ is the mean vector and ∑ is the covariance matrix. Based upon linear algebra theory, it is known that a covariance matrix can be diagonalized if the vectors are linearly transformed to the space spanned by the eigenvectors of the original covariance matrix. If the covariance matrix of a speaker is ∑_x and the transform matrix Ω is composed of the eigenvectors of ∑_x, then after the linear transformation, y= Ω^τ x, the covariance matrix in the Y space, ∑_y, is diagonal. ∑_y, μ _y are related to ∑_x, μ_x according to the following equation: μ_y = Ω ' μ_:

According to the linear algebra theory, it can be seen that the diagonal Gaussian function in Y space is equivalent to the Gaussian function with full covariance matrix in X space. Accordingly, a diagonal GMM in Y space would provide better approximation to the distribution of feature vectors compared to a diagonal GMM in X space. The GMM with orthogonal transform is referred to as orthogonal GMM herein. Based upon the observations described above, the present invention provides a method to more accurately model the correlation of feature vectors than that provided by diagonal GMM-based speaker verification systems. First, a speaker independent model is trained using EM algorithm. The eigenvectors of the covariance matrix for each mixture are calculated. The linear transform matrix for each mixture in the speaker independent model is composed of these eigenvectors. Second, a speaker dependent model for each new speaker enrolled in the system is trained using the MAP method. In speaker dependent model training, the linear transform matrices computed for the speaker independent model are shared by the mixtures of the speaker dependent model that are adapted from the same mixture of the speaker independent model. Using this shared transformation, the covariance matrices in transformed spaces are more diagonal. Accordingly, the diagonal Gaussian functions in transformed spaces will provide a better approximation to the distribution of feature vectors. Furthermore, in MAP adaptation, the covariance is usually much less adapted than mean. Some systems therefore use only mean adaptation. Sharing of transformation matrix will make MAP adaptation more effective because the transformation matrix is computed from the corresponding covariance matrix.

In the decoding phase, the feature vectors are first transformed to the spaces spanned by the eigenvectors of the covariance matrix of each corresponding speaker- independent mixture. The probability or similarity measure between the input speech and the speaker dependent model is then computed using the transformed vectors for each speaker in the set.

Figure 1 illustrates a block diagram of one embodiment of a speaker verification system 100 according to the teachings of the present invention. The system 100, as shown in Figure 1, includes an analog to digital converter (A/D) 110, a feature extractor or spectral analysis unit 120, a similarity measurement unit 130, a speaker dependent model or reference database 140, and a decision making unit 150. An input signal 101 representing a sample speech of a speaker whose claimed identity is to be verified by the system (also referred to as the test speaker herein) is first digitized using the A/D 110. The digital signal is sliced up into frames at a suitable rate (e.g., 10, 15, or 20 ms). The digital signal is then converted or transformed into a set of feature vectors containing acoustic parameters that convey the identity characteristics of the test speaker. The feature vectors are then inputted to the similarity measurement unit 130 which computes a similarity measure between the identity of the test speaker as represented by the feature vectors and the claimed identity that is represented by a previously constructed model stored in the speaker dependent model database 140. The decision-making unit 150 then compares the similarity measure computed by the similarity measurement unit 130 to a predetermined value or threshold and decides whether to accept or reject the claimed identity of the test speaker. As described above, in the present embodiment, the test feature vectors are transformed using the corresponding linear transform matrices computed from a previously trained speaker independent GMM. The similarity measurement unit 130 computes the similarity measure based upon the transformed test feature vectors and a set of speaker dependent models stored in the database 140. The parameters of the speaker dependent models are trained using transformed training feature vectors that are obtained by transforming the training feature vectors extracted from training speech samples using the corresponding linear transform matrices associated with the previously trained speaker independent GMM.

Figure 2 shows a flow diagram of a method 200 for performing speaker verification according to the teachings of the present invention. The method starts at block 201 and proceeds to block 210. At block 210, a test signal representing a test speech is converted into a set of test feature vectors. The test speech is provided by a test speaker who claims a particular identity. At block 220, the test feature vectors are transformed to the spaces spanned by the eigenvectors of the covariance matrix of each corresponding speaker-independent mixture by using the corresponding linear transform matrices that were previously computed for the speaker independent GMM. At block 230, a determination is made to determine whether to accept or reject the claimed identity of the test speaker based upon the transformed test feature vectors that represent the identity of the test speaker and a previously trained speaker dependent model that represents the claimed identity. As described above, the speaker dependent model is represented by a speaker dependent GMM that was constructed using the corresponding linear transform matrices associated with the speaker independent model.

In general, a likelihood ratio test is used to determine whether to accept or reject the claimed speaker. The likelihood ratio test is well known in the art. Assuming that an utterance or sample speech X given by a speaker who claims to be a particular speaker Y that has a corresponding model M , then the likelihood ratio is:

P{X is from the claimed spea ker) _ R( _y | X) P(X is not from the claimed spea ker) P(M- | X)

The likelihood ratio in the log domain is shown below: R(X) = \ogp(X \ M_y) - logp(X I M-) where p X \ M _y ) is the probability or likelihood of the speech being generated from the claimed speaker and p(X \ M-) is the probability or likelihood of the speech given it is not from the claimed speaker. The likelihood ratio is then compared to a threshold value θ. The claimed speaker is accepted if the likelihood ratio exceeds the threshold value and is rejected if it is less than the threshold value. The decision threshold value can be set to adjust the tradeoff between false rejection error and false acceptance errors. In the present embodiment, a cohort normalization method is used to compute the likelihood ratio. Instead of computing the likelihood ratio using the average score of the entire set of speaker dependent models, only the average score of the subset having the highest scores (excluding the score of the claimed identity model) is used. For example, assuming that there are 100 speaker dependent models that represent a set of 100 speakers that have been enrolled in the system (one model for each speaker in the set), during verification phase, the probability of the input speech (as represented by the transformed feature vectors described above) being generated from each model (including the model that represents the claimed identity) is computed. Then the average score for a predetermined number of the top scores (e.g., the top 10 scores) excluding the score of the claimed identity is computed. This average score is then compared with the score of the claimed identity to generate the likelihood ratio. The likelihood ratio obtained is then compared with a predetermined threshold value to determine whether to accept or reject the claimed identity. Figure 3 illustrates a flow diagram of one embodiment of a method 300 according to the teachings of the present invention. The method 300 starts at block 301 and proceeds to block 305 to perform speaker independent model training. At block 309, a speaker independent GMM having M mixtures is trained using the expectation- maximization (EM) technique. At block 311, the linear transform matrix for each mixture of the speaker independent GMM is computed. In one embodiment, this is done by calculating the eigenvectors of the covariance matrix for each mixture. The linear transform matrix for each mixture is composed of the corresponding eigenvectors. The method 300 proceeds to block 313 to perform speaker dependent model training (enrolling new speakers). At block 317, the feature vectors of the training speech provided by a speaker being enrolled are transformed to the spaces spanned by the corresponding linear transform matrices that were computed previously based on the mixtures of the speaker independent model. As described above, according to the teachings of the present invention, the linear transform matrices computed from the speaker independent model are shared by the mixtures that are adapted from the same mixture of the speaker independent model. By this shared transformation, the covariance matrices in the transformed spaces are model diagonal. At block 321, the parameters of each mixture of the speaker dependent GMM are trained in the transformed spaces using MAP algorithm. The process of speaker dependent model training is performed for each speaker enrolled in the system. The method 300 proceeds to block 331 to perform speaker verification task. At block 335, the feature vectors extracted from a test speech of a speaker who claims a particular identity are transformed to the corresponding spaces by the corresponding transform matrices, h other words, these feature vectors are transformed to the spaces spanned by the eigenvectors of the covariance matrix of each corresponding speaker independent mixture. At block 339, the probabilities of the feature vectors with respect to the speaker dependent models are calculated in the corresponding spaces to obtain verification results (i.e., whether to accept or reject the claimed identity).

The invention has been described in conjunction with the preferred embodiment. It is evident that numerous alternatives, modifications, variations and uses will be apparent to those skilled in the art in light of the foregoing description.

Claims

CLAIMSWhat is claimed is:

1. A method comprising : converting a test signal representing a test speech into a set of test feature vectors representing identity of a test speaker who claims a particular identity; transforming the test feature vectors using corresponding linear transform matrices associated with a speaker independent Gaussian mixture model (SIGMM); and determining whether to accept or reject the claimed identity of the test speaker based upon the transformed test feature vectors representing the identity of the test speaker and a speaker dependent model representing the claimed identity, the speaker dependent model being represented by a speaker dependent Gaussian mixture model (SDGMM) which is constructed using linear transform matrices associated with corresponding mixtures in the speaker independent model.

2. The method of claim 1 wherein training feature vectors used to construct the SDGMM are first transformed by the corresponding linear transform matrices associated with the speaker independent model and wherein parameters of each mixture of the respective SDGMM are trained based upon the transformed training feature vectors.

3. The method of claim 2 wherein the speaker independent GMM is constructed using speech samples from a training corpus.

4. The method of claim 3 wherein a linear transform matrix is computed for each mixture of the speaker independent GMM, said linear transform matrices are utilized for the training of the speaker dependent model.

5. The method of claim 4 wherein the speaker independent model is trained using the expectation-maximization (EM) method.

6. The method of claim 5 wherein the speaker dependent model is trained using the maximum a posteriori (MAP) adaptation method.

7. The method of claim 6 wherein linear transform matrices computed for the spealcer independent model are shared by the speaker dependent model mixtures that are adapted from the same mixtures of the speaker independent model.

8. A method comprising: training a speaker independent model for a speaker verification system using speech samples from a speech training corpus, the speaker independent model being represented by a Gaussian mixture model (GMM), the respective GMM comprising a plurality of Gaussian mixtures, each Gaussian mixture being parameterized by a corresponding mean vector and a covariance matrix; computing a linear transform matrix for each fixture based on the eigenvectors of the corresponding covariance matrix; and training a speaker dependent model for each speaker in a set of speakers using the linear transform matrices computed previously for the speaker independent model.

9. The method of claim 8 wherein the speaker independent model is trained using the Expectation-Maximization (EM) method.

10. The method of claim 8 wherein the speaker dependent model is trained using the maximum a posteriori (MAP) adaptation method.

11. The method of claim 10 wherein the linear transform matrices computed for the speaker independent model are shared by mixtures adapted from the same mixture of the spealcer independent model.

12. The method of claim 11 wherein training the speaker dependent model comprises: transforming feature vectors extracted from speaker dependent training speech to corresponding spaces by the corresponding linear transform matrices; and training the parameters of each mixture in the corresponding spaces using the MAP method.

13. The method of claim 12 further comprising: performing speaker verification comprising: receiving a test speech from a current speaker who claims to be a particular speaker; converting the test speech to a set of test feature vectors; calculating a similarity measure based upon the current speaker's identity information represented by the test feature vectors and the particular speaker's identity information represented by a speaker dependent training model associated with the particular speaker; and deciding whether to accept or reject the current speaker as the particular speaker based upon the calculated similarity measure and a threshold.

14. The method of claim 13 wherein calculating the similarity measure comprises: transforming the test feature vectors to the spaces spanned by the eigenvectors of the covariance matrix of each corresponding speaker independent mixture using the corresponding linear transform matrices; and computing the probability of the test feature vectors in the corresponding spaces for each speaker based on a speaker dependent model associated with each respective speaker.

15. A system comprising : a speaker dependent model database comprising a set of spealcer dependent models each representing the identity of a particular speaker, each speaker dependent model being constructed from a set of training feature vectors associated with the respective speaker and being represented as an orthogonal Gaussian mixture model (OGMM); a feature extraction unit to convert a digital signal representing a test speech into a set of test feature vectors, the test speech being spoken by a test speaker who claims to be a particular speaker whose identity is represented by a corresponding OGMM in the speaker dependent model database; and a verification unit coupled to the speaker dependent model database and the feature extraction unit, the verification unit to determine whether to accept or reject the claimed identity of the test speaker based on the test feature vectors representing the identity information of the test speaker and the OGMMs in the speaker dependent model database.

16. The system of claim 15 wherein a speaker independent model is utilized to train the speaker dependent model, the speaker independent model being constructed based upon a training corpus, the speaker independent model being represented by a Gaussian mixture model (GMM) which corresponds to the distribution of training feature vectors extracted from training speech samples from the training corpus.

17. The system of claim 16 wherein the speaker independent model is trained using the expectation-maximization (EM) method.

18. The system of claim 17 wherein the speaker dependent model is trained using the maximum a posteriori (MAP) adaptation method.

19. The system of claim 18 wherein a linear transform matrix is computed for each mixture in the speaker independent model, the linear transform matrices computed for the mixtures in the speaker independent model are shared by corresponding mixtures in the speaker dependent model.

20. The system of claim 19 wherein the linear transform matrices are used to transform training feature vectors associated with the spealcer dependent model to provide a better approximation of the distribution of the training feature vectors.

21. The system of claim 20 wherein the training feature vectors for the speaker dependent model are transformed to corresponding spaces by the corresponding linear transform matrices and the parameters of each mixture of the speaker dependent model are trained using the transformed training feature vectors.

22. The system of claim 20 wherein test feature vectors representing the identity information of the test speaker are transformed using the corresponding linear transform matrices, the transformed test feature vectors being used by the verification unit to determine whether to accept or reject the claimed identity of the test speaker.

23. A machine-readable medium comprising instructions which, when executed by a machine, cause the machine to perform operations comprising: converting a test signal representing a test speech into a set of test feature vectors representing identity of a test speaker who claims a particular identity; transforming the test feature vectors using corresponding linear transform matrices associated with a speaker independent Gaussian mixture model (SIGMM); and determining whether to accept or reject the claimed identity of the test spealcer based upon the transformed test feature vectors representing the identity of the test speaker and a speaker dependent model representing the claimed identity, the speaker dependent model being represented by a speaker dependent Gaussian mixture model (SDGMM) which is constructed using linear transform matrices associated with corresponding mixtures in m_e speaker independent model.

24. The machine-readable medium of claim 23 wherein training feature vectors used to construct the SDGMM are first transformed by the corresponding linear transform matrices associated with the speaker independent model and wherein parameters of each mixture of the respective SDGMM are trained based upon the transformed training feature vectors.

25. The machine-readable medium of claim 24 wherein a linear transform matrix is computed for each mixture of the speaker independent GMM, said linear transform matrices are utilized for the training of the speaker dependent model.

26. The machine-readable medium of claim 25 wherein linear transform matrices computed for the speaker independent model are shared by the speaker dependent model mixtures that are adapted from the same mixtures of the speaker independent model.

27. A system comprising: means for storing a set of speaker dependent models each representing the identity of a particular speaker, each speaker dependent model being constructed from a set of training feature vectors associated with the respective speaker and being represented as an orthogonal Gaussian mixture model (OGMM); means for converting a digital signal representing a test speech into a set of test feature vectors, the test speech being spoken by a test speaker who claims to be a particular speaker whose identity is represented by a corresponding OGMM in the speaker dependent model database; and means for determining whether to accept or reject the claimed identity of the test speaker based on the test feature vectors representing the identity information of the test spealcer and the OGMMs in the speaker dependent model database.

28. The system of claim 27 wherein a speaker independent model is utilized to train the speaker dependent model, the speaker independent model being constructed based upon a training corpus, the speaker independent model being represented by a Gaussian mixture model (GMM) which corresponds to the distribution of training feature vectors extracted from training speech samples from the training corpus.

29. The system of claim 28 wherein a linear transform matrix is computed for each mixture in the speaker independent model, the linear transform matrices computed for the mixtures in the speaker independent model are shared by corresponding mixtures in the spealcer dependent model.

30. The system of claim 29 wherein the training feature vectors for the speaker dependent model are transformed to corresponding spaces by the corresponding linear transform matrices and the parameters of each mixture of the speaker dependent model are trained using the transformed training feature vectors.