[go: up one dir, main page]

WO2002029785A1 - Method, apparatus, and system for speaker verification based on orthogonal gaussian mixture model (gmm) - Google Patents

Method, apparatus, and system for speaker verification based on orthogonal gaussian mixture model (gmm) Download PDF

Info

Publication number
WO2002029785A1
WO2002029785A1 PCT/CN2000/000303 CN0000303W WO0229785A1 WO 2002029785 A1 WO2002029785 A1 WO 2002029785A1 CN 0000303 W CN0000303 W CN 0000303W WO 0229785 A1 WO0229785 A1 WO 0229785A1
Authority
WO
WIPO (PCT)
Prior art keywords
speaker
model
test
feature vectors
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2000/000303
Other languages
French (fr)
Inventor
Xiaoxing Liu
Yonghong Yan
Baosheng Yuan
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Intel Corp
Original Assignee
Intel Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Intel Corp filed Critical Intel Corp
Priority to PCT/CN2000/000303 priority Critical patent/WO2002029785A1/en
Priority to AU2000276401A priority patent/AU2000276401A1/en
Publication of WO2002029785A1 publication Critical patent/WO2002029785A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies

Definitions

  • the present invention relates to the field of speaker recognition. More specifically, the present invention relates to a method, apparatus, and system for speaker verification based upon orthogonal Gaussian mixture model (GMM).
  • GMM orthogonal Gaussian mixture model
  • the speech signal can convey various types of information at different levels. In particular, not only that the speech signal conveys a message as a sequence of words, it also conveys speaker specific information, for example, information about the identity of the speaker who produces the speech signal.
  • the field of speech recognition is concerned with extracting the underlying message conveyed in the speech signal and the field of speaker recognition deals with extracting and verifying the identity of the speaker who generates the speech signal.
  • Speaker recognition can be divided into two areas: speaker identification and speaker verification. In speaker identification, the task is to determine the identity of a speaker based upon a speech sample provided by that speaker. In speaker verification, the task is to verify whether a speaker is whom he or she claims to be, based upon a speech sample provided by that speaker.
  • speaker verification involves a two-way classification or a binary test to determine whether the speaker's claim is correct or not.
  • either speaker identification or speaker verification can be constrained to specific phrases or text which is referred to as text-dependent, or unconstrained to any specific text which is referred to as text- independent.
  • Operating a speaker recognition system typically involves two stages. In the first stage, a user enrolls in the system by providing the system with one or more samples of his speech. These training samples are used by the system to build a model for that user. In the second stage, the user provides a test sample to be used by the system to test the similarity between the test sample and the model(s) of the user(s) in order to perform its corresponding function (e.g., speaker identification or speaker verification).
  • the Gaussian mixture speaker model has been successfully and widely used for text-independent speaker verification.
  • This modeling technique basically uses a Gaussian mixture density (a weighted sum of several multivariate Gaussian functions) to represent or model the distribution of training feature vectors.
  • each Gaussian function may have a full covariance matrix.
  • the diagonal covariance matrix has been mostly used in practice because of its computational advantages.
  • the elements of feature vectors extracted from a speech signal are correlated.
  • a linear combination of diagonal covariance Gaussian functions is capable of modeling the correlation.
  • a large number of mixtures needs to be used in order to provide a good approximation for the distribution of the feature vectors extracted from a person's speech.
  • Figure 1 is a block diagram of one embodiment of a speaker recognition system according to the teachings of the present invention
  • Figure 2 is a flow diagram of one embodiment of a method according to the teachings of the present invention.
  • Figure 3 shows a flow diagram of one embodiment of a method according to the teachings of the present invention.
  • a test signal representing a test speech is converted or transformed into a set of test feature vectors that represent the identity of a test speaker who claims a particular identity.
  • the test feature vectors are then transformed using corresponding linear transform matrices associated with a speaker independent Gaussian mixture model (SIGMM) that was previously trained.
  • SIGMM speaker independent Gaussian mixture model
  • the system determines whether to accept or reject the claimed identity of the test speaker based upon the transformed test feature vectors representing the identity of the test speaker and the models including the speaker dependent model representing the claimed identity and the anti-models corresponding to the claimed identity (cohort models or background model).
  • the speaker dependent model in one embodiment, is represented by a speaker dependent GMM (SDGMM) which was constructed using linear transform matrices associated with corresponding mixtures in the speaker independent model.
  • SDGMM speaker dependent GMM
  • the training feature vectors used to construct the SDGMM are first transformed by the corresponding linear transform matrices associated with the speaker independent model and the parameters of each mixture of the respective SDGMM are trained based upon the transformed training feature vectors.
  • the speaker independent GMM is constructed using speech samples provided by a large set of speakers in a training corpus. After the speaker independent GMM is trained, a linear transform matrix is computed for each mixture of the speaker independent model and the resultant linear transform matrices are utilized for the training of the speaker dependent models.
  • the speaker independent model is trained using the expectation-maximization (EM) method.
  • the speaker dependent models are then trained using the maximum a posteriori (MAP) adaptation method.
  • the linear transform matrices computed for the speaker independent model are shared by the speaker dependent model mixtures that are adapted from the same mixtures of the speaker independent model.
  • the teachings of the present invention are applicable to any scheme, method and system for speaker recognition that employs GMMs as the probabilistic model of the underlying sounds of a speaker's voice.
  • the present invention is not limited to speaker recognition systems and can be applied to other types of probabilistic and data modeling in speech recognition and in other fields or disciplines including, but not limited to, image processing, signal processing, geometric modeling, computer-aided-design (CAD), computer-aided- manufacturing (CAM), etc.
  • the present invention provides a method and a system that combines orthogonal GMM with maximum a posteriori (MAP) adaptation.
  • MAP maximum a posteriori
  • the correlation of feature vectors can be modeled much better than diagonal GMM-based speaker verification systems.
  • GMM-based speaker verification systems the distribution of feature vectors extracted from a speaker's speech is modeled by a Gaussian mixture density which is a weighted sum of several multivariate Gaussian functions.
  • a Gaussian function has the form as shown below:
  • is the mean vector
  • is the covariance matrix.
  • the diagonal Gaussian function in Y space is equivalent to the Gaussian function with full covariance matrix in X space. Accordingly, a diagonal GMM in Y space would provide better approximation to the distribution of feature vectors compared to a diagonal GMM in X space.
  • the GMM with orthogonal transform is referred to as orthogonal GMM herein.
  • the present invention provides a method to more accurately model the correlation of feature vectors than that provided by diagonal GMM-based speaker verification systems. First, a speaker independent model is trained using EM algorithm. The eigenvectors of the covariance matrix for each mixture are calculated.
  • the linear transform matrix for each mixture in the speaker independent model is composed of these eigenvectors.
  • a speaker dependent model for each new speaker enrolled in the system is trained using the MAP method.
  • the linear transform matrices computed for the speaker independent model are shared by the mixtures of the speaker dependent model that are adapted from the same mixture of the speaker independent model.
  • the covariance matrices in transformed spaces are more diagonal. Accordingly, the diagonal Gaussian functions in transformed spaces will provide a better approximation to the distribution of feature vectors.
  • the covariance is usually much less adapted than mean. Some systems therefore use only mean adaptation. Sharing of transformation matrix will make MAP adaptation more effective because the transformation matrix is computed from the corresponding covariance matrix.
  • the feature vectors are first transformed to the spaces spanned by the eigenvectors of the covariance matrix of each corresponding speaker- independent mixture.
  • the probability or similarity measure between the input speech and the speaker dependent model is then computed using the transformed vectors for each speaker in the set.
  • FIG. 1 illustrates a block diagram of one embodiment of a speaker verification system 100 according to the teachings of the present invention.
  • the system 100 includes an analog to digital converter (A/D) 110, a feature extractor or spectral analysis unit 120, a similarity measurement unit 130, a speaker dependent model or reference database 140, and a decision making unit 150.
  • An input signal 101 representing a sample speech of a speaker whose claimed identity is to be verified by the system (also referred to as the test speaker herein) is first digitized using the A/D 110.
  • the digital signal is sliced up into frames at a suitable rate (e.g., 10, 15, or 20 ms).
  • the digital signal is then converted or transformed into a set of feature vectors containing acoustic parameters that convey the identity characteristics of the test speaker.
  • the feature vectors are then inputted to the similarity measurement unit 130 which computes a similarity measure between the identity of the test speaker as represented by the feature vectors and the claimed identity that is represented by a previously constructed model stored in the speaker dependent model database 140.
  • the decision-making unit 150 then compares the similarity measure computed by the similarity measurement unit 130 to a predetermined value or threshold and decides whether to accept or reject the claimed identity of the test speaker.
  • the test feature vectors are transformed using the corresponding linear transform matrices computed from a previously trained speaker independent GMM.
  • the similarity measurement unit 130 computes the similarity measure based upon the transformed test feature vectors and a set of speaker dependent models stored in the database 140.
  • the parameters of the speaker dependent models are trained using transformed training feature vectors that are obtained by transforming the training feature vectors extracted from training speech samples using the corresponding linear transform matrices associated with the previously trained speaker independent GMM.
  • Figure 2 shows a flow diagram of a method 200 for performing speaker verification according to the teachings of the present invention.
  • the method starts at block 201 and proceeds to block 210.
  • a test signal representing a test speech is converted into a set of test feature vectors.
  • the test speech is provided by a test speaker who claims a particular identity.
  • the test feature vectors are transformed to the spaces spanned by the eigenvectors of the covariance matrix of each corresponding speaker-independent mixture by using the corresponding linear transform matrices that were previously computed for the speaker independent GMM.
  • the speaker dependent model is represented by a speaker dependent GMM that was constructed using the corresponding linear transform matrices associated with the speaker independent model.
  • a likelihood ratio test is used to determine whether to accept or reject the claimed speaker.
  • the likelihood ratio test is well known in the art. Assuming that an utterance or sample speech X given by a speaker who claims to be a particular speaker Y that has a corresponding model M , then the likelihood ratio is:
  • the likelihood ratio is then compared to a threshold value ⁇ .
  • the claimed speaker is accepted if the likelihood ratio exceeds the threshold value and is rejected if it is less than the threshold value.
  • the decision threshold value can be set to adjust the tradeoff between false rejection error and false acceptance errors.
  • a cohort normalization method is used to compute the likelihood ratio.
  • the likelihood ratio instead of computing the likelihood ratio using the average score of the entire set of speaker dependent models, only the average score of the subset having the highest scores (excluding the score of the claimed identity model) is used. For example, assuming that there are 100 speaker dependent models that represent a set of 100 speakers that have been enrolled in the system (one model for each speaker in the set), during verification phase, the probability of the input speech (as represented by the transformed feature vectors described above) being generated from each model (including the model that represents the claimed identity) is computed. Then the average score for a predetermined number of the top scores (e.g., the top 10 scores) excluding the score of the claimed identity is computed. This average score is then compared with the score of the claimed identity to generate the likelihood ratio.
  • the probability of the input speech as represented by the transformed feature vectors described above
  • the average score for a predetermined number of the top scores e.g., the top 10 scores
  • This average score is then compared with the score of the claimed identity to generate the likelihood ratio.
  • FIG. 3 illustrates a flow diagram of one embodiment of a method 300 according to the teachings of the present invention.
  • the method 300 starts at block 301 and proceeds to block 305 to perform speaker independent model training.
  • a speaker independent GMM having M mixtures is trained using the expectation- maximization (EM) technique.
  • EM expectation- maximization
  • the linear transform matrix for each mixture of the speaker independent GMM is computed. In one embodiment, this is done by calculating the eigenvectors of the covariance matrix for each mixture.
  • the linear transform matrix for each mixture is composed of the corresponding eigenvectors.
  • the method 300 proceeds to block 313 to perform speaker dependent model training (enrolling new speakers).
  • the feature vectors of the training speech provided by a speaker being enrolled are transformed to the spaces spanned by the corresponding linear transform matrices that were computed previously based on the mixtures of the speaker independent model.
  • the linear transform matrices computed from the speaker independent model are shared by the mixtures that are adapted from the same mixture of the speaker independent model. By this shared transformation, the covariance matrices in the transformed spaces are model diagonal.
  • the parameters of each mixture of the speaker dependent GMM are trained in the transformed spaces using MAP algorithm. The process of speaker dependent model training is performed for each speaker enrolled in the system. The method 300 proceeds to block 331 to perform speaker verification task.
  • the feature vectors extracted from a test speech of a speaker who claims a particular identity are transformed to the corresponding spaces by the corresponding transform matrices, h other words, these feature vectors are transformed to the spaces spanned by the eigenvectors of the covariance matrix of each corresponding speaker independent mixture.
  • the probabilities of the feature vectors with respect to the speaker dependent models are calculated in the corresponding spaces to obtain verification results (i.e., whether to accept or reject the claimed identity).

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Game Theory and Decision Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Complex Calculations (AREA)

Abstract

According to one aspect of the invention, a method is provided in which a signal representing a speech provided by a speaker who claims a particular identity is converted into a set of feature vectors. The feature vectors are transformed into corresponding spaces using the corresponding linear transform matrices that were previously computed based upon a speaker independent Gaussian mixture model. A determination is made to determine whether to accept or reject the claimed identity of the speaker using the transformed feature vectors and a previously constructed speaker dependent model that represents the claimed identity. The speaker dependent model is represented by a speaker dependent Gaussian mixture model which is constructed using the linear transform matrices associated with the corresponding mixtures in the speaker independent model.

Description

METHOD, APPARATUS, AND SYSTEM FOR SPEAKER VERIFICATION BASED ON ORTHOGONAL GAUSSIAN MIXTURE MODEL(GMM)
FIELD OF THE INVENTION
The present invention relates to the field of speaker recognition. More specifically, the present invention relates to a method, apparatus, and system for speaker verification based upon orthogonal Gaussian mixture model (GMM).
BACKGROUND OF THE INVENTION
The speech signal can convey various types of information at different levels. In particular, not only that the speech signal conveys a message as a sequence of words, it also conveys speaker specific information, for example, information about the identity of the speaker who produces the speech signal. In general, the field of speech recognition is concerned with extracting the underlying message conveyed in the speech signal and the field of speaker recognition deals with extracting and verifying the identity of the speaker who generates the speech signal. Speaker recognition can be divided into two areas: speaker identification and speaker verification. In speaker identification, the task is to determine the identity of a speaker based upon a speech sample provided by that speaker. In speaker verification, the task is to verify whether a speaker is whom he or she claims to be, based upon a speech sample provided by that speaker. Accordingly, speaker verification involves a two-way classification or a binary test to determine whether the speaker's claim is correct or not. In addition, either speaker identification or speaker verification can be constrained to specific phrases or text which is referred to as text-dependent, or unconstrained to any specific text which is referred to as text- independent. Operating a speaker recognition system typically involves two stages. In the first stage, a user enrolls in the system by providing the system with one or more samples of his speech. These training samples are used by the system to build a model for that user. In the second stage, the user provides a test sample to be used by the system to test the similarity between the test sample and the model(s) of the user(s) in order to perform its corresponding function (e.g., speaker identification or speaker verification). Various techniques have been developed over the years to train or construct speaker models for use in speaker recognition systems. Some of the earlier approaches use long-term averages of acoustic features, for example spectrum representations or pitch information. With respect to spectral features, the long-term average represents a speaker's average vocal tract shape. Another approach is to model the speaker- dependent acoustic features within the individual phonetic units that make up the speech. This technique is concerned with measuring speaker differences rather than textual differences by comparing acoustic features from the phonetic units in a test sample with previously obtained speaker-dependent acoustic features from similar phonetic units. This approach can be accomplished using either explicit or implicit segmentation of the speech into phonetic unit classes before the speaker model training or recognition. Explicit segmentation generally uses a hidden Markov model (HMM) based continuous speech recognizer as a front-end segmenter. Implicit segmentation uses some form of unsupervised clustering to provide the segmentation during training and recognition.
The Gaussian mixture speaker model has been successfully and widely used for text-independent speaker verification. This modeling technique basically uses a Gaussian mixture density (a weighted sum of several multivariate Gaussian functions) to represent or model the distribution of training feature vectors. In theory each Gaussian function may have a full covariance matrix. However, the diagonal covariance matrix has been mostly used in practice because of its computational advantages. Generally, the elements of feature vectors extracted from a speech signal are correlated. A linear combination of diagonal covariance Gaussian functions is capable of modeling the correlation. However, a large number of mixtures needs to be used in order to provide a good approximation for the distribution of the feature vectors extracted from a person's speech. Consequently, a large GMM needs a large amount of training data for a good estimation of the GMM parameters, takes a long time to train, and is slow in response due to its large size. Thus there exists a need to improve the system performance of GMM-based speaker recognition systems.
BRIEF DESCRIPTION OF THE DRAWINGS
The features and advantages of the present invention will be more fully understood by reference to the accompanying drawings, in which: Figure 1 is a block diagram of one embodiment of a speaker recognition system according to the teachings of the present invention;
Figure 2 is a flow diagram of one embodiment of a method according to the teachings of the present invention; and
Figure 3 shows a flow diagram of one embodiment of a method according to the teachings of the present invention.
DETAILED DESCRIPTION
In the following detailed description numerous specific details are set forth in order to provide a thorough understanding of the present invention. However, it will be appreciated by one skilled in the art that the present invention may be understood and practiced without these specific details.
In the discussion below, the teachings of the present invention are utilized to implement a method, apparatus, system, and machine-readable medium for performing speaker verification based on orthogonal Gaussian mixture model (GMM). In one embodiment, a test signal representing a test speech is converted or transformed into a set of test feature vectors that represent the identity of a test speaker who claims a particular identity. The test feature vectors are then transformed using corresponding linear transform matrices associated with a speaker independent Gaussian mixture model (SIGMM) that was previously trained. The system then determines whether to accept or reject the claimed identity of the test speaker based upon the transformed test feature vectors representing the identity of the test speaker and the models including the speaker dependent model representing the claimed identity and the anti-models corresponding to the claimed identity (cohort models or background model). The speaker dependent model, in one embodiment, is represented by a speaker dependent GMM (SDGMM) which was constructed using linear transform matrices associated with corresponding mixtures in the speaker independent model. In one embodiment, the training feature vectors used to construct the SDGMM are first transformed by the corresponding linear transform matrices associated with the speaker independent model and the parameters of each mixture of the respective SDGMM are trained based upon the transformed training feature vectors. In one embodiment, the speaker independent GMM is constructed using speech samples provided by a large set of speakers in a training corpus. After the speaker independent GMM is trained, a linear transform matrix is computed for each mixture of the speaker independent model and the resultant linear transform matrices are utilized for the training of the speaker dependent models. In one embodiment, the speaker independent model is trained using the expectation-maximization (EM) method. The speaker dependent models are then trained using the maximum a posteriori (MAP) adaptation method. The linear transform matrices computed for the speaker independent model are shared by the speaker dependent model mixtures that are adapted from the same mixtures of the speaker independent model. The teachings of the present invention are applicable to any scheme, method and system for speaker recognition that employs GMMs as the probabilistic model of the underlying sounds of a speaker's voice. However, the present invention is not limited to speaker recognition systems and can be applied to other types of probabilistic and data modeling in speech recognition and in other fields or disciplines including, but not limited to, image processing, signal processing, geometric modeling, computer-aided-design (CAD), computer-aided- manufacturing (CAM), etc.
The present invention provides a method and a system that combines orthogonal GMM with maximum a posteriori (MAP) adaptation. Using this method, the correlation of feature vectors can be modeled much better than diagonal GMM-based speaker verification systems. As described above, in GMM-based speaker verification systems, the distribution of feature vectors extracted from a speaker's speech is modeled by a Gaussian mixture density which is a weighted sum of several multivariate Gaussian functions. A Gaussian function has the form as shown below:
P(X) = (2π)-d'2 |∑f/2 exp[[-i(( - μ)'∑-> (X - μ))}
where Xis a d-dimensional feature vector, μ is the mean vector and ∑ is the covariance matrix. Based upon linear algebra theory, it is known that a covariance matrix can be diagonalized if the vectors are linearly transformed to the space spanned by the eigenvectors of the original covariance matrix. If the covariance matrix of a speaker is ∑x and the transform matrix Ω is composed of the eigenvectors of ∑x, then after the linear transformation, y= Ωτ x, the covariance matrix in the Y space, ∑y, is diagonal. ∑y, μ y are related to ∑x, μx according to the following equation: μy = Ω ' μ:
According to the linear algebra theory, it can be seen that the diagonal Gaussian function in Y space is equivalent to the Gaussian function with full covariance matrix in X space. Accordingly, a diagonal GMM in Y space would provide better approximation to the distribution of feature vectors compared to a diagonal GMM in X space. The GMM with orthogonal transform is referred to as orthogonal GMM herein. Based upon the observations described above, the present invention provides a method to more accurately model the correlation of feature vectors than that provided by diagonal GMM-based speaker verification systems. First, a speaker independent model is trained using EM algorithm. The eigenvectors of the covariance matrix for each mixture are calculated. The linear transform matrix for each mixture in the speaker independent model is composed of these eigenvectors. Second, a speaker dependent model for each new speaker enrolled in the system is trained using the MAP method. In speaker dependent model training, the linear transform matrices computed for the speaker independent model are shared by the mixtures of the speaker dependent model that are adapted from the same mixture of the speaker independent model. Using this shared transformation, the covariance matrices in transformed spaces are more diagonal. Accordingly, the diagonal Gaussian functions in transformed spaces will provide a better approximation to the distribution of feature vectors. Furthermore, in MAP adaptation, the covariance is usually much less adapted than mean. Some systems therefore use only mean adaptation. Sharing of transformation matrix will make MAP adaptation more effective because the transformation matrix is computed from the corresponding covariance matrix.
In the decoding phase, the feature vectors are first transformed to the spaces spanned by the eigenvectors of the covariance matrix of each corresponding speaker- independent mixture. The probability or similarity measure between the input speech and the speaker dependent model is then computed using the transformed vectors for each speaker in the set.
Figure 1 illustrates a block diagram of one embodiment of a speaker verification system 100 according to the teachings of the present invention. The system 100, as shown in Figure 1, includes an analog to digital converter (A/D) 110, a feature extractor or spectral analysis unit 120, a similarity measurement unit 130, a speaker dependent model or reference database 140, and a decision making unit 150. An input signal 101 representing a sample speech of a speaker whose claimed identity is to be verified by the system (also referred to as the test speaker herein) is first digitized using the A/D 110. The digital signal is sliced up into frames at a suitable rate (e.g., 10, 15, or 20 ms). The digital signal is then converted or transformed into a set of feature vectors containing acoustic parameters that convey the identity characteristics of the test speaker. The feature vectors are then inputted to the similarity measurement unit 130 which computes a similarity measure between the identity of the test speaker as represented by the feature vectors and the claimed identity that is represented by a previously constructed model stored in the speaker dependent model database 140. The decision-making unit 150 then compares the similarity measure computed by the similarity measurement unit 130 to a predetermined value or threshold and decides whether to accept or reject the claimed identity of the test speaker. As described above, in the present embodiment, the test feature vectors are transformed using the corresponding linear transform matrices computed from a previously trained speaker independent GMM. The similarity measurement unit 130 computes the similarity measure based upon the transformed test feature vectors and a set of speaker dependent models stored in the database 140. The parameters of the speaker dependent models are trained using transformed training feature vectors that are obtained by transforming the training feature vectors extracted from training speech samples using the corresponding linear transform matrices associated with the previously trained speaker independent GMM.
Figure 2 shows a flow diagram of a method 200 for performing speaker verification according to the teachings of the present invention. The method starts at block 201 and proceeds to block 210. At block 210, a test signal representing a test speech is converted into a set of test feature vectors. The test speech is provided by a test speaker who claims a particular identity. At block 220, the test feature vectors are transformed to the spaces spanned by the eigenvectors of the covariance matrix of each corresponding speaker-independent mixture by using the corresponding linear transform matrices that were previously computed for the speaker independent GMM. At block 230, a determination is made to determine whether to accept or reject the claimed identity of the test speaker based upon the transformed test feature vectors that represent the identity of the test speaker and a previously trained speaker dependent model that represents the claimed identity. As described above, the speaker dependent model is represented by a speaker dependent GMM that was constructed using the corresponding linear transform matrices associated with the speaker independent model.
In general, a likelihood ratio test is used to determine whether to accept or reject the claimed speaker. The likelihood ratio test is well known in the art. Assuming that an utterance or sample speech X given by a speaker who claims to be a particular speaker Y that has a corresponding model M , then the likelihood ratio is:
P{X is from the claimed spea ker) _ R( y | X) P(X is not from the claimed spea ker) P(M- | X)
The likelihood ratio in the log domain is shown below: R(X) = \ogp(X \ My) - logp(X I M-) where p X \ M y ) is the probability or likelihood of the speech being generated from the claimed speaker and p(X \ M-) is the probability or likelihood of the speech given it is not from the claimed speaker. The likelihood ratio is then compared to a threshold value θ. The claimed speaker is accepted if the likelihood ratio exceeds the threshold value and is rejected if it is less than the threshold value. The decision threshold value can be set to adjust the tradeoff between false rejection error and false acceptance errors. In the present embodiment, a cohort normalization method is used to compute the likelihood ratio. Instead of computing the likelihood ratio using the average score of the entire set of speaker dependent models, only the average score of the subset having the highest scores (excluding the score of the claimed identity model) is used. For example, assuming that there are 100 speaker dependent models that represent a set of 100 speakers that have been enrolled in the system (one model for each speaker in the set), during verification phase, the probability of the input speech (as represented by the transformed feature vectors described above) being generated from each model (including the model that represents the claimed identity) is computed. Then the average score for a predetermined number of the top scores (e.g., the top 10 scores) excluding the score of the claimed identity is computed. This average score is then compared with the score of the claimed identity to generate the likelihood ratio. The likelihood ratio obtained is then compared with a predetermined threshold value to determine whether to accept or reject the claimed identity. Figure 3 illustrates a flow diagram of one embodiment of a method 300 according to the teachings of the present invention. The method 300 starts at block 301 and proceeds to block 305 to perform speaker independent model training. At block 309, a speaker independent GMM having M mixtures is trained using the expectation- maximization (EM) technique. At block 311, the linear transform matrix for each mixture of the speaker independent GMM is computed. In one embodiment, this is done by calculating the eigenvectors of the covariance matrix for each mixture. The linear transform matrix for each mixture is composed of the corresponding eigenvectors. The method 300 proceeds to block 313 to perform speaker dependent model training (enrolling new speakers). At block 317, the feature vectors of the training speech provided by a speaker being enrolled are transformed to the spaces spanned by the corresponding linear transform matrices that were computed previously based on the mixtures of the speaker independent model. As described above, according to the teachings of the present invention, the linear transform matrices computed from the speaker independent model are shared by the mixtures that are adapted from the same mixture of the speaker independent model. By this shared transformation, the covariance matrices in the transformed spaces are model diagonal. At block 321, the parameters of each mixture of the speaker dependent GMM are trained in the transformed spaces using MAP algorithm. The process of speaker dependent model training is performed for each speaker enrolled in the system. The method 300 proceeds to block 331 to perform speaker verification task. At block 335, the feature vectors extracted from a test speech of a speaker who claims a particular identity are transformed to the corresponding spaces by the corresponding transform matrices, h other words, these feature vectors are transformed to the spaces spanned by the eigenvectors of the covariance matrix of each corresponding speaker independent mixture. At block 339, the probabilities of the feature vectors with respect to the speaker dependent models are calculated in the corresponding spaces to obtain verification results (i.e., whether to accept or reject the claimed identity).
The invention has been described in conjunction with the preferred embodiment. It is evident that numerous alternatives, modifications, variations and uses will be apparent to those skilled in the art in light of the foregoing description.

Claims

CLAIMSWhat is claimed is:
1. A method comprising : converting a test signal representing a test speech into a set of test feature vectors representing identity of a test speaker who claims a particular identity; transforming the test feature vectors using corresponding linear transform matrices associated with a speaker independent Gaussian mixture model (SIGMM); and determining whether to accept or reject the claimed identity of the test speaker based upon the transformed test feature vectors representing the identity of the test speaker and a speaker dependent model representing the claimed identity, the speaker dependent model being represented by a speaker dependent Gaussian mixture model (SDGMM) which is constructed using linear transform matrices associated with corresponding mixtures in the speaker independent model.
2. The method of claim 1 wherein training feature vectors used to construct the SDGMM are first transformed by the corresponding linear transform matrices associated with the speaker independent model and wherein parameters of each mixture of the respective SDGMM are trained based upon the transformed training feature vectors.
3. The method of claim 2 wherein the speaker independent GMM is constructed using speech samples from a training corpus.
4. The method of claim 3 wherein a linear transform matrix is computed for each mixture of the speaker independent GMM, said linear transform matrices are utilized for the training of the speaker dependent model.
5. The method of claim 4 wherein the speaker independent model is trained using the expectation-maximization (EM) method.
6. The method of claim 5 wherein the speaker dependent model is trained using the maximum a posteriori (MAP) adaptation method.
7. The method of claim 6 wherein linear transform matrices computed for the spealcer independent model are shared by the speaker dependent model mixtures that are adapted from the same mixtures of the speaker independent model.
8. A method comprising: training a speaker independent model for a speaker verification system using speech samples from a speech training corpus, the speaker independent model being represented by a Gaussian mixture model (GMM), the respective GMM comprising a plurality of Gaussian mixtures, each Gaussian mixture being parameterized by a corresponding mean vector and a covariance matrix; computing a linear transform matrix for each fixture based on the eigenvectors of the corresponding covariance matrix; and training a speaker dependent model for each speaker in a set of speakers using the linear transform matrices computed previously for the speaker independent model.
9. The method of claim 8 wherein the speaker independent model is trained using the Expectation-Maximization (EM) method.
10. The method of claim 8 wherein the speaker dependent model is trained using the maximum a posteriori (MAP) adaptation method.
11. The method of claim 10 wherein the linear transform matrices computed for the speaker independent model are shared by mixtures adapted from the same mixture of the spealcer independent model.
12. The method of claim 11 wherein training the speaker dependent model comprises: transforming feature vectors extracted from speaker dependent training speech to corresponding spaces by the corresponding linear transform matrices; and training the parameters of each mixture in the corresponding spaces using the MAP method.
13. The method of claim 12 further comprising: performing speaker verification comprising: receiving a test speech from a current speaker who claims to be a particular speaker; converting the test speech to a set of test feature vectors; calculating a similarity measure based upon the current speaker's identity information represented by the test feature vectors and the particular speaker's identity information represented by a speaker dependent training model associated with the particular speaker; and deciding whether to accept or reject the current speaker as the particular speaker based upon the calculated similarity measure and a threshold.
14. The method of claim 13 wherein calculating the similarity measure comprises: transforming the test feature vectors to the spaces spanned by the eigenvectors of the covariance matrix of each corresponding speaker independent mixture using the corresponding linear transform matrices; and computing the probability of the test feature vectors in the corresponding spaces for each speaker based on a speaker dependent model associated with each respective speaker.
15. A system comprising : a speaker dependent model database comprising a set of spealcer dependent models each representing the identity of a particular speaker, each speaker dependent model being constructed from a set of training feature vectors associated with the respective speaker and being represented as an orthogonal Gaussian mixture model (OGMM); a feature extraction unit to convert a digital signal representing a test speech into a set of test feature vectors, the test speech being spoken by a test speaker who claims to be a particular speaker whose identity is represented by a corresponding OGMM in the speaker dependent model database; and a verification unit coupled to the speaker dependent model database and the feature extraction unit, the verification unit to determine whether to accept or reject the claimed identity of the test speaker based on the test feature vectors representing the identity information of the test speaker and the OGMMs in the speaker dependent model database.
16. The system of claim 15 wherein a speaker independent model is utilized to train the speaker dependent model, the speaker independent model being constructed based upon a training corpus, the speaker independent model being represented by a Gaussian mixture model (GMM) which corresponds to the distribution of training feature vectors extracted from training speech samples from the training corpus.
17. The system of claim 16 wherein the speaker independent model is trained using the expectation-maximization (EM) method.
18. The system of claim 17 wherein the speaker dependent model is trained using the maximum a posteriori (MAP) adaptation method.
19. The system of claim 18 wherein a linear transform matrix is computed for each mixture in the speaker independent model, the linear transform matrices computed for the mixtures in the speaker independent model are shared by corresponding mixtures in the speaker dependent model.
20. The system of claim 19 wherein the linear transform matrices are used to transform training feature vectors associated with the spealcer dependent model to provide a better approximation of the distribution of the training feature vectors.
21. The system of claim 20 wherein the training feature vectors for the speaker dependent model are transformed to corresponding spaces by the corresponding linear transform matrices and the parameters of each mixture of the speaker dependent model are trained using the transformed training feature vectors.
22. The system of claim 20 wherein test feature vectors representing the identity information of the test speaker are transformed using the corresponding linear transform matrices, the transformed test feature vectors being used by the verification unit to determine whether to accept or reject the claimed identity of the test speaker.
23. A machine-readable medium comprising instructions which, when executed by a machine, cause the machine to perform operations comprising: converting a test signal representing a test speech into a set of test feature vectors representing identity of a test speaker who claims a particular identity; transforming the test feature vectors using corresponding linear transform matrices associated with a speaker independent Gaussian mixture model (SIGMM); and determining whether to accept or reject the claimed identity of the test spealcer based upon the transformed test feature vectors representing the identity of the test speaker and a speaker dependent model representing the claimed identity, the speaker dependent model being represented by a speaker dependent Gaussian mixture model (SDGMM) which is constructed using linear transform matrices associated with corresponding mixtures in m_e speaker independent model.
24. The machine-readable medium of claim 23 wherein training feature vectors used to construct the SDGMM are first transformed by the corresponding linear transform matrices associated with the speaker independent model and wherein parameters of each mixture of the respective SDGMM are trained based upon the transformed training feature vectors.
25. The machine-readable medium of claim 24 wherein a linear transform matrix is computed for each mixture of the speaker independent GMM, said linear transform matrices are utilized for the training of the speaker dependent model.
26. The machine-readable medium of claim 25 wherein linear transform matrices computed for the speaker independent model are shared by the speaker dependent model mixtures that are adapted from the same mixtures of the speaker independent model.
27. A system comprising: means for storing a set of speaker dependent models each representing the identity of a particular speaker, each speaker dependent model being constructed from a set of training feature vectors associated with the respective speaker and being represented as an orthogonal Gaussian mixture model (OGMM); means for converting a digital signal representing a test speech into a set of test feature vectors, the test speech being spoken by a test speaker who claims to be a particular speaker whose identity is represented by a corresponding OGMM in the speaker dependent model database; and means for determining whether to accept or reject the claimed identity of the test speaker based on the test feature vectors representing the identity information of the test spealcer and the OGMMs in the speaker dependent model database.
28. The system of claim 27 wherein a speaker independent model is utilized to train the speaker dependent model, the speaker independent model being constructed based upon a training corpus, the speaker independent model being represented by a Gaussian mixture model (GMM) which corresponds to the distribution of training feature vectors extracted from training speech samples from the training corpus.
29. The system of claim 28 wherein a linear transform matrix is computed for each mixture in the speaker independent model, the linear transform matrices computed for the mixtures in the speaker independent model are shared by corresponding mixtures in the spealcer dependent model.
30. The system of claim 29 wherein the training feature vectors for the speaker dependent model are transformed to corresponding spaces by the corresponding linear transform matrices and the parameters of each mixture of the speaker dependent model are trained using the transformed training feature vectors.
PCT/CN2000/000303 2000-09-30 2000-09-30 Method, apparatus, and system for speaker verification based on orthogonal gaussian mixture model (gmm) Ceased WO2002029785A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2000/000303 WO2002029785A1 (en) 2000-09-30 2000-09-30 Method, apparatus, and system for speaker verification based on orthogonal gaussian mixture model (gmm)
AU2000276401A AU2000276401A1 (en) 2000-09-30 2000-09-30 Method, apparatus, and system for speaker verification based on orthogonal gaussian mixture model (gmm)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2000/000303 WO2002029785A1 (en) 2000-09-30 2000-09-30 Method, apparatus, and system for speaker verification based on orthogonal gaussian mixture model (gmm)

Publications (1)

Publication Number Publication Date
WO2002029785A1 true WO2002029785A1 (en) 2002-04-11

Family

ID=4574716

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2000/000303 Ceased WO2002029785A1 (en) 2000-09-30 2000-09-30 Method, apparatus, and system for speaker verification based on orthogonal gaussian mixture model (gmm)

Country Status (2)

Country Link
AU (1) AU2000276401A1 (en)
WO (1) WO2002029785A1 (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2005055200A1 (en) * 2003-12-05 2005-06-16 Queensland University Of Technology Model adaptation system and method for speaker recognition
GB2465782A (en) * 2008-11-28 2010-06-02 Univ Nottingham Trent Biometric identity verification utilising a trained statistical classifier, e.g. a neural network
US20110010171A1 (en) * 2009-07-07 2011-01-13 General Motors Corporation Singular Value Decomposition for Improved Voice Recognition in Presence of Multi-Talker Background Noise
CN102237089A (en) * 2011-08-15 2011-11-09 哈尔滨工业大学 Method for reducing error identification rate of text irrelevant speaker identification system
US8433567B2 (en) 2010-04-08 2013-04-30 International Business Machines Corporation Compensation of intra-speaker variability in speaker diarization
WO2017045429A1 (en) * 2015-09-18 2017-03-23 广州酷狗计算机科技有限公司 Audio data detection method and system and storage medium
US10257191B2 (en) 2008-11-28 2019-04-09 Nottingham Trent University Biometric identity verification
CN111027453A (en) * 2019-12-06 2020-04-17 西北工业大学 Automatic non-cooperative underwater target identification method based on Gaussian mixture model
US11611581B2 (en) 2020-08-26 2023-03-21 ID R&D, Inc. Methods and devices for detecting a spoofing attack
CN116665680A (en) * 2023-04-28 2023-08-29 王力安防科技股份有限公司 Voiceprint recognition method, device, terminal and storage medium
CN119441902A (en) * 2024-10-18 2025-02-14 华中科技大学 An unsupervised dimension reduction method for online monitoring of voltage transformer secondary circuit anomalies

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5555320A (en) * 1992-11-27 1996-09-10 Kabushiki Kaisha Toshiba Pattern recognition system with improved recognition rate using nonlinear transformation
WO1999023643A1 (en) * 1997-11-03 1999-05-14 T-Netix, Inc. Model adaptation system and method for speaker verification

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5555320A (en) * 1992-11-27 1996-09-10 Kabushiki Kaisha Toshiba Pattern recognition system with improved recognition rate using nonlinear transformation
WO1999023643A1 (en) * 1997-11-03 1999-05-14 T-Netix, Inc. Model adaptation system and method for speaker verification

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2005055200A1 (en) * 2003-12-05 2005-06-16 Queensland University Of Technology Model adaptation system and method for speaker recognition
US10257191B2 (en) 2008-11-28 2019-04-09 Nottingham Trent University Biometric identity verification
GB2465782A (en) * 2008-11-28 2010-06-02 Univ Nottingham Trent Biometric identity verification utilising a trained statistical classifier, e.g. a neural network
US9311546B2 (en) 2008-11-28 2016-04-12 Nottingham Trent University Biometric identity verification for access control using a trained statistical classifier
GB2465782B (en) * 2008-11-28 2016-04-13 Univ Nottingham Trent Biometric identity verification
US9177557B2 (en) * 2009-07-07 2015-11-03 General Motors Llc. Singular value decomposition for improved voice recognition in presence of multi-talker background noise
US20110010171A1 (en) * 2009-07-07 2011-01-13 General Motors Corporation Singular Value Decomposition for Improved Voice Recognition in Presence of Multi-Talker Background Noise
US8433567B2 (en) 2010-04-08 2013-04-30 International Business Machines Corporation Compensation of intra-speaker variability in speaker diarization
CN102237089A (en) * 2011-08-15 2011-11-09 哈尔滨工业大学 Method for reducing error identification rate of text irrelevant speaker identification system
WO2017045429A1 (en) * 2015-09-18 2017-03-23 广州酷狗计算机科技有限公司 Audio data detection method and system and storage medium
CN111027453A (en) * 2019-12-06 2020-04-17 西北工业大学 Automatic non-cooperative underwater target identification method based on Gaussian mixture model
US11611581B2 (en) 2020-08-26 2023-03-21 ID R&D, Inc. Methods and devices for detecting a spoofing attack
CN116665680A (en) * 2023-04-28 2023-08-29 王力安防科技股份有限公司 Voiceprint recognition method, device, terminal and storage medium
CN119441902A (en) * 2024-10-18 2025-02-14 华中科技大学 An unsupervised dimension reduction method for online monitoring of voltage transformer secondary circuit anomalies
CN119441902B (en) * 2024-10-18 2025-10-17 华中科技大学 Method for monitoring abnormality of secondary circuit of voltage transformer in online manner without supervision for dimension reduction

Also Published As

Publication number Publication date
AU2000276401A1 (en) 2002-04-15

Similar Documents

Publication Publication Date Title
EP0870300B1 (en) Speaker verification system
US6539352B1 (en) Subword-based speaker verification with multiple-classifier score fusion weight and threshold adaptation
EP0744734B1 (en) Speaker verification method and apparatus using mixture decomposition discrimination
US9646614B2 (en) Fast, language-independent method for user authentication by voice
US6519561B1 (en) Model adaptation of neural tree networks and other fused models for speaker verification
US6697778B1 (en) Speaker verification and speaker identification based on a priori knowledge
US8099288B2 (en) Text-dependent speaker verification
US6401063B1 (en) Method and apparatus for use in speaker verification
CN101465123B (en) Verification method and device for speaker authentication and speaker authentication system
US6233555B1 (en) Method and apparatus for speaker identification using mixture discriminant analysis to develop speaker models
JPH09127972A (en) Vocalization discrimination and verification for recognitionof linked numeral
Angkititrakul et al. Discriminative in-set/out-of-set speaker recognition
Woodward et al. Confidence Measures in Encoder-Decoder Models for Speech Recognition.
WO2002029785A1 (en) Method, apparatus, and system for speaker verification based on orthogonal gaussian mixture model (gmm)
Ozaydin Design of a text independent speaker recognition system
Ilyas et al. Speaker verification using vector quantization and hidden Markov model
EP1178467B1 (en) Speaker verification and identification
Kadhim et al. Enhancement and modification of automatic speaker verification by utilizing hidden Markov model
Olsson Text dependent speaker verification with a hybrid HMM/ANN system
Fierrez-Aguilar et al. Speaker verification using adapted user-dependent multilevel fusion
Dustor Voice verification based on nonlinear Ho-Kashyap classifier
Li et al. Evaluation of the i-vector system for text-dependent speaker verification
Zhou et al. Novel discriminative vector quantization approach for speaker identification
Suh et al. Filling acoustic holes through leveraged uncorellated GMMs for in-set/out-of-set speaker recognition.
Chao Verbal Information Verification for High-performance Speaker Authentication

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP