CN102238190B

CN102238190B - Identity authentication method and system

Info

Publication number: CN102238190B
Application number: CN2011102180452A
Authority: CN
Inventors: 潘逸倩; 胡国平; 何婷婷; 魏思; 胡郁; 王智国; 刘庆峰
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2011-08-01
Filing date: 2011-08-01
Publication date: 2013-12-11
Anticipated expiration: 2031-08-01
Also published as: CN102238190A

Abstract

The invention discloses an identity authentication method and an identity authentication system. The method comprises the following steps of: in the login of a user, receiving a continuous voice signal recorded by the current login user; extracting a voiceprint characteristic sequence from the continuous voice signal; computing likelihood between the voiceprint characteristic sequence and a background model; computing the likelihood between the voiceprint characteristic sequence and a speaker model of the current login user, wherein the speaker model is a polyhybrid Gaussian model constructed according to the repetition times and frame number of the registration voice signals recorded in the login of the current login user; computing a likelihood ratio according to the likelihood between the voiceprint characteristic sequence and the speaker model and the likelihood between the voiceprint characteristic sequence and the background model; and if the likelihood ratio is greater than a preset threshold value, determining the current login user is an effectively authenticated user, otherwise determining the current login user is an unauthenticated user. By the method and the system, the voiceprint-password-based identity authentication accuracy can be improved.

Description

Identity authentication method and system

Technical Field

The invention relates to the technical field of identity recognition, in particular to an identity authentication method and an identity authentication system.

Background

Voiceprint Recognition (VPR), also known as speaker Recognition, has two categories, namely speaker Recognition and speaker verification. The former is used for judging which one of a plurality of people said a certain section of voice, and is a 'one-out-of-multiple' problem; the latter is used to confirm whether a certain speech is spoken by a given person, which is a "one-to-one decision" problem. Different tasks and applications may use different voiceprint recognition techniques.

Voiceprint authentication refers to the identification of a speaker according to collected voice signals, and belongs to the one-to-one discrimination problem. The mainstream voiceprint authentication system of today adopts a framework based on hypothesis testing, which is confirmed by respectively calculating the likelihood of the voiceprint signal with respect to the speaker model and the background model and comparing their likelihood ratio with a threshold value set empirically in advance. Obviously, the accuracy of the background model and the speaker model directly affects the voiceprint authentication effect, and the larger the training data volume is, the better the model effect is.

Voiceprint password authentication is a text-related method of authenticating the identity of a speaker. The method requires the user to enter a certain cipher text by voice and confirm the identity of the speaker accordingly. In the application, the user registration and the identity authentication both adopt the voice input of the determined cipher text, so the voiceprints of the user registration and the identity authentication are always consistent, and the authentication effect better than the confirmation of a speaker irrelevant to the text can be correspondingly obtained.

The most popular technical route of the voiceprint password authentication system is the GMM-UBM algorithm, that is, a Gaussian Mixture Model (GMM) Model is adopted to simulate a Background Model (UBM) and a speaker Model. The UBM model is used to describe the commonality of the speaker's voice print. Because each speaker voiceprint always has respective specificity, the corresponding UBM model based on the training data of multiple speakers needs a complex model structure to meet the fitting requirement of distributed scattered data. Currently, the UBM model usually selects a 1024 or even larger gaussian GMM model.

The speaker model is obtained by the system through on-line training according to the registered voice when the user registers. Because the registered voice samples are often limited, the complex model is directly trained according to the method, and the model is not accurate enough due to data sparsity. Therefore, in the prior art, a background model is usually used as an initial model to adjust partial parameters of the model according to data of a small number of speakers through various adaptive methods, such as the most commonly used adaptive algorithm based on Maximum A Posteroior (MAP) at present, and the like, so as to adapt the user voiceprint commonality to the personality of the current speaker.

Under the adaptive updating algorithm, a one-to-one corresponding relation is formed between gausses of a mixed gauss model and a general background gauss model of a speaker, so that the parameters of the speaker model are excessive, and the following problems are easily caused in a voiceprint password authentication system with less registered data volume:

1. model redundancy: the speaker model in the voiceprint password authentication system is obtained by training sample data repeatedly registered with a voice password for several times. Too little sample data causes the adaptive algorithm to update only part of gaussians in the initial background model, and many of the adaptive algorithm retain gaussian components similar to the background model. The existence of the redundant model parameters easily causes the increase of storage and operation pressure, and further influences the decoding efficiency.

2. The model training amount is large: in the adaptive algorithm, it is necessary to calculate the sample statistics of each gaussian of 1024 or more gaussians of the initial background model and update the parameters thereof.

3. In the adaptive algorithm, the variance of the background model is often directly used because the variance of the speaker model is difficult to reestimate. Because the background model is a model for simulating the common vocal prints obtained based on the training data of multiple speakers, the variance of the probability distribution of the model is often large. The variance of the speaker model simulates the characteristics of the speaker specific voiceprint, and has specificity. The characteristics of the speaker model cannot be well reflected by directly using the variance of the background model, and the distinguishability among different speaker models is reduced, so that the identification accuracy is influenced.

Disclosure of Invention

The embodiment of the invention provides an identity authentication method and system, which aim to improve the accuracy of identity authentication based on a voiceprint password.

An embodiment of the present invention provides an identity authentication method, including:

when a user logs in, receiving a continuous voice signal recorded by the currently logged-in user;

extracting a voiceprint feature sequence in the continuous voice signal, wherein the voiceprint feature sequence comprises a group of voiceprint features;

calculating the likelihood of the voiceprint characteristic sequence and a background model;

calculating the likelihood of the voiceprint characteristic sequence and a speaker model of the current login user, wherein the speaker model is a multi-mixing Gaussian model constructed according to the repetition times and the frame number of the registered voice signals recorded when the current login user registers;

calculating a likelihood ratio according to the likelihood of the voiceprint feature sequence and the speaker model and the likelihood of the voiceprint feature sequence and the background model;

and if the likelihood ratio is larger than a set threshold value, determining that the current login user is a valid authentication user, otherwise, determining that the current login user is a non-authentication user.

Another aspect of the embodiments of the present invention provides an identity authentication system, including:

the voice signal receiving unit is used for receiving continuous voice signals recorded by a currently logged user when the user logs in;

the extracting unit is used for extracting a voiceprint feature sequence in the continuous voice signal, and the voiceprint feature sequence comprises a group of voiceprint features;

the first calculation unit is used for calculating the likelihood of the voiceprint feature sequence and a background model;

the second calculation unit is used for calculating the likelihood of the voiceprint feature sequence and a speaker model of the current login user, wherein the speaker model is a multi-mixing Gaussian model constructed according to the repetition times and the frame number of the registered voice signals input when the current login user registers;

a third calculating unit, configured to calculate a likelihood ratio according to the likelihood of the voiceprint feature sequence and the speaker model and the likelihood of the voiceprint feature sequence and the background model;

and the judging unit is used for determining the current login user as a valid authentication user when the likelihood ratio calculated by the third calculating unit is greater than a set threshold, and otherwise, determining the current login user as a non-authentication user.

According to the identity authentication method and system provided by the embodiment of the invention, the likelihood of the voiceprint feature sequence and the speaker model and the background model of the current login user are respectively calculated according to the voiceprint feature sequence in the continuous voice signal recorded by the current login user, then the likelihood ratio is calculated, and whether the current login user is an effective authentication user is determined according to the obtained likelihood ratio. In the scheme, the speaker model is a multi-mixing Gaussian model constructed according to the voice signals input when the user logs in at present, so that the characteristics of different pronunciation changes of the same voice signal (namely, the password) spoken by the user can be simulated, and the accuracy of identity authentication based on the voiceprint password is improved.

Drawings

In order to more clearly illustrate the technical solution of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flow chart of a method of identity authentication according to an embodiment of the present invention;

FIG. 2 is a flow chart of a background model parameter training process according to an embodiment of the present invention;

FIG. 3 is a flow chart of a conventional speaker model constructed using an adaptive algorithm;

FIG. 4 is a flow chart of constructing a speaker model in an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of an identity authentication system according to an embodiment of the present invention;

fig. 6 is another schematic structural diagram of the identity authentication system according to the embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, it is a flowchart of an identity authentication method according to an embodiment of the present invention, including the following steps:

step 101, when a user logs in, receiving a continuous voice signal recorded by a currently logged-in user.

And 102, extracting a voiceprint characteristic sequence in the continuous voice signal.

The voiceprint feature sequence comprises a group of voiceprint features, so that different speakers can be effectively distinguished, and the change of the same speaker is kept relatively stable.

The voiceprint features are mainly as follows: spectral envelope parameter speech characteristics, pitch contour, formant frequency bandwidth characteristics, linear prediction coefficients, cepstrum coefficients, and the like. Considering the problems of the quantization of the voiceprint features, the number of training samples, the evaluation of system performance and the like, a Mel Frequency Cepstrum Coefficient (MFCC) feature can be selected, each frame of voice data with a window length of 25ms and a frame shift of 10ms is analyzed in a short time to obtain MFCC parameters and first-order and second-order differences thereof, and 39 dimensions are counted. Thus, each sentence of the speech signal can be quantized into a 39-dimensional voiceprint feature vector sequence X.

And 103, calculating the likelihood of the voiceprint feature sequence and the background model.

The likelihood that the sequence of voiceprint feature vectors X for frame number T corresponds to the background model (UBM) is:

p (X | UBM) = \frac{1}{T} Σ_{t = 1}^{T} Σ_{m = 1}^{M} c_{m} N (X_{t}; μ_{m}, Σ_{m}) - - - (1)

wherein, c_mIs the weighting coefficient of the mth Gaussian satisfyingμ_mAnd sigma_mThe mean and variance of the mth gaussian, respectively. Wherein N () satisfies normal distribution for calculating voiceprint feature vector X at time t_tLikelihood on a single gaussian component:

N (X_{t}; μ_{m}, Σ_{m}) = \frac{1}{\sqrt{{(2 π)}^{n} | Σ_{m} |}} e^{- \frac{1}{2} {(X_{t} - μ_{m})}^{'} {Σ_{m}}^{- 1} (X_{t} - μ_{m})} - - - (2)

and 104, calculating the likelihood of the voiceprint characteristic sequence and a speaker model of the current login user, wherein the speaker model is a multi-mixing Gaussian model constructed according to the repetition times and the frame number of the registered voice signals recorded when the current login user registers.

Because the speaker model is a multi-mixture gaussian model constructed according to the voice signal input during the registration of the current login user, in this step, when the likelihood of the voiceprint feature sequence and the speaker model of the current login user is calculated, the likelihood of each voiceprint feature in the voiceprint feature sequence and each mixture gaussian model needs to be calculated respectively; and then determining the likelihood of the voiceprint feature and the speaker model of the current login user according to all the calculated likelihoods. Specifically, there may be various implementations, such as:

1. and firstly, respectively calculating the likelihood of the voiceprint feature sequence and each Gaussian mixture model, and then determining the likelihood of the voiceprint feature sequence and the speaker model of the current login user according to the calculation result.

In this way, the likelihood of each voiceprint feature in the sequence of voiceprint features and each mixture gaussian model in the multiple mixture gaussian models can be calculated separately; and selecting the time average value of the sum of the likelihood degrees calculated by a mixed Gaussian model corresponding to a group of voiceprint features in the voiceprint feature sequence as the likelihood degree of the voiceprint feature sequence and the mixed Gaussian model.

After the likelihood of the voiceprint feature sequence and each Gaussian mixture model is obtained, one maximum value or the average value can be selected as the likelihood of the voiceprint feature sequence and the speaker model of the current login user.

2. And firstly, respectively calculating the likelihood of each voiceprint feature in the voiceprint feature sequence relative to the multi-Gaussian mixture model, and then determining the likelihood of the voiceprint feature sequence and the speaker model of the current login user according to the calculation result.

In this way, the likelihood of each voiceprint feature in the sequence of voiceprint features and each mixture gaussian model in the multiple mixture gaussian models can be calculated separately; selecting the maximum value of the likelihood degrees calculated by the voiceprint feature in the voiceprint feature sequence corresponding to each mixed Gaussian model in the multiple mixed Gaussian models as the likelihood degree of the voiceprint feature and the multiple mixed Gaussian models; or selecting the average value of all likelihood degrees calculated by a voiceprint feature in the voiceprint feature sequence corresponding to each mixed Gaussian model in the multiple mixed Gaussian models as the likelihood degree of the voiceprint feature and the multiple mixed Gaussian models.

And after the likelihood of each voiceprint feature in the voiceprint features and the multi-Gaussian mixture model is obtained, selecting the sum time average of the likelihood of all the voiceprint features of the voiceprint feature sequence as the likelihood of the voiceprint feature sequence and the speaker model of the current login user.

Of course, there may be other selection manners, such as performing a weighted average on all the calculated likelihoods, and the embodiment of the present invention is not limited thereto.

And 105, calculating a likelihood ratio according to the likelihood of the voiceprint feature sequence and the speaker model and the likelihood of the voiceprint feature sequence and the background model.

The likelihood ratio is:

p = \frac{p (X | U)}{p (X | UBM)} - - - (3)

wherein p (X | U) is the likelihood of the voiceprint feature and the speaker model, and p (X | UBM) is the likelihood of the voiceprint feature and the background model.

Step 106, judging whether the likelihood ratio is larger than a set threshold value, if so, executing step 107; otherwise, step 108 is performed.

The threshold value may be preset by the system, and generally, the larger the threshold value, the higher the sensitivity of the system, and the user is required to follow the pronunciation of the voice signal (i.e., password) entered at the time of registration as much as possible at the time of login, and conversely, the lower the sensitivity of the system, and a certain change is allowed between the pronunciation of the voice signal entered at the time of login and the pronunciation at the time of registration.

And step 107, determining the current login user as a valid authentication user.

And step 108, determining that the current login user is a non-authentication user.

It should be noted that, in order to improve the robustness of the system, before the

above steps

101 and 102, the continuous speech signal may also be subjected to noise reduction processing, for example, the continuous speech signal is first divided into independent speech segments and non-speech segments by analyzing the short-time energy and the short-time zero-crossing rate of the speech signal. And then, the interference of channel noise and background noise is reduced through the front-end noise reduction treatment, the voice signal-to-noise ratio is improved, and a clean signal is provided for the subsequent system treatment.

The user voiceprint features are both relatively stable and variable. On one hand, the model is easily influenced by physical conditions, age, emotion and the like, and on the other hand, the model is easily interfered by external environment noise and a voice acquisition channel, so that the speaker model needs to be capable of well distinguishing different voiceprint changes of the same speaker. In the embodiment of the invention, the speaker model is a multi-Gaussian mixture model constructed according to the voice signals recorded when the current login user registers, the number of Gaussian mixtures models and the Gaussian number of each Gaussian mixture model are related to the repetition times of the voice signals recorded when the user registers and the frame number of the voice signals, so that the characteristics of different pronunciation changes of the user for speaking the same password (namely the voice signals) can be simulated by using the plurality of Gaussian mixture models, and the accuracy of identity authentication based on the voiceprint password is improved.

In the embodiment of the present invention, the background model is used to describe the commonalities of the voiceprints of the speaker, the background model needs to be constructed in advance, and some ways in the prior art may be specifically adopted, for example, a mixed gaussian model with a gaussian number of 1024 or more is adopted to simulate the background model, and the training process of the model parameters is shown in fig. 2.

Step 201, voiceprint features are respectively extracted from a multi-speaker training speech signal, and each voiceprint feature is used as a feature vector.

And 202, clustering the feature vectors by using a clustering algorithm to obtain an initialized mean value of K gaussians, wherein K is the number of preset Gaussian mixture models.

For example, a conventional LBG (Linde, Buzo, Gray) clustering algorithm may be used to approximate the optimal regenerated codebook through a training vector set and a certain iterative algorithm.

And step 203, iteratively updating the mean value, the variance and the weighting coefficient corresponding to each Gaussian by using an EM (expectation maximization) algorithm to obtain a background model.

The specific iterative update process is the same as the prior art and will not be described in detail here.

Of course, the background model may also be constructed in other manners, and the embodiment of the present invention is not limited thereto.

In the embodiment of the present invention, it is necessary to distinguish whether a user is in a login mode or a registration mode, if the user is in the login mode, the user needs to perform identity authentication based on a voiceprint password according to the flow shown in fig. 1, and if the user is in the registration mode, a registration voice signal input by the user needs to be received, and a speaker model of the user needs to be constructed according to the registration voice signal.

In the embodiment of the present invention, the construction process of the speaker model is completely different from the construction process of the conventional speaker model, and in order to better explain this point, the construction process of the conventional speaker model is first briefly explained below.

In the traditional construction process of the speaker model, a background model is used as an initial model, and partial parameters of the model are adjusted by a self-adaptive method, such as the most common self-adaptive algorithm based on the maximum posterior probability at present. The adaptive algorithm adapts the user voiceprint commonality to the current speaker personality according to a small amount of speaker data, and the specific training process is shown in fig. 3 and comprises the following steps:

step 301, extracting voiceprint features from a registered voice signal entered by a user.

Step 302, updating the mean value mu of the background model mixed gaussians in a self-adaptive manner by utilizing the voiceprint characteristics_m。

In particular, the new Gaussian mean μ_mCalculated as a weighted average of the sample statistics and the original gaussian mean, i.e.:

\hat{μ_{m}} = \frac{Σ_{t = 1}^{T} γ_{m} (x_{t}) x_{t} + τ μ_{m}}{Σ_{t = 1}^{T} γ_{m} (x_{t}) + τ} - - - (4)

wherein x is_tRepresenting the voiceprint characteristics of the t-th frame, gamma_m(x_t) And (3) representing the probability that the voiceprint feature of the t-th frame falls into the m-th Gaussian, wherein tau is a forgetting factor and is used for balancing the historical mean value and the updating strength of the sample to the new mean value. In general, the larger the value of τ, the more constrained the new mean is to the original mean. And if the value of tau is smaller, the new mean value is mainly determined by the sample statistics, and the characteristics of new sample distribution are reflected more.

Step 303, copying a background model variance as the speaker model variance of the user.

Step 304, generating a speaker model of the user.

In the embodiment of the invention, when a user needs to register, a registered voice signal input by the user is received, and a speaker model of the user is constructed according to the registered voice signal. The speaker model is composed of a plurality of mixed Gaussian models to simulate the characteristics of different pronunciation changes of a speaker to speak the same password, and the variance of each mixed Gaussian model in the speaker model is trained independently to solve the problem that the variance is too large and is not suitable for practical application due to the fact that the variance of a background model is directly copied in the traditional method.

As shown in fig. 4, it is a flowchart of constructing a speaker model in the embodiment of the present invention, which includes the following steps:

step 401, storing the registered voice signal recorded by the user as a discrete energy sequence.

Assuming that a user registers to input the same password content N (for example, N is 2, 3, etc.) times, N independent discrete energy sequences are obtained.

Step 402, extracting voiceprint features from the obtained discrete energy sequence.

The specific process is similar to the previous step 102 and will not be described in detail here.

Step 403, determining all Gaussian mixture models of the speaker model of the user according to the repetition times and the frame number of the registered voice signals.

In voiceprint password applications, the user enters uniform text content for use as a password. For example, the number of gaussian mixtures of the speaker model of the user may be set to be equal to the number of repetitions of the registered speech signal, and the number of gaussians corresponding to each gaussian mixture may be set to be equal to the number of frames of the registered speech signal corresponding to the gaussian mixture, which may be specifically expressed as:

p (O | M_{k}) = Σ_{m = 1}^{T (k)} c_{m}^{k} N (O; μ_{m}^{k}, Σ_{m}^{k}) - - - (5)

wherein T (k) is a Gaussian mixture model M_kIs equal to the number of frames of the kth speech sample corresponding to the model. While

Respectively, a Gaussian mixture model M_kThe weighting coefficient, mean and variance of the mth gaussian component.

Of course, the embodiments of the present invention do not limit the topology of the speaker models, the number of gaussian models and the number of gaussians of each gaussian model may not be equal to the repetition number and the number of frames of the speech signal, or the number of gaussian models selected by the clustering algorithm may be smaller than the repetition number of the registered speech signal, and similarly, the number of gaussians of each gaussian model may also be smaller than the number of frames of the registered speech signal.

And step 404, estimating Gaussian mean parameters of all the Gaussian mixture models according to the extracted voiceprint features.

In the embodiment of the invention, the Gaussian mean parameter of the corresponding Gaussian mixture model is determined according to the single training sample. Specifically, each gaussian mean vector of the gaussian mixture model may be set as a feature vector value of the sample, i.e., a value of the sample

WhereinRepresents the mean of the mth Gaussian of the kth mixture model, and

a voiceprint feature vector representing the mth frame of speech of the kth speech signal.

Step 405, estimating the gaussian variance parameters of all the gaussian mixture models according to the extracted voiceprint features.

It can be assumed that the multiple gaussians of each mixed gaussians in the speaker model have a matrix that is globally uniform to achieve variance re-estimation problems on fewer data. Under the assumption that the first and second images are identical,

(i.e., the covariance matrices of all gaussian components of the kth hybrid gaussian model have the same matrix value). In particular, for a given sample voiceprint feature sequence O_kAccording to

I.e. statistical information of all remaining sample voiceprint feature sequences, reestimating the Gaussian mixture model M^kThe variance of (c) is calculated as follows:

Σ^{k} = \frac{\underset{n &NotEqual; k}{Σ} Σ_{i = 1}^{T (n)} Σ_{m = 1}^{T (k)} (γ_{m}^{k} (O_{i}^{n}) (O_{i}^{n} - μ_{m}^{k}) {(O_{i}^{n} - μ_{m}^{k})}^{T})}{\underset{n &NotEqual; k}{Σ} Σ_{i = 1}^{T (n)} Σ_{m = 1}^{T (k)} γ_{m}^{k} (O_{i}^{n})} - - - (6)

wherein,

an ith speech frame (i.e. sample) representing the nth registered password (i.e. registered speech signal),

represents the mth gaussian mean of the kth hybrid gaussian model,

representing a sample

Fall on a mean value of

Probability on gaussian.

Thus, each individual Gaussian M mixture for the speaker model^kAll can utilize non-O^kAnd obtaining a corresponding variance parameter by the sample data. If the registered voice signal is N sentences, N different variance matrixes are obtained.

In particular, the variance matrix may be assumed to be a diagonal matrix to further reduce the data sparseness problem, i.e.

In addition, the method can further consider that the variances of a plurality of gaussians of a plurality of mixed gaussians of the speaker model have a globally unified diagonal matrix so as to better solve the problem of reestimation of the variance of the model under the condition of sparse data. Under the assumption that the first and second images are identical,

in step 406, the gaussian weighting coefficient parameters of all the gaussian mixture models are estimated.

Considering that the gaussian mean of the mixture gaussian model in this embodiment is directly determined by the sample vector, each gaussian exists at a probability of 1 on the sample, i.e. the probability of occurrence is the same. For this reason, in this embodiment, the weighting coefficient of each gaussian in the mixture model may be set to be equal, that is:

c_{m}^{k} = c^{k} = \frac{1}{T (k)} - - - (7)

by using the flow shown in fig. 4, the number of gaussian mixture models in the speaker model can be set according to the sentence number and the sentence length of the registered voice, the topological structure of the model can be determined, and the training problem of sparse data in the traditional system based on voiceprint password authentication can be effectively solved by reasonably setting the gaussian mean, variance and weighting coefficient of all the gaussian mixture models, so that the distinguishability among the gaussian mixture models can be improved, and the accuracy of identity authentication can be further improved. And the used Gaussian mixture model is smaller and more effective, and compared with the prior art, the operation rate and the memory pressure required by data storage are greatly improved.

Correspondingly, an embodiment of the present invention further provides an identity authentication system, as shown in fig. 5, which is a schematic structural diagram of the identity authentication system according to the embodiment of the present invention.

In this embodiment, the system includes:

a voice signal receiving unit 501, configured to receive a continuous voice signal recorded by a currently logged-in user when the user logs in;

an extracting unit 502, configured to extract a voiceprint feature sequence in the continuous speech signal;

a first calculating unit 503, configured to calculate a likelihood between the voiceprint feature sequence and a background model;

a second calculating unit 504, configured to calculate a likelihood between the voiceprint feature sequence and a speaker model of the current logged-in user, where the speaker model is a multiple-mixture gaussian model constructed according to the number of repetitions and the number of frames of a registered voice signal that is input when the current logged-in user registers;

a third calculating unit 505, configured to calculate a likelihood ratio according to the likelihood of the voiceprint feature sequence and the speaker model and the likelihood of the voiceprint feature sequence and the background model;

a determining unit 506, configured to determine that the current login user is a valid authenticated user when the likelihood ratio calculated by the third calculating unit 505 is greater than a set threshold, and otherwise determine that the current login user is a non-authenticated user.

For example, the voiceprint features that can be extracted by the extracting unit 502 mainly include: spectral envelope parameter speech characteristics, pitch contour, formant frequency bandwidth characteristics, linear prediction coefficients, cepstrum coefficients, and the like. Considering the problems of the quantization of the voiceprint features, the number of training samples, the evaluation of system performance and the like, the feature of MFCC (Mel Frequency Cepstrum Coefficient) can be selected, and each frame of voice data with the window length of 25ms and the frame shift of 10ms is analyzed in short time to obtain MFCC parameters and first-order and second-order differences thereof, and the total is 39 dimensions. Thus, each sentence of the speech signal can be quantized into a 39-dimensional voiceprint feature sequence X.

The background model may be pre-constructed by the system and loaded during initialization, and the specific construction process of the background model is not limited in the embodiments of the present invention.

The speaker model is a multiple-mixture gaussian model constructed according to the voice signal input when the currently logged-in user logs in, and accordingly, in the embodiment of the present invention, the second calculating unit 504 may have multiple implementation manners, such as:

in one implementation, the second computing unit 504 includes: a first calculating subunit and a first determining subunit. Wherein:

the first calculating subunit is configured to calculate a likelihood of the voiceprint feature sequence and each gaussian mixture model respectively;

the first determining subunit is configured to determine, according to the calculation result of the first calculating subunit, a likelihood between the voiceprint feature sequence and the speaker model of the currently logged-in user.

The first calculating subunit may include: a first calculation module and a first selection module, wherein:

the first calculation module is configured to calculate a likelihood of each voiceprint feature in the voiceprint feature sequence and each gaussian mixture model in the multiple gaussian mixture models respectively;

the first selection module is configured to select a time average of a sum of likelihood values calculated by a hybrid gaussian model corresponding to a group of voiceprint features in the voiceprint feature sequence as a likelihood value of the voiceprint feature sequence and the hybrid gaussian model.

Accordingly, the first determining subunit may also have multiple implementation manners, for example, after the first calculating subunit obtains the likelihood of the voiceprint feature sequence and each gaussian mixture model, the first determining subunit may select one maximum value or one average value thereof as the likelihood of the voiceprint feature sequence and the speaker model of the currently logged-in user.

In another implementation manner, the second computing unit 504 includes: a second calculation subunit and a second determination subunit. Wherein:

the second calculating subunit is configured to calculate a likelihood of each voiceprint feature in the voiceprint feature sequence with respect to the multiple gaussian mixture model respectively;

and the second selection subunit is used for determining the likelihood of the voiceprint feature sequence and the speaker model of the current login user according to the calculation result of the second calculation subunit.

The second calculating subunit may include: a second calculation module and a second selection module, wherein:

the second calculation module is configured to calculate a likelihood of each voiceprint feature in the voiceprint feature sequence and each gaussian mixture model in the multiple gaussian mixture models respectively;

the second selection module is configured to select a maximum value of likelihood degrees calculated by a voiceprint feature in the voiceprint feature sequence corresponding to each gaussian mixture model in the multiple gaussian mixture models, as a likelihood degree of the voiceprint feature and the multiple gaussian mixture models; or selecting the average value of all likelihood degrees calculated by a voiceprint feature in the voiceprint feature sequence corresponding to each mixed Gaussian model in the multiple mixed Gaussian models as the likelihood degree of the voiceprint feature and the multiple mixed Gaussian models.

Accordingly, the second determining subunit may also have multiple implementation manners, for example, after the second calculating subunit obtains the likelihood of each voiceprint feature in the voiceprint feature sequence relative to the multiple-mixture gaussian model, the second determining subunit may select the time average of the likelihood of each voiceprint feature in the voiceprint feature sequence relative to the multiple-mixture gaussian model as the likelihood of the voiceprint feature sequence and the speaker model of the currently logged-in user.

Of course, the second calculating unit 504 may also be implemented in other ways, and the embodiment of the present invention is not limited thereto.

The specific calculation processes of the first calculating unit 503, the second calculating unit 504, and the third calculating unit 505 may refer to the descriptions in the identity authentication method in the foregoing embodiments of the present invention, and are not described herein again.

In the embodiment of the invention, the speaker model is a multi-Gaussian mixture model constructed according to the voice signals recorded when the current login user registers, the number of Gaussian mixtures models and the Gaussian number of each Gaussian mixture model are related to the repetition times of the voice signals recorded when the user registers and the frame number of the voice signals, so that the characteristics of different pronunciation changes of the user for speaking the same password (namely the voice signals) can be simulated by using the plurality of Gaussian mixture models, and the accuracy of identity authentication based on the voiceprint password is improved.

Fig. 6 is a schematic diagram of another structure of an identity authentication system according to an embodiment of the present invention.

Unlike the embodiment shown in fig. 5, in this embodiment, the voice signal receiving unit 501 is further configured to receive a registration voice signal entered by the user when the user registers.

In addition, the system further comprises: a model building unit 601, configured to build a speaker model of the user according to the registered speech signal, where the model building unit 601 includes:

a feature extraction subunit 611 configured to extract a voiceprint feature from the registration voice signal;

a topology determining subunit 612, configured to determine all gaussian mixture models of the speaker model of the user according to the number of repetitions and the number of frames of the registered speech signal;

for example, the number of gaussian mixture models of the speaker model of the user may be set to be less than or equal to the number of repetitions of the registered speech signal; setting the number of gaussians corresponding to each Gaussian mixture model to be less than or equal to the number of frames of the registration voice signal;

a first estimating sub-unit 613, configured to estimate, by using the voiceprint features extracted by the feature extracting sub-unit 611, gaussian mean parameters of all the gaussian mixture models determined by the topology determining sub-unit 612;

a second estimating sub-unit 614, configured to estimate, by using the voiceprint features extracted by the feature extracting sub-unit 611, the gaussian variance parameters of all the gaussian mixtures determined by the topology determining sub-unit 612.

The above estimation methods of the estimation subunits for the corresponding parameters in the gaussian mixture model can refer to the foregoing description, and are not described herein again.

The identity authentication system of the embodiment of the invention can set the number of the Gaussian mixture models in the speaker model according to the sentence number and the sentence length of the registered voice and determine the topological structure of the models, effectively solves the training problem of data sparsity in the traditional system based on voiceprint password authentication by reasonably setting the Gaussian mean value, the variance and the weighting coefficient of all the Gaussian mixture models, improves the distinguishability among the Gaussian mixture models and further can improve the accuracy of identity authentication. And the used Gaussian mixture model is smaller and more effective, and compared with the prior art, the operation rate and the memory pressure required by data storage are greatly improved.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, they are described in a relatively simple manner, and reference may be made to some descriptions of method embodiments for relevant points. The above-described system embodiments are merely illustrative, and the units and modules described as separate components may or may not be physically separate. In addition, some or all of the units and modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The above disclosure is only for the preferred embodiments of the present invention, but the present invention is not limited thereto, and any non-inventive changes that can be made by those skilled in the art and several modifications and amendments made without departing from the principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. An identity authentication method, comprising:

calculating a likelihood ratio according to the likelihood p (X | U) of the voiceprint feature sequence and the speaker model and the likelihood p (X | UBM) of the voiceprint feature sequence and the background model

2. The method of claim 1, wherein said calculating a likelihood of said voiceprint feature sequence to a speaker model of said current logged in user comprises:

respectively calculating the likelihood of the voiceprint characteristic sequence and each Gaussian mixture model;

and determining the likelihood of the voiceprint feature sequence and the speaker model of the current login user according to the calculation result.

3. The method of claim 2, wherein said separately calculating the likelihood of the voiceprint feature sequence with each gaussian mixture model comprises:

respectively calculating the likelihood of each voiceprint feature in the voiceprint feature sequence and each mixed Gaussian model in the multiple mixed Gaussian models;

and selecting the time average value of the sum of the likelihood degrees calculated by a mixed Gaussian model corresponding to a group of voiceprint features in the voiceprint feature sequence as the likelihood degree of the voiceprint feature sequence and the mixed Gaussian model.

4. The method of claim 2, wherein said determining a likelihood of said voiceprint feature sequence from said speaker model of said currently logged in user based on said calculation comprises:

selecting the average value of likelihood degrees calculated by all mixed Gaussian models corresponding to the voiceprint feature sequence as the likelihood degree of the voiceprint feature sequence and the speaker model of the current login user; or

And selecting the maximum value of the likelihood calculated by all the Gaussian mixture models corresponding to the voiceprint feature sequence as the likelihood of the voiceprint feature sequence and the speaker model of the current login user.

5. The method of claim 1, wherein said calculating a likelihood of said voiceprint feature sequence to a speaker model of said current logged in user comprises:

respectively calculating the likelihood of each voiceprint feature in the voiceprint feature sequence relative to the multi-Gaussian mixture model;

6. The method of claim 5, wherein the separately calculating the likelihood of each voiceprint feature in the sequence of voiceprint features relative to the multi-mixture Gaussian model comprises:

selecting the maximum value of the likelihood degrees calculated by the voiceprint feature in the voiceprint feature sequence corresponding to each mixed Gaussian model in the multiple mixed Gaussian models as the likelihood degree of the voiceprint feature and the multiple mixed Gaussian models; or selecting the average value of all likelihood degrees calculated by a voiceprint feature in the voiceprint feature sequence corresponding to each mixed Gaussian model in the multiple mixed Gaussian models as the likelihood degree of the voiceprint feature and the multiple mixed Gaussian models.

7. The method of claim 5, wherein said determining a likelihood of said voiceprint feature sequence from said speaker model of said currently logged in user based on said calculation comprises:

and selecting the time average value of the likelihood calculated by the multi-Gaussian mixture model corresponding to all the voiceprint features in the voiceprint feature sequence as the likelihood of the voiceprint feature sequence and the speaker model of the current login user.

8. The method of any of claims 1 to 7, further comprising:

when a user registers, receiving a registration voice signal input by the user;

constructing a speaker model of the user according to the registered voice signal;

the process of constructing the speaker model of the user according to the registered voice signal includes:

extracting voiceprint features from the enrollment voice signal;

determining all Gaussian mixture models of the speaker model of the user according to the repetition times and the frame number of the registered voice signals;

estimating Gaussian mean parameters of all Gaussian mixture models of the speaker model of the user according to voiceprint features extracted from the registered voice signals;

and estimating Gaussian variance parameters of all Gaussian mixture models of the speaker model of the user according to the voiceprint features extracted from the registered voice signals.

9. The method of claim 8, wherein said determining all gaussian mixtures of models of the speaker of the user based on the number of repetitions and the number of frames of the registered speech signal comprises:

setting the number of Gaussian mixture models of the speaker model of the user to be less than or equal to the repetition times of the registered voice signals;

the number of gaussians corresponding to each gaussian mixture model is set to be less than or equal to the number of frames of the registered voice signal corresponding to the gaussian mixture model.

10. An identity authentication system, comprising:

a third calculating unit for calculating a likelihood ratio according to the likelihood p (X | U) of the voiceprint feature sequence and the speaker model and the likelihood p (X | UBM) of the voiceprint feature sequence and the background model

p = \frac{p (X | U)}{p (X | UBM)};

11. The system of claim 10, wherein the second computing unit comprises:

the first calculating subunit is used for calculating the likelihood of the voiceprint feature sequence and each Gaussian mixture model respectively;

and the first determining subunit is used for determining the likelihood of the voiceprint feature sequence and the speaker model of the current login user according to the calculation result of the first calculating subunit.

12. The system of claim 11, wherein the first computing subunit comprises:

the first calculation module is used for calculating the likelihood of each voiceprint feature in the voiceprint feature sequence and each mixed Gaussian model in the multiple mixed Gaussian models respectively;

and the first selection module is used for selecting the time average value of the likelihood sum obtained by calculating a group of voiceprint features in the voiceprint feature sequence corresponding to a mixed Gaussian model as the likelihood of the voiceprint feature sequence and the mixed Gaussian model.

13. The system of claim 11,

the first determining subunit is specifically configured to select an average value of likelihood degrees calculated by all gaussian mixture models corresponding to the voiceprint feature sequence as a likelihood degree of the voiceprint feature sequence and the speaker model of the currently logged-in user; or selecting the maximum value of the likelihood calculated by all the Gaussian mixture models corresponding to the voiceprint feature sequence as the likelihood of the voiceprint feature sequence and the speaker model of the current login user.

14. The system of claim 10, wherein the second computing unit comprises:

the second calculating subunit is used for respectively calculating the likelihood of each voiceprint feature in the voiceprint feature sequence relative to the multi-Gaussian mixture model;

and the second determining subunit is used for determining the likelihood of the voiceprint feature sequence and the speaker model of the current login user according to the calculation result of the second calculating subunit.

15. The system of claim 14, wherein the second computing subunit comprises:

the second calculation module is used for calculating the likelihood of each voiceprint feature in the voiceprint feature sequence and each mixed Gaussian model in the multiple mixed Gaussian models respectively;

the second selection module is used for selecting the maximum value of the likelihood degrees obtained by calculating one voiceprint feature in the voiceprint feature sequence corresponding to each mixed Gaussian model in the multiple mixed Gaussian models as the likelihood degree of the voiceprint feature and the multiple mixed Gaussian models; or selecting the average value of all likelihood degrees calculated by a voiceprint feature in the voiceprint feature sequence corresponding to each mixed Gaussian model in the multiple mixed Gaussian models as the likelihood degree of the voiceprint feature and the multiple mixed Gaussian models.

16. The system of claim 14,

the second determining subunit is specifically configured to select a time average of the likelihood of each voiceprint feature in the voiceprint feature sequence with respect to the multiple gaussian mixture model as the likelihood of the voiceprint feature sequence and the speaker model of the currently logged-in user.

17. The system of any one of claims 10 to 16,

the voice signal receiving unit is also used for receiving a registration voice signal input by a user when the user registers;

the system further comprises: a model construction unit for constructing a speaker model of the user according to the registered voice signal, the model construction unit comprising:

a feature extraction subunit, configured to extract voiceprint features from the registration voice signal;

a topological structure determining subunit, configured to determine all gaussian mixture models of the speaker model of the user according to the number of repetitions and the number of frames of the registered voice signal;

the first estimation subunit is used for estimating the Gaussian mean value parameters of all the Gaussian mixture models determined by the topological structure determination subunit by using the voiceprint features extracted by the feature extraction subunit;

and the second estimation subunit is used for estimating the Gaussian variance parameters of all the Gaussian mixture models determined by the topological structure determination subunit by using the voiceprint features extracted by the feature extraction subunit.

18. The system of claim 17,

the topological structure determining subunit is specifically configured to set the number of gaussian mixture models of the speaker model of the user to be less than or equal to the number of repetitions of the registered speech signal; the number of gaussians corresponding to each gaussian mixture model is set to be less than or equal to the number of frames of the registered voice signal corresponding to the gaussian mixture model.