WO2002067245A1 - Verification de haut-parleurs - Google Patents
Verification de haut-parleurs Download PDFInfo
- Publication number
- WO2002067245A1 WO2002067245A1 PCT/GB2002/000665 GB0200665W WO02067245A1 WO 2002067245 A1 WO2002067245 A1 WO 2002067245A1 GB 0200665 W GB0200665 W GB 0200665W WO 02067245 A1 WO02067245 A1 WO 02067245A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- features
- scoring
- speaker
- speech
- stored model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/06—Decision making techniques; Pattern matching strategies
Definitions
- This invention relates to a speaker verification system and in particular to a speaker verification system based on the principles proposed in our British patent application serial no. GB-A-2248513.
- Speaker verification is important in applications such as financial transactions which are carried out automatically by telephone. Some of the problems of speaker verification are reduced by forming what are known as Gaussian Mixture models (GMM) for a number of utterances using features of these utterances from a large number of speakers . These models are known as world models. In addition, for every person whose speech is to be recognised, a GMM is formed. These models are known as personal or speaker models and comprise mixture comp'onents with which input utterances will be processed.
- GMM Gaussian Mixture models
- a person says an isolated or connected utterances and features from each of these utterances are extracted and into feature vectors . After this, the probabilities that these features vectors could have been generated for these words by the world model and by the personal model of that person are calculated and these probabilities are compared for the utterances. A decision on a verification for the speaker is then based on a poll of these comparisons.
- a system such as this operates by cutting an incoming stream of speech data into short sections or frames to allow feature extraction.
- a front end process extracts a set of features from each frame, these features being a function of the input speech signal. These features are then stored as a vector. The feature vectors are then further used ' for comparison to a world and speaker model.
- the present invention seeks to reduce the processing of the mixture components from the world model thereby considerably reducing the computational overhead of the whole process.
- FIGURE 1 shows a block diagram of a system embodying the invention
- FIGURE 2 shows a block diagram of a basic world model scoring system
- FIGURE 3 shows a component predicted world model scoring system in accordance with an embodiment of the invention.
- FIGURE 4 shows schematically how component prediction is performed.
- the diagram of Figure 1 shows an input speech signal to a front end processor 2 which produces as its output a feature vector. This is achieved as described above by cutting the speech signal into frames and from each frame extracting a set of features which are then combined into a feature vector for that frame.
- the next stage in the process is that the feature vectors are provided to a world model storing unit 4. This also receives, as an input, mixture components from a world GMM which comprises mixture components for all possible speakers.
- the scoring process with the world model leads to a ranking of the mixture components according to the likelihood score for the given feature vector. It will be appreciated that the score for the comparison of each feature vector with each mixture component can only give a likelihood as there will be small variations in input speech every time a speaker- provides an input signal. These variations will also occur as a result of the frames from which the feature vectors are extracted having cut points at different times in the speaker's speech each time speech is analysed.
- each feature vector is assigned to the best scoring mixture component of the world model and these are output from the world model score unit 4.
- the assigned feature vectors are accumulated in a temporary GMM model 8.
- This temporary model is used for a speaker adaptation process in conjunction with the world model to create a speaker model 10.
- the intention is to produce a speaker model which is a statistical representation of the speaker' s speech, where each of the speaker' s mixture components has exactly one corresponding component in the world model.
- This speaker model can then be used in a speaker model scoring unit 12.
- a fast convergence of the speaker model parameter is one of the most important tasks in the system and is performed by the speaker adaptation unit 14.
- the speaker model parameters should change almost immediately to the input characteristic during the initialisation period. After a certain time, when enough speaker data are collected, the system should change from speaker adaptation to speaker tracking.
- the tracking allows the system to follow changes in the speaker's voice pattern over a longer period of time.
- the tracking should be slow to allow capturing the voice over a long time span and not only over the last few utterances. This should lead to a more robust estimation of the speaker's model parameters.
- the speaker adaptation unit 14 performs operations based on the standard equation for on-line model re- adaptation.
- the equation is:
- ⁇ S ⁇ ,r is an already estimated speaker model parameter from the speaker model 10 which represents J ⁇ seen frames
- S/ ,r is the new mean accumulated over the last segment in the temporary model 8 with the representative weight of n.
- the new re-adapted model parameter s, .r represents n+m frames and is calculated with weights, according to n and m, and ⁇ s , ' r respectively.
- the numeral 8 is only a preferred value, and the values may be appropriate in other circumstances.
- a problem of the accumulation is the memory usage for each parameter.
- ⁇ S ⁇ .t can be stored with 16 bit resolution whereas ⁇ Sr , t is stored with only 8 bit.
- the averaging is not very accurate if the resolution for ⁇ St >z is reduced to 8 bit. Therefore the storage of the temporary model 8 is twice as large as the speaker model 10.
- a simple way of reducing the memory size of the temporary model is to store only a sub-set of all mixture components in the temporary model. This enables the memory to be reduced to a half of the original size.
- the components are chosen by the frequency of their occurrence. Components with a high frequency are kept in memory in the temporary model. If a frame is seen for a component which is not in the memory, the component with the lowest frequency in the memory is checked and possibly exchanged with the new component from the world model 6.
- Speaker tracking uses the same equation as in adaptation.
- the weighting factor ⁇ is set to a certain value instead of being calculated individually according to the number of seen frames.
- the adaptation will change to a tracking approach if enough frames have been received to train the mixture component parameters for the target speaker. In this system, this is the case when 255 frames have been processed for a certain mixture component, but other values could be selected.
- ⁇ should not be too large to avoid a fast tracking which weights the newer data very high and therefore loses information from the past quickly. If Y is chosen too low, the target speaker' s change might not be captured and the speaker is locked out by the system.
- the value for ⁇ also depends on the model size and the length of the test segment.
- test with standard adaptation is performed. This reveals the baseline performance for further testing of memory reductions on the temporary model and the speaker tracking settings. Tests are performed with a model size of 64 mixture components. Larger model sizes might not obtain any changes between the different tracking strategies due to the small amount of training sessions available. A comparison is also made for gender dependent background models and a combined gender model for the adaptation process .
- the feature vectors and mixture components output by the world model scoring unit are provided to the speaker model scoring unit 12.
- Producing a likelihood score for the speaker model involves only the processing of the most likely mixture components with the feature vectors. These components also retain high scores from the speaker model due to the component correspondence between speaker and world model. Therefore, only a small number of components are processed for scoring the speaker model .
- the output scores of the world and speaker model scoring units are likelihood scores which are input to a normalisation and decision unit 16.
- the two world and speaker model scores are normalised by subtracting the world model score. This is compared to a threshold and the speaker is accepted or rejected by the system in dependence on the difference signal or falling above or below the threshold. An accept or reject signal is then output by the normalisation and decision unit 16.
- a process known as component prediction can be used to speed up the processing of the world model scoring. It will be appreciated that in the diagram described above, each of the input feature vectors from the front end processor 2 has to be compared with each of the mixture components from the world model 6 in the world model scoring unit 4. This task is therefore computationally very expensive both in initial training of the unit for a speaker and for subsequent testing.
- the world model which is a GMM, consists of a number of mixture components 20.
- each of the mixture components 20 in the world model is processed with an - input feature vector in a scoring unit 22.
- the result of this scoring is a likelihood score for each of the mixture components.
- the likelihood of the scores for all of the components are sorted according to their values to produce a likelihood ranking of the mixture components.
- the best scoring components recognised by their indices are used for further processing and are stored for each feature vector in a best scoring component store 24.
- the other output from the scoring unit 22 is the world score. For each feature vector, this is calculated by combining the likelihood scores of the best scoring component .
- FIG. 3 a component predicted world model scoring system is shown which illustrates the extension of the world model scoring using component prediction.
- the indices of the best scoring components stored at 24 are used to choose indices from a look-up table index 26. These indices point to information about which mixture components are most likely to obtain high likelihood scores for the following feature vector. Thus, these contain data about which mixture components should be selected for comparison with the next input feature vector. Only a subset of all the components is selected, eg. 5 components according to the data from the look-up table. This component selection is performed by component selection unit 28 before being provided to the scoring unit 22 for scoring with the next feature vector. The scoring of this next feature vector again leads to best scoring components and the indices of these components are again used for a prediction for the following feature vector.
- the idea is to predict certain mixture components from the world model which are most likely to achieve high scores for processing of the immediately succeeding vector. This prediction is based on a data driven estimation of the most likely component indices. For example, a total of 25 components might be predicted from a total of 1124 mixture components in a world model thereby reducing the processing time by over 95%.
- the prediction is derived from transition probabilities for transversing from a mixture component J to a mixture component I in acoustic space .
- the Gaussian component of a world GMM can be trained using the EM-algorithm published in the Statistical Society, 39:1-38,1997 by a Dempster, A., Laird, N., and D. Rubin under the heading "Maximum likelihood from incomplete data via the EM algorithm”. This algorithm assumes equal probabilities for all state transitions.
- transition probabilities can be calculated after training of the mixture components. These transition probabilities allow prediction of certain mixture components for the processing of the succeeding vector.
- Figure 4 shows an overview of the prediction scheme.
- a feature vector is processed using the world GMM.
- a likelihood score of the feature vector is calculated to each mixture component (Stage 1) . These scores are sorted.
- the look-up tables of the most likely mixture components are used for prediction (stage 2) .
- the component indices of these tables are copied into a component prediction array for the processing of the next feature vector.
- Another aspect is the processing of the first feature vector in a vector sequence.
- the prediction may not be used for the first frame which is similar to a calculation of all components in the GMM.
- the prediction produces the same substantially same result as full processing.
- the prediction deteriorates with the frame likelihood score but the prediction does not degenerate into a random component calculation.
- the initial component prediction obtains good results when used for the first frame of a speech segment. This is only in general, the differences between the two initialisation methods are minor at the start of a speech segment.
- Global transition estimates can be averaged over all consecutive vector pairs.
- An initial transition probability for the start of a vector sequence can be defined for a number of extracted speech segments.
- the transition probabilities P(I:J) are sorted for each component J and only the most likely indices of I are stored in the look up table 26.
- the table for each mixture component can vary. in size. When a feature vector is processed with the world GMM it leads to a most likely mixture component. This is stored in the indices of aposteirori components 24 and is used to select the look up table to be used by the component selection unit 28.
- the table contains the indices of the mixture components which will be processed for the next feature vector, these mixture components being the most likely mixture components to follow the current feature vector.
- a new speaker model is to be created for a new user this is done by the speaker saying a known sequence of words a certain number of times and each utterance of the word being used to generate the most likely components from the world model which correspond to that speaker. This is done by first using the world model scoring unit 4 with extracted feature vectors. Initially, the first • feature vector is scored with all the components of the world model 6 and an index of the best scoring components with this stored at 24. The look up table store 26 then provides data corresponding to the most likely set of components which should be compared with the next feature vector. This most likely set is the most likely set of the world components not of any particular speaker's components. These are then scored with the next input vector and the process repeats.
- the temporary model 8 receives the feature vectors output by the world model scoring unit 4 and a speaker adaptation unit 14 uses this to produce a speaker model 10 for that particular speaker.
- a speaker model is created and is stored for future reference.
- a speech input is tested against a speaker model. This can be done by a user who is to be identified first inputting, for example, a first identification number or some other identifier. This causes the system to load what it believes to be a speaker model for that speaker into the speaker model store 10.
- An utterance of speech from the speaker is then processed by the front end processor and the world model scoring unit which operates according to the system Figure 3. That is to say, it first scores the first input vector with all the components from the world model 6 before using the index of best scoring components 24 and look up table store 26 to select by component selection unit 28 the most likely set of components for scoring against the next vector.
- Feature vector mixture indices are supplied to the speaker model scoring which operates only with the mixture components in the speaker model 10.
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Business, Economics & Management (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Game Theory and Decision Science (AREA)
- Image Analysis (AREA)
Abstract
L'invention concerne un système de vérification de haut-parleurs destiné à identifier si une partie d'entrée de parole provient d'un haut-parleur particulier. Un ensemble de caractéristiques est extrait d'une partie d'entrée de parole fournie par le haut-parleur. Un premier moyen de notation (4) note l'ensemble de caractéristiques à l'aide d'un premier modèle mémorisé de composantes de mélange dérivé d'ensembles de caractéristiques extraits de parties d'entrée de la parole fournies par une pluralité de haut-parleurs. Un second moyen de notation (12) note l'ensemble de caractéristiques à l'aide d'un second modèle mémorisé de composantes de mélange dérivé d'ensembles de caractéristiques extraits de parties d'entrée de la parole fournies par le haut-parleur à identifier. Les résultats sont comparés afin de déterminer si la partie d'entrée de la parole provenait effectivement d'un haut-parleur particulier. Le système prévoit que le premier moyen de notation (4) note l'ensemble de caractéristiques avec seulement une partie du premier modèle mémorisé le plus susceptible de fournir une bonne correspondance avec l'ensemble de caractéristiques fourni.
Applications Claiming Priority (4)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| GB0103875A GB2372366A (en) | 2001-02-16 | 2001-02-16 | Speaker verification |
| GB0103875.1 | 2001-02-16 | ||
| GB0108473.0 | 2001-04-04 | ||
| GB0108473A GB0108473D0 (en) | 2001-02-16 | 2001-04-04 | Mixed composition in speaker verification |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2002067245A1 true WO2002067245A1 (fr) | 2002-08-29 |
Family
ID=26245721
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/GB2002/000665 Ceased WO2002067245A1 (fr) | 2001-02-16 | 2002-02-15 | Verification de haut-parleurs |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2002067245A1 (fr) |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2005055200A1 (fr) * | 2003-12-05 | 2005-06-16 | Queensland University Of Technology | Systeme et procede d'adaptation de modele destines a la reconnaissance du locuteur |
| FR2868587A1 (fr) * | 2004-03-31 | 2005-10-07 | France Telecom | Procede et systeme de conversion rapides d'un signal vocal |
| CN110299150A (zh) * | 2019-06-24 | 2019-10-01 | 中国科学院计算技术研究所 | 一种实时语音说话人分离方法及系统 |
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| GB2248513A (en) * | 1990-10-03 | 1992-04-08 | Ensigma Ltd | Speaker verification |
| EP0822539A2 (fr) * | 1996-07-31 | 1998-02-04 | Digital Equipment Corporation | Sélection en deux étapes d'une population pour un système de vérification de locuteur |
-
2002
- 2002-02-15 WO PCT/GB2002/000665 patent/WO2002067245A1/fr not_active Ceased
Patent Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| GB2248513A (en) * | 1990-10-03 | 1992-04-08 | Ensigma Ltd | Speaker verification |
| EP0822539A2 (fr) * | 1996-07-31 | 1998-02-04 | Digital Equipment Corporation | Sélection en deux étapes d'une population pour un système de vérification de locuteur |
Non-Patent Citations (2)
| Title |
|---|
| NAKAGAWA S ET AL: "Speaker verification using frame and utterance level likelihood normalization", ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 1997. ICASSP-97., 1997 IEEE INTERNATIONAL CONFERENCE ON MUNICH, GERMANY 21-24 APRIL 1997, LOS ALAMITOS, CA, USA,IEEE COMPUT. SOC, US, 21 April 1997 (1997-04-21), pages 1087 - 1090, XP010225987, ISBN: 0-8186-7919-0 * |
| REYNOLDS D A: "COMPARISON OF BACKGROUND NORMALIZATION METHODS FOR TEXT-INDEPENDENTSPEAKER VERIFICATION", 5TH EUROPEAN CONFERENCE ON SPEECH COMMUNICATION AND TECHNOLOGY. EUROSPEECH '97. RHODES, GREECE, SEPT. 22 - 25, 1997, EUROPEAN CONFERENCE ON SPEECH COMMUNICATION AND TECHNOLOGY. (EUROSPEECH), GRENOBLE: ESCA, FR, vol. 2 OF 5, 22 September 1997 (1997-09-22), pages 963 - 966, XP001004029 * |
Cited By (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2005055200A1 (fr) * | 2003-12-05 | 2005-06-16 | Queensland University Of Technology | Systeme et procede d'adaptation de modele destines a la reconnaissance du locuteur |
| FR2868587A1 (fr) * | 2004-03-31 | 2005-10-07 | France Telecom | Procede et systeme de conversion rapides d'un signal vocal |
| WO2005106853A1 (fr) | 2004-03-31 | 2005-11-10 | France Telecom | Procede et systeme de conversion rapides d'un signal vocal |
| US7792672B2 (en) | 2004-03-31 | 2010-09-07 | France Telecom | Method and system for the quick conversion of a voice signal |
| CN110299150A (zh) * | 2019-06-24 | 2019-10-01 | 中国科学院计算技术研究所 | 一种实时语音说话人分离方法及系统 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN111916111B (zh) | 带情感的智能语音外呼方法及装置、服务器、存储介质 | |
| CA2163017C (fr) | Methode de reconnaissance vocale utilisant une recherche a deux passages | |
| EP0813735B1 (fr) | Reconnaissance de la parole | |
| JP3114975B2 (ja) | 音素推定を用いた音声認識回路 | |
| US6195634B1 (en) | Selection of decoys for non-vocabulary utterances rejection | |
| US6134527A (en) | Method of testing a vocabulary word being enrolled in a speech recognition system | |
| CN111524527A (zh) | 话者分离方法、装置、电子设备和存储介质 | |
| WO2017162053A1 (fr) | Procédé et dispositif d'authentification d'identité | |
| JP2002533789A (ja) | 自動音声認識システムにおけるnベストリストに用いる知識ベース戦略 | |
| EP1269464A2 (fr) | Modeles de melange entraines de maniere discriminatoire en reconnaissance vocale en continu | |
| CN113744742A (zh) | 对话场景下的角色识别方法、装置和系统 | |
| US20030200087A1 (en) | Speaker recognition using dynamic time warp template spotting | |
| JP2002123286A (ja) | 音声認識方法 | |
| WO2002067245A1 (fr) | Verification de haut-parleurs | |
| Cohen et al. | On feature selection for speaker verification | |
| US20030110032A1 (en) | Fast search in speech recognition | |
| GB2372366A (en) | Speaker verification | |
| EP1488410B1 (fr) | Détermination d'une mesure de distorsion pour la reconnaissance de la parole | |
| JP2001312293A (ja) | 音声認識方法およびその装置、並びにコンピュータ読み取り可能な記憶媒体 | |
| Korkmazskiy et al. | Discriminative adaptation for speaker verification | |
| CN113178205B (zh) | 语音分离方法、装置、计算机设备及存储介质 | |
| JP2000259198A (ja) | パターン認識装置および方法、並びに提供媒体 | |
| JP3322536B2 (ja) | ニューラルネットワークの学習方法および音声認識装置 | |
| CN111681671A (zh) | 异常音识别方法、装置及计算机存储介质 | |
| JP4424023B2 (ja) | 素片接続型音声合成装置 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| AK | Designated states |
Kind code of ref document: A1 Designated state(s): JP US |
|
| AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR |
|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
| DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
| 122 | Ep: pct application non-entry in european phase | ||
| NENP | Non-entry into the national phase |
Ref country code: JP |
|
| WWW | Wipo information: withdrawn in national office |
Country of ref document: JP |