[go: up one dir, main page]

WO2002067245A1 - Verification de haut-parleurs - Google Patents

Verification de haut-parleurs Download PDF

Info

Publication number
WO2002067245A1
WO2002067245A1 PCT/GB2002/000665 GB0200665W WO02067245A1 WO 2002067245 A1 WO2002067245 A1 WO 2002067245A1 GB 0200665 W GB0200665 W GB 0200665W WO 02067245 A1 WO02067245 A1 WO 02067245A1
Authority
WO
WIPO (PCT)
Prior art keywords
features
scoring
speaker
speech
stored model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/GB2002/000665
Other languages
English (en)
Inventor
Michael John Carey
Roland Auckenthaler
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Imagination Technologies Ltd
Original Assignee
Imagination Technologies Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from GB0103875A external-priority patent/GB2372366A/en
Application filed by Imagination Technologies Ltd filed Critical Imagination Technologies Ltd
Publication of WO2002067245A1 publication Critical patent/WO2002067245A1/fr
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies

Definitions

  • This invention relates to a speaker verification system and in particular to a speaker verification system based on the principles proposed in our British patent application serial no. GB-A-2248513.
  • Speaker verification is important in applications such as financial transactions which are carried out automatically by telephone. Some of the problems of speaker verification are reduced by forming what are known as Gaussian Mixture models (GMM) for a number of utterances using features of these utterances from a large number of speakers . These models are known as world models. In addition, for every person whose speech is to be recognised, a GMM is formed. These models are known as personal or speaker models and comprise mixture comp'onents with which input utterances will be processed.
  • GMM Gaussian Mixture models
  • a person says an isolated or connected utterances and features from each of these utterances are extracted and into feature vectors . After this, the probabilities that these features vectors could have been generated for these words by the world model and by the personal model of that person are calculated and these probabilities are compared for the utterances. A decision on a verification for the speaker is then based on a poll of these comparisons.
  • a system such as this operates by cutting an incoming stream of speech data into short sections or frames to allow feature extraction.
  • a front end process extracts a set of features from each frame, these features being a function of the input speech signal. These features are then stored as a vector. The feature vectors are then further used ' for comparison to a world and speaker model.
  • the present invention seeks to reduce the processing of the mixture components from the world model thereby considerably reducing the computational overhead of the whole process.
  • FIGURE 1 shows a block diagram of a system embodying the invention
  • FIGURE 2 shows a block diagram of a basic world model scoring system
  • FIGURE 3 shows a component predicted world model scoring system in accordance with an embodiment of the invention.
  • FIGURE 4 shows schematically how component prediction is performed.
  • the diagram of Figure 1 shows an input speech signal to a front end processor 2 which produces as its output a feature vector. This is achieved as described above by cutting the speech signal into frames and from each frame extracting a set of features which are then combined into a feature vector for that frame.
  • the next stage in the process is that the feature vectors are provided to a world model storing unit 4. This also receives, as an input, mixture components from a world GMM which comprises mixture components for all possible speakers.
  • the scoring process with the world model leads to a ranking of the mixture components according to the likelihood score for the given feature vector. It will be appreciated that the score for the comparison of each feature vector with each mixture component can only give a likelihood as there will be small variations in input speech every time a speaker- provides an input signal. These variations will also occur as a result of the frames from which the feature vectors are extracted having cut points at different times in the speaker's speech each time speech is analysed.
  • each feature vector is assigned to the best scoring mixture component of the world model and these are output from the world model score unit 4.
  • the assigned feature vectors are accumulated in a temporary GMM model 8.
  • This temporary model is used for a speaker adaptation process in conjunction with the world model to create a speaker model 10.
  • the intention is to produce a speaker model which is a statistical representation of the speaker' s speech, where each of the speaker' s mixture components has exactly one corresponding component in the world model.
  • This speaker model can then be used in a speaker model scoring unit 12.
  • a fast convergence of the speaker model parameter is one of the most important tasks in the system and is performed by the speaker adaptation unit 14.
  • the speaker model parameters should change almost immediately to the input characteristic during the initialisation period. After a certain time, when enough speaker data are collected, the system should change from speaker adaptation to speaker tracking.
  • the tracking allows the system to follow changes in the speaker's voice pattern over a longer period of time.
  • the tracking should be slow to allow capturing the voice over a long time span and not only over the last few utterances. This should lead to a more robust estimation of the speaker's model parameters.
  • the speaker adaptation unit 14 performs operations based on the standard equation for on-line model re- adaptation.
  • the equation is:
  • ⁇ S ⁇ ,r is an already estimated speaker model parameter from the speaker model 10 which represents J ⁇ seen frames
  • S/ ,r is the new mean accumulated over the last segment in the temporary model 8 with the representative weight of n.
  • the new re-adapted model parameter s, .r represents n+m frames and is calculated with weights, according to n and m, and ⁇ s , ' r respectively.
  • the numeral 8 is only a preferred value, and the values may be appropriate in other circumstances.
  • a problem of the accumulation is the memory usage for each parameter.
  • ⁇ S ⁇ .t can be stored with 16 bit resolution whereas ⁇ Sr , t is stored with only 8 bit.
  • the averaging is not very accurate if the resolution for ⁇ St >z is reduced to 8 bit. Therefore the storage of the temporary model 8 is twice as large as the speaker model 10.
  • a simple way of reducing the memory size of the temporary model is to store only a sub-set of all mixture components in the temporary model. This enables the memory to be reduced to a half of the original size.
  • the components are chosen by the frequency of their occurrence. Components with a high frequency are kept in memory in the temporary model. If a frame is seen for a component which is not in the memory, the component with the lowest frequency in the memory is checked and possibly exchanged with the new component from the world model 6.
  • Speaker tracking uses the same equation as in adaptation.
  • the weighting factor ⁇ is set to a certain value instead of being calculated individually according to the number of seen frames.
  • the adaptation will change to a tracking approach if enough frames have been received to train the mixture component parameters for the target speaker. In this system, this is the case when 255 frames have been processed for a certain mixture component, but other values could be selected.
  • should not be too large to avoid a fast tracking which weights the newer data very high and therefore loses information from the past quickly. If Y is chosen too low, the target speaker' s change might not be captured and the speaker is locked out by the system.
  • the value for ⁇ also depends on the model size and the length of the test segment.
  • test with standard adaptation is performed. This reveals the baseline performance for further testing of memory reductions on the temporary model and the speaker tracking settings. Tests are performed with a model size of 64 mixture components. Larger model sizes might not obtain any changes between the different tracking strategies due to the small amount of training sessions available. A comparison is also made for gender dependent background models and a combined gender model for the adaptation process .
  • the feature vectors and mixture components output by the world model scoring unit are provided to the speaker model scoring unit 12.
  • Producing a likelihood score for the speaker model involves only the processing of the most likely mixture components with the feature vectors. These components also retain high scores from the speaker model due to the component correspondence between speaker and world model. Therefore, only a small number of components are processed for scoring the speaker model .
  • the output scores of the world and speaker model scoring units are likelihood scores which are input to a normalisation and decision unit 16.
  • the two world and speaker model scores are normalised by subtracting the world model score. This is compared to a threshold and the speaker is accepted or rejected by the system in dependence on the difference signal or falling above or below the threshold. An accept or reject signal is then output by the normalisation and decision unit 16.
  • a process known as component prediction can be used to speed up the processing of the world model scoring. It will be appreciated that in the diagram described above, each of the input feature vectors from the front end processor 2 has to be compared with each of the mixture components from the world model 6 in the world model scoring unit 4. This task is therefore computationally very expensive both in initial training of the unit for a speaker and for subsequent testing.
  • the world model which is a GMM, consists of a number of mixture components 20.
  • each of the mixture components 20 in the world model is processed with an - input feature vector in a scoring unit 22.
  • the result of this scoring is a likelihood score for each of the mixture components.
  • the likelihood of the scores for all of the components are sorted according to their values to produce a likelihood ranking of the mixture components.
  • the best scoring components recognised by their indices are used for further processing and are stored for each feature vector in a best scoring component store 24.
  • the other output from the scoring unit 22 is the world score. For each feature vector, this is calculated by combining the likelihood scores of the best scoring component .
  • FIG. 3 a component predicted world model scoring system is shown which illustrates the extension of the world model scoring using component prediction.
  • the indices of the best scoring components stored at 24 are used to choose indices from a look-up table index 26. These indices point to information about which mixture components are most likely to obtain high likelihood scores for the following feature vector. Thus, these contain data about which mixture components should be selected for comparison with the next input feature vector. Only a subset of all the components is selected, eg. 5 components according to the data from the look-up table. This component selection is performed by component selection unit 28 before being provided to the scoring unit 22 for scoring with the next feature vector. The scoring of this next feature vector again leads to best scoring components and the indices of these components are again used for a prediction for the following feature vector.
  • the idea is to predict certain mixture components from the world model which are most likely to achieve high scores for processing of the immediately succeeding vector. This prediction is based on a data driven estimation of the most likely component indices. For example, a total of 25 components might be predicted from a total of 1124 mixture components in a world model thereby reducing the processing time by over 95%.
  • the prediction is derived from transition probabilities for transversing from a mixture component J to a mixture component I in acoustic space .
  • the Gaussian component of a world GMM can be trained using the EM-algorithm published in the Statistical Society, 39:1-38,1997 by a Dempster, A., Laird, N., and D. Rubin under the heading "Maximum likelihood from incomplete data via the EM algorithm”. This algorithm assumes equal probabilities for all state transitions.
  • transition probabilities can be calculated after training of the mixture components. These transition probabilities allow prediction of certain mixture components for the processing of the succeeding vector.
  • Figure 4 shows an overview of the prediction scheme.
  • a feature vector is processed using the world GMM.
  • a likelihood score of the feature vector is calculated to each mixture component (Stage 1) . These scores are sorted.
  • the look-up tables of the most likely mixture components are used for prediction (stage 2) .
  • the component indices of these tables are copied into a component prediction array for the processing of the next feature vector.
  • Another aspect is the processing of the first feature vector in a vector sequence.
  • the prediction may not be used for the first frame which is similar to a calculation of all components in the GMM.
  • the prediction produces the same substantially same result as full processing.
  • the prediction deteriorates with the frame likelihood score but the prediction does not degenerate into a random component calculation.
  • the initial component prediction obtains good results when used for the first frame of a speech segment. This is only in general, the differences between the two initialisation methods are minor at the start of a speech segment.
  • Global transition estimates can be averaged over all consecutive vector pairs.
  • An initial transition probability for the start of a vector sequence can be defined for a number of extracted speech segments.
  • the transition probabilities P(I:J) are sorted for each component J and only the most likely indices of I are stored in the look up table 26.
  • the table for each mixture component can vary. in size. When a feature vector is processed with the world GMM it leads to a most likely mixture component. This is stored in the indices of aposteirori components 24 and is used to select the look up table to be used by the component selection unit 28.
  • the table contains the indices of the mixture components which will be processed for the next feature vector, these mixture components being the most likely mixture components to follow the current feature vector.
  • a new speaker model is to be created for a new user this is done by the speaker saying a known sequence of words a certain number of times and each utterance of the word being used to generate the most likely components from the world model which correspond to that speaker. This is done by first using the world model scoring unit 4 with extracted feature vectors. Initially, the first • feature vector is scored with all the components of the world model 6 and an index of the best scoring components with this stored at 24. The look up table store 26 then provides data corresponding to the most likely set of components which should be compared with the next feature vector. This most likely set is the most likely set of the world components not of any particular speaker's components. These are then scored with the next input vector and the process repeats.
  • the temporary model 8 receives the feature vectors output by the world model scoring unit 4 and a speaker adaptation unit 14 uses this to produce a speaker model 10 for that particular speaker.
  • a speaker model is created and is stored for future reference.
  • a speech input is tested against a speaker model. This can be done by a user who is to be identified first inputting, for example, a first identification number or some other identifier. This causes the system to load what it believes to be a speaker model for that speaker into the speaker model store 10.
  • An utterance of speech from the speaker is then processed by the front end processor and the world model scoring unit which operates according to the system Figure 3. That is to say, it first scores the first input vector with all the components from the world model 6 before using the index of best scoring components 24 and look up table store 26 to select by component selection unit 28 the most likely set of components for scoring against the next vector.
  • Feature vector mixture indices are supplied to the speaker model scoring which operates only with the mixture components in the speaker model 10.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Business, Economics & Management (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Game Theory and Decision Science (AREA)
  • Image Analysis (AREA)

Abstract

L'invention concerne un système de vérification de haut-parleurs destiné à identifier si une partie d'entrée de parole provient d'un haut-parleur particulier. Un ensemble de caractéristiques est extrait d'une partie d'entrée de parole fournie par le haut-parleur. Un premier moyen de notation (4) note l'ensemble de caractéristiques à l'aide d'un premier modèle mémorisé de composantes de mélange dérivé d'ensembles de caractéristiques extraits de parties d'entrée de la parole fournies par une pluralité de haut-parleurs. Un second moyen de notation (12) note l'ensemble de caractéristiques à l'aide d'un second modèle mémorisé de composantes de mélange dérivé d'ensembles de caractéristiques extraits de parties d'entrée de la parole fournies par le haut-parleur à identifier. Les résultats sont comparés afin de déterminer si la partie d'entrée de la parole provenait effectivement d'un haut-parleur particulier. Le système prévoit que le premier moyen de notation (4) note l'ensemble de caractéristiques avec seulement une partie du premier modèle mémorisé le plus susceptible de fournir une bonne correspondance avec l'ensemble de caractéristiques fourni.
PCT/GB2002/000665 2001-02-16 2002-02-15 Verification de haut-parleurs Ceased WO2002067245A1 (fr)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
GB0103875A GB2372366A (en) 2001-02-16 2001-02-16 Speaker verification
GB0103875.1 2001-02-16
GB0108473.0 2001-04-04
GB0108473A GB0108473D0 (en) 2001-02-16 2001-04-04 Mixed composition in speaker verification

Publications (1)

Publication Number Publication Date
WO2002067245A1 true WO2002067245A1 (fr) 2002-08-29

Family

ID=26245721

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2002/000665 Ceased WO2002067245A1 (fr) 2001-02-16 2002-02-15 Verification de haut-parleurs

Country Status (1)

Country Link
WO (1) WO2002067245A1 (fr)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2005055200A1 (fr) * 2003-12-05 2005-06-16 Queensland University Of Technology Systeme et procede d'adaptation de modele destines a la reconnaissance du locuteur
FR2868587A1 (fr) * 2004-03-31 2005-10-07 France Telecom Procede et systeme de conversion rapides d'un signal vocal
CN110299150A (zh) * 2019-06-24 2019-10-01 中国科学院计算技术研究所 一种实时语音说话人分离方法及系统

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2248513A (en) * 1990-10-03 1992-04-08 Ensigma Ltd Speaker verification
EP0822539A2 (fr) * 1996-07-31 1998-02-04 Digital Equipment Corporation Sélection en deux étapes d'une population pour un système de vérification de locuteur

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2248513A (en) * 1990-10-03 1992-04-08 Ensigma Ltd Speaker verification
EP0822539A2 (fr) * 1996-07-31 1998-02-04 Digital Equipment Corporation Sélection en deux étapes d'une population pour un système de vérification de locuteur

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
NAKAGAWA S ET AL: "Speaker verification using frame and utterance level likelihood normalization", ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 1997. ICASSP-97., 1997 IEEE INTERNATIONAL CONFERENCE ON MUNICH, GERMANY 21-24 APRIL 1997, LOS ALAMITOS, CA, USA,IEEE COMPUT. SOC, US, 21 April 1997 (1997-04-21), pages 1087 - 1090, XP010225987, ISBN: 0-8186-7919-0 *
REYNOLDS D A: "COMPARISON OF BACKGROUND NORMALIZATION METHODS FOR TEXT-INDEPENDENTSPEAKER VERIFICATION", 5TH EUROPEAN CONFERENCE ON SPEECH COMMUNICATION AND TECHNOLOGY. EUROSPEECH '97. RHODES, GREECE, SEPT. 22 - 25, 1997, EUROPEAN CONFERENCE ON SPEECH COMMUNICATION AND TECHNOLOGY. (EUROSPEECH), GRENOBLE: ESCA, FR, vol. 2 OF 5, 22 September 1997 (1997-09-22), pages 963 - 966, XP001004029 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2005055200A1 (fr) * 2003-12-05 2005-06-16 Queensland University Of Technology Systeme et procede d'adaptation de modele destines a la reconnaissance du locuteur
FR2868587A1 (fr) * 2004-03-31 2005-10-07 France Telecom Procede et systeme de conversion rapides d'un signal vocal
WO2005106853A1 (fr) 2004-03-31 2005-11-10 France Telecom Procede et systeme de conversion rapides d'un signal vocal
US7792672B2 (en) 2004-03-31 2010-09-07 France Telecom Method and system for the quick conversion of a voice signal
CN110299150A (zh) * 2019-06-24 2019-10-01 中国科学院计算技术研究所 一种实时语音说话人分离方法及系统

Similar Documents

Publication Publication Date Title
CN111916111B (zh) 带情感的智能语音外呼方法及装置、服务器、存储介质
CA2163017C (fr) Methode de reconnaissance vocale utilisant une recherche a deux passages
EP0813735B1 (fr) Reconnaissance de la parole
JP3114975B2 (ja) 音素推定を用いた音声認識回路
US6195634B1 (en) Selection of decoys for non-vocabulary utterances rejection
US6134527A (en) Method of testing a vocabulary word being enrolled in a speech recognition system
CN111524527A (zh) 话者分离方法、装置、电子设备和存储介质
WO2017162053A1 (fr) Procédé et dispositif d'authentification d'identité
JP2002533789A (ja) 自動音声認識システムにおけるnベストリストに用いる知識ベース戦略
EP1269464A2 (fr) Modeles de melange entraines de maniere discriminatoire en reconnaissance vocale en continu
CN113744742A (zh) 对话场景下的角色识别方法、装置和系统
US20030200087A1 (en) Speaker recognition using dynamic time warp template spotting
JP2002123286A (ja) 音声認識方法
WO2002067245A1 (fr) Verification de haut-parleurs
Cohen et al. On feature selection for speaker verification
US20030110032A1 (en) Fast search in speech recognition
GB2372366A (en) Speaker verification
EP1488410B1 (fr) Détermination d'une mesure de distorsion pour la reconnaissance de la parole
JP2001312293A (ja) 音声認識方法およびその装置、並びにコンピュータ読み取り可能な記憶媒体
Korkmazskiy et al. Discriminative adaptation for speaker verification
CN113178205B (zh) 语音分离方法、装置、计算机设备及存储介质
JP2000259198A (ja) パターン認識装置および方法、並びに提供媒体
JP3322536B2 (ja) ニューラルネットワークの学習方法および音声認識装置
CN111681671A (zh) 异常音识别方法、装置及计算机存储介质
JP4424023B2 (ja) 素片接続型音声合成装置

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): JP US

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP