GB2388947A - Method of voice authentication - Google Patents
Method of voice authentication Download PDFInfo
- Publication number
- GB2388947A GB2388947A GB0211842A GB0211842A GB2388947A GB 2388947 A GB2388947 A GB 2388947A GB 0211842 A GB0211842 A GB 0211842A GB 0211842 A GB0211842 A GB 0211842A GB 2388947 A GB2388947 A GB 2388947A
- Authority
- GB
- United Kingdom
- Prior art keywords
- feature vectors
- recorded signal
- user
- value
- spoken
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000000034 method Methods 0.000 title claims abstract description 269
- 239000013598 vector Substances 0.000 claims abstract description 282
- 230000004044 response Effects 0.000 claims abstract description 134
- 230000006870 function Effects 0.000 claims description 43
- 230000001419 dependent effect Effects 0.000 claims description 31
- 238000012935 Averaging Methods 0.000 claims description 18
- 230000003247 decreasing effect Effects 0.000 claims description 12
- 230000008859 change Effects 0.000 claims description 11
- 230000007423 decrease Effects 0.000 claims description 4
- 238000005315 distribution function Methods 0.000 claims description 4
- 238000004590 computer program Methods 0.000 claims description 2
- 238000004458 analytical method Methods 0.000 abstract description 2
- 230000008569 process Effects 0.000 description 65
- 230000008901 benefit Effects 0.000 description 15
- 239000011159 matrix material Substances 0.000 description 13
- 238000010586 diagram Methods 0.000 description 12
- 238000004422 calculation algorithm Methods 0.000 description 11
- 238000012545 processing Methods 0.000 description 5
- GVGLGOZIDCSQPN-PVHGPHFFSA-N Heroin Chemical compound O([C@H]1[C@H](C=C[C@H]23)OC(C)=O)C4=C5[C@@]12CCN(C)[C@@H]3CC5=CC=C4OC(C)=O GVGLGOZIDCSQPN-PVHGPHFFSA-N 0.000 description 4
- 230000001413 cellular effect Effects 0.000 description 3
- 230000001186 cumulative effect Effects 0.000 description 3
- 238000005070 sampling Methods 0.000 description 3
- MUXFZBHBYYYLTH-UHFFFAOYSA-N Zaltoprofen Chemical compound O=C1CC2=CC(C(C(O)=O)C)=CC=C2SC2=CC=CC=C21 MUXFZBHBYYYLTH-UHFFFAOYSA-N 0.000 description 2
- 238000013459 approach Methods 0.000 description 2
- 238000005259 measurement Methods 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 241000272875 Ardeidae Species 0.000 description 1
- 206010011224 Cough Diseases 0.000 description 1
- 230000002411 adverse Effects 0.000 description 1
- 230000003466 anti-cipated effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000007717 exclusion Effects 0.000 description 1
- 238000012417 linear regression Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000007935 neutral effect Effects 0.000 description 1
- 101150033318 pcm2 gene Proteins 0.000 description 1
- 230000011664 signaling Effects 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 238000003756 stirring Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/22—Interactive procedures; Man-machine interfaces
- G10L17/24—Interactive procedures; Man-machine interfaces the user being prompted to utter a password or a predefined phrase
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)
- Collating Specific Patterns (AREA)
Abstract
A method of voice authentication comprises enrolment and authentication stages. During enrolment, a user is prompted to provide a spoken response which is recorded using a microphone (5). The recorded signal is divided into frames and converted into feature vectors. Feature vectors are concatenated to form featuregrams. Endpoints of speech may be determined using either explicit endpointing based on analysis of energy of the timeslices or using dynamic time warping methods. Featuregrams corresponding to speech are generated and averaged together to produce a speech featuregram archetype. During an authentication stage, a user is prompted to provide a spoken response to the prompt from which a speech featuregram is obtained. The speech featuregram obtained during authentication is compared with the speech featuregram archetype and is scored. The score is evaluated to determined whether the user is a valid user or an impostor. Accordingly, a method of gain control for optimising input voice is used. If the amplified input signal is higher/lower than a predetermined limit, the gain is lowered/raised so there is no increase in gain between signals. To stop replay attacks, a method comprises prompting the user for a plurality of responses, and checking the responses to see if they are substantially identical. A smart card to perform the above voice authentication is possible.
Description
1- 2388947
Method of voice authentication Description
The present invention relates to a method of voice authentication.
Voice authentication may be defined as a process in which a user's identity is validated by analysing the user's speech patterns. Such a process may be used for controlling access to a system, such as a personal computer, cellular telephone handset or telephone banking account.
Aspects of voice authentication are known in voice recognition systems. Examples of voice recognition systems are described in US-A-4956865, US-A507939, US-A 5845092 and WO-A-0221513.
5 The present invention seeks to provide a method of voice authentication.
According to the present invention there is provided a method of voice authentication comprising: enrolling a user including requesting said enrolling user to provide a spoken response to a prompt, obtaining a recorded signal including a 20 recorded signal portion corresponding to said spoken response, determining endpoints of said recorded signal portion, deriving a set of feature vectors for characterising said recorded signal portions, averaging a plurality of feature vectors, each set of feature vectors relating to one or more different spoken responses to the prompt by said enrolling user so as to provide an archetype set of feature vectors 25 for said response, storing said archetype set of feature vectors together with data relating to said prompt; and authenticating a user including retrieving said data relating to said prompt and said archetype set of feature vectors, requesting said authenticating user to provide another spoken response to said prompt, obtaining another recorded signal including another recorded signal portion corresponding to 30 said other spoken response, determining endpoints of said other recorded signal portion, deriving another set of feature vectors characterising said other recorded signal portions, comparing said another set of feature vectors with said archetype set of feature vectors so as to produce a score dependent upon a degree of matching
( - 2
and comparing said score with a predefined threshold so as to determine whether said enrolling user and said authenticating user are the same.
The recorded signal may include a plurality of frames and the determining of said s endpoints of said recorded signal may further comprise determining whether a value of energy for a first frame exceeds a first predetermined value and determining whether a second frame immediately preceding the first frame represents a spoken utterance portion. E 0 The method may comprise receiving said spoken response via a microphone, generating an electrical signal, amplifying said electrical signal using an amplifier to produce an amplified signal, determining whether said amplified signal level is above a first predetermined limit and either decreasing gain if the amplified signal level is above the first predetermined limit or maintaining gain if otherwise, thereby 5 permitting no increase in gain.
The method may further comprise requesting said enrolling user to provide another spoken response to another prompt, generating another electrical signal, amplifying said another electrical signal with said amplifier to produce another amplified signal, 20 determining whether said another amplified signal level is above the first predetermined limit, either decreasing gain if the amplified signal level is above the first predetermined limit or maintaining gain if otherwise, thereby permitting no increase in gain.
2s The method may comprise requesting said authenticating user to provide first and second spoken responses to said prompt, obtaining a recorded signal including first and second recorded signal portions corresponding to said first and second spoken responses, isolating said hrst and second recorded signal portions, deriving first and second sets of feature vectors for characterising said first and second isolated 30 recorded signal portions respectively, comparing said first set of feature vectors with said second set of feature vectors so as to produce another score dependent upon the degree of matching and comparing the another score with another
predefined threshold so as for determine whether the first set of feature vectors is substantially identical to the second set of feature vectors.
The method may comprise requesting an authenticating user to provide a plurality s of spoken responses to a prompt, obtaining a plurality of corresponding recorded signals, each recorded signal including a recorded signal portion corresponding to a respective spoken response, deriving a plurality of sets of feature vectors, each set of feature vectors for characterizing a respective recorded signal portion, comparing said sets of feature vectors with an archetype set of feature vectors so as to produce lo a plurality of scores dependent upon a degree of matching and determining whether authentication is successful in dependence upon said plurality of scores.
The method may further comprise requesting a hrst set of users to provide respective spoken responses to a prompt, for each user, obtaining a recorded signal 15 which includes a recorded signal portion corresponding to the user's spoken response, for each user, deriving a set of feature vectors for characterizing the recorded signal portion, for each user, comparing said set of feature vectors with an archetype set of feature vectors for said user so as to produce a score dependent upon a degree of matching, fitting a first probability density function to frequency 20 of scores for said first set of users, requesting a second set of users to provide respective spoken responses to a prompt, for each user, obtaining a recorded signal which includes a recorded signal portion corresponding to the user's spoken response, for each user, deriving a set of feature vectors for characterizing the recorded signal portion, for each user, comparing said set of feature vectors with an 25 archetype set of feature vectors for a different user so as to produce a score I dependent upon a degree of matching? fitting a second probability density function to frequency of scores for said set of users.
The comparing of sets of feature vectors may comprise dynamic time warping said 30 sets of feature vectors and wherein the score produced by the comparing said sets of feature vectors may be a dynamic time warping winning path distance. The deriving of the set of feature vectors for characterizing said recorded signal portions may comprise determining feature vectors representative of acoustic features within
i - 4 said recorded signal portions. The deriving of said set of feature vectors for characterizing said recorded signal portions may comprise determining feature vectors using a feature transform. The feature transform may be a mel-cepstral transform. The method may further comprise dividing said recorded signal into s frames. The deriving of said set of feature vectors for characterizing said recorded signal portions may comprise deriving a feature vector for each respective frame of said recorded signal portion.
The determining of endpoints may comprise dynamic time warping said another set lo of feature vectors onto said archetype set of feature vectors, including determining a first sub-set of feature vectors within said another set of feature vectors from which a dynamic time warping winning path may start and determining a second sub-set of feature vectors within said another set of feature vectors at which the dynamic time warping winning path may finish.
Is The method may further comprise performing a plurality of checks on said recorded signal for determining whether said recorded signal is suitable for use in enrolling.
The method may further comprise performing a plurality of checks on said another recorded signal for determining whether said another recorded signal is suitable for 20 use in authentication. The performing of plurality of checks may includes determining whether a length of spoken utterance included in a recorded signal exceeds a minimum, determining whether a length of silence included in a recorded signal exceeds a minimum, determining whether a signal-to-noise ratio of a recorded signal exceeds a minimum, determining whether energy of a recorded signal exceeds 25 a minimum or determining whether a degree of clipping of a recorded signal exceeds a maximum. The performing of plurality of checks may include calculating a mean feature vector by averaging a set of feature vectors, calculating a distance between said mean feature vector and each feature vector, calculating an average of said distances and determining whether said distance exceeds a minimum. The 30 performing of plurality of checks may include calculating a mean feature vector by averaging a set of feature vectors, deriving a set of feature vector for characterizing a recorded signal portion corresponding to a silence interval, calculating a distance between said mean feature vector and each feature vector corresponding to said
- 5 ( silence interval, calculating an average of said distances and determining whether said distance exceeds a minimum.
The averaging of the plurality sets of feature vectors may comprise comparing said s each set of feature vectors with each other set feature vectors so as to produce a respective set of scores dependent upon a degree of matching, searching for a minimum score and determining whether at least one score is below a predetermined threshold.
10 According to another aspect of the present invention there is provided apparatus for voice authentication comprising means for enrolling a user including means for requesting said enrolling user to provide a spoken response to a prompt, means for obtaining a recorded signal including a recorded signal portion corresponding to said spoken response, means for determining endpoints of said recorded signal Is portion, means for deriving a set of feature vectors for characterising said recorded signal portions, means for averaging one or more sets of feature vectors, each set of data relating to one or more different spoken responses to the prompt by said enrolling user so as to provide an archetype set of feature vectors for said response, means for storing said archetype set of feature vectors together with data relating to 20 said prompt means for authenticating a user including means for retrieving said data relating to said prompt and said archetype set of feature vectors, means for requesting said authenticating user to provide another spoken response to said prompt, means for obtaining another recorded signal including another recorded signal portion corresponding to said other spoken response, means for determining 25 endpoints of said other recorded signal portion, means for deriving another set of feature vectors for characterising said other recorded signal portions, means for comparing said another set of feature vectors with said archetype set of feature vectors so as to produce a score dependent upon a degree of matching and means for comparing said score with a predefined threshold so as to determine whether 30 said enrolling user and said authenticating user are the same.
Endpointing seeks to locate a start and stop point of a spoken utterance.
- G ( The present invention seeks to provide an improved method of endpointing.
According to the present invention there is provided a method of determining an endpoint of a recorded signal portion in a recorded signal including a plurality of s frames, the method comprising determining whether a value of energy for a first frame exceeds a first predetermined value and determining whether a second frame immediately preceding the first frame represents a spoken utterance portion.
The first predetermined value may represent a value of energy of a frame comprised 10 of background noise. The method may comprise defining a start point if the value
of energy of the first frame exceeds the first predetermined value and the second frame does not represent a spoken utterance portion. The method may further comprise indicating that the hrst frame represents a spoken utterance portion. The method may comprise defining a stop point if the value of energy of the first frame 15 does not exceed the first predetermined value and the second frame represents a spoken utterance portion. The method may comprise defining the first frame as not representing a spoken utterance portion. The method may comprise counting a number of frames preceding a start point of the spoken utterance portion. The method may further comprise pairing the stop point with said start point of the 20 spoken utterance portion if the number of frames exceeds a predetermined number.
The method may further comprise pairing the stop point with start point of a preceding spoken utterance portion if the number of frames does not exceed a predetermined number. The method may comprise determining whether the value of energy for a first frame exceeds a third predetermined value and counting a 2s number of frames preceding a start point of the spoken utterance portion. The method may further comprise defining a start point if the value of energy of the first frame exceeds the third predetermined value, the second frame does not represent a spoken utterance portion and if the number of frames does not exceed a predetermined number. The method may further comprise determining whether a 30 value of energy for a third frame following said first frame exceeds the second predetermined value. The method may further comprise defining a stop point if the value of energy of the third frame does not exceed the third predetermined value.
The method may further comprise pairing the stop point with the start point of the
- 7 ( spoken utterance portion. The method may further comprise pairing the stop point with a start point of a preceding spoken utterance portion. The method may comprise defining the first frame as representing background noise if the value of
energy of the first frame does not exceed the third predetermined value. The s method may further comprise calculating an updated value of background energy
using said value of energy of the first frame. The method may further comprise counting a number of frames preceding a start point of the spoken utterance portion and determining whether said number of frames exceeds another, large number. The method may comprise determining whether a value of rate of change 10 of energy of the first frame exceeds a second predetermined value. The second predetermined value may represent a value of rate of change of energy of a frame comprised of background noise. The method may further comprise defining a start
point if the value of energy of the first frame exceeds the first predetermined value, and the value of rate of change of energy exceeds the second predetermined value 15 and the second frame does not represent a spoken utterance portion. The method may comprise deeming a stop point if the value of energy of the first frame does not exceed the first predetermined value, and the value of rate of change of energy does not exceed the second predetermined value and the second frame represents a spoken utterance portion. The method may comprise determining whether the 20 value of rate of change of energy for the first frame exceeds a fourth predetermined value. Voice authentication system typically include an amplifier. If a user provides a spoken response which is too quiet, then amplifier gain may be increased.
25 (conversely, if a spoken response if too loud, then amplifier gain may be reduced.
Usually, a succession of samples is taken and amplifier gain is increased or reduced accordingly until a settled value of amplifier gain is obtained. However, there is a danger that that amplifier gain rises and falls and never settles.
30 Ihe present invention seeks to ameliorate this problem.
According to the present invention there is provided a method of gain control comprising a plurality of times determining whether an amplified signal level is
( - 8 above a predetermined limit, either decreasing gain if the amplified signal level is above the predetermined limit or maintaining gain if otherwise, thereby permitting no increase in gain.
s The method may further comprise receiving a spoken response from a user via a microphone, generating an electrical signal and amplifying said electrical signal using an amplifier to produce the amplified signal. The method may comprise determining whether amplifier gain is at a lowest value. The method may comprise informing the user that the spoken response is loud if the amplified signal level is above the 10 predetermined limit and if the gain is at the lowest value. The determining whether the amplified signal is above the predetermined level may comprise determining whether any portion of the amplified signal is above the predetermined level. The determining whether the amplified signal is above the predetermined level may comprise determining whether an average of the amplified signal is above the Is predetermined level. The method may further comprise counting a number of spoken responses given by the user. The counting of number of spoken responses may comprise determining a number of times that gain has been decreased or maintained. The method may comprise stopping if the number of spoken responses exceeds a predetermined number. The method may comprise determining a number 20 of times gain has been consecutively maintained. l he method may comprise storing a value of gain if the number of times gain has been consecutively maintained reaches a predetermined number. The method may further comprise determining a value of signal-to-noise ratio. The method may further comprise determining whether the amplified signal level is above another, lower predetermined level. The 25 method may further comprise not storing a value of gain if the number of times gain if the amplified signal level is below said another, lower predetermined level.
According to another aspect of the present invention there is provided a method of gain control comprising a plurality of times determining whether an amplified signal 30 level is below a predetermined limit, either increasing gain if the amplified signal level is below the predetermined limit or maintaining gain if otherwise, thereby permuting no decreases In gain.
According to still another aspect of the present invention there is provided a gain controller configured to repeatedly determine whether an amplified signal level is above a predetermined limit and to permit gain to be decreased or maintained but not increased.
s According to still yet another aspect of the present invention there is provided a gain controller configured to repeatedly determine whether an amplified signal level is below a predetermined limit and to permit gain to be increased or maintained but not decreased.
A potential threat to the security offered by any voice authentication system is the possibility of an impostor secretly recording a spoken response of a valid user and subsequently replaying a recording to gain access to the system. This is known as a replay attack.' The present invention seeks to help detect a replay attack.
According to the present invention there is provided a method of voice authentication comprising requesting a user to provide first and second spoken 20 responses to a prompt, obtaining a recorded signal including first and second recorded signal portions corresponding to said first and second spoken responses, isolating said first and second recorded signal portions, deriving first and second sets of feature vectors for characterising said first and second isolated recorded signal portions respectively, comparing said first set of feature vectors with said 25 second set of feature vectors so as to produce a second score dependent upon the degree of matching and comparing the second score with another predeEned threshold so as for determine whether the first set of feature vectors is substantially identical to the second set of feature vectors.
30 The method may comprise providing an third set of feature vectors and comparing said first set of feature vectors with said third set of feature vectors and obtaining third score dependent upon a degree of match. The third set of feature vectors may be an archetype set of feature vectors. The method may further comprise comparing
- 10 said third score with a third predetermined threshold thereby determining whether the first set of feature vectors is substantially identical to the archetype set of feature vectors. The method may further comprise comparing said third score with a fourth score, the fourth score being obtained in a previous authentication by s comparing a fourth set of feature vectors obtained with said archetype set of feature vectors and determining whether the third score and fourth scores are substantially the same. The method may comprise measuring an interval between said first and second recorded signal portions. The method may further comprise comparing said interval with another interval obtained in a previous enrolment or authentication.
70 The method may further comprise storing data relating to said interval.
During authentication, users may occasionally provide an uncharacteristic spoken response. 15 The present invention seeks to provide an improved method of dealing with uncharacteristic responses.
According to the present invention there is provided a method of voice authentication including requesting an authenticating user to provide a plurality of 20 spoken responses to a prompt, obtaining a plurality of corresponding recorded; signals, each recorded signal including a recorded signal portion corresponding to a respective spoken response, deriving a plurality of sets of feature vectors, each set of feature vectors for characterising a respective recorded signal portion, comparing said sets of feature vectors with an archetype set of feature vectors so as to produce 25 a plurality of scores dependent upon a degree of matching and determining whether authenucanon is successful in dependence upon said plurality of scores.
The method may further include computing an average score from said plurality of scores and comparing said average score with a predefined threshold. The method 30 may include permitting authentication if said average score exceeds the predeftned threshold. The method may further include comparing each of said scores with a predefined threshold and determining a number of scores which exceed a predeOned threshold. The method may further include permitting authentication if
the number of scores which exceed the predefined threshold exceeds another predeEned threshold.
If a user provides a spoken response during authentication, a threshold score is s usually generated.
The present invention seeks to provide an improved method of determining an authentication threshold score.
10 According to the present invention there is provided a method of determining an authentication threshold score, the method including requesting a first set of users to provide respective spoken responses to a prompt, for each user, obtaining a recorded signal which includes a recorded signal portion corresponding to the user's spoken response, for each user, deriving a set of feature vectors for characterising 15 the recorded signal portion, for each user, comparing said set of feature vectors with an archetype set of feature vectors for said user so as to produce a score dependent upon a degree of matching, fitting a first probability density function to frequency of scores for said first set of users, requesting a second set of users to provide respective spoken responses to a prompt, for each user, obtaining a 20 recorded signal which includes a recorded signal portion corresponding to the user's spoken response, for each user, deriving a set of feature vectors for characterising the recorded signal portion, for each user, comparing said set of feature vectors with an archetype set of feature vectors for a different user so as to produce a score dependent upon a degree of matching, fitting a second probability density function 25 to frequency of scores for said set of users.
The method may further comprise integrating said first and second density distribution functions to produce first and second respective continuous density functions. The method may further comprise determining where said hrst and 30 second continuous density functions cross and in dependence thereon determining a threshold score. The method may further comprise storing data relating to said score.
- 12 Many voice recognition and authentication systems use dynamic time warping to match a recording to a template. However, a user may pause, cough, sigh or generate other sounds before or after providing a response to a prompt. These silences or sounds may be included in the recording. Thus, only a portion of the 5 recording is relevant.
The present invention seeks to provide a solution to this problem.
According to the present invention there is provided a method of dynamic time lo warping for warping a first speech pattern characterised by a first set of feature vectors onto a second speech pattern characterised by a second set of feature vectors, the method comprising identifying a first sub-set of feature vectors within said first set of feature vectors from which a dynamic time warping winning path starts and identifying a second sub-set of feature vectors within said first set of 15 feature vectors at which the dynamic time warping winning path finishes.
The first speech pattern may include speech, background noise and/or silence.
The identifying of the first sub-set of feature vectors may comprise calculating a 20 distance between a first feature vector at the beginning of said second set of feature vectors and each feature vector of said first sub-set of feature vectors within said first set of feature vectors. The method may further comprise entering each distance into an array. The method may further comprise determining whether to calculate a distance between a feature vector of said first set of feature vectors and a 25 feature vector of said second set of feature vectors. The determining whether to calculate the distance may comprise determining whether to calculate a distance between a josh feature vector of said first set of feature vectors which comprises J feature vectors and an ith feature vector of said second set of feature vectors which comprises I feature vectors, where I and J arc positive integers, by finding whether a 30 value of j falls between a maximum value and minimum boundary value for a value i. The method may further comprise calculating a maximum value using: j(k)2maX[2i(k)2I 2' 2 2]
- 13 wherein k is record of dynamic time warping path and k = i. The method may further comprise calculating a maximum value using: j(k) < min[2i(k) - I + 2 (2) 2 + J: wherein k is record of dynamic time warping path and k = i. The method may s comprise calculating a distance g(i, i) if j falls between a maximum value and minimum boundary value. The calculating of the distance g(i, i) may comprise usmg: g(i -1, j) in(k) + d(i, j), g(i, j) = min g(i -1, j -1)+ d(i, j), Lg(i-l,j-2)+d(i,j), wherein h(k)=l ((k)) ((j]))andkis record of dynamic time warping path 0 and k = i. The method may comprise determining a global distance for a dynamic time warping winning path. The determining of the distance for the dynamic time warping winning path comprises examining distances associated with the second sub-set of feature vectors and searching for a lowest value of distance.
15 The present invention seeks to provide a method of averaging a plurality of feature vectors. According to the present invention there is provided a method of averaging said plurality of feature vectors, the method comprising providing a plurality of feature 20 vectors, comparing said each set of feature vectors with each other set feature vectors so as to produce a respective set of scores dependent upon a degree of matching,searching for a minimum score, determining whether at least one score is below a predetermined threshold.
25 The method may further comprise generating a plurality of archetype sets of feature vectors by dynamic time warping each set of feature vectors with each other set feature vectors, comparing said each set of feature vectors with each archetype set of feature vectors so as to produce a respective set of scores dependent upon a degree of matching and arranging said scores in first array according to set of
- 14 / feature vector and archetype set of feature vector. The method may further comprise generating another array by averaging elements with said array corresponding to said sets of feature vectors. The method may further comprise searching for a maximum value within said another array. The method may further s comprise excluding one of said sets of feature vectors if said maximum value within said another array exceeds a predetermined threshold. The excluding of one of said sets of feature vectors may include calculating a variance for each archetype set of feature vectors and excluding one of said sets of feature vectors whose corresponding archetype set of feature vector has the lowest variance.
According to the present invention there is also provided a computer program for performing the method.
According to the present invention there is also apparatus configured to perform 15 the method.
An embodiment of the present invention will now be described, by way of example, with reference to the accompanying drawings in which: Figure I shows a voice authentication system I for performing a method of voice 20 authentication; Figure 2 is a process flow diagram of a method of voice authentication; Figure 3 is a process flow diagram of a method of enrolment; Figure 4 is a process flow diagram of a method of calibration; Figure 5 is an analog representation of a recorded signal; 25 Figure 6 is a generic representation of a recorded signal; Figure 7 is a digital representation of a recorded signal; Figure 8 illustrates dividing a recorded signal into timeslices; Figure 9 is a process flow diagram of a method of generating a featuregram; Figure 10 illustrates generation of a feature vector; 30 Figure 11 illustrates generation of a featuregram from a plurality of feature vectors; Figure 12 shows first and second endpointing processes; Figure 13 illustrates explicit endpointing; Figure 14 is a process flow diagram of a method of explicit endpointing;
- 15 Figure 15 illustrates determination of energy and delta energy values of a timeslice; Figure 16 shows pairing of a stop point with a two start points; Figure 17 shows pairing of a stop point with a start point of a preceding section; Figure 18 is a process flow diagram of a method of detecting lip smack; s Figure 19 shows pairing of a stop point with an updated start point for removing lip smack; Figure 20 illustrates a dynamic time warping process for word spotting; Figure 21 shows a warping function from a start point to an end point; Figure 22 illustrates a local slope constraint on a warping function; lo Figure 23 illustrates a global condition imposed on a warping function; Figure 24 is a process flow diagram of a method of finding a minimum distance for an optimised path from a start point to an end point representing matched speech patterns; Figure 25a shows an array following initialization for holding a cumulative distance 15 associated with a path from a start point to an end point; Figure 25b shows an array for holding a cumulative distance associated with a path from a start point to an end point; Figure 25c shows a completed array for holding a cumulative distance associated with a path from a start point to an end point including a winning path; 20 Figure 26 shows a process flow diagram of a method of performing a plurality of sanity checks; Figure 27 illustrates creation of a speech featuregram; Figure 28 illustrates generation of a speech featuregram archetype; Figure 29 is a process flow diagram of a method of generating a speech featuregram 2s archetype; Figure 30 illustrates generation of a featuregram cost matrix; Figure 31 shows a featuregram cost matrix; Figure 32 is a process flow diagram of a method of finding a minimum distance for an optimised path from a start point to an end point representing matched speech 30 patterns; Figure 33 illustrates creation of featuregram archetypes using featuregrams; Figure 34 illustrates generation of a featuregram archetype cost matrix; Figure 35 shows a featuregram archetype cost matrix;
- 16 Figure 36 shows a probability distribution function; Figure 37 shows a continuous distribution function Figure 38 shows a voice authentication biometric; Figure 39 is a process flow diagram of a method of authentication; s Figure 40 is an analog representation of an authentication recorded signal; Figure 41 illustrates dividing an authentication recorded signal into timeslices; Figure 42 illustrates generation of an authentication feature vector; Figure 43 illustrates generation of an authentication featuregram from a plurality of feature vectors; 10 Figure 44 illustrates generation of endpoints; Figure 45 illustrates comparison of a featuregram archetype with an authentication featuregram; Figure 46 illustrates a featuregram including hrst and second spoken responses of the same prompt for detecting replay attack and 5 Figure 47 is a process flow diagram of a method of detecting replay attack.
Voice authentication system 1 Referring to Figure 1, a voice authentication system 1 for performing a method of voice authentication is shown. The voice authentication system 1 limits access by a 20 user 2 to a secure system 3. The secure system 3 may be physical, such as a room or building, or logical, such as a computer system, cellular telephone handset or bank account. The voice authentication system 1 is managed by a system administrator 4.
The voice authentication system 1 includes a microphone S into which a user may 2s provide a spoken response and which converts a sound signal into an electrical signal, an amplifier 6 for amplifying the electrical signal, an analog-to-digital (A/D) converter 7 for sampling the amplified signal and generating a digital signal, a filter 8, a processor 9 for performing signal processing on the digital signal and controlling the voice authentication system 1, volatile memory 10 and non-volatile 30 memory 11. In this example, the A/D converter 7 samples the amplified signal at 11025 Hz and provides a mono-linear 16-bit pulse code modulation (PCM) representation of the signal. The digital signal is filtered using a 4th order 100Hz high-pass kilter to remove any d.c. offset.
- 17 l ( The system 1 further includes a digital-to-analog (D/A) converter 12, another amplifier 13 and a speaker 14 for providing audio prompts to the user 2 and a display 15 for providing text prompts to the user 2. The system 1 also includes an s interface 16, such as a keyboard and/or mouse, and a display 17 for allowing access by the system administrator 4. The system 1 also includes an interface 18 to the secure system 3.
In this embodiment, the voice authentication system 1 is provided by a personal 10 computer which operates software performing the voice authentication process.
Referring to Figure 2, the voice authentication process comprises two stages, namely enrolment (step S1) and authentication (step S2).
15 The purpose of the enrolment is to obtain a plurality of specimens of speech from a person who is authorised to enrol with the system 1, referred herein as a "valid user". The specimens of speech are used to generate a reliable and distinctive voice authentication biometric, which is subsequently used in authentication.
20 A voice authentication biometric is a compact data structure comprising acoustic information-bearing attributes that characterise the way a valid user speaks. These attributes take the form of a template, herein refered to as a "featuregram archetypes" (FGAs), which are described in more detail later.
2s The valid user's voice authentication biometric may also include further information relating to enrolment and authentication. The further information may include data relating to prompts to which a valid user has responded during enrolment, which may take the form of text prompts or equivalent identifiers, the number of prompts to be used during authentication and whether prompts should be presented in a 30 random order during authentication, and other data relating to authentication such as scoring strategy, pass/fail/retry thresholds, the number of acceptable failed attempts and amplifier gain.
- 18 Enrolment Referring to Figure 3, the enrolment process corresponding to step Sl in Figure 2 is shown: 5 The voice authentication system 1 is calibrated, for example to ensure that a proper amplifier gain is set (step Sl.1). Once the system is calibrated, a plurality of spoken responses are recorded (step S1.2). The recordings are characterized by generating so-called "featuregrams", which comprise a set of feature vectors(step S1.3). The recordings are also examined so as to isolate speech from background noise and
0 periods of silence (step S1.4, step S1.5). Checks are performed to ensure that the recorded responses, isolated specimens of speech and featuregrams are suitable for processing (step S1.6). A plurality of speech featuregrams are then generated (step S1.7). Thereafter, an average of some or all of the featuregrams is taken thereby forming a more representative featuregram, namely a featuregram archetype (step 75 S1.8). A pass level is set (step S1.9) and a voice authentication biometric is generated and stored (step S1.10) Calibration Referring to Figure 4, the calibration process of step S1.1 (Figure 3) is described in 20 more detail: One of the purposes of calibration is to set the gain of the amplifier 6 (Figure 1) such that the amplitude of a captured speech utterance is of a predetermined standard. The predetermined standard may specify that the amplitude of the speech 25 utterance peaks at predetermined value, such as 70% of a full-scale deflection of a recording range. In this example, the A/D converter 7 (Figure 1) is 16- bits wide and so 70% of full-scale deflection corresponds to a signal of 87dB. The predetermined standard may also specify that the signal has a minimum signal-to-
noise rang, for instance 20dB which corresponds to a signal ten times stronger than 30 the background noise.
The gain of the amplifier 6 (Figure 1) is set to the highest value (step S1.1.1) and first and second counters are set to zero (steps S1.1.2 & S1. 1.3). The first counter
- 19 ( keeps a tally of the number of specimens provided by the valid user. The second counter is used to determine the number of consecutive specimens which meet the predetermined standard.
s A prompt is issued (step S1.1.4). In this example, the prompt is randomly selected.
This has the advantage that the it prevents the valid user from anticipating the spoken response, thereby providing an uncharacteristic or unnatural response, for example which is unnaturally loud or quiet. The prompt may be a text prompt or an audio prompt. The valid user may be prompted to say a single word, such as 10 "thirty-four" or a phrase "My voice is my pass phrase'. In this example, the prompts comprise numbers. Preferably, the numbers are chosen from a range between 21 and 99. This has the advantage that the spoken utterance is sufficiently long and complex so as to include a plurality of features.
15 A speech utterance is recorded (step S1.1.5). This comprises the user providing a spoken response which is picked-up by the microphone 5 (Figure l), amplified by the amplifier G (Figure 1), sampled by the analog-to-digital (A/D) converter 7 (Figure 1), filtered and stored in volatile memory 10 (Figure 1) as the recorded signal. The processor 9 (Figure 1) calculates the power of speech utterance included 20 in the recorded signal and analyses the result.
The signal-to-noise ratio is determined (step S1.1.6). If the signal-tonoise ratio is too low, for example less than hoods, then the spoken response is too quiet and the corresponding signal generated is too weak, even at the highest gain. The user is 25 informed of this fact (step S1.1. 7) and the calibration stage ends. Otherwise, the process continues.
The signal level is determined (step S1.1.8). If signal level is too high, for example greater than 87 dB which corresponds to the 95th percentile of the speech utterance 30 energy being greater that 70% of the full scale deflection of the A/D converter 7 (Figure 1), then the spoken response is too loud and the corresponding signal generated is too strong. If the gain has already been reduced to its lowest value, then the signal is too strong, even at the lowest gain (step S1.1.9). The user is
- 20 informed of this fact and calibration ends (step S1.1.10). Otherwise, the gain of the amplifier 6 is reduced (step S1.1.11). The gain may be reduced by a fixed amount regardless of signal strength. Alternatively, the gain may be reduced by an amount dependent on signal strength. This has the advantage of obtaining an appropriate s gain more quickly. The fact that a specimen spoken response has been taken is noted by incrementing the first counter by one (step S1.1.12). The second counter is reset (step S1.1.13).
If too many specimens have been taken, for example 15, then calibration ends (step 10 S1.1.14). Otherwise, the process returns to step S1.1.4, wherein the user is prompted, and the process of recording, calculating and analysing is repeated.
If, at step S1.1.8, the signal level is not too high, then the spoken response is considered to be satisfactory, i.e. neither too loud nor too quiet. Thus, the recorded 5 signal falls within an appropriate range of values of signal energy. The fact that a specimen spoken response has been taken is recorded by incrementing the first counter by one (step S1. 1.16. The fact that the specimen is satisfactory is also recorded by incrementing the second counter by one (step S1.1.17). The gain remains unchanged.
If a predetermined number of consecutive specimens are taken without a change in gain, then calibration is successfully terminated and the gain setting of the amplifier 6 (Figure 1) is stored (step S1.1.18 & S1.1.19). In this example, the gain setting is stored in the voice authentication biometric.
2s Additional steps may be included. For example, once a settled value of gain is achieved at step S1.118, then the signal level is measured a final time. If the signal level is too low, then calibration ends without the gain setting being stored.
30 The calibration process allows decreases, but not increases, in gain. This has the advantage of preventing infinite loops in which the gain fluctuates without reaching a stable setting.
- 21 l Alternatively, the calibration process may be modified to start at the lowest gain and allow increases, but not decreases, in gain. Thus, if the signal strength is too low, for example below a predetermined limit, then gain is increased. Once a settled value of gain has been achieved, then the signal level may be measured a final time to determine whether it is too high.
The calibration process may include a further check of signal-to-noise ratio. For example, once a settle value of gain has been determined, then peak signal-to-noise ratio of the signal is measured. If the signal-tonoise ratio exceeds a predetermined 10 level, such as 20dB, then the gain setting is stored. Otherwise, the user is instructed to repeat calibration in a quieter environment, move closer to the microphone or speak with a louder voice.
Referring again to Figure 3, during enrolment, the voice authentication system 1 15 records one or more spoken responses (step S1.2). This may occur during calibration at step Sl.l. Additionally or alternately, a separate recording stage may be used.
During enrolment, the voice authentication system I asks the user to provide a 20 spoken response. Preferably, the system prompts the user a plurality of umes. Four types of prompt may be used: In a first type, the prompt comprises a request for a single word, for example "Say 81". Preferably, the user is asked to repeat the word. The user may be asked to 2s repeat the word once so as to obtain two specimens of the spoken response. The user may be asked to repeat the word more than once so as to obtain multiple examples.
In a second type, the prompt comprises a request for a single phrase, for example 30 'Say My voice is my pass phrase". Preferably, the user is asked to repeat the phrase.
The user may be asked to repeat the phrase once or more than once.
- 22 In a third type, the prompt may comprise a challenge requesting personal information, such as "What is your home telephone number?". The valid user provides a spoken response which includes the personal information. This type of prompt is referred to as a "challenge-response". This type of prompt has the 5 advantage of increasing security. During subsequent authentication, an impostor must know or guess what to say as well as attempt to say the spoken response in the correct manner. For example, a valid user may pronounce digits in different ways, such as pronouncing "10" as "ten", "one, zero", "one, nought" or "one-oh", and/or pause while saying a string of numbers, such as reciting "12345678" as "12-34-56 10 78" or "1234-5678".
In a fourth type, the prompt may comprise a cryptic challenge-response, such as "NOD?". For example, "NOD" may signify "Name of dog?". Preferably, the cryptic challenge is specified by the user. This type of prompt has the advantage of 75 increasing security since the prompt is meaningful only to the valid user. It offers few clues as to what the spoken response should be.
A set of prompts may be common to all users. Alternatively, a set of prompts may be randomly selected on an individual basis. If the prompts are chosen randomly, 20 then a record of the prompts issued to each user is stored in the voice authentication biometric, together with corresponding data generated from the spoken response. Preferably, this information is used during authentication stage to ensure that only prompts responded to by the valid user are issued and that appropriate comparisons are made with corresponding featuregram archetypes.
25 Preferably, the administrator 4 (Figure 1) determines the type and number of prompts used during enrolment and authentication.
Recording Referring again to Figure 1, a spoken response is recorded by the microphone S. 30 amplified by amplifier 6 and sampled using A/D converter 7 at 11025 Hz to provide a 16-bit PCM digital signal. The duration of the recording may be fixed. Preferably, the recording lasts between 2 and 3 seconds. The signal is then filtered to remove any d.c. component. The signal may be stored in volatile memory 10.
- 23 ! Referring to Figures 5, G. 7, an example of a recorded signal 19 is shown in analog, generic and digital representations.
s Referring particularly to Figure 5, the recorded signal 19 may comprise one or more speech utterances 20, one or more background noises 21 and/or one or more
silence intervals 22. A speech utterance 20 is defined as a period in a recorded signal 19 which is derived solely from the spoken response of the user. A background noise 21 is defined as a period in a recorded signal arising from audible
10 sounds, but not originating from the speech utterance. A silence interval 22 is defined as a period in a recorded signal which is free from background noise and
speech utterance.
As explained earlier, the purpose of the enrolment is to obtain a plurality of 15 specimens of speech so as to generate a voice authentication biometric. To help achieve this, recorded responses are characterised by generating "featuregrams" which comprise sets of feature vectors. The recordings are also examined so as to isolate speech from background noise and silences.
20 If the recordings are known to contain specific words, then they are searched for those words. This is known as "word spotting". If there is no prior knowledge of the content of the recordings, then the recordings are inspected to identify spoken utterances. This is known as "endpointing". By identifying speech utterances using one or both of these processes, a speech featuregram may be generated which 25 corresponds to portions of the recorded signal comprising speech utterances.
Referring to Figure 8, a portion 19' of the recorded signal 19 is shown. The recorded signal 19 is divided into frames, referred to herein as timeslices 23. The recorded signal 19 is divided into partiallyoverlapping timeslices 23 having a 30 predetermined period. In this example, timeslices 23 have a period of 50 ms, i.e. to= 50ms, and overlap by 50%, i.e. tz= 25 ms.
- 24 Featuregram generation Referring to Figures 9, 10 and 11, a process by which a featuregram is generated will be described in more detail: 5 The recorded signal 19 is divided into frames, herein referred to as timeslices 23 (step S1.3.1). Each timeslice 23 is converted into a feature vector 24 using a feature transform 25 (step S1.3.2).
The content of the feature vector 24 depends on the transform 25 used. In general, 70 a feature vector 24 is a one-dimensional data structure comprising data related to acoustic information-bearing attributes of the timeslice 23. Typically, a feature vector 24 comprises a string of numbers, for example 10 to 50 numbers, which represent the acoustic features of signal comprised in the timeslice 23.
15 In this example, a so-called mel-cepstral transform 25 is used. This transform 25 is suitable for use with a 32-bit fixed-point microprocessor. A mel-cepstral transform 25 is a cosine transform of the real-part of a logarithmic-scale energy spectrum. A met is a measure of perceived pitch or frequency of a tone by a human auditory system. Thus, in this example, for a sampling rate of 11025Hz, each feature vector 20 24 comprises twelve signed 8-bit integers, typically representing the second to thirteenth calculated mel-cepstral coefficients. Data relating to energy (in dB) may be included as a 1 3'h feature. This has the advantage of helping to improve the performance of a word spotting routine that would otherwise operate on the feature vector coefficients alone.
The transform 25 may also calculate first and second differentials, referred to as "delta" and "delta-delta" values.
Further details regarding mel-ceptral transforms may be found in "Fundamentals of 30 Speech Recognition" by Rabiner &]uang (Prentice Hall, 1993).! Other transforms may be used. For example, a linear predictive coefficient (LPC) transform may be used in conjunction with a regression algorithm so as to produce
- 25 LPC cepstral coefficients. This transform is suitable for use with a 16-bit microprocessor. Alternatively, a TESPAR transform may be used.
Linear predictive coefficient (LPC) transform is described by B.S. Atal, s "Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification", Journal of Acoustical Society of America, Vol. 55, pp-1304-1312,.June 1974. Further details regarding the TESPAR transform may be found in GB-B-21G2025.
10 Referring to Figure 11, a featuregram 25 comprises a set or concatenation of feature vectors 24. The featuregram 25 includes speech utterances, background noise and
silence intervals.
Endpointing is Endpointing seeks to identify portions of a recorded signal which contains spoken utterances. This allows generation of speech featuregrams which characterise the spoken utterances.
Referring to Figure 12, two methods of endpointing may be used, namely explicit 20 endpointing (step Sl.4) and dynamic time warping (DEW) word spotting (step S1.5).
-Explicit Endpointing Explicit endpointing seeks to locate approximate endpoints of a speech utterance in a particular domain without using any a priory knowledge of the words that might 25 have been spoken. Explicit endpointing tracks changes in signal energy profile over time and frequency and makes boundary decisions based on general assumptions regarding the nature of profiles that are indicative of speech and those that are representative of noise or silence. Explicit endpointing cannot easily distinguish between speech spoken by the enrolling user and speech prominently forming part 30 of background noise. Therefore, it is desirable that no-one else speaks in close
proximity to the valid user when enrolment takes place.
- 26 Referring to Figure 13, an explicit endpointing process 27 generates a plurality of pairs 28 of possible start and stop points for a stream of timeslices 23. The advantage of generating a plurality of endpoints is that the true endpoints are likely to be identified. However, a drawback is that if too many endpoint combinations 5 are identified, then the system response time is adversely affected. Therefore, a trade-off is sought between the number of potential endpoint combinations identified and the response time required.
Explicit endpointing is suitable for both fixed and continuous recording 10 environments, although is mainly intended for use with isolated word or isolated phrase recognition systems.
Referring to Figure 14, an explicit endpointing process is shown in more detail: Is A check is made whether initialization is needed, whereby background noise energy
is measured (step S1.4.A). If so, a background noise signal is recorded (step
S1.4.B), divided into timeslices (step S1.4.C) and a background energy value is
calculated (step S1.4.D).
20 After initialization, or if no initialization is needed, a signal is recorded and divided into a plurality of timeslices 23 (Figure 8) (step Sl.4.1). A first counter, i, for keeping track of which timeslice 23; is currently being processed is set to one (step S1.4.2). A second counter, i, for counting the number of consecutive timeslices 23 which represent background noise is set to zero (step S1.4.3). A "word" flag is set
25 to zero to represent that the current timeslice 23, does not represent a spoken utterance portion, such as a word portion (step S1.4.4).
Referring also to Figure 15, the energy of the current timeslice 23; is calculated (step S1.4.5)
Preferably, a plurality of timeslices 23 are used to calculate a value of energy for the current timeslice 23j. The timeslices 23 are comprised in a window 29. In this
- 27 ( example, five timeslices 23,2, 23, i, 23,, 23+, 23;,2 are used to calculate a value of energy of the ith timeslice 23j.
A ume encoded speech processing and recognition (TESPAR) coding process 30 is s used to calculate an energy value for each timeslice 23; 2, 23, I, 23j, 23+, 23j+2. This comprises taking each timeslice 23j 2, 23, I, 23;, 23j+,, 23j+2 and dividing it into a plurality of so-called "epochs" according to where signal magnitude changes from positive to negative and vice versa. These points are known a 'zero crossings". An absolute value of a peak magnitude for each epoch is used to calculate an average 70 energy for each timeslice 23; 2, 23j,, 23,, 23j+1, 23j+2. Thus, five energy values are obtained from which a mean energy value 31 is calculated.
A description of the TESPAR coding process is given in GB-A-2020517.
A delta energy value 32 indicative of changes in the five energy values is also calculated (step S1.4.6). In this example, the delta energy value 32 is calculated by performing a smoothed linear regression calculation using the energy values for the timeslices 23,2, 23,,, 23;, 23j', 23j+2. The delta energy value 32 represents a gradient of straight line fitted to the values of energy. Thus, large changes in the energy 20 values result in a large value of the delta energy value 32.
The values 31, 32 of energy E, and delta energy AEj are used to determine whether the ith timeslice 23 represents a spoken utterance.
25 Referring again to Figure 14, if the energy of 31 of an ith timeslice23 is equal to or greater than a first threshold, which is first predetermined multiple of background
noise energy, i.e. E; kit x E,, (step Sl.4.7), and the delta energy 32 is equal to or greater than a second threshold, which is second predetermined multiple of background delta energy, i.e. AK; k2 x AEo (step S1.4.8), then the ith timeslice 23 is
30 considered to form part of a word. The timeslice 23j is said to form part of a voiced or energetic section. In this example, k,= 2 and k2 = 3.
If the word flag not set to one, representing that the previous timeslice 23 was background noise (step S1.4.9), then the current timeslice 23; is considered to be the
beginning of a word (step S.1.4.10). Thus, the word flag is set to one (step S1.4.11).
s If at step S1.4.9, the word flag is set to one, then the beginning of the word has already been detected and so the current timeslice 23; is located within an existing word (step Sl.4.12).
The first counter, i, is incremented by one (step S1.4.13) and the process returns to 10 step S1.4.5 where the energy of the new ith timeslice 23j is calculated.
If the energy value 31 falls below the first threshold at step S1.4.7 or the delta energy value 32 falls below the second threshold at step S1.4.8, then it is determined whether there is a stop point, and if so with which start point or start points it 15 could be paired.
If the word flag is set to one, (step S1.4.14), then the current timeslice 23' is considered to be a stop point.
20 The stop point may be paired with one or more other start points, as will now be explained: Referring to Figure 16, first and second sections 33, 34 are separated by a gap 35.
The first section 33 includes a first start point 36 and a first stop point 37. The 2s second section 34 has a second start point 362. According to step S1.4.7 or S1.4.8 and step S1.4.14, a second stop point 372 is found. The second stop point 372 may be paired with the second start point 372 so identifying the second section 34 as a word. However, the second stop point 372 may also be paired with the first start point 36. Thus, the first start point 36 and the second stop point 372 may define a 30 larger word 38 which includes both the first and second sections 33, 34. Therefore, it is desirable to determine the duration of a gap 35 between the first stop point 37, and the second start point 362. If the gap 35 is sufficiently short, then an additional pairing is made and the additional word 38 is identified. This has the advantage of
- 29 identifying a greater number of candidates and thus increase the chances of correctly identifying a word.
Referring again to Figure 14, a check is made as to whether the start point preceding s the current endpoint occurs within ten timeslices 23 of the stop point of the previous word (step Sl.4.15). If it does not, then the current endpoint is paired with only the preceding start point, thereby identifying a single word (step Sl.4.16).
If the start point is within ten timeslices 23 or less of the stop point, then the current stop point is paired with both the start point of the current section (step 10 S1.4.17) and the start point of the preceding word (step S1.4.18), thereby identifying two potential words.
The second counter, i, is reset to zero (step Sl.4.19), the word flag is set to zero (step S1.4.20) and the first counter is incremented by one (step S1.4.21) before 15 returning to step Sl.4.5.
If, at step Sl.4.14, the word flag is not set to one, then a further check is made as to whether the current timeslice 23j may be considered to be the start point of an unvoiced or unenergetic section, herein after referred to as simply an unvoiced 20 section. If the energy 31 of an ith timeslice 23; is equal to or greater than a third threshold, which is lower than the first and which is third predetermined multiple of background noise energy, i.e. E; > k3 x E,, (step S1.4.22) and the delta energy 32 is
As equal to or greater than a fourth threshold, which is lower than the first and which is fourth predetermined multiple of background delta energy, i.e. AK;> k4x AK',
(step Sl.4.23), and provided that the timeslice 23; is found withing 10 timeslices of the previous stop point (step Sl.4.24), then the in timeslice 23 is considered to be the start point of an unvoiced section (step Sl.4.25). In this example, k3= 1.25 and 30 k4 = 2.
The extent of the unvoiced section is determined by incrementing the first counter i (step S1.4.26), calculating values 31, 32 of energy and delta energy (steps S1.4.27 &
-30 Sl.4.28) and determining whether the energy 31 of the current timeslice 23; exceeds a fifth threshold corresponding to a fifth predetermined multiple of background
noise energy, i.e. E; > kS x E,, (step Sl.4.29). In this case, k5- k3= 1. 25. Provided that the energy 31 of the current timeslice 23j exceeds a fifth threshold, the timeslice s 23, is identified as being part of the unvoiced section (step Sl.4.30).
If the energy value 31 falls below the third threshold at step Sl.4.22 or the delta energy value 32 fall below the fourth threshold at step Sl.4.23, then the current timeslice 23j is deemed to represent background noise (step Sl.4.31). The values of
10 background noise energy and delta background noise energy are updated using the
current timeslice 23j. In this case, a weighted average is taken using 95% of the background noise energy ED and 5% of the timeslice energy Ej (step S1.4.32).
Similarly, a weighted average is taken using 95% of the delta background noise AEo
energy and 5% of the delta energy AEj (step S1.4.33). The second counter, i, is 5 incremented by one (step S1.4.34).
A check is made to see whether an isolated word has been found (step S1.4. 35). If a sufficiently long period of background noise is identified, for example by counting
twenty timeslices after the end of a word which corresponds to 0.5 seconds of 20 silence (step S1.4.36), then it is reasonable to assume that the last stop point represents the end of an isolated word. If an isolated word is found, then pairing of possible start and stop points may be terminated. Otherwise, searching continues by returning to step S1. 4.5.
25 If at step S1.4.30, the energy 31 of the timeslice 23j falls below the fifth threshold, then a stop point of an unvoiced section is identified (step Sl.4.36). The stop point is associated with the start point of the preceding word (step S1.4.37) and the first counter, i, is incremented by one (step S1.4.38) 30 Referring to Figure 17, a first section 39 precedes a second section 40 and has a first start point 41i and a first stop point 422. A second stop point 422 is found in the second section 40 according to step S1.4.36. The second stop point 422 is paired
- 31 ( the first start point 41,. Thus, the first stats point 411 and the second stop point 422 may define a word 43 which includes both the first and second sections 39, 40.
Thus, two types of stop points may be identified. The stop point may be an end is point of a voiced section, such as the "le" in "left", or a stop point of an unvoiced section, such as the "t" in "left".
Referring to Figure 18, a process for funding and removing extraneous noises such as lip smack and generating an additional pair of endpoints is shown: When a stop point is located at step S1.4.16 or S1.4.17 in a voiced section, the current start point is located (step S1.4.39). First and second pointers p, q are set to the start point (steps S1.4 40 & S.1. 4.41). The first index p points to an updated start point. The second index q points keeps track of which timeslice is currently 15 being examined.
The delta energy of a current timeslice 23q is compared with the delta energy of a succeeding timeslice 23+ (step S1.4.42). If the delta energy of the current timeslice 23q is greater than the delta energy of the succeeding timeslice 23+, then the delta 20 energy of the succeeding timeslice 23+ is compared with the delta energy of a second succeeding timeslice 23q+2 (step S1.4.43). If the delta energy of the succeeding timeslice 23+ is greater than the delta energy of the second succeeding timeslice 239+2, then the start point is updated by incrementing the first index p by one (step Sl.4.44). A check is made to see whether updated start point and the stop 2s position are separated by at least three timeslices (step S1.4.45). If not, then the process terminates without generating an additional pair of endpoints including an updated start point.
If at either, step S1.4.42 or S1.4.43 the delta energy of the current timeslice 23q is Jo less than the delta energy of the succeeding timeslice 23q+; or delta energy of the succeding timeslice 23+ is less than the delta energy of the succeeding timeslice 23q+2, then the process terminates and generates an additional pair of endpoints including an updated start point.
- 32 Referring to Figures 19a and 19b, the effect of the process for fmding and removing extraneous noises is illustrated.
s Figure 19a shows a voiced section 44 having a pair of start and stop points 45, 46.
Figure 19b shows the voiced section 44 after the process has identified a section portion 47 comprising a lip smack. Another pair of start and stop points 48, 49 are generated. 0 Preferably, explicit endpointing is performed in real-time. This has the advantage that it may be determined whether or not a timeslice 23 corresponds to a spoken utterance, i.e. whether a potion of the recorded signal currently being processed corresponds to part of word. If so, a featuregram is generated. If not, a featuregram need not be generated. Processing resources may be better put to use, 15 for example by generating a template (if in the training mode) or performing a comparison (if in the real-time live interrogation mode).
-Word spotting Word spotting seeks to locate endpoints of a speech utterance in a particular 20 domain using a peon knowledge of the words that should have been spoken as a guide. The a peon knowledge is typically presented as a speaker-independent featuregram archetype (FGA) generated from speech utterances of the word or phrase being sought that have previously been supplied by a wide range of representative speakers. The featuregram archetype may include an energy term.
25: Referring to Figure 20, a dynamic time warping process 50, herein referred to as a DTWFlex, is used. The process 50 compares a featuregram 51 derived from the recorded signal 19 (Figure 5) with a speakerindependent featuregram archetype 52, representing a word or phase being sought. This is achieved by compressing and/or 30 expanding different sections of the featuregram 51 until a region inside the featuregram 51 matches the speaker-independent featuregram archetype 52. The best fit is known as the winning path and the endpoints of the winning path are output 28'.
- 33 One advantage of word spotting is that it delivers more accurate endpoints than those produced by explicit endpointing, particularly when heavy non-stationary background noise is present. If word spotting is used during enrolment, users are
s asked to respond to fixed-word or fixed-phrase prompts for which speaker-
independent featuregram archetype have been prepared in advance. It is difficult to use word spotting in conjunction with challenge-response prompts, particularly if spoken responses cannot be easily anticipated. Thus, it is preferable to use explicit endpointing when using challengeresponse prompts.
An outline of a word spotting process will now be described: First and second speech patterns A, B may be expressed as a sequence of first and second respective sets of feature vectors a, b, wherein: A=al, a2, a,., al (la) B=bl, b2, bj,, bJ (lb) F.ach respective vector a, b represents a fixed period of time.
Referring to Figure 21, a dynamic time warping process seeks to eliminate timing differences between the first and second speech patterns A, B. The timing differences may be illustrated using an i-, plot, wherein the hrst speech pattern A is developed along an i-axis 53 and the second speech pattern B is developed along a 25 j-axis 54.
The timing differences between the first and second speech patterns A, B may be represented by a sequence A, wherein: 30 F = c(l), c(2),..., c(k),. .., c(K) (2) where c(k) = (i(k), j(k)). The sequence F may be considered to represent a function which approximately maps the time axis of first speech pattern A onto that of the
- 34 ( second speech pattern B. The sequence F is referred to as a warping function 55.
When there is no timing difference between the first and second speech patterns A, B. the warping function 55 coincides with a diagonal line j = i, indicated by reference number 56. As the timing differences grow, the warping function 55 s increasingly deviates from the diagonal line S6.
A Euclidian distance, d, is used to measure of the timing difference between a pair time points in the form of feature vectors a,, bj, wherein: 10 d(c) = c(i, i) = 1ai-bjl () However, other distances may be used to measure the timing difference, such as Manhattan distance. A weighted sum of distances in the warping function 53 is calculated using: E(F)=zk d(c(k)) w(k) (4) where w(k) is a positive weighting coefficient. E(l7) reaches a minimum value when the warping function 55 optimally adjusts the timing differences between the first 20 and second speech patterns A, B. The minimum value may be considered to be a distance between the htst and second speech patterns A, B. once the timing differences between them has been eliminated and is expected to be stable against time-axis fluctuation. Based on these considerations, a time- normalised distance D between the first and second speech patterns A, B is defined as: d(c(k)) w(k) D(A,B)= MinF K ( (5) k =1 where the denominator Irk w(k) compensates for the number of points on the warping function 55.
- 35 Two conditions are imposed on the speech patterns A, B. Firstly, the speech patterns A, B are time-sampled with a common and constant sampling period.
Secondly, there is no a priors knowledge about which parts of the speech pattern s contain linguistically important information. In this case, each part of the speech pattern is considered to have an equal amount of linguistic information.
As explained earlier, the warping function 55 is a model of time-axis fluctuations in a speech pattern. Thus, the warping function 55, when viewed as a mapping 70 function from the time axis of the second speech pattern B onto that of the first speech pattern A, preserves linguistically important structures in the second speech pattern B time axis, and vice rersa. In this example, important speech pattern time axis structures include continuity, monotonicity and limitation on acoustic parameter transition speed in speech.
In this example, asymmetric time warping is used, wherein a weighting function w(k) is dependent on i but not i. This condition is realised using the following restrictions on the warping function 55: 20 Firstly, a monotonic condition is applied, wherein: i(k -1) < i(k) alld j(k -1) c i(k) (6) The monotonic condition specifies that the warping function 55 does not turn back on itself. Secondly, a continuity condition is imposed, wherein: i(k)- i(k -1) = 1 and j(k)- j(k -1) < 2 (7) The continuity condition specifies that the warping function 55 advances a 30 predetermined number of steps at a time. As a result of these two conditions, the following relation holds between two consecutive points, namely:
- 36 ((i(k)-l, j(k)), c(k - l) = (i(k)- 1, j(k)- 1)' (8) l(i(k)- l, j(k)2) Boundary conditions are set such the warping function 55 starts at (1, 1) and ends at (I, J.), i e: i(l) = 1, j(l) - 1, and i(K) = I, j(K) = J (9) A local slope constraint condition is also imposed. This defines a relation between consecutive points on the warping function 55 and places limitations on possible 10 configurations. In this example, the Itakura condition is used.
Referring to Figure 22, if point 57 moves forward in the i-direction but not in the j-
direction, then the point 572 cannot move again in the i-direction without consecutively moving in the -direction. Therefore, this condition combined with 15 the monotonicity and continuity conditions, imposes a maximum slope of 2 and a minimum slope of 0.5 on the warping function F. In other words, the second speech pattern B may be maximally compressed or expanded by a factor of 2 in order to time align it with the hrst speech pattern A. 20 Referring to Figure 23, the above conditions effectively constrain the possible warping function 55 to a region in the time axis bounded by a parallelogram 58 and which is referred to as the "legal" search region. The legal search conforms to the following conditions: 25 j(k)>maxt2i(k)-21+J i(k)+ 11 (10) and j(k) < min[2i(k)- 1 (2) - - + J] (11)
- 37 ! Thus, j may take a maximum value 58mX and minimum value 58m,n for a particular value of i.
s The weighting coefficient is also restricted. If the denominator in equation (5) is independent of the warping function, then: N = w(k) (12) k=! lo where is the normalization coefficient. Equation 5 may then be simplified and re-written as: D(A,B)= NminFt2,d(c(k)) w(k)1 (13) Is The time-normalised distance D may be solved using standard dynamic programming techniques. The aim is to find the cost of the shortest path.
In this example, an asymmetric weighting function w(k) is used, namely 20 w(k) = (i(k)- i(k -1)) (14) The use of an asymmetric weighting function simplifies the normalization coefficient N of equation 12, such that: 25 N = I (15)
where I is the length of speech pattern A. An algorithm for solving equation 13 comprises defining an array g for holding the 30 lowest cost path to each point and initializing such that:
- 38 gl(c(l))=d(c(l)) w(l) (IG) In other words, the lowest cost to the first point is the distance between the first two elements multiplied by the weighting factor. For a symmetric weighting factor s w(1) = 2, while for an asymmetric weighting factor w(1) = 1.
The algorithm comprises calculating gk(i, j) for each row i and column i, wherein: gk (c(k)) = minc(k-)[gk-'(c(k -1))+ d(c(k)) w(k)1 (17) The solution for the time-normalised distance D(A, B) is given by: D(A, B) = N gK (C(K)) (18) 15 The asymmetric weighting coefficient w(k) of equation 14 may be substituted into equation 17, wherein w(1) = 1.
Thus, the algorithm defined by equations 17, l8, 19 is simplified and comprises defining an array g for holding the lowest cost path to each point and initializing 20 such that: g(l,l)= d(l,1) (19) In other words, the lowest cost to the first point is the distance between the first 2s two elements.
The algorithm comprises calculating gk(i, j) for each row i and column i, wherein: g(i - l, j) h(k)+ d(i, j), g(i,j)= min g(i -l,j-l)+d(i,j), (20) g(i -l,j-2)+d(iii
- 39 where h(k)= (k 2) (i 2) (21) s The algorithm further comprises applying the following global conditions, namely: j > maxt2i - 21 + J, 2 + 21 (22) j < min|2i -1, - - + J] (23) Thus, the solution for the timenormalised distance D(A, B) is given by: 15 D(A,B)= Ig(I,J) (24) An algorithm based on equations 19 to 24 may be used to obtain a score when comparing speech utterances of substantially the same length. For example, the algorithm is used when comparing featuregram archetypes, which is described in 20 more detail later.
However, the algorithm based on equations 19 to 24 may be adapted for word spotting applications. In word spotting, it is assumed that the start and stop points of the first speech vector A are known. However, the start and stop points of the 2s relevant speech in the second pattern B are unknown. Therefore, the conditions of equation 9 no longer hold and can be re-defined such that: i(l) = 1, j(l) = start, and i(K) = I, j(K) = stop (25)
- 40 Based on the fact that the maximum expansion/compression in the speech pattern is 2, the start point can assume any value from I to J-1/2 and stop point may assume any value from I/2 to J. Consequently, the global conditions specify: s j(k) > max:2i(k)- 2I _ I, i(k) + 1 1 (2G) and 10 j(k) < min[2i(k)- 1 + 2 ' (2) - 2 + J] (27) The time-normalised difference D(A, B) is now defined as: D(A,B)=- min[g(I,K)1whereK= -,, J (28) Referring to Figures 21, 24 and 25, a process for calculating the time-normalised distance D is shown.
The featuregram Sl derived from the recoded signal 19 (Figure 5) is compared with 20 the speaker-independent featuregram archetype 52. As explained earlier, the featuregram comprises a speech utterance, such as "twenty-one", silence intervals and background noise. The speakerindependent featuregram archetype 52
comprises a word or phrase being sought, which in this example is "twentyone".
25 The featuregram 51 is warped onto the speaker-independent featuregram archetype 52. The aim is to locate a region within the featuregram 51 (speech pattern B) which best matches speaker-independent featuregram archetype 52 (speech pattern A).
- 41 / An array g for holding the lowest cost path to each point is deemed (step S1.5.1).
The array may be considered as a net of points or nodes. As explained earlier, the start point can assume any value from I to J-I/2, therefore the elements g(1, 1) to g(1, J-I/2) are set to values d(1, 1) to d(1, JI/2) respectively (step S1.5.2).
5 Elements g(1,J-I/2+1) to g(1,J) may be set to a large number. A corresponding array 59 is shown in Figure 25a.
Equation 20 is then calculated for some, but not all, elements (i, j) of array g.
10 The process comprises incrementing index i (step S1.5.3), checking whether the algorithm has come to an end (step S1.5.4), determining the bounds 58,, 58, (Figure 23) of the legal search (step S1.5.5 to S1.5.8) and determining whether an index value j falls outside bounds 58 58mn (step S1.5.9). If so, then a large distance is entered, i.e. g(i, j) = at, which in practice is a large number (step S1.5.10).
Is Otherwise, equation 20 is calculated and a corresponding distance, herein labelled d'; ', is entered, i.e. g(i, j) = d', (step Sl.5.11). The process continues by incrementing index j at step S1.5.7 and continuing for until j exceeds J (step S1.5.8).
A corresponding array 59', partially filled, is shown in Figure 25b.
20 The algorithm continues until the array is completed, i.e. (i, j) = (I, J) (step S1.5.4).
A corresponding completed array 59" is shown in Figure 25c.
The winning score with the lowest value is found (step S1.5.12). As explained earlier, the stop point may assume any value from 1/2 to J. Therefore, elements 25 g(I, I/2) to g(l, J) are searched. Once a stop point 60 has been found, a start point 61 may be estimated by tracing back the winning path 62. Thus, endpoints 28' are found be reading i- values corresponding to the start and stop points 60, 61.
Performing sanity egrets 30 The ability of a voice authentication system to consistently accept valid users and reject impostors is dependent on the generation of featuregrams that in some way represent the user's key speech characteristics. A plurality of sanity checks may be applied during enrolment and authentication, preferably on the recorded signal or a
- 42 ! recorded signal portion, to ensure that they are suitable for enrolment and authentication, i.e. that the speech utterances carry sufficient information for featuregrams to be generated. Preferably, all the following sanity checks are performed. -Speech Length-
A first sanity check comprises confirming that the length of speech exceeds a minimum length. The minimum length of speech is a function of not only time but also of the number of feature vector time slices. In this example, the minimum lo length of speech is 0.5 seconds of speech and 30 feature vector timeslices, and timeslice duration and overlap are defined accordingly.
-Noise Length A second sanity check comprises checking that each speech utterance includes a 5 silence interval which exceeds a minimum length. The silence interval is used to determine noise threshold levels for explicit endpointing, signal to noise measurements and for Speech/Noise entropy. In this example, the minimum length of silence is 0.5 seconds and 30 feature vector timeslices.
20 -Signal-to-Noise Ratio (SNR)-
A third sanity check includes examining whether the signal-to-noise ratio exceeds a minimum. In this example, the minimum signal-to-noise ratio is 20dB. The purpose of setting a minimum signal-to-noise ratio is to obtain an accurate speaker biometric template uncorrupted by background noise.
An estimate of the STIR can be determined using SNR = lOx LOGotIs (29) 30 where Is is the speech energy and I,, is the noise energy. The speech and noise energy Is, I,, can be calculated using:
- 43 I n I =-An, pcm2 (30) where pcmj is the value of the digital signal. Other values of signal-to noise ratio may be used, for example 25dB.
s -Speech Intensity A fourth sanity check comprises checking whether the speech energy exceeds a minimum. The purpose of setting a minimum speech intensity is not only to provide adequate signal-to-noise, but also to avoid excessive quantisation in the 70 digital signal. In this example, the minimum speech intensity is 47 dB.
-Clipping A fifth sanity check comprises determining whether the degree of clipping exceeds a maximum value. The degree of clipping is deemed as the average number of 5 samples which exceeds an absolute value in each speech frame. In this case, the degree of clipping is 32000 which represents about 98% of the full-scale deflection of a 16-bit analog-to- digital converter.
* -Speech Entropy 20 A sixth sanity check includes checking whether a socalled "speech entropy" exceeds a minimum. In this example, the minimum speech entropy is 40.
Speech entropy is defined as the average distance between a speech featuregram and the mean feature vector of the speech featuregram. The mean feature vector is 25 calculated by taking an average of the efeature vectors in the featuregram. A distance between each feature vector and the mean feature vector is determined.
Preferably, a Manhattan distance is calculated, although a Euclidian distance may be used. An average distance is calculated by taking an average of e-values of distance.
30 -Speech/Noise Entropy-
A sixth sanity check comprises testing whether a so-called "speech-tonoise entropy" exceeds a minimum. Speech-to-noise entropy is defined as the average
- 44 distance between the mean feature vector of the speech featuregram and the feature vectors of the background noise. In this example, the minimum speech-to-noise
entropy is 40.
s Referring to Figure 25, a process of performing sanity checks is shown. A plurality of sanity checks are performed and a tally Icept of the number of failures (steps S1.6.1 to S1.6.6). If there are number of failures exceeds a threshold, for example 3, then signal is deemed to be inadequate and the user is asked to check their set-up (steps S1.6.7 and S1.6.8). Otherwise, the recorded signal 19 (Figure 5) is considered 10 to be satisfactory (step S1.6.9).
Creating speech featuregram Once the endpoints of the recorded signal 19 (Figure 5) have been identified and the recorded signal (Figure 5) passes a plurality of sanity checks, then a speech Is featuregram may be created.
Referring to Figure 27, a speech featuregram 63 is created using a process 64 by concatenating feature vectors 24 extracted from the section of the featuregram 25 that originates from the speech utterance. The speech section of the featuregram is 20 located via the speech endpoints 28, 28'.
Creating speech featuregram archetype The aim of the enrolment is to provide a characteristic voiceprint for one or more words or phrases. However, specimens of the same word or phase provided by the 2s same user usually differ from one another. Therefore, it is desirable to obtain a plurality of specimens and derive a model or archetypal specimen. This mayinvolve discarding one or more specimens that differ significantly from other specimens.
Referring to Figure 28, a speech featuregram archetype 65 is calculated using an 30 averaging process G6 using w-featuregrams 63,, 632,, 63w. Typically, an average of three featuregrams 63 is taken.
- 45 Referring to Figure 29, 30, 31 and 32, the featuregram archetype 65 is computed by determining a winning score D for each featuregram 63,, 632,..., 63W warped, using a modified version of process 50 which is shown in Figure 32, against each other featuregram 63,, 632,..., 63W to create an w-by-w featuregram cost matrix 67, whose s diagonal elements are zero (steps S1.8.1 to S1.8.9).
Excluding the diagonal elements, a minimum value Dmjn in the featuregram cost matrix 67 is determined (step Sl.8.10). If the minimum value Dm,n is greater than a predestined threshold distance Do, then all the featuregrams 63, 632,..., 63W are 10 considered to be so dissimilar that a featuregram archetype 67 cannot be created (step S1.8.11) Referring to Figures 29 and 31, if one or more values in the featuregram cost matrix 67 is less than the threshold D,, then w-featuregram archetypes 69', 682,. .., 68W are 5 computed using each featuregram 63, 632,, 63W as a reference and warping each remaining (w-1)-featuregrams 63, 632,..., 63W onto it (steps S1.8.12 to S1.8.21).
Referring to Figures 29, 33, 34 and 35, once w-featuregram archetypes 68,, 682,....
68W have been created, a w-by-w featuregram archetype cost matrix 69 is computed 20 whose elements consist of the winning scores E from warping each featuregram 63,, 63z,..., 63W into each featuregram archetype 681, 682,..., G8W (steps S1.8.22 to S1.8.28).
An average featuregram archetype cost matrix 70 is computed by averaging elements 2s within each column 71 corresponding to a featuregram 631, 632, , 63W (steps S1.8.29 to S1.8.37).
A maximum value E'm,X in the featuregram cost matrix 69 is also determined (steps Sl.8.38). If the maximum value E'm,X in the featuregram cost matrix 69 is less than the threshold D,, then the featuregram archetype 681, 682,, G8W which provides the lowest mean featuregram archetype cost <E''>, CE'2>,,<E', > is chosen to be
- 46 included in the voice authentication biometric (steps S1.8.37 to S1. 8.50). The lowest mean featuregram archetype cost <E'>, <E'2>,...,<E'W> is calculated by averaging elements within each row 72.
5 If the maximum value E'maX in the featuregram cost matrix 69 is greater than the threshold D,,, then a featuregram 63, 632,..., 63W is excluded, thus reducing the number of featuregrams to (we) and steps S1.8.1 to S1.8. 50 are repeated (steps S1.8.54).
0 A featuregram 63,, 632,, 63W is chosen for exclusion by calculating a variance cr', O2,...,C,W for each featuregram archetype 68', 682,..., 68W and excluding the featuregram 63,, 63z,..., 63W corresponding to the featuregram archetype 68', 682,....
68W having the lowest value of variance ok, 62,...,aw (steps S1.8.51 to S1.8.53). For example, for an ith featuregram archetype 68;, a variance rsj is calculated from the 15 average featuregram archetype cost matrix 70 using: C5; = ZZ|(E jj.jk) (Eki.kj)| (31) J k Thus, the mean featuregram archetype cost <E',>, <E'2>,...,<E'W> which 20 produced the lowest average distance results in the reference featuregram 63', G32,..., 63W from which it was created being discarded.
Steps S1.8.1 to S1.8.50 arc repeated until a featuregram archetype 65 (Figure 28) is obtained or if only one featuregram 63, 632,..., 63W is left.
Setting an approprioteposf level A featuregram archetype 65 is obtained for each prompt. Thus, during subsequent authentication, a user is asked to provide a response to a prompt. A featuregram is obtained and compared with the featuregram archetype 65 using a dynamic time 30 warping process which produces a score. The score is compared with a preset pass
- 47 level. A score which falls below the pass level indicates a good match and so the user is accepted as being a valid user.
A valid user is likely to provide a response that results in a low score, falling below 5 the pass level, and which is accepted. However, there may be occasions when even a valid user provides a response that results in a high score and which is rejected.
Conversely, an impostor may be expected to provide poor responses which are usually rejected. Nevertheless, they may occasionally provide a sufficiently close-
matching response which is accepted. Thus, the pass level affects the proportion of 10 valid users being incorrectly rejected, i.e. the "false reject rate" (ERR) and the proportion of impostors which are accepted, i. e "false accept rate" (FAR).
In this example, a neutral strategy is adopted which shows no bias towards preventing unauthorised access or allowing authorised access.
A pass level for a fxed-word or taxed-phrase prompt is determined using previously acquired captured recordings taken from a wide range of representative speakers.
A featuregram archetype is obtained for each of a first set of users for the same 20 prompt in a manner hereinbefore described. Thereafter, each user provides a spoken response to the prompt from which a featuregram is obtained and compared with the user's featuregram archetype using a dynamic time warping process so as produce a score. This produces a first set of scores corresponding to valid users.
25 The process is repeated for a second set of users, again using the same prompt.
Once more, each user provides a spoken response to the prompt from which a featuregram is obtained. However, the featuregeam is compared with a different user's featuregram archetype. Another set of scores are produced, this time corresponding to impostors.
Referring to Figure 36, frequency of scores for valid users and impostors are tatted to first and second probability density functions 73, 732 respectively using:
- 48 ( p(x) = (; expel- (]n(x)-) j (32) where, p is probability, x is score, is mean score and is standard deviation.
Other probability density functions may be used.
The mean score it, for valid users is expected to be lower than the mean score p2 for the impostors. Furthermore, the standard deviation o for the valid users is usually smaller than the standard deviation c;2 of the second density function 10 Referring to Figure 37, the first and second probability density functions 73, 732 are numerically integrated to produced first and second continuous density functions 74,,742 The point of intersection 75 of the first and second continuous density functions 74,,742 is the equal error rate (ERR), wherein FRR = FAR. The score at the point of intersection 75 is used as a pass score for the prompt.
Creating a Loire authentication hiometnc Referring to Figure 38, a voice authentication biometric 7G is shown. The voice authentication biometeric 76 comprises sets of data 77,,772, 77q corresponding to featuregram archetypes 65 and associated prompts 78. Statistical information 79 20 regarding each featuregram achetype 65 and an associated prompt 78 may also be stored and will be described in more detail later. The voice authentication biometeric 76 further comprises ancillary information including the number of prompts to be issued during authentication 80, scoring strategy 81, higher level and gain settings 82. The biometeric 76 may include further information, for example 25 related to high-level logic for analysing scores.
The voice authentication biometric 76 is stored in non-volatile memory 11 (Figure 1).
- 49 ( shs Referring again to Figures 1 and 2, once enrolment has been successfully completed, the user is registered as a valid user. Access to the secure system 3 is conditional on successful authentication.
s Referring to Figure 39, the authentication process corresponding to step S2 in Figure 2, is shown: The voice authentication system I is initialised, for example by setting amplifier gain 0 to a value stored in the voice authentication biometric, or calibrated, for example to ensure that an appropriate amplifier gain is set (step S2.1). The user is then prompted (step 2.2) and the user's responses are recorded (step S2.3). Featuregrams are generated from the recordings (step S2.4). The recordings are examined so as to isolate speech from background noise and periods of silence (step S2.S, step S2.G).
5 Checks are performed to ensure that the recordings, isolated speech utterances and featuregrams are suitable for processing (step S2.7). The featuregrams are then matched with the featuregram archetype (step S2.8). The response is also checked for replay attack (step S2.9). The user's response is then scored. (step S2.10).
20 Initialilation/ Calibration The gain of the amplifier 6 (Figure 1) is set according to the value 82 (Figure 37) stored in the voice authentication biometric 76 (Figure 37) which is stored in non volatile memory 1 1 (Figure 1).
2s Alternatively, the system may be calibrated in a way similar to that used in enrolment. However, the process may differ. For example, prompts used in authentication may differ from those used in enrolment. A value of gain determined during enrolment calibration need not be recorded but may be compared with a value stored in the voice authentication biometric and user to 30 determine whether the user is a valid user.
- 50 zthenticafion prompts Authentication prompts are chosen from those stored in the voice authentication biometric 76 (Figure 37). Preferably, prompts are randomly chosen from a sub-set.
This has the advantage that it becomes more difficult for a user to guess what 5 prompt will be used and so give an unnatural response. Moreover, this improves security. Recording Referring to Figure 40, following the or each prompt, a signal 83 is recorded using 0 the microphone 5 (Figure 1) in a manner hereinbefore described.
Creating authenticationfeaturegrams Referring to Figures 41, 42 and 43, the or each recorded signal 83, is divided into timeslices 84. The timeslices 84 use the same window size and the same overlap as used for enrolment. Feature vectors 85 are created. Again, the same process 25 is used in authentication as enrolment. The feature vectors 85 are concatenated to produce featuregrams 86. The featuregrams 86 generated during authentication are usually referred to as authentication featuregrams.
20 Referring to Figure 44, explicit endpointing may be performed using the process 27 described earlier so as to generate endpoints 87. Explicit endpointing may be used to support sanity checks.
Sanity checks 25 Sanity checks are conducted on the recorded signal 83 as described earlier.
Matching, authenficationfeaturegrams with the voice authentication biometric Referring to Figure 45, the process 50 and the featuregram archetype 65 is used to word spot the authentication featuregram 86 and provide a dynamic time warping 30 winning score 87. The process 50 may be used to provide endpoints 28'.
- 51 ( Rejecting a replay attack A potential threat to the security offered by any voice authentication system is the possibility of an impostor secretly recording a spoken response of the valid user and subsequently replaying a recording to gain access to the system. This is known as a 5 "replay attack."
One solution to this problem is to issue, during each separate authentication, a randomly chosen subset of prompts from a full set of prompts responded to during enrolment. This means that several different authentication sessions will need to be 10 secretly recorded before an impostor can collect a complete set of the responses.
However, this does not combat the threat from recordings made during enrolment.
Another solution is to store copies of the featuregrams generated during recent authentications and track them to see if they vary sufficiently over time. However, 15 this has several drawbacks. Firstly, additional storage is needed. Secondly, replaying the same recording on several occasions under different levels and types of background noise may in itself provide sufficient variability for the system to be
fooled into thinking that it is observing legitimate live spoken responses provided by the valid user.
To Referring to Figures 46 and 47, a process for rejecting replay attack is shown: A fixed-phrase prompt is randomly selected (step S2.9.1). An example of a fixed-
phrase prompt is "This is my voiceprint". A recording is started (step S2. 9.2). The 2s user is then prompted a first time (step S2.9.3). After a predetermined period of time, for example 1 or 2 seconds, the user is prompted a second time with the same prompt (steps S2.9.4 & S2.9.5). Thus, the user supplies two different examples 89, 892 separated by a 1-2 second interval 90. A featuregram 86 is generated as described earlier Step S2.9.6). The interval may comprise silence and/or noise.
The word spotting process 50 is used to isolate the two spoken responses 89, 892 to the fixed-phrase prompt and the interval 90 (steps S2.9.7 & S2. 9.8). The isolate responses 89,, 892, in the form of truncated featuregrams, are fed to process 88.
- 52 Each truncated featuregram provides a representation of the spoken response. The duration of the interval 90 is determined.
If the featuregrams 89,, 892 are too similar, either to each other, or to the s featuregram archetype 65 stored in the voice authentication biometric, then authentication is rejected on suspicion of a replay attack (steps S2.9.9 to S2.9.13). A corresponding reject flag 91 is set.
A record 92 is kept of a degree of match between the two featuregrams 89, 892 and 10 the length of the intermediate silence 90 (step S2.9.11). This record 92 is known as a "Replay Attack Statistic" (WAS). The record 92 comprises two integers.
Therefore, it is possible to store a plurality of replay attack statistics 92 for each fixed-phrase prompt in the voice authentication biometric 76 (Figure 36) without consuming a significant amount of memory. The record 92 is stored in statistical 15 information 79 (Figure 37).
If during a subsequent authentication, a close match is detected between the latest replay attack statistic 92 and any subsequent replay attack statistic 92 stored in the voice authentication biometric 76 (Figure 36) (steps S2.9.15 to S2.9.16), then the 20 authentication may be rejected on suspicion of a replay attack. Additionally or alternatively, the process may be repeated using a different prompt and check for the replay attack based on another set of replay attack statistics 92.
If during subsequent authentication, the duration of the interval 90 is found to be 2s the same as the duration of the interval 90 for the same prompt arising from an earlier authentication, then the authentication may also be rejected on suspicion of a replay attack (step S2.9.17 and step S2.9.18).
The advantage of using this approach is that it is possible to monitor and detect 30 suspicious similarities between featuregram archetypes even if the acoustic environment has changed since the time the recording was originally made.
Furthermore, the approach helps to guard against replay attacks based on recordings made during enrolment and authentication. Additionally, the cost of storing the
- 53 ! replay attack statistics is low, typically 3 bytes per prompt. Thus, to monitor the last 5 authentication attempts across 5 fixed prompts typically requires 75 bytes of memory. s Higher-lcrel decision logic A decision on whether to accept or relect the user is based on the degree of match between featuregram archetypes 65 stored in the voice authentication biometric 76 (Figure 36) and the featuregrams 86 derived from the authentication recordings.
0 Higher-level decision logic is subsequently applied.
Higher-level decision logic may include calculating an average score for a plurality featuregrams 86 and determining whether the average score falls below a first predetermined scoring threshold, i.e. D*V c D'besh If the average score falls below 5 the first predetermined scoring threshold, then authentication is considered successful. Higher-level decision logic may include determining the number, n, of featuregrams 86 whose score fall below a second predetermined scoring threshold, i.e. D; DthcshZ 20 for all 0 c i < p. The decision logic subsequently comprises checking a pass condition. For example, the pass condition may be that the scores for n out of p featuregrams 86 fall below the second predetermined scoring threshold, where c n < p. Allowing one or more of the featuregram scores to be ignored is useful because it allows the valid user to provide an uncharacteristic response to at least 2s one of the prompts without being unduly penalised.
For fixed prompts, with a priori knowledge of a response, the scoring thresholds may be set based upon the statistical method described earlier.
30 For challenge-response prompts, a threshold may be determined during enrolment.
A plurality of specimens, preferably two or three, of the same response are taken. A featuregram archetype is determined. Additionally, a variance is determined.
- 54 Thus, a fixed number of prompts are issued and spoken responses are recorded.
The spoken response are analysed to determine whether a valid user is addressing the system.
s However, an alternative strategy may be used, which adaptively determines a number of prompts to be issued.
Initially, a user is prompted a predetermined number of times, for example two or three times. Spoken responses are recorded, corresponding featuregrams are 10 obtained and compared with the featuregram archetype so as to produce a number of scores.
Depending on the score, further prompts may be issued. For example, if all or substantially all the scores fall below a threshold score, indicating a good number of 15 matches, then no further prompts are issued and authentication is successful.
Conversely, if all or substantially all the scores exceed the threshold score, indicating a poor number of matches, then authentication is unsuccessful.
However, if some scores fall below the threshold and other scores exceed the threshold, then further prompts are issued and further scores obtained.
This process continues until either the proportion of successful scores exceeds a first predetermined proportion, for example 70%, in which case authentication is successful, or falls below a second predetermined proportion, such as 30%, in which case authentication is considered unsuccessful.
This has the advantage that valid users who provide consistently good examples of speech when prompted need only provide a small number of spoken responses, thus saving time.
30 In the above embodiment, the voice authentication system is comprised in a single unit, such as a personal computer. However, the voice authentication system may be distributed.
- 55 For example, the processor r performing the matching process and nonvolatile memory holding the voice authentication biometric may be held on a so-called "smart card" which is carried by the valid uses. This is particularly convenient for controlling access to a room or building via an electronically-controlled lockable 5 door. The door is provided with a microphone and a smart card reader. The door is also provided with a speaker for providing audio prompts and/or a display for providing text prompts. When the smart card is inserted into the smart card reader, the voice authentication system is connected and permits authentication and, optionally, enrolment. Enrolment may be performed elsewhere, preferably under 10 supervision of the system administrator, using another microphone and smart card reader together with speaker and/or display. This has the advantage that access is conditional not only on successful authentication, but also possession of the smart card. Furthermore, the voice authentication biometric and the matching process may be encrypted. The smart card may also be used in personal electronic devices, I5 such as cellular telephones and personal data assistants.
Many modifications may be made to the embodiment hereinbefore described. For example, the recorded signal may comprise a stereo recording.
20 Measurements of background noise may be made in different ways. For example, a
recorded signal, or part thereof, may be divided into a plurality of frames. A value of background noise may be determined by selecting one or more of the lowest
energy frames and either using one of the selected frames as a representative frame or obtaining an average of all the selected frames. To select the one or more lowest 25 energy frames, the frames may be arranged in order of signal energy. Thereafter, the ordered frames may be examined to determine a boundary where signal energy jumps from a relatively low level and to a relatively high level. Alternatively, a predetermined number of frames at the lower energy end may be selected.
Claims (1)
- - 56 Glaims 1. A method of voice authentication comprising: enrolling auser including: s requesting said enrolling user to provide a spoken response to a prompt; obtaining a recorded signal including a recorded signal portion corresponding to said spoken response; determining endpoints of said recorded signal portion; deriving a set of feature vectors characterising for said recorded signal 10 portions; averaging a plurality of feature vectors, each set of feature vectors relating to one or more different spoken responses to the prompt by said enrolling user so as to provide an archetype set of feature vectors for said response; 15 storing said archetype set of feature vectors together with data relating to said prompt; authenticating a user including: retrieving said data relating to said prompt and said archetype set of feature vectors; 20 requesting said authenticating user to provide another spoken response to said prompt; obtaining another recorded signal including another recorded signal portion corresponding to said other spoken response; determining endpoints of said other recorded signal portion; As deriving another set of feature vectors for characterising said other recorded signal portions; comparing said another set of feature vectors with said archetype set of feature vectors so as to produce a score dependent upon a degree of matching; and 30 comparing said score with a predefined threshold so as to determine whether said enrolling user and said authenticating user are the sane.- 57 2. A method according to claim I, wherein the recorded signal includes a plurality of frames and the determining of said endpoints of said recorded signal further comprises: determining whether a value of energy for a first frame exceeds a first 5 predetermined value; and determining whether a second frame immediately preceding the first frame represents a spoken utterance portion.3. A method of determining an endpoint of a recorded signal portion in a 10 recorded signal including a plurality of frames, the method comprising: determining whether a value of energy for a first frame exceeds a first predetermined value; and determining whether a second frame immediately preceding the first frame represents a spoken utterance portion.4. A method according to claim 2 or 3, wherein the first predetermined value represents a value of energy of a frame comprised of background noise.5. A method according to any one of claims 2, 3 or 4 comprising: 20 defining a start point if the value of energy of the first frame exceeds the first predetermined value and the second frame does not represent a spoken utterance portion. 6. A method according to claim 5, further comprising: 25 indicating that the first frame represents a spoken utterance portion.7. A method according to any one of claims 2 to 5 comprising: defining a stop point if the value of energy of the hrst frame does not exceed the first predetermined value and the second frame represents a spoken utterance 30 portion. 8. A method according to claim 7, further comprising: defining the first frame as not representing a spoken utterance portion.! - 589. A method according to claim 7 or 8, further comprising: counting a number of frames preceding a start point of the spoken utterance portion. s 10. A method according to claim further comprising: pairing the stop point with said start point of the spoken utterance portion if the number of frames exceeds a predetermined number.0 11. A method according to claim 9 further comprising: pairing the stop point with start point of a preceding spoken utterance portion if the number of frames does not exceed a predetermined number.12. A method according to claim 2 or 3 comprising: 15 determining whether the value of energy for a first frame exceeds a third predetermined value; and counting a number of frames preceding a start point of the spoken utterance portion. 20 13. A method according to claim 12 further comprising: defining a start point if the value of energy of the first frame exceeds the third predetermined value, the second frame does not represent a spoken utterance portion and if the number of frames does not exceed a predetermined number.2s 14. A method according to claim 13 further comprising: determining whether a value of energy for a third frame following said first frame exceeds the second predetermined value.15. A method according to claim 14 further comprising: 30 defining a stop point if the value of energy of the third frame does not exceed the third predetermined value.16. A method according to claim 15 further comprising: pairing the stop point with the start point of the spoken utterance portion.17. A method according to claim l6 further comprising: s pairing the stop point with a start point of a preceding spoken utterance portion. 18. A method according to claim 12 comprising: deeming the first frame as representing background noise if the value of10 energy of the first frame does not exceed the third predetermined value.19. A method according to claim 18 further comprising: calculating an updated value of background energy using said value of energyof the first frame.20. A method according to claim 19, further comprising: counting a number of frames preceding a start point of the spoken utterance portion and determining whether said number of frames exceeds another, large number. 21. A method according to any one of claims 2 to 20 comprising: determining whether a value of rate of change of energy of the first frame exceeds a second predetermined value.25 22. A method according to claim 21, wherein the second predetermined value represents a value of rate of change of energy of a frame comprised of backgroundnoise. 23. A method according to claim 21 or 22, comprising: 30 defining a start point if the value of energy of the first frame exceeds the first predetermined value, and the value of rate of change of energy exceeds the second predetermined value and the second frame does not represent a spoken utterance portion.- 60 24. A method according to claim 21 or 22, comprising: defining a stop point if the value of energy of the first frame does not exceed the first predetermined value, and the value of rate of change of energy does not s exceed the second predetermined value and the second frame represents a spoken utterance portion.25. A method according to any one of claims 22 to 24 comprising: determining whether the value of rate of change of energy for the first frame 0 exceeds a fourth predetermined value.26. A method according to any preceding claim, comprising: receiving said spoken response via a microphone; generating an electrical signal; 15 amplifying said electrical signal using an amplifier to produce an amplified signal; determining whether said amplified signal level is above a first predetermined limit; and either decreasing gain if the amplified signal level is above the first 20 predetermined limit or maintaining gain if otherwise; thereby permitting no increase in gain.27. A method according to claim 26, further comprising: requesting said enrolling user to provide another spoken response to another Us prompt; generating another electrical signal; amplifying said another electrical signal with said amplifier to produce another amplified signal; determining whether said another amplified signal level is above the first 30 predetermined limit; either decreasing gain if the amplified signal level is above the first predetermined limit or maintaining gain if otherwise; thereby permitting no increase in gain.- G1 28. A method of gain control comprising a plurality of times: determining whether an amplified signal level is above a predetermined limit; either decreasing gain if the amplified signal level is above the predetermined s limit or maintaining gain if otherwise; thereby permitting no increase in gain.29. A method according to claim 28, further comprising: receiving a spoken response from a user via a microphone; 10 generating an electrical signal; and amplifying said electrical signal using an amplifier to produce the amplified signal. 30. A method according to claim 29, comprising: 15 determining whether amplifier gain is at a lowest value.31. A method according to claim 30, comprising: informing the user that the spoken response is loud if the amplified signal level is above the predetermined limit and if the gain is at the lowest value.32. A method according to any one of claims 26 to 31, wherein the determining whether the amplified signal is above the predetermined level comprises: determining whether any portion of the amplified signal is above the predetermined level.as 33. A method according to any one of claims 26 to 31, wherein the determining whether the amplified signal is above the predetermined level comprises: determining whether an average of the amplified signal is above the predetermined level.34. A method according to any one of claims 26 to 33, comprising: counting a number of spoken responses given by the user.- 62 35. A method according to claim 34, wherein the counting of number of spoken responses comprises: determining a number of times that gain has been decreased or maintained.5 36. A method according to claim 34 or 35, comprising: stopping if the number of spoken responses exceeds a predetermined number.37. A method according to any one of claims 26 to 36, comprising: determining a number of times gain has been consecutively maintained.38. A method according to claim 37, comprising: storing a value of gain if the number of times gain has been consecutively maintained reaches a predetermined number.15 39. A method according to any one of claims 26 to 38, further comprising: determining a value of signal-to-noise ratio.40. A method of gain control comprising a plurality of times: determining whether an amplified signal level is below a predetermined limit; 20 either increasing gain if the amplified signal level is below the predetermined limit or maintaining gain if otherwise; thereby permitting no decreases in gain.41. A method according to any preceding claim, comprising: 25 requesting said authenticating user to provide hrst and second spoken responses to said prompt; obtaining a recorded signal including first and second recorded signal portions corresponding to said first and second spoken responses; isolating said first and second recorded signal portions; 30 deriving first and second sets of feature vectors for characterising said first and second isolated recorded signal portions respectively; comparing said first set of feature vectors with said second set of feature vectors so as to produce another score dependent upon the degree of matching; andt - 63 comparing the another score with another predef1ned threshold so as for determine whether the first set of feature vectors is substantially identical to the second set of feature vectors.5 42. A method of voice authentication comprising: requesting a user to provide first and second spoken responses to a prompt; obtaining a recorded signal including first and second recorded signal portions corresponding to said first and second spoken responses; isolating said first and second recorded signal portions; 10 deriving first and second sets of feature vectors for characterizing said first and second isolated recorded signal portions respectively; comparing said first set of feature vectors with said second set of feature vectors so as to produce a second score dependent upon the degree of matching; and 15 comparing the second score with another predefined threshold so as for determine whether the first set of feature vectors is substantially identical to the second set of feature vectors.43. A method according to claim 41 or 42, comprising: 20 providing an third set of feature vectors; and comparing said first set of feature vectors with said third set of feature vectors and obtaining third score dependent upon a degree of match.44. A method according to claim 43, wherein the third set of feature vectors is an 2s archetype set of feature vectors.45. A method according to claim 43 or 44 further comprising: comparing said third score with a third predetermined threshold thereby determining whether the first set of feature vectors is substantially identical to the 30 archetype set of feature vectors.46. A method according to any one of claims 43 to 45 further comprising: comparing said third score with a fourth score, the fourth score being- 64 obtained in a previous authentication by comparing a fourth set of feature vectors obtained with said archetype set of feature vectors and determining whether the third score and fourth scores are substantially the same. s 47. A method according to any one of claims 41 to 4G comprising: measuring an interval between said first and second recorded signal portions.48. A method according to claim 47 further comprising: 10 comparing said interval with another interval obtained in a previous enrolment or authentication.49. A method according to claim 47 or 48 further comprising: storing data relating to said interval.50. A method according to any preceding claim, comprising: requesting an authenticating user to provide a plurality of spoken responses to a prompt; obtaining a plurality of corresponding recorded signals, each recorded signal 20 including a recorded signal portion corresponding to a respective spoken response; deriving a plurality of sets of feature vectors, each set of feature vectors for characterising a respective recorded signal portion; comparing said sets of feature vectors with an archetype set of feature vectors so as to produce a plurality of scores dependent upon a degree of matching and 2s determining whether authentication is successful in dependence upon said plurality of scores.51. A method of voice authentication including: requesting an authenticating user to provide a plurality of spoken responses to 30 a prompt; obtaining a plurality of corresponding recorded signals, each recorded signal including a recorded signal portion corresponding to a respective spoken response; deriving a plurality of sets of feature vectors, each set of feature vectors for- 65 characterising a respective recorded signal portion; comparing said sets of feature vectors with an archetype set of feature vectors so as to produce a plurality of scores dependent upon a degree of matching and determining whether authentication is successful in dependence upon said 5 plurality of scores.52. A method according to claim 50 or 51, further including: computing an average score from said plurality of scores; and comparing said average score with a predefined threshold.53. A method according to claim 52, including: permitting authentication if said average score exceeds the predefined threshold. 5 54. A method according to claim 50 or 51, further including: comparing each of said scores with a predefined threshold and determining a number of scores which exceed a predefined threshold.55. A method of authentication according to claim 54, including: 20 permitting authentication if the number of scores which exceed the predefined threshold exceeds another predeEned threshold.56. A method according to any preceding claim, further comprising: requesting a first set of users to provide respective spoken responses to a 25 prompt; for each user, obtaining a recorded signal which includes a recorded signal portion corresponding to the user's spoken response; for each user, deriving a set of feature vectors for characterising the recorded signal portion; 30 for each user, comparing said set of feature vectors with an archetype set of feature vectors for said user so as to produce a score dependent upon a degree of matching; fitting a first probability density function to frequency of scores for said first- 66 set of users; requesting a second set of users to provide respective spoken responses to a prompt; for each user, obtaining a recorded signal which includes a recorded signal s portion corresponding to the user's spoken response; for each user, deriving a set of feature vectors for characterizing the recorded signal portion; for each user, comparing said set of feature vectors with an archetype set of feature vectors for a different user so as to produce a score dependent upon a 0 degree of matching; fitting a second probability density function to frequency of scores for said set of users.57. A method of determining an authentication threshold score, the method 5 including: requesting a first set of users to provide respective spoken responses to a prompt; for each user, obtaining a recorded signal which includes a recorded signal portion corresponding to the user's spoken response; 20 for each user, deriving a set of feature vectors for characterising the recorded signal portion; for each user, comparing said set of feature vectors with an archetype set of feature vectors for said user so as to produce a score dependent upon a degree of matching; 25 fitting a first probability density function to frequency of scores for said first set of users; requesting a second set of users to provide respective spoken responses to a prompt; for each user, obtaining a recorded signal which includes a recorded signal 30 portion corresponding to the user's spoken response; for each user, deriving a set of feature vectors for characterizing the recorded signal portion; for each user, comparing said set of feature vectors with an archetype set of( - 67feature vectors for a different user so as to produce a score dependent upon a degree of matching; fitting a second probability density function to frequency of scores for said set of users.s 58. A method according to claim 57, further comprising integrating said first and second density distribution functions to produce first and second respective continuous density functions.10 59. A method according to claim 58, further comprising determining where said first and second continuous density functions cross and in dependence thereon determining a threshold score.60. A method according to any one of claim SG to 59 further comprising: 15 storing data relating to said score.61. A method according to any preceding claim, wherein the comparing of sets of feature vectors comprises: dynamic time warping said sets of feature vectors and wherein the score 20 produced by the comparing said sets of feature vectors is a dynamic time warping winning path distance.62. A method according to any preceding claim, wherein the deriving a set of feature vectors for characterising said recorded signal portions comprises: 25 determining feature vectors representative of acoustic features within said recorded signal portions.63. A method according to any preceding claim, wherein the deriving of said set of feature vectors for characterizing said recorded signal portions comprises: 30 determining feature vectors using a feature transform.64. A method according to claim 63, wherein said feature transform is a mel-cepstral transform.! - G865. A method according to any preceding claim further comprising: dividing said recorded signal into frames.s 66. A method according to claim 65, wherein the deriving of said set of feature vectors for characterising said recorded signal portions comprises: deriving a feature vector for each respective frame of said recorded signal portion. 10 67. A method according to any preceding claim, wherein the determining of endpoints comprises: dynamic time warping said another set of feature vectors onto said archetype set of feature vectors, including: determining a hrst sub-set of feature vectors within said another set of 15 feature vectors from which a dynamic time warping winning path may start and determining a second sub-set of feature vectors within said another set of feature vectors at which the dynamic time warping winning path may finish.68. A method of dynamic time warping for warping a first speech pattern (A) 20 characterized by a first set of feature vectors onto a second speech pattern (A) characterized by a second set of feature vectors, the method comprising: identifying a first sub-set of feature vectors within said first set of feature vectors from which a dynamic time warping winning path starts and identifying a second sub-set of feature vectors within said first set of feature 25 vectors at which the dynamic time warping winning path finishes.69. A method according to claim 67 or 68, wherein the identifying of the hrst sub-set of feature vectors comprises: calculating a distance between a first feature vector (i) at the beginning of said 30 second set of feature vectors and each feature vector (j) of said first sub-set of feature vectors within said first set of feature vectors.- 69 ( 7(. A method according to claim 69, further comprising: entering each distance into an array.71. A method according to any one of claims 68 to JO, further comprising: s determining whether to calculate a distance between a feature vector (j) of said first set of feature vectors and a feature vector (i) of said second set of feature vectors. 72. A method according to claim 71, wherein the determining whether to calculate 70 the distance comprises: determining whether to calculate a distance between a Ah feature vector of said first set of feature vectors which comprises J feature vectors and an ith feature vector of said second set of feature vectors which comprises I feature vectors, where I and J are positive integers, by finding whether a value of j falls between a Is maximum value and minimum boundary value for a value i.73. A method according to claim 72, further comprising: calculating a maximum value using: j(k) 2 max[2i(k)- 2I _ I, i(k) + I] 20 wherein k is record of dynamic time warping path and k = i.74. A method according to claim 72 or 73, further comprising: calculating a maximum value using: j(k) < min[2i(k)- 1 + 2 ' (2) - 2 + J] wherein k is record of dynamic time warping path and k = i.75. method according to any one of claims 72 to 74, comprising: calculating a distance g(i, j) if j falls between a maximum value and minimum boundary value.- 70 76. A method according to claim 75, wherein the calculating of the distance g(i, i) comprises using: g(i -1, j) h(k)+ d(i, AL g(i, j) = min g(i -1, j -1)+ d(i, j), g(i-l,j-2)+d(i,j)7 herein in(k):1 C(k-2)=C(i-2,J) and s k is record of dynamic time warping path and k = i.77. A method according to any one of claims 67 to 76, comprising: determining a global distance for a dynamic time warping winning path.10 78. A method according to claim 77, wherein the determining of the distance for the dynamic time warping winning path comprises: examining distances associated with the second sub-set of feature vectors; and searching for a lowest value of distance.15 79. A method according to any preceding claim, further comprising: performing a plurality of checks on said recorded signal for determining whether said recorded signal is suitable for use in enrolling.80. A method according to any preceding claim, further comprising: 20 performing a plurality of checks on said another recorded signal for determining whether said another recorded signal is suitable for use in authentication. 81. A method according to clam 79 or 80, wherein performing a plurality of 25 checks includes: determining whether a length of spoken utterance included in a recorded signal exceeds a minimum.82. A method according to any one of claims 79 to 81, wherein performing a 30 plurality of checks includes:determining whether a length of silence included in a recorded signal exceeds a minimum.83. A method according to any one of claims 79 to 82, wherein performing a s plurality of checks includes: determining whether a signal-to-noise ratio of a recorded signal exceeds a minimum. 84. A method according to any one of claims 79 to 83, wherein the performing to the plurality of checks includes: determining whether energy of a recorded signal exceeds a minimum.85. A method according to any one of claims 79 to 84, wherein the performing the plurality of checks includes: 5 determining whether a degree of clipping of a recorded signal exceeds a maximum. 86. A method according to any one of claims 79 to 85, wherein the performing the plurality of checks includes: 20 calculating a mean feature vector by averaging a set of feature vectors; calculating a distance between said mean feature vector and each feature vector; calculating an average of said distances; and determining whether said distance exceeds a minimum.87. A method according to any one of claims 79 to 86, wherein the performing the plurality of checks includes: calculating a mean feature vector by averaging a set of feature vectors; deriving a set of feature vector for characterising a recorded signal portion 30 corresponding to a silence interval calculating a distance between said mean feature vector and each feature vector corresponding to said silence interval;- 72 ( calculating an average of said distances; and determining whether said distance exceeds a minimum.88. A method according to any preceding claim, wherein the averaging said s plurality of feature vectors comprises: comparing said each set of feature vectors with each other set feature vectors so as to produce a respective set of scores dependent upon a degree of matching; searching for a minimum score; determining whether at least one score is below a predetermined threshold.89. A method of averaging a plurality of feature vectors, the method comprising: providing a plurality of feature vectors; comparing said each set of feature vectors with each other set feature vectors so as to produce a respective set of scores dependent upon a degree of matching; 15 searching for a minimum score; determining whether at least one score is below a predetermined threshold.90. A method according to claim 88 or 89, further comprising: generating a plurality of archetype sets of feature vectors by dynamic time 20 warping each set of feature vectors with each other set feature vectors; comparing said each set of feature vectors with each archetype set of feature vectors so as to produce a respective set of scores dependent upon a degree of matching and arranging said scores in first array according to set of feature vector and 2s archetype set of feature vector.91. A method according to claim 90, further comprising: generating another array by averaging elements with said array corresponding to said sets of feature vectors.92. A method according to claim 91, further comprising: searching for a maximum value within said another array.- 73 ( 93. A method according to claim 92, further comprising: excluding one of said sets of feature vectors if said maximum value within said another array exceeds a predetermined threshold.5 94. A method according to claim 93, wherein said excluding of one of said sets of feature vectors includes: calculating a variance for each archetype set of feature vectors and excluding one of said sets of feature vectors whose corresponding archetype set of feature vector has the lowest variance.95. A computer program for performing the method according to any preceding claim. 96. Apparatus configured to perform the method according to any one of claims I 15 to 94.97. Apparatus according to claim 96 including a processor.98. Apparatus according to claim 97 further including memory.99. Apparatus for voice authentication comprising: means for enrolling a user including: means for recontesting said enrolling user to provide a spoken response to a prompt; z5 means for obtaining a recorded signal including a recorded signal portion corresponding to said spoken response; means for determining endpoints of said recorded signal portion; means for deriving a set of feature vectors for characterising said recorded signalportions; 30 means for averaging one or more sets of feature vectors, each set of data relating to one or more different spoken responses to the prompt by said enrolling user so as to provide an archetype set of feature vectors for said response;- 74 means for storing said archetype set of feature vectors together with data relating to said prompt; means for authenticating a user including: means for retrieving said data relating to said prompt and said archetype set s of feature vectors; means for requesting said authenticating user to provide another spoken e response to said prompt; means for obtaining another recorded signal including another recorded 10 signal portion corresponding to said other spoken response; means for determining endpoints of said other recorded signal portion; means for deriving another set of feature vectors for characterising said other recorded signal portions; means for comparing said another set of feature vectors with said archetype 15 set of feature vectors so as to produce a score dependent upon a degree of matching; and means for comparing said score with a predeOned threshold so as to determine whether said enrolling user and said authenticating user are the same. 100. A smart card for voice authentication comprising: means for storing a first set of feature vectors and data relating to a prompt; means for providing said data to an external circuit; means for receiving a second set of feature vectors relating to said prompt; 25 means for comparing said first and second set of feature vectors so as to determine a score; and means for comparing said score with a predetermined threshold.101. A gain controller configured to repeatedly determine whether an amplified 30 signal level is above a predetermined limit and to permit gain to be decreased or maintained but not increased. =2- 75 102. A gain controller configured to repeatedly determine whether an amplified signal level is below a predetermined limit and to permit gain to be increased or maintained but not decreased.
Priority Applications (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| GB0211842A GB2388947A (en) | 2002-05-22 | 2002-05-22 | Method of voice authentication |
| AU2003230039A AU2003230039A1 (en) | 2002-05-22 | 2003-05-22 | Voice authentication |
| PCT/GB2003/002246 WO2003098373A2 (en) | 2002-05-22 | 2003-05-22 | Voice authentication |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| GB0211842A GB2388947A (en) | 2002-05-22 | 2002-05-22 | Method of voice authentication |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| GB0211842D0 GB0211842D0 (en) | 2002-07-03 |
| GB2388947A true GB2388947A (en) | 2003-11-26 |
Family
ID=9937239
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| GB0211842A Withdrawn GB2388947A (en) | 2002-05-22 | 2002-05-22 | Method of voice authentication |
Country Status (3)
| Country | Link |
|---|---|
| AU (1) | AU2003230039A1 (en) |
| GB (1) | GB2388947A (en) |
| WO (1) | WO2003098373A2 (en) |
Cited By (10)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| GB2407681A (en) * | 2003-10-29 | 2005-05-04 | Vecommerce Ltd | Determining the likelihood of voice identity fraud |
| WO2005119653A1 (en) * | 2004-06-04 | 2005-12-15 | Philips Intellectual Property & Standards Gmbh | Method and dialog system for user authentication |
| WO2008095768A1 (en) * | 2007-02-08 | 2008-08-14 | Nuance Communications, Inc. | System and method for telephonic user authentication |
| WO2010066269A1 (en) * | 2008-12-10 | 2010-06-17 | Agnitio, S.L. | Method for verifying the identify of a speaker and related computer readable medium and computer |
| US8817964B2 (en) | 2008-02-11 | 2014-08-26 | International Business Machines Corporation | Telephonic voice authentication and display |
| GB2541466A (en) * | 2015-08-21 | 2017-02-22 | Validsoft Uk Ltd | Replay attack detection |
| GB2545534A (en) * | 2016-08-03 | 2017-06-21 | Cirrus Logic Int Semiconductor Ltd | Methods and apparatus for authentication in an electronic device |
| US10552595B2 (en) | 2016-11-07 | 2020-02-04 | Cirrus Logic, Inc. | Methods and apparatus for authentication in an electronic device |
| US10691780B2 (en) | 2016-08-03 | 2020-06-23 | Cirrus Logic, Inc. | Methods and apparatus for authentication in an electronic device |
| EP3751561A3 (en) * | 2015-10-16 | 2020-12-30 | Google LLC | Hotword recognition |
Families Citing this family (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| AU2012200605B2 (en) * | 2008-09-05 | 2014-01-23 | Auraya Pty Ltd | Voice authentication system and methods |
| CA2736133C (en) | 2008-09-05 | 2016-11-08 | Auraya Pty Ltd | Voice authentication system and methods |
| GB2566215B (en) * | 2016-06-06 | 2022-04-06 | Cirrus Logic Int Semiconductor Ltd | Voice user interface |
| WO2019173304A1 (en) * | 2018-03-05 | 2019-09-12 | The Trustees Of Indiana University | Method and system for enhancing security in a voice-controlled system |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| EP0424071A2 (en) * | 1989-10-16 | 1991-04-24 | Logica Uk Limited | Speaker recognition |
| US5794195A (en) * | 1994-06-28 | 1998-08-11 | Alcatel N.V. | Start/end point detection for word recognition |
| WO2000054257A1 (en) * | 1999-03-11 | 2000-09-14 | British Telecommunications Public Limited Company | Speaker recognition |
| US6195638B1 (en) * | 1995-03-30 | 2001-02-27 | Art-Advanced Recognition Technologies Inc. | Pattern recognition system |
| US6249760B1 (en) * | 1997-05-27 | 2001-06-19 | Ameritech Corporation | Apparatus for gain adjustment during speech reference enrollment |
Family Cites Families (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JPH07121630B2 (en) * | 1987-05-30 | 1995-12-25 | 株式会社東芝 | IC card |
| US4827518A (en) * | 1987-08-06 | 1989-05-02 | Bell Communications Research, Inc. | Speaker verification system using integrated circuit cards |
| GB9617426D0 (en) * | 1996-08-20 | 1996-10-02 | Domain Dynamics Ltd | Signal processing arrangements |
| US8266451B2 (en) * | 2001-08-31 | 2012-09-11 | Gemalto Sa | Voice activated smart card |
-
2002
- 2002-05-22 GB GB0211842A patent/GB2388947A/en not_active Withdrawn
-
2003
- 2003-05-22 AU AU2003230039A patent/AU2003230039A1/en not_active Abandoned
- 2003-05-22 WO PCT/GB2003/002246 patent/WO2003098373A2/en not_active Ceased
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| EP0424071A2 (en) * | 1989-10-16 | 1991-04-24 | Logica Uk Limited | Speaker recognition |
| US5794195A (en) * | 1994-06-28 | 1998-08-11 | Alcatel N.V. | Start/end point detection for word recognition |
| US6195638B1 (en) * | 1995-03-30 | 2001-02-27 | Art-Advanced Recognition Technologies Inc. | Pattern recognition system |
| US6249760B1 (en) * | 1997-05-27 | 2001-06-19 | Ameritech Corporation | Apparatus for gain adjustment during speech reference enrollment |
| WO2000054257A1 (en) * | 1999-03-11 | 2000-09-14 | British Telecommunications Public Limited Company | Speaker recognition |
Cited By (18)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| GB2407681B (en) * | 2003-10-29 | 2007-02-28 | Vecommerce Ltd | Voice recognition system and method |
| GB2407681A (en) * | 2003-10-29 | 2005-05-04 | Vecommerce Ltd | Determining the likelihood of voice identity fraud |
| WO2005119653A1 (en) * | 2004-06-04 | 2005-12-15 | Philips Intellectual Property & Standards Gmbh | Method and dialog system for user authentication |
| WO2008095768A1 (en) * | 2007-02-08 | 2008-08-14 | Nuance Communications, Inc. | System and method for telephonic user authentication |
| US8817964B2 (en) | 2008-02-11 | 2014-08-26 | International Business Machines Corporation | Telephonic voice authentication and display |
| WO2010066269A1 (en) * | 2008-12-10 | 2010-06-17 | Agnitio, S.L. | Method for verifying the identify of a speaker and related computer readable medium and computer |
| US20110246198A1 (en) * | 2008-12-10 | 2011-10-06 | Asenjo Marta Sanchez | Method for veryfying the identity of a speaker and related computer readable medium and computer |
| US8762149B2 (en) | 2008-12-10 | 2014-06-24 | Marta Sánchez Asenjo | Method for verifying the identity of a speaker and related computer readable medium and computer |
| WO2010066310A1 (en) | 2008-12-10 | 2010-06-17 | Agnitio, S.L. | Method for verifying the identity of a speaker, system therefore and computer readable medium |
| US9792912B2 (en) | 2008-12-10 | 2017-10-17 | Agnitio Sl | Method for verifying the identity of a speaker, system therefore and computer readable medium |
| GB2541466A (en) * | 2015-08-21 | 2017-02-22 | Validsoft Uk Ltd | Replay attack detection |
| GB2541466B (en) * | 2015-08-21 | 2020-01-01 | Validsoft Ltd | Replay attack detection |
| EP3751561A3 (en) * | 2015-10-16 | 2020-12-30 | Google LLC | Hotword recognition |
| GB2545534A (en) * | 2016-08-03 | 2017-06-21 | Cirrus Logic Int Semiconductor Ltd | Methods and apparatus for authentication in an electronic device |
| GB2545534B (en) * | 2016-08-03 | 2019-11-06 | Cirrus Logic Int Semiconductor Ltd | Methods and apparatus for authentication in an electronic device |
| US10691780B2 (en) | 2016-08-03 | 2020-06-23 | Cirrus Logic, Inc. | Methods and apparatus for authentication in an electronic device |
| US10878068B2 (en) | 2016-08-03 | 2020-12-29 | Cirrus Logic, Inc. | Methods and apparatus for authentication in an electronic device |
| US10552595B2 (en) | 2016-11-07 | 2020-02-04 | Cirrus Logic, Inc. | Methods and apparatus for authentication in an electronic device |
Also Published As
| Publication number | Publication date |
|---|---|
| WO2003098373A3 (en) | 2004-04-29 |
| GB0211842D0 (en) | 2002-07-03 |
| WO2003098373A2 (en) | 2003-11-27 |
| AU2003230039A8 (en) | 2003-12-02 |
| AU2003230039A1 (en) | 2003-12-02 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US10950245B2 (en) | Generating prompts for user vocalisation for biometric speaker recognition | |
| US6480825B1 (en) | System and method for detecting a recorded voice | |
| US5167004A (en) | Temporal decorrelation method for robust speaker verification | |
| JP4802135B2 (en) | Speaker authentication registration and confirmation method and apparatus | |
| US8428945B2 (en) | Acoustic signal classification system | |
| US20100017209A1 (en) | Random voiceprint certification system, random voiceprint cipher lock and creating method therefor | |
| US8160877B1 (en) | Hierarchical real-time speaker recognition for biometric VoIP verification and targeting | |
| US7447632B2 (en) | Voice authentication system | |
| EP1159737B9 (en) | Speaker recognition | |
| Sambur | Speaker recognition using orthogonal linear prediction | |
| ES2275700T3 (en) | PROCEDURE AND APPARATUS FOR CREATING VOICE TEMPLATES FOR AN INDEPENDENT VOICE RECOGNITION SYSTEM. | |
| US20150112682A1 (en) | Method for verifying the identity of a speaker and related computer readable medium and computer | |
| US20120173239A1 (en) | Method for verifying the identityof a speaker, system therefore and computer readable medium | |
| US7603275B2 (en) | System, method and computer program product for verifying an identity using voiced to unvoiced classifiers | |
| GB2552722A (en) | Speaker recognition | |
| GB2388947A (en) | Method of voice authentication | |
| KR101888058B1 (en) | The method and apparatus for identifying speaker based on spoken word | |
| US20100063817A1 (en) | Acoustic model registration apparatus, talker recognition apparatus, acoustic model registration method and acoustic model registration processing program | |
| EP0424071A2 (en) | Speaker recognition | |
| JP4440414B2 (en) | Speaker verification apparatus and method | |
| Cristea et al. | New cepstrum frequency scale for neural network speaker verification | |
| JPH0449952B2 (en) | ||
| Hsieh et al. | A robust speaker identification system based on wavelet transform | |
| JP2001350494A (en) | Verification device and verification method | |
| Genoud et al. | Deliberate imposture: a challenge for automatic speaker verification systems. |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| WAP | Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1) |