CHANNEL ESTIMATION SYSTEM AND METHOD FOR USE IN AUTOMATIC
SPEAKER VERIFICATION SYSTEMS
SPECIFICATION
TO ALL WHOM IT MAY CONCERN:
BE IT KNOWN, that we: Richard J. Mammone, a resident of Bridgewater, New Jersey and a citizen of the United States, Rajesh Balchandran, a resident of Somerset,
New Jersey and a citizen of India, Alvin Garcia, a resident of New Brunswick, New Jersey
and a citizen of the United States, and Vidhya Ramanujam, a resident of Somerset, New Jersey
and a citizen of India, have invented certain new and useful improvements in CHANNEL
ESTIMATION SYSTEM AND METHOD FOR USE IN AUTOMATIC SPEAKER
VERIFICATION SYSTEMS, of which the following is a specification.
CROSS REFERENCE TO RELATED APPLICATIONS
This application incorporates by reference the application entitled VOICE PRINT
SYSTEM AND METHOD, serial number 08/976,280, filed November 21, 1997, which
claims priority from Provisional Application 60/031,639, filed November 22, 1996, entitled
VOICE PRINT SYSTEM
BACKGROUND OF THE INVENTION
The invention is directed to improved systems and methods of channel estimation
useful in automatic speaker verification (ASV) systems and methods.
1. Field of The Invention.
The invention relates to the fields of digital speech processing and speaker recognition.
2. Description of Related Art
In many situations it is desired to verify the identity of a person, such as a consumer.
Voice verification systems, sometimes known as automatic speaker verification (ASV)
systems, have recently been developed to attempt to cure the deficiencies of manual systems
and methods, such as credit card signature and/or photograph verification by merchants. The
automatic speaker verification (ASV) systems attempt to match the voice of the person whose
identity is undergoing verification with a known voice.
One type of voice recognition system is a text-dependent automatic speaker
verification system. The text-dependent ASV system requires that the user speak a specific
password or phrase (the "password"). This password is determined by the system or by the user during enrollment. However, in most text-dependent ASV systems, the password is constrained to be within a fixed vocabulary, such as a limited number of numerical digits. The
limited number of password phrases gives an imposter a higher probability of discovering a
person's password, reducing the reliability of the system.
Other text-independent ASV systems of the prior art utilize a user-selectable
password. In such systems, the user enjoys the freedom to make-up his/her own password
with no constraints on vocabulary words or language. The disadvantage of these types of
systems is that they increase the processing requirement of the system because it is much more
technically challenging to model and verify a voice pattern of an unknown transcript (i.e. a
highly variable context).
Modeling of speech has been done at the phrase, word, and subword level. In recent
years, several subword-based speaker verification systems have been proposed using either
Hidden Markov Models ("HMM") or Artificial Neural Network ("ANN") references.
Modeling at the subword level expands the versatility of the system. Moreover, it is also
conjectured that the variations in speaking styles among different speakers can be better
captured by modeling at the subword level.
One of the major problems in ASV systems is that channel distortion occurs in the transmission channels. Performance of speech and speaker recognition systems degrades
when there is a mismatch between training and testing conditions. A significant part of the
mismatch is caused by differences in transmission channels. Channel estimation and
normalization (hereinafter, both referred to as "channel estimation" unless separately
identified) are used to combat the problems of channel distortion and differences in
transmission channels used in the ASV systems.
The conventional method to combat this problem is using cepstral analysis and
homomorphic deconvolution. Cepstral Mean Subtraction (CMS) is an implementation of
homomorphic deconvolution using cepstral coefficients. In removing the effects of channel
distortion, however, CMS also undesirably extracts substantial amoums of the desired speech
information. This causes a significant loss of performance in the present ASV systems. One method that attempts to overcome this is pole filtering approximation. While pole filtering
improves performance by decoupling some of the speech information from the channel information, improved techniques are needed to substantially limit the extraction of the desired
speech information.
What is needed is a user-selectable ASV system in which accuracy is improved over
prior ASV systems.
What is needed is improved channel adaptation to adapt a system in response to signals
received over different channels.
What is needed is an improved and more accurate method and system of channel estimation.
SUMMARY OF THE INVENTION
The voice print system of the present invention builds and improves upon existing ASV
systems. The voice print system of the present invention is a subword-based, text-dependent
automatic speaker verification system that embodies the capability of user-selectable
passwords with no constraints on the choice of vocabulary words or the language.
One component of the preferred ASV system is a channel estimation and normalization component. Channel estimation and normalization removes the nonuniform effects of different
data transmission environments which lead to varying channel characteristics, such as
distortion. Channel normalization is able to remove the characteristics of the test channel
and/or enrollment channel to increase accuracy. The preferred methods and systems of the
present invention termed Curve-Fitting and Clean Speech, separately, together, and in
combination with Pole filtering, significantly improve the existing methods of channel
estimation and normalization. Unlike Cepstral Mean Subtraction, both the Curve-Fitting and
Clean Speech methods and systems extract only the channel related information from the cepstral mean and not any speech information. In this manner, channel distortion is more
accurately eliminated.
The improved voice print system using the inventive channel estimation methods can
be employed for user validation for telephone services such as cellular phone services and bill-
to-third-party phone services. It can also be used for account validation for information
system access.
All ASV systems include at least two components, an enrollment component and a
testing component. The enrollment component is used to store information concerning a
user's voice. This information is then compared to the voice undergoing verification (testing)
by the test component. The system of the present invention includes inventive enrollment and
testing components, as well as a third, "bootstrap" component. The bootstrap component is
used to generate data which assists the enrollment component to model the user's voice. Each of these components comprise the channel estimation and normalization techniques of the
present invention.
1. Enrollment Summary. An enrollment component is used to characterize a known user's voice and store the characteristics in a database, so that this information is available for future comparisons. The
system of the present invention utilizes an improved enrollment process. During enrollment,
the user speaks the password, which is sampled by the system. Digital to analog conversion (if
necessary) is conducted to obtain digital speech samples. Preprocessing is performed to
remove unwanted silence and noise from the voice sample, and to indicate portions of the
voice sample which correspond to the user's voice.
Next, the transmission channel carrying the user's enrollment voice signal is examined.
The characteristics of the enrollment channel are estimated and stored in a database. The
database may be indexed by identification information, such as by the user's name, credit card
number, account identifier, etc..
The Curve-Fitting and Clean Speech methods and systems of the present invention
improve on the estimation of the enrollment channel. Specifically, both of these methods and
systems account for deficiencies inherent in the Cepstral Mean Subtraction method of channel
estimation. The Curve-Fitting method performs an approximation to the ideal general
characteristics of the transmission channel. The Clean Speech method separately estimates the
speech information contained in the cepstral mean using speech (i.e., "clean speech") which is
not perturbed by the channel. These methods and systems can be used alone, together, or in combination with Pole filtering to reduce the amount of speech information contained in the
channel estimate. Therefore, the enrollment channel can be more accurately recalled from the
stored characteristics. Feature extraction is then performed to extract features of the user's voice, such as
pitch, spectral frequencies, intonations, etc., and/or desired segments of the voice sample. A reference template may be generate from the extracted features. Next, segmentation of the
voice segment occurs. Segmentation divides the voice sample into a number of subwords.
The present invention uses subword modeling and may use any of the known techniques, but preferably uses a discriminant training based hierarchical classifier called a Neural Tree
Network (NTN). The NTN is a hierarchical classifier that combines the properties of decision
trees and feed-forward Neural Networks.
The system also utilizes the principles of multiple classifier fusion and data resampling.
The additional classifier used herein is the Gaussian Mixture Model (GMM) classifier. If only
a small amount of data is available, data resampling is used to artificially expand the size of the
sample pool and improve the generalizations of the classifiers.
A fusion function, which is set and then stored in the database, is used to weigh the
individual scored classifiers, and to set a threshold value. The threshold value is stored in the
database for use in the verification process. Thus, enrollment produces a voice print database
containing an index (such as the user's name or credit card number), along with enrollment
channel data, classifier models, feature vectors, segmentation information, multiple trained
classifier data, fusion constant, and a recognition threshold.
2. Test Component Summary.
The test component is the component which performs the verification. During testing or verification, the system first accepts "test speech" and index information from a user
claiming to be the person identified by the index information. Voice data indexed in the
database is retrieved and used to process the test speech sample.
During verification, the user speaks the password into the system. This "test speech"
password undergoes preprocessing, as previously described, with respect to the enrollment
component. The next step is to perform channel normalization or channel adaptation.
In channel normalization, the transmission channel carrying the user's test voice signal
is examined. Channel normalization is performed if the enrollment channel was also normalized. The characteristics of the test channel are normalized to remove the effects of the
test channel from the test voice signal. The channel normalization may be performed with the
Curve Fitting or Clean Speech methods of the present invention, either alone or in
combination.
Alternatively, channel adaptation is performed by removing from the test sample the
characteristics of the channel from which the test sample was received. Next, the
characteristics of the enrollment channel which were stored by the enrollment component are recalled. The test sample is filtered through the recalled enrollment channel. This type of
channel adaptation removes the characteristics from the test channel and supplies the
characteristics from the enrollment channel to the test speech so that the test speech matches
the transmission channel of the originally enrolled speech.
The present invention also improves on the channel adaption in the testing component.
In removing the characteristics of the test sample, the Curve Fitting and Clean Speech
methods and systems, whether used alone, together, or in combination with Pole filtering,
reduce the amount of speech information removed with the test channel. After channel adaption, feature extraction is performed on the test sample. This occurs
as previously described with respect to the enrollment component. After feature extraction, it
is desired to locate, or "spot" the phrases in the test speech and simultaneously avoid areas of background noise.
The performance of ASV systems can be significantly degraded by background noise and sounds. To combat the effects of background noise, the invention uses a key word/ key
phrase spotter to identify the password phrase. After key word/ key pnrase spotting,
automatic speech segmentation occurs. Preferably "force" alignment segmentation, and not
"blind" segmentation, is used. The "force" segmentation results in the identification of
subword borders. The multiple classifiers of the enrollment component are used to "score" the subword
data, and the scores are the fused, or combined, into a "final score". If the final score exceeds or equals the threshold, the test sample is verified as the user's. The final score and date of
verification, as well as other related details, may be stored in the database as well.
3. "Bootstrapping" Component Summary.
Bootstrapping is used to generate a pool of speech data representative of the speech of
nonspeakers, or "antispeakers." This data is used during enrollment to train the discriminant
training-based classifiers. Bootstrapping involves obtaining voice samples from antispeakers,
preprocessing the voice samples (as in the enrollment phase), and inverse channel filtering the
preprocessed voice samples. Inverse channel filtering removes the characteristics of the
channel on which the antispeaker voice sample is obtained. The Curve Fitting and Clean Speech methods and systems, alone, together, or in combination with Pole Filtering, also
improve on the inverse channel filtering since it produces a more accurate inverse channel
filter. After inverse channel filtering, feature generation and automatic blind voice segmentation occur, as in the enrollment component. The segments and feature vectors are
stored in an antispeaker database for use by the enrollment component.
BRIEF DESCRIPTION OF THE DRAWINGS
Figure IA is a diagram of a enrollment component of the present invention.
Figure IB shows pseudo-code for creating a filter to perform the channel estimation
shown in Figure IA.
Figure IC shows pseudo-code for inverting the filter of Figure IB.
Figure ID shows a flow diagram for performing the Curve-Fit channel estimation.
Figure IE shows a chart of an actual channel and a channel obtained from a cepstral mean.
Figure IF shows a chart of an actual channel and a channel obtained from Curve-
Fitting on a cepstal mean.
Figure 1G shows a chart of an inverse channel and an inverse channel obtained from
Curve-Fitting on a cepstral mean.
Figure 1H shows a flow diagram for performing Clean Speech channel normalization.
Figure 2 is a diagram of a testing component of the present invention.
Figures 3 A and 3B are flow diagrams of a channel adaptation module, shown in
Figure 2, of the present invention. Figure 4 is a diagram of a bootstrapping component, used to generate antispeaker data
in the system of Figure IA.
DESCRIPTION OF THE PREFERRED EMBODIMENT(S)
The preferred system used with the present invention includes an enrollment
component, a testing component, and a bootstrap component. The enrollment component
uses antispeaker data to generate and store information concerning a user's voice. The information concerning the user's voice is compared to the voice undergoing verification
(testing) by the test component. The bootstrap component is used to provide initial antispeaker data for use by the enrollment component, such that the enrollment component
may properly perform its function of generating data concerning the user's voice. The
performance of the above components can be significantly improved through using the
preferred embodiments of the channel estimation modules including the Curve Fitting and
Clean Speech methods and systems, and combinations thereof.
1. Enrollment Component- Detailed Description.
The enrollment component is used to store information (using supervised learning)
about a known user's voice into a voice print database, so that this information is available for
future comparisons. In the preferred embodiment, the enrollment component also stores
information concerning the channel on which the user provides the speech, the "enrollment
channel" into the voice print database.
Figure IA shows the enrollment component 10. As shown, the first step 20 is to obtain enrollment speech (the password) and to obtain 26 an index , such as the user's name or credit card number. The enrollment speech may be obtained via a receiver, telephone or
other sources, and be received from any transmission media, digital or analog, including
terrestrial links, land lines, satellite, microwave, etc.. More than one sample of enrollment
speech should be supplied, each of which is used to generate multiple data sets. Preferably,
four enrollment samples are supplied and processed. The enrollment speech is then analog-to-digital converted 25, .f necessary. Analog-to-
digital conversion can be performed with standard telephony boards such as those manufactured by Dialogic. A speech encoding method such as ITU G 11 standard μ and A
law can be used to encode the speech samples. Preferably, a sampling rate of 8000 Hz is used.
Alternatively, the speech may be obtained in digital format, such as from an ISDN
transmission. In such a case, a telephony board is used to handle Telco signaling protocol.
In the preferred embodiment, the computer processing unit for the speaker verification
system is an Intel Pentium platform general purpose computer processing unit (CPU) of at
least 100 MHZ having about 10MB associated RAM memory and a hard or fixed drive as
storage. Alternatively, an additional embodiment could be the Dialogic Antares card.
The digital enrollment speech is then pre-processed 30. Preprocessing 30 may include
one or more of the following techniques, as follows:
Digital filtering using pre-emphasis. In this case, a digital filter H(z) = 1-ccz"1 is used, where is set between .9 and 1.0.
Silence removal using energy and zero-crossing statistics. The success of this technique is primarily based on finding a short interval which is guaranteed to be background silence (generally found a few milliseconds at the beginning of the utterance, before the speaker actually starts recording). Thresholds are set using the silence region statistics, in order to discriminate speech and silence frames.
Silence removal based on an energy histogram. In this method, a histogram of frame energies is generated. A threshold energy value is determined based on the assumption that the biggest peak in the histogram at the lower energy region shall correspond to the background silence frame energies. This threshold energy value is used to perform speech versus silence discrimination.
DC Bias removal to remove DC bias introduced by analog-to-digital hardware or other components. The mean value of the signal is computed over the entire voice sample and then is subtracted from the voice samples.
In the preferred embodiment, the following preprocessing is conducted: silence
removal using the energy histogram technique (20 bins in histogram), signal mean removal to
remove DC bias, and signal pre-emphasis using filter = .95. The preprocessing is preferably
conducted using hamming windowed analysis frames, with 30 millisecond analysis frames and
10 millisecond shifts between adjacent frames.
Following preprocessing 30, channel estimation 40 is performed. This procedure
stores characteristics of the enrollment channel in the voice print database 115. The voice
print database 115 may be RAM, ROM, EPROM, EEPROM, hard disk, CD ROM, writeable
CD ROM, minidisk, a file server, or other storage device. In order to estimate the channel,
the distortion present on the channel is considered by the following embodiments. Each of the
following preferred embodiments can be implemented with software and the computer
processing unit defined above as an Intel Pentium platform general purpose computer
processing unit (CPU) of at least 100 MHZ having about 10MB associated RAM memory and
a hard or fixed drive as storage. Alternatively, an additional embodiment could be the
Dialogic Antares card. The present invention, however, is not limited to these preferred
embodiments and could be implemented using other digital signal processors, computers, or
neural networks.
A. Cepstral Mean Subtraction and Pole Filtering
A speech signal with frequency spectrum S(co) is distorted by a transmission channel
with frequency response H(c ). The frequency spectrum of the distorted speech Y(co) is given
as:
If the logarithm and inverse Fourier transform (F ) of the magnitude of both
sides of the equation are taken, the following equation results:
F -
1 log( |Y(S)) |) = P
D + P log(| S(c ) |)
then, in cepstral domain:
CY(n) = CH(n) + Cs(n) (1)
Cepstrum is defined as the inverse Fourier transform of the logarithm of short-time spectral
magnitude. Time invariant convolution distortion H(c ) can be eliminated by Cepstral Mean
Subtraction (CMS) or Cepstral Mean Normalization (CMN), which is averaging in the cepstral
domain, and subtracting the average component. For example:
Taking the mean of the above cepstral domain equation:.
E[CY(n)]=E[CH(n)] + E[Cs(n)],
where E[.] represents the expected value. Assuming the channel to be time invariant, the channel cepstrum is a constant additive component in the above equation. For sufficiently
long utterances the average speech cepstrum (E[Cs(n)]) is assumed to tend to zero.
Therefore, E[CY(n)] can be assumed to represent the channel cepstrum only, that is:
E[CY(n)]=CH(n) (2)
The above equation is subtracted from cepstrum domain equation (1), that is,
Cγ(n) - E[CY(n)] = CH(n) - CH(n) + Cs(n)
and,
Cs(n)= Y(n)-E[CY(n)],
which is equivalent to inverse filtering the speech with the inverse channel.
Thus, CMS may be conducted on the cepstral features obtained for the voice signal to
remove the distortion of the channel
While CMS may be used alone to remove the effects of the channel distortion, the
cepstral mean may include information other than the estimate of the time-invariant
convolutional distortion, such as coarse spectral distribution of the speech itself.
Pole filtering attempts to decouple the speech information from the channel
information in the cepstral mean. Since cepstrum is the weighted combination of LP poles or
spectral components, the effect of individual components on the cepsfral mean was examined.
It was found that broad band-width components exhibited smoother frequency characteristics
corresponding to the "roll-off" of channel distortion, assuming that narrow band-width
components in the inverse filter were influenced more by speech characteristics. Thus, the
narrow band-width LP poles were selectively deflated by broadening their bandwidth and
keeping their frequency the same.
Therefore, for every frame of speech, the pole filtered cepstral coefficients (PFCC) are computed along with LP-derived cepstral coefficients (LPCC). To achieve cepstral mean
subtraction, the mean of the PFCC is subtracted from the LPCC, instead of the regular LPCC mean. This procedure is called pole filtered cepstral mean subtraction F-CMS).
To perform PF-CMS, the procedure outlined in the flow chart of Figure IB is followed. With reference to Figure IB, the first block of pseudo-code 42 sets the pole
bandwidth threshold. Next z{ and Z;are obtained and LPCC and PFCC are evaluated 44. This allows the mean of the PFCC vectors to be computed 46, which may be saved 48 as a
channel estimate in the voice print database 115. The PFCC mean ma be used to create an
LPC filter.
An inverse of this filter may be generated as shown in Figure IC. First, the PFCC
mean is converted from cepstral to the LPC filter coefficient domain 52. Next, the LPC filter
may be inverted 54, and speech passed through the inverted filter 56.
Although not preferred, the preprocessed speech during enrollment may be inverse-
filtered by inverting the filter of Figure IC (as described below with respect to Figure 3B).
While inverse filtering will theoretically remove the enrollment channel distortion, it is
preferred to inverse filter the test speech (on the testing channel) and then feed the test speech
through the saved enrollment filter, as described below with reference to Figure 3
B. Curve-Fitting Method
The preceding Cepstral Mean Subtraction method used for channel estimation 40
assumes that the average speech cepstrum (E[Cs(n)]) tends to zero. This assumption is not valid, however, when the number of frames of speech available is limited. In such a situation,
the average cepstrum (E[Cs(n)])will contain a significant amount of speech information. Consequently, subtracting the cepstral mean removes a good part of speech from the signal.
This removal of speech causes a significant loss in performance, especially when the channel is
mild.
Pole filtering attempts to account for this, but Pole filtering still leaves a lot of speech
information in the channel estimation. Furthermore, selectively modifying some poles during
Pole filtering may produce some unwanted effects in the overall channel characteristics. The
Curve-Fitting method overcomes the limitations of Cepstral Mean Subtraction and the Pole
filtering. The Curve-Fitting method extracts channel related information from the cepstral mean, but not any speech information. Using this channel information, the Curve-Fitting
method improves upon the channel estimation 40 performed with the Cepstral Mean
Subtraction. Likewise, the Curve-Fitting method can use this channel information to create an
improved inverse filter to remove the enrollment channel disruption.
Figure IE illustrates a comparison between an actual channel and a channel derived from the cepstral mean subtraction method. As seen in Figure IE, the channel obtained from
the cepstral mean contains a substantial amount of unwanted speech information, especially in
the pass band of the channel. The cepstral mean channel, however, models the roll-off at the
ends of the channel spectrum effectively. The Curve-Fitting method takes advantage of this
effective modeling of the roll-off to get a better estimate of the channel.
The Curve-Fitting method is an approximation of the ideal general characteristics of
the channel. The following is the preferred method of this approximation and is a piecewise linear fit to the ideal general characteristics. The Curve-Fitting method extracts the slope
information from the cepstral mean. In order to do this, the pass band of the channel is
assumed to he flat at 0 dB. Then, as illustrated in Figure ID, the lowest frequency spectral
peak in the log magnitude LP spectrum of the cepstral mean is detected 62. All points below
this detected lowest frequency spectral peak are scanned 64 to find the point at which the
magnitude spectrum is closest to zero dB. Once this point is found, a straight line fit 66 is done between the lowest frequency spectral peak and this point. The slope of this line is set to
be the roll-off of the Curve-Fitting method channel estimate at the low frequency end.
A similar procedure is carried out at the high frequency end. The difference at the high
frequency end is that the high frequency spectral peak is detected 68 and then a straight line fit
70 is performed considering all points between the spectral peak and the last point of the
spectrum. A passband is then modeled 72 as a straight line at zero dB between the
intersection of the straight lines obtained above with the zero dB line. The resulting Curve-
Fitting method channel estimate is illustrated in Figure IF. It is noted that other
approximations to the ideal general characteristics of the channel can be used.
The Curve-Fitting method enrollment channel estimate can be stored in the voice print database 115, as described above, for recall and use as a filter in the testing component, as
described below with reference to Figure 3A. Likewise, the Curve-Fitting method enrollment
channel estimate may be converted to an inverse filter to inverse filter the channel corrupted
speech, as described below with reference to Figure 3B. This removes the enrollment channel
distortion. Alternatively, the Curve-Fitting enrollment channel estimate can be converted to the cepstrum domain and subtracted from the cepstral features. An inverse channel obtained
by the Curve-Fitting method is shown in Figure IG. As shown in Figure IG, this method and module creates a close approximation of the actual inverse channel.
The Curve-Fitting channel estimation method and module can be further improved by
fusing it with the channel estimate obtained using Pole filtering, as described above with
reference to Figure IB. Fused together, the Pole filtering channel estimation and the Curve- Fitting channel estimate balance each other and produce a more accurate channel estimation.
The fusing of the Curve-Fitting method with the Pole Filtering can be represented as follows: Cnew(n) = λCPF(n) + (l-λ)CCF(n)
where CPF(n) is the channel estimate obtained using Pole filtering and CCF(n) is the channel
estimate obtained with the Curve-Fitting method. It is also noted that the Curve-Fitting
method channel estimate can be improved by combining or fusing it with channel estimation
techniques other than Pole filtering.
C. Clean Speech Method
The Clean Speech method and module is an alternative preferred embodiment that also
overcomes the above-described limitation of the Cepstral Mean Subtraction method used for
channel estimation 40. This limitation was that the Cepstral Mean Subtraction method
assumed that the average speech cepstrum (E[Cs(n)]) tends to zero. The Clean Speech
module extracts channel related information by separately estimating the speech information in
the cepstral mean. Using this channel information, the Clean Speech method improves the
channel estimation 40. Likewise, the Clean Speech method can use this channel information to
create an improved inverse filter to remove the enrollment channel disruption.
The Clean Speech method does not assume that the average speech cepstrum
(E[Cs(n)]) tends to zero. Rather, the Clean Speech method separately estimates the average speech cepstrum (E[Cs(n)]). Referring to Figure 1H, the Clean Speecn method accomplishes
this by performing 82 a simultaneous "clean" recording of the same speech from the same
person who is recording speech samples over the actual enrollment equipment (typically a
telephone set). Alternatively, the "clean" recording can be made during the same session in
which the enrollment speech samples are collected from the enrollment equipment. In this
alternative embodiment, the recording of the clean speech is done before or after the corrupted
channel enrollment. The simultaneous method is preferred because of its inherent increased
security guarantees.
The simultaneous "clean" recording is done on a high quality, close talking, wide
bandwidth microphone. A high quality microphone will have minimal channel distortion. A
close talking microphone limits the amount of background noise that is captured. A wide bandwidth microphone has a flat frequency response between 20 Hz and 20 kHz. A preferred
high quality, close talking, wide bandwidth microphone is a Sennheiser® microphone from
Sennheiser, Inc., which would be connected to a preferred Pentium® based computer with a
16 or 32 bit SoundBlaster® audio hardware built and sold by Creative Labs, Inc. However,
any microphones and computers with the above characteristics can be employed with the
present invention. Under the presumption that a "clean" recording from a high quality
microphone will be free of any transmission channel characteristics, the Clean Speech method
assumes that the cepstral mean of the "clean" recording will be representative of the speech information in the recording.
The cepstral mean for the "clean" recording can be represented as the
following:
E[ CY.CLEAN(n)] = E[Cs.CLEAN(n)]
since it is assumed that there is no channel corruption on the "clean" recording, i.e., E[CH.
CLJJANOI)] = 0.
As shown in Figure 1H, The cepstral mean (E[Cs.CLEAN(n)]) of the "clean" recording is
calculated 84 and then used to perform channel estimation 40 of a channel speech recording.
The cepstral coefficients of a channel corrupted speech can be represented as shown above
with Cepstral Mean Subtraction as:
CY(n) = CH(n) + Cs(n).
Taking the mean of the above equation:
E[CY(n)]=E[CH(n)] + E[Cs(n)].
The estimation from the "clean" recording is now used to eliminate E[Cs(n)] during channel normalization. To achieve channel normalization of a corrupted
speech cepstrum coefficient Cγ(n), its mean (E[CY(n)]) is subtracted 86 from the above cepstral coefficient equation, while the "clean" speech cepstral mean
(E[Cs.CLEAN(n)]) is added 88. This channel normalization would be represented
as:
=CY(n) - E[CY(n)] + E[Cs.CLEAN(n)]
=CH(n) + Cs(n) - E[CH(n)] - E[Cs(n)] + E[Cs.CLEAN(n)]
=Cs(n)
assuming, that the transmission channel is stationary during the entire recording, i.e., CH(n)=
E[CH(n)] and that the same person is speaking the same text between the "clean" and channel corrupted transmission, i.e., E[Cs(n)] = E[Cs.CLEAN(n)].
The channel normalization process shown above is done in the cepstrum domain. Alternatively, the same processing can be done in the time waveform domain. For example,
the subtraction of the speech cepstrum coefficient mean (E[CY(n)]) is equivalent to inverse
filtering in the time domain. Likewise, the addition of the "clean" speech cepstral
mean (E[Cs_CLEAN(n)]) is equivalent to filtering in the time domain. The transmission channel
and "clean" speech estimates in the cepstral domain can be converted to filter coefficients and
applied to speech waveform in the time domain.
The channel estimate E[CH(n)] calculated under the Clean Speech method can be
improved with the pole-filtering or Curve-Fitting methods described above. Likewise, other channel estimation techniques may be used. Performance wise, the Clean Speech method
provides excellent performance, although practically it is limited because enrollment is done
locally.
After preprocessing 30, feature extraction 50 is performed on ;he processed speech.
Feature extraction may occur after (as shown) or simultaneously with ihe step of channel estimation 40 (in parallel computing embodiments). The detail of the feature extraction is
herein incorporated by reference from VOICE PRINT SYSTEM AND METHOD, serial
number 08/976,280, filed November 21, 1997. The result of feature extraction 50 is that
vectors representing a template of the password are generated. This template is stored 60 in
the voice print database 115. Following storage of the template 60, the speech is segmented
into sub-words for further processing.
The preferred technique for subword generation 70 is automatic blind speech
segmentation, the details of which, and of alternative methods, are herein incorporated by
reference from VOICE PRINT SYSTEM AND METHOD, serial number 08/976,280, filed
November 21, 1997.
After subwords are obtained, each sub-word is then modeled P0, 90, preferably with
multiple classifier modules. Preferably a first neural tree network (NTN) 80 and a second
Gaussian mixture model (GMM) 90 are used. The NTN 80 provides discriminative-based
model and the GMM 90 provides one that is based on a statistical measure. In a preferred embodiment a leave-one-out method data resampling scheme 100 is u ed. Data resampling
100 is performed by creating multiple subsets of the training data, each of which is created by
leaving one data sample out at a time. The subsets of the training data are then used to train
multiple models of each of the classifiers, which are stored in the voice print database 115.
Thus, Figure 1 A shows N models for the NTN classifier 80 and N models for the GMM
classifier 90. For model #1 of the NTN classifier, a enrollment sample such as the 1st sample, is left out of the classifier.
In order to train an NTN model 80 for a given speaker, it is necessary to appropriately
label the subword data available in the antispeaker database 110. The antispeaker database
110 may be RAM, ROM, EPROM, EEPROM, hard disk, CD ROM, a file server, or other
storage device.
The subword data from the speaker being trained is labeled as enrollment speaker data. Because there is a no linguistic labelling information in the antispeaker database 110, the entire
database 110 is searched for the closet subword data from other speakers. This data is labeled
the anti-speaker data. The mean vector and covariance matrix of the subword segments obtained from subword generation are used to find the "close" subwords. An anti-speaker
module 120 searches the antispeaker database 110 to find the "close" subwords of antispeaker
data, which are used in the NTN model 20 . Preferably, 20 "close" subwords are identified. The anti-speaker data in the antispeaker database 110 is either manuahy created, or created
using a "bootstrapping" component, described below with reference to Figure 4. Because a "leave-one-out" system 100 is employed with multiple (N) samples, the
classifier models 80, 90 are trained by comparing antispeaker data with N-l samples of
enrollment speech. Both modules 80, 90 can determine a score for each spectral feature
vector of a subword segment. The individual scores of the NTN 80 and GMM 90 modules
can be combined, or "fused" by a classifier fusion module 130 to obtain a composite score for
the subword. Since these two modeling approaches tend to have errors that are uncorrelated,
it has been found that performance improvements can be obtained by fusing the model outputs
130. In the preferred embodiment, the results of the neural tree network 80 and the Gaussian
mixture model 90 are fused 130 using a linear opinion pool, as described below. However,
other ways of combining the data can be used with the present invention including a log
opinion pool or a "voting" mechanism, wherein hard decisions from both the NTN and GMM
are considered in the voting process.
Further details of the subword modeling with the NTN and GMM modules are herein
incorporated by reference from VOICE PRINT SYSTEM AND METHOD, serial number
08/976,280, filed November 21, 1997.
A scoring algorithm 145, 150 is used for each of the NTN and GMM models. The
output score (estimated a-posteriori probabilities) of the subword models is combined across
all the subwords of the password phrase, so as to yield a composite score for the test utterance.
The scoring algorithm 145, 150 for combining the score the subword models 80, 90
can be based on either of the following schemes: (a) PHRASE- AVERAGE: Averaging the output scores for the vectors over the
entire phrase,
(b) SUB WORD- AVERAGE: Average the score of vectors within a subword, before
averaging the (averaged) subword scores, and
(c) SUBWORD-WEIGHiNG: Same as (b) subword-average scoring, but the
(averaged) subword scores are weighted in the final averaging process.
Transitional (or durational) probabilities between the subwords can also be used while
computing the composite score for the password phrase. The preferred embodiment is (b) subword-average scoring. The result of scoring provides a GMM score and an NTN score,
which must then be combined.
In the preferred embodiment, a classifier fusion module 130 us.ng the linear opinion
pool method combines the NTN score and the GMM score. Use of the linear opinion pool is
referred to as a data fusion function, because the data from each classifier is "fused," or
combined.
The data fusion function for n classifiers, S(a), is governed by the following linear
opinion pool equation:
n
S(a) = £ais1 i=l
In this equation S(a) is the probability of the combined system, a; are weights, and s;(a)
is the probability output by the ith classifier, and n is the number of classifiers; a; is between
zero and one and the sum of all a s is equal to one. If two classifiers are used (n=2), sx is the
score of the first classifier and s2 is the score of the second classifier. In this instance the
equation becomes:
S = as, + (l-c-)s2
The variable a is set as a constant (although it may be dynamically adapted as discussed below), and functions to provide more influence on one classifier method as
opposed to the other. For example, if the NTN method 80 was found to be more accurate, the
first classifier sx would be more important, and would be made greaier than 0.5, or its
previous value. Preferably, is only incremented or decremented by _. small amount, €.
Once the variables in the fusion equation are known, a threshold value 140 is output
and stored in the voice print database 115. The threshold value output 140 is compared to a
"final score" in the testing component to determine whether a test user's voice has so closely
matched the model that it can be said that the two voices are from the -ame person.
2. Testing Component- Detailed Description.
Figure 2 shows a general outline of the testing component 150 which has many
features similar to those described with respect to the enrollment component 10 of Figure IA. The testing component 150 is used to determine whether test speech received from a user
sufficiently matches identified stored speech characteristics so as to validate that the user is in
fact the person whose speech was stored.
First, the test speech and index information 160 is supplied to the test component. The index information is used to recall subword/segmentation information and the threshold value
140 from the voice print database 115. The index information may b any nonvoice data which identifies the user, such as the user's name, credit card number, etc..
After obtaining the test speech and index information, the test speech is preprocessed
170. Preprocessing 170 may be performed as previously described in _he enrollment
component 10 (Figure IA). Preferably, the same preprocessing 30,170 is conducted on the
test speech as was performed during enrollment.
The fact that a speaker's model is conventionally built using enrollment speech that is
recorded under a specific, controlled environment implies that the model carries not only the
voice print but also the channel print. Therefore, following preprocessing, channel adaptation
180 is performed. Channel adaptation 180 adapts the system to the particular enrollment
channel and test channel. Channel adaptation 180 includes processing under both the
enrollment component 10 and the test component 150. Figures 3 A and 3B show alternatives
of channel adaptation 180.
As previously mentioned with respect to Figure IA, the enrollment channel is estimated 40 during the enrollment component 10, also shown in Figures 3 A and 3B at 300
As shown in Figure 3 A, the enrollment channel estimate is also stored 310 in the voice print database 115 during the enrollment component. The enrollment channel may be estimated and stored using the preferred embodiments of the present invention previously discussed with
respect to Figure 1 A. These include the Curve-Fitting method discussed above with respect to
Figure ID, the Clean Speech Method discussed above with respect to Figure IE, the Pole
filtering discussed above with respect to Figure IB, and combinations of these methods
As shown in Figure 3 A, the test channel is estimated 320 during the testing
component. The test channel may be estimated by generating a filter using the procedures
previously discussed with respect to Figures IB, ID, or IE. After generating the filter, the
test speech is inverse filtered through the test channel 330. To achiev. this, the test speech is
passed through the inverse filter of the test channel using the procedure of Figure IC. This
process removes the distortion of the test channel from the test speech. Now, the distortion of
the enrollment channel is added to the test speech by filtering the test speech through the
enrollment channel. To perform this, the saved enrollment filter is recalled 340 and the test
speech is filtered through the enrollment filter 350.
The procedure of Figure 3 A stores the enrollment data with the enrollment channel
distortion during the enrollment component, and then removes the distortion of the test
channel and adds the distortion of the original enrollment channel during testing. As an
alternative, shown in Figure 3B, it may be desired to remove the enrollment channel distortion
during enrollment, and then remove the test channel distortion during testing.
Channel normalization is shown in Figure 3B, the enrollment channel is estimated 300
during the enrollment component. Next, the enrollment speech is filtered through an inverse of the enrollment channel filter 360. In other words, the enrollment speech is inverse filtered
using the techniques previously discussed. The enrollment speech may normalized using the
Curve Fitting or Clean Speech method described above with respect to Figure IE. During the testing phase the test channel is estimated 370, and an inverse filter constructed using the
techniques previously described. The test speech is then filtered through the inverse filter 380.
Using either channel adaptation technique, the system adapts to account for the channel distortion on the enrollment channel and on the test channel. it has been found that
the technique shown in Figure 3 A is preferred.
In the scenario of cellular fraud control, the concept of channe' adaptation 180 can be
used to validate the user since the channel print carries the characteristics of the particular cellular handset of which the speaker is an authorized user, and therefore creates an
association between the voice print and the phone channel print. The combination of voice
print and phone print ensures that a particular cellular handset can only be used by its
registered subscriber. However, in application such as bill-to-third party phone services where
the users are allowed to have access to the service from various locations, an authorized user's request for service may be denied due to the phone print mismatch.
Channel adaptation 180 provides a solution to this problem. It first removes the phone
and channel print of the test environment from the test speech by performing an inverse
filtering of the channel. Thereafter, channel adaptation can add the phone and channel print of
the training environment to the speech so that it looks as if the verification speech is recorded
through the training channel.
Channel adaptation 180 in this manner can still be advantageous in cellular fraud
control when the channel mismatch is primarily due to variations in tiV cellular network rather
the phone set. The channels can be estimated using techniques such as pole-filtered cepstrum,
as described in Figure IB, LP derived cepstrum mean, fast Fourier transform (FFT)-derived
cepstrum mean as well as FFT based periodogram of the speech signa. Pole-filtered
cepstrum, as shown in Figure IB, is the preferred method.
Referring to Figure 2, feature extraction 190 is performed afte: preprocessing. Feature
extraction 190 may occur immediately after channel adaption 180, or may occur
simultaneously with channel adaption 180 (in a multiple processor embodiment).
Following feature extraction 190, key word/key phrase spotting 200 is performed. Further
detail of feature extraction 190 and key word/key phrase spotting 200 is herein incorporated
by reference from VOICE PRINT SYSTEM AND METHOD, serial number 08/976,280,
filed November 21, 1997.
Referring back to Figure 2, after key word/ key phrase spotting, automatic subword
generation 210 occurs. Because segmentation was already performed during the enrollment
component, subword generation 210 in the testing component is performed based on the subwords/segment model computed in the enrollment phase 10.
As previously described with respect to Figure IA, during the enrollment component
10 GMM modeling 90 was performed. The GMM modeling 90 is used in the test component
subword generation 210 to "force align" the test phrase into segments corresponding to the
previously formed subwords. Using the subword GMMs as reference models, Viterbi or
Dynamic programming (DP) based algorithms are used to locate the optimal boundaries for
the subword segments. Additionally, the normalized subword duration (stored during enrollment) is used as a constraint for force alignment since it provides stability to the
algorithm. Speech segmentation using force alignment is disclosed in U.S. patent application
no. 08/827, 562 entitled "Blind Clustering of Data With Application to Speech Processing
Systems", filed on April 1, 1997, and its corresponding U.S. provisional application no. 60/014,537 entitled "Blind Speech Segmentation", filed on April 2, 1996, both of which are
herein incorporated by reference in their entirety. After subword generation 210 is performed, scoring 240, 250 using the techniques
previously described with respect to Figure 2 (i.e. multiple classifiers such as GMM 230 and
NTN 220) is performed on the subwords. Scoring using the NTN and GMM classifiers 220,
230 is disclosed in U.S. patent application Ser. No. 60/064,069, entitled "Model Adaption
System And Method For Speaker Verification," filed on November 3, J 997 by Kevin Farrell
and William Mistretta, U.S. patent application no. 08/827,562 entitled "Blind Clustering of
Data With Application to Speech Processing Systems", filed on April 1, 1997, and its
corresponding U.S. provisional application no. 60/014,537 entitled "B'ind Speech
Segmentation", filed on April 2, 1996, each of which is herein incorporated by reference in its
entirety.
An NTN scoring algorithm 240 and a GMM scoring algorithm 250 are used, as
previously described with respect to Figure 1 A, to provide a GMM score and a NTN score to the classifier fusion module 260.
With continued reference to Figure 2, the classifier fusion mod le 260 outputs a "final score" 270. The "final score" 270 is then compared 280 to the threshold value 140. If the
"final score" 270 is equal to or greater than the threshold value 140 obtained during
enrollment, the user is verified. If the "final score" 270 is less than the threshold value 140 then the user is not verified or permitted to complete the transaction requiring verification.
The present invention also employs a number of additional adaptations, in addition to channel adaptation 180.
As previously described, the multiple classifier system uses a c ssifier fusion module
130, 260 incorporating a fusion function to advantageously combine the strength of the
individual classifiers and avoid their weakness. However, the fusion function that is set during
the enrollment may not be optimal for the testing in that every single classifier may have its
own preferred operating conditions. Therefore, as the operating environment changes, the fusion function changes accordingly in order to achieve the optimal results for fusion. Also,
for each user, one classifier may perform better than the other. An adaptable fusion function
provides more weight to the better classifier. Fusion adaptation uses predetermined
knowledge of the performance of the classifiers to update the fusion function so that the amount of emphasis being put on a particular classifier varies from time to time based on its
performance.
As shown in Figure 2, a fusion adaptation module 290 is connected to the classifier
fusion module 280. The fusion adaptation module 290 changes the constant, a, in the linear
pool data fusion function described previously with respect to Figure 2, which is:
S(a) = ∑a,s, i=l In the present invention two classifiers are used (NTN 80, 220 and GMM 90, 230) and
Si is the score of the first classifier and s2 is the score of the second classifier. In this instance
the equation becomes:
8 = 018! + (l-α)s2
The fusion adaptation module 290 dynamically changes to weigh either the NTN (s,)
or GMM (s2) classifiers more than the other, depending on which classifier turns out to be
more indicative of a true verification. Further detail of the fusion adaptation module 290 is
herein incorporate by reference from VOICE PRINT SYSTEM AND METHOD, serial
number 08/976,280, filed November 21, 1997.
Threshold adaptation adapts the threshold value in response to prior final scores.
Threshold adaptation module 295 is shown in Figure 2. The detail of the threshold adaptation
module 295 is herein incorporate by reference from VOICE PRINT SYSTEM AND
METHOD, serial number 08/976,280, filed November 21, 1997.
Model adaptation adapts the classifier models to subsequent successful verifications.
The detail of the model adaptation module is herein incorporate by reference from VOICE
PRINT SYSTEM AND METHOD, serial number 08/976,280, filed November 21, 1997.
Fusion adaptation 290, model adaptation, and threshold adaptation 600 all may effect
the number and probability of obtaining false-negative and false-positive results, so should be
used with caution. These adaptive techniques may be used in combination with channel
adaptation 180, or each other, either simultaneously or at different authorization occurrences.
Model adaptation is more dramatic than threshold adaptation or fusion adaptation, which both
provide incremental changes to the system.
The voiceprint database 115 may or may not be coresident with the antispeaker
database 110. Voice print data stored in the voice print database may include: enrollment
channel estimate, classifier models, list of antispeakers selected for training, fusion constant,
threshold value, normalized segment durations, and/or other intermediate scores or authorization results used for adaptation.
3. "Bootstrapping" Component.
Because the enrollment component 10 uses the "closest" antispeaker data to generate
the threshold value 140, the antispeaker database 110 must be initially be filled with
antispeaker data. The initial antispeaker data may be generated via anificial simulation
techniques, or can be obtained from a pre-existing database, or the database may be
"bootstrapped" with data by the bootstrapping component.
Figure 4 shows a bootstrapping component 700. The bootstrarφing component 700
first obtains antispeaker speech 710, and then preprocess the speech 720 as previously
described with respect to Figure 1 A. The antispeaker speech may be phrases from any number
of speakers who will not be registered in the database as users. Next, the antispeaker speech
is inverse-channel filtered 730 to remove the effects of the antispeakei channel as described
with respect to Figures 1 and 2. As shown in Figure 4, the processed ..nd filtered antispeaker
speech then undergoes feature extraction 770. The feature extraction may occur as previously
described with respect to Figure 1 A. Next, the antispeaker speech undergoes sub-word
generation 750, using the techniques previously described with respeci to Figure 1 A. The preferable method of sub-word generation is automatic blind speech segmentation, discussed
previously with respect to Figure 1 A. The sub-words are then registered as antispeaker data
760 in the database.
Thus, the bootstrapping component initializes the database with antispeaker data which
then may be compared to enrollment data in the enrollment component.
The present invention provides for an accurate and reliable automatic speaker
verification, which uses adaptive techniques to improve performance. A key word/ key phrase
spotter 200 and automatic blind speech segmentation improve the usefulness of the system. Adaptation schemes adapt the ASV to changes in success/failures and to changes in the user by using channel adaptation 180, model adaptation 540, fusion adaptation 290, and threshold adaptation 600.
The foregoing description of the present invention has been presented for purposes of
illustration and description which is not intended to limit the invention to the specific
embodiments described. Consequently, variations and modifications commensurate with the
above teachings, and within the skill and knowledge of the relevant art, are part of the scope
of the present invention. It is intended that the appended claims be construed to include alternative embodiments to the extent permitted by law.