WO1999059136A1

WO1999059136A1 - Channel estimation system and method for use in automatic speaker verification systems

Info

Publication number: WO1999059136A1
Application number: PCT/US1999/010038
Authority: WO
Inventors: Richard J. Mammone; Rajesh Balchandran; Andy A. Garcia; Vidhya Ramanujam
Original assignee: T-NETIX Inc; T Netix Inc
Current assignee: T-NETIX Inc; T Netix Inc
Priority date: 1998-05-08
Filing date: 1999-05-07
Publication date: 1999-11-18
Anticipated expiration: 2000-11-08
Also published as: AU3889799A

Abstract

The voice print system of the present invention concerns an automatic speaker verification (ASV) system that is subword-based and text-dependent with no constraints on the choice of vocabulary words or language. One component of the preferred ASV system is a channel estimation and normalization component that is able to remove the characteristics of the test channel component (150) and/or enrollment channel component (90) to increase accuracy. The preferred methods and systems of the present invention termed Curve-Fitting (62, 64, 66) and Clean Speech (82, 86, 88, 90, 92), separately, together, and in combination with Pole filtering (42, 44, 46), significantly improve the existing methods of channel estimation and normalization. Unlike Cepstral Mean Subtraction, both Curve-Fitting (62, 64, 66) and Clean Speech (42, 44, 46) methods and systems extract only the channel related information from the cepstral mean and not any speech information.

Description

CHANNEL ESTIMATION SYSTEM AND METHOD FOR USE IN AUTOMATIC

SPEAKER VERIFICATION SYSTEMS

SPECIFICATION

TO ALL WHOM IT MAY CONCERN:

BE IT KNOWN, that we: Richard J. Mammone, a resident of Bridgewater, New Jersey and a citizen of the United States, Rajesh Balchandran, a resident of Somerset,

New Jersey and a citizen of India, Alvin Garcia, a resident of New Brunswick, New Jersey

and a citizen of the United States, and Vidhya Ramanujam, a resident of Somerset, New Jersey

and a citizen of India, have invented certain new and useful improvements in CHANNEL

ESTIMATION SYSTEM AND METHOD FOR USE IN AUTOMATIC SPEAKER

VERIFICATION SYSTEMS, of which the following is a specification.

CROSS REFERENCE TO RELATED APPLICATIONS

This application incorporates by reference the application entitled VOICE PRINT

SYSTEM AND METHOD, serial number 08/976,280, filed November 21, 1997, which

claims priority from Provisional Application 60/031,639, filed November 22, 1996, entitled

VOICE PRINT SYSTEM

BACKGROUND OF THE INVENTION

The invention is directed to improved systems and methods of channel estimation

useful in automatic speaker verification (ASV) systems and methods.

1. Field of The Invention.

The invention relates to the fields of digital speech processing and speaker recognition.

2. Description of Related Art

In many situations it is desired to verify the identity of a person, such as a consumer.

Voice verification systems, sometimes known as automatic speaker verification (ASV)

systems, have recently been developed to attempt to cure the deficiencies of manual systems

and methods, such as credit card signature and/or photograph verification by merchants. The

automatic speaker verification (ASV) systems attempt to match the voice of the person whose

identity is undergoing verification with a known voice.

One type of voice recognition system is a text-dependent automatic speaker

verification system. The text-dependent ASV system requires that the user speak a specific password or phrase (the "password"). This password is determined by the system or by the user during enrollment. However, in most text-dependent ASV systems, the password is constrained to be within a fixed vocabulary, such as a limited number of numerical digits. The

limited number of password phrases gives an imposter a higher probability of discovering a

person's password, reducing the reliability of the system.

Other text-independent ASV systems of the prior art utilize a user-selectable

password. In such systems, the user enjoys the freedom to make-up his/her own password

with no constraints on vocabulary words or language. The disadvantage of these types of

systems is that they increase the processing requirement of the system because it is much more

technically challenging to model and verify a voice pattern of an unknown transcript (i.e. a

highly variable context).

Modeling of speech has been done at the phrase, word, and subword level. In recent

years, several subword-based speaker verification systems have been proposed using either

Hidden Markov Models ("HMM") or Artificial Neural Network ("ANN") references.

Modeling at the subword level expands the versatility of the system. Moreover, it is also

conjectured that the variations in speaking styles among different speakers can be better

captured by modeling at the subword level.

One of the major problems in ASV systems is that channel distortion occurs in the transmission channels. Performance of speech and speaker recognition systems degrades

when there is a mismatch between training and testing conditions. A significant part of the

mismatch is caused by differences in transmission channels. Channel estimation and

normalization (hereinafter, both referred to as "channel estimation" unless separately identified) are used to combat the problems of channel distortion and differences in

transmission channels used in the ASV systems.

The conventional method to combat this problem is using cepstral analysis and

homomorphic deconvolution. Cepstral Mean Subtraction (CMS) is an implementation of

homomorphic deconvolution using cepstral coefficients. In removing the effects of channel

distortion, however, CMS also undesirably extracts substantial amoums of the desired speech

information. This causes a significant loss of performance in the present ASV systems. One method that attempts to overcome this is pole filtering approximation. While pole filtering

improves performance by decoupling some of the speech information from the channel information, improved techniques are needed to substantially limit the extraction of the desired

speech information.

What is needed is a user-selectable ASV system in which accuracy is improved over

prior ASV systems.

What is needed is improved channel adaptation to adapt a system in response to signals

received over different channels.

What is needed is an improved and more accurate method and system of channel estimation.

SUMMARY OF THE INVENTION

The voice print system of the present invention builds and improves upon existing ASV

systems. The voice print system of the present invention is a subword-based, text-dependent automatic speaker verification system that embodies the capability of user-selectable

passwords with no constraints on the choice of vocabulary words or the language.

One component of the preferred ASV system is a channel estimation and normalization component. Channel estimation and normalization removes the nonuniform effects of different

data transmission environments which lead to varying channel characteristics, such as

distortion. Channel normalization is able to remove the characteristics of the test channel

and/or enrollment channel to increase accuracy. The preferred methods and systems of the

present invention termed Curve-Fitting and Clean Speech, separately, together, and in

combination with Pole filtering, significantly improve the existing methods of channel

estimation and normalization. Unlike Cepstral Mean Subtraction, both the Curve-Fitting and

Clean Speech methods and systems extract only the channel related information from the cepstral mean and not any speech information. In this manner, channel distortion is more

accurately eliminated.

The improved voice print system using the inventive channel estimation methods can

be employed for user validation for telephone services such as cellular phone services and bill-

to-third-party phone services. It can also be used for account validation for information

system access.

All ASV systems include at least two components, an enrollment component and a

testing component. The enrollment component is used to store information concerning a

user's voice. This information is then compared to the voice undergoing verification (testing)

by the test component. The system of the present invention includes inventive enrollment and

testing components, as well as a third, "bootstrap" component. The bootstrap component is used to generate data which assists the enrollment component to model the user's voice. Each of these components comprise the channel estimation and normalization techniques of the

present invention.

1. Enrollment Summary. An enrollment component is used to characterize a known user's voice and store the characteristics in a database, so that this information is available for future comparisons. The

system of the present invention utilizes an improved enrollment process. During enrollment,

the user speaks the password, which is sampled by the system. Digital to analog conversion (if

necessary) is conducted to obtain digital speech samples. Preprocessing is performed to

remove unwanted silence and noise from the voice sample, and to indicate portions of the

voice sample which correspond to the user's voice.

Next, the transmission channel carrying the user's enrollment voice signal is examined.

The characteristics of the enrollment channel are estimated and stored in a database. The

database may be indexed by identification information, such as by the user's name, credit card

number, account identifier, etc..

The Curve-Fitting and Clean Speech methods and systems of the present invention

improve on the estimation of the enrollment channel. Specifically, both of these methods and

systems account for deficiencies inherent in the Cepstral Mean Subtraction method of channel

estimation. The Curve-Fitting method performs an approximation to the ideal general

characteristics of the transmission channel. The Clean Speech method separately estimates the

speech information contained in the cepstral mean using speech (i.e., "clean speech") which is not perturbed by the channel. These methods and systems can be used alone, together, or in combination with Pole filtering to reduce the amount of speech information contained in the

channel estimate. Therefore, the enrollment channel can be more accurately recalled from the

stored characteristics. Feature extraction is then performed to extract features of the user's voice, such as

pitch, spectral frequencies, intonations, etc., and/or desired segments of the voice sample. A reference template may be generate from the extracted features. Next, segmentation of the

voice segment occurs. Segmentation divides the voice sample into a number of subwords.

The present invention uses subword modeling and may use any of the known techniques, but preferably uses a discriminant training based hierarchical classifier called a Neural Tree

Network (NTN). The NTN is a hierarchical classifier that combines the properties of decision

trees and feed-forward Neural Networks.

The system also utilizes the principles of multiple classifier fusion and data resampling.

The additional classifier used herein is the Gaussian Mixture Model (GMM) classifier. If only

a small amount of data is available, data resampling is used to artificially expand the size of the

sample pool and improve the generalizations of the classifiers.

A fusion function, which is set and then stored in the database, is used to weigh the

individual scored classifiers, and to set a threshold value. The threshold value is stored in the

database for use in the verification process. Thus, enrollment produces a voice print database

containing an index (such as the user's name or credit card number), along with enrollment

channel data, classifier models, feature vectors, segmentation information, multiple trained

classifier data, fusion constant, and a recognition threshold. 2. Test Component Summary.

The test component is the component which performs the verification. During testing or verification, the system first accepts "test speech" and index information from a user

claiming to be the person identified by the index information. Voice data indexed in the

database is retrieved and used to process the test speech sample.

During verification, the user speaks the password into the system. This "test speech"

password undergoes preprocessing, as previously described, with respect to the enrollment

component. The next step is to perform channel normalization or channel adaptation.

In channel normalization, the transmission channel carrying the user's test voice signal

is examined. Channel normalization is performed if the enrollment channel was also normalized. The characteristics of the test channel are normalized to remove the effects of the

test channel from the test voice signal. The channel normalization may be performed with the

Curve Fitting or Clean Speech methods of the present invention, either alone or in

combination.

Alternatively, channel adaptation is performed by removing from the test sample the

characteristics of the channel from which the test sample was received. Next, the

characteristics of the enrollment channel which were stored by the enrollment component are recalled. The test sample is filtered through the recalled enrollment channel. This type of

channel adaptation removes the characteristics from the test channel and supplies the

characteristics from the enrollment channel to the test speech so that the test speech matches

the transmission channel of the originally enrolled speech. The present invention also improves on the channel adaption in the testing component.

In removing the characteristics of the test sample, the Curve Fitting and Clean Speech

methods and systems, whether used alone, together, or in combination with Pole filtering,

reduce the amount of speech information removed with the test channel. After channel adaption, feature extraction is performed on the test sample. This occurs

as previously described with respect to the enrollment component. After feature extraction, it

is desired to locate, or "spot" the phrases in the test speech and simultaneously avoid areas of background noise.

The performance of ASV systems can be significantly degraded by background noise and sounds. To combat the effects of background noise, the invention uses a key word/ key

phrase spotter to identify the password phrase. After key word/ key pnrase spotting,

automatic speech segmentation occurs. Preferably "force" alignment segmentation, and not

"blind" segmentation, is used. The "force" segmentation results in the identification of

subword borders. The multiple classifiers of the enrollment component are used to "score" the subword

data, and the scores are the fused, or combined, into a "final score". If the final score exceeds or equals the threshold, the test sample is verified as the user's. The final score and date of

verification, as well as other related details, may be stored in the database as well.

3. "Bootstrapping" Component Summary.

Bootstrapping is used to generate a pool of speech data representative of the speech of

nonspeakers, or "antispeakers." This data is used during enrollment to train the discriminant training-based classifiers. Bootstrapping involves obtaining voice samples from antispeakers,

preprocessing the voice samples (as in the enrollment phase), and inverse channel filtering the

preprocessed voice samples. Inverse channel filtering removes the characteristics of the

channel on which the antispeaker voice sample is obtained. The Curve Fitting and Clean Speech methods and systems, alone, together, or in combination with Pole Filtering, also

improve on the inverse channel filtering since it produces a more accurate inverse channel

filter. After inverse channel filtering, feature generation and automatic blind voice segmentation occur, as in the enrollment component. The segments and feature vectors are

stored in an antispeaker database for use by the enrollment component.

BRIEF DESCRIPTION OF THE DRAWINGS

Figure IA is a diagram of a enrollment component of the present invention.

Figure IB shows pseudo-code for creating a filter to perform the channel estimation

shown in Figure IA.

Figure IC shows pseudo-code for inverting the filter of Figure IB.

Figure ID shows a flow diagram for performing the Curve-Fit channel estimation.

Figure IE shows a chart of an actual channel and a channel obtained from a cepstral mean.

Figure IF shows a chart of an actual channel and a channel obtained from Curve-

Fitting on a cepstal mean.

Figure 1G shows a chart of an inverse channel and an inverse channel obtained from

Curve-Fitting on a cepstral mean. Figure 1H shows a flow diagram for performing Clean Speech channel normalization.

Figure 2 is a diagram of a testing component of the present invention.

Figures 3 A and 3B are flow diagrams of a channel adaptation module, shown in

Figure 2, of the present invention. Figure 4 is a diagram of a bootstrapping component, used to generate antispeaker data

in the system of Figure IA.

DESCRIPTION OF THE PREFERRED EMBODIMENT(S)

The preferred system used with the present invention includes an enrollment

component, a testing component, and a bootstrap component. The enrollment component

uses antispeaker data to generate and store information concerning a user's voice. The information concerning the user's voice is compared to the voice undergoing verification

(testing) by the test component. The bootstrap component is used to provide initial antispeaker data for use by the enrollment component, such that the enrollment component

may properly perform its function of generating data concerning the user's voice. The

performance of the above components can be significantly improved through using the

preferred embodiments of the channel estimation modules including the Curve Fitting and

Clean Speech methods and systems, and combinations thereof.

1. Enrollment Component- Detailed Description.

The enrollment component is used to store information (using supervised learning)

about a known user's voice into a voice print database, so that this information is available for

future comparisons. In the preferred embodiment, the enrollment component also stores information concerning the channel on which the user provides the speech, the "enrollment

channel" into the voice print database.

Figure IA shows the enrollment component 10. As shown, the first step 20 is to obtain enrollment speech (the password) and to obtain 26 an index , such as the user's name or credit card number. The enrollment speech may be obtained via a receiver, telephone or

other sources, and be received from any transmission media, digital or analog, including

terrestrial links, land lines, satellite, microwave, etc.. More than one sample of enrollment

speech should be supplied, each of which is used to generate multiple data sets. Preferably,

four enrollment samples are supplied and processed. The enrollment speech is then analog-to-digital converted 25, .f necessary. Analog-to-

digital conversion can be performed with standard telephony boards such as those manufactured by Dialogic. A speech encoding method such as ITU G 11 standard μ and A

law can be used to encode the speech samples. Preferably, a sampling rate of 8000 Hz is used.

Alternatively, the speech may be obtained in digital format, such as from an ISDN

transmission. In such a case, a telephony board is used to handle Telco signaling protocol.

In the preferred embodiment, the computer processing unit for the speaker verification

system is an Intel Pentium platform general purpose computer processing unit (CPU) of at

least 100 MHZ having about 10MB associated RAM memory and a hard or fixed drive as

storage. Alternatively, an additional embodiment could be the Dialogic Antares card.

The digital enrollment speech is then pre-processed 30. Preprocessing 30 may include

one or more of the following techniques, as follows: Digital filtering using pre-emphasis. In this case, a digital filter H(z) = 1-ccz^"1 is used, where is set between .9 and 1.0.

Silence removal using energy and zero-crossing statistics. The success of this technique is primarily based on finding a short interval which is guaranteed to be background silence (generally found a few milliseconds at the beginning of the utterance, before the speaker actually starts recording). Thresholds are set using the silence region statistics, in order to discriminate speech and silence frames.

Silence removal based on an energy histogram. In this method, a histogram of frame energies is generated. A threshold energy value is determined based on the assumption that the biggest peak in the histogram at the lower energy region shall correspond to the background silence frame energies. This threshold energy value is used to perform speech versus silence discrimination.

DC Bias removal to remove DC bias introduced by analog-to-digital hardware or other components. The mean value of the signal is computed over the entire voice sample and then is subtracted from the voice samples.

In the preferred embodiment, the following preprocessing is conducted: silence

removal using the energy histogram technique (20 bins in histogram), signal mean removal to

remove DC bias, and signal pre-emphasis using filter = .95. The preprocessing is preferably

conducted using hamming windowed analysis frames, with 30 millisecond analysis frames and

10 millisecond shifts between adjacent frames.

Following preprocessing 30, channel estimation 40 is performed. This procedure

stores characteristics of the enrollment channel in the voice print database 115. The voice

print database 115 may be RAM, ROM, EPROM, EEPROM, hard disk, CD ROM, writeable

CD ROM, minidisk, a file server, or other storage device. In order to estimate the channel,

the distortion present on the channel is considered by the following embodiments. Each of the

following preferred embodiments can be implemented with software and the computer

processing unit defined above as an Intel Pentium platform general purpose computer processing unit (CPU) of at least 100 MHZ having about 10MB associated RAM memory and

a hard or fixed drive as storage. Alternatively, an additional embodiment could be the

Dialogic Antares card. The present invention, however, is not limited to these preferred

embodiments and could be implemented using other digital signal processors, computers, or

neural networks.

A. Cepstral Mean Subtraction and Pole Filtering

A speech signal with frequency spectrum S(co) is distorted by a transmission channel

with frequency response H(c ). The frequency spectrum of the distorted speech Y(co) is given

as:

If the logarithm and inverse Fourier transform (F ) of the magnitude of both

sides of the equation are taken, the following equation results:

F -¹ log( |Y(S)) |) = P

D + P log(| S(c ) |)

then, in cepstral domain:

C_Y(n) = C_H(n) + C_s(n) (1)

Cepstrum is defined as the inverse Fourier transform of the logarithm of short-time spectral

magnitude. Time invariant convolution distortion H(c ) can be eliminated by Cepstral Mean

Subtraction (CMS) or Cepstral Mean Normalization (CMN), which is averaging in the cepstral

domain, and subtracting the average component. For example:

Taking the mean of the above cepstral domain equation:. E[C_Y(n)]=E[C_H(n)] + E[C_s(n)],

where E[.] represents the expected value. Assuming the channel to be time invariant, the channel cepstrum is a constant additive component in the above equation. For sufficiently

long utterances the average speech cepstrum (E[C_s(n)]) is assumed to tend to zero.

Therefore, E[C_Y(n)] can be assumed to represent the channel cepstrum only, that is:

E[C_Y(n)]=C_H(n) (2)

The above equation is subtracted from cepstrum domain equation (1), that is,

C_γ(n) - E[C_Y(n)] = C_H(n) - C_H(n) + C_s(n)

and,

C_s(n)= _Y(n)-E[C_Y(n)],

which is equivalent to inverse filtering the speech with the inverse channel.

Thus, CMS may be conducted on the cepstral features obtained for the voice signal to

remove the distortion of the channel

While CMS may be used alone to remove the effects of the channel distortion, the

cepstral mean may include information other than the estimate of the time-invariant

convolutional distortion, such as coarse spectral distribution of the speech itself.

Pole filtering attempts to decouple the speech information from the channel

information in the cepstral mean. Since cepstrum is the weighted combination of LP poles or

spectral components, the effect of individual components on the cepsfral mean was examined.

It was found that broad band-width components exhibited smoother frequency characteristics

corresponding to the "roll-off" of channel distortion, assuming that narrow band-width

components in the inverse filter were influenced more by speech characteristics. Thus, the narrow band-width LP poles were selectively deflated by broadening their bandwidth and

keeping their frequency the same.

Therefore, for every frame of speech, the pole filtered cepstral coefficients (PFCC) are computed along with LP-derived cepstral coefficients (LPCC). To achieve cepstral mean

subtraction, the mean of the PFCC is subtracted from the LPCC, instead of the regular LPCC mean. This procedure is called pole filtered cepstral mean subtraction F-CMS).

To perform PF-CMS, the procedure outlined in the flow chart of Figure IB is followed. With reference to Figure IB, the first block of pseudo-code 42 sets the pole

bandwidth threshold. Next z_{ and Z_;are obtained and LPCC and PFCC are evaluated 44. This allows the mean of the PFCC vectors to be computed 46, which may be saved 48 as a

channel estimate in the voice print database 115. The PFCC mean ma be used to create an

LPC filter.

An inverse of this filter may be generated as shown in Figure IC. First, the PFCC

mean is converted from cepstral to the LPC filter coefficient domain 52. Next, the LPC filter

may be inverted 54, and speech passed through the inverted filter 56.

Although not preferred, the preprocessed speech during enrollment may be inverse-

filtered by inverting the filter of Figure IC (as described below with respect to Figure 3B).

While inverse filtering will theoretically remove the enrollment channel distortion, it is

preferred to inverse filter the test speech (on the testing channel) and then feed the test speech

through the saved enrollment filter, as described below with reference to Figure 3

B. Curve-Fitting Method The preceding Cepstral Mean Subtraction method used for channel estimation 40

assumes that the average speech cepstrum (E[C_s(n)]) tends to zero. This assumption is not valid, however, when the number of frames of speech available is limited. In such a situation,

the average cepstrum (E[C_s(n)])will contain a significant amount of speech information. Consequently, subtracting the cepstral mean removes a good part of speech from the signal.

This removal of speech causes a significant loss in performance, especially when the channel is

mild.

Pole filtering attempts to account for this, but Pole filtering still leaves a lot of speech

information in the channel estimation. Furthermore, selectively modifying some poles during

Pole filtering may produce some unwanted effects in the overall channel characteristics. The

Curve-Fitting method overcomes the limitations of Cepstral Mean Subtraction and the Pole

filtering. The Curve-Fitting method extracts channel related information from the cepstral mean, but not any speech information. Using this channel information, the Curve-Fitting

method improves upon the channel estimation 40 performed with the Cepstral Mean

Subtraction. Likewise, the Curve-Fitting method can use this channel information to create an

improved inverse filter to remove the enrollment channel disruption.

Figure IE illustrates a comparison between an actual channel and a channel derived from the cepstral mean subtraction method. As seen in Figure IE, the channel obtained from

the cepstral mean contains a substantial amount of unwanted speech information, especially in

the pass band of the channel. The cepstral mean channel, however, models the roll-off at the

ends of the channel spectrum effectively. The Curve-Fitting method takes advantage of this

effective modeling of the roll-off to get a better estimate of the channel. The Curve-Fitting method is an approximation of the ideal general characteristics of

the channel. The following is the preferred method of this approximation and is a piecewise linear fit to the ideal general characteristics. The Curve-Fitting method extracts the slope

information from the cepstral mean. In order to do this, the pass band of the channel is

assumed to he flat at 0 dB. Then, as illustrated in Figure ID, the lowest frequency spectral

peak in the log magnitude LP spectrum of the cepstral mean is detected 62. All points below

this detected lowest frequency spectral peak are scanned 64 to find the point at which the

magnitude spectrum is closest to zero dB. Once this point is found, a straight line fit 66 is done between the lowest frequency spectral peak and this point. The slope of this line is set to

be the roll-off of the Curve-Fitting method channel estimate at the low frequency end.

A similar procedure is carried out at the high frequency end. The difference at the high

frequency end is that the high frequency spectral peak is detected 68 and then a straight line fit

70 is performed considering all points between the spectral peak and the last point of the

spectrum. A passband is then modeled 72 as a straight line at zero dB between the

intersection of the straight lines obtained above with the zero dB line. The resulting Curve-

Fitting method channel estimate is illustrated in Figure IF. It is noted that other

approximations to the ideal general characteristics of the channel can be used.

The Curve-Fitting method enrollment channel estimate can be stored in the voice print database 115, as described above, for recall and use as a filter in the testing component, as

described below with reference to Figure 3A. Likewise, the Curve-Fitting method enrollment

channel estimate may be converted to an inverse filter to inverse filter the channel corrupted

speech, as described below with reference to Figure 3B. This removes the enrollment channel distortion. Alternatively, the Curve-Fitting enrollment channel estimate can be converted to the cepstrum domain and subtracted from the cepstral features. An inverse channel obtained

by the Curve-Fitting method is shown in Figure IG. As shown in Figure IG, this method and module creates a close approximation of the actual inverse channel.

The Curve-Fitting channel estimation method and module can be further improved by

fusing it with the channel estimate obtained using Pole filtering, as described above with

reference to Figure IB. Fused together, the Pole filtering channel estimation and the Curve- Fitting channel estimate balance each other and produce a more accurate channel estimation.

The fusing of the Curve-Fitting method with the Pole Filtering can be represented as follows: C_new(n) = λC_PF(n) + (l-λ)C_CF(n)

where C_PF(n) is the channel estimate obtained using Pole filtering and C_CF(n) is the channel

estimate obtained with the Curve-Fitting method. It is also noted that the Curve-Fitting

method channel estimate can be improved by combining or fusing it with channel estimation

techniques other than Pole filtering.

C. Clean Speech Method

The Clean Speech method and module is an alternative preferred embodiment that also

overcomes the above-described limitation of the Cepstral Mean Subtraction method used for

channel estimation 40. This limitation was that the Cepstral Mean Subtraction method

assumed that the average speech cepstrum (E[C_s(n)]) tends to zero. The Clean Speech

module extracts channel related information by separately estimating the speech information in

the cepstral mean. Using this channel information, the Clean Speech method improves the channel estimation 40. Likewise, the Clean Speech method can use this channel information to

create an improved inverse filter to remove the enrollment channel disruption.

The Clean Speech method does not assume that the average speech cepstrum

(E[C_s(n)]) tends to zero. Rather, the Clean Speech method separately estimates the average speech cepstrum (E[C_s(n)]). Referring to Figure 1H, the Clean Speecn method accomplishes

this by performing 82 a simultaneous "clean" recording of the same speech from the same

person who is recording speech samples over the actual enrollment equipment (typically a

telephone set). Alternatively, the "clean" recording can be made during the same session in

which the enrollment speech samples are collected from the enrollment equipment. In this

alternative embodiment, the recording of the clean speech is done before or after the corrupted

channel enrollment. The simultaneous method is preferred because of its inherent increased

security guarantees.

The simultaneous "clean" recording is done on a high quality, close talking, wide

bandwidth microphone. A high quality microphone will have minimal channel distortion. A

close talking microphone limits the amount of background noise that is captured. A wide bandwidth microphone has a flat frequency response between 20 Hz and 20 kHz. A preferred

high quality, close talking, wide bandwidth microphone is a Sennheiser® microphone from

Sennheiser, Inc., which would be connected to a preferred Pentium® based computer with a

16 or 32 bit SoundBlaster® audio hardware built and sold by Creative Labs, Inc. However,

any microphones and computers with the above characteristics can be employed with the

present invention. Under the presumption that a "clean" recording from a high quality

microphone will be free of any transmission channel characteristics, the Clean Speech method assumes that the cepstral mean of the "clean" recording will be representative of the speech information in the recording.

The cepstral mean for the "clean" recording can be represented as the

following:

E[ C_Y._CLEAN(n)] = E[C_s._CLEAN(n)]

since it is assumed that there is no channel corruption on the "clean" recording, i.e., E[C_H.

_CLJJANOI)] = 0.

As shown in Figure 1H, The cepstral mean (E[C_s.CLEAN(n)]) of the "clean" recording is

calculated 84 and then used to perform channel estimation 40 of a channel speech recording.

The cepstral coefficients of a channel corrupted speech can be represented as shown above

with Cepstral Mean Subtraction as:

C_Y(n) = C_H(n) + C_s(n).

Taking the mean of the above equation:

E[C_Y(n)]=E[C_H(n)] + E[C_s(n)].

The estimation from the "clean" recording is now used to eliminate E[C_s(n)] during channel normalization. To achieve channel normalization of a corrupted

speech cepstrum coefficient C_γ(n), its mean (E[C_Y(n)]) is subtracted 86 from the above cepstral coefficient equation, while the "clean" speech cepstral mean

(E[C_s._CLEAN(n)]) is added 88. This channel normalization would be represented

as:

=C_Y(n) - E[C_Y(n)] + E[C_s._CLEAN(n)]

=C_H(n) + C_s(n) - E[C_H(n)] - E[C_s(n)] + E[C_s._CLEAN(n)]

=C_s(n) assuming, that the transmission channel is stationary during the entire recording, i.e., C_H(n)=

E[C_H(n)] and that the same person is speaking the same text between the "clean" and channel corrupted transmission, i.e., E[C_s(n)] = E[C_s._CLEAN(n)].

The channel normalization process shown above is done in the cepstrum domain. Alternatively, the same processing can be done in the time waveform domain. For example,

the subtraction of the speech cepstrum coefficient mean (E[C_Y(n)]) is equivalent to inverse

filtering in the time domain. Likewise, the addition of the "clean" speech cepstral

mean (E[C_s__CLEAN(n)]) is equivalent to filtering in the time domain. The transmission channel

and "clean" speech estimates in the cepstral domain can be converted to filter coefficients and

applied to speech waveform in the time domain.

The channel estimate E[C_H(n)] calculated under the Clean Speech method can be

improved with the pole-filtering or Curve-Fitting methods described above. Likewise, other channel estimation techniques may be used. Performance wise, the Clean Speech method

provides excellent performance, although practically it is limited because enrollment is done

locally.

After preprocessing 30, feature extraction 50 is performed on ;he processed speech.

Feature extraction may occur after (as shown) or simultaneously with ihe step of channel estimation 40 (in parallel computing embodiments). The detail of the feature extraction is

herein incorporated by reference from VOICE PRINT SYSTEM AND METHOD, serial

number 08/976,280, filed November 21, 1997. The result of feature extraction 50 is that

vectors representing a template of the password are generated. This template is stored 60 in the voice print database 115. Following storage of the template 60, the speech is segmented

into sub-words for further processing.

The preferred technique for subword generation 70 is automatic blind speech

segmentation, the details of which, and of alternative methods, are herein incorporated by

reference from VOICE PRINT SYSTEM AND METHOD, serial number 08/976,280, filed

November 21, 1997.

After subwords are obtained, each sub-word is then modeled P0, 90, preferably with

multiple classifier modules. Preferably a first neural tree network (NTN) 80 and a second

Gaussian mixture model (GMM) 90 are used. The NTN 80 provides discriminative-based

model and the GMM 90 provides one that is based on a statistical measure. In a preferred embodiment a leave-one-out method data resampling scheme 100 is u ed. Data resampling

100 is performed by creating multiple subsets of the training data, each of which is created by

leaving one data sample out at a time. The subsets of the training data are then used to train

multiple models of each of the classifiers, which are stored in the voice print database 115.

Thus, Figure 1 A shows N models for the NTN classifier 80 and N models for the GMM

classifier 90. For model #1 of the NTN classifier, a enrollment sample such as the 1st sample, is left out of the classifier.

In order to train an NTN model 80 for a given speaker, it is necessary to appropriately

label the subword data available in the antispeaker database 110. The antispeaker database

110 may be RAM, ROM, EPROM, EEPROM, hard disk, CD ROM, a file server, or other

storage device. The subword data from the speaker being trained is labeled as enrollment speaker data. Because there is a no linguistic labelling information in the antispeaker database 110, the entire

database 110 is searched for the closet subword data from other speakers. This data is labeled

the anti-speaker data. The mean vector and covariance matrix of the subword segments obtained from subword generation are used to find the "close" subwords. An anti-speaker

module 120 searches the antispeaker database 110 to find the "close" subwords of antispeaker

data, which are used in the NTN model 20 . Preferably, 20 "close" subwords are identified. The anti-speaker data in the antispeaker database 110 is either manuahy created, or created

using a "bootstrapping" component, described below with reference to Figure 4. Because a "leave-one-out" system 100 is employed with multiple (N) samples, the

classifier models 80, 90 are trained by comparing antispeaker data with N-l samples of

enrollment speech. Both modules 80, 90 can determine a score for each spectral feature

vector of a subword segment. The individual scores of the NTN 80 and GMM 90 modules

can be combined, or "fused" by a classifier fusion module 130 to obtain a composite score for

the subword. Since these two modeling approaches tend to have errors that are uncorrelated,

it has been found that performance improvements can be obtained by fusing the model outputs

130. In the preferred embodiment, the results of the neural tree network 80 and the Gaussian

mixture model 90 are fused 130 using a linear opinion pool, as described below. However,

other ways of combining the data can be used with the present invention including a log

opinion pool or a "voting" mechanism, wherein hard decisions from both the NTN and GMM

are considered in the voting process. Further details of the subword modeling with the NTN and GMM modules are herein

incorporated by reference from VOICE PRINT SYSTEM AND METHOD, serial number

08/976,280, filed November 21, 1997.

A scoring algorithm 145, 150 is used for each of the NTN and GMM models. The

output score (estimated a-posteriori probabilities) of the subword models is combined across

all the subwords of the password phrase, so as to yield a composite score for the test utterance.

The scoring algorithm 145, 150 for combining the score the subword models 80, 90

can be based on either of the following schemes: (a) PHRASE- AVERAGE: Averaging the output scores for the vectors over the

entire phrase,

(b) SUB WORD- AVERAGE: Average the score of vectors within a subword, before

averaging the (averaged) subword scores, and

(c) SUBWORD-WEIGHiNG: Same as (b) subword-average scoring, but the

(averaged) subword scores are weighted in the final averaging process.

Transitional (or durational) probabilities between the subwords can also be used while

computing the composite score for the password phrase. The preferred embodiment is (b) subword-average scoring. The result of scoring provides a GMM score and an NTN score,

which must then be combined.

In the preferred embodiment, a classifier fusion module 130 us.ng the linear opinion

pool method combines the NTN score and the GMM score. Use of the linear opinion pool is referred to as a data fusion function, because the data from each classifier is "fused," or

combined.

The data fusion function for n classifiers, S(a), is governed by the following linear

opinion pool equation:

n

S(a) = £a_is₁ i=l

In this equation S(a) is the probability of the combined system, a_; are weights, and s_;(a)

is the probability output by the i^th classifier, and n is the number of classifiers; a; is between

zero and one and the sum of all a s is equal to one. If two classifiers are used (n=2), s_x is the

score of the first classifier and s₂ is the score of the second classifier. In this instance the

equation becomes:

S = as, + (l-c-)s₂

The variable a is set as a constant (although it may be dynamically adapted as discussed below), and functions to provide more influence on one classifier method as

opposed to the other. For example, if the NTN method 80 was found to be more accurate, the

first classifier s_x would be more important, and would be made greaier than 0.5, or its

previous value. Preferably, is only incremented or decremented by _. small amount, €.

Once the variables in the fusion equation are known, a threshold value 140 is output

and stored in the voice print database 115. The threshold value output 140 is compared to a

"final score" in the testing component to determine whether a test user^'s voice has so closely

matched the model that it can be said that the two voices are from the -ame person. 2. Testing Component- Detailed Description.

Figure 2 shows a general outline of the testing component 150 which has many

features similar to those described with respect to the enrollment component 10 of Figure IA. The testing component 150 is used to determine whether test speech received from a user

sufficiently matches identified stored speech characteristics so as to validate that the user is in

fact the person whose speech was stored.

First, the test speech and index information 160 is supplied to the test component. The index information is used to recall subword/segmentation information and the threshold value

140 from the voice print database 115. The index information may b any nonvoice data which identifies the user, such as the user's name, credit card number, etc..

After obtaining the test speech and index information, the test speech is preprocessed

170. Preprocessing 170 may be performed as previously described in _he enrollment

component 10 (Figure IA). Preferably, the same preprocessing 30,170 is conducted on the

test speech as was performed during enrollment.

The fact that a speaker's model is conventionally built using enrollment speech that is

recorded under a specific, controlled environment implies that the model carries not only the

voice print but also the channel print. Therefore, following preprocessing, channel adaptation

180 is performed. Channel adaptation 180 adapts the system to the particular enrollment

channel and test channel. Channel adaptation 180 includes processing under both the

enrollment component 10 and the test component 150. Figures 3 A and 3B show alternatives

of channel adaptation 180. As previously mentioned with respect to Figure IA, the enrollment channel is estimated 40 during the enrollment component 10, also shown in Figures 3 A and 3B at 300

As shown in Figure 3 A, the enrollment channel estimate is also stored 310 in the voice print database 115 during the enrollment component. The enrollment channel may be estimated and stored using the preferred embodiments of the present invention previously discussed with

respect to Figure 1 A. These include the Curve-Fitting method discussed above with respect to

Figure ID, the Clean Speech Method discussed above with respect to Figure IE, the Pole

filtering discussed above with respect to Figure IB, and combinations of these methods

As shown in Figure 3 A, the test channel is estimated 320 during the testing

component. The test channel may be estimated by generating a filter using the procedures

previously discussed with respect to Figures IB, ID, or IE. After generating the filter, the

test speech is inverse filtered through the test channel 330. To achiev. this, the test speech is

passed through the inverse filter of the test channel using the procedure of Figure IC. This

process removes the distortion of the test channel from the test speech. Now, the distortion of

the enrollment channel is added to the test speech by filtering the test speech through the

enrollment channel. To perform this, the saved enrollment filter is recalled 340 and the test

speech is filtered through the enrollment filter 350.

The procedure of Figure 3 A stores the enrollment data with the enrollment channel

distortion during the enrollment component, and then removes the distortion of the test

channel and adds the distortion of the original enrollment channel during testing. As an

alternative, shown in Figure 3B, it may be desired to remove the enrollment channel distortion

during enrollment, and then remove the test channel distortion during testing. Channel normalization is shown in Figure 3B, the enrollment channel is estimated 300

during the enrollment component. Next, the enrollment speech is filtered through an inverse of the enrollment channel filter 360. In other words, the enrollment speech is inverse filtered

using the techniques previously discussed. The enrollment speech may normalized using the

Curve Fitting or Clean Speech method described above with respect to Figure IE. During the testing phase the test channel is estimated 370, and an inverse filter constructed using the

techniques previously described. The test speech is then filtered through the inverse filter 380.

Using either channel adaptation technique, the system adapts to account for the channel distortion on the enrollment channel and on the test channel. it has been found that

the technique shown in Figure 3 A is preferred.

In the scenario of cellular fraud control, the concept of channe' adaptation 180 can be

used to validate the user since the channel print carries the characteristics of the particular cellular handset of which the speaker is an authorized user, and therefore creates an

association between the voice print and the phone channel print. The combination of voice

print and phone print ensures that a particular cellular handset can only be used by its

registered subscriber. However, in application such as bill-to-third party phone services where

the users are allowed to have access to the service from various locations, an authorized user's request for service may be denied due to the phone print mismatch.

Channel adaptation 180 provides a solution to this problem. It first removes the phone

and channel print of the test environment from the test speech by performing an inverse

filtering of the channel. Thereafter, channel adaptation can add the phone and channel print of the training environment to the speech so that it looks as if the verification speech is recorded

through the training channel.

Channel adaptation 180 in this manner can still be advantageous in cellular fraud

control when the channel mismatch is primarily due to variations in tiV cellular network rather

the phone set. The channels can be estimated using techniques such as pole-filtered cepstrum,

as described in Figure IB, LP derived cepstrum mean, fast Fourier transform (FFT)-derived

cepstrum mean as well as FFT based periodogram of the speech signa. Pole-filtered

cepstrum, as shown in Figure IB, is the preferred method.

Referring to Figure 2, feature extraction 190 is performed afte: preprocessing. Feature

extraction 190 may occur immediately after channel adaption 180, or may occur

simultaneously with channel adaption 180 (in a multiple processor embodiment).

Following feature extraction 190, key word/key phrase spotting 200 is performed. Further

detail of feature extraction 190 and key word/key phrase spotting 200 is herein incorporated

by reference from VOICE PRINT SYSTEM AND METHOD, serial number 08/976,280,

filed November 21, 1997.

Referring back to Figure 2, after key word/ key phrase spotting, automatic subword

generation 210 occurs. Because segmentation was already performed during the enrollment

component, subword generation 210 in the testing component is performed based on the subwords/segment model computed in the enrollment phase 10.

As previously described with respect to Figure IA, during the enrollment component

10 GMM modeling 90 was performed. The GMM modeling 90 is used in the test component

subword generation 210 to "force align" the test phrase into segments corresponding to the previously formed subwords. Using the subword GMMs as reference models, Viterbi or

Dynamic programming (DP) based algorithms are used to locate the optimal boundaries for

the subword segments. Additionally, the normalized subword duration (stored during enrollment) is used as a constraint for force alignment since it provides stability to the

algorithm. Speech segmentation using force alignment is disclosed in U.S. patent application

no. 08/827, 562 entitled "Blind Clustering of Data With Application to Speech Processing

Systems", filed on April 1, 1997, and its corresponding U.S. provisional application no. 60/014,537 entitled "Blind Speech Segmentation", filed on April 2, 1996, both of which are

herein incorporated by reference in their entirety. After subword generation 210 is performed, scoring 240, 250 using the techniques

previously described with respect to Figure 2 (i.e. multiple classifiers such as GMM 230 and

NTN 220) is performed on the subwords. Scoring using the NTN and GMM classifiers 220,

230 is disclosed in U.S. patent application Ser. No. 60/064,069, entitled "Model Adaption

System And Method For Speaker Verification," filed on November 3, J 997 by Kevin Farrell

and William Mistretta, U.S. patent application no. 08/827,562 entitled "Blind Clustering of

Data With Application to Speech Processing Systems", filed on April 1, 1997, and its

corresponding U.S. provisional application no. 60/014,537 entitled "B'ind Speech

Segmentation", filed on April 2, 1996, each of which is herein incorporated by reference in its

entirety.

An NTN scoring algorithm 240 and a GMM scoring algorithm 250 are used, as

previously described with respect to Figure 1 A, to provide a GMM score and a NTN score to the classifier fusion module 260. With continued reference to Figure 2, the classifier fusion mod le 260 outputs a "final score" 270. The "final score" 270 is then compared 280 to the threshold value 140. If the

"final score" 270 is equal to or greater than the threshold value 140 obtained during

enrollment, the user is verified. If the "final score" 270 is less than the threshold value 140 then the user is not verified or permitted to complete the transaction requiring verification.

The present invention also employs a number of additional adaptations, in addition to channel adaptation 180.

As previously described, the multiple classifier system uses a c ssifier fusion module

130, 260 incorporating a fusion function to advantageously combine the strength of the

individual classifiers and avoid their weakness. However, the fusion function that is set during

the enrollment may not be optimal for the testing in that every single classifier may have its

own preferred operating conditions. Therefore, as the operating environment changes, the fusion function changes accordingly in order to achieve the optimal results for fusion. Also,

for each user, one classifier may perform better than the other. An adaptable fusion function

provides more weight to the better classifier. Fusion adaptation uses predetermined

knowledge of the performance of the classifiers to update the fusion function so that the amount of emphasis being put on a particular classifier varies from time to time based on its

performance.

As shown in Figure 2, a fusion adaptation module 290 is connected to the classifier

fusion module 280. The fusion adaptation module 290 changes the constant, a, in the linear

pool data fusion function described previously with respect to Figure 2, which is: S(a) = ∑a,s, i=l In the present invention two classifiers are used (NTN 80, 220 and GMM 90, 230) and

Si is the score of the first classifier and s₂ is the score of the second classifier. In this instance

the equation becomes:

8 = 018_! + (l-α)s₂

The fusion adaptation module 290 dynamically changes to weigh either the NTN (s,)

or GMM (s₂) classifiers more than the other, depending on which classifier turns out to be

more indicative of a true verification. Further detail of the fusion adaptation module 290 is

herein incorporate by reference from VOICE PRINT SYSTEM AND METHOD, serial

number 08/976,280, filed November 21, 1997.

Threshold adaptation adapts the threshold value in response to prior final scores.

Threshold adaptation module 295 is shown in Figure 2. The detail of the threshold adaptation

module 295 is herein incorporate by reference from VOICE PRINT SYSTEM AND

METHOD, serial number 08/976,280, filed November 21, 1997.

Model adaptation adapts the classifier models to subsequent successful verifications.

The detail of the model adaptation module is herein incorporate by reference from VOICE

PRINT SYSTEM AND METHOD, serial number 08/976,280, filed November 21, 1997.

Fusion adaptation 290, model adaptation, and threshold adaptation 600 all may effect

the number and probability of obtaining false-negative and false-positive results, so should be

used with caution. These adaptive techniques may be used in combination with channel

adaptation 180, or each other, either simultaneously or at different authorization occurrences. Model adaptation is more dramatic than threshold adaptation or fusion adaptation, which both

provide incremental changes to the system.

The voiceprint database 115 may or may not be coresident with the antispeaker

database 110. Voice print data stored in the voice print database may include: enrollment

channel estimate, classifier models, list of antispeakers selected for training, fusion constant,

threshold value, normalized segment durations, and/or other intermediate scores or authorization results used for adaptation.

3. "Bootstrapping" Component.

Because the enrollment component 10 uses the "closest" antispeaker data to generate

the threshold value 140, the antispeaker database 110 must be initially be filled with

antispeaker data. The initial antispeaker data may be generated via anificial simulation

techniques, or can be obtained from a pre-existing database, or the database may be

"bootstrapped" with data by the bootstrapping component.

Figure 4 shows a bootstrapping component 700. The bootstrarφing component 700

first obtains antispeaker speech 710, and then preprocess the speech 720 as previously

described with respect to Figure 1 A. The antispeaker speech may be phrases from any number

of speakers who will not be registered in the database as users. Next, the antispeaker speech

is inverse-channel filtered 730 to remove the effects of the antispeakei channel as described

with respect to Figures 1 and 2. As shown in Figure 4, the processed ..nd filtered antispeaker

speech then undergoes feature extraction 770. The feature extraction may occur as previously

described with respect to Figure 1 A. Next, the antispeaker speech undergoes sub-word generation 750, using the techniques previously described with respeci to Figure 1 A. The preferable method of sub-word generation is automatic blind speech segmentation, discussed

previously with respect to Figure 1 A. The sub-words are then registered as antispeaker data

760 in the database.

Thus, the bootstrapping component initializes the database with antispeaker data which

then may be compared to enrollment data in the enrollment component.

The present invention provides for an accurate and reliable automatic speaker

verification, which uses adaptive techniques to improve performance. A key word/ key phrase

spotter 200 and automatic blind speech segmentation improve the usefulness of the system. Adaptation schemes adapt the ASV to changes in success/failures and to changes in the user by using channel adaptation 180, model adaptation 540, fusion adapta^tion 290, and threshold adaptation 600.

The foregoing description of the present invention has been presented for purposes of

illustration and description which is not intended to limit the invention to the specific

embodiments described. Consequently, variations and modifications commensurate with the

above teachings, and within the skill and knowledge of the relevant art, are part of the scope

of the present invention. It is intended that the appended claims be construed to include alternative embodiments to the extent permitted by law.

Claims

What is claimed is:

1. A system for estimating a channel in a speech processing system, comprising: a receiver, the receiver obtaining speech over a channel;

a means, operably connected to the receiver, for developing an estimate of the channel; a storage device, connected to the developing means, for storii.g the channel estimate;

wherein the developing means provides a channel estimate that accurately estimates an

actual channel by substantially limiting the amount of speech information extracted.

2. The system of claim 1 further comprising a means, connected to the storage device, for

normalizing the channel using the channel estimate.

3. A system for estimating a channel in a speech processing system, comprising: a receiver, the receiver obtaining speech over a channel;

a means, operably connected to the receiver, for developing a first estimate of the

channel using cepstral mean or pole filtering;

a means, connected to the developing means, for performing a curve fitting

approximation to obtain a final estimate of the channel, wherein the filial channel estimate

obtained from the curve fitting approximation accurately estimates an actual channel; and

a storage device, connected to the performing means, for storing the final channel

estimate.

4. The system of claim 3 further comprising a means, connected to the storage device, for

normalizing the channel using the final channel estimate.

5. A system for estimating a channel in a speech processing system, comprising:

a first receiver, the first receiver obtaining speech over a channel; a second receiver, the second receiver obtaining clean speech;

a means, operably connected to the first and second receivers, for developing an estimate of the channel using a cepstral mean or pole filtering approximations, wherein either

the cepstral mean or the pole filtering approximations accurately estimates an actual channel

by substantially limiting the amount of lost speech information by using the clean speech;

a print storage device, connected to the developing means, for storing the channel

estimate.

6. The system of claim 5 further comprising a means, connected to the storage device, for

normalizing the channel using the channel estimate.

7. A method for estimating a channel in a speech processing system, comprising the steps

of:

obtaining speech over a channel;

developing an estimate of the channel;

storing the channel estimate; and wherein the developed channel estimate accurately estimates a*) actual channel by

substantially limiting the amount of speech information extracted

8 A method for estimating a channel in a speech processing system, comprising the steps

of obtaining speech over a channel,

developing a first estimate of the channel using cepstral mean or pole filtering;

performing a curve fitting approximation to obtain a final estimate of the channel,

wherein the final channel estimate obtained from the curve fitting approximation accurately

estimates an actual channel, and

storing the final channel estimate

9 A method for estimating a channel in a speech processing system, comprising: obtaining speech over a channel,

gathering clean speech,

developing an estimate of the channel using a cepstral mean or pole filtering

approximations, wherein either the cepstral mean or the pole filtering approximations

accurately estimate an actual channel by substantially limiting the amount of lost speech

information by using the clean speech; and

storing the channel estimate