WO1993003480A1 - Mise en correspondance de signaux vocaux dans du bruit non blanc - Google Patents
Mise en correspondance de signaux vocaux dans du bruit non blanc Download PDFInfo
- Publication number
- WO1993003480A1 WO1993003480A1 PCT/US1992/006351 US9206351W WO9303480A1 WO 1993003480 A1 WO1993003480 A1 WO 1993003480A1 US 9206351 W US9206351 W US 9206351W WO 9303480 A1 WO9303480 A1 WO 9303480A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- noise
- reference signals
- speech
- filter
- white noise
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/20—Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
Definitions
- the present invention relates generally to
- 19 speech segments can each be represented by . a set of
- VQ Vector Quantization
- test and reference segments ' are usually arbitrarily
- Rabiner in the article, "A Linear Predictive Front-end Processor for Speech Recognition in noisy Environments", International Conference on Acoustics , Speech and Signal Processing, ICASSP-87, pp. 1324-1327, Dallas TX, 1987, present a method for speech recognition suitable or colored noise.
- the power spectrum of the noise is used, in an iterative algorithm, to estimate the Linear Prediction Coefficients (LPC) of clean speech from its noisy version.
- LPC Linear Prediction Coefficients
- This algorithm requires extensive computations.
- This last method and the SMC method were applied to speech recognition in car noise by I. Lecomte, M. Lever, J. Boudy and A. Tassy. Their results are discussed in the article, "Car Noise Processing for Speech Input", International Conference on Acoustics , Speech and Signal Processing, ICASSP-89, pp. 512-515, Glasgow UK, 1989.
- the present invention may be used, for example, for speech recognition, speaker identification and verification, or vector quantization (VQ) for speech coding.
- the system includes the following three operations: 1) Noise modeling: Noise is collected from a noisy input test signal in the intervals containing no speech. The features of the noise are extracted and are used to construct a noise whitening filter for whitening the noise. 2) Pre-Processing: The noisy input test signal and a plurality of reference template signals, each containing a previously stored reference speech signal which can be accompanied by noise, are filtered through the noise whitening filter.
- a pattern matching algorithm which is operative in white noise is applied to the modified test and reference template signals. This operation involves scoring the similarity between the modified test signal and each of the modified reference templates, followed by deciding which reference template is most similar to the test signal.
- the present invention can be applied, among others, to the following problems of speech processing in colored noise: a.) Speech recognition using Dynamic Time Warping (DTW) in the presence of colored noise, such as is found in the environment of a car. b) Vector quantization (VQ) of noisy speech.
- DTW Dynamic Time Warping
- VQ Vector quantization
- HMM Hidden Markov Models
- a system for matching an input signal, including non-white noise and a patterned signal corrupted by the non-white noise, to a plurality of reference signals including means for estimating noise features of the non-white noise and for producing from the features at least one noise whitening filter, filter means for filtering the input signal and the plurality of reference signals using the at least one noise whitening filter and producing a filtered input signal, having a white noise component, and a plurality of filtered reference signals and pattern matching means generally robust to white noise for matching the filtered input signal to one of the filtered reference signals.
- the at least one noise whitening filter is two noise whitening filters respectively for filtering the input signal and the reference signals which system also includes means for extracting features of the input signal and the reference signals and wherein the filter means operate in a feature domain.
- a feature domain of the input signal is different than a feature domain of
- the pattern matching means perform a pattern matching technique selected from the group of DTW, HMM, or DTW-VQ.
- the input signal is a speech signal.
- the feature domains are selected from the group of data samples, Linear Prediction Coefficients, cepstral coefficients, power spectrum samples, and filter bank energies and the means for estimating estimate a filter in accordance with the selected feature domain.
- a speech recognition system for recognizing words found in a speech signal corrupted by non-white noise including means for estimating noise features of the non-white noise and for producing from the features at least one noise whitening filter, filter means for filtering the speech signal and a plurality of reference signals of selected spoken words, the filter means using the at least one noise whitening filter and producing a filtered speech signal and a plurality of filtered reference signals and pattern matching means generally robust to white noise for matching the filtered speech signal to one of the filtered reference signals thereby recognizing the word n the speech signal.
- VQ Vector Quantization
- a speech recognition system for .recognizing a word found in a speech signal corrupted by non-white noise
- a vector quantization system according to claim 10 producing vector quantized speech and word matching means receiving a plurality of reference sequences of symbols relating to the reference signals and a test sequence of symbols relating to the speech signal for matching the test sequence to the reference sequences thereby to recognize the word in the speech signal.
- the word matching means performs Dynamic Time Warping (DTW) on the vector quantized speech and Hidden Markov Modeling (HMM) .
- DTW Dynamic Time Warping
- HMM Hidden Markov Modeling
- a speaker recognition system using any of the systems described above wherein the reference signals include one word spoken by a plurality of different speakers.
- a speaker verification system using any of the systems described above wherein the reference signals include at least one word spoken by one speaker.
- the non-white noise is the noise from the environment of a movable vehicle or, alternatively, the noise from the environment of a moving airplane cockpit or a vibrating machine.
- Fig. 1 is a schematic block diagram illustration of a pattern matching system constructed and operated in accordance with a preferred embodiment of the present invention
- Fig. 2 is a schematic block diagram illustration of the hardware implementing the pattern matching system of Fig. 1.
- FIG. 1 illustrates a schematic block diagram of a pattern matching system constructed and operative in accordance with the principles of the present invention.
- the pattern matching system of the present invention typically includes an input device 10, such as a microphone or similar device, for providing an analog signal, and a sampling device 12 for converting the analog signal to a digital signal.
- the samples of the digital signal are typically grouped into frames, typically of 128 or 256 samples each.
- the digital and analog signals typically include portions containing only background noise which is typically non-white, such as colored or quasi- stationary noise, and some portions containing a signal whose pattern is to be detected, known herein as a "patterned signal".
- the patterned signal is the speech signal.
- the patterned signal is typically corrupted by the colored noise.
- the present invention seeks to match the patterned signal to a plurality of previously stored reference signals wherein the patterned signal is received in the presence of the colored noise.
- the reference signals are stored as reference templates including feature sets of the reference signals extracted via feature extraction devices not shown.
- the reference templates are typically stored in a reference template storage device 14, such as any suitable memory device, during a process called training (not shown) .
- These templates are representative of various patterned signals to which it is desired to match the input patterned signal.
- the reference templates might be feature sets of uttered words (for speech recognition) or of utterances of various speakers (for speaker recognition or verification) .
- the reference templates might also be centroids of speech segments (for speech coding using VQ analysis) .
- the digital signal is supplied to a patterned signal activated detection device 16 • which generally detects the presence or absence of the patterned signal.
- the device 16 typically is a voice activated switch (VOX) such as described in U.S. Patent 4,959,865 to Stettiner et al. U.S. Patent 4,959.865 is incorporated herein by reference.
- the output of the device 16 are two signals, a noise signal and a "test utterance" including the patterned signal corrupted by colored noise.
- the VOX typically does not have to be precise. The remainder of the present invention will be described for speech signals, as an example only. It will be appreciated that the present invention is operative for other types of patterned signals also.
- the noise signal is provided to a noise filter estimator 18, described in more detail hereinbelow, for estimating parameters of a noise whitening filter.
- the noise whitening filter can convert the colored noise signal into a white noise signal.
- the noise whitening filter thus estimated is used to filter both the test utterance and the reference templates, as described in more detail hereinbelow.
- the test utterance is provided to a first stage feature extraction device 24 which transforms the test utterance into a sequence of test feature or parameter vectors which can be any of several types of desired features, such as power spectrum samples, autocorrelation coefficients, LPC, cepstral coefficients, filter bank energies or other features characteristic of the power spectrum of the test utterance.
- Suitable feature extraction devices 24 are described in the book, Speech Communication Human and Machine by Douglas 0' Shaughnessy, published by Addison- Wesley of Reading, Massachusetts in 1987, which book is incorporated herein by reference.
- each feature vector preferably contains the features of one speech frame of approximately 30 msec. An overlap of typically 0# may be applied between adjacent speech frames.
- the test vector is provided to a noise whitening filter 26, whose parameters are estimated by filter estimator 18, for filtering the test vector so as to provide a filtered tes ⁇ t vector in the presence of approximately white, rather than colored, noise.
- the output of the noise whitening filter is a test vector in the presence of white noise whose speech component is different than that off the original test vector. Therefore, in accordance with a preferred embodiment of the present invention and in order to preserve the matching between the test vector and the reference templates, the entirety of reference templates from the reference template storage device 14 are filtered by a noise whitening filter 28 which is generally identical to noise whitening filter 26.
- the reference templates to which the test vector is to be matched are adjusted in the same manner as the test vector.
- the reference templates are typically defined in the same feature set as the test vector. If so, noise whitening filter 28 is identical to noise whitening filter 26. If not, filter 28 is defined differently than the filter 26 although both filters have an equivalent effect.
- the noise whitening filters 26 and 28 are calculated as follows. The parameters of each filter are such that the power spectrum of its impulse response is approximately the inverse of the power spectrum of the colored noise, as estimated from the most recent noise portions received from the patterned signal detection device 16.
- the noise whitening filters 26 and 28 are defined with respect to the same feature sets that respectively describe the test utterance and the reference signals. Various ways to estimate and operate the filters 26 and 28 exist and depend on the type of feature set used.
- Filter estimator 18 estimates an Infinite Impulse Response (IIR) or a Finite Impulse Response (FIR) filter. The latter is typically a moving average filter whose coefficients are estimated by LPC analysis of the noise signal. The IIR or FIR is then applied to the samples of the test utterance.
- Filter estimator 18 estimates the inverse of the average power spectrum of the noise signal. The filter operates by multiplying the test utterance power spectrum or filter bank energy samples by the corresponding filter power spectrum values.
- Filter estimator 18 estimates the inverse of the average power spectrum of the noise signal and converts it to the correlation domain.
- the autocorrelation coefficients of the test utterance are then convolved with the filter coefficients.
- the filter estimator 18 estimates the cepstral coefficients of the noise signal.
- the cepstral coefficients of the noise are then subtracted from the corresponding cepstral coefficients of the test utterance.. No subtraction is performed on the zeroth coefficient of the test utterance.
- the filtered test and reference feature vectors are then passed separately through second stage feature extractor devices 30 and 32 respectively, which are operative to transform the filtered feature vectors to feature vectors which are appropriate for the chosen pattern matching method, as described hereinbelow.
- the first and second stage feature extraction devices are chosen together to produce the features necessary for the selected pattern matching method.
- the two stages are necessary to enable the filter estimation to be performed with whichever feature type a designer desires, whether for reasons of computation ease or speed.
- the second stage feature extraction devices 30 and 3 can be absent if the respective input feature vectors are already suitable for the selected pattern matching method.
- the first stage feature extraction device 24 can be absent (provided the second stage feature extraction device 30 exists). In that case, the test vector includes speech samples.
- the filtered test vectors of the test utterance and the filtered reference vectors of the entirety of reference templates are passed to a local scoring or matching unit 34, operative to calculate a score between the filtered test feature vectors and each of the corresponding filtered reference feature vectors.
- the unit 34 also receives data from a boundary detector 35 which indicates the beginning and ending points of speech in the test utterance. Any frames, or vectors, of the test utterance which are outside of the beginning and ending points of speech will not be utilized in the scoring of unit 34.
- the boundary detector 35 receives the test utterance from the patterned signal detection device 16 and determines the beginning and ending points usually via inspection of the energy contained in the patterned signal.
- Suitable boundary detectors 35 are described in the following articles which is incorporated herein by reference: L. Lamel, L. Rabiner, A. Rosenberg and J. Wilpon, "An Improved Endpoint Detector for Isolated Word Recognition," IEEE Transactions on ' Acoustics, Speech and Signal Processing, ASSP-29, PP. 777 - 785, 1981.
- the local scoring unit 34 typically uses a local distortion measure which is robust to white noise.
- Example local distortion measures are WLR and projection distortion measures as described in the previously mentioned article by D. Mansour and B.H. Juang, which article is incorporated herein by reference.
- the output of unit 34 is a set of local similarity scores where each score indicates the similarity between a frame of the test utterance and single frames of each one of the reference templates.
- the set of local scores is then provided to a decision procedure 36, described hereinbelow, for determining the index, code or symbol of the reference template to whom the test utterance best matches. These indices, codes or symbols are the overall output of the matching procedure.
- the decision is global in the sense that it is based on the local scores of many test feature vectors or frames. This is necessary so as to match a number of frames which make up an uttered word.
- the global score is typically carried out using a standard Dynamic Time Warping (DTW) procedure on the local scores, described in the following article, incorporated herein by reference: H. Sakoe and S.
- DTW Dynamic Time Warping
- VQ Vector Quantization
- the process can also be performed via Hidden Markov Modeling (HMM) or DTW-VQ in two stages.
- the first stage is the VQ method described above operating on the test and reference vectors and providing symbols representing the test and reference vectors.
- the second stage is a scoring stage.
- the local scoring is between symbols.
- a global score providing a score for a group of reference symbols forming a reference word, is then calculated on the local scores via DTW.
- scoring via HMM a model of a word is first built from the symbols and the global score is then calculated between models using the Viterby algorithm or the forward-backward algorithm. Both algorithms are described in the article by L.R.
- Fig. 2 shows a schematic block diagram of the architecture implementing the system of Fig. 1.
- a user codec 40 such as an Intel 913, from Intel Corporation, receives the analog signal from the input device 10 and interfaces with digital signal processing circuitry 42, typically a TMS 320C25 from Texas Instruments Corporation.
- a memory storage area 44 which typically includes a static random-access memory such as a 32K by 8 bit memory with an access time of 100 nsec, is connected to the digital signal processing circuitry 42 by means of a standard address data and read-write control bus.
- the operations of Fig. 1 are typically carried out by software run on the digital signal processing circuitry 42.
- the VOX of unit 16 is typically incorporated in software run on the digital signal processing circuitry 42.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Complex Calculations (AREA)
- Image Analysis (AREA)
Abstract
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US08/190,087 US5487129A (en) | 1991-08-01 | 1992-07-30 | Speech pattern matching in non-white noise |
| JP5503742A JPH06510375A (ja) | 1991-08-01 | 1992-07-30 | 非ホワイトノイズにおけるスピーチパターン整合 |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| IL99041 | 1991-08-01 | ||
| IL9904191A IL99041A (en) | 1991-08-01 | 1991-08-01 | Speech pattern matching in non-white noise |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO1993003480A1 true WO1993003480A1 (fr) | 1993-02-18 |
Family
ID=11062774
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US1992/006351 Ceased WO1993003480A1 (fr) | 1991-08-01 | 1992-07-30 | Mise en correspondance de signaux vocaux dans du bruit non blanc |
Country Status (4)
| Country | Link |
|---|---|
| JP (1) | JPH06510375A (fr) |
| AU (1) | AU2447592A (fr) |
| IL (1) | IL99041A (fr) |
| WO (1) | WO1993003480A1 (fr) |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US4720802A (en) * | 1983-07-26 | 1988-01-19 | Lear Siegler | Noise compensation arrangement |
| US4737976A (en) * | 1985-09-03 | 1988-04-12 | Motorola, Inc. | Hands-free control system for a radiotelephone |
| US4829578A (en) * | 1986-10-02 | 1989-05-09 | Dragon Systems, Inc. | Speech detection and recognition apparatus for use with background noise of varying levels |
| US4926488A (en) * | 1987-07-09 | 1990-05-15 | International Business Machines Corporation | Normalization of speech by adaptive labelling |
-
1991
- 1991-08-01 IL IL9904191A patent/IL99041A/en not_active IP Right Cessation
-
1992
- 1992-07-30 AU AU24475/92A patent/AU2447592A/en not_active Abandoned
- 1992-07-30 JP JP5503742A patent/JPH06510375A/ja active Pending
- 1992-07-30 WO PCT/US1992/006351 patent/WO1993003480A1/fr not_active Ceased
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US4720802A (en) * | 1983-07-26 | 1988-01-19 | Lear Siegler | Noise compensation arrangement |
| US4737976A (en) * | 1985-09-03 | 1988-04-12 | Motorola, Inc. | Hands-free control system for a radiotelephone |
| US4829578A (en) * | 1986-10-02 | 1989-05-09 | Dragon Systems, Inc. | Speech detection and recognition apparatus for use with background noise of varying levels |
| US4926488A (en) * | 1987-07-09 | 1990-05-15 | International Business Machines Corporation | Normalization of speech by adaptive labelling |
Also Published As
| Publication number | Publication date |
|---|---|
| IL99041A (en) | 1996-03-31 |
| IL99041A0 (en) | 1992-07-15 |
| AU2447592A (en) | 1993-03-02 |
| JPH06510375A (ja) | 1994-11-17 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| EP1301922B1 (fr) | Systeme de reconnaissance vocale pourvu d'une pluralite de moteurs de reconnaissance vocale, et procede de reconnaissance vocale correspondant | |
| US10847137B1 (en) | Trigger word detection using neural network waveform processing | |
| US5995928A (en) | Method and apparatus for continuous spelling speech recognition with early identification | |
| US5583961A (en) | Speaker recognition using spectral coefficients normalized with respect to unequal frequency bands | |
| US5459815A (en) | Speech recognition method using time-frequency masking mechanism | |
| Das et al. | Recognition of isolated words using features based on LPC, MFCC, ZCR and STE, with neural network classifiers | |
| EP1159737B1 (fr) | Reconnaissance du locuteur | |
| JP2007500367A (ja) | 音声認識方法およびコミュニケーション機器 | |
| EP1022725B1 (fr) | Sélection des modèles acoustiques utilisant de la vérification de locuteur | |
| JPH08221092A (ja) | スペクトルサブトラクションを用いた雑音除去システム | |
| US5487129A (en) | Speech pattern matching in non-white noise | |
| US5764853A (en) | Voice recognition device and method using a (GGM) Guaranteed Global minimum Mapping | |
| EP1159735B1 (fr) | Plan de rejet d'un systeme de reconnaissance vocale | |
| JP2000194392A (ja) | 騒音適応型音声認識装置及び騒音適応型音声認識プログラムを記録した記録媒体 | |
| JP3098593B2 (ja) | 音声認識装置 | |
| Kalaiarasi et al. | Performance Analysis and Comparison of Speaker Independent Isolated Speech Recognition System | |
| WO1993003480A1 (fr) | Mise en correspondance de signaux vocaux dans du bruit non blanc | |
| Biswas et al. | Speaker identification using Cepstral based features and discrete Hidden Markov Model | |
| Menne | Learning acoustic features from the raw waveform for automatic speech recognition | |
| Bossemeyer et al. | Automatic speech recognition of small vocabularies within the context of unconstrained input | |
| JPH10149190A (ja) | 音声認識方法及び音声認識装置 | |
| Kuah et al. | A neural network-based text independent voice recognition system | |
| Sankar et al. | Noise-resistant feature extraction and model training for robust speech recognition | |
| JP2658426B2 (ja) | 音声認識方法 | |
| Kitamura et al. | Word recognition using a two‐dimensional mel‐cepstrum in noisy environments |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| AK | Designated states |
Kind code of ref document: A1 Designated state(s): AT AU BB BG BR CA CH CS DE DK ES FI GB HU JP KP KR LK LU MG MN MW NL NO PL RO RU SD SE US |
|
| AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): AT BE CH DE DK ES FR GB GR IT LU MC NL SE BF BJ CF CG CI CM GA GN ML MR SN TD TG |
|
| DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
| WWE | Wipo information: entry into national phase |
Ref document number: 08190087 Country of ref document: US |
|
| REG | Reference to national code |
Ref country code: DE Ref legal event code: 8642 |
|
| NENP | Non-entry into the national phase |
Ref country code: CA |
|
| 122 | Ep: pct application non-entry in european phase |