HK1035600B

HK1035600B - System and method for noise-compensated speech recognition

Info

Publication number: HK1035600B
Application number: HK01105667.0A
Authority: HK
Inventors: G‧C‧西; 毕宁
Original assignee: 高通股份有限公司
Priority date: 1998-02-04
Filing date: 1999-02-03
Publication date: 2006-07-21

Description

System and method for noise compensated speech recognition

Background

I. Field of the invention

The present invention relates to voice processing. More particularly, the present invention relates to a system and method for automatic recognition of spoken words or phrases.

Description of the related Art

Digital processing of voice processing has been widely adopted, particularly for cellular telephone and PCS applications. One digital voice processing technique is voice recognition. The handling of speech recognition is becoming increasingly important for security reasons. For example, voice recognition may be used in place of the manual operation of pressing keys on a keypad of a cellular telephone. This is even more important when the user is making a telephone call while driving. When using a telephone without voice recognition, the driver must leave the steering wheel with one hand and look at the telephone keypad to make a key press to make a dial call. These actions increase the chances of a traffic accident. Voice recognition, in turn, allows the driver to keep a constant gaze on the road while making a telephone call and keep both hands on the steering wheel. Hands-free car kits with voice recognition will be a legal requirement for future systems for safety reasons.

The most common type of speaker-dependent voice recognition in use today works in two stages: a training (training) phase and a recognition phase. During the training phase, the speech recognition system asks the user to speak each word in the vocabulary once or twice so that the machine can know the characteristics of the user's utterance for these particular words or phrases. The size of the recognition vocabulary is typically relatively small (less than 50 words), and the voice recognition system will only be able to have a high recognition accuracy for users for which it is trained. An example vocabulary for a hands-free in-vehicle device includes the numbers on the keypad, the keywords "call," "send," "dial," "cancel," "clear," "add," "delete," "history," "program," "yes," and "no," and the names of 20 commonly used colleagues, friends, or family members. After the training is completed, the user can make a call in the recognition stage by speaking the trained keywords. For example, if the name "john" is a trainee name, the user places a call to john by saying the phrase "call john". The voice confirmation system confirms the words "call" and "john" and dials the digits that the user previously entered as john's telephone number.

A block diagram of the training unit 6 of the speaker dependent voice confirmation system is shown in fig. 1. The training unit 6 receives an input s (n), which is a set of digitized speech samples for a word or phrase to be trained. The speech signal s (N) is passed through a parameter determination block 7 which produces a template of N parameters { p (N) N-1 … N } which captures the pronunciation characteristics of the user-specific word. The parameter determination unit 7 may employ any of several speech parameter determination techniques, which are well known in the art. An exemplary embodiment of a parameter determination technique is a vocoder encoder, see U.S. Pat. No. 5,414,796 entitled "variable Rate vocoder", assigned to the assignee of the present invention and incorporated herein by reference. Another example of a parameter determination technique is Fast Fourier Transform (FFT), where the N constants are N FFT coefficients. Other embodiments derive some parameters from the FFT coefficients. Each spoken word produces a template of N parameters stored in the template database 8. After the training of the M vocabularies is completed, the template database 8 contains M templates, each of which contains N parameters. The template database 8 is stored in some type of non-volatile memory so that the template is still present after power is removed.

Fig. 2 is a block diagram of a speech recognition unit 10, which is operated during the recognition phase of a speaker-dependent speech recognition system (spatker). The speech recognition unit 10 contains a template database 14, which in general is the template database 8 from the training unit 6. The input to the speech recognition unit 10 is digitized input speech x (n), which is the speech to be recognized. The input speech x (n) enters a parameter determination block 12 which uses the same parameter determination technique as the parameter determination block 7 of the training unit 6. The parameter determination block 12 generates recognition templates of N parameters { t (N) ═ 1 … N } to form a feature model of the input speech N (N). The identified template t (n) is then passed to a pattern comparison block 16, which performs a pattern comparison between the template t (n) and all of the templates stored in the template database 14. The distance between the template t (n) and each template in the template database 14 is passed to a decision block 18 which selects the template 14 from the template database 14 which is closest to the identified template t (n). The output of decision block 18 is a decision of which vocabulary was spoken.

Recognition accuracy is a measure of the correctness of words in a recognition vocabulary by the recognition system. For example, the recognition correctness of 95% means that the recognition unit can correctly recognize 95 words among 100 words. In conventional speech recognition systems, recognition accuracy is severely affected in the presence of noise. The main reason for the loss of accuracy is that training words typically occur in quiet environments, but recognition typically occurs in noisy environments. For example, hands-free in-vehicle speech recognition systems are typically trained when the vehicle is parked in a garage or in a driveway, so that the engine and air conditioner are inactive, and the window is typically rolled up. However, the recognition is used while the vehicle is running, so that when the engine is running, noise of roads and wind occurs, windows are lowered, and the like. Due to the difference in noise level between the training phase and the recognition phase, the recognition template does not match well with any of the templates obtained during training. This increases the likelihood of identifying an error or failure.

Fig. 3 depicts a speech recognition unit 20 which has to perform speech recognition in the presence of noise. As shown in fig. 3, the adder 22 adds the speech signal x (n) to the noise signal w (n) to obtain the noise-corrupted speech signal r (n). It should be understood that the adder 22 is not a specific element of the system, but is a simulation of a noisy environment. The noise-corrupted speech signal r (n) is input to the parameter determination block 24, producing a noise-corrupted template t1 (n). The pattern comparison block 28 compares the template t1(n) with all templates in the template database 26, and the template database 26 is constructed in a quiet environment. Since the noise-corrupted template t1(n) does not match any of the training templates, there is a high probability that the decision made by decision block 30 may be a recognition error or failure.

Summary of The Invention

The present invention is a system and method for automatically recognizing spoken words in the presence of noise. Speaker-dependent speech recognition systems work in two phases: a training phase and a recognition phase. During the training phase of a conventional speech recognition system, the user is prompted to speak all of the words in the prescribed vocabulary. The digitized speech template for each word is processed to produce a parametric template characterizing the spoken word. The output of the training phase is a library of these templates. In the recognition phase, the user speaks a particular word to initiate the desired action. The spoken word is digitized and processed to generate a template, which is compared to all templates generated during training. The closest match determines the action to be performed. The main impairment that limits the accuracy of speech recognition systems is when there is noise. Adding noise during recognition severely compromises the accuracy of the recognition because the noise does not occur during training when the template database is generated. The present invention recognizes that specific noise occurring at the time of recognition needs to be taken into account to improve the accuracy of recognition.

Thus, instead of storing parameter templates, the improved speech processing system and method stores and digitizes speech templates for each spoken word during the training phase. The output of the training phase is therefore a digitized speech database. In the recognition phase, the noise characteristics in the acoustic environment are continuously monitored. When the user speaks words for recognition, a noise-compensated template database is constructed by adding a noise signal to each signal in the speech database and determining each voice-added noise signal. Each embodiment of the added noise signal is an artificially synthesized noise signal having similar characteristics as the actual noise. Another embodiment is to record a window of noise time that occurs before the user speaks the word for recognition. Since the template database is constructed with the same type of noise occurring in the words to be recognized, the speech recognition unit can find a good match between the templates, improving the recognition accuracy.

Brief Description of Drawings

The features, objects, and advantages of the present invention will become more apparent to the reader after reading the detailed description of the invention with reference to the accompanying drawings. In the drawings, the same reference numerals are used for the same purposes.

FIG. 1 is a block diagram of a speech recognition system training unit;

FIG. 2 is a block diagram of a speech recognition unit;

FIG. 3 is a block diagram of a speech recognition unit that performs speech recognition on noise-corrupted speech input;

FIG. 4 is a block diagram of an improved speech recognition system training unit; and

fig. 5 is a block diagram of an exemplary improved speech recognition unit.

Detailed description of the preferred embodiments

The present invention provides a system and method for improving speech recognition accuracy in the presence of noise. It takes advantage of recent advances in computing power and memory integration and modifies the training and recognition stages to take into account the presence of noise in recognition. The function of the speech recognition unit is to find the closest match to the recognition template that is computed for the voice that is corrupted by noise. Since the characteristics of the noise will vary with time and place, the present invention recognizes that the best time to construct the template is in the recognition stage.

Fig. 4 shows a block diagram of a training unit 40 for an improvement of the speech recognition system. Unlike the conventional training method of fig. 1, the training unit 40 is modified to eliminate the parameter determination step. Instead of storing the parameter templates, digitized speech samples of the actual words are stored. Therefore, training unit 40 receives as input speech samples s (n) and stores the digitized speech samples s (n) in speech database 42. After training, the speech database 42 contains M speech signals, where M is the number of words in the vocabulary. Whereas prior art parameter determination systems and methods lose information about speech characteristics, which only store speech parameters, the systems and methods retain all speech information for the recognition stage.

Fig. 5 is a block diagram of an improved speech recognition unit 50 for use with training unit 40. The input to the speech recognition unit 50 is a noise-corrupted speech signal r (n). The noise corrupted speech signal r (n) is obtained by adding the speech signal x (n) to the noise signal w (n) by adder 52. As before, the summer 52 is not a specific element of the system, but is a simulation of the noise environment.

The speech recognition unit 50 comprises a speech database 60 containing digitized speech samples, which are recorded during a training phase. The speech recognition unit 50 further comprises a parameter determination block 54 by which the noise-corrupted speech signal r (n) is transmitted to generate a noise-corrupted template t1 (n). As in conventional voice recognition systems, the parameter determination block 54 may employ any speech parameter determination technique.

Typical parameter determination techniques employ Linear Predictive Coding (LPC) analysis techniques. The LPC analysis technique models the vocal tract (vocal tract) as a digital filter. With LPC analysis, LPC cepstral coefficients c (m) can be calculated as parameters to represent the speech signal. The coefficient c (m) is calculated by the following procedure. Firstly, for a speech sample frame, windowing a speech signal r (n) affected by noise by using a window function v (n):

y(n)＝r(n)v(n) 0＜＝n＜＝N-1 (1)

in the present exemplary embodiment, the window function v (N) is a hamming window and the frame size N is equal to 160. The autocorrelation coefficients are then calculated for the windowed samples using the following equation:

in a typical embodiment, P is the autocorrelation coefficient to be calculated, equal to the order of the LPC predictor, which is equal to 10. Subsequently, the LPC coefficients are calculated directly from the autocorrelation values using the Durbin recursion rule. The rules may state the following:

1.E⁽⁰⁾＝R(0)，i＝1 (3)

3.α_i ⁽ⁱ⁾＝k_i (5)

4.α_j ⁽ⁱ⁾＝α_j ^(i-1)-k_iα_i-j ^(i-1) 1＜＝j＜＝i-1 (6)

5.E⁽ⁱ⁾＝(1-k_i ²)E^(i-1) (7)

6. if i < P, go back to (2), and i ═ i +1 (8)

Final solution of LPC coefficients gives

a_j＝α_j ^(P) 1＜＝j＜＝P (9)

The LPC coefficients are then converted to LPC cepstral coefficients using the following equation:

c(0)＝ln(R(0)) (10)

it should be appreciated that other techniques may be used for parameter determination instead of LPC cepstral coefficients.

In addition, the signal r (n) is passed to a speech detection block 56, which determines the presence or absence of speech. The speech detection block 56 may use any technique to determine whether speech is present. One such method is described in the above-mentioned U.S. patent 5,414,796 entitled "variable rate vocoder". This technique analyzes the level of voice activity and makes a determination as to whether voice is present. The voice activity level is the signal energy based on a comparison with an estimate of the background noise energy. First, energy E (n) is calculated for each frame, which in a preferred embodiment consists of 160 samples. Subsequently, the background noise energy estimate b (n) is calculated using the following equation:

B(n)＝min[E(n)，5059644，max(1.00547*B(n-1)，B(n-1)+1)] (13)

if B (n) < 160000, then three thresholds are calculated with B (n), as follows:

T1(B(n))＝-(5.544613×10^-6)*B²(n)+4.047152*B(n)+362 (14)

T2(B(n))＝-(1.529733×10^-5)*B²(n)+8.750045*B(n)+1136 (15)

T3(B(n))＝-(3.957050×10^-5)*B²(n)+18.89962*B(n)+3347 (16)

if B (n) > 160000, three thresholds are calculated as:

T1(B(n))＝-(9.043945×10^-8)*B²(n)+3.535748*B(n)-62071 (17)

T2(B(n))＝-(1.986007×10^-7)*B²(n)+4.941658*B(n)+223951 (18)

T3(B(n))＝-(4.838477×10^-7)*B²(n)+8.630020*B(n)+645864 (19)

the speech detection method indicates that speech is present when the energy E (n) is greater than the threshold T2(B (n)), and indicates that no speech is present when the energy E (n) is less than the threshold T2(B (n)). In another embodiment, this method may be extended to calculating the background noise energy estimate and the threshold in two or more frequency bands. In addition, it should be understood that the numerical values in equations (13) - (19) are determined experimentally and may be modified according to circumstances.

When the speech detection block 56 determines that speech is not present, it sends a control signal that enables the noise analysis, modeling and synthesis block 58 to be initiated. It should be noted that in the absence of speech, the received signal r (n) is identical to the noise signal w (n).

When the noise analysis, modeling and synthesis block 58 is activated, it analyzes the characteristics of the noise signal r (n), models it, and synthesizes a noise signal w1(n) having characteristics similar to the actual noise w (n). An exemplary embodiment for performing Noise analysis, modeling, and synthesis is shown in U.S. Pat. No. 5,646,991 entitled "Noise Replacement System and Method in an echo canceller" (assigned to the assignee of the present invention and incorporated herein by reference). The method performs a noise analysis by predicting an error filter delivered noise signal r (n):

here, P is the order (order) of the predictor, and is 5 in the present exemplary embodiment. LPC coefficient a_iAre calculated as explained previously using equations (1) to (9). Once the LPC coefficients are obtained, white noise is passed through a noise synthesis filter to produce synthesized noise samples that have the same spectral characteristics as described by:

this is the inverse of the filter used for noise analysis. After taking a scaling factor for each synthesized noise sample to form a synthesized noise energy equal to the actual noise energy, the output is synthesized noise w1 (n).

Synthesized noise w1(n) is added by adder 62 to each set of digitized speech samples in speech database 60 to produce a set of synthesized noise-corrupted speech samples. Each combined noise-corrupted speech sample then passes through a parameter determination block 64, which uses the same parameter determination technique as used in parameter determination block 54 to generate a set of parameters for each combined noise-corrupted speech sample. The parameter determination block 54 generates parameter templates for each set of speech samples and stores these templates in the noise compensated template database 66. The noise-compensated template database 66 is a set of templates constructed as if the same type of noise present during recognition had been traditionally trained. Note that there are many possible methods for generating the estimated noise w1(n) in addition to the method disclosed in U.S. patent 5,646,991. Another embodiment is to simply record the time window of the actual noise that actually appears when the user is silent and use this noise signal as the estimated noise w1 (n). The noise time window recorded before the words to be recognized are spoken is an exemplary embodiment of the method. Another approach is to average the individual noise windows obtained over a specified time period.

Referring again to fig. 5, the pattern comparison block 68 compares the noise-corrupted template t1(n) to all templates in the noise-compensated template database 66. Since the effects of noise are included in the templates of the noise-compensated template database 66, decision block 70 is able to find a good match for t1 (n). The accuracy of the speech recognition system can be improved in view of the influence of noise in this manner.

The previous description of the preferred embodiments is provided to enable any person skilled in the art to make or use the present invention. It is obvious to a person skilled in the art that various modifications can be made to the embodiments and that the basic principles can be applied to other embodiments without the aid of a person skilled in the art. Thus, the present invention is not intended to be limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A speech recognition system, comprising:

a training unit for receiving signals of words to be trained, generating a digitized sample for each of said words, and storing said digitized samples in a speech database; and

a speech recognition unit for receiving a noise-corrupted input signal to be recognized, generating a noise-compensated template database by applying the effects of noise to the digitized samples of the speech database, and providing speech recognition results for the noise-corrupted input signal based on the noise-compensated template database

Wherein the voice recognition unit includes:

a voice detection unit, configured to receive the input signal affected by the noise and determine whether there is voice in the input signal, where the input signal is designated as a noise signal when it is determined that there is no voice in the input signal; and

a noise unit activated upon determining that no speech is present in the input signal, the noise unit to analyze the noise signal and synthesize a synthesized noise signal having characteristics of the noise signal, the synthesized noise signal to apply a noise effect to the digitized samples of the speech database.

2. The speech recognition system of claim 1, wherein the speech detection unit determines whether speech is present by analyzing a level of speech activity in the input signal.

3. The speech recognition system of claim 1, wherein the noise unit analyzes and synthesizes the synthesized noise signal using a Linear Predictive Coding (LPC) technique.

4. The speech recognition system of claim 1, wherein the synthesized noise signal corresponds to a window of the noise signal recorded before the input signal to be recognized.

5. The speech recognition system of claim 1, wherein the synthesized noise signal corresponds to respective windowed averages of the noise signal recorded over a predetermined time.

6. A speech recognition system, comprising: a training unit for receiving signals of words to be trained, generating a digitized sample for each of said words, and storing said digitized samples in a speech database; and

Wherein the voice recognition unit includes:

a first parameter determining unit for receiving said noise-corrupted input signal and generating a parameter template representative of said input signal according to a predetermined parameter determination technique;

a second parameter determination unit for receiving said speech database, said database having a noise contribution to said digitized samples and generating said noise compensated template database according to said predetermined parameter determination technique;

a pattern comparison unit for comparing said parameter template representing said input signal with said noise compensated template database to determine a best match to identify said speech recognition result;

a voice detecting unit for receiving an input signal disturbed by noise and judging whether there is voice in the input signal, wherein the input signal is designated as a noise signal when it is judged that there is no voice in the input signal; and

a noise unit activated upon determining that there is no speech in the input signal, the noise unit to analyze the noise signal and synthesize a synthesized noise signal having characteristics of the noise signal, the synthesized noise signal to apply a noise contribution to the digitized samples of the speech database.

7. The speech recognition system of claim 6, wherein the parameter determination technique is a Linear Predictive Coding (LPC) analysis technique.

8. The speech recognition system of claim 6, wherein the speech detection unit determines whether speech is present by analyzing a level of speech activity in the input signal.

9. The speech recognition system of claim 6, wherein the noise unit analyzes and synthesizes the synthesized noise signal using a Linear Predictive Coding (LPC) technique.

10. The speech recognition system of claim 6, wherein the synthesized noise signal corresponds to the noise signal window previously recorded for the input signal to be recognized.

11. The speech recognition system of claim 6, wherein the synthesized noise signals correspond to respective windowed averages of the noise signals recorded over a predetermined time.

Means for storing said digitized samples in a speech database.

12. A speech recognition unit of a speech recognition system for recognizing an input signal, said speech recognition unit taking into account the effects of a noisy environment, comprising:

means for storing a digitized sample of words of a vocabulary in a speech database; means for applying a noise impact to the digitized samples of the vocabulary to generate noise-impacted digitized samples of the vocabulary;

means for generating a noise-compensated template database from said noise-affected digitized samples; and

device for determining speech recognition result of said input signal according to said noise compensation template database

Wherein the means for applying a noise influence comprises: means for determining whether speech is present in the input signal, wherein the input signal is designated as a noise signal when it is determined that speech is not present in the input signal; and

means for analyzing said noise signals and synthesizing a synthesized noise signal, said synthesized noise signal being added to said digitized samples of said vocabulary.

13. A speech recognition unit of a speech recognition system for recognizing an input signal, said speech recognition unit taking into account the effects of a noisy environment, comprising:

means for generating a noise-compensated template database from said noise-affected digitized samples;

means for determining a speech recognition result of said input signal based on said noise compensation template database;

first parameter determining means for receiving said input signal and generating a parameter template representative of said input signal in accordance with a predetermined parameter determination technique; and

second parameter determining means for receiving said noise-corrupted digitized samples of said vocabulary and generating templates of said noise-corrupted template database according to a predetermined parameter determination technique;

wherein the means for determining the speech recognition result compares the parameter template representing the input signal with a template of the noise-compensated template database to determine a best match to identify the speech recognition result,

wherein the means for applying a noise influence comprises:

means for determining whether speech is present in the input signal, wherein the input signal is designated as a noise signal when it is determined that speech is not present in the input signal; and

14. A speech recognition method taking into account the effects of a noisy environment, characterized in that it comprises the steps of:

generating a digitized sample of each trained term, each said term belonging to a vocabulary;

storing the digitized samples into a voice database;

receiving an input signal to be identified;

applying a noise impact to the digitized samples of the vocabulary to produce noise-impacted digitized samples of the vocabulary;

generating a noise-compensated template database from the noise-affected digitized samples; and

providing a speech recognition result of the input signal affected by the noise according to the noise compensation template database,

wherein the step of applying the influence of the noise influence comprises the steps of:

judging whether the input signal has voice, and when the input signal is judged not to have voice, designating the input signal as a noise signal; and

analyzing the noise signal and synthesizing a synthesized noise signal that is added to the digitized samples of the vocabulary to produce the noise-corrupted digitized samples.

15. A speech recognition method taking into account the effects of a noisy environment, characterized in that it comprises the steps of:

storing the digitized samples into a voice database;

receiving an input signal to be identified;

generating a noise-compensated template database from the noise-affected digitized samples;

generating a template representing parameters of said input signal in accordance with a predetermined parameter determination technique; and

generating a template of said noise compensation template database according to said predetermined parameter determination technique;

wherein said step of providing a speech recognition result compares a parameter template representing said input signal with said templates of said noise-compensated template database to determine a best match to identify said speech recognition result,

wherein the step of applying a noise effect comprises the steps of:

determining the presence or absence of speech in the input signal, wherein the input signal is designated as a noise signal when the absence of speech in the input signal is determined; and

analyzing the noise signal and synthesizing a synthesized noise signal that is added to the digitized samples of the vocabulary to produce the noise-affected digitized samples.