WO2004114280A1

WO2004114280A1 - Apparatus and method for speech recognition of degraded speech by speech coder

Info

Publication number: WO2004114280A1
Application number: PCT/KR2003/001217
Authority: WO
Inventors: Hochong Park; Sang Wook Han
Original assignee: Kwangwoon Foundation
Current assignee: Kwangwoon Foundation
Priority date: 2003-06-20
Filing date: 2003-06-20
Publication date: 2004-12-29
Anticipated expiration: 2005-12-20
Also published as: AU2003244242A1

Abstract

Disclosed is a speech recognition method and apparatus that includes a process of compensating a degraded speech by a speech coder in order to prevent deterioration of speech recognition performance by the speech coder and improve the speech recognition performance when a voice signal is inputted to the speech recognition apparatus, not directly, but after that the voice signal is reproduced by compressing and decoding processes of the speech coder while transferring the voice signal through a digital communication network after the voice signal is generated. The speech recognition method and apparatus according to the present invention include a process of increasing the speech recognition performance when training the speech recognition apparatus with only an original voice signal not passing through the speech coder or with both the original voice signal and a signal degraded by the plurality of voice compressors, and a process of compensating a degradation by the speech coder regardless of the type of the speech coder.

Description

APPARATUS AND METHOD FOR SPEECH RECOGNITION OF DEGRADED

SPEECH BY SPEECH CODER

Technical Field The present invention relates to an apparatus and a method. More specifically, the present invention relates to a speech recognition method and apparatus that includes a process of compensating a degraded speech by a speech coder in order to prevent deterioration of speech recognition performance by the speech coder and improve the speech recognition performance when a speech signal is inputted to the speech recognition apparatus, not directly, but after that the speech signal is reproduced by encoding and decoding processes of the speech coder while transferring the speech signal through a digital communication network after the speech signal is generated. The speech recognition method and apparatus according to the present invention include a process of increasing the speech recognition performance when training the speech recognition apparatus with only an original speech signal not passing through the speech coder or with both the original speech signal and a signal degraded by the plurality of speech coders, and a process of compensating a degradation by the speech coder regardless of the type of the speech coder. The speech recognition process used in the present invention is similar to the prior art in the procedure. Only, during the process of obtaining speech parameters of a signal used in a recognition apparatus, a new process for compensating the degradation of a signal by the speech coder is added. Further, the speech recognition process recognizes predetermined isolated words, and by applying the distortion compensating process to a speaker independent isolated-word recognition method and apparatus that can recognize a speech regardless of any user, the speech recognition performance thereof is improved.

Background Art

Speech recognition is a technique in which a computer automatically recognizes a speech signal uttered by a human. A machine understands the speech signal so that it allows an interaction between the machine and the human. With a rapid development of speech recognition technique, the speech recognition has been used in various fields, and new products and service have been developed using the speech recognition. At present, the speech recognition directly inputs the speech signal to a speech recognition system through a microphone. For example, the speech recognition includes a command signal recognition in a computer, an automatic document input device, a telephone number recognition in a portable phone, and speech recognition toy. Cases, which inputs speech command through a communication network, frequently occur. Namely, when an initially uttered speech signal passes the communication network and is inputted to a speech recognition system, the speech recognition system recognizes the initially uttered speech signal. The speech recognition system is located at a server of other party. When an initially uttered speech signal passes the communication network, the initially uttered speech signal is modified, thus lowering the performance of speech recognition. With the increasing use of a digital mobile communication and an internet communication, frequency in use of a digital communication has rapidly been increasing. It is expected that a service in which a speech command is inputted to a communication terminal, is transferred to a speech recognition system through a digital communication, and the speech recognition system recognizing speech based on the speech command will rapidly increase.

Speech recognition through a digital communication network has a different structure from that of the conventional direct input system through a microphone.

FIG. 1 shows a view for illustrating a speech recognition system through a digital communication network. When an input speech signal 101 is inputted to a communication terminal 102, a speech encoder 103 of the communication terminal

102 converts the input speech signal 101 into digital data 104 to be transmitted to a digital communication channel. The digital data 104 is transmitted to the other party through a digital communication network. A speech decoder 107 restores the transmitted data 106 to a speech signal 108. A speech recognition system 109 performs a speech recognition for the speech signal from the speech decoder 107 and outputs a recognized result 110. Between the input speech signal 101 and the decoded speech signal, a signal varies according to operations of the speech encoder

103 and the speech decoder 107. A speech recognition performance of the speech recognition device 109, when the speech recognition system 109 recognizes the decoded speech signal, is degraded more than that of the speech recognition when the speech recognition system 109 recognizes the original input speech signal. Since recognition information included in the decoded speech signal 108 varies according to operations of the speech encoder 103 and the speech decoder 107, a speech recognition is performed with information different from the one that the speech recognition system 109 anticipates, thus lowering a recognition performance. In the speech recognition which does not pass through the communication network, the speech signal 101 is directly inputted to the speech recognition system 109, and a speech recognition is achieved. Accordingly, the speech recognition is achieved without varying the recognition information, thus having an excellent speech recognition performance. In an isolated word hidden Markov model (HMM), a speech recognition system trains a plurality of previously determined words to be recognized using speech signals of various speakers by words. A feature parameter of a speech signal corresponding to each word is obtained during the training of the words. The feature parameter is stochastically analyzed by an HMM and is classified by words. In a real recognition, the same feature parameter used in the training is obtained from an input known speech signal. The same feature parameter is compared with previously prepared information, and a word corresponding to an input signal is. searched based on the compared result. A perceptual linear prediction (PLP) parameter is widely used as the feature parameter in order to recognize isolated words.

FIG. 2 is a view which illustrates a process obtaining a PLP parameter from an input signal in a prior art. A discrete Fourier transform (DFT) calculator 202 receives an input signal 201 in frames and outputs a DFT coefficient magnitude 203 to a critical-band integrator 204. The critical-band integrator 204 integrates the DFT coefficient magnitude into a perceptually meaningful frequency band according to a Mel Scale and outputs a power 205 by bands to a loudness compensator 206. The Mel Scale is given by the following equation (1).

Mel(f)=2595{log}_{10}(l+{f} over {700}) (1)

The loudness compensator 206 compensates a difference of loudness by bands which a human perceptually feels in the power 205 by bands from the integrator 204 and outputs a compensated power 207 to a multiplier 208. The multiplier 208 multiplies the compensated power 207 from the loudness compensator 206 and outputs an intensity 209 to an inverse DFT calculator 210. The inverse DFT calculator 210 inverts the intensity 209 from the multiplier 208 and outputs an autocorrelation coefficient 211 to a Durbin calculator 212. The Durbin calculator 212 calculates the autocorrelation coefficient 211 and outputs a linear predictive coding (LPC) coefficient 213 to a cepstrum calculator 214. The cepstrum calculator 214 receives the LPC coefficient 213 from the Durbin calculator 212 and outputs a cepstrum coefficient 215 for a PLP model. The cepstrum coefficient 215 is a feature PLP parameter used in training the speech signal.

In order to achieve an accurate speech recognition, a signal having a feature similar to that of a speech signal of each word which is used in training process of the speech recognition device should be inputted. When a feature of a speech signal of the same word is significantly modified, a feature parameter having a feature different from that of a feature parameter in a training process is obtained. Accordingly, a significantly different feature parameter is obtained in a speech signal of the same word, thus causing an inaccurate recognition. In FIG. 1, the speech recognition system 109 is trained with the initial input speech signal 101 but a real recognition is performed by the decoded speech signal 108. Due to a difference in feature between the initial input speech signal 101 and the decoded speech signal, different feature parameters with respect to the same word are obtained, thus lowering the recognition performance.

Referring to FIG. 1, conventionally, in order to improve a speech recognition performance, the decoded speech signal 108 is used to train the speech recognition system 109. In FIG. 2, since signals having the same feature are used for training and recognition, the recognition performance increases. Various kinds of speech encoders and speech decoders are used as the speech encoder 103 and the speech decoder 107 in a real digital communication. According to the kinds of the speech encoders and speech decoders, features of distortions occurred from the speech signal are different. Accordingly, for all kinds of speech encoders and decoders, the decoded speech signals 108 are obtained, and the speech recognition system 109 for recognizing the decoded speech signals 108 is trained. Since the speech recognition system of FIG. 1 has different speech recognition system 109 according to kinds of speech coders, it is inefficient to embody the speech recognition system of FIG. 1. For example, IS-96 QCELP8k, IS-733 QCELP13k, and IS-127 EVRC speech coders are used in a CDMA mobile communication. G.729 CS-ACELP and G.723.1 speech coders are used in an internet communication. An AMR speech coder is used in an IMT-2000 communication systems.

In order to improve a speech recognition performance, another technique of FIG. 1 trains one speech recognition system 109 with the input speech signal 101 and signals 107 decoded by all kinds of speech encoders 103 and decoders 107. In this case, since the speech recognition system 109 has all speech features of each signal, a deterioration in recognition performance for each input signal is somewhat prevented. However, since the speech recognition system 109 is not optimally trained according to the kinds of signals, it does not have the best performance for the signals but has an average performance. When new speech encoder 103 and decoder 107 are added or eliminated, training of the speech recognition system is additionally required in order to compensate for the modified contents.

Disclosure of Invention

Therefore, the present invention has been made in view of the above- mentioned problems, and it is an object of the present invention to provide a method and an apparatus which improve a speech recognition performance of a speech signal inputted through a digital communication network when recognizing speech using a PLP parameter in an isolated word HMM.

It is another object of the present invention to provide a method and an apparatus which compensate for a distortion of a speech signal decoded by a speech coder in a digital communication network and which improves a speech recognition performance using information having compensated distortion.

Also, it is an object of the present invention to provide a method and an apparatus which improve a recognition performance for a speech signal distorted by a plurality of speech coders using one speech recognition system trained with an original speech signal.

Also, it is an object of the present invention to provide a method and an apparatus which improve a recognition performance for a speech signal distorted by a speech coders using one speech recognition system trained with an original speech signal and a signal distorted by a plurality of speech coders.

Also, it is an object of the present invention to provide a method and an apparatus for improving a speech recognition performance used in a speech coder which is widely used for a digital communication network.

According to the present invention, there is provided an apparatus of compensating a degraded speech by a speech coder, which has an apparatus for compensating a distortion in a step of obtaining a feature parameter during a speech recognition process. The process of obtaining the feature parameter includes a device for obtaining a frequency DFT coefficient from an input speech signal, a device for sorting a DFT coefficient magnitude in order of magnitude, a device for applying a weight function to the DFT coefficient magnitude which is sorted in order of magnitude, a device for restoring the DFT coefficient magnitude in order of original frequency, and a device for obtaining a PLP parameter from the DFT coefficient magnitude to which a weight value is applied. A speech recognition process is performed according to a conventional method using the PLP parameter obtained by the present invention.

Brief Description of the Drawings

The foregoing and other objects, features and advantages of the present invention will become more apparent from the following detailed description when taken in conjunction with the accompanying drawings in which:

FIG. 1 is a view for showing a method which recognizes a speech signal decoded by a speech coder in a digital communication network;

FIG. 2 is a view for showing a method which obtains a PLP parameter for speech recognition in prior art;

FIG. 3 is a view for illustrating a process of obtaining a PLP parameter based on an input signal in a method and an apparatus for compensating for a distortion occurring due to a speech coder according to an embodiment of the present invention; FIG. 4 is a view for showing a weight value functions which is used in the present invention;

FIG. 5 is a view for comparing a performance of a conventional speech recognition and that of a speech recognition according to an embodiment of the present invention; and

FIG. 6 is a view for comparing a performance of another conventional speech recognition and that of a speech recognition according to an embodiment of the present invention.

Best Mode for Carrying Out the Invention

Hereinafter, a preferred embodiment of the present invention will be described with reference to the accompanying drawings.

FIG. 3 is a view for illustrating a process of obtaining a PLP parameter based on an input signal in a method and an apparatus for compensating for a distortion occurring due to a speech coder according to an embodiment of the present invention.

A DFT calculator 302 receives a speech signal 301 decoded by a speech coder and outputs DFT coefficient magnitude 303. The speech signal 301 is sampled with 8kHz and is a signal in which each sample is linear-quantized by 16 bits. The speech signal 301 is inputted to the DFT calculator 302 in 25 msec frame and moves to a time axis in 10 msec. The DFT calculator 302 according to an embodiment of the present invention performs a window operation and 256-point DFT for 25 msec input signal, and outputs a magnitude of each DFT coefficient in dB scale to a magnitude sorting section 304. The magnitude sorting section 304 sorts the DFT coefficient magnitude 303 in order from the greatest value to the least value, and outputs the DFT coefficient magnitude 305 sorted in order of magnitude to a weighting module 306. The weighting module 306 applies a weight value to the sorted DFT coefficient magnitude 305 and outputs a weighted DFT coefficient magnitude 307 to an order restoring device 308.

The weighting module 306 is a main device to compensate for a distortion caused by a speech coder. The weighting module 306 applied a relative greater weight value to a DFT coefficient having a small magnitude in order to compensate for a distortion caused by a speech coder. The speech coder compresses a speech signal so that a distortion is not aurally heard. A greater distortion occurs in a frequency band having a lower energy according to Masking feature of a human auditory sense. That is, since a frequency band having a lower energy is masked by an adjacent frequency band having a higher energy, a distortion component of the frequency band having a lower energy is not aurally recognized. When compressing the speech signal using such a feature, energy is relatively reduced to the frequency band having the lower energy. Such an energy difference becomes one reason to lower a performance of speech recognition. Accordingly, the weighting module 306 applies a relative greater weight value to a DFT coefficient magnitude of a frequency band having a lower energy in order to compensate for a lower energy of the frequency band wherein a relative greater distortion occurs in the frequency band having the lower energy. The compensation of the lower energy of the frequency band improves the performance of speech recognition. During the compensation, a determined weight value is not applied but variable weight values according to a frequency distribution of an energy of the speech signal are applied. Accordingly, the DFT coefficient is sorted in order of magnitude and a weight value is applied to the sorted DFT coefficients. It has effect which applies a variable weight function. According to such a background, the weighting module 306 applies a greater weight value to a smaller DFT coefficient. An actually used weight function varies according to extent or range of a distortion compensation. FIG. 4 shows a weight function according to an embodiment of the present invention. In the weight function, 128 DFT coefficients by 256-point DFT are sorted in order of magnitude. Each index is sorted by setting the greatest DFT coefficient index to 128, and setting the least DFT coefficient index to 1. In detail, FIG. 4-1 shows weight values applied when the speech recognition device is trained with an initially uttered speech signal 101. FIG. 4-2 shows weight values applied when the speech recognition device is trained with the initially uttered speech signal 101 and speech signals 108 decoded by a plurality of speech coders.

An order restoring device 308 receives the DFT coefficient magnitude 307 having the applied weight value from the weighting module 306 and acquires a weighted DFT coefficient magnitude 309 which is re-sorted in order of an original frequency. Accordingly, the DFT coefficient magnitude 309 is a result obtained by applying a variable weight value to the DFT coefficient magnitude 303 according to a frequency distribution of an energy. The DFT coefficient magnitude 309 is inputted to a method 310 which obtains a PLP parameter for speech recognition in prior art shown in FIG. 2, and a PLP parameter 311, to which a weight value is added, is finally outputted. Speech recognition is performed by the PLP parameter 311. The PLP parameter obtaining method 310 includes steps 202, 204, 206, 208, 210, 212, and 214. An embodiment of speech recognition using a weighted PLP parameter and a performance measuring result thereof according to the present invention will be described.

Industrial Applicability

39 parameters, having 13-order PLP parameters, 13-order delta parameters, and 13-order delta-delta parameters are used for speech recognition. FIG. 5 shows a comparative result of a performance of a Korean isolated word recognition system trained with the original input speech signal 101 shown in FIG. 1 according to the present invention. FIG. 5-1 shows a recognition performance of a conventional speech recognition system. It is found that the recognition performance of a conventional speech recognition system is deteriorated due to distortion by EVRC and QCELPδk speech coders. FIG. 5-2 shows a recognition performance of the Korean isolated word recognition device according to an embodiment of the present invention. It is found that recognition performance for the EVRC and QCELPδk speech coders are significantly increased. Although the same speech recognition method is used in the EVRC and QCELPδk speech coders, the same enhancement in recognition performance is achieved. The EVRC and QCELPδk speech coders operate based on CELP. Although a method for quantizing a parameter, a structure of a codebook, and distortion feature is different, a speech recognition performance is improved by using a same distortion compensating method according to the present invention.

FIG. 6 shows a comparative result of a performance of one speech recognition system trained with the speech signal 101 shown in FIG. 1 and the speech signals 108 decoded by EVRC and IS-96 QCELP8k speech coders according to the present invention. FIG. 6-1 shows a recognition performance of a conventional speech recognition system. The conventional speech recognition system is partially trained with a signal for each speech compressor, so that a recognition rate of the conventional speech recognition system of FIG. 6-1 is greater than that of the conventional speech recognition system of FIG. 5-1. FIG. 6-2 shows a recognition rate of a speech recognition system according to an embodiment of the present invention. The recognition rates for the EVRC and QCELPδk speech coders of FIG. 6-2 are greater than those of the EVRC and QCELPδk speech coders of FIG. 6-1, respectively. That is, by training one speech recognition system with output signals of a plurality of speech coders, the recognition rate of the speech recognition system according to the present invention is improved.

Claims

1. A speech recognition apparatus and method which recognize a decoded signal by a speech coder in a digital communication network, wherein a speech recognition system is trained by using an initial input signal which has not passed through a speech coder, the speech recognition system is not trained by a speech signal passing through a speech coder and decoded by the speech coder when recognizing the speech signal passing through a speech coder and decoded by the speech coder.

2. The speech recognition apparatus and method according to claim 1, wherein a distortion component for the speech signal due to the speech coder is compensated to improve a speech recognition performance when recognizing the speech signal passing through a speech coder and decoded by the speech coder.

3. The speech recognition apparatus and method as claimed in claim 2, wherein the same distortion compensating method is applied to all speech coders in order to compensate for a distortion component in the speech signal passing a speech coder and decoded by the speech coder.

4. The speech recognition apparatus and method according to claim 2 or 3, wherein a weight value by frequency bands is applied when obtaining a feature parameter for a recognition from the speech signal in order to compensate for a distortion component in the speech signal passing through a speech coder and decoded by the speech coder.

5. The speech recognition apparatus and method according to claim 4, wherein different weight values are applied to a speech signal inputted every time when a weight value by frequency bands is applied.

6. The speech recognition apparatus and method according to claim 5, wherein different weight values are applied to a speech signal inputted every time according to a frequency distribution of an energy of the speech signal when a variable weight value by frequency bands is applied.

7. The speech recognition apparatus and method according to claim 5 or 6, wherein a frequency coefficient is obtained through a frequency transform of an input signal, all frequency coefficients are sorted in an order of magnitude, and a predetermined weight value is applied to all the frequency coefficients which are sorted in the order of magnitude.

δ. The speech recognition apparatus and method according to claim 6 or 7, wherein a great weight value is applied to a frequency band of a low energy when the weight value by frequency bands is applied.

9. The speech recognition apparatus and method according to claim 6 or 7, wherein a weight value function of FIG. 4-1 is used when a predetermined weight value is applied to all the frequency coefficients which are sorted in the order of magnitude.

10. A speech recognition apparatus and method which recognizes a decoded signal by a predetermined speech coder in a digital communication network, wherein one speech recognition device is trained by using all decoded signals for all speech coders used in the digital communication network when recognizing the speech signal passing through a speech coder and decoded by the speech coder.

11. The speech recognition apparatus and method according to claim 10, wherein a distortion component for the speech signal due to the speech coder is compensated to improve a speech recognition performance when recognizing the speech signal passing through a speech coder and decoded by the speech coder.

12. The speech recognition apparatus and method according to claim 11, wherein the same distortion compensating method is applied to all speech coders in order to compensate for a distortion component in the speech signal passing a speech coder and decoded by the speech coder.

13. The speech recognition apparatus and method according to claim 11 or 12, wherein a weight value by frequency bands is applied when obtaining a feature parameter for a recognition from the speech signal in order to compensate for a distortion component in the speech signal passing through a speech coder and decoded by the speech coder.

14. The speech recognition apparatus and method according to claim 13, wherein different weight values are applied to a speech signal inputted every time when a weight value by frequency bands is applied.

15. The speech recognition apparatus and method according to claim 14, wherein different weight values are applied to a speech signal inputted every time according to a frequency distribution of an energy of the speech signal when a variable weight value by frequency bands is applied.

16. The speech recognition apparatus and method according to claim 14 or

15, wherein a frequency coefficient is obtained through a frequency transform of an input signal, all frequency coefficients are sorted in the order of magnitude, and a predetermined weight value is applied to all the frequency coefficients which are sorted in the order of magnitude.

17. The speech recognition apparatus and method according to claim 15 or

16, wherein a great weight value is applied to a frequency band of a low energy when the weight value by frequency bands is applied.

lδ. The speech recognition apparatus and method according to claim 16, wherein a weight value function of FIG. 4-2 is used when a predetermined weight value is applied to all the frequency coefficients which are sorted in the order of magnitude.