HK1151884A - System and method for cantonese speech recognition using an optimized phone set - Google Patents
System and method for cantonese speech recognition using an optimized phone set Download PDFInfo
- Publication number
- HK1151884A HK1151884A HK11105766.8A HK11105766A HK1151884A HK 1151884 A HK1151884 A HK 1151884A HK 11105766 A HK11105766 A HK 11105766A HK 1151884 A HK1151884 A HK 1151884A
- Authority
- HK
- Hong Kong
- Prior art keywords
- phone
- optimized
- phone set
- phones
- syllable
- Prior art date
Links
Description
The invention relates to a system and a method for performing Guangdong speech sound recognition by using an optimized phoneme set, which is a divisional application with Chinese patent application number of 200410008562.7, application date of 2004 of 3-24.3.
Technical Field
The present invention relates generally to electronic speech recognition systems, and more particularly to a system and method for Cantonese speech recognition using an optimized phone set.
Background
Implementing robust and efficient human-machine communication between system users and electronic devices is a significant consideration for system designers and manufacturers. Voice-controlled operation of electronic devices is an ideal interface for many system users. For example, voice-controlled operations allow a user to perform other tasks simultaneously. For example, a person may operate the electronic manager through voice control while driving the locomotive. Hands-free operation of electronic systems is desirable for those users who have physical impairments or other special requirements.
Hands-free operation of electronic devices may be accomplished through a variety of voice-activated electronic systems. Voice-activated electronic systems thus advantageously allow a user to communicate with the electronic device in situations where it is inconvenient or potentially dangerous to use conventional input devices. Electronic entertainment systems may also utilize speech recognition technology to allow users to interact with a system by speaking into it.
However, effectively implementing such a system can be a significant challenge for system designers. For example, further demands to increase the functionality and performance of the system may require greater system processing power and require additional hardware resources. The increase in processing or hardware requirements has correspondingly adverse effects due to increased production costs and operational inefficiencies.
In addition, enhancing the system's ability to perform various advanced operations may provide additional advantages to the system user, but also result in increased control and management of various system components. For example, an enhanced electronic system that effectively identifies words and phrases in Cantonese would benefit from an effective implementation because of the large amount and complexity of digital data required. Thus, for all of the foregoing reasons, implementing a robust and efficient method for system users to communicate with electronic devices by man-machine has been an important consideration for system designers and manufacturers.
Disclosure of Invention
In accordance with the present invention, a system and method for implementing a Cantonese speech recognizer with an optimized phone set is disclosed. In one embodiment, the recognizer may be configured to compare input speech data to phone strings from a vocabulary dictionary (vocabulary dictionary) implemented according to an optimized Cantonese phone set. The optimized Cantonese phone set may be implemented with a sub-syllabic phonetic technique to include consonantal phones and vocalic phones, respectively. For reasons of system efficiency, the optimized Cantonese phone set may preferably be implemented in a miniaturized manner to include only the minimum number of consonantal phones and vocalic phones required to accurately represent the Cantonese speech during the speech recognition process.
In some embodiments, the optimized Cantonese phone set may include the following consonantal phones: b, d, g, p, t, k, m, n, ng, f, l, h, z, c, s, w and j. In addition, the optimized Cantonese phone set may also include the following vowel phones: aa, i, u, e, o, yu, oe, eo, a, eu, aai, aau, ai, au, ei, oi, ou, eoi, ui, and iu. In various embodiments, the optimized Cantonese phone set may also include a closure phone "cl" and a mute phone "sil". Because a relatively small number of phones are used, the optimized Cantonese phone set provides an efficient and compact representation of phones that accurately recognize Cantonese speech.
In some embodiments, the optimized Cantonese phone set may advantageously represent diphthongs by utilizing a single unified diphone phone. For example, the optimized Cantonese phone set may include the following unified diphthong phones: eu, aai, aau, ai, au, ei, oi, ou, eoi, ui, and iu. Furthermore, in Cantonese, lip rounding (lip rounding) may occur typically with the "g" tone or with the "k" tone. In some embodiments, the optimized Cantonese phone set may effectively represent lip rounding by utilizing the different lip rounding phones "w" already present in the Cantonese phone set.
Further, in the guangdong language, "stop" may be preferably associated first with the tones corresponding to "b", "d", "g", "p", "t", and "k". According to the present invention, the optimized Cantonese phone set can advantageously represent "b", "d", "g", "p", "t", and "k" using two different techniques depending on the context of the corresponding sound in the phrase. In the context of syllable initiations where the stop is at the beginning of a syllable, the optimized Cantonese phone set may represent the consonant and the preceding closure with the appropriate consonant phone ("b", "d", "g", "p", "t", or "k") in the initial syllable.
Furthermore, under syllabic-final/midphrase context (syllabic-final/midphrase context) where the stop is located at the end of the word in the middle of the phrase, the optimized cantonese phone set can represent the consonant and the preceding occlusion with the appropriate phone ("p", "t", or "k") in the syllable-final/phrase context. Furthermore, under syllabic-final/phrase-end context (syllable-final/phrase-end context) where the stop is located at the end of the word at the end of the phrase, the optimized cantonese phone set can effectively utilize the same closure phone "cl" in the syllable-final/phrase-end context to represent either "p", "t", or "k" as a closure only, without any subsequently uttered consonants. The present invention thus provides an efficient system and method for implementing a Cantonese speech recognizer with an optimized phone set.
Drawings
FIG. 1 is a block diagram for one embodiment of a computer system, according to the present invention;
FIG. 2 is a block diagram for one embodiment of the memory of FIG. 1, in accordance with the present invention;
FIG. 3 is a block diagram for one embodiment of the speech detector of FIG. 2, in accordance with the present invention;
FIG. 4 is a diagram illustrating one embodiment of a Hidden Markov Model (Hidden Markov Model) of FIG. 2 in accordance with the present invention;
FIG. 5 is a diagram illustrating one embodiment of the dictionary of FIG. 2, in accordance with the present invention;
FIG. 6 is a diagram illustrating an optimized Cantonese phone set according to an embodiment of the present invention;
FIG. 7 is a diagram illustrating a technique for processing diphthongs according to one embodiment of the present invention;
FIG. 8 is a diagram illustrating a technique for processing lip rounding according to one embodiment of the present invention;
figure 9 is a diagram illustrating a technique for handling stop-consonants according to one embodiment of the present invention.
Detailed Description
The present invention relates to improvements in speech recognition systems. The following description is presented to enable one of ordinary skill in the art to make and use the invention and is provided in the context of a patent application and its requirements. Various modifications to the preferred embodiment will be readily apparent to those skilled in the art, and the generic principles herein may be applied to other embodiments. Thus, the present invention is not intended to be limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and features described herein.
The present invention includes systems and methods for implementing a Cantonese speech recognizer with an optimized phone set, which may include a recognizer configured to compare input speech data to phone strings from a dictionary implemented according to the optimized Cantonese phone set. The optimized Cantonese phone set may be implemented with a sub-syllabic phonetic technique to include consonantal phones and vocalic phones, respectively. For reasons of system efficiency, the optimized Cantonese phone set is preferably implemented in a compact manner to include only the minimum number of consonantal phones and vocalic phones required to accurately represent the Cantonese speech during speech recognition.
Referring now to FIG. 1, a block diagram for one embodiment of a computer system 110 is shown, according to the present invention. The embodiment of fig. 1 includes a sound sensor 112, an amplifier 116, an analog-to-digital converter 120, a Central Processing Unit (CPU)128, a memory 130, and an input/output interface 132. In alternate embodiments, computer system 110 may readily include various other elements or functions in addition to, or instead of, those elements or functions discussed in conjunction with the FIG. 1 embodiment.
Sound sensor 112 detects sound energy and converts the detected sound energy into an analog voice signal, which is provided over line 114 to amplifier 116. Amplifier 116 amplifies the received analog voice signal and provides the amplified analog voice signal to analog-to-digital converter 120 via line 118. The analog-to-digital converter 120 then converts the amplified analog voice signal into corresponding digital voice data. Analog to digital converter 120 then provides digital voice data over line 122 to system bus 124.
CPU 128 then accesses the digital voice data on system bus 124 and analyzes and processes the digital voice data accordingly to perform voice detection in accordance with software instructions contained in memory 130. The operation of the CPU 128 and the software instructions in the memory 130 are further discussed below in conjunction with fig. 2-7. After processing the voice data, the CPU 128 then provides the results of the voice detection analysis to other devices (not shown) via the input/output interface 132. In alternate embodiments, the present invention may readily be implemented in a variety of devices other than the computer system 110 shown in FIG. 1.
Referring now to FIG. 2, a block diagram for one embodiment of the memory 130 of FIG. 1 is shown, in accordance with the present invention. Alternatively, memory 130 may include a variety of storage device configurations, including Random Access Memory (RAM) and storage devices such as floppy disks or hard disk drives. In the FIG. 2 embodiment, memory 130 includes, but is not limited to, a speech detector 210, Hidden Markov Models (HMMs) 212, vocabulary dictionary 214, and language models 216. In alternate embodiments, memory 130 may readily include various other elements or functions in addition to, or instead of, those elements or functions discussed in conjunction with the FIG. 2 embodiment.
In the FIG. 2 embodiment, speech detector 210 includes a series of software modules that are executed by CPU 128 to analyze and recognize speech data, and which are further described below with reference to FIG. 3. In alternate embodiments, the speech detector 210 is readily implemented in various other software and/or software architectures. HMMs 212 and dictionary 214 may be used by speech detector 210 to implement the speech recognition functionality of the present invention. One embodiment of HMMs 212 is discussed further below in conjunction with FIG. 4, and one embodiment of dictionary 214 is discussed further below in conjunction with FIG. 5. Language model 216 may include a word sequence or "grammar" model that predicts the next word from a previous word.
Referring now to FIG. 3, a block diagram for one embodiment of the FIG. 2 speech detector 210 is shown, according to the present invention. The speech detector 210 includes, but is not limited to, a feature extractor 310, an endpoint detector 312, and a recognizer 314. In alternate embodiments, the speech detector 210 may readily include various other elements or functions in addition to, or instead of, those elements or functions discussed in conjunction with the FIG. 3 embodiment.
In the FIG. 3 embodiment, analog-to-digital converter 120 (FIG. 1) provides digital speech data to feature extractor 310 via system bus 124. Feature extractor 310 responsively generates a feature vector, which is provided to recognizer 314 via path 320. Feature extractor 310 further responsively generates speech energy to endpoint detector 312 via path 322. The endpoint detector 312 analyzes the speech energy and responsively determines the endpoints of the utterance represented by the speech energy. The endpoints indicate the beginning and end of the utterance in time. The endpoint detector 312 then provides the endpoint to the recognizer 314 via path 324.
Recognizer 314 is preferably configured to recognize words in a predetermined vocabulary represented in dictionary 214 (FIG. 2). The aforementioned vocabulary words (vocabularies) in dictionary 214 may correspond to any desired commands, instructions, or other communications of computer system 110. The recognized vocabulary words or instructions are then output to the system 110 via path 332.
In practice, each word from dictionary 214 may be associated with a corresponding phone string (string of individual phones) that represents the word. Hidden Markov Models (HMMs) 212 (FIG. 2) may include trained stochastic representations for each phoneme from a predetermined set of phonemes that may be effectively used to represent words in dictionary 214. Recognizer 314 then compares the input feature vectors from line 320 to the appropriate HMMs 212 for each phone string from dictionary 214 to determine which word produces the highest recognition score. Thereby identifying the word corresponding to the highest recognition score as the recognized word.
Referring now to FIG. 4, a block diagram for one embodiment of the HMMs 212 of FIG. 2 is shown, according to the present invention. In the FIG. 4 embodiment, HMMs 212 preferably include model 1(412(a)) through model N (412 (c)). In alternate embodiments, HMMs 212 may readily include various other elements or functions in addition to, or instead of, those elements or functions discussed in conjunction with the FIG. 4 embodiment.
In the FIG. 4 embodiment, HMMs 212 are readily implemented to include any desired number of models 412, which may include any desired type of information. In the FIG. 5 embodiment, each model 412 from HMMs 212 may correspond to a different particular phone from the predetermined phone set for use by recognizer 314 (FIG. 3). One embodiment of an optimized Cantonese phone set is discussed further below in conjunction with FIGS. 6-9.
Referring now to FIG. 5, a block diagram of dictionary 214 of FIG. 2 is shown, in accordance with one embodiment of the present invention. In the FIG. 5 embodiment, dictionary 214 preferably includes words 1(512(a)) through N (512 (c)). In alternate embodiments, dictionary 214 may readily include various other elements or functions in addition to, or instead of, those elements or functions discussed in conjunction with the FIG. 5 embodiment.
In the FIG. 5 embodiment, dictionary 214 is readily implemented to include any desired number of words 512, which may include any type of information. In the FIG. 5 embodiment, each word 512 from dictionary 214 may also include a corresponding phone string of individual phones from a predetermined phone set, as discussed above with reference to FIG. 3. The individual phones of the aforementioned phone strings preferably form a sequential representation of the pronunciation of the corresponding word in dictionary 214. One embodiment of an optimized Cantonese phone set is discussed further below in conjunction with FIGS. 6-9.
Referring now to FIG. 6, shown is a diagram of an optimized Cantonese phone set in accordance with one embodiment of the present invention. In alternate embodiments, the present invention may readily perform speech recognition using various other elements or functions in addition to, or instead of, those elements or functions discussed in conjunction with the FIG. 6 embodiment.
In the fig. 6 embodiment, phone set 610 includes 39 individual phones, represented here as 17 consonantal phones plus a closure phone "cl" and a silence phone "sil" (all shown on the left side of fig. 6) and 20 vowel phones comprising a diphthong set (all shown on the right side of fig. 6). In the embodiment of fig. 6, phone set 610 is implemented to represent speech from the cantonese of south china.
Since Cantonese is usually written in Chinese characters rather than roman letters, the phone set 610 of FIG. 6 (except for the closed phone "cl" and the silent phone "sil") is represented by using the Cantonese roman writing scheme (generally referred to as "jyuping") developed by Linguistic Society of Hong Kong (LSHK). Further information on "jyutping" and Linguistic Society of Hong Kong can be found on the world Wide Web cptt 91.cityu. edu. hk/lshk. In alternate embodiments, the present invention may utilize optimized Cantonese phone sets represented in other various types of romanization schemes.
In the FIG. 6 embodiment, phone set 610 includes the following consonantal phones: b, d, g, p, t, k, m, n, ng, f, l, h, z, c, s, w and j. Furthermore, phone set 610 may also include the following vowel phones: aa, i, u, e, o, yu, oe, eo, a, eu, aai, aau, ai, au, ei, oi, ou, eoi, ui, and iu. In the FIG. 6 embodiment, phone set 610 may also include a closure phone "cl" and a silence phone "sil". Because a relatively small number of phonemes is used, phone set 610 provides an efficient and compact representation of phones that accurately recognize Cantonese speech.
The reduction in the number of individual phones in phone set 610 greatly saves processing resources and memory in electronic system 110. Furthermore, the reduction in the number of total phonemes substantially reduces the burden associated with training Hidden Markov Models (HMMs) 212. However, in various alternative embodiments, the present invention may be implemented to include various additional or different phones than those shown in the embodiment of FIG. 6.
Conventional chinese speech recognition systems typically utilize a phone set implemented in a sub-syllabic approach in which each syllable is represented as a rhyme (rime) or a semisyllabic. In contrast, the optimized Cantonese phone set 610 of the present invention advantageously utilizes a sub-syllable speech technique in which syllables are further divided into sub-units represented by appropriate combinations of consonantal phones and vocalic phones to provide greater granularity to the speech representation process. Furthermore, phone set 610 represents various sounds of Cantonese without utilizing corresponding tonal information as part of different phones. In addition to providing greater flexibility, the foregoing speech techniques have the additional advantage of requiring fewer total phones in phone set 610.
Phone set 610 of fig. 6 may be organized into various linguistic categories based on characteristics of the corresponding phones. For purposes of illustration, one such organization is given below in Table 1, with the left-hand categories of Table 1 corresponding to the phonemes of the phoneme set 610 on the right. In alternate embodiments, phone set 610 may also be organized in various other ways other than the way shown in Table 1.
Table 1:
referring now to FIG. 7, a diagram 710 illustrating a technique for processing diphthongs is shown, in accordance with one embodiment of the present invention. In alternate embodiments, the present invention readily processes diphthongs using various other techniques or functions in addition to or instead of those techniques or functions discussed in conjunction with the FIG. 7 embodiment.
In the FIG. 7 embodiment, diphthongs (two or more concurrent vowels) are advantageously represented by the optimized Cantonese phone set 610 (FIG. 6) using a single unified phone. For example, in the FIG. 7 embodiment, phone set 610 may include the following unified diphthong phones: eu, aai, aau, ai, au, ei, oi, ou, eoi, ui, and iu. The present invention can effectively utilize integrated diphthongs to conserve processing and memory resources. Further, since the vowel sound among the diphthongs in the Cantonese speech is relatively fast, representing the diphthongs as an integrated phoneme can prevent various problems in the speech recognition process.
For purposes of illustration, in the example of FIG. 7, block 714 includes an exemplary Cantonese word "sei". In block 716, the word "sei" is expressed in conventional language having three different units, "s", "e", and "i". In accordance with the present invention, in block 718, the word "sei" is effectively represented by only two phonemes (i.e., "s" and "ei") from the phoneme set 610. Any type of Cantonese diphthong (or other diphthongs) may be represented using unified phones according to the present invention, as shown in the example of FIG. 7.
Referring now to FIG. 8, a diagram 810 illustrating a technique for processing lip rounding is shown, in accordance with one embodiment of the present invention. In alternate embodiments, the present invention readily addresses lip rounding using various other techniques or functions in addition to or in lieu of those techniques or functions discussed in conjunction with the FIG. 8 embodiment.
Lip rounding (lip rounding) may include producing a "w" sound after a certain consonant. In Cantonese, the aforementioned lip rounding occurs generally with either the "g" or "k" tone. Conventional phone sets typically include both "g" phones or "gw" phones alone (lip rounding variation). In the FIG. 8 embodiment, optimized Cantonese phone set 610 (FIG. 6) advantageously represents lip rounding by utilizing a different lip rounding phone "w". The present invention effectively utilizes the individual lip rounding phoneme "w" to provide greater accuracy in the speech recognition process.
Furthermore, because phoneme "w" is already present in phoneme set 610, this technique does not require additional processing or memory resources to implement. By not representing the lip rounding as a separate phoneme, the lip rounding is thus considered to be close enough to the "w" phoneme to ensure merging of the two.
For purposes of illustration, in the example of FIG. 8, block 814 includes an exemplary Cantonese word "gwo". In block 816, the word "gwo" is represented in conventional language having two separate units "gw" and "o". In accordance with the present invention, in block 818, the word "gwo" is accurately represented by the three phones from phone set 610 (i.e., "g," "w," and "o"). Any type of Cantonese lip rounding (or other type of lip rounding) may be represented by using separate phonemes in accordance with the present invention, as shown in the example of FIG. 8.
Referring now to FIG. 9, a diagram 910 illustrating a technique for handling "stop consonants" is shown, in accordance with one embodiment of the present invention. In alternate embodiments, the present invention readily addresses stop tones using various other techniques or functions in addition to or instead of those techniques or functions discussed in conjunction with the FIG. 9 embodiment.
In conventional language practice, a stop consonant is typically simulated to include an initial closure of the mouth, establishing respiratory pressure, and then releasing that pressure in the form of a particular consonant. In the Cantonese language, the stop consonants may preferably be first associated with the voices corresponding to "b", "d", "g", "p", "t", and "k". In the FIG. 9 embodiment, optimized Cantonese phone set 610 (FIG. 6) advantageously utilizes two different techniques to represent "b", "d", "g", "p", "t", and "k" depending on the corresponding acoustic environment in the phrase.
In the embodiment of FIG. 9, block 914 shows the initial context of the syllable with the stop at the beginning of the syllable. As shown in diagram 910 of fig. 9, phone set 610 may utilize the appropriate consonantal phone ("b", "d", "g", "p", "t", or "k") in the syllable-initial environment to represent the consonant and the previous closed consonant. In addition, block 916 shows a syllable-final/phrase-middle context where the stop is at the end of a word in the middle of a phrase. As shown in diagram 910 of fig. 9, phone set 610 may represent consonants and previous closure with the appropriate phone ("p", "t", or "k") of the context in the syllable-final/phrase middle. In addition, block 918 illustrates a syllable-last/phrase-end environment where the stop is at the end of the word at the end of the phrase. As shown in diagram 910 of fig. 9, phone set 610 may effectively utilize the same closed phone "cl" in the context of syllable-final/phrase-end to represent only either "p", "t", or "k" as a closed sound without any subsequent consonant utterances.
The invention has been explained with reference to preferred embodiments. Other embodiments will be apparent to those skilled in the art in light of this disclosure. For example, the present invention is readily implemented using structures and techniques other than those described in the preferred embodiments above. Furthermore, the present invention may also be effectively used with systems other than those described above as preferred embodiments. These and other variations of the preferred embodiments are therefore intended to be covered by the present invention, which is limited only by the appended claims.
Claims (37)
1. A system for performing a speech recognition procedure, comprising:
a recognizer configured to compare input speech data with phone strings from a vocabulary dictionary implemented according to an optimized phone set implemented with phonetic techniques to provide consonantal phones and vocalic phones, respectively, the optimized phone set implemented in a miniaturized manner to include a minimum required number of the consonantal phones and the vocalic phones; and
a processor configured to control the recognizer to thereby perform the speech recognition process,
wherein the optimized phone set represents various sounds of a tonal language without utilizing corresponding tonal information as part of different phones in the optimized phone set.
2. The system of claim 1, wherein the identifier and the processor are implemented as part of a consumer electronics device.
3. The system of claim 1 wherein said optimized phone set conserves processing resources and memory resources while performing said speech recognition process.
4. The system of claim 1 wherein said optimized phone set reduces training requirements for performing a recognizer training process for initially implementing said recognizer.
5. The system of claim 1 wherein each of said phone strings comprises a different phone series from said optimized phone set, each of said phone strings corresponding to a different word from said vocabulary dictionary.
6. The system of claim 5 wherein said recognizer compares said input speech data to hidden Markov models of said phone strings from said vocabulary dictionary to thereby select recognized words during said speech recognition process.
7. The system of claim 1 wherein said optimized phone set includes phones for b, d, g, p, t, k, m, n, ng, f, l, h, z, c, s, w, j, cl, sil, aa, i, u, e, o, yu, oe, eo, a, eu, aai, aau, ai, au, ei, oi, ou, eoi, ui and iu.
8. The system of claim 1 wherein said optimized phone set includes consonantal phones b, d, g, p, t, k, m, n, ng, f, l, h, z, c, s, w, and j.
9. The system of claim 1 wherein said optimized phone set includes a closure phone "cl" and a silence phone "sil".
10. The system of claim 1 wherein said optimized phone set includes vowel phones aa, i, u, e, o, yu, oe, eo, a, eu, aai, aau, ai, au, ei, oi, ou, eoi, ui, and iu.
11. The system of claim 1 wherein said optimized phone set represents certain diphthongs using unified diphthongs to thereby conserve processing and memory resources while providing more accurate characterization to said speech recognition process.
12. The system of claim 11 wherein said optimized phone set includes unified diphthong phones eu, aai, aau, ai, au, ei, oi, ou, eoi, ui and iu.
13. The system of claim 1 wherein said optimized phone set represents a lip rounding by using a separate lip rounding phone "w" after the consonantal phone "g".
14. The system of claim 1 wherein said optimized phone set represents a lip rounding by using a separate lip rounding phone "w" after the consonantal phone "k".
15. The system of claim 1 wherein said input speech data includes a syllable initial environment in which a stop is located at the beginning of a syllable, said optimized phone set responsively representing corresponding consonants and preceding closure utilizing the appropriate consonant phone "p", "t", or "k" in said syllable initial environment.
16. The system of claim 1 wherein said input speech data includes a syllable final/phrase intermediate environment where a stop is located at the end of a word in the phrase intermediate, said optimized phone set responsively representing corresponding consonants and preceding closure using appropriate consonant phones "p", "t", or "k" in said syllable final/phrase intermediate environment.
17. The system of claim 1 wherein said input speech data includes a syllable-final/end-of-phrase environment in which a stop is located at the end of a word at the end of a phrase, said optimized phone set responsively representing any of the "p", "t", or "k" consonants as only closure sounds with the same closure phone "cl" in said syllable-final/end-of-phrase environment without any subsequently released consonant sounds.
18. The system of claim 1 wherein said input speech data includes a syllable initial environment in which a first stop is located at the beginning of a syllable, a syllable final/phrase intermediate environment in which a second stop is located at the end of a first word in the middle of a phrase, and a syllable final/phrase end environment in which a third stop is located at the end of a second word in the end of said phrase, said optimized phone set representing corresponding consonants and preceding closure using appropriate consonant phones "b", "d", "g", "p", "t", or "k" in said syllable initial environment, said optimized phone set representing said corresponding consonants and said preceding closure in response using said appropriate consonant phones "p", "t", or "k" in said syllable final/phrase intermediate environment, said optimized phone set in response using the same phase in said syllable final/phrase end environment The same closure phoneme "cl" will represent any of "p", "t" or "k" as a closure only without any subsequent release of consonant sounds.
19. A method for performing a speech recognition procedure, comprising the steps of:
configuring a recognizer to compare input speech data with phone strings from a vocabulary dictionary implemented according to an optimized phone set implemented with phonetic techniques to provide consonantal phones and vocalic phones, respectively, the optimized phone set implemented in a miniaturized manner to include a minimum required number of the consonantal phones and the vocalic phones; and
controlling, with a processor, the recognizer to thereby perform the speech recognition procedure,
wherein the optimized phone set represents various sounds of a tonal language without utilizing corresponding tonal information as part of different phones in the optimized phone set.
20. The method of claim 19, wherein the identifier and the processor are implemented as part of a consumer electronics device.
21. The method of claim 19 wherein said optimized phone set conserves processing resources and memory resources while performing said speech recognition process.
22. The method of claim 19 wherein said optimized phone set reduces training requirements for performing a recognizer training process for initially implementing said recognizer.
23. The method of claim 19 wherein each of said phone strings comprises a different phone set from said optimized phone set, each of said phone strings corresponding to a different word from said vocabulary dictionary.
24. The method of claim 23 wherein said recognizer compares said input speech data to hidden markov models for said phone strings from said vocabulary dictionary to thereby select recognized words during said speech recognition process.
25. The method of claim 19 wherein said optimized phone set includes phones for b, d, g, p, t, k, m, n, ng, f, l, h, z, c, s, w, j, cl, sil, aa, i, u, e, o, yu, oe, eo, a, eu, aai, aau, ai, au, ei, oi, ou, eoi, ui and iu.
26. The method of claim 19 wherein said optimized phone set includes consonantal phones b, d, g, p, t, k, m, n, ng, f, l, h, z, c, s, w, and j.
27. The method of claim 19 wherein said optimized phone set includes a closure phone "cl" and a silence phone "sil".
28. The method of claim 19 wherein said optimized phone set includes vowel phones aa, i, u, e, o, yu, oe, eo, a, eu, aai, aau, ai, au, ei, oi, ou, eoi, ui, and iu.
29. The method of claim 19 wherein said optimized phone set represents certain diphthongs using unified diphthongs to thereby conserve processing and memory resources while providing more accurate characterization of said speech recognition process.
30. The method of claim 29 wherein said optimized phone set includes unified diphthong phones eu, aai, aau, ai, au, ei, oi, ou, eoi, ui and iu.
31. The method of claim 19 wherein said optimized phone set represents a lip rounding by using a separate lip rounding phone "w" after the consonantal phone "g".
32. The method of claim 19 wherein said optimized phone set represents a lip rounding by using a separate lip rounding phone "w" after the consonantal phone "k".
33. The method of claim 19 wherein said input speech data includes a syllable initial environment in which a stop is located at the beginning of a syllable, said optimized phone set responsively representing corresponding consonants and preceding closure utilizing the appropriate consonant phone "b", "d", "g", "p", "t", or "k" in said syllable initial environment.
34. The method of claim 19 wherein said input speech data includes a syllable final/phrase intermediate environment where a stop is located at the end of a word in the phrase intermediate, said optimized phone set responsively representing corresponding consonants and preceding closure using appropriate consonant phones "p", "t", or "k" in said syllable final/phrase intermediate environment.
35. The method of claim 19 wherein said input speech data includes a syllable-final/end-of-phrase environment in which a stop is located at the end of a word at the end of a phrase, said optimized phone set responsively representing any of "p", "t", or "k" as a stop only without any subsequently released consonant sounds using the same closure phone "cl" in said syllable-final/end-of-phrase environment.
36. The method of claim 19 wherein said input speech data includes a syllable initial environment in which a first stop is at the beginning of a syllable, a syllable final/phrase intermediate environment in which a second stop is at the end of a first word in the middle of a phrase, and a syllable final/phrase end environment in which a third stop is at the end of a second word at the end of said phrase, said optimized phone set representing corresponding consonants and preceding closure using the appropriate consonant phone "b", "d", "g", "p", "t", or "k" in said syllable initial environment, said optimized phone set representing said corresponding consonant and preceding closure responsively using the appropriate consonant phone "p", "t", or "k" in said syllable final/phrase intermediate environment, said optimized phone set representing said corresponding consonants and preceding closures responsively using the same closure in said syllable final/phrase end environment The element "cl" represents any of "p", "t", or "k" as a closing sound only without any subsequent released consonant sound.
37. A system for performing a speech recognition procedure, comprising:
means for comparing input speech data with phone strings from a vocabulary dictionary implemented according to an optimized phone set, the optimized phone set being implemented with phonetic techniques to provide consonantal phones and vocalic phones, respectively, the optimized phone set being implemented in a miniaturized manner to include a minimum required number of the consonantal phones and the vocalic phones; and
means for controlling said means for comparing to thereby perform said speech recognition procedure,
wherein the optimized phone set represents various sounds of a tonal language without utilizing corresponding tonal information as part of different phones in the optimized phone set.
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US10/395352 | 2003-03-24 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| HK1151884A true HK1151884A (en) | 2012-02-10 |
Family
ID=
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| KR101229034B1 (en) | Multimodal unification of articulation for device interfacing | |
| US7502731B2 (en) | System and method for performing speech recognition by utilizing a multi-language dictionary | |
| JP2002304190A (en) | Method for generating pronunciation change form and method for speech recognition | |
| JP2024514064A (en) | Phonemes and Graphemes for Neural Text-to-Speech | |
| WO2003010753A1 (en) | Pattern recognition using an observable operator model | |
| US8015008B2 (en) | System and method of using acoustic models for automatic speech recognition which distinguish pre- and post-vocalic consonants | |
| JP7335569B2 (en) | Speech recognition method, device and electronic equipment | |
| WO2000014723A1 (en) | Speech recognizer | |
| US7181396B2 (en) | System and method for speech recognition utilizing a merged dictionary | |
| US7353174B2 (en) | System and method for effectively implementing a Mandarin Chinese speech recognition dictionary | |
| CN100380442C (en) | System and method for Mandarin speech recognition using optimized phoneme set | |
| US6963832B2 (en) | Meaning token dictionary for automatic speech recognition | |
| Lee et al. | Cantonese syllable recognition using neural networks | |
| US20060136209A1 (en) | Methodology for generating enhanced demiphone acoustic models for speech recognition | |
| Mandal et al. | A Review on Speech Recognition | |
| Venkatagiri | Speech recognition technology applications in communication disorders | |
| Alewine et al. | Pervasive speech recognition | |
| CN1532806B (en) | System and method for Cantonese speech recognition using an optimized phone set | |
| HK1151884A (en) | System and method for cantonese speech recognition using an optimized phone set | |
| HK1069663B (en) | System and method for cantonese speech recognition using an optimized phone set | |
| Batlouni et al. | Mathifier—Speech recognition of math equations | |
| Rathor et al. | Speech recognition and system controlling using Hindi language | |
| Delić et al. | A Review of AlfaNum Speech Technologies for Serbian, Croatian and Macedonian | |
| Yusnita et al. | Phoneme-based or isolated-word modeling speech recognition system? An overview | |
| JP2006243213A (en) | Language model conversion device, acoustic model conversion device, and computer program |