WO2000031727A1 - Optimization device for optimizing a vocabulary of a speech recognition device - Google Patents
Optimization device for optimizing a vocabulary of a speech recognition device Download PDFInfo
- Publication number
- WO2000031727A1 WO2000031727A1 PCT/EP1999/008640 EP9908640W WO0031727A1 WO 2000031727 A1 WO2000031727 A1 WO 2000031727A1 EP 9908640 W EP9908640 W EP 9908640W WO 0031727 A1 WO0031727 A1 WO 0031727A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- word
- information
- stored
- memory
- speech
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
- G10L15/19—Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
- G10L15/197—Probabilistic grammars, e.g. word n-grams
Definitions
- Optimization device for optimizing a vocabulary of a speech recognition device.
- the invention relates to an optimization device for optimizing a vocabulary of a speech recognition device, comprising a lexicon memory in which the word information of at least a first and a second word forming the vocabulary of a speech recognition device can be stored, and comprising a speech model memory in which at least a probability of occurrence of the second word after the first word in a word sequence formed by these words can be stored as transition probability information, and comprising word defining means for defining a third word and for storing in the lexicon memory the third word as word information and for storing in the speech model memory at least transition probability information of the probability of occurrence of the third word in a word sequence after at least the first or the second word stored in the lexicon memory.
- the invention further relates to a speech recognition device for recognizing phoneme information contained in speech information of a spoken text and for delivering word information of a recognized text, comprising input means which are supplied with speech information of a spoken text as input signals, and comprising speech recognition means which are arranged for recognizing phoneme information of the spoken text contained in the input signals and for delivering word information of a recognized text, and comprising output means which can deliver word information of a recognized text as output signals.
- the invention further relates to a vocabulary generator for generating and storing word information that forms the vocabulary of a speech recognition device, comprising input means to which word information of stored text information can be fed as input signals, and comprising generator means which are arranged for generating a vocabulary, so that at least word information of the text information of a first and a second word and transition probability information indicating the probability of occurrence of the second word after the first word in a word sequence of the text information can be determined, and which generator means are arranged for storing at least the first and the second word as word information in a lexicon memory and the transition probability information in a speech model memory.
- Such a speech recognition device for recognizing phoneme information contained in speech information of a spoken text and for delivering word information of a recognized text, of the type discussed in the second paragraph, comprising an optimization device for optimizing a vocabulary of a speech recognition device of the type defined in the first paragraph, is known from document WO 96/29695.
- the known speech recognition device includes input means formed by a microphone terminal to which a microphone can be connected.
- the microphone can apply speech information as an electric input signal to the speech recognition device, which speech information comes from a text spoken by a user of the speech recognition device.
- the electric input signal can be applied to speech recognition means of the speech recognition device and the speech recognition means can deliver its recognized word information as recognized text to output means of the speech recognition device.
- a monitor by which word information of the recognized text can be displayed can be connected to the output means, which are formed by a monitor terminal.
- the speech recognition means For recognizing word information contained in the electric input signal, the speech recognition means include, inter alia, a lexicon memory.
- the lexicon memory stores as word information all the words that form the vocabulary of the speech recognition device.
- Phoneme information is assignedly stored with word information, which phoneme information forms a phoneme sequence featuring the assignedly stored word.
- phoneme sequences in a spoken text are determined by the speech recognition means and compared with phoneme sequences stored in the lexicon memory. If this comparison shows a match of a determined phoneme sequence and a stored phoneme sequence, stored word information assigned to this stored phoneme sequence is taken from the lexicon memory as a recognized word.
- the speech recognition means further include a speech model memory which stores transition probability information for word sequences of words stored in the lexicon memory.
- the speech model memory stores word sequences of two words each, so-called bigrams, and word sequences of three words each, so-called trigrams.
- the word sequence "Sehr geehrte Coding unddorf” made up of bigrams and trigrams relatively often occurs as a typical formulation in spoken texts.
- the speech model memory is stored as a probability of occurrence of the word “Damen” after the words “Sehr geehrte” the transition probability information of, for example, "5%”.
- the transition probability information for example, "4%”.
- a search is made in the speech model memory not only for words separately stored in the lexicon memory, but also for word sequences formed by stringed recognized words.
- the word information of the composite word sequence is determined as recognized text by the speech recognition device, of which word information the transition probability information of the bigrams and trigrams contained in the word information has the highest value.
- word sequences stored in the speech model memory would consist of up to five words
- a very large number of possible and sensible combinations of five words would be stored in the lexicon memory and, therefore, a multiplicity of word sequences would have to be stored in the speech model memory.
- the required memory space in the speech model memory would therefore be very large and the speech recognition device would be expensive, which is a considerable disadvantage.
- the known speech recognition device includes word defining means by which so-called composite elements can be defined as words.
- Composite elements are words which relatively often occur as composite words, so-called composites, in a spoken text.
- the word defining means are arranged for storing transition probability information in the speech model memory which information indicates the probability of occurrence of a composite element defined as one word in a word sequence after at least one further word stored in the lexicon memory.
- Storing composite elements in the lexicon memory provides the advantage that not all possible composites formed by combinations of such composite elements need to be stored in the lexicon memory, so that the required memory space of the lexicon memory is reduced considerably.
- the known speech recognition device now proves to have a disadvantage that typical formulations contained in a spoken text are not sufficiently well recognized during a speech recognition operation, because only transition probability information of word sequences having a maximum of three words can be stored in the speech model memory and this number of words per word sequence cannot, in essence, be increased as a result of the constriction of the limited memory capacity of the speech model memory. Furthermore, in the known speech recognition device, the recognition of typical formulations contained in a spoken text is additionally degraded by the inclusion of words forming composite elements, because a composite formed, for example, by three composite elements already forms a complete word sequence and no further information about words and word sequences neighboring this word sequence can be determined by evaluating transition probability information of this word sequence.
- test means are provided which are arranged for testing whether transition probability information of a word sequence stored in the speech model memory has a minimum value, and in that the word defining means are arranged for defining the words of this word sequence as a third word when the test means give a positive test result.
- word sequences containing two or more words which relatively often appear as a typical formulation in spoken texts or stored text information, can be stored in the lexicon memory as one word.
- transition probability information on word sequences can be stored in the speech model memory, which word sequences contain words which are themselves determined by two or more words stored as one word in the lexicon memory.
- the speech recognition means are capable of determining phoneme sequences contained in a spoken text more reliably.
- Each phoneme in a phoneme sequence is influenced m the way it is pronounced by phonemes before and after the particular phoneme m the phoneme sequence.
- the phonemes of a phoneme sequence of a word are influenced on the word boundaries by adjacent words in the word sequence Consequently, in case of a recognition of a word sequence comprising a larger number of words, a larger number of neighbo ⁇ ng phonemes of neighbo ⁇ ng words is known and, as a result, the associated word sequence can be recognized more reliably
- transition probability information already determined of word sequences stored in the speech model memory which contain one word defined by the word defining means plus two or more words, can be stored as transition probability information for word sequences containing the defined word even after the word has been defined
- the advantage is obtained that typical formulations in spoken texts can be recognized better and the recognition rate of the speech recognition device is improved, although practically no additional storage space is necessary for the speech model memory Furthermore, the computation circuitry of the speech recognition means du ⁇ ng a speech recognition operation is considerably smaller when typical formulations are recognized, which is highly advantageous.
- Fig. 1 diagrammatically shows in the form of a block diagram a speech recognition device including test means for testing whether transition probability information stored in a speech model memory has a minimum value
- Fig. 2 shows a first table containing word information and phoneme information stored in a lexicon memory of the speech recognition device in accordance with Fig. 1,
- Fig. 3 shows a second table containing word sequence information and transition probability information stored in the speech model memory of the speech recognition device as shown in Fig. 1
- Fig. 4 shows a third table containing word information and phoneme information stored in the lexicon memory of the speech recognition device as shown in Fig. 1 which, after a word sequence comprising two words has been defined, is stored as one word by word defining means of the speech recognition device
- Fig. 5 shows a fourth table containing word sequence information and transition probability information stored in the speech model memory of the speech recognition device as shown in Fig. 1, which word sequence information and transition probability information is stored as one word after the word defining means have defined a word sequence containing two words
- Fig. 6 diagrammatically shows in the form of a block diagram a vocabulary generator including test means for testing whether transition probability information stored in a speech model memory has a minimum value.
- Fig. 1 shows in the form of a block diagram a personal computer 1 in which a speech recognition device 2 in accordance with a first example of embodiment of the invention is realized.
- the speech recognition device is arranged for recognizing phoneme information PI contained in speech information SI of a text spoken by a user of the speech recognition device 2, and for delivering word information WI of a recognized text.
- the speech recognition device 2 includes input means formed by an input terminal 3.
- a microphone 4 can be connected to the input terminal 3.
- the microphone 4 can deliver speech information SI of a spoken text as an electric input signal to the input terminal 3 of the speech recognition device 2.
- the microphone 4 has a control key 5 by which control information ST can be delivered to the speech recognition device 2.
- speech information SI contained in the spoken text can be delivered to the input terminal 3 and the control information ST to the speech recognition device 2.
- the speech recognition device 2 includes speech recognition means 6 which are arranged for recognizing phoneme information PI of a spoken text, which phoneme information is contained in the speech information SI of the input signal, and for delivering word information WI of a recognized text.
- the speech recognition means 6 include an A/D converter stage 7, a storage stage 8, calculation means 9, a lexicon memory 10, a speech model memory 11 and a reference memory 12.
- Speech information SI delivered to the input terminal 3 can be delivered to the A/D converter stage 7 as an electric input signal.
- the A/D converter stage 7 can deliver digitized speech information SI to the storage stage 8.
- the storage stage 8 stores the digitized speech information SI delivered thereto.
- an audio reproduction mode of the speech recognition device 2 which mode can be activated in a manner not further shown in Fig.
- digitized speech information SI stored in the storage stage 8 can be applied to a D/A converter stage 13.
- the D/A converter stage 13, in the audio reproduction mode, can deliver analog speech information SI as electric output signals to a loudspeaker 14 for the acoustic reproduction of a text spoken into the microphone 4 by a user of the speech recognition device 2.
- the calculation means 9 are formed by a microprocessor and connected via an address/data bus to a lexicon memory 10, the speech model memories 11 and the reference memory 12. To the calculation means 9 can be applied digital speech information SI stored in the storage stage 8 and the control information ST coming from the microphone 4.
- the calculation means 9 can determine word information WI of a recognized text while it utilizes information stored in the lexicon memory 10, the speech model memory 11 and the reference memory 12, which will be discussed in further detail hereinafter.
- the calculation means 9 can deliver word information WI of a recognized text to an output terminal 15 which forms output means.
- a monitor 16 on which word information WI of a recognized text and delivered by the output terminal 15 can be displayed can be connected to the output terminal 15.
- the lexicon memory 10 can be stored the word information WI with a maximum of 64,000 individual words which form the vocabulary of the speech recognition device 2.
- the speech recognition device 2 correctly recognizes only the words contained in the speech information SI of a spoken text that are also stored in the lexicon memory 10.
- phoneme sequence of phoneme sequence For each word information WI of a word in the lexicon memory 10 can be stored a phoneme sequence as phoneme information PI(WI) featuring the word.
- Phonemes of a phoneme sequence are the smallest distinguishable acoustic units into which digitized speech information SI can be subdivided.
- the acoustic pronunciation of a phoneme in a phoneme sequence is influenced by the phonemes surrounding the relevant phoneme in the phoneme sequence.
- the first phoneme of a phoneme sequence of a word thus depends on the last phoneme of a phoneme sequence of a previous word in a word sequence, as also the last phoneme of a phoneme sequence of a word depends on the first phoneme of a phoneme sequence of the next word in the word sequence.
- it is therefore very important for the correct recognition of a word to know the words surrounding the word of the word sequence to be recognized, or to adopt these words as predetermined values.
- a first table 17 of Fig. 2 contains word information WI and phoneme information PI(WI) assignedly stored in the lexicon memory 10.
- the seven word information signals WI stated in the first table 17 represent a plurality of word information signals WI stored in the lexicon memory 10.
- the vocabulary of the speech recognition device 2 thus also contains the seven words denoted as word information WI in the first table.
- a probability of occurrence of a second word stored in the lexicon memory 10 can be stored as transition probability information UWI in the speech model memory 11 of the speech recognition device 2, after a first word of a word sequence stored in the lexicon memory 11 in a word sequence formed by these words.
- In the speech model memory 11 can be stored word sequences of two words each, so-called bigrams and word sequences of three words each, so-called trigrams.
- Fig. 3 shows a second table 18 containing word sequence information WFI of word sequences and assigned transition probability information UWI stored in the speech model memory 1 1.
- Transition probability information UWI of word sequence information WFI A+C+D of 2%, stated in the sixth row of the table 18, indicates that the word "connection” follows the word sequence "international business” with a probability of 2%.
- Reference information RI is stored in the reference memory 12. Since each human being has a different type of acoustic pronunciation of a word, also phonemes and phoneme sequences are pronounced slightly differently by each human being.
- the speech recognition device 2 is adapted to the respective user of the speech recognition device 2 by means of reference information RI stored in the reference memory 12.
- the control information ST delivered by the microphone 4 to the calculation means 9 activates a speech recognition mode in the speech recognition device 2 and a speech recognition operation in the calculation means 9.
- Speech information SI of the spoken text is applied by the microphone 4 to the A/D converter stage 7 and from there as digitized speech information SI to the storage stage 8 and stored there.
- the calculation means 9 When the calculation means 9 are ready for processing digitized speech information SI stored in the storage stage 8, the digitized speech information SI is read from the storage stage 8 by the calculation means 9.
- the calculation means 9 while utilizing the reference information RI stored in the reference memory 12, determine phoneme information PI of phoneme sequences contained in the digitized speech information SI.
- the calculation means 9 then compare determined phoneme information PI with phoneme information PI(WI) stored in the lexicon memory 10. When there is a match between determined and stored phoneme information PI(WI) after this comparison, stored word information WI assigned to this stored phoneme information PI(WI) is determined as a recognized word from the lexicon memory 10.
- word sequences are formed by stringing together recognized words.
- the calculation means 9 compare word sequence information WFI of formed word sequences with word sequence information WFI stored in the speech model memory 11. When there is a match, the stored transition probability information UWI is determined which is assigned to this recognized word sequence in the speech model memory 11.
- transition probability information UWI stored in the speech model memory 11 By evaluating transition probability information UWI stored in the speech model memory 11 for recognized word sequences, a plurality of possible word sequences composed of words and their overall transition probability information are determined.
- the word information WI of the composed word sequence is determined as recognized text, whose overall transition probability information of the bigrams and trigrams included therein, has the highest value.
- Such a speech recognition operation is carried out in accordance with the so- called "Hidden-Markov-Model" and has been known for a long time.
- the calculation means 9 deliver word information WI of a recognized text recognized by the calculation means 9, to the monitor 16 for the recognized text to be displayed.
- the speech recognition device 2 comprises an optimization device 19 which is arranged for optimizing the vocabulary of the speech recognition device 2, which vocabulary is stored in the lexicon memory 10.
- the optimization device 9 includes test means 20, word defining means 21 and determining means 22.
- the calculation means 9 are arranged for activating an optimization mode of the speech recognition device 2 and an optimization operation of the optimization device 19 in that the calculation means deliver activation information Al to the test means 20.
- the calculation means 9 could be arranged, for example, for activating an optimization operation of the optimization device 19 after a certain number of speech recognition operations or after, for example, one week since the last optimization operation.
- word sequences often occurring in spoken texts and stored as word sequence information WFI stored in the speech model memory 11 are defined as one word to provide a better recognition of these typical formulations, which will be discussed in more detail hereinafter.
- a minimum value MW a value of 9% which prescribes that word sequences containing typical formulations, in which a second word occurs with an occurrence probability of at least 9% after a first word, are defined as one word.
- the test means 20 are arranged for comparing transition probability information UWI stored in the speech model memory 11 with the minimum value MW.
- transition probability information UWI stored in the speech model memory 11 has the minimum value MW, or a higher value
- stored word sequence information WFI assigned to this transition probability information UWI in the speech model memory 11 can be determined from the speech model memory 11 by the test means 20.
- Such word sequence information WFI can be delivered by the test means 20 both to the word defining means 21 and to the determining means 22.
- the word defining means 21 are arranged for defining as one word word sequence information WFI delivered thereto of a word sequence containing at least two words, and for storing word information WFI of a word sequence defined as one word as word information WI in the lexicon memory 10.
- the determining means 22 are arranged for comparing word sequence information WFI delivered to the determining means 22 by the test means 20 with the word sequence information stored in the speech model memory 11. If word sequence information WFI delivered to the determining means 22 corresponds to word sequence information WFI stored in the speech model memory 1 1, or is contained in stored word sequence information WFI, the determining means 22 are arranged for delivering identification information II to the word defining means 21.
- Identification information II features a first memory location in the speech model memory 11 in which this word sequence information WFI is stored. However, identification information II also features all the further memory locations in the speech model memory 11 in which this word sequence information WFI contained in other word sequence information WFI is stored.
- the word defining means 21 When the word defining means 21 receive identification information delivered to them, they are arranged for erasing this word sequence information WFI and its transition probability information UWI on the first memory location featured by the identification information II, because this word sequence information WFI was stored as word information WI in the lexicon memory 10.
- the advantage is obtained that all the three-word word sequences stored in the speech model memory 11, which are word sequences containing two words and having a typical formulation, are stored in the speech model memory 11 as a word sequence containing only two words.
- Fig. 4 shows a third table 24 which contains word information WI and phoneme information PI(WI) which is stored in the lexicon memory 10 after the optimization operation of the optimization device 19.
- Fig. 5 shows a fourth table 25 which contains the word sequence information WFI and transition probability information UWI which is. stored in the speech model memory 11 after the optimization operation of the optimization device 19.
- the optimization device 19 starts the optimization operation.
- the determining means 22 then deliver respective identification information II to the word defining means 21.
- the speech recognition device 2 has training means 23 to which word information WI of a recognized text can be applied after a speech recognition operation, and which training means 23 are arranged for extending a word sequence stored in the speech model memory 11 by a word often occurring in a recognized text before or after this word sequence, and for storing transition probability information UWI of the extended word sequence in the speech model memory 11, if the number of words that can be stored for each word sequence in the speech model memory 11 so permits. Since bigrams, having two words per word sequence, and trigrams, having three words per word sequence, can be stored in the speech model memory 11, the training means 23 are arranged for extending a bigram by a word often occurring in a recognized text before or after a bigram. When the training means 23 detect, for example, that before the word sequence
- the advantage is obtained that a word sequence reduced from a trigram to a bigram in the speech model memory 11 by the word defining means 21 during an optimization operation, can be extended by the training means 23. Consequently, word sequences of typical formulations having a high probability of occurrence can be stored in the speech model memory 11, which word sequences can contain considerably more words than can be stored in a word sequence of the speech model memory 11. As a result, the range of the speech model is increased and typical formulations can be recognized considerably better during a speech recognition operation.
- the required memory capacity in the speech model memory 11 is considerably smaller than when all the possible combinations of, for example, a maximum of four words per word sequence of words stored in the lexicon memory 10 would be stored in the speech model memory 11.
- Fig. 6 shows a personal computer 26 in which a vocabulary generator 27 in accordance with a second example of embodiment of the invention is realized.
- the vocabulary generator 27 is arranged for generating and storing word information WI that forms the vocabulary of a speech recognition device.
- the vocabulary generator 27 has an input terminal 28 which forms input means and at which word information WI of stored text information can be applied as electric input signals to the vocabulary generator 27.
- the personal computer 26 has a hard disk 29 which is connected to the input terminal 28.
- the hard disk 29 contains much text information and many documents respectively, generated, for example, with a text processing program. These documents are formed, for example, by letters, messages or other publications. The contents of these documents relate to a certain domain for which the vocabulary generator 27 is to generate a vocabulary. A certain domain may be, for example, the domain of radiology, botanies or nuclear physics.
- the input terminal 28 of the vocabulary generator 27 is further connected to the
- the Internet and through the Internet to the memory means 30 which may be formed, for example, by a data server of a university.
- the personal computer 26 is arranged, in a manner not further shown in Fig. 6, for delivering to an Internet search machine search words which state the specific domain.
- To the input terminal 28 are connected memory means 30 of the Internet in which documents of the specific domain are stored.
- the vocabulary generator 27 includes a storage stage 31 in which the word information WI can be stored.
- the vocabulary generator 27 is arranged for reading word information WI of text information stored on the hard disk 29 and in the memory means 30 of the Internet and for storing this word information WI in the storage stage 31.
- the storage stage 31 contains many documents or much text information respectively, of the specific domain.
- the vocabulary generator 27 includes generating means 32 which are arranged for generating a vocabulary while at least word information WI of the text information of a first and a second word and transition probability information UWI, which indicates the probability of occurrence of the second word after the first word in a word sequence of the text information and is stored in the storage stage 31, can be defined, and which are arranged for storing at least the first and the second word as word information WI in a lexicon memory 10 and for storing the transition probability information UWI in a speech model memory 11.
- the generating means 32 test all the relevant documents stored in the memory stage 31 about what words very often occur in the specific domain and store these words as word information WI in the lexicon memory 10. These words form the vocabulary of the specific domain.
- the vocabulary generator 27 further includes a background lexicon memory 33 in which much word information WI and assigned phoneme information PI(WI) of a general vocabulary is stored.
- a background lexicon memory 33 in which much word information WI and assigned phoneme information PI(WI) of a general vocabulary is stored.
- the generating means 32 calculate phoneme information PI(WI) for this word in accordance with statistical methods and store the calculated phoneme information PI(WI) of this word assigned to its word information WI in the lexicon memory 10.
- the generating means 32 further test what word sequences often occur in the documents stored in the storage stage 31.
- the generating means 32 store word sequences often occurring as bigrams and trigrams in the speech model memory 11. Transition probability information UWI of the bigrams and trigrams determined by the generating means 32 assigned to the bigrams and trigrams is stored in the speech model memory 11.
- word information WI and phoneme information PI(WI) is stored in the lexicon memory 10, for example, in accordance with the information contained in the first table 17 and word sequence information WFI and transition probability information UWI is stored in the speech model memory 11, for example, in accordance with the information contained in the second table 18.
- the vocabulary generator 27 has an optimization device 19 which is used for optimizing a vocabulary generated by the vocabulary generator 27.
- the optimization device 19 of the second example of embodiment of the invention here corresponds to the whole optimization device 19 of the first example of embodiment of the invention.
- the generating means 32 apply activation information Al to the test means 20 of the optimization device 19, after which the optimization operation described with reference to the first example of embodiment commences and the vocabulary is optimized.
- the optimization device 19 is included in the vocabulary generator 27, the advantage is obtained that a vocabulary generated by the vocabulary generator 27 is optimized and typical formulations can be better recognized in the documents stored in the storage stage 31 when a speech recognizer that uses the vocabulary generated by the vocabulary generator 27 carries out a recognition operation. Furthermore, in an advantageous manner, the calculation circuitry for recognizing a typical formulation in speech information SI of a spoken text is considerably smaller with a speech recognition operation of a speech recognizer that uses the vocabulary generated by the vocabulary generator 27.
- the generating means 32 after at least a third word has been defined and stored by the word defining means 21, is arranged for extending a word sequence stored in the speech model memory 11 by one word and including the third word, which one word often occurs before or after this word sequence in stored text information applied as an input signal to the generating means 32, and for sto ⁇ ng the extended word sequence in the speech model memory 11 if the number of words that can be stored per word sequence in the speech model memory 11 so permits
- text information or documents respectively stored m the storage stage 31 are again tested by the generating means 32 and a word often occurring in these documents before or after a bigram stored in the speech model memory 11 is stored in the speech model memory 11 as a t ⁇ gram, which is a combination of this word and the bigram Transition probability information UWI of the t ⁇ gram is determined by the generating means 32 and stored in the speech model memory 11 while being assigned to the t ⁇ gram
- test means 20 of the vocabulary generator 27, after a positive result of the test of the transition probability information UWI of a word sequence, are also used for determining occurrence probability information, how often this word sequence occurs in the stored text information, and for further testing whether the determined occu ⁇ ence probability information has a minimum-occurrence value and could not be arranged for defining the words of this word sequence as a third word until there is a positive result of the further test
- Such an optimization device with such test means would then advantageously contain only those word sequences, whose transition probability information UWI has a minimum value, as one word each, which word sequences also relatively often occur in stored text information and thus also relatively often in speech information SI of a spoken text
- a minimum-occurrence value may be, for example, a value of 2%, which indicates that in 100 words of stored text information the words of the word sequence occur twice as a word sequence
- the typical formulation "Sehr geehrte Engineering und leopard” can be stored in a speech model memory by word defining means du ⁇ ng an optimization operation, and can be recognized very well with little calculation circuitry during a next speech recognition operation.
- the optimization device can be used very well for optimizing a vocabulary containing composite components, by which a range of the speech model extending beyond a composite built from composite components is achieved.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Machine Translation (AREA)
Abstract
In an optimization device (19) for optimizing a vocabulary of a speech recognition device (2), comprising a lexicon memory (10) in which word information (WI) of at least a first and a second word forming the vocabulary of a speech recognition device (2) can be stored, and comprising a speech model memory (11) in which at least a probability of occurrence of the second word after the first word can be stored as transition probability information (UWI) in a word sequence formed by these words, and comprising word defining means (21) for defining a third word and for storing the third word as word information (WI) in the lexicon memory (10) and for storing at least transition probability information (UWI) of the probability of occurrence of the third word in a word sequence after at least the first or the second word stored in the lexicon memory (10) in the speech model memory (11), test means (20) are provided which are arranged for testing whether transition probability information (UWI) of a word sequence stored in the speech model memory (11) has a minimum value (MW) and the word defining means (21) are arranged for defining the words of this word sequence as the third word when the test of the test means (20) shows a positive result.
Description
Optimization device for optimizing a vocabulary of a speech recognition device.
The invention relates to an optimization device for optimizing a vocabulary of a speech recognition device, comprising a lexicon memory in which the word information of at least a first and a second word forming the vocabulary of a speech recognition device can be stored, and comprising a speech model memory in which at least a probability of occurrence of the second word after the first word in a word sequence formed by these words can be stored as transition probability information, and comprising word defining means for defining a third word and for storing in the lexicon memory the third word as word information and for storing in the speech model memory at least transition probability information of the probability of occurrence of the third word in a word sequence after at least the first or the second word stored in the lexicon memory.
The invention further relates to a speech recognition device for recognizing phoneme information contained in speech information of a spoken text and for delivering word information of a recognized text, comprising input means which are supplied with speech information of a spoken text as input signals, and comprising speech recognition means which are arranged for recognizing phoneme information of the spoken text contained in the input signals and for delivering word information of a recognized text, and comprising output means which can deliver word information of a recognized text as output signals.
The invention further relates to a vocabulary generator for generating and storing word information that forms the vocabulary of a speech recognition device, comprising input means to which word information of stored text information can be fed as input signals, and comprising generator means which are arranged for generating a vocabulary, so that at least word information of the text information of a first and a second word and transition probability information indicating the probability of occurrence of the second word after the first word in a word sequence of the text information can be determined, and which generator means are arranged for storing at least the first and the second word as word information in a lexicon memory and the transition probability information in a speech model memory.
Such a speech recognition device for recognizing phoneme information contained in speech information of a spoken text and for delivering word information of a
recognized text, of the type discussed in the second paragraph, comprising an optimization device for optimizing a vocabulary of a speech recognition device of the type defined in the first paragraph, is known from document WO 96/29695.
The known speech recognition device includes input means formed by a microphone terminal to which a microphone can be connected. The microphone can apply speech information as an electric input signal to the speech recognition device, which speech information comes from a text spoken by a user of the speech recognition device. The electric input signal can be applied to speech recognition means of the speech recognition device and the speech recognition means can deliver its recognized word information as recognized text to output means of the speech recognition device. A monitor by which word information of the recognized text can be displayed can be connected to the output means, which are formed by a monitor terminal.
For recognizing word information contained in the electric input signal, the speech recognition means include, inter alia, a lexicon memory. The lexicon memory stores as word information all the words that form the vocabulary of the speech recognition device.
Phoneme information is assignedly stored with word information, which phoneme information forms a phoneme sequence featuring the assignedly stored word.
During a speech recognition operation of the speech recognition device, phoneme sequences in a spoken text are determined by the speech recognition means and compared with phoneme sequences stored in the lexicon memory. If this comparison shows a match of a determined phoneme sequence and a stored phoneme sequence, stored word information assigned to this stored phoneme sequence is taken from the lexicon memory as a recognized word.
The speech recognition means further include a speech model memory which stores transition probability information for word sequences of words stored in the lexicon memory. The speech model memory stores word sequences of two words each, so-called bigrams, and word sequences of three words each, so-called trigrams.
For example, the word sequence "Sehr geehrte Damen und Herren" made up of bigrams and trigrams relatively often occurs as a typical formulation in spoken texts. In the speech model memory is stored as a probability of occurrence of the word "Damen" after the words "Sehr geehrte" the transition probability information of, for example, "5%". For the probability of occurrence of the word "Herren" after the words "Sehr geehrte" is stored the transition probability information of, for example, "4%". Since the composite word sequence "Sehr geehrte Damen und Herren" as a typical formulation occurs more frequently than the
composite word sequence "Sehr geehrte Herren" in a spoken text, the probability of occurrence of the trigram "Sehr geehrte Damen" is about "1%" higher than the probability of occurrence of the trigram "Sehr geehrte Herren".
During a speech recognition operation of the speech recognition device, a search is made in the speech model memory not only for words separately stored in the lexicon memory, but also for word sequences formed by stringed recognized words. By evaluating transition probability information stored in the speech model memory for recognized word sequences, the word information of the composite word sequence is determined as recognized text by the speech recognition device, of which word information the transition probability information of the bigrams and trigrams contained in the word information has the highest value.
It is known to the expert that typical formulations in spoken texts are better recognizable the larger the number of words are which are stored per word sequence in a speech model memory. For example, the composite word sequence "Sehr geehrte Damen und Herren" could be recognized very well by a speech recognition device, if it was stored in the speech model memory as a word sequence having only one transition probability information signal with a very high value, because this word sequence would not have to be assembled from bigrams und trigrams by the speech recognition means during a speech recognition operation. Consequently, also the computation circuitry of the speech recognition means would be relatively small, which would also be an advantage.
In this case, where word sequences stored in the speech model memory would consist of up to five words, a very large number of possible and sensible combinations of five words would be stored in the lexicon memory and, therefore, a multiplicity of word sequences would have to be stored in the speech model memory. The required memory space in the speech model memory would therefore be very large and the speech recognition device would be expensive, which is a considerable disadvantage.
The number of different words forming the vocabulary of the speech recognition device and which can be recognized by the speech recognition device is again restricted as a consequence of the memory capacity of the lexicon memory. The known speech recognition device includes word defining means by which so-called composite elements can be defined as words. Composite elements are words which relatively often occur as composite words, so-called composites, in a spoken text. The word defining means are arranged for storing transition probability information in the speech model memory which information
indicates the probability of occurrence of a composite element defined as one word in a word sequence after at least one further word stored in the lexicon memory.
Storing composite elements in the lexicon memory provides the advantage that not all possible composites formed by combinations of such composite elements need to be stored in the lexicon memory, so that the required memory space of the lexicon memory is reduced considerably.
The known speech recognition device now proves to have a disadvantage that typical formulations contained in a spoken text are not sufficiently well recognized during a speech recognition operation, because only transition probability information of word sequences having a maximum of three words can be stored in the speech model memory and this number of words per word sequence cannot, in essence, be increased as a result of the constriction of the limited memory capacity of the speech model memory. Furthermore, in the known speech recognition device, the recognition of typical formulations contained in a spoken text is additionally degraded by the inclusion of words forming composite elements, because a composite formed, for example, by three composite elements already forms a complete word sequence and no further information about words and word sequences neighboring this word sequence can be determined by evaluating transition probability information of this word sequence.
The disadvantages stated above have also turned up in a speech recognition device of the type defined in the second paragraph when a vocabulary stored as word information was used in a speech model memory, which vocabulary was generated by a vocabulary generator as mentioned in the third paragraph and stored in the speech model memory.
It is an object of the invention to eliminate the problems stated above and provide an improved optimization device of the type mentioned in the opening paragraph. This object is achieved with an optimization device of the type mentioned in the opening paragraph in that test means are provided which are arranged for testing whether transition probability information of a word sequence stored in the speech model memory has a minimum value, and in that the word defining means are arranged for defining the words of this word sequence as a third word when the test means give a positive test result.
In consequence, word sequences containing two or more words, which relatively often appear as a typical formulation in spoken texts or stored text information, can be stored in the lexicon memory as one word. Thus, transition probability information on word
sequences can be stored in the speech model memory, which word sequences contain words which are themselves determined by two or more words stored as one word in the lexicon memory. This bπngs in the advantage that the number of words per word sequence stored in the speech model memory need not be increased and, nevertheless, formulations with a larger number of words of a word sequence can be recognized better and the speech model has a so to speak larger range in a word sequence formed by recognized words.
Additionally, there is the further advantage that when a speech recognition is in operation, the speech recognition means are capable of determining phoneme sequences contained in a spoken text more reliably. Each phoneme in a phoneme sequence is influenced m the way it is pronounced by phonemes before and after the particular phoneme m the phoneme sequence. The phonemes of a phoneme sequence of a word are influenced on the word boundaries by adjacent words in the word sequence Consequently, in case of a recognition of a word sequence comprising a larger number of words, a larger number of neighboπng phonemes of neighboπng words is known and, as a result, the associated word sequence can be recognized more reliably
In an optimization device as claimed in claim 1 it has appeared to be advantageous to provide the measures in accordance with claim 2. Consequently, the advantage is obtained that transition probability information already determined of word sequences stored in the speech model memory, which contain one word defined by the word defining means plus two or more words, can be stored as transition probability information for word sequences containing the defined word even after the word has been defined
It is a further object of the invention to eliminate the problems stated above and provide an improved speech recognition device of the type as stated at the beginning of the application in the second paragraph This object is achieved with a speech recognition device of the type stated in the second paragraph in that an optimization device in accordance with claim 1 is provided.
As a result, the advantage is obtained that typical formulations in spoken texts can be recognized better and the recognition rate of the speech recognition device is improved, although practically no additional storage space is necessary for the speech model memory Furthermore, the computation circuitry of the speech recognition means duπng a speech recognition operation is considerably smaller when typical formulations are recognized, which is highly advantageous.
In a speech recognition device as claimed m claim 3 it has proved to be advantageous to provide the measures in accordance with claim 4 As a result, the advantage is
obtained that a word sequence defined as one word by the word defining means can be extended by further words contained in recognized texts before or after the defined word, so that the so-called range of the speech model is further extended.
It is a further object of the invention to eliminate the problems discussed above to provide an improved vocabulary generator of the type stated in the third paragraph. This object is achieved with a vocabulary generator of the type discussed in the third paragraph, in that an optimization device in accordance with claim 1 is provided.
As a result, the advantage is obtained that a vocabulary generated in the vocabulary generator and stored in the lexicon memory as word information is optimized to the effect that the previously stated advantages are obtained with a speech recognition device.
In a vocabulary generator in accordance with claim 5 it.has proved to be advantageous to provide the measures in accordance with claim 6. As a result, the advantage is obtained that a word sequence which the word defining means defined as one word can be extended by further words often contained in the stored text information before or after the defined word, so that the so-called range of the speech model is further extended.
In a vocabulary generator as claimed in claim 5, it has proved to be advantageous to provide the measures in accordance with claim 7. As a result, the advantage is obtained that only those word sequences are defined as one word that occur relatively often in stored texts and thus also relatively often in spoken texts, so that the required memory space in the speech model memory is further reduced.
These and other aspects of the invention are apparent from and will be elucidated with reference to the embodiments described hereinafter.
In the drawings: Fig. 1 diagrammatically shows in the form of a block diagram a speech recognition device including test means for testing whether transition probability information stored in a speech model memory has a minimum value,
Fig. 2 shows a first table containing word information and phoneme information stored in a lexicon memory of the speech recognition device in accordance with Fig. 1,
Fig. 3 shows a second table containing word sequence information and transition probability information stored in the speech model memory of the speech recognition device as shown in Fig. 1,
Fig. 4 shows a third table containing word information and phoneme information stored in the lexicon memory of the speech recognition device as shown in Fig. 1 which, after a word sequence comprising two words has been defined, is stored as one word by word defining means of the speech recognition device, Fig. 5 shows a fourth table containing word sequence information and transition probability information stored in the speech model memory of the speech recognition device as shown in Fig. 1, which word sequence information and transition probability information is stored as one word after the word defining means have defined a word sequence containing two words, Fig. 6 diagrammatically shows in the form of a block diagram a vocabulary generator including test means for testing whether transition probability information stored in a speech model memory has a minimum value.
Fig. 1 shows in the form of a block diagram a personal computer 1 in which a speech recognition device 2 in accordance with a first example of embodiment of the invention is realized. The speech recognition device is arranged for recognizing phoneme information PI contained in speech information SI of a text spoken by a user of the speech recognition device 2, and for delivering word information WI of a recognized text. The speech recognition device 2 includes input means formed by an input terminal 3. A microphone 4 can be connected to the input terminal 3. The microphone 4 can deliver speech information SI of a spoken text as an electric input signal to the input terminal 3 of the speech recognition device 2. The microphone 4 has a control key 5 by which control information ST can be delivered to the speech recognition device 2.
When a user of the speech recognition device 2 wishes to speak a spoken text to be recognized into the microphone 4, the user is to actuate the control key 5. Subsequently, speech information SI contained in the spoken text can be delivered to the input terminal 3 and the control information ST to the speech recognition device 2.
The speech recognition device 2 includes speech recognition means 6 which are arranged for recognizing phoneme information PI of a spoken text, which phoneme information is contained in the speech information SI of the input signal, and for delivering word information WI of a recognized text. For this purpose, the speech recognition means 6 include an A/D converter stage 7, a storage stage 8, calculation means 9, a lexicon memory 10, a speech model memory 11 and a reference memory 12.
Speech information SI delivered to the input terminal 3 can be delivered to the A/D converter stage 7 as an electric input signal. The A/D converter stage 7 can deliver digitized speech information SI to the storage stage 8. The storage stage 8 stores the digitized speech information SI delivered thereto. In an audio reproduction mode of the speech recognition device 2, which mode can be activated in a manner not further shown in Fig. 1, digitized speech information SI stored in the storage stage 8 can be applied to a D/A converter stage 13. The D/A converter stage 13, in the audio reproduction mode, can deliver analog speech information SI as electric output signals to a loudspeaker 14 for the acoustic reproduction of a text spoken into the microphone 4 by a user of the speech recognition device 2.
The calculation means 9 are formed by a microprocessor and connected via an address/data bus to a lexicon memory 10, the speech model memories 11 and the reference memory 12. To the calculation means 9 can be applied digital speech information SI stored in the storage stage 8 and the control information ST coming from the microphone 4. The calculation means 9 can determine word information WI of a recognized text while it utilizes information stored in the lexicon memory 10, the speech model memory 11 and the reference memory 12, which will be discussed in further detail hereinafter. The calculation means 9 can deliver word information WI of a recognized text to an output terminal 15 which forms output means. A monitor 16 on which word information WI of a recognized text and delivered by the output terminal 15 can be displayed can be connected to the output terminal 15.
In the lexicon memory 10 can be stored the word information WI with a maximum of 64,000 individual words which form the vocabulary of the speech recognition device 2. The speech recognition device 2 correctly recognizes only the words contained in the speech information SI of a spoken text that are also stored in the lexicon memory 10.
For each word information WI of a word in the lexicon memory 10 can be stored a phoneme sequence as phoneme information PI(WI) featuring the word. Phonemes of a phoneme sequence are the smallest distinguishable acoustic units into which digitized speech information SI can be subdivided. The acoustic pronunciation of a phoneme in a phoneme sequence is influenced by the phonemes surrounding the relevant phoneme in the phoneme sequence. The first phoneme of a phoneme sequence of a word thus depends on the last phoneme of a phoneme sequence of a previous word in a word sequence, as also the last phoneme of a phoneme sequence of a word depends on the first phoneme of a phoneme sequence of the next word in the word sequence. In a speech recognition operation of the
speech recognition device 2 it is therefore very important for the correct recognition of a word to know the words surrounding the word of the word sequence to be recognized, or to adopt these words as predetermined values.
A first table 17 of Fig. 2 contains word information WI and phoneme information PI(WI) assignedly stored in the lexicon memory 10. For a simple explanation, the letters A, B, C to F are stated in the table 17 to represent the word information WI. The first table 17 for example contains for a word "international" the word information WI is A, for a word "machines" the word information WI = B, for a word "business" the word information WI = C, for a word "connection" the word information WI = D, for a word "the" the word information WI = E and for a word "corporation" the word information WI = F in brackets. The seven word information signals WI stated in the first table 17 represent a plurality of word information signals WI stored in the lexicon memory 10. The vocabulary of the speech recognition device 2 thus also contains the seven words denoted as word information WI in the first table. A probability of occurrence of a second word stored in the lexicon memory 10 can be stored as transition probability information UWI in the speech model memory 11 of the speech recognition device 2, after a first word of a word sequence stored in the lexicon memory 11 in a word sequence formed by these words. In the speech model memory 11 can be stored word sequences of two words each, so-called bigrams and word sequences of three words each, so-called trigrams.
Fig. 3 shows a second table 18 containing word sequence information WFI of word sequences and assigned transition probability information UWI stored in the speech model memory 1 1. For example, the third row of the second table 18 contains the information that in a word sequence formed by the words "business" and "international" in speech information SI of a spoken text the word "business" having the word information WI = C follows the word "international" having the word information WI = A with a statistical probability of 10 r . When during a speech recognition operation the word "international" is recognized, it may be assumed with a probability of 10% that the next word present in the spoken text will be the word "business", which in the previously discussed context, with the acoustic pronunciation of the last phoneme of the word "international" and of the first phoneme of the word "business", is very important for a correct recognition of the word.
Transition probability information UWI of word sequence information WFI = A+C+D of 2%, stated in the sixth row of the table 18, indicates that the word "connection" follows the word sequence "international business" with a probability of 2%. Transition
probability information UWI stated on the seventh row of the second table 18 of word sequence information WFI = E+A+C of 5% indicates that the word "business" follows the word sequence "the international" with a probability of 5%.
There may be observed that in the speech model memory 11 not the word information WI such as in the lexicon memory 10 is again stored in the word sequence information WFI, but, to save storage space in the speech model memory 11, address pointers are stored at memory locations in the lexicon memory 10 of the relevant word information WI in the speech model memory 11. For example, an address pointer to the second row of the first table and an address pointer to the fourth row of the first table 17 are stored on the second row of the third table 18 for the word sequence information WFI = A+C.
Reference information RI is stored in the reference memory 12. Since each human being has a different type of acoustic pronunciation of a word, also phonemes and phoneme sequences are pronounced slightly differently by each human being. The speech recognition device 2 is adapted to the respective user of the speech recognition device 2 by means of reference information RI stored in the reference memory 12.
When a user speaks a text into the microphone 4, and simultaneously presses the control key 5, the control information ST delivered by the microphone 4 to the calculation means 9 activates a speech recognition mode in the speech recognition device 2 and a speech recognition operation in the calculation means 9. Speech information SI of the spoken text is applied by the microphone 4 to the A/D converter stage 7 and from there as digitized speech information SI to the storage stage 8 and stored there.
When the calculation means 9 are ready for processing digitized speech information SI stored in the storage stage 8, the digitized speech information SI is read from the storage stage 8 by the calculation means 9. The calculation means 9, while utilizing the reference information RI stored in the reference memory 12, determine phoneme information PI of phoneme sequences contained in the digitized speech information SI. The calculation means 9 then compare determined phoneme information PI with phoneme information PI(WI) stored in the lexicon memory 10. When there is a match between determined and stored phoneme information PI(WI) after this comparison, stored word information WI assigned to this stored phoneme information PI(WI) is determined as a recognized word from the lexicon memory 10.
During the speech recognition operation of the calculation means 9, word sequences are formed by stringing together recognized words. The calculation means 9 compare word sequence information WFI of formed word sequences with word sequence
information WFI stored in the speech model memory 11. When there is a match, the stored transition probability information UWI is determined which is assigned to this recognized word sequence in the speech model memory 11.
By evaluating transition probability information UWI stored in the speech model memory 11 for recognized word sequences, a plurality of possible word sequences composed of words and their overall transition probability information are determined. The word information WI of the composed word sequence is determined as recognized text, whose overall transition probability information of the bigrams and trigrams included therein, has the highest value. Such a speech recognition operation is carried out in accordance with the so- called "Hidden-Markov-Model" and has been known for a long time. Through the output terminal 15 the calculation means 9 deliver word information WI of a recognized text recognized by the calculation means 9, to the monitor 16 for the recognized text to be displayed.
The speech recognition device 2 comprises an optimization device 19 which is arranged for optimizing the vocabulary of the speech recognition device 2, which vocabulary is stored in the lexicon memory 10. For this purpose, the optimization device 9 includes test means 20, word defining means 21 and determining means 22. The calculation means 9 are arranged for activating an optimization mode of the speech recognition device 2 and an optimization operation of the optimization device 19 in that the calculation means deliver activation information Al to the test means 20. The calculation means 9 could be arranged, for example, for activating an optimization operation of the optimization device 19 after a certain number of speech recognition operations or after, for example, one week since the last optimization operation. During an optimization operation of the optimization device 19, word sequences often occurring in spoken texts and stored as word sequence information WFI stored in the speech model memory 11 are defined as one word to provide a better recognition of these typical formulations, which will be discussed in more detail hereinafter.
In the test means 20 are stored as a minimum value MW a value of 9% which prescribes that word sequences containing typical formulations, in which a second word occurs with an occurrence probability of at least 9% after a first word, are defined as one word. By defining the minimum value MW, the required memory capacity to be expected for the speech model memory 11 and the computational circuitry in the calculation means 9 necessary for a typical formulation during a speech recognition operation can be predefined.
When activation information Al occurs, the test means 20 are arranged for comparing transition probability information UWI stored in the speech model memory 11 with
the minimum value MW. When transition probability information UWI stored in the speech model memory 11 has the minimum value MW, or a higher value, stored word sequence information WFI assigned to this transition probability information UWI in the speech model memory 11 can be determined from the speech model memory 11 by the test means 20. Such word sequence information WFI can be delivered by the test means 20 both to the word defining means 21 and to the determining means 22.
The word defining means 21 are arranged for defining as one word word sequence information WFI delivered thereto of a word sequence containing at least two words, and for storing word information WFI of a word sequence defined as one word as word information WI in the lexicon memory 10. As a result, the advantage is obtained that typical formulations in spoken texts can already be recognized as one word and not as a word sequence by means of information stored in the speech model memory 11. Advantageously, this considerably reduces the calculation circuitry of the calculation means 9 when a word sequence defined as one word is recognized. In the case of a positive result of the test of the test means 20, the determining means 22 are arranged for comparing word sequence information WFI delivered to the determining means 22 by the test means 20 with the word sequence information stored in the speech model memory 11. If word sequence information WFI delivered to the determining means 22 corresponds to word sequence information WFI stored in the speech model memory 1 1, or is contained in stored word sequence information WFI, the determining means 22 are arranged for delivering identification information II to the word defining means 21.
Identification information II features a first memory location in the speech model memory 11 in which this word sequence information WFI is stored. However, identification information II also features all the further memory locations in the speech model memory 11 in which this word sequence information WFI contained in other word sequence information WFI is stored.
When the word defining means 21 receive identification information delivered to them, they are arranged for erasing this word sequence information WFI and its transition probability information UWI on the first memory location featured by the identification information II, because this word sequence information WFI was stored as word information WI in the lexicon memory 10. The word defining means 21, when receiving identification information II applied thereto, are further arranged for storing the word information WI of the word sequence defined as one word in the word sequence information WFI of the further
memory locations of the speech model memory 11 featured by the identification information π.
As a result, the advantage is obtained that all the three-word word sequences stored in the speech model memory 11, which are word sequences containing two words and having a typical formulation, are stored in the speech model memory 11 as a word sequence containing only two words.
Next an optimization operation of the optimization device 19 will be described. Fig. 4 shows a third table 24 which contains word information WI and phoneme information PI(WI) which is stored in the lexicon memory 10 after the optimization operation of the optimization device 19. Fig. 5 shows a fourth table 25 which contains the word sequence information WFI and transition probability information UWI which is. stored in the speech model memory 11 after the optimization operation of the optimization device 19.
When the information of the first table 17 is stored in the lexicon memory 10 and the information of the second table 18 in the speech model memory 11, and the calculation means 9 deliver activation information Al to the test means 20, the optimization device 19 starts the optimization operation. The test means 20 then test what transition probability information UWI stored in the speech model memory 11 has a value greater than or equal to the minimum value MW. During this test the test means 20 establish that transition probability information UWI of word sequence information WFI = A+C stored on the third row of the second table has the value 10%. The word sequence "international business" therefore represents a typical formulation. After this, the test means 20 deliver the word sequence information WFI = A+C both to the word defining means 21 and to the determining means 22. After receiving the word sequence information WFI = A+C, the word defining means 21 store the word sequence "international business" in the lexicon memory 10 as one word by means of word information WI = G indicated on the last row of the third table 24. A combination of the phoneme information PI(A) and PI(C) of the words "international" and "business" is assigned to the word information WI = G as phoneme information PI(G) and stored in the lexicon memory 10.
The determining means 22 search for the word sequence information WFI = A+C received from the test means 20 in the word sequence information WFI stored in the speech model memory 11, which is contained in the second table 18. The word sequence information WFI = A+C is found in the stored word sequence information WFI = A+C, WFI = A+C+D and WFI = E+A+C by the determining means 22. The determining means 22 then deliver respective identification information II to the word defining means 21.
Upon receipt of this identification information II, the word defining means 21 are arranged for erasing the word sequence information WFI = A+C still appearing on the third row of the second table 18 and already erased in the fourth table 25 defined as one word having the word information WI = G, because this word sequence defined as one word no longer forms a word sequence. When the identification information II is received, the word defining means 21 are further arranged for replacing the word sequence information WFI = A+C by the word information WI = G in the word sequence information WFI = A+C+D and WFI = E+A+C contained in the second table 18 in the sixth and seventh rows, so as to obtain word sequence information WFI = G+D and WFI = E+G represented in the fourth table 25 on the fifth and sixth rows.
As a result of the optimization operation described above, the advantage is obtained that the trigrams WFI = A+C+D and WFI = E+A+C were stored as bigrams WFI = G+D and WFI = E+G in the speech model memory 11 and that a speech recognition operation of these word sequences is possible with less calculation circuitry of the calculation means 9. Furthermore, the advantage is obtained that transition probability information UWI of the word sequence information WFI = A+C+D and WFI = E+A+C contained in the second table 18 continues to be assigned to the word sequence information WFI =G+D and WFI = E+G, even after the vocabulary has been optimized, and that no information already determined is lost as a result of this. The speech recognition device 2 has training means 23 to which word information WI of a recognized text can be applied after a speech recognition operation, and which training means 23 are arranged for extending a word sequence stored in the speech model memory 11 by a word often occurring in a recognized text before or after this word sequence, and for storing transition probability information UWI of the extended word sequence in the speech model memory 11, if the number of words that can be stored for each word sequence in the speech model memory 11 so permits. Since bigrams, having two words per word sequence, and trigrams, having three words per word sequence, can be stored in the speech model memory 11, the training means 23 are arranged for extending a bigram by a word often occurring in a recognized text before or after a bigram. When the training means 23 detect, for example, that before the word sequence
"international business connection" stored as a bigram in the speech model memory 11 as the word sequence information WFI = G+D, the word "the" having the word information WI=E relatively often occurs in texts recognized by the calculation means 9, the training means 23 are arranged for storing a word sequence "the international business connection" indicated on
the seventh row of the fourth table 25 and having the word sequence information WI = E+G+D and associated transition probability information UWI of 3% determined by the training means 23.
As a result, the advantage is obtained that a word sequence reduced from a trigram to a bigram in the speech model memory 11 by the word defining means 21 during an optimization operation, can be extended by the training means 23. Consequently, word sequences of typical formulations having a high probability of occurrence can be stored in the speech model memory 11, which word sequences can contain considerably more words than can be stored in a word sequence of the speech model memory 11. As a result, the range of the speech model is increased and typical formulations can be recognized considerably better during a speech recognition operation. As only the word sequences are selected that have a high probability of occurrence and as only these word sequences stored as words are extended beyond the maximum number of three words per word sequence defined for the speech model memory 11, there is advantageously achieved that the required memory capacity in the speech model memory 11 is considerably smaller than when all the possible combinations of, for example, a maximum of four words per word sequence of words stored in the lexicon memory 10 would be stored in the speech model memory 11.
When the training means 23 detect, for example, that in recognized texts the word sequence "International Business Machines Corporation", of which the words are written with initial capitals because they indicate a company name, occur relatively often, the four words can be stored as three word information signals WI = H ("International Business"), WI = I ("Machines") WI = J ("Corporation") in the lexicon memory 10. This word sequence can then be stored as a trigram in the speech model memory 11 under word sequence information WFI = H+I+J. This provides the advantage that although only word sequences with a maximum of three words can be stored in the speech model memory 11, the word sequence "International Business Machines Corporation" having four words can very well be recognized during a speech recognition operation and, in addition, also the initial capitals of the words of the word sequence can be detected. Fig. 6 shows a personal computer 26 in which a vocabulary generator 27 in accordance with a second example of embodiment of the invention is realized. The vocabulary generator 27 is arranged for generating and storing word information WI that forms the vocabulary of a speech recognition device. The vocabulary generator 27 has an input terminal
28 which forms input means and at which word information WI of stored text information can be applied as electric input signals to the vocabulary generator 27.
The personal computer 26 has a hard disk 29 which is connected to the input terminal 28. The hard disk 29 contains much text information and many documents respectively, generated, for example, with a text processing program. These documents are formed, for example, by letters, messages or other publications. The contents of these documents relate to a certain domain for which the vocabulary generator 27 is to generate a vocabulary. A certain domain may be, for example, the domain of radiology, botanies or nuclear physics. The input terminal 28 of the vocabulary generator 27 is further connected to the
Internet and through the Internet to the memory means 30 which may be formed, for example, by a data server of a university. The personal computer 26 is arranged, in a manner not further shown in Fig. 6, for delivering to an Internet search machine search words which state the specific domain. To the input terminal 28 are connected memory means 30 of the Internet in which documents of the specific domain are stored.
The vocabulary generator 27 includes a storage stage 31 in which the word information WI can be stored. The vocabulary generator 27 is arranged for reading word information WI of text information stored on the hard disk 29 and in the memory means 30 of the Internet and for storing this word information WI in the storage stage 31. The storage stage 31 contains many documents or much text information respectively, of the specific domain. The vocabulary generator 27 includes generating means 32 which are arranged for generating a vocabulary while at least word information WI of the text information of a first and a second word and transition probability information UWI, which indicates the probability of occurrence of the second word after the first word in a word sequence of the text information and is stored in the storage stage 31, can be defined, and which are arranged for storing at least the first and the second word as word information WI in a lexicon memory 10 and for storing the transition probability information UWI in a speech model memory 11. For this purpose, the generating means 32 test all the relevant documents stored in the memory stage 31 about what words very often occur in the specific domain and store these words as word information WI in the lexicon memory 10. These words form the vocabulary of the specific domain.
The vocabulary generator 27 further includes a background lexicon memory 33 in which much word information WI and assigned phoneme information PI(WI) of a general vocabulary is stored. When the generating means 32 detect a word that often occurs in the
documents stored in the storage stage 31 and have stored it in the lexicon memory 10, the generating means 32 search for this word in the background lexicon memory 33. When this word is found in the background lexicon memory 33, the generating means 32 determine the assigned phoneme information PI(WI) of this word on the basis of the background lexicon memory 33, and store the phoneme information PI(WI) of this word assigned to its word information WI in the lexicon memory 10. When this word cannot be found in the background lexicon memory 33, the generating means 32 calculate phoneme information PI(WI) for this word in accordance with statistical methods and store the calculated phoneme information PI(WI) of this word assigned to its word information WI in the lexicon memory 10. The generating means 32 further test what word sequences often occur in the documents stored in the storage stage 31. The generating means 32 store word sequences often occurring as bigrams and trigrams in the speech model memory 11. Transition probability information UWI of the bigrams and trigrams determined by the generating means 32 assigned to the bigrams and trigrams is stored in the speech model memory 11. When the vocabulary generator 27 has finished generating a vocabulary for a certain domain, word information WI and phoneme information PI(WI) is stored in the lexicon memory 10, for example, in accordance with the information contained in the first table 17 and word sequence information WFI and transition probability information UWI is stored in the speech model memory 11, for example, in accordance with the information contained in the second table 18.
The vocabulary generator 27 has an optimization device 19 which is used for optimizing a vocabulary generated by the vocabulary generator 27. The optimization device 19 of the second example of embodiment of the invention here corresponds to the whole optimization device 19 of the first example of embodiment of the invention. When the vocabulary generator 27 has finished generating a vocabulary, the generating means 32 apply activation information Al to the test means 20 of the optimization device 19, after which the optimization operation described with reference to the first example of embodiment commences and the vocabulary is optimized.
As the optimization device 19 is included in the vocabulary generator 27, the advantage is obtained that a vocabulary generated by the vocabulary generator 27 is optimized and typical formulations can be better recognized in the documents stored in the storage stage 31 when a speech recognizer that uses the vocabulary generated by the vocabulary generator 27 carries out a recognition operation. Furthermore, in an advantageous manner, the calculation circuitry for recognizing a typical formulation in speech information SI of a spoken
text is considerably smaller with a speech recognition operation of a speech recognizer that uses the vocabulary generated by the vocabulary generator 27.
The generating means 32, after at least a third word has been defined and stored by the word defining means 21, is arranged for extending a word sequence stored in the speech model memory 11 by one word and including the third word, which one word often occurs before or after this word sequence in stored text information applied as an input signal to the generating means 32, and for stoπng the extended word sequence in the speech model memory 11 if the number of words that can be stored per word sequence in the speech model memory 11 so permits After an optimization operation of the optimization means 19, text information or documents respectively, stored m the storage stage 31 are again tested by the generating means 32 and a word often occurring in these documents before or after a bigram stored in the speech model memory 11 is stored in the speech model memory 11 as a tπgram, which is a combination of this word and the bigram Transition probability information UWI of the tπgram is determined by the generating means 32 and stored in the speech model memory 11 while being assigned to the tπgram
This achieves the advantage that the so-called range of the speech model is further extended and typical formulations are even better recognizable with a speech recognizer that uses the vocabulary generated and optimized by the vocabulary generator 27
It may be observed that the test means 20 of the vocabulary generator 27, after a positive result of the test of the transition probability information UWI of a word sequence, are also used for determining occurrence probability information, how often this word sequence occurs in the stored text information, and for further testing whether the determined occuπence probability information has a minimum-occurrence value and could not be arranged for defining the words of this word sequence as a third word until there is a positive result of the further test Such an optimization device with such test means would then advantageously contain only those word sequences, whose transition probability information UWI has a minimum value, as one word each, which word sequences also relatively often occur in stored text information and thus also relatively often in speech information SI of a spoken text A minimum-occurrence value may be, for example, a value of 2%, which indicates that in 100 words of stored text information the words of the word sequence occur twice as a word sequence
It may be observed that, for example, the typical formulation "Sehr geehrte Damen und Herren" can be stored in a speech model memory by word defining means duπng
an optimization operation, and can be recognized very well with little calculation circuitry during a next speech recognition operation.
It may be observed that the optimization device according to the invention can be used very well for optimizing a vocabulary containing composite components, by which a range of the speech model extending beyond a composite built from composite components is achieved.
Claims
1. An optimization device (19) for optimizing a vocabulary of a speech recognition device (2), comprising a lexicon memory (10) in which the word information (WI) of at least a first and a second word forming the vocabulary of a speech recognition device (2) can be stored, and comprising a speech model memory (11) in which at least a probability of occuπence of the second word after the first word in a word sequence formed by these words can be stored as transition probability information (UWI), and comprising word defining means (21) for defining a third word and for storing in the lexicon memory (10) the third word as word information (WI) and for storing in the speech model memory (11) at least transition probability information (UWI) of the probability of occurrence of the third word in a word sequence after at least the first or the second word stored in the lexicon memory (10), characterized in that test means (20) are provided which are arranged for testing whether transition probability information (UWI) of a word sequence stored in the speech model memory (11) has a minimum value (MW), and in that the word defining means (21) are arranged for defining the words of this word sequence as a third word when the test means (20) give a positive test result.
2. An optimization device (19) as claimed in claim 1, characterized in that determining means (22) are provided which, in the event of a positive result of the test of the test means (20), are arranged for determining, as appropriate, at least transition probability information (UWI) already stored in the speech model memory (11), which transition probability information indicates the probability of occurrence of a specific word stored as word information (WI) in the lexicon memory (10) before or after respectively, the word sequence defined as the third word, which determined transition probability information (UWI) can be stored in the speech model memory (11) as a probability of occurrence of specific words before or after respectively, the third word in a word sequence.
3. A speech recognition device (2) for recognizing phoneme information (PI) contained in speech information (SI) of a spoken text and for delivering word information (WI) of a recognized text, comprising input means (3), to which speech information (SI) of a spoken text can be applied as input signals, and comprising speech recognition means (6) for recognizing phoneme information (PI) of the spoken text contained in the input signals and for delivering word information (WI) of a recognized text, and comprising output means (15) from which word information (WI) of a recognized text can be delivered as an output signal, characterized in that an optimization device (19) in accordance with claim 1 is provided.
4. A speech recognition device (2) as claimed in claim 3, characterized in that training means (23) are provided to which can be applied word information (WI) of a recognized text and which are arranged for extending a word sequence stored in the speech model memory (11) by a word before or after a word often present in recognized text together with this word sequence and for storing transition probability information (UWI) of the extended word sequence in the speech model memory (11), if the number of words per word sequence that can be stored in the speech model memory (11) so permits.
5. A vocabulary generator (27) for generating and storing word information (WI) forming the vocabulary of a speech recognition device (2), comprising input means (28) to which word information (WI) of stored text information can be applied as an input signal, and comprising generating means (32) which are arranged for generating a vocabulary, at least word information (WI) of the text information of a first and a second word, and transition probability information (UWI) indicating the probability of occuπence of the second word after the first word in a word sequence of the text information, and which are arranged for stoπng at least the first and the second word as word information (WI) in a lexicon memory (10) and the transition probability information (UWI) in a speech model memory (11), characterized in that an optimization device (19) as claimed in claim 1 is provided.
6. A vocabulary generator (27) as claimed in claim 5, characterized in that the generating means (32), after at least a third word has been defined and stored by the word defining means (21), are arranged for extending a word sequence including the third word, which word sequence is stored in the speech model memory (11), by a word that often occurs before or after this word sequence in stored text information applied as an input signal to the generating means (32), and arranged for storing the extended word sequence in the speech model memory (11) if the number of words per word sequence that can be stored in the speech model memory (1 1) so permits.
7. A vocabulary generator (27) as claimed in claim 5, characterized in that, after the positive result of the test of the transition probability information (UWI) of a word sequence, the test means 20 are arranged for determining probability-of-occurrence information, how often this word sequence occurs in the stored text information and for further testing whether the determined probability-of-occurrence information has a minimum- occurrence value and are not arranged for defining the words of this word sequence as a third word until the result of the further test is positive.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP98890348.0 | 1998-11-24 | ||
| EP98890348 | 1998-11-24 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2000031727A1 true WO2000031727A1 (en) | 2000-06-02 |
Family
ID=8237219
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/EP1999/008640 Ceased WO2000031727A1 (en) | 1998-11-24 | 1999-11-11 | Optimization device for optimizing a vocabulary of a speech recognition device |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2000031727A1 (en) |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| EP1550489A4 (en) * | 2002-09-13 | 2005-10-19 | Konami Corp | Game device, game device control method, program, program distribution device, information storage medium |
| WO2015171875A1 (en) * | 2014-05-07 | 2015-11-12 | Microsoft Technology Licensing, Llc | Language model optimization for in-domain application |
| US9734826B2 (en) | 2015-03-11 | 2017-08-15 | Microsoft Technology Licensing, Llc | Token-level interpolation for class-based language models |
Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| EP0935238A2 (en) * | 1998-02-06 | 1999-08-11 | Philips Patentverwaltung GmbH | Method of speech recognition using statistical language models |
-
1999
- 1999-11-11 WO PCT/EP1999/008640 patent/WO2000031727A1/en not_active Ceased
Patent Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| EP0935238A2 (en) * | 1998-02-06 | 1999-08-11 | Philips Patentverwaltung GmbH | Method of speech recognition using statistical language models |
Non-Patent Citations (5)
| Title |
|---|
| DELIGNE S ET AL: "LANGUAGE MODELING BY VARIABLE LENGTH SEQUENCES: THEORETICAL FORMULATION AND EVALUATION OF MULTIGRAMS", PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP),US,NEW YORK, IEEE, 1995, pages 169 - 172, XP000657957, ISBN: 0-7803-2432-3 * |
| GIACHIN E P: "PHRASE BIGRAMS FOR CONTINUOUS SPEECH RECOGNITION", PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP),US,NEW YORK, IEEE, 1995, pages 225 - 228, XP000657971, ISBN: 0-7803-2432-3 * |
| HWANG K: "VOCABULARY OPTIMIZATION BASED ON PERPLEXITY", IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP),US,LOS ALAMITOS, IEEE COMP. SOC. PRESS, 1997, pages 1419 - 1422, XP000822723, ISBN: 0-8186-7920-4 * |
| KLAKOW D: "Language-model optimization by mapping of corpora", PROCEEDINGS OF THE 1998 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP '98 (CAT. NO.98CH36181), PROCEEDINGS OF THE 1998 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, SEATTLE, WA, USA, 12-1, 1998, New York, NY, USA, IEEE, USA, pages 701 - 704 vol.2, XP002124207, ISBN: 0-7803-4428-6 * |
| RIES K ET AL: "Class phrase models for language modeling", PROCEEDINGS ICSLP 96. FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING (CAT. NO.96TH8206), PROCEEDING OF FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING. ICSLP '96, PHILADELPHIA, PA, USA, 3-6 OCT. 1996, 1996, New York, NY, USA, IEEE, USA, pages 398 - 401 vol.1, XP002130925, ISBN: 0-7803-3555-4 * |
Cited By (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| EP1550489A4 (en) * | 2002-09-13 | 2005-10-19 | Konami Corp | Game device, game device control method, program, program distribution device, information storage medium |
| US7753796B2 (en) | 2002-09-13 | 2010-07-13 | Konami Digital Entertainment Co., Ltd. | Game device, game device control method, program, program distribution device, information storage medium |
| WO2015171875A1 (en) * | 2014-05-07 | 2015-11-12 | Microsoft Technology Licensing, Llc | Language model optimization for in-domain application |
| US9972311B2 (en) | 2014-05-07 | 2018-05-15 | Microsoft Technology Licensing, Llc | Language model optimization for in-domain application |
| US9734826B2 (en) | 2015-03-11 | 2017-08-15 | Microsoft Technology Licensing, Llc | Token-level interpolation for class-based language models |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US6347296B1 (en) | Correcting speech recognition without first presenting alternatives | |
| US6535849B1 (en) | Method and system for generating semi-literal transcripts for speech recognition systems | |
| US7529678B2 (en) | Using a spoken utterance for disambiguation of spelling inputs into a speech recognition system | |
| US6629073B1 (en) | Speech recognition method and apparatus utilizing multi-unit models | |
| US8645139B2 (en) | Apparatus and method of extending pronunciation dictionary used for speech recognition | |
| US20020123894A1 (en) | Processing speech recognition errors in an embedded speech recognition system | |
| EP0800158B1 (en) | Word spotting | |
| US8909528B2 (en) | Method and system for prompt construction for selection from a list of acoustically confusable items in spoken dialog systems | |
| JP5753769B2 (en) | Voice data retrieval system and program therefor | |
| US20080172224A1 (en) | Position-dependent phonetic models for reliable pronunciation identification | |
| US20050114131A1 (en) | Apparatus and method for voice-tagging lexicon | |
| CN110188353A (en) | Text error correction method and device | |
| US6631348B1 (en) | Dynamic speech recognition pattern switching for enhanced speech recognition accuracy | |
| KR20180092582A (en) | WFST decoding system, speech recognition system including the same and Method for stroing WFST data | |
| US20050187767A1 (en) | Dynamic N-best algorithm to reduce speech recognition errors | |
| US20080027725A1 (en) | Automatic Accent Detection With Limited Manually Labeled Data | |
| KR100848148B1 (en) | Syllable unit speech recognition device, character input unit using syllable unit speech recognition device, method and recording medium | |
| WO2000031727A1 (en) | Optimization device for optimizing a vocabulary of a speech recognition device | |
| Kasparaitis | Transcribing of the Lithuanian text using formal rules | |
| US20030216920A1 (en) | Method and apparatus for processing number in a text to speech (TTS) application | |
| JP2001166790A (en) | Transcribed text automatic generation device, speech recognition device, and recording medium | |
| US7676366B2 (en) | Adaptation of symbols | |
| JP3865149B2 (en) | Speech recognition apparatus and method, dictionary creation apparatus, and information storage medium | |
| JPS6083136A (en) | Program reader | |
| EP1060471A1 (en) | Speech recognition device including a sub-word memory |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| AK | Designated states |
Kind code of ref document: A1 Designated state(s): JP KR |
|
| AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE |
|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
| 122 | Ep: pct application non-entry in european phase |