WO2012093451A1 - 音声認識システム、音声認識方法および音声認識プログラム - Google Patents
音声認識システム、音声認識方法および音声認識プログラム Download PDFInfo
- Publication number
- WO2012093451A1 WO2012093451A1 PCT/JP2011/007203 JP2011007203W WO2012093451A1 WO 2012093451 A1 WO2012093451 A1 WO 2012093451A1 JP 2011007203 W JP2011007203 W JP 2011007203W WO 2012093451 A1 WO2012093451 A1 WO 2012093451A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- word
- hypothesis
- section
- repair
- transparent
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/065—Adaptation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
- G10L15/19—Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
- G10L15/197—Probabilistic grammars, e.g. word n-grams
Definitions
- the present invention relates to a voice recognition system, a voice recognition method, and a voice recognition program.
- speech recognition technology has been applied, and speech recognition technology has come to be used not only for reading aloud from person to machine but also for natural utterance from person to person.
- Rephrasing is a phenomenon in which a certain word string is replaced as it is or replaced with another word string and re-uttered.
- repair target section is a section that is restated by the subsequent utterance.
- the restoration section is a utterance section in which the preceding utterance section is restated.
- the non-fluent section refers to a section in which some sound is uttered after the repair target section and the repair section in order to connect to the subsequent repair section, although the preceding utterance section is not rephrased, such as excuse and interjection.
- the restoration target section may be referred to as the previous section.
- the restoration section may be called a section after rephrasing.
- non-fluent section may be included in the section before rephrasing or may be included in the section after rephrasing. Moreover, it may be included in neither and it may be a separate area, or may be omitted.
- section from the repair target section to the repair section may be simply referred to as a restatement section.
- Non-Patent Document 2 describes a language analysis system that analyzes a sentence having ineligibility such as rephrasing in a unified manner.
- the system described in Non-Patent Document 2 is a system for language analysis using input as text, and is realized as an extension of dependency.
- Non-Patent Document 2 in a language analysis system, it is common to analyze while looking at long-distance information as in dependency analysis, but in a speech recognition system, N- It is common to use the gram language model. For this reason, a speech recognition system using the N-gram language model cannot see long-distance information, and cannot uniformly analyze speech with ineligibility such as rephrasing.
- an object of the present invention is to provide a speech recognition system, a speech recognition method, and a speech recognition program that are robust even when the N-gram language model is used as the language model of the speech recognition system.
- the speech recognition system includes a hypothesis search unit that generates a hypothesis that is a chain of words to be searched as a recognition result candidate for input speech data and searches for an optimal solution, and a hypothesis search unit Calculates the rephrasability of the word or word string included in the hypothesis being searched and rephrased by the rephrase determining means for determining whether or not the word or word string is rephrased, A transparent word hypothesis that is a hypothesis that treats a word or word string included in a non-fluent section or repair section of the rephrasing section including the word or word string as a transparent word when it is determined that there is a word.
- Word hypothesis generation means, and the hypothesis search means searches for an optimal solution by including the transparent word hypothesis generated by the transparent word hypothesis generation means in the hypothesis to be searched. And wherein the door.
- the temporary hypothesis searching means searches the input speech data for an optimal solution while generating a hypothesis that is a chain of words to be searched as a recognition result candidate.
- calculate the rephrasability of the word or word string included in the hypothesis being searched determine whether the word or word string is reworded, and if it is determined to be rephrased
- the hypothesis search means searches by generating a transparent word hypothesis that is a hypothesis in which a word or a word string included in a non-fluent section or a repair section of the reword section including the word or word string is treated as a transparent word.
- An optimal solution is searched by including the generated transparent word hypothesis in the target hypothesis.
- the speech recognition program is a hypothesis search process for searching for an optimal solution while generating a hypothesis that is a chain of words to be searched as a recognition result candidate for input speech data to a computer.
- the process of calculating the re-phrasing of the word or word string included in the hypothesis being searched and determining whether or not the word or word string is re-phrasing it is determined to be re-phrasing
- a transparent word hypothesis that generates a transparent word hypothesis that is a hypothesis that treats a word or word string included in a non-fluent section or repair section of the rephrasing section including the word or word string as a transparent word
- the process is executed, and the hypothesis search process searches for an optimal solution including the generated transparent word hypothesis in the hypothesis to be searched.
- the N-gram language model is used as the language model of the speech recognition system, it is possible to provide a speech recognition system, a speech recognition method, and a speech recognition program that are robust in other words.
- FIG. 10 is an explanatory diagram illustrating an example of a hypothesis generated by regarding a word string of a non-fluent section and a repair section as a transparent word.
- FIG. 10 is an explanatory diagram illustrating an example of a hypothesis generated by regarding a repair target hypothesis and a word string in a non-fluent section as a transparent word. It is a block diagram which shows the outline
- FIG. 1 is a block diagram illustrating a configuration example of a speech recognition system according to a first embodiment of this invention.
- the voice recognition system shown in FIG. 1 includes a voice input unit 1, a voice recognition unit 2, and a result output unit 3.
- the speech recognition unit 2 includes a hypothesis search unit 21, a determination unit 22, and a hypothesis generation unit 23.
- the voice input unit 1 takes in the voice of the speaker as voice data.
- the audio data is captured as, for example, an audio feature amount series.
- the voice recognition unit 2 performs voice recognition using the voice data captured by the voice input unit 1 as an input, and outputs a voice recognition result.
- the result output unit 3 outputs a speech recognition result.
- the hypothesis search unit 21 calculates the likelihood of the hypothesis, develops hypotheses connected to phonemes and words connected to each hypothesis, and searches for solutions.
- the determination unit 2 assumes a repair target section, a non-fluent section, and a repair section in the word chain of each hypothesis, obtains rephrasing under the assumption, and determines rephrasing above a threshold.
- the hypothesis generation unit 23 generates a hypothesis in which words in the word string in the non-fluent section and the repair section are treated as transparent words.
- acoustic information such as the presence or absence of silent intervals, power, pitch, and presence or absence of sudden changes in speech speed, the type of words in non-fluent sections, the proximity of words in the repair target section and the repair section, etc. It is possible to calculate using the index. These indices may be used singly or may be combined linearly or nonlinearly.
- the voice input unit 1 is realized by a voice input device such as a microphone, for example.
- the voice recognition unit 2 (including the hypothesis search unit 21, the determination unit 22, and the hypothesis generation unit 23) is realized by an information processing apparatus that operates according to a program such as a CPU, for example.
- the result output unit 3 is realized by, for example, an information processing device that operates according to a program such as a CPU and an output device such as a monitor.
- FIG. 2 is a flowchart showing an example of the operation of the speech recognition system of the present embodiment.
- the speech recognition unit 1 captures a speaker's utterance as speech data (step A101).
- the voice recognition unit 2 performs voice recognition on the voice data by using the fetched voice data as an input.
- the hypothesis search unit 21 calculates the likelihood of an intra-word hypothesis in which the word is not fixed in the input speech data (step A102).
- the hypothesis search part 21 gives a language likelihood based on the confirmed word about the hypothesis which reached
- the intra-word hypothesis is the process of searching for speech data along the time axis from the front, and in the part where the word is uncertain, the word with the same phoneme as one hypothesis It refers to the unit (unit) that is handled.
- the hypothesis search unit 21 performs likelihood calculation in the form of “acoustic likelihood + approximate language likelihood” for the intra-word hypothesis where the word is not fixed. Note that the word likelihood of the word chain is accurately calculated and summed with “acoustic likelihood + language likelihood” when the hypothesis reaches the end of the word and the word is finalized. To do.
- the determination unit 22 enumerates a set of repair target sections, non-fluid sections, and repair sections in order from the determined word string. A pair of eyes is taken out (step A104).
- the determination unit 22 uses the hypothesis generated by the hypothesis search unit 21 (that is, the hypothesis being searched) as one type of word, and sets the predetermined rephrase section setting information. Based on the above, a repair target section, a non-fluent section, and a repair section are assumed. It is assumed that the fixed section includes a confirmed word.
- the repair target section, the non-fluent section, and the repair section may be, for example, sections of consecutive words, the number of words in the repair target section is L, the number of words in the non-fluent section is M, Assuming that the number of words in the restoration section is a section that allows up to N words, all the combinations that the number of words in each section can take may be listed (L, M, N ⁇ 0).
- the combination of the repair target section, the non-fluent section, and the repair section listed in step A104 may be referred to as a hypothetical reword section group, and a section formed by connecting them may be referred to as a hypothetical reword section.
- the determination unit 22 calculates the rephrasing likelihood for the hypothetical rephrasing section set extracted in step A104 (step A105).
- the rephrasing is acoustic information such as whether there is a silent section or whether there is a sudden change in power, pitch, speech speed, the type of words in the non-fluent section, the closeness of the words in the repair target section and the repair section, etc. It is possible to calculate using the index.
- the determination unit 22 determines whether or not the calculated restatement is equal to or greater than a threshold value (step A106).
- the hypothesis generation unit 23 generates a hypothesis using the non-fluidic section and the repaired section in the rephrasing section set of the hypothesis as a transparent word (Step A107).
- the transparent word refers to a word that is treated as non-linguistic in the speech recognition process. Therefore, in the case of a transparent word, when calculating the hypothesis language likelihood, the word is removed and the likelihood is calculated. More specifically, the hypothesis search unit 21 calculates the language likelihood of the hypothesis using the N-gram language model, assuming that there is no such word for a word that is a transparent word.
- step A108 the determination unit 22 confirms whether there is a set that has not yet been processed in the listed restatement section assumptions. If it remains, the determination unit 22 returns to step A105 and takes out one set from the remaining sets (Yes in step A108).
- the processing from step A105 to A107 is completed for all of the assumed rephrasing section groups listed (No in step A108)
- the process proceeds to step A109.
- step A109 the hypothesis search unit 21 determines whether or not the hypothesis search has been completed up to the end of the speech. Here, if the hypothesis search has not ended up to the end of the speech (No in Step A109), the hypothesis search unit 21 returns to Step A102 and is determined to add or restate the hypothesis generated in Step A107. Then, the hypothesis search for the next speech frame is performed (the processing from steps A102 to A108 is performed on the next speech frame).
- the result output unit 3 uses the N-gram language model as a speech recognition result for the hypothesis that has finally become the maximum likelihood. Output (step A110).
- a rephrasing interval is sequentially assumed for the hypothesis under search, the rephrasing likelihood is calculated, and the non-fluidity interval and the restoration of the interval determined to be rephrased as a result
- FIG. 3 is a block diagram illustrating a configuration example of the speech recognition system according to the second embodiment of this invention.
- the speech recognition system shown in FIG. 3 is different from the first embodiment shown in FIG. 1 in that the speech recognition unit 2 includes a result generation unit 24.
- the hypothesis generation unit 23 generates not only a hypothesis that treats words in the word strings of the non-fluent section and the repair section as transparent words, but also the words in the word string of the repair target section and the non-fluent section. Generate a hypothesis that treats as a transparent word.
- the result generation unit 24 generates a maximum likelihood hypothesis when a hypothesis in which a word string on the repair target section side is treated as a transparent word and a maximum likelihood hypothesis in which a hypothesis is treated with the word string on the repair section side as a transparent word. A speech recognition result combined with a hypothesis is generated.
- FIG. 4 is a flowchart showing an example of the operation of the voice recognition system of this embodiment.
- the operation of the present embodiment generates a hypothesis in which the words in the word string in the repair target section and the non-fluent section on the repair target section side are treated as transparent words, or the repair section
- the system maintains a transparency flag that determines whether to generate a hypothesis that treats words in the non-fluent section and repair section word strings as transparent words, and treats the repair target section as a transparent word
- the difference is that two maximum likelihood hypotheses are generated: the maximum likelihood hypothesis when the generated hypothesis is generated and the maximum likelihood hypothesis when the hypothesis in which the word string on the repair section side is treated as a transparent word is generated.
- the voice input unit 1 captures the utterance of the speaker as voice data (step A201).
- the voice recognition system sets the transparency flag held in the system to the restoration section side at the timing when the voice data is captured (step A202).
- the transparent flag is information indicating whether a transparent word is created on the repair target section side or on the repair section side.
- the hypothesis search unit 21 calculates the likelihood of the hypothesis in the word in which the word is not fixed in the input voice data (step A203). Moreover, the hypothesis search part 21 gives a language likelihood based on the confirmed word about the hypothesis which arrived at the word end (step A204).
- the determining unit 22 enumerates a set of repair target sections, non-fluent sections, and repair sections in order from the determined word string, and takes out the first set (step A205).
- These sections include fixed words, and the repair target section, the non-fluent section, and the repair section may be, for example, one continuous word, the number of words in the repair target section is L, and the number of words in the non-fluent section is M.
- the number of words in the repair section may be enumerated as N sections, and all combinations may be listed (L, M, N ⁇ 0).
- the determination unit 22 calculates rephrasingness for the listed repair target sections, non-fluent sections, and repair sections (step A206).
- the rephrasing is acoustic information such as whether there is a silent section or whether there is a sudden change in power, pitch, speech speed, the type of words in the non-fluent section, the closeness of the words in the repair target section and the repair section, etc. It is possible to calculate using the index.
- the determination unit 22 determines whether or not the calculated restatement is greater than or equal to a threshold value (step A207). If the rephrasing is equal to or greater than the threshold (Yes in step A207), the hypothesis generation unit 23 sets the repair target section and the non-fluent section as transparent words if the transparency flag held in the system is on the repair target section side. If the transparency flag is on the repair section side, a hypothesis with a non-fluent section and a repair section as transparent words is generated. (Step A208). Note that the hypothesis generated by the hypothesis generation unit 23 is calculated by the hypothesis search unit 2 using the N-gram language model, assuming that there is no transparent word.
- step A209 if the rephrase is less than the threshold, the process proceeds to step A209 (No in step A207).
- step A209 the determination unit 22 confirms whether the listed combination of the repair target section, the non-fluent section, and the repair section remains. If a set of sections remains (Yes in step A209), the process from step A205 to step A208 is performed for the set of sections.
- step A210 the hypothesis search unit 21 determines whether or not the hypothesis search has been completed up to the end of the speech. If the hypothesis search has not been completed to the end of the speech (No in step A210), the hypothesis search is completed for the next speech frame. Processing from step A203 to step A209 is performed.
- Step A210 If the hypothesis search has been completed up to the end of the speech (Yes in Step A210), it is determined whether or not the current transmission flag is on the repair section side (Step A211). (Step A212), and the processing from step A203 to step A210 is similarly performed on the input voice.
- the result generation unit 24 processes the maximum likelihood hypothesis of the hypothesis on the repair section side that has been processed earlier and the subsequent processing. Compare the hypothesis of the maximum likelihood hypothesis on the repair target section side. Then, the result generation unit 24 determines whether the repair section is selected as a transparent word in the maximum likelihood hypothesis on the repair section, or whether the repair target section is selected as a transparent word in the maximum likelihood hypothesis on the repair target section. Then, a result of combining these two maximum likelihood hypotheses is generated for this rephrasing section (step A213).
- the result generation unit 24 does not select the repair target section as a transparent word when the repair section is not selected as a transparent word in the maximum likelihood hypothesis on the repair section or when the repair target section is not selected as a transparent word in the maximum likelihood hypothesis on the repair target section. In such a case, it is assumed that these sections are not rephrasing sections, and a result is generated using the maximum likelihood hypothesis based on normal likelihood determination as the maximum likelihood hypothesis of the section without performing this combination process. That is, the result generation unit 24 confirms that a hypothesis having a predetermined interval as a transparent word is selected as the maximum likelihood hypothesis for rephrasing both of the two maximum likelihood hypotheses. Combine the two maximum likelihood hypotheses for the restatement interval.
- the result output unit 3 outputs the result generated by the result generation unit 24 (step A214).
- the transparent word hypothesis in which the repair target section and the non-fluent section are treated as transparent words, and the transparent word hypothesis in which the non-fluent section and the repair section are treated as transparent words are generated.
- the N-gram language model of the word before the repair target section, the repair section, and the word after the repair section is used. be able to.
- the N-gram language model of the word before the repair target section, the repair target section, and the word after the repair section is used. be able to.
- the transparent word is generated first from the repair section side, but may be generated from the repair target section side.
- two types of transparent word hypotheses transparent word hypothesis in which a word string in a non-fluent section and a repair section is treated as a transparent word, and repair are performed in one reword determination. It is also possible to generate a transparent word hypothesis in which the word strings of the target section and the non-fluent section are treated as transparent words.
- step A101 the voice input unit 1 reads the speaker's "pen, don, blue so write” (Japanese utterance: English example is “a beda pen, you know, a brown one is made of The utterance "woodsplease write by a blue one”) is imported as audio data.
- the hypothesis search unit 21 uses the acquired speech data as an input, and calculates the likelihood of the intra-word hypothesis in which the word is not determined.
- This processing is performed, for example, on the utterance of the phoneme of / i / of the word “write” (Japanese: “made of woodsplease write” in the English example) of this utterance example.
- the acoustic likelihood calculation with the phoneme model of / is performed, and the language likelihood of the word chain of the hypothesis ahead such as “blue so” (Japanese: “a brown one isby a blue one” in the English example) It corresponds to adding up.
- step A103 the hypothesis search unit 21 gives a language likelihood based on the confirmed word for the hypothesis that has reached the end of the word.
- FIG. 5 is an explanatory diagram showing an example of hypotheses searched in this example.
- each ellipse indicates a word (word hypothesis) to be searched as a recognition result candidate.
- the numerical value attached to each word hypothesis represents the log likelihood of the word chain in which each word hypothesis is linked to the preceding word hypothesis.
- the preceding “pen” Japanese example: “a bed”
- pen Japanese example is "a beda pen, you know
- the language likelihood of word chain is given.
- a log likelihood of “ ⁇ 60” is given.
- a word chain hypothesis such as “Pan-n” (in Japanese: “breada pet, you know” in the English example) may be calculated.
- the log likelihood of “ ⁇ 50” is calculated. Is given.
- the determination unit 22 enumerates combinations of possible repair target sections, non-fluent sections and repair sections in the determined word string, and takes out the first set.
- the repair section includes the word determined in step A103, and the repair target section, the non-fluent section, and the repair section may be one continuous word, the repair target section is L words, and the non-fluent section is M words. All the combinations may be listed as continuous intervals allowing up to N words for the repair interval.
- the restoration target section is one word
- the non-fluent section is one word
- the restoration word is one word
- the word “blue” Japanese: “a brownblue” in the English example
- the word “pen” Japanese: "a bed” in the English example
- n- as the non-fluent section
- a section set of “blue” Japanese: “a brownblue” in the English example
- step A105 the determination unit 22 calculates a rephrase likelihood for the one restatement section set of one hypothesis assumed and extracted in step A104.
- acoustic information such as the length of a silent section, power, pitch, and presence / absence of a sudden change in speech speed is used as an index of restatement.
- the length, power, pitch, and speech speed of the silent section are used using learning data that is pre-tagged with the restoration target section, non-fluidity section, and restoration section, and also with acoustic information. Modeling is performed using a mixed Gaussian distribution with time differentiation as a feature amount, and the likelihood with the model is calculated.
- step A106 the determination unit 22 determines whether or not the restatement probability of the extracted one assumed restatement section is equal to or greater than a threshold value. If the rephrase is greater than or equal to the threshold, the process proceeds to step A107, and if less than the threshold, the process proceeds to step A108.
- step A107 the hypothesis generation unit 23 generates a hypothesis that regards a word string in a non-fluent section and a repair section as a transparent word for a hypothesis having a rewordability equal to or greater than a threshold value. Remove the deemed word and recalculate the likelihood. Note that the recalculation of the language likelihood of the generated hypothesis may be executed by the hypothesis search unit 21.
- the non-fluent section is “n-” (Japanese: “you know” in the English example)
- the repair section is “blue” (Japanese: “a brownblue” in the English example)
- FIG. 7 shows an example of a hypothesis generated when it is assumed that “No” (Japanese: English example is “one”).
- the non-fluent section “n-” Japanese: “you know” in the English example
- the repair section “blue” Japanese example
- step A108 the determination unit 22 confirms whether other combinations of the repair target section, the non-fluent section, and the repair section listed in step A104 remain. When it remains, it returns to step A104, takes out one combination from the remaining combinations, and repeats the processing from step A104 to step A107 in the same manner.
- step A109 the hypothesis search unit 21 determines whether or not the hypothesis search has been completed up to the end of the speech. If the end of the voice has not been reached, the process returns to step A102, and the hypothesis search unit 21 adds the hypothesis generated in step A107 and searches for the hypothesis of the next voice frame. If the end of the voice is reached, the process proceeds to step A110.
- Step A110 the result output unit 3 finally writes the voice with the most likely hypothesis “write with a pen” (in Japanese: “a bed is made of woodsplease write by a pen”) Output the recognition result.
- step A201 the voice input unit 1 reads the speaker's "pen, don, blue so write” (Japanese utterance: English example is “a beda pen, you know, a brown one is made of The utterance "woodsplease write by a blue one”) is imported as audio data.
- step A202 the speech recognition system generates a hypothesis that treats words in the word string of the repair target section and the non-fluent section on the repair target section side as transparent words, or the non-fluent section on the repair section side. And a transparency flag for determining whether to generate a hypothesis in which the words in the word string in the repair section are treated as transparent words are set on the repair section side.
- step A203 to step A210 is the same as the operation from step A102 to step A109 in the first embodiment.
- step A211 since the transparency flag is initially set on the repair section side, the process proceeds to step A212, and in step A212, the transparency flag is set on the repair target section side.
- step A203 to A207 the operation is the same as in the first embodiment.
- step A208 since the transparency flag is a restoration target section, the hypothesis generation unit 23 assumes a word string in the restoration target section and the non-fluent section as a transparent word with respect to a hypothesis having a restatement greater than or equal to a threshold value. Is generated. Then, the hypothesis generation unit 23 recalculates the likelihood by removing words regarded as transparent words in terms of language.
- Fig. 8 shows that in this utterance example, the restoration target section is “Pan” (Japanese: “breada pet” in the English example) or “Pen” (Japanese: “a" bed ”in the English example), non-fluent section Is an explanatory diagram showing an example of a hypothesis generated when it is assumed that “n-” (Japanese: English example is “you know”).
- “pan” Japanese: “a pet” bread ”in English: English
- pen Japanese
- non-fluid section "n-" Japanese example is "you ⁇ know"
- Japanese example is "you ⁇ know”
- blue so write Japanese example is "a brown one is made of woodsplease Language likelihood is given as a word chain of write ⁇ by a blue one "). Therefore, from the beginning of the sentence, "Pan-n” (Japanese: English example is "breada pet, you” know ") and from the beginning of the sentence," Pen-n "(Japanese: English example is” pena " log likelihood given to the word chain "bed, you know”) is "0", and "-20” for the word chain and the word chain "blue” (Japanese: "a brownblue” in the English example) High log likelihood is given.
- step A209 it is determined whether there is any other combination as in the first embodiment. If there are no other combinations, it is determined in step A210 whether the hypothesis search has been completed up to the end of the speech. If the hypothesis search is completed up to the end of the speech, the process proceeds to step A211. In the next step A211, since the transparency flag is on the repair section side, the process proceeds to step A213.
- step A213 the result generation unit 23 “writes with a pen”, which is the maximum likelihood hypothesis when the transparency flag is on the restoration target section side (in Japanese: “a ⁇ bed is made of woodsplease write by a pen ”) and the maximum likelihood hypothesis when the transparency flag is on the repair interval side (in Japanese:“ a ⁇ brown one is made of woodsplease write by a blue one ")
- a speech recognition result is generated using two maximum likelihood hypotheses.
- the result generation unit 23 firstly selects “pen” (Japanese: English example is “a bed”, which is a word string of a repair target section whose transparency flag is not a transparent word in the maximum likelihood hypothesis on the repair section side. “)” And “n-” (Japanese: “you know” in the English example) are extracted. Next, the result generation unit 23 transmits the transparent flag “n-” (Japanese: “you ⁇ ⁇ know” in the English example)) that is the word string of the transparent word in the non-fluent section of the maximum likelihood hypothesis of the repair target section. Extract the word “blue” (Japanese: “a brown onea blue one” in the English example), which is the word string of the repair section that is not a word.
- the result generation unit 23 arranges the word strings in the order of the repair target section, the non-fluid section, and the repair section around the common non-fluid section, and arranges the common word strings after the repair section.
- Penn-blue so write "(Japanese: English example is” a pena bed, you know, a brown one is made of woodsplease write by a blue one ").
- the word chain in the reworded section indicated by the maximum likelihood hypothesis determined in a series of search processes in which a hypothesis search is performed while generating a transparent word hypothesis that treats the repair target section side as a transparent word and the repair section side Transparency of a word in the reworded section by combining the word chain in the reworded section indicated by the maximum likelihood hypothesis determined in a series of search processes in which a hypothesis search is performed while generating a transparent word hypothesis treated as a transparent word What is necessary is just to produce
- step A214 the result generated in step A213 is output.
- “Pen-blue so write” is output as the speech recognition result (Japanese: English example is “a beda pen, you know, a brown one is made of woodsplease write by a blue one" ⁇ )
- the N-gram language model of the word string before the repair target section, the repair target section, the non-fluent section, the repair section, and the word string after the repair section is appropriately applied. Therefore, it is possible to reduce misrecognition in utterance including rephrasing.
- FIG. 9 is a block diagram showing an outline of the present invention.
- the speech recognition apparatus includes a hypothesis search unit 101, a determination unit 102, and a transparent word hypothesis generation unit 103.
- the hypothesis search means 101 (for example, the hypothesis search unit 21) generates a hypothesis that is a chain of words to be searched as a recognition result candidate for the input speech data and searches for an optimal solution. Further, the hypothesis searching unit 101 searches for a hypothesis to be searched including a transparent word hypothesis generated by a transparent word hypothesis generating unit 103 described later.
- the rephrase determination unit 102 calculates the rephrase likelihood of the word or word string included in the hypothesis being searched by the hypothesis search unit 101, and whether the word or word string is reworded. Determine whether or not.
- the transparent word hypothesis generation unit 103 (for example, the hypothesis generation unit 23), when the rephrase determination unit 102 determines that the word is rephrased, the non-fluent section or the non-fluent section of the reword section including the word or word string
- a transparent word hypothesis that is a hypothesis in which a word or a word string included in the repair section is treated as a transparent word is generated.
- the rephrasing determination unit 102 performs, for a word or word string included in the hypothesis being searched by the hypothesis searching unit 101, a repair target section including the word or word string in the repair section, a non-fluent section, and a repair section.
- the transparent word hypothesis generation unit 103 determines whether the word included in the non-fluent section or the repair section of the combination determined to be rephrased by the rephrase determination unit 102 A hypothesis that treats the word string as a transparent word may be generated.
- the transparent word hypothesis generation means 103 uses the repair target section side transparent word hypothesis that treats words or word strings included in the repair target section or non-fluent section as transparent words, and the non-fluent section or repair section as the transparent word hypothesis.
- the hypothesis search means 101 generates a repair word side transparent word hypothesis in which a word or a word string included in the word is treated as a transparent word, and the hypothesis search means 101 sets the search target hypothesis to the repair target section side generated by the transparent word hypothesis generation means. You may search for an optimal solution including the transparent word hypothesis and the repair section side transparent word hypothesis.
- FIG. 10 is a block diagram showing another configuration example of the speech recognition system according to the present invention.
- the speech recognition system according to the present invention may include a result generation unit 104 (for example, a result generation unit 24) that generates a speech recognition result.
- the hypothesis searching unit 101 includes a first search process for searching for an optimal solution by including the generated repair target section side transparent word hypothesis in the hypothesis to be searched, and the generated repair section side
- a second search process for searching for an optimal solution by including the transparent word hypothesis in the hypothesis to be searched is performed, and the result generation unit 104 performs the speech recognition result by the first search process and the second search process.
- a voice recognition result combined with the voice recognition result may be output.
- the result generation unit 104 relates to the section determined to be rephrased, and the maximum likelihood hypothesis indicated as the speech recognition result by the first search processing is the repair target section side transparent word hypothesis, and the second search
- the maximum likelihood hypothesis shown as the speech recognition result by the process is the repair section side transparent word hypothesis
- the word chain in the reword section indicated by the repair target section side transparent word hypothesis and the repair section side transparent word hypothesis are The speech recognition result indicating the word chain including all the words in the reworded section without including the transparent words may be output by combining the word chain in the rewritten section shown.
- the speech recognition system includes a result output unit (for example, the result output unit 3) that outputs a speech recognition result, and the result output unit is indicated by a word chain of the maximum likelihood hypothesis.
- a speech recognition result to which information on a repair target section, a non-fluent section, or a repair section is added may be output.
- the hypothesis searching means searches the optimum solution while generating a hypothesis that is a chain of words to be searched as a recognition result candidate for the input speech data.
- the transparent word hypothesis generation means determines that the word is rephrased, the repair target section side transparent word hypothesis that treats a word or word string included in the repair target section or non-fluent section as a transparent word, A hypothesis that generates a repair word side transparent word hypothesis in which a word or a word string included in the fluency section or the repair section is treated as a transparent word, and the hypothesis search means searches for the generated repair target section side transparent word hypothesis
- a first search process for searching for an optimal solution included in the search, and a second search process for searching for an optimal solution by including the generated repair interval side transparent word hypothesis in the hypothesis to be searched, Result output means, and the speech recognition result of the first search processing may output a speech recognition result of a combination of a speech recognition result by the second search processing.
- the speech recognition program calculates to the computer the rephrasability of the word or word string included in the hypothesis being searched, and restates to determine whether the word or word string is rephrased.
- First transparent word hypothesis that generates a repair target section side transparent word hypothesis in which a word or a word string included in a repair target section or a non-fluent section is treated as a transparent word when it is determined that the determination process is rewording Generation process
- second transparent word hypothesis generation process for generating a repair section side transparent word hypothesis that treats a word or a word string included in a non-fluent section or repair section as a transparent word when it is determined to be rephrasing
- a first search process for searching for an optimal solution including the generated repair target section side transparent word hypothesis in a hypothesis to be searched; generated repair section side transparent word hypothesis
- the second search process for searching for an optimal solution included in the hypothesis to be searched, and the voice recognition result obtained by combining the voice recognition result by the first search process and the voice recognition result by the second search process are output
- the present invention can be widely used for general speech recognition systems.
- the present invention can be suitably applied to a speech recognition system that recognizes speech spoken by people such as lecture speech and dialogue speech.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
図1は、本発明の第1の実施形態の音声認識システムの構成例を示すブロック図である。図1に示す音声認識システムは、音声入力部1と、音声認識部2と、結果出力部3とを備える。また、音声認識部2は、仮説探索部21と、判定部22と、仮説生成部23とを含む。
次に、本発明の第2の実施形態について説明する。図3は、本発明の第2の実施形態の音声認識システムの構成例を示すブロック図である。図3に示す音声認識システムは、図1に示す第1の実施形態と比べて、音声認識部2が結果生成部24を含む点が異なる。
2 音声認識部
21 仮説探索部
22 判定部
23 仮説生成部
24 結果生成部
3 結果出力部
101 仮説探索手段
102 判定手段
103 透過単語仮説生成手段
104 結果生成手段
Claims (10)
- 入力された音声データに対して、認識結果の候補として探索が行われる単語の連鎖である仮説を生成して最適な解を探索する仮説探索手段と、
仮説探索手段が探索中の仮説に含まれる単語または単語列の言い直しらしさを計算し、当該単語または単語列が言い直しであるか否かを判定する言い直し判定手段と、
言い直し判定手段によって言い直しであると判定された場合に、当該単語もしくは単語列を含む言い直し区間のうちの非流暢区間または修復区間に含まれる単語もしくは単語列を透過単語として扱った仮説である透過単語仮説を生成する透過単語仮説生成手段とを備え、
仮説探索手段は、探索対象とする仮説に透過単語仮説生成手段によって生成された透過単語仮説を含めて最適な解を探索する
ことを特徴とする音声認識システム。 - 言い直し判定手段は、仮説探索手段が探索中の仮説に含まれる単語または単語列に対して、当該単語または単語列を修復区間に含む修復対象区間と非流暢区間と修復区間の組み合わせを仮定し、仮定した修復対象区間と非流暢区間と修復区間の組み合わせ毎に言い直しらしさを計算し、計算した言い直しらしさが所定の閾値以上であるか否かを判定することによって、該組み合わせに対して言い直しであるか否かを判定し、
透過単語仮説生成手段は、前記言い直し判定手段によって言い直しであると判定された組み合わせの非流暢区間または修復区間に含まれる単語もしくは単語列を透過単語として扱った仮説を生成する
請求項1に記載の音声認識システム。 - 透過単語仮説生成手段は、透過単語仮説として、修復対象区間または非流暢区間に含まれる単語もしくは単語列を透過単語として扱った修復対象区間側透過単語仮説と、非流暢区間または修復区間に含まれる単語もしくは単語列を透過単語として扱った修復区間側透過単語仮説とを生成し、
仮説探索手段は、探索対象とする仮説に、透過単語仮説生成手段によって生成された修復対象区間側透過単語仮説と修復区間側透過単語仮説とを含めて最適な解を探索する
請求項1または請求項2に記載の音声認識システム。 - 音声認識結果を生成する結果生成手段を備え、
仮説探索手段は、生成された修復対象区間側透過単語仮説を探索対象とする仮説に含めて最適な解を探索する第1の探索処理と、生成された修復区間側透過単語仮説を探索対象とする仮説に含めて最適な解を探索する第2の探索処理とを行い、
結果生成手段は、前記第1の探索処理による音声認識結果と、前記第2の探索処理による音声認識結果とを組み合わせた音声認識結果を生成する
請求項3に記載の音声認識システム。 - 結果生成手段は、言い直しであると判定された区間に関し、前記第1の探索処理による音声認識結果として示される最尤仮説が修復対象区間側透過単語仮説であって、前記第2の探索処理による音声認識結果として示される最尤仮説が修復区間側透過単語仮説である場合に、前記修復対象区間側透過単語仮説が示す当該言い直し区間における単語連鎖と、前記修復区間側透過単語仮説が示す当該言い直し区間における単語連鎖とを組み合わせて、当該言い直し区間における単語を透過単語とせずに全て含んだ状態の単語連鎖を示す音声認識結果を生成する
請求項4に記載の音声認識システム。 - 音声認識結果を出力する結果出力手段を備え、
結果出力手段は、最尤仮説の単語連鎖によって示されるテキスト情報だけでなく、修復対象区間、非流暢区間または修復区間の情報を付与した音声認識結果を出力する
請求項1から請求項5のうちのいずれか1項に記載の音声認識システム。 - 仮説探索手段が、入力された音声データに対して、認識結果の候補として探索が行われる単語の連鎖である仮説を生成しつつ最適な解を探索する過程で、
探索中の仮説に含まれる単語または単語列の言い直しらしさを計算し、当該単語または単語列が言い直しであるか否かを判定し、
言い直しであると判定された場合に、当該単語もしくは単語列を含む言い直し区間のうちの非流暢区間または修復区間に含まれる単語もしくは単語列を透過単語として扱った仮説である透過単語仮説を生成することによって、
前記仮説探索手段が、探索対象とする仮説に、前記生成された透過単語仮説を含めて最適な解を探索する
ことを特徴とする音声認識方法。 - 仮説探索手段が、入力された音声データに対して、認識結果の候補として探索が行われる単語の連鎖である仮説を生成しつつ最適な解を探索する過程で、
透過単語仮説生成手段が、言い直しであると判定された場合に、修復対象区間または非流暢区間に含まれる単語もしくは単語列を透過単語として扱った修復対象区間側透過単語仮説と、非流暢区間または修復区間に含まれる単語もしくは単語列を透過単語として扱った修復区間側透過単語仮説とを生成することによって、
仮説探索手段が、生成された修復対象区間側透過単語仮説を探索対象とする仮説に含めて最適な解を探索する第1の探索処理と、生成された修復区間側透過単語仮説を探索対象とする仮説に含めて最適な解を探索する第2の探索処理とを行い、
結果出力手段が、前記第1の探索処理による音声認識結果と、前記第2の探索処理による音声認識結果とを組み合わせた音声認識結果を出力する
請求項7に記載の音声認識方法。 - コンピュータに、
入力された音声データに対して、認識結果の候補として探索が行われる単語の連鎖である仮説を生成しつつ最適な解を探索する仮説探索処理の過程で、
探索中の仮説に含まれる単語または単語列の言い直しらしさを計算し、当該単語または単語列が言い直しであるか否かを判定する言い直し判定処理、
言い直しであると判定された場合に、当該単語もしくは単語列を含む言い直し区間のうちの非流暢区間または修復区間に含まれる単語もしくは単語列を透過単語として扱った仮説である透過単語仮説を生成する透過単語仮説生成処理を実行させ、
前記仮説探索処理で、探索対象とする仮説に、前記生成された透過単語仮説を含めて最適な解を探索させる
ための音声認識プログラム。 - コンピュータに、
探索中の仮説に含まれる単語または単語列の言い直しらしさを計算し、当該単語または単語列が言い直しであるか否かを判定する言い直し判定処理、
言い直しであると判定された場合に、修復対象区間または非流暢区間に含まれる単語もしくは単語列を透過単語として扱った修復対象区間側透過単語仮説を生成する第1の透過単語仮説生成処理、
言い直しであると判定された場合に、非流暢区間または修復区間に含まれる単語もしくは単語列を透過単語として扱った修復区間側透過単語仮説を生成する第2の透過単語仮説生成処理、
生成された修復対象区間側透過単語仮説を探索対象とする仮説に含めて最適な解を探索する第1の探索処理、
生成された修復区間側透過単語仮説を探索対象とする仮説に含めて最適な解を探索する第2の探索処理、および
前記第1の探索処理による音声認識結果と、前記第2の探索処理による音声認識結果とを組み合わせた音声認識結果を出力する結果出力処理を実行させる
請求項9に記載の音声認識プログラム。
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2012551755A JPWO2012093451A1 (ja) | 2011-01-07 | 2011-12-22 | 音声認識システム、音声認識方法および音声認識プログラム |
| US13/994,462 US20130268271A1 (en) | 2011-01-07 | 2011-12-22 | Speech recognition system, speech recognition method, and speech recognition program |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2011-002307 | 2011-01-07 | ||
| JP2011002307 | 2011-01-07 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2012093451A1 true WO2012093451A1 (ja) | 2012-07-12 |
Family
ID=46457320
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/JP2011/007203 Ceased WO2012093451A1 (ja) | 2011-01-07 | 2011-12-22 | 音声認識システム、音声認識方法および音声認識プログラム |
Country Status (3)
| Country | Link |
|---|---|
| US (1) | US20130268271A1 (ja) |
| JP (1) | JPWO2012093451A1 (ja) |
| WO (1) | WO2012093451A1 (ja) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| KR20220121182A (ko) * | 2021-02-24 | 2022-08-31 | 주식회사 바이칼에이아이 | 문서 분류 모델 기법 및 유창성 태깅에 기반한 인지장애 예측 방법 및 시스템 |
Families Citing this family (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US9047562B2 (en) * | 2010-01-06 | 2015-06-02 | Nec Corporation | Data processing device, information storage medium storing computer program therefor and data processing method |
| US20130325482A1 (en) * | 2012-05-29 | 2013-12-05 | GM Global Technology Operations LLC | Estimating congnitive-load in human-machine interaction |
| US10186257B1 (en) * | 2014-04-24 | 2019-01-22 | Nvoq Incorporated | Language model for speech recognition to account for types of disfluency |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JPH1124693A (ja) * | 1997-06-27 | 1999-01-29 | Nec Corp | 音声認識装置 |
| JP2001188558A (ja) * | 1999-12-27 | 2001-07-10 | Internatl Business Mach Corp <Ibm> | 音声認識装置、方法、コンピュータ・システム及び記憶媒体 |
| JP2007057844A (ja) * | 2005-08-24 | 2007-03-08 | Fujitsu Ltd | 音声認識システムおよび音声処理システム |
| JP2007093789A (ja) * | 2005-09-27 | 2007-04-12 | Toshiba Corp | 音声認識装置、音声認識方法および音声認識プログラム |
| JP2007225931A (ja) * | 2006-02-23 | 2007-09-06 | Advanced Telecommunication Research Institute International | 音声認識システム及びコンピュータプログラム |
Family Cites Families (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US8457967B2 (en) * | 2009-08-15 | 2013-06-04 | Nuance Communications, Inc. | Automatic evaluation of spoken fluency |
-
2011
- 2011-12-22 US US13/994,462 patent/US20130268271A1/en not_active Abandoned
- 2011-12-22 JP JP2012551755A patent/JPWO2012093451A1/ja active Pending
- 2011-12-22 WO PCT/JP2011/007203 patent/WO2012093451A1/ja not_active Ceased
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JPH1124693A (ja) * | 1997-06-27 | 1999-01-29 | Nec Corp | 音声認識装置 |
| JP2001188558A (ja) * | 1999-12-27 | 2001-07-10 | Internatl Business Mach Corp <Ibm> | 音声認識装置、方法、コンピュータ・システム及び記憶媒体 |
| JP2007057844A (ja) * | 2005-08-24 | 2007-03-08 | Fujitsu Ltd | 音声認識システムおよび音声処理システム |
| JP2007093789A (ja) * | 2005-09-27 | 2007-04-12 | Toshiba Corp | 音声認識装置、音声認識方法および音声認識プログラム |
| JP2007225931A (ja) * | 2006-02-23 | 2007-09-06 | Advanced Telecommunication Research Institute International | 音声認識システム及びコンピュータプログラム |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| KR20220121182A (ko) * | 2021-02-24 | 2022-08-31 | 주식회사 바이칼에이아이 | 문서 분류 모델 기법 및 유창성 태깅에 기반한 인지장애 예측 방법 및 시스템 |
| KR102844417B1 (ko) * | 2021-02-24 | 2025-08-11 | 주식회사 바이칼에이아이 | 문서 분류 모델 기법 및 유창성 태깅에 기반한 인지장애 예측 방법 및 시스템 |
Also Published As
| Publication number | Publication date |
|---|---|
| JPWO2012093451A1 (ja) | 2014-06-09 |
| US20130268271A1 (en) | 2013-10-10 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Shor et al. | Personalizing ASR for dysarthric and accented speech with limited data | |
| US10074363B2 (en) | Method and apparatus for keyword speech recognition | |
| US8321218B2 (en) | Searching in audio speech | |
| EP2387031B1 (en) | Methods and systems for grammar fitness evaluation as speech recognition error predictor | |
| KR102052031B1 (ko) | 발음평가 방법 및 상기 방법을 이용하는 발음평가 시스템 | |
| Wester | Pronunciation modeling for ASR–knowledge-based and data-derived methods | |
| CN113808571A (zh) | 语音合成方法、装置、电子设备以及存储介质 | |
| CN105336322A (zh) | 多音字模型训练方法、语音合成方法及装置 | |
| US20110218802A1 (en) | Continuous Speech Recognition | |
| WO2019126881A1 (en) | System and method for tone recognition in spoken languages | |
| CN114627896B (zh) | 语音评测方法、装置、设备及存储介质 | |
| Mao et al. | Applying multitask learning to acoustic-phonemic model for mispronunciation detection and diagnosis in l2 english speech | |
| JP6875819B2 (ja) | 音響モデル入力データの正規化装置及び方法と、音声認識装置 | |
| Hu et al. | A DNN-based acoustic modeling of tonal language and its application to Mandarin pronunciation training | |
| CN113763939A (zh) | 基于端到端模型的混合语音识别系统及方法 | |
| US9076436B2 (en) | Apparatus and method for applying pitch features in automatic speech recognition | |
| Abdullah et al. | Central Kurdish automatic speech recognition using deep learning | |
| WO2012093451A1 (ja) | 音声認識システム、音声認識方法および音声認識プログラム | |
| Yamasaki et al. | Transcribing and aligning conversational speech: A hybrid pipeline applied to french conversations | |
| Kubis et al. | Back transcription as a method for evaluating robustness of natural language understanding models to speech recognition errors | |
| Mao et al. | Integrating articulatory features into acoustic-phonemic model for mispronunciation detection and diagnosis in l2 english speech | |
| Habeeb et al. | An ensemble technique for speech recognition in noisy environments | |
| Li et al. | Improving mandarin tone mispronunciation detection for non-native learners with soft-target tone labels and blstm-based deep models | |
| WO2012093661A1 (ja) | 音声認識装置、音声認識方法および音声認識プログラム | |
| Anzai et al. | Recognition of utterances with grammatical mistakes based on optimization of language model towards interactive CALL systems |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 11855126 Country of ref document: EP Kind code of ref document: A1 |
|
| ENP | Entry into the national phase |
Ref document number: 2012551755 Country of ref document: JP Kind code of ref document: A |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 13994462 Country of ref document: US |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 11855126 Country of ref document: EP Kind code of ref document: A1 |