WO2012093661A1 - 音声認識装置、音声認識方法および音声認識プログラム - Google Patents
音声認識装置、音声認識方法および音声認識プログラム Download PDFInfo
- Publication number
- WO2012093661A1 WO2012093661A1 PCT/JP2012/000044 JP2012000044W WO2012093661A1 WO 2012093661 A1 WO2012093661 A1 WO 2012093661A1 JP 2012000044 W JP2012000044 W JP 2012000044W WO 2012093661 A1 WO2012093661 A1 WO 2012093661A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- word
- hypothesis
- section
- transparent
- rephrasing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
Definitions
- the present invention relates to a voice recognition device, a voice recognition method, and a voice recognition program.
- speech recognition technology has been applied, and speech recognition technology has been used not only for reading speech from people to machines but also for more natural speech from people to people.
- speech recognition is performed on a person-to-person utterance, there are phenomena of rephrasing and saying as a cause of speech recognition errors.
- Rephrasing is a phenomenon of re-speaking a word string as it is or by replacing it with another word string.
- Speaking is a phenomenon that stops speaking in the middle of a part of a certain word.
- the section restated by the subsequent utterance is restated, the section uttered in order to restate the preceding utterance section, the section after restatement, and the section connecting these two sections is restated. It is described as an interval.
- the section before rephrasing is often accompanied by grudges.
- Patent Document 1 describes a voice recognition device that can recognize and robustly recognize voices that are rephrased.
- the speech recognition means performs speech recognition by searching for which word string is uttered using the hypothesis search unit using speech data as an input,
- the recognizing unit receives the speech recognition result as an input, assumes a section before rephrasing and a section after rephrasing, and re-recognizes the section before rephrasing.
- the section recognizing section assumes that each phrase is a section after rephrasing, and the preceding phrase is rephrased as a preceding section, and the word in the section after rephrasing or a subword of the similar word is rephrased as a dictionary, Re-recognize sequentially. Then, the determination unit determines which of the original recognition result and the section recognition result is likely as the speech recognition result, and the output unit outputs the speech recognition result determined to be likely.
- the speech recognition result in the section after rephrasing is often wrong due to the influence of misrecognition in the section before rephrasing.
- the method of performing re-processing on the speech recognition result after the speech recognition is finished the re-statement is accurate. If not recognized, the process for rephrasing cannot be performed normally. That is, when speech including a rephrase is recognized by speech, the word chain of the rephrased part becomes unnatural, so the language likelihood of the word chain of the section becomes low, and the rephrased part may make a recognition error. Often there is. Thus, if a recognition error has occurred at the stage of speech recognition, it cannot be corrected correctly.
- an object of the present invention is to provide a speech recognition device, a speech recognition method, and a program that are robust against rephrasing and speaking.
- a speech recognition apparatus includes a hypothesis search unit that generates a hypothesis that is a chain of words to be searched as a recognition result candidate for input speech data, and searches for an optimal solution, and a hypothesis search unit Calculates the rephrasability of the word or word string included in the hypothesis being searched and rephrased by the rephrase determining means for determining whether or not the word or word string is rephrased, Transparent word hypothesis generating means for generating a transparent word hypothesis that is a hypothesis that treats a word or a word string included in the previous section of the word or word string as a transparent word when it is determined that The hypothesis searching means searches for an optimal solution by including the transparent word hypothesis generated by the transparent word hypothesis generating means in the hypothesis to be searched.
- the hypothesis searching means searches the optimum solution while generating a hypothesis that is a chain of words to be searched as a recognition result candidate for the input speech data. Then, calculate the rephrasability of the word or word string included in the hypothesis being searched, determine whether the word or word string is reworded, and if it is determined to be rephrased, By generating a transparent word hypothesis that is a hypothesis in which a word or word string included in the previous section related to the word or word string is treated as a transparent word, the hypothesis search means generates the transparent generated in the hypothesis to be searched. It is characterized by searching for an optimal solution including a word hypothesis.
- the speech recognition program is a hypothesis search process for searching for an optimal solution while generating a hypothesis that is a chain of words to be searched as a recognition result candidate for speech data input to a computer.
- re-determination determination processing for determining whether or not the word or word string is re-phrase, and re-phrase
- a transparent word hypothesis generation process is executed to generate a transparent word hypothesis that is a hypothesis in which the word or word string included in the previous section related to the word or word string is treated as a transparent word, and a hypothesis search is performed.
- an optimum solution is searched by including the transparent word hypothesis generated by the transparent word hypothesis generation processing in the hypothesis to be searched.
- the present invention it is possible to prevent erroneous recognition in the section after rephrasing due to the influence of misrecognition in the section before rephrasing. As a result, it is possible to provide a speech recognition apparatus, method, and program that are robust against rephrasing and complaining.
- FIG. 1 is a block diagram showing a configuration example of a speech recognition apparatus according to the present invention.
- the speech recognition apparatus shown in FIG. 1 includes a speech input unit 101, a speech recognition unit 102, and a result output unit 106. Further, the speech recognition unit 102 includes a hypothesis search unit 103, a determination unit 104, and a hypothesis generation unit 105.
- the voice input unit 101 captures the speaker generation as voice data.
- the audio data is captured as, for example, an audio feature amount series.
- the voice recognition unit 102 receives voice data, performs voice recognition, and outputs a recognition result.
- the result output unit 106 displays the recognition result by the voice recognition unit 102.
- the hypothesis search unit 103 calculates the likelihood of the hypothesis, develops hypotheses connected to phonemes and words connected to each hypothesis, and searches for solutions.
- the determination unit 104 assumes an interval before and after rephrasing in the word chain of each hypothesis, obtains rephrasing under the assumption, rephrases a word chain having a rephrasing greater than or equal to a threshold, and hypothesis Judge that.
- the hypothesis generation unit 105 generates a hypothesis in which each word in the word string in the previous section of the restatement hypothesis is treated as a transparent word.
- the voice input unit 101 is realized by a voice input device such as a microphone, for example.
- the voice recognition unit 102 (including the hypothesis search unit 103, the determination unit 104, and the hypothesis generation unit 105) is realized by an information processing apparatus that operates according to a program such as a CPU, for example.
- the result output unit 106 is realized by, for example, an information processing device that operates according to a program such as a CPU and an output device such as a monitor.
- acoustic information such as the presence or absence of silent intervals, power, pitch, and the presence or absence of sudden changes in speech speed, acoustic similarity between subwords before and after rephrasing, and rewording
- a linguistic index such as the presence or absence of continuation of words of the same class in the previous section and the subsequent section can be used. These indices may be used alone, or may be integrated and used by linear combination or the like.
- the speech recognition apparatus rephrases based on the rephrasing probability that is an index indicating the degree of rewording of the word or the word string included in the pre-rephrased interval and the rephrased interval.
- a hypothesis that dynamically treats the word string of the previous section as a transparent word is generated.
- the speech recognition apparatus uses such a transparent word to suppress deterioration of linguistic likelihood in the rephrasing phenomenon.
- FIG. 2 is a flowchart showing an example of the operation of the speech recognition apparatus shown in FIG.
- the voice input unit 101 captures a speaker's utterance as voice data (step S ⁇ b> 1).
- the voice recognition unit 102 performs voice recognition on the voice data by using the fetched voice data as an input.
- the hypothesis searching unit 103 calculates the likelihood of the intra-word hypothesis using the speech data captured by the speech input unit 101 as an input (step S2).
- the intra-word hypothesis is the process of searching for speech data along the time axis from the front, and in the part where the word is uncertain, the word with the same phoneme as one hypothesis It refers to the unit (unit) that is handled.
- the hypothesis search unit 103 performs likelihood calculation in the form of “acoustic likelihood + approximate language likelihood” for the intra-word hypothesis where the word is not fixed. Note that the word likelihood of the word chain is accurately calculated and summed with “acoustic likelihood + language likelihood” when the hypothesis reaches the end of the word and the word is finalized. To do.
- the hypothesis search unit 103 gives a language likelihood based on the confirmed word for the hypothesis that has reached the end of the word (step S3).
- the determination unit 104 lists all the possible re-rearranged and re-interpreted intervals in the determined word string. Then, the first set is taken out (step S4).
- the determination unit 104 uses the hypothesis generated by the hypothesis search unit 103 (that is, the hypothesis being searched) as one type of word, and sets the predetermined rephrase section setting information. On the basis of the above, it is assumed that the section before rephrasing and the section after rephrasing.
- the determination unit 104 includes the word determined in the immediately preceding step S3 in the section after rephrasing.
- the likelihood calculation of the intra-word hypothesis is completed in step S2, and the word that has just reached the end of the word is included.
- the section before rephrasing and the section after rephrasing may be, for example, one continuous word, or the section before rephrasing may be N words and the section after rephrasing may be continuous sections allowing up to M words. . In that case, all combinations of 1 to N words and 1 to M words may be listed.
- the group of sections before rephrasing and the section after rephrasing listed in step S4 may be referred to as a hypothetical rewording section group, and a section connecting them may be referred to as a hypothetical rephrasing section.
- the determination unit 104 calculates the rephrasing likelihood for the hypothetical rephrasing section set extracted in step S4 (step S5).
- acoustic information such as the presence or absence of silent sections or the presence or absence of sudden changes in power, pitch, and speech speed, the degree of acoustic similarity between subwords before and after rephrasing, and rewording
- An index such as the presence or absence of continuation of words of the same class in the previous section and the subsequent section can be used.
- the determination unit 104 determines whether or not the rephrase is greater than or equal to a threshold value (step S6).
- the determination unit 104 proceeds to step S7 if the rephrase is greater than or equal to the threshold, and proceeds to step S8 if it is less than the threshold.
- step S7 the hypothesis generation unit 105 generates a hypothesis that regards the word string in the previous section as a transparent word for the hypothesis including the rephrasing section set of the hypothesis determined to have a rephrase greater than or equal to the threshold.
- the transparent word refers to a word that is treated as non-linguistic in the speech recognition process. Therefore, in the case of a transparent word, when calculating the hypothesis language likelihood, the word is removed and the likelihood is calculated.
- step S8 the determination unit 104 confirms whether there is a set that has not yet been processed in the hypothetical rephrasing section sets listed in step S4. If it remains, the determination unit 104 returns to step S4 and takes out one set from the remaining sets (Yes in step S8). On the other hand, when the processing from step S5 to S7 is completed for all of the listed assumption rephrasing section sets (No in step S8), the determination unit 104 proceeds to step S9.
- step S9 the determination unit 104 determines whether the hypothesis search has been completed up to the end of the speech. If the end of the speech has not been reached (No in step S9), the process returns to step S2, and after adding the hypothesis generated in step S7 or replacing it with the hypothesis determined to be restated, the next speech frame Search for hypotheses. When the end of the voice is reached (Yes in step S9), the process proceeds to step S10.
- step S10 the result output unit 106 outputs the hypothesis that finally becomes the maximum likelihood as the speech recognition result.
- the speech recognition apparatus dynamically or in the process of searching for speech recognition includes a word or a word included in a pre-rephrase section of an assumed restatement section group that is highly likely to be rephrased. Since the column is treated as a transparent word, it is possible to suppress a decrease in the language likelihood of the correct hypothesis in the section after rephrasing. For example, in the case where the transparent word is not dynamically processed with respect to the section before rephrasing extracted in this way, the correct hypothesis of the section after rephrasing is recognized by misrecognizing the section before rephrasing. Often, the likelihood of language becomes worse, and the section after rephrasing is often misrecognized.
- the word or word string included in the hypothesis being searched is sequentially calculated, the word or word string is determined when the word or word string is determined to be rephrased.
- the word or word string is determined when the word or word string is determined to be rephrased.
- the timing for performing the rewording determination is not limited to this. It is sufficient that the hypothesis search unit 103 can recognize a hypothesis (a hypothesis including a transparent word) generated as a result of the rewording determination as a search target together with or in place of the hypothesis being searched. It is also possible to determine the timing or conditions for performing the rephrasing determination, and sequentially perform the restatement determination for the hypotheses that have been searched so far. As an example, it is conceivable to perform a rephrasing determination when a plurality of word hypotheses are detected in the same section.
- step S1 the voice input unit 101 captures a speaker's utterance “Do you know some someone who can speak Japanese?” As voice data.
- the hypothesis search unit 103 calculates the likelihood of the intra-word hypothesis in which the word is uncertain with respect to the speech data taken in. For example, for the utterance of the / i / phoneme of the word “speak” in the utterance example, the acoustic likelihood calculation with the / i / or / u / phoneme model is performed, and “can” or “can ' This corresponds to the addition of the language likelihood of the word chain of the hypothesis, such as “t”.
- step S3 the hypothesis search unit 103 gives a language likelihood based on the confirmed word for the hypothesis that has reached the end of the word.
- FIG. 3 is an explanatory diagram showing examples of hypotheses searched in this example. This process will be described more specifically using the example shown in FIG. In FIG. 3, each ellipse indicates a word (word hypothesis) to be searched as a recognition result candidate.
- the numerical value attached to each word hypothesis represents the log likelihood of the word chain in which each word hypothesis is linked to the preceding word hypothesis.
- step S4 the determination unit 104 enumerates a set of possible before-rephrased sections and a re-phrased section in the determined word string, and takes out the first set.
- the determination unit 104 includes the word determined in step S3 in the section after rephrasing.
- the section before rephrasing and the section after rephrasing may be, for example, one continuous word, or all the combinations are enumerated as a continuous section allowing N words for the previous section and M words for the subsequent section. Also good.
- FIG. 4 is an explanatory diagram showing an example of enumeration of hypothetical rephrasing sections.
- section before rephrasing is one word and the section after rephrasing is two words
- the section before rephrasing is “know” and the section after rephrasing is “some someone”. Accordingly, one set of hypothetical rewording section sets is listed.
- a total of two sets are listed including the one set of combinations described above. That is, in FIG.
- the setting information in FIG. 4 includes (number of words in the section before rephrasing + section after rephrasing)
- a total of 4 sets of rewording section (“you know” + “some someone”) are listed.
- step S5 the determination unit 104 calculates a rephrase likelihood for the one reassured section set of 1 hypothesized extracted in step S4.
- acoustic information such as the length of a silent section, power, pitch, and presence / absence of a sudden change in speech speed is used as an index of rephrasing.
- Acoustic information is modeled using a mixture of Gaussian distributions with features such as length, power, pitch, and time differential of speech speed, using learning data that is pre-tagged with reworded sections.
- the determination unit 104 calculates the likelihood with the model.
- step S6 the determination unit 104 determines whether or not the restatement probability of the extracted one assumed restatement section is equal to or greater than a threshold value.
- the determination unit 104 proceeds to step S7 when the rephrase is greater than or equal to the threshold, and proceeds to step S8 when the rephrasing is less than the threshold.
- step S7 the hypothesis generation unit 105 generates a hypothesis that regards the word string in the previous section as a transparent word for a hypothesis having a rewordability equal to or greater than the threshold, and is regarded as a transparent word in terms of language.
- the likelihood is removed and the likelihood is recalculated. Note that recalculation of the language likelihood of the generated hypothesis may be executed by the hypothesis search unit 103.
- FIG. 5 is an explanatory diagram showing an example of a hypothesis generated when it is assumed that the section before rephrasing is “some” and the section after rephrasing is “someone” in this utterance example.
- “some” which is the previous section is excluded, and the language likelihood is given by regarding the word chain as “Do you know someone who can speak Japanese”. For this reason, the log likelihood given to the word chain “know some” is “0”, and a high log likelihood of “ ⁇ 30” is given to the word chain “know someone”. Note that the acoustic likelihood is not changed.
- step S8 the determination unit 104 confirms whether there are other combinations remaining before the redoing section enumerated in step S4. When it remains, it returns to step S4 and takes out one combination from the remaining combinations.
- step S9 the determination unit 104 determines whether the hypothesis search has been completed up to the end of the speech. If the end of the speech has not been reached, the process returns to step S2, and the hypothesis search for the next speech frame is performed by adding the hypothesis generated in step S7. On the other hand, when the end of the voice is reached, the process proceeds to step S10.
- step S10 the result output unit 106 outputs the hypothesis that finally becomes the maximum likelihood as the speech recognition result.
- the word likelihood of the word chain in the rephrasing section “some someone” is low, so the “someone” part is erroneous.
- the word ⁇ some '' included in the previous rephrase section of the restatement section of the hypothetical rephrase section that was likely to be rephrased Is dynamically treated as a transparent word. For this reason, the fall of the language likelihood of the word chain following this can be suppressed. Therefore, the correct hypothesis “Do you know someone who can speak Japanese” can be easily left as the most likely hypothesis. Therefore, it is possible to reduce misrecognition in utterances including rephrasing.
- the acoustic similarity between the section before rephrasing and the subword of the section after rephrasing is used as the rephrasing index used by the determination unit 104.
- the subword including the first phoneme in the subsequent section is first generated, and the edit distance between each subword and the previous section is calculated.
- the section before rephrasing is “some” and the section after rephrasing is “someone”
- the subwords of the section after rephrasing are “so”, “some”, “someo”, and “someone”.
- the phoneme editing distance of “some” (note: pronunciation) and “some” (note: word) is zero.
- a linguistic index indicating the presence / absence of consecutive words of the same class is used as a rephrasing index used by the determination unit 104.
- the presence / absence of consecutive words of the same class is determined based on the semantic similarity of each word using a thesaurus. For example, when it is determined that a word representing a fruit is continuously uttered between the previous section and the second section, such as "apple banana" (Japanese: "apple banana” in English). Alternatively, it may be determined that the rephrase is higher than the threshold value.
- the semantic similarity of words that are continuous between the previous section and the subsequent section may be obtained, and the higher the similarity, the higher the likelihood of rephrasing, which may be used for the determination.
- an appendix is accompanied, such as “apple is a banana” (Japanese: “apple is banana is” in English)
- the similarity between words is obtained by excluding the appendix.
- the semantic similarity between the words excluding the annexed word is obtained. Good.
- the indices used in the first to third embodiments are linearly combined and used as the rephrasing index used by the determination unit 104.
- the speech recognition apparatus determines whether or not the hypothesis search has been completed up to the end of speech in the first to fourth steps S9. When it is determined that the end of the speech has not been reached, the speech recognition apparatus replaces the hypothesis generated in step 7 with the hypothesis determined to include the rephrasing section when returning to step S2. Then, the hypothesis search of the next speech frame is performed.
- the hypothesis search unit 103 adds the hypothesis generated in step 7 to the search target hypothesis and searches for a hypothesis that does not treat the word or word string included in the section set determined to be rephrased as a transparent word. After removing from the target hypothesis, the hypothesis search for the next speech frame may be performed.
- the result excluding the hypothesis determined to include the rephrasing section can be output as the recognition result. That is, since the recognition result that may be misrecognized by the restated part can be removed, it is possible to expect the effect of preventing the subsequent process from being adversely affected and the effect of reducing the processing load.
- FIG. 6 is a block diagram showing an outline of the present invention.
- the speech recognition apparatus includes a hypothesis search unit 11, a rephrase determination unit 12, and a transparent word hypothesis generation unit 13.
- the hypothesis search means 11 (for example, the hypothesis search unit 103) generates a hypothesis that is a chain of words to be searched as a recognition result candidate for the input speech data and searches for an optimal solution. Further, the hypothesis search means 11 searches the hypothesis to be searched including the transparent word hypothesis generated by the transparent word hypothesis generation means 13 described later.
- the rephrasing determination unit 12 calculates the rephrasing likelihood of the word or word string included in the hypothesis being searched by the hypothesis searching unit 11, and whether the word or word string is reworded. Determine whether or not.
- the transparent word hypothesis generation unit 13 (for example, the hypothesis generation unit 105), when it is determined that the reword determination unit 12 determines that the word is rephrased, the word or the word included in the previous section of the word or word string A transparent word hypothesis that is a hypothesis in which the column is treated as a transparent word is generated.
- the rephrasing determination means 12 is a section before rephrasing a word or word string included in a hypothesis being searched by the hypothesis searching means 11 and a section before rephrasing that includes the word or word string in a section after rephrasing.
- the speech recognition apparatus may use, for example, the presence or absence of a sudden change in the length or power, pitch, and speech speed of the silent section in the phrase section as the rephrasing index.
- the acoustic similarity between the word or word string included in the section before rephrasing and the subword of the word or word string included in the section after rephrasing may be used.
- presence / absence of words that belong to the same class between the section before rephrasing and the section after rephrasing may be used.
- the hypothesis search means 11 may perform the search by adding the transparent word hypothesis generated by the transparent word hypothesis generation means 13 to the existing hypothesis.
- the hypothesis search means 11 adds the transparent word hypothesis generated by the transparent word hypothesis generation means 13 to the existing hypothesis and the word, word string, or rephrase determined to be restated by the restatement determination means 12.
- the search may be performed except for the hypothesis that does not treat the word or the word string included in the subsequent section of the combination as the transparent word.
- the present invention can be widely used for general speech recognition systems.
- the present invention can be suitably applied to a speech recognition system that recognizes speech spoken by people such as lecture speech and dialogue speech.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
102 音声認識部
103 仮説探索部
104 判定部
105 仮説生成部
106 結果出力部
11 仮説探索手段
12 言い直し判定手段
13 透過単語仮説生成手段
Claims (9)
- 入力された音声データに対して、認識結果の候補として探索が行われる単語の連鎖である仮説を生成して最適な解を探索する仮説探索手段と、
前記仮説探索手段が探索中の仮説に含まれる単語または単語列の言い直しらしさを計算し、当該単語または単語列が言い直しであるか否かを判定する言い直し判定手段と、
前記言い直し判定手段によって言い直しであると判定された場合に、当該単語または単語列に係る言い直し前区間に含まれる単語または単語列を透過単語として扱った仮説である透過単語仮説を生成する透過単語仮説生成手段とを備え、
前記仮説探索手段は、探索対象とする仮説に、前記透過単語仮説生成手段によって生成された透過単語仮説を含めて最適な解を探索する
ことを特徴とする音声認識装置。 - 言い直し判定手段は、仮説探索手段が探索中の仮説に含まれる単語または単語列に対して、当該単語または単語列を言い直し後区間に含む言い直し前区間と言い直し後区間の組み合わせを仮定し、仮定した言い直し前区間と言い直し後区間の組み合わせ毎に言い直しらしさを計算し、計算した言い直しらしさが所定の閾値以上であるか否かを判定することによって、該組み合わせに対して言い直しであるか否かを判定し、
透過単語仮説生成手段は、前記言い直し判定手段によって言い直しであると判定された組み合わせの言い直し前区間に含まれる単語または単語列を透過単語として扱った仮説を生成する
請求項1に記載の音声認識装置。 - 言い直しらしさの指標として、言い回し区間における無音区間の長さもしくはパワー、ピッチ、話速の急激な変化の有無を用いる
請求項2に記載の音声認識装置。 - 言い直しらしさの指標として、言い直し前区間に含まれる単語または単語列と、言い直し後区間に含まれる単語または単語列のサブワードとの音響類似度を用いる
請求項2または請求項3のうちのいずれか1項に記載の音声認識装置。 - 言い直しらしさの指標として、言い直し前区間と言い直し後区間の間での意味的に同クラスに属する単語の連続の有無を用いる
請求項2から請求項4のうちのいずれか1項に記載の音声認識装置。 - 仮説探索手段は、透過単語仮説生成手段によって生成された透過単語仮説を既存の仮説に付け加えて探索を行う
請求項1から請求項5のうちのいずれか1項に記載の音声認識装置。 - 仮説探索手段は、透過単語仮説生成手段によって生成された透過単語仮説を既存の仮説に付け加えるとともに、言い直し判定手段によって言い直しである判定された単語、単語列、または言い直し前区間と言い直し後区間の組み合わせに対して判定された場合には当該組み合わせの言い直し後区間に含まれる単語または単語列を透過単語として扱わない仮説を除いて探索を行う
請求項1から請求項6のうちのいずれか1項に記載の音声認識装置。 - 仮説探索手段が、入力された音声データに対して、認識結果の候補として探索が行われる単語の連鎖である仮説を生成しつつ最適な解を探索する過程で、
探索中の仮説に含まれる単語または単語列の言い直しらしさを計算し、当該単語または単語列が言い直しであるか否かを判定し、
言い直しであると判定された場合に、当該単語または単語列に係る言い直し前区間に含まれる単語または単語列を透過単語として扱った仮説である透過単語仮説を生成することによって、
仮説探索手段が、探索対象とする仮説に、前記生成された透過単語仮説を含めて最適な解を探索する
ことを特徴とする音声認識方法。 - コンピュータに、
入力された音声データに対して、認識結果の候補として探索が行われる単語の連鎖である仮説を生成しつつ最適な解を探索する仮説探索処理の過程で、
探索中の仮説に含まれる単語または単語列の言い直しらしさを計算し、当該単語または単語列が言い直しであるか否かを判定する言い直し判定処理、および
言い直しであると判定された場合に、当該単語または単語列に係る言い直し前区間に含まれる単語または単語列を透過単語として扱った仮説である透過単語仮説を生成する透過単語仮説生成処理を実行させ、
前記仮説探索処理で、探索対象とする仮説に、前記透過単語仮説生成処理で生成された透過単語仮説を含めて最適な解を探索させる
ための音声認識プログラム。
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US13/977,382 US20130282374A1 (en) | 2011-01-07 | 2012-01-05 | Speech recognition device, speech recognition method, and speech recognition program |
| JP2012551857A JPWO2012093661A1 (ja) | 2011-01-07 | 2012-01-05 | 音声認識装置、音声認識方法および音声認識プログラム |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2011-002306 | 2011-01-07 | ||
| JP2011002306 | 2011-01-07 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2012093661A1 true WO2012093661A1 (ja) | 2012-07-12 |
Family
ID=46457512
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/JP2012/000044 Ceased WO2012093661A1 (ja) | 2011-01-07 | 2012-01-05 | 音声認識装置、音声認識方法および音声認識プログラム |
Country Status (3)
| Country | Link |
|---|---|
| US (1) | US20130282374A1 (ja) |
| JP (1) | JPWO2012093661A1 (ja) |
| WO (1) | WO2012093661A1 (ja) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2017211513A (ja) * | 2016-05-26 | 2017-11-30 | 日本電信電話株式会社 | 音声認識装置、その方法、及びプログラム |
Families Citing this family (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US9047562B2 (en) * | 2010-01-06 | 2015-06-02 | Nec Corporation | Data processing device, information storage medium storing computer program therefor and data processing method |
| US20150058006A1 (en) * | 2013-08-23 | 2015-02-26 | Xerox Corporation | Phonetic alignment for user-agent dialogue recognition |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JPH07230293A (ja) * | 1994-02-17 | 1995-08-29 | Sony Corp | 音声認識装置 |
| JPH11194793A (ja) * | 1997-12-26 | 1999-07-21 | Nec Corp | 音声ワープロ |
| JP2006235298A (ja) * | 2005-02-25 | 2006-09-07 | Mitsubishi Electric Corp | 音声認識ネットワーク生成方法、音声認識装置及びそのプログラム |
| JP2006277676A (ja) * | 2005-03-30 | 2006-10-12 | Toshiba Corp | 情報検索装置、情報検索方法および情報検索プログラム |
| JP2007057844A (ja) * | 2005-08-24 | 2007-03-08 | Fujitsu Ltd | 音声認識システムおよび音声処理システム |
-
2012
- 2012-01-05 WO PCT/JP2012/000044 patent/WO2012093661A1/ja not_active Ceased
- 2012-01-05 JP JP2012551857A patent/JPWO2012093661A1/ja active Pending
- 2012-01-05 US US13/977,382 patent/US20130282374A1/en not_active Abandoned
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JPH07230293A (ja) * | 1994-02-17 | 1995-08-29 | Sony Corp | 音声認識装置 |
| JPH11194793A (ja) * | 1997-12-26 | 1999-07-21 | Nec Corp | 音声ワープロ |
| JP2006235298A (ja) * | 2005-02-25 | 2006-09-07 | Mitsubishi Electric Corp | 音声認識ネットワーク生成方法、音声認識装置及びそのプログラム |
| JP2006277676A (ja) * | 2005-03-30 | 2006-10-12 | Toshiba Corp | 情報検索装置、情報検索方法および情報検索プログラム |
| JP2007057844A (ja) * | 2005-08-24 | 2007-03-08 | Fujitsu Ltd | 音声認識システムおよび音声処理システム |
Non-Patent Citations (2)
| Title |
|---|
| JIN'ICHI MURAKAMI: "Frame Synchronous Full Search Algorithm and Applied for Spontaneous Speech Recognition", IEICE TECHNICAL REPORT, vol. 95, no. 123, 23 June 1995 (1995-06-23), pages 57 - 64 * |
| KOTARO FUNAKOSHI ET AL.: "Tango no Imi Yakuwari o Mochiita Jiko Shufuku Hyogen no Shori", THE ASSOCIATION FOR NATURAL LANGUAGE PROCESSING NENJI TAIKAI HAPPYO RONBUNSHU, vol. 8, 18 March 2002 (2002-03-18), pages 655 - 658 * |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2017211513A (ja) * | 2016-05-26 | 2017-11-30 | 日本電信電話株式会社 | 音声認識装置、その方法、及びプログラム |
Also Published As
| Publication number | Publication date |
|---|---|
| JPWO2012093661A1 (ja) | 2014-06-09 |
| US20130282374A1 (en) | 2013-10-24 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US9911413B1 (en) | Neural latent variable model for spoken language understanding | |
| US8972243B1 (en) | Parse information encoding in a finite state transducer | |
| JP4301102B2 (ja) | 音声処理装置および音声処理方法、プログラム、並びに記録媒体 | |
| CN1160699C (zh) | 语音识别系统 | |
| CN102013253B (zh) | 基于语音单元语速的差异的语音识别方法及语音识别系统 | |
| Wester | Pronunciation modeling for ASR–knowledge-based and data-derived methods | |
| Lin et al. | OOV detection by joint word/phone lattice alignment | |
| Prabhavalkar et al. | Less is more: Improved RNN-T decoding using limited label context and path merging | |
| EP3734595A1 (en) | Methods and systems for providing speech recognition systems based on speech recordings logs | |
| Chang et al. | Turn-taking prediction for natural conversational speech | |
| Liu et al. | Rnn-t based open-vocabulary keyword spotting in mandarin with multi-level detection | |
| Yamasaki et al. | Transcribing and aligning conversational speech: A hybrid pipeline applied to french conversations | |
| KR101122591B1 (ko) | 핵심어 인식에 의한 음성 인식 장치 및 방법 | |
| Catania et al. | Automatic Speech Recognition: Do Emotions Matter? | |
| WO2012093661A1 (ja) | 音声認識装置、音声認識方法および音声認識プログラム | |
| Mao et al. | Integrating articulatory features into acoustic-phonemic model for mispronunciation detection and diagnosis in l2 english speech | |
| WO2012093451A1 (ja) | 音声認識システム、音声認識方法および音声認識プログラム | |
| Kou et al. | Fix it where it fails: Pronunciation learning by mining error corrections from speech logs | |
| KR20050101695A (ko) | 인식 결과를 이용한 통계적인 음성 인식 시스템 및 그 방법 | |
| Li et al. | Improving entity recall in automatic speech recognition with neural embeddings | |
| Kim et al. | Improving end-to-end contextual speech recognition via a word-matching algorithm with backward search | |
| Hwang et al. | Building a highly accurate Mandarin speech recognizer with language-independent technologies and language-dependent modules | |
| Anzai et al. | Recognition of utterances with grammatical mistakes based on optimization of language model towards interactive CALL systems | |
| Taguchi et al. | Learning lexicons from spoken utterances based on statistical model selection | |
| Breslin et al. | Continuous asr for flexible incremental dialogue |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 12732012 Country of ref document: EP Kind code of ref document: A1 |
|
| ENP | Entry into the national phase |
Ref document number: 2012551857 Country of ref document: JP Kind code of ref document: A |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 13977382 Country of ref document: US |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 12732012 Country of ref document: EP Kind code of ref document: A1 |