[go: up one dir, main page]

WO2010100853A1 - Language model adaptation device, speech recognition device, language model adaptation method, and computer-readable recording medium - Google Patents

Language model adaptation device, speech recognition device, language model adaptation method, and computer-readable recording medium Download PDF

Info

Publication number
WO2010100853A1
WO2010100853A1 PCT/JP2010/001134 JP2010001134W WO2010100853A1 WO 2010100853 A1 WO2010100853 A1 WO 2010100853A1 JP 2010001134 W JP2010001134 W JP 2010001134W WO 2010100853 A1 WO2010100853 A1 WO 2010100853A1
Authority
WO
WIPO (PCT)
Prior art keywords
topic
language model
text
sections
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/JP2010/001134
Other languages
French (fr)
Japanese (ja)
Inventor
寺尾真
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NEC Corp
Original Assignee
NEC Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NEC Corp filed Critical NEC Corp
Publication of WO2010100853A1 publication Critical patent/WO2010100853A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1815Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models

Definitions

  • the present invention relates to a language model adaptation device, a speech recognition device using the same, a language model adaptation method, and a language model adaptation method for adapting a language model used in speech recognition according to a recognition result of a recognition target speech.
  • the present invention relates to a computer-readable recording medium on which a program is recorded.
  • the most widely used statistical language model in speech recognition is the N-gram language model.
  • the N-gram language model is a probability model that assumes that the word generation probability at a certain time depends only on the immediately preceding N-1 words.
  • the generation probability of the i-th word w i is given by P (w i
  • w i ⁇ N + 1 i ⁇ 1 represents the (i ⁇ N + 1) to (i ⁇ 1) th word string.
  • the generation probability P (w 1 n ) of the word string w 1 n (w 1 , w 2 ,..., W n ) is expressed by the following (Equation 1).
  • the N-gram language model is created by the maximum likelihood estimation method using large-scale learning text data.
  • the maximum likelihood estimation method is a method of learning model parameters so that the generation probability of learning text data is maximized.
  • it is usually difficult to prepare learning text data whose content completely matches the recognition target speech. Therefore, the language model created in advance does not always appropriately represent the appearance tendency of words in the recognition target speech. Therefore, a technique for adapting a language model in accordance with the characteristics of the recognition target speech is desired.
  • a cache model has been proposed as a technique for adapting a language model in accordance with the characteristics of speech to be recognized (see Non-Patent Document 1, for example).
  • the cache model is a method of adapting the language model to the speech to be recognized by utilizing the property of the word that “the same word or phrase is easily used repeatedly”.
  • the previously created N-gram language model is P BASE (w i
  • the M word w i-M i-1 appearing immediately before the word w i as a cache interval cache probability P CACHE
  • the recognition result may be used as the immediately preceding M word w i-M i-1 .
  • w i ⁇ M i ⁇ 1 ) is a word distribution in the cache interval of the M word immediately before the word w i , and this is the appearance of the word near the word w i to be recognized. It is thought that it represents a trend.
  • the cache length M is experimentally determined in advance.
  • cache probability P CACHE to adapt the
  • an adaptive language model P ADAPT (w i
  • P BASE and P CACHE may be linearly interpolated by the following ( Equation 3).
  • the language model P ADAPT thus obtained is a model reflecting the appearance tendency of words in the recognition target speech based on the language model P BASE created in advance.
  • Non-Patent Document 1 the immediately preceding M word is treated equally, but a method for calculating the cache probability by considering that the closer the word to the word w i is, the greater the influence is proposed (for example, non-patent) Reference 2).
  • the cache probability P CACHE is calculated by the following ( Equation 4) so that the influence decreases as the distance from the word w i increases.
  • is a decay rate of the influence of the word
  • is a normalization constant.
  • the cache interval is set to all the words before the word w i .
  • Non-Patent Document 1 and Non-Patent Document 2 described above, when the topic suddenly changes in the recognition target speech, and when the same topic is repeated at intervals in the recognition target speech, There is a problem that the adaptation effect of the language model is insufficient.
  • the reason is that in the above technique, the language model is adapted using the cache interval set without considering the topic change in the recognition target speech.
  • the topic here is, for example, the genre of news such as “politics”, “economics”, “sports” if the sound of a news program, “election of the House of Representatives”, “XX company goes bankrupt”
  • individual agenda items such as the agenda of the conference if the audio of the conference.
  • FIG. 11 is a diagram for explaining a conventional problem, and shows that a news program is composed of news of various topics.
  • the topics in the sections 1 to 4 are “sports”, “politics”, “economy”, and “sports”, respectively. “Sports” exist in two places, which means that after sports news was taken up at the beginning of the program, the same sports news was again taken up in detail in the sports corner in the second half of the program.
  • the language model is adapted by the technique disclosed in Non-Patent Document 1 or Non-Patent Document 2 described above and an adapted language model used at time T is created.
  • the adaptive language model used at time T is preferably adapted to reflect the appearance tendency of words in section 1 and section 4, which are the same topic (sports) as time T, and is a topic different from time T.
  • the appearance tendency of words in section 2 and section 3 should not be used for adaptation.
  • a topic 3 (economic) section 3 different from a topic (sports) to be applied is a cache section. May be included. Furthermore, section 1 that is distant from time T may not be included in the cache section. As a result, the cache probability that appropriately reflects the appearance tendency of the word to be adapted cannot be obtained, and the adaptation effect becomes insufficient. In the technology disclosed in Non-Patent Document 1 or Non-Patent Document 2 described above, such a problem cannot be avoided because the cache interval is set without considering the change in the topic in the recognition target speech.
  • An object of the present invention is to solve the above-mentioned problem and to obtain a sufficient adaptation effect of a language model even when the topic changes suddenly and even when the same topic is repeated at intervals.
  • An adaptive device, a speech recognition device, a language model adaptation method, and a computer-readable recording medium are provided.
  • a language model adaptation apparatus is a language model adaptation apparatus that performs adaptation of a base language model, A dividing unit for dividing the input text into a plurality of sections; Determining a topic included in each of the plurality of sections, and for each determined topic, a topic analysis unit that creates a topic model representing an appearance tendency of words in the topic; For each of the plurality of sections, using the topic model corresponding to the topic included in the section, adapting the base language model, and creating an adapted language model, It is characterized by providing.
  • a speech recognition device is a speech recognition device that performs speech recognition by adapting a language model, A speech recognition unit that performs speech recognition on speech data using the language model; A dividing unit for dividing the text obtained by the speech recognition into a plurality of sections; Determining a topic included in each of the plurality of sections, and for each determined topic, a topic analysis unit that creates a topic model representing an appearance tendency of words in the topic; For each of the plurality of sections, using the topic model corresponding to the topics included in the section, adapting the base language model, and creating an adapted language model, , A re-recognition unit that performs speech recognition on the speech data in the section corresponding to the adaptation language model using the adaptation language model created by the adaptation language model creation unit.
  • a language model adaptation method is a language model adaptation method for adapting a base language model, (A) dividing the text into a plurality of sections; (B) determining a topic included in each of the plurality of sections, and creating a topic model representing the appearance tendency of words in the topic for each determined topic; (C) For each of the plurality of sections, using the topic model corresponding to the topic included in the section, adapting the base language model and creating an adapted language model; It is characterized by having.
  • a computer-readable recording medium is a computer-readable recording medium in which a program for executing adaptation of a language model as a base by a computer is recorded.
  • a program for executing adaptation of a language model as a base by a computer is recorded.
  • the computer In the computer, (A) dividing the text into a plurality of sections; (B) determining a topic included in each of the plurality of sections, and creating a topic model representing the appearance tendency of words in the topic for each determined topic; (C) For each of the plurality of sections, performing the step of adapting the base language model using the topic model corresponding to the topic included in the section and creating an adapted language model A program including an instruction is recorded.
  • the language model adaptation device when a topic changes suddenly, the same topic is repeated at intervals. Even when the language model is used, a sufficient adaptation effect of the language model can be obtained.
  • FIG. 1 is a block diagram showing a schematic configuration of a language model adaptation apparatus according to an embodiment of the present invention.
  • FIG. 2 is a flowchart showing the operation of the language model adaptation apparatus in the embodiment of the present invention.
  • FIG. 3 is a block diagram illustrating a schematic configuration of the language model adaptation device and the speech recognition device according to the first embodiment of the present invention.
  • FIG. 4 is a flowchart showing operations of the language model adaptation device and the speech recognition device according to the first exemplary embodiment of the present invention.
  • FIG. 5 is a diagram for explaining a specific example of the recognition result in the first embodiment of the present invention.
  • FIG. 6 is a diagram for explaining a specific example of the division result in the first embodiment of the present invention.
  • FIG. 1 is a block diagram showing a schematic configuration of a language model adaptation apparatus according to an embodiment of the present invention.
  • FIG. 2 is a flowchart showing the operation of the language model adaptation apparatus in the embodiment of the present invention.
  • FIG. 3 is a block
  • FIG. 7 is a diagram illustrating a specific example of the topic determination result in the first embodiment of the present invention.
  • FIG. 8 is a diagram for explaining another specific example of the division result according to the first embodiment of the present invention.
  • FIG. 9 is a diagram illustrating a specific example of the topic determination result in the second embodiment of the present invention.
  • FIG. 10 is a block diagram illustrating an example of a computer that can implement the language model adaptation device and the speech recognition device according to the first and second embodiments of the present invention.
  • FIG. 11 is a diagram for explaining a conventional problem.
  • FIG. 1 is a block diagram showing a schematic configuration of a language model adaptation apparatus according to an embodiment of the present invention.
  • a language model adaptation apparatus 100 is an apparatus that executes adaptation of a base language model (hereinafter referred to as “base language model”).
  • the language model adaptation apparatus 100 includes a division unit 105, a topic analysis unit 107, and an adaptation language model creation unit 110.
  • the language model adaptation apparatus 100 further includes a base language model storage unit 102 that stores a base language model.
  • a base language model is a known N-gram language model.
  • the N-gram language model is a probability model that gives the word generation probability on the assumption that the word generation probability at a certain time depends only on the immediately preceding N ⁇ 1 words.
  • the dividing unit 105 divides the input text into a plurality of sections.
  • the topic analysis unit 107 determines topics included in each of the plurality of sections.
  • “determining a topic” is not limited to specifying a specific topic, but also includes determining whether a topic in one section and a topic in another section are the same or similar. It is. Further, after determining the topic, the topic analysis unit 107 creates a topic model representing the appearance tendency of words in the topic for each determined topic.
  • adaptation language model creation unit 110 adapts the base language model for each of the plurality of sections using the topic model corresponding to the topic included in the section, and creates an adaptation language model. The created adaptive language model is then output.
  • FIG. 2 is a flowchart showing the operation of the language model adaptation apparatus in the embodiment of the present invention.
  • the language model adaptation method is executed by operating the language model adaptation apparatus 100.
  • the description of the language model adaptation method in the present embodiment is replaced with the following description of the operation of the language model adaptation apparatus 100.
  • the dividing unit 105 receives the input of the text (step S1).
  • the text input in step S1 may be any text. Specifically, texts created using a base language model, for example, texts obtained by speech recognition and texts obtained by machine translation can be mentioned.
  • the dividing unit 105 divides the input text into a plurality of sections (step S2).
  • the dividing unit 105 can perform division based on, for example, the number of words in the text or the number of utterances (in the case of speech recognition). Further, the dividing unit 105 can perform division based on a change point of the word distribution in the text or a position where a preset topic boundary expression appears in the text, and is related to a preset topic. Division can also be performed using a model representing the appearance tendency of words.
  • the topic analysis unit 107 determines one or more topics included in the section for each section divided by the dividing unit 105 (step S3).
  • the topic analysis unit 107 classifies a plurality of sections into groups based on the similarity between the texts of the sections, and determines that the topics of the sections belonging to the same group are common. In the present embodiment, as the topic determination, it is determined whether the topic is common between sections.
  • step S3 the topic analysis unit 107 assumes a probability model representing the appearance tendency of words in each section using the topic as a hidden variable, and further learns the parameters of the probability model using the text in each section as learning data. In this way, the topic can be determined.
  • the topic analysis unit 107 creates a topic model representing the appearance tendency of words in the topic for each determined topic (step S4).
  • the adaptation language model creation unit 110 adapts the base language model to each divided section by using one or more topic models of topics that the topic analysis unit 107 determines to be included in the section.
  • an adaptive language model adapted to the section is created (step S5).
  • the language model adaptation device 100 outputs the adaptation language model created in step S5 (step S6), and ends the process.
  • language model adaptation apparatus 100 divides text into a plurality of sections, and for each topic, in other words, for each section having a common topic, Implement adaptation that reflects the appearance trend of. Therefore, according to the present embodiment, sufficient recognition effect of the language model can be obtained even when the topic suddenly changes in the recognition speech or the like that is the source of the text, and even when the same topic is repeated at intervals. Obtainable.
  • language model adaptation apparatus 100 may include a speech recognition unit as shown in Examples 1 and 2 below when text obtained by speech recognition is targeted.
  • FIG. 3 is a block diagram illustrating a schematic configuration of the language model adaptation device and the speech recognition device according to the first embodiment of the present invention.
  • the language model adaptation apparatus 100 in the first embodiment shown in FIG. 3 constitutes a part of the speech recognition apparatus 10 and adapts the language model in accordance with the characteristics of the recognition target speech. Further, the speech recognition apparatus 10 according to the first embodiment recognizes speech data with higher accuracy using the language model adapted by the language model adaptation apparatus 100.
  • the language model adaptation apparatus 100 includes a base language model storage unit 102, a division unit 105, a topic analysis unit 107, an adaptation language, as described in the embodiment. And a model creation unit 110. Furthermore, in the first embodiment, the language model adaptation device 100 includes a recognition result storage unit 104, a division result storage unit 106, a topic determination result storage unit 108, a topic model storage unit 109, and an adaptive language model storage unit. 111.
  • the speech recognition device 10 according to the first embodiment includes a speech data storage unit 101, a speech recognition unit 103, a re-recognition unit 112, and a re-recognition result storage unit 113. Yes. Each of these units operates as follows.
  • the voice data storage unit 101 stores voice data to be recognized.
  • audio data various audio such as conference audio, lecture audio, broadcast audio, and the like can be considered. Note that these audio data may be archive data prepared in advance or data input in real time from a microphone or the like.
  • the base language model storage unit 102 stores a base language model used by the speech recognition unit 103.
  • a known N-gram language model can be used as the base language model.
  • the base language model can be constructed in advance by performing learning using a large amount of text data. In the first embodiment, as described later, by adapting the base language model according to the characteristics of the recognition target speech, it is possible to recognize speech data with higher accuracy.
  • the voice recognition unit 103 reads the voice data stored in the voice data storage unit 101, and performs voice recognition of the voice data using the base language model stored in the base language model storage unit 102. Then, the speech recognition unit 103 outputs the recognition result as text (text data) to the recognition result storage unit 104.
  • the recognition result storage unit 104 stores the recognition result (see FIG. 5).
  • the dividing unit 105 divides the recognition result text stored in the recognition result storage unit 104 into a plurality of sections, and outputs the division result to the division result storage unit 106.
  • the division result storage unit 106 stores the division result (see FIG. 6).
  • the topic analysis unit 107 refers to the recognition result text stored in the recognition result storage unit 104 and the division result stored in the division result storage unit 106, and is included in the section for each divided section. Determine the topic. Further, the topic analysis unit 107 outputs the determination result to the topic determination result storage unit 108 and stores it therein. Furthermore, the topic analysis unit 107 creates a topic model representing the appearance tendency of words in the topic for each determined topic, and outputs the created topic model to the topic model storage unit 109 for storage therein. .
  • the adaptation language model creation unit 110 refers to the topic determination result stored in the topic determination result storage unit 108 and the topic model stored in the topic model storage unit 109. Then, the adaptation language model creation unit 110 creates, for each divided section, an adaptation language model adapted to the section from the base language model using the topic model of the topic included in the section. To do. Also, the adaptation language model creation unit 110 outputs the created adaptation language model to the adaptation language model storage unit 111 and stores it therein.
  • the re-recognition unit 112 uses the adaptation language model for each section stored in the adaptation language model storage unit 111 to perform speech recognition (repeating speech corresponding to the corresponding section of the speech data stored in the speech data storage unit 101 ( Execute (Re-recognition). Then, the re-recognition unit 112 outputs the re-recognition result to the re-recognition result storage unit 113 as text data.
  • FIG. 4 is a flowchart showing operations of the language model adaptation device and the speech recognition device according to the first exemplary embodiment of the present invention.
  • FIG. 5 is a diagram for explaining a specific example of the recognition result in the first embodiment of the present invention.
  • FIG. 6 is a diagram for explaining a specific example of the division result in the first embodiment of the present invention.
  • FIG. 7 is a diagram illustrating a specific example of the topic determination result in the first embodiment of the present invention.
  • FIG. 8 is a diagram for explaining another specific example of the division result according to the first embodiment of the present invention.
  • the language model adaptation method is executed by operating the language model adaptation device 100. Therefore, the description of the language model adaptation method is replaced with the description of the operation of the language model adaptation apparatus 100.
  • FIG. 3 will be referred to as appropriate.
  • the speech recognition unit 103 performs speech recognition on speech data using a base language model, and converts this into text (step S201). Specifically, the voice recognition unit 103 reads the voice data stored in the voice data storage unit 101 and detects a plurality of utterances by a known voice detection technique. Furthermore, the speech recognition unit 103 converts the speech data into text by applying a known large vocabulary continuous speech recognition technology to each utterance, and outputs this to the recognition result storage unit 104. .
  • step S201 the base language model stored in the base language model storage unit 102 is used as the language model for speech recognition.
  • the acoustic model for example, a known HMM acoustic model with phonemes as units can be used.
  • step S201 for example, as shown in FIG. 5, an utterance of 10 to 13 seconds of voice data is recognized as “Good evening is news”, and an utterance of 16 to 20 seconds is “the first news today is baseball. Is a Japanese series. "
  • the dividing unit 105 divides the recognition result text stored in the recognition result storage unit 104 into a plurality of sections, and outputs the division result to the division result storage unit 106 (step S202).
  • the recognition result is divided for each predetermined number of utterances.
  • the recognition result may be divided for each predetermined number of words.
  • the recognition result text is divided into five utterances.
  • the recognition results are divided into a section 1 composed of utterances 1 to 5 and a section 2 composed of utterances 6 to 10.
  • each section is associated with a recognition result text of an utterance constituting the section. Note that the number of utterances included in one section can be determined by conducting an experiment in advance.
  • the topic analysis unit 107 determines, for each divided section, the topics included in the section and outputs the determination result to the topic determination result storage unit 108 (step S203). Subsequently, the topic analysis unit 107 creates a topic model indicating the appearance tendency of words in the topic for each determined topic, and outputs this to the topic model storage unit 109 (step S204).
  • the topic analysis unit 107 performs clustering on each section based on the similarity of semantic content, and the sections assigned to the same cluster have the same topic. The topic is judged.
  • section 1, section 2, section 9, and section 10 are assigned to the same cluster. This indicates that these sections are determined to be sections related to the same topic (topic 1). Similarly, the sections 3 and 4 are determined to be sections related to the topic 2, and the sections 5, 6, 7, and 8 are determined to be sections related to the topic 3.
  • a document vector is a vector obtained by assigning the weight of each word appearing in a document to each dimension.
  • the weight of each word appearing in the text of the section may be used. Examples of the word weight include the appearance frequency of the word in the document and the word weight calculated by a known TF / IDF method. For example, d 1 the document vector of the section 1, when the document vector of the section 2 was set to d 2, cosine similarity cos ⁇ of the section 1 and the section 2 is calculated by the following equation (5).
  • the cosine similarity cos ⁇ is a real number from 0 to 1, and the closer the value is to 1, the greater the similarity between both sections.
  • a known k-means method or hierarchical clustering method may be used as a clustering algorithm.
  • the procedure when the k-means method is used when the number of clusters is K is as follows. (1) Document vectors of K sections selected at random are determined as representative points of K clusters. (2) Document vectors in all sections are assigned to representative points that are the closest, that is, the highest similarity. (3) The centroid point of the document vector is calculated for each cluster and used as a representative point. (2) and (3) are repeatedly executed. When the section (document vector) assigned to each cluster does not change, the clustering is terminated.
  • the simplest hierarchical clustering procedure is as follows. When the total number of sections is N, (1) N clusters including only one section are created. (2) Merge two clusters having the highest similarity. If this merging is repeated until the number of clusters reaches K, K clusters can be obtained.
  • a known shortest distance method, longest distance method, Ward method, or the like may be used as the similarity between two clusters.
  • the cosine similarity of the document vector can be used as the similarity between the two sections.
  • methods other than those listed here may be used as the clustering method.
  • the similarity between two sections is not limited to the cosine similarity, and any scale can be used as long as it can measure the similarity of the semantic content between the two sections.
  • the number of clusters K may be proportional to the total number of utterances, not the total number of sections, and may of course be determined by other methods.
  • step S203 the sections divided by the dividing unit 105 are clustered, and the result is output to the topic determination result storage unit 108.
  • a topic model of a certain topic is a model representing the appearance tendency of words in the topic.
  • the topic model can be created as follows. (1) The texts of a plurality of sections determined as the same topic (cluster) by clustering are collected. (2) A probability model representing the word generation probability is learned based on the text compiled for each topic (cluster).
  • an N-gram language model may be used as the probability model.
  • the sections determined as the topic 1 are the sections 1, 2, 2, and 10.
  • the texts of these four sections are collected as one group.
  • the topic model of topic 2 may be learned based on the texts of sections 3 and 4
  • the topic model of topic 3 may be learned based on the texts of sections 5, 6, 7, and 8. Just do it.
  • step S204 a topic model representing the appearance tendency of words is created for each topic obtained by clustering, and the topic model is output to the topic model storage unit 109. Therefore, even when a plurality of topics are included in the recognition target speech, it is possible to create a topic model representing the appearance tendency of words separately for each topic. Moreover, even if the same topic appears several times at intervals, the same topic model can be created by collecting the appearance tendency of words in the entire section of the same topic.
  • the adaptation language model creation unit 110 refers to the determination result stored in the topic determination result storage unit 108 and the topic model stored in the topic model storage unit 109, and is obtained by division. For each section, an adaptive language model adapted to the section is created (step S205). The created adaptation language model is output to the adaptation language model storage unit 111.
  • the adaptation language model creation unit 110 adapts the base language model stored in the base language model storage unit 102 using the topic model of the topic determined in each interval, so that each interval Create different adaptive language models for.
  • an adaptive language model is created by adapting the base language model with the topic model of Topic 1 for Section 1, Section 2, Section 9, and Section 10.
  • an adaptive language model is created by adapting the base language model with the topic model of topic 2.
  • Examples of the adaptation method include linear interpolation between a base language model and a corresponding topic model. Specifically, when the base language model P BASE and the topic model P TOPIC_1 corresponding to the topic 1 are both trigram models, the adaptive language model P ADAPT_1 corresponding to the section of the topic 1 is 6).
  • is a constant of 0 to 1, and can be determined by conducting an experiment in advance.
  • the base language model, the topic model, and the adaptation language model are all trigram examples, but other N-gram language models such as bigrams and unigrams may be used.
  • the topic model P TOPIC_1 is a unigram model
  • the base language model may be adapted by a known unigram rescaling method.
  • the adaptive language model P ADAPT_1 is obtained by the following ( Equation 7).
  • the adaptive language model finally created for each section by the processing of steps S201 to S205 described above reflects the appearance tendency of words in the same topic section as the section, but has a topic section different from the section. It is not influenced by the appearance tendency of words.
  • the adaptive language model created for the section 9 learns the appearance tendency of words in the texts of the sections 1, 2, and 10 that are the same topic as the section 9.
  • the texts of the sections 3 to 8 that are different from the section 9 are not used for learning.
  • the adaptive language model created for each section according to the present invention is not limited to words in the section even when the topic suddenly changes in the recognition target speech or when the same topic is repeated in the recognition target speech. It can be said that it is a model that can accurately predict the appearance tendency of. In other words, since the appearance tendency of words differs depending on the topic, it is generally desirable to adapt the language model using only the appearance tendency of words in the same topic in adaptation of the language model, but such adaptation is realized by the present invention. It becomes possible.
  • a cache model (see Non-Patent Document 1) is a typical method for adapting a language model using the appearance tendency of words in recognition target speech.
  • the cache model for example, when an adapted language model is created for the section 9 (topic 1), the adaptation is performed based on the appearance tendency of words in the immediately preceding section 8 (topic 3).
  • the appearance tendency of both words is considered to be different, and as a result, there arises a problem that an appropriate adaptive language model cannot be obtained.
  • the present invention it is possible to adapt the base language model without causing such topic inconsistency, so various topics appear in the recognition target speech. Even in this case, a sufficient adaptation effect can be obtained.
  • the re-recognition unit 112 uses the adaptation language model for each section stored in the adaptation language model storage unit 111, and uses the corresponding section of the speech data stored in the speech data storage unit 101. Again, voice recognition (re-recognition) is executed (step S206). The re-recognition result is output to the re-recognition result storage unit 113 as text data.
  • the utterances 1 to 5 are re-recognized by the adaptation language model created for the section 1 and the utterance 6 is created by the adaptation language model created for the section 2.
  • Re-recognizing ⁇ 10 is repeated for all intervals.
  • the language model adapted to each section according to the present invention is a model that can accurately predict the appearance tendency of words in the section even when there is a topic change or repetition in the recognition target speech. . For this reason, the re-recognition unit 112 can recognize the voice data with high accuracy.
  • the dividing unit 105 divides the recognition result text into a plurality of sections by dividing the recognition result for each predetermined number of utterances. However, there is an overlap in these sections. Also good.
  • FIG. 8 shows such an example.
  • FIG. 8 shows a division result when 10 utterances are grouped into one section and divided into a plurality of sections by shifting each utterance by two utterances, and a topic determination result in each section.
  • the topic analysis unit 107 can operate in exactly the same way.
  • the re-recognition unit 112 re-recognizes using the adaptive language model corresponding to each section, for example, only the utterance at the center of the section may be re-recognized.
  • the utterance 13 and the utterance 14 may be recognized using the adaptive language model corresponding to the section 5.
  • the utterance 15 and the utterance 16 may be recognized using the adaptive language model corresponding to the section 6. By doing so, it is possible to prevent the utterances to be re-recognized from overlapping.
  • the dividing unit 105 can also detect a boundary of meaning content in the recognition result, and can divide the text of the recognition result based on the detection result. For example, according to the method described in Reference Document 1, the dividing unit 105 may detect a change point of the word distribution in the recognition result text and divide the recognition result text based on the detected change point.
  • the dividing unit 105 can also divide the text of the recognition result based on the position where the topic boundary expression appears when the topic boundary expression representing the topic boundary is determined in advance.
  • topic boundary expressions include expressions that change topics such as “Now”, “Now” and “Next”.
  • the dividing unit 105 obtains a topic model sequence that best matches the recognition result text by the method described in Reference Document 2, and changes in topics And the text of the recognition result can be divided based on the change point.
  • the topic model here is not the topic model created by the topic analysis unit 107.
  • the topic model here is a topic model prepared in advance regardless of the recognition result text, and represents the appearance tendency of words related to a predetermined topic. Such a topic model can be obtained, for example, by learning in advance a large amount of text data to which topic information is assigned.
  • each section does not necessarily represent a group of topics, and a plurality of topics are mixed in the section near the topic boundary.
  • the dividing unit 105 divides the text of the recognition result into a plurality of sections based on the semantic content boundary, each section may represent a group of topics even if they may be mixed. become. For this reason, the clustering accuracy and topic model estimation accuracy in the topic analysis unit 107 are improved, and the accuracy of the adaptive language model finally obtained is further improved.
  • a sufficient adaptation effect of the language model can be obtained even when the topic suddenly changes in the recognition target speech and when the same topic is repeated in the recognition target speech at intervals. It is done. Further, if the recognition target speech is re-recognized using the adaptive language model generated by the language model adaptation device 100 of the first embodiment, a recognition result with higher accuracy can be obtained.
  • FIG. 9 is a diagram illustrating a specific example of the topic determination result in the second embodiment of the present invention.
  • the topic analysis unit 107 uses the topic z as a hidden variable as a probability variable P (w
  • topic z is a hidden variable that cannot be observed from the outside, and the number of topics is set to an appropriate value in advance.
  • the number of topics z may be set to be proportional to the total number of sections or the total number of utterances as in the case where the number of clusters is determined in the first embodiment, or may be set by another method.
  • z) in the above (Equation 8) represents the word appearance probability in the topic z
  • d) represents the ratio of each topic z included in the section d.
  • the probability model of the above (Equation 8) can be considered as a model in which the appearance probability P (w
  • topics included in each section are determined by learning the parameters of the probability model of the above (Equation 8) using the text of each section as learning data (see S203 in FIG. 4).
  • each topic model is created (see S204 in FIG. 4). That is, in the second embodiment, P (z
  • Equation 8 The parameters of the above probability model (Equation 8) can be estimated by maximum likelihood estimation for learning data using the EM algorithm, as described in Reference Document 3 below. That is, calculation of P (z
  • n (w, d) represents the number of appearances of the word w in the section d.
  • r represents the number of repetitions.
  • the topic determination result according to the second embodiment is expressed as a combination of a plurality of topics. That is, for example, section 1 indicates that topic 1 is a topic combined with weight 0.8, topic 2 is weight 0.1, and topic 5 is combined with weight 0.1. 1 is a value calculated as P (z
  • the adaptive language model creation unit 110 refers to the topic determination result stored in the topic determination result storage unit 108 and the topic model stored in the topic model storage unit 109. . Based on the reference result, the adaptation language model creation unit 110 creates an adaptation language model adapted to the section for each section obtained by the division, and outputs this to the adaptation language model storage unit 111. (See S205 in FIG. 4).
  • the topic analysis unit 107 has already obtained a topic ratio P (z
  • the language model can be adapted after expressing the topic in the recognition target speech as a combination of a plurality of basic topics and extracting the appearance tendency of words in the basic topics. For example, even when “Section 1 includes Topic 1, Section 2 includes Topic 1 and Topic 2, and Section 3 includes Topic 2”, according to the second embodiment, it is included in each section.
  • Each topic model of Topic 1 and Topic 2 can be created after determining the percentage of topics that are to be viewed.
  • a topic model can be created for each topic.
  • weighting of a plurality of topic models is performed according to the ratio of topics included in the section, so that the adaptation effect of the language model is improved. To do.
  • the language model adaptation apparatus and the speech recognition apparatus in the first and second embodiments described above are realized by using, for example, a program (language model adaptation program) that causes a computer to execute steps S201 to S206 shown in FIG. Can do. That is, if a language model adaptation program is installed in a computer and executed, a language model adaptation device and a speech recognition device are realized.
  • a program language model adaptation program
  • FIG. 10 is a block diagram illustrating an example of a computer that can implement the language model adaptation device and the speech recognition device according to the first and second embodiments of the present invention.
  • the computer 300 includes a data processing device 320 including a CPU and a storage device 330 including a magnetic disk, a semiconductor memory, and the like.
  • the storage area of the storage device 330 includes a voice data storage unit 331, a base language model storage unit 332, a recognition result storage unit 333, a division result storage unit 334, a topic determination result storage unit 335, a topic model storage unit 336, an adaptation A structured language model storage unit 337 and a re-recognition result storage unit 338 are constructed.
  • the storage device 330 functions as these storage units 331 to 338.
  • the language model adaptation program 310 is read into the data processing device 320 by reading from a computer-readable recording medium or storage device, or by transmission via a network, and controls the operation of the data processing device 320.
  • the data processing device 320 functions as a speech recognition unit 103, a division unit 105, a topic analysis unit 107, an adaptive language model creation unit 110, and a re-recognition unit 112 (see FIG. 3) under the control of the language model adaptation program 310. And execute the process.
  • the computer-readable recording medium include an optical disk, a magnetic disk, a magneto-optical disk, a semiconductor memory, and a floppy disk.
  • the language model adaptation apparatus, speech recognition apparatus, language model adaptation method, and computer-readable recording medium according to the present invention have the following characteristics.
  • a language model adaptation device for adapting a language model as a base, A dividing unit for dividing the input text into a plurality of sections; Determining a topic included in each of the plurality of sections, and for each determined topic, a topic analysis unit that creates a topic model representing an appearance tendency of words in the topic; For each of the plurality of sections, using the topic model corresponding to the topic included in the section, adapting the base language model, and creating an adapted language model, A language model adaptation device comprising:
  • the method further includes: a speech recognition unit that performs speech recognition on speech data using the base language model and inputs text obtained by the speech recognition to the dividing unit.
  • the language model adaptation device according to 1).
  • the topic analysis unit classifies the plurality of sections into groups based on the similarity between the texts in the sections, and determines that the topics in the sections belonging to the same group are common, and For each group, create the topic model based on the appearance tendency of words in the text in the section belonging to the group,
  • the language model adaptation device according to (1) above.
  • the topic analysis unit assumes a probability model representing the appearance tendency of words in each of the plurality of sections using the topic as a hidden variable, and further uses the text of each of the plurality of sections as learning data. Learning the parameters of the probability model; The learning determines the topic and creates the topic model.
  • the language model adaptation device according to (1) above.
  • a speech recognition device that performs speech recognition by adapting a language model, A speech recognition unit that performs speech recognition on speech data using the language model; A dividing unit for dividing the text obtained by the speech recognition into a plurality of sections; Determining a topic included in each of the plurality of sections, and for each determined topic, a topic analysis unit that creates a topic model representing an appearance tendency of words in the topic; For each of the plurality of sections, using the topic model corresponding to the topics included in the section, adapting the base language model, and creating an adapted language model, , Using the adaptation language model created by the adaptation language model creation unit, a re-recognition unit that performs speech recognition on the speech data in the section corresponding to the adaptation language model; A speech recognition apparatus comprising:
  • a language model adaptation method for adapting a base language model (A) dividing the text into a plurality of sections; (B) determining a topic included in each of the plurality of sections, and creating a topic model representing the appearance tendency of words in the topic for each determined topic; (C) For each of the plurality of sections, using the topic model corresponding to the topic included in the section, adapting the base language model and creating an adapted language model;
  • a language model adaptation method characterized by comprising:
  • the method further includes the step of performing speech recognition on the speech data using the base language model, thereby generating the text to be divided in the step (a).
  • step of (b) classifying the plurality of sections into groups based on the similarity between the texts of the sections, and determining that the topics of the sections belonging to the same group are common, For each group, create the topic model based on the appearance tendency of words in the text in the section belonging to the group,
  • a computer-readable recording medium recording a program for executing adaptation of a language model as a base by a computer, In the computer, (A) dividing the text into a plurality of sections; (B) determining a topic included in each of the plurality of sections, and creating a topic model representing the appearance tendency of words in the topic for each determined topic; (C) For each of the plurality of sections, performing the step of adapting the base language model using the topic model corresponding to the topic included in the section and creating an adapted language model A computer-readable recording medium on which a program including instructions is recorded.
  • step (b) using the topic as a hidden variable, a probability model representing the appearance tendency of words in each of the plurality of sections is assumed, and further, text in each of the plurality of sections is learned data. And learning the parameters of the probability model, The learning determines the topic and creates the topic model.
  • the computer-readable recording medium according to (18) above.
  • the present invention realizes an automatic speech recognition system that recognizes speech data including various topics such as conference speech, lecture speech, broadcast speech, etc., and outputs text information, and an automatic speech recognition system for a computer. It can be applied to uses such as programs. Further, the present invention can be applied to an application such as an information search system for searching for these voice data using text information of a recognition result.
  • Speech recognition apparatus 100 Language model adaptation apparatus 101 Speech data storage part 102 Base language model storage part 103 Speech recognition part 104 Recognition result storage part 105 Division part 106 Division result storage part 107 Topic analysis part 108 Topic determination result storage part 109 Topic model Storage unit 110 Adaptive language model creation unit 111 Adaptive language model storage unit 112 Re-recognition unit 113 Re-recognition result storage unit 300 Computer 310 Language model adaptation program 320 Data processing device 330 Storage device 331 Speech data storage unit 332 Base language model Storage unit 333 Recognition result storage unit 334 Division result storage unit 335 Topic determination result storage unit 336 Topic model storage unit 337 Adaptive language model storage unit 338 Re-recognition result storage unit

Landscapes

  • Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

A language model adaptation device (100) which adapts a base language model. The language model adaptation device (100) is provided with: a segmentation unit (105) which segments an inputted text into a plurality of segments; a topic analysis unit (107) which identifies topics contained in the respective segments and creates, on the basis of each identified topic, topic models expressing the appearance tendency of words in the topics; and an adapted language model creation unit (110) which adapts the base language model on a segment-by-segment basis using the topic model corresponding to the topics contained in the segment, thereby creating an adapted language model.

Description

言語モデル適応装置、音声認識装置、言語モデル適応方法、及びコンピュータ読み取り可能な記録媒体Language model adaptation apparatus, speech recognition apparatus, language model adaptation method, and computer-readable recording medium

 本発明は、音声認識で用いられる言語モデルを認識対象音声の認識結果に応じて適応化させる、言語モデル適応装置、これを用いた音声認識装置、言語モデル適応方法、及びこれらを実現するためのプログラムを記録したコンピュータ読み取り可能な記録媒体に関する。 The present invention relates to a language model adaptation device, a speech recognition device using the same, a language model adaptation method, and a language model adaptation method for adapting a language model used in speech recognition according to a recognition result of a recognition target speech. The present invention relates to a computer-readable recording medium on which a program is recorded.

 音声データに含まれる発話の内容をテキストに変換する音声認識処理では、単語列の生成確率を与える統計的言語モデルが活用されている。音声認識で最も広く用いられている統計的言語モデルはN-gram言語モデルである。N-gram言語モデルは、ある時点での単語の生成確率が直前のN-1個の単語にのみ依存すると仮定した確率モデルである。 In the speech recognition process that converts the content of utterances contained in speech data into text, a statistical language model that gives the generation probability of word strings is used. The most widely used statistical language model in speech recognition is the N-gram language model. The N-gram language model is a probability model that assumes that the word generation probability at a certain time depends only on the immediately preceding N-1 words.

 すなわち、i番目の単語wの生成確率は、P(w|wi-N+1 i-1)で与えられる。ここで、wi-N+1 i-1は、(i-N+1)~(i-1)番目の単語列を表す。通常、N=2、N=3がよく用いられ、それぞれバイグラム、トライグラム、と呼ばれる。N-gram言語モデルを用いると、単語列w =(w,w,…,w)の生成確率P(w )は、下記の(数1)で表される。 That is, the generation probability of the i-th word w i is given by P (w i | w i−N + 1 i−1 ). Here, w i−N + 1 i−1 represents the (i−N + 1) to (i−1) th word string. Usually, N = 2 and N = 3 are often used, and are called bigram and trigram, respectively. When the N-gram language model is used, the generation probability P (w 1 n ) of the word string w 1 n = (w 1 , w 2 ,..., W n ) is expressed by the following (Equation 1).

Figure JPOXMLDOC01-appb-M000001
Figure JPOXMLDOC01-appb-M000001

 N-gram言語モデルは、大規模な学習用テキストデータを用いた最尤推定法などにより作成する。最尤推定法とは、学習用テキストデータの生成確率が最大となるようにモデルのパラメータを学習する手法である。しかし、認識対象音声と内容が完全に一致するような学習用テキストデータをあらかじめ用意しておくことは通常困難である。したがって、あらかじめ作成した言語モデルは、認識対象音声における単語の出現傾向を必ずしも適切に表しているとは限らない。そこで、認識対象音声の特徴にあわせて言語モデルを適応化する技術が望まれている。 The N-gram language model is created by the maximum likelihood estimation method using large-scale learning text data. The maximum likelihood estimation method is a method of learning model parameters so that the generation probability of learning text data is maximized. However, it is usually difficult to prepare learning text data whose content completely matches the recognition target speech. Therefore, the language model created in advance does not always appropriately represent the appearance tendency of words in the recognition target speech. Therefore, a technique for adapting a language model in accordance with the characteristics of the recognition target speech is desired.

 認識対象音声の特徴にあわせて言語モデルを適応化する技術として、キャッシュモデルが提案されている(例えば、非特許文献1参照。)。キャッシュモデルは、「同じ単語や言い回しは繰り返し使われやすい」という言葉の性質を利用して、言語モデルを認識対象音声に適応化する手法である。以下では、あらかじめ作成したN-gram言語モデルをPBASE(w|wi-N+1 i-1)とし、これをキャッシュモデルにより適応化することで、認識対象中のi番目の単語wの生成確率を与える言語モデルを求める場合について説明する。 A cache model has been proposed as a technique for adapting a language model in accordance with the characteristics of speech to be recognized (see Non-Patent Document 1, for example). The cache model is a method of adapting the language model to the speech to be recognized by utilizing the property of the word that “the same word or phrase is easily used repeatedly”. In the following, the previously created N-gram language model is P BASE (w i | w i-N + 1 i-1 ), and this is adapted by the cache model, so that the i-th word w i in the recognition target The case of obtaining a language model that gives a generation probability will be described.

 まず、単語wの直前に現れたM単語wi-M i-1をキャッシュ区間として、キャッシュ確率PCACHE(w|wi-M i-1)を下記の(数2)により計算する。直前のM単語wi-M i-1としては、認識結果を用いれば良い。 First, the M word w i-M i-1 appearing immediately before the word w i as a cache interval cache probability P CACHE | calculated by (w i w i-M i-1) to the following (Equation 2) . The recognition result may be used as the immediately preceding M word w i-M i-1 .

Figure JPOXMLDOC01-appb-M000002
Figure JPOXMLDOC01-appb-M000002

 δ(・)はクロネッカーのδ関数であり、2つの引数が等しいときは1、それ以外は0を与える関数である。したがって、キャッシュ確率PCACHE(w|wi-M i-1)は、単語wの直前M単語のキャッシュ区間内における単語分布であり、これは認識対象の単語w付近における単語の出現傾向を表していると考えられる。なお、キャッシュの長さMは、あらかじめ実験的に定める。 δ (·) is a Kronecker δ function, which is 1 when the two arguments are equal, and 0 otherwise. Therefore, the cache probability P CACHE (w i | w i−M i−1 ) is a word distribution in the cache interval of the M word immediately before the word w i , and this is the appearance of the word near the word w i to be recognized. It is thought that it represents a trend. The cache length M is experimentally determined in advance.

 次に、キャッシュ確率PCACHE(w|wi-M i-1)を用いて、あらかじめ作成したN-gram言語モデルPBASE(w|wi-N+1 i-1)を適応化することで、i番目の単語wの生成確率を与える適応化言語モデルPADAPT(w|wi-N+1 i-1,wi-M i-1)を求める。具体的には、PBASEとPCACHEとを下記の(数3)によって線形補間すれば良い。 Then, cache probability P CACHE to adapt the | | (w i-N + 1 i-1 w i) by using the (w i w i-M i -1), a pre-created N-gram language model P BASE Then, an adaptive language model P ADAPT (w i | W i−N + 1 i−1 , w i−M i−1 ) that gives the generation probability of the i-th word w i is obtained. Specifically, P BASE and P CACHE may be linearly interpolated by the following ( Equation 3).

Figure JPOXMLDOC01-appb-M000003
Figure JPOXMLDOC01-appb-M000003

 ただし、λは0~1の定数で、あらかじめ実験的に定められる。このようにして得られる言語モデルPADAPTは、あらかじめ作成した言語モデルPBASEをベースにした上で、認識対象音声における単語の出現傾向を反映させたモデルとなる。 However, λ is a constant of 0 to 1 and is experimentally determined in advance. The language model P ADAPT thus obtained is a model reflecting the appearance tendency of words in the recognition target speech based on the language model P BASE created in advance.

 なお、非特許文献1では直前のM単語が同等に扱われているが、単語wにより近い単語ほど影響力が大きいと考えてキャッシュ確率を計算する手法も提案されている(例えば、非特許文献2参照。)。非特許文献2では、単語wから離れた単語ほど影響力が減少するように、キャッシュ確率PCACHEが、下記の(数4)によって計算されている。 Note that in Non-Patent Document 1, the immediately preceding M word is treated equally, but a method for calculating the cache probability by considering that the closer the word to the word w i is, the greater the influence is proposed (for example, non-patent) Reference 2). In Non-Patent Document 2, the cache probability P CACHE is calculated by the following ( Equation 4) so that the influence decreases as the distance from the word w i increases.

Figure JPOXMLDOC01-appb-M000004
Figure JPOXMLDOC01-appb-M000004

 ただし、上記の(数4)では、αは単語の影響力の減衰率、βは正規化定数である。また、上記の(数4)ではキャッシュ区間を単語wより前の全ての単語としている。 However, in the above (Equation 4), α is a decay rate of the influence of the word, and β is a normalization constant. In the above (Equation 4), the cache interval is set to all the words before the word w i .

R.Kuhn and R.de Mori,"A Cache-Based Natural Language Model forSpeech Recognition, "IEEE Transactions on Pattern Analysis and MachineIntelligence,pp.570-583,1990.R.Kuhn and R.de Mori, "A Cache-Based Natural Language Model forSpeech Recognition," IEEE Transactions on Pattern Analysis and MachineIntelligence, pp.570-583, 1990. P.R.Clarkson and A.J.Robinson,"Language Model Adaptation UsingMixtures and an Exponentially Decaying Cache,"Proceedings of The 1997 IEEEInternational Conference on Acoustics,Speech,and SignalProcessing,pp.799-802,1997.P.R.Clarkson and A.J.Robinson, "Language Model Adaptation UsingMixtures and an Exponentially Decaying Cache," Proceedings of The 1997 IEEEInternational Conference on Acoustics, Speech, and SignalProcessing, pp.799-802,1997.

 しかしながら、上記の非特許文献1及び非特許文献2に開示された技術には、認識対象音声において話題が急に変化する場合、及び認識対象音声において同じ話題が間をあけて繰り返される場合に、言語モデルの適応効果が不十分である、という課題がある。 However, in the techniques disclosed in Non-Patent Document 1 and Non-Patent Document 2 described above, when the topic suddenly changes in the recognition target speech, and when the same topic is repeated at intervals in the recognition target speech, There is a problem that the adaptation effect of the language model is insufficient.

 その理由は、上記の技術では、認識対象音声における話題の変化を考慮せずに設定したキャッシュ区間を用いて言語モデルの適応化が行われるためである。なお、ここでの話題としては、例えば、ニュース番組の音声であれば「政治」、「経済」、「スポーツ」などのニュースのジャンル、また、「衆議院選挙戦」、「○○会社が倒産」などの個別のニュース項目、更に、会議の音声であれば会議の議題などが挙げられる。 The reason is that in the above technique, the language model is adapted using the cache interval set without considering the topic change in the recognition target speech. The topic here is, for example, the genre of news such as “politics”, “economics”, “sports” if the sound of a news program, “election of the House of Representatives”, “XX company goes bankrupt” In addition, individual agenda items such as the agenda of the conference if the audio of the conference.

 以下では、ニュース番組の音声を認識対象とする場合を例に、この課題について具体的に説明する。図11は、従来における課題を説明するための図であり、ニュース番組が様々な話題のニュースから構成されている様子を示している。区間1~4の話題は、それぞれ、「スポーツ」、「政治」、「経済」、「スポーツ」である。「スポーツ」が2箇所に存在しているが、これは、番組冒頭でスポーツニュースが取り上げられた後に、番組後半のスポーツコーナーで再び同じスポーツニュースが詳しく取り上げられたことを示す。 In the following, this problem will be described in detail by taking as an example the case where the sound of a news program is targeted for recognition. FIG. 11 is a diagram for explaining a conventional problem, and shows that a news program is composed of news of various topics. The topics in the sections 1 to 4 are “sports”, “politics”, “economy”, and “sports”, respectively. “Sports” exist in two places, which means that after sports news was taken up at the beginning of the program, the same sports news was again taken up in detail in the sports corner in the second half of the program.

 このとき、上記の非特許文献1又は非特許文献2に開示された技術によって言語モデルを適応化し、時刻Tで用いる適応化言語モデルを作成することを考える。ここで重要なことは、話題が異なれば単語の出現傾向は全く異なるということである。例えば、スポーツニュースではスポーツ選手名などの出現確率が高いと考えられるのに対し、経済ニュースでは経済用語の出現確率が高いと考えられる。したがって、時刻Tで用いる適応化言語モデルは、時刻Tと同じ話題(スポーツ)である区間1と区間4とにおける単語の出現傾向を反映させて適応することが望ましく、時刻Tと異なる話題である区間2と区間3とにおける単語の出現傾向は適応に用いるべきではない。 At this time, it is considered that the language model is adapted by the technique disclosed in Non-Patent Document 1 or Non-Patent Document 2 described above and an adapted language model used at time T is created. What is important here is that the appearance tendency of words is completely different for different topics. For example, in sports news, the appearance probability of a sports player name or the like is considered to be high, whereas in economic news, the appearance probability of an economic term is considered to be high. Therefore, the adaptive language model used at time T is preferably adapted to reflect the appearance tendency of words in section 1 and section 4, which are the same topic (sports) as time T, and is a topic different from time T. The appearance tendency of words in section 2 and section 3 should not be used for adaptation.

 しかし、上記の非特許文献1又は非特許文献2に開示された技術では、図11に示した例のように、適応すべき話題(スポーツ)とは異なる話題(経済)の区間3がキャッシュ区間に含まれてしまう場合がある。さらに、時刻Tから離れている区間1はキャッシュ区間に含まれないことがある。その結果、適応すべき単語の出現傾向を適切に反映したキャッシュ確率が得られなくなり、適応効果が不十分となる。上記の非特許文献1又は非特許文献2に開示された技術では、認識対象音声における話題の変化を考慮せずにキャッシュ区間を設定しているため、このような問題を避けることができない。 However, in the technique disclosed in Non-Patent Document 1 or Non-Patent Document 2 described above, as in the example illustrated in FIG. 11, a topic 3 (economic) section 3 different from a topic (sports) to be applied is a cache section. May be included. Furthermore, section 1 that is distant from time T may not be included in the cache section. As a result, the cache probability that appropriately reflects the appearance tendency of the word to be adapted cannot be obtained, and the adaptation effect becomes insufficient. In the technology disclosed in Non-Patent Document 1 or Non-Patent Document 2 described above, such a problem cannot be avoided because the cache interval is set without considering the change in the topic in the recognition target speech.

 本発明の目的は、上記問題を解消し、話題が急に変化する場合、及び同じ話題が間をあけて繰り返される場合でも、言語モデルの十分な適応効果を得ることを可能とする、言語モデル適応装置、音声認識装置、言語モデル適応方法、及びコンピュータ読み取り可能な記録媒体を提供することにある。 An object of the present invention is to solve the above-mentioned problem and to obtain a sufficient adaptation effect of a language model even when the topic changes suddenly and even when the same topic is repeated at intervals. An adaptive device, a speech recognition device, a language model adaptation method, and a computer-readable recording medium are provided.

 上記目的を達成するため、本発明における言語モデル適応装置は、ベースとなる言語モデルの適応化を行う言語モデル適応装置であって、
 入力されたテキストを複数の区間に分割する分割部と、
 前記複数の区間それぞれに含まれる話題を判定し、且つ、前記判定された話題毎に、当該話題における単語の出現傾向を表す話題モデルを作成する話題分析部と、
 前記複数の区間それぞれ毎に、当該区間に含まれる話題に対応する前記話題モデルを用いて、前記ベースとなる言語モデルの適応化を行い、適応化言語モデルを作成する適応化言語モデル作成部と、を備えることを特徴とする。
In order to achieve the above object, a language model adaptation apparatus according to the present invention is a language model adaptation apparatus that performs adaptation of a base language model,
A dividing unit for dividing the input text into a plurality of sections;
Determining a topic included in each of the plurality of sections, and for each determined topic, a topic analysis unit that creates a topic model representing an appearance tendency of words in the topic;
For each of the plurality of sections, using the topic model corresponding to the topic included in the section, adapting the base language model, and creating an adapted language model, It is characterized by providing.

 上記目的を達成するため、本発明における音声認識装置は、言語モデルの適応化を行って音声認識を行う音声認識装置であって、
 前記言語モデルを用いて、音声データに対して音声認識を行う音声認識部と、
 前記音声認識によって得られたテキストを複数の区間に分割する分割部と、
 前記複数の区間それぞれに含まれる話題を判定し、且つ、前記判定された話題毎に、当該話題における単語の出現傾向を表す話題モデルを作成する話題分析部と、
 前記複数の区間それぞれ毎に、当該区間に含まれる話題に対応する前記話題モデルを用いて、前記ベースとなる言語モデルの適応化を行い、適応化言語モデルを作成する適応化言語モデル作成部と、
 前記適応化言語モデル作成部によって作成された前記適応化言語モデルを用いて、前記適応化言語モデルが対応する区間の前記音声データに対して、音声認識を行う再認識部と、を備えることを特徴とする。
To achieve the above object, a speech recognition device according to the present invention is a speech recognition device that performs speech recognition by adapting a language model,
A speech recognition unit that performs speech recognition on speech data using the language model;
A dividing unit for dividing the text obtained by the speech recognition into a plurality of sections;
Determining a topic included in each of the plurality of sections, and for each determined topic, a topic analysis unit that creates a topic model representing an appearance tendency of words in the topic;
For each of the plurality of sections, using the topic model corresponding to the topics included in the section, adapting the base language model, and creating an adapted language model, ,
A re-recognition unit that performs speech recognition on the speech data in the section corresponding to the adaptation language model using the adaptation language model created by the adaptation language model creation unit. Features.

 上記目的を達成するため、本発明における言語モデル適応方法は、ベースとなる言語モデルの適応化を行うための言語モデル適応方法であって、
(a)テキストを複数の区間に分割するステップと、
(b)前記複数の区間それぞれに含まれる話題を判定し、且つ、前記判定された話題毎に、当該話題における単語の出現傾向を表す話題モデルを作成するステップと、
(c)前記複数の区間それぞれ毎に、当該区間に含まれる話題に対応する前記話題モデルを用いて、前記ベースとなる言語モデルの適応化を行い、適応化言語モデルを作成するステップと、を有することを特徴とする。
To achieve the above object, a language model adaptation method according to the present invention is a language model adaptation method for adapting a base language model,
(A) dividing the text into a plurality of sections;
(B) determining a topic included in each of the plurality of sections, and creating a topic model representing the appearance tendency of words in the topic for each determined topic;
(C) For each of the plurality of sections, using the topic model corresponding to the topic included in the section, adapting the base language model and creating an adapted language model; It is characterized by having.

 上記目的を達成するため、本発明におけるコンピュータ読み取り可能な記録媒体は、ベースとなる言語モデルの適応化を、コンピュータによって実行するためのプログラムを記録した、コンピュータ読み取り可能な記録媒体であって、
前記コンピュータに、
(a)テキストを複数の区間に分割するステップと、
(b)前記複数の区間それぞれに含まれる話題を判定し、且つ、前記判定された話題毎に、当該話題における単語の出現傾向を表す話題モデルを作成するステップと、
(c)前記複数の区間それぞれ毎に、当該区間に含まれる話題に対応する前記話題モデルを用いて、前記ベースとなる言語モデルの適応化を行い、適応化言語モデルを作成するステップとを実行させる、命令を含むプログラムを記録していることを特徴とする。
In order to achieve the above object, a computer-readable recording medium according to the present invention is a computer-readable recording medium in which a program for executing adaptation of a language model as a base by a computer is recorded.
In the computer,
(A) dividing the text into a plurality of sections;
(B) determining a topic included in each of the plurality of sections, and creating a topic model representing the appearance tendency of words in the topic for each determined topic;
(C) For each of the plurality of sections, performing the step of adapting the base language model using the topic model corresponding to the topic included in the section and creating an adapted language model A program including an instruction is recorded.

 以上の特徴により、本発明における言語モデル適応装置、音声認識装置、言語モデル適応方法、及びコンピュータ読み取り可能な記録媒体によれば、話題が急に変化する場合、及び同じ話題が間をあけて繰り返される場合でも、言語モデルの十分な適応効果を得ることができる。 With the above features, according to the language model adaptation device, the speech recognition device, the language model adaptation method, and the computer-readable recording medium according to the present invention, when a topic changes suddenly, the same topic is repeated at intervals. Even when the language model is used, a sufficient adaptation effect of the language model can be obtained.

図1は、本発明の実施の形態における言語モデル適応装置の概略構成を示すブロック図である。FIG. 1 is a block diagram showing a schematic configuration of a language model adaptation apparatus according to an embodiment of the present invention. 図2は、本発明の実施の形態における言語モデル適応装置の動作を示すフロー図である。FIG. 2 is a flowchart showing the operation of the language model adaptation apparatus in the embodiment of the present invention. 図3は、本発明の実施例1における言語モデル適応装置及び音声認識装置の概略構成を示すブロック図である。FIG. 3 is a block diagram illustrating a schematic configuration of the language model adaptation device and the speech recognition device according to the first embodiment of the present invention. 図4は、本発明の実施例1における言語モデル適応装置及び音声認識装置の動作を示すフロー図である。FIG. 4 is a flowchart showing operations of the language model adaptation device and the speech recognition device according to the first exemplary embodiment of the present invention. 図5は、本発明の実施例1における認識結果の具体例を説明する図である。FIG. 5 is a diagram for explaining a specific example of the recognition result in the first embodiment of the present invention. 図6は、本発明の実施例1における分割結果の具体例を説明する図である。FIG. 6 is a diagram for explaining a specific example of the division result in the first embodiment of the present invention. 図7は、本発明の実施例1における話題判定結果の具体例を説明する図である。FIG. 7 is a diagram illustrating a specific example of the topic determination result in the first embodiment of the present invention. 図8は、本発明の実施例1における分割結果の他の具体例を説明する図である。FIG. 8 is a diagram for explaining another specific example of the division result according to the first embodiment of the present invention. 図9は、本発明の実施例2における話題判定結果の具体例を説明する図である。FIG. 9 is a diagram illustrating a specific example of the topic determination result in the second embodiment of the present invention. 図10は、本発明の実施例1及び2における言語モデル適応装置及び音声認識装置を実現可能なコンピュータの一例を示すブロック図である。FIG. 10 is a block diagram illustrating an example of a computer that can implement the language model adaptation device and the speech recognition device according to the first and second embodiments of the present invention. 図11は、従来における課題を説明するための図である。FIG. 11 is a diagram for explaining a conventional problem.

 (実施の形態)
 以下、本発明の実施の形態における言語モデル適応装置及び適応方法について、図1及び図2を参照しながら説明する。最初に、図1を用いて、本実施の形態における言語モデル適応装置の構成について説明する。図1は、本発明の実施の形態における言語モデル適応装置の概略構成を示すブロック図である。
(Embodiment)
Hereinafter, a language model adaptation apparatus and adaptation method according to an embodiment of the present invention will be described with reference to FIGS. 1 and 2. Initially, the structure of the language model adaptation apparatus in this Embodiment is demonstrated using FIG. FIG. 1 is a block diagram showing a schematic configuration of a language model adaptation apparatus according to an embodiment of the present invention.

 図1に示すように、本実施形態に係る言語モデル適応装置100は、ベースとなる言語モデル(以下「ベース言語モデル」という。)の適応化を実行する装置である。言語モデル適応装置100は、分割部105と、話題分析部107と、適応化言語モデル作成部110を備えている。 As shown in FIG. 1, a language model adaptation apparatus 100 according to the present embodiment is an apparatus that executes adaptation of a base language model (hereinafter referred to as “base language model”). The language model adaptation apparatus 100 includes a division unit 105, a topic analysis unit 107, and an adaptation language model creation unit 110.

 本実施の形態では、言語モデル適応装置100は、更に、ベース言語モデルを記憶するベース言語モデル記憶部102も備えている。ベース言語モデルとしては、例えば、公知のN-gram言語モデルが挙げられる。N-gram言語モデルとは、ある時点での単語の生成確率が直前のN-1個の単語にのみ依存すると仮定して、単語の生成確率を与える確率モデルである。 In this embodiment, the language model adaptation apparatus 100 further includes a base language model storage unit 102 that stores a base language model. An example of the base language model is a known N-gram language model. The N-gram language model is a probability model that gives the word generation probability on the assumption that the word generation probability at a certain time depends only on the immediately preceding N−1 words.

 分割部105は、入力されたテキストを複数の区間に分割する。話題分析部107は、複数の区間それぞれに含まれる話題を判定する。また、「話題を判定する」とは、具体的な話題を特定することだけに限られず、ある区間の話題と別の区間の話題とが同一又は類似しているかどうかを判定することも含む意である。更に、話題分析部107は、話題を判定した後、判定された話題毎に、当該話題における単語の出現傾向を表す話題モデルを作成する。 The dividing unit 105 divides the input text into a plurality of sections. The topic analysis unit 107 determines topics included in each of the plurality of sections. In addition, “determining a topic” is not limited to specifying a specific topic, but also includes determining whether a topic in one section and a topic in another section are the same or similar. It is. Further, after determining the topic, the topic analysis unit 107 creates a topic model representing the appearance tendency of words in the topic for each determined topic.

 また、適応化言語モデル作成部110は、複数の区間それぞれ毎に、当該区間に含まれる話題に対応する話題モデルを用いて、ベース言語モデルの適応化を行い、適応化言語モデルを作成する。作成された適応化言語モデルは、その後、出力される。 Also, the adaptation language model creation unit 110 adapts the base language model for each of the plurality of sections using the topic model corresponding to the topic included in the section, and creates an adaptation language model. The created adaptive language model is then output.

 次に、図1に示した言語モデル適応装置100の動作について図2を用いて説明する。図2は、本発明の実施の形態における言語モデル適応装置の動作を示すフロー図である。また、本実施の形態においては、言語モデル適応装置100を動作させることにより、言語モデル適応方法が実行される。このため、本実施の形態における言語モデル適応方法の説明は、以下の言語モデル適応装置100の動作の説明に代える。 Next, the operation of the language model adaptation apparatus 100 shown in FIG. 1 will be described with reference to FIG. FIG. 2 is a flowchart showing the operation of the language model adaptation apparatus in the embodiment of the present invention. In the present embodiment, the language model adaptation method is executed by operating the language model adaptation apparatus 100. For this reason, the description of the language model adaptation method in the present embodiment is replaced with the following description of the operation of the language model adaptation apparatus 100.

 図2に示すように、先ず、言語モデル適応装置100にテキストが入力されと、分割部105によって、テキストの入力が受け付けられる(ステップS1)。ステップS1で入力されるテキストはどのようなテキストであっても良い。具体的には、ベース言語モデルを用いて作成されたテキスト、例えば、音声認識によって得られたテキスト、及び機械翻訳によって得られたテキストが挙げられる。 As shown in FIG. 2, first, when a text is input to the language model adaptation apparatus 100, the dividing unit 105 receives the input of the text (step S1). The text input in step S1 may be any text. Specifically, texts created using a base language model, for example, texts obtained by speech recognition and texts obtained by machine translation can be mentioned.

 次に、分割部105は、入力されたテキストを複数の区間へと分割する(ステップS2)。ステップS2では、分割部105は、例えば、テキスト中の単語の数や、発話(音声認識の場合)の数に基づいて分割を行うことができる。更に、分割部105は、テキスト内の単語分布の変化点、又は予め設定された話題境界表現がテキスト内に出現した位置に基づいて分割を行うこともできるし、予め設定された話題に関係する単語の出現傾向を表すモデルを用いて分割を行うこともできる。 Next, the dividing unit 105 divides the input text into a plurality of sections (step S2). In step S2, the dividing unit 105 can perform division based on, for example, the number of words in the text or the number of utterances (in the case of speech recognition). Further, the dividing unit 105 can perform division based on a change point of the word distribution in the text or a position where a preset topic boundary expression appears in the text, and is related to a preset topic. Division can also be performed using a model representing the appearance tendency of words.

 次に、話題分析部107は、分割部105によって分割されたそれぞれの区間に対して当該区間に含まれる1つ以上の話題を判定する(ステップS3)。ステップS3では、話題分析部107は、区間のテキスト間の類似度に基づいて複数の区間をグループに分類し同一のグループに属する区間の話題は共通すると判定する。本実施の形態では、話題の判定として、区間間で話題が共通するかどうかの判定が行われている。 Next, the topic analysis unit 107 determines one or more topics included in the section for each section divided by the dividing unit 105 (step S3). In step S3, the topic analysis unit 107 classifies a plurality of sections into groups based on the similarity between the texts of the sections, and determines that the topics of the sections belonging to the same group are common. In the present embodiment, as the topic determination, it is determined whether the topic is common between sections.

 更に、ステップS3では、話題分析部107は、話題を隠れ変数として、各区間の単語の出現傾向を表す確率モデルを仮定し、更に、各区間のテキストを学習データとして、確率モデルのパラメータを学習し、これによって話題を判定することもできる。 Further, in step S3, the topic analysis unit 107 assumes a probability model representing the appearance tendency of words in each section using the topic as a hidden variable, and further learns the parameters of the probability model using the text in each section as learning data. In this way, the topic can be determined.

 続いて、話題分析部107は、判定された各話題に対して、当該話題における単語の出現傾向を表す話題モデルを作成する(ステップS4)。その後、適応化言語モデル作成部110は、分割された各区間に対して、話題分析部107が当該区間に含まれると判定した1以上の話題の話題モデルを用いて、ベース言語モデルの適応化を行い、それによって、当該区間に適応した適応化言語モデルを作成する(ステップS5)。 Subsequently, the topic analysis unit 107 creates a topic model representing the appearance tendency of words in the topic for each determined topic (step S4). After that, the adaptation language model creation unit 110 adapts the base language model to each divided section by using one or more topic models of topics that the topic analysis unit 107 determines to be included in the section. Thus, an adaptive language model adapted to the section is created (step S5).

 その後、言語モデル適応装置100は、ステップS5によって作成された適応化言語モデルを出力し(ステップS6)、処理を終了する。 Thereafter, the language model adaptation device 100 outputs the adaptation language model created in step S5 (step S6), and ends the process.

 以上のように、本実施の形態においては、言語モデル適応装置100は、テキストを複数の区間に分け、そして話題毎に、言い換えると話題が共通する区間毎に、ベース言語モデルに対して、単語の出現傾向を反映させた適応化を実施する。よって、本実施の形態によれば、テキストの元となった認識音声等において、話題が急に変化する場合、及び同じ話題が間をあけて繰り返される場合でも、言語モデルの十分な適応効果を得ることができる。なお、本実施の形態における言語モデル適応装置100は、音声認識によって得られたテキストを対象とする場合は、以下の実施例1及び2に示すように、音声認識部を備えていても良い。 As described above, in the present embodiment, language model adaptation apparatus 100 divides text into a plurality of sections, and for each topic, in other words, for each section having a common topic, Implement adaptation that reflects the appearance trend of. Therefore, according to the present embodiment, sufficient recognition effect of the language model can be obtained even when the topic suddenly changes in the recognition speech or the like that is the source of the text, and even when the same topic is repeated at intervals. Obtainable. Note that language model adaptation apparatus 100 according to the present embodiment may include a speech recognition unit as shown in Examples 1 and 2 below when text obtained by speech recognition is targeted.

 次に、本発明について実施例を用いて説明する。以下、本発明の実施例1における言語モデル適応装置、これを備える音声認識装置、及び言語モデル適応方法について、図3~図8を参照して詳細に説明する。最初に、本実施例1における言語モデル適応装置及び音声認識装置の構成について図3を用いて説明する。図3は、本発明の実施例1における言語モデル適応装置及び音声認識装置の概略構成を示すブロック図である。 Next, the present invention will be described using examples. Hereinafter, a language model adaptation device, a speech recognition device including the language model adaptation device, and a language model adaptation method according to Embodiment 1 of the present invention will be described in detail with reference to FIGS. First, the configuration of the language model adaptation device and the speech recognition device according to the first embodiment will be described with reference to FIG. FIG. 3 is a block diagram illustrating a schematic configuration of the language model adaptation device and the speech recognition device according to the first embodiment of the present invention.

 図3に示す本実施例1における言語モデル適応装置100は、音声認識装置10の一部を構成しており、言語モデルを認識対象音声の特徴にあわせて適応化する。また、本実施例1における音声認識装置10は、言語モデル適応装置100が適応化した言語モデルを用いて、音声データをさらに高精度で認識する。 The language model adaptation apparatus 100 in the first embodiment shown in FIG. 3 constitutes a part of the speech recognition apparatus 10 and adapts the language model in accordance with the characteristics of the recognition target speech. Further, the speech recognition apparatus 10 according to the first embodiment recognizes speech data with higher accuracy using the language model adapted by the language model adaptation apparatus 100.

 図3に示すように、本実施例1における言語モデル適応装置100は、実施の形態で述べたように、ベース言語モデル記憶部102と、分割部105と、話題分析部107と、適応化言語モデル作成部110とを備えている。更に、本実施例1では、言語モデル適応装置100は、認識結果記憶部104と、分割結果記憶部106と、話題判定結果記憶部108と、話題モデル記憶部109と、適応化言語モデル記憶部111とを備えている。また、本実施例1における音声認識装置10は、言語モデル適応装置100に加え、音声データ記憶部101と、音声認識部103と、再認識部112と、再認識結果記憶部113とを備えている。これらの部は、それぞれ次のように動作する。 As shown in FIG. 3, the language model adaptation apparatus 100 according to the first embodiment includes a base language model storage unit 102, a division unit 105, a topic analysis unit 107, an adaptation language, as described in the embodiment. And a model creation unit 110. Furthermore, in the first embodiment, the language model adaptation device 100 includes a recognition result storage unit 104, a division result storage unit 106, a topic determination result storage unit 108, a topic model storage unit 109, and an adaptive language model storage unit. 111. In addition to the language model adaptation device 100, the speech recognition device 10 according to the first embodiment includes a speech data storage unit 101, a speech recognition unit 103, a re-recognition unit 112, and a re-recognition result storage unit 113. Yes. Each of these units operates as follows.

 音声データ記憶部101は、認識対象となる音声データを記憶している。音声データとしては、会議音声、講演音声、放送音声など様々な音声が考えられる。なお、これらの音声データはあらかじめ用意されたアーカイブデータであっても、マイク等からリアルタイムに入力されるデータであっても構わない。 The voice data storage unit 101 stores voice data to be recognized. As audio data, various audio such as conference audio, lecture audio, broadcast audio, and the like can be considered. Note that these audio data may be archive data prepared in advance or data input in real time from a microphone or the like.

 また、ベース言語モデル記憶部102は、音声認識部103が用いるベース言語モデルを記憶する。ベース言語モデルとしては、公知のN-gram言語モデルを用いることができる。ベース言語モデルは、予め、大量のテキストデータを用いた学習を行うことによって構築できる。本実施例1では、後述するように、ベース言語モデルを認識対象音声の特徴にあわせて適応化することで、音声データをより高精度に認識することが可能となる。 The base language model storage unit 102 stores a base language model used by the speech recognition unit 103. A known N-gram language model can be used as the base language model. The base language model can be constructed in advance by performing learning using a large amount of text data. In the first embodiment, as described later, by adapting the base language model according to the characteristics of the recognition target speech, it is possible to recognize speech data with higher accuracy.

 音声認識部103は、音声データ記憶部101が記憶する音声データを読み込み、ベース言語モデル記憶部102が記憶するベース言語モデルを用いて音声データの音声認識を行う。そして、音声認識部103は、認識結果をテキスト(テキストデータ)として認識結果記憶部104に出力する。認識結果記憶部104は、認識結果(図5参照)を記憶する。 The voice recognition unit 103 reads the voice data stored in the voice data storage unit 101, and performs voice recognition of the voice data using the base language model stored in the base language model storage unit 102. Then, the speech recognition unit 103 outputs the recognition result as text (text data) to the recognition result storage unit 104. The recognition result storage unit 104 stores the recognition result (see FIG. 5).

 分割部105は、認識結果記憶部104が記憶する認識結果のテキストを複数の区間へと分割し、分割結果を分割結果記憶部106に出力する。分割結果記憶部106は、分割結果(図6参照)を記憶する。 The dividing unit 105 divides the recognition result text stored in the recognition result storage unit 104 into a plurality of sections, and outputs the division result to the division result storage unit 106. The division result storage unit 106 stores the division result (see FIG. 6).

 話題分析部107は、認識結果記憶部104が記憶する認識結果のテキストと、分割結果記憶部106が記憶する分割結果とを参照して、分割されたそれぞれの区間に対して当該区間に含まれる話題を判定する。また、話題分析部107は、判定結果を話題判定結果記憶部108に出力して、それに記憶させる。更に、話題分析部107は、判定されたそれぞれの話題に対して当該話題における単語の出現傾向を表す話題モデルを作成し、作成した話題モデルを話題モデル記憶部109に出力して、それに記憶させる。 The topic analysis unit 107 refers to the recognition result text stored in the recognition result storage unit 104 and the division result stored in the division result storage unit 106, and is included in the section for each divided section. Determine the topic. Further, the topic analysis unit 107 outputs the determination result to the topic determination result storage unit 108 and stores it therein. Furthermore, the topic analysis unit 107 creates a topic model representing the appearance tendency of words in the topic for each determined topic, and outputs the created topic model to the topic model storage unit 109 for storage therein. .

 適応化言語モデル作成部110は、話題判定結果記憶部108に記憶されている話題の判定結果と、話題モデル記憶部109に記憶されている話題モデルとを参照する。そして、適応化言語モデル作成部110は、分割されたそれぞれの区間に対して、当該区間に含まれる話題の話題モデルを用いて、ベース言語モデルから、当該区間に適応した適応化言語モデルを作成する。また、適応化言語モデル作成部110は、作成した適応化言語モデルを適応化言語モデル記憶部111に出力して、それに記憶させる。 The adaptation language model creation unit 110 refers to the topic determination result stored in the topic determination result storage unit 108 and the topic model stored in the topic model storage unit 109. Then, the adaptation language model creation unit 110 creates, for each divided section, an adaptation language model adapted to the section from the base language model using the topic model of the topic included in the section. To do. Also, the adaptation language model creation unit 110 outputs the created adaptation language model to the adaptation language model storage unit 111 and stores it therein.

 再認識部112は、適応化言語モデル記憶部111が記憶する区間毎の適応化言語モデルを用いて、音声データ記憶部101が記憶する音声データの対応する区間に対して、再度の音声認識(再認識)を実行する。そして、再認識部112は、再認識結果をテキストデータとして再認識結果記憶部113に出力する。 The re-recognition unit 112 uses the adaptation language model for each section stored in the adaptation language model storage unit 111 to perform speech recognition (repeating speech corresponding to the corresponding section of the speech data stored in the speech data storage unit 101 ( Execute (Re-recognition). Then, the re-recognition unit 112 outputs the re-recognition result to the re-recognition result storage unit 113 as text data.

 次に、本実施例1における言語モデル適応装置100及び音声認識装置10の動作について図4~図8を用いて説明する。図4は、本発明の実施例1における言語モデル適応装置及び音声認識装置の動作を示すフロー図である。図5は、本発明の実施例1における認識結果の具体例を説明する図である。図6は、本発明の実施例1における分割結果の具体例を説明する図である。図7は、本発明の実施例1における話題判定結果の具体例を説明する図である。図8は、本発明の実施例1における分割結果の他の具体例を説明する図である。 Next, operations of the language model adaptation apparatus 100 and the speech recognition apparatus 10 in the first embodiment will be described with reference to FIGS. FIG. 4 is a flowchart showing operations of the language model adaptation device and the speech recognition device according to the first exemplary embodiment of the present invention. FIG. 5 is a diagram for explaining a specific example of the recognition result in the first embodiment of the present invention. FIG. 6 is a diagram for explaining a specific example of the division result in the first embodiment of the present invention. FIG. 7 is a diagram illustrating a specific example of the topic determination result in the first embodiment of the present invention. FIG. 8 is a diagram for explaining another specific example of the division result according to the first embodiment of the present invention.

 なお、本実施例1では、言語モデル適応装置100を動作させることにより、言語モデル適応方法が実行される。よって、言語モデル適応方法の説明は、言語モデル適応装置100の動作の説明に代える。また、以下の説明では、適宜、図3を参照する。 In the first embodiment, the language model adaptation method is executed by operating the language model adaptation device 100. Therefore, the description of the language model adaptation method is replaced with the description of the operation of the language model adaptation apparatus 100. In the following description, FIG. 3 will be referred to as appropriate.

 まず、図4に示すように、音声認識部103が、ベース言語モデルを用いて音声データに対して音声認識を行い、これをテキスト化する(ステップS201)。具体的には、音声認識部103は、音声データ記憶部101が記憶する音声データを読み込み、公知の音声検出技術によって複数の発話を検出する。更に、その上で、音声認識部103は、それぞれの発話に対して公知の大語彙連続音声認識技術を適用することで、音声データをテキストに変換し、これを認識結果記憶部104に出力する。 First, as shown in FIG. 4, the speech recognition unit 103 performs speech recognition on speech data using a base language model, and converts this into text (step S201). Specifically, the voice recognition unit 103 reads the voice data stored in the voice data storage unit 101 and detects a plurality of utterances by a known voice detection technique. Furthermore, the speech recognition unit 103 converts the speech data into text by applying a known large vocabulary continuous speech recognition technology to each utterance, and outputs this to the recognition result storage unit 104. .

 ステップS201において、音声認識のための言語モデルとしては、ベース言語モデル記憶部102が記憶するベース言語モデルが用いられる。また、音響モデルとしては、例えば、音素を単位とした公知のHMMによる音響モデルを用いることができる。 In step S201, the base language model stored in the base language model storage unit 102 is used as the language model for speech recognition. As the acoustic model, for example, a known HMM acoustic model with phonemes as units can be used.

 ステップS201の結果、例えば、図5に示すように、音声データの10秒~13秒の発話が「こんばんは、ニュースです」と認識され、16秒~20秒の発話が「今日最初のニュースは野球の日本シリーズです」と認識される。 As a result of step S201, for example, as shown in FIG. 5, an utterance of 10 to 13 seconds of voice data is recognized as “Good evening is news”, and an utterance of 16 to 20 seconds is “the first news today is baseball. Is a Japanese series. "

 次に、分割部105が、認識結果記憶部104が記憶する認識結果のテキストを、複数の区間へと分割し、分割結果を分割結果記憶部106に出力する(ステップS202)。本実施例1では、あらかじめ定められた発話数ごとに、認識結果の分割が行われるが、例えば、所定の単語数ごとに認識結果の分割が行われても良い。 Next, the dividing unit 105 divides the recognition result text stored in the recognition result storage unit 104 into a plurality of sections, and outputs the division result to the division result storage unit 106 (step S202). In the first embodiment, the recognition result is divided for each predetermined number of utterances. For example, the recognition result may be divided for each predetermined number of words.

 図6の例では、認識結果のテキストは5つの発話ごとに分割されている。図6に示すように、発話1~発話5によって構成される区間1、発話6~発話10によって構成される区間2、のように認識結果が分割される。当然ながら、それぞれの区間には、区間を構成する発話の認識結果テキストが対応付けられている。なお、1つの区間に含める発話数は、あらかじめ実験を行うことによって定めることができる。 In the example of FIG. 6, the recognition result text is divided into five utterances. As shown in FIG. 6, the recognition results are divided into a section 1 composed of utterances 1 to 5 and a section 2 composed of utterances 6 to 10. Naturally, each section is associated with a recognition result text of an utterance constituting the section. Note that the number of utterances included in one section can be determined by conducting an experiment in advance.

 次に、話題分析部107が、分割された各区間に対して、当該区間に含まれる話題を判定し、判定結果を話題判定結果記憶部108に出力する(ステップS203)。続いて、話題分析部107は、判定された各話題に対して、当該話題における単語の出現傾向を表す話題モデルを作成し、これを話題モデル記憶部109に出力する(ステップS204)。 Next, the topic analysis unit 107 determines, for each divided section, the topics included in the section and outputs the determination result to the topic determination result storage unit 108 (step S203). Subsequently, the topic analysis unit 107 creates a topic model indicating the appearance tendency of words in the topic for each determined topic, and outputs this to the topic model storage unit 109 (step S204).

 ここで、まず、ステップS203について、分割されたそれぞれの区間に含まれる話題の判定方法の一例を説明する。本実施例1では、話題分析部107は、各区間に対して、意味内容の類似性に基づいてクラスタリングを行い、同じクラスタに割り当てられた区間は同じ話題を持つものとすることで、各区間の話題の判定を行っている。 Here, first, an example of a method for determining a topic included in each divided section will be described in step S203. In the first embodiment, the topic analysis unit 107 performs clustering on each section based on the similarity of semantic content, and the sections assigned to the same cluster have the same topic. The topic is judged.

 図7に示すように、各区間をクラスタリングした結果、区間1、区間2、区間9、及び区間10は、同じクラスタに割り当てられている。このことは、これらの区間は同じ話題(話題1)に関する区間と判定されていることを示している。同様に、区間3及び区間4は話題2に関する区間と判定されており、区間5、区間6、区間7、及び区間8は話題3に関する区間と判定されている。 As shown in FIG. 7, as a result of clustering each section, section 1, section 2, section 9, and section 10 are assigned to the same cluster. This indicates that these sections are determined to be sections related to the same topic (topic 1). Similarly, the sections 3 and 4 are determined to be sections related to the topic 2, and the sections 5, 6, 7, and 8 are determined to be sections related to the topic 3.

 このように、それぞれの区間を意味内容の類似性に基づいてクラスタリングすることで、それぞれの区間が他のどの区間と同じ話題を含むかということを判定することができる。なお、このとき、それぞれの話題(話題1、話題2、話題3)が具体的にどのような内容の話題であるかまでは判定されていなくても良く、このことは本発明では問題とはならない。 Thus, by clustering each section based on the similarity of semantic content, it can be determined which section contains the same topic as each other section. At this time, it is not necessary to determine the specific content of each topic (topic 1, topic 2, topic 3), which is a problem in the present invention. Don't be.

 ところで、各区間をクラスタリングする場合には、2つの区間の間の類似度を求める必要がある。同じ話題の区間を同じクラスタにまとめることが目的であるから、2つの区間の間の類似度は、2つの区間の間の意味内容の類似性を反映していることが望ましい。そのような類似度としては、分割によって得られた各区間のテキストを文書とみなした上で、文書ベクトル間のコサイン類似度を用いることができる。文書ベクトルとは、文書中に出現する各単語の重みを各次元に割り当てることによって得られるベクトルのことである。本実施例1では、区間のテキストに出現する各単語の重みを用いれば良い。単語の重みとしては、文書中の単語の出現頻度や、公知のTF・IDF法により計算された単語の重みなどが挙げられる。例えば、区間1の文書ベクトルをd、区間2の文書ベクトルをdとしたとき、区間1と区間2とのコサイン類似度cosθは、下記の(数5)によって計算される。 By the way, when clustering each section, it is necessary to obtain the similarity between the two sections. Since the purpose is to group sections of the same topic into the same cluster, it is desirable that the similarity between the two sections reflects the similarity of the semantic content between the two sections. As such similarity, the text of each section obtained by the division is regarded as a document, and the cosine similarity between document vectors can be used. A document vector is a vector obtained by assigning the weight of each word appearing in a document to each dimension. In the first embodiment, the weight of each word appearing in the text of the section may be used. Examples of the word weight include the appearance frequency of the word in the document and the word weight calculated by a known TF / IDF method. For example, d 1 the document vector of the section 1, when the document vector of the section 2 was set to d 2, cosine similarity cosθ of the section 1 and the section 2 is calculated by the following equation (5).

Figure JPOXMLDOC01-appb-M000005
Figure JPOXMLDOC01-appb-M000005

 コサイン類似度cosθは、0~1の実数であり、その値が1に近いほど両区間の類似度が大きいと考えることができる。なお、文書ベクトルを構成する各次元としては、区間に出現するすべての単語を用いる必要はなく、名詞や動詞などのあらかじめ定めた品詞の単語のみを用いても良い。このようにすることで、文書ベクトルの類似度は、より内容の類似性を反映した尺度となる。 The cosine similarity cos θ is a real number from 0 to 1, and the closer the value is to 1, the greater the similarity between both sections. As each dimension constituting the document vector, it is not necessary to use all the words appearing in the section, and only words having predetermined parts of speech such as nouns and verbs may be used. By doing so, the similarity of the document vector becomes a scale reflecting the similarity of the contents.

 また、本実施例1では、クラスタリングのアルゴリズムとしては、公知のk-means法や、階層的クラスタリング法などを用いるのが良い。クラスタの数をK個とする場合に、k-means法を用いる場合の手順は次の通りである。(1)ランダムに選んだK個の区間の文書ベクトルを、K個のクラスタの代表点に定める。(2)すべての区間の文書ベクトルを、それぞれ最も近い、すなわち最も類似度の大きい代表点に割り当てる。(3)クラスタごとに文書ベクトルの重心点を計算して代表点とする。(2)と(3)とを繰り返し実行し、各クラスタに割り当てられる区間(文書ベクトル)が変化しなくなったらクラスタリングを終了する。 In the first embodiment, a known k-means method or hierarchical clustering method may be used as a clustering algorithm. The procedure when the k-means method is used when the number of clusters is K is as follows. (1) Document vectors of K sections selected at random are determined as representative points of K clusters. (2) Document vectors in all sections are assigned to representative points that are the closest, that is, the highest similarity. (3) The centroid point of the document vector is calculated for each cluster and used as a representative point. (2) and (3) are repeatedly executed. When the section (document vector) assigned to each cluster does not change, the clustering is terminated.

 また、最も単純な階層的クラスタリング法の手順は次の通りである。区間の総数をNとすると、(1)区間を1つだけ含むN個のクラスタを作成する。(2)最も類似度の大きい2つのクラスタを併合する。この併合をクラスタ数がK個になるまで繰り返せば、K個のクラスタが得られる。ここで、2つのクラスタ間の類似度としては、公知の最短距離法、最長距離法、ウォード法、などを用いれば良い。 The simplest hierarchical clustering procedure is as follows. When the total number of sections is N, (1) N clusters including only one section are created. (2) Merge two clusters having the highest similarity. If this merging is repeated until the number of clusters reaches K, K clusters can be obtained. Here, as the similarity between two clusters, a known shortest distance method, longest distance method, Ward method, or the like may be used.

 いずれのクラスタリング手法を用いる場合でも、2つの区間間の類似度としては、文書ベクトルのコサイン類似度を用いることができる。また、クラスタリング手法として、ここに挙げた方法以外の方法を用いても良いことはもちろんである。また、2つの区間間の類似度もコサイン類似度に限るものではなく、2つの区間間の意味内容の類似性を測れる尺度であればどのような尺度でも良いことはもちろんである。 Whichever clustering method is used, the cosine similarity of the document vector can be used as the similarity between the two sections. Of course, methods other than those listed here may be used as the clustering method. In addition, the similarity between two sections is not limited to the cosine similarity, and any scale can be used as long as it can measure the similarity of the semantic content between the two sections.

 なお、クラスタリングする際のクラスタ数Kは、例えば、クラスタリングする区間の総数をNとして、K=α・Nとすれば良い。ここでαは比例定数で、例えばα=0.2とすれば、1つのクラスタには平均5つの区間が含まれることになる。また、クラスタ数Kは区間の総数ではなく、発話の総数に比例させても良いし、さらに他の方法によって定めても良いことはもちろんである。 Note that the number of clusters K for clustering may be, for example, K = α · N, where N is the total number of sections to be clustered. Here, α is a proportionality constant. For example, if α = 0.2, one cluster includes five sections on average. In addition, the number of clusters K may be proportional to the total number of utterances, not the total number of sections, and may of course be determined by other methods.

 このようにして、ステップS203では、分割部105で分割されたそれぞれの区間がクラスタリングされ、その結果が、話題判定結果記憶部108に出力される。このような処理により、認識対象音声に複数の話題が含まれている場合でも、認識結果の各区間の話題を判定することが可能となる。 Thus, in step S203, the sections divided by the dividing unit 105 are clustered, and the result is output to the topic determination result storage unit 108. By such processing, even when a plurality of topics are included in the recognition target speech, it is possible to determine the topics in each section of the recognition result.

 次に、クラスタリングによって判定されたそれぞれの話題に対して、話題モデルを作成する方法について説明する。ある話題の話題モデルとは、当該話題における単語の出現傾向を表すモデルのことである。話題モデルの作成は次のように行えば良い。(1)クラスタリングによって同じ話題(クラスタ)と判定された複数の区間のテキストをまとめる。(2)それぞれの話題(クラスタ)ごとにまとめたテキストを元にして、単語の生成確率を表す確率モデルを学習する。ここで、確率モデルとしては、N-gram言語モデルを用いれば良い。 Next, a method for creating a topic model for each topic determined by clustering will be described. A topic model of a certain topic is a model representing the appearance tendency of words in the topic. The topic model can be created as follows. (1) The texts of a plurality of sections determined as the same topic (cluster) by clustering are collected. (2) A probability model representing the word generation probability is learned based on the text compiled for each topic (cluster). Here, an N-gram language model may be used as the probability model.

 図7の例では、話題1と判定された区間は、区間1、区間2、区間9、及び区間10であるから、まず、これら4つの区間のテキストを一つのグループとしてまとめる。次に、まとめたテキストを元に、最尤推定法などによってN-gram言語モデルを学習することで、話題1の話題モデルが得られる。具体的には、N=1のユニグラムモデル、N=2のバイグラムモデル、N=3のトライグラムモデルなどを学習すれば良い。同様に、話題2の話題モデルは区間3及び区間4のテキストを元に学習を行えば良く、話題3の話題モデルは区間5、区間6、区間7、及び区間8のテキストを元に学習を行えば良い。 In the example of FIG. 7, since the sections determined as the topic 1 are the sections 1, 2, 2, and 10, first, the texts of these four sections are collected as one group. Next, the topic model of topic 1 is obtained by learning the N-gram language model by the maximum likelihood estimation method based on the collected text. Specifically, an N = 1 unigram model, an N = 2 bigram model, an N = 3 trigram model, and the like may be learned. Similarly, the topic model of topic 2 may be learned based on the texts of sections 3 and 4, and the topic model of topic 3 may be learned based on the texts of sections 5, 6, 7, and 8. Just do it.

 このようにして、ステップS204では、クラスタリングによって得られた話題毎に、単語の出現傾向を表す話題モデルが作成され、話題モデルは話題モデル記憶部109に出力される。よって、認識対象音声に複数の話題が含まれている場合でも、それらの話題ごとに別々に、単語の出現傾向を表す話題モデルを作成することができる。また、同じ話題が間をあけて何回か現れた場合でも、同じ話題の区間全体における単語の出現傾向をまとめて同一の話題モデルを作成することができる。 Thus, in step S204, a topic model representing the appearance tendency of words is created for each topic obtained by clustering, and the topic model is output to the topic model storage unit 109. Therefore, even when a plurality of topics are included in the recognition target speech, it is possible to create a topic model representing the appearance tendency of words separately for each topic. Moreover, even if the same topic appears several times at intervals, the same topic model can be created by collecting the appearance tendency of words in the entire section of the same topic.

 次に、適応化言語モデル作成部110が、話題判定結果記憶部108に記憶されている判定結果と、話題モデル記憶部109に記憶されている話題モデルとを参照して、分割によって得られた区間毎に、当該区間に適応した適応化言語モデルを作成する(ステップS205)。作成された適応化言語モデルは、適応化言語モデル記憶部111に出力される。 Next, the adaptation language model creation unit 110 refers to the determination result stored in the topic determination result storage unit 108 and the topic model stored in the topic model storage unit 109, and is obtained by division. For each section, an adaptive language model adapted to the section is created (step S205). The created adaptation language model is output to the adaptation language model storage unit 111.

 具体的には、適応化言語モデル作成部110は、ベース言語モデル記憶部102が記憶するベース言語モデルを、各区間で判定された話題の話題モデルを用いて適応化することで、それぞれの区間に対して異なる適応化言語モデルを作成する。例えば、図7に示した例では、区間1、区間2、区間9、及び区間10に対しては、ベース言語モデルを話題1の話題モデルによって適応化することで、適応化言語モデルが作成される。同様に、区間3及び区間4に対しては、ベース言語モデルを話題2の話題モデルによって適応化することで、適応化言語モデルが作成される。 Specifically, the adaptation language model creation unit 110 adapts the base language model stored in the base language model storage unit 102 using the topic model of the topic determined in each interval, so that each interval Create different adaptive language models for. For example, in the example shown in FIG. 7, an adaptive language model is created by adapting the base language model with the topic model of Topic 1 for Section 1, Section 2, Section 9, and Section 10. The Similarly, for section 3 and section 4, an adaptive language model is created by adapting the base language model with the topic model of topic 2.

 適応化の方法としては、例えば、ベース言語モデルと対応する話題モデルとの線形補間が挙げられる。具体的には、ベース言語モデルPBASEと、話題1に対応する話題モデルPTOPIC_1とが共にトライグラムモデルである場合は、話題1の区間に対応する適応化言語モデルPADAPT_1は下記の(数6)によって求められる。 Examples of the adaptation method include linear interpolation between a base language model and a corresponding topic model. Specifically, when the base language model P BASE and the topic model P TOPIC_1 corresponding to the topic 1 are both trigram models, the adaptive language model P ADAPT_1 corresponding to the section of the topic 1 is 6).

Figure JPOXMLDOC01-appb-M000006
Figure JPOXMLDOC01-appb-M000006

 ただし、上記(数6)において、λは0~1の定数であり、あらかじめ実験を行うことによって定めることができる。ここでは、ベース言語モデル、話題モデル、および適応化言語モデルがすべてトライグラムの例を示したが、それぞれバイグラムやユニグラムなど他のN-gram言語モデルであっても良い。また、話題モデルPTOPIC_1がユニグラムモデルである場合には、公知のユニグラムリスケーリング法によってベース言語モデルを適応化しても良い。その場合、適応化言語モデルPADAPT_1は下記の(数7)によって求められる。 However, in the above (Equation 6), λ is a constant of 0 to 1, and can be determined by conducting an experiment in advance. Here, the base language model, the topic model, and the adaptation language model are all trigram examples, but other N-gram language models such as bigrams and unigrams may be used. When the topic model P TOPIC_1 is a unigram model, the base language model may be adapted by a known unigram rescaling method. In this case, the adaptive language model P ADAPT_1 is obtained by the following ( Equation 7).

Figure JPOXMLDOC01-appb-M000007
Figure JPOXMLDOC01-appb-M000007

 以上のステップS201~S205の処理によって最終的に区間ごとに作成された適応化言語モデルは、当該区間と同じ話題の区間における単語の出現傾向を反映する一方で、当該区間とは異なる話題の区間における単語の出現傾向の影響は受けていない。例えば、図7に示した例において、区間9に対して作成される適応化言語モデルは、区間9と同じ話題である区間1、区間2、及び区間10のテキストにおける単語の出現傾向を学習する一方、区間9とは話題が異なる区間3~区間8のテキストを学習に用いていない。 The adaptive language model finally created for each section by the processing of steps S201 to S205 described above reflects the appearance tendency of words in the same topic section as the section, but has a topic section different from the section. It is not influenced by the appearance tendency of words. For example, in the example shown in FIG. 7, the adaptive language model created for the section 9 learns the appearance tendency of words in the texts of the sections 1, 2, and 10 that are the same topic as the section 9. On the other hand, the texts of the sections 3 to 8 that are different from the section 9 are not used for learning.

 従って、本発明によって区間毎に作成される適応化言語モデルは、認識対象音声において話題が急に変化する場合や、認識対象音声において同じ話題が間をあけて繰り返される場合でも、当該区間における単語の出現傾向を精度良く予測可能なモデルであるといえる。すなわち、話題が異なれば単語の出現傾向が異なるため、一般に言語モデルの適応では同じ話題における単語の出現傾向のみを用いて言語モデルを適応することが望ましいが、本発明によってそのような適応が実現可能となる。 Therefore, the adaptive language model created for each section according to the present invention is not limited to words in the section even when the topic suddenly changes in the recognition target speech or when the same topic is repeated in the recognition target speech. It can be said that it is a model that can accurately predict the appearance tendency of. In other words, since the appearance tendency of words differs depending on the topic, it is generally desirable to adapt the language model using only the appearance tendency of words in the same topic in adaptation of the language model, but such adaptation is realized by the present invention. It becomes possible.

 なお、認識対象音声における単語の出現傾向を用いて言語モデルを適応化する代表的な手法として、キャッシュモデル(非特許文献1参照)が挙げられる。キャッシュモデルでは、例えば、区間9(話題1)に対して適応化言語モデルを作成する場合に、直前の区間8(話題3)における単語の出現傾向に基づいて、適応化が行われる。この場合は、区間8は区間9とは話題が異なるため、両者の単語の出現傾向は異なると考えられ、その結果、適切な適応化言語モデルが得られない、という問題が生じてしまう。これに対して、実施例1から分かるように、本発明では、このような話題の不一致を起こすことなく、ベース言語モデルの適応化を行うことができるため、認識対象音声に様々な話題が出現する場合でも、十分な適応効果が得られることとなる。 Note that a cache model (see Non-Patent Document 1) is a typical method for adapting a language model using the appearance tendency of words in recognition target speech. In the cache model, for example, when an adapted language model is created for the section 9 (topic 1), the adaptation is performed based on the appearance tendency of words in the immediately preceding section 8 (topic 3). In this case, since the topic of section 8 is different from that of section 9, the appearance tendency of both words is considered to be different, and as a result, there arises a problem that an appropriate adaptive language model cannot be obtained. On the other hand, as can be seen from the first embodiment, in the present invention, it is possible to adapt the base language model without causing such topic inconsistency, so various topics appear in the recognition target speech. Even in this case, a sufficient adaptation effect can be obtained.

 ステップS205の終了後、最後に、再認識部112は、適応化言語モデル記憶部111が記憶する区間毎の適応化言語モデルを用いて、音声データ記憶部101が記憶する音声データの対応する区間に対して、再度、音声認識(再認識)を実行する(ステップS206)。再認識結果は、テキストデータとして再認識結果記憶部113に出力される。 Finally, after the completion of step S205, the re-recognition unit 112 uses the adaptation language model for each section stored in the adaptation language model storage unit 111, and uses the corresponding section of the speech data stored in the speech data storage unit 101. Again, voice recognition (re-recognition) is executed (step S206). The re-recognition result is output to the re-recognition result storage unit 113 as text data.

 すなわち、例えば、図7に示した例では、区間1に対して作成された適応化言語モデルによって発話1~発話5を再認識し、区間2に対して作成された適応化言語モデルによって発話6~10を再認識する、ということがすべての区間に対して繰り返される。 That is, for example, in the example shown in FIG. 7, the utterances 1 to 5 are re-recognized by the adaptation language model created for the section 1 and the utterance 6 is created by the adaptation language model created for the section 2. Re-recognizing ~ 10 is repeated for all intervals.

 前述したとおり、本発明によってそれぞれの区間に対して適応化された言語モデルは、認識対象音声において話題の変化や繰り返しがある場合でも当該区間における単語の出現傾向を精度良く予測可能なモデルである。このため、再認識部112によって高い精度で音声データを認識することが可能となる。 As described above, the language model adapted to each section according to the present invention is a model that can accurately predict the appearance tendency of words in the section even when there is a topic change or repetition in the recognition target speech. . For this reason, the re-recognition unit 112 can recognize the voice data with high accuracy.

 なお、本実施例1では、分割部105は、あらかじめ定められた発話数ごとに認識結果を区切ることで、認識結果テキストを複数の区間へと分割したが、これらの区間には重なりがあっても良い。図8にそのような例を示す。図8には、10発話をまとめて1つの区間とし、これを2発話ずつずらすことで複数の区間へと分割した場合の分割結果と、それぞれの区間における話題判定結果が示されている。このようにそれぞれの区間に重なりがある場合でも、話題分析部107は全く同じように動作することができる。 In the first embodiment, the dividing unit 105 divides the recognition result text into a plurality of sections by dividing the recognition result for each predetermined number of utterances. However, there is an overlap in these sections. Also good. FIG. 8 shows such an example. FIG. 8 shows a division result when 10 utterances are grouped into one section and divided into a plurality of sections by shifting each utterance by two utterances, and a topic determination result in each section. Thus, even when there is an overlap in each section, the topic analysis unit 107 can operate in exactly the same way.

 本実施例1では、このようにそれぞれの区間に重なりを持たせることで、1つの区間に含める発話の数を大きくしたときでも、認識対象音声における話題の変化を時間的に細かく捉えることが可能である。つまり、再認識部112が各区間に対応する適応化言語モデルを用いて再認識する際に、例えば、区間の中央の発話のみが再認識されるようにすれば良い。 In the first embodiment, by overlapping each section in this way, even when the number of utterances included in one section is increased, it is possible to grasp changes in the topic in the recognition target speech in terms of time. It is. That is, when the re-recognition unit 112 re-recognizes using the adaptive language model corresponding to each section, for example, only the utterance at the center of the section may be re-recognized.

 具体的には、区間5に対応する適応化言語モデルを用いて発話13及び発話14(発話9~発話18の中央の2つの発話)を認識するようにすれば良い。また、区間6に対応する適応化言語モデルを用いて発話15及び発話16(発話11~発話20の中央の2つの発話)を認識するようにすれば良い。このようにすることにより、再認識する発話に重なりがないようにすることができる。 Specifically, the utterance 13 and the utterance 14 (two utterances in the middle of the utterances 9 to 18) may be recognized using the adaptive language model corresponding to the section 5. Further, the utterance 15 and the utterance 16 (two utterances at the center of the utterances 11 to 20) may be recognized using the adaptive language model corresponding to the section 6. By doing so, it is possible to prevent the utterances to be re-recognized from overlapping.

 また、本実施例1では、分割部105は、認識結果における意味内容の境界を検出し、検出結果に基づいて、認識結果のテキストを分割することもできる。例えば、参照文献1に記載の方法に従って、分割部105は、認識結果テキスト内の単語分布の変化点を検出し、検出した変化点に基づいて、認識結果のテキストを分割しても良い。 In the first embodiment, the dividing unit 105 can also detect a boundary of meaning content in the recognition result, and can divide the text of the recognition result based on the detection result. For example, according to the method described in Reference Document 1, the dividing unit 105 may detect a change point of the word distribution in the recognition result text and divide the recognition result text based on the detected change point.

(参照文献1)
Marti A.Hearst, "MULTI-PARAGRAPH
Segmentation of Expository Text, "32nd Annual Meeting of The Association
for Computational Linguistics, pp.9-16, 1994.
(Reference 1)
Marti A. Hearst, "MULTI-PARAGRAPH
Segmentation of Expository Text, "32nd Annual Meeting of The Association
for Computational Linguistics, pp.9-16, 1994.

 また、分割部105は、話題の境界を表す話題境界表現が予め定められている場合に、話題境界表現が出現した位置に基づいて、認識結果のテキストを分割することもできる。話題境界表現としては、例えば、「さて」「それでは」「次に」などの話題を転換する表現が挙げられる。 Further, the dividing unit 105 can also divide the text of the recognition result based on the position where the topic boundary expression appears when the topic boundary expression representing the topic boundary is determined in advance. Examples of topic boundary expressions include expressions that change topics such as “Now”, “Now” and “Next”.

 更に、予め、様々な話題に関する話題モデルが用意されている場合は、分割部105は、参照文献2に記載の方法によって、認識結果テキストと最も良く整合する話題モデル系列を求めて話題の変化点を検出し、変化点に基づいて認識結果のテキストを分割できる。なお、ここでの話題モデルは、話題分析部107が作成する話題モデルではない。ここでの話題モデルは、認識結果テキストとは無関係に予め用意された話題モデルであり、予め定められた話題に関係する単語の出現傾向を表している。このような話題モデルは、例えば、話題の情報が付与された大量のテキストデータを事前に学習することによって得ることができる。 Further, when topic models relating to various topics are prepared in advance, the dividing unit 105 obtains a topic model sequence that best matches the recognition result text by the method described in Reference Document 2, and changes in topics And the text of the recognition result can be divided based on the change point. Note that the topic model here is not the topic model created by the topic analysis unit 107. The topic model here is a topic model prepared in advance regardless of the recognition result text, and represents the appearance tendency of words related to a predetermined topic. Such a topic model can be obtained, for example, by learning in advance a large amount of text data to which topic information is assigned.

(参照文献2)
J.P.Yamron, I.Carp, L.Gillick, S.Lowe, and P.van Mulbregt, ”A hidden Markov Model
Approach to Text Segmentation and Event Tracking, "IEEE International
Conference on Acoustics, Speech and Signal Processing, pp.333-336, 1998.
(Reference 2)
JPYamron, I.Carp, L.Gillick, S.Lowe, and P.van Mulbregt, “A hidden Markov Model
Approach to Text Segmentation and Event Tracking, "IEEE International
Conference on Acoustics, Speech and Signal Processing, pp.333-336, 1998.

 認識結果のテキストを所定の発話数や所定の単語数ごとに分割する場合、それぞれの区間は必ずしもまとまった話題を表すとは限らず、話題の境界付近では区間内に複数の話題が混在してしまう可能性がある。しかし、分割部105が、意味内容の境界らしさに基づいて認識結果のテキストを複数の区間へと分割する場合は、混在する可能性があっても、それぞれの区間は、まとまった話題を表すようになる。このため、話題分析部107におけるクラスタリング精度や話題モデルの推定精度が向上し、最終的に得られる適応化言語モデルの精度の更なる向上が図られる。 When the recognition result text is divided into a predetermined number of utterances or a predetermined number of words, each section does not necessarily represent a group of topics, and a plurality of topics are mixed in the section near the topic boundary. There is a possibility. However, when the dividing unit 105 divides the text of the recognition result into a plurality of sections based on the semantic content boundary, each section may represent a group of topics even if they may be mixed. become. For this reason, the clustering accuracy and topic model estimation accuracy in the topic analysis unit 107 are improved, and the accuracy of the adaptive language model finally obtained is further improved.

 以上のように、実施例1によれば、認識対象音声において話題が急に変化する場合、及び認識対象音声において同じ話題が間をあけて繰り返される場合でも、言語モデルの十分な適応効果が得られる。また、実施例1の言語モデル適応装置100が生成した適応化言語モデルを用いて、認識対象音声を再認識すれば、さらに高い精度の認識結果が得られる。 As described above, according to the first embodiment, a sufficient adaptation effect of the language model can be obtained even when the topic suddenly changes in the recognition target speech and when the same topic is repeated in the recognition target speech at intervals. It is done. Further, if the recognition target speech is re-recognized using the adaptive language model generated by the language model adaptation device 100 of the first embodiment, a recognition result with higher accuracy can be obtained.

 次に、本発明の実施例2における言語モデル適応装置、これを備える音声認識装置、及び言語モデル適応方法について、図面を参照して詳細に説明する。本実施例2は、話題分析部107及び適応化言語モデル作成部110における動作の点で、実施例1と異なっている。それ以外の点では、実施例2は、実施例1と同様である。よって、共通の部分についての説明は省略する。異なる部分について、図9を用いて説明する。図9は、本発明の実施例2における話題判定結果の具体例を説明する図である。 Next, a language model adaptation device, a speech recognition device including the language model adaptation device, and a language model adaptation method according to Embodiment 2 of the present invention will be described in detail with reference to the drawings. The second embodiment is different from the first embodiment in terms of operations in the topic analysis unit 107 and the adaptive language model creation unit 110. In other respects, the second embodiment is the same as the first embodiment. Therefore, the description about a common part is abbreviate | omitted. Different parts will be described with reference to FIG. FIG. 9 is a diagram illustrating a specific example of the topic determination result in the second embodiment of the present invention.

 本実施例2では、話題分析部107は、分割部105によって分割された各区間dにおける単語wの出現確率を与える確率モデルP(w|d)として、話題zを隠れ変数とした下記の(数8)によって規定されるモデルを仮定する。そして、話題分析部107は、各区間のテキストを学習データとして、確率モデルのパラメータを学習する。 In the second embodiment, the topic analysis unit 107 uses the topic z as a hidden variable as a probability variable P (w | d) that gives the appearance probability of the word w in each section d divided by the division unit 105 (the following ( Assume the model defined by equation (8). Then, the topic analysis unit 107 learns the parameters of the probability model using the text of each section as learning data.

Figure JPOXMLDOC01-appb-M000008
Figure JPOXMLDOC01-appb-M000008

 ここで、話題zは外部から観測できない隠れ変数であり、話題の数はあらかじめ適当な値に定められる。話題zの数は、実施例1においてクラスタ数を定めた場合と同様に、区間の総数、又は発話の総数に比例するように設定されても良いし、他の方法によって設定されても良い。 Here, topic z is a hidden variable that cannot be observed from the outside, and the number of topics is set to an appropriate value in advance. The number of topics z may be set to be proportional to the total number of sections or the total number of utterances as in the case where the number of clusters is determined in the first embodiment, or may be set by another method.

 上記(数8)におけるパラメータP(w|z)は、話題zにおける単語の出現確率を表し、パラメータP(z|d)は、区間dに含まれるそれぞれの話題zの割合を表す。すなわち、上記の(数8)の確率モデルは、区間dにおける単語wの出現確率P(w|d)を、複数の話題における単語の出現確率の重み付きの和として表現したモデルと考えることができる。 The parameter P (w | z) in the above (Equation 8) represents the word appearance probability in the topic z, and the parameter P (z | d) represents the ratio of each topic z included in the section d. In other words, the probability model of the above (Equation 8) can be considered as a model in which the appearance probability P (w | d) of the word w in the section d is expressed as a weighted sum of the appearance probabilities of words in a plurality of topics. it can.

 従って、本実施例2では、各区間のテキストを学習データとして上記の(数8)の確率モデルのパラメータを学習することで、各区間に含まれる話題が判定される(図4のS203参照)と共に、それぞれの話題モデルが作成される(図4のS204参照)。つまり、本実施例2では、P(z|d)が、区間dに含まれる話題の判定結果として、話題判定結果記憶部108に出力されれば良く、P(w|z)が、話題zにおける単語の出現傾向を表す話題モデルとして、話題モデル記憶部109に出力されれば良い。 Therefore, in the second embodiment, topics included in each section are determined by learning the parameters of the probability model of the above (Equation 8) using the text of each section as learning data (see S203 in FIG. 4). At the same time, each topic model is created (see S204 in FIG. 4). That is, in the second embodiment, P (z | d) may be output to the topic determination result storage unit 108 as the determination result of the topic included in the section d, and P (w | z) What is necessary is just to output to the topic model memory | storage part 109 as a topic model showing the appearance tendency of the word in.

 上記の(数8)の確率モデルのパラメータは、下記の参照文献3に記載されているように、EMアルゴリズムを用いた学習データに対する最尤推定によって推定することができる。すなわち、下記の(数9)によるP(z|w,d)の計算と、下記の(数10)によるパラメータP(w|z)の推定と、更に下記の(数11)によるP(z|d)の推定とを繰り返せば良い。 The parameters of the above probability model (Equation 8) can be estimated by maximum likelihood estimation for learning data using the EM algorithm, as described in Reference Document 3 below. That is, calculation of P (z | w, d) by the following (Equation 9), estimation of the parameter P (w | z) by the following (Equation 10), and P (z by the following (Equation 11) It is sufficient to repeat the estimation of | d).

Figure JPOXMLDOC01-appb-M000009
Figure JPOXMLDOC01-appb-M000009

Figure JPOXMLDOC01-appb-M000010
Figure JPOXMLDOC01-appb-M000010

Figure JPOXMLDOC01-appb-M000011
Figure JPOXMLDOC01-appb-M000011

(参照文献3)
D.Gildea and T.Hofmann, "Topic‐based Language Models Using
em, "Proceedings of
The 6th European Conference on Speech Communication and Technology, 1999.
(Reference 3)
D. Gildea and T. Hofmann, "Topic-based Language Models Using
em, "Proceedings of
The 6th European Conference on Speech Communication and Technology, 1999.

 ここで、上記(数10)及び(数11)において、n(w,d)は、区間dにおける単語wの出現回数を表す。また、上記(数9)~(数11)において、rは、繰り返しの回数を表す。 Here, in the above (Equation 10) and (Equation 11), n (w, d) represents the number of appearances of the word w in the section d. In the above (Equation 9) to (Equation 11), r represents the number of repetitions.

 図9に示すように、本実施例2による話題判定結果は、複数の話題の組み合わせとして表現されている。すなわち、例えば、区間1は、話題1が重み0.8で、話題2が重み0.1で、話題5が重み0.1で組み合わされた話題であることを示しているが、これらは区間1におけるP(z|d)として計算された値である。 As shown in FIG. 9, the topic determination result according to the second embodiment is expressed as a combination of a plurality of topics. That is, for example, section 1 indicates that topic 1 is a topic combined with weight 0.8, topic 2 is weight 0.1, and topic 5 is combined with weight 0.1. 1 is a value calculated as P (z | d) in 1.

 そして、適応化言語モデル作成部110は、本実施例2においても、話題判定結果記憶部108に記憶されている話題判定結果と、話題モデル記憶部109に記憶されている話題モデルとを参照する。適応化言語モデル作成部110は、参照結果に基づき、分割によって得られた各区間に対して、当該区間に適応した適応化言語モデルを作成し、これを適応化言語モデル記憶部111に出力する(図4のS205参照)。 Then, also in the second embodiment, the adaptive language model creation unit 110 refers to the topic determination result stored in the topic determination result storage unit 108 and the topic model stored in the topic model storage unit 109. . Based on the reference result, the adaptation language model creation unit 110 creates an adaptation language model adapted to the section for each section obtained by the division, and outputs this to the adaptation language model storage unit 111. (See S205 in FIG. 4).

 また、本実施例2では、話題分析部107は、すでに、各区間dに対する話題の割合P(z|d)、及び、各話題zに対する単語の出現確率P(w|z)を求めている。このため、上記の(数8)によって、各区間dに対して単語の出現確率P(w|d)を求めることができる。したがって、ベース言語モデル記憶部102が記憶するベース言語モデルを、P(w|d)を用いて適応化することで、各区間に対する適応化言語モデルを作成することができる。具体的には、下記の(数12)に示す線形補間や、下記の(数13)に示すユニグラムリスケーリングなどを用いれば良い。 In the second embodiment, the topic analysis unit 107 has already obtained a topic ratio P (z | d) for each section d and a word appearance probability P (w | z) for each topic z. . Therefore, the word appearance probability P (w | d) can be obtained for each section d by the above (Equation 8). Therefore, an adaptive language model for each section can be created by adapting the base language model stored in the base language model storage unit 102 using P (w | d). Specifically, linear interpolation shown in the following (Equation 12) or unigram rescaling shown in the following (Equation 13) may be used.

Figure JPOXMLDOC01-appb-M000012
Figure JPOXMLDOC01-appb-M000012

Figure JPOXMLDOC01-appb-M000013
Figure JPOXMLDOC01-appb-M000013

 本実施例2では、認識対象音声における話題を複数の基本的な話題の組み合わせとして表現し、それら基本的な話題における単語の出現傾向を抽出した上で、言語モデルを適応化することができる。例えば、「区間1は話題1を含み、区間2は話題1と話題2とを含み、区間3は話題2を含む」というような場合でも、本実施例2によれば、それぞれの区間に含まれる話題の割合を判定した上で、話題1と話題2とのそれぞれの話題モデルを作成することができる。 In the second embodiment, the language model can be adapted after expressing the topic in the recognition target speech as a combination of a plurality of basic topics and extracting the appearance tendency of words in the basic topics. For example, even when “Section 1 includes Topic 1, Section 2 includes Topic 1 and Topic 2, and Section 3 includes Topic 2”, according to the second embodiment, it is included in each section. Each topic model of Topic 1 and Topic 2 can be created after determining the percentage of topics that are to be viewed.

 このように、それぞれの区間が複数の話題を含んでいるために、区間のクラスタリングが難しい場合でも、本実施例2によれば、それぞれの話題ごとに話題モデルを作成することが可能となる。そして、それぞれの区間に対して言語モデルの適応化が行われる場合には、当該区間に含まれる話題の割合に応じて、複数の話題モデルの重み付けが行われるため、言語モデルの適応効果が向上する。 Thus, even if it is difficult to cluster sections because each section includes a plurality of topics, according to the second embodiment, a topic model can be created for each topic. When the language model is adapted to each section, weighting of a plurality of topic models is performed according to the ratio of topics included in the section, so that the adaptation effect of the language model is improved. To do.

 上述の実施例1及び実施例2における言語モデル適応装置及び音声認識装置は、例えば、図4に示したステップS201~S206をコンピュータに実行させるプログラム(言語モデル適応用プログラム)を用いて実現することができる。つまり、言語モデル適応用プログラムをコンピュータにインストールし、これを実行させれば、言語モデル適応装置及び音声認識装置が具現化される。 The language model adaptation apparatus and the speech recognition apparatus in the first and second embodiments described above are realized by using, for example, a program (language model adaptation program) that causes a computer to execute steps S201 to S206 shown in FIG. Can do. That is, if a language model adaptation program is installed in a computer and executed, a language model adaptation device and a speech recognition device are realized.

 ここで、図10を用いて、プログラムによって実現された、実施例1又は実施例2における言語モデル適応装置及び音声認識装置について説明する。図10は、本発明の実施例1及び2における言語モデル適応装置及び音声認識装置を実現可能なコンピュータの一例を示すブロック図である。 Here, the language model adaptation apparatus and the speech recognition apparatus according to the first embodiment or the second embodiment, which are realized by a program, will be described with reference to FIG. FIG. 10 is a block diagram illustrating an example of a computer that can implement the language model adaptation device and the speech recognition device according to the first and second embodiments of the present invention.

 図10に示すように、コンピュータ300は、CPU等を含んで構成されるデータ処理装置320と、磁気ディスクや半導体メモリ等で構成される記憶装置330とを備えている。 As shown in FIG. 10, the computer 300 includes a data processing device 320 including a CPU and a storage device 330 including a magnetic disk, a semiconductor memory, and the like.

 また、記憶装置330の記憶領域には、音声データ記憶部331、ベース言語モデル記憶部332、認識結果記憶部333、分割結果記憶部334、話題判定結果記憶部335、話題モデル記憶部336、適応化言語モデル記憶部337、及び再認識結果記憶部338が構築される。記憶装置330は、これら記憶部331~338として機能する。 The storage area of the storage device 330 includes a voice data storage unit 331, a base language model storage unit 332, a recognition result storage unit 333, a division result storage unit 334, a topic determination result storage unit 335, a topic model storage unit 336, an adaptation A structured language model storage unit 337 and a re-recognition result storage unit 338 are constructed. The storage device 330 functions as these storage units 331 to 338.

 そして、コンピュータ読み取り可能な記録媒体又は記憶装置からの読み出しによって、またはネットワークを介した送信によって、言語モデル適応用プログラム310は、データ処理装置320に読み込まれ、データ処理装置320の動作を制御する。これにより、データ処理装置320上には、図3に示した、音声認識部103、分割部105、話題分析部107、適応化言語モデル作成部110、及び再認識部112が構築される。データ処理装置320は、言語モデル適応用プログラム310の制御によって、音声認識部103、分割部105、話題分析部107、適応化言語モデル作成部110、及び再認識部112(図3参照)として機能し、処理を実行する。なお、コンピュータ読み取り可能な記録媒体としては、例えば、光ディスク、磁気ディスク、光磁気ディスク、半導体メモリ、フロッピーディスク等が挙げられる。 The language model adaptation program 310 is read into the data processing device 320 by reading from a computer-readable recording medium or storage device, or by transmission via a network, and controls the operation of the data processing device 320. Thereby, the speech recognition unit 103, the division unit 105, the topic analysis unit 107, the adaptive language model creation unit 110, and the re-recognition unit 112 shown in FIG. The data processing device 320 functions as a speech recognition unit 103, a division unit 105, a topic analysis unit 107, an adaptive language model creation unit 110, and a re-recognition unit 112 (see FIG. 3) under the control of the language model adaptation program 310. And execute the process. Examples of the computer-readable recording medium include an optical disk, a magnetic disk, a magneto-optical disk, a semiconductor memory, and a floppy disk.

 以上、実施の形態及び実施例を参照して本願発明を説明したが、本願発明は上記実施の形態及び実施例に限定されるものではない。本願発明の構成や詳細には、本願発明のスコープ内で当業者が理解し得る様々な変更をすることができる。 As described above, the present invention has been described with reference to the embodiments and examples, but the present invention is not limited to the above embodiments and examples. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.

 この出願は、2009年3月4日に出願された日本出願特願2009-50151を基礎とする優先権を主張し、その開示の全てをここに取り込む。 This application claims priority based on Japanese Patent Application No. 2009-50151 filed on Mar. 4, 2009, the entire disclosure of which is incorporated herein.

 本願発明における言語モデル適応装置、音声認識装置、言語モデル適応方法、及びコンピュータ読み取り可能な記録媒体は以下の特徴を有する。 The language model adaptation apparatus, speech recognition apparatus, language model adaptation method, and computer-readable recording medium according to the present invention have the following characteristics.

(1)ベースとなる言語モデルの適応化を行う言語モデル適応装置であって、
 入力されたテキストを複数の区間に分割する分割部と、
 前記複数の区間それぞれに含まれる話題を判定し、且つ、前記判定された話題毎に、当該話題における単語の出現傾向を表す話題モデルを作成する話題分析部と、
 前記複数の区間それぞれ毎に、当該区間に含まれる話題に対応する前記話題モデルを用いて、前記ベースとなる言語モデルの適応化を行い、適応化言語モデルを作成する適応化言語モデル作成部と、を備えることを特徴とする言語モデル適応装置。
(1) A language model adaptation device for adapting a language model as a base,
A dividing unit for dividing the input text into a plurality of sections;
Determining a topic included in each of the plurality of sections, and for each determined topic, a topic analysis unit that creates a topic model representing an appearance tendency of words in the topic;
For each of the plurality of sections, using the topic model corresponding to the topic included in the section, adapting the base language model, and creating an adapted language model, A language model adaptation device comprising:

(2)前記ベースとなる言語モデルを用いて、音声データに対して音声認識を行い、前記音声認識によって得られたテキストを前記分割部に入力する音声認識部を、更に備えている、上記(1)に記載の言語モデル適応装置。 (2) The method further includes: a speech recognition unit that performs speech recognition on speech data using the base language model and inputs text obtained by the speech recognition to the dividing unit. The language model adaptation device according to 1).

(3)前記話題分析部が、前記区間のテキスト間の類似度に基づいて前記複数の区間をグループに分類し、そして、同一のグループに属する区間の前記話題が共通すると判定し、更に、前記グループ毎に、当該グループに属する前記区間におけるテキスト中の単語の出現傾向に基づいて、前記話題モデルを作成する、
上記(1)に記載の言語モデル適応装置。
(3) The topic analysis unit classifies the plurality of sections into groups based on the similarity between the texts in the sections, and determines that the topics in the sections belonging to the same group are common, and For each group, create the topic model based on the appearance tendency of words in the text in the section belonging to the group,
The language model adaptation device according to (1) above.

(4)前記話題分析部が、前記話題を隠れ変数として用いて、前記複数の区間それぞれにおける単語の出現傾向を表す確率モデルを仮定し、更に、前記複数の区間それぞれのテキストを学習データとして、前記確率モデルのパラメータを学習し、
前記学習によって、前記話題の判定と前記話題モデルの作成とを実行する、
上記(1)に記載の言語モデル適応装置。
(4) The topic analysis unit assumes a probability model representing the appearance tendency of words in each of the plurality of sections using the topic as a hidden variable, and further uses the text of each of the plurality of sections as learning data. Learning the parameters of the probability model;
The learning determines the topic and creates the topic model.
The language model adaptation device according to (1) above.

(5)前記分割部が、単語数又は発話数に基づいて、前記テキストを分割する、上記(1)に記載の言語モデル適応装置。 (5) The language model adaptation device according to (1), wherein the division unit divides the text based on the number of words or the number of utterances.

(6)前記分割部が、前記テキスト内の単語分布の変化点を検出し、検出した変化点に基づいて、前記テキストを分割する、上記(1)に記載の言語モデル適応装置。 (6) The language model adaptation device according to (1), wherein the dividing unit detects a change point of the word distribution in the text and divides the text based on the detected change point.

(7)前記分割部が、予め設定された話題境界表現が前記テキスト内に出現した位置に基づいて、前記テキストを分割する、上記(1)に記載の言語モデル適応装置。 (7) The language model adaptation device according to (1), wherein the division unit divides the text based on a position where a preset topic boundary expression appears in the text.

(8)前記分割部が、予め設定された話題に関係する単語の出現傾向を表すモデルを用いて、前記テキストを分割する、上記(1)に記載の言語モデル適応装置。 (8) The language model adaptation device according to (1), wherein the division unit divides the text using a model representing an appearance tendency of words related to a preset topic.

(9)言語モデルの適応化を行って音声認識を行う音声認識装置であって、
 前記言語モデルを用いて、音声データに対して音声認識を行う音声認識部と、
 前記音声認識によって得られたテキストを複数の区間に分割する分割部と、
 前記複数の区間それぞれに含まれる話題を判定し、且つ、前記判定された話題毎に、当該話題における単語の出現傾向を表す話題モデルを作成する話題分析部と、
 前記複数の区間それぞれ毎に、当該区間に含まれる話題に対応する前記話題モデルを用いて、前記ベースとなる言語モデルの適応化を行い、適応化言語モデルを作成する適応化言語モデル作成部と、
 前記適応化言語モデル作成部によって作成された前記適応化言語モデルを用いて、前記適応化言語モデルが対応する区間の前記音声データに対して、音声認識を行う再認識部と、
を備えることを特徴とする音声認識装置。
(9) A speech recognition device that performs speech recognition by adapting a language model,
A speech recognition unit that performs speech recognition on speech data using the language model;
A dividing unit for dividing the text obtained by the speech recognition into a plurality of sections;
Determining a topic included in each of the plurality of sections, and for each determined topic, a topic analysis unit that creates a topic model representing an appearance tendency of words in the topic;
For each of the plurality of sections, using the topic model corresponding to the topics included in the section, adapting the base language model, and creating an adapted language model, ,
Using the adaptation language model created by the adaptation language model creation unit, a re-recognition unit that performs speech recognition on the speech data in the section corresponding to the adaptation language model;
A speech recognition apparatus comprising:

(10)ベースとなる言語モデルの適応化を行うための言語モデル適応方法であって、
(a)テキストを複数の区間に分割するステップと、
(b)前記複数の区間それぞれに含まれる話題を判定し、且つ、前記判定された話題毎に、当該話題における単語の出現傾向を表す話題モデルを作成するステップと、
(c)前記複数の区間それぞれ毎に、当該区間に含まれる話題に対応する前記話題モデルを用いて、前記ベースとなる言語モデルの適応化を行い、適応化言語モデルを作成するステップと、を有することを特徴とする言語モデル適応方法。
(10) A language model adaptation method for adapting a base language model,
(A) dividing the text into a plurality of sections;
(B) determining a topic included in each of the plurality of sections, and creating a topic model representing the appearance tendency of words in the topic for each determined topic;
(C) For each of the plurality of sections, using the topic model corresponding to the topic included in the section, adapting the base language model and creating an adapted language model; A language model adaptation method characterized by comprising:

(11)(d)前記ベースとなる言語モデルを用いて、音声データに対して音声認識を行い、それによって、前記(a)のステップで分割対象となるテキストを生成するステップを、更に有している、上記(10)に記載の言語モデル適応方法。 (11) (d) The method further includes the step of performing speech recognition on the speech data using the base language model, thereby generating the text to be divided in the step (a). The language model adaptation method according to (10) above.

(12)前記(b)のステップで、前記区間のテキスト間の類似度に基づいて前記複数の区間をグループに分類し、そして、同一のグループに属する区間の前記話題が共通すると判定し、更に、前記グループ毎に、当該グループに属する前記区間におけるテキスト中の単語の出現傾向に基づいて、前記話題モデルを作成する、
上記(10)に記載の言語モデル適応方法。
(12) In the step of (b), classifying the plurality of sections into groups based on the similarity between the texts of the sections, and determining that the topics of the sections belonging to the same group are common, For each group, create the topic model based on the appearance tendency of words in the text in the section belonging to the group,
The language model adaptation method according to (10) above.

(13)前記(b)のステップで、前記話題を隠れ変数として用いて、前記複数の区間それぞれにおける単語の出現傾向を表す確率モデルを仮定し、更に、前記複数の区間それぞれのテキストを学習データとして、前記確率モデルのパラメータを学習し、
前記学習によって、前記話題の判定と前記話題モデルの作成とを実行する、
上記(10)に記載の言語モデル適応方法。
(13) In the step of (b), using the topic as a hidden variable, a probability model representing the appearance tendency of words in each of the plurality of sections is assumed, and further, the text of each of the plurality of sections is learned data. And learning the parameters of the probability model,
The learning determines the topic and creates the topic model.
The language model adaptation method according to (10) above.

(14)前記(a)のステップで、単語数又は発話数に基づいて、前記テキストを分割する、上記(10)に記載の言語モデル適応方法。 (14) The language model adaptation method according to (10), wherein in the step (a), the text is divided based on the number of words or the number of utterances.

(15)前記(a)のステップで、前記テキスト内の単語分布の変化点を検出し、検出した変化点に基づいて、前記テキストを分割する、上記(10)に記載の言語モデル適応方法。 (15) The language model adaptation method according to (10), wherein in the step (a), a change point of the word distribution in the text is detected, and the text is divided based on the detected change point.

(16)前記(a)のステップで、予め設定された話題境界表現が前記テキスト内に出現した位置に基づいて、前記テキストを分割する、上記(10)に記載の言語モデル適応方法。 (16) The language model adaptation method according to (10), wherein in the step (a), the text is divided based on a position where a preset topic boundary expression appears in the text.

(17)前記(a)のステップで、予め設定された話題に関係する単語の出現傾向を表すモデルを用いて、前記テキストを分割する、上記(10)に記載の言語モデル適応方法。 (17) The language model adaptation method according to (10), wherein in the step (a), the text is divided by using a model representing an appearance tendency of words related to a preset topic.

(18)ベースとなる言語モデルの適応化を、コンピュータによって実行するためのプログラムを記録した、コンピュータ読み取り可能な記録媒体であって、
前記コンピュータに、
(a)テキストを複数の区間に分割するステップと、
(b)前記複数の区間それぞれに含まれる話題を判定し、且つ、前記判定された話題毎に、当該話題における単語の出現傾向を表す話題モデルを作成するステップと、
(c)前記複数の区間それぞれ毎に、当該区間に含まれる話題に対応する前記話題モデルを用いて、前記ベースとなる言語モデルの適応化を行い、適応化言語モデルを作成するステップとを実行させる、命令を含むプログラムを記録していることを特徴とするコンピュータ読み取り可能な記録媒体。
(18) A computer-readable recording medium recording a program for executing adaptation of a language model as a base by a computer,
In the computer,
(A) dividing the text into a plurality of sections;
(B) determining a topic included in each of the plurality of sections, and creating a topic model representing the appearance tendency of words in the topic for each determined topic;
(C) For each of the plurality of sections, performing the step of adapting the base language model using the topic model corresponding to the topic included in the section and creating an adapted language model A computer-readable recording medium on which a program including instructions is recorded.

(19)(d)前記ベースとなる言語モデルを用いて、音声データに対して音声認識を行い、それによって、前記(a)のステップで分割対象となるテキストを生成するステップを、更に前記コンピュータに実行させる、上記(18)に記載のコンピュータ読み取り可能な記録媒体。 (19) (d) Performing speech recognition on speech data by using the base language model, thereby further generating the text to be divided in the step (a). The computer-readable recording medium according to (18), which is executed by the computer.

(20)前記(b)のステップで、前記区間のテキスト間の類似度に基づいて前記複数の区間をグループに分類し、そして、同一のグループに属する区間の前記話題が共通すると判定し、更に、前記グループ毎に、当該グループに属する前記区間におけるテキスト中の単語の出現傾向に基づいて、前記話題モデルを作成する、
上記(18)に記載のコンピュータ読み取り可能な記録媒体。
(20) In the step of (b), classifying the plurality of sections into groups based on the similarity between the texts of the sections, and determining that the topics of the sections belonging to the same group are common, For each group, create the topic model based on the appearance tendency of words in the text in the section belonging to the group,
The computer-readable recording medium according to (18) above.

(21)前記(b)のステップで、前記話題を隠れ変数として用いて、前記複数の区間それぞれにおける単語の出現傾向を表す確率モデルを仮定し、更に、前記複数の区間それぞれのテキストを学習データとして、前記確率モデルのパラメータを学習し、
前記学習によって、前記話題の判定と前記話題モデルの作成とを実行する、
上記(18)に記載のコンピュータ読み取り可能な記録媒体。
(21) In the step (b), using the topic as a hidden variable, a probability model representing the appearance tendency of words in each of the plurality of sections is assumed, and further, text in each of the plurality of sections is learned data. And learning the parameters of the probability model,
The learning determines the topic and creates the topic model.
The computer-readable recording medium according to (18) above.

(22)前記(a)のステップで、単語数又は発話数に基づいて、前記テキストを分割する、上記(18)に記載のコンピュータ読み取り可能な記録媒体。 (22) The computer-readable recording medium according to (18), wherein in the step (a), the text is divided based on the number of words or the number of utterances.

(23)前記(a)のステップで、前記テキスト内の単語分布の変化点を検出し、検出した変化点に基づいて、前記テキストを分割する、上記(18)に記載のコンピュータ読み取り可能な記録媒体。 (23) The computer-readable recording according to (18), wherein in the step (a), a change point of the word distribution in the text is detected, and the text is divided based on the detected change point. Medium.

(24)前記(a)のステップで、予め設定された話題境界表現が前記テキスト内に出現した位置に基づいて、前記テキストを分割する、上記(18)に記載のコンピュータ読み取り可能な記録媒体。 (24) The computer-readable recording medium according to (18), wherein in the step (a), the text is divided based on a position where a preset topic boundary expression appears in the text.

(25)前記(a)のステップで、予め設定された話題に関係する単語の出現傾向を表すモデルを用いて、前記テキストを分割する、上記(18)に記載のコンピュータ読み取り可能な記録媒体。 (25) The computer-readable recording medium according to (18), wherein in the step (a), the text is divided using a model representing an appearance tendency of words related to a preset topic.

 本発明は、会議音声、講演音声、放送音声などのように様々な話題を含んだ音声データを認識してテキスト情報を出力する自動音声認識システム、及び自動音声認識システムをコンピュータに実現するためのプログラムといった用途に適用できる。また、認識結果のテキスト情報を用いて、これらの音声データを検索するための情報検索システムといった用途にも適用可能である。 The present invention realizes an automatic speech recognition system that recognizes speech data including various topics such as conference speech, lecture speech, broadcast speech, etc., and outputs text information, and an automatic speech recognition system for a computer. It can be applied to uses such as programs. Further, the present invention can be applied to an application such as an information search system for searching for these voice data using text information of a recognition result.

 10 音声認識装置
 100 言語モデル適応装置
 101 音声データ記憶部
 102 ベース言語モデル記憶部
 103 音声認識部
 104 認識結果記憶部
 105 分割部
 106 分割結果記憶部
 107 話題分析部
 108 話題判定結果記憶部
 109 話題モデル記憶部
 110 適応化言語モデル作成部
 111 適応化言語モデル記憶部
 112 再認識部
 113 再認識結果記憶部
 300 コンピュータ
 310 言語モデル適応用プログラム
 320 データ処理装置
 330 記憶装置
 331 音声データ記憶部
 332 ベース言語モデル記憶部
 333 認識結果記憶部
 334 分割結果記憶部
 335 話題判定結果記憶部
 336 話題モデル記憶部
 337 適応化言語モデル記憶部
 338 再認識結果記憶部
DESCRIPTION OF SYMBOLS 10 Speech recognition apparatus 100 Language model adaptation apparatus 101 Speech data storage part 102 Base language model storage part 103 Speech recognition part 104 Recognition result storage part 105 Division part 106 Division result storage part 107 Topic analysis part 108 Topic determination result storage part 109 Topic model Storage unit 110 Adaptive language model creation unit 111 Adaptive language model storage unit 112 Re-recognition unit 113 Re-recognition result storage unit 300 Computer 310 Language model adaptation program 320 Data processing device 330 Storage device 331 Speech data storage unit 332 Base language model Storage unit 333 Recognition result storage unit 334 Division result storage unit 335 Topic determination result storage unit 336 Topic model storage unit 337 Adaptive language model storage unit 338 Re-recognition result storage unit

Claims (25)

 ベースとなる言語モデルの適応化を行う言語モデル適応装置であって、
 入力されたテキストを複数の区間に分割する分割部と、
 前記複数の区間それぞれに含まれる話題を判定し、且つ、前記判定された話題毎に、当該話題における単語の出現傾向を表す話題モデルを作成する話題分析部と、
 前記複数の区間それぞれ毎に、当該区間に含まれる話題に対応する前記話題モデルを用いて、前記ベースとなる言語モデルの適応化を行い、適応化言語モデルを作成する適応化言語モデル作成部と、を備えることを特徴とする言語モデル適応装置。
A language model adaptation device for adapting a language model as a base,
A dividing unit for dividing the input text into a plurality of sections;
Determining a topic included in each of the plurality of sections, and for each determined topic, a topic analysis unit that creates a topic model representing an appearance tendency of words in the topic;
For each of the plurality of sections, using the topic model corresponding to the topic included in the section, adapting the base language model, and creating an adapted language model, A language model adaptation device comprising:
 前記ベースとなる言語モデルを用いて、音声データに対して音声認識を行い、前記音声認識によって得られたテキストを前記分割部に入力する音声認識部を、更に備えている、請求項1に記載の言語モデル適応装置。 The speech recognition unit according to claim 1, further comprising: a speech recognition unit that performs speech recognition on speech data using the base language model and inputs text obtained by the speech recognition to the dividing unit. Language model adaptation device.  前記話題分析部が、前記区間のテキスト間の類似度に基づいて前記複数の区間をグループに分類し、そして、同一のグループに属する区間の前記話題が共通すると判定し、更に、前記グループ毎に、当該グループに属する前記区間におけるテキスト中の単語の出現傾向に基づいて、前記話題モデルを作成する、
請求項1または2に記載の言語モデル適応装置。
The topic analysis unit classifies the plurality of sections into groups based on the similarity between the texts of the sections, and determines that the topics of the sections belonging to the same group are common, and further, for each group , Creating the topic model based on the appearance tendency of words in the text in the section belonging to the group,
The language model adaptation apparatus according to claim 1 or 2.
 前記話題分析部が、前記話題を隠れ変数として用いて、前記複数の区間それぞれにおける単語の出現傾向を表す確率モデルを仮定し、更に、前記複数の区間それぞれのテキストを学習データとして、前記確率モデルのパラメータを学習し、
前記学習によって、前記話題の判定と前記話題モデルの作成とを実行する、
請求項1または2に記載の言語モデル適応装置。
The topic analysis unit assumes a probability model representing the appearance tendency of words in each of the plurality of sections using the topic as a hidden variable, and further uses the text of each of the plurality of sections as learning data to generate the probability model. Learn the parameters of
The learning determines the topic and creates the topic model.
The language model adaptation apparatus according to claim 1 or 2.
 前記分割部が、単語数又は発話数に基づいて、前記テキストを分割する、請求項1~4のいずれかに記載の言語モデル適応装置。 5. The language model adaptation device according to claim 1, wherein the dividing unit divides the text based on the number of words or the number of utterances.  前記分割部が、前記テキスト内の単語分布の変化点を検出し、検出した変化点に基づいて、前記テキストを分割する、請求項1~4のいずれかに記載の言語モデル適応装置。 The language model adaptation device according to any one of claims 1 to 4, wherein the dividing unit detects a change point of a word distribution in the text and divides the text based on the detected change point.  前記分割部が、予め設定された話題境界表現が前記テキスト内に出現した位置に基づいて、前記テキストを分割する、請求項1~4のいずれかに記載の言語モデル適応装置。 The language model adaptation device according to any one of claims 1 to 4, wherein the dividing unit divides the text based on a position where a preset topic boundary expression appears in the text.  前記分割部が、予め設定された話題に関係する単語の出現傾向を表すモデルを用いて、前記テキストを分割する、請求項1~4のいずれかに記載の言語モデル適応装置。 The language model adaptation device according to any one of claims 1 to 4, wherein the dividing unit divides the text using a model representing an appearance tendency of words related to a preset topic.  言語モデルの適応化を行って音声認識を行う音声認識装置であって、
 前記言語モデルを用いて、音声データに対して音声認識を行う音声認識部と、
 前記音声認識によって得られたテキストを複数の区間に分割する分割部と、
 前記複数の区間それぞれに含まれる話題を判定し、且つ、前記判定された話題毎に、当該話題における単語の出現傾向を表す話題モデルを作成する話題分析部と、
 前記複数の区間それぞれ毎に、当該区間に含まれる話題に対応する前記話題モデルを用いて、前記ベースとなる言語モデルの適応化を行い、適応化言語モデルを作成する適応化言語モデル作成部と、
 前記適応化言語モデル作成部によって作成された前記適応化言語モデルを用いて、前記適応化言語モデルが対応する区間の前記音声データに対して、音声認識を行う再認識部と、を備えることを特徴とする音声認識装置。
A speech recognition device that performs speech recognition by adapting a language model,
A speech recognition unit that performs speech recognition on speech data using the language model;
A dividing unit for dividing the text obtained by the speech recognition into a plurality of sections;
Determining a topic included in each of the plurality of sections, and for each determined topic, a topic analysis unit that creates a topic model representing an appearance tendency of words in the topic;
For each of the plurality of sections, using the topic model corresponding to the topic included in the section, adapting the base language model, and creating an adapted language model, ,
A re-recognition unit that performs speech recognition on the speech data in the section corresponding to the adaptation language model using the adaptation language model created by the adaptation language model creation unit. A featured voice recognition device.
 ベースとなる言語モデルの適応化を行うための言語モデル適応方法であって、
(a)テキストを複数の区間に分割するステップと、
(b)前記複数の区間それぞれに含まれる話題を判定し、且つ、前記判定された話題毎に、当該話題における単語の出現傾向を表す話題モデルを作成するステップと、
(c)前記複数の区間それぞれ毎に、当該区間に含まれる話題に対応する前記話題モデルを用いて、前記ベースとなる言語モデルの適応化を行い、適応化言語モデルを作成するステップと、を有することを特徴とする言語モデル適応方法。
A language model adaptation method for adapting a base language model,
(A) dividing the text into a plurality of sections;
(B) determining a topic included in each of the plurality of sections, and creating a topic model representing the appearance tendency of words in the topic for each determined topic;
(C) For each of the plurality of sections, using the topic model corresponding to the topic included in the section, adapting the base language model and creating an adapted language model; A language model adaptation method characterized by comprising:
(d)前記ベースとなる言語モデルを用いて、音声データに対して音声認識を行い、それによって、前記(a)のステップで分割対象となるテキストを生成するステップを、更に有している、請求項10に記載の言語モデル適応方法。 (D) further comprising performing speech recognition on speech data using the base language model, thereby generating a text to be divided in the step (a). The language model adaptation method according to claim 10.  前記(b)のステップで、前記区間のテキスト間の類似度に基づいて前記複数の区間をグループに分類し、そして、同一のグループに属する区間の前記話題が共通すると判定し、更に、前記グループ毎に、当該グループに属する前記区間におけるテキスト中の単語の出現傾向に基づいて、前記話題モデルを作成する、
請求項10または11に記載の言語モデル適応方法。
In the step (b), the plurality of sections are classified into groups based on the similarity between the texts of the sections, and it is determined that the topics of the sections belonging to the same group are common. Every time, based on the appearance tendency of the words in the text in the section belonging to the group, create the topic model,
The language model adaptation method according to claim 10 or 11.
 前記(b)のステップで、前記話題を隠れ変数として用いて、前記複数の区間それぞれにおける単語の出現傾向を表す確率モデルを仮定し、更に、前記複数の区間それぞれのテキストを学習データとして、前記確率モデルのパラメータを学習し、
前記学習によって、前記話題の判定と前記話題モデルの作成とを実行する、
請求項10または11に記載の言語モデル適応方法。
In the step (b), using the topic as a hidden variable, a probability model representing the appearance tendency of words in each of the plurality of sections is assumed, and further, the text of each of the plurality of sections is used as learning data, Learn the parameters of the probabilistic model,
The learning determines the topic and creates the topic model.
The language model adaptation method according to claim 10 or 11.
 前記(a)のステップで、単語数又は発話数に基づいて、前記テキストを分割する、請求項10~13のいずれかに記載の言語モデル適応方法。 14. The language model adaptation method according to claim 10, wherein in the step (a), the text is divided based on the number of words or the number of utterances.  前記(a)のステップで、前記テキスト内の単語分布の変化点を検出し、検出した変化点に基づいて、前記テキストを分割する、請求項10~13のいずれかに記載の言語モデル適応方法。 The language model adaptation method according to any one of claims 10 to 13, wherein, in the step (a), a change point of a word distribution in the text is detected, and the text is divided based on the detected change point. .  前記(a)のステップで、予め設定された話題境界表現が前記テキスト内に出現した位置に基づいて、前記テキストを分割する、請求項10~13のいずれかに記載の言語モデル適応方法。 14. The language model adaptation method according to claim 10, wherein in the step (a), the text is divided based on a position where a preset topic boundary expression appears in the text.  前記(a)のステップで、予め設定された話題に関係する単語の出現傾向を表すモデルを用いて、前記テキストを分割する、請求項10~13のいずれかに記載の言語モデル適応方法。 The language model adaptation method according to any one of claims 10 to 13, wherein, in the step (a), the text is divided using a model representing an appearance tendency of words related to a preset topic.  ベースとなる言語モデルの適応化を、コンピュータによって実行するためのプログラムを記録した、コンピュータ読み取り可能な記録媒体であって、
前記コンピュータに、
(a)テキストを複数の区間に分割するステップと、
(b)前記複数の区間それぞれに含まれる話題を判定し、且つ、前記判定された話題毎に、当該話題における単語の出現傾向を表す話題モデルを作成するステップと、
(c)前記複数の区間それぞれ毎に、当該区間に含まれる話題に対応する前記話題モデルを用いて、前記ベースとなる言語モデルの適応化を行い、適応化言語モデルを作成するステップとを実行させる、命令を含むプログラムを記録していることを特徴とするコンピュータ読み取り可能な記録媒体。
A computer-readable recording medium recording a program for executing adaptation of a language model as a base by a computer,
In the computer,
(A) dividing the text into a plurality of sections;
(B) determining a topic included in each of the plurality of sections, and creating a topic model representing the appearance tendency of words in the topic for each determined topic;
(C) For each of the plurality of sections, performing the step of adapting the base language model using the topic model corresponding to the topic included in the section and creating an adapted language model A computer-readable recording medium on which a program including instructions is recorded.
 (d)前記ベースとなる言語モデルを用いて、音声データに対して音声認識を行い、それによって、前記(a)のステップで分割対象となるテキストを生成するステップを、更に前記コンピュータに実行させる、請求項18に記載のコンピュータ読み取り可能な記録媒体。 (D) Performing speech recognition on speech data using the base language model, thereby causing the computer to further execute a step of generating text to be divided in the step (a). The computer-readable recording medium according to claim 18.  前記(b)のステップで、前記区間のテキスト間の類似度に基づいて前記複数の区間をグループに分類し、そして、同一のグループに属する区間の前記話題が共通すると判定し、更に、前記グループ毎に、当該グループに属する前記区間におけるテキスト中の単語の出現傾向に基づいて、前記話題モデルを作成する、
請求項18または19に記載のコンピュータ読み取り可能な記録媒体。
In the step (b), the plurality of sections are classified into groups based on the similarity between the texts of the sections, and it is determined that the topics of the sections belonging to the same group are common. Every time, based on the appearance tendency of the words in the text in the section belonging to the group, create the topic model,
The computer-readable recording medium according to claim 18 or 19.
 前記(b)のステップで、前記話題を隠れ変数として用いて、前記複数の区間それぞれにおける単語の出現傾向を表す確率モデルを仮定し、更に、前記複数の区間それぞれのテキストを学習データとして、前記確率モデルのパラメータを学習し、
前記学習によって、前記話題の判定と前記話題モデルの作成とを実行する、
請求項18または19に記載のコンピュータ読み取り可能な記録媒体。
In the step (b), using the topic as a hidden variable, a probability model representing the appearance tendency of words in each of the plurality of sections is assumed, and further, the text of each of the plurality of sections is used as learning data, Learn the parameters of the probabilistic model,
The learning determines the topic and creates the topic model.
The computer-readable recording medium according to claim 18 or 19.
 前記(a)のステップで、単語数又は発話数に基づいて、前記テキストを分割する、請求項18~21のいずれかに記載のコンピュータ読み取り可能な記録媒体。 The computer-readable recording medium according to any one of claims 18 to 21, wherein in the step (a), the text is divided based on the number of words or the number of utterances.  前記(a)のステップで、前記テキスト内の単語分布の変化点を検出し、検出した変化点に基づいて、前記テキストを分割する、請求項18~21のいずれかに記載のコンピュータ読み取り可能な記録媒体。 The computer-readable program according to any one of claims 18 to 21, wherein in the step (a), a change point of a word distribution in the text is detected, and the text is divided based on the detected change point. recoding media.  前記(a)のステップで、予め設定された話題境界表現が前記テキスト内に出現した位置に基づいて、前記テキストを分割する、請求項18~21のいずれかに記載のコンピュータ読み取り可能な記録媒体。 The computer-readable recording medium according to any one of claims 18 to 21, wherein in the step (a), the text is divided based on a position where a preset topic boundary expression appears in the text. .  前記(a)のステップで、予め設定された話題に関係する単語の出現傾向を表すモデルを用いて、前記テキストを分割する、請求項18~21のいずれかに記載のコンピュータ読み取り可能な記録媒体。 The computer-readable recording medium according to any one of claims 18 to 21, wherein in the step (a), the text is divided using a model representing an appearance tendency of a word related to a preset topic. .
PCT/JP2010/001134 2009-03-04 2010-02-22 Language model adaptation device, speech recognition device, language model adaptation method, and computer-readable recording medium Ceased WO2010100853A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2009050151 2009-03-04
JP2009-050151 2009-03-04

Publications (1)

Publication Number Publication Date
WO2010100853A1 true WO2010100853A1 (en) 2010-09-10

Family

ID=42709423

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2010/001134 Ceased WO2010100853A1 (en) 2009-03-04 2010-02-22 Language model adaptation device, speech recognition device, language model adaptation method, and computer-readable recording medium

Country Status (1)

Country Link
WO (1) WO2010100853A1 (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012151743A1 (en) * 2011-05-10 2012-11-15 Nokia Corporation Methods, apparatuses and computer program products for providing topic model with wording preferences
JP2015141368A (en) * 2014-01-30 2015-08-03 日本電信電話株式会社 Language model creation device, voice recognition device, method and program for the same
CN106297800A (en) * 2016-08-10 2017-01-04 中国科学院计算技术研究所 A kind of method and apparatus of adaptive speech recognition
US10559301B2 (en) * 2009-07-01 2020-02-11 Comcast Interactive Media, Llc Generating topic-specific language models
US10635709B2 (en) 2008-12-24 2020-04-28 Comcast Interactive Media, Llc Searching for segments based on an ontology
WO2020162229A1 (en) * 2019-02-06 2020-08-13 日本電信電話株式会社 Speech recognition device, retrieval device, speech recognition method, retrieval method, and program
CN113407792A (en) * 2021-07-06 2021-09-17 亿览在线网络技术(北京)有限公司 Topic-based text input method
JPWO2021256043A1 (en) * 2020-06-16 2021-12-23
US11531668B2 (en) 2008-12-29 2022-12-20 Comcast Interactive Media, Llc Merging of multiple data sets

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000075886A (en) * 1998-08-28 2000-03-14 Atr Onsei Honyaku Tsushin Kenkyusho:Kk Statistical language model generator and voice recognition device
JP2002091484A (en) * 2000-09-14 2002-03-27 Mitsubishi Electric Corp Language model generation device, speech recognition device using the same, language model generation method, speech recognition method using the same, computer-readable recording medium recording language model generation program, and computer-readable recording speech recognition program Recording media
WO2005122143A1 (en) * 2004-06-08 2005-12-22 Matsushita Electric Industrial Co., Ltd. Speech recognition device and speech recognition method
JP2007280364A (en) * 2006-03-10 2007-10-25 Nec (China) Co Ltd Method and device for switching/adapting language model
WO2008004666A1 (en) * 2006-07-07 2008-01-10 Nec Corporation Voice recognition device, voice recognition method and voice recognition program

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000075886A (en) * 1998-08-28 2000-03-14 Atr Onsei Honyaku Tsushin Kenkyusho:Kk Statistical language model generator and voice recognition device
JP2002091484A (en) * 2000-09-14 2002-03-27 Mitsubishi Electric Corp Language model generation device, speech recognition device using the same, language model generation method, speech recognition method using the same, computer-readable recording medium recording language model generation program, and computer-readable recording speech recognition program Recording media
WO2005122143A1 (en) * 2004-06-08 2005-12-22 Matsushita Electric Industrial Co., Ltd. Speech recognition device and speech recognition method
JP2007280364A (en) * 2006-03-10 2007-10-25 Nec (China) Co Ltd Method and device for switching/adapting language model
WO2008004666A1 (en) * 2006-07-07 2008-01-10 Nec Corporation Voice recognition device, voice recognition method and voice recognition program

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JUN SAKO ET AL.: "Wadai no Renzoku/Furenzoku Henka o Koryo shita Topic Model ni Motozuku Onsei Ninshiki", INFORMATION PROCESSING SOCIETY OF JAPAN KENKYU HOKOKU, vol. 2008, no. 123, 2 December 2008 (2008-12-02), pages 55 - 60 *
YUYA AKITA ET AL.: "Kokkai Onsei Ninshiki no Tameno Hatsuon Model Seisei to Gengo Model Tekio", REPORT OF THE 2005 SPRING MEETING, THE ACOUSTICAL SOCIETY OF JAPAN, - 8 March 2005 (2005-03-08), pages 5 - 6 *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10635709B2 (en) 2008-12-24 2020-04-28 Comcast Interactive Media, Llc Searching for segments based on an ontology
US12153617B2 (en) 2008-12-24 2024-11-26 Comcast Interactive Media, Llc Searching for segments based on an ontology
US11468109B2 (en) 2008-12-24 2022-10-11 Comcast Interactive Media, Llc Searching for segments based on an ontology
US11531668B2 (en) 2008-12-29 2022-12-20 Comcast Interactive Media, Llc Merging of multiple data sets
US11562737B2 (en) 2009-07-01 2023-01-24 Tivo Corporation Generating topic-specific language models
US10559301B2 (en) * 2009-07-01 2020-02-11 Comcast Interactive Media, Llc Generating topic-specific language models
US11978439B2 (en) 2009-07-01 2024-05-07 Tivo Corporation Generating topic-specific language models
WO2012151743A1 (en) * 2011-05-10 2012-11-15 Nokia Corporation Methods, apparatuses and computer program products for providing topic model with wording preferences
JP2015141368A (en) * 2014-01-30 2015-08-03 日本電信電話株式会社 Language model creation device, voice recognition device, method and program for the same
CN106297800A (en) * 2016-08-10 2017-01-04 中国科学院计算技术研究所 A kind of method and apparatus of adaptive speech recognition
WO2020162229A1 (en) * 2019-02-06 2020-08-13 日本電信電話株式会社 Speech recognition device, retrieval device, speech recognition method, retrieval method, and program
WO2021256043A1 (en) * 2020-06-16 2021-12-23 日本電信電話株式会社 Estimation device, estimation method, learning device, learning method and program
JP7425368B2 (en) 2020-06-16 2024-01-31 日本電信電話株式会社 Estimation device, estimation method, learning device, learning method and program
JPWO2021256043A1 (en) * 2020-06-16 2021-12-23
CN113407792B (en) * 2021-07-06 2024-03-26 亿览在线网络技术(北京)有限公司 Topic-based text input method
CN113407792A (en) * 2021-07-06 2021-09-17 亿览在线网络技术(北京)有限公司 Topic-based text input method

Similar Documents

Publication Publication Date Title
US11545142B2 (en) Using context information with end-to-end models for speech recognition
US11043214B1 (en) Speech recognition using dialog history
KR102871460B1 (en) Dialect phoneme adaptive training system and method
US10943583B1 (en) Creation of language models for speech recognition
US10121467B1 (en) Automatic speech recognition incorporating word usage information
US9934777B1 (en) Customized speech processing language models
US10176802B1 (en) Lattice encoding using recurrent neural networks
US10134388B1 (en) Word generation for speech recognition
KR102871441B1 (en) Acoustic information based language modeling system and method
JP6066354B2 (en) Method and apparatus for reliability calculation
CN106463113B (en) Predicting pronunciation in speech recognition
CN104681036B (en) A kind of detecting system and method for language audio
WO2010100853A1 (en) Language model adaptation device, speech recognition device, language model adaptation method, and computer-readable recording medium
Lugosch et al. Donut: Ctc-based query-by-example keyword spotting
JP2019514045A (en) Speaker verification method and system
JPWO2008001485A1 (en) Language model generation system, language model generation method, and language model generation program
JP2014157323A (en) Voice recognition device, acoustic model learning device, and method and program of the same
JP6031316B2 (en) Speech recognition apparatus, error correction model learning method, and program
US20240296838A1 (en) Machine learning model updating
CN112509560B (en) Voice recognition self-adaption method and system based on cache language model
KR20180038707A (en) Method for recogniting speech using dynamic weight and topic information
CN120199247B (en) Intelligent customer service voice interaction method and system based on voice recognition
Savargiv et al. Persian speech emotion recognition
Moyal et al. Phonetic search methods for large speech databases
Mary et al. Searching speech databases: features, techniques and evaluation measures

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 10748456

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 10748456

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP