JP2006259641A

JP2006259641A - Voice recognition device and program

Info

Publication number: JP2006259641A
Application number: JP2005080732A
Authority: JP
Inventors: Katsuhiko Shirai; 克彦白井; Hideaki Kikuchi; 英明菊池; Takashi Okubo; 崇大久保
Original assignee: Waseda University
Current assignee: Waseda University
Priority date: 2005-03-18
Filing date: 2005-03-18
Publication date: 2006-09-28

Abstract

<P>PROBLEM TO BE SOLVED: To simply add additional information consisting of letters of special characters, signs and/or images to the voice recognized letter string. <P>SOLUTION: This voice recognition device 12 is provided with a sound detector 15 to detect various sound data about the voice input from a microphone 11, the main converter 17 to create a letter string matching the input voice from the sound data detected by this sound detector 15, an additional information converter 18 to create additional information added to the letter string from the sound data detected by this sound detector 15, and a converter connector 19 to create the output information made by adding the additional information to the letter string. The additional information converter 18 is provided with a corrector 21 to correct the rhythm information detected by the sound detector 15, and an additional information selector 22 to choose the additional information showing the feeling by estimating the talker's feeling contained in the input voice from the corrected rhythm information. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、音声認識装置及び音声認識用プログラムに係り、更に詳しくは、音声による文字入力を行う際に、顔文字、絵文字の特殊文字等の追加情報を簡単に付加することのできる音声認識装置及び音声認識用プログラムに関する。 The present invention relates to a speech recognition device and a speech recognition program, and more particularly, a speech recognition device that can easily add additional information such as emoticons and special characters of pictograms when inputting characters by speech. And a speech recognition program.

近時、パソコンや携帯電話の普及により、電子メール等の文字によるコミュニケーションが盛んに行われている。これに伴い、コミュニケーションの多様性が追求され、その結果、若者を中心としたユーザ間のやり取りには、単純な文章のみならず、各種文字及び記号で構成される顔文字や絵文字（以下、「特殊文字」と称する。）を併用した表現が用いられているところである。これら特殊文字は、昨今のワープロ変換辞書機能の向上により、通常の文字と同様、キーボードやテンキー等を使って入力可能になっている。 Recently, with the widespread use of personal computers and mobile phones, communication using characters such as e-mail has been actively performed. Accordingly, the diversity of communication has been pursued. As a result, not only simple sentences but also emoticons and pictograms composed of various characters and symbols (hereinafter “ An expression using a combination of “special characters” is used. These special characters can be input by using a keyboard, a numeric keypad, etc., like normal characters, due to recent improvements in word processor conversion dictionary functions.

ところで、人間の音声を文字列（テキスト）に変換する音声認識技術が発達してきており、当該音声認識技術により、音声による電子メール等への文字入力が可能となっている。このような音声認識技術として、入力された音声情報の変化から、当該音声情報に基づいて変換された文字列に対し、色やフォントを部分的に変える音声認識システムが知られている（特許文献１参照）。
特開２００１−１３４２９０号公報 By the way, a voice recognition technology for converting human voice into a character string (text) has been developed, and the voice recognition technology enables character input to an electronic mail or the like by voice. As such speech recognition technology, a speech recognition system is known in which a color or font is partially changed from a change in input speech information to a character string converted based on the speech information (Patent Document). 1).
JP 2001-134290 A

しかしながら、前記音声認識システムにあっては、変換された文字列の色やフォントを変えることができるものの、前記特殊文字のような追加情報を入力できないという不都合がある。また、現行の音声認識技術を使って、変換された文字列に前記追加情報を付加しようとすると、ユーザ側で面倒な入力が必要になるという問題がある。すなわち、音声によって特殊文字を入力する際、その文字の種類を意味する音声、例えば、「顔文字１」、「顔文字、喜び」、「顔文字、ニコニコ」等、特殊文字入力を意味する音声と、当該入力文字の内容を意味する音声等とをユーザ側で複合的に入力しなければならず、面倒な入力作業が必要となる。 However, although the voice recognition system can change the color and font of the converted character string, there is a disadvantage that additional information such as the special character cannot be input. Further, if the additional information is added to the converted character string using the current speech recognition technology, there is a problem that troublesome input is required on the user side. That is, when inputting a special character by voice, a voice meaning the type of the character, for example, a voice meaning special characters such as “emoticon 1”, “emoticon, joy”, “emoticon, smiling” In addition, it is necessary for the user to input a voice or the like meaning the content of the input character in a complex manner, and a troublesome input operation is required.

本発明は、このような不都合に着目して案出されたものであり、その目的は、音声認識された文字列に対し、特殊文字等の文字、記号及び／又は図形等からなる追加情報を簡単に付加することができる音声認識装置及び音声認識用プログラムを提供することにある。 The present invention has been devised by paying attention to such inconveniences, and its purpose is to add additional information consisting of characters such as special characters, symbols and / or figures, etc. to a character string that has been voice-recognized. An object of the present invention is to provide a speech recognition apparatus and a speech recognition program that can be easily added.

（１）前記目的を達成するため、本発明は、入力音声から変換された文字列に付加される追加情報を作成する追加情報変換部を備え、当該追加情報変換部は、前記入力音声の韻律情報に基づいて作成される、という構成を採っている。 (1) In order to achieve the above object, the present invention includes an additional information conversion unit that creates additional information to be added to a character string converted from input speech, and the additional information conversion unit includes the prosody of the input speech. It is configured to be created based on information.

（２）また、前記追加情報変換部は、前記韻律情報から前記入力音声に込められた感情を推定し、当該感情を表現する追加情報を作成する、という構成を採ることが好ましい。 (2) Moreover, it is preferable that the said additional information conversion part takes the structure of estimating the emotion embedded in the said input audio | voice from the said prosodic information, and creating the additional information expressing the said emotion.

（３）更に、前記追加情報変換部は、前記追加情報が付加される入力音声の対象部分の全体の韻律情報と、前記対象部分の中の最終モーラ部分の韻律情報とに基づいて、前記追加情報の内容を決定する、という構成を採ることが好ましい。
なお、本明細書において、「モーラ」とは、拍感覚に相当する単位であり、例えば、「おんせい」という発声は、４モーラと数えられる。 (3) Furthermore, the additional information conversion unit is configured to add the additional information based on the prosodic information of the entire target portion of the input speech to which the additional information is added and the prosodic information of the final mora portion in the target portion. It is preferable to adopt a configuration in which the content of information is determined.
In the present specification, “mora” is a unit corresponding to beat sensation. For example, an utterance of “onsei” is counted as 4 mora.

（４）また、前記追加情報変換部は、一定時間以上のポーズで区切られた区間ごとに前記追加情報を付加すべきか否かを判断する、という構成を併せて採用するとよい。 (4) In addition, the additional information conversion unit may also employ a configuration in which it is determined whether or not the additional information should be added for each section divided by a pause of a predetermined time or more.

（５）更に、前記追加情報変換部では、正規化された前記韻律情報を使って前記追加情報を作成するとよい。 (5) Further, the additional information conversion unit may create the additional information using the normalized prosodic information.

（６）また、本発明は、コンピュータに対し音声認識処理を実行させる音声認識用プログラムであって、
入力音声から変換された文字列に対し、前記入力音声の韻律情報に基づいて付加される追加情報の作成を前記コンピュータに実行させる、という構成を採っている。 (6) The present invention is a speech recognition program for causing a computer to execute speech recognition processing,
A configuration is adopted in which the computer is caused to create additional information added to the character string converted from the input speech based on the prosodic information of the input speech.

本発明によれば、ユーザである話者の韻律情報に基づいて追加情報が作成されるため、韻律情報から話者の感情を推定し、当該感情に相当する追加情報等が自動的に作成可能となり、当該追加情報の作成に際し、追加情報の内容を意味する音声を話者が独立して入力する必要がなく、ユーザに馴染みやすい簡単な入力方法で、追加情報を精度良く文字列に付加させることができる。 According to the present invention, since additional information is created based on the prosody information of the speaker who is the user, the emotion of the speaker can be estimated from the prosodic information, and additional information corresponding to the emotion can be automatically created. Therefore, when creating the additional information, it is not necessary for the speaker to input the voice that means the content of the additional information independently, and the additional information is added to the character string with high accuracy by a simple input method that is easy for the user to become familiar with. be able to.

以下、本発明の実施例について図面を参照しながら説明する。 Embodiments of the present invention will be described below with reference to the drawings.

図１には、本実施例に係る音声認識装置が適用された音声認識システムの概略構成図が示されている。この図において、音声認識システム１０は、パーソナルコンピュータ、携帯情報端末、携帯電話機等の端末に組み込まれており、ユーザである話者の音声が入力されるマイク１１と、このマイク１１への入力音声から文字、記号及び／又は図形からなる出力情報を作成する音声変換装置１２と、音声変換装置１２で作成された出力情報を表示する液晶ディスプレイ等の表示装置１３とを備えている。ここで、本実施例のマイク１１及び表示装置１３は、公知の構成のものが採用されており、ここでは、各構成の詳細な説明を省略する。 FIG. 1 shows a schematic configuration diagram of a speech recognition system to which a speech recognition apparatus according to the present embodiment is applied. In this figure, a voice recognition system 10 is incorporated in a terminal such as a personal computer, a portable information terminal, a mobile phone, etc., and a microphone 11 to which a voice of a speaker who is a user is input, and an input voice to the microphone 11 The voice conversion device 12 for creating output information composed of characters, symbols and / or figures, and the display device 13 such as a liquid crystal display for displaying the output information created by the voice conversion device 12 are provided. Here, the microphone 11 and the display device 13 of the present embodiment employ a known configuration, and a detailed description of each configuration is omitted here.

前記音声認識装置１２は、ハードウェア及び／又はソフトウェアによって構成され、プロセッサ等、複数のプログラムモジュール及び／又は処理回路より成り立っており、以下に説明する各種処理を実行可能なプログラムがインストールされている。 The voice recognition device 12 is configured by hardware and / or software, and includes a plurality of program modules and / or processing circuits such as a processor, and is installed with a program capable of executing various processes described below. .

具体的に、この音声認識装置１２は、マイク１１からの入力音声に関する各種の音声データを検出する音声検出部１５と、この音声検出部１５で検出された音声データから、入力音声に相当する文字列を作成する主変換部１７と、音声検出部１５で検出された音声データから、前記文字列に付加される追加情報を作成する追加情報変換部１８と、主変換部１７で作成された文字列に、追加情報変換部１８で作成された追加情報を付加した出力情報を作成する変換結合部１９とを備えて構成されている。 Specifically, the voice recognition device 12 includes a voice detection unit 15 that detects various voice data related to input voice from the microphone 11, and a character corresponding to the input voice from the voice data detected by the voice detection unit 15. A main conversion unit 17 that creates a string, an additional information conversion unit 18 that creates additional information to be added to the character string from the voice data detected by the voice detection unit 15, and a character created by the main conversion unit 17 A conversion coupling unit 19 that creates output information with the additional information created by the additional information conversion unit 18 added to the column is configured.

前記音声検出部１５は、主変換部１７及び追加情報変換部１８で必要となる音声データを検出可能に設けられており、そのうちの一つとして、韻律情報、すなわち、音調となる周波数（Ｈｚ）と、音量（ｄＢ）と、速度（モーラ数／時間）とがそれぞれ検出される。 The voice detection unit 15 is provided so as to be able to detect voice data required by the main conversion unit 17 and the additional information conversion unit 18, and one of them is the frequency (Hz) at which the prosody information, that is, the tone is used. , Volume (dB), and speed (number of mora / hour) are detected.

前記主変換部１７は、マイク１１からの入力音声データに対し、従来知られている種々の音声認識技術のうち何れか一若しくは複数の技術を使って、入力音声をテキストで表現するようになっている。なお、この音声認識技術は、本発明の要旨ではないため、詳細な説明を省略する。 The main conversion unit 17 expresses the input voice as text with respect to the input voice data from the microphone 11 by using any one or more of various conventionally known voice recognition techniques. ing. Note that this voice recognition technology is not the gist of the present invention, and thus detailed description thereof is omitted.

前記追加情報変換部１８は、音声検出部１５で検出された韻律情報を補正する補正手段２１と、補正後の韻律情報から、入力音声に込められた感情を推定して当該感情が表現された追加情報を選択する追加情報選択手段２２とを備えて構成されている。 The additional information converting unit 18 corrects the prosodic information detected by the voice detecting unit 15 and estimates the emotion included in the input speech from the corrected prosodic information to express the emotion. Additional information selection means 22 for selecting additional information is provided.

前記補正手段２１は、音声検出部１５で検出された韻律情報に対して正規化処理（標準化処理）を行うようになっている。ここでの正規化処理は、同じ感情を表現した音声の韻律情報でも、例えば男女差等の個人差により、声の高さ等、大きさの程度が相違することから、追加情報選択手段２２での一定基準への適用のために行われるものである。すなわち、ここでは、後述するように、音声認識を行う話者個人の固有データである各韻律情報の平均値及び標準偏差を予め求めておき、当該平均値及び標準偏差を使って、音声認識を行う際に入力された話者の音声の各韻律情報を正規化するようになっている。 The correction means 21 performs normalization processing (standardization processing) on the prosodic information detected by the voice detection unit 15. In this normalization process, even in the prosodic information of speech expressing the same emotion, the degree of loudness such as the pitch of the voice differs depending on individual differences such as gender differences. This is done for application to certain standards. That is, here, as will be described later, an average value and standard deviation of each prosodic information that is unique data of a speaker individual performing speech recognition is obtained in advance, and speech recognition is performed using the average value and standard deviation. Each prosodic information of the speech of the speaker input when performing is normalized.

前記追加情報選択手段２２は、話者が音声を入力する際に、一定時間以上のポーズで区切られた各音声区間に対し、前記追加情報が付加される対象部分として、補正手段２１で補正された後の各韻律情報の値に基づき、前記追加情報を付加するか否かとその種類が決定される。ここで、本実施例での追加情報としては、対象部分の全文字列を太字にする強調表現と、対象部分の文字列の最後を改行する段落変更と、対象部分の文字列の最後に付加される顔文字（怒）、顔文字（喜）、顔文字（悲）とからなる全５種類が用意されている。この追加情報としては、これら５種類に限定されるものではなく、他の顔文字や絵文字を含む文字、記号類及び／又は図形等、種々の情報を、各韻律情報と関連付けて採用することも可能である。 The additional information selection unit 22 is corrected by the correction unit 21 as a target portion to which the additional information is added to each voice segment divided by a pause of a predetermined time or more when the speaker inputs voice. Whether or not to add the additional information and its type are determined based on the value of each prosodic information after that. Here, as additional information in the present embodiment, the emphasis expression in which all the character strings of the target part are bolded, the paragraph change in which the end of the character string of the target part is broken, and the end of the character string of the target part are added. There are five types of emoticons (angry), emoticons (joy), and emoticons (sad). The additional information is not limited to these five types, and various information such as characters, symbols, and / or figures including other emoticons and pictograms may be used in association with each prosodic information. Is possible.

ここでの各追加情報の選択基準は、対象部分全体の各韻律情報と、発声の最終拍部分である最終モーラの各韻律情報とをパラメータとし、それら値が所定の基準値を超えたか否か応じて判断され、具体的には、図２にフローチャート化した決定木方式により判断される。この決定木は、複数の話者に対して、予め指示した感情通りに同一文章を発声させることで蓄積されたデータにより構築され、予め設定されたものである。この決定木は、ユーザ側の発声による学習により構築するようにしてもよい。 Here, the selection criteria for each additional information is whether each prosodic information of the entire target part and each prosodic information of the final mora that is the final beat part of the utterance are parameters, and whether or not those values exceed a predetermined reference value Specifically, the determination is made by the decision tree method shown in the flowchart of FIG. This decision tree is constructed and set in advance by data accumulated by causing a plurality of speakers to utter the same sentence according to emotions designated in advance. This decision tree may be constructed by learning by utterance on the user side.

本実施例における追加情報選択手段２２では、図２の決定木による決定に際し、対象部分全体の韻律情報として、周波数の平均値及び音量の平均値が少なくとも求められる一方、最終モーラの韻律情報として、周波数の句末上昇度、音量の平均値、及び速度が少なくとも求められる。周波数の句末上昇度は、最終モーラ全体の周波数曲線を最小二乗法で直線に近似して求められた直線の傾きである。なお、周波数や音量に関しては、対象部分全体及び最終モーラ共に、最大値、最小値、それらの幅であるレンジ等を求め、更に複雑な決定木を使って追加情報の判断を行うようにするとよく、この場合、認識精度が一層向上する。 In the additional information selection means 22 in the present embodiment, at the time of determination by the decision tree of FIG. 2, at least the average value of the frequency and the average value of the volume are obtained as the prosodic information of the entire target portion, while the prosodic information of the final mora is At least the phrase end-of-frequency increase, the average value of the volume, and the speed are obtained. The frequency end-of-phrase rise is the slope of a straight line obtained by approximating the frequency curve of the entire final mora to a straight line by the method of least squares. Regarding the frequency and volume, it is better to determine the maximum value, the minimum value, the range that is the width of the entire target portion and the final mora, and to determine additional information using a more complicated decision tree. In this case, the recognition accuracy is further improved.

次に、前記音声認識装置１２による音声認識手順につき説明する。 Next, a voice recognition procedure by the voice recognition device 12 will be described.

先ず、初期設定が行われる。すなわち、音声認識装置１２を使って音声認識を行う話者に対し、複数種類の文章をそれぞれ数回ずつ発声させ、その音声がマイク１１に入力される。そして、音声検出部１５で、当該音声の韻律情報である周波数、音量、速度がそれぞれ検出される。次に、補正手段２１で、音声検出部１５にて検出された周波数、音量、速度に対し、平均値及び標準偏差がそれぞれ算出される。これら平均値及び標準偏差は、後述するように、前記出力情報への変換の際に、入力音声の韻律情報に対する正規化処理に用いられる。 First, initial setting is performed. That is, a speaker who performs voice recognition using the voice recognition device 12 utters a plurality of types of sentences several times each, and the voice is input to the microphone 11. Then, the voice detection unit 15 detects the frequency, volume, and speed that are the prosody information of the voice. Next, the correction means 21 calculates an average value and a standard deviation for the frequency, volume, and speed detected by the voice detection unit 15. As will be described later, these average value and standard deviation are used for normalization processing on the prosodic information of the input speech when converting to the output information.

以上の初期設定が行われた後、ユーザとなる話者側で、音声認識を行う音声がマイク１１に入力される。例えば、話者が、怒りを込めて「何で昨日来なかったの」と発声したとする。このとき、話者が文末に顔文字（怒）付加させたいため、「来なかったの」の前後にポーズを入れ、「来なかったの」の部分の発音を、本当に怒りを表現しているかのように、全体的に、最終モーラ「たの」の部分を早く発声し、低い音から高い音（低周波数から高周波数）に徐々にシフトさせたとする。 After the above initial setting is performed, a voice for voice recognition is input to the microphone 11 on the speaker side as a user. For example, suppose a speaker speaks with anger, "Why didn't you come yesterday?" At this time, because the speaker wants to add an emoticon (angry) at the end of the sentence, is the pose before and after "I didn't come" and the pronunciation of the part "I didn't come" really expresses anger? As a whole, it is assumed that the final mora “TANO” is uttered early and gradually shifted from a low sound to a high sound (from a low frequency to a high frequency).

すると、主変換部１７では、入力音声の内容に一致する文字列（テキスト）の変換が行われる。つまり、先の例では、入力音声から文字列「何で昨日来なかったの」に変換される。 Then, the main conversion unit 17 converts a character string (text) that matches the content of the input voice. That is, in the previous example, the input voice is converted into the character string “Why did not come yesterday”.

また、追加情報変換部１８では、ポーズで区切られた各部分につき、追加情報が作成される。つまり、先の例では、前段の「何で昨日」の部分と、後段の「来なかったの」の部分に対し、追加情報の付加の判断及びその選択が行われる。 Further, the additional information conversion unit 18 creates additional information for each part delimited by the pause. In other words, in the previous example, the determination of addition of additional information and the selection thereof are performed for the part “Why yesterday” in the previous stage and the part “Did not come” in the subsequent stage.

先ず、音声検出部１５で検出された検出値すなわち入力音声の各韻律情報（周波数、音量、速度）が、補正手段２１により正規化処理され、各韻律情報の補正値が求められる。この補正値は、先の初期設定時に求められた話者個人の音声の周波数、音量、速度の各平均値及び各標準偏差を使って、次式により求められる。
補正値＝（検出値−平均値）／標準偏差（１） First, the detection value detected by the voice detection unit 15, that is, each prosodic information (frequency, volume, speed) of the input voice is normalized by the correcting means 21, and a correction value of each prosodic information is obtained. This correction value is obtained by the following equation using the average values and standard deviations of the frequency, volume and speed of the individual voice of the speaker obtained at the time of the initial setting.
Correction value = (Detected value-Average value) / Standard deviation (1)

そして、追加情報を付加する対象部分に対し、求められた補正値を使って、図２の決定木方式により追加情報の選択が行われる。先の例だと、各補正値から、図２の決定木によって、普通に発音された「何で昨日」の部分は、「追加情報無し」と判断される一方、感情を込められた「来なかったの」の部分に対し、追加情報として顔文字（怒）が選択される。 Then, the additional information is selected by the decision tree method of FIG. 2 using the obtained correction value for the target portion to which the additional information is added. In the case of the previous example, from each correction value, the “Why yesterday” part, which is normally pronounced by the decision tree of FIG. 2, is judged as “no additional information”, while the emotional “not come” An emoticon (angry) is selected as additional information for the “Tano” part.

そして、変換結合部１９で、主変換部１７で作成された文字列に、追加情報変換部１８で作成された追加情報が付加される。先の例では、主変換部で作成された文字列「何で昨日来なかったの」の後に、追加情報変換部１８で作成された顔文字（怒）が付加される。このようにして完成した出力情報が表示装置１３に表示される。 Then, the conversion coupling unit 19 adds the additional information created by the additional information conversion unit 18 to the character string created by the main conversion unit 17. In the previous example, after the character string “Why did not come yesterday” created by the main converter, the emoticon (angry) created by the additional information converter 18 is added. The output information thus completed is displayed on the display device 13.

従って、このような実施例によれば、人間が実際に強調や感情を表現するときの韻律的な特徴を参考にして、当該強調や感情を表現する追加情報が決定されるため、文字列に顔文字等の特殊文字の追加を行う際、対応する感情を込めて音声入力することで、特殊な操作や入力を行わなくても、特殊文字を簡単に付加できるという効果を得る。 Therefore, according to such an embodiment, additional information expressing the emphasis or emotion is determined with reference to the prosodic features when the human actually expresses emphasis or emotion. When adding special characters such as emoticons, voice input with corresponding emotions can provide an effect of easily adding special characters without performing special operations or inputs.

また、追加情報の付加に際し、感情表現の特徴が現れる韻律情報、すなわち、対象部分全体の韻律情報と、その最終モーラの韻律情報とを使うため、ユーザの感情に対応する追加情報に対する識別率を大幅に向上させることが可能となる。 In addition, when adding additional information, the prosody information in which the characteristic of emotion expression appears, that is, the prosodic information of the entire target part and the prosodic information of the final mora, the identification rate for the additional information corresponding to the user's emotion is set. It becomes possible to greatly improve.

なお、音声認識システム１０は、前記実施例のようにシステム全体を端末等の一つのハードウェアに組み込んで構成する他、マイク１１や表示装置１３等の構成をユーザの端末側に残し、前記音声認識装置１２を遠隔地に存在するサーバ等のコンピュータに組み込んで、遠隔地に存在するインターネット等のネットワーク通信網を使って、遠隔的な音声認識処理を行うようにしてもよい。 Note that the voice recognition system 10 is configured by incorporating the entire system into one piece of hardware such as a terminal as in the above embodiment, and the configuration of the microphone 11 and the display device 13 is left on the user terminal side, and the voice The recognition device 12 may be incorporated in a computer such as a server in a remote location, and remote voice recognition processing may be performed using a network communication network such as the Internet in a remote location.

なお、本発明における装置各部の構成は図示構成例に限定されるものではなく、実質的に同様の作用を奏する限りにおいて、種々の変更が可能である。 The configuration of each part of the apparatus in the present invention is not limited to the illustrated configuration example, and various modifications are possible as long as substantially the same operation is achieved.

本実施例に係る音声認識システムの概略構成図。1 is a schematic configuration diagram of a voice recognition system according to an embodiment. 決定木方式を説明するためのフローチャート。The flowchart for demonstrating a decision tree system.

Explanation of symbols

１０音声認識システム
１１マイク
１２音声認識装置
１３表示装置
１５音声検出部
１７主変換部
１８追加情報変換部
１９変換結合部
２１補正手段
２２追加情報選択手段 DESCRIPTION OF SYMBOLS 10 Speech recognition system 11 Microphone 12 Speech recognition apparatus 13 Display apparatus 15 Voice detection part 17 Main conversion part 18 Additional information conversion part 19 Conversion coupling | bond part 21 Correction means 22 Additional information selection means

Claims

A speech recognition system comprising: an additional information conversion unit that creates additional information added to a character string converted from an input speech, wherein the additional information conversion unit is created based on the prosodic information of the input speech apparatus.

The speech recognition apparatus according to claim 1, wherein the additional information conversion unit estimates an emotion embedded in the input speech from the prosodic information, and creates additional information expressing the emotion.

The additional information conversion unit determines the content of the additional information based on the overall prosodic information of the target portion of the input speech to which the additional information is added and the prosodic information of the final mora portion in the target portion. The speech recognition apparatus according to claim 1 or 2, wherein

The speech recognition apparatus according to claim 1, 2 or 3, wherein the additional information conversion unit determines whether or not the additional information should be added for each section divided by a pause of a predetermined time or more.

The speech recognition apparatus according to claim 1, wherein the additional information conversion unit creates the additional information using the normalized prosodic information.

A speech recognition program for causing a computer to execute speech recognition processing,
A speech recognition program that causes the computer to create additional information added to a character string converted from input speech based on prosodic information of the input speech.