JP2006030384A

JP2006030384A - Text-to-speech synthesizer and text-to-speech synthesis method

Info

Publication number: JP2006030384A
Application number: JP2004206080A
Authority: JP
Inventors: Toshiyuki Kumakura; 俊之熊倉; Erika Kumakura; 恵理香熊倉
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2004-07-13
Filing date: 2004-07-13
Publication date: 2006-02-02

Abstract

【課題】複数の言語が混在した入力テキストが入力された場合であっても、言語の特徴を十分に表現して音声合成することができるテキスト音声合成装置及びテキスト音声合成方法を提供するを提供する。
【解決手段】複数の言語が混合した入力テキストに含まれる言語を判定し、判定された言語毎に入力テキストを分割する。テキスト分割部２は、分割されたテキスト部分を言語毎に設けられた音声合成部３_１〜３_ｎへ出力する。各テキスト音声合成部３１〜３ｎにて合成された音声波形は、音声波形統合部４にて統合される。
【選択図】図１
Provided is a text-to-speech synthesizer and a text-to-speech synthesis method capable of synthesizing speech by sufficiently expressing the characteristics of the language even when input text in which a plurality of languages are mixed is input. To do.
A language included in an input text in which a plurality of languages are mixed is determined, and the input text is divided for each determined language. The text dividing unit 2 outputs the divided text parts to the speech synthesis units 3 _{1 to} 3 _n provided for each language. The speech waveforms synthesized by the text speech synthesis units 31 to 3n are integrated by the speech waveform integration unit 4.
[Selection] Figure 1

Description

本発明は、複数の言語を処理することができるテキスト音声合成装置及びテキスト音声合成方法に関する。 The present invention relates to a text-to-speech synthesizer and a text-to-speech synthesis method that can process a plurality of languages.

テキスト音声合成は、タイプ入力などにより得られた入力テキストを人間の音声の生成機構に基づいて機械的に音声を合成するものである。図９に、従来のテキスト音声合成装置１００の構成を示す。テキスト音声合成装置１００は、入力テキストを音声記号列に変換する音声記号列生成部１０１と、音声記号列を韻律データに変換する韻律生成部１０２と、韻律データから音声波形を生成する波形生成部１０３とを有して構成されている。 Text-to-speech synthesis is to mechanically synthesize speech based on a human speech generation mechanism based on input text obtained by typing. FIG. 9 shows a configuration of a conventional text-to-speech synthesizer 100. A text-to-speech synthesizer 100 includes a speech symbol string generation unit 101 that converts input text into a phonetic symbol string, a prosody generation unit 102 that converts a phonetic symbol string into prosodic data, and a waveform generation unit that generates a speech waveform from prosodic data 103.

音声記号列生成部１０１は、入力テキストを意味をもつ最小の言語単位である形態素に分割し、形態素に関する表記、品詞、読みなどの情報が格納された辞書を参照して音声記号列を生成する。この音声記号列は、入力テキストを発音記号、アクセント記号、ポーズ記号、音調記号などを用いて表現したものである。 The phonetic symbol string generation unit 101 divides the input text into morphemes that are the smallest meaningful language units, and generates a phonetic symbol string with reference to a dictionary that stores information such as morpheme notation, part of speech, and reading. . This phonetic symbol string is a representation of input text using phonetic symbols, accent symbols, pause symbols, tone symbols, and the like.

そして、韻律生成部１０２は、入力された音声記号列を解析し、音素毎に継続時間長、基本周波数及びパワーを決定する。決定された各音素の情報は、韻律データとして波形生成部１０３に出力する。ここで、音素とは、ある１つの言語で用いられる音の単位で、意味の相違をもたらす最小の単位である。 Then, the prosody generation unit 102 analyzes the input phonetic symbol string and determines the duration, fundamental frequency, and power for each phoneme. Information on each determined phoneme is output to the waveform generation unit 103 as prosodic data. Here, a phoneme is a unit of sound used in a certain language, and is the smallest unit that causes a difference in meaning.

韻律生成部１０２から出力された韻律データは、波形生成部１０３で音素列情報、音素時間情報、ピッチパタン情報に分けられる。音素列情報には音素列が含まれており、選択された音素列に対応する音響特長パラメータは、音素時間情報に基づいて時間軸上で伸縮され、ピッチパタン情報に基づいてピッチが変更されて音声波形に変換される。 The prosody data output from the prosody generation unit 102 is divided into phoneme string information, phoneme time information, and pitch pattern information by the waveform generation unit 103. The phoneme string information includes a phoneme string, and the acoustic feature parameter corresponding to the selected phoneme string is expanded and contracted on the time axis based on the phoneme time information, and the pitch is changed based on the pitch pattern information. Converted to speech waveform.

以上のような処理により、音声合成装置１００は、タイプ入力などにより得られた入力テキストを音声波形に変換する。 Through the processing as described above, the speech synthesizer 100 converts the input text obtained by type input or the like into a speech waveform.

特開２００１−１４３０５号公報JP 2001-14305 A

ところで、テキスト音声合成装置において、日本語、英語などの複数の言語を処理できるものが知られている（例えば、特許文献１参照。）。例えば、特許文献１に記載の電子文書処理装置は、電子文書を記述する言語を示す属性情報に基づいて音声合成エンジンを選択することにより複数の言語を処理している。 By the way, a text-to-speech synthesizer is known that can process a plurality of languages such as Japanese and English (see, for example, Patent Document 1). For example, the electronic document processing apparatus described in Patent Document 1 processes a plurality of languages by selecting a speech synthesis engine based on attribute information indicating a language describing the electronic document.

しかしながら、特許文献１には、例えば１つの入力テキストに複数の言語が混在する場合の音声合成エンジンの選択方法について記載されておらず、特許文献１に記載の技術では、１つの入力テキストに複数の言語が混在する場合、各言語の韻律で音声合成することができなかった。 However, Patent Document 1 does not describe, for example, a method for selecting a speech synthesis engine when a plurality of languages are mixed in one input text. In the technique described in Patent Document 1, a plurality of input texts are included in one input text. In the case of mixed languages, speech synthesis could not be performed with the prosody of each language.

本発明は、このような問題に鑑みてなされたものであり、複数の言語が混在した入力テキストが入力された場合であっても、言語の特徴を十分に表現して音声合成することができるテキスト音声合成装置及びテキスト音声合成方法を提供することを目的とする。 The present invention has been made in view of such problems, and even when an input text in which a plurality of languages are mixed is input, it is possible to synthesize speech by sufficiently expressing the characteristics of the language. An object is to provide a text-to-speech synthesizer and a text-to-speech synthesis method.

上述した目的を達成するために、本発明に係るテキスト音声合成装置は、２つ以上の言語のテキスト部分を有する入力テキストを処理するテキスト音声合成装置であって、上記入力テキストに含まれる言語を判定し、言語毎のテキスト部分に分割する言語判別手段と、上記テキスト部分を音声波形に変換する上記言語毎に設けられた複数の音声合成手段と、上記言語毎に変換された音声波形を統合する音声波形統合手段とを有し、上記言語判別手段は、上記テキスト部分の言語に応じて当該テキスト部分を上記複数の音声合成手段に出力することを特徴としている。 In order to achieve the above-described object, a text-to-speech synthesizer according to the present invention is a text-to-speech synthesizer that processes input text having a text portion of two or more languages, and includes a language included in the input text. Integrating the language discriminating means for determining and dividing into text parts for each language, a plurality of speech synthesizing means provided for each language for converting the text parts into speech waveforms, and the speech waveforms converted for each language Voice language integration means, and the language discrimination means outputs the text portion to the plurality of voice synthesis means according to the language of the text portion.

また、本発明に係るテキスト音声合成装置は、２つ以上の言語のテキスト部分を有する入力テキストを処理するテキスト音声合成装置であって、上記入力テキストに含まれる言語を判定する判定手段と、上記判定された言語毎のテキスト部分を音声記号列に変換する言語処理手段と、上記音声記号列を韻律データに変換する上記言語毎に設けられた複数の韻律生成手段と、上記韻律データを音声波形に変換する音声波形生成手段とを有し、上記言語処理手段は、上記音声記号列に応じて当該音声記号列を上記複数の韻律生成手段に出力することを特徴としている。 The text-to-speech synthesizer according to the present invention is a text-to-speech synthesizer that processes input text having a text portion of two or more languages, the determination means for determining the language included in the input text, A language processing means for converting the determined text portion of each language into a phonetic symbol string; a plurality of prosody generating means provided for each language for converting the phonetic symbol string into prosodic data; and A speech waveform generation means for converting the speech symbol string to the plurality of prosody generation means according to the phonetic symbol string.

また、本発明に係るテキスト音声合成方法は、２つ以上の言語のテキスト部分を有する入力テキストを処理するテキスト音声合成方法であって、上記入力テキストに含まれる言語を判定し、言語毎のテキスト部分に分割する言語判別工程と、上記テキスト部分を音声波形に変換する上記言語毎に設けられた複数の音声合成工程と、上記言語毎に変換された音声波形を統合する音声波形統合工程とを有し、上記言語判別工程は、上記テキスト部分の言語に応じて当該テキスト部分を上記複数の音声合成工程に出力することを特徴としている。 The text-to-speech synthesis method according to the present invention is a text-to-speech synthesis method for processing input text having a text portion of two or more languages, determining a language included in the input text, and determining the text for each language. A language discriminating step for dividing the text into a portion, a plurality of speech synthesizing steps provided for each language for converting the text portion into a speech waveform, and a speech waveform integrating step for integrating the speech waveforms converted for each language. And the language discrimination step outputs the text portion to the plurality of speech synthesis steps according to the language of the text portion.

また、本発明に係るテキスト音声合成方法は、２つ以上の言語のテキスト部分を有する入力テキストを処理するテキスト音声合成方法であって、上記入力テキストに含まれる言語を判定する判定工程と、上記判定された言語毎のテキスト部分を音声記号列に変換する言語処理工程と、上記音声記号列を韻律データに変換する上記言語毎に設けられた複数の韻律生成工程と、上記韻律データを音声波形に変換する音声波形生成工程とを有し、上記言語処理工程は、上記音声記号列に応じて当該音声記号列を上記複数の韻律生成工程に出力することを特徴としている。 The text-to-speech synthesis method according to the present invention is a text-to-speech synthesis method for processing input text having a text portion of two or more languages, the determination step for determining a language included in the input text, A language processing step for converting the determined text portion for each language into a phonetic symbol string, a plurality of prosody generation steps for each language for converting the phonetic symbol sequence into prosodic data, and the prosody data as a voice waveform A speech waveform generation step of converting the speech symbol sequence to the plurality of prosody generation steps according to the phonetic symbol sequence.

本発明は、２つ以上の言語のテキスト部分を有する入力テキストに含まれる言語を判定し、テキスト部分の言語に応じて当該テキスト部分を各言語の音声合成手段に出力することにより、各言語の韻律で音声合成することができる。 The present invention determines a language included in an input text having a text portion of two or more languages, and outputs the text portion to the speech synthesizer of each language according to the language of the text portion. It can synthesize speech with prosody.

また、２つ以上の言語のテキスト部分を有する入力テキストに含まれる言語を判定し、言語毎のテキスト部分を音声記号列に変換し、音声記号列に応じて当該音声記号列を各言語の韻律生成手段に出力することにより、各言語の韻律で音声合成することができる。 Further, the language included in the input text having the text portions of two or more languages is determined, the text portion for each language is converted into a phonetic symbol sequence, and the phonetic symbol sequence is converted into a prosody of each language according to the phonetic symbol sequence. By outputting to the generating means, it is possible to synthesize speech with prosody of each language.

以下、本発明の実施の形態について、図面を参照しながら詳細に説明する。本発明の具体例として示すテキスト音声合成装置は、複数の言語が混合した入力テキストに含まれる言語を判定し、その言語の特徴を十分に表現して音声合成するようにしたものである。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. A text-to-speech synthesizer shown as a specific example of the present invention determines a language included in an input text in which a plurality of languages are mixed, and synthesizes speech by sufficiently expressing the characteristics of the language.

（第１の実施の形態）
図１は、第１の実施の形態におけるテキスト音声合成装置１０の全体構成を示すものである。テキスト音声合成装置１０は、複数の言語が混合した入力テキストに含まれる言語を判定する言語判定部１と、判定された言語毎に入力テキストを分割するテキスト分割部２と、分割されたテキスト部分の音声波形を合成する言語毎に設けられたテキスト音声合成部３_１〜３_ｎと、各テキスト音声合成部３_１〜３_ｎで合成された音声波形を統合する音声波形統合部４とを有して構成されている。 (First embodiment)
FIG. 1 shows the overall configuration of a text-to-speech synthesizer 10 according to the first embodiment. The text-to-speech synthesizer 10 includes a language determination unit 1 that determines a language included in an input text in which a plurality of languages are mixed, a text division unit 2 that divides the input text for each determined language, and a divided text portion A text-to-speech synthesizer 3 _{1 to} 3 _n provided for each language for synthesizing a speech waveform, and a speech waveform integration unit 4 to integrate the speech waveforms synthesized by the text-to-speech synthesizers 3 _{1 to} 3 _n. Configured.

言語判定部１は、２つ以上の言語のテキスト部分を有する入力テキストに含まれる言語を判定する。また、言語判定部１は、判定された言語毎に言語情報を示すタグを付与する。例えば、「これはｄｉｃｔｉｏｎａｒｙです。」という日本語と英語の言語混合文に対するタグ付きテキストは、以下のように記述される。
<lang=japanese>これは</lang> <lang=english>ｄｉｃｔｉｏｎａｒｙ</lang> <lang=japanense>です</lang>
なお、<>で囲まれた文字列がタグ、それ以外がテキスト部分である。また、<lang=japanese>は文内で日本語(Japanese)が開始することを示し、</lang>は当該言語が終了することを示す。 The language determination unit 1 determines a language included in an input text having a text portion of two or more languages. Moreover, the language determination part 1 provides the tag which shows language information for every determined language. For example, the tagged text for a Japanese and English mixed language sentence “This is dictionary” is described as follows.
<lang = japanese> This is </ lang><lang = english> dictionary </ lang><lang = japanense></lang>
The character string enclosed in <> is a tag, and the rest are text parts. <Lang = japanese> indicates that Japanese (Japanese) starts within the sentence, and </ lang> indicates that the language ends.

また、言語の判定は、例えば次のような方法により行われる。
１．外部から与えられた情報により言語を判定する
例えば単語毎にタグ等の属性情報が付与されている場合、その属性情報に含まる言語情報に基づいて言語を判定する。
２．文字の種類により言語を判定する
例えば、ひらがな、カタカナ、漢字ならば日本語、アルファベットならば英語と判定する。
３．辞書を参照することにより言語を判定する
例えば、入力テキストを辞書に格納されている単語などと照合することにより言語を判定する。
４．文字コードにより言語を判定する
言語の種類などで決められた数値の羅列である文字コードを解析することにより、言語を判定する。 The language is determined by the following method, for example.
1. Judgment of language by information given from outside
For example, when attribute information such as a tag is assigned to each word, the language is determined based on the language information included in the attribute information.
2. Determine language by character type
For example, hiragana, katakana and kanji are judged as Japanese, and alphabets are judged as English.
3. Determine language by referring to a dictionary
For example, the language is determined by collating the input text with a word stored in the dictionary.
4). Determine language by character code
The language is determined by analyzing a character code that is a list of numerical values determined by the type of language.

テキスト分割部２は、入力テキストを言語毎にタグが付与されたテキスト部分に切り分け、タグの言語情報に基づいてテキスト部分を各言語のテキスト音声合成部３_１〜３_ｎへ出力する。 The text dividing unit 2 cuts the input text into text portions to which tags are assigned for each language, and outputs the text portions to the text-to-speech synthesis units 3 _{1 to} 3 _n of the respective languages based on the language information of the tags.

音声波形統合部４は、各言語のテキスト音声合成部３_１〜３_ｎで合成された音声波形を統合する。また、音声波形統合部４は、音声波形が入力される度にテキスト分割部２へ波形到着通知を送信し、テキスト分割部２から終了指令を受信することにより、それまでに統合された音声波形を出力する。 The speech waveform integration unit 4 integrates the speech waveforms synthesized by the text speech synthesis units 3 _{1 to} 3 _n of each language. The speech waveform integration unit 4 transmits a waveform arrival notification to the text division unit 2 each time a speech waveform is input, and receives an end command from the text division unit 2, thereby integrating the voice waveform integrated so far. Is output.

次に、音声合成装置１０の動作を図２に示すフローチャートを参照して説明する。先ず、言語判定部１は、入力テキストに含まれる言語を判定し（ステップＳ１）、言語毎のテキスト部分に分割する判定された言語毎に言語情報を示すタグを付与する。 Next, the operation of the speech synthesizer 10 will be described with reference to the flowchart shown in FIG. First, the language determination unit 1 determines a language included in the input text (step S1), and assigns a tag indicating language information to each determined language to be divided into text parts for each language.

タグ付きテキストは、テキスト分割部２で言語毎に切り分けられる（ステップＳ２）。そして、テキスト分割部２は、先ず、入力テキストのうち先頭のテキスト部分を、タグの言語情報に従って当該言語のテキスト音声合成部３_１〜３_ｎへ出力する。テキスト音声合成部３_１〜３_ｎは、入力されたテキスト部分を音声波形に変換し（ステップＳ３）、音声波形統合部４へ出力する。 The tagged text is cut for each language by the text dividing unit 2 (step S2). The text dividing unit 2 first outputs the first text portion of the input text to the text-to-speech synthesis units 3 _{1 to} 3 _n of the language according to the language information of the tag. The text-to-speech synthesizers 3 _{1 to} 3 _n convert the input text portion into a speech waveform (step S3) and output it to the speech waveform integration unit 4.

音声波形統合部４は、テキスト音声合成部で出力された音声波形を受け取ったら、音声波形が届いたという波形到着通知をテキスト分割部２へ送信する。テキスト分割部２は、波形到着通知を波形統合部４から受信すると、入力テキストのうち先頭から2番目のテキスト部分を言語情報に従って、当該言語のテキスト音声合成部ｎへ出力する。同様にして、２番目のテキストもテキスト音声合成部３_１〜３_ｎにて音声波形に変換され、音声波形統合部４へと送られる。 When the speech waveform integration unit 4 receives the speech waveform output from the text speech synthesis unit, the speech waveform integration unit 4 transmits a waveform arrival notification that the speech waveform has arrived to the text division unit 2. When the text division unit 2 receives the waveform arrival notification from the waveform integration unit 4, the text division unit 2 outputs the second text portion from the beginning of the input text to the text-to-speech synthesis unit n of the language according to the language information. Similarly, the second text is also converted into a speech waveform by the text-to-speech synthesis units 3 _{1 to} 3 _n and sent to the speech waveform integration unit 4.

音声波形統合部４は、テキスト音声合成部３_１〜３_ｎから出力された音声波形を、音声波形統合部４へ入力された順番に統合し（ステップＳ４）、音声波形が届いたという信号をテキスト分割部２へ送る。そして、同様にして、入力テキストのテキスト部分が終了するまで、3番目以降のテキスト部分も処理される（ステップＳ５）。 The speech waveform integration unit 4 integrates the speech waveforms output from the text speech synthesis units 3 _{1 to} 3 _n in the order in which they are input to the speech waveform integration unit 4 (step S4), and signals that the speech waveform has arrived. Send to text divider 2. Similarly, the third and subsequent text portions are processed until the text portion of the input text is completed (step S5).

最終のテキスト部分の波形到着通知を受信した場合、テキスト分割部２は、終了指令を音声波形統合部４へ送信する。終了指令を受けたら、音声波形統合部４は、それまでに結合した音声波形を出力する（ステップＳ６）。 When the waveform arrival notification of the final text portion is received, the text dividing unit 2 transmits an end command to the speech waveform integrating unit 4. When receiving the end command, the speech waveform integration unit 4 outputs the speech waveform combined so far (step S6).

このように、第１の実施の形態におけるテキスト音声合成装置１０は、テキスト部分の言語に応じて当該テキスト部分を各言語の音声合成手段３_１〜３_ｎに出力することにより、２つ以上の言語のテキスト部分を有する入力テキストでも各言語の韻律で音声合成することができる。 As described above, the text-to-speech synthesizer 10 according to the _first embodiment outputs two or more text parts to the speech synthesizers 3 _{1 to} 3 _n of each language according to the language of the text part. Even an input text having a text portion of a language can be synthesized with the prosody of each language.

（第２の実施の形態）
図３は、第２の実施の形態におけるテキスト音声合成装置２０の構成を示すものである。このテキスト音声合成装置２０は、複数の言語が混合した入力テキストに含まれる言語を判定する言語判定部２１と、単語毎のテキスト部分を音声記号列に変換する音声記号列生成部２２と、音声記号列を言語毎に設けられた韻律生成部２４_１〜２４_ｎに出力する切換処理部２３と、音声記号列を韻律データに変換する言語毎の韻律生成部２４_１〜２４_ｎと、各韻律生成部２４_１〜２４_ｎからの韻律データを接続する韻律接続部２５と、韻律データに基づいて音声波形を生成する波形生成部２６とを有して構成されている。 (Second Embodiment)
FIG. 3 shows the configuration of the text-to-speech synthesizer 20 in the second embodiment. The text-to-speech synthesizer 20 includes a language determination unit 21 that determines a language included in an input text in which a plurality of languages are mixed, a speech symbol string generation unit 22 that converts a text part for each word into a phonetic symbol string, a change processing unit 23 for outputting a symbol string to the prosody generation unit 24 ₁ to 24 _n provided for each language, and the prosody generation unit 24 ₁ to 24 _n for each language to be converted in the prosodic data the voice symbol sequence, each prosody The prosody connection unit 25 connects prosody data from the generation units 24 _{1 to} 24 _n and the waveform generation unit 26 generates a speech waveform based on the prosody data.

言語判定部２１は、上記第１の実施の形態における言語判定部１と同様で、２つ以上の言語のテキスト部分を有する入力テキストに含まれる言語を判定する。また、言語判定部１は、判定された言語毎に言語情報を示すタグを付与する。例えば、「これはｄｉｃｔｉｏｎａｒｙです。」という日本語と英語の言語混合文に対するタグ付きテキストは、以下のように記述される。
<lang=japanese>これは</lang> <lang=english>ｄｉｃｔｉｏｎａｒｙ</lang> <lang=japanense>です</lang>
なお、<>で囲まれた文字列がタグ、それ以外がテキスト部分である。また、<lang=japanese>は文内で日本語(Japanese)が開始することを示し、</lang>は当該言語が終了することを示す。また、言語情報が記述されたタグを用いて説明するが、本発明の実施形態はこれに限られるものではない。 Similar to the language determination unit 1 in the first embodiment, the language determination unit 21 determines a language included in an input text having a text portion of two or more languages. Moreover, the language determination part 1 provides the tag which shows language information for every determined language. For example, the tagged text for a Japanese and English mixed language sentence “This is dictionary” is described as follows.
<lang = japanese> This is </ lang><lang = english> dictionary </ lang><lang = japanense></lang>
The character string enclosed in <> is a tag, and the rest are text parts. <Lang = japanese> indicates that Japanese (Japanese) starts within the sentence, and </ lang> indicates that the language ends. Moreover, although it demonstrates using the tag in which linguistic information was described, embodiment of this invention is not restricted to this.

言語の判定は、例えば次のような方法により行われる。
１．外部から与えられた情報により言語を判定する
例えば単語毎にタグ等の属性情報が付与されている場合、その属性情報に含まる言語情報に基づいて言語を判定する。
２．文字の種類により言語を判定する
例えば、ひらがな、カタカナ、漢字ならば日本語、アルファベットならば英語と判定する。
３．辞書を参照することにより言語を判定する
例えば、入力テキストを辞書に格納されている単語などと照合することにより言語を判定する。
４．文字コードにより言語を判定する
言語の種類などで決められた数値の羅列である文字コードを解析することにより、言語を判定する。 The language is determined by the following method, for example.
1. Judgment of language by information given from outside
For example, when attribute information such as a tag is assigned to each word, the language is determined based on the language information included in the attribute information.
2. Determine the language by the type of characters
For example, hiragana, katakana and kanji are judged as Japanese, and alphabets are judged as English.
3. Determine language by referring to a dictionary
For example, the language is determined by collating the input text with a word stored in the dictionary.
4). Determine language by character code
The language is determined by analyzing a character code that is a list of numerical values determined by the type of language.

音声記号列生成部２２は、図４に示すように入力されたテキスト文を解析し、形態素情報を生成するテキスト解析部２２１と、形態素情報を基に音声記号を生成する韻律情報生成部２２２とを有して構成されている。ここで、形態素とは、語幹、接頭辞、接尾辞など、意味をもつ文字列の最小単位で、単語よりやや小さいものである。また、音声記号列とは、発音記号やアクセント記号などで表現したものである。 The phonetic symbol string generation unit 22 analyzes the input text sentence as shown in FIG. 4 to generate morpheme information, a prosody information generation unit 222 that generates phonetic symbols based on the morpheme information, and It is comprised. Here, a morpheme is a minimum unit of a meaningful character string such as a stem, a prefix, and a suffix, and is slightly smaller than a word. The phonetic symbol string is expressed by phonetic symbols or accent symbols.

テキスト解析部２２１は、テキスト解析ルール２２３と辞書２２４とを有している。テキスト解析ルール２２３には、形態素の並びに関する規則(文法や接続マトリックス、形態素Nグラムなど)と辞書２２４に登録されていない形態素に必要な情報を付与する規則とが格納されている。また、辞書２２４には、表記、品詞、読みなど、登録されている形態素に関する情報が納められている。 The text analysis unit 221 has a text analysis rule 223 and a dictionary 224. The text analysis rule 223 stores rules relating to morpheme arrangement (grammar, connection matrix, morpheme N-gram, etc.) and rules for assigning necessary information to morphemes not registered in the dictionary 224. The dictionary 224 stores information on registered morphemes such as notation, part of speech, and reading.

韻律情報生成部２２２は、形態素情報を解析し韻律情報を得るための韻律情報生成ルール２２５を有する。韻律情報生成ルール２２５には、形態素情報をもとにフレーズの区切り方を指すフレージングや読みの変更を行うための規則が納められている。また、入力された各形態素について、韻律生成のためのさらに詳しい情報を得るため、テキスト解析部２２１で用いた辞書２２４とも接続されている。韻律生成のための情報とは、例えば、複数の形態素が一緒になって複合語を形成する際に、読みがどのように変化するかの情報や、アクセント核位置の移動パタンの情報などである。 The prosodic information generation unit 222 has a prosodic information generation rule 225 for analyzing morphological information and obtaining prosodic information. In the prosodic information generation rule 225, rules for performing phrasing and reading changes indicating phrase separation based on morpheme information are stored. Further, for each input morpheme, the dictionary 224 used in the text analysis unit 221 is also connected to obtain more detailed information for prosody generation. The information for prosody generation is, for example, information on how reading changes when a plurality of morphemes are combined to form a compound word, information on the movement pattern of the accent nucleus position, etc. .

なお、図４に示す音声記号列生成部２２の構成では、１つの辞書２２４に形態素に関する情報と韻律生成のための情報とを納める構成となっているが、それぞれ別の辞書を保持してもよい。すなわち、テキスト解析部２２１は、形態素に関する情報のみが納められた辞書と接続され、韻律情報生成部２２２は、韻律生成のための情報のみが納められた辞書と接続されるという構成でも同様の処理を行うことができる。また、テキスト解析ルール２２３、辞書２２４及び韻律情報生成ルール２２５は、各言語のルールを格納していることとしているが、言語毎にルールを独立して有する構成としてもよい。 Note that the configuration of the phonetic symbol string generation unit 22 shown in FIG. 4 has a configuration in which information about morphemes and information for prosody generation are stored in one dictionary 224. Good. In other words, the text analysis unit 221 is connected to a dictionary that stores only information about morphemes, and the prosody information generation unit 222 is connected to a dictionary that stores only information for prosody generation. It can be performed. In addition, the text analysis rule 223, the dictionary 224, and the prosody information generation rule 225 store rules for each language, but may be configured to have rules independently for each language.

ここで、入力テキストの処理について説明する。なお、入力テキストは、1文の場合でも、複数文からなる場合でもよい。 Here, processing of input text will be described. The input text may be a single sentence or a plurality of sentences.

入力テキストは、言語判定部２１にて言語毎のテキスト部分に区分される。区分されたテキスト部分は、テキスト解析部２２１において、テキスト解析ルール２２３と辞書２４とを用いて形態素に分割され、辞書２２４からそれぞれの形態素に関する情報を得る。辞書２２４に登録されていない形態素については、テキスト解析ルール２２３を用いて、必要な形態素情報を生成する。このようにして得られた形態素情報は、韻律情報生成部２２２へと送られる。 The input text is divided into text portions for each language by the language determination unit 21. The segmented text portion is divided into morphemes by the text analysis unit 221 using the text analysis rules 223 and the dictionary 24, and information on each morpheme is obtained from the dictionary 224. For morphemes that are not registered in the dictionary 224, necessary morpheme information is generated using the text analysis rule 223. The morpheme information obtained in this way is sent to the prosodic information generation unit 222.

韻律情報生成部２２２では、テキスト解析部２２１から受け取った形態素情報をもとに、入力文に対し、韻律情報生成ルール２２５を用いて韻律情報を付与するという処理を行う。韻律情報は、言語によって異なるが、日本語であれば、例えば、アクセント核位置、アクセントの強さ、ポーズ位置、ポーズの長さ、読みの変化などの情報である。なお、韻律情報は、装置の設計によって異なるので、上記情報を全て含む必要はなく、上記情報に限定されるものではない。 The prosodic information generation unit 222 performs processing of adding prosodic information to the input sentence using the prosodic information generation rule 225 based on the morpheme information received from the text analysis unit 221. The prosodic information differs depending on the language, but in the case of Japanese, it is information such as an accent nucleus position, an accent strength, a pose position, a pose length, and a change in reading. Note that prosody information varies depending on the design of the apparatus, and thus it is not necessary to include all of the above information and is not limited to the above information.

韻律情報生成部２２２での処理によって得られる韻律情報は、音声記号列で表現され、韻律生成部へと出力する。音声記号列とは、入力テキストを発音記号、アクセント記号、ポーズ記号、音調記号などを用いて表現したものである。 The prosodic information obtained by the processing in the prosodic information generation unit 222 is expressed as a phonetic symbol string and is output to the prosody generation unit. A phonetic symbol string is a representation of input text using phonetic symbols, accent symbols, pause symbols, tone symbols, and the like.

これらの記号は、独自のものを定義してもよいし、既存の記号を採用してもよい。例えば、ＩＰＡ(国際音声字母、International Phonetic Alphabet)、ＳＡＭＰＡ（Speech Assessment Methods Phonetic Alphabet）などの発音記号、アクセント記号などを用いてもよい。又は、ＴｏＢＩ(Tone and Break Indices)などのポーズ記号、音調記号などを用いてもよい。本実施の形態では、便宜的に日本語部分にはローマ字を、英語部分にはSAMPAを参考にした独自記号を用いた。例えば、「これはｄｉｃｔｉｏｎａｒｙです。」という入力テキストに対し、音声記号列生成部２２から出力される音声記号列は、例えば次のように記述される。
<lang=japanese>korewa</lang> <lang=english>dIkS@neri:</lang> <lang=japanense>desu</lang>
なお、<>で囲まれた文字列がタグ、それ以外が音声記号列である。また、<lang=japanese>は文内で日本語(Japanese)が開始することを示し、</lang>は当該言語が終了することを示す。また、言語情報が記述されたタグを用いて説明するが、本発明の実施形態はこれに限られるものではない。 These symbols may define their own symbols or adopt existing symbols. For example, pronunciation symbols such as IPA (International Phonetic Alphabet) and SAMPA (Speech Assessment Methods Phonetic Alphabet), accent symbols, and the like may be used. Alternatively, a pause symbol such as ToBI (Tone and Break Indices), a tone symbol, or the like may be used. In this embodiment, for the sake of convenience, a unique symbol referring to Roman characters for the Japanese portion and SAMPA for the English portion is used. For example, for the input text “This is dictionary”, the phonetic symbol string output from the phonetic symbol string generator 22 is described as follows, for example.
<lang = japanese> korewa </ lang><lang = english> dIkS @ neri: </ lang><lang = japanense> desu </ lang>
A character string enclosed in <> is a tag, and the other character strings are phonetic symbol strings. <Lang = japanese> indicates that Japanese (Japanese) starts within the sentence, and </ lang> indicates that the language ends. Moreover, although it demonstrates using the tag in which linguistic information was described, embodiment of this invention is not restricted to this.

切換処理部２３は、音声記号列を出力する韻律生成部２４_１〜２４_ｎを選択する。例えば、音声記号列に付与された言語情報に応じて、その音声記号列の出力を韻律生成部２４_１〜２４_ｎに切り換える。例えば、入力された音声記号列<lang=japanese>korewa</lang> <lang=english>dIkS@neri:</lang> <lang=japanense>desu</lang>に対し、切換処理部２３は、「korewa」及び「desu」を日本語韻律生成部へ、「dIkS@neri:」を英語韻律生成部へ出力する。 The switching processing unit 23 selects prosody generation units 24 _{1 to} 24 _n that output a phonetic symbol string. For example, the output of the phonetic symbol sequence is switched to the prosody generation units 24 _{1 to} 24 _n according to the language information given to the phonetic symbol sequence. For example, for the input phonetic symbol string <lang = japanese> korewa </ lang><lang = english> dIkS @ neri: </ lang><lang = japanense> desu </ lang>, the switching processing unit 23 Output “korewa” and “desu” to the Japanese prosody generation unit and “dIkS @ neri:” to the English prosody generation unit.

図５は、韻律生成部２４_１〜２４_ｎのうち、１つの韻律生成部２４_ｎの構成を示すものである。韻律生成部２４_ｎは、各音素の継続時間長を求めるための継続時間長決定部５１と、各音素の基本周波数を求めるための基本周波数決定部５２と、各音素の音量を求めるためのパワー決定部５３とを有して構成されている。 FIG. 5 shows a configuration of one prosody generation unit 24 _n among the prosody generation units 24 _{1 to} 24 _n . The prosody generation unit 24 _n includes a duration determination unit 51 for determining the duration of each phoneme, a fundamental frequency determination unit 52 for determining the fundamental frequency of each phoneme, and a power for determining the volume of each phoneme. And a determination unit 53.

継続時間長決定部５１は、音声記号列解析部５４と、ルール適用部５５とを有している。音声記号列解析部５４は、音声記号列を解析するための解析ルール５６を有している。解析ルール５６は、例えば、アクセント核位置、アクセントの強さ、ポーズ位置、ポーズの長さ、読みの変化などを音声記号列から取得するためのルールを格納している。 The duration length determination unit 51 includes a phonetic symbol string analysis unit 54 and a rule application unit 55. The phonetic symbol string analysis unit 54 has an analysis rule 56 for analyzing the phonetic symbol string. The analysis rule 56 stores rules for acquiring, for example, an accent nucleus position, accent strength, pose position, pose length, and reading change from a phonetic symbol string.

ルール適用部５５は、それぞれの音素の継続時間長を決めるための生成ルール５７を有している。生成ルール５７は、それぞれの音素の継続時間長を決定するためのルールであり、どのような音韻環境のときにデフォルト継続時間長を伸縮させるかを決めるための規則を格納している。また、音素のデフォルト継続時間長も格納されている。ここで、音韻環境とは、その音素が母音であるかどうか、前後の音素が何であるか、音節の何番目の音素であるか、アクセント核位置かどこかなどを示すものである。 The rule application unit 55 has a generation rule 57 for determining the duration of each phoneme. The generation rule 57 is a rule for determining the duration of each phoneme, and stores a rule for determining in which phoneme environment the default duration is expanded or contracted. The phoneme default duration is also stored. Here, the phoneme environment indicates whether or not the phoneme is a vowel, what the preceding and following phonemes are, what number phoneme of the syllable, the accent nucleus position, and the like.

継続時間長決定部５１は、この生成ルール５７を用いて、それぞれの音素の継続時間長を決定する。生成ルール５７を適用するのに必要な情報は、入力された音声記号列を解析することにより得られる。決定された継続時間長の情報は、基本周波数決定部３２へ送信される。 The duration length determination unit 51 uses this generation rule 57 to determine the duration length of each phoneme. Information necessary for applying the generation rule 57 can be obtained by analyzing the input phonetic symbol string. Information on the determined duration length is transmitted to the fundamental frequency determination unit 32.

基本周波数決定部５２は、音声記号列解析部５８と生成ルール適用部５９と時間伸縮部６０とを有している。音声記号列解析部５８は、音声記号列を解析するための解析ルール６１を有している。解析ルール６１は、例えば、アクセント核位置、アクセントの強さ、ポーズ位置、ポーズの長さ、読みの変化などを音声記号列から取得するためのルールを格納している。 The fundamental frequency determination unit 52 includes a phonetic symbol string analysis unit 58, a generation rule application unit 59, and a time expansion / contraction unit 60. The phonetic symbol string analysis unit 58 has an analysis rule 61 for analyzing the phonetic symbol string. The analysis rule 61 stores rules for acquiring, for example, an accent nucleus position, accent strength, pose position, pose length, and reading change from a phonetic symbol string.

ルール適用部５９は、基本周波数パタンを生成するための生成ルール６２を有している。生成ルール６２は、アクセント型等の音調区分に対する基本周波数パタンの値及び、それらのパタンの選択、変形のためのルールが格納されている。 The rule application unit 59 has a generation rule 62 for generating a fundamental frequency pattern. The generation rule 62 stores the value of the fundamental frequency pattern for the tone type such as the accent type, and rules for selection and modification of those patterns.

ここで、ルール選択方法は装置によって、様々な形態が考えられるが、本実施の形態では、その音素が属する単語のアクセント型や、アクセントの強さ、係り受けなどの文構造などを基準にルールの適用可能性の是非を判定する。 Here, various types of rule selection methods can be considered depending on the device. In this embodiment, the rule is based on the accent type of the word to which the phoneme belongs, the strength of the accent, the sentence structure such as dependency, etc. Determine the applicability of.

生成ルールに含まれる基本周波数パタンのデータは、複数の音素にまたがる基本周波数の値を保持しており、ルールにより選択できるように、アクセント型や音調など予測要因毎にパタンが分類してある。 The fundamental frequency pattern data included in the generation rule holds fundamental frequency values that span a plurality of phonemes, and the patterns are classified for each prediction factor such as accent type and tone so that they can be selected according to the rule.

音声記号列解析部５８は、入力された音声記号列を解析ルール６１により、生成ルール６２の適用に必要な情報を得る。ルール適用部５９は、得られた情報から生成ルール６２を用いて、最適なパタンを選択、変形する。時間伸縮部６０は、選択されたパタンを、継続時間長決定部５１ですでに決定されたそれぞれの音素の継続時間長にあわせて、変形させ、音素毎の基本周波数を決定する。 The phonetic symbol string analysis unit 58 obtains information necessary for applying the generation rule 62 based on the analysis rule 61 of the input phonetic symbol string. The rule application unit 59 selects and transforms an optimal pattern using the generation rule 62 from the obtained information. The time expansion / contraction unit 60 deforms the selected pattern in accordance with the duration length of each phoneme already determined by the duration length determination unit 51, and determines the fundamental frequency for each phoneme.

パワー決定部５３は、音声記号列解析部６３と、ルール適用部６４とを有している。音声記号列解析部６３は、音声記号列を解析するための解析ルール６５を有している。解析ルール６５は、例えば、アクセント核位置、アクセントの強さ、ポーズ位置、ポーズの長さ、読みの変化などを音声記号列から取得するためのルールを格納している。 The power determination unit 53 includes a phonetic symbol string analysis unit 63 and a rule application unit 64. The phonetic symbol string analysis unit 63 has an analysis rule 65 for analyzing the phonetic symbol string. The analysis rule 65 stores rules for acquiring, for example, an accent nucleus position, accent strength, pose position, pose length, reading change, and the like from a phonetic symbol string.

ルール適用部６４は、それぞれの音素のパワーを決めるための生成ルール６４を有している。生成ルール６６は、それぞれの音素のパワーを決定するためのルールである。 The rule application unit 64 has a generation rule 64 for determining the power of each phoneme. The generation rule 66 is a rule for determining the power of each phoneme.

パワー決定部５３は、この生成ルール６６を用いて、それぞれの音素のパワーを決定する。生成ルール６６を適用するのに必要な情報は、入力された音声記号列を解析することにより得られる。決定されたパワーの情報は、出力生成部６７へ送信される。 The power determination unit 53 uses the generation rule 66 to determine the power of each phoneme. Information necessary for applying the generation rule 66 can be obtained by analyzing the input phonetic symbol string. Information on the determined power is transmitted to the output generation unit 67.

継続時間長決定部５１で決定された音素毎の継続時間長、基本周波数決定部５２で決定された音素毎の基本周波数及びパワー決定部５３で決定された音素毎のパワーの情報は、韻律データとして出力生成部６２で生成され、韻律生成部２４ｎから出力される。 The duration information for each phoneme determined by the duration determination unit 51, the fundamental frequency for each phoneme determined by the fundamental frequency determination unit 52, and the power information for each phoneme determined by the power determination unit 53 are prosodic data. Are generated by the output generation unit 62 and output from the prosody generation unit 24n.

韻律接続部２５は、韻律生成部２４_１〜２４_ｎから出力された韻律データを、切換処理部２３に入力された音声記号列の順番に会うように整列、接続する。この韻律接続部２５は、韻律データが到達すると、切換処理部２３に到達通知を送信する。切換処理部２３は、到達通知を受信した後、次の順番の音声記号列を出力するので、入力テキストの順番に韻律データが統合される。また、切換処理部２３は、最後の音声記号列の到達通知を受信すると、終了通知を韻律接続部２５へ送信する。韻律接続部２５は、終了通知を受信して韻律データを波形生成部２６へ出力する。 The prosody connection unit 25 arranges and connects the prosody data output from the prosody generation units 24 _{1 to} 24 _n so as to meet the order of the phonetic symbol strings input to the switching processing unit 23. When the prosodic data arrives, the prosodic connection unit 25 transmits an arrival notification to the switching processing unit 23. Since the switching processing unit 23 receives the arrival notification and outputs the next phonetic symbol string, the prosodic data is integrated in the order of the input text. In addition, when the arrival notification of the last phonetic symbol string is received, the switching processing unit 23 transmits an end notification to the prosodic connection unit 25. The prosody connection unit 25 receives the end notification and outputs the prosody data to the waveform generation unit 26.

図６は、波形生成部２６の構成を示したものである。波形生成部２６は、韻律生成部２４_１〜２４_ｎから出力された韻律データから音声波形を生成する。波形生成部２６は、入力された韻律データを情報毎に分配する韻律データ分配部２６１と、音素列の音響特長パラメータを生成する素片選択部２６２と、音響特長パラメータを補正するパラメータ補正部２６３と、ピッチを変更しながら音声信号波形を合成する波形組立部２６４とを有して構成されている。また、素片選択部２６２は、音響的な特長を示すパラメータを格納した音声データ２６５を有している。 FIG. 6 shows the configuration of the waveform generator 26. The waveform generation unit 26 generates a speech waveform from the prosodic data output from the prosody generation units 24 _{1 to} 24 _n . The waveform generation unit 26 includes a prosody data distribution unit 261 that distributes input prosodic data for each information, a segment selection unit 262 that generates an acoustic feature parameter of a phoneme string, and a parameter correction unit 263 that corrects the acoustic feature parameter. And a waveform assembly unit 264 that synthesizes the audio signal waveform while changing the pitch. The element selection unit 262 has audio data 265 that stores parameters indicating acoustic features.

韻律データ分配部２６１は、入力された韻律データを音素列情報、音素時間長情報及びピッチパタン情報に分け、それぞれ、素片選択部２６２、パラメータ補正部２６３及び波形組立部２６４に出力する。 The prosody data distribution unit 261 divides the input prosody data into phoneme string information, phoneme time length information, and pitch pattern information, and outputs them to the segment selection unit 262, the parameter correction unit 263, and the waveform assembly unit 264, respectively.

素片選択部２６２は、入力された音素列情報に基づいて音声データ２６５を参照して上記音素列情報に含まれる音素列を選択して、選択した音素列に対応する音響特長パラメータを音声データ２６５から順次読み出して出力する。 The segment selection unit 262 selects the phoneme sequence included in the phoneme sequence information with reference to the audio data 265 based on the input phoneme sequence information, and sets the acoustic feature parameter corresponding to the selected phoneme sequence as the audio data. Sequentially read from H.265 and output.

音声データ２６５は、既存の規則音声合成装置に使用されるものと同様のもので、例えばケプストラム係数などの音声の音響的な特長を示すパラメータなどである。また、素片のひとつひとつの長さは、ＣＶ、ＣＶＣ（Ｃ：子音、Ｖ：母音）、その他の合成単位に応じた単位、あるいはコーパスベースの合成方式では特に合成単位に固定されない。 The voice data 265 is the same as that used in the existing regular voice synthesizer, and is, for example, a parameter indicating the acoustic characteristics of voice such as a cepstrum coefficient. In addition, the length of each element is not fixed to a synthesis unit in CV, CVC (C: consonant, V: vowel), a unit corresponding to another synthesis unit, or a corpus-based synthesis method.

パラメータ補正部２６３は、音素列によって時間軸上に並べられた音響特長パラメータを、韻律データ分配部２６１から入力された音素時間情報によってそれぞれの音素の長さに等しくなるように時間軸上で伸縮する。また、素片の接続部での音響特長パラメータの不整合を避けるために音響特長パラメータを補正する。 The parameter correction unit 263 expands / contracts the acoustic feature parameters arranged on the time axis by the phoneme string on the time axis so as to be equal to the length of each phoneme by the phoneme time information input from the prosody data distribution unit 261. To do. In addition, the acoustic feature parameter is corrected in order to avoid inconsistency of the acoustic feature parameter at the connecting portion of the piece.

波形組立部２６４は、パラメータ補正部２６３から出力される音響特長パラメータの系列に基づいて、また、韻律データ分配部２６１からのピッチパタン情報に基づいてピッチを変更しながら音声信号波形を合成して出力する。 The waveform assembling unit 264 synthesizes the audio signal waveform while changing the pitch based on the series of acoustic feature parameters output from the parameter correcting unit 263 and based on the pitch pattern information from the prosody data distributing unit 261. Output.

このような構成のテキスト音声合成装置２０は、音声記号列の言語情報に応じて当該テキスト部分を各言語の韻律生成部２４_１〜２４_ｎに出力することにより、２つ以上の言語のテキスト部分を有する入力テキストでも各言語の韻律で音声合成することができる。 The text-to-speech synthesizer 20 having such a configuration outputs the text part to the prosodic generation units 24 _{1 to} 24 _n of each language in accordance with the language information of the phonetic symbol string, whereby the text parts of two or more languages are output. Can be synthesized with the prosody of each language.

また、テキスト音声合成装置２０は、言語毎に韻律生成部２４_１〜２４_ｎを有する構成としたが、１つの韻律生成部において各ルールが各言語のルールを有する構成としてもよい。この場合、切換処理部２３及び韻律接続部２５の構成がなくても、各言語の韻律で音声合成することができる。 In addition, the text-to-speech synthesizer 20 is configured to have prosody generation units 24 _{1 to} 24 _n for each language, but each rule may have a rule for each language in one prosody generation unit. In this case, speech synthesis can be performed with the prosody of each language without the configuration of the switching processing unit 23 and the prosody connection unit 25.

（第３の実施の形態）
図７は、第３の実施の形態におけるテキスト音声合成装置３０の構成を示すものである。このテキスト音声合成装置３０は、複数の言語が混合した入力テキストに含まれる言語を判定する言語判定部２１と、単語毎のテキスト部分を音声記号列に変換する音声記号列生成部２２と、言語決定部３２で決定された言語に応じて言語毎に設けられた韻律生成部２４_１〜２４_ｎへの音声記号列の出力を切り換える切換処理部３１と、音声記号列を韻律データに変換する言語毎の韻律生成部２４_１〜２４_ｎと、各韻律生成部２４_１〜２４_ｎからの韻律データを接続する韻律接続部２５と、韻律データに基づいて音声波形を生成する波形生成部２６とを有して構成されている。 (Third embodiment)
FIG. 7 shows the configuration of the text-to-speech synthesizer 30 in the third embodiment. The text-to-speech synthesizer 30 includes a language determination unit 21 that determines a language included in an input text in which a plurality of languages are mixed, a speech symbol string generation unit 22 that converts a text part for each word into a phonetic symbol string, A switching processing unit 31 for switching output of the phonetic symbol string to the prosody generation units 24 _{1 to} 24 _n provided for each language according to the language determined by the determining unit 32, and a language for converting the phonetic symbol string into prosodic data Each prosody generation unit 24 _{1 to} 24 _n , a prosody connection unit 25 that connects the prosody data from each prosody generation unit 24 _{1 to} 24 _n, and a waveform generation unit 26 that generates a speech waveform based on the prosody data It is configured.

すなわち、第３の実施の形態におけるテキスト音声合成装置３０は、第２の実施の形態におけるテキスト音声合成装置２０の構成に言語決定部３２をさらに設け、切換処理部３１では、言語決定部３２にて決定された言語に応じて音声記号列の出力を各韻律生成部２４_１〜２４_ｎへ切り換えるようにしたものである。なお、上述した図３に示す構成の各部と対応する部分には同じ指示符号を付している。 That is, the text-to-speech synthesizer 30 according to the third embodiment further includes a language determining unit 32 in the configuration of the text-to-speech synthesizer 20 according to the second embodiment. The output of the phonetic symbol string is switched to each of the prosody generation units 24 _{1 to} 24 _n according to the determined language. In addition, the same instruction | indication code | symbol is attached | subjected to the part corresponding to each part of the structure shown in FIG. 3 mentioned above.

言語決定部３２は、切換処理部３１に入力された音声記号列のベースとなる言語を決定する。ベース言語とは、あたかも一人の話者が発声しているかのように韻律を生成するために用いられる話者の母国語のようなものである。 The language determination unit 32 determines the language that is the base of the phonetic symbol string input to the switching processing unit 31. A base language is like the native language of a speaker used to generate prosody as if a single speaker was speaking.

次に、言語決定部３２における言語の決定方法について説明する。ここでは、「これはdictionaryです。」という入力テキストの音声記号列<lang=japanese>korewa</lang> <lang=english>dIkS@neri:</lang> <lang=japanense>desu</lang>を用いて説明する。なお、<>で囲まれた文字列がタグ、それ以外が音声記号列である。また、<lang=japanese>は文内で日本語(Japanese)が開始することを示し、</lang>は当該言語が終了することを示す。また、言語情報が記述されたタグを用いて説明するが、本発明の実施形態はこれに限られるものではない。 Next, a language determination method in the language determination unit 32 will be described. Here, the phonetic symbol string <lang = japanese> korewa </ lang> <lang = english> dIkS @ neri: </ lang> <lang = japanense> desu </ lang> of the input text "This is a dictionary." Will be described. A character string enclosed in <> is a tag, and the other character strings are phonetic symbol strings. <Lang = japanese> indicates that Japanese (Japanese) starts within the sentence, and </ lang> indicates that the language ends. Moreover, although it demonstrates using the tag in which linguistic information was described, embodiment of this invention is not restricted to this.

１．入力された音声記号列の中で最も長い区間に割り当てられた言語
上記例において、入力された音声記号列のうち、「dIkS@neri:」が最も音韻数が多いため、英語をベース言語と決定する。
２．入力された音声記号列の中で最も多数の区間に割り当てられた言語
上記例において、日本語が２区間、英語が１区間であるため、日本語をベース言語と決定する。
３．入力された音声記号列の中で最初に出現する区間に割り当てられた言語
上記例において、最初に出現する区間<lang=japanense>korewa</lang>が日本語であるため、日本語をベース言語と決定する。
４．入力された音声記号列の中で最後に出現する区間に割り当てられた言語
上記例において、最後に出現する区間<lang=japanense>desu</lang>が日本語であるため、日本語をベース言語と決定する。
５．入力された音声記号列の中で最初に出現する区間と最後に出現する区間に割り当てられた言語
上記例において、最初に出現する区間と最後に出現する区間が日本語であるため、日本語をベース言語と決定する。 1. The language assigned to the longest interval in the input phonetic symbol string
In the above example, since “dIkS @ neri:” has the largest number of phonemes in the input phonetic symbol string, English is determined as the base language.
2. Language assigned to the largest number of intervals in the input phonetic symbol string
In the above example, since Japanese has two sections and English has one section, Japanese is determined as the base language.
3. The language assigned to the first occurrence of the input phonetic symbol string
In the above example, since the section <lang = japanense> korewa </ lang> that appears first is Japanese, Japanese is determined as the base language.
4). The language assigned to the last occurrence in the input phonetic symbol string
In the above example, since the section <lang = japanense> desu </ lang> that appears last is Japanese, Japanese is determined as the base language.
5. The language assigned to the first and last appearing intervals in the input phonetic symbol string
In the above example, since the first appearing section and the last appearing section are in Japanese, Japanese is determined as the base language.

なお、入力された音声記号列に出現する言語を任意に選択し、ベース言語としてもよい。また、ベース言語一覧をもち、その中から任意に選択することでベース言語としてもよい。また、外部から指定された言語をベース言語としてもよい。 Note that a language that appears in the input phonetic symbol string may be arbitrarily selected and used as a base language. Moreover, it is good also as a base language by having a base language list and selecting arbitrarily from the list. Further, a language designated from the outside may be used as the base language.

上記方法のいずれか１つ以上でベース言語を決定することができないときには、前述した第２の実施の形態のように、タグの言語情報に応じた言語の韻律制御部を利用することができる。 When the base language cannot be determined by any one or more of the above methods, the prosodic control unit of the language corresponding to the language information of the tag can be used as in the second embodiment described above.

切換処理部３１は、図８に示すように音声記号列を再変換する再変換部３１１と、言語情報を変換する言語変換部３１２とを有している。 As shown in FIG. 8, the switching processing unit 31 includes a reconversion unit 311 that reconverts a phonetic symbol string and a language conversion unit 312 that converts language information.

再変換部３１１は、言語決定部３２にて決定された言語に応じて音声記号列を再変換する。その際、読みに関する情報が格納されている変換ルールを参照する。例えば、ベース言語として日本語が決定された場合、音声記号列を日本語の音声記号列に再変換する。この場合、例えば、音声記号列<lang=english>dIkS@neri:</lang>は、<lang=english>dikushonarii</lang>に再変換される。 The reconversion unit 311 reconverts the phonetic symbol string according to the language determined by the language determination unit 32. At that time, a conversion rule in which information about reading is stored is referred to. For example, when Japanese is determined as the base language, the phonetic symbol string is reconverted into a Japanese phonetic symbol string. In this case, for example, the phonetic symbol string <lang = english> dIkS @ neri: </ lang> is reconverted into <lang = english> dikushonarii </ lang>.

言語変換部３１２は、言語決定部３２にて決定された言語に応じて音声記号列の言語情報を変える。例えば、ベース言語として日本語が決定された場合、音声記号列に付与された言語情報を日本語とする。この場合、例えば、音声記号列<lang=english>dIkS@neri:</lang>は、<lang=japanense>dIkS@neri:</lang>と変換される。つまり、音声記号列に付与されている出力先の言語情報を言語決定部３２で決定された言語に変換する。 The language conversion unit 312 changes the language information of the phonetic symbol string according to the language determined by the language determination unit 32. For example, when Japanese is determined as the base language, the language information given to the phonetic symbol string is Japanese. In this case, for example, the phonetic symbol string <lang = english> dIkS @ neri: </ lang> is converted to <lang = japanense> dIkS @ neri: </ lang>. That is, the language information of the output destination given to the phonetic symbol string is converted into the language determined by the language determination unit 32.

このように、再変換部３１１と言語変換部３１２とを有することにより、ベース言語以外の言語の音声記号列をベース言語の音声記号列に変換し、韻律を生成することができる。また、入力された音声記号列にベース言語以外の音声記号列が含まれていても、そのまま韻律を生成することができる。 As described above, by including the re-conversion unit 311 and the language conversion unit 312, it is possible to convert a phonetic symbol string of a language other than the base language into a phonetic symbol string of the base language and generate a prosody. Even if the input phonetic symbol sequence includes a phonetic symbol sequence other than the base language, the prosody can be generated as it is.

例えば、入力テキスト「これはdictionaryです。」の音声記号列<lang=japanese>korewa</lang> <lang=english>dIkS@neri:</lang> <lang=japanense>desu</lang>を<lang=english>korewa</lang> <lang=english>dIkS@neri:</lang> <lang=english>desu</lang>に変換することにより、dictionaryの発音だけ流暢な英語で、他の部分は片言の日本語であるような発話を表現することができる。 For example, the phonetic string <lang = japanese> korewa </ lang> <lang = english> dIkS @ neri: </ lang> <lang = japanense> desu </ lang> for the input text "This is a dictionary." lang = english> korewa </ lang> <lang = english> dIkS @ neri: </ lang> By converting to <lang = english> desu </ lang>, only the pronunciation of the dictionary is fluent in English, and other parts Can express utterances that are in Japanese.

また、<lang=japanense>korewa</lang> <lang=japanense>dikushonarii</lang> <lang=japanense>desu</lang>に変換することにより、流暢な日本語の発話を表現することができる。 Also, by converting to <lang = japanense> korewa </ lang> <lang = japanense> dikushonarii </ lang> <lang = japanense> desu </ lang>, you can express fluent Japanese utterances. .

また、言語決定部３２にて異なる言語決定方法で異なるベース言語を決定し、それぞれ異なるベース言語に基づいて、音声記号列及びその言語情報をそれぞれ変換してもよい。この場合、例えば、入力テキスト「これはdictionaryです。」の音声記号列<lang=japanese>korewa</lang> <lang=english>dIkS@neri:</lang> <lang=japanense>desu</lang>を<lang=english>korewa</lang> <lang=english> dikushonarii</lang> <lang=english>desu</lang>に変換することができるのため、片言の日本語の発話を表現することができる。 Alternatively, the language determination unit 32 may determine different base languages using different language determination methods, and convert the phonetic symbol strings and the language information thereof based on the different base languages. In this case, for example, the phonetic string <lang = japanese> korewa </ lang> <lang = english> dIkS @ neri: </ lang> <lang = japanense> desu </ lang Can be converted to <lang = english> korewa </ lang> <lang = english> dikushonarii </ lang> <lang = english> desu </ lang>, thus expressing a simple Japanese utterance be able to.

また、ベース言語に限らなくとも、様々な言語の韻律を指定できるので、ユニークな発話を表現することができる。 Further, not only the base language but also the prosody of various languages can be specified, so that a unique utterance can be expressed.

なお、上記第１〜第３の実施の形態では、判定された言語毎に言語情報を示すタグを付与することとしたが、これに限られるものではなく、例えば言語毎に異なる音声記号列を使用してもよい。 In the first to third embodiments, a tag indicating language information is assigned to each determined language. However, the present invention is not limited to this. For example, a different phonetic symbol string is used for each language. May be used.

本発明に係るテキスト音声合成装置の構成を示すブロック図である。1 is a block diagram showing a configuration of a text-to-speech synthesizer according to the present invention. 本発明に係るテキスト音声合成装置の動作を説明するフローチャートである。It is a flowchart explaining operation | movement of the text-to-speech synthesizer concerning this invention. 本発明に係るテキスト音声合成装置の構成を示すブロック図である。1 is a block diagram showing a configuration of a text-to-speech synthesizer according to the present invention. 音声記号列生成部の構成を示すブロック図である。It is a block diagram which shows the structure of an audio | voice symbol sequence production | generation part. 韻律生成部の構成を示すブロック図である。It is a block diagram which shows the structure of a prosody generation part. 波形生成部の構成を示すブロック図である。It is a block diagram which shows the structure of a waveform generation part. 本発明に係るテキスト音声合成装置の構成を示すブロック図である。1 is a block diagram showing a configuration of a text-to-speech synthesizer according to the present invention. 切換処理部の構成を示すブロック図である。It is a block diagram which shows the structure of a switching process part. 従来のテキスト音声合成装置の構成を示すブロック図である。It is a block diagram which shows the structure of the conventional text-to-speech synthesizer.

Explanation of symbols

１言語判定部、２テキスト分割部、３_１〜３_ｎテキスト音声合成部、４音声波形統合部、１０テキスト音声合成装置、２０テキスト音声合成装置、２１言語判定部、２２音声記号列生成部、２３切換処理部、２４_１〜２４_ｎ韻律生成部、２５韻律接続部、２６波形生成部、３０テキスト音声合成装置、３１切換処理部、３２言語決定部、１００テキスト音声合成装置、１０１音声記号列生成部、１０２韻律生成部、１０３波形生成部 DESCRIPTION OF SYMBOLS 1 Language determination part, 2 Text division | segmentation part, 3 _1-3 _n text speech synthesizer, 4 Speech waveform integration part, 10 Text speech synthesizer, 20 Text speech synthesizer, 21 Language determination part, 22 Speech symbol string production | generation part, 23 switching processing unit, 24 _{1 to} 24 _n prosody generation unit, 25 prosody connection unit, 26 waveform generation unit, 30 text speech synthesis device, 31 switching processing unit, 32 language determination unit, 100 text speech synthesis device, 101 speech symbol string Generator, 102 prosody generator, 103 waveform generator

Claims

A text-to-speech synthesizer that processes input text having text portions in two or more languages,
Language determination means for determining a language included in the input text and dividing the language into text portions for each language;
A plurality of speech synthesis means provided for each of the languages for converting the text portion into a speech waveform;
Voice waveform integration means for integrating the voice waveforms converted for each language,
The text speech synthesizing apparatus, wherein the language discrimination means outputs the text part to the plurality of speech synthesis means according to the language of the text part.

2. The text-to-speech synthesizer according to claim 1, wherein the language determining means determines a language based on attribute information previously given to the text portion.

2. The text-to-speech synthesizer according to claim 1, wherein the language discriminating unit judges a language with reference to a dictionary for identifying the text portion.

2. The text-to-speech synthesizer according to claim 1, wherein the language discriminating unit determines a language based on a character type and / or a character code of the text portion.

A text-to-speech synthesizer that processes input text having text portions in two or more languages,
Determining means for determining a language included in the input text;
Language processing means for converting the determined text portion of each language into a phonetic symbol string;
A plurality of prosodic generation means provided for each of the languages for converting the phonetic symbol string into prosodic data;
A speech waveform generating means for converting the prosodic data into a speech waveform;
The speech processing apparatus according to claim 1, wherein the language processing unit outputs the phonetic symbol string to the plurality of prosody generation units according to the phonetic symbol sequence.

6. The text-to-speech synthesizer according to claim 5, wherein the determination means includes determining a language based on attribute information previously given to the text portion.

6. The text-to-speech synthesizer according to claim 5, wherein the determination unit determines a language with reference to a dictionary for identifying the text portion.

6. The text-to-speech synthesizer according to claim 5, wherein the determination unit determines a language based on a character type and / or a character code of the text part.

6. The text-to-speech synthesizer according to claim 5, wherein the language processing means further includes language determining means for determining a language as a base of the phonetic symbol string.

10. The text-to-speech synthesizer according to claim 9, wherein the language determining unit determines a language assigned to the longest section in the phonetic symbol string.

10. The text-to-speech synthesizer according to claim 9, wherein the language determining means determines a language assigned to the largest number of sections in the phonetic symbol string.

10. The text-to-speech synthesizer according to claim 9, wherein the language determining means determines a language assigned to a section that appears first and / or a section that appears last in the phonetic symbol string.

10. The text-to-speech synthesizer according to claim 9, wherein the language processing means includes re-conversion means for re-converting the phonetic symbol string according to the language determined by the language determination means.

10. The text-to-speech synthesizer according to claim 9, wherein the language processing means includes language conversion means for converting language information of the phonetic symbol string in accordance with the language determined by the language determination means.

A text-to-speech synthesis method for processing input text having text parts in two or more languages,
A language determination step of determining a language included in the input text and dividing the language into text portions for each language;
A plurality of speech synthesis steps provided for each of the languages for converting the text portion into a speech waveform;
A speech waveform integration step of integrating the speech waveforms converted for each language,
The text speech synthesis method, wherein the language discrimination step outputs the text portion to the plurality of speech synthesis steps according to the language of the text portion.

16. The text-to-speech synthesis method according to claim 15, wherein the language determining step includes determining a language based on attribute information previously given to the text portion.

16. The text-to-speech synthesis method according to claim 15, wherein the language determination step determines a language with reference to a dictionary for identifying the text portion.

16. The text-to-speech synthesis method according to claim 15, wherein the language determination step determines a language based on a character type and / or a character code of the text portion.

A text-to-speech synthesis method for processing input text having text parts in two or more languages,
A determination step of determining a language included in the input text;
A language processing step of converting the determined text portion of each language into a phonetic symbol string;
A plurality of prosodic generation steps provided for each language for converting the phonetic symbol string into prosodic data;
A speech waveform generation step of converting the prosodic data into a speech waveform;
The text speech synthesizing method, wherein the language processing step outputs the phonetic symbol sequence to the plurality of prosody generation steps according to the phonetic symbol sequence.

20. The text-to-speech synthesis method according to claim 19, wherein the determination step includes determining a language based on attribute information previously given to the text portion.

20. The text-to-speech synthesis method according to claim 19, wherein the determination step determines a language with reference to a dictionary for identifying the text portion.

20. The text-to-speech synthesis method according to claim 19, wherein the determination step determines a language based on a character type and / or a character code of the text portion.

20. The text-to-speech synthesis method according to claim 19, wherein the language processing step further includes a language determination step of determining a language as a base of the phonetic symbol string.

24. The text-to-speech synthesis method according to claim 23, wherein the language determining step determines a language assigned to the longest section in the phonetic symbol string.

24. The text-to-speech synthesis method according to claim 23, wherein the language determining step determines languages assigned to the largest number of sections in the phonetic symbol string.

24. The text-to-speech synthesis method according to claim 23, wherein the language determining step determines a language assigned to a section that appears first and / or a section that appears last in the phonetic symbol string.

24. The text-to-speech synthesis method according to claim 23, wherein the language processing step includes a reconversion step of reconverting the phonetic symbol string in accordance with the language determined in the language determination step.

24. The text-to-speech synthesis method according to claim 23, wherein the language processing step includes a language conversion step of converting language information of the phonetic symbol string according to the language determined in the language determination step.