JP2014062970A

JP2014062970A - Voice synthesis, device, and program

Info

Publication number: JP2014062970A
Application number: JP2012206980A
Authority: JP
Inventors: Kenji Nagamatsu; 健司永松; Takeo Mori; 竹雄森
Original assignee: Hitachi ULSI Systems Co Ltd
Current assignee: Hitachi Solutions Technology Ltd
Priority date: 2012-09-20
Filing date: 2012-09-20
Publication date: 2014-04-10
Anticipated expiration: 2032-09-20
Also published as: JP6013104B2

Abstract

【課題】口語調音声を始めとする韻律や声質の変化が大きい発話スタイルの合成音声を生成する音声合成処理において、口語調音声の特徴的な部分（口語表現部分）の音質（肉声感）を高くする。
【解決手段】（準）肉声音声選択部１０３は、（準）肉声音声ＩＤ１１４で指定に基づき、口語調音声の特徴的部分に肉声もしくは肉声に準ずる音声を選択し、声質情報抽出部１０５でその音声の韻律と声質を分析した結果に基づき、声質考慮波形合成部１１０において、それ以外の合成音声部分の韻律を調整し、分析された声質を考慮した合成音声波形生成処理を実施する。これにより、肉声音声部分と合成音声部分とで韻律・声質が大きくずれることのない合成音声を生成できる。
【選択図】図１ＡKind Code: A1 In a speech synthesis process for generating synthesized speech of an utterance style with a large change in prosody and voice quality including colloquial speech, the sound quality (feeling of real voice) of the characteristic part (spoken expression part) of spoken speech Make it high.
A (quasi) real voice selection unit (103) selects a voice or a voice based on a real voice as a characteristic part of a colloquial speech based on designation by a (quasi) real voice ID (114), and a voice quality information extraction unit (105) Based on the result of analyzing the prosody and voice quality of the voice, the voice quality-considered waveform synthesis unit 110 adjusts the prosody of the other synthesized voice parts, and performs a synthesized voice waveform generation process considering the analyzed voice quality. As a result, it is possible to generate a synthesized speech in which the prosody and voice quality are not greatly shifted between the real voice portion and the synthesized speech portion.
[Selection] Figure 1A

Description

本発明は、音声合成装置に関し、発話スタイルの合成音声を生成する音声合成技術に係る。 The present invention relates to a speech synthesizer, and relates to a speech synthesizer technique for generating speech style synthesized speech.

テキストを音声に変換して読み上げるテキスト音声合成技術、およびそれを利用したテキスト音声合成システムがある。このような技術、システムの応用先として、例えば、カーナビゲーションでのガイド音声、携帯電話・スマートフォンでのメール読み上げや音声対話インタフェース、視覚障碍者向けのスクリーンリーダー、電子書籍の読み上げ機能などが存在する。 There is a text-to-speech synthesis technology that converts text into speech and reads it out, and a text-to-speech synthesis system that uses it. Applications of such technologies and systems include, for example, guide voice for car navigation, email reading and voice interaction interfaces on mobile phones and smartphones, screen readers for visually impaired people, and reading functions for electronic books. .

音声合成技術は近年、その重要性が拡大している。カーナビゲーションや携帯電話・スマートフォンの普及に加えて障碍者や高齢者を対象としたユニバーサルデザインの志向によって音声を使ったユーザインタフェースが今まで以上に使われるようになってきている。さらに、近年では電子書籍端末の普及が始まり、音声読上げに必須な音声合成技術のニーズが拡大している。 The importance of speech synthesis technology has increased in recent years. In addition to the widespread use of car navigation, mobile phones and smartphones, the user interface using voice has been used more than ever due to the universal design orientation for disabled and elderly people. Furthermore, in recent years, the spread of electronic book terminals has started, and the need for speech synthesis technology essential for speech reading has expanded.

音声合成技術に対する最近のニーズでは、カーナビや携帯電話などの音声対話処理で用いる事例が多くなってきている。加えて、そのような応用では、ユーザが会話を自然に進められるように、話し言葉・口語調での音声を合成したいというニーズが増えてきている。従来の音声合成技術であっても、合成音声の韻律（抑揚やリズム、強さなど）を工夫することで口語調の音声を生成できるものは存在している（特許文献１参照）。 In recent needs for speech synthesis technology, there are an increasing number of examples used in speech dialogue processing such as car navigation and mobile phones. In addition, in such applications, there is an increasing need to synthesize spoken / spoken speech so that users can naturally advance conversations. Even conventional speech synthesis techniques exist that can generate colloquial speech by devising the prosody (inflection, rhythm, strength, etc.) of the synthesized speech (see Patent Document 1).

特開２０１１−２８１３１号公報JP 2011-28131 A 特開２００８−１０７４５４号公報JP 2008-107454 A 特開２００９−０２０２６４号公報JP 2009-020264 A

しかし、口語調の実際の音声では、口語調ならではの特徴的表現（例えば文末の「かなぁ」等）の部分の韻律や声質の変化が大きいが、従来手法の音声合成技術では、口語調韻律の再現を重視するあまり、声質という点では実際の口語調音声を十分に再現することは実現できていない。 However, in the actual speech of colloquial tone, the prosody and voice quality of the characteristic expression unique to the colloquial tone (for example, “Kana” at the end of the sentence) is large. Reproduction of actual colloquial speech has not been fully realized in terms of voice quality because of emphasis on reproduction.

本発明の目的は、上記の課題を解決し、口語調音声を始めとする韻律や声質の変化が大きい発話スタイルの合成音声を生成する音声合成処理において、口語調音声の特徴的な部分（口語表現部分）の音質（肉声感）を高くすることが可能な音声合成方法、装置、及びプログラムを提供することにある。 An object of the present invention is to solve the above-mentioned problems and to generate a characteristic part of spoken tone speech (spoken language) in speech synthesis processing for generating synthesized speech of an utterance style with a large change in prosody and voice quality, including spoken tone speech. The object is to provide a speech synthesis method, device, and program capable of improving the sound quality (feeling of real voice) of (expression part).

更に、本発明の目的は、その口語表現部分に肉声音声を差し込む場合に、肉声音声部分と合成音声部分との間で韻律・声質に大きな違いが生ずることの無い音声合成方法、装置、及びプログラムを提供することにある。 Furthermore, an object of the present invention is to provide a speech synthesis method, apparatus, and program that does not cause a great difference in prosody and voice quality between a real voice part and a synthesized voice part when a real voice is inserted into the spoken phrase part. Is to provide.

上記目的を達成するため、本発明においては、合成音声を出力する処理部における音声合成方法であって、処理部は、入力テキスト内から発話スタイルの特徴的な箇所に対応する肉声音声データ、もしくは韻律や声質面で肉声に準じる準肉声音声データ（以下（準）肉声音声データと称する）を選択し、選択した（準）肉声音声データから韻律情報、及び声質情報を抽出し、抽出した韻律情報に基づいて、入力テキストに対応する合成音声の韻律情報を調整し、抽出した声質情報を考慮して、入力テキストに対応する合成音声波形データを生成し、生成した合成音声波形データ、及び選択した（準）肉声音声データを接続端点で接合する音声合成方法を提供する。 In order to achieve the above object, in the present invention, a speech synthesis method in a processing unit that outputs synthesized speech, the processing unit is a voice data corresponding to a characteristic part of an utterance style from input text, or Select prosody information and voice quality information from the selected (quasi) real voice data, select quasi real voice data (hereinafter referred to as (quasi) real voice data) according to the real voice in terms of prosody and voice quality, and extract the prosodic information The synthesized speech waveform data corresponding to the input text is generated in consideration of the extracted voice quality information by adjusting the prosody information of the synthesized speech corresponding to the input text based on the Provided is a speech synthesis method for joining (quasi) real voice data at connection end points.

また、上記の目的を達成するため、本発明においては、音声合成装置であって、処理部と記憶部を備え、処理部は、入力テキスト内から発話スタイルの特徴的な箇所に対応する（準）肉声音声データを選択し、選択した（準）肉声音声データから韻律情報、及び声質情報を抽出し、抽出した韻律情報に基づいて、入力テキストに対応する合成音声の韻律情報を調整し、抽出した声質情報を考慮して、入力テキストに対応する合成音声波形データを生成し、生成した合成音声波形データ、及び選択した（準）肉声音声データを接続端点で接合して出力する構成の音声合成装置を提供する。 In order to achieve the above object, according to the present invention, the speech synthesizer includes a processing unit and a storage unit, and the processing unit corresponds to a characteristic part of the utterance style from the input text (semi-standard). ) Select real voice data, extract prosodic information and voice quality information from the selected (quasi) real voice data, and adjust and extract prosody information of synthesized speech corresponding to the input text based on the extracted prosodic information The synthesized speech waveform data corresponding to the input text is generated in consideration of the voice quality information, and the generated synthesized speech waveform data and the selected (quasi) real voice data are joined and output at the connection end points. Providing equipment.

更に、上記の目的を達成するため、本発明においては、音声合成装置の処理部で実行される音声合成プログラムであって、処理部を、入力テキスト内から発話スタイルの特徴的な箇所に対応する（準）肉声音声データを選択し、選択した（準）肉声音声データから韻律情報、及び声質情報を抽出し、抽出した韻律情報に基づいて、入力テキストに対応する合成音声の韻律情報を調整し、抽出した声質情報を考慮して、入力テキストに対応する合成音声波形データを生成し、生成した合成音声波形データ、及び選択した（準）肉声音声データを接続端点で接合するよう動作させる音声合成プログラムを提供する。 Furthermore, in order to achieve the above object, according to the present invention, there is provided a speech synthesis program executed by a processing unit of a speech synthesizer. Select (quasi) real voice data, extract prosodic information and voice quality information from the selected (quasi) real voice data, and adjust the prosodic information of the synthesized speech corresponding to the input text based on the extracted prosodic information In consideration of the extracted voice quality information, synthesized speech waveform data corresponding to the input text is generated, and the synthesized speech waveform data generated and the selected (quasi) real voice data are joined at the connection end point. Provide a program.

本発明によれば、口語調音声を始めとする韻律や声質の変化が大きい発話スタイルの合成音声を生成する音声合成処理において、発話スタイルに特徴的な部分（口語表現部分）に肉声や肉声に準じる品質の音声データを用いることで全体の音質（肉声感）を向上させることができる。さらに、その（準）肉声音声とその他の合成音声部分との韻律や声質の違いをできるだけ小さくすることができ、従来技術で感じられた合成音声の違和感を軽減することが可能となる。 According to the present invention, in speech synthesis processing for generating synthesized speech of an utterance style with a large change in prosody and voice quality including colloquial speech, a voice or a real voice is included in a portion characteristic of the utterance style (a spoken expression portion). The overall sound quality (feeling of real voice) can be improved by using sound data of conforming quality. Furthermore, the difference in prosody and voice quality between the (quasi) real voice and other synthesized voice parts can be reduced as much as possible, and the uncomfortable feeling of the synthesized voice felt in the prior art can be reduced.

実施例１の音声合成装置の全体機能構成の一例を説明する図である。It is a figure explaining an example of the whole function structure of the speech synthesizer of Example 1. FIG. 実施例１の音声合成装置のハードウェア構成の一例を説明する図である。FIG. 2 is a diagram illustrating an example of a hardware configuration of a speech synthesizer according to the first embodiment. 実施例１に係る、（準）肉声音声格納部に格納されているデータ例を示す図である。It is a figure which shows the example of data stored in the (quasi) real voice audio | voice storage part based on Example 1. FIG. 実施例１に係る、入力解析部の出力データ例を示す図である。It is a figure which shows the example of output data of the input analysis part based on Example 1. FIG. 実施例１に係る、韻律生成部の出力データ例を示す図である。It is a figure which shows the example of output data of the prosody generation part based on Example 1. FIG. 実施例１に係る、（準）肉声音声格納部に格納されているデータ例を示す図である。It is a figure which shows the example of data stored in the (quasi) real voice audio | voice storage part based on Example 1. FIG. 実施例１に係る、韻律情報抽出部の出力データ例を示す図である。It is a figure which shows the output data example of the prosody information extraction part based on Example 1. FIG. 実施例１に係る、韻律生成部の全体処理フローチャートを示す図である。It is a figure which shows the whole process flowchart of the prosody generation part based on Example 1. FIG. 実施例１に係る、韻律生成部の韻律情報生成処理のフローチャートを示す図である。It is a figure which shows the flowchart of the prosody information generation process of the prosody generation part based on Example 1. FIG. 実施例１に係る、韻律生成部における韻律調整処理の模式図である。6 is a schematic diagram of prosody adjustment processing in the prosody generation unit according to Embodiment 1. FIG. 実施例１に係る、韻律生成部の出力データ例を示す図である。It is a figure which shows the example of output data of the prosody generation part based on Example 1. FIG. 実施例１に係る、声質抽出部の処理フローチャートを示す図である。It is a figure which shows the process flowchart of the voice quality extraction part based on Example 1. FIG. 実施例１に係る、（準）肉声音声格納部に格納されているデータ例を示す図である。It is a figure which shows the example of data stored in the (quasi) real voice audio | voice storage part based on Example 1. FIG. 実施例１に係る、声質考慮波形合成部の処理フローチャートを示す図である。It is a figure which shows the process flowchart of the voice quality consideration waveform synthetic | combination part based on Example 1. FIG. 実施例２に係る、音声合成装置の全体機能構成を説明する図である。It is a figure explaining the whole function structure of the speech synthesizer based on Example 2. FIG. 実施例２に係る、読み上げシナリオ格納部のデータ例を示す図である。It is a figure which shows the example of data of the reading-out scenario storage part based on Example 2. FIG. 実施例２に係る、全域声質情報格納部のデータ例を示す図である。It is a figure which shows the example of data of the whole area voice quality information storage part based on Example 2. FIG. 実施例２に係る、声質考慮波形合成部の処理フローチャートを示す図である。It is a figure which shows the process flowchart of the voice quality consideration waveform synthetic | combination part based on Example 2. FIG. 実施例３に係る、音声合成装置の全体機能構成を説明する図である。It is a figure explaining the whole function structure of the speech synthesizer based on Example 3. FIG. 実施例３に係る、コンテンツデータの構成の一例を示す図である。It is a figure which shows an example of a structure of the content data based on Example 3. FIG. 実施例３に係る、コンテンツデータの構成の一例を示す図である。It is a figure which shows an example of a structure of the content data based on Example 3. FIG. 実施例１に係る、全域声質情報の構成の一例を示す図である。It is a figure which shows an example of a structure of the whole area voice quality information based on Example 1. FIG.

以下、各種の実施例を、図面に従い順次説明する。なお、本明細書において、「肉声音声」とは、通常の意味における肉声音声を、「準肉声音声」とは、韻律や声質面で肉声に準じる合成肉声音声を、「（準）肉声音声」とは、通常の意味における肉声音声と、韻律や声質面で肉声に準じる合成肉声音声の両者を意味することとする。例えば、「肉声音声データ」とは、通常の意味における肉声音声データを意味し、「準肉声音声データ」とは、韻律や声質面で肉声に準じる合成肉声音声データを意味し、「（準）肉声音声データ」とは、肉声音声データと準肉声音声データの両者を意味している。 Hereinafter, various embodiments will be sequentially described with reference to the drawings. In the present specification, “real voice” refers to a real voice in a normal sense, and “quasi-real voice” refers to a synthetic real voice similar to a real voice in terms of prosody and voice quality, “(quasi) real voice”. The term “means” means both a real voice in a normal sense and a synthetic voice that conforms to the real voice in terms of prosody and voice quality. For example, “real voice data” means real voice data in a normal sense, and “quasi-real voice data” means synthetic real voice data that conforms to real voice in terms of prosody and voice quality. “Real voice data” means both real voice data and semi-voice data.

第１の実施例は、音声合成装置、システムに係るものである。本実施例の音声合成装置、システムの構成により、ユーザが入力した読み上げテキストを、合成音声に変換する基本機能を実現する。以下では、「部」という表現で、音声合成装置、システムの内部処理の各単位を呼称するが、ハードウェアとしての装置で実現しなくても、ソフトウェアプログラムとして実現されたシステム、またはプログラムという形態で、その機能を実現することも可能である。 The first embodiment relates to a speech synthesizer and system. With the configuration of the speech synthesizer and the system of the present embodiment, a basic function for converting the text read out by the user into synthesized speech is realized. In the following, each unit of the internal processing of the speech synthesizer and the system is referred to by the expression “part”. However, the form of a system or program realized as a software program without being realized by a device as hardware. It is possible to realize the function.

図１Ａは、本実施例の音声合成装置、システムの全体機能を説明する図である。本実施例の音声合成装置では、読み上げ対象となる、入力テキスト（漢字かな文）１１２に加えて、（準）肉声音声ＩＤ１１４を入力として与えると、音声合成処理が実施されて、合成音声１１３が出力される。この音声合成処理を実現するために、入力解析部１０６、韻律生成部１０７、声質考慮波形合成部１１０、接続合成部１１１、韻律モデル１０８、素片データベース（ＤＢ）１０９という通常の音声合成装置が持つ処理単位、機能処理部を備える。これに加えて、本実施例の構成に特徴的な、（準）肉声音声格納部１０１、韻律や声質面で肉声に準じる合成肉声音声を生成する準肉声音声生成部１０２、（準）肉声音声を選択する（準）肉声音声選択部１０３、韻律情報抽出部１０４、声質情報抽出部１０５を備え、さらに声質考慮波形合成部１１０は、声質情報抽出部１０５で抽出された声質情報に基づいて特別な波形合成処理を行う機能処理部であることを特徴とする。これらの図１Ａの装置、システム中の処理単位、機能処理部は、例えば、次に説明する処理部でのプログラム実行で実現可能である。 FIG. 1A is a diagram for explaining the overall functions of the speech synthesis apparatus and system of this embodiment. In the speech synthesizer of the present embodiment, when a (quasi) real voice ID 114 is given as an input in addition to the input text (Kanji sentence) 112 to be read out, the speech synthesis process is performed and the synthesized speech 113 is obtained. Is output. In order to realize this speech synthesis processing, a normal speech synthesizer such as an input analysis unit 106, prosody generation unit 107, voice quality-considered waveform synthesis unit 110, connection synthesis unit 111, prosody model 108, and segment database (DB) 109 is used. A processing unit and a function processing unit are provided. In addition to this, the (quasi) real voice voice storage unit 101, the quasi real voice voice generation unit 102 that generates a synthetic real voice according to the real voice in terms of prosody and voice quality, characteristic of the configuration of the present embodiment, (Quasi) real voice selection unit 103, prosody information extraction unit 104, voice quality information extraction unit 105, and voice quality-considered waveform synthesis unit 110 is specially selected based on the voice quality information extracted by voice quality information extraction unit 105. It is a functional processing unit that performs a simple waveform synthesis process. These processing units and function processing units in the apparatus of FIG. 1A and the system can be realized, for example, by executing a program in the processing unit described below.

以下ではまず本実施例を構成する各機能処理部で行われる処理について説明するが、それに先立ち、この全体機能構成を実現する、音声合成装置、システムのハードウェア構成の一具体例を図１Ｂに示す。なお、このハードウェア構成は、第１の実施例のみならず、後で説明する他の実施例の音声合成装置、システムのハードウェア構成としても利用できるものである。 In the following, processing performed by each function processing unit constituting this embodiment will be described first. Prior to that, FIG. 1B shows a specific example of the hardware configuration of the speech synthesizer and system that realizes this overall functional configuration. Show. Note that this hardware configuration can be used not only as the first embodiment but also as a hardware configuration of a speech synthesizer and system of other embodiments described later.

図１Ｂの構成において、１２０はパーソナルコンピュータなどの通常のコンピュータであり、その内部に、内部バス１２８で相互接続された、処理部である中央処理部（ＣＰＵ）１２１、記憶部であるメモリ（ＭＭ）１２２、ネットワーク１２４に接続されるネットワークインターフェース１２３、入出力部（Ｉ／Ｏ）１２６を備える。また、その外部には、記憶部であるハードディスクドライブ（ＨＤＤ）１２５、入出力部１２６に接続されたスピーカ１２７を有する。外部記憶部であるＨＤＤ１２５には、動作時にＭＭ１２２にロードされ、処理部であるＣＰＵ１２１で実行される各種の音声合成用の機能プログラムや、素片データベース（ＤＢ）等の各種のデータベースなどが記憶される。入出力部１２６に接続されたスピーカ１２７からは、音声合成装置、システムによる音声合成の結果である合成音声が出力される。 In the configuration of FIG. 1B, reference numeral 120 denotes a normal computer such as a personal computer, and a central processing unit (CPU) 121 as a processing unit and a memory (MM) as a storage unit, which are interconnected by an internal bus 128. ) 122, a network interface 123 connected to the network 124, and an input / output unit (I / O) 126. In addition, a hard disk drive (HDD) 125 serving as a storage unit and a speaker 127 connected to the input / output unit 126 are provided outside thereof. The HDD 125 as an external storage unit stores various types of function programs for speech synthesis that are loaded into the MM 122 during operation and executed by the CPU 121 as a processing unit, and various databases such as a segment database (DB). The The speaker 127 connected to the input / output unit 126 outputs synthesized speech that is a result of speech synthesis by the speech synthesizer and system.

さて、図１Ａの実施例１の機能構成図に戻り、漢字かな文である入力テキスト１１２は、「よろしくお願いします」や「今日の天気は晴れかなぁ」などの読み上げ対象テキストである。上記の例は文の数がそれぞれ１であるが、もちろん、複数の文を結合した長いテキストを入力することも可能である。また、図１Ａでは漢字かな文としているが、もちろん、英語や中国語などの外国語でも構わない。その場合は、内部の音声合成処理もその外国語に対応したプログラム・データを用いなければならない。 Returning to the functional configuration diagram of the first embodiment shown in FIG. 1A, the input text 112, which is a kanji sentence, is a text to be read out such as “Thank you for your consideration” or “Is the weather today sunny?”. In the above example, the number of sentences is one, but it is possible to input a long text obtained by combining a plurality of sentences. In FIG. 1A, a kanji sentence is used. Of course, a foreign language such as English or Chinese may be used. In that case, internal speech synthesis processing must use program data corresponding to the foreign language.

（準）肉声音声ＩＤ１１４は、入力テキスト１１２の中でどの文字列部分に対してどの（準）肉声音声を使用するかを指定する情報である。図２にＩＤと対応する文字列、対応する（準）肉声音声を格納した（準）肉声音声格納部１０１中のデータベース２００の一例を示す。ここで、（準）肉声音声とは、上述の通り、通常の肉声音声、或いは韻律や声質面で肉声に準じる合成肉声音声を意味する。本音声合成装置の利用者は、入力テキスト１１２「今日の天気は晴れかなぁ」とあわせて、（準）肉声音声ＩＤ１１４として例えば「２」という情報を入力する。利用者は図２に示す（準）肉声音声データベース２００を参照して、入力テキストの一部分として使用すべき（準）肉声音声を指定するＩＤを決定する。例えば、「かなぁ」などの口語調表現に対しては、さまざまな韻律（抑揚）で発声した「かなぁ」がありえる。音声合成装置の利用者は、自分が求める韻律を持つ（準）肉声音声を選択し、対応する（準）肉声音声ＩＤとして、「１」又は「２」を指定する必要がある。 The (quasi) real voice ID 114 is information that specifies which (quasi) real voice is used for which character string portion in the input text 112. FIG. 2 shows an example of the database 200 in the (semi) voice voice storage unit 101 storing the character string corresponding to the ID and the corresponding (semi) voice voice. Here, the (quasi) real voice means a normal real voice or a synthetic real voice similar to the real voice in terms of prosody and voice quality as described above. The user of the speech synthesizer inputs information “2”, for example, as the (quasi) real voice ID 114 together with the input text 112 “Is the weather today sunny?”. The user refers to the (quasi) real voice database 200 shown in FIG. 2 to determine an ID that designates a (quasi) real voice to be used as a part of the input text. For example, for a colloquial expression such as “Kana”, there can be “Kana” uttered with various prosody (inflections). The user of the speech synthesizer needs to select a (quasi) real voice having the desired prosody and specify “1” or “2” as the corresponding (quasi) real voice ID.

入力解析部１０６は、入力テキスト１１２と（準）肉声音声ＩＤ１１４を受け取り、その解析処理を実施する。入力テキスト１１２に対しては、テキストに含まれる単語の発音やアクセントなどを決定する言語解析処理を実施し、解析結果を韻律生成部１０７に渡す。 The input analysis unit 106 receives the input text 112 and the (quasi) real voice ID 114 and performs an analysis process thereof. The input text 112 is subjected to language analysis processing for determining the pronunciation and accent of words included in the text, and the analysis result is passed to the prosody generation unit 107.

一方、（準）肉声音声ＩＤ１１４は、そのまま（準）肉声音声選択部１０３に受け渡す。後述する別の実施例では、（準）肉声音声ＩＤ１１４が単純なＩＤで指定されない場合、または（準）肉声音声ＩＤ自体が指定されない場合を説明する。その場合は、入力解析部１０６では、入力テキスト１１２を解析した結果の中から（準）肉声音声の選択に必要な情報を、（準）肉声音声選択部１０３に受け渡すことになる。 On the other hand, the (semi) voice voice ID 114 is passed to the (semi) voice voice selection unit 103 as it is. In another embodiment to be described later, a case where the (quasi) real voice ID 114 is not designated by a simple ID or the case where the (quasi) real voice ID itself is not designated will be described. In this case, the input analysis unit 106 passes information necessary for selecting the (quasi) real voice from the analysis result of the input text 112 to the (semi) real voice selection unit 103.

入力解析部１０６で実施する言語解析処理はすでにさまざまな音声合成装置で実施されている入力解析部１０６から出力される情報は、例えば図３に示す情報３００のようなものとなる。もちろん、図３の情報３００は最小限必要なものであり、さらに多様な情報を付加して出力する場合も考えられる。 The language analysis processing performed by the input analysis unit 106 is information output from the input analysis unit 106 that has already been performed by various speech synthesizers, for example, information 300 shown in FIG. Of course, the information 300 in FIG. 3 is the minimum necessary, and there may be a case where various information is added and output.

韻律生成部１０７では、入力された解析結果（図３）を用いて、合成音声が持つべき韻律情報を決定する。決定される韻律情報には、例えば、入力テキストを構成する各音素の長さ（ｍｓ）・高さ（Ｈｚ）・強さ（ｄＢ）などがある。韻律生成部１０７では、この韻律情報を決定するために韻律モデル１０８を使用する。 The prosody generation unit 107 determines prosodic information that the synthesized speech should have using the input analysis result (FIG. 3). The determined prosodic information includes, for example, the length (ms), height (Hz), and strength (dB) of each phoneme constituting the input text. The prosody generation unit 107 uses the prosody model 108 to determine the prosody information.

図４に韻律生成部１０７から出力される韻律情報の例を情報４００として示す。ただし、本実施例の韻律生成部１０７は、韻律情報抽出部１０４から出力された（準）肉声音声の韻律情報をもとに、上記生成された入力テキストの韻律情報４００を調整する。この処理については後述する。 FIG. 4 shows an example of prosody information output from the prosody generation unit 107 as information 400. However, the prosody generation unit 107 of the present embodiment adjusts the prosody information 400 of the generated input text based on the prosody information of the (quasi) real voice output from the prosody information extraction unit 104. This process will be described later.

声質考慮波形合成部１１０では、韻律生成部１０７から出力された図４の韻律情報４００をもとに、合成音声データの生成を行う。ここで行われる基本処理は従来の音声合成処理で行われる波形合成処理を用いることができる。波形合成処理では通常、素片ＤＢ１０９と呼ばれる音声データを参照して合成音声を生成する。ここで用いられる技術は、例えば音素単位、音節単位の音声波形データ（素片）を多数、素片ＤＢに格納しておき、それらの中から入力テキストを構成するために最適な素片の系列（並び）を選択し、その後、波形接続処理、または波形重畳処理を施して合成音声を生成するという技術を用いることができる。波形接続処理にはクロスフェードによる重ね合わせ処理などの、波形重畳処理には時間領域ピッチ同期オーバーラッピング処理（ＴＤ−ＰＳＯＬＡ）などのアルゴリズムを用いることができる。また、この技術以外にも、音素単位、音節単位で音声分析パラメータをＤＢに格納しておき、入力テキストを構成する音素に対応する音声分析パラメータを取り出して、それを逆フィルタ処理に通すことで音声波形に変換するパラメータ合成技術というものを用いることもできる。近年では、音声分析パラメータをＨＭＭ（隠れマルコフモデル）という確率モデルでモデル化し、より滑らかなパラメータ変化を実現する手法なども存在している。 The voice quality-considered waveform synthesis unit 110 generates synthesized speech data based on the prosody information 400 of FIG. 4 output from the prosody generation unit 107. As the basic processing performed here, the waveform synthesis processing performed in the conventional speech synthesis processing can be used. In the waveform synthesizing process, synthesized speech is usually generated with reference to speech data called the segment DB 109. The technique used here is, for example, a large number of speech waveform data (segments) in units of phonemes and syllables, stored in the unit DB, and an optimal sequence of segments for constructing an input text from them. A technique of selecting (arrangement) and then performing waveform connection processing or waveform superimposition processing to generate synthesized speech can be used. An algorithm such as a superposition process using crossfading can be used for the waveform connection process, and an algorithm such as a time domain pitch synchronization overlapping process (TD-PSOLA) can be used for the waveform superimposition process. In addition to this technique, the speech analysis parameters are stored in the DB in units of phonemes and syllables, and the speech analysis parameters corresponding to the phonemes constituting the input text are extracted and passed through the inverse filtering process. A parameter synthesis technique for converting into a speech waveform can also be used. In recent years, there has been a method of modeling speech analysis parameters using a probabilistic model called HMM (Hidden Markov Model) to realize smoother parameter changes.

以上で説明した波形合成処理は、声質考慮波形合成部１１０で行われる波形合成処理の基本処理を説明したものである。本実施例の声質考慮波形合成部１１０は、声質情報抽出部１０５から出力された（準）肉声音声の声質情報をあわせて用いて、上記説明した素片選択処理、または音声分析パラメータの調整処理を実施する。これにより、（準）肉声音声に声質が近い合成音声の生成が可能となる。この処理については後述する。 The waveform synthesizing process described above describes the basic process of the waveform synthesizing process performed by the voice quality considering waveform synthesizing unit 110. The voice quality-considered waveform synthesizer 110 of this embodiment uses the voice quality information of the (quasi) real voice output from the voice quality information extractor 105 and uses the segment selection process or the voice analysis parameter adjustment process described above. To implement. As a result, it is possible to generate a synthesized voice having a voice quality close to that of a (semi) voice voice. This process will be described later.

接続合成部１１１では、波形合成部１１０から出力された合成音声と、（準）肉声音声選択部１０３から出力された（準）肉声音声データとの結合処理を行う。入力テキスト１１２「今日は晴れかなぁ」、（準）肉声音声ＩＤ１１４「２」に対して、声質考慮波形合成部１１０からは「今日は晴れかなぁ」に対応する合成音声が出力され、同時に（準）肉声音声選択部１０３からはＩＤ「２」に対応する準肉声音声データが出力される。接続合成部１１１では声質考慮波形合成部１１０から出力された「今日は晴れかなぁ」の合成音声のうち、「かなぁ」に対応する部分を（準）肉声音声選択部１０３から出力された「かなぁ」の（準）肉声音声データで置換する処理を実施する。この置換処理は例えば、クロスフェード重ね合わせ処理を用いることができる。また、この置換処理には、本発明者等による特許文献２や、特許文献３で開示されている肉声音声と合成音声の接続処理を用いることもできる。この結果、指定された（準）肉声音声を使用することでより音質を向上させた合成音声１１３が出力される。 The connection synthesizer 111 performs a process of combining the synthesized speech output from the waveform synthesizer 110 and the (semi) voiced voice data output from the (semi) voiced voice selection unit 103. In response to the input text 112 “Is it fine today” and (quasi) real voice ID 114 “2”, the synthesized speech corresponding to “I am fine today” is output from the voice quality-considered waveform synthesizing unit 110 and at the same time ( The quasi-speech voice selection unit 103 outputs quasi-speech voice data corresponding to the ID “2”. In the connection synthesis unit 111, the portion corresponding to “Kana” among the synthesized speech of “Today is sunny” output from the voice quality-considered waveform synthesis unit 110 is output from the (quasi) real voice selection unit 103. The process of replacing with (quasi) real voice data of “Kana” is implemented. For example, a cross-fade overlay process can be used for this replacement process. In addition, the connection processing of the real voice and the synthesized voice disclosed in Patent Document 2 and Patent Document 3 by the present inventors can be used for the replacement processing. As a result, the synthesized speech 113 with higher sound quality is output by using the designated (quasi) real voice.

以上は、入力テキスト１１２を合成音声１１３に変換するためのメインの音声合成処理の流れについて説明した。以下では、本実施例の音声合成装置に特徴的な機能処理部について説明する。 The flow of the main speech synthesis process for converting the input text 112 into the synthesized speech 113 has been described above. Hereinafter, a function processing unit characteristic of the speech synthesizer according to the present embodiment will be described.

（準）肉声音声格納部１０１には、さまざまな表現（日本語文字列）に対応する（準）肉声音声波形データが格納されている。例えば、図５に示すリスト５００がこの（準）肉声音声格納部１０１に格納されているデータの例である。この例では、口語調（話し言葉）音声の文末として使われる表現に対して、対応する（準）肉声音声の断片データが格納されている。もちろん、文末表現に限る必要はなく、肉声音声の品質を必要とするような表現に対してこのように（準）肉声音声データを対応付けて格納しておくことができる。ここで格納されている音声波形には、後の接続合成部１１１で使用するための付加情報として音声の長さ（ｍｓ）と両方の端点での音の高さ（Ｈｚ）が格納されている。 The (semi) voice voice storage unit 101 stores (semi) voice voice waveform data corresponding to various expressions (Japanese character strings). For example, a list 500 shown in FIG. 5 is an example of data stored in the (quasi) real voice storage unit 101. In this example, the corresponding (quasi) real voice segment data is stored for the expression used as the end of a spoken tone (spoken language) speech. Of course, it is not necessary to limit the expression to the end of a sentence, and (quasi) real voice data can be stored in association with an expression that requires the quality of real voice. The voice waveform stored here stores the voice length (ms) and the pitch (Hz) at both end points as additional information to be used in the connection synthesis unit 111 later. .

準肉声音声生成部１０２は、肉声音声の品質に準じる合成音声データである準肉声音声を生成するための処理部である。この処理部は、使用できる肉声音声の種類を（準）肉声音声格納部１０１に格納されているもの以外にも増やせることを目的としている。ただし、（準）肉声音声格納部１０１に格納されている（準）肉声音声データだけで必要な音質を確保できると判断されれば、この準肉声音声生成部１０２は必要ない。そこで、準肉声音声生成部１０２での処理は実施例２において説明する。この本実施例の説明では、この準肉声音声生成部１０２が動作しないとして説明を続けるが、本実施例の変形として、（準）肉声音声格納部１０１に格納されている（準）肉声音声データだけで必要な音質を確保できないと判断される場合は、準肉声音声生成部１０２において、必要とされる、準肉声音声、すなわち韻律や声質面で肉声に準じる合成肉声音声を生成することができる。 The quasi-real voice generation unit 102 is a processing unit for generating quasi-real voice that is synthesized voice data according to the quality of the real voice. The purpose of this processing unit is to increase the types of real voices that can be used other than those stored in the (quasi) real voice storage unit 101. However, if it is determined that the required sound quality can be ensured only by the (semi) voice voice data stored in the (semi) voice voice storage unit 101, the quasi-speech voice generation unit 102 is not necessary. Therefore, processing in the semi-real voice generation unit 102 will be described in the second embodiment. In the description of the present embodiment, the description will be continued on the assumption that the quasi-real voice generation unit 102 does not operate. However, as a modification of the present embodiment, (quasi-) real voice data stored in the (quasi) real voice storage unit 101 is used. If it is determined that the necessary sound quality cannot be secured, the quasi-speech voice generation unit 102 can generate the required quasi-speech voice, that is, a synthetic real voice that is similar to the real voice in terms of prosody and voice quality. .

続いて、（準）肉声音声選択部１０３では、入力解析部を通して受け渡された（準）肉声音声ＩＤ１１４をもとに、対応する（準）肉声音声データを（準）肉声音声格納部１０１、または準肉声音声生成部１０２から取得する。上述の例、（準）肉声音声ＩＤ１１４として「２」が指定された場合は、図５に示される（準）肉声音声格納部１０１に格納されたデータの中からＩＤ＝２に対応する（準）肉声音声データ、およびその付加情報が選択される。ここで選択されたデータ類は韻律情報抽出部１０４、声質情報抽出部１０５、および接続合成部１１１に渡される。 Subsequently, in the (semi) voice voice selection unit 103, based on the (semi) voice voice ID 114 passed through the input analysis unit, the corresponding (semi) speech voice data is converted to the (semi) speech voice storage unit 101, Alternatively, it is obtained from the quasi-real voice generation unit 102. In the above example, when “2” is designated as the (quasi) real voice ID 114, it corresponds to ID = 2 from the data stored in the (quasi) real voice storage unit 101 shown in FIG. ) Real voice data and its additional information are selected. The data selected here is passed to the prosodic information extraction unit 104, the voice quality information extraction unit 105, and the connection synthesis unit 111.

韻律情報抽出部１０４では、（準）肉声音声選択部１０３から出力された肉声音声データに対して韻律情報の抽出処理を行う。韻律情報とは音声データを構成する各音素の長さ（ｍｓ）、高さ（Ｈｚ）、強さ（ｄＢ）などの情報であり、従来から幅広く利用されているさまざまな音声信号処理アルゴリズムを用いることで分析することができる。例えば、音の高さ（Ｈｚ）の分析には、自己相関関数のピーク検出によるピッチ抽出アルゴリズムや、ＬＰＣ残差波形の自己相関関数のピーク検出アルゴリズムなどを用いることができる。音の強さに関しても二乗平均値などのパワー分析アルゴリズムなどが存在する。 The prosody information extraction unit 104 performs prosody information extraction processing on the real voice data output from the (quasi) real voice selection unit 103. Prosodic information is information such as the length (ms), height (Hz), and strength (dB) of each phoneme constituting speech data, and uses various speech signal processing algorithms that have been widely used in the past. Can be analyzed. For example, a pitch extraction algorithm based on autocorrelation function peak detection, an LPC residual waveform autocorrelation function peak detection algorithm, or the like can be used for sound pitch (Hz) analysis. There are also power analysis algorithms such as the mean square value for sound intensity.

しかし、音声を構成する音素系列の決定やその時間位置の分析は容易ではない。さまざまな文献で開示されている音声認識アルゴリズムを用いることで、肉声音声データを音素列に変換（書き起こし）して、各音素の時間位置を決定することはできる。しかし、現状の音声認識アルゴリズムでは書き起こしの認識精度はあまり高くない。そこで、（準）肉声音声選択部１０３から（準）肉声音声データとともに受け渡される文字列表現を音素列に変換し、その時間位置のみを音声認識アルゴリズムで決定する手法をとることができる。現状の音声認識アルゴリズムでも、音素列が確定してさえいればある程度は精度も高く各音素の時間位置の決定は可能である。このようなさまざまな音声信号処理アルゴリズムを用いることで、韻律情報抽出部１０４は図６に示すような韻律情報を肉声音声データに対して生成する。もちろん、（準）肉声音声格納部１０１に格納されている音声データリスト５００に対して、事前にこの韻律情報の抽出（さらには人手による修正）を行っておき、音声データとともに（準）肉声音声格納部１０１に格納しておく構成も可能である。 However, it is not easy to determine a phoneme sequence that constitutes speech and to analyze its time position. By using speech recognition algorithms disclosed in various documents, it is possible to convert (transcribe) real voice data into a phoneme string and determine the time position of each phoneme. However, the current speech recognition algorithm does not have very high recognition accuracy of transcription. Therefore, it is possible to take a technique of converting the character string representation transferred together with the (quasi) real voice data from the (quasi) real voice selection unit 103 into a phoneme string and determining only the time position by the speech recognition algorithm. Even with the current speech recognition algorithm, as long as the phoneme string is determined, the time position of each phoneme can be determined with high accuracy to some extent. By using such various speech signal processing algorithms, the prosody information extraction unit 104 generates prosody information as shown in FIG. 6 for real voice data. Of course, this prosodic information is extracted (and further manually corrected) from the speech data list 500 stored in the (semi) voice voice storage unit 101, and the (semi) speech voice together with the speech data. A configuration for storing in the storage unit 101 is also possible.

韻律情報抽出部１０４から出力された図６に示す韻律情報６００は韻律生成部１０７に受け渡される。韻律生成部１０７は、入力解析部１０６から渡された入力テキストの解析結果３００をもとに合成音声の韻律情報４００を生成するが、本実施例の音声合成装置においては、この韻律情報抽出部１０４から受け渡された肉声音声データの韻律情報６００を用いて、合成音声の韻律情報４００の調整を行うことを特徴としている。この調整処理により、入力テキスト１１２をそのまま音声合成した場合の韻律（抑揚やリズムなど）が、指定された（準）肉声音声ＩＤ１１４と組み合わせられた場合にどう変化するかが解析され、最初に生成された韻律情報にその変化を施した調整結果を最終的な韻律情報４００として出力する。これにより、上記選択された肉声音声データとスムーズにつながるように韻律情報が調整されることになる。この韻律生成部１０７での韻律情報調整処理の詳細については後述する。 The prosody information 600 shown in FIG. 6 output from the prosody information extraction unit 104 is transferred to the prosody generation unit 107. The prosody generation unit 107 generates the prosody information 400 of the synthesized speech based on the input text analysis result 300 passed from the input analysis unit 106. In the speech synthesizer of the present embodiment, this prosody information extraction unit The prosody information 400 of synthesized speech is adjusted using the prosody information 600 of the real voice data passed from 104. This adjustment process analyzes how the prosody (intonation, rhythm, etc.) when the input text 112 is directly speech-synthesized is combined with the specified (quasi) real voice ID 114, and is generated first. An adjustment result obtained by changing the prosodic information is output as final prosodic information 400. As a result, the prosodic information is adjusted so as to be smoothly connected to the selected real voice data. Details of the prosody information adjustment processing in the prosody generation unit 107 will be described later.

声質情報抽出部１０５では、（準）肉声音声選択部１０３から出力された肉声音声データに対して声質情報の抽出処理を行う。声質情報とは音声データがどのような声色であるかを示す情報であり、例えばケプストラム係数やＦＦＴなどによる周波数分析係数などが該当する。この声質情報の構成に関してはさまざまな形態が考えられ、後述する別の実施例などでいくつかの実施形態について詳細に説明を行う。また、この声質情報抽出部１０５で抽出される声質情報には少なくとも、肉声音声データの両端付近の声質を示す短時間声質情報（両端それぞれに対して）と、肉声音声データ全域に対する声質を示す全域声質情報の２種類が存在する。前者の短時間声質情報としては、肉声音声データの両端それぞれに対して、ある時間長（例えば２０ｍｓ）の短時間音声データ（フレームと呼ぶ）を切り出し、その音声に対して上述の声質分析処理を施すことができる。一方、後者の全域声質情報の構成方法にはさまざまなものが考えられる。例えば、最も単純には肉声音声データ全体に対して上述の声質分析処理を施す方法がある。後述する別の実施例ではこの他の構成についても説明する。ここの説明では、この声質情報抽出部１０５から、短時間声質情報として肉声音声データ両端の短時間フレームケプストラム係数ベクトルが、全域声質情報として肉声音声データ全体に対するケプストラム係数ベクトルが出力されるとして説明を続ける。 The voice quality information extraction unit 105 performs voice quality information extraction processing on the real voice data output from the (quasi) real voice selection unit 103. The voice quality information is information indicating the voice color of the voice data, and corresponds to, for example, a cepstrum coefficient, a frequency analysis coefficient by FFT, or the like. Various configurations can be considered for the configuration of the voice quality information, and some embodiments will be described in detail in another example described later. The voice quality information extracted by the voice quality information extraction unit 105 includes at least short-time voice quality information (for both ends) indicating the voice quality near both ends of the real voice data, and the entire area indicating the voice quality for the entire real voice data. There are two types of voice quality information. As the short time voice quality information of the former, short time voice data (referred to as a frame) having a certain length (for example, 20 ms) is cut out from both ends of the real voice data, and the above voice quality analysis process is performed on the voice. Can be applied. On the other hand, there are various conceivable methods for configuring the latter global voice quality information. For example, the simplest method is to perform the above voice quality analysis process on the entire real voice data. In other embodiments described later, other configurations will be described. In this description, the voice quality information extraction unit 105 outputs the short-time frame cepstrum coefficient vectors at both ends of the real voice data as short-time voice quality information, and the cepstrum coefficient vector for the entire real voice data as global voice quality information. to continue.

声質考慮波形合成部１１０では、声質情報抽出部１０５から出力される声質情報（短時間および全域）を受け取り、それを用いた上で合成音声の波形合成処理を行う。すでに上述したように、この波形合成処理では、入力テキストを構成する各音素（音節）に対応する素片データを素片ＤＢ１０９から取り出し、接続コストと呼ばれる素片データ間の音の近さを示す値が素片系列全体で最小となるような素片系列（最適素片系列）を探索する。この接続コストの計算において、前記声質情報（短時間および全域）を勘案することで、指定された（準）肉声音声データと声質がスムーズにつながるような合成音声を生成できるようになる。ここでの処理の詳細は後述する。 The voice quality-considered waveform synthesis unit 110 receives voice quality information (short time and whole area) output from the voice quality information extraction unit 105 and uses it to perform waveform synthesis processing of synthesized speech. As already described above, in this waveform synthesis process, the segment data corresponding to each phoneme (syllable) constituting the input text is extracted from the segment DB 109, and the closeness of the sound between the segment data called connection cost is shown. Search for a unit sequence (optimal unit sequence) whose value is minimum in the entire unit sequence. In calculating the connection cost, by considering the voice quality information (short time and whole area), it is possible to generate a synthesized voice that smoothly connects the designated (quasi-) real voice data and the voice quality. Details of the processing here will be described later.

以上で、図１Ａに示す本実施例による音声合成装置の各機能処理の概要を説明した。この処理の結果、指定された入力テキスト１１２を音声合成した合成音声を、同じく指定された（準）肉声音声ＩＤに対応する肉声音声データと置換・接続して生成することが可能となり、その合成音声は肉声音声データ部分の声質や韻律をもとに調整されることで、声質や韻律がスムーズに肉声音声データとつながるような合成音声にすることが可能となる。具体例で示すと、「今日は晴れかなぁ」という入力テキスト、および「かなぁ」に対応する肉声音声データとして図５のＩＤ＝２の音声を指定した場合、出力される合成音声は「かなぁ」の部分に図５の（準）肉声音声データを使用し、その前の部分の「今日は晴れ」に相当する合成音声は、末尾の「かなぁ」の肉声音声データの声質や韻律に合うように調整されたものとなる。結果、末尾の肉声「かなぁ」と韻律や声質が合った（近い）「今日は晴れかなぁ」の合成音声が出力されることとなる。 The outline of each functional process of the speech synthesizer according to this embodiment shown in FIG. 1A has been described above. As a result of this processing, it is possible to generate a synthesized voice obtained by voice synthesis of the designated input text 112 by replacing and connecting with the voice voice data corresponding to the designated (quasi) real voice ID. By adjusting the voice based on the voice quality and prosody of the real voice data part, it is possible to make a synthesized voice in which the voice quality and prosody are smoothly connected to the real voice data. As a specific example, when the speech of ID = 2 in FIG. 5 is designated as the input text “Is it sunny today” and the real voice data corresponding to “Kana”, the output synthesized speech is “ The (semi) voice data of Fig. 5 is used for the "Nana" part, and the synthesized voice corresponding to "Today is fine" in the previous part is the voice quality and prosody of the real voice data of "Kana" at the end. It will be adjusted to fit. As a result, the synthesized voice of “Kana is fine today” with the prosody and voice quality matched with the real voice “Kana” at the end is output.

続いて、本実施例の音声合成装置で特徴的な部分についてその処理フローも含めてより詳細に説明する。
図７、図８に示すフローチャートを用いて、韻律生成部１０７の詳細な処理の流れの一例を説明する。 Next, the characteristic part of the speech synthesizer of the present embodiment will be described in detail including the processing flow.
An example of the detailed processing flow of the prosody generation unit 107 will be described using the flowcharts shown in FIGS.

韻律生成処理がスタートするとまず、入力解析部１０６から出力された言語解析結果３００を受け取り、その言語解析結果をもとに入力テキストを構成する各音素（音節）の韻律情報（長さ、高さ、強さ）を生成する。この韻律情報生成処理７０３は、現在利用されているさまざまな音声合成プログラムで実施されている韻律情報生成処理をそのまま利用することができる。そこでは例えば図８のフローチャートに示す処理が実施されている。 When the prosodic generation process starts, first, the language analysis result 300 output from the input analysis unit 106 is received, and the prosodic information (length, height) of each phoneme (syllable) constituting the input text based on the language analysis result. , Strength). This prosodic information generation processing 703 can directly use prosodic information generation processing implemented in various speech synthesis programs currently used. For example, the process shown in the flowchart of FIG. 8 is performed.

図８において、言語解析結果３００が入力されると各単語の発音を音素列（または音節列）へと分解する（ステップ８０２）。この結果、図４の「音素」列に示すような音素記号の系列へと変換される。 In FIG. 8, when the language analysis result 300 is input, the pronunciation of each word is broken down into phoneme strings (or syllable strings) (step 802). As a result, the phoneme symbol sequence shown in the “phoneme” column of FIG. 4 is converted.

つづいて、各音素に対して環境情報ベクトルを生成する（ステップ８０３）。環境情報ベクトルとは、言語解析によって解析された情報３００をもとに、それぞれの位置の音素が持っているさまざまな言語環境情報をベクトル形式で抽出したものである。そのベクトルに含まれる言語環境情報には例えば、その音素が子音であるか／母音であるか（０または１）、その音素が含まれる単語（アクセント句）の音節数（モーラ数）、その音素が含まれる単語の品詞、その音素の直前の音素が子音であるか／母音であるか（０または１）、その音素の直後の音素が子音であるか／母音であるか（０または１）、その音素の直前に単語（アクセント句）の境界が存在するか／しないか（０または１）、その音素の直後に単語（アクセント句）の境界が存在するか／しないか（０または１）などがありえる。 Subsequently, an environment information vector is generated for each phoneme (step 803). The environment information vector is obtained by extracting various language environment information possessed by phonemes at respective positions in a vector format based on the information 300 analyzed by language analysis. The language environment information included in the vector includes, for example, whether the phoneme is a consonant / vowel (0 or 1), the number of syllables (number of mora) of a word (accent phrase) including the phoneme, the phoneme The part of speech of the word containing the word, whether the phoneme immediately before the phoneme is a consonant / vowel (0 or 1), whether the phoneme immediately after the phoneme is a consonant / vowel (0 or 1) Whether a word (accent phrase) boundary exists / does not exist immediately before the phoneme (0 or 1), or does a word (accent phrase) boundary exist / does not exist immediately after the phoneme (0 or 1) There can be.

つづいて、上記生成された環境情報ベクトルを参照して、入力テキストを構成する各音素の継続長の決定処理が行われる（ステップ８０４）。この決定アルゴリズムには、例えば決定木を用いる手法、数量化Ｉ類や重回帰分析による回帰直線で予測する手法などさまざまな実施例が存在する。決定木による継続長決定手法を用いる場合は、事前に大量の音声データを用いて機械学習された決定木データを参照する。決定木データには、木のルートノード（始点）から出発して、木の各ノードに設定されている質問（環境情報ベクトルの第Ｎ要素が値ｋよりも大きいか／小さいか等）を、前記抽出された環境情報ベクトルを用いて分岐していき、最終的に到達した葉ノードに設定されている継続長値（ｍｓ）を、その環境情報ベクトルを持つ音素の継続長として出力する。数量化Ｉ類や重回帰直線による予測の場合も同様に、各音素の環境情報ベクトルと、ある重み係数ベクトルとの間で内積計算することでその音素の継続長を決定する。このような音素ごとの継続長決定処理を、入力テキストを構成するすべての音素に対して実行することで、入力テキスト全体の継続長値（ｍｓ）が決定される。 Subsequently, referring to the generated environment information vector, a continuation length determination process for each phoneme constituting the input text is performed (step 804). There are various examples of this determination algorithm, for example, a method using a decision tree, a method of predicting with a regression line by quantification type I or multiple regression analysis. When the continuation length determination method using a decision tree is used, decision tree data that has been machine-learned using a large amount of speech data in advance is referred to. In the decision tree data, starting from the root node (starting point) of the tree, questions set to each node of the tree (whether the Nth element of the environment information vector is larger / smaller than the value k, etc.) Branching is performed using the extracted environment information vector, and the continuation length value (ms) set for the finally reached leaf node is output as the continuation length of the phoneme having the environment information vector. Similarly, in the case of prediction based on the quantification type I or the multiple regression line, the continuation length of the phoneme is determined by calculating the inner product between the environment information vector of each phoneme and a certain weight coefficient vector. The continuation length value (ms) of the entire input text is determined by executing the continuation length determination process for each phoneme for all phonemes constituting the input text.

つづいて、各音素の音高（Ｈｚ）の決定処理を実施する（ステップ８０５）。こちらも同様に、決定木や重回帰予測などの手法を用いて、前記抽出された環境情報ベクトルから各音素の音の高さ（Ｈｚ）を決定する。最後に、ステップ８０６で音素ごとの強さ（ｄＢ）を決定する。ここの処理も上記と同様の手法を用いることができる。 Subsequently, a process for determining the pitch (Hz) of each phoneme is performed (step 805). Similarly, the pitch (Hz) of each phoneme is determined from the extracted environment information vector by using a method such as a decision tree or multiple regression prediction. Finally, in step 806, the strength (dB) for each phoneme is determined. The same method as described above can be used for the processing here.

上記の説明では、継続長の決定処理（ステップ８０４）を行ってから音高の決定処理（ステップ８０５）、強さの決定処理（ステップ８０６）を実施する形態で説明したが、環境情報ベクトルに含まれる情報の種類によっては、先に音高の決定を行ってから継続長を決定するという実施形態や、継続長と音高・強さとを同時に決定する実施形態も考えられる。この順番は本実施例の実施に対しては重要ではない。 In the above description, the continuation length determination process (step 804) is performed, and then the pitch determination process (step 805) and the strength determination process (step 806) are performed. Depending on the type of information included, an embodiment in which the duration is determined after the pitch is determined first, or an embodiment in which the duration and the pitch / strength are determined at the same time are also conceivable. This order is not important for the implementation of this embodiment.

以上の図８の処理の流れにより、図７のフローチャートの韻律情報生成処理（ステップ７０３）は実現できる。この結果、韻律情報生成処理（ステップ７０３）から図４に示すような音素ごとの韻律情報列が出力される。 The prosodic information generation process (step 703) in the flowchart of FIG. 7 can be realized by the process flow of FIG. As a result, the prosodic information sequence for each phoneme as shown in FIG. 4 is output from the prosodic information generation process (step 703).

韻律情報生成処理（ステップ７０３）につづいて、韻律情報抽出部１０４で抽出された（準）肉声音声データの韻律情報６００を受け取り、その情報をもとに、前記ステップ７０３の韻律情報生成処理で生成された韻律情報４００の調整処理が実施される。この調整処理は、（準）肉声音声データが持つ韻律（全体的な話速、音の高さ、強さなど）に合うように、前記生成された入力テキストの韻律情報４００を調整する処理である。 Following the prosody information generation process (step 703), the prosody information 600 of the (quasi) real voice data extracted by the prosody information extraction unit 104 is received, and based on the information, the prosody information generation process in step 703 is performed. Adjustment processing of the generated prosodic information 400 is performed. This adjustment process is a process for adjusting the prosodic information 400 of the generated input text so as to match the prosody (overall speech speed, pitch, strength, etc.) of the (quasi) real voice data. is there.

図７に示すフローチャートでは、まず継続長の調整処理（ステップ７０５）を実行する。調整の仕方にはさまざまな方法が考えられるが、もっとも簡単には文全体（または息継ぎフレーズ単位など）での平均継続長が、図６に示す（準）肉声音声データの韻律情報６００の平均継続長と近くなるように個々の音素の継続長値を増減させる手法がありえる。ここでの増減は、日本語音声においては話速の違いが母音音素に大きく影響するという知見から母音音素の継続長に対してのみ増減処理を行うという方法を採りえる。もちろん、話速変化が子音音素に及ぼす影響と母音音素に及ぼす影響とをそれぞれ定量化して、各々切り替えて増減処理を施すという手法も考えられる。さらには、その（準）肉声音声データを切り出した元となる文音声の対応区間（例えば、「かなぁ」に対応する（準）肉声音声データが元々、「そろそろ終わりかなぁ」という肉声音声から切り出されたものであった場合、「かなぁ」の直前の部分「そろそろ終わり」に相当する部分）の平均継続長を図５に示す（準）肉声音声格納部１０１に付加情報として格納しておき、その値を用いて調整処理を行うという手法もありえる。 In the flowchart shown in FIG. 7, first, a continuation length adjustment process (step 705) is executed. Various adjustment methods are possible, but the simplest is that the average duration of the whole sentence (or breath phrase phrase unit etc.) is the average duration of the prosodic information 600 of (quasi) real voice data shown in FIG. There can be a method of increasing or decreasing the duration value of each phoneme so as to be close to the length. The increase / decrease here can be performed by performing the increase / decrease process only on the vowel phoneme duration based on the knowledge that the difference in speech speed greatly affects the vowel phoneme in Japanese speech. Of course, a method of quantifying the influence of the change in speech speed on the consonant phonemes and the influence on the vowel phonemes, and performing the increase / decrease process by switching each of them is also conceivable. Furthermore, the (semi) voiced voice data corresponding to the corresponding section of the sentence speech from which the (semi) voiced voice data was cut out (for example, the (semi) voiced voice data corresponding to “Kanaa” was originally “soon to be finished”). If it has been cut out, the average duration of the portion immediately before “Kana” (the portion corresponding to “Now”) is stored as additional information in the (quasi) real voice storage unit 101 shown in FIG. In addition, there may be a method of performing adjustment processing using the value.

つづいて、全体ピッチの調整処理（ステップ７０６）を実行する。このピッチ（音高）の調整処理でも同様に、文全体（または息継ぎフレーズ単位など）での音高値が、当該（準）肉声音声データの始点の音高とスムーズにつながるように調整処理が行われる。この調整処理もさまざまな実施形態が考えられるが、（準）肉声音声データと接続される端点に対応する音素箇所にて音高が一致するような調整が必要である（図９ａ）。これを一致させるには、図９ｂにあるように、ステップ７０３で生成された入力テキスト全体に対する音高曲線を上下に平行移動して、接続端点の音素箇所にて音高が一致するようにすれば良い。この平行移動によって、先頭音素などの音高がある一定値よりも低くなった場合には、さらに図９ｃのように接続端点を回転の中心点として音高曲線を回転させて音高が低くなりすぎない／高くなりすぎないようにすることができる。さらに、より適切には図９ｄのように音高曲線の始点位置と接続端点位置の高さを固定した上で曲線の上下の幅を拡大・縮小することで音高変化を調整することもできる。この場合の拡大・縮小割合は、（準）肉声音声データ部分の音高変化幅の値をもとに適切な予測手法（回帰予測、決定木等による予測など）を用いて決定する。また、前述のように、この肉声音声データが切り出された元の文音声の対応区間の音高変化幅の値をもとに予測する手法も考えられる。 Subsequently, an overall pitch adjustment process (step 706) is executed. Similarly, in this pitch (pitch) adjustment processing, adjustment processing is performed so that the pitch value of the entire sentence (or breath phrase phrase unit, etc.) is smoothly connected to the pitch of the start point of the (quasi) real voice data. Is called. Although various embodiments can be considered for this adjustment process, it is necessary to make an adjustment such that the pitches coincide at the phoneme locations corresponding to the end points connected to the (quasi) real voice data (FIG. 9a). To match this, as shown in FIG. 9b, the pitch curve for the entire input text generated in step 703 is translated up and down so that the pitches match at the phoneme location at the connection end point. It ’s fine. When the pitch of the head phoneme or the like becomes lower than a certain value due to this parallel movement, the pitch curve is further lowered by rotating the pitch curve with the connection end point as the rotation center point as shown in FIG. 9c. Not too high / not too high. Furthermore, more appropriately, as shown in FIG. 9d, the pitch change can be adjusted by fixing the heights of the start point position and the connection end point position of the pitch curve and expanding / reducing the vertical width of the curve. . The enlargement / reduction ratio in this case is determined using an appropriate prediction method (regression prediction, prediction using a decision tree, etc.) based on the pitch change width of the (quasi) real voice data portion. Further, as described above, a method for predicting based on the value of the pitch change width of the corresponding section of the original sentence voice from which the real voice data is cut out is also conceivable.

最後に、上記のようにして継続長や音高（さらには図７のフローチャートでは示さなかったが音の強さ）を調整した結果として、図１０に示した韻律情報１０００を出力する。 Finally, the prosodic information 1000 shown in FIG. 10 is output as a result of adjusting the continuation length and pitch (and not shown in the flowchart of FIG. 7 but the intensity of the sound) as described above.

このように図７に示すフローチャートの処理を実行することで、韻律生成部１０７は、指定された（準）肉声音声ＩＤの（準）肉声音声データとスムーズに韻律がつながる合成音声の韻律情報を生成することが可能となる。 By executing the processing of the flowchart shown in FIG. 7 as described above, the prosody generation unit 107 generates the prosody information of the synthesized speech that smoothly connects the prosody with the (semi) real voice data of the specified (semi) real voice ID. Can be generated.

本実施例の音声合成装置で特徴的な部分として、次に声質情報抽出部１０５の処理についてのフローチャートを示す図１１を用いて説明する。 As a characteristic part of the speech synthesizer of the present embodiment, a description will be given next with reference to FIG.

図１１に示すフローチャートによる声質情報抽出処理は、上記で既に説明した声質情報抽出部の処理内容とはまた違った実施形態となっている。この声質情報抽出処理では、（準）肉声音声選択部１０３から出力された（準）肉声音声データに対して、その始端と終端における短時間声質情報を計算するのに加えて、その（準）肉声音声データ全体での全域声質情報の計算を行う。ここでの説明では声質情報の具体的な計算アルゴリズムとしてケプストラム係数ベクトル（例えば１６次元）を用いることとする。さらに全域声質情報の構成方法としては、その（準）肉声音声データにおいて声色が特徴的な（全体の声質印象に対して支配的な）時刻位置を事前に複数設定しておき、それぞれの時刻位置での短時間声質情報（ケプストラム係数ベクトル）に指定された重み係数をかけ合わせて、すべての時刻位置での短時間声質情報を加重加算した結果の声質情報（ケプストラム係数ベクトル）を全域声質情報として出力する。 The voice quality information extraction processing according to the flowchart shown in FIG. 11 is an embodiment different from the processing content of the voice quality information extraction unit already described above. In this voice quality information extraction process, in addition to calculating short-time voice quality information at the start and end of (quasi) real voice data output from the (quasi) real voice selection unit 103, Calculates voice quality information for all voice data. In this description, a cepstrum coefficient vector (for example, 16 dimensions) is used as a specific algorithm for calculating voice quality information. Furthermore, as a method of configuring the global voice quality information, a plurality of time positions that are characteristic of the voice color (dominant to the overall voice quality impression) in the (quasi) real voice data are set in advance, and each time position is set. Voice quality information (cepstrum coefficient vector) as a result of weighted addition of short-time voice quality information at all time positions by multiplying the short-time voice quality information (cepstrum coefficient vector) with the specified weighting coefficient as global voice quality information Output.

以下、フローチャート図１１を用いて、声質情報抽出部１０５の処理について説明する。 Hereinafter, the processing of the voice quality information extraction unit 105 will be described with reference to the flowchart of FIG.

まず処理がスタートすると、ステップ１１０２にて（準）肉声音声選択部１０３から出力された（準）肉声音声データを受け取る。つづいて、ステップ１１０３では、その（準）肉声音声データに対して短時間声質情報を計算する。短時間声質情報はその（準）肉声音声データの始端位置と終端位置の２点において計算される。具体的には始端位置と終端位置（実際にはそれぞれ、無音部をとばして音声パワーがある閾値以上となった箇所）での短時間音声フレーム（例えば２０ｍｓ）を切り出し、その音声フレームに対してケプストラム分析を実施する。この結果、短時間声質情報として、始端・終端に対応する２個の声質情報（ケプストラム係数ベクトル）が出力される。 First, when the process starts, in step 1102, (quasi) real voice data output from the (quasi) real voice selection unit 103 is received. In step 1103, short-time voice quality information is calculated for the (quasi) real voice data. The short-time voice quality information is calculated at two points of the start position and the end position of the (quasi) real voice data. Specifically, a short-time audio frame (for example, 20 ms) is cut out at the start position and the end position (actually, the silent portion is skipped and the sound power exceeds a certain threshold), and the audio frame is extracted. Perform cepstrum analysis. As a result, two pieces of voice quality information (cepstrum coefficient vectors) corresponding to the start and end points are output as short time voice quality information.

つづいて、全域声質情報の計算処理ステップを実施する。 Subsequently, a calculation process step for global voice quality information is performed.

まず、ステップ１１０６にて出力される全域声質情報ベクトルＶを初期化し、つづいてステップ１１０７にて、当該（準）肉声音声データに対する声質特徴点の数を取得する。ここで、声質特徴点とはその（準）肉声音声データにおいて声質（音色）が特徴的な時刻位置をあらわす。（準）肉声音声格納部１０１に格納されている（準）肉声音声データそれぞれに対して、図１２の情報１２００に示すように特徴点の個数、および各特徴点位置の時刻と重みを追加して格納しておく。ステップ１１０７では、（準）肉声音声選択部１０３から（準）肉声音声データとあわせて対応する特徴点の関連情報が受け渡されてあり、そこから特徴点の個数を取得することとなる。 First, the global voice quality information vector V output in step 1106 is initialized, and in step 1107, the number of voice quality feature points for the (quasi) real voice data is acquired. Here, the voice quality feature point represents a time position where the voice quality (timbre) is characteristic in the (quasi) real voice data. For each (semi) voice voice data stored in the (semi) voice voice storage unit 101, the number of feature points and the time and weight of each feature point position are added as shown in the information 1200 of FIG. And store it. In step 1107, the related information of the corresponding feature points is transferred from the (quasi) real voice selection unit 103 together with the (quasi) real voice data, and the number of feature points is acquired therefrom.

ステップ１１０８もステップ１１０７と同様に、すでに受け渡されている特徴点関連情報（各特徴点の時刻Ｔｉと重みＷｉ（１≦ｉ≦Ｎ））を取得する。 Similarly to step 1107, step 1108 also acquires feature point related information (time Ti and weight Wi (1 ≦ i ≦ N) of each feature point) that has already been delivered.

つづくステップ１１１０からステップ１１１３までは、変数ｉを１からＮまで増加させながら繰り返して実行される。ステップ１１１０は、（準）肉声音声データからｉ番目の特徴点時刻における短時間音声フレーム（例えば２０ｍｓ）の切り出し処理を実施する。つづくステップ１１１１でその短時間声質情報（ケプストラム係数ベクトル）を計算し、その値をこの特徴点に対する重みＷｉをかけあわせた上で出力変数Ｖに加算していく。 The subsequent steps 1110 to 1113 are repeatedly executed while increasing the variable i from 1 to N. In step 1110, a short time voice frame (for example, 20 ms) is cut out from the (quasi) real voice data at the i-th feature point time. In step 1111, the short-time voice quality information (cepstrum coefficient vector) is calculated, and the value is multiplied by the weight Wi for this feature point and added to the output variable V.

これらのステップを繰り返すことで、当該（準）肉声音声データに指定されている特徴点すべてにおける短時間声質情報を重み加算した結果となる全域声質情報を計算することができる。 By repeating these steps, it is possible to calculate global voice quality information that is the result of weighted addition of short-time voice quality information at all feature points specified in the (quasi) real voice data.

そして、最後にステップ１１０４で、前記計算された始端・終端の短時間声質情報、および全域声質情報（それぞれ１６次元のケプストラム係数ベクトル）が声質抽出部１０５から出力される。 Finally, in step 1104, the calculated short-term voice quality information of the start and end points and global voice quality information (each 16-dimensional cepstrum coefficient vector) are output from the voice quality extraction unit 105.

また、図２１に示すように、本実施例の全域声質情報の計算方法の別の構成を採ることもできる。図２１の例は、（準）肉声音声データに対して、同図に示すように、分割なし（＝分割１）、２分割、３分割、…、ｋ分割と音声データ区間の分割を増やしていきながら、それぞれの分割区間で短時間声質情報を計算する。分割数ｉではｉ個の短時間声質情報が出力されるが、この中から全体を代表する代表声質情報を決定する。代表点の算出方法には、統計手法である平均値を用いる方法、メディアン（中央値）を用いる方法、さらにはクラスタリング手法による中心クラスタ（セントロイド）を用いる方法などが考えられる。こうして、ｋ個の代表声質情報が計算されると、それら各分割の代表声質情報を纏めて全域声質情報ベクトルであるｋ次元全域声質情報として出力する。 In addition, as shown in FIG. 21, another configuration of the calculation method of global voice quality information of the present embodiment can be adopted. In the example of FIG. 21, (quasi) real voice data, as shown in the figure, there is no division (= division 1), two divisions, three divisions,. While running, short-time voice quality information is calculated in each divided section. In the division number i, i pieces of short time voice quality information are output, and representative voice quality information representing the whole is determined from these. As a representative point calculation method, a method using an average value which is a statistical method, a method using a median (median value), a method using a central cluster (centroid) by a clustering method, and the like can be considered. In this way, when k representative voice quality information is calculated, the representative voice quality information of each division is collected and output as k-dimensional global voice quality information which is a global voice quality information vector.

次に、図１３のフローチャートを用いて、本実施例の音声合成装置で特徴的な部分として、声質考慮波形合成部１１０の処理について説明する。
本フローチャートは素片選択型音声合成（ＵｎｉｔＳｅｌｅｃｔｉｏｎＳｐｅｅｃｈＳｙｎｔｈｅｓｉｓ）技術の基本的な構成に対して、本実施例に特徴的ないくつかの処理を加えたものとなっている。まず、素片選択型音声合成技術の基本構成を図１３により説明したのち、本実施例の音声合成装置に特徴的なステップについてさらに追加して説明を行う。 Next, processing of the voice quality-considered waveform synthesis unit 110 will be described as a characteristic part of the speech synthesis apparatus according to the present embodiment with reference to the flowchart of FIG.
This flowchart is obtained by adding some processes characteristic to the present embodiment to the basic configuration of the unit selection type speech synthesis (Unit Selection Speech Synthesis) technique. First, the basic configuration of the unit selection type speech synthesis technology will be described with reference to FIG. 13, and then additional steps characteristic to the speech synthesis apparatus of this embodiment will be described.

図１３の左側のフローチャートは素片選択手法による波形合成処理の流れを示す。まずステップ１３０２では、読み上げ文を構成する音素（音節、または特別な単位）ごとに、その素片候補をすべて列挙する。波形合成部１１０には韻律生成部１０７から図１０に示すような韻律情報１０００が渡されている。この韻律情報の音素欄で指定された「ＫＹ」や「Ｏ」などの音素それぞれに対して、素片ＤＢ１０９の中には複数個の素片候補が存在している。ステップ１３０２では、この「ＫＹ」に対する素片候補集合、「Ｏ」に対する素片候補集合というように、読み上げ文を構成する素片単位ごとに素片候補集合を探索・列挙する。つづいて、ステップ１３０３では列挙された素片候補すべてに対して、ターゲットコストと呼ばれるコスト値を計算する。 The flowchart on the left side of FIG. 13 shows the flow of the waveform synthesis process using the segment selection method. First, in step 1302, all the segment candidates are listed for each phoneme (syllable or special unit) constituting the reading sentence. The prosody information 1000 as shown in FIG. 10 is passed from the prosody generation unit 107 to the waveform synthesis unit 110. For each phoneme such as “KY” and “O” specified in the phoneme field of the prosodic information, a plurality of segment candidates exist in the segment DB 109. In step 1302, a segment candidate set is searched and enumerated for each segment unit constituting the reading sentence, such as a segment candidate set for “KY” and a segment candidate set for “O”. Subsequently, in step 1303, cost values called target costs are calculated for all of the listed segment candidates.

図１３の中央のフローチャートで、このコスト値の計算処理の詳細を説明している。ターゲットコストの計算では、韻律情報１０００で指定されている各音素の長さ・高さ・強さに対して、素片候補それぞれの長さ・高さ・強さがどの程度離れているかに基づいてコスト値を計算する。ここでの計算式はさまざまな実施例がありえるが、本実施例の装置の骨子には大きく影響しないため、ここでは詳細は省く。こうして、韻律情報に設定された値と素片の値との間で長さ・高さ・強さに対するコスト値を計算した後、本実施例に特徴的なステップとして、全域声質に対するコスト計算処理１３１５を実施する。ここでは声質抽出部１０５から出力された（準）肉声音声データの全域声質情報と、計算対象の素片が持つ全域声質情報との間でコスト値の計算を行う。（準）肉声音声データの全域声質情報は前記説明したような実施形態のデータである。一方、当該素片が持つ全域声質情報の計算方法については、いくつか実施形態が考えられる。もっとも単純には、素片データを一つの（準）肉声音声データとみなして、声質情報抽出部１０５での全域声質情報の抽出手法として前記説明した手法を用いることで、素片データに対する全域声質情報を取得することができる。すると、二つの全域声質情報（例えば、１６次ＬＰＣケプストラム係数ベクトル）の間で距離計算（ユークリッド距離など）を行うことができる。 The details of the cost value calculation process are described in the central flowchart of FIG. The target cost is calculated based on how far the length / height / strength of each segment candidate is different from the length / height / strength of each phoneme specified in the prosodic information 1000. To calculate the cost value. The calculation formula here may have various embodiments, but since it does not greatly affect the essence of the apparatus of the present embodiment, details are omitted here. Thus, after calculating the cost values for length, height, and strength between the values set in the prosodic information and the segment values, the cost calculation processing for the whole voice quality is performed as a characteristic step in this embodiment. 1315 is performed. Here, the cost value is calculated between the global voice quality information of the (quasi) real voice data output from the voice quality extraction unit 105 and the global voice quality information of the segment to be calculated. The global voice quality information of the (quasi) real voice data is the data of the embodiment as described above. On the other hand, several embodiments can be considered for the calculation method of global voice quality information possessed by the segment. Most simply, the segment data is regarded as one (quasi) real voice data, and the above-described method is used as the global voice quality information extraction method in the voice quality information extraction unit 105, so that the global voice quality for the segment data is obtained. Information can be acquired. Then, distance calculation (such as Euclidean distance) can be performed between two pieces of global voice quality information (for example, a 16th-order LPC cepstrum coefficient vector).

さらに、その素片データ内の子音音素区間と母音音素区間を区別して、母音音素区間のみに対して前記の全域声質情報の抽出方法を適用するという手法も実現できる。母音音素区間の方が子音音素区間と比較してより多く、話者の声質やそこの音声区間の声質を決定する情報を有しているためである。一般には、素片データを構成する音素の種別に応じて重み値を設定しておき、素片データ全体の全域声質情報に対する寄与を設定するという方法を採ることができる。 Further, it is possible to realize a method in which the consonant phoneme segment and the vowel phoneme segment in the segment data are distinguished, and the above-described global voice quality information extraction method is applied only to the vowel phoneme segment. This is because the vowel phoneme section has more information than the consonant phoneme section and has information for determining the voice quality of the speaker and the voice section of the voice section. In general, it is possible to adopt a method in which a weight value is set in accordance with the type of phoneme constituting the segment data, and the contribution to the entire voice quality information of the entire segment data is set.

また、素片データに対する別の全域声質情報の計算方法として、その素片データが切り出された元々の収録音声において、その素片データが位置するある一定長の音声区間（例えばアクセント句）に対して、声質情報抽出部１０５ですでに説明した全域声質情報を抽出しておき、素片ＤＢ１０９にその素片データと対応づけて全域声質情報を格納しておくという方法も考えられる。この場合は、ステップ１３０２で素片候補を探索・列挙する際に、素片に対応づけられている全域声質情報も併せて取り出しておく必要がある。 In addition, as another method of calculating the whole area voice quality information for the segment data, in the original recorded voice from which the segment data is cut out, for a certain length of voice segment (for example, accent phrase) where the segment data is located Thus, a method of extracting the entire voice quality information already explained by the voice quality information extracting unit 105 and storing the entire voice quality information in the segment DB 109 in association with the segment data is also conceivable. In this case, when searching for and enumerating the segment candidates in step 1302, it is necessary to take out the entire voice quality information associated with the segments.

また、図２１に相当する全域声質情報を採用している場合は、ここで計算される素片データに対する全域声質情報もそれに対応した構成にしなければならない。単純に、各素片データに対して前述のｋ分割によるｋ次元全域声質情報を算出して、（準）肉声音声データのそれとの間で距離計算を行うという手法もある。しかし、素片データの場合は音声長が（準）肉声音声データの音声長と比べて短い場合が多く、単純に同じ分割数ｋを用いてしまうと、各分割の分析時間長が短くなりすぎるという問題が生じる。これに対しては、その素片データが切り出された元の収録音声に対して、ｋ分割の全域声質情報の計算を行った上で、分割後の音声長が当該素片データの長さ以下となった時点で、分割範囲を素片データの中だけに限定するという手法を採ることもできる。つまり、分割単位が素片データ長よりも短い場合は素片データ内を短時間声質情報の計算範囲とし、素片データ長よりも分割単位が長い場合は素片データが切り出された元の収録音声を対象として声質情報を計算するという方法である。 Further, when the global voice quality information corresponding to FIG. 21 is adopted, the global voice quality information for the segment data calculated here must also be configured accordingly. There is also a method of simply calculating k-dimensional global voice quality information by the above-described k division for each piece of data and calculating a distance between it and (quasi) real voice data. However, in the case of segment data, the voice length is often shorter than the voice length of (quasi) real voice data, and if the same division number k is simply used, the analysis time length of each division becomes too short. The problem arises. For this, after calculating the whole voice quality information of k division for the original recorded voice from which the segment data is cut out, the divided voice length is equal to or less than the length of the segment data. At that time, it is also possible to adopt a method of limiting the division range to only the segment data. In other words, if the division unit is shorter than the segment data length, the short-term voice quality information is calculated within the segment data. This is a method of calculating voice quality information for speech.

また、ここで説明した方法では（準）肉声音声データと、素片データが切り出された元の収録音声の長さの違いは無視していたため、同じｋ分割でも分割単位の長さが対応しないという問題も生じる。これに対しては、分割数を増やしていくのではなく、分割単位の時間長を例えば、１０００ミリ秒、５００ミリ秒、２５０ミリ秒と小さくしていくという方法を採ることができる。これによって、各分割単位での分析時間長が異なるという問題は解決できる。 Further, in the method described here, the difference in length between the (quasi) real voice data and the original recorded voice from which the segment data is cut out is ignored, so the length of the division unit does not correspond even in the same k division. The problem also arises. For this, it is possible to adopt a method in which the time length of the division unit is reduced to, for example, 1000 milliseconds, 500 milliseconds, and 250 milliseconds instead of increasing the number of divisions. As a result, the problem that the analysis time length in each division unit is different can be solved.

ターゲットコストの計算ステップ１３０３の説明に戻ると、上記のように、素片と韻律情報（ターゲット）との間で計算された長さ・高さ・強さに対するコスト値、およびステップ１３１５で本実施例独自に計算された全域声質情報に対するコスト値を最終的に重み加算して、その素片に対するターゲットコストを決定（ステップ１３１６）する。以上の処理を、すべての素片候補に対して繰り返す（ステップ１３１７）ことで、ターゲットコスト計算処理（ステップ１３０３）は実施される。 Returning to the description of the target cost calculation step 1303, as described above, the cost value for the length, height, and strength calculated between the segment and the prosodic information (target), and the step 1315, the present implementation. The cost value for the global voice quality information calculated uniquely for the example is finally weighted and the target cost for the segment is determined (step 1316). By repeating the above processing for all segment candidates (step 1317), the target cost calculation processing (step 1303) is performed.

つづいて、接続コストの計算処理(ステップ１３０４）が実施される。このステップの詳細は、図１３の右側にフローチャートとして示されている。ここのフローチャートも本実施例の実施によって、素片選択型音声合成における基本構成から変更された部分が存在する。 Subsequently, a connection cost calculation process (step 1304) is performed. Details of this step are shown as a flowchart on the right side of FIG. In this flowchart, there is a part changed from the basic configuration in the unit selection type speech synthesis by the implementation of this embodiment.

図１３の右側のフローチャートにおける接続コストの計算では、まず読み上げ文の中で隣接する素片候補ｘとｙを一つ取り出す（ステップ１３２２）。つまり、素片ｘの直後に素片ｙが結合されて合成音声が生成されるような素片の組である。読み上げ文を構成する素片単位（例えば音素）ごとに、複数の素片候補が探索・列挙されているため、この組合せの数は膨大なものとなる。この組合せの数を減らす手法にはさまざまなものが開示・実施されているが、本実施例の骨子とは関連しないため、説明は省略する。 In the calculation of the connection cost in the flowchart on the right side of FIG. 13, first, one adjacent segment candidate x and y is extracted from the read-out sentence (step 1322). That is, it is a set of segments such that a synthesized speech is generated by combining a segment y immediately after the segment x. Since a plurality of segment candidates are searched and enumerated for each segment unit (for example, phoneme) constituting the reading sentence, the number of combinations is enormous. Various methods for reducing the number of combinations have been disclosed and implemented, but the description is omitted because they are not related to the gist of the present embodiment.

ステップ１３２３では、左側の素片ｘに対して、その終端位置での短時間声質情報を抽出する。短時間声質情報の構成方法はすでに述べた。この接続コスト計算処理では、前記説明した（準）肉声音声データに対して抽出する短時間声質情報と、ここで抽出する素片の短時間声質情報との間で距離計算する必要があるため、この２種類の短時間声質情報は同じ手法で分析された情報であることが望ましい。 In step 1323, short time voice quality information at the end position is extracted from the left segment x. The method of constructing short-time voice quality information has already been described. In this connection cost calculation process, since it is necessary to calculate the distance between the short-time voice quality information extracted for the (quasi) real voice data described above and the short-time voice quality information of the segment extracted here, These two types of short-time voice quality information are desirably information analyzed by the same method.

次に、素片ｙが（準）肉声音声データの先頭箇所に相当する音素（素片単位）であるかどうかにより、処理を切り分ける。まず、判定結果がＮＯの場合、すなわち、(準）肉声音声データとの接合点ではない場合は、通常の音声合成処理と同一であり、ステップ１３２５で素片ｙの先頭位置での短時間声質情報を計算し、ステップ１３２６で上記計算した素片ｘとｙの短時間声質情報の間での距離を計算する。 Next, the process is divided depending on whether or not the segment y is a phoneme (unit unit) corresponding to the head part of the (quasi) real voice data. First, if the determination result is NO, that is, if it is not a junction point with (quasi) real voice data, it is the same as the normal voice synthesis process, and in step 1325, the short-time voice quality at the head position of the segment y The information is calculated, and the distance between the short-time voice quality information of the above-mentioned calculated segments x and y is calculated in step 1326.

一方、素片ｙの位置が（準）肉声音声データの先頭箇所に相当するものである場合、素片ｘ終端の短時間声質情報と、前記声質抽出部１０５で計算された（準）肉声音声データの始点の短時間声質情報との間で、距離計算する必要がある。ステップ１３２７で（準）肉声音声データ始点の短時間声質情報を取り出し、ステップ１３２８で素片ｘの短時間声質情報との間での距離を計算する。 On the other hand, when the position of the segment y corresponds to the head part of the (quasi) real voice data, the short time voice quality information at the end of the segment x and the (quasi) real voice calculated by the voice quality extraction unit 105. It is necessary to calculate the distance with the short-time voice quality information at the start point of the data. In step 1327, the short-time voice quality information of the (quasi) real voice data is extracted, and in step 1328, the distance to the short-time voice quality information of the segment x is calculated.

最後に、ステップ１３２９では、計算された短時間声質情報間の距離に対して、ある重み値をかけることで最終的な接続コスト値として出力する。この重み値は一定ではなく、接続コスト計算のフローチャートで左右どちらの経路で計算された値かによって重み値を変更する場合は考えられ、また素片ｘやｙの音素の種類などで重みを変更する実施形態もありえる。 Finally, in step 1329, a certain connection cost value is output by multiplying the calculated distance between the short-time voice quality information by a certain weight value. This weight value is not constant, and it may be possible to change the weight value depending on whether it is calculated on the left or right route in the connection cost calculation flowchart, and the weight is changed depending on the type of phoneme of the element x or y. There may also be embodiments.

以上の処理を、すべての素片組合せに対して実施する。これにより、すべての素片の組合せに対して、その間での接続コストが設定される（ステップ１３０４）。 The above processing is carried out for all unit combinations. Thereby, the connection cost between all the combinations of the segment is set (step 1304).

つぎに、最小コストとなる素片系列の探索処理を実施する（ステップ１３０５）。読み上げ文を構成する音素（音節）ごとに複数の素片候補が列挙されているが、その中から一つを決定する処理がこの処理である。決定処理においては、素片の持つターゲットコスト、および左右の音素（音節）に対応する素片との間での接続コストを足し合わせて、最もコスト和が小さな素片の組合せを決定する。この探索処理は、音声合成技術では広く用いられているビタビアルゴリズムなどを用いて実現することができる。 Next, a search process for the segment series that provides the minimum cost is performed (step 1305). A plurality of segment candidates are listed for each phoneme (syllable) constituting the reading sentence, and this is a process for determining one of them. In the determination process, the combination of the segment having the smallest cost sum is determined by adding the target cost of the segment and the connection cost between the segments corresponding to the left and right phonemes (syllables). This search process can be realized using a Viterbi algorithm or the like widely used in speech synthesis technology.

最後に、ステップ１３０６では、決定された素片の組合せを用いて、最終的な合成音声の生成処理を行う。ここでは、素片ごと、または素片間で波形重畳処理や波形接続処理と呼ばれる音声信号処理手法を用いることで実現できる。これらの手法にもさまざまな実施形態がありえるが、本実施例の骨子とは関連しないため、説明は省略する。 Finally, in step 1306, final synthesized speech generation processing is performed using the determined combination of segments. Here, it is realizable by using the audio | voice signal processing method called a waveform superimposition process or a waveform connection process for every element | segment or between elements. Although there are various embodiments in these methods, the description is omitted because it is not related to the gist of the present embodiment.

以上の処理によって、本実施例の音声合成手法および装置では、口語調音声を始めとする韻律や声質の変化が大きい発話スタイルの合成音声を生成する音声合成処理において、発話スタイルに特徴的な部分（口語表現部分）に肉声や肉声に準じる品質の音声データを用いることで全体の音質（肉声感）を向上させることができる。さらには、その（準）肉声音声とその他の合成音声部分との韻律や声質の違いをできるだけ小さくすることができ、従来技術で感じられた合成音声の違和感を軽減することが可能となる。 Through the above processing, in the speech synthesis method and apparatus according to the present embodiment, in speech synthesis processing for generating synthesized speech of an utterance style with a large change in prosody and voice quality including colloquial speech, a part characteristic to the utterance style By using the voice data of the quality according to the real voice or the real voice for the (spoken expression part), the overall sound quality (feeling of real voice) can be improved. Furthermore, the difference in the prosody and voice quality between the (quasi) real voice and other synthesized voice parts can be reduced as much as possible, and the uncomfortable feeling of the synthesized voice felt in the prior art can be reduced.

本実施例では、実施例１で説明した音声合成装置の基本構成を用いて、対話システムなどのように複数の文章を連続して音声合成するシステムの実施例を説明する。 In this embodiment, an embodiment of a system that synthesizes a plurality of sentences continuously, such as a dialogue system, using the basic configuration of the speech synthesizer described in the first embodiment will be described.

本実施例では、シナリオ等に基づいて、またはユーザの応答に応じてシステムの次の読み上げ文章を決定する仕組み等を介して、複数の文章を連続して、またはユーザの応答と交互に音声合成するシステムを実現する。 In this embodiment, a plurality of sentences are synthesized continuously or alternately with a user response based on a scenario or the like, or a mechanism for determining the next reading sentence of the system according to a user response. Realize a system that

図１４は、本実施例の音声合成システム全体を説明する図である。この構成では、実施例１の基本構成（図１Ａ、図１Ｂ）に加えて、読み上げシナリオ格納部１４０１、読み上げ文生成部１４０２、全域声質情報格納部１４０３を持つ。 FIG. 14 is a diagram for explaining the entire speech synthesis system of the present embodiment. In this configuration, in addition to the basic configuration of the first embodiment (FIGS. 1A and 1B), a reading scenario storage unit 1401, a reading sentence generation unit 1402, and a global voice quality information storage unit 1403 are provided.

以下、図１４で新たに追加された機能処理部を中心に処理の流れを説明する。
本実施例では、音声合成システムへの入力（テキストおよび（準）肉声音声ＩＤ）はその都度都度の単独の入力ではなく、複数の文章から構成されるシナリオに沿ってその中から順次、音声合成すべき文章が決定される。 Hereinafter, the flow of processing will be described focusing on the newly added function processing unit in FIG.
In the present embodiment, the input to the speech synthesis system (text and (quasi) real voice ID) is not a single input each time, but is sequentially synthesized from within a scenario composed of a plurality of sentences. The sentence to be decided is decided.

読み上げシナリオ格納部１４０１は、そのようなシナリオデータ（読み上げ文章の流れを指定するデータ）を格納している。この読み上げシナリオ格納部１４０１は、記憶部の一部に形成可能である。シナリオデータの構成方法にはさまざまな手法が考えられるが、もっとも単純には読み上げ文章および対応する（準）肉声音声ＩＤの組を、シナリオの流れに沿って順番に記入したリストの形態で格納する手法である。この場合、読み上げシナリオ格納部１４０１には、例えば、図１５に示すようなシナリオデータ１５００が格納されている。 The reading scenario storage unit 1401 stores such scenario data (data specifying the flow of reading sentences). This reading scenario storage unit 1401 can be formed in a part of the storage unit. There are various ways to compose the scenario data, but the simplest way is to store a set of aloud text and the corresponding (quasi) real voice ID in the form of a list entered in order along the flow of the scenario. It is a technique. In this case, the reading scenario storage unit 1401 stores scenario data 1500 as shown in FIG.

読み上げ文生成部１４０２は、読み上げシナリオ格納部１４０１から順番に読み上げ文と対応する（準）肉声音声ＩＤを取り出し、本実施例の音声合成装置に入力する処理を行う。この読み上げ文生成部１４０２は、例えば、処理部であるＣＰＵにおけるプログラム実行で実現できることはいうまでもない。 The read-out sentence generation unit 1402 performs processing of taking out (quasi) real voice IDs corresponding to the read-out sentences in order from the read-out scenario storage unit 1401 and inputting them to the speech synthesizer of this embodiment. Needless to say, this read-out sentence generation unit 1402 can be realized, for example, by executing a program in a CPU that is a processing unit.

図１５に示すようなシナリオデータ１５００が格納されている場合であれば、次に音声合成する文と（準）肉声音声ＩＤを取り出して、直前の音声合成処理が終了したタイミングで入力すれば良い。もちろん、もっと複雑な形式でシナリオデータを保存することもできる。例えば、ＶｏｉｃｅＸＭＬ、ＳＳＭＬ（ＳｐｅｅｃｈＳｙｎｔｈｅｓｉｓＭａｒｋｕｐＬａｎｇｕａｇｅ）などの標準化された規格にもとづいたシナリオデータを用いることも考えられる。この場合、それらの規格データの中では読み上げ文のテキストしか指定できないため、その文に対応する（準）肉声音声ＩＤを決定するために、別の対応表（図１５もその例）を持つことになる。本実施例の好適な態様にあっては入力テキストを解析し、この図１５の対応表を用いて、（準）肉声音声データを決定する。 If scenario data 1500 as shown in FIG. 15 is stored, a sentence to be synthesized next and a (quasi) real voice ID may be extracted and input at the timing when the immediately preceding speech synthesis process is completed. . Of course, scenario data can be saved in a more complex format. For example, it is conceivable to use scenario data based on standardized standards such as VoiceXML and SSML (Speed Synthesis Markup Language). In this case, since only the text of the read-out sentence can be specified in these standard data, another correspondence table (FIG. 15 is also an example thereof) is used to determine the (quasi) real voice ID corresponding to the sentence. become. In the preferred embodiment of this embodiment, the input text is analyzed, and (quasi) real voice data is determined using the correspondence table of FIG.

本実施例のシステムでもう一つ追加された全域声質情報格納部１４０３は、声質情報抽出部１０５で抽出された全域声質情報を、毎回、追加・格納していく。ここで格納される全域声質情報の構成方法は特に限定しない。この格納部１４０３には、少なくとも直前に音声合成された数回分の全域声質情報を格納しておくことが必要である。例えば、図１６に全域声質情報格納部１４０３に格納されている情報の例１６００を示す。この情報１６００では、現在の読み上げ文を履歴０として過去３回までの読み上げ文に関する全域声質情報が保存されている。また、読み上げ文や対応する（準）肉声音声ＩＤも格納してある。ここに格納しておく必要がある情報も、先の履歴の長さと同じように、声質考慮波形合成処理において、過去の読み上げ文の全域声質情報をどのような処理方法で利用するかに依存する。 The global voice quality information storage unit 1403 added by the system of this embodiment adds and stores the global voice quality information extracted by the voice quality information extraction unit 105 each time. The configuration method of the global voice quality information stored here is not particularly limited. In this storage unit 1403, it is necessary to store at least several times of global voice quality information synthesized immediately before. For example, FIG. 16 shows an example 1600 of information stored in the global voice quality information storage unit 1403. In this information 1600, the entire voice quality information relating to the read-out sentence up to the past three times is stored with the current read-out sentence as history 0. In addition, a read-out sentence and a corresponding (quasi) real voice ID are also stored. The information that needs to be stored here also depends on the processing method used for the entire voice quality information of past read-out sentences in the voice quality-considered waveform synthesis processing, as with the length of the previous history. .

声質考慮波形合成部１１０では、声質情報抽出部１０５から出力された、今回の読み上げ文に関する声質情報に加えて、全域声質情報格納部１４０３に格納されている過去の読み上げ文に対する全域声質情報も参照して、今回の読み上げ文に対する合成音声を生成する。図１７を用いて説明する。 In the voice quality considering waveform synthesis unit 110, in addition to the voice quality information regarding the current reading sentence output from the voice quality information extraction unit 105, the whole voice quality information for the past reading sentence stored in the whole voice quality information storage unit 1403 is also referred to. Then, a synthesized speech for the current reading sentence is generated. This will be described with reference to FIG.

図１７は本実施例における声質考慮波形合成部１１０の処理フローチャートの一例である。 FIG. 17 is an example of a processing flowchart of the voice quality considering waveform synthesis unit 110 in the present embodiment.

図１７の処理フローチャートは、実施例１の声質考慮波形合成部１１０の処理フローチャート図１３中と、全域声質に対するコスト計算処理の箇所が異なる。図１７では、全域声質情報に対するコスト計算処理をさらに詳細ないくつかの処理ステップとして実現する。 The processing flowchart of FIG. 17 differs from the processing flowchart of the voice quality-considered waveform synthesis unit 110 of the first embodiment shown in FIG. In FIG. 17, the cost calculation process for the global voice quality information is realized as several more detailed processing steps.

ステップ１７０１では全域声質情報格納部１４０３に格納されている情報から、直前に音声合成した文に対する全域声質情報（以下、Ｃ１と呼ぶ）を取得し、続く１７０２では声質抽出部１０５から出力された今回の読み上げ文（（準）肉声音声データ）に対する全域声質情報（以下、Ｃ０）を取得する。 In step 1701, global voice quality information (hereinafter referred to as C1) for the sentence synthesized immediately before is obtained from the information stored in the global voice quality information storage unit 1403. In step 1702, the current voice quality information output from the voice quality extraction unit 105 is acquired. The whole voice quality information (hereinafter referred to as C0) for the read-out sentence ((quasi) real voice data) is acquired.

つづいて、ステップ１７０３では、現在、ターゲットコストを計算している素片が読み上げ文の中でどこにあるかを示す位置情報Ｐを計算する。この位置Ｐとは、直前の読み上げ文における（準）肉声音声データの全域声質情報が、今回の文のどの位置まで影響を及ぼすかを評価するための情報として利用される。この位置Ｐの計算方法には、例えば、読み上げ文を構成する音素の系列の中で何番目に位置するかを示す値を使用することもでき、また読み上げ文の先頭から何秒目に位置するかを示す値を韻律情報１０００から計算して使用することもできる。また、読み上げ文中で先頭からこの素片の占める位置までの間に息継ぎポーズが含まれた場合には位置Ｐに大きな値を加算するなどする実装形態も考えられる。息継ぎポーズが出現すると、そこで声質に関して大きく変化（またはリセット）することが想定されるため、位置Ｐの値にもこのような非線形な変化を行うことで、上記の現象を考慮することが可能となる。また別の実施形態では、位置Ｐとして当該素片が含まれる息継ぎフレーズ、またはアクセント句が先頭から何番目のものであるかを示す値を使用することもできる。全域声質情報の影響は、意味的にまとまった音声単位であるアクセント句、またはフレーズごとに段階的に変わっていくことも考えられる。 In step 1703, position information P indicating where the segment for which the target cost is currently calculated is located in the reading sentence is calculated. This position P is used as information for evaluating to what position of the current sentence the global voice quality information of the (quasi) real voice data in the immediately preceding reading sentence affects. For the calculation method of the position P, for example, a value indicating the position in the phoneme sequence constituting the reading sentence can be used, and the position is located at the second from the head of the reading sentence. It is also possible to use a value indicating such from the prosodic information 1000. In addition, when a breath pose is included between the beginning and the position occupied by this segment in the read-out sentence, an implementation form in which a large value is added to the position P is also conceivable. When a breath pose appears, it is assumed that there will be a significant change (or reset) with respect to the voice quality, so that the above phenomenon can be taken into account by making such a non-linear change in the value of the position P as well. Become. In another embodiment, as the position P, a breathing phrase including the segment, or a value indicating what number the accent phrase is from the head can be used. It is conceivable that the influence of the global voice quality information changes step by step for each accent phrase or phrase that is a semantically united speech unit.

ステップ１７０４では、この位置情報Ｐに基づいて、全域声質情報Ｃ１とＣ０に対する重み値を計算する。これらの重み値は、直前の文の（準）肉声音声データの影響と、今回の文の（準）肉声音声データの影響のどちらが強いかを決定するための情報であり、上述の位置Ｐの値をもとに計算される。ここで、重み値の計算方法にもさまざまなものが考えられるが、もっとも単純には、Ｗ１＝（１−Ｐ／Ｐｍａｘ）、Ｗ０＝（１−Ｗ１）という計算手法が考えられる。ここでＰｍａｘは読み上げ文を構成する最後の音素（音節）に対応する素片に対して計算される位置情報Ｐのことである。この計算式では、読み上げ文の先頭に位置する素片ほど、直前文の（準）肉声音声データの全域声質情報の影響が大きくなり、位置Ｐが増えるにつれて比例してその影響力が小さくなる、つまり、今回の文の（準）肉声音声データの全域声質情報の影響が大きくなる効果を持つ。この計算式は一例であり、もっとも単純な場合である。より実際的な実施形態では、位置Ｐの値ごとに（または値の幅ごとに）あらかじめ重み値Ｗ１、Ｗ０を設定したテーブルを参照する方法がありえる。上記のように、直前文の（準）肉声音声データの全域声質と、今回の文のそれの影響が線形に変化していくことは考えづらい。実際の音声を模擬するためには、あらかじめ調整された重みテーブルを設定しておくことが望ましい。 In step 1704, based on the position information P, weight values for the whole area voice quality information C1 and C0 are calculated. These weight values are information for determining whether the influence of the (quasi) real voice data of the previous sentence or the (quasi) real voice data of the current sentence is strong. Calculated based on the value. Here, various methods for calculating the weight value can be considered, but the simplest calculation method is W1 = (1-P / Pmax), W0 = (1-W1). Here, Pmax is the position information P calculated for the segment corresponding to the last phoneme (syllable) constituting the reading sentence. In this calculation formula, the influence of the global voice quality information of the (quasi) real voice data of the immediately preceding sentence becomes larger as the segment located at the head of the reading sentence, and its influence decreases proportionally as the position P increases. That is, there is an effect that the influence of the entire voice quality information of the (quasi) real voice data of the current sentence becomes large. This formula is an example and is the simplest case. In a more practical embodiment, there may be a method of referring to a table in which the weight values W1 and W0 are set in advance for each value of the position P (or for each value width). As described above, it is difficult to think that the overall voice quality of the (quasi) real voice data of the immediately preceding sentence and the influence of this sentence change linearly. In order to simulate actual speech, it is desirable to set a weight table adjusted in advance.

ステップ１７０５では、この素片に対する全域声質情報の距離計算で使用する、（準）肉声音声データ側の全域声質情報を修正する。今回の文の全域声質情報はＣ０であるが、前述の通り、読み上げ文中での位置によって、直前文の全域声質情報の影響が必要な場合がある。そこで、Ｃ＝Ｗ０・Ｃ０＋Ｗ１・Ｃ１という式により、距離計算で使用する全域声質情報を計算する。 In Step 1705, the global voice quality information on the (quasi) real voice data side used in the distance calculation of the global voice quality information for this segment is corrected. Although the global voice quality information of the current sentence is C0, as described above, the influence of the global voice quality information of the immediately preceding sentence may be necessary depending on the position in the text to be read. Therefore, the global voice quality information used in the distance calculation is calculated by the equation C = W0 · C0 + W1 · C1.

最後に、ステップ１７０６では上記計算された全域声質情報Ｃと、この素片の全域声質情報（構成方法は実施例１にて説明）との間で距離を計算し、それを全域声質に対するコスト値とする。 Finally, in step 1706, a distance is calculated between the calculated global voice quality information C and the global voice quality information of this segment (the configuration method will be described in the first embodiment), and this is calculated as a cost value for the global voice quality. And

以上の流れにより、全域声質格納部１４０３に格納されている、直前の文、さらにはそれより以前の文の全域声質情報を参照して、今回の文に対する合成音声の声質を補正することが可能となる。 Through the above flow, it is possible to correct the voice quality of the synthesized speech for the current sentence by referring to the entire voice quality information of the immediately preceding sentence and the previous sentence stored in the global voice quality storage unit 1403 It becomes.

この実施例２によれば、複数の文を連続して音声合成するような場合において、直前に音声合成した文の（準）肉声音声データの全域声質と合うように、今回の合成音声の先頭箇所の声質を調整できるようになる。この結果、声質がスムーズにつながる合成音声を生成することが可能となる。 According to the second embodiment, in the case where a plurality of sentences are continuously synthesized, the head of the synthesized speech of this time is matched with the overall voice quality of the (quasi) real voice data of the sentence synthesized immediately before. You can adjust the voice quality of the part. As a result, it is possible to generate synthesized speech in which the voice quality is smoothly connected.

本実施例では、実施例２の構成を応用して、読み上げコンテンツをサーバからパーソナルコンピュータや、電子書籍端末などの端末側に配信し、その読み上げコンテンツを音声合成システムで読み上げるシステムの実施形態を説明する。 In the present embodiment, the configuration of the second embodiment is applied to describe an embodiment of a system in which read-out content is distributed from a server to a terminal such as a personal computer or an electronic book terminal, and the read-out content is read out by a speech synthesis system. To do.

本実施例では、たとえば電子書籍などで読み上げテキストに（準）肉声音声データ群を付加したデータ全体を配信し、端末側の電子書籍端末では、コンテンツ内に記録されている（準）肉声音声データを使用しつつ、本実施例による音声合成処理を実施する。また、電子書籍以外でも、スマートフォンなどでの音声対話インタフェースにおいて、対話データ（シナリオ）をサーバ側で生成する場合に、対話シナリオ全体、または対話の流れに従って生成された一部分のシナリオをサーバ側から端末側に配信する仕組みにも使用することができる。 In this embodiment, for example, the whole data in which a (quasi) real voice data group is added to a text to be read out in an electronic book or the like is distributed. The speech synthesis process according to the present embodiment is performed while using. In addition to electronic books, when generating dialogue data (scenarios) on the server side in a voice dialogue interface on a smartphone, etc., the server side will be able to display the entire dialogue scenario or a part of the scenario generated according to the dialogue flow from the server side. It can also be used as a mechanism for distributing to the side.

図１８は、本実施例の音声合成システム全体を説明する図である。サーバ１８００は、通常のコンピュータ構成を備え、その記憶部にコンテンツ格納部１８０１を備え、その処理部であるＣＰＵや、ネットワークインターフェースによってコンテンツ送信部１８０２を構成可能である。 FIG. 18 is a diagram for explaining the entire speech synthesis system of the present embodiment. The server 1800 has a normal computer configuration, a storage unit 1801 in its storage unit, and a content transmission unit 1802 can be configured by a CPU that is a processing unit or a network interface.

一方、端末側のコンピュータ１２０では、実施例２の構成（図１４）に加えて、コンテンツ受信部１８０３、コンテンツ格納部１８０４、コンテンツ選択部１８０６、入力生成部１８０５を持つ。これらはコンテンツ格納部１８０４が記憶部である他、先に説明したネットワークインターフェースや、処理部であるＣＰＵのプログラム実行により、実現可能な機能処理部である。一方で、肉声音声格納部、準肉声音声生成部、（準）肉声音声選択部は持たず、入力生成部１８０５がそれらの処理の一部を実施する。また、以上の端末側の装置に、ネットワーク１２４を介して、サーバ側のコンテンツ送信部１８０２、コンテンツ格納部１８０１に繋がっている。 On the other hand, the terminal-side computer 120 includes a content receiving unit 1803, a content storage unit 1804, a content selection unit 1806, and an input generation unit 1805 in addition to the configuration of the second embodiment (FIG. 14). In addition to the content storage unit 1804 being a storage unit, these are function processing units that can be realized by the network interface described above and the program execution of the CPU that is the processing unit. On the other hand, the real voice voice storage unit, the quasi-voice voice generation unit, and the (semi) real voice selection unit are not included, and the input generation unit 1805 performs a part of these processes. In addition, the above-described device on the terminal side is connected to the content transmission unit 1802 and the content storage unit 1801 on the server side via the network 124.

以下の説明では、端末側の処理部についてのみ説明する。
コンテンツ受信部１８０３は、ネットワークを介して、コンテンツデータをサーバから受け取り、コンテンツ格納部１８０４に格納しておく。ここで想定するコンテンツデータとは、電子書籍や音声対話シナリオデータにおいて、読み上げテキストに加えて（準）肉声音声データ群、および読み上げ文と（準）肉声音声データとの対応を示すリストを含んだデータである。図１９にその一例としてコンテンツデータ１９００を示す。図１９の（ａ）は読み上げ文そのもののリストであり、電子書籍データでは書籍内容のテキストそのものに相当する。図１９の（ｃ）は(準）肉声音声データの一覧である。ここに示すように、コンテンツそのものに（準）肉声音声データが埋め込まれて配信される。図１９の（ｂ）は、読み上げ文とそこで使用すべき(準）肉声音声データの対応を示すデータである。 In the following description, only the processing unit on the terminal side will be described.
The content receiving unit 1803 receives content data from the server via the network and stores it in the content storage unit 1804. The content data assumed here includes, in electronic books and voice conversation scenario data, in addition to the reading text, a (quasi) real voice data group, and a list indicating the correspondence between the read text and the (quasi) real voice data It is data. FIG. 19 shows content data 1900 as an example. (A) of FIG. 19 is a list of the read-out sentence itself, and corresponds to the text itself of the book content in the electronic book data. FIG. 19C is a list of (quasi) real voice data. As shown here, (quasi) real voice data is embedded in the content itself and distributed. FIG. 19B shows data indicating the correspondence between the read-out sentence and the (quasi) real voice data to be used there.

図２０に、さらに別の実施形態でのコンテンツデータの例２０００を示す。この形態のコンテンツデータ２０００では、読み上げ文中に特別な表現（※Ｎ※）で（準）肉声音声ＩＤを指定しておく。このように、必要なデータをまとめて一つのコンテンツデータ内に格納しておくことで、コンテンツの内容に合わせて合成音声の話し方の調子を変えたりすることが可能となる。 FIG. 20 shows an example of content data 2000 in still another embodiment. In the content data 2000 of this form, a (quasi) real voice ID is designated with a special expression (* N *) in a reading sentence. In this way, by storing necessary data together in one content data, it is possible to change the way the synthesized speech is spoken according to the content.

コンテンツ選択部１８０６は、コンテンツ格納部１８０４に格納されている複数のコンテンツデータ２０００の中からユーザが使用したいデータを選択するための仕組みである。電子書籍端末として実施する場合は、端末に格納されている電子書籍の中からユーザが読みたい（聞きたい）電子書籍コンテンツを選択する仕組み（タッチパネル等）がこれにあたる。また、音声対話インタフェースで使用される対話コンテンツなどでは、このコンテンツ選択処理は、ユーザ自らによる選択作業ではなく、音声対話システムによって、自動的に決定される仕組みとなる場合もある。 The content selection unit 1806 is a mechanism for selecting data that the user wants to use from among a plurality of content data 2000 stored in the content storage unit 1804. When implemented as an electronic book terminal, this is a mechanism (such as a touch panel) for selecting electronic book content that the user wants to read (want to hear) from among electronic books stored in the terminal. Further, in the case of interactive content used in the voice interactive interface, the content selection process may be a mechanism that is automatically determined by the voice interactive system instead of the user's own selection work.

こうして選択されたコンテンツデータは入力生成部１８０５に渡される。入力生成部１８０６の処理は、実施例２の図１４における読み上げ文生成部１４０２と入力解析部１０６と（準）肉声音声選択部１０３とをあわせたものとなる。図１９に示すようなコンテンツデータを使用する場合、この入力生成部１８０６では、図１９の（ａ）に示すコンテンツデータの中の読み上げ文リストから順番に読み上げ文を取り出し、その文ＩＤに対応する（準）肉声音声データ（および付随情報）を、図１９の（ｂ）の対応表と、図１９（ｃ）の音声データ一覧をもとに決定し、取り出された読み上げ文と（準）肉声音声データ（付随情報）を韻律生成部、韻律情報抽出部、声質情報抽出部、接続合成部にそれぞれ受け渡す。 The content data thus selected is transferred to the input generation unit 1805. The processing of the input generation unit 1806 is a combination of the reading sentence generation unit 1402, the input analysis unit 106, and the (semi) real voice selection unit 103 in FIG. 14 of the second embodiment. When the content data as shown in FIG. 19 is used, the input generation unit 1806 takes out the reading sentences sequentially from the reading sentence list in the contents data shown in FIG. 19A, and corresponds to the sentence ID. The (quasi) real voice data (and accompanying information) is determined based on the correspondence table in FIG. 19 (b) and the voice data list in FIG. 19 (c). The voice data (accompanying information) is transferred to the prosody generation unit, prosody information extraction unit, voice quality information extraction unit, and connection synthesis unit.

このような構成を用いることで、コンテンツごとに設定されている（準）肉声音声データを使用して、そのコンテンツをより高品質な合成音声で読み上げることが可能となる。 By using such a configuration, it becomes possible to read out the content with higher quality synthesized speech using (quasi) real voice data set for each content.

以上、本発明の種々の実施例を説明したが、本発明は上記した実施例に限定されるものではなく、様々な変形例が含まれる。例えば、上記した実施例は本発明のより良い理解のために詳細に説明したのであり、必ずしも説明の全ての構成を備えるものに限定されものではない。また、ある実施例の構成の一部を他の実施例の構成に置き換えることが可能であり、また、ある実施例の構成に他の実施例の構成を加えることが可能である。また、各実施例の構成の一部について、他の構成の追加・削除・置換をすることが可能である。 While various embodiments of the present invention have been described above, the present invention is not limited to the above-described embodiments, and includes various modifications. For example, the above-described embodiments have been described in detail for better understanding of the present invention, and are not necessarily limited to those having all the configurations described. Further, a part of the configuration of one embodiment can be replaced with the configuration of another embodiment, and the configuration of another embodiment can be added to the configuration of one embodiment. Further, it is possible to add, delete, and replace other configurations for a part of the configuration of each embodiment.

更に、上述した各構成、機能、処理部等は、それらの一部又は全部を実現するための処理部で実行されるプログラムを作成する例を中心に説明したが、それらの一部又は全部を、例えば集積回路で設計する等によりハードウェアで実現しても良いことは言うまでもない。 Further, each of the above-described configurations, functions, processing units, etc. has been described mainly with reference to an example of creating a program to be executed by a processing unit for realizing part or all of them. Needless to say, it may be realized by hardware, for example, by designing with an integrated circuit.

１０１（準）肉声音声格納部
１０２準肉声音声生成部
１０３（準）肉声音声選択部
１０４韻律情報抽出部
１０５声質情報抽出部
１０６入力解析部
１０７韻律生成部
１０８韻律モデル
１０９素片データベース（ＤＢ）
１１０声質考慮波形合成部
１１１接続合成部
１１２入力テキスト（漢字かな文）
１１３合成音声
１１４（準）肉声音声ＩＤ
１２０コンピュータ
１２１中央処理部（ＣＰＵ）
１２２メモリ（ＭＭ）
１２３インタフェース（Ｉ／Ｆ）
１２４ネットワーク
１２５ハードディスクコントローラ（ＨＤＤ）
１２６入出力部（Ｉ／Ｏ）
１２７スピーカ
１２８バス
２００（準）肉声音声データベース
９０１、９０３曲線
９０２接続端点
１４０１読み上げシナリオ格納部
１４０２読み上げ文生成部
１４０３全域声質情報格納部
１８００サーバ
１８０１コンテンツ格納部
１８０２コンテンツ送信部
１８０３コンテンツ受信部
１８０４コンテンツ格納部
１８０５入力生成部
１８０６コンテンツ選択部 101 (quasi) real voice storage unit 102 quasi real voice generation unit 103 (quasi) real voice selection unit 104 prosodic information extraction unit 105 voice quality information extraction unit 106 input analysis unit 107 prosody generation unit 108 prosodic model 109 segment database (DB)
110 Voice quality-considered waveform synthesis unit 111 Connection synthesis unit 112 Input text
113 synthetic voice 114 (semi) voice voice ID
120 Computer 121 Central Processing Unit (CPU)
122 memory (MM)
123 interface (I / F)
124 network 125 hard disk controller (HDD)
126 Input / output unit (I / O)
127 Speaker 128 Bus 200 (Quasi) Real voice database 901, 903 Curve 902 Connection end point 1401 Reading scenario storage unit 1402 Reading sentence generation unit 1403 Global voice quality information storage unit 1800 Server 1801 Content storage unit 1802 Content transmission unit 1803 Content reception unit 1804 Content Storage unit 1805 Input generation unit 1806 Content selection unit

Claims

A speech synthesis method in a processing unit for outputting synthesized speech,
The processor is
Select real voice data corresponding to the characteristic part of the utterance style from the input text, or quasi-real voice data (hereinafter referred to as (quasi) real voice data) according to prosody and voice quality,
Extract prosodic information and voice quality information from the selected (quasi) real voice data,
Based on the extracted prosodic information, adjust the prosody information of the synthesized speech corresponding to the input text, and considering the extracted voice quality information, generate synthesized speech waveform data corresponding to the input text,
A speech synthesis method comprising joining the generated synthesized speech waveform data and the selected (quasi) real voice data at a connection end point.

The speech synthesis method according to claim 1,
The processor is
A speech synthesizing method for determining the (quasi) real voice data to be selected by analyzing the input text in order to select the (quasi) real voice data.

The speech synthesis method according to claim 1,
The processor is
In order to extract the voice quality information, for the (quasi) real voice data, short time voice quality information corresponding to a short interval at both ends thereof, and global voice quality information corresponding to the whole (quasi) real voice data A speech synthesis method characterized by extracting.

The speech synthesis method according to claim 3,
The processor is
A speech synthesis method using frequency domain coefficient information obtained by a frequency analysis method for the entire (quasi) real voice data as the global voice quality information.

The speech synthesis method according to claim 3,
The processor is
Using the time and weight information of voice quality feature points previously added to the (quasi) real voice data as the global voice quality information, information obtained by weighting and adding short-time voice quality information at each time position with a specified weight A speech synthesis method characterized by being used.

The speech synthesis method according to claim 3,
The processor is
As the global voice quality information, global voice quality information or short-time voice quality information at a specific position is extracted in k speech sections obtained by dividing the (quasi) real voice data into k parts, and from the extracted k voice quality information. A speech synthesis method using N pieces of representative voice quality information obtained by changing the process of determining representative voice quality information from k = 1 to N.

The speech synthesis method according to claim 3,
The processor is
When the synthesized voice waveform data is generated in consideration of the extracted voice quality information in consideration of the extracted global voice quality information, the waveform synthesis processing is also performed in consideration of the last several global voice quality information that has been saved. A speech synthesis method characterized by:

The speech synthesis method according to claim 2,
The processor is
In order to extract the voice quality information, the global voice quality information corresponding to the entire (quasi) real voice data is extracted, and the extracted global voice quality information is stored,
A speech synthesis method, wherein, when generating the synthesized speech waveform data in consideration of the extracted voice quality information, a waveform synthesis process is performed in consideration of the last several pieces of global voice quality information stored.

A speech synthesizer,
A processing unit and a storage unit;
The processor is
Select (semi) voice data from the input text corresponding to the characteristic part of the utterance style,
Extract prosodic information and voice quality information from the selected (quasi) real voice data,
Based on the extracted prosodic information, adjust the prosody information of the synthesized speech corresponding to the input text, and considering the extracted voice quality information, generate synthesized speech waveform data corresponding to the input text,
A speech synthesizer characterized in that the generated synthesized speech waveform data and the selected (quasi) real voice data are joined at a connection end point and output.

The speech synthesizer according to claim 9,
The processor is
In order to select the (quasi) real voice data, the speech synthesizer determines the (quasi) real voice data to be selected by analyzing the input text.

The speech synthesizer according to claim 9,
The processor is
In order to extract the voice quality information, for the (quasi) real voice data, short time voice quality information corresponding to a short interval at both ends thereof, and global voice quality information corresponding to the whole (quasi) real voice data A speech synthesizer characterized by extracting.

The speech synthesizer according to claim 11,
The processor is
Storing the extracted global voice quality information in the storage unit;
When generating the synthesized speech waveform data in consideration of the extracted voice quality information, the waveform synthesis processing is performed in consideration of the most recent whole area voice quality information stored in the storage unit. A speech synthesizer.

The speech synthesizer according to claim 10,
The processor is
In order to extract the voice quality information, global voice quality information corresponding to the entire (quasi) real voice data is extracted,
Storing the extracted global voice quality information in the storage unit;
A speech synthesizer characterized in that, when generating the synthesized speech waveform data in consideration of the extracted voice quality information, waveform synthesis processing is performed in consideration of the last several pieces of global voice quality information stored.

A speech synthesis program executed by the processing unit of the speech synthesizer,
The processing unit is
Select (semi) voice data from the input text corresponding to the characteristic part of the utterance style,
Extract prosodic information and voice quality information from the selected (quasi) real voice data,
Based on the extracted prosodic information, adjust the prosody information of the synthesized speech corresponding to the input text, and considering the extracted voice quality information, generate synthesized speech waveform data corresponding to the input text,
A speech synthesis program that operates to join the generated synthesized speech waveform data and the selected (quasi) real speech data at a connection end point.

The speech synthesis program according to claim 14,
The processing unit is
A speech synthesis program which operates to determine the (quasi) real voice data to be selected by analyzing the input text in order to select the (quasi) real voice data.