JP2007079019A

JP2007079019A - Method and device for speech synthesis, and computer program

Info

Publication number: JP2007079019A
Application number: JP2005265778A
Authority: JP
Inventors: Tsutomu Kaneyasu; 勉兼安
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 2005-09-13
Filing date: 2005-09-13
Publication date: 2007-03-29
Anticipated expiration: 2025-09-13
Also published as: JP4640063B2

Abstract

<P>PROBLEM TO BE SOLVED: To emphasize a keyword part by altering a speech of the keyword part except the speaker, sound volume, pitch, or speaking speed. <P>SOLUTION: A device for speech synthesis is equipped with a cadence prediction information generation section which generates cadence prediction information 502, a scale calculation section which finds scales of respective symbols of a text body by mapping from a feature quantity to a scale based upon the cadende prediction information and the respective symbols of the text body, a phoneme selection section 104 which selects a text phoneme from a phoneme database 109, a keyword scale calculation section 107 which finds a scale based upon the meter prediction information and the respective symbols of keywords in the text body, a keyword priority phoneme selection section 107 which selects the keyword phoneme from a phoneme database, a phoneme replacement section 107 which replaces a phoneme of a keyword part of the text phoneme with the keyword phoneme, and a synthesis section 111 which connects phonemes. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明はテキスト本文を読み上げるための音声合成にかかり，特にキーワードを音質良く強調する音声合成装置，音声合成方法，およびコンピュータプログラムに関する。 The present invention relates to speech synthesis for reading a text body, and more particularly to a speech synthesizer, a speech synthesis method, and a computer program for emphasizing keywords with good sound quality.

予め録音された人の自然音声等を基にして，ＰＣ（パーソナルコンピュータ）に記憶されたテキスト文書を，音声に変換して読み上げる音声合成装置が一般的に知られている。上記音声合成装置は，品詞単位に分割可能な自然音声が記録されているコーパスに基づいて音声を合成する。 2. Description of the Related Art Generally, a speech synthesizer is known that converts a text document stored in a PC (personal computer) into a voice and reads it out based on a natural voice of a person recorded in advance. The speech synthesizer synthesizes speech based on a corpus in which natural speech that can be divided into parts of speech is recorded.

音声を合成するために，従来にかかる音声合成装置は，まず，入力されたテキストに対して，例えば，形態素解析，係り受け解析を実行し，音素記号，アクセント記号などに変換する。 In order to synthesize speech, a conventional speech synthesizer first performs, for example, morpheme analysis and dependency analysis on input text, and converts it into phoneme symbols, accent symbols, and the like.

次に，従来に係る音声合成装置は，音素記号，アクセント記号列，および形態素解析結果から得られる入力テキストの品詞情報を用いて，音素持続時間（声の長さ），基本周波数（声の高さ），母音中心のパワー（声の大きさ）等の推定を行う。 Next, the conventional speech synthesizer uses the part of speech information of the input text obtained from the phoneme symbol, the accent symbol string, and the morpheme analysis result, and uses the phoneme duration (voice length), fundamental frequency (voice height). And vowel-centered power (voice volume).

次に，従来に係る音声合成装置は，上記推定された音素持続時間，基本周波数，母音中心のパワーなどに最も近く，かつ波形辞書に蓄積されている合成単位（音素片）を接続したときの歪みが最も小さくなる合成単位の組合せを，動的計画法等を用いて選択する。なお，この際に行われる単位選択では，知覚的特徴に一致した尺度を用いる。 Next, the conventional speech synthesizer is connected to the synthesis unit (phoneme fragment) that is closest to the estimated phoneme duration, fundamental frequency, vowel center power, etc. and stored in the waveform dictionary. The combination of composition units with the smallest distortion is selected using dynamic programming or the like. The unit selection performed at this time uses a scale that matches the perceptual feature.

上記合成単位の組合せが選択されると，従来に係る音声合成装置は，当該選択された音素片の組合せに従って，音素片の接続を行うことにより音声を合成する。 When the combination of the synthesis units is selected, the conventional speech synthesizer synthesizes the speech by connecting the phonemes according to the selected phoneme combination.

また，従来に係る音声合成装置には，文書中の重要な個所，文書作成者が読み手に特に伝えたい個所を強調して読み上げることが可能な音声合成装置が存在する（例えば，特許文献１，参照）。 In addition, speech synthesis apparatuses according to the related art include speech synthesis apparatuses that can emphasize important parts in a document and parts that a document creator particularly wants to convey to a reader and read them out (for example, Patent Document 1, Patent Document 1). reference).

特開平１０−２７４９９９号公報JP-A-10-274999

しかしながら，従来に係る音声合成装置では，入力したテキスト本文全体からコストを求め，キーワード部分の音声について話者，音量，ピッチ，または話速のいずれかを変更することで当該キーワードを強調していたが，それ以外を変更してキーワード部分を強調する音声合成装置は存在しなかった。 However, the conventional speech synthesizer emphasizes the keyword by calculating the cost from the whole text body entered and changing either the speaker, volume, pitch, or speaking speed of the keyword part. However, there was no speech synthesizer that changed the others and emphasized the keyword part.

また，従来に係る音声合成装置では，入力したテキスト本文全体からコストを求め，キーワード部分の話者，音量，ピッチ，または話速のいずれかを変更し，キーワードを強調していたため，強調されたキーワード部分の音質を向上するのは困難であった。 In addition, in the conventional speech synthesizer, the cost was calculated from the whole text body entered, and the keyword was emphasized because the keyword was changed by changing the speaker, volume, pitch, or speaking speed. It was difficult to improve the sound quality of the keyword part.

本発明は，上記問題点に鑑みてなされたものであり，本発明の目的は，キーワード部分の音声について話者，音量，ピッチ，または話速以外を変更し，キーワード部分を強調することが可能な，新規かつ改良された音声合成装置，音声合成方法，およびコンピュータプログラムを提供することである。 The present invention has been made in view of the above problems, and an object of the present invention is to enhance the keyword portion by changing the speech of the keyword portion other than the speaker, volume, pitch, or speaking speed. A new and improved speech synthesis apparatus, speech synthesis method, and computer program are provided.

上記課題を解決するため，本発明の第１の観点によれば，テキスト本文を解析し，該テキスト本文を構成する各記号（又は，各音韻記号）の特徴量を予測した韻律予測情報を生成する韻律予測情報生成部と；上記韻律予測情報とテキスト本文を構成する各記号とを基にして観測可能な特徴量から知覚実験に基づいて定められる尺度にマッピングすることで，該テキスト本文を構成する各記号の尺度を求める尺度算出部と；複数の音声が収録され，その音声を構成する音韻が格納された音韻データベースと；上記テキスト本文を構成する各記号ごとに，尺度算出部により求められた尺度に基づいて音韻データベースからテキスト音韻を選択する音韻選択部と；韻律予測情報とテキスト本文のうちキーワードを構成する各記号とを基にして観測可能な特徴量から知覚実験に基づいて定められる尺度にマッピングすることで，該キーワードを構成する各記号の尺度を求めるキーワード尺度算出部と；キーワードを構成する各記号ごとに，キーワード尺度算出部により求められたキーワードを構成する各記号の尺度に基づいて音韻データベースからキーワード音韻を選択するキーワード優先音韻選択部と；上記音韻選択部が選択したテキスト音韻のうちキーワード部分に該当する音韻をキーワード優先音韻選択部が選択したキーワード音韻に置き換える音韻置換部と；音韻を接続し，合成音声を生成する合成部と；を備えることを特徴とした音声合成装置が提供される。 In order to solve the above problems, according to a first aspect of the present invention, a text body is analyzed, and prosodic prediction information is generated by predicting a feature quantity of each symbol (or each phoneme symbol) constituting the text body. The text body is constructed by mapping from the observable feature amount to a scale determined based on a perceptual experiment based on the prosodic prediction information generating unit and each symbol constituting the text body. A scale calculation unit for obtaining a scale of each symbol to be recorded; a phoneme database in which a plurality of voices are recorded and phonemes constituting the speech are stored; and for each symbol constituting the text body, the scale calculation unit calculates A phoneme selection unit that selects text phonemes from a phoneme database based on the measured scale; observable based on prosodic prediction information and each symbol constituting a keyword in the text body A keyword scale calculation unit that calculates a scale of each symbol that constitutes the keyword by mapping from a characteristic amount to a scale that is determined based on a perceptual experiment; and a keyword scale calculation unit that obtains a scale of each symbol that constitutes the keyword; A keyword-priority phoneme selection unit that selects a keyword phoneme from a phoneme database based on a scale of each symbol constituting the selected keyword; and a keyword-priority phoneme selection of a phoneme corresponding to the keyword part among the text phonemes selected by the phoneme selection unit There is provided a speech synthesizer comprising: a phoneme replacement unit that replaces a keyword phoneme selected by the unit; and a synthesis unit that connects phonemes and generates synthesized speech.

本発明によれば，音声合成装置に備わる音韻選択部がテキスト音韻を選択し，キーワード優先音韻選択部がキーワード音韻を選択し，音韻置換部がテキスト音韻のうち該当する部分をキーワード音韻に置き換える。かかる構成によれば，音韻選択部がテキスト音韻を選択するのとは別途独立して，キーワード尺度算出部がキーワードを構成する各記号と韻律予測情報とを基に尺度を計算し，その尺度からキーワード優先音韻選択部がキーワード音韻を選択する。したがって，キーワード部分に対象範囲を限定して尺度が求められ音韻が選択されるためキーワード部分の合成音声の自然性が高まり，さらにキーワード部分の合成音声が明瞭となるため当該キーワード部分が強調される。 According to the present invention, the phoneme selection unit provided in the speech synthesizer selects the text phoneme, the keyword priority phoneme selection unit selects the keyword phoneme, and the phoneme replacement unit replaces the corresponding part of the text phoneme with the keyword phoneme. According to such a configuration, the keyword scale calculation unit calculates the scale based on each symbol constituting the keyword and the prosodic prediction information separately from the phoneme selection unit selecting the text phoneme. A keyword priority phoneme selection unit selects a keyword phoneme. Therefore, the scale is determined by limiting the target range to the keyword part, and the phoneme is selected, so that the naturalness of the synthesized speech of the keyword part increases, and the synthesized speech of the keyword part becomes clear, and the keyword part is emphasized .

なお，上記テキスト音韻は，例えば，テキスト本文に構成する各記号ごとに選択された音韻である。上記キーワード音韻は，例えば，テキスト本文中のキーワードを構成する各記号ごとに選択された音韻である。 The text phoneme is, for example, a phoneme selected for each symbol constituting the text body. The keyword phoneme is, for example, a phoneme selected for each symbol constituting the keyword in the text body.

また，本発明にかかる尺度算出部と音韻選択部とは，別体として構成されるが，かかる例に限定されず，例えば，尺度算出部と音韻選択部とが一体として構成される場合でもよい。 In addition, the scale calculation unit and the phoneme selection unit according to the present invention are configured as separate bodies. However, the present invention is not limited to this example. For example, the scale calculation unit and the phoneme selection unit may be configured as a single unit. .

また，本発明にかかるキーワード尺度算出部とキーワード優先音韻選択部と音韻置換部とは，別体として構成されるが，かかる例に限定されず，例えば，キーワード尺度算出部とキーワード優先音韻選択部と音韻置換部とが一体として構成される場合でもよい。 In addition, the keyword scale calculation unit, the keyword priority phoneme selection unit, and the phoneme replacement unit according to the present invention are configured separately, but are not limited to this example. For example, the keyword scale calculation unit and the keyword priority phoneme selection unit And the phoneme replacement unit may be configured integrally.

また，上記特徴量は，例えば，音響的な尺度，物理量，言語情報などであり，当該物理量は，例えば，スペクトル，メルケプストラム，ピッチ，音の長さ等であるが，かかる例に限定されない。上記知覚評価量（心理量）は，例えば，局所コスト（又は，局所的な尺度，サブコスト）であるが，かかる例に限定されない。上記特徴量から心理量へのマッピングは，例えば，コスト関数が用いられるが，かかる例に限定されない。 The feature quantity is, for example, an acoustic scale, a physical quantity, language information, and the like. The physical quantity is, for example, a spectrum, a mel cepstrum, a pitch, a sound length, etc., but is not limited to such an example. The perceptual evaluation amount (psychological amount) is, for example, a local cost (or local scale, sub cost), but is not limited to such an example. The mapping from the feature quantity to the psychological quantity uses, for example, a cost function, but is not limited to this example.

上記キーワードが上記テキスト本文中に２つ以上存在する場合，上記キーワード優先音韻選択部は，該テキスト本文の先頭からキーワードが出現する順に，該キーワード各々に優先順位を付与しその優先順に，または該キーワードに付与した重みを優先順位に変換しその優先順に，上記キーワード音韻を選択するように構成してもよい。 When there are two or more keywords in the text body, the keyword priority phoneme selection unit assigns a priority to each of the keywords in the order in which the keywords appear from the top of the text body, The weights assigned to the keywords may be converted into priorities, and the keyword phonemes may be selected in the order of priority.

上記キーワードが上記テキスト本文中に２つ以上存在する場合，上記キーワード優先音韻選択部は，上記キーワード単位に各々重み付けし該キーワードが出現する順に，該キーワードに重み付けされた値を基にして上記キーワードを構成する各記号の尺度を求めるように構成してもよい。 When there are two or more keywords in the text body, the keyword priority phoneme selection unit weights each keyword and weights the keywords in the order in which the keywords appear. You may comprise so that the scale of each symbol which comprises may be calculated | required.

上記音韻置換部は，上記キーワード優先音韻選択部により上記キーワード音韻が選択されると逐次，上記音韻選択部が選択した上記テキスト音韻のうち該キーワード部分に該当する音韻を該キーワード音韻に，置き換えるように構成してもよい。 The phoneme replacement unit sequentially replaces the phoneme corresponding to the keyword part in the text phoneme selected by the phoneme selection unit with the keyword phoneme when the keyword phoneme selection unit selects the keyword phoneme. You may comprise.

上記音韻置換部は，上記キーワード優先音韻選択部により求められた尺度と上記音韻選択部により求められた尺度とを比較することにより，上記キーワード音韻に置き換えるか否かを判断するように構成してもよい。 The phoneme replacement unit is configured to determine whether or not to replace with the keyword phoneme by comparing the scale obtained by the keyword priority phoneme selection unit with the measure obtained by the phoneme selection unit. Also good.

上記音韻置換部は，上記キーワード優先音韻選択部により求められた尺度及び上記音韻選択部により求められた尺度の差分値と，所定の閾値とを大小比較するようにしてもよい。 The phoneme replacement unit may compare the difference between the scale obtained by the keyword priority phoneme selection unit and the scale obtained by the phoneme selection unit with a predetermined threshold.

上記課題を解決するために，本発明の別の観点によれば，テキスト本文を解析し，該テキスト本文を構成する各記号の特徴量を予測した韻律予測情報を生成する韻律予測情報生成ステップと；上記韻律予測情報と上記テキスト本文を構成する各記号とを基にして観測可能な特徴量から知覚実験に基づいて定められる尺度にマッピングすることで，該テキスト本文を構成する各記号の尺度を求める尺度算出ステップと；上記テキスト本文を構成する各記号ごとに，上記尺度算出ステップで求めた尺度に基づいて，収録された音声を構成する１又は２以上の音韻が格納された音韻データベースからテキスト音韻を選択する音韻選択ステップと；上記韻律予測情報と上記テキスト本文のうちキーワードを構成する各記号とを基にして上記観測可能な特徴量から上記知覚実験に基づいて定められる尺度にマッピングすることで，該キーワードを構成する各記号の尺度を求めるキーワード尺度算出ステップと；上記キーワードを構成する各記号ごとに，上記キーワード尺度算出ステップで求めた上記キーワードを構成する各記号の尺度に基づいて上記音韻データベースからキーワード音韻を選択するキーワード優先音韻選択ステップと；上記音韻選択ステップで選択した上記テキスト音韻のうち上記キーワード部分に該当する音韻を上記キーワード優先音韻選択ステップで選択した上記キーワード音韻に置き換える音韻置換ステップと；上記音韻を接続し，合成音声を生成する合成音声生成ステップとを含むことを特徴とする，音声合成方法が提供される。 In order to solve the above-mentioned problem, according to another aspect of the present invention, a prosodic prediction information generating step of analyzing a text body and generating prosodic prediction information predicting a feature amount of each symbol constituting the text body; Mapping the scale of each symbol constituting the text body by mapping from the observable feature amount to the scale determined based on the perceptual experiment based on the prosodic prediction information and the symbols constituting the text body; A scale calculation step to be obtained; for each symbol constituting the text body, a text from a phoneme database storing one or more phonemes constituting the recorded speech based on the scale obtained in the scale calculation step A phoneme selection step of selecting a phoneme; the observable based on the prosodic prediction information and each symbol constituting a keyword in the text body A keyword scale calculating step for obtaining a scale of each symbol constituting the keyword by mapping the collected amount to a scale determined based on the perceptual experiment; and the keyword scale calculating step for each symbol constituting the keyword A keyword-preferred phoneme selection step of selecting a keyword phoneme from the phoneme database based on the scale of each symbol constituting the keyword obtained in step; a phoneme corresponding to the keyword portion of the text phoneme selected in the phoneme selection step A speech synthesizing method comprising: a phoneme replacement step that replaces the keyword phoneme with the keyword phoneme selected in the keyword priority phoneme selection step; and a synthesized speech generation step that generates a synthesized speech by connecting the phonemes. The

また，上記課題を解決するために，本発明の別の観点によれば，テキスト本文を解析し，該テキスト本文を構成する各記号の特徴量を予測した韻律予測情報を生成する韻律予測情報生成ステップと；上記韻律予測情報と上記テキスト本文を構成する各記号とを基にして観測可能な特徴量から知覚実験に基づいて定められる尺度にマッピングすることで，該テキスト本文を構成する各記号の尺度を求める尺度算出ステップと；上記テキスト本文を構成する各記号ごとに，上記尺度算出ステップで求めた尺度に基づいて，収録された音声を構成する１又は２以上の音韻が格納された音韻データベースからテキスト音韻を選択する音韻選択ステップと；上記韻律予測情報と上記テキスト本文のうちキーワードを構成する各記号とを基にして上記観測可能な特徴量から上記知覚実験に基づいて定められる尺度にマッピングすることで，該キーワードを構成する各記号の尺度を求めるキーワード尺度算出ステップと；上記キーワードを構成する各記号ごとに，上記キーワード尺度算出ステップで求めた上記キーワードを構成する各記号の尺度に基づいて上記音韻データベースからキーワード音韻を選択するキーワード優先音韻選択ステップと；上記音韻選択ステップで選択した上記テキスト音韻のうち上記キーワード部分に該当する音韻を上記キーワード優先音韻選択ステップで選択した上記キーワード音韻に置き換える音韻置換ステップと；上記音韻を接続し，合成音声を生成する合成音声生成ステップとを含んだ音声合成方法をコンピュータに実行させることを特徴とする，コンピュータプログラムが提供される。 In order to solve the above-mentioned problem, according to another aspect of the present invention, prosody prediction information generation that analyzes a text body and generates prosodic prediction information that predicts a feature amount of each symbol constituting the text body is performed. Mapping each of the symbols constituting the text body by mapping from an observable feature amount to a scale determined based on a perceptual experiment based on the prosodic prediction information and the symbols constituting the text body. A scale calculation step for obtaining a scale; a phoneme database in which one or more phonemes constituting the recorded speech are stored for each symbol constituting the text body based on the scale obtained in the scale calculation step A phoneme selection step of selecting a text phoneme from: the observation based on the prosodic prediction information and each symbol constituting a keyword in the text body A keyword scale calculating step for obtaining a scale of each symbol constituting the keyword by mapping from the feature value to a scale determined based on the perceptual experiment; and the keyword scale for each symbol constituting the keyword A keyword-preferred phoneme selection step for selecting a keyword phoneme from the phoneme database based on the scale of each symbol constituting the keyword obtained in the calculation step; and corresponding to the keyword portion of the text phoneme selected in the phoneme selection step Causing a computer to execute a speech synthesis method including a phoneme replacement step of replacing a phoneme to be replaced with the keyword phoneme selected in the keyword priority phoneme selection step; and a synthesized speech generation step of generating the synthesized speech by connecting the phonemes Featuring a computer Program is provided.

上記課題を解決するために，本発明の別の観点によれば，漢字仮名文字で表現されたテキスト本文を韻律予測情報に変換し，上記韻律予測情報に応じて，各音韻単位で，最適な音韻の候補を，観測可能な特徴量から，心理量へマッピングを行ったコスト関数を用いた，コスト計算（又は，尺度計算）により，収録音声が保持されるコーパス（音韻データベース）から選択，決定し，選択した音韻に対応する波形セグメント（波形データ）を接続することで，合成波形を得る音声合成装置において，予め定めた特定語（キーワード）に対して，独立にコスト再計算を行うことで，最適な音韻の候補を再決定する，キーワード優先音韻選択部を備えた音声合成装置が提供される。 In order to solve the above-described problem, according to another aspect of the present invention, a text body expressed in kanji characters is converted into prosodic prediction information, and an optimal value is obtained for each phoneme unit according to the prosodic prediction information. Select phonetic candidates from corpus (phoneme database) that holds recorded speech by cost calculation (or scale calculation) using a cost function that maps observable features to psychological quantities. Then, by connecting the waveform segments (waveform data) corresponding to the selected phoneme, the speech synthesizer that obtains the synthesized waveform can perform cost recalculation independently for a predetermined specific word (keyword). A speech synthesizer including a keyword-priority phoneme selection unit that re-decides an optimal phoneme candidate is provided.

上記キーワード優先音韻選択部において，予め定めた特定語（キーワード）に対して，独立に行ったコスト再計算の結果，各特定語（キーワード）部分の，合成音の自然性の高さを示す，コスト値を比較することで，音韻候補の組合せを選択するように構成してもよい。 In the keyword priority phoneme selection unit, the result of cost recalculation performed independently for a predetermined specific word (keyword) indicates the naturalness of the synthesized sound of each specific word (keyword) part. A combination of phoneme candidates may be selected by comparing the cost values.

上記キーワード優先音韻選択部において，予め定めた特定語（キーワード）が２つ以上存在する際，テキスト本文の先頭から，特定語（キーワード）が出現する順に，キーワード優先音韻選択部を作用させる，あるいは，複数の特定語（キーワード）に優先順位を付与し，優先順に，キーワード優先音韻選択部を作用させる，あるいは，複数の特定語（キーワード）に重みを付与し，特定語（キーワード）が出現する順に，重み付きコスト再計算を行う，あるいは，複数の特定語（キーワード）に付与した重みを，優先順位に変換し，優先順に，キーワード優先音韻選択部を作用させるようにしてもよい。 In the keyword priority phoneme selection unit, when two or more predetermined specific words (keywords) exist, the keyword priority phoneme selection unit is operated in the order in which the specific words (keywords) appear from the head of the text body, or Priority is given to a plurality of specific words (keywords), and the keyword priority phoneme selection unit is operated in the priority order, or weights are given to a plurality of specific words (keywords), and the specific word (keyword) appears. The weighted cost recalculation may be performed in order, or the weights assigned to a plurality of specific words (keywords) may be converted into priorities, and the keyword priority phoneme selection unit may be operated in order of priority.

上記キーワード優先音韻選択部において，初回に，漢字仮名文字で表現された，テキスト本文全体で選択した音韻候補の組合せの，特定語（キーワード）部分を，特定語（キーワード）が出現する順に，あるいは，優先順に，あるいは，重みを付与してコスト再計算したコスト値を比較した結果から生じる順に，各特定語（キーワード）に対して，キーワード優先音韻選択部を作用させることで求めた音韻候補の組合せに，置き換えるようにしてもよい。 In the keyword-preferred phoneme selection unit, for the first time, the specific word (keyword) portion of the combination of phoneme candidates selected in the entire text body expressed in kanji characters is sorted in the order in which the specific word (keyword) appears, or , In the order of priority, or in the order that results from the comparison of the cost values recalculated with weighting, the phoneme candidate obtained by applying the keyword priority phoneme selection unit to each specific word (keyword) You may make it replace with a combination.

上記キーワード優先音韻選択部において，初回に，漢字仮名文字で表現された，テキスト本文全体で選択した音韻候補の組合せの，特定語（キーワード）部分を，各特定語（キーワード）に対して，キーワード優先音韻選択部を作用させることで求めた音韻候補の組合せに，置き換えるかどうかを判断する手段を追加した，キーワード優先音韻選択部を備えるように構成してもよい。 In the keyword-preferred phoneme selection unit, for each specific word (keyword), a specific word (keyword) portion of the combination of phoneme candidates selected in the entire text body expressed in Kanji characters for the first time You may comprise so that the keyword priority phoneme selection part which added the means to judge whether to replace with the combination of the phoneme candidate calculated | required by operating the priority phoneme selection part may be provided.

上記初回に，漢字仮名文字で表現された，テキスト本文全体で選択した音韻候補の組合せの，特定語（キーワード）部分を，各特定語（キーワード）に対して，キーワード優先音韻選択部を作用させることで求めた音韻候補の組合せに，置き換えるかどうかを判断する手段を追加した，キーワード優先音韻選択部において，漢字仮名文字で表現された，テキスト本文中の特定語（キーワード）部分に対して，初回に，テキスト文全体の音韻候補を選択する際に求めた特定語（キーワード）部分のコスト（又は，尺度）を取得し，キーワード優先音韻選択部を作用させることで求めたコスト値との大小を判定し，置き換えの必要の際は，キーワード優先音韻選択部を作用させることで求めた音韻候補の組合せに，置き換えるように構成してもよい。 In the first time, the keyword priority phoneme selection unit is applied to each specific word (keyword) for the specific word (keyword) portion of the combination of phoneme candidates selected in the entire text body expressed in Kanji characters. In the keyword-preferred phoneme selection unit, which adds a means to determine whether or not to replace the phoneme candidate combination obtained by the above, for the specific word (keyword) part in the text body expressed in kanji kana characters, First, obtain the cost (or scale) of the specific word (keyword) part obtained when selecting phoneme candidates for the entire text sentence, and compare it with the cost value obtained by operating the keyword priority phoneme selection unit If a replacement is necessary, the keyword-preferred phoneme selection unit may be replaced with the combination of phoneme candidates obtained by the action. .

以上説明したように，本発明によれば，テキスト本文を構成する各記号の尺度を求めてテキスト音韻を選択するのとは独立して，テキスト本文内のキーワードを構成する各記号の尺度を別途に求めてキーワード音韻を選択し，上記テキスト音韻のキーワード部分の音韻をキーワード音韻に置換することで，キーワード部分を明瞭に強調することができる。 As described above, according to the present invention, the scale of each symbol constituting the keyword in the text body is separately provided independently of obtaining the scale of each symbol constituting the text body and selecting the text phoneme. Then, the keyword phoneme is selected, and the keyword part is replaced with the keyword phoneme to clearly emphasize the keyword part.

以下，本発明の好適な実施の形態について，添付図面を参照しながら詳細に説明する。なお，以下の説明及び添付図面において，略同一の機能及び構成を有する構成要素については，同一符号を付することにより，重複説明を省略する。 DESCRIPTION OF EXEMPLARY EMBODIMENTS Hereinafter, preferred embodiments of the invention will be described in detail with reference to the accompanying drawings. In the following description and the accompanying drawings, components having substantially the same functions and configurations are denoted by the same reference numerals, and redundant description is omitted.

（音声合成装置について）
まず，図１を参照しながら，第１の実施の形態にかかる音声合成装置１００について説明する。なお，図１は，第１の実施の形態にかかる音声合成装置の概略的な構成を示すブロック図である。 (About voice synthesizer)
First, the speech synthesis apparatus 100 according to the first embodiment will be described with reference to FIG. FIG. 1 is a block diagram illustrating a schematic configuration of the speech synthesizer according to the first embodiment.

図１に示すように，音声合成装置１００は，テキスト解析部１０１と，韻律予測部１０３と，音韻選択部１０４と，キーワード優先音韻選択部（又は，音韻置換部）１０７と，コーパス（又は，音韻データベース）１０９と，音韻接続部１１１とを備えている。 As shown in FIG. 1, the speech synthesizer 100 includes a text analysis unit 101, a prosody prediction unit 103, a phoneme selection unit 104, a keyword priority phoneme selection unit (or phoneme replacement unit) 107, a corpus (or A phoneme database) 109 and a phoneme connection unit 111.

なお，本実施の形態にかかるテキスト解析部１０１と韻律予測部１０３とは別体である場合を例に挙げて説明するが，かかる例に限定されず，例えば，テキスト解析部１０１と韻律予測部１０３が，韻律予測情報生成部として一体に構成される場合等でも実施可能である。 Note that the case where the text analysis unit 101 and the prosody prediction unit 103 according to the present embodiment are separated from each other will be described as an example. However, the present invention is not limited to this example. For example, the text analysis unit 101 and the prosody prediction unit 103 The present invention can be implemented even when 103 is integrally configured as a prosody prediction information generation unit.

上記テキスト解析部１０１は，図１に示すように，漢字仮名文字で表現されたテキスト本文を，音韻記号に変換する。なお，音韻とは，例えば，音素記号で表されるような分節可能な単位を示すが，かかる例に限定されない。 As shown in FIG. 1, the text analysis unit 101 converts a text body expressed by kanji characters to phonemic symbols. Note that the phoneme is a unit that can be segmented as represented by a phoneme symbol, but is not limited to such an example.

より具体的には，上記テキスト解析部１０１は，例えば，漢字仮名文字で表現されたテキスト本文に対して，形態素解析，係り受け解析を行い，アクセント記号列と，テキスト本文の品詞情報を表す形態素解析結果とを出力する。 More specifically, the text analysis unit 101 performs, for example, a morphological analysis and a dependency analysis on a text body expressed by Kanji characters, and displays an accent symbol string and a morpheme representing part of speech information of the text body. Output analysis results.

上記韻律予測部１０３は，テキスト解析部１０１により変換されたテキスト本文の音韻記号と，テキスト解析部１０１から出力されるアクセント記号列と，上記テキスト解析部１０１による形態素解析結果から得られるテキスト本文の品詞情報とを用いて，ピッチ（声の高さ：基本周波数Ｆ_０）と，音韻継続時間長（声の長さ）と，波形の成分を表現するメルケプストラムとを予測する。また，メルケプストラム等の詳細については，例えば，特開２００３−２０８１８８に記載されている。 The prosody prediction unit 103 includes the phoneme symbol of the text body converted by the text analysis unit 101, the accent symbol string output from the text analysis unit 101, and the text body obtained from the morphological analysis result by the text analysis unit 101. Using the part-of-speech information, the pitch (voice pitch: fundamental frequency F ₀ ), phoneme duration (voice length), and mel cepstrum representing the waveform components are predicted. Details of the mel cepstrum and the like are described in, for example, Japanese Patent Application Laid-Open No. 2003-208188.

上記音韻選択部１０４は，韻律予測部１０３で予測した上記ピッチと上記音韻継続時間長とメルケプストラムとを，音韻選択処理のパラメータとして，コーパス１０９から音韻を選択する。なお，コーパス１０９は，例えば，話者やピッチ等が異なる音声を複数収録し，その収録された音声を構成する１又は２以上の音韻が少なくとも格納された音韻データベースであって，ハードディスクドライブ（ＨＤＤ）等の記憶手段に記憶されている。 The phoneme selection unit 104 selects a phoneme from the corpus 109 using the pitch predicted by the prosody prediction unit 103, the phoneme duration, and the mel cepstrum as parameters for phoneme selection processing. The corpus 109 is, for example, a phoneme database that records a plurality of voices with different speakers, pitches, etc., and stores at least one or more phonemes constituting the recorded voices, and is a hard disk drive (HDD). ) Or the like.

上記音韻を選択する処理では，上記音韻選択部１０４は，知覚的特性に一致した尺度（以下，コスト）を使用する。また，観測可能な特徴量（物理量など）から，知覚実験に基づいて定められるコスト（又は，知覚評価量（心理量））にマッピングを行ったコスト関数は，例えば，韻律に関するサブコストと，ピッチの不連続に関するサブコストと，音韻環境代替に関するサブコストと，スペクトルの不連続に関するサブコストと，音韻の適合性に関するサブコストとの重み付けされた５つのサブコスト関数を足し合わせた，関数として構成される（なお，詳細については，例えば，特開２００３−２０８１８８，参照）。 In the process of selecting the phoneme, the phoneme selection unit 104 uses a scale (hereinafter, cost) that matches the perceptual characteristic. In addition, the cost function that maps from the observable feature quantity (physical quantity, etc.) to the cost (or perceptual evaluation quantity (psychological quantity)) determined based on the perceptual experiment is, for example, the sub-cost related to prosody and the pitch Consists of a function that is a sum of five weighted subcost functions: subcost for discontinuity, subcost for phonological environment substitution, subcost for spectrum discontinuity, and subcost for phonological suitability. (For example, refer to JP2003-208188).

なお，上記特徴量は，例えば，音響的な尺度，物理量，言語情報などであり，当該物理量は，例えば，スペクトル，メルケプストラム，ピッチ，音の長さ，リズム（音の高低など）等であるが，かかる例に限定されない。上記心理量は，例えば，局所コスト（又は，局所的な尺度，サブコスト）であるが，かかる例に限定されない。上記特徴量から心理量へのマッピングは，例えば，コスト関数が用いられる。なお，上記コスト関数から求まるコスト値が最小となればなるほど，合成音の自然性が高くなる傾向がある。 The feature quantity is, for example, an acoustic scale, physical quantity, linguistic information, etc., and the physical quantity is, for example, spectrum, mel cepstrum, pitch, length of sound, rhythm (sound pitch, etc.), etc. However, it is not limited to such an example. The psychological quantity is, for example, a local cost (or local scale, sub cost), but is not limited to such an example. For example, a cost function is used for the mapping from the feature quantity to the psychological quantity. Note that the naturalness of the synthesized sound tends to increase as the cost value obtained from the cost function is minimized.

また，本実施形態に係るコスト関数は，聴覚上の印象を反映する必要があるため，知覚実験に基づいて定められる。上記知覚実験が行われるのは，物理量がどれくらい変化したら，音質が劣化するのか把握するのが困難なためである。知覚実験は，例えば，被験者がテキストを読み上げた合成音声を聞き，その自然性を５段階（例えば，１段階「不自然」〜５段階「自然音声と遜色なし」）で評価するが，かかる例に限定されない。 Further, the cost function according to the present embodiment needs to reflect an auditory impression, and thus is determined based on a perceptual experiment. The above perception experiment is performed because it is difficult to grasp how much the physical quantity changes and the sound quality deteriorates. In the perceptual experiment, for example, the subject listens to the synthesized speech in which the text is read out, and the naturalness is evaluated in 5 levels (for example, 1 level “unnatural” to 5 levels “natural speech and no inferiority”). It is not limited to.

上記音韻選択部１０４は，テキスト本文の音韻に対して，音韻の適合性に関するサブコスト値と，音韻環境代替に関するサブコスト値と，韻律に関するサブコスト値とを，コーパス１０９を用いて取得し，上記３つのサブコスト値を足し合わせた，最小のサブコスト値（以下，ターゲットコスト値）から，ある程度の幅を持たせた値の範囲内に含まれる，音韻を候補としてターゲット選択する。 The phoneme selection unit 104 acquires, using the corpus 109, a subcost value related to phoneme suitability, a subcost value related to phonological environment substitution, and a subcost value related to prosody for the phoneme of the text body. A target phoneme is selected as a candidate from a minimum sub-cost value (hereinafter referred to as a target cost value) obtained by adding the sub-cost values within a range of values having a certain range.

図４に示すように，音韻選択部１０４は，ターゲット選択で求めた音韻候補をもとに，テキスト本文の開始位置（例えば，図４に示す本文開始）から，テキスト本文の終了位置（例えば，図４に示す本文終了）まで，ターゲットコスト値と，ピッチの不連続に関するサブコスト値と，スペクトルの不連続に関するサブコスト値とを，足し合わせたコスト値が最小となる音韻候補の組合せを，動的計画法を用いて選択する。なお，図４は，本実施の形態にかかる音韻選択処理の概略の一例を示す説明図である。 As shown in FIG. 4, the phoneme selection unit 104 starts from the text body start position (for example, the body start shown in FIG. 4) based on the phoneme candidate obtained by the target selection (for example, the text body end position (for example, Until the end of the text shown in FIG. 4, the combination of the target cost value, the sub-cost value related to the pitch discontinuity, and the sub-cost value related to the spectrum discontinuity is combined into the dynamic phoneme candidate that minimizes the cost value. Select using programming. FIG. 4 is an explanatory diagram showing an example of the outline of the phoneme selection process according to the present embodiment.

図４に示すように，まず最上段に図示された本文開始から本文終了までの音韻記号（又は，記号）の列は，例えば，テキスト解析部１０１により出力された音韻記号の列であり，図４の最上段から一段下の段に図示された音韻記号の列は，例えば，音韻選択部１０４によりコストが求められコーパス１０９から選択された音韻（又は，テキスト音韻）の列であり，それから一段下の段に図示された列は，キーワード優先音韻選択部１０７によりキーワード部分についてコストが求められコーパス１０９から選択された音韻（又は，キーワード音韻）の列であり，図４の最下段の列は，例えば，音韻選択部１０４により選択された音韻（図４の最上段から２段目の音韻）のうちキーワード部分の音韻が，キーワード優先音韻選択部１０７により選択された音韻に置換された音韻の列である。 As shown in FIG. 4, first, the sequence of phonological symbols (or symbols) from the start of the body to the end of the body shown in the uppermost row is, for example, a sequence of phonological symbols output by the text analysis unit 101. 4 is a sequence of phonemes (or text phonemes) selected from the corpus 109 for which the cost is determined by the phoneme selection unit 104, for example. The column shown in the lower row is a column of phonemes (or keyword phonemes) selected from the corpus 109 for which the keyword priority cost is determined by the keyword priority phoneme selection unit 107, and the lower row in FIG. For example, among the phonemes selected by the phoneme selection unit 104 (phonemes in the second row from the top in FIG. 4), the keyword-part phoneme is converted by the keyword priority phoneme selection unit 107. Is a sequence of which is substituted in-option has been phoneme phoneme.

なお，従来にかかる音韻選択部では，例えば，特開２００１−１００７７５に記載されているように，音素ラベルに対応した音声波形信号の音声セグメントのデータからなる音声波形データベースを記憶しておき，音声波形信号の音声セグメントを連結することによって任意の音素列をつなぎ合わせて音声合成するに際して，音韻選択部は，音声波形データベースに含まれる１対の音素のリストを生成し，その音素リストに基づいて書く１対の音素のリストを生成し，その音素リストに基づいて書く１対の音素に対する韻律的特徴パラメータと音響的特徴パラメータとに関する類似度を計算し，その類似度が所定のしきい値以上であるとき，その各１対の音素のうちの一方の１対の音素に係る音声波形信号の音声セグメントのデータを音声波形データベースから削除して音声データ量を削減していた。 The phoneme selection unit according to the related art stores a speech waveform database including speech segment data of speech waveform signals corresponding to phoneme labels, as described in, for example, Japanese Patent Application Laid-Open No. 2001-100775. When synthesizing speech by connecting arbitrary phoneme strings by concatenating speech segments of waveform signals, the phoneme selection unit generates a list of a pair of phonemes included in the speech waveform database, and based on the phoneme list A list of pairs of phonemes to be written is generated, and the similarity between the prosodic feature parameter and the acoustic feature parameter for the pair of phonemes to be written is calculated based on the phoneme list, and the similarity is equal to or greater than a predetermined threshold. The speech segment data of the speech waveform signal associated with one of the pair of phonemes is converted into a speech wave. We had to reduce the amount of audio data is deleted from the database.

音声合成装置１００の特徴部分である上記キーワード優先音韻選択部１０７は，韻律予測部１０３で予測したピッチと，音韻継続時間長と，メルケプストラムとを，音韻選択処理のパラメータとして，コーパス１０９から音韻を選択する。 The keyword priority phoneme selection unit 107, which is a characteristic part of the speech synthesizer 100, uses the pitch, phoneme duration, and mel cepstrum predicted by the prosody prediction unit 103 as parameters of phoneme selection processing from the corpus 109. Select.

上記キーワード優先音韻選択部１０７による音韻を選択する処理では，知覚的特性に一致した尺度（又は，コスト）を使用する。観測可能な特徴量から，心理量にマッピングを行ったコスト関数は，韻律に関するサブコストと，ピッチの不連続に関するサブコストと，音韻環境代替に関するサブコストと，スペクトルの不連続に関するサブコストと，音韻の適合性に関するサブコストとの重み付けされた５つのサブコスト関数を足し合わせた，関数として構成される。なお，詳細については，例えば，特開２００３−２０８１８８等に記載されている。 In the process of selecting a phoneme by the keyword priority phoneme selection unit 107, a scale (or cost) that matches the perceptual characteristic is used. A cost function that maps observable features to psychological quantities is a sub-cost related to prosody, a sub-cost related to pitch discontinuity, a sub-cost related to phonological environment substitution, a sub-cost related to spectral discontinuity, and a phoneme fit. It is constructed as a function that is a sum of five weighted subcost functions with respect to the subcost. Details are described in, for example, Japanese Patent Application Laid-Open No. 2003-208188.

上記キーワード優先音韻選択部１０７は，テキスト本文の音韻に対して，音韻の適合性に関するサブコスト値と，音韻環境代替に関するサブコスト値と，韻律に関するサブコスト値とを，コーパス１０９を用いて取得し，３つのサブコスト値を足し合わせた最小のサブコスト値（ターゲットコスト値）から，ある程度の幅を持たせた値の範囲内に含まれる，音韻を候補としてターゲット選択する。 The keyword priority phoneme selection unit 107 uses the corpus 109 to acquire a subcost value related to phoneme suitability, a subcost value related to phonological environment substitution, and a subcost value related to prosody for the phoneme of the text body. A target phoneme is selected as a candidate from a minimum sub cost value (target cost value) obtained by adding two sub cost values within a range of values having a certain range.

図４に示すように，キーワード優先音韻選択部１０７は，ターゲット選択で求めた音韻候補をもとに，テキスト本文内の，ユーザが指定したキーワード開始位置（例えば，図４に示すキーワード開始）からキーワード終了位置（例えば，図４に示すキーワード終了）までの範囲における，ターゲットコスト値と，ピッチの不連続に関するサブコスト値と，スペクトルの不連続に関するサブコスト値とを，各々足し合わせたコスト値が最小となる音韻候補の組合せを，動的計画法を用いて選択する。 As shown in FIG. 4, the keyword priority phoneme selection unit 107 starts from the keyword start position specified by the user (for example, the keyword start shown in FIG. 4) in the text body based on the phoneme candidate obtained by target selection. In the range up to the keyword end position (for example, the keyword end shown in FIG. 4), the target cost value, the sub-cost value related to the pitch discontinuity, and the sub-cost value related to the spectrum discontinuity are added together to minimize the cost value. Is selected using dynamic programming.

なお，上記キーワード開始位置と，キーワード終了位置は，例えば，テキスト本文の開始位置からの音韻数で指定する。例えば，図４に示すように，キーワード（又は，キーワード音韻列）を“ａｏｋｉ”とした場合，キーワード開始位置の“ａ”は，テキスト本文の開始位置“ｈ”からの音韻数で指定することができる。順に“ｏ”，“ｋ”，“ｉ”についても同様に音韻数を指定することができる。 The keyword start position and the keyword end position are specified by the number of phonemes from the start position of the text body, for example. For example, as shown in FIG. 4, when the keyword (or keyword phoneme string) is “aoki”, the keyword starting position “a” is specified by the number of phonemes from the starting position “h” of the text body. Can do. The number of phonemes can be specified in the same manner for “o”, “k”, and “i” in order.

また，キーワード位置の指定方法としては，テキスト本文中のキーワード部分に対してタグ付けする方法と，テキスト本文を一旦テキスト解析し，テキスト本文の音韻列をユーザに表示し，ユーザがキーワード部分をテキスト本文中から指定する方法と，複数のキーワードをあらかじめ保持し，ユーザが上記複数のキーワードの中からキーワードを選択し，選択したキーワードとテキスト本文中で一致する箇所を検索する方法とがある。 Also, the keyword position can be specified by tagging the keyword part in the text body, analyzing the text body once, displaying the phoneme string of the text body to the user, and then displaying the keyword part as text. There are a method of designating from the text, and a method of storing a plurality of keywords in advance, selecting a keyword from among the plurality of keywords, and searching for a location that matches the selected keyword in the text body.

また，図４に示すように，キーワードが２つ以上存在する場合，キーワード優先音韻選択部１０７によるキーワード部分に関するコスト値の計算は，例えば，図４に示す本文開始から本文終了に向けて，キーワードの出現順に行われるが，かかる例に限定されず，例えば，上記キーワード部分に関するコスト値の計算は，複数のキーワードに優先順位を付与し，その優先順に従い，コスト値の計算を行う場合，あるいは，複数のキーワードに重みを付与し，キーワードが出現する順に，重み付きコスト再計算を行う場合，あるいは，複数のキーワードに付与した重みを，優先順位に変換し，優先順に従って，コスト値の計算を行う場合等でも実施可能である。 In addition, as shown in FIG. 4, when there are two or more keywords, the keyword-priority phoneme selection unit 107 calculates the cost value for the keyword portion from the start of the text to the end of the text shown in FIG. However, the present invention is not limited to such an example. For example, the calculation of the cost value related to the keyword part may be performed when a priority is assigned to a plurality of keywords and the cost value is calculated according to the priority order, or , When assigning weights to multiple keywords and recalculating weighted costs in the order in which the keywords appear, or by converting the weights assigned to multiple keywords into priorities and calculating cost values according to the priorities It is possible to implement even when performing.

上記キーワード優先音韻選択部１０７は，テキスト本文全体で選択した音韻候補の組合せのうちのキーワード部分に対して，キーワードが出現する順に，優先順に，若しくはキーワードが出現する順に，重みとコスト値とを乗算した重み付きコスト値を求める。 The keyword priority phoneme selection unit 107 calculates weights and cost values in the order in which the keywords appear, in the priority order, or in the order in which the keywords appear, with respect to the keyword portion of the phoneme candidate combination selected in the entire text body. Find the multiplied weighted cost value.

次に，上記キーワード優先音韻選択部１０７は，上記求めた重み付きコスト値各々を比較し，重み付きコスト値の小さい順に，キーワード優先音韻選択した音韻候補の組合せに，置き換える。 Next, the keyword priority phoneme selection unit 107 compares the obtained weighted cost values and replaces them with combinations of phoneme candidates selected by keyword priority phonemes in ascending order of the weighted cost values.

上記音韻接続部（又は，合成部）１１１は，例えば，選択された音韻（又は，波形セグメント）を，テキスト本文の開始位置から順に，相互相関法と，補間法とを用いて接続し，合成音声（又は，合成波形）を生成する。当該合成音声が外部に出力されると，利用者はテキスト本文を耳で聞くことができる。 The phoneme connection unit (or synthesis unit) 111 connects, for example, the selected phonemes (or waveform segments) in order from the start position of the text body using the cross-correlation method and the interpolation method. Generate speech (or synthesized waveform). When the synthesized speech is output to the outside, the user can hear the text body by ear.

なお，音声合成装置１００は，テキスト本文とキーワードを基にして合成音声を出力することが可能な装置であって，その合成音声を出力することで，テキスト本文を音声にして読み上げることが可能な装置である。より具体的には，音声合成装置１００は，例えば，ＣＰＵ，メモリ，ＨＤＤ（ハードディスクドライブ），マウス等に相当する入力部（図示せず。），液晶ディスプレイ等に相当する表示部（図示せず。）などを備えたＰＣ等を例示することができるが，かかる例に限定されない。 Note that the speech synthesizer 100 is a device capable of outputting a synthesized speech based on a text body and a keyword, and can output the text body as speech by outputting the synthesized speech. Device. More specifically, the speech synthesizer 100 includes, for example, an input unit (not shown) corresponding to a CPU, a memory, an HDD (hard disk drive), a mouse, and a display unit (not shown) corresponding to a liquid crystal display. Etc.) can be exemplified, but is not limited to such an example.

なお，本実施の形態にかかる音声合成装置１００に備わる表示部は，ＣＰＵにより表示可能なように処理された表示画面データと音声データを出力する。また，表示部は，例えば，ＴＶ又は液晶ディスプレイ装置などが例示され，上記双方ともにスピーカーを備えて，静止画像のほか，音声，又は動画像などを出力することが可能である。 In addition, the display part with which the speech synthesizer 100 concerning this Embodiment is provided outputs the display screen data and audio | voice data processed so that display was possible by CPU. In addition, the display unit is exemplified by a TV or a liquid crystal display device, for example, and both of them are provided with a speaker, and can output a sound or a moving image in addition to a still image.

上記入力部は，例えば，使用者から操作指示を受けることが可能なマウス，トラックボール，トラックパッド，スタイラスペン，タッチパネル，またはジョイスティックなどのポインティングデバイスや，キーボード，ボタン，スイッチ，レバー等の操作手段と，入力信号を生成してＣＰＵに出力する入力制御回路などから構成されている。 The input unit is, for example, a pointing device such as a mouse, a trackball, a trackpad, a stylus pen, a touch panel, or a joystick that can receive an operation instruction from a user, or an operation means such as a keyboard, button, switch, or lever. And an input control circuit for generating an input signal and outputting it to the CPU.

音声合成装置１００のユーザは，この入力部を操作することにより，音声合成装置１００に対して各種のデータを入力したり，キーワードを選択したり，合成音声出力など各種の処理動作を指示したりすることができる。 The user of the speech synthesizer 100 operates the input unit to input various data to the speech synthesizer 100, select keywords, and instruct various processing operations such as output of synthesized speech. can do.

なお，従来に係る音声合成装置は，例えば，特開平１０−０４９１９３に記載されているように，音声セグメントを記憶する手段と，韻律的特徴パラメータ等を抽出して出力する手段と，韻律的特徴パラメータ等を記憶する手段と，音響的特徴パラメータにおける寄与度を表す重み係数ベクトルを決定する手段と，重み係数ベクトルを記憶する手段と，入力される自然発話文の音素列に対して，目標コストと連結コストとを含むコストが最小となる音素候補の組合せを検索して索引情報を出力する音声単位選択手段と，入力された音声を合成して出力する手段とを備えていた。 Note that a conventional speech synthesizer includes, for example, means for storing speech segments, means for extracting and outputting prosodic feature parameters, and the like, as described in Japanese Patent Laid-Open No. 10-049193. A target cost for a phoneme sequence of an input natural utterance sentence, a means for storing a parameter, a means for determining a weighting coefficient vector representing a contribution in an acoustic feature parameter, a means for storing a weighting coefficient vector, And speech unit selection means for searching for a combination of phoneme candidates having the minimum cost including the connection cost and outputting index information, and means for synthesizing and outputting the input speech.

また，従来に係る音声合成装置をさらに音質が良くなるように改良した従来に係る音声合成装置は，例えば，特開平２００３−２０８１８８に記載されているように，言語処理部と，韻律パターン生成部と，音素単位選択部と，波形辞書と，音声波形生成部とを備え，上記言語処理部では，形態素解析，係り受け解析が行われ，テキストから，音素記号，アクセント記号等に変換する。上記韻律パターン生成部では，音素記号，アクセント記号列および形態素解析結果から得られる入力テキストの品詞情報を用いて，音素持続時間（声の長さ），基本周波数（声の高さＦ_０），母音中心のパワー（声の大きさ）等の推定を行う。さらに上記音素単位選択部では，物理量と，知覚実験結果から直接求められる心理量とを用いた，コスト関数を使用する。コスト関数は，韻律に関するサブコストと，ピッチの不連続に関するサブコストと，音韻環境代替に関するサブコストと，スペクトルの不連続に関するサブコストと，音韻（又は，音素）の適合性に関するサブコストとの，５つのサブコスト関数から構成される。 In addition, a conventional speech synthesizer improved from the conventional speech synthesizer so that the sound quality is further improved is, for example, a language processing unit and a prosodic pattern generation unit as described in JP-A-2003-208188. A phoneme unit selection unit, a waveform dictionary, and a speech waveform generation unit. The language processing unit performs morpheme analysis and dependency analysis, and converts the text into phoneme symbols, accent symbols, and the like. The prosodic pattern generation unit uses the part-of-speech information of the input text obtained from the phoneme symbol, the accent symbol string, and the morpheme analysis result, the phoneme duration (voice length), the fundamental frequency (voice pitch F ₀ ), Estimate vowel center power (voice volume). Furthermore, the phoneme unit selection unit uses a cost function that uses physical quantities and psychological quantities directly obtained from the perceptual experiment results. The cost function has five sub-cost functions: a sub-cost related to prosody, a sub-cost related to pitch discontinuity, a sub-cost related to phoneme environment substitution, a sub-cost related to spectrum discontinuity, and a sub-cost related to phoneme (or phoneme) suitability. Consists of

また，従来にかかるテキストを読み上げる音声合成装置は，特開平１０−２７４９９９に記載されているように，例えば，テキスト全体からキーワードのコストを求め，音声の話者，音量，音程および話速のうちから任意選択された少なくとも１つを変えることにより，キーワード部分に対する出力音声とその他の部分に対する出力音声とが識別されるように音声信号を生成している。 Further, a conventional speech synthesizer that reads out text, as described in Japanese Patent Laid-Open No. 10-274999, for example, obtains the cost of a keyword from the entire text, and includes the voice speaker, volume, pitch, and speech speed. By changing at least one arbitrarily selected from the above, an audio signal is generated so that the output audio for the keyword portion and the output audio for the other portion are identified.

（音声合成方法について）
図２に示すように，まず，利用者によって入力部が操作されると，例えば，強調させたい１又は２以上のキーワードを含んだテキスト本文と，その強調させたいキーワードの開始位置及びキーワードの終了位置とが入力される（Ｓ２０１）。なお，かかる例に限定されず，例えば，利用者は，キーワードの開始位置からキーワードの終了位置の範囲全てを指定しても良い。 (Speech synthesis method)
As shown in FIG. 2, first, when the user operates the input unit, for example, the text body including one or more keywords to be emphasized, the start position of the keyword to be emphasized, and the end of the keyword. The position is input (S201). However, the present invention is not limited to this example. For example, the user may specify the entire range from the keyword start position to the keyword end position.

なお，上記ステップＳ２０１において入力されるキーワードの開始位置とキーワードの終了位置とは，例えば，テキスト本文の開始位置からの音韻数に基づいて指定されるが，かかる例に限定されない。 The keyword start position and keyword end position input in step S201 are specified based on, for example, the number of phonemes from the start position of the text body, but are not limited to this example.

上記ステップＳ２０１において，利用者により指定された強調させたい１又は２以上のキーワードを含んだテキスト本文は，図１に示すテキスト解析部１０１に入力する。 In step S201, the text body including one or more keywords to be emphasized specified by the user is input to the text analysis unit 101 shown in FIG.

図２に示すように，テキスト解析部１０１は，例えば，漢字仮名文字で表現されたテキスト本文に対してテキスト解析し，音韻記号に変換する（Ｓ２０３）。 As shown in FIG. 2, for example, the text analysis unit 101 performs text analysis on a text body expressed by kanji characters and converts them into phonemic symbols (S203).

テキスト解析部１０１は，上記テキスト本文に対してテキスト解析を行うが，そのテキスト解析は，例えば，形態素解析，係り受け解析などを例示できる。テキスト解析部１０１は，上記テキスト本文に対し形態素解析，係り受け解析を行い，アクセント記号列と，テキスト本文の品詞情報を表す形態素解析結果とを出力する。 The text analysis unit 101 performs text analysis on the text body. Examples of the text analysis include morphological analysis and dependency analysis. The text analysis unit 101 performs morphological analysis and dependency analysis on the text body, and outputs an accent symbol string and a morphological analysis result representing part-of-speech information of the text body.

なお，第１の実施の形態にかかる音声合成方法では，テキスト本文に対して，テキスト本文から変換された音韻記号と，上記形態素解析結果との情報を持つ出力結果を，例えば，図２に示すように，テキスト本文中間言語とするが，かかる例に限定されない。 In the speech synthesis method according to the first embodiment, an output result having information on the phoneme symbol converted from the text body and the morpheme analysis result is shown in FIG. Thus, although it is set as a text body intermediate language, it is not limited to such an example.

ここで，テキスト本文が変換された音韻記号５０１（図３Ａに示すテキスト本文音韻記号）は，例えば，図３Ａに示すように，「ｈａｊｉｍｅ…ｏｋｉ…」となる。 Here, the phoneme symbol 501 converted from the text body (text body phoneme symbol shown in FIG. 3A) is, for example, “hajime ... oki ...” as shown in FIG. 3A.

次に，テキスト解析部１０１により出力されたテキスト本文中間言語は，図１に示す韻律予測部１０３に入力する。 Next, the text body intermediate language output by the text analysis unit 101 is input to the prosody prediction unit 103 shown in FIG.

図２に示すように，韻律予測部１０３は，ピッチ（声の高さ，基本周波数Ｆ_０），音韻継続時間長（声の長さ）と，基本周波数（声の高さＦ_０）と，波形の成分を表現するメルケプストラムとを，予測する（Ｓ２０７）。 As shown in FIG. 2, the prosody prediction unit 103 includes a pitch (voice pitch, fundamental frequency F ₀ ), phoneme duration (voice length), fundamental frequency (voice pitch F ₀ ), A mel cepstrum expressing the waveform component is predicted (S207).

上記ステップＳ２０７で韻律予測部１０３によりピッチ，音韻継続時間長，基本周波数，メルケプストラムとが予測されると，それらの予測した情報を持つ韻律予測情報は，音韻選択部１０４に入力する。 When the pitch, phoneme duration, fundamental frequency, and mel cepstrum are predicted by the prosody prediction unit 103 in step S207, prosody prediction information having such predicted information is input to the phoneme selection unit 104.

なお，上記韻律予測情報は，例えば，上記ピッチ，音韻継続時間長，基本周波数，またはメルケプストラム等で各音韻記号の特徴量を予測する情報であるが，かかる例に限定されない。 The prosody prediction information is information for predicting the feature quantity of each phoneme symbol by, for example, the pitch, phoneme duration, fundamental frequency, or mel cepstrum, but is not limited to this example.

ここで，上記韻律予測情報を説明するために，図３Ａを参照すると，韻律予測情報５０２は，音韻記号５０１毎（図３Ａに示す韻律予測情報５０２の縦方向に図示された“ｈａｊｉｍｅ…ｏｋｉ…”の音韻記号毎）に，音韻の開始時間を表す“ｓｔａｒｔ”と，音韻の継続時間長を表す“ｄｕｒａｔｉｏｎ”と，音韻の１又は２以上のピッチを表す“ｐｉｔｃｈ”と，音韻の１又は２以上のメルケプストラムを表す“Ｍｅｌｃｅｐ”とから少なくとも構成される。 Here, in order to explain the prosodic prediction information, referring to FIG. 3A, the prosodic prediction information 502 is obtained for each phoneme symbol 501 ("hajime ... oki ..." shown in the vertical direction of the prosodic prediction information 502 shown in FIG. 3A). "For each phoneme symbol"), "start" indicating the start time of the phoneme, "duration" indicating the duration of the phoneme, "pitch" indicating one or more pitches of the phoneme, and 1 or 2 of the phoneme “Mel cep” representing at least two mel cepstrums.

図２に示すように，音韻選択部１０４は，テキスト本文の本文開始位置から，本文終了位置まで，コーパス１０９を用いて，最適な音韻を選択する（Ｓ２１４）。 As shown in FIG. 2, the phoneme selection unit 104 selects an optimal phoneme using the corpus 109 from the text start position to the text end position of the text body (S214).

より具体的にステップＳ２１４について説明すると，音韻選択部１０４は，テキスト本文の音韻に対して，音韻の適合性に関するサブコスト値と，音韻環境代替に関するサブコスト値と，韻律に関するサブコスト値とを，コーパス１０９を用いて取得し，上記３つのサブコスト値を足し合わせた，最小のサブコスト値（ターゲットコスト値）から，ある程度の幅を持たせた値の範囲内に含まれる音韻を候補としてターゲット選択する。 More specifically, step S214 will be described. The phoneme selection unit 104 calculates a subcost value related to phoneme suitability, a subcost value related to phonemic environment substitution, and a subcost value related to prosody for the phoneme of the text body. The phoneme included in the range of values having a certain range is selected as a candidate from the minimum subcost value (target cost value) obtained by adding the above three subcost values.

図４に示すように，音韻選択部１０４では，ターゲット選択で求めた音韻候補をもとに，テキスト本文の本文開始位置から，テキスト本文の本文終了位置まで，ターゲットコスト値と，ピッチの不連続に関するサブコスト値と，スペクトルの不連続に関するサブコスト値とを，足し合わせたコスト値が最小となる音韻候補の組合せを，動的計画法を用いて選択する。なお，コスト関数から求まるコスト値が最小となればなるほど，合成音の自然性が高いものとしているが，かかる例に限定されない。 As shown in FIG. 4, in the phoneme selection unit 104, the target cost value and the pitch discontinuity from the body start position of the text body to the body end position of the text body based on the phoneme candidates obtained by the target selection. The combination of phoneme candidates that minimizes the sum of the sub-cost value related to spectrum and the sub-cost value related to spectrum discontinuity is selected using dynamic programming. Note that, as the cost value obtained from the cost function is minimized, the naturalness of the synthesized sound is higher. However, the present invention is not limited to this example.

上記音韻選択部１０４によりテキスト本文全体の音韻選択が完了すると（Ｓ２１４），次に，キーワード部分の最適な音韻を選択するために，キーワード優先音韻選択部１０７による処理に移行する（Ｓ２１６，Ｓ２１９）。 When the phoneme selection unit 104 completes the phoneme selection of the entire text body (S214), the process proceeds to the processing by the keyword priority phoneme selection unit 107 in order to select the optimum phoneme of the keyword portion (S216, S219). .

ここで，図３Ａを参照すると，図３Ａに示すテキスト本文全体の音韻候補の組合せ５０３には，テキスト本文音韻記号５０１（図３Ａで示す例では，“ｈａｊｉｍｅ…ｏｋｉ…”）の各音韻記号に対応する音韻が，コーパス１０９内に存在するどの音声ファイルで，先頭から何番目に位置する音韻であるかを示す情報が記述されている。テキスト本文全体の音韻候補の組合せ５０３に記述されている情報は，例えば，「音韻記号“ｈ”に対応する音韻は１音声ファイルの先頭から３番目の位置に存在する音韻である」，ことを示している。 Here, referring to FIG. 3A, a combination 503 of phoneme candidates for the entire text body shown in FIG. 3A includes each phoneme symbol of the text body phoneme symbol 501 (“hajime ... oki ...” in the example shown in FIG. 3A). Information indicating which phoneme in the corpus 109 the corresponding phoneme is in which position from the head is located is described. The information described in the phoneme candidate combination 503 of the entire text body is, for example, “the phoneme corresponding to the phoneme symbol“ h ”is the phoneme existing at the third position from the beginning of one speech file”. Show.

さらに，図３Ｂに示すように，キーワード優先音韻選択処理部１０７には，韻律予測情報５０２と，音韻選択部１０４から出力されるテキスト本文全体の音韻候補の組合せ５０３と，ユーザが指定するキーワード位置とが，入力される。 Further, as shown in FIG. 3B, the keyword priority phoneme selection processing unit 107 includes a prosody prediction information 502, a combination of phoneme candidates 503 of the entire text body output from the phoneme selection unit 104, and a keyword position specified by the user. Are entered.

図２に示すように，キーワード優先音韻選択部１０７は，まず，テキスト本文内の最後に出現するキーワードであるかどうかを判定する（Ｓ２１６）。なお，後述する図５に示すステップＳ５０１の処理と上記ステップＳ２１６の処理とは，実質的に同一である。 As shown in FIG. 2, the keyword priority phoneme selection unit 107 first determines whether the keyword appears last in the text body (S216). Note that the processing in step S501 shown in FIG. 5 described later and the processing in step S216 are substantially the same.

キーワード優先音韻選択部１０７は，テキスト本文の本文開始位置から出現するキーワードの順に，あるいは，テキスト本文のキーワードに付与された優先順位を基にして優先順に，あるいは，テキスト本文中の１又は２以上のキーワードに付与された重みを優先順位に変換し優先順に，キーワード優先音韻選択処理を行う。そのためには，最後のキーワードであるかどうか図２に示すステップＳ２１６で判定する必要がある。 The keyword priority phoneme selection unit 107 is arranged in the order of keywords appearing from the text start position of the text body, or in the order of priority based on the priority given to the keywords of the text body, or one or more in the text body. The weights assigned to the keywords are converted into priorities, and keyword priority phoneme selection processing is performed in order of priority. For that purpose, it is necessary to determine whether or not it is the last keyword in step S216 shown in FIG.

上記ステップＳ２１６で判定した結果，最後のキーワードでない場合，図２に示すキーワード優先音韻選択処理が行われる。 If the result of determination in step S216 is not the last keyword, the keyword priority phoneme selection process shown in FIG. 2 is performed.

一方，上記ステップＳ２１６で判定した結果，最後のキーワードである場合，キーワード部分の音韻選択が完了し，波形セグメントを得ることが可能な音韻候補の組合せが決定するので，図１に示す音韻接続部１１１に処理が移行する。 On the other hand, if the result of determination in step S216 is the last keyword, the phoneme selection of the keyword portion is completed and the combination of phoneme candidates from which waveform segments can be obtained is determined. The process moves to 111.

次に，図２に示すように，キーワード優先音韻選択部１０７は，テキスト本文内のキーワード部分から，コーパス１０９を用いて，最適な音韻を選択する（Ｓ２１９）。 Next, as shown in FIG. 2, the keyword priority phoneme selection unit 107 selects an optimal phoneme from the keyword part in the text body using the corpus 109 (S219).

ここで，図５を参照しながら，上記キーワード優先音韻選択部１０７によるキーワード優先音韻選択処理（Ｓ２１９）について説明する。図５は，第１の実施の形態にかかるキーワード優先音韻選択処理の概略を示すフローチャートである。 Here, the keyword priority phoneme selection process (S219) by the keyword priority phoneme selection unit 107 will be described with reference to FIG. FIG. 5 is a flowchart showing an outline of the keyword priority phoneme selection process according to the first embodiment.

図２に示すキーワード優先音韻選択処理（Ｓ２１９）は，図５に示すように，最後のキーワードか否かを確認する処理（Ｓ５０１）と，キーワードに対してターゲット選択を行う処理（Ｓ５０５）と，キーワードに対してコスト値が最小となる音韻候補の組合せを設定する処理（Ｓ５０９）と，テキスト本文全体の音韻候補の組合せの，キーワード箇所の音韻候補の組合せの置き換え処理（Ｓ５１３）とが含まれている。 The keyword priority phoneme selection process (S219) shown in FIG. 2 includes, as shown in FIG. 5, a process for confirming whether or not the keyword is the last keyword (S501), a process for selecting a target for the keyword (S505), A process for setting a combination of phoneme candidates that minimizes the cost value for a keyword (S509) and a process for replacing a combination of phoneme candidates for the entire text body with a combination of phoneme candidates at a keyword location (S513) are included. ing.

上記キーワードに対してターゲット選択を行う処理（Ｓ５０５）では，キーワード部分の音韻に対して，音韻の適合性に関するサブコスト値と，音韻環境代替に関するサブコスト値と，韻律に関するサブコスト値とを，コーパス１０９を用いて取得し，上記３つのサブコスト値を足し合わせた，最小のサブコスト値（ターゲットコスト値）から，ある程度の幅を持たせた値の範囲内に含まれる，音韻を候補としてターゲット選択する。 In the process of performing target selection for the keyword (S505), the sub-cost value related to phoneme compatibility, the sub-cost value related to phonological environment substitution, and the sub-cost value related to prosody for the phoneme of the keyword part are stored in the corpus 109. The phoneme is selected as a candidate from the minimum sub-cost value (target cost value) obtained by using the three sub-cost values and included in a range having a certain range.

上記キーワードに対してコスト値が最小となる音韻候補の組合せを設定する処理（Ｓ５０９）では，図４に示すように，ターゲット選択で求めた音韻候補をもとに，テキスト本文内の，ユーザが指定したキーワード開始位置から，キーワード終了位置まで，ターゲットコスト値と，ピッチの不連続に関するサブコスト値と，スペクトルの不連続に関するサブコスト値とを，足し合わせたコスト値が最小となる音韻候補の組合せを，動的計画法を用いて選択する。 In the process of setting a combination of phoneme candidates that minimizes the cost value for the keyword (S509), as shown in FIG. 4, the user in the text body based on the phoneme candidates obtained by target selection. From the specified keyword start position to the keyword end position, the target cost value, the sub cost value related to the pitch discontinuity, and the sub cost value related to the spectrum discontinuity are combined to obtain a phoneme candidate that minimizes the cost value. , Select using dynamic programming.

キーワード開始位置，キーワード終了位置は，例えば，テキスト本文の本文開始位置からカウントした音韻数によって指定されるが，かかる例に限定されず，例えば，本文終了位置からカウントした音韻数によって指定される場合等でもよい。 The keyword start position and the keyword end position are specified by, for example, the number of phonemes counted from the text start position of the text body. However, the present invention is not limited to this example. For example, the keyword start position and the keyword end position are specified by the number of phonemes counted from the text end position. Etc.

上記キーワード位置の指定方法としては，例えば，テキスト本文中のキーワード部分に対してタグ付けする方法，テキスト本文を一旦テキスト解析しテキスト本文中の音韻列をユーザに表示しキーワード部分を指定する方法，または１又は２以上のキーワードをあらかじめ保持しユーザがキーワードを選択したキーワードとテキスト本文中で一致する箇所を検索する方法とがあるが，かかる例に限定されない。 The keyword position can be specified by, for example, tagging the keyword part in the text body, analyzing the text body once, displaying the phoneme string in the text body to the user, and specifying the keyword part, Alternatively, there is a method in which one or two or more keywords are stored in advance and the user selects a keyword and a search is made for a portion that matches in the text body. However, the method is not limited to this example.

図４に示すように，キーワード音韻候補が２つ以上存在する場合，キーワード部分に関するコスト値の計算は，例えば，テキスト本文の本文開始位置から本文終了位置に向けて，キーワードの出現順に行われる。 As shown in FIG. 4, when there are two or more keyword phoneme candidates, the cost value for the keyword portion is calculated in the order of appearance of the keywords, for example, from the text start position to the text end position of the text body.

なお，本実施の形態にかかる上記キーワード部分に関するコスト値の計算は，本文開始位置から本文終了位置に向けてキーワードの出現順に行う場合に限定されず，例えば，キーワード部分に関するコスト値の計算は，複数のキーワードに優先順位を付与し，優先順に行われる場合等でも実施可能である。 Note that the calculation of the cost value related to the keyword part according to the present embodiment is not limited to the case of performing the keyword appearance order from the text start position to the text end position. The present invention can be carried out even when priority is given to a plurality of keywords and the priority order is given.

図５に示すように，テキスト本文全体の音韻候補の組合せの，キーワード箇所の音韻候補の組合せの置き換え処理（Ｓ５１３）では，上記図３Ａのテキスト本文全体の音韻候補の組合せ５０３のように，テキスト本文全体で選択した音韻候補の組合せの，キーワード部分に対して，キーワードが出現する順で，あるいは，優先順で，キーワード優先音韻選択した音韻候補の組合せに，置き換える。なお，図５に示すようにステップＳ５１３は，図２に示すステップＳ２１４の音韻選択処理が行われる度に，逐次的に実行され，音韻の置換が行われる。 As shown in FIG. 5, in the replacement processing (S513) of the combination of phoneme candidates of the whole text body with the combination of phoneme candidates of the keyword portion, the combination of the phoneme candidates of the entire text body of FIG. The phoneme candidate combinations selected in the whole text are replaced with the phoneme candidate combinations selected in keyword order in the order in which the keywords appear or in the priority order with respect to the keyword part. As shown in FIG. 5, step S513 is executed sequentially each time the phoneme selection process of step S214 shown in FIG. 2 is performed, and phoneme replacement is performed.

キーワード部分の音韻候補の組合せの置き換えが完了した後（Ｓ５１３），図２に示すステップＳ２１６と実質的に同一のステップＳ５０１に処理が移行し，最後のキーワードになるまで繰り返して処理が行われる（Ｓ５０５〜Ｓ５１３）。 After the replacement of the keyword part phoneme candidate combination is completed (S513), the process proceeds to step S501, which is substantially the same as step S216 shown in FIG. 2, and the process is repeated until the last keyword is obtained ( S505 to S513).

さらに，図６を参照しながら，第１の実施の形態にかかるキーワード優先音韻選択処理の変形例について説明する。図６は，第１の実施の形態にかかるキーワード優先音韻選択処理の概略を示すフローチャートである。 Further, a modified example of the keyword priority phoneme selection process according to the first embodiment will be described with reference to FIG. FIG. 6 is a flowchart showing an outline of the keyword priority phoneme selection process according to the first embodiment.

図６に示すように，キーワード優先音韻選択処理は，最後のキーワードであるか否かを判定する処理（Ｓ５０２）と，キーワードに対してターゲット選択を行う処理（Ｓ５０５）と，コスト値が最小となる音韻候補の組合せを設定し，重み付けコスト値を求める処理（Ｓ５１０）と，テキスト本文全体の音韻候補の組合せの，キーワード箇所の音韻候補の組合せの置き換え処理（Ｓ５２１）とを含んでいる。 As shown in FIG. 6, the keyword priority phoneme selection process includes a process for determining whether or not the keyword is the last keyword (S502), a process for selecting a target for the keyword (S505), and a cost value is minimum. And a process for obtaining a weighted cost value (S510), and a process for replacing a combination of phoneme candidates in the entire text body with a combination of phoneme candidates at a keyword location (S521).

図６に示すように，キーワードに対して，コスト値が最小となる音韻候補の組合せを設定し，コスト値と重みを乗算した重み付きコスト値を求める処理（Ｓ５１０）では，上記説明したようにキーワードに対してコスト値が最小となる音韻候補の組合せを設定する。さらに，キーワード部分に関するコスト値の計算（Ｓ５１０）では，１又は２以上のキーワードに重みを付与し，キーワードが出現する順に，重みとコスト値とを乗算する処理が行われる。あるいは，キーワード部分に関するコスト値の計算（Ｓ５１０）では，１又は２以上のキーワードに付与した重みを，優先順位に変換し，優先順に行われる。 As shown in FIG. 6, in the process (S510) of setting a phoneme candidate combination that minimizes the cost value for a keyword and obtaining a weighted cost value obtained by multiplying the cost value by the weight, as described above. A combination of phoneme candidates having a minimum cost value is set for the keyword. Further, in the calculation of the cost value related to the keyword portion (S510), a process of assigning weights to one or more keywords and multiplying the weights and the cost values in the order in which the keywords appear is performed. Alternatively, in the calculation of the cost value relating to the keyword portion (S510), the weights assigned to one or more keywords are converted into priorities and performed in order of priority.

次に，図６に示すように，最後のキーワードであるか否かを判定し，判定した結果最後のキーワードである場合（Ｓ５０２），図６の重み付きコスト値の小さい値をもつキーワード順に，テキスト本文全体の音韻候補の組合せの，キーワード箇所の音韻候補の組合せの置き換え処理（Ｓ５２１）に処理が移行する。 Next, as shown in FIG. 6, it is determined whether or not it is the last keyword, and if it is the last keyword as a result of the determination (S502), in order of keywords having the smallest value of the weighted cost value in FIG. The processing shifts to the replacement processing (S521) of the combination of phoneme candidates in the keyword portion of the combination of phoneme candidates in the entire text body.

図６に示すように，テキスト本文全体の音韻候補の組合せの，キーワード箇所の音韻候補の組合せの置き換え処理（Ｓ５２１）では，テキスト本文全体で選択した音韻候補の組合せの，キーワードが出現する順に，重みとコスト値とを乗算した重み付きコスト値を求め，各々を比較し，重み付きコスト値の小さい順で，キーワード優先音韻選択した音韻候補の組合せに，置き換える。 As shown in FIG. 6, in the replacement process (S521) of the combination of phoneme candidates in the keyword text in the combination of phoneme candidates in the entire text body (S521), in the order in which the keywords appear in the combination of phoneme candidates selected in the entire text body. A weighted cost value obtained by multiplying the weight and the cost value is obtained, compared with each other, and replaced with a combination of phoneme candidates selected by keyword priority phoneme in ascending order of the weighted cost value.

キーワード部分の音韻候補の組合せの置き換えが完了した後（Ｓ５２１），図１に示す音韻接続部１１１が実行する処理に移行する。 After the replacement of the combination of the phoneme candidates in the keyword part is completed (S521), the process proceeds to the process executed by the phoneme connection unit 111 shown in FIG.

再び図２に戻ると，図２に示すように，音韻接続部１１１によって実行される波形セグメントを接続する処理（Ｓ２２３）は，現在取り扱っている波形セグメントと，次の波形セグメントとを接続する。 Returning to FIG. 2 again, as shown in FIG. 2, in the process of connecting waveform segments executed by the phoneme connection unit 111 (S223), the currently handled waveform segment is connected to the next waveform segment.

音韻接続部１１１によって，全ての波形セグメントの接続が完了すると（Ｓ２２３），図２に示すように，音韻接続部１１１によって接続された波形セグメントは，合成音（又は，合成音声）として出力される（Ｓ２２５）。上記合成音がスピーカ等の出力部から出力されることによって，音声合成装置１００は，テキスト本文のうちキーワード部分を強調しながら読上げることができる。 When the connection of all the waveform segments is completed by the phonological connection unit 111 (S223), as shown in FIG. 2, the waveform segments connected by the phonological connection unit 111 are output as synthesized speech (or synthesized speech). (S225). By outputting the synthesized sound from an output unit such as a speaker, the speech synthesizer 100 can read out the text while emphasizing the keyword portion in the text body.

以上で，第１の実施の形態にかかる音声合成装置１００による音声合成方法の一連の処理について説明を終了するが，かかる音声合成方法によって以下に示すような優れた効果が存在する。
（１）テキスト本文中の強調させたいキーワード位置を予め利用者（ユーザ）が音声合成装置１００に対して指定することで，音声合成装置１００はキーワード部分に対する読上げに関して自然性が高い読上げ（自然な読上げ）をすることができ，出力される合成音の音質が良い上，キーワードの読上げが強調される。 This is the end of the description of the series of processes of the speech synthesis method performed by the speech synthesis apparatus 100 according to the first embodiment. The speech synthesis method has the following excellent effects.
(1) A user (user) designates in advance a keyword position to be emphasized in a text body to the speech synthesizer 100, so that the speech synthesizer 100 reads out a natural part with respect to reading a keyword part (natural Reading), the quality of the output synthesized sound is good, and the reading of keywords is emphasized.

（第２の実施の形態について）
次に，図７を参照しながら，第２の実施の形態にかかる音声合成方法について説明する。なお，以下説明する第２の実施の形態にかかる音声合成方法では，上記説明した第１の実施の形態にかかる音声合成方法と相違する点について詳細に説明し，その他の点については略同一であるとして説明は省略する。 (About the second embodiment)
Next, a speech synthesis method according to the second embodiment will be described with reference to FIG. In the speech synthesis method according to the second embodiment described below, differences from the speech synthesis method according to the first embodiment described above will be described in detail, and the other points are substantially the same. The explanation is omitted as there is.

図７は，第２の実施の形態にかかるキーワード優先音韻選択処理の概略を示すフローチャートである。第２の実施の形態にかかる音声合成方法では，図２に示す第１の実施の形態にかかる音声合成方法のうちのキーワード優先音韻選択処理が改良されている点で，異なる。以下，図７を参照しながら説明するが，図７に示す第２の実施の形態にかかるキーワード優先音韻選択処理では，ステップＳ５１１等に示すように，キーワードに対するコスト値をＣ_ｋｅｙとする。 FIG. 7 is a flowchart showing an outline of keyword priority phoneme selection processing according to the second embodiment. The speech synthesis method according to the second embodiment is different in that the keyword priority phoneme selection process in the speech synthesis method according to the first embodiment shown in FIG. 2 is improved. Hereinafter, although described with reference to FIG. 7, in the keyword priority phoneme selection process according to the second embodiment shown in FIG. 7, the cost value for the keyword is C _{key as} shown in step S511 or the like.

図７のステップＳ５１５に示すように，テキスト本文中のキーワード部分に対して，音韻選択部１０４で求めた，音韻の適合性に関するサブコスト値と，音韻環境代替に関するサブコスト値と，韻律に関するサブコスト値とを，コーパス１０９を用いて取得し，上記３つのサブコスト値を足し合わせた，最小のサブコスト値（ターゲットコスト値）と，ピッチの不連続に関するサブコスト値と，スペクトルの不連続に関するサブコスト値とを，足し合わせたコスト値Ｃ_ｏｒｉｇを取得する。 As shown in step S515 of FIG. 7, for the keyword part in the text body, the sub-cost value related to phoneme compatibility, the sub-cost value related to phonological environment substitution, the sub-cost value related to prosody, obtained by the phoneme selection unit 104, Is obtained using the corpus 109, and the minimum sub-cost value (target cost value), the sub-cost value related to the pitch discontinuity, and the sub-cost value related to the spectral discontinuity are obtained by adding the above three sub-cost values. The added cost value C _orig is acquired.

図７に示すキーワード優先音韻選択処理では，ステップＳ５１７に示すように，キーワードに対するコスト値をＣ_ｋｅｙと，テキスト本文中のキーワード部分で，音韻選択部１０４で求めたコスト値Ｃ_ｏｒｉｇとの差分をとり，その差分値と閾値（Ｔｈｒｅｓｈｏｌｄ）に対する大小を判定する。なお，本実施の形態に係る閾値は音声合成方法の使用環境などに応じて適宜変更することが可能である。 In the keyword priority phoneme selection process shown in FIG. 7, as shown in step S517, the difference between the cost value C _key for the keyword and the cost value C _orig obtained by the phoneme selection unit 104 in the keyword part in the text body is _calculated. The difference between the difference value and the threshold (Threshold) is determined. It should be noted that the threshold value according to the present embodiment can be changed as appropriate according to the use environment of the speech synthesis method.

（音声合成方法について）
図２に示すように，第２の実施の形態にかかる音声合成方法では，まず，第１の実施の形態にかかる音声合成方法と同様に，例えば，強調させたい１又は２以上のキーワードを含んだテキスト本文と，その強調させたいキーワードの開始位置と，キーワードの終了位置とを，利用者が入力する（Ｓ２０１）。なお，かかる例に限定されず，例えば，利用者は，キーワードの開始位置からキーワードの終了位置の範囲全てを指定しても良い。 (Speech synthesis method)
As shown in FIG. 2, in the speech synthesis method according to the second embodiment, first, like the speech synthesis method according to the first embodiment, for example, one or more keywords to be emphasized are included. The user inputs the text body, the start position of the keyword to be emphasized, and the end position of the keyword (S201). However, the present invention is not limited to this example. For example, the user may specify the entire range from the keyword start position to the keyword end position.

次に，第２の実施の形態にかかる音声合成方法では，図２に示すようにテキスト解析部１０１は，例えば，漢字仮名文字で表現されたテキスト本文に対してテキスト解析し，音韻記号に変換する（Ｓ２０３）。 Next, in the speech synthesis method according to the second embodiment, as shown in FIG. 2, the text analysis unit 101 performs text analysis on a text body expressed by, for example, kanji kana characters and converts them into phonetic symbols. (S203).

テキスト解析部１０１は，上記漢字仮名文字などで表現されたテキスト本文に対し，例えば，形態素解析，係り受け解析を行い，アクセント記号列と，テキスト本文の品詞情報を表す形態素解析結果とを出力する。 The text analysis unit 101 performs, for example, morphological analysis and dependency analysis on the text body expressed by the above-described kanji characters, and outputs an accent symbol string and a morphological analysis result representing the part of speech information of the text body. .

なお，第２の実施の形態にかかる音声合成方法では，テキスト本文に対して，テキスト本文から変換された音韻記号と，上記形態素解析結果との情報を持つ出力結果を，例えば，テキスト本文中間言語とするが，かかる例に限定されない。 In the speech synthesis method according to the second embodiment, an output result having information on the phoneme symbol converted from the text body and the morpheme analysis result is obtained as the text body intermediate language. However, it is not limited to such an example.

ここで，第１の実施の形態にかかる音声合成方法においても説明したが，テキスト本文が変換された音韻記号５０１（図３Ａに示すテキスト本文音韻記号）は，例えば，図３Ａに示すように，「ｈａｊｉｍｅ…ｏｋｉ…」となる。 Here, as explained in the speech synthesis method according to the first embodiment, the phoneme symbol 501 (the text body phoneme symbol shown in FIG. 3A) obtained by converting the text body is, for example, as shown in FIG. "Hajime ... oki ..."

上記韻律予測部１０３により予測（Ｓ２０７）した情報（例えば，ピッチ，音韻継続時間長，基本周波数，メルケプストラム）を持つ韻律予測情報を，音韻選択部１０４に入力する。 Prosody prediction information having information (for example, pitch, phoneme duration, fundamental frequency, mel cepstrum) predicted by the prosody prediction unit 103 (S207) is input to the phoneme selection unit 104.

ここで，第１の実施の形態にかかる音声合成方法でも説明したが，図３Ａを参照すると，韻律予測情報５０２は，音韻記号５０１の音韻記号毎（図３Ａに示す韻律予測情報５０２の縦方向に図示された“ｈａｊｉｍｅ…ｏｋｉ…”の音韻記号毎）に，音韻の開始時間を表す“ｓｔａｒｔ”と，音韻の継続時間長を表す“ｄｕｒａｔｉｏｎ”と，音韻の１又は２以上のピッチを表す“ｐｉｔｃｈ”と，音韻の１又は２以上のメルケプストラムを表す“Ｍｅｌｃｅｐ”とから少なくとも構成される。 Here, as explained in the speech synthesis method according to the first embodiment, referring to FIG. 3A, the prosody prediction information 502 is obtained for each phoneme symbol of the phoneme symbol 501 (the vertical direction of the prosody prediction information 502 shown in FIG. 3A). “Start” indicating the phoneme start time, “duration” indicating the duration of the phoneme, and one or more pitches of the phoneme. “Pitch” and at least “Mel cep” representing one or more mel cepstrums of phonemes.

上記ステップＳ２１４について，第１の実施形態でも説明したが，より具体的に説明すると，音韻選択部１０４は，テキスト本文の音韻に対して，音韻の適合性に関するサブコスト値と，音韻環境代替に関するサブコスト値と，韻律に関するサブコスト値とを，コーパス１０９を用いて取得し，上記３つのサブコスト値を足し合わせた，最小のサブコスト値（ターゲットコスト値）から，ある程度の幅を持たせた値の範囲内（例えば，ターゲットコスト値に±５を加算した値の範囲内）に含まれる，音韻を候補としてターゲット選択する。 Step S214 has been described in the first embodiment, but more specifically, the phoneme selection unit 104 has a subcost value related to phoneme suitability and a subcost related to phoneme environment substitution for the phoneme of the text body. The value and the sub cost value related to the prosody are acquired using the corpus 109, and within the range of values having a certain range from the minimum sub cost value (target cost value) obtained by adding the above three sub cost values. The phonemes included in the range (for example, within the range of the value obtained by adding ± 5 to the target cost value) are selected as candidates.

図４に示すように，音韻選択部１０４では，ターゲット選択で求めた音韻候補をもとに，テキスト本文の本文開始位置からテキスト本文の本文終了位置までの範囲について，ターゲットコスト値と，ピッチの不連続に関するサブコスト値と，スペクトルの不連続に関するサブコスト値とを，足し合わせたコスト値が最小となる音韻候補の組合せを，動的計画法を用いて選択する。なお，コスト関数から求まるコスト値が最小となればなるほど，合成音の自然性が高くなる傾向がある。 As shown in FIG. 4, in the phoneme selection unit 104, based on the phoneme candidates obtained by target selection, the target cost value and the pitch of the range from the text body start position to the text body end position are determined. A combination of phoneme candidates that minimizes the sum of the sub-cost value related to the discontinuity and the sub-cost value related to the spectrum discontinuity is selected using dynamic programming. Note that the naturalness of the synthesized sound tends to increase as the cost value obtained from the cost function is minimized.

上記テキスト本文全体の音韻選択が完了すると（Ｓ２１４），次に，キーワード部分の最適な音韻を選択するために，キーワード優先音韻選択部１０７による処理に移行する（Ｓ２１６，Ｓ２１９）。 When the phoneme selection of the entire text body is completed (S214), the process proceeds to the process by the keyword priority phoneme selection unit 107 in order to select the optimum phoneme of the keyword part (S216, S219).

図３Ａに示すように，テキスト本文全体の音韻候補の組合せ５０３には，テキスト本文音韻記号５０１（図３Ａで示す例では，“ｈａｊｉｍｅ…ｏｋｉ…”）の各音韻記号に対応する音韻が，コーパス１０９内に存在するどの音声ファイルで，先頭から何番目に位置する音韻であるかを示す情報が記述されている。 As shown in FIG. 3A, the phoneme candidate combination 503 of the entire text body includes a phoneme corresponding to each phoneme symbol of the text body phoneme symbol 501 ("hajime ... oki ..." in the example shown in FIG. 3A). Information indicating which sound file exists in 109 and which phoneme is located from the beginning is described.

図２に示すように，キーワード優先音韻選択部１０７は，第１の実施の形態でも説明したが，テキスト本文内の最後に出現するキーワードであるかどうかを判定する（Ｓ２１６）。 As shown in FIG. 2, as described in the first embodiment, the keyword priority phoneme selection unit 107 determines whether or not the keyword appears last in the text body (S216).

キーワード優先音韻選択部１０７は，テキスト本文の本文開始位置から出現するキーワード順に処理を行う，あるいは，テキスト本文のキーワードに付与された優先順位をもとに優先順に処理を行う，あるいは，テキスト本文中の１又は２以上のキーワードに付与された重みを優先順位に変換し優先順に処理を行う必要がある。そのためには，最後のキーワードであるかどうか図２に示すステップＳ２１６で判定する必要がある。 The keyword priority phoneme selection unit 107 performs processing in the order of keywords appearing from the text start position of the text body, or performs processing in order of priority based on the priority assigned to the keywords of the text body, or in the text body It is necessary to convert the weights assigned to one or more of the keywords into priority levels and perform processing in the priority order. For that purpose, it is necessary to determine whether or not it is the last keyword in step S216 shown in FIG.

次に，図２に示すように，キーワード優先音韻選択処理（Ｓ２１９）は，テキスト本文内のキーワード部分から，コーパス１０９を用いて，最適な音韻を選択する処理である。 Next, as shown in FIG. 2, the keyword priority phoneme selection process (S219) is a process for selecting an optimal phoneme from the keyword part in the text body using the corpus 109.

ここで，図７を参照しながら，第２の実施の形態にかかるキーワード優先音韻処理（Ｓ２１９）について説明する。図７は，第２の実施の形態にかかるキーワード優先音韻選択処理の概略を示すフローチャートである。 Here, the keyword priority phoneme process (S219) according to the second embodiment will be described with reference to FIG. FIG. 7 is a flowchart showing an outline of keyword priority phoneme selection processing according to the second embodiment.

図７に示すように，キーワード優先音韻選択処理（Ｓ２１９）は，最後のキーワードか否かを確認する処理（Ｓ５０１）と，キーワードに対してターゲット選択を行う処理（Ｓ５０５）と，キーワードに対してコスト値Ｃ_ｋｅｙが最小となる音韻候補の組合せを設定する処理（Ｓ５１１）と，テキスト本文中のキーワード部分で音韻選択部１０４で求めたコスト値Ｃ_ｏｒｉｇを取得する処理（Ｓ５１５）と，Ｃ_ｋｅｙ−Ｃ_ｏｒｉｇ＞ｔｈｒｅｓｈｏｌｄの関係が成立するか否かを検証する処理（Ｓ５１７）と，テキスト本文全体の音韻候補の組合せのキーワード箇所の音韻候補の組合せの置き換える処理（Ｓ５１３）との処理が，行われる。なお，最後のキーワードか否かを確認する処理（Ｓ５０１）は，上記説明のステップＳ２１６と実質的に同一である。 As shown in FIG. 7, the keyword priority phoneme selection process (S219) includes a process for confirming whether the keyword is the last keyword (S501), a process for selecting a target for the keyword (S505), and a keyword. A process of setting a combination of phoneme candidates that minimizes the cost value C _key (S511), a process of acquiring the cost value C _orig obtained by the phoneme selection unit 104 at the keyword part in the text body (S515), and C _key The process of verifying whether or not the relationship of -C _orig > threshold is established (S517) and the process of replacing the combination of phoneme candidates in the keyword part of the combination of phoneme candidates in the entire text body (S513) are performed. Is called. Note that the process of confirming whether it is the last keyword (S501) is substantially the same as step S216 described above.

図７に示すように，キーワードに対して，ターゲット選択を行う処理（Ｓ５０５）では，キーワード部分の音韻に対して，音韻の適合性に関するサブコスト値と，音韻環境代替に関するサブコスト値と，韻律に関するサブコスト値とを，コーパス１０９を用いて取得し，上記３つのサブコスト値を足し合わせた，最小のサブコスト値（ターゲットコスト値）から，ある程度の幅を持たせた値の範囲内に含まれる，音韻を候補としてターゲット選択する。 As shown in FIG. 7, in the process of performing target selection for a keyword (S505), the sub-cost value related to phoneme compatibility, the sub-cost value related to phoneme environment substitution, and the sub-cost related to prosody for the phoneme of the keyword part. Values obtained using the corpus 109, and the phonemes included in the range of values having a certain range from the minimum sub cost value (target cost value) obtained by adding the above three sub cost values. Select a target as a candidate.

図７に示すキーワードに対してコスト値Ｃ_ｋｅｙが最小となる音韻候補の組合せを設定する処理（Ｓ５１１）では，図４に示すように，ターゲット選択で求めた音韻候補（又は，音韻選択）をもとに，テキスト本文内の，ユーザが指定したキーワード開始位置から，キーワード終了位置まで，ターゲットコスト値と，ピッチの不連続に関するサブコスト値と，スペクトルの不連続に関するサブコスト値とを，足し合わせたコスト値が最小となる音韻候補の組合せを，動的計画法を用いて選択する。 In the process of setting a combination of phoneme candidates that minimizes the cost value C _key for the keyword shown in FIG. 7 (S511), as shown in FIG. 4, the phoneme candidate (or phoneme selection) obtained by target selection is set. Originally, the target cost value, sub-cost value related to pitch discontinuity, and sub-cost value related to spectrum discontinuity are added from the keyword start position specified by the user to the keyword end position in the text body. The phoneme candidate combination with the lowest cost value is selected using dynamic programming.

キーワード開始位置，キーワード終了位置は，例えば，テキスト本文の本文開始位置からカウントした音韻数によって，指定することができるが，かかる例に限定されず，例えば，本文終了位置からカウントした音韻数によって指定される場合等でもよい。 The keyword start position and the keyword end position can be specified by, for example, the number of phonemes counted from the text start position of the text body, but not limited to such an example, for example, specified by the number of phonemes counted from the text end position It may be the case.

上記キーワード位置（キーワード開始位置〜キーワード終了位置）の指定方法としては，テキスト本文中のキーワード部分に対してタグ付けする方法と，テキスト本文を一旦テキスト解析しテキスト本文中の音韻列をユーザに表示しキーワード部分を指定する方法と，１又は２以上のキーワードをあらかじめ保持しユーザがキーワードを選択しその選択したキーワードとテキスト本文中で一致する箇所を検索する方法とがある。 The keyword position (keyword start position to keyword end position) can be specified by tagging the keyword part in the text body or by analyzing the text body once and displaying the phoneme string in the text body to the user. There are a method of designating a keyword portion and a method of holding one or more keywords in advance, selecting a keyword by the user, and searching for a location that matches the selected keyword in the text body.

図４に示すように，キーワード音韻候補が２つ以上存在する場合，キーワード部分に関するコスト値の計算は，例えば，テキスト本文の本文開始位置から本文終了位置に向けて，キーワードの出現順に行われる，あるいは，キーワード部分に関するコスト値の計算は，１又は２以上のキーワードに優先順位を付与し，優先順に行われる，あるいは，図８に示すように，キーワード部分に関するコスト値の計算は，１又は２以上のキーワードに重みを付与し，キーワードが出現する順に，重みとコスト値とを乗算する処理が行われる，あるいは，キーワード部分に関するコスト値の計算は，複数のキーワードに付与した重みを，優先順位に変換し，優先順に行われる。 As shown in FIG. 4, when there are two or more keyword phoneme candidates, the cost value for the keyword part is calculated in the order of appearance of the keywords from the text start position to the text end position of the text body, for example. Alternatively, the calculation of the cost value for the keyword part is performed in order of priority by giving priority to one or more keywords, or the calculation of the cost value for the keyword part is 1 or 2 as shown in FIG. Weights are assigned to the above keywords, and the process of multiplying the weights by the cost values is performed in the order in which the keywords appear. Alternatively, the cost values for the keyword part are calculated by assigning the weights assigned to multiple keywords to the priority order. And is performed in order of priority.

図７に示すように，テキスト本文中のキーワード部分について音韻選択部１０４で求めたコスト値Ｃ_ｏｒｉｇを取得する処理（Ｓ５１５）では，テキスト本文中のキーワード部分に対して，音韻選択部１０４で求めた，音韻の適合性に関するサブコスト値と，音韻環境代替に関するサブコスト値と，韻律に関するサブコスト値とを，コーパス１０９を用いて取得し，上記３つのサブコスト値を足し合わせた，最小のサブコスト値（ターゲットコスト値）と，ピッチの不連続に関するサブコスト値と，スペクトルの不連続に関するサブコスト値とを，足し合わせたコスト値Ｃ_ｏｒｉｇを取得する。 As shown in FIG. 7, in the process of obtaining the cost value C _orig obtained by the phoneme selection unit 104 for the keyword portion in the text body (S515), the phoneme selection unit 104 obtains the keyword portion in the text body. Further, a sub cost value related to phonological compatibility, a sub cost value related to phonological environment substitution, and a sub cost value related to prosody are acquired using the corpus 109, and the minimum sub cost value (target) is obtained by adding the above three sub cost values. (Cost value), a sub cost value related to pitch discontinuity, and a sub cost value related to spectrum discontinuity are added to obtain a cost value C _orig .

上記コスト値Ｃ_ｏｒｉｇは，テキスト本文中のキーワード部分の範囲から求められたサブコスト値を基にして求められるコスト値であるのに対して，コスト値Ｃ_ｋｅｙは，テキスト本文の本文開始位置から本文終了位置までの範囲から求められたサブコスト値を基にして求められるキーワード部分のコスト値である。即ち，双方のコスト値Ｃ_ｏｒｉｇとコスト値Ｃ_ｋｅｙは，例えば，上記３つのサブコスト値を求めるための音韻の対象範囲が異なっている。上記コスト値Ｃ_ｏｒｉｇの方が，どちらかと言えばキーワード部分を構成する１又は２以上の音韻が本来兼ね備えている声質などの特質を直接的に数値に表現される傾向がある。 The cost value C _orig is a cost value obtained based on the sub cost value obtained from the range of the keyword part in the text body, whereas the cost value C _key is determined from the body start position of the text body. This is the cost value of the keyword portion obtained based on the sub cost value obtained from the range up to the end position. That is, both the cost value C _orig and the cost value C _key have different phoneme target ranges for obtaining the three sub cost values, for example. The cost value C _orig tends to be directly expressed as numerical values such as voice quality that one or more phonemes constituting the keyword part originally have.

図７に示すように，Ｃ_ｋｅｙ−Ｃ_ｏｒｉｇ＞ｔｈｒｅｓｈｏｌｄの関係であるか判定する処理（Ｓ５１７）では，キーワードに対するコスト値をＣ_ｋｅｙと，テキスト本文中のキーワード部分で，音韻選択部１０４で求めたコスト値Ｃ_ｏｒｉｇとの差分をとり，ある閾値（Ｔｈｒｅｓｈｏｌｄ）に対して，大小を判定する。 As shown in FIG. 7, in the process of determining whether the relationship of C _key -C _orig > threshold is satisfied (S517), the phonological selection unit 104 obtains the cost value for the keyword using C _key and the keyword part in the text body. A difference from the cost value C _orig is taken, and a certain threshold value (Threshold) is determined.

上記大小判定した結果，差分値が，ある閾値（Ｔｈｒｅｓｈｏｌｄ）より大きい場合であって，上記不等式の関係を保つ場合，図７に示すテキスト本文全体の音韻候補の組合せの，キーワード箇所の音韻候補の組合せの置き換え（Ｓ５１３）に処理が移行する。 As a result of the above determination, if the difference value is larger than a certain threshold (Threshold) and the above inequality relationship is maintained, the combination of phoneme candidates of the entire text body shown in FIG. Processing shifts to combination replacement (S513).

上記求めた差分値が，ある閾値（Ｔｈｒｅｓｈｏｌｄ）以下の場合，図７に示す最後のキーワードであるか否かを判定する処理が行われる（Ｓ５０１）。 When the obtained difference value is equal to or less than a certain threshold value (Threshold), a process of determining whether or not it is the last keyword shown in FIG. 7 is performed (S501).

図７に示すように，テキスト本文全体の音韻候補の組合せの，キーワード箇所の音韻候補の組合せの置き換え（Ｓ５１３）では，テキスト本文全体で選択した音韻候補の組合せの，キーワード部分に対して，キーワード優先音韻選択部１０７によりキーワード優先音韻選択した音韻候補の組合せに，置き換える。つまり，キーワード個所の音韻は，キーワード優先音韻選択部１０７によりキーワード優先音韻選択した音韻が採用され，その音韻が後述の合成音として出力されることとなる。 As shown in FIG. 7, in the replacement of the combination of phoneme candidates in the keyword text with the combination of phoneme candidates in the entire text body (S513), the keyword for the keyword part of the combination of phoneme candidates selected in the entire text body is used. The priority phoneme selection unit 107 replaces the combination with the phoneme candidate selected by the keyword priority phoneme. In other words, the phoneme at the keyword location is the phoneme selected by the keyword priority phoneme by the keyword priority phoneme selection unit 107, and the phoneme is output as a synthesized sound to be described later.

また，第２の実施の形態にかかる音声合成方法では，キーワード優先音韻選択部１０７により選択された音韻の方が合成音として出力した際により一層クリアに明快に読み上げられるように閾値（Ｔｈｒｅｓｈｏｌｄ）が定められ，置換処理（Ｓ５１３）が行われるようにする。 In the speech synthesis method according to the second embodiment, the threshold is set so that the phoneme selected by the keyword priority phoneme selection unit 107 is read out more clearly and clearly when it is output as a synthesized speech. The replacement process (S513) is performed.

ここで，図８を参照しながら，第２の実施の形態にかかるキーワード優先音韻選択処理の変形例について説明する。図８は，第２の実施の形態にかかるキーワード優先音韻選択処理の概略を示すフローチャートである。 Here, a modified example of the keyword priority phoneme selection process according to the second embodiment will be described with reference to FIG. FIG. 8 is a flowchart showing an outline of keyword priority phoneme selection processing according to the second embodiment.

図８に示すように，キーワード優先音韻選択処理（Ｓ２１９）は，最後のキーワードであるか否かを判定する処理（Ｓ５０１）と，キーワードに対して，ターゲット選択を行う処理（Ｓ５０９）と，キーワードに対してコスト値Ｃ_ｋｅｙが最小となる音韻候補の組合せを設定する処理（Ｓ５１１）と，テキスト本文中のキーワード部分で音韻選択部１０４が求めたコスト値に重みを乗算したＣ_ｏｒｉｇを取得する処理（Ｓ５１６）と，最後のキーワードか否かを判定する処理（Ｓ５０２）と，Ｃ_ｋｅｙ−Ｃ_ｏｒｉｇ＞ｔｈｒｅｓｈｏｌｄの関係であるか判定する処理（Ｓ５１７）と，テキスト本文全体の音韻候補の組合せを，キーワード箇所の音韻候補の組合せに置き換える処理（Ｓ５１３）とを含んでいる。 As shown in FIG. 8, the keyword priority phoneme selection process (S219) includes a process for determining whether or not it is the last keyword (S501), a process for selecting a target for the keyword (S509), a keyword, Processing for setting a combination of phoneme candidates that minimizes the cost value C _key (S511), and obtaining C _orig by multiplying the cost value obtained by the phoneme selection unit 104 in the keyword part in the text body with a weight. processing (S516), processing for determining whether or not the last keyword _(S502), and determines the processing or the relationship of the C _key -C orig> threshold and (S517), a combination of text body entire phoneme candidate , Processing for replacing with a combination of phoneme candidates at the keyword portion (S513).

図８に示すように，第２の実施の形態にかかるキーワード優先音韻選択処理（Ｓ２１９）は，図７に示す第２の実施の形態にかかるキーワード優先音韻選択処理と比べて，上記ステップＳ５１５の代わりにステップＳ５１６を含む点，さらに上記ステップＳ５０２を含む点で異なる。 As shown in FIG. 8, the keyword-priority phoneme selection process (S219) according to the second embodiment is different from the keyword-priority phoneme selection process according to the second embodiment shown in FIG. Instead, it differs in that it includes step S516 and further includes step S502.

図８のステップＳ５１６では，テキスト本文中のキーワード部分に対して，音韻選択部１０４で求めた，音韻の適合性に関するサブコスト値と，音韻環境代替に関するサブコスト値と，韻律に関するサブコスト値とを，コーパス１０９を用いて取得し，上記３つのサブコスト値を足し合わせた，最小のサブコスト値（ターゲットコスト値）と，ピッチの不連続に関するサブコスト値と，スペクトルの不連続に関するサブコスト値とを，足し合わせたコスト値Ｃ_ｏｒｉｇを取得する。 In step S516 in FIG. 8, the sub-cost value related to phoneme compatibility, the sub-cost value related to phonological environment substitution, and the sub-cost value related to prosody obtained by the phoneme selection unit 104 for the keyword portion in the text body are corpus. 109, and the minimum sub-cost value (target cost value), the sub-cost value related to the pitch discontinuity, and the sub-cost value related to the spectral discontinuity are added together. The cost value C _orig is acquired.

また，図８のステップＳ５１６では，キーワードが出現する順に，重みと上記で求めたコスト値とを乗算した重み付きコスト値を求め，各々を比較し，重み付きコスト値を取得する。 Further, in step S516 in FIG. 8, a weighted cost value obtained by multiplying the weight and the cost value obtained above is obtained in the order in which the keywords appear, and each is compared to obtain a weighted cost value.

また，図８に示すように，テキスト本文中のキーワード部分で，音韻選択部１０４で求めたコスト値に重みを乗算したＣ_ｏｒｉｇを取得する処理（Ｓ５１６）では，テキスト本文中のキーワード部分に対して，音韻選択部１０４で求めた，音韻の適合性に関するサブコスト値と，音韻環境代替に関するサブコスト値と，韻律に関するサブコスト値とを，コーパスを用いて取得し，３つのサブコスト値を足し合わせた，最小のサブコスト値（ターゲットコスト値）と，ピッチの不連続に関するサブコスト値と，スペクトルの不連続に関するサブコスト値とを，足し合わせたコスト値Ｃ_ｏｒｉｇを取得し，コスト値Ｃ_ｏｒｉｇと，複数のキーワードに付与された重みとを，キーワードが出現する順に乗算し，重み付きコスト値Ｃ_ｏｒｉｇを求める。 Further, as shown in FIG. 8, in the process of acquiring _Corig obtained by multiplying the cost value obtained by the phoneme selection unit 104 by the weight in the keyword part in the text body (S516), the keyword part in the text body is obtained. Thus, the sub-cost value related to the phoneme compatibility, the sub-cost value related to the phoneme environment substitution, and the sub-cost value related to the prosody obtained by the phoneme selection unit 104 are acquired using a corpus, and the three sub-cost values are added together. The cost value C _orig obtained by adding the minimum sub cost value (target cost value), the sub cost value related to the pitch discontinuity, and the sub cost value related to the spectrum discontinuity is obtained, and the cost value C _orig and a plurality of keywords are obtained. Multiply the weight _assigned to the in the order in which the keywords appear, and obtain the weighted cost value C _orig The

図８に示すように，最後のキーワードであるか否かを判定する処理（Ｓ５０２）では，例えば，各キーワードで求められた，重み付きコスト値Ｃ_ｏｒｉｇを比較し，小さい重み付きコスト値を保持しているキーワード順で，テキスト本文内の最後に出現するキーワードであるかどうかを判定する。 As shown in FIG. 8, in the process of determining whether or not it is the last keyword (S502), for example, the weighted cost value C _orig obtained for each keyword is compared, and a small weighted cost value is held. It is determined whether or not the keyword appears last in the text body in the order in which the keywords are displayed.

上記判定の結果（Ｓ５０２），最後のキーワードでない場合，図８に示す「Ｃ_ｋｅｙ−Ｃ_ｏｒｉｇ＞ｔｈｒｅｓｈｏｌｄ」の関係について検証する処理（Ｓ５１７）に移行する。 As a result of the determination (S502), if it is not the last keyword, the process proceeds to processing (S517) for verifying the relationship of “C _key −C _orig > threshold” shown in FIG.

一方，上記判定の結果（Ｓ５０２），最後のキーワードである場合，キーワード部分の音韻選択が完了し，波形セグメントを得ることが可能な音韻候補の組合せが決定するので，音韻接続部１１１によって行われる処理に移行する。 On the other hand, as a result of the determination (S502), if it is the last keyword, the phoneme selection of the keyword part is completed, and the combination of phoneme candidates from which waveform segments can be obtained is determined. Transition to processing.

第１の実施の形態にかかる音声合成方法にて説明したように，図２に示すように，音韻接続部１１１によって実行される波形セグメントを接続する処理（Ｓ２２３）は，現在取り扱っている波形セグメントと，次の波形セグメントとを接続する。 As described in the speech synthesis method according to the first embodiment, as shown in FIG. 2, the process of connecting waveform segments executed by the phoneme connection unit 111 (S223) is the waveform segment currently handled. To the next waveform segment.

音韻接続部１１１によって，全ての波形セグメントの接続が完了すると（Ｓ２２３），図２に示すように，音韻接続部１１１によって接続された波形セグメントは，合成音（又は，合成音声）として出力される（Ｓ２２５）。上記合成音がスピーカ等の出力部から出力されることによって，音声合成装置１００は，テキスト本文のうちキーワードを強調しながら読上げることができる。 When the connection of all the waveform segments is completed by the phonological connection unit 111 (S223), as shown in FIG. 2, the waveform segments connected by the phonological connection unit 111 are output as synthesized speech (or synthesized speech). (S225). By outputting the synthesized sound from an output unit such as a speaker, the speech synthesizer 100 can read out the text while emphasizing the keyword.

以上で，第２の実施の形態にかかる音声合成方法について説明を終了するが，かかる音声合成方法によって，以下に示すような優れた効果が存在する。
（１）テキスト本文中の強調させたいキーワードの部分を，何の評価もせずに，キーワード優先音韻選択部１０７で選択した音韻の候補に置き換えるのではなく，キーワード部分に対して，音韻選択部１０４で求めたコスト値Ｃ_ｋｅｙとキーワード優先音韻選択部１０７で求めたコスト値Ｃ_ｏｒｉｇとを比較し，置き換えの必要の際は，キーワード優先音韻部１０７が選択した音韻候補の組合せに置き換えることで，キーワード部分のより自然性が高い読上げを可能とし，音質がよく，キーワードの読上げを強調させることができる。 Although the description of the speech synthesis method according to the second embodiment is finished as described above, the speech synthesis method has the following excellent effects.
(1) The keyword portion to be emphasized in the text body is not replaced with the phoneme candidate selected by the keyword priority phoneme selection unit 107 without any evaluation, and the phoneme selection unit 104 is applied to the keyword portion. in determined and compared with the cost value C _key and the keyword priority phoneme selector 107 cost value C _orig obtained, upon replacement necessary, by replacing the combination of phonological candidate keyword priority phoneme unit 107 selects, The keyword part can be read more naturally, the sound quality is good, and the keyword reading can be emphasized.

なお，上述した一連の処理は，専用のハードウェアにより行うこともできるし，ソフトウェアにより行うこともできる。一連の処理をソフトウェアによって行う場合には，そのソフトウェアを構成するプログラムが，汎用のコンピュータやマイクロコンピュータ等の情報処理装置にインストールされ，上記情報処理装置を音声合成装置１００として機能させる。 The series of processes described above can be performed by dedicated hardware or software. When a series of processing is performed by software, a program constituting the software is installed in an information processing apparatus such as a general-purpose computer or a microcomputer, and the information processing apparatus functions as the speech synthesizer 100.

上記プログラムは，コンピュータに内蔵されている記録媒体としてのハードディスクドライブ（ＨＤＤ）やＲＯＭに予め記録しておくことができる。 The above program can be recorded in advance in a hard disk drive (HDD) or ROM as a recording medium built in the computer.

あるいはまた，プログラムは，ハードディスクドライブに限らず，フレキシブルディスク，ＣＤ−ＲＯＭ（ＣｏｍｐａｃｔＤｉｓｃＲｅａｄＯｎｌｙＭｅｍｏｒｙ），ＭＯ（ＭａｇｎｅｔｏＯｐｔｉｃａｌ）ディスク，ＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｃ），磁気ディスク，半導体メモリなどのリムーバブル記録媒体に，一時的あるいは永続的に格納（記録）しておくことができる。このようなリムーバブル記録媒体は，いわゆるパッケージソフトウエアとして提供することができる。 Alternatively, the program is not limited to a hard disk drive, but a removable recording medium such as a flexible disk, CD-ROM (Compact Disc Read Only Memory), MO (Magneto Optical) disk, DVD (Digital Versatile Disc), magnetic disk, and semiconductor memory. In addition, it can be stored (recorded) temporarily or permanently. Such a removable recording medium can be provided as so-called package software.

なお，プログラムは，上述したようなリムーバブル記録媒体からコンピュータにインストールする他，ダウンロードサイトから，ディジタル衛星放送用の人工衛星を介して，コンピュータに無線で転送したり，ＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ），インターネットといったネットワークを介して，コンピュータに有線で転送し，コンピュータでは，そのようにして転送されてくるプログラムを受信し，内蔵するハードディスクドライブにインストールすることができる。 The program is installed on the computer from the removable recording medium as described above, and is transferred from the download site to the computer wirelessly via a digital satellite broadcasting artificial satellite, or a LAN (Local Area Network) or the Internet. Such a program can be transferred to a computer via a network, and the computer can receive the program transferred in this way and install it on a built-in hard disk drive.

ここで，本明細書において，コンピュータに各種の処理を行わせるためのプログラムを記述する処理ステップは，必ずしもフローチャートとして記載された順序に沿って時系列に処理する必要はなく，並列的あるいは個別に実行される処理（例えば，並列処理あるいはオブジェクトによる処理）も含むものである。 Here, in this specification, the processing steps for describing a program for causing a computer to perform various processes do not necessarily have to be processed in time series in the order described in the flowchart, but in parallel or individually. This includes processing to be executed (for example, parallel processing or processing by an object).

また，プログラムは，１のコンピュータにより処理されるものであっても良いし，複数のコンピュータによって分散処理されるものであっても良い。 The program may be processed by one computer, or may be distributedly processed by a plurality of computers.

以上，添付図面を参照しながら本発明の好適な実施形態について説明したが，本発明はかかる例に限定されない。当業者であれば，特許請求の範囲に記載された技術的思想の範疇内において各種の変更例または修正例を想定し得ることは明らかであり，それらについても当然に本発明の技術的範囲に属するものと了解される。 As mentioned above, although preferred embodiment of this invention was described referring an accompanying drawing, this invention is not limited to this example. It is obvious for a person skilled in the art that various changes or modifications can be envisaged within the scope of the technical idea described in the claims, and these are naturally within the technical scope of the present invention. It is understood that it belongs.

上記実施の形態においては，テキスト解析部１０１と韻律予測部１０３とは別体である場合を例に挙げて説明したが，本発明はかかる例に限定されない。例えば，テキスト解析部１０１と韻律予測部１０３とが，一体に構成される場合等でも実施可能である。 In the above embodiment, the case where the text analysis unit 101 and the prosody prediction unit 103 are separate bodies has been described as an example, but the present invention is not limited to such an example. For example, the present invention can be implemented even when the text analysis unit 101 and the prosody prediction unit 103 are configured integrally.

本発明は，テキスト本文を読み上げるための音声を合成することが可能な音声合成装置，音声合成方法，およびコンピュータプログラムに適用可能である。 The present invention is applicable to a speech synthesizer, a speech synthesis method, and a computer program capable of synthesizing speech for reading a text body.

第１の実施の形態にかかる音声合成装置の概略的な構成を示すブロック図である。1 is a block diagram showing a schematic configuration of a speech synthesizer according to a first embodiment. 第１の実施の形態にかかる音声合成装置による音声合成方法の概略を示すフローチャートである。It is a flowchart which shows the outline of the speech synthesis method by the speech synthesizer concerning 1st Embodiment. 本実施の形態にかかる音韻記号，韻律予測情報，およびテキスト本文全体の音韻候補の組合せの概略を示す説明図である。It is explanatory drawing which shows the outline of the combination of the phoneme symbol concerning this Embodiment, prosodic prediction information, and the phoneme candidate of the whole text body. 本実施の形態にかかる音声合成装置におけるデータの流れを概略的に示す説明図である。It is explanatory drawing which shows roughly the flow of the data in the speech synthesizer concerning this Embodiment. 本実施の形態にかかる音韻を選択する処理の概略の一例を示す説明図である。It is explanatory drawing which shows an example of the outline of the process which selects the phoneme concerning this Embodiment. 第１の実施の形態にかかるキーワード優先音韻選択処理の概略を示すフローチャートである。It is a flowchart which shows the outline of the keyword priority phoneme selection process concerning 1st Embodiment. 第１の実施の形態にかかるキーワード優先音韻選択処理の概略を示すフローチャートである。It is a flowchart which shows the outline of the keyword priority phoneme selection process concerning 1st Embodiment. 第２の実施の形態にかかるキーワード優先音韻選択処理の概略を示すフローチャートである。It is a flowchart which shows the outline of the keyword priority phoneme selection process concerning 2nd Embodiment. 第２の実施の形態にかかるキーワード優先音韻選択処理の概略を示すフローチャートである。It is a flowchart which shows the outline of the keyword priority phoneme selection process concerning 2nd Embodiment.

Explanation of symbols

１００音声合成装置
１０１テキスト解析部
１０３韻律予測部
１０４音韻選択部
１０７キーワード優先音韻選択部
１０９コーパス
１１１音韻接続部 DESCRIPTION OF SYMBOLS 100 Speech synthesizer 101 Text analysis part 103 Prosody prediction part 104 Phoneme selection part 107 Keyword priority phoneme selection part 109 Corpus 111 Phoneme connection part

Claims

A prosodic prediction information generating unit that analyzes the text body and generates prosodic prediction information that predicts the feature amount of each symbol constituting the text body;
A scale of each symbol constituting the text body is obtained by mapping from a observable feature amount to a scale determined based on a perceptual experiment based on the prosodic prediction information and each symbol constituting the text body. A scale calculator;
A phoneme database storing one or more phonemes constituting the recorded speech;
A phoneme selection unit that selects a text phoneme from the phoneme database based on a scale obtained by the scale calculation unit for each symbol constituting the text body;
Each symbol constituting the keyword is mapped from the observable feature amount to a scale determined based on the perceptual experiment based on the prosodic prediction information and each symbol constituting the keyword in the text body. A keyword scale calculator for calculating the scale of
A keyword priority phoneme selection unit that selects a keyword phoneme from the phoneme database based on a scale of each symbol that constitutes the keyword obtained by the keyword scale calculation unit for each symbol that constitutes the keyword;
A phoneme replacement unit that replaces a phoneme corresponding to the keyword portion in the text phoneme selected by the phoneme selection unit with the keyword phoneme selected by the keyword priority phoneme selection unit;
A synthesis unit that connects the phonemes and generates synthesized speech;
A speech synthesizer characterized by comprising:

When there are two or more keywords in the text body, the keyword priority phoneme selection unit assigns a priority to each of the keywords in the order in which the keywords appear from the top of the text body, 2. The speech synthesizer according to claim 1, wherein weights assigned to keywords are converted into priorities, and the keyword phonemes are selected in the order of priority.

When there are two or more keywords in the text body, the keyword priority phoneme selection unit weights each keyword unit and, in the order in which the keywords appear, based on the weighted values of the keywords, The speech synthesizer according to claim 1, wherein a scale of each symbol constituting the symbol is obtained.

The phoneme replacement unit sequentially replaces the phoneme corresponding to the keyword part in the text phoneme selected by the phoneme selection unit with the keyword phoneme when the keyword phoneme selection unit selects the keyword phoneme. The speech synthesizer according to claim 1, characterized in that:

The phonological replacement unit determines whether or not to replace with the keyword phonology by comparing the metric obtained by the keyword priority phonological selection unit with the metric obtained by the phonological selection unit. 5. A speech synthesizer according to claim 1 or 4.

The phonological replacement unit compares a scale value obtained by the keyword priority phonological selection unit and a difference value between the scale obtained by the phonological selection unit and a predetermined threshold value. Or the speech synthesizer according to claim 5.

A prosodic prediction information generation step of analyzing the text body and generating prosodic prediction information in which the feature amount of each symbol constituting the text body is predicted;
A scale of each symbol constituting the text body is obtained by mapping from a observable feature amount to a scale determined based on a perceptual experiment based on the prosodic prediction information and each symbol constituting the text body. A scale calculation step;
For each symbol constituting the text body, based on the scale obtained in the scale calculation step, phoneme selection for selecting a text phoneme from a phoneme database storing one or more phonemes constituting the recorded speech Steps and;
Each symbol constituting the keyword is mapped from the observable feature amount to a scale determined based on the perceptual experiment based on the prosodic prediction information and each symbol constituting the keyword in the text body. A keyword scale calculating step for obtaining a scale of
A keyword priority phoneme selection step for selecting a keyword phoneme from the phoneme database based on the scale of each symbol constituting the keyword obtained in the keyword scale calculation step for each symbol constituting the keyword;
A phoneme replacement step of replacing a phoneme corresponding to the keyword part in the text phoneme selected in the phoneme selection step with the keyword phoneme selected in the keyword priority phoneme selection step;
A synthesized speech generation step of connecting the phonemes and generating a synthesized speech;
A speech synthesis method comprising:

A prosodic prediction information generation step of analyzing the text body and generating prosodic prediction information in which the feature amount of each symbol constituting the text body is predicted;
A scale of each symbol constituting the text body is obtained by mapping from a observable feature amount to a scale determined based on a perceptual experiment based on the prosodic prediction information and each symbol constituting the text body. A scale calculation step;
For each symbol constituting the text body, based on the scale obtained in the scale calculation step, phoneme selection for selecting a text phoneme from a phoneme database storing one or more phonemes constituting the recorded speech Steps and;
Each symbol constituting the keyword is mapped from the observable feature amount to a scale determined based on the perceptual experiment based on the prosodic prediction information and each symbol constituting the keyword in the text body. A keyword scale calculating step for obtaining a scale of
A keyword priority phoneme selection step for selecting a keyword phoneme from the phoneme database based on the scale of each symbol constituting the keyword obtained in the keyword scale calculation step for each symbol constituting the keyword;
A phoneme replacement step of replacing a phoneme corresponding to the keyword part in the text phoneme selected in the phoneme selection step with the keyword phoneme selected in the keyword priority phoneme selection step;
A synthesized speech generation step of connecting the phonemes and generating a synthesized speech;
A computer program for causing a computer to execute a speech synthesis method including