JP2009244661A

JP2009244661A - Device, method, and program for speech synthesis

Info

Publication number: JP2009244661A
Application number: JP2008092126A
Authority: JP
Inventors: Masanori Kato; 正徳加藤
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2008-03-31
Filing date: 2008-03-31
Publication date: 2009-10-22
Anticipated expiration: 2028-03-31
Also published as: JP5158567B2

Abstract

PROBLEM TO BE SOLVED: To efficiently find a phoneme of a deletion object which contributes to sound quality improvement. SOLUTION: A device for speech synthesis comprises: a candidate phoneme acquiring section 3 for acquiring the phoneme which is used for synthetic speech as a candidate phoneme based on language processing of an input text; a phoneme selecting section 4 for selecting the phoneme which is most suitable for synthetic speech in the candidate phonemes, as an optimum phoneme by calculating a phoneme selection score which is an index for indicating suitability in the speech synthesis of the candidate phoneme based on language processing result; and a waveform creation section 6 for creating a synthetic speech waveform based on the optimum phoneme. When the phoneme of the deletion object is included in the optimum phoneme, the candidate phoneme acquiring section 3 acquires the candidate phonemes other than this phoneme again. The waveform creation section 6 includes an improvement perspective index calculation section 140 for calculating an improvement perspective index which indicates sound improvement possibility, when the phoneme is deleted, in a speech synthesis device for creating the synthetic speech waveform again. The improvement perspective index calculation section 140 makes an improvement perspective index large, if a phoneme selection score is high. COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、音声合成技術に関し、特に、テキストから音声を合成するための音声合成装置、音声合成方法及び音声合成プログラムに関する。 The present invention relates to speech synthesis technology, and more particularly to a speech synthesizer, a speech synthesis method, and a speech synthesis program for synthesizing speech from text.

従来から、テキスト文を解析し、その文が示す音声情報から規則合成により合成音声を生成する音声合成装置が、種々開発されてきた。図１１は、一般的な規則合成型の音声合成装置の構成を示したブロック図である。このような構成を有する音声合成装置の構成と動作の詳細については、例えば非特許文献１乃至３と、特許文献１及び２に記載されている。 Conventionally, various speech synthesizers have been developed that analyze a text sentence and generate synthesized speech by rule synthesis from speech information indicated by the sentence. FIG. 11 is a block diagram showing a configuration of a general rule synthesis type speech synthesizer. Details of the configuration and operation of the speech synthesizer having such a configuration are described in Non-Patent Documents 1 to 3, and Patent Documents 1 and 2, for example.

図１１に示した音声合成装置は、言語処理部１と、韻律生成部２と、候補素片取得部３と、素片選択部４と、素片情報記憶部５と、波形生成部６とを備えている。 The speech synthesizer shown in FIG. 11 includes a language processing unit 1, a prosody generation unit 2, a candidate segment acquisition unit 3, a segment selection unit 4, a segment information storage unit 5, and a waveform generation unit 6. It has.

素片情報記憶部５は、音声合成単位ごとに生成された音声素片と、各音声素片の属性情報を記憶している。ここで、音声素片とは、合成音声の波形を生成するために使われる情報で、収録された自然音声波形から抽出されることが多い。音声素片の例としては、合成単位毎に切り出された音声波形そのものや、線形予測分析パラメータ、ケプストラム係数などが挙げられる。 The unit information storage unit 5 stores a speech unit generated for each speech synthesis unit and attribute information of each speech unit. Here, the speech segment is information used to generate a synthesized speech waveform, and is often extracted from a recorded natural speech waveform. Examples of speech segments include speech waveforms themselves cut out for each synthesis unit, linear prediction analysis parameters, cepstrum coefficients, and the like.

又、音声素片の属性情報とは、各音声素片の抽出元である自然音声の音素環境や、ピッチ周波数、振幅、継続時間情報等の音韻情報や韻律情報のことである。 The speech element attribute information refers to the phoneme environment of natural speech from which each speech element is extracted, and the phoneme information and prosodic information such as pitch frequency, amplitude, and duration information.

音声合成単位としては、音素、ＣＶ、ＣＶＣ、ＶＣＶ（Ｖは母音、Ｃは子音）などが用いられることが多い。この音声素片の長さや合成単位の詳細については、非特許文献１と非特許文献３と非特許文献５とに記述されている。 As a speech synthesis unit, phonemes, CV, CVC, VCV (V is a vowel and C is a consonant) are often used. Details of the length of the speech element and the synthesis unit are described in Non-Patent Document 1, Non-Patent Document 3, and Non-Patent Document 5.

言語処理部１は、入力されたテキスト文に対して形態素解析、構文解析及びテキスト文の読みやアクセントを分析する読み付け等の言語処理を行い、音素記号などの「読み」を表す記号列と、形態素の品詞、活用、アクセント型などを言語処理結果として韻律生成部２と、候補素片取得部３と、素片選択部４とに出力する。 The language processing unit 1 performs linguistic processing such as morphological analysis, syntax analysis, reading of the text sentence and reading of the accent on the input text sentence, and a symbol string representing “reading” such as a phoneme symbol; The morpheme parts of speech, utilization, accent type, and the like are output to the prosody generation unit 2, candidate segment acquisition unit 3, and segment selection unit 4 as language processing results.

韻律生成部２は、言語処理部１から出力された言語処理結果を基に、合成音声の韻律情報（ピッチ、時間長、パワー等に関する情報であって、音の強弱、長短及び高低等によって作り出される言葉のリズムに係る情報）を生成し、素片選択部４と波形生成部６とに出力する。 The prosody generation unit 2 is based on the linguistic processing result output from the language processing unit 1 and is generated by prosody information (pitch, time length, power, etc.) of the synthesized speech based on the strength of the sound, the length, the shortness, the height, etc. Information related to the rhythm of the word to be generated) and generated to the segment selection unit 4 and the waveform generation unit 6.

候補素片取得部３は、言語処理結果を参照して、素片情報記憶部５に記憶されている音声素片の中から合成音声に用いられる可能性がある音声素片を選び出し、素片選択部４へ伝達する。 The candidate segment acquisition unit 3 refers to the language processing result, selects a speech unit that may be used for the synthesized speech from the speech units stored in the segment information storage unit 5, and This is transmitted to the selection unit 4.

素片選択部４は、言語処理結果と生成された韻律情報に関して適合度が高い音声素片を、候補素片取得部３から供給される素片の中から選択し、選択した音声素片の付属情報と併せて波形生成部６に出力する。 The unit selection unit 4 selects a speech unit having a high fitness with respect to the language processing result and the generated prosodic information from the units supplied from the candidate unit acquisition unit 3, and selects the selected speech unit. It outputs to the waveform generation part 6 with attached information.

波形生成部６は、選択された音声素片から、韻律生成部２で生成した韻律に近い韻律を有する波形を生成し、それらの波形を接続して、合成音声として出力する。なお、波形合成については、非特許文献４に記載されている。 The waveform generation unit 6 generates a waveform having a prosody close to the prosody generated by the prosody generation unit 2 from the selected speech segment, connects the waveforms, and outputs the synthesized speech. Note that waveform synthesis is described in Non-Patent Document 4.

以上の処理において、素片選択部４は、入力された言語処理結果と韻律情報から、目標合成音声の特徴を表す情報（以下、これを「目標素片環境」と呼ぶ。）を所定の合成単位ごとに求める。目標素片環境に含まれる情報には、該当・先行・後続の各音素名、ストレスの有無、アクセント核からの距離、合成単位のピッチ周波数やパワー、単位の継続時間長、ケプストラム、ＭＦＣＣ（Mel Frequency Cepstral Coefficients）、及びこれらのΔ量（単位時間あたりの変化量）などが挙げられる。 In the above processing, the segment selection unit 4 performs predetermined synthesis of information representing the characteristics of the target synthesized speech (hereinafter referred to as “target segment environment”) from the input language processing result and prosodic information. Calculate for each unit. The information contained in the target segment environment includes the corresponding / preceding / following phoneme name, presence / absence of stress, distance from the accent core, pitch frequency and power of synthesis unit, unit duration, cepstrum, MFCC (Mel Frequency Cepstral Coefficients), and Δ amount thereof (change amount per unit time).

次に、目標素片環境が与えられると、素片選択部４は、素片情報記憶部５の中から目標素片環境により指定される特定の情報（主に該当音素）に合致する音声素片を複数選択する。選択された音声素片は、合成に用いる音声素片の候補となる。そして、選択された候補素片に対して、合成に用いる音声素片としての適切度を示す指標である「スコア（又はコスト）」を計算する。高音質な合成音声を生成することを目標としているため、スコアが高い（又はコストが小さい）、即ち適切度が高いと、合成音の音質は高くなる。従って、スコアは、合成音声の音質の劣化度を推定するための指標であると言える。非特許文献６では、音声素片の選択にコストを用いている。 Next, when a target segment environment is given, the segment selection unit 4 selects a speech unit that matches specific information (mainly corresponding phoneme) specified by the target segment environment from the segment information storage unit 5. Select multiple pieces. The selected speech unit is a candidate speech unit used for synthesis. Then, a “score (or cost)” that is an index indicating the appropriateness of the selected candidate segment as a speech segment used for synthesis is calculated. Since the goal is to generate a high-quality synthesized speech, if the score is high (or the cost is low), that is, the degree of appropriateness is high, the quality of the synthesized speech is high. Therefore, it can be said that the score is an index for estimating the degree of deterioration of the quality of the synthesized speech. In Non-Patent Document 6, cost is used for selecting speech segments.

ここで、素片選択部４で計算されるスコアには、単位スコアと接続スコアがある。単位スコアは、候補素片を目標素片環境の下で用いることにより生じる推定音質劣化度を表すもので、候補素片の素片環境と目標素片環境との類似度を基に算出される。一方、接続スコアは、接続する音声素片間の素片環境が不連続であることによって生じる推定音質劣化度を表すもので、隣接候補素片同士の素片環境の親和度を基に算出される。 Here, the score calculated by the segment selection unit 4 includes a unit score and a connection score. The unit score represents the estimated sound quality degradation level caused by using the candidate segment under the target segment environment, and is calculated based on the similarity between the segment environment of the candidate segment and the target segment environment. . On the other hand, the connection score represents the estimated sound quality degradation level caused by the discontinuity of the segment environment between connected speech segments, and is calculated based on the affinity of the segment environment between adjacent candidate segments. The

この単位スコア及び接続スコアの計算方法は、これまで各種提案されている。一般に、単位スコアの計算には、目標素片環境に含まれる情報が用いられ、接続スコアには、素片の接続境界におけるピッチ周波数、ケプストラム、ＭＦＣＣ、短時間自己相関、パワー、及びこれらの△量などが用いられる。 Various methods for calculating the unit score and the connection score have been proposed so far. In general, information included in the target segment environment is used to calculate the unit score, and the connection score includes the pitch frequency, cepstrum, MFCC, short-time autocorrelation, power, and Δ of these at the segment boundary. Quantity etc. are used.

以上の通り、単位スコア及び接続スコアは、素片に関する各種情報（ピッチ周波数、ケプストラム、パワー等）を複数用いて算出される。単位スコアと接続スコアを素片ごとに計算したのちに、接続スコアと単位スコアの両者が最大となる音声素片を各合成単位に対して一意に求める。 As described above, the unit score and the connection score are calculated by using a plurality of pieces of various pieces of information (pitch frequency, cepstrum, power, etc.) related to the segment. After calculating the unit score and the connection score for each segment, a speech segment that maximizes both the connection score and the unit score is uniquely obtained for each synthesis unit.

スコア最大化により求めた素片を、候補素片の中から音声の合成に最も適した素片として選択されたことから最適素片と呼ぶ。素片選択部４は、全合成単位を対象にそれぞれの最適素片を求めると、最終的に最適素片の系列（最適素片系列）を選択結果として波形生成部６に出力する。 The segment obtained by the score maximization is referred to as an optimal segment because it is selected as the most suitable segment for speech synthesis from the candidate segments. When the unit selection unit 4 obtains each optimum unit for all synthesis units, the unit selection unit 4 finally outputs an optimum unit sequence (optimum unit sequence) to the waveform generation unit 6 as a selection result.

素片選択部４では、スコアを計算して最適な素片を選択しているが、最適素片を選択するために用いられる計算式やパラメータなどが不適切なために、必ずしも最良の音質を達成する素片が選択されるとは限らない。 The segment selection unit 4 calculates the score and selects the optimum segment. However, since the calculation formulas and parameters used for selecting the optimum segment are inappropriate, the best sound quality is not necessarily obtained. The piece to be achieved is not necessarily selected.

又、予め用意したスコアでは判別できない劣化が含まれている素片が選択される可能性がある。例えば、突発性のノイズが混入した素片は、最適素片としては不適切であり、素片選択時に除外すべきである。ところが、前記の単位スコアや接続スコアでは突発性ノイズを考慮していないため、スコアの低下により最適素片から遠ざかることは無く、最適素片に選択される可能性もある。 In addition, there is a possibility that an element containing deterioration that cannot be determined by a score prepared in advance is selected. For example, an element mixed with sudden noise is inappropriate as an optimum element and should be excluded when selecting an element. However, since sudden noise is not considered in the unit score or connection score, the unit score and the connection score do not move away from the optimum segment due to a decrease in the score, and may be selected as the optimum segment.

そこで、特許文献３では、上記のような問題を解決する目的で、生成された合成音声を聞いて、音質的に不良であると思われる素片を見つけて削除する、すなわち使用禁止素片に指定する合成音声編集方法がある。 Therefore, in Patent Document 3, for the purpose of solving the above-described problem, the generated synthesized speech is heard, and a segment that seems to be poor in sound quality is found and deleted, that is, a use-prohibited segment is deleted. There is a method for editing synthesized speech.

この特許文献３では、素片削除の指定は単位ごとに実施する。例えば、合成単位を音節と仮定し、発声内容が「こんにちわ」という合成音声に対して素片削除を行う場合、「こ」の音質が悪いと判断したら「こ」の単位で使用されている素片を使用禁止素片に指定する。使用禁止素片は合成音声には使用されなくなるので、これまで使用禁止となった素片が使われていた箇所の音質が改善する。この使用禁止素片を効率よく見つけるために、特許文献３では、音質が悪いと考えられる、すなわち素片選択スコアが低い素片を合成音声編集者に提示する方法を提案している。 In Patent Document 3, the designation of segment deletion is performed for each unit. For example, assuming that the synthesis unit is a syllable and performing segment deletion for synthesized speech whose utterance content is `` Konchiwa '', if it is determined that the sound quality of `` ko '' is poor, the element used in the unit of `` ko '' Designate a piece as a prohibited piece. Since the prohibited use pieces are not used for the synthesized speech, the sound quality of the places where the use prohibited pieces have been used is improved. In order to efficiently find the use-prohibited segment, Patent Document 3 proposes a method of presenting a synthesized speech editor with segments that are considered to have poor sound quality, that is, with a low segment selection score.

又、特許文献４は、周波数毎に、各合成単位での最適な音声素片を評価値に基づいて選択し、その周波数における総評価値を求めて記憶し、最も総評価値の良い周波数でのピッチパターンを合成時に用いるというものである。
特開２００５−０９１５５１号公報特開２００６−０８４８５４号公報特開２００６−３１３１７６号公報特開２００４−１３８７２８号公報 Huang, Acero, Hon: “Spoken Language Processing”, Prentice Hall, pp. 689-836, 2001. 石川: “音声合成のための韻律制御の基礎”, 電子情報通信学会技術研究報告, Vol. 100, No. 392, pp. 27-34, 2000. 阿部: “音声合成のための合成単位の基礎”, 電子情報通信学会技術研究報告, Vol. 100, No. 392, pp. 35-42, 2000. Moulines, Charapentier: “Pitch-Synchronous Waveform Processing Techniques For Text-To-Speech Synthesis Using Diphones”, Speech Communication 9, pp. 435-467, 1990. Segi, Takagi, Ito: “A CONCATENATIVE SPEECH SYNTHESIS METHOD USING CONTEXT DEPENDENT PHONEME SEQUENCES WITH VARIABLE LENGTH AS SEARCH UNITS”, Proceedings of 5th ISCA Speech Synthesis Workshop, pp. 115-120, 2004 Kawai, Toda, Ni, Tsuzaki, Tokuda: “XIMERA: A NEW TTS FROM ATR BASED ON CORPUS-BASED TECHNOLOGIES”, Proceedings of 5th ISCA Speech Synthesis Workshop, pp. 179-184, 2004 Further, Patent Document 4 selects an optimum speech unit for each synthesis unit for each frequency based on the evaluation value, obtains and stores a total evaluation value at that frequency, and pitches at a frequency having the best total evaluation value. The pattern is used at the time of synthesis.
Japanese Patent Laying-Open No. 2005-091551 JP 2006-084854 A JP 2006-313176 A JP 2004-138728 A Huang, Acero, Hon: “Spoken Language Processing”, Prentice Hall, pp. 689-836, 2001. Ishikawa: “Basics of Prosodic Control for Speech Synthesis”, IEICE Technical Report, Vol. 100, No. 392, pp. 27-34, 2000. Abe: “Synthetic unit basis for speech synthesis”, IEICE technical report, Vol. 100, No. 392, pp. 35-42, 2000. Moulines, Charapentier: “Pitch-Synchronous Waveform Processing Techniques For Text-To-Speech Synthesis Using Diphones”, Speech Communication 9, pp. 435-467, 1990. Segi, Takagi, Ito: “A CONCATENATIVE SPEECH SYNTHESIS METHOD USING CONTEXT DEPENDENT PHONEME SEQUENCES WITH VARIABLE LENGTH AS SEARCH UNITS”, Proceedings of 5th ISCA Speech Synthesis Workshop, pp. 115-120, 2004 Kawai, Toda, Ni, Tsuzaki, Tokuda: “XIMERA: A NEW TTS FROM ATR BASED ON CORPUS-BASED TECHNOLOGIES”, Proceedings of 5th ISCA Speech Synthesis Workshop, pp. 179-184, 2004

しかしながら、上記した特許文献３に記載された従来の音声合成装置は、下記記載の問題点を有している。 However, the conventional speech synthesizer described in Patent Document 3 described above has the following problems.

素片選択スコアが低い箇所は、必ずしも改善見込みの可能性が高い箇所では無いという問題である。基本的にスコアが最も高い素片が最適素片として選択されているため、スコアが低い箇所で最適素片の削除を行うことにより、スコアが大幅に高まる可能性は小さい。即ちスコアが低い箇所ならば、音質の良い素片が出現する可能性は低いといえる。 It is a problem that a location with a low segment selection score is not necessarily a location with a high possibility of improvement. Since the element having the highest score is basically selected as the optimum element, it is unlikely that the score is significantly increased by deleting the optimum element at a position having a low score. That is, if the score is low, it can be said that there is a low possibility that a segment with good sound quality will appear.

又、スコアでは除外が難しい音質劣化を含む素片を削除しても、スコアが低く低音質であれば素片を削除したことによる音質改善効果は小さい。従って、スコアの低い箇所を集中的に探索することは、使用禁止素片を見つけて合成音声の品質を効率良く改善する方法としては適切ではない。 Further, even if a segment containing sound quality degradation that is difficult to exclude by score is deleted, if the score is low and the sound quality is low, the sound quality improvement effect by deleting the segment is small. Therefore, it is not appropriate to intensively search for a low-score part as a method of finding a prohibited unit and efficiently improving the quality of synthesized speech.

特許文献４に記載された発明は、音声素片の数が十分でない状況での対応が不十分で、音質の悪い部分が突然生じやすいという問題点がある。 The invention described in Patent Document 4 has a problem in that the response in a situation where the number of speech segments is not sufficient is insufficient, and a portion with poor sound quality is likely to occur suddenly.

本発明は、上記問題点に鑑みてなされたものであり、使用を禁止すべき削除対象の音声素片を効率良く見つけることが可能になる音声合成装置、音声合成方法及び音声合成プログラムを実現することを目的とする。 The present invention has been made in view of the above problems, and realizes a speech synthesizer, a speech synthesis method, and a speech synthesis program that can efficiently find a speech segment to be deleted that should be prohibited from use. For the purpose.

上述の課題を解決するため、本発明に係る音声合成装置は、入力テキストに対し、該入力テキストの読み及びアクセントの分析、形態素解析並びに構文解析を含む言語処理を行う言語処理部と、該言語処理の結果に基づいて音の強弱、長短及び高低に係る韻律情報を生成する韻律生成部と、前記言語処理の結果に基づいて合成音声に用いられる可能性がある音声素片を候補素片として取得する候補素片取得部と、前記韻律情報及び前記言語処理の結果に基づいて前記候補素片の音声合成における適切度を示す指標である素片選択スコアを計算し、前記候補素片の中から合成音声に最適な音声素片を最適素片として選択する素片選択部と、該最適素片に基づいて合成音声波形を生成する波形生成部と、を有し、前記候補素片取得部は、前記最適素片に削除対象の素片が含まれる場合、該素片を除いて候補素片の取得を再度実行し、前記波形生成部は合成音声波形を再生成する音声合成装置において、前記削除対象の素片を削除した際に音質が改善する可能性の高さを表す改善見込指標を計算する改善見込指標計算部を備え、前記改善見込指標計算部は、前記素片選択スコアが大きければ前記改善見込指標を大きくすることを特徴とする。 In order to solve the above-described problems, a speech synthesizer according to the present invention includes a language processing unit that performs language processing including input text reading and accent analysis, morphological analysis, and syntax analysis on the input text, and the language Prosody generation unit that generates prosody information related to the strength, shortness, and high / low of the sound based on the processing result, and a speech unit that may be used for synthesized speech based on the result of the language processing as candidate segments A candidate segment acquisition unit to acquire, and based on the prosodic information and the result of the linguistic processing, calculate a segment selection score which is an index indicating appropriateness in speech synthesis of the candidate segment; A candidate segment acquisition unit including: a unit selection unit that selects a speech unit optimal for synthesized speech as an optimal unit; and a waveform generation unit that generates a synthesized speech waveform based on the optimal unit. Is the optimal If the segment includes a segment to be deleted, the candidate segment is re-executed by removing the segment, and the waveform generator regenerates a synthesized speech waveform in the speech synthesizer. An improved expected index calculation unit that calculates an improved expected index that represents a high possibility that sound quality will be improved when a piece is deleted, and the improved expected index calculation unit calculates the expected improvement if the segment selection score is large It is characterized by increasing the index.

上述の課題を解決するため、本発明に係る音声合成方法は、入力テキストに対し、該入力テキストの読み及びアクセントの分析、形態素解析並びに構文解析を含む言語処理を行う言語処理手順と、該言語処理の結果に基づいて音の強弱、長短及び高低に係る韻律情報を生成する韻律生成手順と、前記言語処理の結果に基づいて合成音声に用いられる可能性がある音声素片を候補素片として取得する候補素片取得手順と、前記韻律情報及び前記言語処理の結果に基づいて前記候補素片の音声合成における適切度を示す指標である素片選択スコアを計算し、前記候補素片の中から合成音声に最適な音声素片を最適素片として選択する素片選択手順と、該最適素片に基づいて合成音声波形を生成する波形生成手順と、を有し、前記候補素片取得手順は、前記最適素片に削除対象の素片が含まれる場合、該素片を除いて候補素片の取得を再度実行し、前記波形生成手順は合成音声波形を再生成する音声合成方法において、前記削除対象の素片を削除した際に音質が改善する可能性の高さを表す改善見込指標を計算する改善見込指標計算手順を備え、前記改善見込指標計算手順は、前記素片選択スコアが大きければ前記改善見込指標を大きくすることを特徴とする。 In order to solve the above-described problems, a speech synthesis method according to the present invention includes a language processing procedure for performing language processing including input text reading and accent analysis, morphological analysis, and syntax analysis on the input text, and the language Prosody generation procedure for generating prosody information related to sound strength, short / long and high / low based on the result of processing, and speech unit that may be used for synthesized speech based on the result of language processing as candidate segments Based on the candidate segment acquisition procedure to be acquired, and the prosodic information and the result of the language processing, a segment selection score that is an index indicating the appropriateness in speech synthesis of the candidate segment is calculated. A candidate segment acquisition procedure comprising: a segment selection procedure for selecting a speech unit optimal for synthesized speech as an optimal segment; and a waveform generation procedure for generating a synthesized speech waveform based on the optimal segment. In the speech synthesis method for regenerating a synthesized speech waveform, when the optimum segment includes a segment to be deleted, the candidate segment is removed again, and the waveform generation procedure regenerates a synthesized speech waveform. An improved expected index calculation procedure for calculating an improved expected index that indicates the likelihood that sound quality will be improved when a segment to be deleted is deleted, and the improved expected index calculation procedure has a larger segment selection score. For example, the improvement expected index is increased.

上述の課題を解決するため、本発明に係る音声合成プログラムは、入力テキストに対し、該入力テキストの読み及びアクセントの分析、形態素解析並びに構文解析を含む言語処理と、該言語処理の結果に基づいて音の強弱、長短及び高低に係る韻律情報を生成する韻律生成処理と、前記言語処理の結果に基づいて合成音声に用いられる可能性がある音声素片を候補素片として取得する候補素片取得処理と、前記韻律情報及び前記言語処理の結果に基づいて前記候補素片の音声合成における適切度を示す指標である素片選択スコアを計算し、前記候補素片の中から合成音声に最適な音声素片を最適素片として選択する素片選択処理と、該最適素片に基づいて合成音声波形を生成する波形生成処理と、をコンピュータに実行させ、前記候補素片取得処理は、前記最適素片に削除対象の素片が含まれる場合、該素片を除いて候補素片の取得を再度実行し、前記波形生成処理は合成音声波形を再生成する音声合成プログラムにおいて、前記削除対象の素片を削除した際に音質が改善する可能性の高さを表す改善見込指標を計算する改善見込指標計算処理をコンピュータに実行させ、前記改善見込指標計算処理は、前記素片選択スコアが大きければ前記改善見込指標を大きくすることを特徴とする。 In order to solve the above-described problem, a speech synthesis program according to the present invention is based on input text, including language processing including reading and accent analysis, morphological analysis, and syntax analysis of the input text, and a result of the language processing. A prosody generation process for generating prosody information related to the strength, shortness, and height of the sound, and a candidate unit for acquiring a speech unit that may be used for synthesized speech based on the result of the language processing as a candidate unit Based on the results of the acquisition process, the prosodic information, and the language processing, a unit selection score that is an index indicating the appropriateness in speech synthesis of the candidate unit is calculated, and the best choice for synthesized speech from the candidate units A candidate segmentation process by causing a computer to execute a segment selection process for selecting a correct speech segment as an optimal segment and a waveform generation process for generating a synthesized speech waveform based on the optimal segment. In the speech synthesis program for regenerating the synthesized speech waveform, the processing is performed again when the optimal segment includes a segment to be deleted, excluding the segment, and obtaining the candidate segment. , Causing the computer to execute an improved expected index calculation process for calculating an improved expected index that indicates a high possibility that sound quality will be improved when the segment to be deleted is deleted. If the one-side selection score is large, the improvement expected index is increased.

本発明によれば、削除対象の素片を削除した際に音質が改善する可能性の高さを表す改善見込指標を計算し、この改善見込指標の計算時に素片選択スコアが大きければ改善見込指標を大きくすることにより、使用を禁止すべき削除対象の音声素片を効率良く見つけることが可能になる音声合成装置、音声合成方法及び音声合成プログラムを実現することができる。 According to the present invention, an improvement expected index indicating the high possibility of sound quality improvement when deleting a segment to be deleted is calculated, and if the segment selection score is large when calculating the improvement expected index, the improvement is expected. By increasing the index, it is possible to realize a speech synthesizer, a speech synthesis method, and a speech synthesis program that can efficiently find a speech unit to be deleted that should not be used.

次に、本発明の実施の形態の構成について図面を参照して詳細に説明する。
［第１の実施の形態］
図１は、本発明の第１の実施の形態に係る音声合成装置の構成を示すブロック図である。図１に示す本実施の形態による構成では、言語処理部１、韻律生成部２、候補素片取得部３、素片選択部４、素片情報記憶部５、波形生成部６、削除対象の素片を指定する素片削除指令が入力される使用禁止素片情報取得部１１、使用禁止素片情報記憶部１２、最適素片情報記憶部１３及び改善見込指標計算部１４０を備えている。 Next, the configuration of the embodiment of the present invention will be described in detail with reference to the drawings.
[First Embodiment]
FIG. 1 is a block diagram showing a configuration of a speech synthesizer according to the first embodiment of the present invention. In the configuration according to the present embodiment shown in FIG. 1, the language processing unit 1, prosody generation unit 2, candidate segment acquisition unit 3, segment selection unit 4, segment information storage unit 5, waveform generation unit 6, deletion target An unusable element information acquisition unit 11, an unusable element information storage unit 12, an optimum element information storage unit 13, and an improvement expected index calculation unit 140 to which an element deletion command for designating an element is input.

素片情報記憶部５は、音声合成単位ごとに生成された音声素片と、各音声素片の属性情報を記憶している。 The unit information storage unit 5 stores a speech unit generated for each speech synthesis unit and attribute information of each speech unit.

使用禁止素片情報記憶部１２は、素片情報記憶部５に登録されている素片の中で、合成音声として使用不可と指定された素片、即ち候補素片から除外すべき素片を記録している。 The prohibited-use segment information storage unit 12 selects a segment that is designated as unusable as a synthesized speech among the segments registered in the segment information storage unit 5, that is, a segment to be excluded from candidate segments. It is recorded.

初期状態の使用禁止素片情報記憶部１２には記憶されている素片が一切無く、合成音声編集者が素片削除の指定を行うと記録が蓄積されていき、合成音声編集完了後に記録内容が全て消去されることが一般的である。しかし、恒久的に使用を禁止したい素片を初期段階から記録しておく利用方法や、合成音声編集後も記録内容を消去することなく累積的に使用禁止素片を登録する利用方法もあり得る。 In the initial state, the unusable segment information storage unit 12 has no stored segments, and if the synthesized speech editor designates the segment deletion, the record is accumulated, and the recorded contents after the synthesized speech editing is completed. Is generally erased. However, there may be a usage method in which the segments that are permanently prohibited from use are recorded from the initial stage, and a usage method in which the use prohibited segments are registered cumulatively without erasing the recorded content even after editing the synthesized speech. .

最適情報素片記憶部１３は、素片選択部４で選択された素片情報を記憶している。従って、初期状態では記録内容が無く、合成音声の編集が完了後には記録内容が消去される。又、素片選択が実行されるたびに、最適情報素片記憶部１３の内容は更新される。 The optimum information segment storage unit 13 stores the segment information selected by the segment selection unit 4. Accordingly, there is no recorded content in the initial state, and the recorded content is erased after editing of the synthesized speech is completed. Further, every time the segment selection is executed, the content of the optimum information segment storage unit 13 is updated.

次に、図１のブロック図を参照しながら、第１の実施の形態による音声合成装置の詳細な動作について説明する。 Next, the detailed operation of the speech synthesizer according to the first embodiment will be described with reference to the block diagram of FIG.

図２は、本発明の第１の実施の形態に係る音声合成装置の動作を説明するためのフローチャートである。 FIG. 2 is a flowchart for explaining the operation of the speech synthesizer according to the first embodiment of the present invention.

図２において、言語処理部１は、入力されたテキスト文に対して形態素解析や構文解析、読み付け等の分析を行い、言語処理結果として韻律生成部２と、候補素片取得部３と、素片選択部４と、に出力する（ステップＳ１０１）。 In FIG. 2, the language processing unit 1 performs analysis such as morphological analysis, syntax analysis, and reading on the input text sentence, and as a language processing result, a prosody generation unit 2, a candidate segment acquisition unit 3, It outputs to the segment selection part 4 (step S101).

韻律生成部２は、言語処理部１から出力された言語処理結果を基に、合成音声の韻律情報を生成し、素片選択部４と波形生成部６に出力する（ステップＳ１０２）。 The prosody generation unit 2 generates the prosody information of the synthesized speech based on the language processing result output from the language processing unit 1, and outputs it to the segment selection unit 4 and the waveform generation unit 6 (step S102).

候補素片取得部３は、言語処理部１から供給された言語処理結果と、使用禁止素片情報記憶部１２に記憶された使用禁止素片情報と、を参照して、素片情報記憶部５に登録されている音声素片の中から合成音声に用いられる可能性がある音声素片を選び出し、素片選択部４へ伝達する。この時に、使用禁止素片として登録されている素片や、読みの異なる素片は、候補対象から除外する。又、言語処理結果と比較して言語的特徴(アクセント句境界の相対関係、アクセント核からの距離など)が著しく異なる素片も候補から除外することもある（ステップＳ１０３）。 The candidate segment acquisition unit 3 refers to the language processing result supplied from the language processing unit 1 and the prohibited segment information stored in the prohibited segment information storage unit 12, and the segment information storage unit A speech unit that can be used for synthesized speech is selected from speech units registered in 5 and transmitted to the unit selection unit 4. At this time, an element registered as a prohibited element or an element whose reading is different is excluded from candidate candidates. In addition, segments that have significantly different linguistic features (such as the relative relationship between accent phrase boundaries and the distance from the accent kernel) compared to the language processing result may be excluded from the candidates (step S103).

素片選択部４は、言語処理部１から供給された言語処理結果と韻律生成部２で生成された韻律情報に関して適合度が高い音声素片を、候補素片取得部３から供給される素片の中から選択し、選択した音声素片の付属情報と併せて最適素片情報記憶部１３及び波形生成部６に伝達する。又、最適素片を選択する際に算出した各候補素片、及び各素片の素片選択スコア(単位スコアや接続スコアなど)を、素片選択情報として改善見込指標計算部１４０に伝達する（ステップＳ１０４）。 The unit selection unit 4 is a unit supplied from the candidate unit acquisition unit 3 with a speech unit having a high degree of fitness with respect to the language processing result supplied from the language processing unit 1 and the prosody information generated by the prosody generation unit 2. The information is selected from the pieces, and is transmitted to the optimum unit information storage unit 13 and the waveform generation unit 6 together with the attached information of the selected speech unit. Also, each candidate segment calculated when selecting the optimal segment and the segment selection score (unit score, connection score, etc.) of each segment are transmitted to the improvement expected index calculation unit 140 as segment selection information. (Step S104).

波形生成部６は、素片選択部４で選択された音声素片から、韻律生成部２で生成された韻律に近い韻律を有する波形を生成し、それらの波形を接続して、合成音声として出力する（ステップＳ１０５）。生成される波形の韻律は、韻律生成部２で生成された韻律を忠実に再現する場合もあれば、生成された韻律をほぼ無視して選択された素片の韻律のみを基に合成音声の韻律を生成する場合もある。 The waveform generation unit 6 generates a waveform having a prosody close to the prosody generated by the prosody generation unit 2 from the speech unit selected by the unit selection unit 4, and connects these waveforms to generate synthesized speech. Output (step S105). The prosody of the generated waveform may faithfully reproduce the prosody generated by the prosody generation unit 2, or the generated prosody may be reproduced based on only the prosody of the selected segment while ignoring the generated prosody. Prosody may be generated.

最適情報素片記憶部１３は、素片選択部４から供給された素片選択情報を基に素片選択部で選択された素片情報を更新する（ステップＳ１０６）。 The optimum information segment storage unit 13 updates the segment information selected by the segment selection unit based on the segment selection information supplied from the segment selection unit 4 (step S106).

改善見込指標計算部１４０は、素片選択部４から供給された素片選択情報を基に、単位ごとに素片削除により音質が改善する可能性を推定し、改善見込指標として出力する（ステップＳ１０７）。 Based on the segment selection information supplied from the segment selection unit 4, the expected improvement index calculation unit 140 estimates the possibility of sound quality improvement by segment deletion for each unit, and outputs the estimated improvement index (step) S107).

素片選択情報を用いて改善見込指標を計算する場合、主に素片選択スコアが利用される。そして、素片選択スコアが高ければ、素片削除を行った際に音質が改善する見込みが高いことから、改善見込指標を高くする。改善見込指標の計算に利用される素片選択スコアとしては、単位スコア、接続スコア、及び両者を基に算出したスコア（例えば両者の和）が主に挙げられる。 When calculating an improvement expected index using the segment selection information, a segment selection score is mainly used. If the segment selection score is high, there is a high possibility that the sound quality will be improved when the segment is deleted. The unit selection score used for calculation of the expected improvement index mainly includes a unit score, a connection score, and a score calculated based on both (for example, the sum of both).

素片選択スコアを基に改善見込指標を導出するには、素片削除後の素片選択スコアの推定が要求される。理想的と考えられる方法としては、素片削除後の素片選択スコアを削除前に求める方法、即ち、現在の最適素片を削除したときに得られる素片選択スコアを単位ごとに求める方法が挙げられる。 In order to derive an improvement expected index based on the segment selection score, estimation of the segment selection score after the segment deletion is required. As a method considered to be ideal, there is a method of obtaining a unit selection score after deleting a unit before deleting, that is, a method of obtaining a unit selection score obtained by deleting the current optimum unit for each unit. Can be mentioned.

但し、算出するには、最適素片の削除と素片の再選択処理を各単位において逐一実行する必要があることが一般的であるため、多大な計算量を要することが多い。 However, in order to calculate, it is generally necessary to execute the deletion of the optimum segment and the reselection process of each segment for each unit, so that a large amount of calculation is often required.

従って、改善見込指標計算部等が素片削除後の素片選択スコアを推定する方法としては、現在の最適素片の素片選択スコアを推定値として用いる方法や、スコアが高い複数の素片のスコアを基に推定値を計算する方法が有望である。 Therefore, as a method for estimating the segment selection score after the segment deletion by the improvement expected index calculation unit or the like, a method using the segment selection score of the current optimum segment as an estimated value, or a plurality of segments having a high score A promising method is to calculate an estimated value based on the score.

スコアが高い複数の素片のスコアを基に推定値を計算する方法では、スコアの平均値や重みつき和などを単位ごとに計算して、素片削除後の素片選択スコアの推定値とする。例えば、スコアの上位Ｎ位をＳ₁からＳ_N（Ｓ₁＞Ｓ₂＞…＞Ｓ_N＞０）とし、重みつき和により推定値Ｔを計算する場合、計算式は以下の式（１）で与えられる。 In the method of calculating the estimated value based on the score of multiple segments with high scores, the average value of the score, the weighted sum, etc. are calculated for each unit, and the estimated value of the segment selection score after deleting the segment is calculated. To do. For example, when the top N rank of the score is S ₁ to S _N (S ₁ > S ₂ >...> S _N > 0) and the estimated value T is calculated by the weighted sum, the calculation formula is as follows: Given in.

但し、ａ₁, ａ₂, …, ａ_N は正の実数である。素片削除後に選択される可能性が高い素片は高スコアの素片であることから、ａ₁＞ａ₂＞...＞ａ_Nという関係を満足することが望ましい。又、最適素片のスコアを除いて推定値を計算する方法も有望である。これは、最適素片のみが大幅にスコアが高く、準最適な素片のスコアが低い場合には、最適素片を削除するとスコア及び音質が大幅に低下する可能性が高いためである。

Here, a ₁ , a ₂ ,..., A _N are positive real numbers. Since a segment having a high possibility of being selected after deleting the segment is a segment having a high score, it is desirable to satisfy the relationship of a ₁ > a ₂ >...> a _N. Also, a method for calculating the estimated value by removing the score of the optimum segment is promising. This is because if only the optimal segment has a significantly high score and the semi-optimal segment has a low score, it is highly likely that the score and sound quality will be significantly reduced if the optimal segment is deleted.

素片削除後の素片選択スコアの推定値から改善見込指標を求める方法としては、例えば以下に示すように推定スコアの一次関数を用いて改善見込指標を計算する方法が挙げられる。このとき、推定スコアをｘ（＞０）、改善見込指標をＦ（ｘ）とした場合、両者の関係は以下の式（２）で与えられる。 As a method for obtaining an improvement expected index from the estimated value of the segment selection score after deleting the segment, for example, a method of calculating the improvement expected index using a linear function of the estimated score as shown below can be cited. At this time, when the estimated score is x (> 0) and the expected improvement index is F (x), the relationship between them is given by the following equation (2).

但し、ａ、ｂは実数で、ａ＞０である。

However, a and b are real numbers, and a> 0.

又、図３に示すように推定スコアと改善見込指標の関係を示す表を事前に用意し、その表を参照して指標値を決定する方法もある。その他にも、関数Ｆ（ｘ）として指数関数や二次以上の多次元関数、多項式関数を用いる方法や、素片選択スコアの推定値そのものを出力する方法もある。いずれの方法でも、推定スコアが高ければ改善見込みが高くなる傾向にあることが、改善見込指標の計算で満足すべき条件となる。 Further, as shown in FIG. 3, there is a method in which a table showing the relationship between the estimated score and the expected improvement index is prepared in advance and the index value is determined with reference to the table. In addition, there are a method of using an exponential function, a quadratic or higher-order multidimensional function and a polynomial function as the function F (x), and a method of outputting an estimated value of the unit selection score itself. In any method, if the estimated score is high, the improvement probability tends to be high, which is a condition that should be satisfied by the calculation of the improvement expected index.

ステップＳ１０８では、使用禁止素片情報取得部１１は、入力された素片削除指令と最適素片情報記憶部１３に記憶された最適素片情報を参照し、削除対象、つまり使用禁止とすべき素片の情報を求め、素片削除指令を取得した場合（ＹＥＳ）、使用禁止とすべき素片の情報を使用禁止素片情報記憶部１２に伝達すると共に、使用禁止素片情報記憶部１２が更新されたこと通知する信号を候補素片取得部３に伝達する（ステップＳ１０９）。又、使用禁止素片情報記憶部１２は、伝達された使用禁止とすべき素片の情報に基づいて、使用禁止素片情報を更新する（ステップＳ１１０）。 In step S108, the prohibited-use element information acquisition unit 11 should refer to the input element deletion command and the optimum element information stored in the optimum-element information storage unit 13 to be deleted, ie, prohibited to use. When the information on the element is obtained and the element deletion instruction is acquired (YES), the information on the element to be prohibited is transmitted to the use prohibited element information storage unit 12, and the use prohibited element information storage unit 12 is also transmitted. Is transmitted to the candidate segment acquisition unit 3 (step S109). Further, the prohibited-use element information storage unit 12 updates the prohibited-use element information based on the transmitted information on the prohibited-use element (step S110).

素片削除指令は合成単位毎に取得する。例えば、合成単位が音節で、合成音声の発声内容が「こんにちわ」であった場合、素片削除は、「こ」、「ん」、「に」、「ち」、「わ」の中から指定される。指定箇所は、「こ」のひとつだけでも良いし、「こ」と「ち」の２箇所でも良い。指定された箇所の素片は、最適素片情報から一意に特定される。 A segment deletion command is acquired for each synthesis unit. For example, if the synthesis unit is syllable and the utterance content of the synthesized speech is "Konchiwa", the unit deletion is specified from "ko", "n", "ni", "chi", "wa" Is done. The designated place may be only one of “ko” or two places of “ko” and “chi”. The segment at the designated location is uniquely identified from the optimum segment information.

素片削除が行われると、削除対象の素片を除いて合成音声を再生成する必要があるため、使用禁止素片情報記憶部１２が更新されたこと通知する信号を候補素片取得部３に伝達することで、候補素片の取得から合成音声の波形生成までを再度実行する。 When the segment deletion is performed, it is necessary to regenerate the synthesized speech excluding the segment to be deleted, so that a signal for notifying that the prohibited-unit information storage unit 12 has been updated is sent to the candidate segment acquisition unit 3. , The process from acquisition of candidate segments to waveform generation of synthesized speech is executed again.

候補素片取得部３は、使用禁止素片情報記憶部１２が更新されたことを通知する信号を受信すると、再び言語処理結果と、使用禁止素片情報記憶部１２に記憶された使用禁止素片情報を参照して、合成音声に用いられる可能性がある音声素片を選び出し、素片選択部４へ伝達する（ステップＳ１０３）。 When the candidate element acquisition unit 3 receives a signal notifying that the prohibited element information storage unit 12 has been updated, the candidate element acquisition unit 3 again uses the language processing result and the prohibited element information stored in the prohibited element information storage unit 12. With reference to the piece information, a speech unit that can be used for the synthesized speech is selected and transmitted to the unit selection unit 4 (step S103).

その後、ステップＳ１０４以降の手順を再度実行し、ステップＳ１０８において、素片削除指令を取得しなかった場合（ＮＯ）、本実施の形態に係る一連の処理は終了する。 Then, the procedure after step S104 is executed again, and if a segment deletion command is not acquired in step S108 (NO), a series of processing according to the present embodiment ends.

以上説明したように、本実施の形態によれば、音声合成装置は、素片選択スコアが高ければ素片削除による音質改善の見込が高くなる性質を利用して、素片選択スコアを基に改善見込指標を計算する。その際、単位ごとに最適素片や複数の高スコア素片のスコアから素片削除後の素片選択スコアを推定し、各単位の改善見込指標の計算に反映する。このため、素片削除により変化するスコア及び音質が推測可能となり、従来よりも信頼性の高い改善見込指標を求めることができる。 As described above, according to the present embodiment, the speech synthesizer uses the property that if the segment selection score is high, the sound quality improvement expected by segment deletion is high, and based on the segment selection score. Calculate the expected improvement index. At that time, the unit selection score after deleting the unit is estimated from the score of the optimal unit or a plurality of high-scoring units for each unit, and reflected in the calculation of the expected improvement index of each unit. For this reason, it becomes possible to estimate the score and sound quality that change due to the segment deletion, and it is possible to obtain an improved expected index with higher reliability than in the past.

その理由は、素片選択スコアが低い箇所よりも改善見込みの高い箇所を優先して探索するほうが効率は良く、改善見込みの高い箇所はスコアの低い箇所では無く、スコアは高いが音質が悪い箇所だからである。 The reason for this is that it is more efficient to search for places with high likelihood of improvement over places with low segment selection scores, and places with high chance of improvement are not places with low scores, but places with high scores but poor sound quality That's why.

スコアが高い箇所では、素片削除によりスコアが若干低下しても高水準のスコアを有する素片が再び選択されることが多い。従って、素片削除を行ったときにスコアも音質も高い代替素片が選択される可能性が高いため、合成音声の品質を効率良く改善することが可能になる。
［第２の実施の形態］
第１の実施の形態では素片選択スコアを利用して改善見込指標を計算しているが、素片選択スコアは、音質の指標としては完全に信頼することは困難であるという問題がある。仮に、素片選択スコアが削除対象とすべき使用禁止素片の発見に関して十分に信頼可能な指標であるならば、素片選択時に高品質な素片が最適素片として選択されるはずであり、使用禁止素片を発見・指定する作業自体が不要になると言える。 In places where the score is high, a segment having a high level score is often selected again even if the score drops slightly due to segment deletion. Therefore, since it is highly likely that an alternative segment having a high score and sound quality is selected when the segment is deleted, the quality of the synthesized speech can be improved efficiently.
[Second Embodiment]
In the first embodiment, the expected improvement index is calculated using the segment selection score. However, it is difficult to completely trust the segment selection score as a sound quality index. If the segment selection score is a sufficiently reliable indicator for finding a prohibited unit to be deleted, a high-quality segment should be selected as the optimal segment when selecting a segment. Therefore, it can be said that the work itself of finding and specifying prohibited pieces is not necessary.

従って、素片選択スコア以外の情報も用いれば、より信頼性の高い改善見込指標を求めることが可能になる。そこで、第２と第３の実施の形態では、素片選択スコア以外の情報を用いて改善見込指標を計算する例について説明する。 Therefore, if information other than the segment selection score is also used, it is possible to obtain a more reliable improvement expected index. Therefore, in the second and third embodiments, an example in which an improvement expected index is calculated using information other than the unit selection score will be described.

ここで、図４は、本発明の第２の実施の形態に係る音声合成装置の構成を示すブロック図である。 Here, FIG. 4 is a block diagram showing the configuration of the speech synthesizer according to the second embodiment of the present invention.

図４に示す本実施の形態に係る構成では、図１に示した第１の実施の形態と比較して、候補素片取得部３１及び改善見込指標計算部１４１を備えていることを特徴とする。以下、図４のブロック図を参照しながら、第２の実施の形態による音声合成装置の詳細な動作について説明する。 The configuration according to the present embodiment shown in FIG. 4 is characterized by including a candidate segment acquisition unit 31 and an improvement expected index calculation unit 141 as compared with the first embodiment shown in FIG. To do. The detailed operation of the speech synthesizer according to the second embodiment will be described below with reference to the block diagram of FIG.

図５は、本発明の第２の実施の形態に係る音声合成装置の動作を説明するためのフローチャートである。この図５は、第１の実施の形態の動作を説明するためのフローチャートである図２と比較して、ステップＳ１０１、ステップＳ１０２、ステップＳ１０４、ステップＳ１０５、ステップＳ１０６、ステップＳ１０８、ステップＳ１０９及びステップＳ１１０は共通であり、図２のステップＳ１０３に代えてステップＳ２０３と、図２のステップＳ１０７に代えてステップＳ２０７と、を有する。 FIG. 5 is a flowchart for explaining the operation of the speech synthesizer according to the second embodiment of the present invention. FIG. 5 is a flowchart for explaining the operation of the first embodiment. Compared with FIG. 2, step S101, step S102, step S104, step S105, step S106, step S108, step S109, and step S110 is common, and includes step S203 instead of step S103 in FIG. 2, and step S207 instead of step S107 in FIG.

このステップＳ２０３において、候補素片取得部３１は、言語処理部１から供給された言語処理結果と、使用禁止素片情報記憶部１２に記憶された使用禁止素片情報を参照して、素片情報記憶部５に登録されている音声素片の中から合成音声に用いられる可能性がある音声素片を選び出し、素片選択部４へ伝達する。又、単位ごとの候補素片の数を改善見込指標計算部１４１に伝達する。 In step S 203, the candidate segment acquisition unit 31 refers to the language processing result supplied from the language processing unit 1 and the prohibited segment information stored in the prohibited segment information storage unit 12. A speech unit that may be used for synthesized speech is selected from speech units registered in the information storage unit 5 and transmitted to the unit selection unit 4. Also, the number of candidate segments for each unit is transmitted to the expected improvement index calculation unit 141.

改善見込指標計算部１４１は、候補素片取得部３１から供給された候補素片数を基に、単位ごとに素片削除により音質が改善する可能性を推定し、改善見込指標として出力する（ステップＳ２０７）。 Based on the number of candidate segments supplied from the candidate segment acquisition unit 31, the estimated improvement index calculation unit 141 estimates the possibility of sound quality improvement by segment deletion for each unit, and outputs it as an expected improvement index ( Step S207).

候補素片数を用いて改善見込指標を計算する場合、候補素片数が少なければ素片削除の指定を行っても代替素片が高音質を達成する可能性が低いので、基本的には改善見込指標の値を小さくする。素片情報記憶部５に登録されている素片及び候補素片取得部３１で取得される素片の数は、単位ごとに異なることが多い。つまり、代替が可能な素片の数は、単位種別に応じて異なる。例えば、合成単位を音節とした場合、「わ」や「お」などの素片数は多いが、「ヴぁ」の素片数は少ないことがある。 When calculating the expected improvement index using the number of candidate segments, if the number of candidate segments is small, it is unlikely that the alternative segment will achieve high sound quality even if the segment deletion is specified. Reduce the value of the expected improvement index. The number of segments registered in the segment information storage unit 5 and the number of segments acquired by the candidate segment acquisition unit 31 is often different for each unit. That is, the number of pieces that can be replaced differs depending on the unit type. For example, when the synthesis unit is a syllable, the number of segments such as “Wa” and “O” is large, but the number of segments of “Vu” may be small.

又、候補素片数が多いということは、様々な特徴量（ピッチ周波数、継続時間長、ケプストラムなど）を持つ素片が多く存在する傾向にあることを意味する。このため、候補素片数の多い箇所では、素片選択スコアが高くなる最適素片が出現する可能性が高い。従って、候補素片数が多ければ、素片選択スコアが高い代替素片が出現する可能性も高くなるので、改善見込みは高くなると言える。 Further, the large number of candidate segments means that there are many segments having various feature quantities (pitch frequency, duration length, cepstrum, etc.). For this reason, there is a high possibility that an optimum segment having a high segment selection score will appear at a location where the number of candidate segments is large. Therefore, if the number of candidate segments is large, there is a high possibility that an alternative segment with a high segment selection score will appear.

候補素片数から改善見込指標を計算する方法としては、素片選択スコアを用いて改善見込指標を求める際に、推定スコアから改善見込指標の算出に用いた方法が同様に利用される。いずれの方法でも、候補数が多ければ改善見込みが大きくなる傾向にあることが、改善見込指標の計算で満足すべき条件となる。 As a method of calculating the improvement expected index from the number of candidate segments, the method used for calculating the improvement expected index from the estimated score is similarly used when obtaining the improvement expected index using the segment selection score. In any method, if the number of candidates is large, the improvement probability tends to be large, which is a condition to be satisfied in the calculation of the improvement expected index.

以上説明したように、本実施の形態によれば、候補数が多ければ素片選択スコアが高い素片が出現する確率が高くなる性質を利用して、候補数が多ければ改善見込指標を高くする。特に、第１の実施の形態と比較して、素片選択スコアを利用せずに改善見込指標を計算するため、素片選択スコアが十分信頼できない状況では、第１の実施の形態よりも有効な改善見込指標を求めることが可能になる。
［第３の実施の形態］
続いて、本発明の第３の実施の形態に係る音声合成装置について説明する。ここで、図６は、本発明の第３の実施の形態に係る音声合成装置の構成を示すブロック図である。 As described above, according to the present embodiment, using the property that the probability that a segment with a high segment selection score will appear increases as the number of candidates increases, the improvement index is increased as the number of candidates increases. To do. In particular, in comparison with the first embodiment, since the improvement expected index is calculated without using the segment selection score, it is more effective than the first embodiment in a situation where the segment selection score is not sufficiently reliable. It is possible to obtain a measure for improvement.
[Third Embodiment]
Next, a speech synthesizer according to the third embodiment of the present invention will be described. Here, FIG. 6 is a block diagram showing a configuration of a speech synthesizer according to the third embodiment of the present invention.

図６に示す本実施の形態に係る構成では、図１に示した第１の実施の形態と比較して、使用禁止素片情報取得部１１２、改善見込指標計算部１４２及び使用禁止素片取得回数計算部１５２を備えていることを特徴とする。 In the configuration according to the present embodiment shown in FIG. 6, compared to the first embodiment shown in FIG. 1, the use prohibited piece information acquisition unit 112, the improvement expected index calculation unit 142, and the use prohibited piece acquisition. A number-of-times calculation unit 152 is provided.

以下、図６のブロック図を参照しながら、第３の実施の形態による音声合成装置の詳細な動作について説明する。 The detailed operation of the speech synthesizer according to the third embodiment will be described below with reference to the block diagram of FIG.

図７は、本発明の第３の実施の形態に係る音声合成装置の動作を説明するためのフローチャートである。この図７は、第１の実施の形態の動作を説明するためのフローチャートである図２と比較して、ステップＳ１０１、ステップＳ１０２、ステップＳ１０３、ステップＳ１０４、ステップＳ１０５、ステップＳ１０６、ステップＳ１０８及びステップＳ１１０は共通であり、図２のステップＳ１０７に代えてステップＳ３０７と、図２のステップＳ１０９に代えてステップＳ３０９と、を有し、別途ステップＳ３１１を備える。 FIG. 7 is a flowchart for explaining the operation of the speech synthesizer according to the third embodiment of the present invention. FIG. 7 is a flowchart for explaining the operation of the first embodiment, compared with FIG. 2, step S101, step S102, step S103, step S104, step S105, step S106, step S108 and step. S110 is common, and has step S307 instead of step S107 in FIG. 2 and step S309 instead of step S109 in FIG. 2, and includes step S311 separately.

このステップＳ３０７において、改善見込指標計算部１４２は、使用禁止素片取得回数計算部１５２から供給された使用禁止素片取得回数を基に、単位ごとに素片削除により音質が改善する可能性を推定し、改善見込指標として出力する。 In this step S307, the expected improvement index calculation unit 142 determines that the sound quality may be improved by deleting the unit for each unit based on the number of the prohibited unit acquisition obtained from the prohibited unit acquisition number calculating unit 152. Estimate and output as an expected improvement index.

使用禁止素片取得回数を用いて改善見込指標を計算する場合、使用禁止素片取得回数が多ければ素片削除の指定を行っても代替素片が高音質を達成する可能性が低いので、基本的には改善見込指標の値を小さくする。素片削除を同一単位に対して複数回実施する場合、スコアの高い素片から順番に削除されることになる。したがって、削除が多く行われた箇所では、その箇所の候補素片の中でも比較的スコアの低い素片の中から音質の高い素片の出現を待つことになるため、その箇所での音質改善の見込みは低下する。 When calculating the expected improvement index using the number of unusable element acquisitions, if the number of unusable element acquisitions is large, it is unlikely that the alternative element will achieve high sound quality even if you specify deletion of the element. Basically, decrease the value of the expected improvement index. When performing segment deletion a plurality of times for the same unit, the segments with the highest score are deleted in order. Therefore, at a place where many deletions have been made, it waits for an appearance of a high-quality sound element from among the low-scoring elements among the candidate elements at that position, so that the sound quality improvement at that place is Prospect decreases.

使用禁止素片取得回数から改善見込指標を計算する方法としては、素片選択スコアを用いて改善見込指標を求める際に、推定スコアから改善見込指標の算出に用いた方法が同様に利用される。いずれの方法でも、使用禁止素片取得回数が少なければ改善見込みが大きくなる傾向にあることが、改善見込指標の計算で満足すべき条件となる。従って、式（２）を用いる場合には、ａは負の実数であることが条件となる。 As a method for calculating an improvement expected index from the number of use-prohibited segment acquisitions, the method used for calculating the improvement expected index from the estimated score is similarly used when obtaining the improvement expected index using the segment selection score. . In any method, if the number of use-prohibited pieces is small, the improvement expectation tends to increase. This is a condition that should be satisfied by the calculation of the improvement expected index. Therefore, when using equation (2), it is a condition that a is a negative real number.

ステップＳ１０８では、使用禁止素片情報取得部１１２は、入力された素片削除指令と最適素片情報記憶部１３に記憶された最適素片情報を参照し、削除対象、つまり使用禁止とすべき素片の情報を求め、素片削除指令を取得した場合（ＹＥＳ）、使用禁止とすべき素片の情報を使用禁止素片情報記憶部１２と使用禁止素片取得回数計算部１５２とに伝達すると共に、使用禁止素片情報記憶部１２が更新されたこと通知する信号を候補素片取得部３及び使用禁止素片取得回数計算部１５２に伝達する（ステップＳ３０９）。 In step S108, the prohibited-use element information acquisition unit 112 refers to the input element deletion command and the optimum element information stored in the optimum element information storage unit 13, and should be deleted, that is, prohibited to use. When the element information is obtained and the element deletion command is acquired (YES), the information on the element to be prohibited is transmitted to the use prohibition element information storage unit 12 and the use prohibition element acquisition frequency calculation unit 152. At the same time, a signal notifying that the use prohibition element information storage unit 12 has been updated is transmitted to the candidate element acquisition unit 3 and the use prohibition element acquisition count calculation unit 152 (step S309).

ステップＳ３１１では、使用禁止素片取得回数計算部１５２は、使用禁止素片情報記憶部１２が更新されたことを通知する信号を使用禁止素片情報取得部１１２から受信するたびに、使用禁止素片情報を取得した回数を更新し、改善見込指標計算部１４２に伝達する。 In step S 311, the use prohibition element acquisition count calculation unit 152 receives the signal indicating that the use prohibition element information storage unit 12 has been updated from the use prohibition element information acquisition unit 112 each time. The number of times the piece information is acquired is updated and transmitted to the expected improvement index calculation unit 142.

使用禁止素片取得回数計算部１５２の初期値は、ゼロに設定しているので、使用禁止素片情報記憶部１２が更新されたことを通知する信号を全く受信しなければ、使用禁止素片取得回数計算部１５２はゼロを出力する。 Since the initial value of the use prohibition element acquisition count calculation unit 152 is set to zero, the use prohibition element is not received unless a signal notifying that the use prohibition element information storage unit 12 has been updated is not received. The acquisition count calculation unit 152 outputs zero.

又、使用禁止素片取得回数は、単位ごとに（合成単位が音節であれば、各音節に対して）削除回数をカウントする。 The number of use-prohibited segment acquisitions is counted for each unit (for each syllable if the synthesis unit is a syllable).

以上説明したように、本実施の形態によれば、使用禁止素片取得回数が少なければ素片選択スコアが高い素片が出現する確率が高くなる性質を利用して、使用禁止素片取得回数が少なければ改善見込指標を高くする。 As described above, according to the present embodiment, the number of use-prohibited segment acquisitions is obtained by using the property that the probability that a unit with a high unit selection score will appear increases if the use-prohibited unit acquisition count is small. If there is little, raise the expected improvement index.

第１の実施の形態と比較して、素片選択スコアを利用せずに改善見込指標を計算するため、素片選択スコアが十分信頼できない状況では、第１の実施の形態よりも有効な改善見込指標を求めることが可能になる。 Compared to the first embodiment, since the expected improvement index is calculated without using the segment selection score, the improvement that is more effective than the first embodiment in a situation where the segment selection score is not sufficiently reliable. An expected index can be obtained.

又、複数個所において削除が複数回繰り返されている状況で、合成音声の全般的な音質改善のために、削除があまり行われていない箇所を優先的に削除対象として検討したい場合に効果的な方法である。
［第４の実施の形態］
続いて、本発明の第４の実施の形態に係る音声合成装置について説明する。ここで、図８は、本発明の第４の実施の形態に係る音声合成装置の構成を示すブロック図である。 Also, it is effective when you want to preferentially consider parts that are not deleted in order to improve the overall sound quality of the synthesized speech in a situation where deletion is repeated multiple times at multiple places. Is the method.
[Fourth Embodiment]
Next, a speech synthesizer according to the fourth embodiment of the present invention will be described. Here, FIG. 8 is a block diagram showing a configuration of a speech synthesizer according to the fourth embodiment of the present invention.

図８に示す本実施の形態に係る構成では、図１に示した第１の実施の形態と比較して、候補素片取得部３１、使用禁止素片情報取得部１１２、改善見込指標計算部１４３及び使用禁止素片取得回数計算部１５２を備えていることを特徴とする。 In the configuration according to the present embodiment shown in FIG. 8, compared to the first embodiment shown in FIG. 1, the candidate segment acquisition unit 31, the prohibited use segment information acquisition unit 112, and the expected improvement index calculation unit 143 and a use prohibition element acquisition frequency calculation unit 152 are provided.

候補素片取得部３１は、図４の第２の実施の形態のブロック図にある候補素片取得部３１と等価であり、使用禁止素片情報取得部１１２及び使用禁止素片取得回数計算部１５２は、図６の第３の実施の形態のブロック図にある使用禁止素片情報取得部１１２及び使用禁止素片取得回数計算部１５２と等価である。 The candidate element acquisition unit 31 is equivalent to the candidate element acquisition unit 31 in the block diagram of the second embodiment of FIG. 4, and the prohibited element information acquisition unit 112 and the prohibited element acquisition frequency calculation unit 152 is equivalent to the unusable element information acquisition unit 112 and the unusable element acquisition frequency calculation unit 152 in the block diagram of the third embodiment of FIG.

以下、図８のブロック図を参照しながら、第４の実施の形態による音声合成装置の詳細な動作について説明する。 The detailed operation of the speech synthesizer according to the fourth embodiment will be described below with reference to the block diagram of FIG.

図９は、本発明の第４の実施の形態に係る音声合成装置の動作を説明するためのフローチャートである。この図９は、第１の実施の形態の動作を説明するためのフローチャートである図２と比較して、ステップＳ１０１、ステップＳ１０２、ステップＳ１０４、ステップＳ１０５、ステップＳ１０６、ステップＳ１０８及びステップＳ１１０は共通である。 FIG. 9 is a flowchart for explaining the operation of the speech synthesizer according to the fourth embodiment of the present invention. FIG. 9 is common to step S101, step S102, step S104, step S105, step S106, step S108, and step S110 as compared to FIG. 2 which is a flowchart for explaining the operation of the first embodiment. It is.

但し、図２のステップＳ１０３に代えて図５のステップＳ２０３と等価なステップＳ２０３と、図２のステップＳ１０９に代えて図７のステップＳ３０９と等価なステップＳ３０９と、図２のステップＳ１０７に代えてステップＳ４０７とを有し、別途図７のステップＳ３１１と等価なステップＳ３１１を備える。 However, in place of step S103 in FIG. 2, step S203 equivalent to step S203 in FIG. 5, step S309 equivalent to step S309 in FIG. 7 instead of step S109 in FIG. 2, and step S107 in FIG. Step S407 is provided, and Step S311 equivalent to Step S311 of FIG. 7 is separately provided.

このステップＳ４０７において、改善見込指標計算部１４３は、素片選択部４から供給された素片選択情報と、候補素片取得部３１から供給された候補素片数と、使用禁止素片取得回数計算部１５２から供給された使用禁止素片取得回数を基に、単位ごとに素片削除により音質が改善する可能性を推定し、改善見込指標として出力する。 In this step S407, the improvement expectation index calculation unit 143 determines the unit selection information supplied from the unit selection unit 4, the number of candidate units supplied from the candidate unit acquisition unit 31, and the number of prohibited use unit acquisition times. Based on the number of use-prohibited segment acquisitions supplied from the calculation unit 152, the possibility of sound quality improvement by segment deletion is estimated for each unit, and output as an expected improvement index.

素片選択情報、候補素片数、及び使用禁止素片取得回数をそれぞれ独立に用いた例は、第１の実施の形態、第２の実施の形態及び第３の実施の形態において説明しており、本実施の形態ではこれらを組み合わせて利用する方法について説明する。 Examples in which the element selection information, the number of candidate elements, and the number of use-prohibited element acquisitions are used independently will be described in the first embodiment, the second embodiment, and the third embodiment. In this embodiment, a method of using these in combination will be described.

本実施の形態では、素片選択情報から推定スコアを求めた後、推定スコアが高ければ、候補素片数が多ければ、使用禁止素片取得回数が少なければ、改善見込指標の値を大きくする。改善見込指標の計算方法としては、推定スコア、候補素片数及び使用禁止素片取得回数の重みつき和を計算する方法がある。この場合、推定スコアＳ₁、候補素片数Ｓ₂、使用禁止素片取得回数Ｓ₃に対して、改善見込指標Ｔは以下の式（３）で与えられる。 In this embodiment, after obtaining the estimated score from the segment selection information, if the estimated score is high, if the number of candidate segments is large, if the number of use-prohibited segments is small, the value of the expected improvement index is increased. . As a method of calculating the expected improvement index, there is a method of calculating a weighted sum of the estimated score, the number of candidate segments, and the number of use-prohibited segments obtained. In this case, the expected improvement index T is given by the following equation (3) with respect to the estimated score S ₁ , the number of candidate segments S ₂ , and the number of use-prohibited segment acquisition times S ₃ .

但し、ａ₁, ａ₂, ａ₃, ｂは実数であり、ａ₁＞０, ａ₂＞０, ａ₃＜０を満たす。又、図１０に示すように、推定スコア、候補素片数、使用禁止素片取得回数と改善見込指標の関係を示す表を事前に用意し、その表を参照して指標値を決定する方法もある。

However, a ₁ , a ₂ , a ₃ , and b are real numbers and satisfy a ₁ > 0, a ₂ > 0, and a ₃ <0. Also, as shown in FIG. 10, a method of preparing a table showing the relationship between the estimated score, the number of candidate segments, the number of use-prohibited segment acquisitions, and an expected improvement index in advance, and determining an index value by referring to the table There is also.

その他にも、式（３）の代わりに指数関数や二次以上の多次元関数、多項式関数を用いる方法や、推定スコア、候補素片数、使用禁止素片取得回数をそのまま出力する方法もある。 In addition, there are a method using an exponential function, a quadratic or higher-order multidimensional function, and a polynomial function instead of the expression (3), and a method for outputting the estimated score, the number of candidate segments, and the number of prohibited use unit acquisitions as they are. .

以上は、推定スコア、候補素片数、及び使用禁止素片取得回数から改善見込指標を直接計算する方法であるが、第１の実施の形態、第２の実施の形態及び第３の実施の形態において説明した方法でそれぞれ個別に改善見込指標を計算し、各改善見込指標から１つの改善見込指標を計算する方法を用いても良い。 The above is a method for directly calculating an improvement expected index from the estimated score, the number of candidate segments, and the number of prohibited use units, but the first embodiment, the second embodiment, and the third embodiment. A method of calculating an improvement expected index individually by the method described in the embodiment and calculating one improvement expected index from each improvement expected index may be used.

以上説明したように、本実施の形態によれば、素片選択スコア、候補素片数、使用禁止素片取得回数を用いて改善見込指標を計算する。このため、第１の実施の形態よりも信頼性の高い改善見込指標を求めることが可能になる。 As described above, according to the present embodiment, the expected improvement index is calculated using the segment selection score, the number of candidate segments, and the number of use-prohibited segment acquisitions. For this reason, it is possible to obtain an improvement expected index with higher reliability than in the first embodiment.

特に、ある情報(例えば素片選択スコア)が同等であった場合に、他の情報(候補素片数や使用禁止素片取得回数)を基に改善見込の補正が可能になるため、各情報を単独で利用する場合に比べて改善見込指標の改善が期待できる。 In particular, when certain information (e.g., segment selection score) is equivalent, it is possible to correct the expected improvement based on other information (number of candidate segments or number of prohibited use), so each information Improvement of the expected improvement index can be expected compared to the case of using alone.

本発明は、各実施の形態で説明した音声合成装置に限定されるものではなく、その構成および動作は、発明の趣旨を逸脱しない範囲で適宜に変更することができる。 The present invention is not limited to the speech synthesizer described in each embodiment, and the configuration and operation thereof can be changed as appropriate without departing from the spirit of the invention.

なお、本発明は、ハードウェア、ソフトウェア又はこれらの組合せにより実現することができる。 The present invention can be realized by hardware, software, or a combination thereof.

本発明は、テキストを高音質で音声に変換する音声合成装置、音声合成方法及び音声合成プログラムに利用することができる。 The present invention can be used in a speech synthesizer, a speech synthesis method, and a speech synthesis program that convert text into speech with high sound quality.

本発明の第１の実施の形態に係る音声合成装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech synthesizer which concerns on the 1st Embodiment of this invention. 本発明の第１の実施の形態に係る音声合成装置の動作を説明するためのフローチャートである。It is a flowchart for demonstrating operation | movement of the speech synthesizer which concerns on the 1st Embodiment of this invention. 図１の改善見込指標計算部が改善見込指標計算の算出に使用する表の例である。It is an example of the table | surface which the improvement expected index calculation part of FIG. 1 uses for calculation of improvement expected index calculation. 本発明の第２の実施の形態に係る音声合成装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech synthesizer which concerns on the 2nd Embodiment of this invention. 本発明の第２の実施の形態に係る音声合成装置の動作を説明するためのフローチャートである。It is a flowchart for demonstrating operation | movement of the speech synthesizer which concerns on the 2nd Embodiment of this invention. 本発明の第３の実施の形態に係る音声合成装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech synthesizer which concerns on the 3rd Embodiment of this invention. 本発明の第３の実施の形態に係る音声合成装置の動作を説明するためのフローチャートである。It is a flowchart for demonstrating operation | movement of the speech synthesizer which concerns on the 3rd Embodiment of this invention. 本発明の第４の実施の形態に係る音声合成装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech synthesizer which concerns on the 4th Embodiment of this invention. 本発明の第４の実施の形態に係る音声合成装置の動作を説明するためのフローチャートである。It is a flowchart for demonstrating operation | movement of the speech synthesizer which concerns on the 4th Embodiment of this invention. 図８の改善見込指標計算部が改善見込指標計算に使用する表の例である。It is an example of the table | surface which the improvement expected index calculation part of FIG. 8 uses for improvement expected index calculation. 一般的な音声合成装置の一構成例を示すブロック図である。It is a block diagram which shows one structural example of a general speech synthesizer.

Explanation of symbols

１言語処理部
２韻律生成部
３、３１候補素片取得部
４素片選択部
５素片情報記憶部
６波形生成部
１１、１１２使用禁止素片情報取得部
１２使用禁止素片情報記憶部
１３最適素片情報記憶部
１４０、１４１、１４２、１４３改善見込指標計算部
１５２使用禁止素片取得回数計算部 DESCRIPTION OF SYMBOLS 1 Language processing part 2 Prosody generation part 3, 31 Candidate element acquisition part 4 Element selection part 5 Element information storage part 6 Waveform generation part 11, 112 Use prohibition element information acquisition part 12 Use prohibition element information storage part 13 Optimal segment information storage unit 140, 141, 142, 143 Improvement expected index calculation unit 152 Unusable segment acquisition count calculation unit

Claims

A language processing unit that performs language processing including input text reading and accent analysis, morphological analysis, and syntax analysis on the input text, and prosodic information related to sound intensity, length, shortness, and height based on the results of the language processing A prosody generation unit that generates a speech unit that may be used for synthesized speech based on the result of the language processing, a candidate unit acquisition unit that acquires a speech unit as a candidate unit, and the prosody information and the language processing Based on the result, a unit selection score, which is an index indicating the appropriateness of the candidate unit in speech synthesis, is calculated, and a speech unit optimal for the synthesized speech is selected as an optimal unit from the candidate units. A segment selection unit and a waveform generation unit that generates a synthesized speech waveform based on the optimum segment, and the candidate segment acquisition unit includes a segment to be deleted in the optimum segment, Candidates excluding the fragment Executed the acquisition pieces again, the waveform generating unit in the speech synthesizer to regenerate a synthesized speech waveform,
An improved expected index calculating unit that calculates an improved expected index that represents a high possibility of sound quality improvement when the deletion target segment is deleted;
The speech synthesis apparatus according to claim 1, wherein the improvement expected index calculation unit increases the improvement expected index if the segment selection score is large.

A deletion designation number calculation unit for calculating the number of deletion designations for designating the element to be deleted;
The speech synthesis apparatus according to claim 1, wherein the expected improvement index calculation unit increases the expected improvement index if the number of times of deletion designation is small.

The speech synthesis apparatus according to claim 1 or 2, wherein the improvement expected index calculation unit increases the improvement expected index if the number of candidate segments acquired by the candidate segment acquisition unit is large.

The improvement expected index calculation unit estimates a segment selection score after deleting the segment to be deleted, and calculates the improvement expected index based on the estimated segment selection score The speech synthesizer according to any one of claims 1 to 3.

The improvement expectation index calculation unit estimates a weighted sum of those showing a high numerical value among the unit selection scores as a unit selection score after deleting the unit to be deleted. Item 5. The speech synthesizer according to Item 4.

The improvement expectation index calculation unit estimates an average value of those showing a high numerical value among the unit selection scores as a unit selection score after deleting the unit to be deleted. 4. The speech synthesizer according to 4.

Language processing procedure for performing language processing including input text reading and accent analysis, morphological analysis and syntax analysis on the input text, and prosodic information related to sound intensity, length, short and high based on the result of the language processing A prosody generation procedure for generating speech units, a candidate segment acquisition procedure for acquiring speech units that may be used for synthesized speech based on the results of the language processing as candidate segments, the prosodic information and the language processing Based on the result, a unit selection score, which is an index indicating the appropriateness of the candidate unit in speech synthesis, is calculated, and a speech unit optimal for the synthesized speech is selected as an optimal unit from the candidate units. A segment selection procedure and a waveform generation procedure for generating a synthesized speech waveform based on the optimum segment, and the candidate segment acquisition procedure includes a segment to be deleted in the optimum segment, The fragment Except executed the acquisition of candidate segments again, the waveform generation procedure in the speech synthesis method for regenerating a synthesized speech waveform,
An improved expected index calculation procedure for calculating an improved expected index that represents a high possibility of sound quality improvement when the deletion target segment is deleted;
In the speech synthesis method, the improvement expected index calculation procedure increases the improvement expected index if the segment selection score is large.

A deletion designation number calculation procedure for calculating the number of times of deletion designation for designating the fragment to be deleted;
The speech synthesis method according to claim 7, wherein in the improvement expected index calculation procedure, the improvement expected index is increased when the number of times of deletion designation is small.

The speech synthesis method according to claim 7 or 8, wherein the improvement expected index calculation procedure increases the expected improvement index if the number of candidate segments acquired in the candidate segment acquisition procedure is large.

The improved expected index calculation procedure estimates a segment selection score after deleting the segment to be deleted, and calculates the improved expected index based on the estimated segment selection score The speech synthesis method according to any one of claims 7 to 9.

The improvement expectation index calculation procedure estimates a weighted sum of those showing a high numerical value among the unit selection scores as a unit selection score after deleting the unit to be deleted. Item 11. The speech synthesis method according to Item 10.

The improvement expectation index calculation procedure is to estimate an average value of those showing a high numerical value among the unit selection scores as a unit selection score after deleting the unit to be deleted. 10. The speech synthesis method according to 10.

Prosody generation that generates prosodic information related to sound intensity, length, short and high based on the results of language processing including input text reading and accent analysis, morphological analysis, and syntactic analysis for the input text Processing, candidate segment acquisition processing for acquiring speech units that may be used for synthesized speech based on the results of the language processing as candidate segments, and the prosody information and the language processing results based on the results of the language processing A unit selection process for calculating a unit selection score that is an index indicating the appropriateness in speech synthesis of candidate units, and selecting a speech unit optimal for synthesized speech from the candidate units as an optimal unit; A waveform generation process for generating a synthesized speech waveform based on the optimal segment, and when the candidate segment acquisition process includes a segment to be deleted in the optimal segment, In the exception executed the acquisition of candidate segments again, the speech synthesis program the waveform generation processing to regenerate synthesized speech waveform,
Causing the computer to execute an improved expected index calculation process for calculating an improved expected index that represents a high possibility of sound quality improvement when the deletion target segment is deleted;
The speech synthesis program characterized in that the improvement expected index calculation processing increases the improvement expected index if the segment selection score is large.

Further causing the computer to execute a deletion designation number calculation process for calculating the number of deletion designations for designating the element to be deleted,
14. The speech synthesis program according to claim 13, wherein in the improvement expected index calculation process, the improvement expected index is increased if the number of times of deletion designation is small.

The speech synthesis program according to claim 13 or 14, wherein the improvement expected index calculation process increases the improvement expected index if the number of candidate segments acquired in the candidate segment acquisition process is large.

The improvement expected index calculation process estimates a segment selection score after deleting the segment to be deleted, and calculates the improvement expected index based on the estimated segment selection score The speech synthesis program according to any one of claims 13 to 15.

The improvement expectation index calculation process estimates a weighted sum of those showing a high numerical value among the unit selection scores as a unit selection score after deleting the unit to be deleted. Item 17. The speech synthesis program according to Item 16.

The improvement expectation index calculation process estimates an average value of those showing a high numerical value among the segment selection scores as a segment selection score after deleting the segment to be deleted. 16. The speech synthesis program according to 16.