JP2016118712A

JP2016118712A - Voice synthesis library generation device and voice synthesis device

Info

Publication number: JP2016118712A
Application number: JP2014259287A
Authority: JP
Inventors: 誠橘; Makoto Tachibana; 橘　　誠; 雅史吉田; Masashi Yoshida
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2014-12-22
Filing date: 2014-12-22
Publication date: 2016-06-30
Anticipated expiration: 2034-12-22
Also published as: JP6497065B2

Abstract

PROBLEM TO BE SOLVED: To generate a synthesis voice which has high degree of being recognized as vague to a listener, while reducing a load applied to a user who generates a library of the voice synthesis.SOLUTION: A generation part 32 is an element for generating a second library L2 from a first library L1 for voice synthesis containing stock piece data Q for every voice stock piece P, and generates the second library L2 containing stock piece data Q of a voice stock piece P in which a vowel is substituted to a specific vowel in the first library L1, about the voice stock piece P containing the vowel other than the specific vowel, out of plural voice stock pieces P.SELECTED DRAWING: Figure 4

Description

本発明は、音声素片の接続で音声を合成する技術に関する。 The present invention relates to a technology for synthesizing speech by connecting speech segments.

収録音声から採取された複数の音声素片を相互に接続して所望の音色の合成音声を合成する素片接続型の音声合成技術が知られている。例えば、特許文献１には、複数の定常音データを音声素片毎に事前に収録し、目標ピッチに応じて複数の定常音データ（音声素片）を選択して相互に補間することで目標ピッチの定常音データを生成する構成が開示されている。 2. Description of the Related Art A unit connection type speech synthesis technique for synthesizing synthesized speech of a desired tone color by connecting a plurality of speech units collected from recorded speech to each other is known. For example, in Patent Document 1, a plurality of stationary sound data is recorded in advance for each speech unit, and a plurality of stationary sound data (speech units) are selected according to a target pitch and interpolated with each other. A configuration for generating pitch steady sound data is disclosed.

特開２０１３−１１８６３号公報JP2013-11863A

ところで、多様な声質の音声を合成するためには、声質毎に音声素片のライブラリを別個に用意する必要がある。例えば、音韻（発音内容）の明瞭度が相違する２種類の音声の合成を想定すると、発声者が曖昧に発音した音声を収録して曖昧な音声のライブラリを作成する一方、発声者が明瞭に発音した音声を収録して明瞭な音声のライブラリを生成する必要があり、ライブラリの作成の負担が大きいという問題がある。特に、曖昧性または明瞭性の度合を一定に維持しながら音声を継続して発音することは容易ではないから、発声者に過大な負担を強いる結果となる。以上の事情を考慮して、本発明は、音声合成用のライブラリを作成する負担を軽減しながら、受聴者に曖昧と知覚される程度が高い合成音声を生成することを目的とする。 By the way, in order to synthesize speech of various voice qualities, it is necessary to prepare a library of speech segments for each voice quality. For example, assuming the synthesis of two types of speech that have different phoneme (pronunciation content) intelligibility, a library of ambiguous speech is created by recording speech that is pronounced vaguely by the speaker, while the speaker is clearly There is a problem that it is necessary to create a library of clear voices by recording pronunciation sounds, and the burden of creating the library is large. In particular, since it is not easy to continuously produce a sound while maintaining a constant degree of ambiguity or clarity, this results in an excessive burden on the speaker. In view of the above circumstances, an object of the present invention is to generate synthesized speech that is highly perceived by the listener as being ambiguous while reducing the burden of creating a speech synthesis library.

以上の課題を解決するために、本発明の第１態様に係る音声合成用ライブラリ生成装置は、音声素片毎に素片データを含む音声合成用の第１ライブラリから第２ライブラリを生成する要素であって、複数の音声素片のうち特定母音以外の母音を含む音声素片について、前記第１ライブラリのうち当該母音を前記特定母音に置換した音声素片の素片データを含む第２ライブラリを生成する生成部を具備する。以上の構成によれば、既存の第１ライブラリから第２ライブラリが生成されるから、第１ライブラリと第２ライブラリとを相異なる収録音から別個に用意する場合と比較して、音声合成用のライブラリを作成する負担を軽減することが可能である。また、特定母音以外の母音を含む音声素片について、第１ライブラリのうち当該母音を特定母音に置換した音声素片の素片データが第２ライブラリに含まれるから、受聴者に曖昧と知覚される合成音声を第２ライブラリの素片データから生成することが可能である。例えば「う」の音声は、口の開き具合が他の母音（「あ」「い」「え」「お」）と比較して小さく、音韻が曖昧であると受聴者に知覚され易いという傾向がある。以上の傾向を考慮すると、「う」を特定母音とした構成が好適である。 In order to solve the above problems, a speech synthesis library generating apparatus according to the first aspect of the present invention generates a second library from a first library for speech synthesis including segment data for each speech segment. A second library containing speech unit data of speech units obtained by replacing the vowels with the specific vowels in the first library for speech units containing vowels other than the specific vowels among a plurality of speech units. A generating unit for generating According to the above configuration, since the second library is generated from the existing first library, compared to the case where the first library and the second library are separately prepared from different recorded sounds, the second library is used for speech synthesis. It is possible to reduce the burden of creating a library. In addition, for speech units containing vowels other than specific vowels, the second library includes speech unit data in which the vowels are replaced with specific vowels in the first library, so that it is perceived as ambiguous by the listener. Can be generated from the segment data of the second library. For example, the voice of “U” tends to be perceived by listeners when the degree of mouth opening is small compared to other vowels (“A”, “I”, “E”, “O”) and the phoneme is ambiguous. There is. Considering the above tendency, a configuration in which “u” is a specific vowel is preferable.

第１態様に係る音声合成用ライブラリ生成装置の好適例において、前記生成部は、前記特定母音に音響特性が近似する特定子音と母音とを含む音声素片について、前記第１ライブラリにおける当該音声素片の素片データのうち前記特定子音の区間を示す素片データを含む第２ライブラリを生成する。以上の態様では、特定子音と母音とを含む音声素片について、第１ライブラリにおける当該音声素片の素片データのうち特定子音の区間を示す素片データを含む第２ライブラリが生成されるから、特定母音を含む音声素片に対応する素片データが第１ライブラリに存在しない場合でも、特定子音の区間を示す素片データを利用して受聴者に曖昧と知覚される合成音声を生成することが可能である。 In a preferred example of the speech synthesis library generating device according to the first aspect, the generating unit is configured to generate a speech element including a specific consonant and a vowel whose acoustic characteristics approximate to the specific vowel in the first library. A second library including segment data indicating the segment of the specific consonant among the segment data is generated. In the above aspect, for the speech unit including the specific consonant and the vowel, the second library including the segment data indicating the segment of the specific consonant among the unit data of the speech unit in the first library is generated. Even if no segment data corresponding to a speech segment including a specific vowel exists in the first library, a synthesized speech that is perceived as ambiguous by the listener is generated using the segment data indicating the segment of the specific consonant. It is possible.

第１態様に係る音声合成用ライブラリ生成装置の好適例において、前記生成部は、第１子音と母音とを含む音声素片について、前記第１ライブラリのうち前記第１子音とは相違する第２子音と前記特定母音とを含む音声素片の素片データを含む第２ライブラリを生成する。以上の態様では、第１子音と母音とを含む音声素片について、第１ライブラリのうち第１子音とは相違する第２子音と特定母音とを含む音声素片の素片データを含む第２ライブラリが生成されるから、第１子音と特定母音を含む素片データが第１ライブラリに存在しない場合でも、第２子音と特定母音とを含む音声素片の素片データを利用して受聴者に曖昧と知覚される合成音声を生成することが可能である。 In a preferred example of the speech synthesis library generating apparatus according to the first aspect, the generating unit is different from the first consonant in the first library with respect to a speech element including a first consonant and a vowel. A second library including segment data of speech segments including consonants and the specific vowel is generated. In the above aspect, for the speech unit including the first consonant and the vowel, the second including the segment data of the speech unit including the second consonant different from the first consonant and the specific vowel in the first library. Since the library is generated, even if no segment data including the first consonant and the specific vowel exists in the first library, the listener uses the segment data of the speech segment including the second consonant and the specific vowel. It is possible to generate synthesized speech that is perceived as ambiguous.

以上の課題を解決するために、本発明の第２態様に係る音声合成装置は、音声素片毎に素片データを含む音声合成用の第１ライブラリから第２ライブラリを生成する要素であって、複数の音声素片のうち特定母音以外の母音を含む音声素片について、前記第１ライブラリのうち当該母音を前記特定母音に置換した音声素片の素片データを含む第２ライブラリを生成する生成部と、前記第１ライブラリのうち合成対象の発音文字に対応する音声素片の第１素片データと、前記第２ライブラリのうち当該音声素片の第２素片データとを選択する素片選択部と、前記第１素片データと前記第２素片データとを混合する素片混合部と、前記素片混合部による混合後の素片データを利用して合成音声を生成する合成処理部とを具備する。第２態様では、特定母音以外の母音を含む音声素片について、第１ライブラリのうち当該母音を特定母音に置換した音声素片の素片データが第２ライブラリに含まれるから、第１態様と同様に、音声合成用のライブラリを作成する負担を軽減することが可能である。また、第１素片データと第２素片データとの混合後の素片データを利用して合成音声が生成されるから、混合の度合に応じて受聴者に曖昧（または明瞭）と知覚される合成音声を生成することが可能である。 In order to solve the above problems, a speech synthesizer according to a second aspect of the present invention is an element that generates a second library from a first library for speech synthesis that includes segment data for each speech segment. Generating a second library including speech unit data of speech units obtained by replacing the vowels with the specific vowels in the first library for speech units including vowels other than the specific vowels among the plurality of speech units. An element for selecting a first unit data of a speech unit corresponding to a phonetic character to be synthesized in the first library and a second unit data of the speech unit in the second library; A synthesis that generates a synthesized speech using a segment selection unit, a segment mixing unit that mixes the first segment data and the second segment data, and segment data after mixing by the segment mixing unit And a processing unit. In the second mode, for the speech unit containing vowels other than the specific vowel, the unit data of the speech unit obtained by replacing the vowel with the specific vowel in the first library is included in the second library. Similarly, it is possible to reduce the burden of creating a speech synthesis library. In addition, since synthesized speech is generated using the segment data after mixing the first segment data and the second segment data, it is perceived as ambiguous (or clear) by the listener according to the degree of mixing. It is possible to generate synthesized speech.

以上の課題を解決するために、本発明の第３態様に係る音声合成装置は、音声素片毎に素片データを含む音声合成用のライブラリから合成対象の発音文字に対応する音声素片の素片データを選択する要素であって、特定母音以外の母音を含む音声素片については、当該母音を前記特定母音に置換した音声素片の素片データを選択する素片選択部と、前記素片選択部が選択した音声素片を利用して合成音声を生成する合成処理部とを具備する。第３態様では、特定母音以外の母音を含む音声素片については、当該母音を特定母音に置換した音声素片の素片データが既存の音声合成用のライブラリから選択されるから、相異なる声質に対応する複数のライブラリを事前に用意する構成と比較して、音声合成用のライブラリを作成する負担を軽減しながら、受聴者に曖昧と知覚される合成音声を生成することが可能である。また、第１ライブラリから生成された第２ライブラリを記憶する必要がある第１態様と比較すると、第３態様では第２ライブラリの記憶が不要であるという利点もある。 In order to solve the above problems, the speech synthesizer according to the third aspect of the present invention provides a speech synthesis unit corresponding to a phonetic character to be synthesized from a speech synthesis library including segment data for each speech unit. An element for selecting element data, and for a speech element including a vowel other than a specific vowel, the element selection unit for selecting element data of a speech element obtained by replacing the vowel with the specific vowel, and And a synthesis processing unit that generates synthesized speech using the speech unit selected by the unit selection unit. In the third aspect, for speech units containing vowels other than specific vowels, the unit data of speech units obtained by replacing the vowels with specific vowels are selected from the existing library for speech synthesis. Compared with a configuration in which a plurality of libraries corresponding to the above are prepared in advance, it is possible to generate synthesized speech that is perceived as ambiguous by the listener while reducing the burden of creating a speech synthesis library. Further, compared with the first mode in which the second library generated from the first library needs to be stored, the third mode also has an advantage that the storage of the second library is unnecessary.

第３態様に係る音声合成装置の好適例において、前記素片選択部は、前記特定母音に音響特性が近似する特定子音と母音とを含む音声素片について、当該音声素片の素片データのうち前記特定子音の区間を示す素片データを選択する。以上の態様では、特定母音に音響特性が近似する特定子音と母音とを含む音声素片について、当該音声素片の素片データのうち特定子音の区間を示す素片データが選択されるから、特定母音を含む音声素片に対応する素片データが音声合成用のライブラリに存在しない場合でも、特定子音の区間を示す素片データを利用して受聴者に曖昧と知覚される合成音声を生成することが可能になる、という効果が実現される。 In a preferred example of the speech synthesizer according to the third aspect, the unit selection unit is configured to generate, for a speech unit including a specific consonant and a vowel whose acoustic characteristics approximate to the specific vowel, Of these, segment data indicating the segment of the specific consonant is selected. In the above aspect, for a speech unit including a specific consonant and a vowel whose acoustic characteristics approximate to the specific vowel, the segment data indicating the segment of the specific consonant is selected from the segment data of the speech unit. Generates synthesized speech that is perceived as ambiguous by the listener by using segment data indicating a specific consonant segment even if the segment data corresponding to the speech segment containing the specific vowel does not exist in the speech synthesis library The effect of being able to do is realized.

第３態様に係る音声合成装置の好適例において、前記素片選択部は、第１子音と母音とを含む音声素片について、前記第１子音とは相違する第２子音と前記特定母音とを含む音声素片のデータを選択する。以上の態様では、第１子音と母音とを含む音声素片について、第１子音とは相違する第２子音と特定母音とを含む音声素片のデータが選択されるから、第１子音と特定母音を含む素片データが音声合成用のライブラリに存在しない場合でも、第２子音と特定母音とを含む音声素片の素片データを利用して受聴者に曖昧と知覚される合成音声を生成することが可能になる、という効果が実現される。 In a preferred example of the speech synthesizer according to the third aspect, the unit selection unit obtains a second consonant different from the first consonant and the specific vowel for a speech unit including a first consonant and a vowel. Select the speech segment data to include. In the above aspect, since the speech unit including the second consonant and the specific vowel different from the first consonant is selected for the speech unit including the first consonant and the vowel, the first consonant and the specific are selected. Generates synthesized speech that is perceived as ambiguous by listeners using segment data of speech units including second consonants and specific vowels, even if segment data including vowels does not exist in the speech synthesis library The effect of being able to do is realized.

第３態様に係る音声合成装置の好適例において、前記素片選択部は、前記特定母音以外の母音を含む音声素片について、当該音声素片の第１素片データと、当該母音を前記特定母音に置換した音声素片の第２素片データとを選択し、前記第１素片データと前記第２素片データとを混合する素片混合部を具備し、前記合成処理部は、前記素片混合部による混合後の素片データを利用して合成音声を生成する。以上の態様では、第１素片データと第２素片データとが混合されるから、第２素片データのみを利用して合成音声を生成する構成と比較して、多様な声質の音声を生成できるという利点がある。 In a preferred example of the speech synthesizer according to the third aspect, the unit selection unit specifies the first unit data of the speech unit and the specific vowel for the speech unit including a vowel other than the specific vowel. Selecting a second unit data of a speech unit replaced with a vowel, and comprising a unit mixing unit that mixes the first unit data and the second unit data, and the synthesis processing unit includes: A synthesized speech is generated using the segment data after mixing by the segment mixing unit. In the above aspect, since the first unit data and the second unit data are mixed, the voice of various voice qualities is compared with the configuration in which the synthesized speech is generated using only the second unit data. There is an advantage that it can be generated.

第２態様および第３態様に係る音声合成装置の好適例において、合成対象の特徴量に応じた混合比を設定する変数設定部を具備し、前記素片混合部は、前記変数設定部が設定した混合比で前記第１素片データと前記第２素片データとを混合する。以上の構成では、合成対象の特徴量に応じて第１素片データと第２素片データとの混合比が設定されるから、例えば利用者が混合比を指示する構成と比較して利用者の負担を軽減することが可能である。特徴量の好例は音高である。例えば、高音域の音声ほど音韻が曖昧と知覚され易いという傾向を前提とすれば、合成対象の音高が高いほど前記第２素片データの比率が高くなるように変数設定部が前記混合比を設定する構成が好適である。 In a preferred example of the speech synthesizer according to the second aspect and the third aspect, the speech synthesizer includes a variable setting unit that sets a mixing ratio according to the feature quantity to be synthesized, and the variable setting unit sets the segment mixing unit The first segment data and the second segment data are mixed at the mixing ratio. In the above configuration, since the mixing ratio between the first segment data and the second segment data is set according to the feature quantity to be combined, for example, the user is compared with the configuration in which the user instructs the mixing ratio. It is possible to reduce the burden. A good example of the feature quantity is pitch. For example, on the premise of the tendency that the higher the sound range, the more likely the phoneme is perceived as ambiguous, the variable setting unit sets the mixing ratio so that the higher the pitch of the synthesis target, the higher the ratio of the second segment data. A configuration for setting is preferable.

以上の各態様に係るライブラリ生成装置および音声合成装置は、専用のハードウェア（電子回路）によって実現されるほか、CPU（Central Processing Unit）等の汎用の演算処理装置とプログラムとの協働によっても実現される。本発明のプログラムは、コンピュータが読取可能な記録媒体に格納された形態で提供されてコンピュータにインストールされ得る。記録媒体は、例えば非一過性（non-transitory）の記録媒体であり、ＣＤ-ＲＯＭ等の光学式記録媒体（光ディスク）が好例であるが、半導体記録媒体や磁気記録媒体等の公知の任意の形式の記録媒体を包含し得る。また、例えば、本発明のプログラムは、通信網を介した配信の形態で提供されてコンピュータにインストールされ得る。また、本発明は、以上に説明した各態様に係る音声合成用ライブラリ生成装置の動作方法（ライブラリ生成方法）や音声合成装置の動作方法（音声合成方法）としても特定される。 The library generation device and the speech synthesizer according to each of the above aspects are realized by dedicated hardware (electronic circuit), or by cooperation of a general-purpose arithmetic processing device such as a CPU (Central Processing Unit) and a program. Realized. The program of the present invention can be provided in a form stored in a computer-readable recording medium and installed in the computer. The recording medium is, for example, a non-transitory recording medium, and an optical recording medium (optical disk) such as a CD-ROM is a good example, but a known arbitrary one such as a semiconductor recording medium or a magnetic recording medium This type of recording medium can be included. For example, the program of the present invention can be provided in the form of distribution via a communication network and installed in a computer. The present invention is also specified as the operation method (library generation method) of the speech synthesis library generation device and the operation method (speech synthesis method) of the speech synthesis device according to each aspect described above.

第１実施形態に係る音声合成装置のブロック図である。It is a block diagram of the speech synthesizer concerning a 1st embodiment. 編集画面４０の説明図である。6 is an explanatory diagram of an edit screen 40. FIG. 生成部３２による素片データＱの置換処理の説明図である。It is explanatory drawing of the replacement process of the segment data Q by the production | generation part 32. FIG. 音声素片Ｐに対応する素片データＱの説明図である。It is explanatory drawing of the segment data Q corresponding to the speech segment P. 第１実施形態のライブラリ生成処理の動作のフローチャートである。It is a flowchart of operation | movement of the library production | generation process of 1st Embodiment. 第２実施形態に係る音声合成装置のブロック図である。It is a block diagram of the speech synthesizer concerning a 2nd embodiment. 第２実施形態の素片選択の処理の動作のフローチャートである。It is a flowchart of the operation | movement of the process of the segment selection of 2nd Embodiment. 第３実施形態に係る音声合成装置のブロック図である。It is a block diagram of the speech synthesizer concerning a 3rd embodiment. 第３実施形態における混合比Ｋの設定の説明図である。It is explanatory drawing of the setting of the mixture ratio K in 3rd Embodiment. 第４実施形態における音声合成装置のブロック図である。It is a block diagram of the speech synthesizer in 4th Embodiment. 変形例に係る音声合成装置における補間処理についての説明図である。It is explanatory drawing about the interpolation process in the speech synthesizer which concerns on a modification. 変形例に係る音声合成装置における補間処理についての説明図である。It is explanatory drawing about the interpolation process in the speech synthesizer which concerns on a modification.

＜第１実施形態＞
図１は、本発明の第１実施形態に係る音声合成装置１００のブロック図である。第１実施形態の音声合成装置１００は、複数の音声素片を連結する素片接続型の音声合成で任意の楽曲（以下「合成楽曲」という）の歌唱音声の音声信号Ｖを生成する信号処理装置である。 <First Embodiment>
FIG. 1 is a block diagram of a speech synthesizer 100 according to the first embodiment of the present invention. The speech synthesizer 100 according to the first embodiment generates a speech signal V of a singing voice of an arbitrary piece of music (hereinafter referred to as “synthetic music”) by a unit connection type speech synthesis that connects a plurality of speech units. Device.

図１に例示される通り、音声合成装置１００は、演算処理装置１０と記憶装置１２と表示装置１４と入力装置１６と放音装置１８とを具備するコンピュータシステム（例えば携帯電話機やパーソナルコンピュータ等の情報処理装置）で実現される。表示装置１４（例えば液晶表示パネル）は、演算処理装置１０から指示された画像を表示する。入力装置１６は、音声合成装置１００に対する各種の指示のために利用者が操作する操作機器（例えばマウス等のポインティングデバイスやキーボード）であり、例えば利用者が操作する複数の操作子を含んで構成される。なお、表示装置１４と一体に構成されたタッチパネルを入力装置１６として採用することも可能である。放音装置１８（例えばスピーカやヘッドホン）は、音声信号Ｖに応じた音響を再生する。 As illustrated in FIG. 1, the speech synthesizer 100 includes a computer system (for example, a mobile phone, a personal computer, or the like) that includes an arithmetic processing device 10, a storage device 12, a display device 14, an input device 16, and a sound emitting device 18. Information processing device). The display device 14 (for example, a liquid crystal display panel) displays an image instructed from the arithmetic processing device 10. The input device 16 is an operation device (for example, a pointing device such as a mouse or a keyboard) operated by the user for various instructions to the speech synthesizer 100, and includes a plurality of operators operated by the user, for example. Is done. Note that a touch panel configured integrally with the display device 14 may be employed as the input device 16. The sound emitting device 18 (for example, a speaker or headphones) reproduces sound according to the audio signal V.

記憶装置１２は、演算処理装置１０が実行するプログラムＰGMや演算処理装置１０が使用する各種のデータを記憶する。半導体記録媒体や磁気記録媒体等の公知の記録媒体または複数種の記録媒体の組合せが記憶装置１２として任意に採用される。第１実施形態の記憶装置１２は、以下に例示する通り、第１ライブラリＬ1と合成情報Ｓとを記憶する。 The storage device 12 stores a program PGM executed by the arithmetic processing device 10 and various data used by the arithmetic processing device 10. A known recording medium such as a semiconductor recording medium or a magnetic recording medium or a combination of a plurality of types of recording media is arbitrarily employed as the storage device 12. The storage device 12 of the first embodiment stores the first library L1 and the synthesis information S as exemplified below.

第１ライブラリＬ1は、相異なる音声素片Ｐに対応する複数の素片データＱの集合（音声合成用ライブラリ）である。第１実施形態における１個の音声素片Ｐは、２個の音素を連結した音素連鎖（ダイフォン）である。第１ライブラリＬ1の１個の音声素片Ｐの素片データＱは、特定の発声者の収録音声のうち当該音声素片Ｐに対応する区間の音声波形を表すデータである。具体的には、各素片データＱは、時間領域の音声波形のサンプル系列や、音声波形のフレーム毎に算定された周波数領域のスペクトルの時系列で表現され得る。本実施形態の第１ライブラリＬ1の各素片データＱは、受聴者が音韻を明確に知覚し得る音声を表す。 The first library L1 is a set (speech synthesis library) of a plurality of segment data Q corresponding to different speech segments P. One speech element P in the first embodiment is a phoneme chain (diphone) in which two phonemes are connected. The unit data Q of one speech unit P in the first library L1 is data representing the speech waveform of the section corresponding to the speech unit P in the recorded speech of a specific speaker. Specifically, each piece of data Q can be expressed as a time series of a time domain speech waveform sample or a frequency domain spectrum time series calculated for each frame of the speech waveform. Each piece data Q of the first library L1 of the present embodiment represents a voice that allows the listener to clearly perceive the phoneme.

合成情報Ｓは、合成楽曲の歌唱音声を指定する時系列データであり、合成楽曲を構成する音符毎に発音文字Ｘ1と発音期間Ｘ2と音高（例えばノートナンバー）Ｘ3とを時系列に指定する。発音文字Ｘ1は、合成対象の音声の発音内容（すなわち合成楽曲の歌詞）を指定する。発音期間Ｘ2は、例えば発音の開始時刻と継続長（または終了時刻）とで規定される。 The synthesis information S is time-series data for specifying the singing voice of the synthesized music, and for each note constituting the synthesized music, the pronunciation character X1, the pronunciation period X2, and the pitch (for example, note number) X3 are designated in time series. . The pronunciation character X1 designates the pronunciation content of the speech to be synthesized (that is, the lyrics of the synthesized music). The sound generation period X2 is defined by, for example, a sound generation start time and duration (or end time).

図１の演算処理装置１０（ＣＰＵ）は、記憶装置１２に格納されたプログラムＰGMを実行することで、合成情報Ｓの編集や音声信号Ｖの生成のための複数の機能（生成部３２，表示制御部３３，素片選択部３４，指示受付部３５，合成処理部３６）を実現する。なお、演算処理装置１０の各機能を複数の装置に分散した構成や、専用の電子回路（例えばＤＳＰ）が演算処理装置１０の一部の機能を実現する構成も採用され得る。 The arithmetic processing unit 10 (CPU) of FIG. 1 executes a program PGM stored in the storage unit 12 to thereby edit a plurality of functions (generation unit 32, display for editing the synthesis information S and generating the audio signal V). A control unit 33, an element selection unit 34, an instruction reception unit 35, and a composition processing unit 36) are realized. A configuration in which each function of the arithmetic processing device 10 is distributed to a plurality of devices, or a configuration in which a dedicated electronic circuit (for example, DSP) realizes a part of the functions of the arithmetic processing device 10 may be employed.

指示受付部３５は、入力装置１６に対する操作に応じた利用者からの指示を受付ける。表示制御部３３は、各種の画像を表示装置１４に表示させる。具体的には、第１実施形態の表示制御部３３は、合成情報Ｓが指定する合成楽曲の内容を利用者が確認するための図２の編集画面４０を表示装置１４に表示させる。図２に例示される通り、編集画面４０の音符列画像４２は、相互に交差する時間軸（横軸）および音高軸（縦軸）が設定されたピアノロール型の座標平面である。 The instruction receiving unit 35 receives an instruction from a user according to an operation on the input device 16. The display control unit 33 displays various images on the display device 14. Specifically, the display control unit 33 of the first embodiment causes the display device 14 to display the editing screen 40 of FIG. 2 for the user to confirm the content of the composite music specified by the composite information S. As illustrated in FIG. 2, the note string image 42 of the editing screen 40 is a piano roll type coordinate plane in which a time axis (horizontal axis) and a pitch axis (vertical axis) intersecting each other are set.

表示制御部３３は、合成情報Ｓが指定する音符毎に音符図像５４と発音文字Ｘ1とを編集画面４０に配置する。音符図像５４は、合成楽曲の各音符を表象する画像である。具体的には、音高軸の方向における音符図像５４の位置は、合成情報Ｓが指定する音高Ｘ3に応じて設定され、時間軸の方向における音符図像５４の位置および表示長は、合成情報Ｓが指定する発音期間Ｘ2に応じて設定される。発音文字Ｘ1は音符図像５４の先頭の近傍に配置される。 The display control unit 33 arranges the note image 54 and the pronunciation character X1 on the editing screen 40 for each note designated by the synthesis information S. The musical note iconic image 54 is an image representing each musical note of the synthesized music. Specifically, the position of the musical note iconic image 54 in the direction of the pitch axis is set according to the pitch X3 specified by the synthesis information S, and the position and display length of the musical note iconic image 54 in the direction of the time axis are determined by the synthesis information. S is set according to the sound generation period X2 designated. The phonetic character X1 is arranged in the vicinity of the head of the musical note iconic image 54.

利用者は、図２の編集画面４０の音符列画像４２を確認しながら入力装置１６を適宜に操作することで、音符図像５４の追加または移動や発音文字Ｘ1の追加または変更を指示することが可能である。表示制御部３３は、指示受付部３５が利用者から受付けた指示に応じて編集画面４０（音符図像５４や発音文字Ｘ1）を更新する。 The user may instruct addition or movement of the note image 54 or addition or change of the pronunciation character X1 by appropriately operating the input device 16 while confirming the note string image 42 on the editing screen 40 of FIG. Is possible. The display control unit 33 updates the editing screen 40 (the musical note iconic image 54 and the pronunciation character X1) according to the instruction received from the user by the instruction receiving unit 35.

図１の素片選択部３４は、合成情報Ｓが指定する音符毎の発音文字Ｘ1に対応する音声素片Ｐの素片データＱを第１ライブラリＬ1から順次に選択する。合成処理部３６は、素片選択部３４によって第１ライブラリＬ1から選択された素片データＱと合成情報Ｓとを利用して音声信号Ｖを生成する。具体的には、合成処理部３６は、素片選択部３４が選択した素片データＱが示す音声波形を、合成情報Ｓが指定する音高Ｘ3および発音期間Ｘ2に調整したうえで相互に連結することで音声信号Ｖを生成する。合成処理部３６が生成した音声信号Ｖが放音装置１８に供給されることで合成楽曲の歌唱音声が再生される。前述の通り、各素片データＱは明瞭な発音の音声を表すから、第１ライブラリＬ1を利用して生成された音声信号Ｖの再生音は、受聴者により明瞭と知覚される。 The segment selection unit 34 in FIG. 1 sequentially selects the segment data Q of the speech segment P corresponding to the phonetic character X1 for each note designated by the synthesis information S from the first library L1. The synthesis processing unit 36 generates the audio signal V using the segment data Q selected from the first library L1 by the segment selection unit 34 and the synthesis information S. Specifically, the synthesis processing unit 36 adjusts the speech waveform indicated by the segment data Q selected by the segment selection unit 34 to the pitch X3 and the sound generation period X2 specified by the synthesis information S, and connects them to each other. As a result, the audio signal V is generated. The voice signal V generated by the synthesis processing unit 36 is supplied to the sound emitting device 18 so that the singing voice of the synthesized music is reproduced. As described above, since each piece data Q represents sound with clear pronunciation, the reproduced sound of the sound signal V generated using the first library L1 is perceived as clear by the listener.

図１の生成部３２は、第１ライブラリＬ1を利用して生成される音声信号Ｖと比較して音韻が曖昧と知覚される合成音声を生成するための第２ライブラリＬ2を既存の第１ライブラリＬ1から生成する。第２ライブラリＬ2は、第１ライブラリＬ1と同様に、相異なる音声素片Ｐに対応する複数の素片データＱの集合である。生成部３２が生成した第２ライブラリＬ2は、図１に例示される通り、第１ライブラリＬ1とともに記憶装置１２に格納されて合成処理部３６による音声信号Ｖの生成に利用される。 1 generates a second library L2 for generating a synthesized speech in which the phoneme is perceived as ambiguous compared to the speech signal V generated using the first library L1. Generate from L1. Similar to the first library L1, the second library L2 is a set of a plurality of unit data Q corresponding to different speech units P. As illustrated in FIG. 1, the second library L2 generated by the generation unit 32 is stored in the storage device 12 together with the first library L1 and used for generation of the audio signal V by the synthesis processing unit 36.

具体的には、生成部３２は、受聴者が曖昧と知覚する傾向がある特定の母音（以下「特定母音」という）以外の母音を含む音声素片Ｐについて、第１ライブラリＬ1のうち当該母音を特定母音に置換した音声素片Ｐの素片データＱを含む第２ライブラリＬ2を生成する。母音[M]（う）は、発音時の口の開き度合が小さいこともあり、他の母音（[a],[i],[e],[o]）と比較して曖昧と知覚され易いという傾向がある。以上の傾向を考慮して、第１実施形態では、母音[M]を特定母音として例示する。なお、各音素の表記は、X-SAMPA(Extended Speech Assessment Method Phonetic Alphabet)に準拠している。符号「Sil」は無音(Silence)を意味する。また、以下の説明における符号「V」は母音（Vowel）を意味し、符号「C」は子音（Consonant）を意味する。 Specifically, the generation unit 32 uses the vowel in the first library L1 for a speech unit P including a vowel other than a specific vowel that the listener tends to perceive as ambiguous (hereinafter referred to as “specific vowel”). The second library L2 including the segment data Q of the speech segment P in which is replaced with a specific vowel is generated. The vowel [M] (U) may be perceived as ambiguous compared to other vowels ([a], [i], [e], [o]), because the degree of mouth opening during pronunciation may be small. There is a tendency to be easy. Considering the above tendency, in the first embodiment, the vowel [M] is exemplified as the specific vowel. In addition, the notation of each phoneme is based on X-SAMPA (Extended Speech Assessment Method Phonetic Alphabet). The symbol “Sil” means silence. In the following description, the symbol “V” means a vowel, and the symbol “C” means a consonant.

図３は、生成部３２による置換処理について具体的に説明するための図である。図３には、第１ライブラリＬ1および第２ライブラリＬ2について共通の音声素片Ｐに対応する素片データＱの内容が図示されている。第１ライブラリＬ1における１個の音声素片Ｐは、２つの音素を連結した音素連鎖であるから、特定母音以外の母音を含む音声素片Ｐとしては、２個の母音を配列した音声素片と、母音および子音を含む音声素片との２種類が想定される。 FIG. 3 is a diagram for specifically explaining the replacement processing by the generation unit 32. FIG. 3 shows the content of the segment data Q corresponding to the speech segment P common to the first library L1 and the second library L2. Since one speech element P in the first library L1 is a phoneme chain obtained by connecting two phonemes, the speech element P including two vowels is used as the speech element P including vowels other than the specific vowel. And two types of speech units including vowels and consonants are assumed.

（１）母音＋母音：［V-V］
図３(a)は、第２ライブラリＬ2のうち２個の母音を配列した音声素片Ｐ［V-V］の素片データＱの説明図である。図３(a)に例示される通り、第１ライブラリＬ1には、５種類の母音（あ,い,う,え,お）に対応する音声素片Ｐ([a-a],[i-i],[M-M],[e-e],[o-o])について、当該音声素片Ｐの音声波形を表す素片データＱ([a-a],[i-i],[M-M],[e-e],[o-o])が記憶される。生成部３２は、各音声素片Ｐ［V-V］（[a-a],[i-i],[M-M],[e-e],[o-o]）のうち特定母音以外の母音（［a］，［i］，［e］，［o］）を特定母音［M］に置換した音声素片Ｐ[M-M]について第１ライブラリＬ1に記憶された素片データＱ[M-M]を複製し、第２ライブラリＬ2における当該音声素片Ｐ[V-V]の素片データＱとして記憶する。例えば、第２ライブラリＬ2の音声素片[a-a]については、音素[a]を特定母音[M]に置換した音声素片[M-M]について第１ライブラリＬ1に記憶された素片データＱ[M-M]が複製される。なお、以上の説明では、同じ母音同士の組み合わせで構成される音声素片Ｐを例示したが、相異なる母音同士の組み合わせにより構成される音声素片Ｐ([a-i],[a-M],[a-e],[a-o],[i-a],[i-M],[i-e],[i-o],[M-a],[M-i],[M-e],[M-o],[o-a],[o-i],[o-M],[o-e])についても同様に、特定母音以外の母音（［a］，［i］，［e］，［o］）を特定母音［M］に置換した音声素片Ｐ[M-M]について第１ライブラリＬ1に記憶された素片データＱ[M-M]を複製し、第２ライブラリＬ2における当該音声素片Ｐ[V-V]の素片データＱとして記憶する。
なお、特定母音のみで構成される音声素片Ｐ[M-M]については、第１ライブラリＬ1のうち当該音声素片[M-M]の素片データＱ[M-M]が第２ライブラリＬ2に複製される。 (1) Vowel + Vowel: [VV]
FIG. 3A is an explanatory diagram of the unit data Q of the speech unit P [VV] in which two vowels are arranged in the second library L2. As illustrated in FIG. 3 (a), the first library L1 includes speech segments P ([aa], [ii], [[] corresponding to five types of vowels (A, I, U, E, O)). As for MM], [ee], [oo]), segment data Q ([aa], [ii], [MM], [ee], [oo]) representing the speech waveform of the speech segment P is stored. Is done. The generating unit 32 generates a vowel other than the specific vowel ([a], [i], [oo]) from each speech unit P [VV] ([aa], [ii], [MM], [ee], [oo]). The unit data Q [MM] stored in the first library L1 is duplicated for the speech unit P [MM] in which [e], [o]) is replaced with the specific vowel [M], and the corresponding unit in the second library L2 Stored as unit data Q of speech unit P [VV]. For example, for the speech unit [aa] in the second library L2, the segment data Q [MM] stored in the first library L1 for the speech unit [MM] obtained by replacing the phoneme [a] with the specific vowel [M]. ] Is duplicated. In the above description, the speech unit P composed of combinations of the same vowels is illustrated, but the speech unit P composed of combinations of different vowels ([ai], [aM], [ae ], [ao], [ia], [iM], [ie], [io], [Ma], [Mi], [Me], [Mo], [oa], [oi], [oM], Similarly, for [oe]), the first speech unit P [MM] in which vowels other than the specific vowel ([a], [i], [e], [o]) are replaced with the specific vowel [M] is the first. The segment data Q [MM] stored in the library L1 is duplicated and stored as the segment data Q of the speech segment P [VV] in the second library L2.
Note that for the speech unit P [MM] composed only of specific vowels, the segment data Q [MM] of the speech unit [MM] in the first library L1 is copied to the second library L2.

（２）母音＋子音または子音＋母音：[C,V]（[C-V]，[V-C]）
図３(b)は、第２ライブラリＬ2のうち母音と子音とを配列した音声素片Ｐ[C,V]の素片データＱの説明図である。図３(b)に例示される通り、生成部３２は、各音声素片Ｐ[C,V]のうち特定母音以外の母音を特定母音[M]に置換した音声素片[C,M]について第１ライブラリＬ1に記憶された素片データＱ[C,M]を複製し、第２ライブラリＬ2における当該音声素片[C,V]の素片データＱとして記憶する。例えば、第２ライブラリＬ2の音声素片Ｐ[s-a]については、母音[a]を特定母音[M]に置換した音声素片[s-M]について第１ライブラリＬ1に記憶された素片データＱ[s-M]が第２ライブラリＬ2に複製される。 (2) Vowel + consonant or consonant + vowel: [C, V] ([CV], [VC])
FIG. 3B is an explanatory diagram of the unit data Q of the speech unit P [C, V] in which vowels and consonants are arranged in the second library L2. As illustrated in FIG. 3B, the generation unit 32 replaces the vowels other than the specific vowel among the speech units P [C, V] with the specific vowel [M]. The segment data Q [C, M] stored in the first library L1 is duplicated and stored as the segment data Q of the speech segment [C, V] in the second library L2. For example, for the speech unit P [sa] of the second library L2, the unit data Q [stored in the first library L1 for the speech unit [sM] obtained by replacing the vowel [a] with the specific vowel [M]. sM] is copied to the second library L2.

以上に説明した通り、図３(a)および図３(b)では、特定母音[M]を包含する音声素片Ｐの素片データＱが第１ライブラリＬ1に存在する場合の処理を例示したが、特定母音[M]を含む特定の音声素片Ｐの素片データＱが第１ライブラリＬ1に存在しない場合がある。以上の場合に実行される例外処理について説明する。 As described above, FIGS. 3 (a) and 3 (b) exemplify processing when the segment data Q of the speech segment P including the specific vowel [M] is present in the first library L1. However, the segment data Q of the specific speech segment P including the specific vowel [M] may not exist in the first library L1. The exception processing executed in the above case will be described.

（３）例外処理１
図３(c)に例示される通り、特定母音[M]に音響特性（例えば受聴者が知覚する音韻）が近似する子音（以下「特定子音」という）[w]に特定母音[M]以外の各母音が後続する音声素片Ｐ（[w-a]，[w-i]，[w-e]，[w-o]）および素片データＱ（[w-a]，[w-i]，[w-e]，[w-o]）については第１ライブラリＬ1に記憶されている。他方、特定子音[w]に特定母音[M]以外の各母音が後続する音声素片Ｐ[w-M]よび素片データＱ[w-M]については、第１ライブラリＬ1に記憶されない場合を想定する。 (3) Exception handling 1
As illustrated in FIG. 3 (c), consonants whose acoustic characteristics (for example, phonemes perceived by the listener) approximate to specific vowels [M] (hereinafter referred to as “specific consonants”) [w] other than specific vowels [M] Speech unit P ([wa], [wi], [we], [wo]) and unit data Q ([wa], [wi], [we], [wo])) followed by each vowel Is stored in the first library L1. On the other hand, it is assumed that the speech unit P [wM] and the segment data Q [wM] in which each vowel other than the specific vowel [M] follows the specific consonant [w] is not stored in the first library L1.

生成部３２は、第１ライブラリＬ1において特定子音[w]と母音[V]（図３(c)の例では[a]）とを含む音声素片Ｐの素片データＱ[w-V]のうち、特定子音[w]の区間を示す素片データＱ[w]を当該音声素片Ｐ[w-V]について含む第２ライブラリＬ2を生成する。具体的には、図３(c)に例示されるように、第１ライブラリＬ1における音声素片Ｐ[w-a]の素片データＱ[w-a]のうち特定子音[w]の区間を表す素片データＱ[w]が、第２ライブラリＬ2における音声素片Ｐ[w-V]の素片データＱとして複製される。 Of the segment data Q [wV] of the speech segment P, the generating unit 32 includes the specific consonant [w] and the vowel [V] ([a] in the example of FIG. 3C) in the first library L1. Then, the second library L2 including the segment data Q [w] indicating the section of the specific consonant [w] for the speech segment P [wV] is generated. Specifically, as illustrated in FIG. 3C, the segment representing the segment of the specific consonant [w] in the segment data Q [wa] of the speech segment P [wa] in the first library L1. Data Q [w] is duplicated as unit data Q of speech unit P [wV] in the second library L2.

以上の説明から理解される通り、第１実施形態では、特定子音[w]と母音[V]とを含む音声素片Ｐ[w-V]について、第１ライブラリＬ1における音声素片Ｐの素片データＱのうち特定子音[w]の区間の素片データＱを含む第２ライブラリＬ2が生成される。したがって、特定母音[M]を含む音声素片Ｐの素片データＱが第１ライブラリＬ1に存在しない場合でも、特定子音[w]の区間を示す素片データＱ[w]を利用して、曖昧な発音に対応する合成音声を生成することが可能である。 As understood from the above description, in the first embodiment, for the speech unit P [wV] including the specific consonant [w] and the vowel [V], the unit data of the speech unit P in the first library L1. A second library L2 including the segment data Q of the specific consonant [w] section of Q is generated. Therefore, even when the segment data Q of the speech segment P including the specific vowel [M] does not exist in the first library L1, the segment data Q [w] indicating the section of the specific consonant [w] is used, It is possible to generate synthesized speech corresponding to ambiguous pronunciation.

（４）例外処理２
図３(d)に例示される通り、第１ライブラリＬ1には、子音（以下「第１子音」という）[h]と母音[a],[e],[o]とを含む音声素片Ｐ([h-a],[h-e],[h-o])については素片データＱ([h-a],[h-e],[h-o])が記憶されるが、第１子音[h]とそれ以外の母音[i]，[M]とを含む音声素片Ｐ（[h-i],[h-M])については素片データＱが記憶されない場合が想定される。他方、第１ライブラリＬ1には、第１子音[h]とは相違する子音（以下「第２子音」という）[p\]と母音[a],[M],[e],[o]とを含む音声素片Ｐ([p\-a],[p\-M],[p\-e],[p\-o])についての素片データＱ([p\-a],[p\-M],[p\-e],[p\-o])が記憶される。 (4) Exception handling 2
As illustrated in FIG. 3D, the first library L1 includes speech units including consonants (hereinafter referred to as “first consonants”) [h] and vowels [a], [e], [o]. For P ([ha], [he], [ho]), segment data Q ([ha], [he], [ho]) is stored, but the first consonant [h] and other vowels For the speech unit P ([hi], [hM]) including [i] and [M], it is assumed that the unit data Q is not stored. On the other hand, the first library L1 contains consonants (hereinafter referred to as “second consonants”) [p \] and vowels [a], [M], [e], [o] that are different from the first consonant [h]. Segment data Q ([p \ -a], [p \ -M], [p \ -e], [p \ -o]) [p \ -M], [p \ -e], [p \ -o]) are stored.

生成部３２は、第１ライブラリＬ1のうち第２子音[p\]と、特定母音[M]とを含む素片データＱ[p\-M]を含む第２ライブラリＬ2を生成する。具体的には、図３(d)に例示されるように、第１ライブラリＬ1の素片データＱ（例えば[p\-a]）のうち第２子音[p\]と特定母音[M]とを含む音声素片Ｐ[p\-M]の素片データＱ[p\-M]が、第２ライブラリＬ2の音声素片Ｐ[h-V]の素片データＱとして複製される。 The generation unit 32 generates the second library L2 including the segment data Q [p \ -M] including the second consonant [p \] and the specific vowel [M] in the first library L1. Specifically, as illustrated in FIG. 3D, the second consonant [p \] and the specific vowel [M] in the segment data Q (for example, [p \ -a]) of the first library L1. Is duplicated as segment data Q of the speech unit P [hV] of the second library L2.

以上の説明から理解される通り、第１実施形態では、第１子音[h]と母音[V]とを含む音声素片Ｐについて、第１子音[h]とは相違する第２子音[p\]と特定母音[M]とを含む音声素片Ｐの素片データＱを含む第２ライブラリＬ2が生成されるから、第１子音[h]と特定母音[M]とを含む素片データＱが第１ライブラリＬ1に存在しない場合でも、第２子音[p\]と特定母音[M]とを含む音声素片Ｐの素片データＱを利用して、曖昧な発音の合成音声を生成することが可能である。 As understood from the above description, in the first embodiment, the second consonant [p] that is different from the first consonant [h] for the speech segment P including the first consonant [h] and the vowel [V]. \] And the second library L2 including the segment data Q of the speech segment P including the specific vowel [M] is generated, so the segment data including the first consonant [h] and the specific vowel [M]. Even if Q does not exist in the first library L1, the synthesized speech with ambiguous pronunciation is generated using the segment data Q of the speech segment P including the second consonant [p \] and the specific vowel [M]. Is possible.

図３(a)から図３(d)では、母音[V]を含む音声素片Ｐについて説明した。子音[C]と無音[sil]とを配列した音声素片Ｐのように母音[V]を含まない音声素片Ｐ（例えば[Sil-k],[k-Sil]等）については、第１ライブラリＬ1のうち当該音声素片Ｐに対応する素片データＱが第２ライブラリＬ2に複製される。 In FIG. 3A to FIG. 3D, the speech segment P including the vowel [V] has been described. For speech units P (for example, [Sil-k], [k-Sil], etc.) that do not contain a vowel [V], such as speech units P in which consonants [C] and silences [sil] are arranged, The segment data Q corresponding to the speech segment P in one library L1 is copied to the second library L2.

以上の説明から理解される通り、第１ライブラリＬ1と第２ライブラリＬ2とでは、素片データＱが用意される音声素片Ｐの種類は共通する一方、各音声素片Ｐについて用意される素片データＱが表す音声波形は相違し得る。以上の手順で生成された第２ライブラリＬ2は、第１ライブラリＬ1と同様に音声信号Ｖの生成に利用される。すなわち、発音文字Ｘ1に対応する音声素片Ｐの素片データＱを素片選択部３４が第２ライブラリＬ2から順次に選択し、素片選択部３４が選択した各素片データＱを合成処理部３６が調整および連結することで、受聴者に曖昧と知覚される音声の音声信号Ｖが生成される。第１実施形態では、第１ライブラリＬ1を利用した明瞭な音声の合成と、第２ライブラリＬ2を利用した曖昧な音声の合成とが、例えば入力装置１６に対する利用者からの指示に応じて択一的に実行される。 As understood from the above description, the first library L1 and the second library L2 share the same kind of speech element P for which the element data Q is prepared, but are prepared for each speech element P. The voice waveform represented by the piece data Q can be different. The second library L2 generated by the above procedure is used for generating the audio signal V in the same manner as the first library L1. That is, the segment selection unit 34 sequentially selects the segment data Q of the speech segment P corresponding to the phonetic character X1 from the second library L2, and the unit segment data Q selected by the segment selection unit 34 is synthesized. The sound signal V of the sound perceived by the listener as ambiguous is generated by the adjustment and connection by the unit 36. In the first embodiment, clear speech synthesis using the first library L1 and ambiguous speech synthesis using the second library L2 are selected according to an instruction from the user to the input device 16, for example. Is executed automatically.

図４は、第１ライブラリＬ1を利用した音声合成と第２ライブラリＬ2を利用した音声合成との説明図である。図４(a)には、合成情報Ｓで指定された発音文字Ｘ1「まや」に対応して選択される音声素片Ｐの時系列が例示され、図４(b)には、第１ライブラリＬ1を利用した音声合成時に第１ライブラリＬ1から読み出される素片データＱの時系列が例示され、図４(c)には、第２ライブラリＬ2を利用した音声合成時に第２ライブラリＬ2から読み出される素片データＱの時系列が例示されている。 FIG. 4 is an explanatory diagram of speech synthesis using the first library L1 and speech synthesis using the second library L2. FIG. 4A illustrates a time series of the speech segment P selected corresponding to the phonetic character X1 “Maya” specified by the synthesis information S, and FIG. 4B illustrates the first library. The time series of the segment data Q read from the first library L1 at the time of speech synthesis using L1 is illustrated, and FIG. 4 (c) is read from the second library L2 at the time of speech synthesis using the second library L2. The time series of the segment data Q is illustrated.

第１ライブラリＬ1を利用した音声合成（明瞭な音声の合成）が指示された場合、素片選択部３４は、図４(b)に例示された素片データＱ（[Sil-m],[m-a],[a-a],[a-j],[j-a],[a-a],[a-Sil]）を第１ライブラリＬ1から順次に選択する。合成処理部３６は、素片選択部３４によって選択された素片データＱを利用して、「まや（maya）」という発音文字Ｘ1が受聴者に明瞭に知覚される合成音声を生成する。 When speech synthesis (clear speech synthesis) using the first library L1 is instructed, the segment selection unit 34 performs segment data Q ([Sil-m], [ ma], [aa], [aj], [ja], [aa], [a-Sil]) are sequentially selected from the first library L1. The synthesis processing unit 36 uses the segment data Q selected by the segment selection unit 34 to generate synthesized speech in which the phonetic character X1 “maya” is clearly perceived by the listener.

他方、第２ライブラリＬ2を利用した音声合成（曖昧な音声の合成）が指示された場合、素片選択部３４は、図４(c)の素片データＱ（[Sil-m],[m-M],[M-M],[M-j],[j-M],[M-M],[M-Sil]）を第２ライブラリＬ2から順次に選択する。合成処理部３６は、素片選択部３４によって選択された素片データＱを利用して、発音文字Ｘ1「まや(maya)」の発音を曖昧にしたと利用者に知覚され得る「むゆ（muyu）」という合成音声を生成する。 On the other hand, when speech synthesis using the second library L2 (synthesized ambiguous speech) is instructed, the segment selection unit 34 selects the segment data Q ([Sil-m], [mM] in FIG. ], [MM], [Mj], [jM], [MM], [M-Sil]) are sequentially selected from the second library L2. The synthesis processing unit 36 uses the segment data Q selected by the segment selection unit 34 to make the pronunciation of the phonetic character X1 “maya” ambiguous. muyu) ”is generated.

図５は、第１実施形態の生成部３２が第２ライブラリＬ2を生成する処理のフローチャートである。例えば、第２ライブラリＬ2の生成が利用者から指定された場合に図５の処理が開始される。 FIG. 5 is a flowchart of processing in which the generation unit 32 of the first embodiment generates the second library L2. For example, the process of FIG. 5 is started when the generation of the second library L2 is designated by the user.

生成部３２は、任意の１個の音声素片（以下「対象素片」という）Ｐを選択し（ＳA1）、対象素片Ｐが母音を含むか否かを判定する（ＳA2）。対象素片Ｐが母音を含まない場合（ＳA2：NO）、生成部３２は、対象素片Ｐについて第１ライブラリＬ1に記憶された素片データＱを第２ライブラリＬ2の対象素片Ｐの素片データＱとして複製する（ＳA3）。他方、対象素片Ｐが母音を含む場合（ＳA2：YES）、生成部３２は、対象素片Ｐの母音を特定母音[M]に置換した音声素片Ｐの素片データＱが第１ライブラリＬ1に存在するか否かを判定する（ＳA4）。判定結果が肯定である場合（ＳA4：YES）、生成部３２は、対象素片Ｐの母音を特定母音[M]に置換した音声素片Ｐの素片データＱを第１ライブラリＬ1から取得し、第２ライブラリＬ2における対象素片Ｐの素片データＱとして記憶する（ＳA5）。他方、対象素片Ｐの母音を特定母音[M]に置換した音声素片Ｐの素片データＱが第１ライブラリＬ1に存在しない場合（ＳA4：NO）、生成部３２は、前述の例外処理１または例外処理２を実行することで、第２ライブラリＬ2のうち対象素片Ｐの素片データＱを用意する（ＳA6）。以上の処理が、全種類の音声素片Ｐについて順次に反復されることで（ＳA7：NO）、音声素片Ｐ毎に素片データＱを含む第２ライブラリＬ2が生成される。 The generation unit 32 selects an arbitrary speech unit (hereinafter referred to as “target unit”) P (SA1), and determines whether the target unit P includes a vowel (SA2). When the target segment P does not include a vowel (SA2: NO), the generation unit 32 uses the segment data Q stored in the first library L1 for the target segment P as an element of the target segment P in the second library L2. Duplicate as single data Q (SA3). On the other hand, when the target segment P includes a vowel (SA2: YES), the generation unit 32 uses the first library to store the segment data Q of the speech segment P obtained by replacing the vowel of the target segment P with the specific vowel [M]. It is determined whether or not it exists in L1 (SA4). When the determination result is affirmative (SA4: YES), the generation unit 32 acquires the segment data Q of the speech segment P obtained by replacing the vowel of the target segment P with the specific vowel [M] from the first library L1. Then, it is stored as segment data Q of the target segment P in the second library L2 (SA5). On the other hand, when the segment data Q of the speech segment P obtained by replacing the vowel of the target segment P with the specific vowel [M] does not exist in the first library L1 (SA4: NO), the generation unit 32 performs the above-described exception processing. By executing 1 or exception processing 2, the segment data Q of the target segment P in the second library L2 is prepared (SA6). The above processing is sequentially repeated for all types of speech units P (SA7: NO), thereby generating the second library L2 including the unit data Q for each speech unit P.

以上に説明した通り、第１実施形態では、第１ライブラリＬ1の複数の音声素片Ｐのうち、特定母音[M]以外の母音（［a］，［i］，［e］，［o］）を含む音声素片Ｐについて、当該母音を特定母音[M]に置換した音声素片Ｐの素片データＱが第２ライブラリＬ2に複製される。すなわち、既存の第１ライブラリＬ1の音声素片Ｐの素片データＱを利用して、曖昧な発音の合成音声に利用される第２ライブラリＬ2が生成される。したがって、明瞭な音声に対応する第１ライブラリＬ1と曖昧な音声に対応する第２ライブラリＬ2とを実際の収録音から個別に生成する場合と比較して、音声合成用のライブラリを作成する負担を軽減しながら、受聴者に曖昧と知覚される合成音声を生成することが可能である。 As described above, in the first embodiment, vowels other than the specific vowel [M] ([a], [i], [e], [o]) among the plurality of speech units P in the first library L1. ), The segment data Q of the speech unit P in which the vowel is replaced with the specific vowel [M] is copied to the second library L2. That is, using the segment data Q of the speech segment P of the existing first library L1, the second library L2 used for the synthesized speech with ambiguous pronunciation is generated. Therefore, compared with the case where the first library L1 corresponding to clear speech and the second library L2 corresponding to ambiguous speech are individually generated from actual recorded sounds, the burden of creating a speech synthesis library is reduced. While mitigating, it is possible to generate synthesized speech that is perceived as ambiguous by the listener.

＜第２実施形態＞
第１実施形態では、第２ライブラリＬ2を既存の第１ライブラリＬ1から事前に生成して音声合成に利用した。第２実施形態では、合成情報Ｓに応じた音声信号Ｖを生成する場面において、第１実施形態で第１ライブラリＬ1から第２ライブラリＬ2を生成したのと同様の規則のもとで、素片選択部３４が既存のライブラリＬから素片データＱを動的に選択することで、第２ライブラリＬ2を生成することなく曖昧な音声の音声信号Ｖを生成する。なお、以下に例示する各態様において作用や機能が第１実施形態と同様である要素については、第１実施形態の説明で参照した符号を流用して各々の詳細な説明を適宜に省略する。 Second Embodiment
In the first embodiment, the second library L2 is generated in advance from the existing first library L1 and used for speech synthesis. In the second embodiment, in the scene in which the audio signal V corresponding to the synthesis information S is generated, the segment is used under the same rules as in the case where the second library L2 is generated from the first library L1 in the first embodiment. The selection unit 34 dynamically selects the segment data Q from the existing library L, thereby generating an ambiguous audio signal V without generating the second library L2. In addition, about the element in which an effect | action and a function are the same as that of 1st Embodiment in each aspect illustrated below, the detailed description of each is abbreviate | omitted suitably using the code | symbol referred by description of 1st Embodiment.

図６は、第２実施形態における音声合成装置１００のブロック図である。第２実施形態では、第１実施形態の生成部３２および第２ライブラリＬ2が省略される。図６のライブラリＬは、第１実施形態の第１ライブラリＬ1に相当し、明瞭な音声の素片データＱを音声素片Ｐ毎に包含する。利用者から明瞭な発音の合成音声の生成が指示された場合、第２実施形態の素片選択部３４は、第１実施形態（図４(b)の例示）と同様に、発音文字Ｘ1に対応する各音声素片Ｐの素片データＱをライブラリＬから順次に選択する。他方、利用者から曖昧な発音の合成音声の生成が指示された場合、素片選択部３４は、図３(a)から図３(d)を参照して前述した置換処理と実質的に同様の規則に則って、図４(c)に例示される素片データＱをライブラリＬから順次に選択する。素片選択部３４による素片選択について具体的に説明する。 FIG. 6 is a block diagram of the speech synthesizer 100 in the second embodiment. In the second embodiment, the generation unit 32 and the second library L2 of the first embodiment are omitted. The library L in FIG. 6 corresponds to the first library L1 of the first embodiment, and includes clear speech segment data Q for each speech segment P. When the user gives an instruction to generate clear pronunciation synthesized speech, the segment selection unit 34 of the second embodiment uses the pronunciation character X1 as in the first embodiment (illustrated in FIG. 4B). The segment data Q of each corresponding speech segment P is sequentially selected from the library L. On the other hand, when the user gives an instruction to generate an ambiguous pronunciation synthesized speech, the segment selection unit 34 is substantially the same as the replacement process described above with reference to FIGS. 3 (a) to 3 (d). 4 is sequentially selected from the library L in accordance with the above rule. The segment selection by the segment selection unit 34 will be specifically described.

（１）母音＋母音：［V-V］
素片選択部３４は、発音文字Ｘ1に対応する音声素片Ｐが特定母音[M]以外の母音（[a],[i],[e],[o]）を含む場合、当該母音を特定母音[M]に置換した音声素片Ｐの素片データＱをライブラリＬから選択する。 (1) Vowel + Vowel: [VV]
When the speech segment P corresponding to the phonetic character X1 includes a vowel other than the specific vowel [M] ([a], [i], [e], [o]), the segment selection unit 34 selects the vowel. The segment data Q of the speech segment P replaced with the specific vowel [M] is selected from the library L.

（２）母音＋子音または子音＋母音：[C,V]（[V-C]，[C-V]）
素片選択部３４は、発音文字Ｘ1に対応する音声素片Ｐが特定母音[M]以外の母音（[a],[i],[e],[o]）と子音とを含む場合、特定母音[M]以外の母音を特定母音[M]に置換した音声素片Ｐの素片データＱをライブラリＬから選択する。 (2) Vowel + consonant or consonant + vowel: [C, V] ([VC], [CV])
When the speech segment P corresponding to the phonetic character X1 includes a vowel ([a], [i], [e], [o]) other than the specific vowel [M] and a consonant, The segment data Q of the speech segment P in which the vowels other than the specific vowel [M] are replaced with the specific vowel [M] is selected from the library L.

（３）例外処理１
発音文字Ｘ1に対応する音声素片Ｐが、特定子音[w]と特定母音[M]以外の母音[V]とを含む音声素片Ｐ[w-V]である場合、素片選択部３４は、特定子音[w]と特定母音[M]以外の母音[V]とを含む音声素片Ｐ[w-V]の素片データＱ[w-V]をライブラリＬから選択し、当該素片データＱ[w-V]のうち特定子音[w]の区間を示す素片データＱ[w]を生成する。 (3) Exception handling 1
When the speech unit P corresponding to the phonetic character X1 is a speech unit P [wV] including the specific consonant [w] and the vowel [V] other than the specific vowel [M], the unit selection unit 34 A segment data Q [wV] of a speech segment P [wV] including a specific consonant [w] and a vowel [V] other than the specific vowel [M] is selected from the library L, and the segment data Q [wV] The segment data Q [w] indicating the interval of the specific consonant [w] is generated.

（４）例外処理２
発音文字Ｘ1に対応する音声素片Ｐが、第１子音[h]と特定母音[M]以外の母音[V]とを含む音声素片Ｐ[h-V]である場合、素片選択部３４は、第１子音[h]とは相違する第２子音[p\]と特定母音[M]とを含む音声素片Ｐの素片データＱをライブラリＬから選択する。 (4) Exception handling 2
When the speech segment P corresponding to the phonetic character X1 is a speech segment P [hV] including the first consonant [h] and the vowel [V] other than the specific vowel [M], the segment selection unit 34 The segment data Q of the speech segment P including the second consonant [p \] different from the first consonant [h] and the specific vowel [M] is selected from the library L.

合成処理部３６は、素片選択部３４によって選択された音声素片Ｐの素片データＱを利用して合成音声を生成する。第２実施形態では、第１実施形態と同様の規則のもとで素片データＱが選択されるから、第１実施形態と同様に、発音が明瞭な合成音声と発音が曖昧な合成音声との双方を生成することが可能である。 The synthesis processing unit 36 generates synthesized speech using the unit data Q of the speech unit P selected by the unit selection unit 34. In the second embodiment, the segment data Q is selected under the same rules as in the first embodiment. Therefore, as in the first embodiment, the synthesized speech with clear pronunciation and the synthesized speech with unclear pronunciation are It is possible to generate both.

図７は、第２実施形態の素片選択部３４が素片データＱを選択する処理のフローチャートである。例えば音声合成の開始が利用者から指示された場合に、合成情報Ｓで指定される各発音文字Ｘ1に対応する音声素片（対象素片）Ｐ毎に図７の処理が順次に実行される。 FIG. 7 is a flowchart of processing in which the segment selection unit 34 of the second embodiment selects the segment data Q. For example, when the start of speech synthesis is instructed by the user, the processing of FIG. 7 is sequentially executed for each speech segment (target segment) P corresponding to each phonetic character X1 specified by the synthesis information S. .

素片選択部３４は、対象素片Ｐが特定母音[M]以外の母音を含むか否かを判定する（ＳB1）。対象素片Ｐが特定母音[M]以外の母音を含まない場合（ＳB1：NO）、素片選択部３４は、当該対象素片Ｐの素片データＱをライブラリＬから選択する（ＳB2）。他方、対象素片Ｐが特定母音[M]以外の母音を含む場合（ＳB1：YES）、素片選択部３４は、当該母音を特定母音[M]に置換した音声素片Ｐの素片データＱがライブラリＬに存在するか否かを判定する（ＳB3）。判定結果が肯定である場合（ＳB3：YES）、素片選択部３４は、対象素片Ｐの母音を特定母音[M]に置換した音声素片Ｐの素片データＱをライブラリＬから選択する（ＳB4）。他方、該当する素片データＱがライブラリＬに存在しない場合（ＳB3：No）、素片選択部３４は、前述の例外処理１または例外処理２により素片データＱを用意する(SB5)。以降については第１実施形態と同様であるので詳細な説明を省略する。 The segment selection unit 34 determines whether or not the target segment P includes a vowel other than the specific vowel [M] (SB1). When the target segment P does not include a vowel other than the specific vowel [M] (SB1: NO), the segment selection unit 34 selects the segment data Q of the target segment P from the library L (SB2). On the other hand, when the target segment P includes a vowel other than the specific vowel [M] (SB1: YES), the segment selection unit 34 replaces the segment data of the speech segment P by replacing the vowel with the specific vowel [M]. It is determined whether or not Q exists in the library L (SB3). When the determination result is affirmative (SB3: YES), the segment selection unit 34 selects the segment data Q of the speech segment P in which the vowel of the target segment P is replaced with the specific vowel [M] from the library L. (SB4). On the other hand, when the corresponding segment data Q does not exist in the library L (SB3: No), the segment selector 34 prepares the segment data Q by the exception processing 1 or the exception processing 2 described above (SB5). Since the subsequent steps are the same as those in the first embodiment, detailed description thereof is omitted.

以上の説明から理解される通り、第２実施形態では、素片選択部３４は、特定母音[M]以外の母音（［a］，［i］，［e］，［o］）を含まない音声素片Ｐについて当該音声素片Ｐの素片データＱを選択し、特定母音[M]以外の母音（［a］，［i］，［e］，［o］）を含む音声素片Ｐについては、当該母音を特定母音[M]に置換した音声素片Ｐの素片データＱを選択する。すなわち、明瞭な音声と曖昧な音声との各々に対応するライブラリを別個に用意する必要がない。したがって、第２実施形態においても第１実施形態と同様に、音声合成用のライブラリを作成する負担を軽減しながら、曖昧な発音の合成音声を生成することが可能である。また、第２実施形態では第２ライブラリＬ2の生成および記憶が不要であるから、第１ライブラリＬ1と第２ライブラリＬ2との双方を記憶装置１２に記憶する必要がある第１実施形態と比較して、記憶装置１２に必要な容量が削減されるという利点もある。 As understood from the above description, in the second embodiment, the segment selection unit 34 does not include vowels ([a], [i], [e], [o]) other than the specific vowel [M]. For the speech unit P, the unit data Q of the speech unit P is selected, and the speech unit P including vowels ([a], [i], [e], [o]) other than the specific vowel [M]. , The segment data Q of the speech segment P in which the vowel is replaced with the specific vowel [M] is selected. That is, it is not necessary to prepare separate libraries corresponding to clear and ambiguous sounds. Therefore, in the second embodiment, similarly to the first embodiment, it is possible to generate a synthesized speech with ambiguous pronunciation while reducing the burden of creating a speech synthesis library. Further, in the second embodiment, generation and storage of the second library L2 are unnecessary, so that both the first library L1 and the second library L2 need to be stored in the storage device 12 as compared with the first embodiment. Thus, there is an advantage that the capacity required for the storage device 12 is reduced.

＜第３実施形態＞
第２実施形態では、発音文字Ｘ1に対応する音声素片Ｐの各素片データＱを相互に連結することで明瞭な音声の音声信号Ｖを生成する動作と、発音文字Ｘ1に対応する音声素片の母音を特定母音[M]に置換した音声素片の各素片データＱを相互に連結することで曖昧な音声の音声信号Ｖを生成する動作とを選択的に実行した。第３実施形態では、発音文字Ｘ1に対応する各音声素片Ｐについて、前者の素片データＱ（以下「第１素片データＱ1」という）と後者の素片データＱ（以下「第２素片データＱ2」という）とが混合される。 <Third Embodiment>
In the second embodiment, the speech element V corresponding to the phonetic character X1 is connected to each other and the speech data V corresponding to the phonetic character X1 is generated. An operation of generating an ambiguous speech signal V by selectively connecting each segment data Q of speech units obtained by replacing a segment of vowels with a specific vowel [M] was performed. In the third embodiment, for each speech segment P corresponding to the phonetic character X1, the former segment data Q (hereinafter referred to as "first segment data Q1") and the latter segment data Q (hereinafter referred to as "second segment"). Is referred to as “single data Q2”).

図８は、第３実施形態の音声合成装置１００のブロック図である。第３実施形態では、第２実施形態の構成に対して変数設定部３７と素片混合部３８とが付加される。第３実施形態の素片選択部３４は、合成情報Ｓが指定する発音文字Ｘ1に対応する各音声素片Ｐについて、第１素片データＱ1と第２素片データＱ2とをライブラリＬから選択する。前述の通り、第１素片データＱ1は、発音文字Ｘ1に対応する各音声素片ＰについてライブラリＬに記憶された素片データＱであり、第２素片データＱ2は、図７と同様の処理により、例えば発音文字Ｘ1に対応する音声素片Ｐの母音を特定母音[M]に置換した音声素片Ｐの素片データＱである。第１素片データＱ1は明瞭に発音された音声素片Ｐの音声波形を表し、第２素片データＱ2は当該音声素片Ｐを曖昧に発音した場合の音声波形を表す、とも換言され得る。 FIG. 8 is a block diagram of the speech synthesizer 100 of the third embodiment. In the third embodiment, a variable setting unit 37 and an element mixing unit 38 are added to the configuration of the second embodiment. The segment selection unit 34 of the third embodiment selects, from the library L, first segment data Q1 and second segment data Q2 for each speech segment P corresponding to the phonetic character X1 specified by the synthesis information S. To do. As described above, the first unit data Q1 is the unit data Q stored in the library L for each speech unit P corresponding to the phonetic character X1, and the second unit data Q2 is the same as in FIG. For example, the unit data Q of the speech unit P is obtained by replacing the vowel of the speech unit P corresponding to the phonetic character X1 with the specific vowel [M]. In other words, the first unit data Q1 represents the speech waveform of the speech unit P that is clearly pronounced, and the second unit data Q2 represents the speech waveform when the speech unit P is pronounced vaguely. .

素片混合部３８は、素片選択部３４が１個の音声素片Ｐについて選択した第１素片データＱ1と第２素片データＱ2とを混合比Ｋに応じて混合することで素片データＱを生成する。具体的には、素片混合部３８は、以下の数式(1)で表現される通り、混合比Ｋに応じた加重値を適用した第１素片データＱ1と第２素片データＱ2との加重和を素片データＱとして生成する。

The unit mixing unit 38 mixes the first unit data Q1 and the second unit data Q2 selected by the unit selection unit 34 for one speech unit P in accordance with the mixing ratio K. Data Q is generated. Specifically, the unit mixing unit 38, as expressed by the following mathematical formula (1), applies the first unit data Q1 and the second unit data Q2 to which a weight value corresponding to the mixing ratio K is applied. A weighted sum is generated as segment data Q.

数式(1)の記号αは所定の定数である。数式(1)から理解される通り、混合比Ｋが大きいほど、素片データＱにおける第２素片データＱ2の比率が増加し、結果的に音声信号Ｖの音声の音韻が聴感的に曖昧になる。合成処理部３６は、素片混合部３８による混合後の各素片データＱを、第１実施形態と同様の方法で調整および連結することで音声信号Ｖを生成する。 The symbol α in the formula (1) is a predetermined constant. As understood from the equation (1), the larger the mixing ratio K is, the more the ratio of the second segment data Q2 in the segment data Q increases, resulting in the audible ambiguity of the speech of the speech signal V. Become. The synthesis processing unit 36 generates the audio signal V by adjusting and connecting the unit data Q after mixing by the unit mixing unit 38 in the same manner as in the first embodiment.

図８の変数設定部３７は、合成対象の音声の特徴量に応じて第１素片データＱ1と第２素片データＱ2との混合比Ｋを設定する。ここで、高音域の音声ほど、受聴者が知覚する音韻が曖昧になる、という傾向を想定する。以上の傾向を考慮して、第３実施形態の変数設定部３７は、合成対象の音符に指定された音高Ｘ3を特徴量として、当該音高Ｘに応じた混合比Ｋを設定する。具体的には、音符の音高Ｘ3が高いほど混合比Ｋが増加する（すなわち音声信号Ｖの音声が曖昧になる）ように混合比Ｋが制御される。 The variable setting unit 37 in FIG. 8 sets the mixing ratio K between the first segment data Q1 and the second segment data Q2 in accordance with the feature amount of the speech to be synthesized. Here, a tendency is assumed that the higher the sound, the vaguely the phoneme perceived by the listener. In consideration of the above tendency, the variable setting unit 37 of the third embodiment sets the mixing ratio K corresponding to the pitch X, using the pitch X3 specified for the note to be synthesized as a feature amount. Specifically, the mixing ratio K is controlled such that the mixing ratio K increases as the pitch X3 of the note increases (that is, the voice of the voice signal V becomes ambiguous).

図９は、変数設定部３７による混合比Ｋの設定についての説明図である。図９に例示される編集画面４０のうち音符列画像４２では、利用者に指定された“あさやけの（asayakeno）”という発音文字Ｘ1が５個の音符に割当てられた場合が例示されている。第３実施形態では、時間軸上の特定の区間ｔ内の音符毎に、第１素片データＱ1と第２素片データＱ2との混合比Ｋが設定される。区間ｔは、混合比Ｋの設定対象となる区間であり、例えば入力装置１６に対する操作で利用者が選択した区間や合成楽曲の全区間等の任意の区間である。 FIG. 9 is an explanatory diagram for setting the mixture ratio K by the variable setting unit 37. In the note sequence image 42 in the editing screen 40 illustrated in FIG. 9, the case where the pronunciation character X1 “asayakeno” designated by the user is assigned to five notes is illustrated. . In the third embodiment, the mixing ratio K between the first segment data Q1 and the second segment data Q2 is set for each note in a specific section t on the time axis. The section t is a section for which the mixing ratio K is set, and is an arbitrary section such as a section selected by the user by an operation on the input device 16 or all sections of the synthesized music.

図９の記号Ｒは、区間ｔ内の音高Ｘ3の最大値ＮHと最小値ＮLとの差異（すなわち区間ｔ内の音高Ｘ3の分布幅）を意味する（Ｒ＝ＮH−ＮL）。変数設定部３７は、以下の数式(2)の演算により、音符の音高Ｘ3に応じた混合比Ｋを区間ｔ内の音符毎に算定する。

数式(2)の記号βは所定の定数（典型的には正数）である。数式(2)から理解される通り、音高Ｘ3が区間ｔ内の最小値ＮLである音符（Ｘ3＝ＮL）の混合比Ｋは最小値０となり、音高Ｘ3が区間ｔ内の最大値ＮHである音符（Ｘ3＝ＮH）の混合比Ｋは最大値βとなる。すなわち、混合比Ｋは、音高Ｘ3が高いほど増加するように最小値０と最大値βとの間の範囲内で音高Ｘ3に応じて変化する。図９に例示される通り、編集画面４０の変数領域４４には、音符毎の混合比Ｋが表示（グラフ表示）される。 The symbol R in FIG. 9 means the difference between the maximum value NH and the minimum value NL of the pitch X3 in the section t (that is, the distribution width of the pitch X3 in the section t) (R = NH-NL). The variable setting unit 37 calculates the mixing ratio K corresponding to the pitch X3 of the note for each note in the section t by the calculation of the following formula (2).

The symbol β in the equation (2) is a predetermined constant (typically a positive number). As understood from the equation (2), the mixing ratio K of the notes (X3 = NL) whose pitch X3 is the minimum value NL in the interval t is the minimum value 0, and the pitch X3 is the maximum value NH in the interval t. The mixing ratio K of the notes (X3 = NH) is the maximum value β. That is, the mixing ratio K changes in accordance with the pitch X3 within the range between the minimum value 0 and the maximum value β so that it increases as the pitch X3 increases. As illustrated in FIG. 9, the mixing ratio K for each note is displayed (graph display) in the variable area 44 of the editing screen 40.

以上の説明から理解される通り、第３実施形態では、発音文字Ｘ1に対応する音声素片Ｐについて、ライブラリＬのうち当該音声素片の第１素片データＱ1と、ライブラリＬのうち母音を特定母音[M]に置換した音声素片Ｐの第２素片データＱ2とが混合される。したがって、第１実施形態や第２実施形態と同様に、音声合成用のライブラリを作成する負担を軽減しながら、曖昧な発音の合成音声を生成することが可能である。 As understood from the above description, in the third embodiment, for the speech unit P corresponding to the phonetic character X1, the first unit data Q1 of the speech unit in the library L and the vowel in the library L are used. The second unit data Q2 of the speech unit P replaced with the specific vowel [M] is mixed. Therefore, similarly to the first embodiment and the second embodiment, it is possible to generate a synthesized speech with an ambiguous pronunciation while reducing the burden of creating a speech synthesis library.

ところで、第１素片データＱ1と第２素片データＱ2とを混合する構成としては、例えば、利用者からの指示に応じた混合比Ｋで混合を実行する構成（以下「対比例」という）も採用され得る。しかし、対比例では、混合比Ｋを指示する利用者の負担が大きいという問題が発生し得る。第３実施形態では、合成対象の特徴量（具体的には各音符の音高Ｘ3）に応じて混合比Ｋが可変に設定されるから、対比例と比較して利用者の負担を軽減することが可能である。第１実施形態では特に、各音符の音高Ｘ3が高いほど第２素片データＱ2の比率が増加するように混合比Ｋが設定されるから、高音域の音声ほど音韻が曖昧になるという傾向を再現した自然な合成音声を生成できるという利点がある。 By the way, as a configuration for mixing the first segment data Q1 and the second segment data Q2, for example, a configuration for executing mixing at a mixing ratio K in accordance with an instruction from the user (hereinafter referred to as “comparative”). Can also be employed. However, in contrast, there may be a problem that the burden on the user who indicates the mixture ratio K is large. In the third embodiment, since the mixing ratio K is variably set according to the feature quantity to be synthesized (specifically, the pitch X3 of each note), the burden on the user is reduced compared to the proportionality. It is possible. In particular, in the first embodiment, the mixing ratio K is set so that the ratio of the second segment data Q2 increases as the pitch X3 of each note is higher. There is an advantage that it is possible to generate natural synthesized speech that reproduces.

＜第４実施形態＞
第４実施形態では、第１ライブラリＬ1から第２ライブラリＬ2を事前に生成する第１実施形態の構成を前提として、第１ライブラリＬ1から選択された第１素片データＱ1と、第２ライブラリＬ2から選択された音声素片Ｐの第２素片データＱ2とを、第３実施形態と同様に混合して合成音声を生成する。 <Fourth embodiment>
In the fourth embodiment, on the premise of the configuration of the first embodiment in which the second library L2 is generated in advance from the first library L1, the first segment data Q1 selected from the first library L1 and the second library L2 The second unit data Q2 of the speech unit P selected from the above is mixed as in the third embodiment to generate a synthesized speech.

図１０は、第４実施形態の音声合成装置１００のブロック図である。図１０の生成部３２は、第１実施形態と同様の方法で既存の第１ライブラリＬ1から第２ライブラリＬ2を生成する。したがって、第４実施形態によれば第１実施形態と同様の効果が実現される。他方、素片選択部３４は、合成対象の発音文字Ｘ1に対応する音声素片Ｐ毎に、第１ライブラリＬ1の第１素片データＱ1と第２ライブラリＬ2の第２素片データＱ2とを順次に選択する。 FIG. 10 is a block diagram of the speech synthesizer 100 of the fourth embodiment. The generating unit 32 in FIG. 10 generates the second library L2 from the existing first library L1 by the same method as in the first embodiment. Therefore, according to the fourth embodiment, the same effect as the first embodiment is realized. On the other hand, the segment selection unit 34 obtains the first segment data Q1 of the first library L1 and the second segment data Q2 of the second library L2 for each speech segment P corresponding to the synthesized character X1. Select sequentially.

図１０に例示されるように、第４実施形態では、第１実施形態の音声合成装置１００に対して、第３実施形態と同様の変数設定部３７と素片混合部３８が追加される。変数設定部３７は、第３実施形態と同様に、音符毎の特徴量である音高Ｘ3に応じて混合比Ｋを音符毎に設定する。具体的には、音高Ｘ3が高いほど混合比Ｋが増加するように、例えば数式(2)の演算で混合比Ｋが算定される。素片混合部３８は、第３実施形態と同様に、素片選択部３４が選択した第１素片データＱ1と第２素片データＱ2とを、変数設定部３７が設定した混合比Ｋに応じて混合することで素片データＱを生成する。したがって、第４実施形態によれば、第３実施形態と同様に、混合比Ｋを利用者が指示する対比例と比較して利用者の負担を軽減することが可能である。 As illustrated in FIG. 10, in the fourth embodiment, a variable setting unit 37 and a segment mixing unit 38 similar to those in the third embodiment are added to the speech synthesizer 100 of the first embodiment. As in the third embodiment, the variable setting unit 37 sets the mixing ratio K for each note in accordance with the pitch X3 that is the feature value for each note. Specifically, the mixing ratio K is calculated by, for example, the calculation of Equation (2) so that the mixing ratio K increases as the pitch X3 increases. Similar to the third embodiment, the unit mixing unit 38 sets the first unit data Q1 and the second unit data Q2 selected by the unit selection unit 34 to the mixing ratio K set by the variable setting unit 37. The segment data Q is generated by mixing them accordingly. Therefore, according to the fourth embodiment, similarly to the third embodiment, it is possible to reduce the burden on the user by comparing the mixing ratio K with the proportionality instructed by the user.

＜変形例＞
前述の各形態は多様に変形され得る。具体的な変形の態様を以下に例示する。以下の例示から任意に選択された２以上の態様を適宜に併合することも可能である。 <Modification>
Each of the above-described embodiments can be variously modified. Specific modifications are exemplified below. Two or more modes arbitrarily selected from the following examples can be appropriately combined.

（１）第３実施形態のように第１素片データＱ1と第２素片データＱ2との混合比Ｋを音符毎に設定する構成では混合比Ｋが音符毎に段階的（不連続）に変動し得る。そこで、各音符の混合比Ｋを補間することで混合比Ｋを時間的に連続に変化させる構成も好適である。例えば図１１に例示されるように、各音符の発音期間Ｘ2の中間地点において当該音符の音高Ｘ3に応じた数値（数式(2)の演算値）となるように各音符の混合比Ｋを直線補間することで、複数の音符に亘って混合比Ｋを連続的に変化させてもよい。 (1) In the configuration in which the mixing ratio K between the first segment data Q1 and the second segment data Q2 is set for each note as in the third embodiment, the mixing ratio K is stepwise (discontinuous) for each note. Can vary. Therefore, a configuration in which the mixing ratio K is continuously changed in time by interpolating the mixing ratio K of each note is also suitable. For example, as illustrated in FIG. 11, the mixing ratio K of each note is set so that a numerical value corresponding to the pitch X3 of the note (calculated value of Equation (2)) is obtained at an intermediate point of the note generation period X2. The mixing ratio K may be continuously changed over a plurality of notes by linear interpolation.

また、例えば図１２に例示される通り、音符の発音期間Ｘ2のうち一部の期間（以下「定常期間」という）σにおいて当該音符の音高Ｘ3に応じた混合比Ｋを維持し、各発音期間Ｘ2内の定常期間σの終点と直後の発音期間Ｘ2内の定常期間σの始点との間で混合比Ｋを補間（図１２の例示では直線補間）することも可能である。定常期間σは、例えば発音期間Ｘ2の始点から所定長（例えば継続長の１／４）だけ経過した時点を始点とし、発音期間Ｘ2の終点から所定長だけ手前の時点を終点とする期間である。なお、図１１や図１２の例示では直線補間を例示したが、補間の具体的な方法は任意であり、例えば公知の曲線補間を利用してもよい。 For example, as illustrated in FIG. 12, the mixing ratio K corresponding to the pitch X3 of the note is maintained in a part of the note sound generation period X2 (hereinafter referred to as “steady period”) σ. It is also possible to interpolate the mixing ratio K between the end point of the stationary period σ in the period X2 and the starting point of the stationary period σ in the immediately following sounding period X2 (linear interpolation in the example of FIG. 12). The steady period σ is a period in which, for example, a point when a predetermined length (for example, ¼ of the continuation length) has elapsed from the start point of the sound generation period X2 is a start point, and a time point a predetermined length before the end point of the sound generation period X2 . In addition, although linear interpolation was illustrated in the illustration of FIG.11 and FIG.12, the specific method of interpolation is arbitrary, For example, you may utilize well-known curve interpolation.

（２）第３実施形態では、混合比Ｋの最大値βを所定の定数としたが、入力装置１６に対する利用者からの指示に応じて最大値βを可変に設定することも可能である。また、第３実施形態では、音高Ｘ3が高いほど混合比Ｋが増加する構成を例示したが、音高Ｘ3の高低と混合比Ｋの大小との関係は以上の例示に限定されない。例えば、音高Ｘ3が高いほど混合比Ｋが減少する（合成音声が明瞭になる）ように混合比Ｋを算定することも可能である。 (2) In the third embodiment, the maximum value β of the mixing ratio K is set as a predetermined constant. However, the maximum value β can be variably set according to an instruction from the user to the input device 16. In the third embodiment, the configuration in which the mixing ratio K increases as the pitch X3 is higher is exemplified. However, the relationship between the pitch X3 and the mixing ratio K is not limited to the above example. For example, it is possible to calculate the mixing ratio K so that the mixing ratio K decreases (the synthesized speech becomes clear) as the pitch X3 is higher.

（３）混合比Ｋの算定に適用される特徴量は、第３実施形態で例示した音高Ｘ3に限定されない。例えば、各音符の音量に応じて混合比Ｋを設定する構成も採用され得る。例えば、音量が小さいほど、受聴者が知覚する音韻が曖昧になる、という傾向を想定すると、音量が小さいほど混合比Ｋが増加するように変数設定部３７が混合比Ｋを設定する構成が好適である。 (3) The feature amount applied to the calculation of the mixing ratio K is not limited to the pitch X3 exemplified in the third embodiment. For example, a configuration in which the mixing ratio K is set according to the volume of each note may be employed. For example, assuming that the sound volume perceived by the listener becomes ambiguous as the volume decreases, a configuration in which the variable setting unit 37 sets the mixing ratio K so that the mixing ratio K increases as the volume decreases is preferable. It is.

（４）合成対象の特徴量に応じて混合比Ｋを設定する第３実施形態や第４実施形態の構成にとって、生成部３２が第１ライブラリＬ1から第２ライブラリＬ2を生成する第１実施形態の構成や、母音を特定母音[M]に置換した音声素片の素片データＬ2を素片選択部３４が選択する第２実施形態の構成は必須ではない。例えば、相異なる声質の収録音から第１ライブラリＬ1および第２ライブラリＬ2が事前に用意された構成において、第１ライブラリＬ1から選択された第１素片データＱ1と第２ライブラリＬ2から選択された第２素片データＱ2とを例えば音高Ｘ3に応じた混合比Ｋで混合することも可能である。 (4) For the configurations of the third embodiment and the fourth embodiment in which the mixing ratio K is set according to the feature quantity to be synthesized, the first embodiment in which the generation unit 32 generates the second library L2 from the first library L1. The configuration of the second embodiment in which the segment selection unit 34 selects the segment data L2 of the speech segment in which the vowel is replaced with the specific vowel [M] is not essential. For example, in a configuration in which the first library L1 and the second library L2 are prepared in advance from recorded sounds of different voice qualities, the first segment data Q1 selected from the first library L1 and the second library L2 are selected. It is also possible to mix the second segment data Q2 with a mixing ratio K corresponding to the pitch X3, for example.

（５）第１実施形態では、生成部３２が生成した第２ライブラリＬ2を利用して音声信号Ｖを生成する音声合成装置１００を例示したが、既存の第１ライブラリＬ1から第１実施形態に例示した方法で第２ライブラリＬ2を生成する装置（音声合成用ライブラリ生成装置）としても本発明は実現され得る。音声合成用ライブラリ生成装置において、音声合成のための機能（素片選択部３４，合成処理部３６）の有無は不問である。 (5) In the first embodiment, the speech synthesizer 100 that generates the speech signal V using the second library L2 generated by the generation unit 32 is illustrated, but the existing first library L1 is changed to the first embodiment. The present invention can also be realized as an apparatus (speech synthesis library generation apparatus) that generates the second library L2 by the exemplified method. In the speech synthesis library generation device, the presence or absence of functions for speech synthesis (segment selection unit 34, synthesis processing unit 36) is not questioned.

１００……音声合成装置、１０……演算処理装置、１２……記憶装置、１４……表示装置、１６……入力装置、１８……放音装置、３２……生成部、３３……表示制御部、３４……素片選択部、３５……指示受付部、３６……合成処理部、３７……変数設定部、３８……素片混合部、４０……編集画面、４２……音符列画像、４４……変数領域、５４……音符図像。
DESCRIPTION OF SYMBOLS 100 ... Voice synthesizer, 10 ... Arithmetic processing unit, 12 ... Memory | storage device, 14 ... Display apparatus, 16 ... Input device, 18 ... Sound emission device, 32 ... Generation | occurrence | production part, 33 ... Display control , 34... Segment selection unit, 35... Instruction receiving unit, 36... Composition processing unit, 37... Variable setting unit, 38. Image, 44... Variable area, 54.

Claims

An element for generating a second library from a first library for speech synthesis that includes unit data for each speech unit, the speech unit including a vowel other than a specific vowel among a plurality of speech units A speech synthesis library generating apparatus comprising: a generating unit that generates a second library including speech segment data obtained by replacing the vowel with the specific vowel in one library.

An element for generating a second library from a first library for speech synthesis that includes unit data for each speech unit, the speech unit including a vowel other than a specific vowel among a plurality of speech units A generating unit that generates a second library including unit data of a speech unit obtained by replacing the vowel with the specific vowel in one library;
A unit selection unit for selecting first unit data of a speech unit corresponding to a phonetic character to be synthesized in the first library and second unit data of the speech unit in the second library; ,
A unit mixing unit for mixing the first unit data and the second unit data;
A speech synthesizer comprising: a synthesis processing unit that generates synthesized speech using the segment data after mixing by the segment mixing unit.

An element for selecting speech segment data corresponding to a phonetic character to be synthesized from a speech synthesis library including speech segment data for each speech segment, and for speech segments containing vowels other than specific vowels Is a segment selection unit that selects segment data of a speech segment in which the vowel is replaced with the specific vowel;
A speech synthesis apparatus comprising: a synthesis processing unit that generates synthesized speech using the speech unit selected by the unit selection unit.

The unit selection unit includes a first unit data of the speech unit and a second unit data of the speech unit obtained by replacing the vowel with the specific vowel for a speech unit including a vowel other than the specific vowel. And select
A unit mixing unit for mixing the first unit data and the second unit data;
The speech synthesis apparatus according to claim 3, wherein the synthesis processing unit generates synthesized speech using the segment data after mixing by the segment mixing unit.

It has a variable setting unit that sets the mixing ratio according to the feature quantity to be combined,
The speech synthesis apparatus according to claim 2, wherein the unit mixing unit mixes the first unit data and the second unit data at a mixing ratio set by the variable setting unit.