JP2008158029A

JP2008158029A - Distributed data base system for voice synthesis

Info

Publication number: JP2008158029A
Application number: JP2006343966A
Authority: JP
Inventors: Tsutomu Kaneyasu; 勉兼安
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 2006-12-21
Filing date: 2006-12-21
Publication date: 2008-07-10

Abstract

<P>PROBLEM TO BE SOLVED: To obtain a distributed data base system for voice synthesis capable of integrating and dealing with the data base, when the data base required for voice synthesis is distributed. <P>SOLUTION: The system includes: one or more voice servers in which a speaker and data required for synthesizing voice characterized by speaking tone, are stored; and a management server 10 for holding a management index 12 which each voice server stores. The management server 10 holds a voice file characteristic table 12b which expresses the speaker and the speaking tone of a voice file which is a base for generating data specified by the management index 12, by a keyword for showing their characteristics, in the management index 12. When request for designating the data required for performing voice synthesis by the keyword is received, the management index 12 is searched and information for showing corresponding data location is returned. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、音声合成を行うために必要なデータを分散して格納する、音声合成用分散データベースシステムに関するものである。 The present invention relates to a speech synthesis distributed database system that stores data necessary for speech synthesis in a distributed manner.

従来、音声合成を行うための一方式として、コーパスベース音声合成方式がある（非特許文献１）。
また、『テキストから音声を合成するテキスト音声合成技術おいて、最適素片選択型音声合成を、比較的計算パワーの小さい端末装置にて行えるようにする。』ことを目的とした技術として、『テキストから音声を合成するテキスト音声合成において、コンテンツ生成、出力に関して、素片選択処理の結果を二次コンテンツとして出力することで、負荷の高い素片選択処理と、負荷の軽い音声波形合成処理とに分離して処理可能とした。これにより、素片選択処理をサーバ側で実施し、使用素片情報を端末に送信し合成用のデータとする。』というものがある（特許文献１）。
特開２００６−１８１３３号公報（要約）電子情報通信学会、信学技報、ＳＰ２００５−１８、ｐｐ．３７−４１（２００５．５） Conventionally, there is a corpus-based speech synthesis method as one method for performing speech synthesis (Non-Patent Document 1).
In addition, “in a text-to-speech synthesis technology that synthesizes speech from text, optimum unit selection-type speech synthesis can be performed by a terminal device with relatively small calculation power. As a technology for the purpose, "In text-to-speech synthesis that synthesizes speech from text, the result of segment selection processing is output as secondary content for content generation and output. And the voice waveform synthesis processing with a light load. As a result, the segment selection process is performed on the server side, and the used segment information is transmitted to the terminal to be combined data. (Patent Document 1).
JP 2006-18133 A (summary) IEICE, IEICE Technical Report, SP2005-18, pp. 37-41 (2005. 5)

上述のような音声合成装置等においては、音声合成を行うために必要なデータは、合成処理を実行する装置内、もしくは音声合成装置に直接接続された記憶装置内に格納されているのが通常である。
しかし、種々の制約や管理上の都合など様々な事情により、音声合成を行うために必要なデータは、必ずしも全てのデータが一箇所に統合されているとは限らず、複数の記憶装置ないしサーバ等に分散して格納されている場合がある。さらには、それら分散して格納されているデータは、個々の格納場所においてそのままデータ量を増し、独立したデータベースとして成長することもあり得る。
上述の特許文献１に記載の従来技術は、素片選択処理と波形合成処理を機能的に分散するものであるが、データの分散化に対応するものではない。また、従来の音声合成装置等がこのような分散化されたデータを考慮するものではないことは言うまでもない。
そのため、音声合成を行うために必要なデータベースが分散して存在している場合に、これを統合的に扱うことのできる音声合成用分散データベースシステムが望まれていた。 In a speech synthesizer as described above, data necessary for performing speech synthesis is usually stored in a device that performs synthesis processing or a storage device that is directly connected to the speech synthesizer. It is.
However, due to various circumstances such as various restrictions and administrative circumstances, the data necessary for performing speech synthesis is not necessarily integrated into one place, and a plurality of storage devices or servers May be distributed and stored. Furthermore, the data stored in a distributed manner may increase the amount of data as it is in each storage location and grow as an independent database.
The prior art described in Patent Document 1 described above functionally distributes the unit selection process and the waveform synthesis process, but does not support data distribution. Further, it goes without saying that a conventional speech synthesizer or the like does not consider such distributed data.
For this reason, there is a demand for a distributed database system for speech synthesis that can handle a database that is necessary for performing speech synthesis in a distributed manner.

本発明に係る音声合成用分散データベースシステムは、
話者と口調により特徴付けられる音声を合成するために必要なデータを格納した１ないし複数の音声サーバと、
前記各音声サーバが格納しているデータのインデックスを保持する管理サーバと、
を有し、
前記管理サーバは、
前記インデックスにより特定されるデータを生成する基となった音声ファイルの話者と口調を、その特徴を現すキーワードにより表現したリストをさらに保持しており、
音声合成を行うために必要なデータを前記キーワードにより指定したリクエストを受け取ると、
前記インデックスと前記リストを検索し、該当するデータの所在を表す情報を返信する
ことを特徴とするものである。 A distributed database system for speech synthesis according to the present invention includes:
One or more voice servers that store the data necessary to synthesize speech characterized by speaker and tone;
A management server that holds an index of data stored in each voice server;
Have
The management server
A list that further expresses the speaker and tone of the voice file that is the basis for generating the data specified by the index, using keywords that express the characteristics;
When a request is received that specifies the data necessary for speech synthesis using the keyword,
The index and the list are searched, and information indicating the location of the corresponding data is returned.

本発明に係る音声合成用分散データベースシステムによれば、音声サーバが格納している各音声データ等の所在を管理サーバで管理するので、音声データ等が分散して構成されていても、それらを自在に検索し、あるいは組み合わせて、新たな音声データ等を生成することが可能となる。 According to the distributed database system for speech synthesis according to the present invention, the location of each voice data and the like stored in the voice server is managed by the management server. It becomes possible to generate new audio data or the like by freely searching or combining them.

実施の形態１．
図１は、本発明の実施の形態１に係る音声合成用分散データベースシステムの概略構成を示すものである。
本実施の形態１に係る音声合成用分散データベースシステムは、音声合成を行うために必要なデータを格納した音声サーバ２０、２１、・・・（以下記載を省略）と、これら各音声サーバが格納しているデータのインデックスを保持する管理サーバ１０を有している。
管理サーバ１０と、音声サーバ２０及び２１とは、ネットワーク３０を介して接続されている。
また、ネットワーク４０を介して、管理サーバ１０とクライアント装置１が接続されている。クライアント装置１は、管理サーバ１０に対して検索リクエストや生成リクエストを発行する。各リクエストの詳細は後述の図５〜図７で説明する。 Embodiment 1 FIG.
FIG. 1 shows a schematic configuration of a distributed database system for speech synthesis according to Embodiment 1 of the present invention.
The distributed database system for speech synthesis according to the first embodiment stores speech servers 20, 21,... (Not shown below) that store data necessary to perform speech synthesis, and stores these speech servers. The management server 10 holds an index of the data being stored.
The management server 10 and the voice servers 20 and 21 are connected via the network 30.
Further, the management server 10 and the client device 1 are connected via the network 40. The client device 1 issues a search request and a generation request to the management server 10. Details of each request will be described with reference to FIGS.

音声サーバ２０、２１等は、コーパスベース音声合成方式を用いて音声合成を行うために必要なデータを格納している。ここでいうデータには、韻律モデルデータ、音響モデルデータ、音声ファイルがある。それぞれの内容を下記に説明する。
（１）韻律モデルデータ
ピッチや音長といった、話者の言い回しに関する韻律的な特徴を、所定のラベル単位で統計モデル化したデータである。
（２）音響モデルデータ
話者の声道の形などに起因する、その話者の声の音響的な特徴を、所定のラベル単位で統計モデル化したデータである。
（３）音声ファイル
話者が実際に発声した音声を格納したファイルであり、所定のラベル単位で１つの音声ファイルを構成している。また、ラベル単位ではなく、ある文章を音読した音声ファイルでもよい。
（４）パラメータ
音声サーバの実装上、上記音声ファイルに関連した種々の計算済みパラメータを音声ファイルとともに格納している場合がある。このパラメータは、上記（３）音声ファイルを用いて音声合成を行う場合に用いられる。
（５）素片選択データ
コーパスベース音声合成方式で素片波形を選択する際に必要なデータである。音声ファイルに関連した種々の計算済みパラメータを格納している。 The voice servers 20, 21 store data necessary for voice synthesis using a corpus-based voice synthesis method. The data here includes prosodic model data, acoustic model data, and audio files. Each content is described below.
(1) Prosodic model data This is data obtained by statistically modeling prosodic features such as pitch and sound length related to a speaker's wording in units of predetermined labels.
(2) Acoustic model data Data obtained by statistically modeling acoustic characteristics of a speaker's voice caused by the shape of the speaker's vocal tract and the like in predetermined label units.
(3) Audio file A file that stores audio actually uttered by a speaker, and constitutes one audio file in units of predetermined labels. In addition, an audio file obtained by reading a sentence aloud may be used instead of a label unit.
(4) Parameters Due to the implementation of the voice server, various calculated parameters related to the voice file may be stored together with the voice file. This parameter is used when (3) voice synthesis is performed using the voice file.
(5) Segment selection data Data required when selecting a segment waveform in the corpus-based speech synthesis method. Stores various calculated parameters associated with the audio file.

上記に言う「統計モデル化」とは、例えば隠れマルコフモデルのような手法により、音声合成処理時に必要な統計モデルを作成することをいうが、モデル化手法はこれに限るものではなく、統一された手法で統計モデル化されていればよい。 “Statistical modeling” as described above refers to creating a statistical model necessary for speech synthesis processing by a technique such as a hidden Markov model, but the modeling technique is not limited to this and is unified. It is sufficient that it is statistically modeled by the above method.

これらのファイルは、１つの音声サーバが（１）〜（５）全ての種類のデータを格納している場合もあるし、一部の種類のファイルのみ格納している場合もある。また、それぞれ単独のデータファイルとして構成されたものもあるし、複数のデータによりデータベースとして構成されている場合もある。
本実施の形態１においては、それぞれデータベース化されているものとして以下の説明を行う。また、単に「音声データベース」と呼ぶときは、上記（１）〜（５）を総称的に指すものとする。
また、本実施の形態１において、「韻律モデルデータベース」「音響モデルデータベース」は上述の（１）〜（２）のことを指し、「音声ファイルデータベース」は上述の（３）と（５）を指す。また、説明の簡単のため、（３）音声ファイルと（４）パラメータは特段の区別をせず、「音声ファイル」というときは両者をともに指すものとする。 For these files, one voice server may store all types of data (1) to (5), or may store only some types of files. In addition, some are configured as a single data file, and some are configured as a database with a plurality of data.
In the first embodiment, the following description will be given on the assumption that each is databased. Further, when simply referred to as “voice database”, the above (1) to (5) are generically indicated.
In the first embodiment, the “prosodic model database” and the “acoustic model database” refer to the above (1) to (2), and the “voice file database” includes the above (3) and (5). Point to. For the sake of simplicity, (3) the audio file and (4) parameters are not particularly distinguished, and “audio file” refers to both.

管理サーバ１０は、話者検索部１１、管理インデックス１２、及びＤＢ生成部１４を有する。
管理インデックス１２は、後述の図２〜図４で説明するテーブル上に、音声サーバ２０、２１等が格納しているデータのインデックスを保持している。このインデックスを参照することにより、各音声サーバが格納しているデータの所在を知ることができる。
管理インデックス１２は、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）等の記憶装置に格納されたデータファイルなどにより構成することができる。
話者検索部１１は、ネットワーク４０を介して、音声合成を行うために必要なデータの所在を検索するリクエストを受け付け、管理インデックス１２が格納している後述のテーブルを検索して、検索結果をリクエスト発行元に返信する。詳細は後述の図２〜図５で説明する。
ＤＢ生成部１４は、ネットワーク４０を介して、音声サーバ２０、２１等が格納している音声ファイルを用いて韻律モデルデータ、音響モデルデータ、又は素片選択データを生成すべき旨のリクエストを受け付け、生成処理を実行する。生成完了後は、生成後のデータの所在を表すインデックスを、管理インデックス１２が保持するテーブルに格納する。 The management server 10 includes a speaker search unit 11, a management index 12, and a DB generation unit 14.
The management index 12 holds an index of data stored in the voice servers 20 and 21 on a table described later with reference to FIGS. By referring to this index, the location of data stored in each voice server can be known.
The management index 12 can be configured by a data file or the like stored in a storage device such as an HDD (Hard Disk Drive).
The speaker search unit 11 receives a request for searching for the location of data necessary for performing speech synthesis via the network 40, searches a table to be described later stored in the management index 12, and obtains the search result. Reply to the request issuer. Details will be described later with reference to FIGS.
The DB generation unit 14 accepts a request to generate prosodic model data, acoustic model data, or segment selection data using the audio file stored in the audio server 20, 21 or the like via the network 40. Execute the generation process. After the generation is completed, an index indicating the location of the generated data is stored in a table held by the management index 12.

クライアント装置１は、話者選択部２とＤＢ生成依頼部３を有する。
話者選択部２は、ネットワーク４０を介して、話者検索部１１に対し、音声合成を行うために必要なデータの所在を検索するリクエストを発行する。
ＤＢ生成依頼部３は、ネットワーク４０を介して、ＤＢ生成部１４に対し、音声サーバ２０、２１等が格納している音声ファイルを用いて韻律モデルデータ、音響モデルデータ、又は素片選択データを生成すべき旨のリクエストを発行する。 The client device 1 includes a speaker selection unit 2 and a DB generation request unit 3.
The speaker selection unit 2 issues a request for searching for the location of data necessary for speech synthesis to the speaker search unit 11 via the network 40.
The DB generation request unit 3 sends the prosody model data, the acoustic model data, or the unit selection data to the DB generation unit 14 via the network 40 using the audio file stored in the audio server 20, 21 or the like. Issue a request to generate.

次に、管理インデックス１２が格納している各テーブルについて説明する。 Next, each table stored in the management index 12 will be described.

図２は、管理インデックス１２が格納しているデータベース所在テーブル１２ａの構成とデータ例を示すものである。
データベース所在テーブル１２ａは、「データベース番号」列、「サーバ」列、「ＤＢ種類」列、「データベース名」列、を有する。
「データベース番号」列には、音声サーバ２０、２１等が格納している音声データベースを識別するための一意の番号が格納される。本列の値は、音声合成用分散データベースシステム中で一意に採番されたものである。
「サーバ」列には、「データベース番号」列の値で特定される音声データベースを格納しているサーバのアドレスが格納される。本列の値は、サーバのＩＰアドレス、サーバ名など、サーバの所在を特定できる情報であれば、値の形式は問わない。
「ＤＢ種類」列には、「データベース番号」列の値で特定される音声データベースの種類が、上述の（１）韻律モデルデータ〜（３）音声ファイル、若しくは（５）素片選択データのいずれであるかを表す情報が格納される。
「データベース名」列には、「データベース番号」列の値で特定される音声データベースの名称が格納される。本列の値は一意でなくともよい。 FIG. 2 shows a configuration of the database location table 12a stored in the management index 12 and an example of data.
The database location table 12a includes a “database number” column, a “server” column, a “DB type” column, and a “database name” column.
The “database number” column stores a unique number for identifying the voice database stored in the voice server 20, 21 or the like. The values in this column are uniquely numbered in the distributed database system for speech synthesis.
The “server” column stores the address of the server that stores the voice database specified by the value in the “database number” column. The value in this column can be in any format as long as it is information that can identify the location of the server, such as the server IP address and server name.
In the “DB type” column, the type of the speech database specified by the value of the “database number” column is any of the above-mentioned (1) prosodic model data to (3) speech file or (5) segment selection data. Information indicating whether or not is stored.
The “database name” column stores the name of the speech database specified by the value of the “database number” column. The values in this column need not be unique.

次に、図２のデータ例が表す意味について説明する。
１行目〜４行目のデータは、図１の音声サーバ２０が格納しているデータを表す。例えば１行目のデータは、サーバ「ＳＶ２０」が、「２０Ａ」という名称の「韻律モデル」データベースを格納しており、そのデータベースの番号は「１００１」であることを表す。
５行目〜７行目のデータは、図１の音声サーバ２１が格納しているデータを表す。
８行目〜９行目のデータは、図１には示していないが、サーバ「ＳＶ２２」が格納しているデータを表す。これによれば、サーバ「ＳＶ２２」は、「韻律モデル２２Ａ」と「音声ファイル２２Ｃ」を格納していることが分かる。
１０行目のデータは、、図１には示していないが、サーバ「ＳＶ２３」が格納しているデータを表す。これによれば、サーバ「ＳＶ２３」は、「音声ファイル２３Ｃ」を格納していることが分かる。 Next, the meaning represented by the data example in FIG. 2 will be described.
The data in the first to fourth lines represent data stored in the voice server 20 of FIG. For example, the data in the first row indicates that the server “SV20” stores a “prosodic model” database named “20A” and the database number is “1001”.
The data on the 5th to 7th lines represent the data stored in the voice server 21 of FIG.
The data on the 8th to 9th lines are not shown in FIG. 1, but represent the data stored in the server “SV22”. This shows that the server “SV22” stores “prosodic model 22A” and “voice file 22C”.
The data on the 10th line is not shown in FIG. 1, but represents data stored in the server “SV23”. According to this, it can be seen that the server “SV23” stores the “voice file 23C”.

図３は、管理インデックス１２が格納している音声ファイル特徴テーブル１２ｂの構成とデータ例を示すものである。
音声ファイル特徴テーブル１２ｂは、「音声ファイル番号」列、「話者・口調」列、「特徴キーワード」列、を有する。
「音声ファイル番号」列には、データベース所在テーブル１２ａの「データベース番号」のうち、「ＤＢ種類」列の値が「音声ファイル」であるものの値が格納される。
「話者・口調」列には、「音声ファイル番号」列の値で特定される音声ファイルの話者・口調に関する情報が格納される。図３のデータ例では、「音声ファイル番号＝１００３」で特定される音声ファイルは、「話者Ａ・口調Ａ」で発声した音声ファイルであることが分かる。
「特徴キーワード」列には、「音声ファイル番号」列の値で特定される音声ファイルの特徴を表すキーワードが格納される。図３のデータ例では、「音声ファイル番号＝１００３」で特定される音声ファイルは、「男性、おじいさん」で特徴付けられる内容で発声された音声ファイルであることが分かる。実際に年配の男性が発声したかどうかは、別の問題であるものとする。 FIG. 3 shows the configuration and data example of the audio file feature table 12b stored in the management index 12.
The voice file feature table 12b includes a “voice file number” column, a “speaker / tone” column, and a “feature keyword” column.
In the “voice file number” column, the value of the “database type” column of the “database type” in the database location table 12a that is “voice file” is stored.
The “speaker / tone” column stores information related to the speaker / tone of the audio file specified by the value of the “voice file number” column. In the data example of FIG. 3, it can be seen that the audio file specified by “audio file number = 1003” is an audio file uttered by “speaker A / tone A”.
In the “feature keyword” column, a keyword representing the feature of the audio file specified by the value of the “audio file number” column is stored. In the data example of FIG. 3, it can be seen that the audio file specified by “audio file number = 1003” is an audio file uttered with contents characterized by “male, grandfather”. Whether the elderly man actually spoke is another matter.

図４は、管理インデックス１２が格納しているデータベース生成元テーブル１２ｃの構成とデータ例を示すものである。
データベース生成元テーブル１２ｃは、「データベース番号」列、「音声ファイル番号」列、を有する。
「データベース番号」列には、データベース所在テーブル１２ａの「データベース番号」のうち、「ＤＢ種類」列の値が「音声ファイル」以外のものに対応する値が格納される。
「音声ファイル番号」列には、音声ファイル特徴テーブル１２ｂの「音声ファイル番号」列の値に対応するものが格納される。 FIG. 4 shows a configuration and data example of the database generation source table 12c stored in the management index 12.
The database generation source table 12c has a “database number” column and an “audio file number” column.
In the “database number” column, a value corresponding to a value other than “sound file” in the “DB type” column in the “database number” of the database location table 12a is stored.
In the “audio file number” column, a value corresponding to the value of the “audio file number” column of the audio file feature table 12b is stored.

次に、図４のデータ例が表す意味について説明する。
データベース生成元テーブル１２ｃは、「データベース番号」列で特定される音声データベースが、「音声ファイル番号」列で特定される音声ファイルを用いて生成されたことを意味する。
例えば１行目のデータは、「データベース番号＝１００１」で特定される音声データベース（ＳＶ２０上の韻律モデル２０Ａ）は、「音声ファイル番号＝１００３」で特定される音声ファイル（ＳＶ２０条の音声ファイル２０Ｃ）を用いて作成されたことが分かる。
このように、音声データベースの作成元に関する情報が必要なのは、音声データベースがいずれの音声ファイルを用いて作成されたかにより、その音声データベースの品質が異なるからである。 Next, the meaning represented by the data example in FIG. 4 will be described.
The database generation source table 12c means that the audio database specified by the “database number” column is generated using the audio file specified by the “audio file number” column.
For example, the data in the first row is an audio database (prosodic model 20A on SV20) specified by “database number = 1001”, an audio file specified by “audio file number = 1003” (audio file 20C of SV20 article) ).
As described above, the information about the creation source of the voice database is necessary because the quality of the voice database differs depending on which voice file is used to create the voice database.

音声合成を行う際に、一般ユーザであれば「韻律モデル２０Ａがどの音声ファイルから作成されたか」といった事には関心がないものと思われる一方で、音声合成に詳しいユーザであれば、より高品質の合成音声を得るなどの目的で、各音声データベースがどの程度の品質を有しているかを予め把握したい場合があり得る。また、音声合成の研究・開発を行っている者にとっても、やはり各音声データベースの出自を把握しておくことは、業務上有用なことである。
こうしたニーズに答えるべく、本発明においては、単に各音声データベースの所在を表す情報をインデックスにより管理するのではなく、その出自に関しても管理し、音声合成を行う際にどの音声データベースを用いて合成処理を行うかを選択可能とした。 When performing speech synthesis, a general user may not be interested in “from which speech file the prosody model 20A was created”, whereas a user who is familiar with speech synthesis has a higher For the purpose of obtaining quality synthesized speech, it may be desired to know in advance how much quality each speech database has. Also, for those who are engaged in research and development of speech synthesis, it is useful for business to keep track of the origin of each speech database.
In order to respond to such needs, in the present invention, information representing the location of each voice database is not managed by an index, but the origin is also managed, and any voice database is used for voice synthesis when performing voice synthesis. It was made possible to select whether to perform.

次に、本実施の形態１に係る音声合成用分散データベースシステムの動作について説明する。
図２〜図４で説明した各テーブルが保持するデータを用いることにより、様々な検索や新規音声データベースの生成が可能である。以下、いくつかの処理を例にとり、具体的な動作を説明する。 Next, the operation of the distributed database system for speech synthesis according to the first embodiment will be described.
By using the data held in each table described with reference to FIGS. 2 to 4, various searches and generation of a new voice database are possible. Hereinafter, specific operations will be described by taking several processes as examples.

図５は、クライアント端末１から音声ファイルの特徴を表すキーワードを送信して、該当する音声ファイルを検索する際のシーケンス図である。各テーブルが保持しているデータは、図２〜図４に示されている通りとする。
以下、各ステップについて説明する。なお、図５には記載の都合からステップ番号を示していないが、上から下に向かって時系列が流れているものとする。次の図６、図７も同様である。 FIG. 5 is a sequence diagram when a keyword representing the characteristics of an audio file is transmitted from the client terminal 1 to search for the corresponding audio file. The data held in each table is as shown in FIGS.
Hereinafter, each step will be described. In FIG. 5, step numbers are not shown for convenience of description, but it is assumed that a time series flows from top to bottom. The same applies to FIGS. 6 and 7.

（１）ユーザがクライアント端末１を操作し、キーワード「女性」を入力して、検索リクエストを実行する。
（２）話者選択部２は、キーワード「女性」を含む検索リクエストを管理サーバ１０に向けて発行する。検索リクエストは、ネットワーク４０を介して管理サーバ１０に到達する。
（３）管理サーバ１０に到達した検索リクエストは、話者検索部１１により処理される。話者検索部１１は、「女性」をキーにして、音声ファイル特徴テーブル１２ｂを検索する。
（４）検索の結果、「特徴キーワード」列の値に「女性」を含むデータがヒットする。図３のデータ例では、３行目と４行目のデータが該当する。
（５）話者検索部１１は、検索結果として、３行目と４行目のデータの「音声ファイル番号」列、「話者・口調」列の値を取得する。
（６）話者検索部１１は、取得した「音声ファイル番号」列、「話者・口調」列の値を、話者選択部２に返信する。 (1) The user operates the client terminal 1 to input the keyword “female” and execute a search request.
(2) The speaker selection unit 2 issues a search request including the keyword “female” to the management server 10. The search request reaches the management server 10 via the network 40.
(3) The search request that reaches the management server 10 is processed by the speaker search unit 11. The speaker search unit 11 searches the audio file feature table 12b using “female” as a key.
(4) As a result of the search, data including “female” in the value of the “characteristic keyword” column is hit. In the data example of FIG. 3, the data in the third and fourth rows correspond.
(5) The speaker search unit 11 acquires the values of the “voice file number” column and “speaker / tone” column of the data in the third and fourth rows as the search results.
(6) The speaker search unit 11 returns the acquired values of the “voice file number” column and the “speaker / tone” column to the speaker selection unit 2.

図６は、図５の検索結果を得た後、その検索結果による話者・口調を指定して音声データベース生成依頼をする際のシーケンス図である。以下、各ステップについて説明する。 FIG. 6 is a sequence diagram when the search result of FIG. 5 is obtained and a speaker / tone is specified based on the search result and a voice database generation request is made. Hereinafter, each step will be described.

（１）ユーザがクライアント端末１を操作し、図５のステップ（６）で得られた検索結果のうち「話者Ｂ・口調Ｂ」を指定して、「話者Ｂ・口調Ｂで発声した際の音響モデルデータベース」の生成リクエストを実行する。
なお、ここで「話者・口調」の組をユーザに選択させるのは、単にその「話者・口調」のセットをユーザが欲するという理由の他、上述の図４で説明した通り、検索結果のうちいずれの音声ファイルを用いるかにより、生成する音響モデルデータベースの品質が異なるからである。
（２）ＤＢ生成依頼部３は、「話者Ｂ・口調Ｂ」「音響モデル」「音声ファイル番号＝１００８」をキーにして、管理サーバ１０に向けてＤＢ生成リクエストを発行する。ＤＢ生成リクエストは、ネットワーク４０を介して管理サーバ１０に到達する。
（３）管理サーバ１０に到達した検索リクエストは、ＤＢ生成部１４により処理される。ＤＢ生成部１４は、「音声ファイル番号＝１００８」をキーにして、データベース生成元テーブル１２ｃを検索し、生成済みデータベースの有無を確認する。
（４）検索の結果、「音声ファイル番号」列の値が「１００８」であるデータがヒットする。図４のデータ例では、６行目のデータが該当する。
（５）ＤＢ生成部１４は、検索結果として、６行目のデータの「データベース番号」列の値「１００７」を取得する。
（６）ＤＢ生成部１４は、「データベース番号＝１００７」をキーにして、データベース所在テーブル１２ａを検索し、その音声データベースの種類を取得する。
（７）検索の結果、「データベース番号＝１００７」で特定される音声データベースは、「韻律モデル」データベースであることが分かる。即ち、「音声ファイル番号＝１００８」で特定される音声ファイルを用いて生成した音響モデルデータベースは存在しないことが分かる。
（８）ＤＢ生成部１４は、上記のように、「音声ファイル番号＝１００８」で特定される音声ファイルを用いて生成した音響モデルデータベースは存在しないため、生成処理を行う旨をＤＢ生成依頼部３に返信する。音響モデルデータベースの作成は時間のかかる処理であるため、クライアント端末１はそれ以上返信を待つことなく、処理を終了する。
（９）ＤＢ生成部１４は、「音声ファイル番号＝１００８」で特定される音声ファイルを用いて、音響モデルデータベースの生成処理を開始する。ここでは、音声ファイル「１００８」が存在するサーバ「ＳＶ２２」上に生成するものとする。
（１０）ＤＢ生成部１４は、音響モデルデータベースの生成が完了すると、その所在を表す情報などを、次のステップ（１１）（１２）で登録する。
（１１）ＤＢ生成部１４は、データベース所在テーブル１２ａに、生成した音響モデルデータベースを表す新たなエントリを登録する。この場合は、各列の値は「データベース番号＝１０１０」「サーバ＝ＳＶ２２」「ＤＢ種類＝音響モデル」「データベース名＝２２Ｂ」となる。
（１２）ＤＢ生成部１４は、データベース生成元テーブル１２ｃに、生成した音響モデルデータベースの生成元を表す新たなエントリを登録する。この場合は、各列の値は「データベース番号＝１０１０」「音声ファイル番号＝１００８」となる。 (1) The user operates the client terminal 1 and designates “speaker B / tone B” among the search results obtained in step (6) of FIG. Generation request of “acoustic model database”.
Here, the reason for letting the user select the “speaker / tone” pair is simply that the user wants the set of “speaker / tone”, as described above with reference to FIG. This is because the quality of the generated acoustic model database differs depending on which audio file is used.
(2) The DB generation request unit 3 issues a DB generation request to the management server 10 using “speaker B / tone B”, “acoustic model”, and “voice file number = 1008” as keys. The DB generation request reaches the management server 10 via the network 40.
(3) The search request that reaches the management server 10 is processed by the DB generation unit 14. The DB generation unit 14 searches the database generation source table 12c using “voice file number = 1008” as a key, and confirms the presence or absence of a generated database.
(4) As a result of the search, data having a value of “1008” in the “voice file number” column is hit. In the data example of FIG. 4, the data in the sixth row corresponds.
(5) The DB generation unit 14 acquires the value “1007” in the “database number” column of the data in the sixth row as a search result.
(6) The DB generation unit 14 searches the database location table 12a using “database number = 1007” as a key, and acquires the type of the voice database.
(7) As a result of the search, it is understood that the speech database specified by “database number = 1007” is a “prosodic model” database. That is, it can be seen that there is no acoustic model database generated using the audio file specified by “audio file number = 1008”.
(8) As described above, since the acoustic model database generated using the audio file specified by “audio file number = 1008” does not exist, the DB generation unit 14 indicates that the generation process is to be performed. Reply to 3. Since the creation of the acoustic model database is a time-consuming process, the client terminal 1 ends the process without waiting for a further reply.
(9) The DB generation unit 14 starts the generation process of the acoustic model database using the audio file specified by “audio file number = 1008”. Here, it is assumed that the file is generated on the server “SV22” where the audio file “1008” exists.
(10) When the generation of the acoustic model database is completed, the DB generation unit 14 registers information indicating the location in the next steps (11) and (12).
(11) The DB generation unit 14 registers a new entry representing the generated acoustic model database in the database location table 12a. In this case, the value of each column is “database number = 1010”, “server = SV22”, “DB type = acoustic model”, “database name = 22B”.
(12) The DB generation unit 14 registers a new entry representing the generation source of the generated acoustic model database in the database generation table 12c. In this case, the value of each column is “database number = 1010” and “sound file number = 1008”.

なお、生成リクエストを発行した対象の音声データベースが既に作成済みであれば、ステップ（８）にてその旨をＤＢ生成依頼部３に返信し、以後のステップは省略する。次の図７についても同様である。 If the target voice database for which the generation request has been issued has already been created, this is returned to the DB generation request unit 3 in step (8), and the subsequent steps are omitted. The same applies to the next FIG.

図７は、図５と図６のリクエストを一括で実行する際のシーケンス図である。以下、各ステップについて説明する。 FIG. 7 is a sequence diagram when the requests of FIGS. 5 and 6 are collectively executed. Hereinafter, each step will be described.

（１）ユーザがクライアント端末１を操作し、キーワード「女性、喜び」を入力して、音響モデルデータベースの生成リクエストを実行する。
（２）ＤＢ生成依頼部３は、「女性、喜び」「韻律モデル」をキーにして、管理サーバ１０に向けてＤＢ生成リクエストを発行する。ＤＢ生成リクエストは、ネットワーク４０を介して管理サーバ１０に到達する。
（３）管理サーバ１０に到達した検索リクエストは、ＤＢ生成部１４により処理される。ＤＢ生成部１４は、「女性、喜び」をキーにして、音声ファイル特徴テーブル１２ｂを検索する。
（４）検索の結果、「特徴キーワード」列の値に「女性、喜び」を含むデータがヒットする。図３のデータ例では、４行目のデータが該当する。
（５）ＤＢ生成部１４は、検索結果として、４行目のデータの「音声ファイル番号」列、「話者・口調」列の値を取得する。
（６）ＤＢ生成部１４は、「音声ファイル番号＝１００９」をキーにして、データベース生成元テーブル１２ｃを検索し、生成済みデータベースの有無を確認する。
（７）検索の結果、「音声ファイル番号」列の値が「１００９」であるデータがヒットする。図４のデータ例では、該当するデータが存在しない。
（８）ＤＢ生成部１４は、検索結果として、該当データが存在しない旨を表す「ｎｕｌｌ」値を取得する。
（９）以下の処理は、図６のステップ（８）〜（１２）と同様であるため、図７にて処理の概要シーケンスを示し、説明を省略する。 (1) The user operates the client terminal 1, inputs the keyword “female, joy”, and executes the acoustic model database generation request.
(2) The DB generation request unit 3 issues a DB generation request to the management server 10 using “female, joy” and “prosodic model” as keys. The DB generation request reaches the management server 10 via the network 40.
(3) The search request that reaches the management server 10 is processed by the DB generation unit 14. The DB generation unit 14 searches the audio file feature table 12b using “female, joy” as a key.
(4) As a result of the search, data including “female, joy” in the value of the “characteristic keyword” column is hit. In the data example of FIG. 3, the data in the fourth row corresponds.
(5) The DB generation unit 14 acquires values of the “voice file number” column and the “speaker / tone” column of the data in the fourth row as a search result.
(6) The DB generation unit 14 searches the database generation source table 12c using “voice file number = 11009” as a key, and confirms the presence or absence of the generated database.
(7) As a result of the search, data having a value of “1009” in the “voice file number” column is hit. In the data example of FIG. 4, there is no corresponding data.
(8) The DB generation unit 14 acquires a “null” value indicating that the corresponding data does not exist as a search result.
(9) Since the following processing is the same as steps (8) to (12) in FIG. 6, an outline sequence of the processing is shown in FIG. 7, and description thereof is omitted.

以上の説明では、音声ファイルから韻律モデルや音響モデルを生成する例を説明したが、音声ファイルから新たな音声ファイルを生成することも可能である。この場合の処理も上記の説明と同様である。 In the above description, an example in which a prosodic model or an acoustic model is generated from an audio file has been described, but a new audio file can also be generated from an audio file. The processing in this case is the same as described above.

本実施の形態１では、図３において音声ファイルの話者・口調と特徴キーワードを対応付けたテーブル構成を説明したが、これは音声ファイルから韻律モデルや音響モデルを生成することを説明するための、便宜上のものである。
韻律モデルや音響モデルについても、同様に特徴を表す情報を保持するテーブルを構成して、検索の便宜を図ってもよい。 In the first embodiment, the table configuration in which speakers / tones of voice files and feature keywords are associated with each other has been described with reference to FIG. 3, but this is for explaining the generation of prosodic models and acoustic models from voice files. For convenience.
Similarly, for the prosody model and the acoustic model, a table that holds information representing features may be configured to facilitate the search.

また、図２〜図４で説明したテーブル構成は、ある程度正規化したものを示したが、列構成はこれに限られるものではなく、音声合成用分散データベースシステムへ向けて発行されるリクエストの内容や検索効率に応じて、適宜列構成を変更し、若しくは正規化を崩してもよい。 The table configurations described with reference to FIGS. 2 to 4 have been normalized to some extent, but the column configuration is not limited to this, and the contents of requests issued to the distributed database system for speech synthesis. Depending on the search efficiency, the column structure may be changed as appropriate, or normalization may be broken.

また、本発明は音声合成を行うために必要なデータを分散格納するデータベースシステムに関するものであるため、音声合成の実行については言及していないが、音声合成を行う機能は、音声サーバ２０、２１等が備えていてもよいし、管理サーバ１０やクライアント端末１が備えていてもよい。 Further, since the present invention relates to a database system that stores data necessary for performing speech synthesis in a distributed manner, the execution of speech synthesis is not mentioned, but the function of performing speech synthesis is the speech server 20, 21. Etc., or the management server 10 or the client terminal 1 may be provided.

また、図２において、「サーバ」列の値により音声サーバのアドレス等の所在を示したが、音声サーバ内における音声データベースのファイルパス等については何ら示していない。
各音声サーバ内のファイルパス等については、各音声サーバ自身が自らの責任で管理することとしてもよいし、管理サーバ１０にてさらに詳細なインデックスを設けてもよい。いずれを選択するかは適宜設計すればよい。
一般的には、音声サーバ内のファイルパス等は、その音声サーバ自身が管理し、音声サーバ外から隠蔽することが望ましいと思われる。 In FIG. 2, the location of the voice server is indicated by the value of the “server” column, but the file path of the voice database in the voice server is not shown at all.
About the file path etc. in each audio | voice server, each audio | voice server itself is good also as managing on its own responsibility, and the management server 10 may provide a more detailed index. Which one to select may be appropriately designed.
In general, it seems that it is desirable to manage the file path and the like in the voice server by the voice server itself and hide it from outside the voice server.

以上のように、本実施の形態１によれば、音声サーバ２０、２１等が格納している各音声データベースの所在を表す管理インデックス１２で管理しているので、音声データベースが分散して構成されていても、それらを自在に検索し、あるいは組み合わせて、新たな音声データベースを生成することが可能となる。
また、各音声データベースがいずれの音声ファイルから生成されたかを管理しているので、これら音声データベースの品質をある程度推し量ることができ、ユーザにとって、いずれの音声データベースを用いて音声合成を行うかを選択可能とした。 As described above, according to the first embodiment, since the management is performed using the management index 12 indicating the location of each voice database stored in the voice servers 20, 21, etc., the voice database is configured in a distributed manner. Even in such a case, it is possible to freely search or combine them to generate a new voice database.
In addition, since it manages which voice file each voice database is generated, the quality of these voice databases can be estimated to some extent, and the user can select which voice database to use for voice synthesis It was possible.

実施の形態２．
実施の形態１では、主にクライアント端末１から音声データベースの生成を依頼する際の処理について説明した。
本発明の実施の形態２では、クライアント端末１に音声合成部４を備え、音声サーバ２０、２１等から取得した音声データベースを用いて音声合成を行う例について説明する。 Embodiment 2. FIG.
In the first embodiment, the processing at the time of requesting generation of a voice database mainly from the client terminal 1 has been described.
In the second embodiment of the present invention, an example will be described in which the client terminal 1 includes the speech synthesizer 4 and performs speech synthesis using a speech database acquired from the speech servers 20, 21 and the like.

図８は、本実施の形態２に係る音声合成用分散データベースシステムの概略構成を示すものである。
音声合成用分散データベースシステム自体の構成は、実施の形態１で説明した図１と変更はないが、クライアント端末１に新たに音声合成部４を備えた。
クライアント端末１は、管理サーバ１０に音声データベースの所在を問い合わせ、存在しなければ生成リクエストを発行し、取得した音声データベースを用いて、音声合成部４により音声合成を行う。 FIG. 8 shows a schematic configuration of the distributed database system for speech synthesis according to the second embodiment.
Although the configuration of the distributed database system for speech synthesis itself is not changed from that in FIG. 1 described in the first embodiment, the client terminal 1 is newly provided with a speech synthesizer 4.
The client terminal 1 inquires the management server 10 about the location of the speech database, issues a generation request if it does not exist, and performs speech synthesis by the speech synthesizer 4 using the acquired speech database.

図９は、クライアント端末１から音声ファイルの特徴を表すキーワードを送信し、該当する音声ファイルを用いて生成された韻律モデルデータベースを取得する際のシーケンス図である。各テーブルが保持しているデータは、図２〜図４に示されている通りとする。
以下、各ステップについて説明する。 FIG. 9 is a sequence diagram when transmitting a keyword representing the characteristics of an audio file from the client terminal 1 and acquiring a prosodic model database generated using the corresponding audio file. The data held in each table is as shown in FIGS.
Hereinafter, each step will be described.

（１）ユーザがクライアント端末１を操作し、キーワード「女性」を入力して、韻律モデルデータベースの取得リクエストを実行する。
（２）話者選択部２は、キーワード「女性」を含む、韻律モデルデータベースの取得リクエストを管理サーバ１０に向けて発行する。取得リクエストは、ネットワーク４０を介して管理サーバ１０に到達する。
（３）管理サーバ１０に到達した検索リクエストは、話者検索部１１により処理される。話者検索部１１は、「女性」をキーにして、音声ファイル特徴テーブル１２ｂを検索する。
（４）検索の結果、「特徴キーワード」列の値に「女性」を含むデータがヒットする。図３のデータ例では、３行目と４行目のデータが該当する。
（５）話者検索部１１は、「音声ファイル番号＝１００８ｏｒ１００９」をキーにして、データベース生成元テーブル１２ｃを検索し、生成済みデータベースの有無を確認する。
（６）検索の結果、「音声ファイル番号」列の値が「１００８」又は「１００９」であるデータがヒットする。図４のデータ例では、６行目のデータが該当する。
（７）話者検索部１１は、検索結果として、６行目のデータの「データベース番号」列の値「１００７」を取得する。
（８）話者検索部１１は、「データベース番号＝１００７」をキーにして、データベース所在テーブル１２ａを検索し、その音声データベースの種類を取得する。
（９）検索の結果、「データベース番号＝１００７」で特定される音声データベースは、「韻律モデル」データベースであることが分かる。
（１０）話者検索部１１は、「データベース番号＝１００７」で特定される音声データベースの「サーバ」列の値「ＳＶ２２」と、その生成元である音声ファイル番号「１００８」を、話者選択部２に返信する。
（１１）音声合成部４は、サーバ「ＳＶ２２」にアクセスし、韻律モデル２２Ａを取得する。 (1) The user operates the client terminal 1, inputs the keyword “female”, and executes a prosodic model database acquisition request.
(2) The speaker selection unit 2 issues a prosodic model database acquisition request including the keyword “female” to the management server 10. The acquisition request reaches the management server 10 via the network 40.
(3) The search request that reaches the management server 10 is processed by the speaker search unit 11. The speaker search unit 11 searches the audio file feature table 12b using “female” as a key.
(4) As a result of the search, data including “female” in the value of the “characteristic keyword” column is hit. In the data example of FIG. 3, the data in the third and fourth rows correspond.
(5) The speaker search unit 11 searches the database generation source table 12c using “voice file number = 1008 or 1009” as a key, and confirms the presence or absence of a generated database.
(6) As a result of the search, data having a value of “1008” or “1009” in the “voice file number” column is hit. In the data example of FIG. 4, the data in the sixth row corresponds.
(7) The speaker search unit 11 acquires the value “1007” in the “database number” column of the data in the sixth row as the search result.
(8) The speaker search unit 11 searches the database location table 12a using “database number = 1007” as a key, and acquires the type of the voice database.
(9) As a result of the search, it is understood that the speech database specified by “database number = 1007” is a “prosodic model” database.
(10) The speaker search unit 11 selects the value “SV22” in the “server” column of the voice database specified by “database number = 1007” and the voice file number “1008” that is the source of the selection. Reply to part 2.
(11) The speech synthesis unit 4 accesses the server “SV22” and acquires the prosody model 22A.

なお、キーワード「女性」に該当する韻律モデルが見つからなければ、実施の形態１の図７で説明したものと同様の処理手順により、新たに生成すればよい。
また、韻律モデル２２Ａが音声ファイル「１００８」を用いて生成されたことがステップ（１０）で判明するが、それ以外の音声ファイルを用いて生成すべきときは、別のキーワードを用いて検索を再実行すればよい。
実際に音声合成を行う際には、韻律モデル、音響モデル、音声ファイル、素片選択データを全て取得する必要がある。 If a prosodic model corresponding to the keyword “female” is not found, a new prosody model may be generated by the same processing procedure as that described in FIG.
Further, it is found in step (10) that the prosodic model 22A has been generated using the voice file “1008”, but when it should be generated using another voice file, the search is performed using another keyword. Just re-execute.
When actually performing speech synthesis, it is necessary to acquire all of the prosodic model, acoustic model, speech file, and segment selection data.

以上の実施の形態１〜２では説明しなかったが、各テーブルの内容を単純にブラウズするのみのリクエストもあり得る。例えば、「サーバＳＶ２２が格納している音声データベースの一覧を得る」といったリクエストも可能である。 Although not described in the first and second embodiments, there may be a request for simply browsing the contents of each table. For example, a request such as “obtaining a list of voice databases stored in the server SV22” is also possible.

以上のように、本実施の形態２によれば、クライアント端末１は音声合成部４を備えたので、クライアント端末１が各種類の音声データベースを取得し、音声合成を行うことができる。
これは、クライアント側に音声合成処理を分散させたい場合に、有効な実装となる。 As described above, according to the second embodiment, since the client terminal 1 includes the speech synthesizer 4, the client terminal 1 can acquire each type of speech database and perform speech synthesis.
This is an effective implementation when it is desired to distribute speech synthesis processing to the client side.

なお、以上の実施の形態１〜２において、各テーブルの初期データは、管理者等により適宜セットしておけばよい。その後のデータセットは、ＤＢ生成部１４が自動的に行うので、以後の管理は不要である。 In the first and second embodiments, the initial data of each table may be set as appropriate by an administrator or the like. Subsequent data sets are automatically performed by the DB generation unit 14, so that subsequent management is unnecessary.

実施の形態１に係る音声合成用分散データベースシステムの概略構成を示すものである。1 shows a schematic configuration of a distributed database system for speech synthesis according to Embodiment 1. 管理インデックス１２が格納しているデータベース所在テーブル１２ａの構成とデータ例を示すものである。The structure and data example of the database location table 12a stored in the management index 12 are shown. 管理インデックス１２が格納している音声ファイル特徴テーブル１２ｂの構成とデータ例を示すものである。The structure and data example of the audio file feature table 12b stored in the management index 12 are shown. 管理インデックス１２が格納しているデータベース生成元テーブル１２ｃの構成とデータ例を示すものである。The structure and data example of the database generation source table 12c stored in the management index 12 are shown. クライアント端末１から音声ファイルの特徴を表すキーワードを送信して、該当する音声ファイルを検索する際のシーケンス図である。It is a sequence diagram at the time of transmitting the keyword showing the characteristic of an audio | voice file from the client terminal 1, and searching an applicable audio | voice file. 図５の検索結果を得た後、その検索結果による話者・口調を指定して音声データベース生成依頼をする際のシーケンス図である。FIG. 6 is a sequence diagram when a voice database generation request is made by designating a speaker / tone according to the search result after obtaining the search result of FIG. 5. 図５と図６のリクエストを一括で実行する際のシーケンス図である。以下、各ステップについて説明する。FIG. 7 is a sequence diagram when executing the requests of FIGS. 5 and 6 in a batch. Hereinafter, each step will be described. 実施の形態２に係る音声合成用分散データベースシステムの概略構成を示すものである。4 shows a schematic configuration of a distributed database system for speech synthesis according to Embodiment 2. 音声ファイルの特徴を表すキーワードを送信し、該当する音声ファイルを用いて生成された韻律モデルデータベースを取得する際のシーケンス図である。It is a sequence diagram at the time of transmitting the keyword showing the characteristic of an audio | voice file, and acquiring the prosodic model database produced | generated using the applicable audio | voice file.

Explanation of symbols

１クライアント装置、２話者選択部、３ＤＢ生成依頼部、４音声合成部、１０管理サーバ、１１話者検索部、１２管理インデックス、１２ａデータベース所在テーブル、１２ｂ音声ファイル特徴テーブル、１２ｃデータベース生成元テーブル、１４ＤＢ生成部、２０音声サーバ、２１音声サーバ、３０ネットワーク、４０ネットワーク。 DESCRIPTION OF SYMBOLS 1 Client apparatus, 2 Speaker selection part, 3 DB production | generation request part, 4 Speech synthesis part, 10 Management server, 11 Speaker search part, 12 Management index, 12a Database location table, 12b Voice file characteristic table, 12c Database generation source Table, 14 DB generation unit, 20 voice server, 21 voice server, 30 network, 40 network.

Claims

One or more voice servers that store the data necessary to synthesize speech characterized by speaker and tone;
A management server that holds an index of data stored in each voice server;
Have
The management server
A list that further expresses the speaker and tone of the voice file that is the basis for generating the data specified by the index, using keywords that express the characteristics;
When receiving a request specifying the data necessary for speech synthesis by the keyword,
The distributed database system for speech synthesis, wherein the index and the list are searched and information indicating the location of the corresponding data is returned.

Each of the voice servers
One or more of a prosody model database, an acoustic model database, or a speech file database is stored as the data necessary for speech synthesis,
The management server
Information representing whether each of the sound servers stores the prosodic model database, the acoustic model database, or the sound file database is held in the index,
When returning information indicating the location of the database stored in the voice server,
The distributed database system for speech synthesis according to claim 1, wherein information indicating which of the databases is stored in the speech server is also returned.

The management server
Information indicating whether the prosodic model database, the acoustic model database, and the voice file database are generated based on the voice file on which voice server is held in the index,
When returning information indicating the location of the database stored in the voice server,
The distributed database for speech synthesis according to claim 2, wherein the database stored in the speech server returns a reply together with information indicating which speech server has generated the speech file. system.

The management server
When receiving a request to specify the database necessary for speech synthesis by the keyword,
If the corresponding database does not exist even after searching the index and the list,
4. The distributed database for speech synthesis according to claim 3, wherein a speech file necessary for generating the database is searched from the list, and the requested database is generated using the corresponding speech file. system.