WO2024247848A1

WO2024247848A1 - Information processing device, information processing method, program, and information processing system

Info

Publication number: WO2024247848A1
Application number: PCT/JP2024/018855
Authority: WO
Inventors: 和由堀江
Original assignee: ソニーグループ株式会社
Priority date: 2023-06-01
Filing date: 2024-05-22
Publication date: 2024-12-05

Abstract

This information processing device is provided with: a speaker search unit that performs a search based on the topic of text data, to select speaker data from among a plurality of sets of speaker data; and a speech synthesizing unit which generates synthesized speech data relating to the text data on the basis of the speaker data selected by the speaker search unit.

Description

Information processing device, information processing method, program, and information processing system

　本技術は情報処理装置、情報処理方法、プログラム、情報処理システムに関し、例えばコンテンツのナレーションやセリフ等に用いる音声合成処理に関する。 This technology relates to information processing devices, information processing methods, programs, and information processing systems, and relates to voice synthesis processing used for, for example, narration and dialogue in content.

　近年、配信サービスにおいて、音声合成（ＴＴＳ：Text To Speech）をナレーションに用いたコンテンツが増えている。利用される音声合成ソフトウェアや音声合成サービスには複数のＴＴＳ話者が提供されていて、利用者は自分のコンテンツに合う声質を選ぶことが可能とされている。
　下記特許文献１には、読み上げる文章内に使われる各単語について、男性文、女性文における出現度合いを集計して男性声か女性声を選択する技術が開示されている。 In recent years, the number of contents using text-to-speech (TTS) narration has been increasing in distribution services. The speech synthesis software and speech synthesis services used provide multiple TTS speakers, allowing users to select the voice quality that best suits their content.
The following Patent Document 1 discloses a technique for selecting a male or female voice by tallying up the frequency of occurrence of each word used in a sentence to be read aloud in male and female sentences.

特開平１１－２９６１９３号公報Japanese Patent Application Publication No. 11-296193

　提供されるＴＴＳ話者の数が少ない場合には、利用者は、すべてのＴＴＳ話者のサンプル音声などを試聴することで、自分のコンテンツに合う声を選ぶことができるが、逆に選択肢が少ないということになる。一方で、提供するＴＴＳ話者を増やすと、利用者がすべての話者のサンプル音声を試聴することに時間が掛かるようになる。また、似たイメージの声質がある場合には、どの声が自分のコンテンツに合うのか迷うことも考えられる。 When the number of TTS speakers offered is small, users can listen to sample voices from all TTS speakers to choose a voice that suits their content, but this means that there are fewer options. On the other hand, if more TTS speakers are offered, it will take more time for users to listen to sample voices from all the speakers. Also, when there are voices with similar qualities, users may be unsure which voice suits their content.

　ユーザが声を選びやすくするために、ＴＴＳ話者の声の周波数の高低で段階的に区別する方法、ＴＴＳ話者の性別や年齢で区別する方法、ＴＴＳ話者の声のイメージ（例えば「大人っぽい声」「元気な声」など）で区別する方法などがある。
　しかしながら、例えば話者を「中程度の声の周波数を持つ男性」で選んだとしても、コンテンツがドキュメント番組なのか、スポーツ番組なのかによって、所望される声質は異なると考えられる。
　また特許文献１の技術では、読み上げるテキストが「話し言葉」に限定される。そのため、コンテンツ内容に応じた声の提案をするにはまだ不十分である。 To make it easier for users to select a voice, there are methods for distinguishing between voices based on the high and low frequency of the TTS speaker's voice, methods for distinguishing between voices based on the gender or age of the TTS speaker, and methods for distinguishing between voices based on the image of the TTS speaker's voice (for example, "mature voice" or "energetic voice").
However, even if a speaker is selected as "a male with a medium voice frequency," the desired voice quality is likely to differ depending on whether the content is a documentary program or a sports program.
In addition, the technology of Patent Document 1 limits the text to be read aloud to "spoken language." Therefore, it is still insufficient to suggest a voice according to the content.

　そこで本技術は、選択可能な音声の中でコンテンツ内容に適した音声をより的確に利用者に提供できるようにする。 This technology makes it possible to more accurately provide users with the audio that is most suited to the content from among the available audio options.

　本技術に係る情報処理装置は、複数の話者データのうちで、テキストデータのトピックに基づいた検索を行って話者データを選択する話者検索部と、前記話者検索部で選択された話者データにより前記テキストデータの合成音声データを生成する音声合成部と、を備える。
　テキストデータの内容としてのトピック（話題やそのジャンル）に応じた声質の話者データが選択されるようにし、その話者データでテキストデータの読み上げ音声が合成されるようにする。 The information processing device according to the present technology includes a speaker search unit that performs a search based on a topic of text data to select speaker data from among a plurality of speaker data, and a voice synthesis unit that generates synthetic voice data of the text data using the speaker data selected by the speaker search unit.
Speaker data with a voice quality that corresponds to the topic (topic or genre) of the text data is selected, and the reading voice of the text data is synthesized using the speaker data.

本技術の実施の形態のシステム構成の説明図である。FIG. 1 is an explanatory diagram of a system configuration according to an embodiment of the present technology; 実施の形態の情報端末の機能構成の説明図である。FIG. 2 is an explanatory diagram of a functional configuration of an information terminal according to an embodiment. 実施の形態の音声合成サーバの機能構成の説明図である。FIG. 2 is an explanatory diagram of a functional configuration of a voice synthesis server according to an embodiment; 実施の形態のコンテンツ解析サーバの機能構成の説明図である。FIG. 2 is an explanatory diagram of a functional configuration of a content analysis server according to an embodiment. 実施の形態のデータベースの機能構成の説明図である。FIG. 2 is an explanatory diagram of a functional configuration of a database according to an embodiment. 実施の形態のＴＴＳ話者提案システムの処理の説明図である。FIG. 2 is an explanatory diagram of the processing of the TTS speaker suggestion system according to the embodiment; 実施の形態で用いるトピックの例の説明図である。FIG. 2 is an explanatory diagram of an example of a topic used in the embodiment. 実施の形態の声の特徴量の学習過程の説明図である。FIG. 4 is an explanatory diagram of a learning process of voice features according to an embodiment. 実施の形態のトピックモデル生成処理のフローチャートである。1 is a flowchart of a topic model generation process according to an embodiment. 実施の形態のＴＴＳ参照データ生成処理のフローチャートである。11 is a flowchart of a TTS reference data generation process according to an embodiment. 実施の形態の音声合成サーバの処理のフローチャートである。13 is a flowchart of a process of a voice synthesis server according to an embodiment. 実施の形態のデータベースの処理のフローチャートである。11 is a flowchart of database processing according to an embodiment. 実施の形態の情報端末での表示例の説明図である。FIG. 11 is an explanatory diagram of a display example on an information terminal according to an embodiment. 実施の形態の音声の特徴量変化に応じたタイムスタンプ設定のフローチャートである。11 is a flowchart of a process for setting a time stamp according to a change in a feature amount of a voice according to an embodiment. 実施の形態のタイムスタンプを参照したＴＴＳ参照データ生成処理のフローチャートである。11 is a flowchart of a TTS reference data generation process with reference to a time stamp according to an embodiment. 実施の形態のデータベースで複数のＴＴＳ参照データを選択する処理のフローチャートである。11 is a flowchart of a process for selecting multiple TTS reference data from a database according to an embodiment. 実施の形態の分散に応じた参照データ数の説明図である。FIG. 11 is an explanatory diagram of the number of reference data items according to distribution according to the embodiment. 実施の形態の情報端末での表示例の説明図である。FIG. 11 is an explanatory diagram of a display example on an information terminal according to an embodiment. 実施の形態の情報処理装置のブロック図である。FIG. 1 is a block diagram of an information processing apparatus according to an embodiment.

　以下、実施の形態を次の順序で説明する。
＜１．システム構成＞
＜２．ＴＴＳ話者提案システムの動作＞
＜３．コンテンツ解析サーバの処理＞
＜４．音声合成サーバの処理＞
＜５．データベースの処理＞
＜６．情報端末での表示＞
＜７．参照コンテンツに複数の話者がいる場合の対応＞
＜８．複数の話者候補の提案＞
＜９．情報処理装置の構成＞
＜１０．まとめ及び変形例＞
The embodiments will be described below in the following order.
1. System configuration
2. Operation of the TTS speaker suggestion system
<3. Processing of Content Analysis Server>
4. Processing of the voice synthesis server
5. Database Processing
<6. Display on information terminal>
7. How to handle cases where there are multiple speakers in the reference content
8. Proposing multiple speaker candidates
9. Configuration of information processing device
10. Summary and Modifications

＜１．システム構成＞
　図１に実施の形態のＴＴＳ話者提案システム１の構成例を示す。ＴＴＳ話者提案システム１は、ユーザ１０に対して、例えば情報端末１００で制作しているコンテンツＣＴ１のテキストに適切なＴＴＳ話者を提案するシステムである。 1. System configuration
1 shows an example of the configuration of a TTS speaker suggestion system 1 according to an embodiment. The TTS speaker suggestion system 1 is a system that suggests to a user 10 a TTS speaker appropriate for the text of a content CT1 being produced on an information terminal 100, for example.

　ＴＴＳ話者提案システム１では、情報端末１００、音声合成サーバ２００、コンテンツ解析サーバ３００、データベース４００、コンテンツプロバイダ５００の各装置がネットワーク６００を介して接続されている。 In the TTS speaker suggestion system 1, the information terminal 100, the voice synthesis server 200, the content analysis server 300, the database 400, and the content provider 500 are connected via a network 600.

　ユーザ１０はコンテンツＣＴ１を制作するために、情報端末１００の操作を行う。なお本開示において「コンテンツＣＴ１」とは、主に音声を用いた作品ことで、映像（動画・静止画）の有無を問わない。情報端末１００はコンテンツ制作のために図２に示す機能を備えている。 User 10 operates information terminal 100 to create content CT1. In this disclosure, "content CT1" refers to a work that mainly uses audio, regardless of whether it contains video (moving images or still images). Information terminal 100 has the functions shown in Figure 2 for content creation.

　コンテンツ制作アプリケーション１１０は、ユーザ１０の操作に応じてコンテンツＣＴ１の制作及び編集を行う機能である。
　ディスプレイ１２０はユーザ１０に対して映像表示を行う機能である。
　文字入力部１３０はユーザ１０が例えばコンテンツＣＴ１に対してテキストデータ入力を行うための入力機能である。
　スピーカ１４０はユーザ１０に対して音声出力を行う機能である。
　ネットワーク通信部１５０はネットワーク６００を介して他の装置と通信を行う機能である。 The content production application 110 is a function for producing and editing the content CT1 in response to operations by the user 10.
The display 120 has a function of displaying images to the user 10 .
The character input unit 130 is an input function for the user 10 to input text data to the content CT1, for example.
The speaker 140 has a function of outputting audio to the user 10 .
The network communication unit 150 is a function for communicating with other devices via the network 600 .

　この情報端末１００は、ネットワーク通信部１５０を通じてインターネット等のネットワーク６００に接続されており、テキストデータＴＤを音声合成サーバ２００に送信することで合成音声データＡＤを得ることができる。
　送信するテキストデータＴＤはコンテンツ制作アプリケーション１１０による制作処理過程でユーザ１０により入力される。例えばユーザ１０は文字入力部１３０を用いてテキストデータＴＤを入力する。 This information terminal 100 is connected to a network 600 such as the Internet through a network communication unit 150, and can obtain synthetic voice data AD by transmitting text data TD to a voice synthesis server 200.
The text data TD to be transmitted is input by the user 10 during the production process by the content production application 110. For example, the user 10 uses the character input unit 130 to input the text data TD.

　音声合成サーバ２００は、情報端末１００から送信されてきたテキストデータＴＤ　を合成音声データＡＤに変換する役割を持つ。
　このため音声合成サーバ２００は、図３に示すように、テキスト－音素記号変換部２１０、参照データ取得部２２０、話者検索部２３０、保有話者データ部２４０、音声合成部２５０、ネットワーク通信部２６０を備える。 The voice synthesis server 200 has a role of converting the text data TD transmitted from the information terminal 100 into synthetic voice data AD.
For this reason, the voice synthesis server 200 comprises a text-to-phoneme symbol conversion unit 210, a reference data acquisition unit 220, a speaker search unit 230, a retained speaker data unit 240, a voice synthesis unit 250, and a network communication unit 260, as shown in FIG.

　この音声合成サーバ２００は、ネットワーク通信部２６０を通じてネットワーク６００に接続されている。音声合成サーバ２００は、情報端末１００から送信されてくるテキストデータＴＤを受信する。 This voice synthesis server 200 is connected to the network 600 via the network communication unit 260. The voice synthesis server 200 receives text data TD transmitted from the information terminal 100.

　音声合成サーバ２００におけるテキスト－音素記号変換部２１０はテキストデータＴＤを音素データに変換する機能である。音素とは、声の音のひとつひとつを表す記号である。音声合成サーバ２００は、この機能により、情報端末１００から送信されてきたテキストデータＴＤに応じた音素データを生成する。 The text-to-phoneme symbol conversion unit 210 in the speech synthesis server 200 is a function that converts the text data TD into phoneme data. A phoneme is a symbol that represents each individual voice sound. With this function, the speech synthesis server 200 generates phoneme data according to the text data TD transmitted from the information terminal 100.

　参照データ取得部２２０は、データベース４００からＴＴＳ参照データＲＤを取得する機能である。ＴＴＳ参照データＲＤは、コンテンツＣＴ１に合った合成音声データＡＤを生成するために利用するデータである。参照データ取得部２２０は、情報端末１００から送信されてきたテキストデータＴＤをデータベース４００に送信し、コンテンツＣＴ１に合ったＴＴＳ参照データＲＤをデータベース４００に要求する。 The reference data acquisition unit 220 is a function that acquires TTS reference data RD from the database 400. The TTS reference data RD is data used to generate synthetic voice data AD that matches the content CT1. The reference data acquisition unit 220 transmits the text data TD sent from the information terminal 100 to the database 400, and requests the database 400 for TTS reference data RD that matches the content CT1.

　話者検索部２３０は、データベース４００より送信されたＴＴＳ参照データＲＤ内の音声特徴量ＲＤＸを用いて、音声合成サーバ２００が保有する話者データのうち、音声特徴量ＲＤＸと似た声（似た音声特徴量）の話者データを検索する機能である。 The speaker search unit 230 is a function that uses the speech features RDX in the TTS reference data RD transmitted from the database 400 to search for speaker data held by the speech synthesis server 200 that has a voice similar to the speech features RDX (similar speech features).

　保有話者データ部２４０は、話者検索部２３０が検索する話者データを保存する機能である。話者データは、個々の話者データを識別する話者ＩＤと、話者ＩＤに対応づけられた声質のデータを含む。
　音声合成サーバ２００は、この保有話者データ部２４０に保存された複数の話者データの内のいずれかを選択して、ユーザ１０に対して提案することになる。 The retained speaker data section 240 is a function for storing speaker data searched by the speaker search section 230. The speaker data includes a speaker ID for identifying individual speaker data, and voice quality data associated with the speaker ID.
The speech synthesis server 200 selects one of the multiple speaker data stored in the retained speaker data section 240 and proposes it to the user 10 .

　音声合成部２５０は、テキストデータＴＤから合成音声データＡＤを生成する機能である。音声合成部２５０は、話者検索部２３０によって選択された話者データと、テキスト－音素記号変換部２１０により得られた音素データに基づいて、テキストデータＴＤに対応する合成音声データＡＤを生成する。
　なお、テキスト－音素記号変換部２１０は、書記素－音素変換（Ｇ２Ｐ）とも呼ばれる。音声合成部２５０は一般に、「合成器」および「ボコーダー」（図示しない）という部品から構成される。 The voice synthesis unit 250 has a function of generating synthetic voice data AD from text data TD. The voice synthesis unit 250 generates synthetic voice data AD corresponding to the text data TD based on the speaker data selected by the speaker search unit 230 and the phoneme data obtained by the text-to-phoneme symbol conversion unit 210.
The text-to-phoneme converter 210 is also called a grapheme-to-phoneme converter (G2P). The speech synthesizer 250 generally comprises components called a "synthesizer" and a "vocoder" (not shown).

　音声合成サーバ２００は、生成した合成音声データＡＤを、ネットワーク通信部２６０を用いてネットワーク６００に送出する。合成音声データＡＤはその後、情報端末１００のネットワーク通信部１５０を経て、コンテンツ制作アプリケーション１１０にて利用されることになる。 The voice synthesis server 200 sends the generated synthetic voice data AD to the network 600 using the network communication unit 260. The synthetic voice data AD is then passed through the network communication unit 150 of the information terminal 100 and is used by the content production application 110.

　図１に示したコンテンツ解析サーバ３００は、コンテンツプロバイダ５００が提供するコンテンツＣＴ２の解析を行う。コンテンツＣＴ２は、一般に配信、放送されているコンテンツであり、コンテンツＣＴ１との区別のため「参照コンテンツＣＴ２」と表記する。この参照コンテンツＣＴ２も音声が記録された作品を指し、映像の有無は問わない。
　コンテンツ解析サーバ３００による参照コンテンツＣＴ２の解析によりＴＴＳ参照データＲＤが生成され、音声合成サーバ２００が音声合成時にＴＴＳ参照データＲＤを利用することになる。 The content analysis server 300 shown in Fig. 1 analyzes the content CT2 provided by the content provider 500. The content CT2 is a content that is generally distributed or broadcast, and is referred to as "reference content CT2" to distinguish it from the content CT1. This reference content CT2 also refers to a work with recorded audio, with or without video.
The content analysis server 300 analyzes the reference content CT2 to generate TTS reference data RD, which is then used by the voice synthesis server 200 during voice synthesis.

　本実施の形態では、ユーザ１０が入力したテキストデータＴＤにふさわしい声質、つまり、コンテンツＣＴ１の内容にふさわしい合成音声の提供を目的としている。その実現のため、テキストデータＴＤを読み上げる声質として適した声に関する情報を準備するのがコンテンツ解析サーバ３００の役割である。 In this embodiment, the purpose is to provide a voice quality that is suitable for the text data TD input by the user 10, that is, a synthetic voice that is suitable for the content CT1. To achieve this, the role of the content analysis server 300 is to prepare information about a voice that is suitable for reading the text data TD.

　コンテンツ解析サーバ３００は図４に示すように、コンテンツ取得部３１０、音声抽出部３２０、音声認識部３３０、ストレージ部３４０、トピック分析部３５０、音声特徴量取得部３６０、ネットワーク通信部３７０を有する。 As shown in FIG. 4, the content analysis server 300 has a content acquisition unit 310, a voice extraction unit 320, a voice recognition unit 330, a storage unit 340, a topic analysis unit 350, a voice feature acquisition unit 360, and a network communication unit 370.

　コンテンツ解析サーバ３００は、ネットワーク通信部３７０を介して、コンテンツプロバイダ５００やデータベース４００等と通信可能とされる。 The content analysis server 300 can communicate with the content provider 500, database 400, etc. via the network communication unit 370.

　コンテンツ取得部３１０は解析対象の参照コンテンツＣＴ２を取得する機能である。コンテンツ取得部３１０はネットワーク通信部３７０を介して、コンテンツプロバイダ５００に取得リクエストＲＱを送信し、様々な参照コンテンツＣＴ２を取得する。参照コンテンツＣＴ２としては例えば、パブリックドメインのインターネットラジオをダウンロードすることが考えられる。あるいは、個別にコンテンツ取得ＡＰＩ（Application Programming Interface）の提供をコンテンツプロバイダ５００に依頼することでも実現できる。 The content acquisition unit 310 has a function for acquiring reference content CT2 to be analyzed. The content acquisition unit 310 transmits an acquisition request RQ to the content provider 500 via the network communication unit 370, and acquires various reference content CT2. As an example of the reference content CT2, it is conceivable to download public domain internet radio. Alternatively, this can be achieved by individually requesting the content provider 500 to provide a content acquisition API (Application Programming Interface).

　音声抽出部３２０は、コンテンツプロバイダ５００から取得した参照コンテンツＣＴ２の音声を抽出する。参照コンテンツＣＴ２が動画の場合の音声の抽出は、既存のソフトウェアを使うことで可能である。 The audio extraction unit 320 extracts the audio of the reference content CT2 obtained from the content provider 500. When the reference content CT2 is a video, audio extraction is possible using existing software.

　音声認識部３３０は音声抽出部３２０が抽出した参照コンテンツＣＴ２の音声をテキストデータ、即ち発話文ＴＲに変換する。
　ストレージ部３４０は発話文ＴＲを記憶する。 The voice recognition unit 330 converts the voice of the reference content CT2 extracted by the voice extraction unit 320 into text data, that is, an utterance TR.
The storage unit 340 stores the spoken sentence TR.

　トピック分析部３５０は、音声認識部３３０により文字起こしされたテキストデータについてトピックベクトルＲＤＴの判別を行う。トピックベクトルＲＤＴとは、コンテンツ（テキストデータ）が、どのようなトピックのコンテンツなのかを表現する情報である。従ってトピックの分類の指標となる情報といえる。
　なお、トピックとは、そのテキストデータの話題やジャンルなど、内容がどのようなものかを示す用語とする。例えば、映画を例に挙げて言えば、特定の映画の話題、映画界全体の話題、ジャンルとしての「アクション映画」「コメディ映画」等など、話題や話題のジャンル等を総括してトピックと呼ぶ。 The topic analysis unit 350 determines the topic vector RDT for the text data transcribed by the speech recognition unit 330. The topic vector RDT is information that expresses what topic the content (text data) is about. Therefore, it can be said to be information that serves as an index for topic classification.
A topic is a term that indicates the type of content of the text data, such as the subject matter or genre. For example, in the case of movies, a topic can be a topic about a specific movie, a topic about the movie industry as a whole, or a genre such as "action movies" or "comedy movies." Topics can be a general term for topics or genres of topics.

　音声特徴量取得部３６０は音声抽出部３２０が抽出した参照コンテンツＣＴ２の音声データについて、「ｘ－ｖｅｃｔｏｒ」と呼ばれる、音声特徴量ＲＤＸに変換する。 The audio feature acquisition unit 360 converts the audio data of the reference content CT2 extracted by the audio extraction unit 320 into an audio feature RDX called an "x-vector."

　以上の機能によりコンテンツ解析サーバ３００は、参照コンテンツＣＴ２について、
・トピックベクトルＲＤＴ
・音声特徴量ＲＤＸ
・コンテンツのＵＲＬ（ＲＤＵ）
　が得られる。これらの情報はまとめられて、１つのＴＴＳ参照データＲＤとされ、ネットワーク６００経由で、データベース４００に格納される。 With the above functions, the content analysis server 300 determines the following for the reference content CT2:
Topic vector RDT
・Speech feature RDX
- Content URL (RDU)
These pieces of information are compiled into one TTS reference data RD, which is then transmitted via the network 600 and stored in the database 400.

　データベース４００は、音声合成サーバ２００からの問い合わせを受けて、テキストデータＴＤの読み上げにふさわしい声質の音声特徴量の検索を行う。検索には、コンテンツ解析サーバ３００によって解析されたＴＴＳ参照データＲＤを利用する。 The database 400 receives a query from the speech synthesis server 200 and searches for speech features with a voice quality suitable for reading the text data TD. For the search, the TTS reference data RD analyzed by the content analysis server 300 is used.

　データベース４００は図５のように、ストレージ部４１０、トピック分析部４２０、トピック類似度分析部４３０、ネットワーク通信部４４０を有する。
　データベース４００は、ネットワーク通信部４４０を介して、コンテンツ解析サーバ３００や音声合成サーバ２００等と通信可能とされる。 As shown in FIG. 5, the database 400 includes a storage unit 410, a topic analysis unit 420, a topic similarity analysis unit 430, and a network communication unit 440.
The database 400 is capable of communicating with the content analysis server 300, the voice synthesis server 200, and the like via a network communication unit 440.

　ストレージ部４１０は、各種の参照コンテンツＣＴ２についてのＴＴＳ参照データＲＤを記憶する。図ではＴＴＳ参照データＲＤ－１からＴＴＳ参照データＲＤ－ＮとしてＮ個のＴＴＳ参照データＲＤを記憶している状態を示している。
　１つのＴＴＳ参照データＲＤは、トピックベクトルＲＤＴ、音声特徴量ＲＤＸ、参照コンテンツＣＴ２のＵＲＬ（ＲＤＵ）を含んでいる。 The storage unit 410 stores TTS reference data RD for various reference contents CT2. In the figure, N pieces of TTS reference data RD are stored as TTS reference data RD-1 to TTS reference data RD-N.
One TTS reference data RD includes a topic vector RDT, speech features RDX, and a URL (RDU) of reference content CT2.

　トピック分析部４２０は、音声合成サーバ２００から送信されてきたテキストデータＴＤをトピックベクトルＴＶに変換する機能である。
　トピック類似度分析部４３０は、トピックベクトルＴＶとストレージ部４１０に保管されているＴＴＳ参照データＲＤのトピックベクトルＲＤＴのうち、もっとも類似性の高いＴＴＳ参照データＲＤを検索する機能である。 The topic analysis unit 420 has a function of converting the text data TD transmitted from the speech synthesis server 200 into a topic vector TV.
The topic similarity analysis unit 430 is a function for searching for the TTS reference data RD having the highest similarity among the topic vector TV and the topic vector RDT of the TTS reference data RD stored in the storage unit 410 .

　図１に示すコンテンツプロバイダ５００は、参照コンテンツＣＴ２の提供を行うプロバイダである。例えば動画投稿サイトなどがコンテンツプロバイダ５００に該当する。コンテンツプロバイダ５００はひとつに限定されるものではなく、複数の企業やサービスなどから構成される場合もある。また、提供する参照コンテンツＣＴ２は、動画とは限らず、インターネットラジオの提供する、音声のみのコンテンツでも構わない。
The content provider 500 shown in Fig. 1 is a provider that provides the reference content CT2. For example, a video posting site corresponds to the content provider 500. The content provider 500 is not limited to one, and may be composed of a plurality of companies, services, etc. Furthermore, the reference content CT2 provided is not limited to a video, and may be audio-only content provided by an Internet radio station.

＜２．ＴＴＳ話者提案システムの動作＞
　以上の構成のＴＴＳ話者提案システム１の動作について説明する。
　まず実施の形態のＴＴＳ話者提案システム１の動作目的について述べる。 2. Operation of the TTS speaker suggestion system
The operation of the TTS speaker suggestion system 1 configured as above will now be described.
First, the purpose of the operation of the TTS speaker suggestion system 1 according to the embodiment will be described.

　ＴＴＳ話者提案システム１は、コンテンツＣＴ１の制作を行うユーザ１０にコンテンツＣＴ１に合った声質の音声を提供するための動作を行う。
　コンテンツＣＴ１のナレーションやセリフには、そのコンテンツＣＴ１の内容に応じた声を採用する必要がある。コンテンツＣＴ１のテキストについて、多数の声質の音声を提供すれば、ユーザ１０の選択肢が広がるが、選択自体が手間のかかるものとなる。また声の周波数の高低、話者の性別や年齢、声のイメージなどで選択できるようにしても、必ずしもコンテンツＣＴ１の内容に合致しないことがある。 The TTS speaker suggestion system 1 operates to provide a user 10 who is creating content CT1 with a voice having a voice quality suited to the content CT1.
For the narration and dialogue of the content CT1, it is necessary to adopt a voice that is appropriate for the content of the content CT1. If a variety of voice qualities are provided for the text of the content CT1, the user 10 will have more options, but the selection itself will be time-consuming. Furthermore, even if the user is able to select based on the high and low frequency of the voice, the gender and age of the speaker, and the image of the voice, the voice may not necessarily match the content of the content CT1.

　そこで本実施の形態のＴＴＳ話者提案システム１では、コンテンツプロバイダ５００によって提供されている参照コンテンツＣＴ２に出演しているナレーターや俳優に近い声質の合成音声を生成することで、制作中のコンテンツＣＴ１の内容にふさわしいＴＴＳ話者をユーザ１０に提案するようにする。
　世間で広く試聴されている参照コンテンツＣＴ２に出演している話者に近い声で、コンテンツＣＴ１のナレーションやセリフを読み上げることで、合成音声の声の雰囲気に対する違和感を視聴者に与えないことが期待できる。 Therefore, in the present embodiment, the TTS speaker suggestion system 1 generates synthetic voice with a voice quality close to that of the narrator or actor appearing in the reference content CT2 provided by the content provider 500, thereby suggesting to the user 10 a TTS speaker suitable for the content of the content CT1 being produced.
By reading the narration and lines of the content CT1 in a voice similar to that of the speaker appearing in the reference content CT2 that is widely listened to, it is expected that the audience will not feel uncomfortable with the atmosphere of the synthetic voice.

　図６により、ＴＴＳ話者提案システム１の各装置の動作の流れを説明する。
　まずユーザ１０へのＴＴＳ話者の提案のための前提の処理として、コンテンツ解析サーバ３００がトピックモデルＴＭやＴＴＳ参照データＲＤを生成し、データベース４００に記憶させる処理ＳＴ１が行われる。 The flow of operations of each device in the TTS speaker suggestion system 1 will be described with reference to FIG.
First, as a prerequisite process for suggesting TTS speakers to the user 10, the content analysis server 300 generates a topic model TM and TTS reference data RD and stores them in the database 400 in a process ST1.

　音声合成サーバ２００がコンテンツＣＴ１のジャンル・トピックにふさわしい話者による合成音声データＡＤを生成するには、ＴＴＳ参照データＲＤが必要となる。そのため、コンテンツ解析サーバ３００が参照コンテンツＣＴ２の解析を行い、ＴＴＳ参照データＲＤをデータベース４００に準備する。この処理ＳＴ１は、多様な参照コンテンツＣＴ２に対して逐次継続的に行われればよい。つまりユーザ１０（情報端末１００）への話者提案のための処理ＳＴ２とは同期している必要はない。 The speech synthesis server 200 requires TTS reference data RD to generate synthetic speech data AD by a speaker suitable for the genre and topic of the content CT1. Therefore, the content analysis server 300 analyzes the reference content CT2 and prepares the TTS reference data RD in the database 400. This process ST1 only needs to be performed sequentially and continuously for various reference contents CT2. In other words, it does not need to be synchronized with the process ST2 for suggesting speakers to the user 10 (information terminal 100).

　処理ＳＴ１として、コンテンツ解析サーバ３００はコンテンツプロバイダ５００に対して取得リクエストＲＱを送付する。これに応じてコンテンツプロバイダ５００はコンテンツ解析サーバ３００に参照コンテンツＣＴ２を送信する。 In process ST1, the content analysis server 300 sends an acquisition request RQ to the content provider 500. In response, the content provider 500 sends reference content CT2 to the content analysis server 300.

　コンテンツ解析サーバ３００は参照コンテンツＣＴ２を取得したら、その取得した参照コンテンツＣＴ２を解析し、トピックモデルＴＭおよびＴＴＳ参照データＲＤを生成する。そして、これらをデータベース４００に送信して記憶させる。 After acquiring the reference content CT2, the content analysis server 300 analyzes the acquired reference content CT2 and generates a topic model TM and TTS reference data RD. These are then transmitted to the database 400 for storage.

　コンテンツ解析サーバ３００は、コンテンツプロバイダ５００によって提供される参照コンテンツＣＴ２の配信をモニタし、適宜、以上の処理ＳＴ１を行ってＴＴＳ参照データＲＤを生成することができる。人気のある参照コンテンツＣＴ２を中心に解析することで、より広い人々に受け入れられる声質をユーザ１０に提案するためのＴＴＳ参照データＲＤを得ることが期待できる。 The content analysis server 300 monitors the distribution of the reference content CT2 provided by the content provider 500, and can perform the above process ST1 as appropriate to generate TTS reference data RD. By focusing on analyzing popular reference content CT2, it is expected that TTS reference data RD can be obtained to suggest to the user 10 a voice quality that will be accepted by a wider range of people.

　次に処理ＳＴ２を説明する。処理ＳＴ２は、ユーザ１０によって情報端末１００からテキストデータＴＤの送信が行われることに応じて実行される。 Next, process ST2 will be described. Process ST2 is executed in response to the user 10 transmitting text data TD from the information terminal 100.

　情報端末１００は、音声合成サーバ２００に向けてテキストデータＴＤを送信する。テキストデータＴＤは、例えばユーザ１０が制作しているコンテンツＣＴ１のナレーションやセリフとしての文章のデータである。 The information terminal 100 transmits text data TD to the voice synthesis server 200. The text data TD is, for example, data of sentences such as narration or dialogue of the content CT1 that the user 10 is creating.

　音声合成サーバ２００は、受信したテキストデータＴＤをデータベース４００に向けて送信する。 The voice synthesis server 200 sends the received text data TD to the database 400.

　データベース４００は、受信したテキストデータＴＤに対して、「トピック分析」、「音声特徴量検索」を行い、テキストデータＴＤに適した１又は複数のＴＴＳ参照データＲＤを選択する。そしてデータベース４００は、これらの処理により得られたＴＴＳ参照データＲＤを音声合成サーバ２００に送信する。 The database 400 performs "topic analysis" and "speech feature search" on the received text data TD, and selects one or more TTS reference data RD suitable for the text data TD. The database 400 then transmits the TTS reference data RD obtained by these processes to the speech synthesis server 200.

　音声合成サーバ２００は、データベース４００より受信した、ＴＴＳ参照データＲＤ　内の音声特徴量ＲＤＸを用いて、音声合成サーバ２００内の保有話者データ部２４０に保有する、保有話者の音声特徴量と比較し、最も声質の似ている話者データを選択する。 The speech synthesis server 200 uses the speech features RDX in the TTS reference data RD received from the database 400 to compare it with the speech features of the retained speakers stored in the retained speaker data section 240 in the speech synthesis server 200, and selects the speaker data with the most similar voice quality.

　そして音声合成サーバ２００は、その話者データの音声合成モデルを用いてテキストデータＴＤの音声合成処理を行い、合成音声データＡＤを生成する。音声合成サーバ２００は、合成音声データＡＤを情報端末１００に向けて送信する。 Then, the voice synthesis server 200 performs voice synthesis processing of the text data TD using the voice synthesis model of the speaker data, and generates synthetic voice data AD. The voice synthesis server 200 transmits the synthetic voice data AD to the information terminal 100.

　情報端末１００では、コンテンツ制作アプリケーション１１０を用いて、ユーザ１０　が受信した合成音声データＡＤを制作中のコンテンツＣＴ１に組み込む作業を進める。 In the information terminal 100, the user 10 uses the content production application 110 to incorporate the received synthetic voice data AD into the content CT1 being produced.

　以上のように、ＴＴＳ話者提案システム１によって、ユーザ１０に対してコンテンツＣＴ１のナレーションやセリフ等に適した声質の合成音声データＡＤが提案されることになる。 As described above, the TTS speaker suggestion system 1 suggests to the user 10 synthetic voice data AD with a voice quality suitable for the narration, lines, etc. of the content CT1.

　ここで、トピック分析と音声特徴量について説明しておく。
　まずトピック分析について説明する。図７は、いくつかの参照コンテンツＣＴ２をもとに生成された情報（トピックモデルと定義する）を示している。 Here, topic analysis and speech features will be explained.
First, the topic analysis will be described. Fig. 7 shows information (defined as a topic model) generated based on some reference contents CT2.

　図７の例では、説明を簡単にするため、ふたつのトピック＃１，＃２でジャンル分けを行っている。ジャンル分けには、「潜在的ディリクレ配分法（ＬＤＡ）」を用いた。トピック＃１は「映画」に関する単語の出現確率が高い。トピック＃２は「スマートフォン」に関する単語が出現していることが分かる。この例では、出現確率の高い単語を出現確率の順に１０個ずつ表示している。
　このトピックモデルを用いて、ある発話がどのトピックに属するのかを調べる。 In the example of FIG. 7, for ease of explanation, the genre is divided into two topics #1 and #2. The "Latent Dirichlet Allocation (LDA)" was used for the genre division. Topic #1 has a high probability of occurrence of words related to "movies." It can be seen that words related to "smartphones" appear in topic #2. In this example, the words with the highest occurrence probability are displayed in groups of 10 in order of occurrence probability.
This topic model is used to find out which topic an utterance belongs to.

　例えば発話が「先週末は映画を観てきました」であったとする。
　この発話に対して、わかち書きを行なうと以下のようなわかち書き単語列が得られる。この単語列のうち名詞だけを残すと、名詞単語列が得られる。 For example, suppose the utterance was "I went to the movies last weekend."
If we split this utterance into words, we get the following word sequence: If we leave only the nouns in this word sequence, we get a noun word sequence.

　わかち書き単語列＝［’先週’，’末’，’は’，’映画’，’を’，’観’，’て’，’き’，’まし’，’た’］
　名詞単語列＝［’先週’，’映画’］ Sentence sequence of words = ['last week', 'last week', 'was', 'movie', 'to', 'watched', 'came', 'made', 'made', 'made']
Noun word string = ['last week', 'movie']

　この名詞単語列に対してＬＤＡ法を使ってトピック分析すると、以下のトピックベクトルが得られる。ここでトピックベクトルとは、ある発話テキストがどのトピックに分類されるかの確率分布を表すものであると定義する。 When topic analysis is performed on this string of noun words using the LDA method, the following topic vector is obtained. Here, a topic vector is defined as a probability distribution of which topic a certain spoken text will be classified into.

　トピック＃１：０．８、トピック＃２：０．２
　トピックベクトル＝［０．８，０．２］
　これは、上記の発話は、トピック＃１に関する出現確率が高く、発話の話題はトピック＃１の方に沿うと判断できるものとなる。
　このようなトピックモデルがコンテンツ解析サーバ３００によって生成され、データベース４００での処理で用いられる。 Topic #1: 0.8, Topic #2: 0.2
Topic vector = [0.8, 0.2]
This means that the above utterance has a high probability of appearing in relation to topic #1, and the topic of the utterance can be determined to be in line with topic #1.
Such a topic model is generated by the content analysis server 300 and used in processing in the database 400 .

　次に音声特徴量について説明する。
　参照コンテンツＣＴ２に出演している話者の声質と、音声合成サーバ２００内に保有している話者モデルの声質とは、“ｘ－ｖｅｃｔｏｒ”と呼ばれる特徴量を用いて比べられる。“ｘ－ｖｅｃｔｏｒ”は「深層話者埋め込み」とも呼ばれ、ディープラーニング技術を用いた「話者識別技術」の一種である。通常、ｘ－ｖｅｃｔｏｒは５１２次元のベクトルである。 Next, the speech feature will be described.
The voice quality of the speaker appearing in the reference content CT2 is compared with the voice quality of the speaker model stored in the speech synthesis server 200 using a feature called an "x-vector." An "x-vector" is also called "deep speaker embedding," and is a type of "speaker identification technology" that uses deep learning technology. Typically, an x-vector is a 512-dimensional vector.

　ｘ－ｖｅｃｔｏｒをどのように利用するか説明する。
　図８はｘ－ｖｅｃｔｏｒの学習過程を示す。Ａさん、Ｂさん、Ｃさんら複数の話者の声をニューラルネットワークに入力する。それぞれの声はｔ＝１からｔ＝Ｔまでの単位時間幅毎に、声の特徴量が抽出される。 We will explain how to use x-vector.
Figure 8 shows the learning process of x-vector. The voices of multiple speakers, including A, B, and C, are input to the neural network. For each voice, voice features are extracted for each unit time interval from t=1 to t=T.

　これらの特徴量はマックスプーリング（Max Pooling）される。マックスプーリングとは、入力データから一定のサイズのウインドウをスライドさせながら、その中で最大値を選び出す処理で、これにより、入力データの縮小を行うことができるようになる。 These features are then max pooled. Max pooling is a process in which a fixed-size window is slid across the input data and the maximum value is selected within it, making it possible to reduce the input data.

　マックスプーリングされた特徴量は、識別部にて、話者クラス判定され、話者の学習が行われる。
　この学習により得た識別部の一部をｘ－ｖｅｃｔｏｒとして利用することができる。たとえば、Ａさんの声を入力した際に得られるｘ－ｖｅｃｔｏｒは一種のＡさんの声紋のように考えることができる。 The max-pooled features are subjected to speaker class determination in a classification unit, and speaker training is performed.
A part of the classification part obtained by this learning can be used as an x-vector. For example, the x-vector obtained when inputting the voice of person A can be thought of as a kind of voiceprint of person A.

　また、このニューラルネットワークは、学習に用いられた話者以外にも応用可能である。したがって、参照コンテンツＣＴ２内でのナレーターや出演者の声をこのニューラルネットワークに入力することで、各コンテンツの話者のｘ－ｖｅｃｔｏｒを得ることができる。同様に、音声合成サーバ２００が有する話者の声を、このニューラルネットワークに入力することで、各話者のｘ－ｖｅｃｔｏｒを得ることができる。
　ふたりの話者のｘ－ｖｅｃｔｏｒを比較することで、声が似ているかどうかの判断が可能となる。声の類似性はたとえば、ベクトル間のコサイン類似度やＰＬＤＡ（確率的線形判別分析）による照合などを用いることができる。
This neural network can also be applied to speakers other than those used in the training. Therefore, by inputting the voices of the narrator and performers in the reference content CT2 into this neural network, the x-vectors of the speakers of each content can be obtained. Similarly, by inputting the voices of the speakers held by the voice synthesis server 200 into this neural network, the x-vectors of each speaker can be obtained.
By comparing the x-vectors of two speakers, it is possible to determine whether their voices are similar. For example, the cosine similarity between vectors or matching using PLDA (Probabilistic Linear Discriminant Analysis) can be used to measure the similarity of voices.

＜３．コンテンツ解析サーバの解析処理＞
　図６の処理ＳＴ１におけるコンテンツ解析サーバ３００の処理について説明する。
　コンテンツ解析サーバ３００の処理としては２つのモデル作成フェーズがある。
　まずトピックモデルを作成するフェーズがある。次に各参照コンテンツＣＴ２がどのトピックに該当するのかを調べ、ＴＴＳ参照データＲＤを作成するフェーズがある。 3. Analysis process of the content analysis server
The process of the content analysis server 300 in step ST1 of FIG. 6 will be described.
The processing of the content analysis server 300 has two model creation phases.
First, there is a phase in which a topic model is created, followed by a phase in which it is determined which topic each piece of reference content CT2 corresponds to, and TTS reference data RD is created.

　トピックモデル作成フェーズについて説明する。
　コンテンツ解析サーバ３００は、コンテンツプロバイダ５００から複数の参照コンテンツＣＴ２の収集を行い、参照コンテンツＣＴ２内の発話テキストからＬＤＡ法により、トピックモデルを作成しておく。これはコンテンツプロバイダ５００の提供する参照コンテンツＣＴ２のジャンル分けをするのが目的である。 Explain the topic model creation phase.
The content analysis server 300 collects a plurality of reference contents CT2 from the content provider 500, and creates a topic model from the spoken text in the reference contents CT2 by the LDA method. The purpose of this is to classify the reference contents CT2 provided by the content provider 500 into genres.

　図４および図９を用いて説明する。
　なお、図９はコンテンツ解析サーバ３００としてのプロセッサが行う処理のフローチャートであるが、実行する処理を実線のボックスで示しつつ、理解の容易化のため、処理に対して入力又は出力されるデータを（　）内に文字又は符号で示している。またデータの記憶先のストレージ部３４０やデータベース４００も加えて示している。コンテンツ解析サーバ３００としてのプロセッサが行う処理についてはステップ番号を付している。
　このようなフローチャートの記載形式は、後述の図１０、図１１、図１２、図１４、図１５、図１６でも同様に用いる。 This will be described with reference to FIG. 4 and FIG.
9 is a flowchart of the processing performed by the processor as the content analysis server 300, in which the processing to be executed is indicated by a solid-line box, and for ease of understanding, data input or output to the processing is indicated by letters or symbols in parentheses. The storage unit 340 and database 400 where the data is stored are also shown. Step numbers are assigned to the processing performed by the processor as the content analysis server 300.
This flowchart description format is also used in the later-described FIGS. 10, 11, 12, 14, 15, and 16 in the same manner.

　コンテンツ解析サーバ３００は、コンテンツ取得部３１０により、コンテンツプロバイダ５００から参照コンテンツＣＴ２を取得する。
　取得された参照コンテンツＣＴ２に対してはステップＳ１０１で音声抽出部３２０により、音声データの抽出処理が行われる。 The content analysis server 300 acquires the reference content CT 2 from the content provider 500 by the content acquisition unit 310 .
In step S101, the audio extraction unit 320 performs an extraction process of audio data on the acquired reference content CT2.

　抽出された音声データについては、ステップＳ１０２で音声認識部３３０により音声認識処理が行われ、テキストデータである発話文ＴＲに変換される。
　発話文ＴＲはステップＳ１０３でストレージ部３４０に保存される。 In step S102, the extracted voice data is subjected to voice recognition processing by the voice recognition unit 330 and converted into a spoken sentence TR, which is text data.
The spoken sentence TR is stored in the storage unit 340 in step S103.

　なお、一点鎖線で囲ったステップＳ１０１，Ｓ１０２，Ｓ１０３の処理は、複数の参照コンテンツＣＴ２に対して繰り返し行われる。
　そして所定数（Ｍ個）の参照コンテンツＣＴ２に対してステップＳ１０１，Ｓ１０２，Ｓ１０３の処理が行われ、ストレージ部３４０に保存される発話文ＴＲがＭ個になると、その保存されたＭ個の発話文ＴＲを用いて、ステップＳ１１０でトピック分析部３５０によるトピック分析が行われる。その結果としてトピックモデルＴＭが得られる。 The processes of steps S101, S102, and S103 enclosed by dashed lines are repeatedly performed for a plurality of pieces of reference content CT2.
Then, the processes of steps S101, S102, and S103 are performed on a predetermined number (M pieces) of reference content CT2, and when the number of utterance sentences TR stored in the storage unit 340 reaches M pieces, topic analysis is performed by the topic analysis unit 350 using the stored M pieces of utterance sentences TR in step S110. As a result, a topic model TM is obtained.

　生成されたトピックモデルＴＭは、ステップＳ１１１でネットワーク通信部３７０　を通じてデータベース４００に送信される。 The generated topic model TM is sent to the database 400 via the network communication unit 370 in step S111.

　次にＴＴＳ参照データＲＤ作成フェーズについて説明する。
　このフェーズでは、あるひとつの参照コンテンツＣＴ２が、どのようなトピックベクトルを持ち、参照コンテンツＣＴ２内で使われている音声はどんな特徴のある声で、どのＵＲＬ（Uniform Resource Locator）に存在するのかを示す、ＴＴＳ参照データＲＤを作成することが目的である。 Next, the TTS reference data RD creation phase will be described.
The purpose of this phase is to create TTS reference data RD that indicates what topic vector a particular piece of reference content CT2 has, what characteristics the voice used in the reference content CT2 has, and in which URL (Uniform Resource Locator) it exists.

　図４および図１０を用いて説明する。
　コンテンツ解析サーバ３００では、コンテンツ取得部３１０により取得された参照コンテンツＣＴ２について、ステップＳ１２１で音声抽出部３２０により音声データの抽出を行う。 This will be described with reference to FIG. 4 and FIG.
In the content analysis server 300, the audio extraction unit 320 extracts audio data from the reference content CT2 acquired by the content acquisition unit 310 in step S121.

　抽出された音声データに対しては、ステップＳ１２３で音声認識部３３０による音声認識処理が行われ、発話文ＴＲが得られる。
　発話文ＴＲは、ステップＳ１２４でトピック分析部３５０によりトピック分析される。これによりトピックベクトルＲＤＴが得られる。トピックベクトルＲＤＴは、発話文ＴＲがどのトピックに属するかの確率をベクトルにしたものである。
　このトピック分析処理の際には、図９のトピックモデル作成フェーズで求めたトピックモデルＴＭが用いられる。 In step S123, the extracted voice data is subjected to voice recognition processing by the voice recognition unit 330, and an utterance sentence TR is obtained.
The utterance sentence TR is topic-analyzed by the topic analysis unit 350 in step S124. This results in a topic vector RDT. The topic vector RDT is a vector representing the probability that the utterance sentence TR belongs to a particular topic.
In this topic analysis process, the topic model TM obtained in the topic model creation phase of FIG. 9 is used.

　ステップＳ１２１で抽出された音声データは、ステップＳ１２５においても用いられ、音声特徴量取得部３６０により音声特徴量ＲＤＸが求められる。 The voice data extracted in step S121 is also used in step S125, and the voice feature acquisition unit 360 determines the voice feature RDX.

　コンテンツ解析サーバ３００は、ステップＳ１２６では、以上のように求められたトピックベクトルＲＤＴと、音声特徴量ＲＤＸと、この参照コンテンツＣＴ２のＵＲＬ（ＲＤＵ）を組み合わせて１つのＴＴＳ参照データＲＤとし、データベース４００送信する。これによりデータベース４００においては、ＴＴＳ参照データＲＤが追加記憶される。
　なお、ここでの説明では、ステップＳ１２３で発話文ＴＲを参照コンテンツＣＴ２から再度抽出したが、図９のトピックモデル作成フェーズのステップＳ１０２で生成した発話文ＴＲをキャッシュしておいて用いても良い。
In step S126, the content analysis server 300 combines the topic vector RDT obtained as described above, the speech feature RDX, and the URL (RDU) of the reference content CT2 into one TTS reference data RD, and transmits the combined data to the database 400. As a result, the TTS reference data RD is additionally stored in the database 400.
In the description here, the utterance sentence TR is extracted again from the reference content CT2 in step S123, but the utterance sentence TR generated in step S102 of the topic model creation phase in FIG. 9 may be cached and used.

＜４．音声合成サーバの処理＞
　次に音声合成サーバ２００の処理、特に音声合成時にＴＴＳ参照データＲＤをどのように使用するかを、図３および図１１を用いて説明する。 4. Processing of the voice synthesis server
Next, the processing of the voice synthesis server 200, particularly how the TTS reference data RD is used during voice synthesis, will be described with reference to FIG. 3 and FIG.

　ユーザ１０は情報端末１００内で利用しているコンテンツ制作アプリケーション１１０を用いて、合成音声化するテキストデータＴＤを入力する。テキストデータＴＤは、いわゆる自然文であり、ＴＴＳ話者提案システム１用に準備された特別な文章である必要はない。したがってユーザ１０は、ＴＴＳ話者提案システム１を利用するために特別な記述法を学習する必要はない。 The user 10 inputs the text data TD to be converted into synthetic speech using a content production application 110 used in the information terminal 100. The text data TD is so-called natural text, and does not need to be special text prepared for the TTS speaker suggestion system 1. Therefore, the user 10 does not need to learn a special writing method in order to use the TTS speaker suggestion system 1.

　音声合成サーバ２００はテキストデータＴＤを受信することで図１１の処理を行う。
　テキストデータＴＤを受信した音声合成サーバ２００は、ステップＳ２０１でテキスト－音素記号変換部２１０により、自然文から音声合成用の音素データに変換する処理を行う。 The voice synthesis server 200 performs the process of FIG. 11 upon receiving the text data TD.
In step S201, the voice synthesis server 200 receives the text data TD, and the text-to-phoneme symbol converter 210 converts the natural text into phoneme data for voice synthesis.

　また音声合成サーバ２００は、ステップＳ２０２で参照データ取得部２２０によりＴＴＳ参照データＲＤの取得処理を行う。具体的には、音声合成サーバ２００はテキストデータＴＤをデータベース４００に向けて送信し、ＴＴＳ参照データＲＤの検索依頼を行う。これに応じてデータベース４００では、テキストデータＴＤを入力とし、データベース４００に記憶されているＴＴＳ参照データＲＤの内のうちで、トピックベクトルＲＤＴが最もテキストデータＴＤのトピックベクトルに近いＴＴＳ参照データＲＤを選択し、音声合成サーバ２００に送信する。このデータベース４００の処理は後述する。
　音声合成サーバ２００は、データベース４００においてこのように選択されたＴＴＳ参照データＲＤ、即ちテキストデータＴＤに対する該当参照データを受信する。 In step S202, the voice synthesis server 200 performs a process of acquiring the TTS reference data RD by the reference data acquisition unit 220. Specifically, the voice synthesis server 200 transmits the text data TD to the database 400 and requests a search for the TTS reference data RD. In response to this, the database 400 receives the text data TD as an input, selects the TTS reference data RD whose topic vector RDT is closest to the topic vector of the text data TD from among the TTS reference data RD stored in the database 400, and transmits the selected TTS reference data RD to the voice synthesis server 200. The process of the database 400 will be described later.
The speech synthesis server 200 receives the TTS reference data RD thus selected in the database 400, that is, the corresponding reference data for the text data TD.

　データベース４００より該当参照データとしてのＴＴＳ参照データＲＤを受信した音声合成サーバ２００は、ステップＳ２０３で話者検索部２３０により話者検索を行う。
　この場合、音声合成サーバ２００は、そのＴＴＳ参照データＲＤに含まれる、音声特徴量ＲＤＸを用いて、保有話者データ部２４０に保有する話者モデルのうちで声質の似ている話者の算出を行う。
　具体的には、データベース４００より送られてきた音声特徴量ＲＤＸと保有する話者モデルの音声特徴量とのコサイン類似度を算出することで、声質の似ている話者を得ることができる。この処理により、音声合成サーバ２００が保有する話者モデルのうち、最も本トピックに適切な話者データの話者ＩＤを導出することが可能となる。 The speech synthesis server 200 receives the TTS reference data RD as the relevant reference data from the database 400, and in step S203, the speaker search unit 230 searches for a speaker.
In this case, the speech synthesis server 200 uses the speech feature amount RDX contained in the TTS reference data RD to calculate speakers with similar voice qualities from among the speaker models held in the held speaker data unit 240 .
Specifically, it is possible to obtain speakers with similar voice qualities by calculating the cosine similarity between the speech feature RDX sent from the database 400 and the speech feature of a speaker model held by the server 200. This process makes it possible to derive the speaker ID of the speaker data that is most suitable for the topic among the speaker models held by the speech synthesis server 200.

　音声合成サーバ２００は、ステップＳ２０１で得た音素データと、ステップＳ２０３で得た話者ＩＤを音声合成部２５０に入力することで、合成音声データＡＤを得る。
　そして音声合成サーバ２００はステップＳ２０５で、ネットワーク通信部２６０により、合成音声データＡＤや、話者ＩＤや、参照ＵＲＬを情報端末１００へ送信する処理を行う。参照ＵＲＬは、ステップＳ２０２で取得したＴＴＳ参照データＲＤに含まれる参照コンテンツＣＴ２のＵＲＬ（ＲＤＵ）である。
The voice synthesis server 200 inputs the phoneme data obtained in step S201 and the speaker ID obtained in step S203 to the voice synthesis unit 250, thereby obtaining synthetic voice data AD.
Then, in step S205, the voice synthesis server 200 transmits the synthetic voice data AD, the speaker ID, and the reference URL to the information terminal 100 via the network communication unit 260. The reference URL is the URL (RDU) of the reference content CT2 included in the TTS reference data RD acquired in step S202.

＜５．データベースの処理＞
　データベース４００の処理を図５および図１２を用いて説明する。これは上述の図１１のステップＳ２０２での音声合成サーバ２００からの検索依頼に応じた処理である。 5. Database Processing
The processing of the database 400 will be described with reference to Fig. 5 and Fig. 12. This is processing in response to a search request from the speech synthesis server 200 in step S202 of Fig. 11 described above.

　データベース４００は、音声合成サーバ２００からテキストデータＴＤを受信すると、図１２のステップＳ２１１でトピック分析部４２０によるトピック分析を行う。
　このトピック分析では、ストレージ部４１０に記憶しているトピックモデルＴＭを用いて、テキストデータＴＤのトピック分析を行い、トピックベクトルＴＶを生成する。 When the database 400 receives the text data TD from the speech synthesis server 200, the topic analysis unit 420 performs topic analysis in step S211 of FIG.
In this topic analysis, a topic analysis of the text data TD is performed using a topic model TM stored in the storage unit 410, and a topic vector TV is generated.

　次にデータベース４００はステップＳ２１２で、トピック類似度分析部４３０により、トピック検索を行う。これはトピックベクトルＴＶに似たトピックベクトルＲＤＴを検索する処理である。具体的にはストレージ部４１０に保存されている、さまざまな参照コンテンツＣＴ２から生成されたＴＴＳ参照データＲＤ（ＲＤ－１・・・ＲＤ－Ｎ）のうちで、トピックベクトルＲＤＴが、テキストデータＴＤのトピックベクトルＴＶに似ているものを検索する処理である。 Next, in step S212, the database 400 performs a topic search using the topic similarity analysis unit 430. This is a process of searching for a topic vector RDT that is similar to the topic vector TV. Specifically, this process searches for TTS reference data RD (RD-1...RD-N) generated from various reference contents CT2 stored in the storage unit 410, whose topic vector RDT is similar to the topic vector TV of the text data TD.

　トピックベクトルが似ているということは、コンテンツ内容として、ジャンルが同じであるとか、話題が似ているということに相当する。
　つまりトピックベクトルが似ているＴＴＳ参照データＲＤを探すということは、ユーザ１０が制作しているコンテンツＣＴ１と、ジャンルや話題が似ている参照コンテンツＣＴ２に基づいて生成されたＴＴＳ参照データＲＤを検索することであるともいえる。 Similar topic vectors correspond to content that is in the same genre or has a similar topic.
In other words, searching for TTS reference data RD with similar topic vectors can be said to be searching for TTS reference data RD generated based on the content CT1 produced by the user 10 and reference content CT2 that is of a similar genre or topic.

　トピックベクトルＴＶと似ているトピックベクトルＲＤＴの検索には、コサイン類似度を用いることができる。これにより、複数のＴＴＳ参照データＲＤの内で、トピックベクトルＴＶと最も類似度が高いトピックベクトルＲＤＴを持つＴＴＳ参照データＲＤを最適トピックとして選択する。
　そしてデータベース４００は、最適トピックとして得られたＴＴＳ参照データＲＤを、今回のテキストデータＴＤに対する該当参照データとして音声合成サーバ２００に送信する。 Cosine similarity can be used to search for a topic vector RDT similar to the topic vector TV. As a result, among multiple TTS reference data RD, the TTS reference data RD having the topic vector RDT with the highest similarity to the topic vector TV is selected as the optimal topic.
Then, the database 400 transmits the TTS reference data RD obtained as the optimal topic to the speech synthesis server 200 as the corresponding reference data for the current text data TD.

　なお、類似度にしきい値を設け、類似度がしきい値を超えない場合は、複数のＴＴＳ参照データＲＤを音声合成サーバ２００に送信するようにしてもよい。
It is also possible to provide a threshold value for the degree of similarity, and if the degree of similarity does not exceed the threshold value, multiple pieces of TTS reference data RD may be transmitted to the speech synthesis server 200 .

＜６．情報端末での表示＞
　以上の処理の結果として情報端末１００でユーザ１０に対して行われる表示例を説明する。 <6. Display on information terminal>
An example of a display shown to the user 10 on the information terminal 100 as a result of the above processing will now be described.

　情報端末１００が音声合成サーバ２００より受信するデータは、合成音声データＡＤ　と、参照コンテンツＣＴ２のＵＲＬである。
　参照コンテンツＣＴ２のＵＲＬは、合成音声の生成に、どんなコンテンツの話者の声を参考にしたのかという情報をユーザ１０に示すために提供される。 The data that the information terminal 100 receives from the voice synthesis server 200 is the synthetic voice data AD and the URL of the reference content CT2.
The URL of the reference content CT2 is provided to show the user 10 information about what content's speaker's voice was used as reference for generating the synthetic speech.

　図１３に情報端末１００のディスプレイ１２０での表示例を示す。ディスプレイ１２０上にはテキストボックス３１、話者ＩＤ３２、合成開始ボタン３３、再生ボタン３４、参照ＵＲＬ３５が表示される。 FIG. 13 shows an example of a display on the display 120 of the information terminal 100. A text box 31, a speaker ID 32, a synthesis start button 33, a play button 34, and a reference URL 35 are displayed on the display 120.

　テキストボックス３１は、テキストデータＴＤを入力するためのボックスである。
　話者ＩＤ３２は、音声合成サーバ２００によって図１１のステップＳ２０３で選ばれた話者ＩＤである。
　合成開始ボタン３３は、音声合成処理の開始を指示する操作子である。
　再生ボタン３４は合成音声を再生するための操作子である。
　参照ＵＲＬ３５は声を参考にした参照コンテンツＣＴ２のＵＲＬであり、例えば参照コンテンツＣＴ２へのリンクという形で表示される。 The text box 31 is a box for inputting text data TD.
The speaker ID 32 is the speaker ID selected by the speech synthesis server 200 in step S203 of FIG.
The synthesis start button 33 is an operator for instructing the start of voice synthesis processing.
The playback button 34 is an operator for playing back the synthesized voice.
The reference URL 35 is the URL of the reference content CT2 that refers to the voice, and is displayed, for example, in the form of a link to the reference content CT2.

　ユーザ１０は、この画面により、再生ボタン３４を操作して、音声合成サーバ２００から提案された話者ＩＤ３２の話者の声を聞くことができる。
　また参照ＵＲＬ３５を操作することで、音声合成サーバ２００が話者ＩＤ３２の選択の際に参考にした参照コンテンツＣＴ２を再生させ、そのナレーション等の声を聞くことができる。
　従ってユーザ１０は、音声合成サーバ２００から提案された話者ＩＤによるテキストの読み上げ音声を聞くだけでなく、その話者ＩＤの選択のために、制作しているコンテンツＣＴ１とジャンル等が似ている参照コンテンツＣＴ２における声を聞くことができる。
On this screen, the user 10 can operate the playback button 34 to listen to the voice of the speaker with the speaker ID 32 suggested by the speech synthesis server 200 .
Furthermore, by operating the reference URL 35, the reference content CT2 that the voice synthesis server 200 referred to when selecting the speaker ID 32 can be reproduced, and the voice of the narration or the like can be heard.
Therefore, the user 10 can not only hear the text being read by the speaker ID proposed by the voice synthesis server 200, but also hear the voice in the reference content CT2 that is similar in genre, etc. to the content CT1 being produced in order to select the speaker ID.

＜７．参照コンテンツに複数の話者がいる場合の対応＞
　ここまでは、各参照コンテンツＣＴ２にはひとりの話者だけが出演する場合を想定して説明してきた。しかしながら、実際の参照コンテンツＣＴ２には、複数の話者が出演することが普通である。例えばテレビ番組などでは、現場からの報告や天気予報、交通情報、スポーツ情報などそれぞれのジャンルに別のアナウンサーが担当することが多い。 7. How to handle cases where there are multiple speakers in the reference content
Up to this point, we have assumed that each reference content CT2 will feature only one speaker. However, in actual reference content CT2, multiple speakers usually appear. For example, in television programs, different announcers are often in charge of different genres, such as field reports, weather forecasts, traffic information, and sports information.

　音声特徴量に関して声を表わす特徴量であるｘ－ｖｅｃｔｏｒを説明したが、ｘ－ｖｅｃｔｏｒを用いることで、参照コンテンツＣＴ２内で話者が交代したことを検出することができる。これは「話者ダイアライゼーション」と呼ばれる技術で、この技術を用いることで、ひとつのコンテンツに複数人の話者がいる場合にも本技術は対応することが可能となる。 We have explained x-vectors, which are features that represent voices, as audio features. By using x-vectors, it is possible to detect when the speaker changes within the reference content CT2. This is a technique called "speaker diarization," and by using this technique, it is possible for this technology to be able to handle cases where there are multiple speakers in a single piece of content.

　図１４を用いて説明する。
　コンテンツ解析サーバ３００は、参照コンテンツＣＴ２についてステップＳ１３１で音声抽出部３２０による音声抽出を行い、音声データを取得する。
　次に、ステップＳ１３２で音声特徴量取得部３６０にて、ある単位時間、例えば３０秒ごとに音声特徴量ＲＤＸの抽出を行う。この単位時間ごとの音声特徴量ＲＤＸについて、ステップＳ１３３で特徴量変化検出処理を行い、しきい値以上の変化をモニタする。 This will be explained using FIG.
In step S131, the content analysis server 300 performs audio extraction on the reference content CT2 using the audio extraction unit 320, and acquires audio data.
Next, in step S132, the speech feature acquisition unit 360 extracts speech features RDX every unit time, for example, every 30 seconds. In step S133, feature change detection processing is performed on the speech features RDX for each unit time, and changes equal to or greater than a threshold value are monitored.

　特徴量変化の検出にはコサイン類似度を用いることができる。しきい値以上の変化があれば、それは話者が変わったことを意味するので、タイムスタンプを記録し、タイムスタンプデータベース３４１で保管する。
　タイムスタンプデータベース３４１は、例えばストレージ部３４０の一部の領域を用いて用意する。 Cosine similarity can be used to detect feature changes. If there is a change equal to or greater than a threshold, this means that the speaker has changed, so a timestamp is recorded and stored in the timestamp database 341.
The time stamp database 341 is prepared by using, for example, a portion of the area of the storage unit 340 .

　例えばこのようにコンテンツ解析サーバ３００は、参照コンテンツＣＴ２について音声特徴量ＲＤＸの変化を監視し、変化点を記憶しておく。
　これにより、図１０のＴＴＳ参照データ作成フェーズで説明したシステムと同等の構成で、複数話者の出演する参照コンテンツＣＴ２にも対応することができる。 For example, in this manner, the content analysis server 300 monitors changes in the audio feature value RDX of the reference content CT2 and stores the points of change.
This makes it possible to handle reference content CT2 featuring multiple speakers with a configuration equivalent to that of the system described in the TTS reference data creation phase of FIG.

　例えば図１５に、図１０と同様のＴＴＳ参照データ作成フェーズの処理を示している。なお図１０と同一の処理については同一のステップ番号を付して説明を省略する。 For example, FIG. 15 shows the process of the TTS reference data creation phase similar to that shown in FIG. 10. Note that the same processes as those in FIG. 10 are given the same step numbers and will not be described.

　図１５の場合、ステップＳ１２１Ａとして、タイムスタンプデータベース３４１を用いて、参照コンテンツＣＴ２についてタイムスタンプで判定される区間毎ごとに音声抽出を行うようにする。
　その後、抽出した音声データについて図１０と同様の処理を行い、ＴＴＳ参照データＲＤを生成する。
In the case of FIG. 15, in step S121A, the time stamp database 341 is used to extract audio from the reference content CT2 for each section determined by the time stamp.
Thereafter, the extracted voice data is subjected to the same process as in FIG. 10 to generate TTS reference data RD.

＜８．複数の話者候補の提案＞
　これまでの説明では、コンテンツＣＴ１で話される言葉（テキストデータＴＤ）から、そのコンテンツＣＴ１に最適なひとりの話者を提案するものとして説明してきた。
　一方でユーザ１０の中には、ほかの話者を選択肢として試してみたい場合もあると考えられる。そのようなユーザ１０を想定し、複数話者を提供する処理例を、図１６を用いて説明する。なお図１６において図１２と同じ処理は同じステップ番号を付して重複説明を避ける。 8. Proposing multiple speaker candidates
In the above explanation, it has been explained that one speaker best suited to the content CT1 is proposed based on the words (text data TD) spoken in the content CT1.
On the other hand, it is considered that some users 10 may want to try other speakers as options. Assuming such a user 10, an example of a process for providing multiple speakers will be described with reference to Fig. 16. Note that in Fig. 16, the same processes as those in Fig. 12 are assigned the same step numbers to avoid duplication of explanation.

　先の図１２において、ステップＳ２１２でトピック類似度分析部４３０が検索するＴＴＳ参照データＲＤは、トピックベクトルが最も類似するひとつだけであった。
　図１６のステップＳ２１２Ａでは、提案する声の種類を増やすために、トピック類似度分析部４３０が検索するＴＴＳ参照データＲＤの数を、コサイン類似度の高い順に複数個選出するものとする。図では１０個を選出するものとし、１０個のＴＴＳ参照データＲＤ（ＲＤ＃１からＲＤ＃１０）を示している。 In FIG. 12, the TTS reference data RD searched by the topic similarity analysis unit 430 in step S212 is only the one with the most similar topic vector.
16, in order to increase the number of voices proposed, the topic similarity analysis unit 430 selects a number of TTS reference data RDs to be searched in descending order of cosine similarity. In the figure, ten pieces of TTS reference data RDs (RD#1 to RD#10) are selected.

　最も類似度の高いトピックベクトルＲＤＴを持つＴＴＳ参照データＲＤ＃１を「最適話者」のデータとし、類似度の高い順に、参照データ＃２、・・・、参照データ＃１０とする。 The TTS reference data RD#1 with the most similar topic vector RDT is designated as the "optimal speaker" data, and the data in descending order of similarity are designated as reference data #2, ..., #10.

　ここでコサイン類似度について改めて説明する。
　先に音声特徴量の説明では、話者の声紋としてｘ－ｖｅｃｔｏｒを利用できることを述べた。そして似た声を調べるのにコサイン類似度を用いた。図１１では、音声合成サーバ２００が、情報端末１００から送信されてきたテキストデータＴＤをもとにＴＴＳ参照データＲＤを得ることを述べた。そしてＴＴＳ参照データＲＤには、ｘ－ｖｅｃｔｏｒである音声特徴量ＲＤＸが含まれている。
　この音声特徴量ＲＤＸはベクトルであるので、他の音声特徴量とコサイン類似度を計算すれば、声質の類似度を計算することが可能である。 Here, the cosine similarity will be explained again.
In the previous explanation of speech features, it was mentioned that x-vectors can be used as the voiceprint of a speaker. Cosine similarity is used to check for similar voices. In Fig. 11, it was mentioned that the speech synthesis server 200 obtains TTS reference data RD based on text data TD transmitted from the information terminal 100. The TTS reference data RD includes speech features RDX, which are x-vectors.
Since this speech feature RDX is a vector, it is possible to calculate the similarity of voice quality by calculating the cosine similarity with other speech features.

　例えば２つのベクトルａ、ｂのコサイン類似度は（数１）で表わすことができ、“－１”から“１”の範囲をとる。 For example, the cosine similarity between two vectors a and b can be expressed as (Equation 1) and ranges from "-1" to "1".

　コサイン類似度が“１”のときは、なす角が０度で、同じ向きのベクトルである。つまり完全に似ている声質の関係である。
　コサイン類似度が“０”のときは、なす角が９０度で、直交した向きのベクトルである。つまり声質が似ている／似ていない、のどちらにも無関係と言える。
　コサイン類似度が“－１”のときは、なす角が１８０度で、反対（逆）向きのベクトルである。これは完全に似ていない声質の関係である。 When the cosine similarity is "1", the angle is 0 degrees and the vectors are in the same direction. In other words, the voice qualities are completely similar.
When the cosine similarity is "0", the angle between the two vectors is 90 degrees, meaning that the vectors are orthogonal to each other. In other words, it is irrelevant whether the voice qualities are similar or not.
When the cosine similarity is "-1", the angle is 180 degrees and the vector is in the opposite direction. This is a relationship of completely dissimilar voice qualities.

　図１６の処理ではデータベース４００は、複数（例えば１０個）のＴＴＳ参照データＲＤ（ＲＤ＃１からＲＤ＃１０）を用いて、ステップＳ２２０の類似度評価処理を行うようにする。 In the process of FIG. 16, the database 400 uses multiple (e.g., 10) TTS reference data RD (RD#1 to RD#10) to perform the similarity evaluation process of step S220.

　この類似度評価処理は、最適話者のＴＴＳ参照データＲＤ＃１の音声特徴量ＲＤＸを「基準特徴量」として、ＴＴＳ参照データＲＤ＃２からＴＴＳ参照データＲＤ＃１０に含まれる、各音声特徴量ＲＤＸとのコサイン類似度を求める。 This similarity evaluation process uses the speech feature RDX of the optimal speaker's TTS reference data RD#1 as the "reference feature" and calculates the cosine similarity between this and each speech feature RDX contained in TTS reference data RD#2 to TTS reference data RD#10.

　この場合に、基準特徴量とのコサイン類似度が“０”に近い音声特徴量ＲＤＸを持つＴＴＳ参照データＲＤは、最適話者と似ても似てなくもない特徴をもつ話者の情報となる。このようなＴＴＳ参照データＲＤ＃ｘを「直交話者」とする。 In this case, TTS reference data RD having speech features RDX whose cosine similarity with the reference features is close to "0" is information on a speaker whose features are not dissimilar to the optimal speaker. Such TTS reference data RD#x is called an "orthogonal speaker."

　基準特徴量とのコサイン類似度が“－１”に近い音声特徴量ＲＤＸを持つＴＴＳ参照データＲＤは、最適話者とは似ていない特徴を持つ話者の情報となる。このようなＴＴＳ参照データＲＤ＃ｙを「逆向き話者」とする。 TTS reference data RD having speech features RDX whose cosine similarity with the reference features is close to "-1" is information on a speaker whose features are not similar to those of the optimal speaker. Such TTS reference data RD#y is called the "reverse speaker."

　最適話者のＴＴＳ参照データＲＤ＃１、直交話者のＴＴＳ参照データＲＤ＃ｘ、逆向き話者のＴＴＳ参照データＲＤ＃ｙは、ひとつのデータ群としてまとめられて、音声合成サーバ２００に送信される。 The optimal speaker's TTS reference data RD#1, the orthogonal speaker's TTS reference data RD#x, and the opposite speaker's TTS reference data RD#y are combined into one data group and sent to the speech synthesis server 200.

　音声合成サーバ２００では、図１１のステップＳ２０３で、保有話者データ部２４０に保有する話者データのうちで、ＴＴＳ参照データＲＤに似た話者データを検索するが、この場合は、ＴＴＳ参照データＲＤ＃１、ＲＤ＃ｘ、ＲＤ＃ｙのそれぞれに対して似た話者データを検索することになる。
　従って、最適話者に似た話者ＩＤ、直交話者に似た話者ＩＤ、逆向き話者に似た話者ＩＤが求められる。
　これにより、声質の異なる３つの声がユーザ１０に提案されることになる。 In the speech synthesis server 200, in step S203 of FIG. 11, the speaker data stored in the stored speaker data section 240 is searched for speaker data similar to the TTS reference data RD. In this case, similar speaker data is searched for for each of the TTS reference data RD#1, RD#x, and RD#y.
Therefore, a speaker ID similar to the optimal speaker, a speaker ID similar to the orthogonal speaker, and a speaker ID similar to the backward speaker are obtained.
In this way, three voices with different voice qualities are proposed to the user 10.

　なお、以上の例では、コサイン類似度が“０”と“－１”に近いふたつの音声特徴量ＲＤＸを求めたが、コサイン類似度が“－１”から“１”までの複数の音声特徴量を求めることも可能である。
　このように、トピックベクトルの類似度が上位の参照データのうちから、コサイン類似度が直交または反対向きの参照データをユーザ１０に提案するということは、似たトピックでありながら話者の声質が異なる参照コンテンツＣＴ２を検索し、それらの声質に似た声質をユーザ１０に提案することを意味する。 In the above example, two speech features RDX with cosine similarities close to "0" and "-1" are obtained, but it is also possible to obtain multiple speech features with cosine similarities ranging from "-1" to "1".
In this way, proposing to user 10 reference data with orthogonal or opposite cosine similarities from among reference data with top topic vector similarities means searching for reference content CT2 that has a similar topic but different speaker voice qualities, and proposing to user 10 voice qualities similar to those reference content CT2.

　また、ユーザ１０に提案する話者の数を音声特徴量の分散によって変えることも考えられる。
　次の（数２）は、基準特徴量（最適話者の音声特徴量ＲＤＸ）と、他のＴＴＳ参照データＲＤの音声特徴量ＲＤＸとのコサイン類似度を行列にしたものである。 It is also possible to change the number of speakers suggested to the user 10 depending on the variance of the speech features.
The following (Equation 2) is a matrix of the cosine similarity between the reference feature (the speech feature RDX of the optimal speaker) and the speech feature RDX of other TTS reference data RD.

　ここで、ｓ1,iは基準特徴量とｉ番目の特徴量とのコサイン類似度を表す。
　この行列の分散σ^２を（数３）に示す。μは平均値である。 Here, s1,i represents the cosine similarity between the reference feature and the i-th feature.
The variance σ ² of this matrix is shown in Equation 3, where μ is the average value.

　分散を求めることで、提案された話者がひとつの音声特徴量に収束しているのか、あるいは、複数の話者にちらばっているのかの目安を得ることができる。
　ここで、ｎは類似度を評価するＴＴＳ参照データＲＤの数である。（数２）の場合は、ｎ＝９である。最初に１０個のＴＴＳ参照データＲＤ（ＲＤ＃１からＲＤ＃１０）を選択した場合、ＴＴＳ参照データＲＤ＃１に対して、ＴＴＳ参照データＲＤ＃２からＲＤ＃１０の類似度を評価するので、ｎ＝９となる。 By calculating the variance, we can get an indication of whether the proposed speakers converge to a single speech feature or are spread across multiple speakers.
Here, n is the number of TTS reference data RD to be evaluated for similarity. In the case of (Equation 2), n = 9. When 10 TTS reference data RD (RD#1 to RD#10) are initially selected, the similarity of TTS reference data RD#2 to RD#10 is evaluated with respect to TTS reference data RD#1, so n = 9.

　例えば分散がゼロに近い場合には、そのトピックの話者たちの声質は、凡そ似たものである。その場合には、トピックベクトルの類似度を上位１０個ではなく、類似度を評価する参照データの数を増やすことで、所望の話者の数を得やすくなる。例えば上位２０個などとする。
　あるいは逆に、分散が十分大きい場合には、評価するＴＴＳ参照データＲＤの数は、さほど多くは必要ない。
　その様子を数式にしたものが（数４）で、分散を分母にとっている。ｙは評価するＴＴＳ参照データＲＤの数である。 For example, if the variance is close to zero, the voice qualities of the speakers of the topic are roughly similar. In that case, it is easier to obtain the desired number of speakers by increasing the number of reference data for evaluating the similarity of the topic vectors, rather than the top 10. For example, the top 20.
Conversely, if the variance is sufficiently large, the number of TTS reference data RD to be evaluated does not need to be very large.
This is expressed in the formula (4), where variance is used as the denominator. y is the number of TTS reference data RD to be evaluated.

　図１７は（数４）をグラフ化したものである。分散の値（Ｖ）に応じて、評価する参照データの数が変わることが分る。分散がゼロに近いときには、３０個のＴＴＳ参照データＲＤを用いて、声質のバリエーションを調べる。一方、分散が大きい場合には、１０個以内のＴＴＳ参照データＲＤの中からバリエーションに富んだ声質を得ることができる。 Figure 17 is a graph of (Equation 4). It can be seen that the number of reference data to be evaluated changes depending on the variance value (V). When the variance is close to zero, 30 TTS reference data RD are used to examine the variation in voice quality. On the other hand, when the variance is large, a wide variety of voice qualities can be obtained from within 10 TTS reference data RD.

　以上のようにすることで、似たトピックでありながら話者の声質のバリエーションをもって複数の話者をユーザ１０に提案できるが、その場合、情報端末１００のディスプレイ１２０では例えば図１８のような表示を行うことが考えられる。 By doing the above, multiple speakers with variations in the voice quality of the speakers can be suggested to the user 10 while covering similar topics. In this case, it is conceivable that the display 120 of the information terminal 100 will display, for example, the image shown in FIG. 18.

　トピックとしては「サッカー」を例に示した。図１３と同様にディスプレイ１２０上にはテキストボックス３１、話者ＩＤ３２、合成開始ボタン３３、再生ボタン３４、参照ＵＲＬ３５が表示される。但し、再生ボタン３４、話者ＩＤ３２、参照ＵＲＬ３５は、合成音声リスト３６として示される。即ちコンテンツＣＴ１に用いる候補となる複数の話者が一覧表示される。 "Soccer" is shown as an example of a topic. As in FIG. 13, a text box 31, a speaker ID 32, a synthesis start button 33, a play button 34, and a reference URL 35 are displayed on the display 120. However, the play button 34, the speaker ID 32, and the reference URL 35 are shown as a synthetic speech list 36. In other words, a list of multiple speakers who are candidates for use in the content CT1 is displayed.

　合成音声リスト３６には、トピックベクトルのコサイン類似度の高い順に、話者が上から表示される。かつ上述したように、これらの各話者は、最適話者、直交話者、逆向き話者を含む、声質の異なる話者である。
　各話者については、テキストボックス３１に入力されたテキスト情報のトピックに一致する「サッカー」のトピックであり、話者ＩＤを選ぶ際に、変化に富んだ声質の話者が出演する参照コンテンツＣＴ２へのリンクが参照ＵＲＬ３５として表示される。 The speakers are displayed in the synthetic speech list 36 in descending order of cosine similarity of the topic vectors. As described above, each of these speakers has a different voice quality, including an optimal speaker, an orthogonal speaker, and an opposite speaker.
For each speaker, the topic is "soccer" which matches the topic of the text information entered in the text box 31, and when a speaker ID is selected, a link to reference content CT2 featuring speakers with a variety of voice qualities is displayed as a reference URL 35.

　例えば図１８の例では、５つの話者が提示されており、ユーザ１０は再生ボタン３４でそれぞれの話者の声を再生させることができる。またユーザ１０は参照ＵＲＬ３５の操作で、その話者ＩＤを選択するために参照した参照コンテンツＣＴ２の声を確認することができる。 For example, in the example of FIG. 18, five speakers are presented, and the user 10 can play back the voice of each speaker by pressing the play button 34. The user 10 can also check the voice of the reference content CT2 that was referenced to select that speaker ID by operating the reference URL 35.

　ここで１つのコンテンツＣＴ１に対して複数の話者候補を提案することの利点を考える。
　例えば小学生向けのオンライン授業のような動画コンテンツに本技術を応用する場合を想定する。 Here, the advantage of proposing multiple speaker candidates for one piece of content CT1 will be considered.
For example, consider the case where this technology is applied to video content such as online lessons for elementary school students.

　国語、算数、理科、社会などの教科がある場合、これらの４教科のテキストデータＴＤをまとめてトピック分析するのではなく、教科毎にトピック分析することで、それぞれの教師らしい声を得ることができる。
　テキストデータＴＤのトピックベクトルだけで考慮した場合、例えば国語と社会の教師の声が似ていることもある。ユーザ１０としては、単調となることを避けるために、各教科の話者を変えたいと考えることもあり得る。 In the case of subjects such as Japanese, arithmetic, science, and social studies, instead of conducting a topic analysis on the text data TD of these four subjects together, a topic analysis can be performed on each subject, thereby obtaining a voice that is characteristic of each teacher.
When only the topic vectors of the text data TD are considered, for example, the voices of teachers of Japanese and social studies may be similar, and the user 10 may want to change the speaker for each subject to avoid monotony.

　そのような場合には、国語、社会のトピックベクトルＴＶによるトピック検索を行い、コサイン類似度上位の参照データの中から、直交話者、逆向き話者を選ぶことで、似た声の重複を避けることができる。 In such cases, a topic search is performed using the Japanese language and social studies topic vector TV, and orthogonal speakers and reverse speakers are selected from the reference data with the highest cosine similarity, thereby avoiding duplication of similar voices.

　また長い文章を自動的に複数のトピックに分割し、話者を変えることも考えられる。
　例えば国語の動画コンテンツであっても、単元が、「小説」「評論」「詩」など様々なトピックがある。例えば「詩」などは感情豊かに読んで欲しい。
　このような場合には、文章の段落単位でのトピック分析を行うことで、あるひとつのまとまりで話者を提案することが可能となる。文章の段落の検出手法としては、インデントや空白行を検出することで行うことができる。
It is also possible to automatically split long pieces of text into multiple topics and change the speaker.
For example, even in the case of video content for Japanese language classes, there are various topics such as "novels,""criticisms," and "poetry." For example, I would like students to read "poetry" with emotion.
In such cases, it is possible to suggest speakers as a group by performing topic analysis on a paragraph-by-paragraph basis. Paragraph detection can be done by detecting indents and blank lines.

＜９．情報処理装置の構成＞
　以上のＴＴＳ話者提案システム１における音声合成サーバ２００、コンテンツ解析サーバ３００、データベース４００、情報端末１００として用いることのできる情報処理装置７０の構成例を図１９で説明する。
　情報処理装置７０は、例えば専用のワークステーションや、汎用のパーソナルコンピュータ、モバイル端末装置等として構成することができる。 9. Configuration of information processing device
An example of the configuration of an information processing device 70 that can be used as the speech synthesis server 200, the content analysis server 300, the database 400, and the information terminal 100 in the above-mentioned TTS speaker suggestion system 1 will be described with reference to FIG.
The information processing device 70 can be configured as, for example, a dedicated workstation, a general-purpose personal computer, a mobile terminal device, or the like.

　図１９に示す情報処理装置７０のＣＰＵ７１は、ＲＯＭ７２や例えばＥＥＰ－ＲＯＭ（Electrically Erasable Programmable Read-Only Memory）などの不揮発性メモリ部７４に記憶されているプログラム、または記憶部７９からＲＡＭ７３にロードされたプログラムに従って各種の処理を実行する。ＲＡＭ７３にはまた、ＣＰＵ７１が各種の処理を実行する上において必要なデータなども適宜記憶される。
　ＣＰＵ７１において、プログラムによって、図２，図３，図４，図５における各種の制御・演算を行う機能が実現される。 19 executes various processes according to programs stored in a ROM 72 or a non-volatile memory unit 74 such as an EEPROM (Electrically Erasable Programmable Read-Only Memory), or programs loaded from a storage unit 79 to a RAM 73. The RAM 73 also stores data necessary for the CPU 71 to execute various processes, as appropriate.
In the CPU 71, the functions of carrying out various controls and calculations shown in FIGS. 2, 3, 4 and 5 are realized by programs.

　なおＣＰＵ７１とは別のプロセッサとして、ＧＰＵ（Graphics Processing Unit）、ＧＰＧＰＵ（General-purpose computing on graphics processing units）、ＡＩ（artificial intelligence）プロセッサ等を備える場合もある。 In addition, processors other than the CPU 71 may include a GPU (Graphics Processing Unit), a GPGPU (General-purpose computing on graphics processing units), an AI (artificial intelligence) processor, etc.

　ＣＰＵ７１、ＲＯＭ７２、ＲＡＭ７３、不揮発性メモリ部７４は、バス８３を介して相互に接続されている。このバス８３にはまた、入出力インタフェース７５も接続されている。 The CPU 71, ROM 72, RAM 73, and non-volatile memory unit 74 are interconnected via a bus 83. The input/output interface 75 is also connected to this bus 83.

　入出力インタフェース７５には、操作子や操作デバイスよりなる入力部７６が接続される。例えば入力部７６としては、キーボード、マウス、キー、ダイヤル、タッチパネル、タッチパッド、リモートコントローラ等の各種の操作子や操作デバイスが想定される。
　入力部７６によりユーザ１０の操作が検知され、入力された操作に応じた信号はＣＰＵ７１によって解釈される。 An input unit 76 including an operator or an operating device is connected to the input/output interface 75. For example, the input unit 76 may be various operators or operating devices such as a keyboard, a mouse, a key, a dial, a touch panel, a touch pad, or a remote controller.
An operation by the user 10 is detected by the input unit 76 , and a signal corresponding to the input operation is interpreted by the CPU 71 .

　また入出力インタフェース７５には、ＬＣＤ（Liquid Crystal Display）或いは有機ＥＬ（Electro-Luminescence）パネルなどよりなる表示部７７や、スピーカなどよりなる音声出力部７８が一体又は別体として接続される。 The input/output interface 75 is also connected, either integrally or separately, to a display unit 77 such as an LCD (Liquid Crystal Display) or an organic EL (Electro-Luminescence) panel, and an audio output unit 78 such as a speaker.

　表示部７７はユーザインタフェースとして各種表示を行う。表示部７７は例えば情報処理装置７０の筐体に設けられるディスプレイデバイスや、情報処理装置７０に接続される別体のディスプレイデバイス等により構成される。
　表示部７７は、ＣＰＵ７１の指示に基づいて表示画面上に各種の画像表示を実行する。また表示部７７はＣＰＵ７１の指示に基づいて、各種操作メニュー、アイコン、メッセージ等、即ちＧＵＩ（Graphical User Interface）としての表示を行う。 The display unit 77 performs various displays as a user interface. The display unit 77 is, for example, a display device provided in the housing of the information processing device 70, or a separate display device connected to the information processing device 70.
The display unit 77 executes various image displays on the display screen based on instructions from the CPU 71. The display unit 77 also displays various operation menus, icons, messages, etc., that is, GUIs (Graphical User Interfaces), based on instructions from the CPU 71.

　入出力インタフェース７５には、ＳＳＤ（Solid State Drive）やＨＤＤ（Hard Disk Drive）などより構成される記憶部７９や、モデムなどより構成される通信部８０が接続される場合もある。
　記憶部７９は各種データのストレージに用いることができる。また記憶部７９においてデータベースを構築することができる。
　例えば音声合成サーバ２００の保有話者データ部２４０、コンテンツ解析サーバ３００のストレージ部３４０、データベース４００のストレージ部４１０等は記憶部７９を使用して構成できる。 The input/output interface 75 may be connected to a storage unit 79 configured with a solid state drive (SSD) or a hard disk drive (HDD) or a communication unit 80 configured with a modem or the like.
The storage unit 79 can be used to store various data, and a database can be constructed in the storage unit 79.
For example, the retained speaker data unit 240 of the speech synthesis server 200 , the storage unit 340 of the content analysis server 300 , the storage unit 410 of the database 400 , etc. can be configured using the storage unit 79 .

　通信部８０は、ネットワーク６００を介した通信処理を行う。
　例えば情報端末１００のネットワーク通信部１５０、音声合成サーバ２００のネットワーク通信部２６０、コンテンツ解析サーバ３００のネットワーク通信部３７０、データベース４００のネットワーク通信部４４０は通信部８０を使用して構成できる。 The communication unit 80 performs communication processing via the network 600 .
For example, the network communication unit 150 of the information terminal 100 , the network communication unit 260 of the voice synthesis server 200 , the network communication unit 370 of the content analysis server 300 , and the network communication unit 440 of the database 400 can be configured using the communication unit 80 .

　入出力インタフェース７５にはまた、必要に応じてドライブ８２が接続され、フラッシュメモリ、メモリカード、磁気ディスク、光ディスク、光磁気ディスクなどのリムーバブル記録媒体８１が適宜装着される。
　ドライブ８２により、リムーバブル記録媒体８１からは画像ファイル等のデータファイルや、各種のコンピュータプログラムなどを読み出すことができる。読み出されたデータファイルは記憶部７９に記憶されたり、データファイルに含まれる画像や音声が表示部７７や音声出力部７８で出力されたりする。またリムーバブル記録媒体８１から読み出されたコンピュータプログラム等は必要に応じて記憶部７９にインストールされる。 A drive 82 is also connected to the input/output interface 75 as required, and a removable recording medium 81 such as a flash memory, a memory card, a magnetic disk, an optical disk, or a magneto-optical disk is appropriately attached thereto.
The drive 82 allows data files such as image files and various computer programs to be read from the removable recording medium 81. The read data files are stored in the storage unit 79, and images and sounds contained in the data files are output on the display unit 77 and the sound output unit 78. In addition, the computer programs and the like read from the removable recording medium 81 are installed in the storage unit 79 as necessary.

　この情報処理装置７０では、ソフトウェアを、通信部８０によるネットワーク通信やリムーバブル記録媒体８１を介してインストールすることができる。或いは当該ソフトウェアは予めＲＯＭ７２や記憶部７９等に記憶されていてもよい。 In this information processing device 70, software can be installed via network communication by the communication unit 80 or via a removable recording medium 81. Alternatively, the software may be pre-stored in the ROM 72, the storage unit 79, etc.

　このような情報処理装置７０により、情報端末１００、音声合成サーバ２００、コンテンツ解析サーバ３００、データベース４００を構成することができる。そして、情報端末１００としての図２の構成、音声合成サーバ２００としての図３の構成、コンテンツ解析サーバ３００としての図４の構成、データベース４００としての図５の構成は、図１９の情報処理装置７０のハードウェア構成と、それにインストールされたソフトウェアにより実現可能である。
The information terminal 100, the voice synthesis server 200, the content analysis server 300, and the database 400 can be configured by using such an information processing device 70. The configuration of the information terminal 100 in Fig. 2, the configuration of the voice synthesis server 200 in Fig. 3, the configuration of the content analysis server 300 in Fig. 4, and the configuration of the database 400 in Fig. 5 can be realized by the hardware configuration of the information processing device 70 in Fig. 19 and software installed therein.

＜１０．まとめ及び変形例＞
　以上の実施の形態では次のような効果が得られる。 10. Summary and Modifications
The above embodiment provides the following advantages.

　実施の形態の音声合成サーバ２００は、複数の話者データのうちで、テキストデータＴＤのトピックに基づいた検索を行って話者データを選択する話者検索部２３０と、話者検索部２３０で選択された話者データによりテキストデータＴＤの合成音声を生成する音声合成部２５０を備える（図３参照）。
　即ちテキストデータＴＤの内容としてのトピック（話題やそのジャンル）に応じた声質の話者データが選択されるようにし、その話者データでテキストデータＴＤの読み上げ音声が合成されるようにする。これにより、テキストデータＴＤに対して、むやみに各種の声質の合成音声データＡＤを提供するのではなく、そのコンテンツＣＴ１のテキストデータＴＤのトピックに合致した声質の話者データによる合成音声データＡＤをユーザ１０に提供することができる。 The speech synthesis server 200 of the embodiment includes a speaker search unit 230 that selects speaker data from among a plurality of speaker data by performing a search based on the topic of the text data TD, and a speech synthesis unit 250 that generates synthetic speech of the text data TD using the speaker data selected by the speaker search unit 230 (see FIG. 3).
That is, speaker data with a voice quality corresponding to the topic (topic or genre) of the content of the text data TD is selected, and the reading voice of the text data TD is synthesized using the speaker data. This makes it possible to provide the user 10 with synthetic voice data AD using speaker data with a voice quality that matches the topic of the text data TD of the content CT1, rather than providing synthetic voice data AD with various voice qualities for the text data TD indiscriminately.

　実施の形態の音声合成サーバ２００は、話者検索部２３０が、テキストデータＴＤのトピックに基づいて選択されたＴＴＳ参照データＲＤを取得し、ＴＴＳ参照データＲＤで示される音声の特徴と話者データの音声の特徴の類似度に基づいて複数の話者データのうちで話者データを選択する例を挙げた（図１１参照）。
　ＴＴＳ参照データＲＤを取得することで、テキストデータＴＤのトピックに合っていると一般的にイメージされる音声特徴量ＲＤＸの情報を得ることができる。従って、音声合成サーバ２００は、保有話者データ部２４０に保有する話者データのうちで処理対象のテキストデータＴＤのトピックに適切な話者データを選択することができる。 In the embodiment of the speech synthesis server 200, the speaker search unit 230 acquires TTS reference data RD selected based on the topic of the text data TD, and selects speaker data from among multiple speaker data based on the similarity between the voice features indicated in the TTS reference data RD and the voice features of the speaker data (see FIG. 11).
By acquiring the TTS reference data RD, it is possible to obtain information on speech features RDX that are generally considered to match the topic of the text data TD. Therefore, the speech synthesis server 200 can select speaker data appropriate for the topic of the text data TD to be processed from the speaker data held in the held speaker data section 240.

　実施の形態の音声合成サーバ２００は、参照データ取得部２２０により、テキストデータＴＤを外部のデータベース４００に送信し、データベース４００からＴＴＳ参照データＲＤを受信するものとした（図６、図１１参照）。
　ＴＴＳ参照データＲＤをデータベース４００から取得することで、音声合成サーバ２００が多数のＴＴＳ参照データＲＤを保存しておく必要はない。そして音声合成サーバ２００は、処理対象のテキストデータＴＤに応じたＴＴＳ参照データＲＤの音声特徴量ＲＤＸに基づいて、保有する話者データのうちで適切な話者データを選択することができる。つまり、ＴＴＳ参照データＲＤの保存、追加、管理等の処理を行わずに、音声合成サーバ２００が保存する話者データの中で適切な話者データを選択できる。 The voice synthesis server 200 of the embodiment transmits text data TD to an external database 400 and receives TTS reference data RD from the database 400 via the reference data acquisition unit 220 (see FIGS. 6 and 11).
By acquiring the TTS reference data RD from the database 400, the speech synthesis server 200 does not need to store a large number of TTS reference data RD. The speech synthesis server 200 can select appropriate speaker data from among the speaker data held by the server 200 based on the speech feature value RDX of the TTS reference data RD corresponding to the text data TD to be processed. In other words, the speech synthesis server 200 can select appropriate speaker data from among the speaker data stored therein without performing processes such as storing, adding, and managing the TTS reference data RD.

　実施の形態では、ＴＴＳ参照データＲＤは、参照コンテンツＣＴ２のトピックの分類の指標（類似・非類似の判定の指標）となるトピックベクトルＲＤＴを含むものとした（図５参照）。
　これによりデータベース４００はＴＴＳ参照データＲＤのトピックベクトルＲＤＴと、音声合成サーバ２００から受信した処理対象のテキストデータＴＤのトピックベクトルＴＶとを比較して、テキストデータＴＤのトピックに応じたＴＴＳ参照データＲＤを選択することができる。従って処理対象のテキストデータＴＤのトピックに適したＴＴＳ参照データＲＤを選択できる。 In this embodiment, the TTS reference data RD includes a topic vector RDT that serves as an index for classifying the topics of the reference content CT2 (an index for determining similarity/dissimilarity) (see FIG. 5).
This allows the database 400 to compare the topic vector RDT of the TTS reference data RD with the topic vector TV of the text data TD to be processed received from the speech synthesis server 200, and select the TTS reference data RD according to the topic of the text data TD. Therefore, it is possible to select the TTS reference data RD that is suitable for the topic of the text data TD to be processed.

　またＴＴＳ参照データＲＤは、参照コンテンツＣＴ２の音声データの特徴量抽出により得られる音声特徴量を含むものとした（図５参照）。
　ＴＴＳ参照データＲＤに音声特徴量ＲＤＸが含まれることで、音声合成サーバ２００は、音声特徴量ＲＤＸに類似する声質の話者データを選択でき、これはテキストデータＴＤのトピックに合致した声質の話者データとなる。 The TTS reference data RD includes speech features obtained by extracting features from the speech data of the reference content CT2 (see FIG. 5).
By including the speech feature RDX in the TTS reference data RD, the speech synthesis server 200 can select speaker data with a voice quality similar to the speech feature RDX, which becomes speaker data with a voice quality that matches the topic of the text data TD.

　またＴＴＳ参照データＲＤは、その作成のために用いた参照コンテンツＣＴ２を示す情報を含むものとした（図５参照）。例えば参照コンテンツＣＴ２を示す情報としてコンテンツのＵＲＬ（ＲＤＵ）を含む。
　これにより図１３や図１８のように、ユーザ１０に参照コンテンツＣＴ２を視聴可能とするユーザインタフェースが可能となる。ユーザ１０は、音声合成サーバ２００による合成音声が、どのような参照コンテンツＣＴ２に基づいて選択されかを知ることで、コンテンツ制作の参考とすることができる。 The TTS reference data RD also includes information indicating the reference content CT2 used to create the TTS reference data RD (see FIG. 5). For example, the information indicating the reference content CT2 includes the URL (RDU) of the content.
This enables a user interface that allows the user 10 to view the reference content CT2, as shown in Figures 13 and 18. The user 10 can use the information about the reference content CT2 on which the synthesized voice by the voice synthesis server 200 is selected as a reference for content production.

　実施の形態の音声合成サーバ２００は、話者検索部２３０が選択した話者データの情報と、音声合成部２５０が生成した合成音声を前記テキストデータの送信元の情報端末１００に送信する処理を行うネットワーク通信部２６０を備えている（図３参照）。
　図１１で説明したように、音声合成サーバ２００はネットワーク通信部２６０により話者ＩＤや合成音声データＡＤを情報端末１００に送信する。これにより図１３や図１８のような表示画面で、ユーザ１０にコンテンツＣＴ１に合った合成音声を提案するという形のサービスを行うことができる。 The voice synthesis server 200 of the embodiment is equipped with a network communication unit 260 that performs processing to transmit information on the speaker data selected by the speaker search unit 230 and the synthetic voice generated by the voice synthesis unit 250 to the information terminal 100 that is the source of the text data (see Figure 3).
11, the voice synthesis server 200 transmits the speaker ID and the synthetic voice data AD to the information terminal 100 via the network communication unit 260. This makes it possible to provide a service in which synthetic voice suited to the content CT1 is suggested to the user 10 on a display screen such as that shown in FIG.

　実施の形態の音声合成サーバ２００は、ネットワーク通信部２６０は、ＴＴＳ参照データＲＤに含まれている、参照コンテンツＣＴ２に関する情報をテキストデータＴＤの送信元の情報端末１００に送信する処理を行うものとした（図１１参照）。
　これにより図１３や図１８のような表示画面で参照ＵＲＬ３５を表示させ、ユーザ１０に参照コンテンツＣＴ２を視聴させる導線を提供できる。 In the embodiment of the speech synthesis server 200, the network communication unit 260 performs a process of transmitting information about the reference content CT2 contained in the TTS reference data RD to the information terminal 100 that is the source of the text data TD (see FIG. 11).
This allows the reference URL 35 to be displayed on a display screen such as that shown in FIG. 13 or FIG. 18, providing a path for the user 10 to view the reference content CT2.

　実施の形態では、音声合成サーバ２００の話者検索部２３０が、テキストデータＴＤのトピックに基づいて選択された複数のＴＴＳ参照データＲＤのそれぞれについて、ＴＴＳ参照データＲＤで示される音声の特徴と、保存する話者データの音声の特徴の類似度に基づいて話者データを選択する例を挙げた。
　音声合成サーバ２００は、複数のＴＴＳ参照データＲＤ（例えば図１６のＴＴＳ参照データＲＤ＃１、ＲＤ＃ｘ、ＲＤ＃ｙ）を取得することで、テキストデータＴＤのトピックに合っている音声特徴量ＲＤＸの情報を複数得ることができる。従って、音声合成サーバ２００は、保有話者データ部２４０に保有する話者データのうちで、それぞれのＴＴＳ参照データＲＤ＃１、ＲＤ＃ｘ、ＲＤ＃ｙに基づいて話者データを選択することで、コンテンツＣＴ１のトピックに合った複数の話者データを選択し、ユーザ１０にそれぞれの声質を提案できる。 In the embodiment, an example is given in which the speaker search unit 230 of the speech synthesis server 200 selects speaker data for each of multiple TTS reference data RD selected based on the topic of the text data TD, based on the similarity between the speech features indicated in the TTS reference data RD and the speech features of the speaker data to be stored.
The speech synthesis server 200 can obtain a plurality of pieces of information on speech features RDX that match the topic of the text data TD by acquiring a plurality of TTS reference data RD (for example, TTS reference data RD#1, RD#x, RD#y in FIG. 16). Therefore, the speech synthesis server 200 can select speaker data based on each of the TTS reference data RD#1, RD#x, RD#y from the speaker data held in the held speaker data section 240, thereby selecting a plurality of speaker data that match the topic of the content CT1 and proposing the voice quality of each to the user 10.

　実施の形態では、ネットワーク通信部２６０が、複数のＴＴＳ参照データＲＤに基づいて選択された話者データの情報を、情報端末１００において一覧表示される情報として送信する例を挙げた。
　例えば図１８の合成音声リスト３６として話者ＩＤ３２等が一覧表示されるようにする。これによりユーザ１０はコンテンツＣＴ１のトピックに合ったという条件の中で候補とされた複数の声質の音声を試聴できる。 In the embodiment, an example has been given in which the network communication unit 260 transmits information on speaker data selected based on a plurality of TTS reference data RD as information displayed in a list on the information terminal 100 .
For example, speaker IDs 32 and the like may be displayed as a list in the synthetic speech list 36 of Fig. 18. This allows the user 10 to preview a number of voice qualities that are selected as candidates based on whether they match the topic of the content CT1.

　そのような複数のＴＴＳ参照データＲＤは、テキストデータＴＤのトピックとの類似度の高い順に選択された複数のＴＴＳ参照データＲＤのうちからさらに選択された参照データであるとした。そして元となった参照コンテンツＣＴ２のトピックが、テキストデータＴＤのトピックと最も高い類似度とされた第１の参照データと、第１の参照データとの類似度評価に基づいて選択された１又は複数の第２の参照データを含む例を挙げた。
　例えば図１６のように複数のＴＴＳ参照データＲＤ（ＲＤ＃１からＲＤ＃１０）を選択する。その中で最も高い類似度とされた第１の参照データ（ＴＴＳ参照データＲＤ＃１）と、１又は複数の第２の参照データ（ＴＴＳ参照データＲＤ＃ｘ、ＲＤ＃ｙ）を選択する。これにより類似度評価の方式により、トピックに合致しつつ、多様な声質のＴＴＳ参照データＲＤを選択できる。 Such a plurality of TTS reference data RD are reference data further selected from the plurality of TTS reference data RD selected in descending order of similarity to the topic of the text data TD. An example is given in which the topic of the original reference content CT2 includes a first reference data having the highest similarity to the topic of the text data TD and one or more second reference data selected based on the similarity evaluation of the first reference data.
For example, multiple TTS reference data RD (RD#1 to RD#10) are selected as shown in Fig. 16. Among them, the first reference data (TTS reference data RD#1) that is determined to have the highest similarity and one or more second reference data (TTS reference data RD#x, RD#y) are selected. This makes it possible to select TTS reference data RD with various voice qualities that match the topic by using a similarity evaluation method.

　また、この場合の第２の参照データは、第１の参照データの音声特徴量を基準としてコサイン類似度が直交または反対向きであると類似度評価された参照データであるとした。
　これにより、複数のＴＴＳ参照データＲＤ＃１、ＲＤ＃ｘ、ＲＤ＃ｙは互いに似ていない声質の音声特徴量ＲＤＸを持つものとなる。従って音声合成サーバ２００はこれらに基づいて声質の異なるバリエーションとして話者データをユーザ１０に提供できる。 In this case, the second reference data is reference data whose similarity is evaluated as being orthogonal or opposite in cosine similarity with respect to the speech feature of the first reference data.
As a result, the plurality of TTS reference data RD#1, RD#x, RD#y have speech features RDX with voice qualities that are dissimilar to each other, and the speech synthesis server 200 can provide the user 10 with speaker data with variations of different voice qualities based on these.

　また、テキストデータＴＤのトピックとの類似度の高い順に選択された複数の参照データの数は、音声特徴量の分散に応じて設定されたものとした（図１６，図１７等参照）。
　これにより、最初に類似度の高い順に選択された複数のＴＴＳ参照データＲＤが、おおむね似通った声質の場合は、数を多くして分散を大きくし、類似度評価に基づいて選択されるＴＴＳ参照データＲＤ＃ｘ，ＲＤ＃ｙが、ＴＴＳ参照データＲＤ＃１と似た声質とならないようにすることができる。つまり分散に応じてＴＴＳ参照データＲＤの選択の母数が制御されることで、最終的にユーザ１０に提案される声質のバリエーションを広くする状態を維持できる。 The number of the multiple reference data selected in descending order of similarity to the topic of the text data TD is set according to the variance of the speech features (see, for example, FIGS. 16 and 17).
In this way, if the multiple TTS reference data RD initially selected in descending order of similarity have roughly similar voice qualities, the number is increased to increase the variance, so that the TTS reference data RD#x, RD#y selected based on the similarity evaluation do not have a voice quality similar to that of the TTS reference data RD#1. In other words, by controlling the parameter for selecting TTS reference data RD according to the variance, a wide variety of voice qualities can be maintained in the end proposed to the user 10.

　実施の形態のＴＴＳ話者提案システム１は、参照コンテンツＣＴ２を解析してトピックに関する情報と音声の特徴に関する情報を含むＴＴＳ参照データＲＤを生成するコンテンツ解析装置（コンテンツ解析サーバ３００）を備える。またＴＴＳ話者提案システム１は、コンテンツ解析サーバ３００が生成したＴＴＳ参照データＲＤを記憶するとともに、テキストデータＴＤのトピックに基づいてＴＴＳ参照データＲＤを選択するデータベース４００を備える。さらに音声合成装置として、上述してきたように話者検索部２３０と音声合成部２５０を備える音声合成サーバ２００を備える。
　このＴＴＳ話者提案システム１では、コンテンツ解析サーバ３００とデータベース４００により、各種の参照コンテンツＣＴ２に基づいてＴＴＳ参照データＲＤが生成、蓄積されていく。音声合成サーバ２００はそのような情報資源を用いて、情報端末１００から受信したテキストデータＴＤに合致する合成音声を提供できることになる。ＴＴＳ参照データＲＤの質や量が充実するほど、音声合成サーバ２００は、よりテキストデータＴＤに合致した声質の合成音声をユーザ１０に提供できるようになる。 The TTS speaker suggestion system 1 of the embodiment includes a content analysis device (content analysis server 300) that analyzes reference content CT2 to generate TTS reference data RD including information on topics and information on speech features. The TTS speaker suggestion system 1 also includes a database 400 that stores the TTS reference data RD generated by the content analysis server 300 and selects the TTS reference data RD based on the topic of text data TD. The system further includes a speech synthesis device, a speech synthesis server 200 that includes a speaker search unit 230 and a speech synthesis unit 250 as described above.
In this TTS speaker suggestion system 1, the content analysis server 300 and the database 400 generate and accumulate TTS reference data RD based on various reference content CT2. The voice synthesis server 200 can use such information resources to provide synthetic voice that matches the text data TD received from the information terminal 100. The more the quality and quantity of the TTS reference data RD is improved, the more the voice synthesis server 200 can provide the user 10 with synthetic voice whose voice quality matches the text data TD.

　実施の形態のプログラムは、図１１のような処理を、例えばＣＰＵ、ＤＳＰ（digital signal processor）、ＡＩプロセッサ等、或いはこれらを含む情報処理装置７０に実行させるプログラムである。
　即ち実施の形態のプログラムは、複数の話者データのうちで、テキストデータＴＤのトピックに基づいた検索を行って話者データを選択する話者検索処理と、話者検索処理で選択された話者データによりテキストデータＴＤの合成音声を生成する音声合成処理とを情報処理装置に実行させるプログラムである。 The program of the embodiment is a program that causes, for example, a CPU, a DSP (digital signal processor), an AI processor, or an information processing device 70 including these to execute the process shown in FIG.
In other words, the program of the embodiment is a program that causes an information processing device to execute a speaker search process that selects speaker data from a plurality of speaker data by searching based on the topic of the text data TD, and a voice synthesis process that generates synthetic voice of the text data TD using the speaker data selected in the speaker search process.

　このようなプログラムにより、実施の形態の音声合成サーバ２００としての情報処理装置を、例えばコンピュータ装置、携帯端末装置、その他の情報処理が実行できる機器において実現できる。 With such a program, an information processing device serving as the voice synthesis server 200 of the embodiment can be realized in, for example, a computer device, a mobile terminal device, or other device capable of performing information processing.

　このようなプログラムは、コンピュータ装置等の機器に内蔵されている記録媒体としてのＨＤＤや、ＣＰＵを有するマイクロコンピュータ内のＲＯＭ等に予め記録しておくことができる。
　あるいはまたプログラムは、フレキシブルディスク、ＣＤ－ＲＯＭ(Compact Disc Read Only Memory)、ＭＯ(Magneto Optical)ディスク、ＤＶＤ(Digital Versatile Disc)、ブルーレイディスク（Blu-ray Disc（登録商標））、磁気ディスク、半導体メモリ、メモリカードなどのリムーバブル記録媒体に、一時的あるいは永続的に格納（記録）しておくことができる。このようなリムーバブル記録媒体は、いわゆるパッケージソフトウェアとして提供することができる。
　また、このようなプログラムは、リムーバブル記録媒体からパーソナルコンピュータ等にインストールする他、ダウンロードサイトから、ＬＡＮ(Local Area Network)、インターネットなどのネットワークを介してダウンロードすることもできる。 Such a program can be recorded in advance in a HDD serving as a recording medium built into a device such as a computer device, or in a ROM within a microcomputer having a CPU.
Alternatively, the program may be temporarily or permanently stored (recorded) on a removable recording medium such as a flexible disk, a CD-ROM (Compact Disc Read Only Memory), an MO (Magneto Optical) disk, a DVD (Digital Versatile Disc), a Blu-ray Disc (registered trademark), a magnetic disk, a semiconductor memory, a memory card, etc. Such removable recording media may be provided as so-called package software.
Furthermore, such a program can be installed in a personal computer or the like from a removable recording medium, or can be downloaded from a download site via a network such as a LAN (Local Area Network) or the Internet.

　またこのようなプログラムによれば、実施の形態の音声合成サーバ２００を構成する情報処理装置７０の広範な提供に適している。例えばスマートフォンやタブレット等の携帯端末装置、撮像装置、携帯電話機、パーソナルコンピュータ、ゲーム機器、ビデオ機器、ＰＤＡ（Personal Digital Assistant）等にプログラムをダウンロードすることで、これらの機器を、本開示の音声合成サーバ２００として機能する情報処理装置７０とすることができる。 Furthermore, such a program is suitable for the widespread provision of information processing devices 70 that constitute the voice synthesis server 200 of the embodiment. For example, by downloading the program to mobile terminal devices such as smartphones and tablets, imaging devices, mobile phones, personal computers, game devices, video devices, PDAs (Personal Digital Assistants), etc., these devices can be made into information processing devices 70 that function as the voice synthesis server 200 of the present disclosure.

　なお、本明細書に記載された効果はあくまでも例示であって限定されるものではなく、また他の効果があってもよい。 Note that the effects described in this specification are merely examples and are not limiting, and other effects may also be present.

　なお本技術は以下のような構成も採ることができる。
　（１）
　複数の話者データのうちで、テキストデータのトピックに基づいた検索を行って話者データを選択する話者検索部と、
　前記話者検索部で選択された話者データにより前記テキストデータの合成音声データを生成する音声合成部と、を備えた
　情報処理装置。
　（２）
　前記話者検索部は、前記テキストデータのトピックに基づいて選択された参照データを取得し、前記参照データで示される音声の特徴と話者データの音声の特徴の類似度に基づいて複数の話者データのうちで話者データを選択する
　上記（１）に記載の情報処理装置。
　（３）
　前記テキストデータを外部のデータベースに送信し、前記データベースから前記参照データを受信する
　上記（２）に記載の情報処理装置。
　（４）
　前記参照データは、
　参照データ作成のために用いた参照コンテンツのトピックの分類の指標となるトピックベクトルを含む
　上記（２）又は（３）に記載の情報処理装置。
　（５）
　前記参照データは、
　参照データ作成のために用いた参照コンテンツの音声データの特徴量抽出により得られる音声特徴量を含む
　上記（２）から（４）のいずれかに記載の情報処理装置。
　（６）
　前記参照データは、
　参照データ作成のために用いた参照コンテンツを示す情報を含む
　上記（２）から（５）のいずれかに記載の情報処理装置。
　（７）
　前記話者検索部が選択した話者データの情報と、前記音声合成部が生成した合成音声データを前記テキストデータの送信元の情報端末に送信する処理を行う通信部を備えた
　上記（１）から（６）のいずれかに記載の情報処理装置。
　（８）
　前記話者検索部は、前記テキストデータのトピックに基づいて選択された参照データを取得し、前記参照データで示される音声の特徴と話者データの音声の特徴の類似度に基づいて複数の話者データのうち話者データを選択し、
　前記通信部は、前記参照データに含まれている、前記参照データ作成のために用いた参照コンテンツに関する情報を前記テキストデータの送信元の情報端末に送信する処理を行う
　上記（７）に記載の情報処理装置。
　（９）
　前記話者検索部は、前記テキストデータのトピックに基づいて選択された複数の参照データのそれぞれについて、前記参照データで示される音声の特徴と、話者データの音声の特徴の類似度に基づいて話者データを選択する
　上記（１）から（８）のいずれかに記載の情報処理装置。
　（１０）
　前記話者検索部が選択した話者データの情報と、前記音声合成部が生成した合成音声を前記テキストデータの送信元の情報端末に送信する処理を行う通信部を備え、
　複数の前記参照データに基づいて選択された話者データの情報は、前記情報端末において一覧表示される情報として送信される
　上記（９）に記載の情報処理装置。
　（１１）
　前記複数の参照データは、
　前記テキストデータのトピックとの類似度の高い順に選択された複数の参照データのうちからさらに選択された参照データであり、
　元となった参照コンテンツのトピックが、前記テキストデータのトピックと最も高い類似度とされた第１の参照データと、
　前記第１の参照データとの類似度評価に基づいて選択された１又は複数の第２の参照データを含む
　上記（９）又は（１０）に記載の情報処理装置。
　（１２）
　前記第２の参照データは、前記第１の参照データの音声特徴量を基準としてコサイン類似度が直交または反対向きであると類似度評価された参照データである
　上記（１１）に記載の情報処理装置。
　（１３）
　前記テキストデータのトピックとの類似度の高い順に選択された複数の参照データの数は、音声特徴量の分散に応じて設定されたものである
　上記（１１）又は（１２）に記載の情報処理装置。
　（１４）
　複数の話者データのうちで、テキストデータのトピックに基づいた検索を行って話者データを選択する話者検索処理と、
　前記話者検索処理で選択された話者データにより前記テキストデータの合成音声を生成する音声合成処理と、
　を情報処理装置が実行する情報処理方法。
　（１５）
　複数の話者データのうちで、テキストデータのトピックに基づいた検索を行って話者データを選択する話者検索処理と、
　前記話者検索処理で選択された話者データにより前記テキストデータの合成音声を生成する音声合成処理と、
　を情報処理装置に実行させるプログラム。
　（１６）
　参照コンテンツを解析してトピックに関する情報と音声の特徴に関する情報を含む参照データを生成するコンテンツ解析装置と、
　前記コンテンツ解析装置が生成した前記参照データを記憶するとともに、テキストデータのトピックに基づいて参照データを選択するデータベースと、
　音声合成装置と、
　を備え、
　前記音声合成装置は、
　テキストデータのトピックに基づいて前記データベースで選択された参照データを取得し、前記参照データで示される音声の特徴と話者データの音声の特徴の類似度に基づいて複数の話者データのうちで話者データを選択する話者検索部と、
　前記話者検索部で選択された話者データにより前記テキストデータの合成音声データを生成する音声合成部と、を備える
　情報処理システム。 The present technology can also be configured as follows.
(1)
a speaker search unit that selects speaker data from among the plurality of speaker data by performing a search based on a topic of the text data;
a voice synthesis unit that generates synthetic voice data of the text data using speaker data selected by the speaker search unit.
(2)
The information processing device described in (1) above, wherein the speaker search unit acquires reference data selected based on a topic of the text data, and selects speaker data from among multiple speaker data based on a similarity between voice features indicated in the reference data and voice features of the speaker data.
(3)
The information processing device according to (2) above, further comprising: transmitting the text data to an external database; and receiving the reference data from the database.
(4)
The reference data is
The information processing device according to (2) or (3) above, including a topic vector that is an index for classifying topics of the reference content used to create the reference data.
(5)
The reference data is
The information processing device according to any one of (2) to (4) above, including audio features obtained by extracting features of audio data of reference content used to create the reference data.
(6)
The reference data is
The information processing device according to any one of (2) to (5) above, further comprising information indicating reference content used to create the reference data.
(7)
The information processing device according to any one of (1) to (6) above, further comprising a communication unit that performs processing to transmit information on the speaker data selected by the speaker search unit and synthetic voice data generated by the voice synthesis unit to an information terminal that is a source of the text data.
(8)
the speaker search unit acquires reference data selected based on a topic of the text data, and selects speaker data from among a plurality of speaker data based on a similarity between a voice feature indicated in the reference data and a voice feature of the speaker data;
The information processing device according to (7) above, wherein the communication unit performs a process of transmitting information regarding reference content used to create the reference data, which is included in the reference data, to an information terminal that is a source of the text data.
(9)
The information processing device according to any one of (1) to (8) above, wherein the speaker search unit selects speaker data for each of a plurality of reference data selected based on a topic of the text data, based on a similarity between voice characteristics indicated in the reference data and voice characteristics of the speaker data.
(10)
a communication unit that performs processing to transmit information on the speaker data selected by the speaker search unit and the synthetic voice generated by the voice synthesis unit to an information terminal that is a source of the text data;
The information processing device according to (9) above, wherein information on the speaker data selected based on the plurality of reference data is transmitted as information displayed as a list on the information terminal.
(11)
The plurality of reference data includes
a piece of reference data further selected from a plurality of pieces of reference data selected in descending order of similarity to a topic of the text data;
a first reference data in which a topic of the original reference content has a highest similarity to a topic of the text data;
The information processing device according to (9) or (10) above, further comprising one or more second reference data selected based on a similarity evaluation with the first reference data.
(12)
The information processing device according to (11) above, wherein the second reference data is reference data whose similarity is evaluated as being orthogonal or opposite in cosine similarity based on the speech feature of the first reference data.
(13)
The information processing device according to (11) or (12), wherein the number of the multiple reference data selected in order of similarity to the topic of the text data is set according to the variance of the speech features.
(14)
a speaker search process for selecting speaker data from among a plurality of speaker data by searching based on a topic of the text data;
a speech synthesis process for generating synthetic speech of the text data using speaker data selected in the speaker search process;
The information processing method is executed by an information processing device.
(15)
a speaker search process for selecting speaker data from among a plurality of speaker data by searching based on a topic of the text data;
a speech synthesis process for generating synthetic speech of the text data using speaker data selected in the speaker search process;
A program for causing an information processing device to execute the above.
(16)
a content analysis device for analyzing the reference content to generate reference data including information on topics and information on speech features;
a database for storing the reference data generated by the content analysis device and for selecting reference data based on a topic of text data;
A voice synthesizer;
Equipped with
The speech synthesizer comprises:
a speaker search unit that acquires reference data selected from the database based on a topic of text data, and selects speaker data from among a plurality of speaker data based on a similarity between a voice feature indicated by the reference data and a voice feature of the speaker data;
a voice synthesis unit that generates synthetic voice data for the text data using speaker data selected by the speaker search unit.

１　ＴＴＳ話者提案システム
７０　情報処理装置
７１　ＣＰＵ
１００　情報端末
２００　音声合成サーバ
２１０　テキスト－音素記号変換部
２２０　参照データ取得部
２３０　話者検索部
２４０　保有話者データ部
２５０　音声合成部
２６０　ネットワーク通信部
３００　コンテンツ解析サーバ
４００　データベース
５００　コンテンツプロバイダ
ＲＤＴ　トピックベクトル
ＲＤＸ　音声特徴量
ＲＤＵ　コンテンツのＵＲＬ
ＴＤ　テキストデータ
ＡＤ　合成音声データ
ＲＤ　ＴＴＳ参照データ
ＴＭ　トピックモデル
ＴＶ　トピックベクトル
ＸＶ　音声特徴量
ＣＴ１　コンテンツ（ユーザが作る）
ＣＴ２　参照コンテンツ（分析用） 1 TTS speaker suggestion system 70 Information processing device 71 CPU
100 Information terminal 200 Speech synthesis server 210 Text-phoneme symbol conversion unit 220 Reference data acquisition unit 230 Speaker search unit 240 Retained speaker data unit 250 Speech synthesis unit 260 Network communication unit 300 Content analysis server 400 Database 500 Content provider RDT Topic vector RDX Speech feature amount RDU Content URL
TD Text data AD Synthetic speech data RD TTS reference data TM Topic model TV Topic vector XV Speech feature CT1 Content (created by user)
CT2 Reference content (for analysis)

Claims

a speaker search unit that selects speaker data from among the plurality of speaker data by performing a search based on a topic of the text data;
a voice synthesis unit that generates synthetic voice data of the text data using speaker data selected by the speaker search unit.

The information processing device according to claim 1 , wherein the speaker search unit acquires reference data selected based on a topic of the text data, and selects speaker data from among a plurality of speaker data based on a similarity between voice characteristics indicated in the reference data and voice characteristics of the speaker data.

The information processing apparatus according to claim 2 , further comprising: transmitting the text data to an external database; and receiving the reference data from the database.

The reference data is
The information processing device according to claim 2 , further comprising a topic vector that is an index for classifying topics of the reference content used for creating the reference data.

The reference data is
The information processing device according to claim 2 , further comprising: an audio feature quantity obtained by extracting a feature quantity of audio data of the reference content used to create the reference data.

The reference data is
The information processing apparatus according to claim 2 , further comprising information indicating a reference content used to create the reference data.

The information processing device according to claim 1 , further comprising a communication unit configured to transmit information on the speaker data selected by the speaker search unit and synthetic voice data generated by the voice synthesis unit to an information terminal that is a source of the text data.

the speaker search unit acquires reference data selected based on a topic of the text data, and selects speaker data from among a plurality of speaker data based on a similarity between a voice feature indicated by the reference data and a voice feature of the speaker data;
The information processing device according to claim 7 , wherein the communication unit performs a process of transmitting information about a reference content used to create the reference data, the information being included in the reference data, to an information terminal that is a source of the text data.

The information processing device according to claim 1 , wherein the speaker search unit selects speaker data for each of a plurality of reference data selected based on a topic of the text data based on a similarity between voice characteristics indicated in the reference data and voice characteristics of the speaker data.

a communication unit that performs processing to transmit information on the speaker data selected by the speaker search unit and the synthetic voice generated by the voice synthesis unit to an information terminal that is a source of the text data;
The information processing apparatus according to claim 9 , wherein the information on the speaker data selected based on the plurality of reference data is transmitted as information to be displayed as a list on the information terminal.

The plurality of reference data includes
a piece of reference data further selected from a plurality of pieces of reference data selected in descending order of similarity to a topic of the text data;
a first reference data in which a topic of the original reference content has a highest similarity to a topic of the text data;
The information processing apparatus according to claim 9 , further comprising one or more second reference data selected based on a similarity evaluation with the first reference data.

The information processing apparatus according to claim 11 , wherein the second reference data is reference data whose similarity is evaluated as being orthogonal or opposite in cosine similarity with respect to the speech feature quantity of the first reference data.

The information processing device according to claim 11 , wherein the number of the plurality of reference data selected in order of similarity to the topic of the text data is set according to a variance of the speech feature.

a speaker search process for selecting speaker data from among a plurality of speaker data by searching based on a topic of the text data;
a speech synthesis process for generating synthetic speech of the text data using speaker data selected in the speaker search process;
The information processing method is executed by an information processing device.

a speaker search process for selecting speaker data from among a plurality of speaker data by searching based on a topic of the text data;
a speech synthesis process for generating synthetic speech of the text data using speaker data selected in the speaker search process;
A program for causing an information processing device to execute the above.

a content analysis device for analyzing the reference content to generate reference data including information on topics and information on speech features;
a database for storing the reference data generated by the content analysis device and for selecting reference data based on a topic of text data;
A voice synthesizer;
Equipped with
The speech synthesizer comprises:
a speaker search unit that acquires reference data selected from the database based on a topic of text data, and selects speaker data from among a plurality of speaker data based on a similarity between a voice feature indicated by the reference data and a voice feature of the speaker data;
a voice synthesis unit that generates synthetic voice data for the text data using speaker data selected by the speaker search unit.