JP2018005048A

JP2018005048A - Voice quality conversion system

Info

Publication number: JP2018005048A
Application number: JP2016133530A
Authority: JP
Inventors: ▲高▼橋　賢一; 賢一 ▲高▼橋; Kenichi Takahashi; 飛河　和生; Kazuo Hikawa; 和生飛河; 戸田　智基; Tomoki Toda; 智基戸田; 小林　和弘; Kazuhiro Kobayashi; 和弘小林
Original assignee: Nara Institute of Science and Technology NUC; Crimson Tech Inc
Current assignee: Nara Institute of Science and Technology NUC; Crimson Tech Inc
Priority date: 2016-07-05
Filing date: 2016-07-05
Publication date: 2018-01-11
Anticipated expiration: 2036-07-05
Also published as: JP6664670B2

Abstract

【課題】アクターの声質をターゲットの声質に変換することを可能な声質変換システムを提供する。【解決手段】声質学習装置１０は、ターゲット２の第１の音声信号から第１の特徴量を抽出する。声質学習装置１０は、アクター１の第２の音声信号に含まれる基本周波数を所定の倍率で変換し、基本周波数が変換された第２の音声信号から第２の特徴量を抽出する。声質学習装置１０は、抽出された第１の特徴量及び第２の特徴量間の対応関係をモデル化することによって得られるモデルデータをデータベースＤＢに格納する。声質変換装置２０は、アクター１の第３の音声信号に含まれる基本周波数を所定の倍率で変換し、基本周波数が変換された第３の音声信号から第３の特徴量を抽出する。声質変換装置２０は、データベースＤＢに格納されたモデルデータ及び抽出された第３の特徴量に基づいて、アクター１の声質がターゲット２の声質に変換された第４の音声信号を生成する。【選択図】図１A voice quality conversion system capable of converting the voice quality of an actor into the voice quality of a target is provided. A voice quality learning device (10) extracts a first feature quantity from a first speech signal of a target (2). The voice quality learning apparatus 10 transforms the fundamental frequency contained in the second audio signal of the actor 1 by a predetermined magnification, and extracts the second feature quantity from the second audio signal whose fundamental frequency has been transformed. Voice quality learning apparatus 10 stores model data obtained by modeling the correspondence relationship between the extracted first feature amount and second feature amount in database DB. The voice conversion device 20 converts the fundamental frequency included in the third audio signal of the actor 1 by a predetermined magnification, and extracts the third feature amount from the third audio signal whose fundamental frequency has been converted. The voice quality conversion device 20 generates a fourth voice signal in which the voice quality of the actor 1 is converted into the voice quality of the target 2 based on the model data stored in the database DB and the extracted third feature amount. [Selection drawing] Fig. 1

Description

本発明は、声質変換システムに関する。 The present invention relates to a voice quality conversion system.

近年では、テーマパークまたはイベント会場等にキャラクタが登場することにより、集客を図ることが行われている。 In recent years, attracting customers has been performed by characters appearing in theme parks or event venues.

このキャラクタには、例えば映画、アニメ、コミック及びゲーム等に登場する様々なキャラクタが含まれる。更に、音楽の分野においては、キャラクタがアーティストとして活動を行う場合もある。 This character includes various characters appearing in, for example, movies, anime, comics, games, and the like. Further, in the music field, a character may act as an artist.

キャラクタは、例えば現実空間内に着ぐるみとして登場する場合もあれば、映像として登場する場合もある。このようなキャラクタの視覚的な要素は、着ぐるみの精度向上及びコンピュータグラフィクスの技術の進歩等により、十分に観客を楽しませることができる。 For example, the character may appear as a costume in the real space or may appear as a video. Such a visual element of the character can sufficiently entertain the audience by improving the accuracy of the costume and improving the technology of computer graphics.

小林和弘、戸田智基、Graham Neubig、Sakriani Sakti、中村哲（奈良先端大・情報）、“差分スペクトル補正に基づく統計的歌声声質変換”、［online］、２０１４年３月、日本音響学会講演論文集、［平成２８年６月２７日検索］、インターネット＜URL:http:www.phontron.com/paper/kobayashi14asj.pdf＞Kazuhiro Kobayashi, Tomoki Toda, Graham Neubig, Sakriani Sakti, Satoshi Nakamura (Nara Institute of Science and Information), “Statistical Voice / Voice Conversion Based on Difference Spectrum Correction”, [online], March 2014, Proc. [Search on June 27, 2016], Internet <URL: http: www.phontron.com/paper/kobayashi14asj.pdf>

ところで、キャラクタの声を録音した音声を同時に流すか、あるいは現場のアクター（声優）が演じることによって、当該キャラクタが話しているように見せることが一般的に行われている。 By the way, it is common practice to make the character appear to be speaking by playing the voice recording the voice of the character at the same time or by being played by an actor (voice actor) in the field.

しかしながら、録音した音声の再生タイミングや収録内容が現場と合わなかったり、アクターの声質がキャラクタの声質と異なる場合があり、この場合には観客は違和感を感じてしまう場合がある。 However, there are cases where the playback timing and recorded content of the recorded voice do not match the actual site, or the voice quality of the actor is different from the voice quality of the character. In this case, the audience may feel uncomfortable.

このため特定のキャラクタの声は実際の声優本人が自らアクターとなるか、あるいはキャラクタに似た声を出すことが出来るような特定のアクターのみが演じる場合も見られるが、テーマパークまたはイベント会場に長時間登場するようなキャラクタの場合には、アクターが喉を傷めてしまうような事態が発生し得る。また、このような状況下ではアクターは喉を酷使するため、同一のアクターであっても例えば朝と夜とで声質が変化してしまうことがある。 For this reason, the voice of a specific character can be seen as the actual voice actor himself acting as an actor, or only by a specific actor who can make a voice similar to the character, but at the theme park or event venue In the case of a character that appears for a long time, a situation may occur in which an actor hurts his throat. In such a situation, since the actor overuses the throat, even if the actor is the same, the voice quality may change between morning and evening, for example.

なお、テーマパークまたはイベント会場に登場するキャラクタは例えばステージやスクリーン上で、リアルタイムで観客と会話するようなことも求められており、予めキャラクタの声を録音した音声で用意しておくようなことは困難である。 Characters appearing in theme parks or event venues are also required to have real-time conversations with the audience on the stage or on the screen, for example. It is difficult.

したがって、異なるアクターであっても同一のキャラクタ（ターゲット）の声で自由に話すことができるような技術が望まれている。 Therefore, there is a demand for a technique that allows different actors to speak freely with the voice of the same character (target).

そこで、本発明の目的は、アクターの声質をターゲットの声質に変換することが可能な声質変換システムを提供することにある。 Therefore, an object of the present invention is to provide a voice quality conversion system capable of converting an actor's voice quality into a target voice quality.

本発明の１つの態様によれば、声質学習装置及び声質変換装置を備え、アクターの声質をターゲットの声質に変換する声質変換システムが提供される。前記声質学習装置は、前記ターゲットの第１の音声信号を入力する第１の入力手段と、前記入力された第１の音声信号から第１の特徴量を抽出する第１の抽出手段と、前記第１の音声信号に対応する前記アクターの第２の音声信号を入力する第２の入力手段と、前記入力された第２の音声信号に含まれる基本周波数を所定の倍率で変換する第１の変換手段と、前記基本周波数が変換された第２の音声信号から第２の特徴量を抽出する第２の抽出手段と、前記抽出された第１の特徴量及び第２の特徴量間の対応関係をモデル化することによって得られるモデルデータを格納するデータベースとを含む。前記声質変換装置は、前記アクターの第３の音声信号を入力する第３の入力手段と、前記入力された第３の音声信号に含まれる基本周波数を前記所定の倍率で変換する第２の変換手段と、前記基本周波数が変換された第３の音声信号から第３の特徴量を抽出する第３の抽出手段と、前記データベースに格納されたモデルデータ及び前記抽出された第３の特徴量に基づいて、前記アクターの声質が前記ターゲットの声質に変換された第４の音声信号を生成する生成手段と、前記生成された第４の音声信号を出力する出力手段とを含む。 According to one aspect of the present invention, a voice quality conversion system that includes a voice quality learning device and a voice quality conversion device and converts an actor's voice quality to a target voice quality is provided. The voice quality learning device includes: a first input unit that inputs a first audio signal of the target; a first extraction unit that extracts a first feature amount from the input first audio signal; A second input means for inputting a second audio signal of the actor corresponding to the first audio signal; and a first input for converting a fundamental frequency contained in the input second audio signal at a predetermined magnification. Correspondence between the converting means, the second extracting means for extracting the second feature value from the second audio signal whose fundamental frequency has been converted, and the extracted first feature value and second feature value And a database that stores model data obtained by modeling the relationship. The voice quality conversion device includes: a third input unit that inputs a third audio signal of the actor; and a second conversion that converts a fundamental frequency included in the input third audio signal at the predetermined magnification. Means, third extraction means for extracting a third feature amount from the third audio signal whose fundamental frequency has been converted, model data stored in the database, and the extracted third feature amount. And generating means for generating a fourth voice signal in which the voice quality of the actor is converted to the voice quality of the target, and output means for outputting the generated fourth voice signal.

本発明は、アクターの声質をターゲットの声質に変換することを可能とする。 The present invention makes it possible to convert an actor's voice quality to a target voice quality.

本発明の実施形態に係る声質変換システムの構成の一例を概略的に示す図。The figure which shows schematically an example of a structure of the voice quality conversion system which concerns on embodiment of this invention. 声質学習装置の機能構成の一例を示すブロック図。The block diagram which shows an example of a function structure of a voice quality learning apparatus. 声質変換装置の機能構成の一例を示すブロック図。The block diagram which shows an example of a function structure of a voice quality conversion apparatus. 声質学習装置の処理手順の一例を示すフローチャート。The flowchart which shows an example of the process sequence of a voice quality learning apparatus. 声質変換装置の処理手順の一例を示すフローチャート。The flowchart which shows an example of the process sequence of a voice quality conversion apparatus.

以下、図面を参照して、本発明の実施形態について説明する。 Embodiments of the present invention will be described below with reference to the drawings.

図１は、本実施形態に係る声質変換システムの構成を概略的に示す図である。本実施形態に係る声質変換システムは、例えば声優のようなアクターと称される人物（以下、単にアクターと表記）１の声質をターゲットと称される例えばキャラクタ（以下、単にターゲットと表記）２の声質に変換するために用いられる。 FIG. 1 is a diagram schematically showing a configuration of a voice quality conversion system according to the present embodiment. The voice quality conversion system according to the present embodiment includes, for example, a character (hereinafter simply referred to as a target) 2 called a voice quality of a person (hereinafter simply referred to as an actor) 1 called an actor such as a voice actor. Used to convert to voice quality.

具体的には、声質変換システムは、例えばテーマパークまたはイベント会場等において、アクター１が発声した際に、当該アクター１の声質をターゲット２の声質に変換して音声を出力することによって、ターゲット２と声質の異なるアクター１であっても当該ターゲット２の声質で話すことができるようにするといった用途に用いることができる。 Specifically, the voice quality conversion system converts the voice quality of the actor 1 into the voice quality of the target 2 and outputs the voice when the actor 1 utters, for example, in a theme park or an event venue. Even if the actor 1 has a different voice quality, the voice quality of the target 2 can be used.

なお、本実施形態の説明においてはアクター１が人物であるものとして説明するが、当該アクター１は、音声を発するものであればよく、例えば機械的に生成された音声を発するものであってもよい。また、ターゲット２（キャラクタ）は、当該キャラクタの声を演じる人物であってもよいし、当該キャラクタの声を機械的に発する機器等であってもよい。ターゲット２はキャラクタではなく、有名人や俳優、歌手などの人物の声でもよい。 In the description of the present embodiment, the actor 1 is described as being a person. However, the actor 1 only needs to emit sound, for example, it may emit mechanically generated sound. Good. Further, the target 2 (character) may be a person who plays the voice of the character, or may be a device that mechanically emits the voice of the character. The target 2 may be a voice of a person such as a celebrity, an actor, or a singer instead of a character.

図１に示すように、声質変換システムは、声質学習装置１０及び声質変換装置２０を備える。 As shown in FIG. 1, the voice quality conversion system includes a voice quality learning device 10 and a voice quality conversion device 20.

声質学習装置１０は、各種プログラム（ソフトウェア）を実行可能なＣＰＵのようなプロセッサ（コンピュータ）を備えるパーソナルコンピュータ等を含む。声質学習装置１０は、解析エンジン１０ａ及びデータベース（ＤＢ）１０ｂを有する。 The voice quality learning apparatus 10 includes a personal computer including a processor (computer) such as a CPU that can execute various programs (software). The voice quality learning apparatus 10 includes an analysis engine 10a and a database (DB) 10b.

解析エンジン１０ａは、ターゲット２の音声（ターゲット２が発音した声）と、当該ターゲット２の音声におけるイントネーション、発音タイミング及び音程等を真似て発音したアクター１の音声とを用いて学習処理を実行する（声質変換モデルを学習する）。解析エンジン１０ａは、学習処理の結果（学習結果）をもとに解析を実行し、データベース１０ｂを作成する。 The analysis engine 10a executes the learning process using the voice of the target 2 (voice generated by the target 2) and the voice of the actor 1 which is pronounced by imitating the intonation, the sound generation timing, and the pitch of the target 2 voice. (Learn voice conversion model). The analysis engine 10a performs analysis based on the learning process result (learning result) to create the database 10b.

なお、図１には示されていないが、声質学習装置１０は、上記したアクター１及びターゲット２の音声を入力するためのマイクロフォン等を備えている。なお、声質学習装置１０は、例えばアクター１及びターゲット２の音声が予め収録された音声ファイルを入力する構成であってもよい。 Although not shown in FIG. 1, the voice quality learning device 10 includes a microphone and the like for inputting the voices of the actor 1 and the target 2 described above. Note that the voice quality learning device 10 may be configured to input a voice file in which voices of the actor 1 and the target 2 are recorded in advance, for example.

声質変換装置２０は、声質学習装置１０と同様に、各種プログラム（ソフトウェアを実行可能なＣＰＵのようなプロセッサ（コンピュータ）を備えるパーソナルコンピュータ等を含む。 Similar to the voice quality learning device 10, the voice quality conversion device 20 includes various programs (such as a personal computer including a processor (computer) such as a CPU capable of executing software).

声質変換装置２０は、上記した声質学習装置１０において作成されたデータベース１０ｂを利用してアクター１の声質をターゲット２の声質に変換する。このように声質変換装置２０において声質が変換された音声は、声質変換装置２０に備えられる例えばスピーカ２０ａから出力される。なお、声質変換装置２０において声質が変換された音声は、例えば音声ファイルとして出力され、声質変換装置２０内で管理されてもよいし、外部のサーバ装置等に送信されても構わない。 The voice quality conversion device 20 converts the voice quality of the actor 1 into the voice quality of the target 2 using the database 10b created in the voice quality learning device 10 described above. The voice whose voice quality is converted in the voice quality conversion apparatus 20 in this way is output from, for example, a speaker 20 a provided in the voice quality conversion apparatus 20. Note that the voice whose voice quality is converted in the voice quality conversion apparatus 20 may be output as, for example, a voice file and managed in the voice quality conversion apparatus 20, or may be transmitted to an external server apparatus or the like.

なお、図１には示されていないが、声質変換装置２０は、アクター１の音声を入力するためのマイクロフォン等を備えている。また、上記した声質学習装置１０と同様に、声質変換装置２０は、アクター１の音声が予め収録された音声ファイルを入力する構成であってもい。 Although not shown in FIG. 1, the voice quality conversion device 20 includes a microphone or the like for inputting the voice of the actor 1. Similarly to the voice quality learning apparatus 10 described above, the voice quality conversion apparatus 20 may be configured to input a voice file in which the voice of the actor 1 is recorded in advance.

本実施形態に係る声質変換システムにおいては、声質学習装置１０及び声質変換装置２０が別個の装置であるものとして説明するが、当該声質学習装置１０及び声質変換装置２０は、１つの装置として実現されていても構わない。 In the voice quality conversion system according to the present embodiment, the voice quality learning device 10 and the voice quality conversion device 20 will be described as separate devices, but the voice quality learning device 10 and the voice quality conversion device 20 are realized as one device. It does not matter.

更に、声質学習装置１０及び声質変換装置２０は、パーソナルコンピュータ以外の電子機器、例えばスマートフォンまたはタブレット端末等として実現されていてもよい。また、声質学習装置１０及び声質変換装置２０は、本実施形態において説明する各機能がチップなどに格納され一体化されたマイクのような形態を有していてもよいし、他の形態を有する専用機器として実現されていてもよい。 Furthermore, the voice quality learning device 10 and the voice quality conversion device 20 may be realized as an electronic device other than a personal computer, such as a smartphone or a tablet terminal. Further, the voice quality learning device 10 and the voice quality conversion device 20 may have a form such as a microphone in which each function described in the present embodiment is stored in a chip or the like, or other forms. It may be realized as a dedicated device.

以下、本実施形態に係る声質変換システムにおける声質の変換手法の概要について説明する。 The outline of the voice quality conversion method in the voice quality conversion system according to this embodiment will be described below.

本実施形態に係る声質変換システムにおいては、混合正規分布モデル（ＧＭＭ：Gaussian Mixture Model）に基づいて声質を変換する手法（以下、ＧＭＭに基づく声質変換と表記）が採用されているものとする。このＧＭＭに基づく声質変換においては、上述した声質学習装置１０によって学習処理が実行され、声質変換装置２０によって変換処理が実行される。 In the voice quality conversion system according to the present embodiment, a technique for converting voice quality based on a mixed normal distribution model (GMM: Gaussian Mixture Model) (hereinafter referred to as voice conversion based on GMM) is adopted. In the voice quality conversion based on this GMM, the learning process is executed by the voice quality learning apparatus 10 described above, and the conversion process is executed by the voice quality conversion apparatus 20.

まず、学習処理について簡単に説明する。学習処理では、例えば同一のセリフ（文等）を同一のイントネーション及び音程で発声した際のアクター１及びターゲット２それぞれの音声信号（音声データ）を用意する。 First, the learning process will be briefly described. In the learning process, for example, voice signals (voice data) of the actor 1 and the target 2 when the same speech (sentence, etc.) is uttered with the same intonation and pitch are prepared.

声質学習装置１０は、このアクター１及びターゲット２それぞれの音声信号（つまり、アクター１及びターゲット２による同一内容発声の音声信号）を入力する。 The voice quality learning device 10 inputs the voice signals of the actor 1 and the target 2 (that is, voice signals of the same content uttered by the actor 1 and the target 2).

声質学習装置１０は、双方の音声信号を各フレームに分割して短時間分析処理を実行する。通常は、固定長（例えば 5 ms）で分析区間をシフトさせることで、双方の音声信号を短時間音声波形に分割する。 The voice quality learning device 10 divides both audio signals into frames and executes short-time analysis processing. Normally, both speech signals are divided into short-time speech waveforms by shifting the analysis interval by a fixed length (for example, 5 ms).

声質学習装置１０は、分割されたフレーム毎に音声の特徴を表す特徴量の分析（スペクトル分析）を行い、局所的な時間フレーム系列の伸縮を行い、時間同期をとることによって、双方の音声フレーム間のマッチングを行う。対応するフレーム毎のスペクトルを結合したデータを順次算出し、結合確率密度関数をＧＭＭでモデル化する。 The voice quality learning device 10 performs analysis (spectrum analysis) of the feature amount representing the feature of the voice for each divided frame, expands and contracts the local time frame sequence, and synchronizes the time frames, thereby obtaining both voice frames. Match between. Data obtained by combining the spectra for the corresponding frames is calculated sequentially, and the combined probability density function is modeled by GMM.

本実施形態においては、このような学習処理によって得られるモデルデータ（声質変換モデルデータ）がデータベース１０ｂに蓄積される。 In the present embodiment, model data (voice quality conversion model data) obtained by such learning processing is accumulated in the database 10b.

すなわち、上記した学習処理においては、時間の対応付けがされたアクター１及びターゲット２の音声（波形）の特徴量のペアから、当該アクター１の声質をターゲット２の声質に変換するための変換規則が統計的にモデル化される。 That is, in the learning process described above, a conversion rule for converting the voice quality of the actor 1 to the voice quality of the target 2 from the pair of voice (waveform) feature quantities of the actor 1 and the target 2 associated with time. Are modeled statistically.

次に、変換処理について簡単に説明する。上記したＧＭＭに基づく声質変換の変換処理においては一般的に音声合成技術を使用して基本音声を作り出すが、本実施形態における変換処理では、機械的な音声ではなく比較的自然な音声を出力するために、アクター１の音声（波形）をそのまま使用するものとする。 Next, the conversion process will be briefly described. In the above-described conversion processing of voice quality conversion based on GMM, a basic speech is generally generated using a speech synthesis technique. However, in the conversion processing according to the present embodiment, relatively natural speech is output instead of mechanical speech. Therefore, the voice (waveform) of the actor 1 is used as it is.

すなわち、この変換処理では、音声の特徴量を変換するのではなく、アクター１の音声と出力すべきターゲット２の音声との特徴量の差分を上記ＧＭＭに基づき推定し、当該アクター１の音声（波形）に合成フィルタにより畳み込む（すなわち、差分スペクトル補正を適用する）ことによって、アクター１の声質をターゲット２の声質に変換する。 That is, in this conversion process, the feature amount of the voice is not converted, but the difference in the feature amount between the voice of the actor 1 and the voice of the target 2 to be output is estimated based on the GMM, and the voice of the actor 1 ( The voice quality of the actor 1 is converted to the voice quality of the target 2 by convolving the waveform with a synthesis filter (that is, applying a difference spectrum correction).

上述したように、ＧＭＭに基づく声質変換においては、学習処理によって作成されたデータベース１０ｂ（に蓄積されたモデルデータ）を利用して、変換処理によってアクター１の声質をターゲット２の声質に変換することができる。 As described above, in voice quality conversion based on the GMM, the voice quality of the actor 1 is converted into the voice quality of the target 2 by the conversion processing using the database 10b (model data accumulated therein) created by the learning processing. Can do.

なお、本実施形態において採用されるＧＭＭに基づく声質変換については例えば「小林和弘、戸田智基、Graham Neubig、Sakriani Sakti、中村哲（奈良先端大・情報）、“差分スペクトル補正に基づく統計的歌声声質変換”、［online］、２０１４年３月、日本音響学会講演論文集、［平成２８年６月２７日検索］、インターネット＜URL:http:www.phontron.com/paper/kobayashi14asj.pdf＞」及び「戸田智基、“音声音響信号処理〜統計的手法による音声変換〜”、［online］、２０１４年１月２０日、［平成２８年６月２７日検索］、インターネット＜http://hil.t.u-tokyo.ac.jp/~kameoka/SAP/SAP13_11.pdf＞」等に開示されているため、ここではその詳しい説明については省略する。 For voice quality conversion based on GMM adopted in this embodiment, for example, “Kazuhiro Kobayashi, Tomoki Toda, Graham Neubig, Sakriani Sakti, Satoshi Nakamura (NAIST, Information),“ Statistical singing voice quality based on differential spectrum correction ” "Conversion", [online], March 2014, Proceedings of the Acoustical Society of Japan, [Search June 27, 2016], Internet <URL: http: www.phontron.com/paper/kobayashi14asj.pdf> "Tomochi Toda," Speech and acoustic signal processing-Voice conversion using statistical methods ", [online], January 20, 2014, [Search June 27, 2016], Internet <http: //hil.tu -tokyo.ac.jp/~kameoka/SAP/SAP13_11.pdf> ”, etc., and detailed description thereof is omitted here.

ところで、例えばアクター１及びターゲット２の性別が異なるような場合、当該アクター１とターゲット２とで発声することが可能な音域が異なる。 By the way, for example, when the genders of the actor 1 and the target 2 are different, the sound ranges that can be uttered by the actor 1 and the target 2 are different.

このようにアクター１とターゲット２とで音域が異なる場合、当該アクター１の声質を当該ＧＭＭに基づく声質変換によってターゲット２の声質に変換したとしても、出力される音声はターゲット２の音声として認識できない程度のものとなる。 In this way, when the sound range is different between the actor 1 and the target 2, even if the voice quality of the actor 1 is converted to the voice quality of the target 2 by voice quality conversion based on the GMM, the output voice cannot be recognized as the voice of the target 2 It will be about.

このため、上述した変換処理において、アクター１の音声の基本周波数（ｆ０）をターゲット２（の音声）の音域に合わせて変換する処理（以下、基本周波数変換と表記）を実行する必要がある。基本周波数は、声の高さ（音高）等を表現する音声の特徴量の１つである。なお、変換処理ではアクター１の音声がそのまま使用されるため、本実施形態においては、アクター１の音声（波形）を信号処理で加工する基本周波数変換が必要となる。 For this reason, in the conversion process described above, it is necessary to execute a process for converting the fundamental frequency (f0) of the voice of the actor 1 in accordance with the range of the target 2 (speech) (hereinafter referred to as fundamental frequency conversion). The fundamental frequency is one of the feature quantities of voice that expresses the pitch (pitch) of the voice. In addition, since the voice of the actor 1 is used as it is in the conversion process, in the present embodiment, basic frequency conversion is required in which the voice (waveform) of the actor 1 is processed by signal processing.

ここで、上述した音声波形を信号処理で加工する基本周波数変換の手法としては、例えばＷ−ＳＯＬＡ等のタイムストレッチ及びリサンプルを用いた比較的簡易な手法がある。このような音声波形を時間軸上で加工する比較的簡易な手法による基本周波数変換には、例えば、音声基本周波数推定処理を必要としない、ボコーダによる音声分析合成処理を必要としない、ＣＰＵ負荷が少ないという利点がある。なお、基本周波数変換として他の手法が用いられても構わない。 Here, as a fundamental frequency conversion technique for processing the above-described speech waveform by signal processing, for example, there is a relatively simple technique using time stretching and resampling such as W-SOLA. Basic frequency conversion by a relatively simple method of processing such a speech waveform on the time axis does not require, for example, speech fundamental frequency estimation processing, does not require speech analysis / synthesis processing by a vocoder, and has a CPU load. There is an advantage of less. Note that other methods may be used as the fundamental frequency conversion.

しかしながら、このような基本周波数変換が実行された場合、アクター１の音声のスペクトル（フォルマント）に伸縮が生じるため、当該アクター１の声質が変化してしまう。このような基本周波数変換が声質変換装置２０における変換処理の前段で実行される場合には、声質学習装置１０における学習処理の際に与えられたアクター１の音声のフォルマントと、当該変換処理の際に与えられるアクター１の音声のフォルマントとが異なることになるため、変換処理においてアクター１の声質をターゲット２の声質に適切に変換することは困難となる。 However, when such fundamental frequency conversion is executed, the voice spectrum of the actor 1 changes because the voice spectrum (formant) of the actor 1 expands and contracts. When such a fundamental frequency conversion is executed in the preceding stage of the conversion process in the voice quality conversion apparatus 20, the formant of the voice of the actor 1 given during the learning process in the voice quality learning apparatus 10 and the conversion process Therefore, it is difficult to appropriately convert the voice quality of the actor 1 to the voice quality of the target 2 in the conversion process.

一方、基本周波数変換が声質変換装置２０における変換処理の後段で実行される（つまり、変換処理によって声質が変換された後に音声の基本周波数が変換される）場合には、既に変換処理によって得られたターゲット２の声質が基本周波数変換によって生じるフォルマントの伸縮により異なるものに変化してしまう。この場合には、ターゲット２の声質の音声を出力することはできない。 On the other hand, when the fundamental frequency conversion is performed after the conversion process in the voice quality conversion device 20 (that is, the basic frequency of the voice is converted after the voice quality is converted by the conversion process), it is already obtained by the conversion process. The voice quality of the target 2 is changed to a different one by the expansion / contraction of the formant caused by the fundamental frequency conversion. In this case, the voice of the voice quality of the target 2 cannot be output.

ここで、上述したように基本周波数変換は音声のフォルマント（スペクトル）の伸縮を生じさせるが、常に一定の倍率で基本周波数を変換した場合には、声質は変化してしまうものの、当該フォルマント同様に一定の倍率で伸縮することにより安定する（つまり、安定した個性を持ったフォルマントの声質を得ることができる）。 Here, as described above, the fundamental frequency conversion causes the expansion and contraction of the formant (spectrum) of the speech. However, if the fundamental frequency is always converted at a constant magnification, the voice quality will change, but as with the formant. It stabilizes by expanding and contracting at a certain magnification (that is, it is possible to obtain a formant voice quality with a stable personality).

本実施形態においては、このような特性に着目し、声質変換システムにおける学習処理及び変換処理の双方の前段でアクター１の音声に対して基本周波数変換を実行する構成とする。なお、この場合における基本周波数変換は、予め決定された一定の倍率（音高変換倍率）で実行される。 In this embodiment, paying attention to such characteristics, the fundamental frequency conversion is performed on the voice of the actor 1 before the learning process and the conversion process in the voice quality conversion system. In this case, the fundamental frequency conversion is executed at a predetermined magnification (pitch conversion magnification).

以下、本実施形態に係る声質変換システムに備えられる声質学習装置１０及び声質変換装置２０の各々の機能構成について説明する。 Hereinafter, functional configurations of the voice quality learning device 10 and the voice quality conversion device 20 included in the voice quality conversion system according to the present embodiment will be described.

図２は、声質学習装置１０の機能構成を示すブロック図である。声質学習装置１０は、上記したように声質変換装置２０においてアクター１の声質をターゲット２の声質に変換するために、ターゲット２の音声（声質）及びアクター１の音声（声質）を学習しておく機能を有する。 FIG. 2 is a block diagram illustrating a functional configuration of the voice quality learning device 10. The voice quality learning device 10 learns the voice (voice quality) of the target 2 and the voice (voice quality) of the actor 1 in order to convert the voice quality of the actor 1 into the voice quality of the target 2 in the voice quality conversion device 20 as described above. It has a function.

図２に示すように、声質学習装置１０は、第１音声入力部１１、第１分析処理部１２、倍率決定部１３、第２音声入力部１４、基本周波数変換部１５、第２分析処理部１６及びモデル学習部１７を含む。 As shown in FIG. 2, the voice quality learning device 10 includes a first voice input unit 11, a first analysis processing unit 12, a magnification determination unit 13, a second voice input unit 14, a fundamental frequency conversion unit 15, and a second analysis processing unit. 16 and a model learning unit 17.

本実施形態において、第１音声入力部１１、第１分析処理部１２、倍率決定部１３、第２音声入力部１４、基本周波数変換部１５、第２分析処理部１６及びモデル学習部１７は、図１に示す解析エンジン１０ａを構成する機能部であり、例えば声質学習装置１０に備えられるＣＰＵ等のコンピュータにプログラムを実行させること、すなわち、ソフトウェアによって実現されるものとする。なお、これらの各部１１〜１７の一部または全ては、ＩＣ（Integrated Circuit）等のハードウェアによって実現されてもよいし、ソフトウェア及びハードウェアの組み合わせ構成として実現されてもよい。なお、コンピュータに実行させるプログラムは、コンピュータ読み取り可能な記憶媒体に格納して頒布されてもよいし、またはネットワークを通じて声質学習装置１０にダウンロードされてもよい。 In the present embodiment, the first speech input unit 11, the first analysis processing unit 12, the magnification determination unit 13, the second speech input unit 14, the fundamental frequency conversion unit 15, the second analysis processing unit 16, and the model learning unit 17 are: It is a functional unit constituting the analysis engine 10a shown in FIG. 1, and is realized by causing a computer such as a CPU provided in the voice quality learning apparatus 10 to execute a program, that is, by software. Part or all of these units 11 to 17 may be realized by hardware such as an IC (Integrated Circuit), or may be realized as a combined configuration of software and hardware. The program to be executed by the computer may be stored and distributed in a computer-readable storage medium, or may be downloaded to the voice quality learning device 10 through a network.

ここで、上記した学習処理のために声質学習装置１０に対して発せられたターゲット２の音声は、例えばマイクロフォンを介してアナログ電気信号に変換される。マイクロフォンを介してアナログ電気信号に変換された音声は、更にＡ／Ｄコンバータを通してデジタル電気信号に変換され、第１音声入力部１１によって入力される。以下、第１音声入力部１１によって入力された音声（信号）を便宜的にターゲット２の学習用音声信号と称する。なお、第１音声入力部１１は、ターゲット２の学習用音声信号として上記した音声ファイルを入力してもよい。 Here, the voice of the target 2 uttered to the voice quality learning device 10 for the above-described learning process is converted into an analog electric signal via a microphone, for example. The sound converted into the analog electric signal through the microphone is further converted into a digital electric signal through the A / D converter, and input by the first sound input unit 11. Hereinafter, the voice (signal) input by the first voice input unit 11 is referred to as a learning voice signal of the target 2 for convenience. The first voice input unit 11 may input the above-described voice file as a learning voice signal for the target 2.

なお、ターゲット２の学習用音声信号には、ターゲット２の音声の特徴を表すパラメータ（特徴量）として、例えば音韻性及び声質等を表現するスペクトル特徴量（スペクトル包絡）と、声の高さ（音高）及び声のかすれ等を表現する基本周波数及び非周期成分とが含まれる。 In addition, the learning speech signal of the target 2 includes, as parameters (features) representing the features of the speech of the target 2, for example, a spectral feature amount (spectrum envelope) representing phonological characteristics and voice quality, and a voice pitch ( Pitch) and a basic frequency and a non-periodic component expressing voice blur.

第１分析処理部１２は、ターゲット２の学習用音声信号（に含まれるスペクトル特徴量、基本周波数及び非周期成分）を分析し、当該ターゲット２の学習用音声信号からスペクトル特徴量（第１の特徴量）を抽出する。 The first analysis processing unit 12 analyzes the learning speech signal of the target 2 (the spectral feature amount, the fundamental frequency, and the non-periodic component included therein), and the spectral feature amount (the first feature amount from the learning speech signal of the target 2). (Feature amount) is extracted.

倍率決定部１３は、上記した基本周波数変換を実行する際の一定の倍率（ｆ０ｒａｔｅ）を決定する。具体的には、倍率決定部１３は、例えばアクター１の音声の周波数帯域（つまり、音域）の平均値及びターゲット２の音声の周波数帯域（つまり、音域）の平均値に基づいて倍率を決定する。以下、倍率決定部１３によって決定された倍率を便宜的に固定倍率と称する。この固定倍率は、声質学習装置１０内に保持される。また、この固定倍率は、例えば声質変換装置２０に対して送信されることによって、後述するように声質変換装置２０内でも保持される。 The magnification determination unit 13 determines a constant magnification (f0rate) when executing the above-described fundamental frequency conversion. Specifically, the magnification determination unit 13 determines the magnification based on, for example, the average value of the frequency band (ie, sound range) of the voice of the actor 1 and the average value of the frequency band (ie, sound range) of the sound of the target 2. . Hereinafter, the magnification determined by the magnification determination unit 13 is referred to as a fixed magnification for convenience. This fixed magnification is held in the voice quality learning device 10. The fixed magnification is also held in the voice quality conversion apparatus 20 as described later by being transmitted to the voice quality conversion apparatus 20, for example.

ここで、学習処理のために声質学習装置１０に対して発せられたアクター１の音声は、例えばマイクロフォンを介して電気信号（音声信号）に変換される。第２音声入力部１４は、マイクロフォンを介して変換された音声信号（第２の音声信号）を入力する。以下、第２音声入力部１４によって入力された音声信号を便宜的にアクター１の学習用音声信号と称する。 Here, the voice of the actor 1 uttered to the voice quality learning device 10 for the learning process is converted into an electric signal (voice signal) via a microphone, for example. The second audio input unit 14 inputs an audio signal (second audio signal) converted through a microphone. Hereinafter, the audio signal input by the second audio input unit 14 is referred to as a learning audio signal of the actor 1 for convenience.

なお、アクター１の学習用音声信号には、アクター１の音声の特徴を表すパラメータ（特徴量）として、例えば音韻性及び声質等を表現するスペクトル特徴量（スペクトル包絡）と、声の高さ（音高）及び声のかすれ等を表現する基本周波数及び非周期成分とが含まれる。 Note that the learning speech signal of the actor 1 includes, as parameters (features) representing the characteristics of the speech of the actor 1, for example, a spectral feature amount (spectrum envelope) representing phonological characteristics and voice quality, and a voice pitch ( Pitch) and a basic frequency and a non-periodic component expressing voice blur.

基本周波数変換部１５は、アクター１の学習用音声信号に含まれる基本周波数を固定倍率で変換する。すなわち、本実施形態においては、この基本周波数変換部１５により、学習処理の前段でアクター１の学習用音声信号に対して基本周波数変換が実行される。 The fundamental frequency conversion unit 15 converts the fundamental frequency included in the learning speech signal of the actor 1 at a fixed magnification. In other words, in the present embodiment, the fundamental frequency conversion unit 15 performs fundamental frequency conversion on the learning speech signal of the actor 1 in the previous stage of the learning process.

第２分析処理部１６は、基本周波数が固定倍率で変換された後のアクター１の学習用音声信号（に含まれるスペクトル特徴量、基本周波数及び非周期成分）を分析し、当該アクター１の学習用音声信号からスペクトル特徴量（第２の特徴量）を抽出する。 The second analysis processing unit 16 analyzes the learning speech signal of the actor 1 (the spectrum feature amount, the fundamental frequency, and the non-periodic component included) after the fundamental frequency is converted at a fixed magnification, and learns the actor 1 A spectral feature amount (second feature amount) is extracted from the audio signal.

モデル学習部１７は、上述した学習処理を実行する機能部である。モデル学習部１７は、第１分析処理部１２によって抽出されたスペクトル特徴量（つまり、ターゲット２の音声の特徴量）及び第２分析処理部１６によって抽出されたスペクトル特徴量（つまり、アクター１の音声の特徴量）のペアに関する変換規則が統計的にモデル化される。モデル学習部１７は、このような学習処理によって得られるモデルデータをデータベース１０ｂに格納（蓄積）する。 The model learning unit 17 is a functional unit that executes the learning process described above. The model learning unit 17 extracts the spectrum feature amount extracted by the first analysis processing unit 12 (that is, the feature amount of the target 2 speech) and the spectrum feature amount extracted by the second analysis processing unit 16 (that is, the actor 1 Conversion rules for pairs of speech features are statistically modeled. The model learning unit 17 stores (accumulates) the model data obtained by such learning processing in the database 10b.

図３は、声質変換装置２０の機能構成を示すブロック図である。声質変換装置２０は、アクター１の声質をターゲット２の声質に変換する機能を有する。 FIG. 3 is a block diagram showing a functional configuration of the voice quality conversion device 20. The voice quality conversion device 20 has a function of converting the voice quality of the actor 1 into the voice quality of the target 2.

図３に示すように、声質変換装置２０は、変換テーブル２１、音声入力部２２、基本周波数変換部２３、分析処理部２４、差分推定部２５、声質変換部２６及び音声出力部２７を含む。 As shown in FIG. 3, the voice quality conversion device 20 includes a conversion table 21, a voice input unit 22, a fundamental frequency conversion unit 23, an analysis processing unit 24, a difference estimation unit 25, a voice quality conversion unit 26, and a voice output unit 27.

本実施形態において、変換テーブル２１は、上記した声質学習装置１０のデータベース１０ｂをインストールすることによって生成され、当該データベース１０ｂに蓄積されたモデルデータを保持する。なお、変換テーブル２１は、例えば声質変換装置２０に備えられる記憶装置等に格納される。 In the present embodiment, the conversion table 21 holds the model data generated by installing the database 10b of the voice quality learning device 10 described above and accumulated in the database 10b. The conversion table 21 is stored in, for example, a storage device provided in the voice quality conversion device 20.

また、本実施形態において、音声入力部２２、基本周波数変換部２３、分析処理部２４、差分推定部２５、声質変換部２６及び音声出力部２７は、例えば声質変換装置１０に備えられるＣＰＵ等のコンピュータにプログラムを実行させること、すなわち、ソフトウェアによって実現されるものとする。なお、これらの各部２２〜２７の一部または全ては、ＩＣ（Integrated Circuit）等のハードウェアによって実現されてよいし、ソフトウェア及びハードウェアの組み合わせ構成として実現されてもよい。なお、コンピュータに実行させるプログラムは、コンピュータ読み取り可能な記憶媒体に格納して頒布されてもよいし、またはネットワークを通じて声質変換装置２０にダウンロードされてもよい。 In the present embodiment, the voice input unit 22, the fundamental frequency conversion unit 23, the analysis processing unit 24, the difference estimation unit 25, the voice quality conversion unit 26, and the voice output unit 27 are, for example, a CPU or the like provided in the voice quality conversion device 10. It is assumed that the program is executed by a computer, that is, realized by software. Note that some or all of these units 22 to 27 may be realized by hardware such as an IC (Integrated Circuit) or may be realized as a combined configuration of software and hardware. The program to be executed by the computer may be stored and distributed in a computer-readable storage medium, or may be downloaded to the voice quality conversion device 20 through a network.

変換処理のために声質変換装置２０に対して発せられたアクター１の音声は、例えばマイクロフォンを介してアナログ電気信号に変換される。マイクロフォンを介してアナログ電気信号に変換された音声は、更にＡ／Ｄコンバータを通してデジタル電気信号に変換され、音声入力部２２によって入力される。以下、音声入力部２２によって入力された音声（信号）を便宜的にアクター１の変換用音声信号と称する。なお、音声入力部２２は、アクター１の変換用音声信号として上記した音声ファイルを入力してもよい。 The voice of the actor 1 uttered to the voice quality conversion device 20 for the conversion process is converted into an analog electric signal via a microphone, for example. The sound converted into the analog electric signal through the microphone is further converted into a digital electric signal through the A / D converter and input by the sound input unit 22. Hereinafter, the voice (signal) input by the voice input unit 22 will be referred to as a conversion voice signal of the actor 1 for convenience. Note that the voice input unit 22 may input the above-described voice file as the conversion voice signal of the actor 1.

アクター１の変換用音声信号には、上記したようにアクター１の音声の特徴を表すパラメータ（特徴量）として、スペクトル特徴量、基本周波数及び非周期成分等が含まれる。 As described above, the conversion voice signal of the actor 1 includes a spectral feature quantity, a fundamental frequency, an aperiodic component, and the like as parameters (feature quantities) representing the voice characteristics of the actor 1.

ここで、声質学習装置１０内に保持されている固定倍率（つまり、倍率決定部１３によって決定された倍率）は、上記したように声質変換装置２０内においても保持されているものとする。 Here, the fixed magnification (that is, the magnification determined by the magnification determination unit 13) held in the voice quality learning apparatus 10 is also held in the voice quality conversion apparatus 20 as described above.

基本周波数変換部２３は、アクター１の変換用音声信号に含まれる基本周波数を声質変換装置２０内で保持されている固定倍率で変換する。すなわち、本実施形態においては、この基本周波数変換部２３により、変換処理の前段でアクター１の変換用音声信号に対して基本周波数変換が実行される。 The fundamental frequency conversion unit 23 converts the fundamental frequency included in the voice signal for conversion of the actor 1 at a fixed magnification held in the voice quality conversion device 20. That is, in the present embodiment, the fundamental frequency conversion is performed on the conversion audio signal of the actor 1 by the basic frequency conversion unit 23 in the previous stage of the conversion process.

分析処理部２４は、基本周波数が固定倍率で変換された後のアクター１の変換用音声信号（に含まれるスペクトル特徴量、基本周波数及び非周期成分）を分析し、当該アクター１の変換用音声信号からスペクトル特徴量（第３の特徴量）を抽出する。 The analysis processing unit 24 analyzes the conversion voice signal of the actor 1 (the spectrum feature amount, the basic frequency, and the non-periodic component included therein) after the fundamental frequency is converted at a fixed magnification, and the conversion voice of the actor 1 is analyzed. A spectral feature amount (third feature amount) is extracted from the signal.

差分推定部２５及び声質変換部２６は、上述した変換処理を実行する機能部である。 The difference estimation unit 25 and the voice quality conversion unit 26 are functional units that execute the conversion processing described above.

ここで、差分推定部２５及び声質変換部２６は、変換テーブル２１に保持されているモデルデータ及び分析処理部２４によって抽出されたスペクトル特徴量に基づく変換処理によって、アクター１の声質がターゲット２の声質に変換された音声信号（第４の音声信号）を生成する。このように生成される音声信号は、アクター１の変換用音声信号に対応するターゲット２の音声信号に相当する。 Here, the difference estimation unit 25 and the voice quality conversion unit 26 convert the voice quality of the actor 1 to the target 2 by the conversion process based on the model data held in the conversion table 21 and the spectrum feature amount extracted by the analysis processing unit 24. A voice signal (fourth voice signal) converted into voice quality is generated. The audio signal generated in this way corresponds to the audio signal of the target 2 corresponding to the conversion audio signal of the actor 1.

具体的には、差分推定部２５は、変換テーブル２１（つまり、モデルデータ）を参照して、分析処理部２４によって抽出されたスペクトル特徴量（つまり、アクター１の変換用音声信号に含まれるスペクトル特徴量）と当該アクター１の変換用音声信号に対応するターゲット２の音声信号のスペクトル特徴量との差分（以下、差分特徴量と表記）を推定する。 Specifically, the difference estimation unit 25 refers to the conversion table 21 (that is, model data), and the spectrum feature amount extracted by the analysis processing unit 24 (that is, the spectrum included in the conversion voice signal of the actor 1). The difference between the feature amount) and the spectral feature amount of the audio signal of the target 2 corresponding to the conversion audio signal of the actor 1 (hereinafter referred to as difference feature amount) is estimated.

声質変換部２６は、アクター１の変換用音声信号（音声波形）に対して差分推定部２５によって推定された差分特徴量を適用する処理（フィルタ処理）を実行する。これにより、音声入力部２２によって入力されたアクター１の変換用音声信号において、アクター１の声質をターゲット２の声質に変換することができる。 The voice quality conversion unit 26 executes a process (filtering process) for applying the difference feature amount estimated by the difference estimation unit 25 to the conversion voice signal (voice waveform) of the actor 1. Thereby, the voice quality of the actor 1 can be converted into the voice quality of the target 2 in the voice signal for conversion of the actor 1 input by the voice input unit 22.

音声出力部２７は、声質変換部２６によって声質が変換された音声信号を例えばスピーカ２０ａを介して出力する。なお、声質変換部２６によって性質が変換された音声信号は、上記したように音声ファイルとして出力されてもよい。 The audio output unit 27 outputs the audio signal whose voice quality has been converted by the voice quality conversion unit 26 via, for example, the speaker 20a. Note that the audio signal whose properties are converted by the voice quality conversion unit 26 may be output as an audio file as described above.

以下、本実施形態に係る声質変換システム（声質学習装置１０及び声質変換装置２０）の動作について説明する。 Hereinafter, the operation of the voice quality conversion system (voice quality learning apparatus 10 and voice quality conversion apparatus 20) according to the present embodiment will be described.

まず、図４のフローチャートを参照して、声質学習装置１０の処理手順について説明する。 First, the processing procedure of the voice quality learning device 10 will be described with reference to the flowchart of FIG.

図４に示す処理が実行される場合、例えばターゲット２（特定のキャラクタ）が話すことが多い音素（言い回し等）の包含されたテキストが用意される。 When the process shown in FIG. 4 is executed, for example, a text including a phoneme (phrase) often spoken by the target 2 (specific character) is prepared.

ターゲット２（の声を演じる人物等）は、当該ターゲット２の声のイメージを作り、当該イメージに基づいてイントネーション及び音程の変化等を意識して、用意されたテキストに基づいて発声する。なお、ここで用意されているテキストには、例えば５０〜１００文程度（のセリフ等）が含まれているものとする。 The target 2 (the person who plays the voice) creates an image of the voice of the target 2 and utters based on the prepared text in consideration of intonation, change in pitch, and the like based on the image. The text prepared here includes, for example, about 50 to 100 sentences (such as words).

これにより、第１音声入力部１１は、ターゲット２の発声に応じて当該ターゲット２（つまり、特定のキャラクタ）の学習用音声信号を入力する（ステップＳ１）。 Thereby, the 1st audio | voice input part 11 inputs the audio | voice signal for learning of the said target 2 (namely, specific character) according to the utterance of the target 2 (step S1).

第１分析処理部１２は、ステップＳ１において入力されたターゲット２の学習用音声信号からスペクトル特徴量を抽出する（ステップＳ２）。 The first analysis processing unit 12 extracts a spectrum feature amount from the learning speech signal of the target 2 input in step S1 (step S2).

次に、アクター１は、上記したテキストに基づくターゲット２による発声と同様のイントネーション及び音程の変化等を真似て当該テキストに基づいて発声する。 Next, the actor 1 utters based on the text by imitating the same intonation and pitch change as the utterance by the target 2 based on the text.

これにより、第２音声入力部１４は、アクター１の発声に応じて当該アクター１の学習用音声信号（つまり、ステップＳ１において入力されたターゲット２の学習用音声信号に対応するアクター１の学習用音声信号）を入力する（ステップＳ３）。 As a result, the second voice input unit 14 performs the learning voice signal of the actor 1 corresponding to the voice of the actor 1 (that is, the learning voice of the actor 1 corresponding to the learning voice signal of the target 2 input in step S1). Audio signal) is input (step S3).

ここで、上記したように声質学習装置１０内には、事前処理として倍率決定部１３によって決定された基本周波数変換のための倍率（固定倍率）が保持されている。なお、固定倍率は、上記したようにアクター１の音声の周波数帯域の平均値及びターゲット２の音声の周波数帯域の平均値に基づいて決定される。具体的には、例えばアクター１の音声の周波数帯域の平均値が１００Ｈｚであり、ターゲット２の音声の周波数帯域の平均値が１３０Ｈｚである場合には、固定倍率は１．３（１３０／１００）である。ここでは、固定倍率が「ターゲット２の音声の周波数帯域の平均値／アクター１の音声の周波数帯域の平均値」であるものとして説明したが、当該固定倍率は他の手法によって決定されるものであってもよい。なお、アクター１の音声の周波数帯域の平均値及びターゲット２の音声の周波数帯域の平均値は、予め計測されていればよい。 Here, as described above, the voice quality learning device 10 holds the magnification (fixed magnification) for the fundamental frequency conversion determined by the magnification determination unit 13 as a pre-process. The fixed magnification is determined based on the average value of the frequency band of the voice of the actor 1 and the average value of the frequency band of the voice of the target 2 as described above. Specifically, for example, when the average value of the frequency band of the voice of the actor 1 is 100 Hz and the average value of the frequency band of the voice of the target 2 is 130 Hz, the fixed magnification is 1.3 (130/100). It is. Here, the fixed magnification is described as “the average value of the frequency band of the voice of the target 2 / the average value of the frequency band of the voice of the actor 1”, but the fixed magnification is determined by another method. There may be. In addition, the average value of the frequency band of the voice of the actor 1 and the average value of the frequency band of the voice of the target 2 may be measured in advance.

基本周波数変換部１５は、上記した固定倍率に基づいて、ステップＳ３において入力されたアクター１の学習用音声信号に対して基本周波数変換を実行する（ステップＳ４）。これにより、アクター１の学習用音声信号に含まれる基本周波数が固定倍率で変換される。 Based on the fixed magnification described above, the fundamental frequency conversion unit 15 performs fundamental frequency conversion on the learning speech signal of the actor 1 input in step S3 (step S4). As a result, the fundamental frequency included in the learning audio signal of the actor 1 is converted at a fixed magnification.

第２分析処理部１６は、ステップＳ４において基本周波数が変換された後のアクター１の学習用音声信号からスペクトル特徴量を抽出する（ステップＳ５）。 The second analysis processing unit 16 extracts a spectrum feature amount from the learning speech signal of the actor 1 after the fundamental frequency is converted in step S4 (step S5).

モデル学習部１７は、上述した学習処理を実行し、ステップＳ２において抽出されたスペクトル特徴量（ターゲット２の音声のスペクトル特徴量）及びステップＳ５において抽出されたスペクトル特徴量（アクター１の音声のスペクトル特徴量）間の対応関係をモデル化する（ステップＳ６）。具体的には、モデル学習部１７は、上述したように時間（フレーム）毎に対応付けられたスペクトル特徴量に基づく変換規則を統計的にモデル化（ＧＭＭでモデル化）することによってモデルデータを得ることができる。 The model learning unit 17 executes the above-described learning process, and the spectral feature amount extracted in step S2 (the spectral feature amount of the target 2 speech) and the spectral feature amount extracted in step S5 (the speech spectrum of the actor 1). A correspondence relationship between the feature quantities) is modeled (step S6). Specifically, as described above, the model learning unit 17 statistically models the conversion rule based on the spectral feature amount associated with each time (frame) (modeling with GMM) to obtain the model data. Can be obtained.

モデル学習部１７によって得られたモデルデータは、データベース１０ｂに蓄積される（ステップＳ７）。 The model data obtained by the model learning unit 17 is accumulated in the database 10b (step S7).

上記した図４に示す処理によれば、上記したようにアクター１の学習用音声信号に対して固定倍率に基づく基本周波数変換を実行した後で学習処理が実行され、当該学習処理において得られるモデルデータがデータベース１０ｂに蓄積される。 According to the process shown in FIG. 4 described above, the learning process is performed after the fundamental frequency conversion based on the fixed magnification is performed on the learning speech signal of the actor 1 as described above, and the model obtained in the learning process is obtained. Data is accumulated in the database 10b.

次に、図５のフローチャートを参照して、声質変換装置２０の処理手順について説明する。 Next, the processing procedure of the voice quality conversion apparatus 20 will be described with reference to the flowchart of FIG.

本実施形態において、声質変換装置２０は、例えばテーマパークまたはイベント会場等において着ぐるみを着用したアクター１が特定のキャラクタ（ターゲット２）の声を演じてリアルタイムで観客等と会話（やりとり）を行うような場合に使用される。なお、声質変換装置２０は、映像として映し出された特定のキャラクタの声をアクター１が演じるような場合に使用されても構わない。 In the present embodiment, the voice quality conversion device 20 causes the actor 1 wearing a costume to play a voice of a specific character (target 2) and talk (interact) with the audience in real time, for example, at a theme park or event venue. Used when Note that the voice quality conversion device 20 may be used when the actor 1 plays the voice of a specific character projected as an image.

なお、このアクター１の音声のスペクトル特徴量及びターゲット２（特定のキャラクタ）の音声のスペクトル特徴量間の対応関係がモデル化されることによって得られるモデルデータは、上述した図４に示す処理が実行されることによって声質学習装置１０（データベース１０ｂ）に蓄積されているものとする。このモデルデータは、当該声質変換装置２０にインストールされ、変換テーブル２１に保持されているものとする。 Note that the model data obtained by modeling the correspondence between the spectral feature quantity of the voice of the actor 1 and the spectral feature quantity of the voice of the target 2 (specific character) is obtained by the process shown in FIG. It is assumed that the voice quality learning device 10 (database 10b) accumulates the data by executing the command. It is assumed that this model data is installed in the voice quality conversion device 20 and held in the conversion table 21.

アクター１が声質変換装置２０を使用する場合、当該アクター１は、上述した図４に示す処理が実行される際に真似た程度のターゲット２のイントネーション及び音程の変化等で発声する（例えば、観客と会話する）。 When the actor 1 uses the voice quality conversion device 20, the actor 1 utters with the intonation of the target 2 and the change in the pitch, etc., which are imitated when the process shown in FIG. 4 is executed (for example, the audience) Talk to).

この場合、音声入力部２２は、アクター１の発声に応じて当該アクター１の変換用音声信号を入力する（ステップＳ１１）。 In this case, the voice input unit 22 inputs the conversion voice signal of the actor 1 according to the utterance of the actor 1 (step S11).

ここで、上記したように声質変換装置１０内には、声質学習装置１０内に保持されている固定倍率（倍率決定部１３によって決定された倍率）と同じ固定倍率が保持されている。 Here, as described above, in the voice quality conversion apparatus 10, the same fixed magnification as the fixed magnification (magnification determined by the magnification determination unit 13) held in the voice quality learning apparatus 10 is held.

基本周波数変換部１５は、声質変換装置１０内に保持されている固定倍率に基づいて、ステップＳ１１において入力されたアクター１の変換用音声信号に対して基本周波数変換を実行する（ステップＳ１２）。これにより、アクター１の変換用音声信号に含まれる基本周波数が固定倍率で変換される。 The fundamental frequency conversion unit 15 performs fundamental frequency conversion on the conversion audio signal of the actor 1 input in step S11 based on the fixed magnification held in the voice quality conversion apparatus 10 (step S12). As a result, the fundamental frequency included in the conversion audio signal of the actor 1 is converted at a fixed magnification.

分析処理部２４は、ステップＳ１２において基本周波数が変換された後のアクター１の変換用音声信号からスペクトル特徴量を抽出する（ステップＳ１３）。 The analysis processing unit 24 extracts a spectrum feature amount from the conversion voice signal of the actor 1 after the fundamental frequency is converted in step S12 (step S13).

以下、差分推定部２５及び声質変換部２６は、上述した変換処理を実行する。具体的には、差分推定部２５は、ステップＳ１３において抽出されたスペクトル特徴量（アクター１の音声のスペクトル特徴量）とステップＳ１１において入力されたアクター１の変換用音声信号に対応するターゲット２の音声信号との差分特徴量を、変換テーブル２１に保持されているモデルデータ（ＧＭＭ）に基づいて推定する（ステップＳ１４）。なお、ステップＳ１４における推定処理においては、例えばＧＭＭに対して変数変換を行うことによりアクター１の音声のスペクトル特徴量（ベクトル）と差分特徴量（ベクトル）の結合確率密度をモデル化したＧＭＭを導出し、このように導出されたＧＭＭに基づき差分特徴量を推定するものとする。 Hereinafter, the difference estimation unit 25 and the voice quality conversion unit 26 execute the conversion process described above. Specifically, the difference estimation unit 25 sets the spectral feature value (the spectral feature value of the voice of the actor 1) extracted in step S13 and the target 2 corresponding to the voice signal for conversion of the actor 1 input in step S11. A difference feature amount from the audio signal is estimated based on the model data (GMM) held in the conversion table 21 (step S14). In the estimation process in step S14, for example, variable conversion is performed on the GMM to derive a GMM that models the joint probability density of the spectrum feature quantity (vector) and the difference feature quantity (vector) of the voice of the actor 1 The difference feature amount is estimated based on the GMM thus derived.

次に、声質変換部２６は、ステップＳ１３において抽出されたスペクトル特徴量に対して、ステップＳ１４において推定された差分特徴量を合成フィルタにより畳み込む（合成する）ことにより、アクター１の声質がターゲット２の声質に変換された音声信号を生成する（ステップＳ１５）。なお、合成フィルタとしては、音声合成に用いられる例えばＭＬＳＡ（Mel-Log Spectrum Approximation）フィルタ等を使用することができる。 Next, the voice quality conversion unit 26 convolves (synthesizes) the differential feature quantity estimated in step S14 with the synthesis filter for the spectral feature quantity extracted in step S13, so that the voice quality of the actor 1 is the target 2 The voice signal converted into the voice quality is generated (step S15). As the synthesis filter, for example, an MLSA (Mel-Log Spectrum Approximation) filter used for speech synthesis can be used.

この声質変換部２６によって声質が変換された後の音声信号は、音声出力部２７によって出力される（ステップＳ１６）。 The voice signal whose voice quality has been converted by the voice quality conversion unit 26 is output by the voice output unit 27 (step S16).

上記した図５に示す処理によれば、アクター１の変換用音声信号に対して固定倍率に基づく基本周波数変換を実行した後で変換処理が実行され、当該変換処理によってアクター１の声質をターゲット２の声質にリアルタイムに変換した音声信号を出力することが可能となる。なお、図５に示す処理は、アクター１の音声信号が入力される度に実行される。具体的には、連続的に入力されるアクター１の音声信号を例えば５ms程度の固定長毎に処理することによって、リアルタイムでの声質変換を実現することが可能となる。 According to the process shown in FIG. 5 described above, the conversion process is executed after the fundamental frequency conversion based on the fixed magnification is performed on the conversion audio signal of the actor 1, and the voice quality of the actor 1 is set to the target 2 by the conversion process. It is possible to output a voice signal converted in real time to the voice quality of the voice. Note that the process shown in FIG. 5 is executed each time an audio signal of the actor 1 is input. Specifically, the voice quality conversion in real time can be realized by processing the voice signal of the actor 1 that is continuously input for each fixed length of, for example, about 5 ms.

上記したように本実施形態において、声質学習装置１０は、ターゲット２の学習用音声信号（第１の音声信号）及び当該ターゲット２の学習用音声信号に対応するアクター１の学習用音声信号（第２の音声信号）を入力する。声質学習装置１０は、学習処理の前段で、アクター１の学習用音声信号に含まれる基本周波数を所定の倍率（固定倍率）で変換する。また、声質学習装置１０は、学習処理として、ターゲット２の学習用音声信号から抽出されたスペクトル特徴量（第１の特徴量）及び基本周波数が変換されたアクター１の学習用音声信号から抽出されたスペクトル特徴量（第２の特徴量）間の対応関係をモデル化することによって得られるモデルデータ（声質変換モデルデータ）をデータベース１０ｂに格納（蓄積）する。 As described above, in the present embodiment, the voice quality learning device 10 includes the learning audio signal (first audio signal) of the target 2 and the learning audio signal (first audio signal of the actor 1 corresponding to the learning audio signal of the target 2. 2 audio signals). The voice quality learning device 10 converts the fundamental frequency included in the learning speech signal of the actor 1 at a predetermined magnification (fixed magnification) before the learning process. Further, the voice quality learning device 10 is extracted from the learning speech signal of the actor 1 obtained by converting the spectral feature amount (first feature amount) and the fundamental frequency extracted from the learning speech signal of the target 2 as the learning process. The model data (voice quality conversion model data) obtained by modeling the correspondence relationship between the spectral feature amounts (second feature amounts) is stored (accumulated) in the database 10b.

一方、声質変換装置２０は、アクター１の変換用音声信号（第３の音声信号）を入力し、変換処理の前段で、当該アクター１の変換用音声信号に含まれる基本周波数を上記した所定の倍率（固定倍率）で変換する。声質変換装置２０は、変換処理として、データベース１０ｂに蓄積されたモデルデータ及び基本周波数が変換されたアクター１の変換用音声信号から抽出されたスペクトル特徴量（第３の特徴量）に基づいて、当該アクター１の声質がターゲットの声質に変換された音声信号（第４の音声信号）を生成する。 On the other hand, the voice quality conversion device 20 receives the conversion audio signal (third audio signal) of the actor 1 and sets the basic frequency included in the conversion audio signal of the actor 1 to the predetermined frequency described above before the conversion process. Convert with magnification (fixed magnification). The voice quality conversion device 20 performs conversion processing based on the model data stored in the database 10b and the spectral feature value (third feature value) extracted from the conversion voice signal of the actor 1 whose fundamental frequency is converted. A voice signal (fourth voice signal) in which the voice quality of the actor 1 is converted into the target voice quality is generated.

なお、声質変換装置２０による変換処理においては、データベース１０ｂに格納されたモデルデータ及び基本周波数が変換されたアクター１の変換用音声信号から抽出されたスペクトル特徴量に基づいてターゲット２のスペクトル特徴量との差分特徴量が推定され、当該差分特徴量が当該スペクトル特徴量にフィルタとして適用されることによって、アクター１の声質がターゲット２の声質に変換される。 In the conversion process by the voice quality conversion device 20, the spectral feature value of the target 2 based on the model feature data stored in the database 10b and the spectral feature value extracted from the conversion speech signal of the actor 1 whose fundamental frequency has been converted. And the difference feature quantity is applied to the spectrum feature quantity as a filter, so that the voice quality of the actor 1 is converted to the voice quality of the target 2.

ここで、本実施形態においては、アクター１とターゲット２との音高の差異による影響を低減するために基本周波数変換が学習処理及び変換処理の双方の前段で実行される。すなわち、本実施形態においては、基本周波数変換後のアクター１の音声（信号）で学習処理が実行されるため、変換処理の前段でアクター１の変換用音声信号に対して基本周波数変換が実行された場合であっても、学習処理によって得られたモデルデータに基づいて適切に声質を変換することが可能となる。 Here, in the present embodiment, in order to reduce the influence due to the difference in pitch between the actor 1 and the target 2, the fundamental frequency conversion is executed before both the learning process and the conversion process. That is, in this embodiment, since the learning process is executed with the voice (signal) of the actor 1 after the fundamental frequency conversion, the fundamental frequency conversion is executed with respect to the voice signal for conversion of the actor 1 before the conversion process. Even in this case, it is possible to appropriately convert the voice quality based on the model data obtained by the learning process.

本実施形態においては、このような構成により、例えばテーマパークまたはイベント会場等におけるアクター１の発声に基づいて入力された音声信号に応じて、当該アクター１の声質がターゲット２の声質に変換された音声信号をリアルタイムに出力することができるため、例えばアクター１は特定のキャラクタ（ターゲット２）の声を容易に発する（つまり、発声する）ことが可能となる。 In this embodiment, the voice quality of the actor 1 is converted into the voice quality of the target 2 according to the voice signal input based on the voice of the actor 1 at a theme park or an event venue, for example. Since an audio signal can be output in real time, for example, the actor 1 can easily utter (that is, utter) a voice of a specific character (target 2).

また、本実施形態においては比較的簡易な基本周波数変換を使用することができるため、性能の低い電子機器（声質学習装置１０及び声質変換装置２０）であっても声質変換システムを実現することができる。 In addition, since the basic frequency conversion that is relatively simple can be used in the present embodiment, a voice quality conversion system can be realized even with low-performance electronic devices (voice quality learning device 10 and voice quality conversion device 20). it can.

なお、本実施形態においては学習処理及び変換処理の双方の前段で実行される基本周波数変換における倍率が固定されていればよいため、当該倍率は適宜変更されても構わない。また、例えば学習処理時にターゲット２及びアクター１の音声信号の基本周波数を常に計測して動的な倍率を決定しておき、変換処理においては、入力されたアクター１の音声信号の基本周波数に応じた倍率で基本周波数が変換されるような構成としてもよい。 In the present embodiment, since the magnification in the fundamental frequency conversion executed in the preceding stage of both the learning process and the conversion process only needs to be fixed, the magnification may be changed as appropriate. Also, for example, during the learning process, the basic frequency of the audio signal of the target 2 and the actor 1 is always measured to determine the dynamic magnification. In the conversion process, the basic frequency of the input audio signal of the actor 1 is determined. Alternatively, the fundamental frequency may be converted at a different magnification.

また、本実施形態においては、説明の便宜のためにアクター１とターゲット２とが１対１の関係であるものとして説明したが、複数のアクター１の各々の音声（信号）とターゲット２の音声（信号）との特徴量間の対応関係をモデル化したモデルデータ（つまり、アクター１毎のモデルデータ）を蓄積しておくことによって、当該複数のアクター１の各々が同一のキャラクタの声質で発声することが可能となる。これによれば、特定のキャラクタの声を演じるアクター１の交代が容易となることにより各アクター１への身体的負担を軽減することができるとともに、複数のアクター１間の声質の相似度の向上を実現することができる。なお、アクター１毎のモデルデータを蓄積しておく場合には、上述した固定倍率は、当該アクター１毎に決定されるものとする。 In the present embodiment, for convenience of explanation, the actor 1 and the target 2 have been described as having a one-to-one relationship. However, the voice (signal) of each of the plurality of actors 1 and the voice of the target 2 are described. By accumulating model data (that is, model data for each actor 1) that models the correspondence between feature quantities with (signal), each of the plurality of actors 1 utters with the same character voice quality. It becomes possible to do. According to this, it is possible to reduce the physical burden on each actor 1 by facilitating the replacement of the actor 1 who plays the voice of a specific character, and to improve the similarity of voice quality among a plurality of actors 1 Can be realized. When model data for each actor 1 is accumulated, the fixed magnification described above is determined for each actor 1.

また、アクター１の音声（信号）と複数のターゲット２の各々の音声（信号）との特徴量間の対応関係をモデル化したモデルデータ（つまり、ターゲット２毎のモデルデータ）を蓄積しておくことによって、アクター１が所望のターゲット２を選択し、当該選択されたターゲット２の声質に変換された音声信号が出力されるような構成とすることも可能である。 In addition, model data (that is, model data for each target 2) that models the correspondence relationship between the voice (signal) of the actor 1 and the voice (signal) of each of the plurality of targets 2 is stored. Thus, the actor 1 can select a desired target 2 and output a voice signal converted into the voice quality of the selected target 2.

以下、本実施形態に係る声質変換システムの使用態様の例について説明する。本実施形態においては、アクター１の発声に応じて、当該アクター１とは音域の異なるターゲット２の声質の音声信号を出力することができる。このため、例えば女性のアクター１が男性のターゲット２の声質で会話をするようなことが可能となる。また、本実施形態においては、個人の声の音程の差を補うことができるため、アクター１は普段は発声することができないような音域の声を出すことができるようになり、例えばカラオケ等において歌手の声質で歌うことができるとともに、音域の問題も解消することができる。 Hereinafter, examples of usage modes of the voice quality conversion system according to the present embodiment will be described. In this embodiment, according to the utterance of the actor 1, the voice signal of the voice quality of the target 2 having a different sound range from the actor 1 can be output. For this reason, for example, a female actor 1 can have a conversation with the voice quality of a male target 2. Further, in the present embodiment, since the difference in the pitch of the individual voice can be compensated, the actor 1 can output a voice in a range that cannot normally be uttered. You can sing with the voice quality of the singer and solve the problem of the range.

また、特定のキャラクタ（ターゲット２）の音声を前もって保存しておくことにより、例えば当該特定のキャラクタの声を演じていた人物（声優）が亡くなった後等に、他の人物（アクター１）の音声と蓄積しておいた当該特定キャラクタの音声との特徴量間の対応関係をモデル化したモデルデータを得るような構成とすることも可能である。このような構成によれば、特定のキャラクタの声を演じていた人物が亡くなった後等であっても、他の人物（アクター１）の発声に応じて当該キャラクタが出演するアニメ映画を制作するようなことが可能となる。すなわち、本実施形態に係る声質変換システムは、アニメ映画の制作の時間的制限をなくすといった従来の音声合成とは異なる分野にも適用可能である。 Further, by storing the voice of a specific character (target 2) in advance, for example, after the person (voice actor) who played the voice of the specific character has died, other persons (actor 1) It is also possible to obtain a configuration in which model data obtained by modeling the correspondence between feature amounts between the voice and the stored voice of the specific character is obtained. According to such a configuration, even after the person who was playing the voice of a specific character died, an animated movie in which the character appears in response to the voice of another person (Actor 1) is produced. It becomes possible. That is, the voice quality conversion system according to the present embodiment can be applied to a field different from conventional speech synthesis, such as eliminating the time limit of production of an animated movie.

また、例えば声優等の人物（ターゲット２）が、将来声質が変化してしまうこと等に備えて、保険として本実施形態に係る声質変化システムを利用することも考えられる。すなわち、予め声優等の人物の音声を保存しておき、実際に病気、怪我または老化等の原因によって声質が変化した際等に、現在の音声と蓄積しておいた過去の音声との特徴量間の対応関係をモデル化したモデルデータを得るような構成とすることも可能である。このような構成によれば、声質が変化した後であっても、例えば若い時のような過去の音声（声質）でセリフを言うまたは会話をすることが可能となる。この場合、例えば無料または低価格で保険として音声を保存しておくことができ、実際に声質変化システムを利用する際に料金を支払うようなサービスを提供することができる。なお、近年では声が出せなくなった後であっても人工的な音声を発することができるような機器が開発されているため、このような機器を利用すれば、声が出せなくなった後であっても過去の音声で会話をするようなことが可能となる。また、現役の声優が本システムで登場機会が失われるのではとの問題が考えられるが、むしろその声優の声を同時に世界中のあらゆる場所で用いられることとなるため、本人がその場に居なくても提供が可能となり、むしろ利用機会がふえるため、利用に応じて声優本人に印税などの形で使用料が戻ってくるようなビジネスモデルも提供できる。 In addition, for example, a voice actor or the like (target 2) may use the voice quality changing system according to the present embodiment as insurance in preparation for a future voice quality change. In other words, when the voice of a person such as a voice actor is saved in advance and the voice quality actually changes due to illness, injury or aging, etc., the feature amount of the current voice and the accumulated past voice It is also possible to obtain a configuration in which model data obtained by modeling the correspondence between the two is obtained. According to such a configuration, even after the voice quality has changed, it is possible to say a speech or have a conversation with a past voice (voice quality) such as when young. In this case, for example, voice can be stored as insurance for free or at a low price, and a service that pays a fee when actually using the voice quality change system can be provided. In recent years, devices have been developed that are capable of producing artificial voice even after they cannot speak. However, it is possible to have conversations with past voices. In addition, there may be a problem that the active voice actor will lose the opportunity to appear in this system, but rather the voice actor's voice will be used at every place in the world at the same time. It is possible to provide the service without the need and rather the opportunity to use it. Therefore, it is possible to provide a business model in which the usage fee is returned to the voice actor in the form of a royalty or the like according to the use.

また、声質変換システムを利用する際にアクター１がモデルデータを登録してない第三者に代えられた場合に警告などを出したり、システムを利用できないようにする等の目的で、予めアクター１の声紋登録を行い、声質変換装置２０に音声が入力された際に声紋認証を行い、その声紋がアクター１の声紋と一致しない場合はエラーを表示したり変換を実行しないといった構成にすることも出来る。 In addition, when the voice quality conversion system is used, if the actor 1 is replaced by a third party who has not registered the model data, a warning or the like is issued in advance, or the actor 1 cannot be used. The voice print registration is performed, and when voice is input to the voice quality conversion device 20, voice print authentication is performed. If the voice print does not match the voice print of the actor 1, an error is not displayed or conversion is not executed. I can do it.

同様に、映画またはドラマ等の登場人物（例えば、俳優等）の現在の音声と過去の音声との特徴量間の対応関係をモデル化したモデルデータを蓄積しておくことで、例えば当該映画またはドラマ等の回想シーンにおいては当該回想シーンで描かれる時期の当該人物の声質（つまり、過去の音声）でセリフを話すといった用途に声質変化システムが使用されても構わない。 Similarly, by storing model data that models the correspondence between feature amounts of current voices and past voices of characters (eg, actors) of movies or dramas, for example, the movie or drama In a reminiscence scene such as a drama, the voice quality changing system may be used for such purposes as speaking the speech with the voice quality (that is, past voice) of the person at the time drawn in the recollection scene.

更に、例えば海外の映画またはドラマ等において日本語の吹き替えが行われる場合に、当該映画またはドラマに実際に出演している俳優の声質で日本語のセリフを話すようにすることも可能である。 Furthermore, for example, when dubbing in Japanese is performed in an overseas movie or drama, it is possible to speak Japanese words with the voice quality of an actor who actually appears in the movie or drama.

上記したように本実施形態に係る声質変換システムは、言語の影響も少ないため、例えば言語的に意味のないキャラクタの発する特殊言語等であっても声質を変換して出力することが可能である。また、本実施形態に係る声質変換システムは、リアルタイム性を活かして様々な用途に用いることが可能であり、例えば上記したカラオケに用いることも可能であるし、機器による音声合成ガイダンスの声質を変換した音声を出力するような用途に用いることも可能である。 As described above, since the voice quality conversion system according to the present embodiment is less influenced by language, it is possible to convert voice quality even if it is a special language or the like generated by a character that has no linguistic meaning, for example. . In addition, the voice quality conversion system according to the present embodiment can be used for various purposes by taking advantage of real-time characteristics. For example, the voice quality conversion system can also be used for the above-mentioned karaoke, and converts the voice quality of voice synthesis guidance by a device. It is also possible to use it for the purpose of outputting such voice.

また、本実施形態に係る声質変換システムにおいてはＧＭＭに基づく声質変換が採用されるものとして主に説明したが、上述したようにアクター１の音声とターゲット２の音声との特徴量の差分を当該アクター１の音声に合成する（差分スペクトル補正を適用する）ことによってアクター１の声質をターゲット２の声質に変換する手法は例えばＧＭＭではなくニューラルネットワークを用いてモデル化を行う深層学習に基づく声質変換にも適用することができる。このため、本実施形態に係る声質変換システムにおいては、このような深層学習に基づく声質変換が採用されてもよいし、アクター１の音声及びターゲット２の音声を入力して学習を行うものであれば他の声質変換（手法）が採用されても構わない。 In the voice quality conversion system according to the present embodiment, the voice quality conversion based on GMM has been mainly described. However, as described above, the difference in the feature amount between the voice of the actor 1 and the voice of the target 2 is calculated. For example, voice quality conversion based on deep learning in which modeling is performed using a neural network instead of GMM is a method for converting the voice quality of actor 1 into the voice quality of target 2 by synthesizing the voice of actor 1 (applying differential spectrum correction). It can also be applied to. For this reason, in the voice quality conversion system according to the present embodiment, such voice quality conversion based on deep learning may be employed, or learning may be performed by inputting the voice of the actor 1 and the voice of the target 2. For example, other voice quality conversion (method) may be adopted.

なお、本願発明は、上記実施形態そのままに限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で構成要素を変形して具体化できる。また、上記実施形態に開示されている複数の構成要素の適宜な組合せにより種々の発明を形成できる。例えば、実施形態に示される全構成要素から幾つかの構成要素を削除してもよい。更に、異なる実施形態に亘る構成要素を適宜組合せてもよい。 Note that the present invention is not limited to the above-described embodiment as it is, and can be embodied by modifying the constituent elements without departing from the scope of the invention in the implementation stage. Moreover, various inventions can be formed by appropriately combining a plurality of constituent elements disclosed in the embodiment. For example, some components may be deleted from all the components shown in the embodiment. Furthermore, constituent elements over different embodiments may be appropriately combined.

１０…声質学習装置、１０ａ…解析エンジン、１０ｂ…データベース、１１…第１音声入力部（第１の入力手段）、１２…第１分析処理部（第１の抽出手段）、１３…倍率決定部、１４…第２音声入力部（第２の入力手段）、１５…基本周波数変換部（第１の変換手段）、１６…第２分析処理部（第２の抽出手段）、１７…モデル学習部、２０…声質変換装置、２０ａ…スピーカ、２１…変換テーブル、２２…音声入力部（第３の入力手段）、２３…基本周波数変換部（第２の変換手段）、２４…分析処理部（第３の抽出手段）、２５…差分推定部、２６…声質変換部、２７…音声出力部。 DESCRIPTION OF SYMBOLS 10 ... Voice quality learning apparatus, 10a ... Analysis engine, 10b ... Database, 11 ... 1st audio | voice input part (1st input means), 12 ... 1st analysis process part (1st extraction means), 13 ... Magnification determination part , 14 ... second voice input unit (second input unit), 15 ... fundamental frequency conversion unit (first conversion unit), 16 ... second analysis processing unit (second extraction unit), 17 ... model learning unit 20 ... voice quality conversion device, 20a ... speaker, 21 ... conversion table, 22 ... speech input unit (third input unit), 23 ... basic frequency conversion unit (second conversion unit), 24 ... analysis processing unit (first) 3 extraction means), 25 ... difference estimation unit, 26 ... voice quality conversion unit, 27 ... voice output unit.

Claims

In a voice quality conversion system that includes a voice quality learning device and a voice quality conversion device, and converts an actor's voice quality to a target voice quality,
The voice quality learning device includes:
First input means for inputting a first audio signal of the target;
First extraction means for extracting a first feature quantity from the input first audio signal;
Second input means for inputting a second audio signal of the actor corresponding to the first audio signal;
First conversion means for converting a fundamental frequency included in the input second audio signal at a predetermined magnification;
Second extraction means for extracting a second feature quantity from the second audio signal converted from the fundamental frequency;
A database that stores model data obtained by modeling the correspondence between the extracted first feature quantity and second feature quantity, and
The voice quality conversion device includes:
Third input means for inputting a third audio signal of the actor;
Second conversion means for converting a fundamental frequency included in the input third audio signal at the predetermined magnification;
Third extraction means for extracting a third feature amount from the third audio signal converted from the fundamental frequency;
Generating means for generating a fourth audio signal in which the voice quality of the actor is converted to the voice quality of the target based on the model data stored in the database and the extracted third feature quantity;
An output means for outputting the generated fourth voice signal. A voice quality conversion system comprising: