JPH06162168A

JPH06162168A - Composite image display system

Info

Publication number: JPH06162168A
Application number: JP4335527A
Authority: JP
Inventors: Akira Nakagawa; 章中川; Eiji Morimatsu; 映史森松; Kiichi Matsuda; 喜一松田
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1992-11-20
Filing date: 1992-11-20
Publication date: 1994-06-10

Abstract

(57)【要約】（修正有）【目的】文章データを送るだけで送信者の合成動画像と
合成音声でメッセージを伝えることができるＡＶ等に適
用できる合成画像表示システムにおいて、特に、合成音
声と合成画像の口の動きとの同期を正確に行えるように
する。【構成】合成音声を合成し出力する音声合成手段３１
と、文章情報を発声したときの一連の口形の動きを表す
口形符号の系列に変換する変換手段３２と、音声合成手
段から出力される各音節の発音時間を計算して各音節の
切れ目のタイミングを推定する発音時間計算手段３３
と、推定した各音節の切れ目のタイミングで表示画像を
変換手段からの口形符号に対応して画像に切り換える制
御を行う表示制御手段３４とを備え、発音時間計算手段
は、合成音声の発音時間を計算するにあたり、画面上に
おける人物の合成顔動画像の表示サイズ等の表示態様と
その表示の処理にかかる時間の関係を予め求め、各音声
対応の遅延時間を上記関係に基づいて補正するように構
成される。 (57) [Summary] (Correction) [Purpose] In a synthetic image display system applicable to AV, etc., which can convey a message by a synthetic moving image and synthetic voice of a sender by simply sending text data, especially, a synthetic voice. And the movement of the mouth of the composite image can be accurately synchronized. [Structure] Speech synthesis means 31 for synthesizing and outputting synthesized speech
And a conversion means 32 for converting the sentence information into a series of mouth-shaped codes representing a series of mouth-shaped movements when uttered, and a pronunciation time of each syllable output from the speech synthesizing means to calculate the timing of each syllable break. Pronunciation time calculation means 33 for estimating
And a display control means 34 for controlling the display image to be switched to an image corresponding to the mouth-shaped code from the converting means at the timing of the estimated break of each syllable, and the sounding time calculating means calculates the sounding time of the synthetic voice. In the calculation, the relationship between the display mode such as the display size of the synthetic face moving image of the person on the screen and the time required for the display is obtained in advance, and the delay time corresponding to each voice is corrected based on the above relationship. Composed.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、文章（テキスト）デー
タを送るだけであたかもＴＶ電話のように送信者が喋っ
ている顔の合成動画像と合成音声で相手側にメッセージ
を伝えることができるＡＶ（オーディオ・ビデオ）電子
メール等に適用できる合成画像表示システムに係り、特
に、合成音声と合成画像の口の動きとの同期を正確に行
えるようにした合成画像表示システムに関するものであ
る。INDUSTRIAL APPLICABILITY According to the present invention, a message can be transmitted to the other party only by sending text data, using a synthetic moving image of a face spoken by the sender and a synthesized voice as if the sender were a TV phone. The present invention relates to a synthetic image display system applicable to AV (audio / video) e-mails, etc., and particularly to a synthetic image display system capable of accurately synchronizing a synthetic voice and a mouth movement of the synthetic image.

【０００２】[0002]

【従来の技術】任意の文章（テキスト）情報からそれに
対応した合成音声を自由に生成し発音する技術は、規則
音声合成と呼ばれ、これを実現するための規則音声合成
装置が既に作られている。この規則音声合成技術は人間
と機械とのインターフェースを向上させるために様々な
分野で応用されている。また、近年、音声の合成と同様
に、任意の文章情報からそれを喋ったときの口の動きを
含む人物の動画像をその文章情報を解析することで生成
する技術が開発されており、これを上述の音声合成技術
と組み合わせることによって、より自然なインターフェ
ースを実現することができる。2. Description of the Related Art A technique for freely generating and pronouncing synthetic speech corresponding to arbitrary text information is called "regular speech synthesis", and a regular speech synthesizing device for realizing this is already made. There is. This rule speech synthesis technology is applied in various fields to improve the interface between human and machine. Also, in recent years, similar to voice synthesis, a technique has been developed that generates a moving image of a person including the movement of the mouth when speaking it from arbitrary text information by analyzing the text information. A more natural interface can be realized by combining the above with the above speech synthesis technology.

【０００３】例えば、かかる音声と顔動画像の合成技術
を電子メールに適用すると、受信側にメール送信者の顔
画像などのデータファイルを予め用意しておくことによ
り、従来では受信側の画面上に文章が表示されるだけで
あった電子メールに対して、メール送信者が喋っている
顔の動画像が現れて合成音声で読み上げるといった表現
豊かなメッセージを受信者に伝えることができる。For example, when such a technique of synthesizing a voice and a facial moving image is applied to an electronic mail, a data file such as a face image of the mail sender is prepared in advance on the receiving side, so that the screen of the receiving side is conventionally used. It is possible to convey to the recipient an expressive message such as a moving image of the face of the sender of the mail appearing and reading it aloud with a synthetic voice in response to the e-mail whose text was only displayed in.

【０００４】このような文章情報に基づいて音声および
顔動画像を合成し出力する音声・動画像出力装置の構成
例を図３に示す。図３において、１は文章（テキスト）
情報が入力される文章分解部であり、この文章分解部１
は入力された文章情報を解析して音声出力用の発音制御
データを生成し規則音声合成部２と音声／口形変換部３
に出力する。例えば、文章情報として「ただいま」の文
章が入力された場合、これを「Ｔ，Ａ，Ｄ，Ａ，Ｉ，
Ｍ，Ａ」の母音と子音の音素データに分解して出力す
る。FIG. 3 shows an example of the configuration of a voice / moving image output apparatus for synthesizing and outputting a voice and a face moving image based on such text information. In FIG. 3, 1 is a sentence (text)
This is a sentence decomposition unit where information is input. This sentence decomposition unit 1
Analyzes the input sentence information, generates pronunciation control data for voice output, and generates the regular voice synthesis unit 2 and the voice / mouth conversion unit 3.
Output to. For example, when a sentence of "I'm home" is input as the sentence information, this is changed to "T, A, D, A, I,
It is decomposed into vowel and consonant phoneme data of "M, A" and output.

【０００５】規則音声合成部２は任意の文章についての
発音制御データに基づいてその文章を読み上げる合成音
声を生成し出力する装置である。The regular voice synthesizing unit 2 is a device for generating and outputting a synthetic voice for reading out a sentence based on pronunciation control data for the arbitrary sentence.

【０００６】音声／口形変換部３は、任意の文章につい
ての音素データをその文章を発音する際の一連の口の動
きを表すための口形符号の系列に変換するための装置で
ある。口形符号としては例えば、Ａ（母音のア）、Ｉ
（母音のイ）、Ｕ（母音のウ）、Ｅ（母音のエ）、Ｏ
（母音のオ）、Ｓ（子音）、Ｃ（閉じた口）の７種類が
あり、それぞれの口形符号に対応してそれらを発音する
際の口形の画像が予め用意される。例えば、文章情報と
して前述の「ただいま」の文章が入力された場合、この
音素データ「ＴＡＤＡＩＭＡ」に基づいて、「Ｔ」→口
形符号Ｓ、「Ａ」→口形符号Ａ、「Ｄ」→口形符号Ｓ、
「Ｉ」→口形符号Ｉ、「Ｍ」→口形符号Ｃ、「Ａ」→口
形符号Ａ、をそれぞれ割り当てて、それらを口形符号の
系列として画像表示制御部５に出力する。The voice / mouth conversion unit 3 is a device for converting phoneme data of an arbitrary sentence into a series of mouth codes for representing a series of mouth movements when the sentence is pronounced. As the mouthpiece code, for example, A (vowel a), I
(Vowel a), U (vowel u), E (vowel d), O
There are seven types of (vowel vowel), S (consonant), and C (closed mouth), and mouth-shaped images for uttering them are prepared in advance corresponding to each mouth-shaped code. For example, when the above-mentioned sentence "Itaima" is input as the sentence information, based on this phoneme data "TADAIMA", "T" → mouth code S, "A" → mouth code A, "D" → mouth code S,
“I” → mouth shape code I, “M” → mouth shape code C, and “A” → mouth shape code A are respectively assigned and output to the image display control unit 5 as a series of mouth shape codes.

【０００７】画像メモリ６には合成画像データがファイ
リングされている。この合成画像データとしては、話者
の１フレーム分の肩上画像と、それを基に合成した前述
の７種類の口形符号に対応した７種類の口領域画像のデ
ータとを纏めて一つのファイルとしている。Composite image data is filed in the image memory 6. As the composite image data, one file is obtained by collecting the one-frame shoulder image of the speaker and the data of the seven types of mouth region images corresponding to the above seven types of mouth shape codes synthesized based on the image. I am trying.

【０００８】発音時間計算部４は文章分解部１からの音
素データに基づいて規則音声合成部２と全く同じアルゴ
リズムを用いて音声を合成する際の各音節が発音される
までの時間をそれぞれ計算する。つまり、入力された文
章に対してそれが規則音声合成部２で音声合成されて発
音出力される際に、文章の先頭を起点にしてその文章を
構成する各音節の切れ目のタイミングをそれぞれ推定し
てその結果を画像表示制御部５に出力する。The pronunciation time calculation unit 4 calculates the time until each syllable is pronounced when synthesizing a voice based on the phoneme data from the sentence decomposition unit 1 using the same algorithm as the regular voice synthesis unit 2. To do. That is, when the input sentence is synthesized by the regular speech synthesizer 2 and is output as a pronunciation, the timing of the break of each syllable constituting the sentence is estimated starting from the beginning of the sentence. And outputs the result to the image display controller 5.

【０００９】画像表示制御部５は発音時間計算部４から
のタイミング信号に基づいて、各音節の発音タイミング
が到来したときにその該当する音節の口形符号に対応す
る口形画像が画像メモリ６から選択されて出力されるよ
う画像表示制御を行う。すなわち、規則音声合成部２で
発音される音声に対して画面に表示される話者の口の動
きが一致するよう、つまり合成音声と顔動画像との同期
がとれるように同期制御を行うものである。The image display control unit 5 selects, from the image memory 6, a mouthpiece image corresponding to the mouthpiece code of the corresponding syllable when the sounding timing of each syllable arrives, based on the timing signal from the sounding time calculation unit 4. The image display is controlled so that the image is output after being output. That is, the synchronization control is performed so that the movement of the speaker's mouth displayed on the screen matches the voice produced by the regular voice synthesizing unit 2, that is, the synthesized voice and the face moving image can be synchronized. Is.

【００１０】パラメータ入力部７は規則音声合成部２で
合成する音声の声質、顔動画像の画面上での表示場所、
表示倍率等の各種パラメータをキーボード等を用いて入
力する部分であり、合成音声に関するパラメータは規則
音声合成部２に渡され、また顔動画像に関するパラメー
タは画像表示制御部５と画像メモリ６に渡される。The parameter input unit 7 has a voice quality of the voice synthesized by the regular voice synthesis unit 2, a display location of the face moving image on the screen,
This is a part for inputting various parameters such as display magnification using a keyboard or the like. Parameters regarding synthetic voice are passed to the regular voice synthesizing unit 2, and parameters regarding face moving images are passed to the image display control unit 5 and the image memory 6. Be done.

【００１１】このように構成した装置の動作を説明す
る。文章情報が入力されると、文章分解部１でその文章
情報が解析されて音素データがまとめて規則音声合成部
２に渡されて合成音声により発音出力される。この発音
動作に並行して、音素データが音声／口形変換部３で口
形符号の系列に変換される。また発音時間計算部４では
音素データから各音節の切れ目の時間が推定され、この
時間データが画像表示制御部５に渡される。画像表示制
御部５では各音節の発音タイミングに口形符号のタイミ
ングを合わせて、画像メモリ６上に展開された各口形符
号の画像のうちから音声／口形変換部３で求まった口形
符号に対応した顔動画像データがＶＲＡＭに転送される
ようにし、このＶＲＡＭを介して表示装置の画面上に話
者の顔動画像を表示する。これにより文章情報は、それ
を実際に発音した合成音声とその合成音声に口の動きの
タイミングがあった話者の顔動画像とによるメッセージ
として受信者に伝えられることになる。The operation of the apparatus thus configured will be described. When the sentence information is input, the sentence information is analyzed by the sentence disassembling unit 1 and the phoneme data is collectively passed to the regular voice synthesizing unit 2 and output as a synthetic voice. In parallel with this sounding operation, the phoneme data is converted into a series of mouth code by the voice / mouth conversion unit 3. In addition, the pronunciation time calculation unit 4 estimates the time of each syllable break from the phoneme data, and passes this time data to the image display control unit 5. The image display control unit 5 matches the timing of the mouthpiece code with the sounding timing of each syllable, and corresponds to the mouthpiece code obtained by the voice / mouthpiece conversion unit 3 from the images of each mouthpiece code developed on the image memory 6. The face moving image data is transferred to the VRAM, and the face moving image of the speaker is displayed on the screen of the display device via the VRAM. As a result, the text information is transmitted to the receiver as a message by the synthetic voice that actually pronounces it and the face moving image of the speaker whose timing of mouth movement is in the synthetic voice.

【００１２】この図３の装置は、規則音声合成部２に従
来からある小型の音声合成ユニットを利用し、それ以外
の部分にはパーソナルコンピュータ等を用いることによ
り、小型で経済的なシステムとして実現することができ
る。The apparatus shown in FIG. 3 uses a conventional small-sized speech synthesizing unit for the regular speech synthesizing unit 2 and uses a personal computer or the like for the other portions to realize a compact and economical system. can do.

【００１３】かかる音声・顔動画像出力装置では、合成
音声と顔動画像を生成するにあたって、声質、画面上で
の画像の表示場所、表示倍率などのパラメータは、表示
するシステムに初期値として予め設定されたもの（パラ
メータ入力部７で予め入力されたもの）が使われる。こ
のように合成音声の声質と顔動画像の生成態様を受信側
で予め設定しておくものである場合、それら予め登録さ
れてある顔画像の人物と声質とが例えばメッセージに対
して釣り合っていないようなときには、それをみる人に
不自然な感じを与えてしまうことになる。In such a voice / face moving image output apparatus, parameters such as voice quality, image display location on the screen, and display magnification are previously set as initial values to the display system when generating the synthesized voice and the face moving image. The set one (the one previously input by the parameter input unit 7) is used. In this way, when the voice quality of the synthesized voice and the generation mode of the face moving image are preset on the receiving side, the person and voice quality of the face image registered in advance are not balanced with respect to the message, for example. In such a case, it will give an unnatural feeling to the viewer.

【００１４】また、この装置をＡＶ電子メールなどに用
いた場合などに代表されるように、文章情報を作った人
とその文章情報を実際に音声と動画像で表示して見る人
とが異なる場合、文章情報を作った人が希望するような
声質や画像の大きさで、受信側において発音・画像表示
されるとは限らず、この結果、送り側の人の意図とは全
く違う印象を受信側の人に与えてしまう可能性がある。Also, as typified by the case where this device is used for AV e-mail, etc., the person who created the text information is different from the person who actually displays the text information by voice and moving images. In this case, the voice quality and image size desired by the person who created the text information are not always displayed and displayed as an image on the receiving side, and as a result, an impression completely different from the intention of the person on the sending side is obtained. It may be given to the recipient.

【００１５】そこで、文章情報の作成側において顔の合
成画像を作成する際に、表示側における合成音声の声
質、合成動画像を表示する際の表示倍率、表示位置、そ
の他の各種パラメータを合成画像に付加して表示側に渡
すようにし、それにより作成側の人が意図した合成画像
の表示倍率や表示位置、合成音声の声質等によって表示
側で音声と顔動画像を合成できるようにする方法が提案
されている。Therefore, when a synthetic image of a face is created on the side of creating text information, the voice quality of the synthetic voice on the display side, the display magnification when displaying the synthetic moving image, the display position, and other various parameters are synthesized. To be passed to the display side so that the display side can synthesize the face and the moving image according to the display magnification and display position of the synthesized image intended by the creator and the voice quality of the synthesized voice. Is proposed.

【００１６】[0016]

【発明が解決しようとする課題】上述のシステムでは、
表示側において音声と顔の口の動きとを一致させるため
には、規則音声合成部で生成した音声の各音節の発音タ
イミングに、顔画像の口の動きを合わせるように画像表
示制御部で制御を行っている。In the system described above,
In order to match the voice with the movement of the mouth of the face on the display side, the image display control unit controls to match the movement of the mouth of the face image with the pronunciation timing of each syllable of the voice generated by the regular voice synthesis unit. It is carried out.

【００１７】ところで、上述のシステムのように、シス
テムを動作させる環境上（この例ではパーソナルコンピ
ュータ）で各音節に対応する顔画像を表示すべきタイミ
ングの実時刻を測定する手段が存在しない場合、ソフト
ウェア上でカウンタ等を実現してこのカウンタ等を用い
て口形画像の切換え処理を各音節の発音タイミングまで
必要な時間間隔だけ遅延させて合成音声に口形の動きを
合わせている。By the way, when there is no means for measuring the actual time of the timing at which the face image corresponding to each syllable should be displayed in the environment where the system is operated (personal computer in this example) like the above-mentioned system, A counter or the like is realized on software, and by using the counter or the like, the process of switching the mouth-shaped image is delayed by a necessary time interval until the sounding timing of each syllable, and the mouth-shaped movement is matched with the synthesized voice.

【００１８】すなわち、図４に示されるように、例えば
音節（ア）または（タ）を発音する時間は規則音声合成
部２によってそれぞれ一意に決められるが、それぞれの
発音時間Ｔ_TOTALを検討してみると、発音時間Ｔ_TOTAL
はデータ処理時間Ｔ_Pと合成画像データのＶＲＡＭへの
転送時間Ｔ_Tと各音節対応に異なる遅延時間Ｔ_Dの和か
らなる。このうちデータ処理時間Ｔ_Pは常に一定の値で
あり、ＶＲＡＭへの転送時間Ｔ_Tは表示倍率が同じであ
ればどの音節についても一定の値である。よって、各音
節ごとの発音時間の違いは各音節対応に予め設定された
遅延時間Ｔ_Dによって調整される。つまり各音節対応に
その音節の遅延時間Ｔ_Dがあらかじめ決められており、
発音時間計算部４では各音節に対してその音節の遅延時
間Ｔ_Dをデータ処理時間Ｔ_Pと転送時間Ｔ_Tに加算して
その音節の発音時間Ｔ_TOTALを計算している。That is, as shown in FIG. 4, for example, the time for uttering a syllable (a) or (ta) is uniquely determined by the rule voice synthesizing unit 2, but each sounding time T _TOTAL is examined. Looking at the pronunciation time T _TOTAL
Is the sum of the data processing time T _P , the transfer time T _{T of the} composite image data to the VRAM, and the delay time T _D different for each syllable. Of these, the data processing time T _P is always a constant value, and the transfer time T _T to the VRAM is a constant value for any syllable if the display magnification is the same. Therefore, the difference in sounding time for each syllable is adjusted by the delay time T _D preset for each syllable. That is, the delay time T _{D of the} syllable is predetermined for each syllable,
The sounding time calculation unit 4 adds the delay time T _{D of the} syllable to the data processing time T _P and the transfer time T _T for each syllable to calculate the sounding time T _{TOTAL of the} syllable.

【００１９】ところが、ＶＲＡＭへの合成画像データの
転送時間Ｔ_Tは画面上での表示画像の表示倍率（表示サ
イズ）等が変わると転送データ量が変化するため、表示
倍率等に応じて転送時間Ｔ_Tも変化してしまう。この結
果、例えば上述したＡＶ電子メールシステムのように表
示倍率等を送り側の人の意図に従って任意に変えること
ができるようにシステムを構成した場合、各音節対応の
遅延時間Ｔ_Dがどの表示倍率でも同じ値であると、発音
時間計算部４で計算する各音節の発音時間Ｔ_TOTALは表
示倍率等によって変化してしまい、このため合成音声と
合成画像の完全な同期を得ることができなくなってしま
う。これを防ぐには各表示倍率毎に各音節対応の遅延時
間を予め決めておいて保持しておけばよいのであるが、
その場合には保持データの量が多大となってしまうとい
う問題点がある。However, the transfer time T _T of the composite image data to the VRAM changes depending on the display magnification because the transfer data amount changes when the display magnification (display size) of the display image on the screen changes. T _T will also change. As a result, when the system is configured such that the display magnification and the like can be arbitrarily changed according to the intention of the sender, such as the above-mentioned AV electronic mail system, the display magnification of the delay time T _D corresponding to each syllable is increased. However, if the values are the same, the pronunciation time T _TOTAL of each syllable calculated by the pronunciation time calculation unit 4 changes depending on the display magnification and the like, and thus perfect synchronization between the synthetic voice and the synthetic image cannot be obtained. I will end up. In order to prevent this, the delay time corresponding to each syllable may be predetermined and held for each display magnification.
In that case, there is a problem that the amount of held data becomes large.

【００２０】本発明はかかる問題点に鑑みてなされたも
のであり、画面の表示サイズや表示倍率の変化に対して
も各音節の発音時間を正しく計算できるようにして、音
声と同期した自然な口の動きの顔動画像を生成できるよ
うにすることを目的とする。The present invention has been made in view of the above problems, and enables the pronunciation time of each syllable to be correctly calculated even when the display size and the display magnification of the screen are changed, so that the natural time synchronized with the voice is obtained. It is intended to be able to generate a face moving image of mouth movement.

【００２１】[0021]

【課題を解決するための手段】図１は本発明に係る原理
説明図である。本発明においては、一つの形態として、
文章情報に基づいて合成音声を合成し出力する音声合成
手段３１と、文章情報をその文章情報を発声したときの
一連の口形の動きを表す口形符号の系列に変換する変換
手段３２と、文章情報に基づいて音声合成手段３１から
出力される合成音声の各音節の発音時間を計算して各音
節の切れ目のタイミングを推定する発音時間計算手段３
３と、発音時間計算手段３３で推定した各音節の切れ目
のタイミングで表示画像を変換手段３２からの口形符号
に対応した口形画像に切り換える制御を行う表示制御手
段３４とを備えた合成画像表示システムにおいて、発音
時間計算手段３３は、合成音声の各音節の発音時間を計
算するにあたり、画面上における人物の合成顔動画像の
表示サイズ等の表示態様とその表示の処理にかかる時間
の関係を予め求め、各音節の発音時間を計算するための
各音声対応の遅延時間を上記関係に基づいて補正するよ
うに構成されたことを特徴とする合成画像表示システム
が提供される。FIG. 1 is a diagram illustrating the principle of the present invention. In the present invention, as one form,
A voice synthesizing means 31 for synthesizing and outputting a synthetic voice based on the sentence information, a converting means 32 for converting the sentence information into a series of mouth-shaped codes representing a series of mouth movements when the sentence information is uttered, and the sentence information. On the basis of the above, the pronunciation time calculation means 3 for estimating the pronunciation time of each syllable of the synthetic speech output from the speech synthesis means 31 and estimating the timing of the break of each syllable.
3 and a display control means 34 for performing control to switch the display image to the mouth shape image corresponding to the mouth shape code from the conversion means 32 at the timing of the break of each syllable estimated by the pronunciation time calculation means 33. In calculating the pronunciation time of each syllable of the synthetic voice, the pronunciation time calculation means 33 preliminarily relates the relationship between the display mode such as the display size of the synthetic face moving image of the person on the screen and the time required for the display processing. A synthetic image display system is provided, which is configured to correct the delay time corresponding to each voice for calculating the pronunciation time of each syllable based on the above relationship.

【００２２】本発明においては、他の形態として、任意
の文章情報からそれに対応する合成音声とその合成音声
に合わせて口が動く人物の合成顔動画像を生成する合成
画像表示システムにおいて、合成音声の各音節の切れ目
のタイミングで合成顔動画像の口形画像を切り換えるた
めに該合成音声の各音節の発音時間を計算するにあた
り、画面上における人物の合成顔動画像の表示サイズ等
の表示態様とその表示の処理にかかる時間の関係を予め
求め、各音節の発音時間を計算するための各音節の種類
対応の遅延時間を上記関係に基づいて補正するように構
成されたことを特徴とする合成画像表示システムが提供
される。According to another aspect of the present invention, in a synthetic image display system for generating a synthetic voice corresponding to arbitrary sentence information and a synthetic facial moving image of a person whose mouth moves in accordance with the synthetic voice, In calculating the pronunciation time of each syllable of the synthesized voice in order to switch the mouth-shaped image of the synthesized face moving image at the timing of the break of each syllable, the display mode such as the display size of the synthetic face moving image of the person on the screen and A composition characterized in that the delay time corresponding to the type of each syllable for calculating the pronunciation time of each syllable is corrected in advance based on the above relation, An image display system is provided.

【００２３】上述の遅延時間の補正は、遅延時間を、画
像表示態様に依存して一意に決まる第１の遅延時間と、
音節の種類に依存して一意に決まる第２の遅延時間とに
分け、画像表示態様に応じて第１の遅延時間を求め、そ
れに各音節対応の第２の遅延時間を加えることで行うこ
とができる。The correction of the delay time described above is performed such that the delay time is the first delay time that is uniquely determined depending on the image display mode.
It can be performed by dividing into a second delay time that is uniquely determined depending on the type of syllable, a first delay time is obtained according to the image display mode, and a second delay time corresponding to each syllable is added to it. it can.

【００２４】[0024]

【作用】各音節の発音時間を計算する際に用いる遅延時
間を、画像表示倍率等の態様に依存して一意に決まる第
１の遅延時間と、音節の種類に依存して一意に決まる第
２の遅延時間とに分ける。これにより画像表示倍率が決
まると、それに応じて第１の遅延時間が決まり、それに
各音節対応の第２の遅延時間を加えることで、各音節の
遅延時間が決まる。この遅延時間を用いて各音節の発音
時間を求めれば、この発音時間は各音節毎に一意の値と
なり、画像表示倍率等に依存しない。よって、画像の大
きさ等に依存せず、音声と動画像の同期をとることがで
きる。The delay time used when calculating the pronunciation time of each syllable is the first delay time that is uniquely determined depending on the mode such as the image display magnification and the second delay time that is uniquely determined depending on the type of syllable. And the delay time. As a result, when the image display magnification is determined, the first delay time is determined accordingly, and the second delay time corresponding to each syllable is added thereto, whereby the delay time of each syllable is determined. If the sounding time of each syllable is obtained using this delay time, this sounding time becomes a unique value for each syllable and does not depend on the image display magnification or the like. Therefore, the voice and the moving image can be synchronized without depending on the size of the image or the like.

【００２５】[0025]

【実施例】以下、図面を参照して本発明の実施例を説明
する。図２は本発明の一実施例としての合成画像表示シ
ステムにおける発音時間の計算処理を説明する図であ
る。この合成画像表示システムとして音声・顔動画像出
力装置のハードウェア構成は図３に示したものとほぼ同
じであるが、相違点として、発音時間計算部４における
発音時間の計算処理が図２の方法によっている。以下、
この計算方法について説明する。Embodiments of the present invention will be described below with reference to the drawings. FIG. 2 is a diagram for explaining the calculation process of the sound generation time in the composite image display system as one embodiment of the present invention. The hardware configuration of the voice / facial moving image output device as this composite image display system is almost the same as that shown in FIG. 3, but the difference is that the pronunciation time calculation unit 4 calculates the pronunciation time as shown in FIG. It depends on the method. Less than,
This calculation method will be described.

【００２６】いま図示のように、音素として（ア）、
（イ）、（ウ）、（エ）があるものとする。これらの発
音時間Ｔ_TOTALはそれぞれ異なっており、それは遅延時
間Ｔ_Dによって調整される。つまり各音素の発音時間Ｔ
_TOTALは処理時間Ｔ_PとＶＲＡＭ転送時間Ｔ_Tと遅延時
間Ｔ_Dの和からなり、各音素毎に遅延時間Ｔ_Dを異なら
せることによって各音素の発音時間Ｔ_TOTALが異なる値
となる。As shown in the figure, as a phoneme (a),
It is assumed that there are (a), (c), and (d). These _{tone generation} times T _TOTAL are different from each other and are adjusted by the delay time T _D. That is, the pronunciation time T of each phoneme
_TOTAL consists sum of processing time T _P and VRAM transfer time T _T and the delay time T _D, the sounding time T _TOTAL is a different value for each phoneme by varying the delay time T _D for each phoneme.

【００２７】上述の各時間のうち、処理時間Ｔ_Pは画像
の表示倍率等に依存しない一定の値である。ＶＲＡＭへ
の転送時間Ｔ_Tは画像の表示倍率等に依存してその値が
変わる値であり、例えば図２に示されるように、表示倍
率ａである場合の転送時間がＴ_T(a)であるとすると、表
示倍率ｂである場合の転送時間はＴ_T(b)となる。Of the above times, the processing time T _P is a constant value that does not depend on the image display magnification or the like. Transfer time T _T to VRAM is a value whose value varies depending on the display magnification of the image, for example, as shown in FIG. 2, the transfer time when a display magnification a is at T _{T (a)} If so, the transfer time when the display magnification is b is T _{T (b)} .

【００２８】いま、表示倍率が最大のときの最大の転送
時間Ｔ_T(max)を想定し、この値よりも少し大きい値の一
定時間Ｔ_C（図２中のＡ−Ｂ間の時間）を設定する。そ
して、この一定時間Ｔ_Cを用いて遅延時間Ｔ_Dを二分す
る。すなわち、遅延時間Ｔ_Dを一定時間Ｔ_Cに含まれる
側の遅延時間Ｔ_D1とそれから外れる側の遅延時間Ｔ_D2に
分ける。この場合、Ｔ_D＝Ｔ_D1＋Ｔ_D2 の関係にある。Now, assuming the maximum transfer time T _{T (max)} when the display magnification is maximum, a constant time T _C (time between A and B in FIG. 2 ₎ which is a little larger than this value is assumed. Set. Then, the delay time T _D is divided into two using this constant time T _C. That is, the delay time T _D is divided into the delay time T _D1 included in the constant time T _C and the delay time T _D2 deviating from the delay time T _D1 . In this case, T _D = T _D1 + T _D2 .

【００２９】このように遅延時間Ｔ_DをＴ_D2とＴ_D2に二
分した場合、遅延時間Ｔ_D1は音素の種類にはよらず画面
における画像の表示倍率（表示サイズ）に依存してその
値が変わる時間となる。すなわち、一定時間Ｔ_CがＶＲ
ＡＭ転送時間Ｔ_Tと遅延時間Ｔ_D1との和であるので、Ｔ
_D1＝Ｔ_C−Ｔ_T となり、このうち転送時間Ｔ_Tが音素
の種類に依存せず表示倍率により変化する値であるから
である。一方、遅延時間Ｔ_D2は音素の種類に依存してそ
の値が変わる時間であり、表示倍率によっては値は変わ
らない。In this way, when the delay time T _D is divided into T _D2 and T _D2 , the delay time T _D1 depends on the display magnification (display size) of the image on the screen regardless of the type of phoneme. It's time to change. That is, the constant time T _C is VR
Since it is the sum of the AM transfer time T _T and the delay time T _D1 , T
_D1 = _T _C -T T next, these transfer time T _T is because a value which changes the display magnification without depending on the type of phonemes. On the other hand, the delay time T _D2 is a time when its value changes depending on the type of phoneme, and does not change depending on the display magnification.

【００３０】よって上述のように遅延時間を設定する
と、画面上における顔像画像の表示倍率が決まると、こ
の表示倍率に応じてＶＲＡＭへの転送時間Ｔ_Tが決ま
り、その転送時間Ｔ_Tによって遅延時間Ｔ_D1が一意に決
まることになる。例えば表示倍率ａでそれに対応する転
送時間がＴ_T(a)であれば、表示倍率に依存する遅延時間
Ｔ_D1(a)は、Ｔ_D1(a)＝Ｔ_C−Ｔ_T(a) となり、表示倍
率ｂでそれに対応する転送時間がＴ_T(b)であれば、表示
倍率に依存する遅延時間Ｔ_D1(b)は、Ｔ_D1(b)＝Ｔ_C−
Ｔ_T(b) となる。[0030] Thus setting the delay time as described above, the display magnification of the facial image picture on the screen is determined, it determines the transfer time T _T to VRAM in accordance with the display magnification, delayed by the transfer time T _T The time T _D1 is uniquely determined. For example, if the transfer time T _T the corresponding display magnification a _(a), the delay time depending on display magnification T _{D1 (a)} _{is, T D1 (a) = T} C -T T (a) , and the if the transfer time is a T _{T (b)} the corresponding display magnification b, the delay time depending on display magnification T _{D1 (b)} _{is, T D1 (b) = T} C -
It becomes T _{T (b)} .

【００３１】よって、この遅延時間Ｔ_D1に各音素に対応
した遅延時間Ｔ_D2を加算すれば、各音素毎にその遅延時
間Ｔ_Dを決めることができる。この方法による場合、保
持しておくデータとしては各音素対応の遅延時間Ｔ
_D2と、各表示倍率対応のＶＲＡＭへの転送時間Ｔ_Tだけ
でよいことになり、保持データの量を少なく抑えること
ができる。Therefore, by adding the delay time T _D2 corresponding to each phoneme to this delay time T _D1 , the delay time T _D can be determined for each phoneme. In the case of this method, the data to be retained is the delay time T corresponding to each phoneme.
_{Since only D2} and the transfer time T _T to the VRAM corresponding to each display magnification are required, the amount of held data can be reduced.

【００３２】本発明の実施にあたっては種々の変形形態
が可能である。例えば、上述の実施例では口形として７
種類の画像を用いる場合について説明したが、もちろん
本発明はこれに限られるものではなく、より自然に近い
口の動きを合成するためにはこの口形の画像の種類をさ
らに増やしてもよい。また上述の実施例では受信側で合
成する顔画像の動き部分として口領域の動きを取り上げ
たが、これに限られるものではなく、例えば口の動きに
加えて、文章に合わせて目の動きなども変化させるよう
にすれば、より表情豊かなＡＶメッセージを受け側に送
ることができる。Various modifications are possible in carrying out the present invention. For example, in the above embodiment, the mouth shape is 7
Although the case of using images of different types has been described, the present invention is not limited to this, and the types of images of this mouth shape may be further increased in order to synthesize a more natural movement of the mouth. Further, in the above-described embodiment, the movement of the mouth area is taken up as the moving portion of the face image to be synthesized on the receiving side, but the present invention is not limited to this. By also changing the above, a more expressive AV message can be sent to the receiving side.

【００３３】また上述の実施例では音声・顔動画像出力
システムに本発明を適用したが、これに限らず、ＡＶ電
子メールに適用することもできるし、あるいは音声認識
技術によりリアルタイムに発声音声の音素の認識が可能
となれば、通常の電話をかけるだけで受信者側に話し手
の顔の表情も動画像で表示できるという擬似テレビ電話
等のサービスに適用することも可能である。Further, although the present invention is applied to the voice / face moving image output system in the above-mentioned embodiments, the present invention is not limited to this, and it can be applied to AV electronic mails, or voiced voices can be used to produce voiced voices in real time. If the phoneme can be recognized, it can be applied to a service such as a pseudo-videophone in which the facial expression of the talker's face can be displayed on the receiver side as a moving image only by making a normal call.

【００３４】[0034]

【発明の効果】以上に説明したように、本発明によれ
ば、合成画像表示システムにおいて画面倍率等の表示態
様が変化しても、それに影響されることなく合成音声と
同期した自然な口の動きの顔動画像を合成し表示するこ
とができる。As described above, according to the present invention, even if the display mode such as the screen magnification changes in the synthetic image display system, it is not affected by the change and the natural speech synchronized with the synthetic voice is obtained. A moving face moving image can be combined and displayed.

[Brief description of drawings]

【図１】本発明に係る原理説明図である。FIG. 1 is a diagram illustrating the principle of the present invention.

【図２】本発明の一実施例としての合成画像表示システ
ムにおける音声・動画像出力装置の発音時間の計算処理
を説明する図である。FIG. 2 is a diagram illustrating a calculation process of a sounding time of a voice / moving image output device in a composite image display system as an example of the present invention.

【図３】従来の音声・動画像出力装置を示す図である。FIG. 3 is a diagram showing a conventional audio / moving image output apparatus.

【図４】従来装置における発音時間の概念を説明する図
である。FIG. 4 is a diagram illustrating the concept of sound generation time in a conventional device.

[Explanation of symbols]

１文章分解部２規則音声合成部３音声／口形変換部４発音時間計算部５画像表示制御部６画像メモリ７パラメータ入力部 1 Text Decomposing Unit 2 Ruled Speech Synthesizing Unit 3 Voice / Variety Converting Unit 4 Sounding Time Calculation Unit 5 Image Display Control Unit 6 Image Memory 7 Parameter Input Unit

Claims

[Claims]

1. A voice synthesizing means (31) for synthesizing and outputting a synthetic voice based on sentence information, and converting the sentence information into a series of mouth-shaped codes representing a series of mouth-shaped movements when the sentence information is uttered. Converting means (3
2), and a pronunciation time calculation means (33) for estimating the timing of each syllable of the synthetic speech output from the speech synthesis means on the basis of the sentence information to estimate the break timing of each syllable; Display control means (34) for controlling switching of the display image to the mouth shape image corresponding to the mouth shape code from the converting means at the timing of the break of each syllable estimated by the time calculating means.
In the synthetic image display system including the above, the pronunciation time calculation means calculates the pronunciation time of each syllable of the synthesized voice, and the display mode of the synthetic face moving image of the person on the screen and the time required for the display process. Is obtained in advance, and the delay time corresponding to each voice for calculating the pronunciation time of each syllable is corrected based on the above relation.

2. A synthetic image display system for generating a synthetic voice corresponding to arbitrary text information and a synthetic facial moving image of a person whose mouth moves in accordance with the synthetic voice, in a timing of a break of each syllable of the synthetic voice. In calculating the pronunciation time of each syllable of the synthesized voice in order to switch the mouth-shaped image of the synthesized face moving image, the relationship between the display mode of the person's synthesized facial moving image on the screen and the time required for the display processing is described. A synthetic image display system, characterized in that the delay time corresponding to the type of each syllable for calculating the pronunciation time of each syllable is corrected based on the above relationship.

3. The correction of the delay time is performed such that the delay time is divided into a first delay time that is uniquely determined depending on an image display mode and a second delay time that is uniquely determined depending on a syllable type. 3. The composite image according to claim 1 or 2, wherein the first delay time is calculated according to the image display mode, and the second delay time corresponding to each syllable is added to the first delay time. Display system.